NeoBERT Model
This is a NeoBERT model trained with pszemraj/NeoBERT and exported to transformers format.
Model Details
- Architecture: NeoBERT
- Hidden Size: 768
- Layers: 12
- Attention Heads: 12
- Vocab Size: 30592
- Max Length: 4096
- Dtype: float32
Runtime Dependencies
Exported NeoBERT inference does not require Liger kernels, flash-attn, or
other custom CUDA extensions. The exported modeling_neobert.py runs on standard
PyTorch + Transformers attention paths.
- torch: 2.10.0+cu128
- transformers: 4.57.6
- safetensors: required (weights are exported as
model.safetensors)
Exported Artifacts
config.jsonmodel.safetensorsmodeling_neobert.pyrotary.py- tokenizer files (
tokenizer.json,tokenizer_config.json,special_tokens_map.json, ...)
Usage
Ensure you update
repo_idto your actual HuggingFace repo ID or local path.
For Masked Language Modeling (Fill-Mask)
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
repo_id = "baseline-neobert-100m-bert_tok-SmolLM2mix_p4_100000" # Update this to your HF repo ID
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(repo_id, trust_remote_code=True)
# Example: Fill in masked tokens
text = "NeoBERT is the most [MASK] model of its kind!"
# Replace display mask with actual mask token if different
text = text.replace("[MASK]", tokenizer.mask_token)
inputs = tokenizer(text, return_tensors="pt")
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
mask_positions = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)
if len(mask_positions[0]) == 0:
raise ValueError("No mask token found in input")
mask_pos = mask_positions[1][0]
predictions = outputs.logits[0, mask_pos].topk(5)
# Display top predictions
for idx, score in zip(predictions.indices, predictions.values):
token = tokenizer.decode([idx])
print(f"{token}: {score:.2f}")
For Embeddings / Feature Extraction
from transformers import AutoModel, AutoTokenizer
repo_id = "baseline-neobert-100m-bert_tok-SmolLM2mix_p4_100000" # Update this to your HF repo ID
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
# Example: Generate embeddings
text = "NeoBERT is an efficient transformer model!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Get CLS token embedding
cls_embedding = outputs.last_hidden_state[:, 0, :]
print(f"Embedding shape: {cls_embedding.shape}")
Training Configuration
Full Config (click to expand)
Full training config:
model:
name: null
hidden_size: 768
num_hidden_layers: 12
num_attention_heads: 12
intermediate_size: 3072
max_position_embeddings: 4096
vocab_size: 30592
rope: true
rms_norm: true
hidden_act: swiglu
dropout_prob: 0.0
norm_eps: 1.0e-05
embedding_init_range: 0.02
decoder_init_range: 0.02
classifier_init_range: 0.02
attn_backend: flash_attn_varlen
kernel_backend: auto
ngpt: false
base_scale: 0.03227486121839514
pad_token_id: 0
from_hub: false
dataset:
name: EleutherAI/SmolLM2-1.7B-stage-4-100B
config: null
path: ''
num_workers: 4
pin_memory: false
persistent_workers: true
prefetch_factor: 8
streaming: true
cache_dir: null
trust_remote_code: false
max_seq_length: 1024
text_column: null
validation_split: null
train_split: train
eval_split: null
eval_samples: 4096
num_proc: 8
shuffle_buffer_size: 10000
pre_tokenize: false
pre_tokenize_output: null
load_all_from_disk: false
force_redownload: false
min_length: 5
alpha: 1.0
tokenizer:
name: bert-base-uncased
path: null
max_length: 4096
padding: max_length
truncation: true
vocab_size: 30592
trust_remote_code: false
revision: null
allow_special_token_rewrite: false
optimizer:
name: muonclip
lr: 0.0001
weight_decay: 0.01
betas:
- 0.9
- 0.95
eps: 1.0e-08
muon_config:
muon_beta: 0.95
muon_decay: 0.01
ns_steps: 5
enable_clipping: false
clipping_threshold: 50.0
clipping_alpha: 0.5
clipping_warmup_steps: 0
clipping_interval: 10
clipping_qk_chunk_size: 1024
capture_last_microbatch_only: true
detect_anomalies: false
orthogonalization: polar_express
algorithm: null
polar_express: null
clipping_layers_mapping: {}
scheduler:
name: cosine
warmup_steps: 5000
total_steps: null
decay_steps: null
final_lr_ratio: 0.1
warmup_percent: null
decay_percent: null
trainer:
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
gradient_accumulation_steps: 4
max_steps: 100000
save_steps: 10000
eval_steps: 5000
eval_max_batches: null
logging_steps: 25
enforce_full_packed_batches: true
log_train_accuracy: false
log_grad_norm: true
output_dir: ./outputs/baseline-neobert-100m-bert_tok-SmolLM2mix_p4
overwrite_output_dir: true
gradient_checkpointing: false
gradient_clipping: null
mixed_precision: bf16
masked_logits_only_loss: true
torch_compile: true
torch_compile_dynamic: null
torch_compile_backend: inductor
resume_from_checkpoint: false
num_train_epochs: 3
eval_strategy: steps
save_strategy: steps
save_total_limit: 1
early_stopping: 0
metric_for_best_model: null
greater_is_better: true
load_best_model_at_end: false
save_model: true
disable_tqdm: false
dataloader_num_workers: 0
use_cpu: false
report_to: []
tf32: true
max_ckpt: null
log_weight_norms: true
train_batch_size: null
eval_batch_size: null
datacollator:
mlm_probability: 0.25
pad_to_multiple_of: 8
mask_all: false
pack_sequences: true
max_length: null
wandb:
enabled: true
project: neobert-pretraining
entity: null
name: neobert-100m-SmolLM2mix_p4-bert_tok
tags: []
mode: online
watch: gradients
log_interval: 100
resume: never
dir: logs/wandb
glue:
task_name: cola
num_labels: 2
max_seq_length: 128
pretrained_model_path: null
pretrained_checkpoint_dir: null
pretrained_checkpoint: null
allow_random_weights: false
classifier_dropout: 0.1
classifier_init_range: 0.02
transfer_from_task: false
num_workers: 4
preprocessing_num_proc: 4
contrastive:
temperature: 0.05
pooling: avg
loss_type: simcse
hard_negative_weight: 0.0
pretraining_prob: 0.3
pretrained_checkpoint_dir: null
pretrained_checkpoint: null
allow_random_weights: false
task: pretraining
accelerate_config_file: null
mteb_task_type: all
mteb_batch_size: 32
mteb_pooling: mean
mteb_overwrite_results: false
pretrained_checkpoint: latest
use_deepspeed: false
seed: 69
debug: false
pretraining_metadata: {}
- Downloads last month
- 4