NeoBERT Model

This is a NeoBERT model trained with pszemraj/NeoBERT and exported to transformers format.

Model Details

Architecture: NeoBERT
Hidden Size: 768
Layers: 12
Attention Heads: 12
Vocab Size: 30592
Max Length: 4096
Dtype: float32

Runtime Dependencies

Exported NeoBERT inference does not require Liger kernels, flash-attn, or other custom CUDA extensions. The exported modeling_neobert.py runs on standard PyTorch + Transformers attention paths.

torch: 2.10.0+cu128
transformers: 4.57.6
safetensors: required (weights are exported as model.safetensors)

Exported Artifacts

config.json
model.safetensors
modeling_neobert.py
rotary.py
tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json, ...)

Usage

Ensure you update repo_id to your actual HuggingFace repo ID or local path.

For Masked Language Modeling (Fill-Mask)

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

repo_id = "baseline-neobert-100m-bert_tok-SmolLM2mix_p4_100000"  # Update this to your HF repo ID
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(repo_id, trust_remote_code=True)

# Example: Fill in masked tokens
text = "NeoBERT is the most [MASK] model of its kind!"

# Replace display mask with actual mask token if different
text = text.replace("[MASK]", tokenizer.mask_token)

inputs = tokenizer(text, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    mask_positions = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)
    if len(mask_positions[0]) == 0:
        raise ValueError("No mask token found in input")
    mask_pos = mask_positions[1][0]
    predictions = outputs.logits[0, mask_pos].topk(5)

# Display top predictions
for idx, score in zip(predictions.indices, predictions.values):
    token = tokenizer.decode([idx])
    print(f"{token}: {score:.2f}")

For Embeddings / Feature Extraction

from transformers import AutoModel, AutoTokenizer

repo_id = "baseline-neobert-100m-bert_tok-SmolLM2mix_p4_100000"  # Update this to your HF repo ID
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)

# Example: Generate embeddings
text = "NeoBERT is an efficient transformer model!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get CLS token embedding
cls_embedding = outputs.last_hidden_state[:, 0, :]
print(f"Embedding shape: {cls_embedding.shape}")

Training Configuration

Full Config (click to expand)

Full training config:

model:
  name: null
  hidden_size: 768
  num_hidden_layers: 12
  num_attention_heads: 12
  intermediate_size: 3072
  max_position_embeddings: 4096
  vocab_size: 30592
  rope: true
  rms_norm: true
  hidden_act: swiglu
  dropout_prob: 0.0
  norm_eps: 1.0e-05
  embedding_init_range: 0.02
  decoder_init_range: 0.02
  classifier_init_range: 0.02
  attn_backend: flash_attn_varlen
  kernel_backend: auto
  ngpt: false
  base_scale: 0.03227486121839514
  pad_token_id: 0
  from_hub: false
dataset:
  name: EleutherAI/SmolLM2-1.7B-stage-4-100B
  config: null
  path: ''
  num_workers: 4
  pin_memory: false
  persistent_workers: true
  prefetch_factor: 8
  streaming: true
  cache_dir: null
  trust_remote_code: false
  max_seq_length: 1024
  text_column: null
  validation_split: null
  train_split: train
  eval_split: null
  eval_samples: 4096
  num_proc: 8
  shuffle_buffer_size: 10000
  pre_tokenize: false
  pre_tokenize_output: null
  load_all_from_disk: false
  force_redownload: false
  min_length: 5
  alpha: 1.0
tokenizer:
  name: bert-base-uncased
  path: null
  max_length: 4096
  padding: max_length
  truncation: true
  vocab_size: 30592
  trust_remote_code: false
  revision: null
  allow_special_token_rewrite: false
optimizer:
  name: muonclip
  lr: 0.0001
  weight_decay: 0.01
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  muon_config:
    muon_beta: 0.95
    muon_decay: 0.01
    ns_steps: 5
    enable_clipping: false
    clipping_threshold: 50.0
    clipping_alpha: 0.5
    clipping_warmup_steps: 0
    clipping_interval: 10
    clipping_qk_chunk_size: 1024
    capture_last_microbatch_only: true
    detect_anomalies: false
    orthogonalization: polar_express
    algorithm: null
    polar_express: null
    clipping_layers_mapping: {}
scheduler:
  name: cosine
  warmup_steps: 5000
  total_steps: null
  decay_steps: null
  final_lr_ratio: 0.1
  warmup_percent: null
  decay_percent: null
trainer:
  per_device_train_batch_size: 32
  per_device_eval_batch_size: 32
  gradient_accumulation_steps: 4
  max_steps: 100000
  save_steps: 10000
  eval_steps: 5000
  eval_max_batches: null
  logging_steps: 25
  enforce_full_packed_batches: true
  log_train_accuracy: false
  log_grad_norm: true
  output_dir: ./outputs/baseline-neobert-100m-bert_tok-SmolLM2mix_p4
  overwrite_output_dir: true
  gradient_checkpointing: false
  gradient_clipping: null
  mixed_precision: bf16
  masked_logits_only_loss: true
  torch_compile: true
  torch_compile_dynamic: null
  torch_compile_backend: inductor
  resume_from_checkpoint: false
  num_train_epochs: 3
  eval_strategy: steps
  save_strategy: steps
  save_total_limit: 1
  early_stopping: 0
  metric_for_best_model: null
  greater_is_better: true
  load_best_model_at_end: false
  save_model: true
  disable_tqdm: false
  dataloader_num_workers: 0
  use_cpu: false
  report_to: []
  tf32: true
  max_ckpt: null
  log_weight_norms: true
  train_batch_size: null
  eval_batch_size: null
datacollator:
  mlm_probability: 0.25
  pad_to_multiple_of: 8
  mask_all: false
  pack_sequences: true
  max_length: null
wandb:
  enabled: true
  project: neobert-pretraining
  entity: null
  name: neobert-100m-SmolLM2mix_p4-bert_tok
  tags: []
  mode: online
  watch: gradients
  log_interval: 100
  resume: never
  dir: logs/wandb
glue:
  task_name: cola
  num_labels: 2
  max_seq_length: 128
  pretrained_model_path: null
  pretrained_checkpoint_dir: null
  pretrained_checkpoint: null
  allow_random_weights: false
  classifier_dropout: 0.1
  classifier_init_range: 0.02
  transfer_from_task: false
  num_workers: 4
  preprocessing_num_proc: 4
contrastive:
  temperature: 0.05
  pooling: avg
  loss_type: simcse
  hard_negative_weight: 0.0
  pretraining_prob: 0.3
  pretrained_checkpoint_dir: null
  pretrained_checkpoint: null
  allow_random_weights: false
task: pretraining
accelerate_config_file: null
mteb_task_type: all
mteb_batch_size: 32
mteb_pooling: mean
mteb_overwrite_results: false
pretrained_checkpoint: latest
use_deepspeed: false
seed: 69
debug: false
pretraining_metadata: {}

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32

pszemraj
/

neobert-100m-bert_tok-SmolLM2mix_p4