You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ModernBERT-32K Hallucination Detector with Early Exit Adapters

Fast and Faithful Long-Context Hallucination Detection - A 32K-token encoder for RAG verification with configurable early exit for production deployment.

Overview

This repository contains early exit adapters for the llm-semantic-router/modernbert-base-32k-haldetect-combined model, enabling configurable accuracy-latency tradeoffs for production deployment.

Component	Description
Base Model	llm-semantic-router/modernbert-base-32k-haldetect-combined
This Repo	Early exit adapters (1.5MB) at layers 6, 11, 16
Architecture	ModernBERT (32K context, RoPE + Flash Attention 2)
Task	Token-level hallucination detection

Key Features

1. Long-Context Support (32K tokens)

Process entire legal contracts, financial reports, and scientific papers
No chunking required - single-pass inference
4× longer context than previous encoder-based detectors (8K)

2. Configurable Early Exit

Exit at different layers for accuracy-latency tradeoffs:

Exit Layer	F1 Score	Relative Accuracy	Speedup
L6	48.2%	48%	3.9×
L11	81.2%	81%	2.3×
L16	95.5%	97%	1.4×
L22 (full)	98.4%	100%	1.0×

Key insight: Speedup increases with context length (3.4× at 512 tokens → 3.9× at 24K tokens).

3. Production Performance on RAGTruth

Metric	Score
Example F1	77.0%
Token F1	53.4%

Installation

pip install transformers torch

Usage

Basic Hallucination Detection (Full Model)

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

# Load base model
model_name = "llm-semantic-router/modernbert-base-32k-haldetect-combined"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Format: context + response
context = "The Eiffel Tower was completed in 1889 and stands 330 meters tall."
response = "The Eiffel Tower was built in 1920 and is 500 meters tall."

inputs = tokenizer(
    context,
    response,
    return_tensors="pt",
    max_length=32768,
    truncation=True
).to(model.device)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    # 0 = faithful, 1 = hallucinated (per token)

Early Exit Inference (Faster)

import torch
import torch.nn as nn
from transformers import AutoModelForTokenClassification, AutoTokenizer
from huggingface_hub import hf_hub_download

# Load base model
model_name = "llm-semantic-router/modernbert-base-32k-haldetect-combined"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    output_hidden_states=True,
)
model = model.cuda().eval()

# Download and load early exit adapters
adapter_path = hf_hub_download(
    repo_id="HuaminChen/modernbert-32k-hallucination-early-exit",
    filename="early_exit_adapters.pt"
)
adapter_weights = torch.load(adapter_path, map_location="cpu")

# Create adapter modules
class EarlyExitAdapter(nn.Module):
    def __init__(self, hidden_size=768, bottleneck_size=256, num_classes=2):
        super().__init__()
        self.adapter = nn.Sequential(
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, bottleneck_size),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(bottleneck_size, bottleneck_size),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(bottleneck_size, num_classes),
        )
    
    def forward(self, hidden_states):
        return self.adapter(hidden_states)

# Load adapters for each exit layer
adapters = {}
for layer in [6, 11, 16]:
    adapters[layer] = EarlyExitAdapter().to(torch.bfloat16).cuda()
    # Load weights
    state_dict = {
        k.replace(f"{layer}.", ""): v 
        for k, v in adapter_weights.items() 
        if k.startswith(f"{layer}.")
    }
    adapters[layer].load_state_dict(state_dict)
    adapters[layer].eval()

def early_exit_predict(text_context, text_response, exit_layer=16, confidence_threshold=0.9):
    """
    Predict with early exit.
    
    Args:
        exit_layer: Which layer to exit at (6, 11, 16, or 22)
        confidence_threshold: Exit early if confidence exceeds this
    """
    inputs = tokenizer(
        text_context,
        text_response,
        return_tensors="pt",
        max_length=32768,
        truncation=True
    ).to("cuda")
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        
        if exit_layer == 22:
            # Use full model
            logits = outputs.logits
        else:
            # Use early exit adapter
            hidden = outputs.hidden_states[exit_layer]
            logits = adapters[exit_layer](hidden)
        
        predictions = torch.argmax(logits, dim=-1)
        probs = torch.softmax(logits, dim=-1)
        
    return predictions, probs

# Example usage
context = "The contract specifies a 30-day notice period for termination."
response = "According to the contract, termination requires 60 days notice."

# Fast inference with L16 (97% accuracy, 1.4x speedup)
preds, probs = early_exit_predict(context, response, exit_layer=16)
print(f"Predictions: {preds}")
print(f"Max hallucination probability: {probs[0, :, 1].max():.2%}")

Dynamic Early Exit (Adaptive)

def dynamic_early_exit(text_context, text_response, thresholds={6: 0.95, 11: 0.9, 16: 0.85}):
    """
    Dynamically choose exit layer based on confidence.
    Exit early if confident, otherwise continue to deeper layers.
    """
    inputs = tokenizer(
        text_context,
        text_response,
        return_tensors="pt",
        max_length=32768,
        truncation=True
    ).to("cuda")
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        
        for layer in [6, 11, 16]:
            hidden = outputs.hidden_states[layer]
            logits = adapters[layer](hidden)
            probs = torch.softmax(logits, dim=-1)
            confidence = probs.max(dim=-1).values.mean()
            
            if confidence >= thresholds[layer]:
                return torch.argmax(logits, dim=-1), layer, confidence.item()
        
        # Fall back to full model
        return torch.argmax(outputs.logits, dim=-1), 22, 1.0

# Example
preds, exit_layer, conf = dynamic_early_exit(context, response)
print(f"Exited at layer {exit_layer} with confidence {conf:.2%}")

Model Architecture

┌─────────────────────────────────────────────────────────┐
│                    Input Tokens                          │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│              ModernBERT Encoder (Frozen)                 │
│                                                         │
│   Layer 1-5: [──────────────────────────────────]       │
│                           │                             │
│   Layer 6:   [──────────────────────────────────]──┬──► Adapter 6 ──► Exit (48% F1)
│                           │                        │                   3.9× speedup
│   Layer 7-10:[──────────────────────────────────]  │
│                           │                        │
│   Layer 11:  [──────────────────────────────────]──┼──► Adapter 11 ──► Exit (81% F1)
│                           │                        │                    2.3× speedup
│   Layer 12-15:[─────────────────────────────────]  │
│                           │                        │
│   Layer 16:  [──────────────────────────────────]──┼──► Adapter 16 ──► Exit (96% F1)
│                           │                        │                    1.4× speedup
│   Layer 17-21:[─────────────────────────────────]  │
│                           │                        │
│   Layer 22:  [──────────────────────────────────]──┴──► Classifier ──► Exit (98% F1)
│                                                                         1.0× speedup
└─────────────────────────────────────────────────────────┘

Training Details

Base Model Training

Extended from 8K to 32K tokens using YaRN RoPE scaling
Fine-tuned on RAGTruth dataset for hallucination detection
Achieves 77.0% Example F1

Early Exit Adapter Training

Method: Self-distillation from Layer 22 to earlier layers
Adapters: Lightweight bottleneck adapters (256-dim) at layers 6, 11, 16
Loss: KL divergence + task loss
Training data: RAGTruth + long-context hallucination benchmark

Files in This Repository

File	Description
`early_exit_adapters.pt`	PyTorch weights for early exit adapters (1.5MB)
`config.json`	Model configuration and performance metrics
`inference.py`	Example inference code

Limitations

Language: Primarily trained on English data
Domain: Best performance on factual/encyclopedic content
Memory: Full 32K context requires ~8GB GPU memory
Calibration: Early exit thresholds may need task-specific tuning

Citation

@article{modernbert-32k-hallucination,
  title={Fast and Faithful: Long-Context Hallucination Detection with Early Exit Adapters},
  author={Anonymous},
  year={2026},
  note={Under review}
}

License

MIT License

Downloads last month: -

Model tree for HuaminChen/modernbert-32k-hallucination-early-exit

Base model

answerdotai/ModernBERT-base

Finetuned

llm-semantic-router/modernbert-base-32k

Finetuned

llm-semantic-router/modernbert-base-32k-haldetect-combined

Finetuned

(1)

this model

Evaluation results

Example F1 on RAGTruth
self-reported

0.770
Token F1 on RAGTruth
self-reported

0.534