π₯¬ ModernBERT-base-32k Hallucination Detector
A hallucination detection model fine-tuned on RAGTruth dataset with Data2txt augmentation using extended 32K context ModernBERT. Specifically designed for long documents that exceed 8K tokens.
π Why 32K Context Matters
| Scenario | 8K Model | 32K Model |
|---|---|---|
| 15K-token legal contract | β Truncates 47% | β Full context |
| Multi-document RAG | β Loses evidence | β Sees all docs |
| Long-form summarization | β Misses details | β Complete view |
Performance
RAGTruth Benchmark (Standard, <3K tokens)
Evaluated on RAGTruth test set (2,700 samples):
| Metric | This Model | LettuceDetect BASE | LettuceDetect LARGE |
|---|---|---|---|
| Example-Level F1 | 76.56% β | 75.99% | 79.22% |
| Token-Level F1 | 53.77% | 56.27% | - |
| Context Window | 32K | 8K | 8K |
β Exceeds LettuceDetect BASE on short documents while supporting 4x longer context
Long-Context Benchmark (8K-24K tokens)
Evaluated on llm-semantic-router/longcontext-haldetect (337 test samples, avg 17,550 tokens):
| Metric | 32K ModernBERT | 8K LettuceDetect | Improvement |
|---|---|---|---|
| Samples Truncated | 0 (0%) | 320 (95%) | -95% |
| Hallucination Recall | 0.547 | 0.056 | +877% |
| Hallucination F1 | 0.499 | 0.101 | +393% |
Model Description
This model detects hallucinations in LLM-generated text by classifying each token as either Supported (grounded in context) or Hallucinated (not supported by context).
Key Features
- 32K Context Window: Built on llm-semantic-router/modernbert-base-32k with YaRN RoPE scaling
- Token-Level Classification: Identifies specific spans that are hallucinated
- RAG Optimized: Trained on RAGTruth benchmark for RAG applications
- Data2txt Augmentation: Enhanced with DART and E2E datasets for better structured data handling
- Long Document Support: Handles legal contracts, financial reports, research papers
Usage
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model_name = "llm-semantic-router/modernbert-base-32k-haldetect"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(model_name, trust_remote_code=True)
# Format: context + question + answer
text = """Context: The Eiffel Tower is located in Paris, France. It was completed in 1889.
Question: Where is the Eiffel Tower and when was it built?
Answer: The Eiffel Tower is located in London, England and was completed in 1920."""
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=24000)
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
# 0 = Supported, 1 = Hallucinated
# Tokens for "London, England" and "1920" will be marked as hallucinated
With LettuceDetect Library
from lettucedetect.models.inference import HallucinationDetector
detector = HallucinationDetector(
method="transformer",
model_path="llm-semantic-router/modernbert-base-32k-haldetect",
max_length=24000 # Use extended context
)
context = "The Eiffel Tower is located in Paris, France. It was completed in 1889."
question = "Where is the Eiffel Tower?"
answer = "The Eiffel Tower is located in London, England."
spans = detector.predict(context, question, answer)
# Returns: [{"text": "London, England", "start": 35, "end": 50, "confidence": 0.95}]
Training Details
Datasets
| Dataset | Samples | Task Type | Description |
|---|---|---|---|
| RAGTruth | 17,790 | QA, Summary, Data2txt | Human-annotated hallucination spans |
| DART | 2,000 | Data2txt | LLM-generated structured data responses |
| E2E | 1,500 | Data2txt | LLM-generated restaurant descriptions |
| Total | 21,290 | Mixed | Balanced task distribution |
The DART and E2E datasets were synthetically generated using Qwen2.5-72B-Instruct to create both faithful and intentionally hallucinated responses from structured data, then LLM-annotated for span-level hallucinations.
Configuration
base_model: llm-semantic-router/modernbert-base-32k
max_length: 8192
batch_size: 32
learning_rate: 1e-5
epochs: 6
loss: CrossEntropyLoss (weighted)
scheduler: None (constant LR)
early_stopping_patience: 4
Hardware
- AMD Instinct MI300X GPU (192GB HBM3) - Trained entirely on AMD ROCm
- Training time: ~17 minutes (6 epochs)
- Framework: PyTorch 2.9 + HuggingFace Transformers on ROCm 7.0
When to Use This Model
| Use Case | Recommended Model |
|---|---|
| Documents > 8K tokens | β This model |
| Multi-document RAG | β This model |
| Legal/Financial docs | β This model |
| Structured data (tables, lists) | β This model |
| Short QA (<3K tokens) | Either model works |
| Speed critical | 8K model (faster) |
Limitations
- Trained primarily on English text
- Best performance on RAG-style prompts (context + question + answer format)
- Longer contexts require more GPU memory
Related Resources
Datasets
- Long-Context Benchmark: llm-semantic-router/longcontext-haldetect - 3,366 samples, 8K-24K tokens
- DART Hallucination Spans: llm-semantic-router/dart-halspans - 2,000 Data2txt samples
- E2E Hallucination Spans: llm-semantic-router/e2e-halspans - 1,500 restaurant descriptions
Models
- Base Model: llm-semantic-router/modernbert-base-32k - Extended ModernBERT
- Combined Model: modernbert-base-32k-haldetect-combined - Trained on RAGTruth + HaluEval
Citation
@misc{modernbert-32k-haldetect,
title={ModernBERT-32K Hallucination Detector with Data2txt Augmentation},
author={LLM Semantic Router Team},
year={2026},
url={https://huggingface.co/llm-semantic-router/modernbert-base-32k-haldetect}
}
Acknowledgments
- Built on LettuceDetect framework
- Uses ModernBERT architecture
- Trained on RAGTruth dataset
- Data2txt augmentation from DART and E2E datasets
- Downloads last month
- 32
Model tree for llm-semantic-router/modernbert-base-32k-haldetect
Base model
answerdotai/ModernBERT-baseDatasets used to train llm-semantic-router/modernbert-base-32k-haldetect
Evaluation results
- Example-Level F1 on RAGTruth Test Setself-reported76.560
- Token-Level F1 on RAGTruth Test Setself-reported53.770
- Hallucination F1 on Long-Context Benchmark (8K-24K tokens)self-reported49.860