🥬 ModernBERT-base-32k Hallucination Detector

A hallucination detection model fine-tuned on RAGTruth dataset with Data2txt augmentation using extended 32K context ModernBERT. Specifically designed for long documents that exceed 8K tokens.

🚀 Why 32K Context Matters

Scenario	8K Model	32K Model
15K-token legal contract	❌ Truncates 47%	✅ Full context
Multi-document RAG	❌ Loses evidence	✅ Sees all docs
Long-form summarization	❌ Misses details	✅ Complete view

Performance

RAGTruth Benchmark (Standard, <3K tokens)

Evaluated on RAGTruth test set (2,700 samples):

Metric	This Model	LettuceDetect BASE	LettuceDetect LARGE
Example-Level F1	76.56% ✅	75.99%	79.22%
Token-Level F1	53.77%	56.27%	-
Context Window	32K	8K	8K

✅ Exceeds LettuceDetect BASE on short documents while supporting 4x longer context

Long-Context Benchmark (8K-24K tokens)

Evaluated on llm-semantic-router/longcontext-haldetect (337 test samples, avg 17,550 tokens):

Metric	32K ModernBERT	8K LettuceDetect	Improvement
Samples Truncated	0 (0%)	320 (95%)	-95%
Hallucination Recall	0.547	0.056	+877%
Hallucination F1	0.499	0.101	+393%

Model Description

This model detects hallucinations in LLM-generated text by classifying each token as either Supported (grounded in context) or Hallucinated (not supported by context).

Key Features

32K Context Window: Built on llm-semantic-router/modernbert-base-32k with YaRN RoPE scaling
Token-Level Classification: Identifies specific spans that are hallucinated
RAG Optimized: Trained on RAGTruth benchmark for RAG applications
Data2txt Augmentation: Enhanced with DART and E2E datasets for better structured data handling
Long Document Support: Handles legal contracts, financial reports, research papers

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model_name = "llm-semantic-router/modernbert-base-32k-haldetect"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(model_name, trust_remote_code=True)

# Format: context + question + answer
text = """Context: The Eiffel Tower is located in Paris, France. It was completed in 1889.
Question: Where is the Eiffel Tower and when was it built?
Answer: The Eiffel Tower is located in London, England and was completed in 1920."""

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=24000)
with torch.no_grad():
    outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)

# 0 = Supported, 1 = Hallucinated
# Tokens for "London, England" and "1920" will be marked as hallucinated

With LettuceDetect Library

from lettucedetect.models.inference import HallucinationDetector

detector = HallucinationDetector(
    method="transformer",
    model_path="llm-semantic-router/modernbert-base-32k-haldetect",
    max_length=24000  # Use extended context
)

context = "The Eiffel Tower is located in Paris, France. It was completed in 1889."
question = "Where is the Eiffel Tower?"
answer = "The Eiffel Tower is located in London, England."

spans = detector.predict(context, question, answer)
# Returns: [{"text": "London, England", "start": 35, "end": 50, "confidence": 0.95}]

Training Details

Datasets

Dataset	Samples	Task Type	Description
RAGTruth	17,790	QA, Summary, Data2txt	Human-annotated hallucination spans
DART	2,000	Data2txt	LLM-generated structured data responses
E2E	1,500	Data2txt	LLM-generated restaurant descriptions
Total	21,290	Mixed	Balanced task distribution

The DART and E2E datasets were synthetically generated using Qwen2.5-72B-Instruct to create both faithful and intentionally hallucinated responses from structured data, then LLM-annotated for span-level hallucinations.

Configuration

base_model: llm-semantic-router/modernbert-base-32k
max_length: 8192
batch_size: 32
learning_rate: 1e-5
epochs: 6
loss: CrossEntropyLoss (weighted)
scheduler: None (constant LR)
early_stopping_patience: 4

Hardware

AMD Instinct MI300X GPU (192GB HBM3) - Trained entirely on AMD ROCm
Training time: ~17 minutes (6 epochs)
Framework: PyTorch 2.9 + HuggingFace Transformers on ROCm 7.0

When to Use This Model

Use Case	Recommended Model
Documents > 8K tokens	✅ This model
Multi-document RAG	✅ This model
Legal/Financial docs	✅ This model
Structured data (tables, lists)	✅ This model
Short QA (<3K tokens)	Either model works
Speed critical	8K model (faster)

Limitations

Trained primarily on English text
Best performance on RAG-style prompts (context + question + answer format)
Longer contexts require more GPU memory

Related Resources

Datasets

Long-Context Benchmark: llm-semantic-router/longcontext-haldetect - 3,366 samples, 8K-24K tokens
DART Hallucination Spans: llm-semantic-router/dart-halspans - 2,000 Data2txt samples
E2E Hallucination Spans: llm-semantic-router/e2e-halspans - 1,500 restaurant descriptions

Models

Base Model: llm-semantic-router/modernbert-base-32k - Extended ModernBERT
Combined Model: modernbert-base-32k-haldetect-combined - Trained on RAGTruth + HaluEval

Citation

@misc{modernbert-32k-haldetect,
  title={ModernBERT-32K Hallucination Detector with Data2txt Augmentation},
  author={LLM Semantic Router Team},
  year={2026},
  url={https://huggingface.co/llm-semantic-router/modernbert-base-32k-haldetect}
}

Acknowledgments

Built on LettuceDetect framework
Uses ModernBERT architecture
Trained on RAGTruth dataset
Data2txt augmentation from DART and E2E datasets

Downloads last month: 32

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for llm-semantic-router/modernbert-base-32k-haldetect

Base model

answerdotai/ModernBERT-base

Finetuned

llm-semantic-router/modernbert-base-32k

Finetuned

(2)

this model

Datasets used to train llm-semantic-router/modernbert-base-32k-haldetect

Evaluation results

Example-Level F1 on RAGTruth Test Set
self-reported

76.560
Token-Level F1 on RAGTruth Test Set
self-reported

53.770
Hallucination F1 on Long-Context Benchmark (8K-24K tokens)
self-reported

49.860