πŸ₯¬ ModernBERT-base-32k Hallucination Detector

A hallucination detection model fine-tuned on RAGTruth dataset with Data2txt augmentation using extended 32K context ModernBERT. Specifically designed for long documents that exceed 8K tokens.

πŸš€ Why 32K Context Matters

Scenario 8K Model 32K Model
15K-token legal contract ❌ Truncates 47% βœ… Full context
Multi-document RAG ❌ Loses evidence βœ… Sees all docs
Long-form summarization ❌ Misses details βœ… Complete view

Performance

RAGTruth Benchmark (Standard, <3K tokens)

Evaluated on RAGTruth test set (2,700 samples):

Metric This Model LettuceDetect BASE LettuceDetect LARGE
Example-Level F1 76.56% βœ… 75.99% 79.22%
Token-Level F1 53.77% 56.27% -
Context Window 32K 8K 8K

βœ… Exceeds LettuceDetect BASE on short documents while supporting 4x longer context

Long-Context Benchmark (8K-24K tokens)

Evaluated on llm-semantic-router/longcontext-haldetect (337 test samples, avg 17,550 tokens):

Metric 32K ModernBERT 8K LettuceDetect Improvement
Samples Truncated 0 (0%) 320 (95%) -95%
Hallucination Recall 0.547 0.056 +877%
Hallucination F1 0.499 0.101 +393%

Model Description

This model detects hallucinations in LLM-generated text by classifying each token as either Supported (grounded in context) or Hallucinated (not supported by context).

Key Features

  • 32K Context Window: Built on llm-semantic-router/modernbert-base-32k with YaRN RoPE scaling
  • Token-Level Classification: Identifies specific spans that are hallucinated
  • RAG Optimized: Trained on RAGTruth benchmark for RAG applications
  • Data2txt Augmentation: Enhanced with DART and E2E datasets for better structured data handling
  • Long Document Support: Handles legal contracts, financial reports, research papers

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model_name = "llm-semantic-router/modernbert-base-32k-haldetect"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(model_name, trust_remote_code=True)

# Format: context + question + answer
text = """Context: The Eiffel Tower is located in Paris, France. It was completed in 1889.
Question: Where is the Eiffel Tower and when was it built?
Answer: The Eiffel Tower is located in London, England and was completed in 1920."""

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=24000)
with torch.no_grad():
    outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)

# 0 = Supported, 1 = Hallucinated
# Tokens for "London, England" and "1920" will be marked as hallucinated

With LettuceDetect Library

from lettucedetect.models.inference import HallucinationDetector

detector = HallucinationDetector(
    method="transformer",
    model_path="llm-semantic-router/modernbert-base-32k-haldetect",
    max_length=24000  # Use extended context
)

context = "The Eiffel Tower is located in Paris, France. It was completed in 1889."
question = "Where is the Eiffel Tower?"
answer = "The Eiffel Tower is located in London, England."

spans = detector.predict(context, question, answer)
# Returns: [{"text": "London, England", "start": 35, "end": 50, "confidence": 0.95}]

Training Details

Datasets

Dataset Samples Task Type Description
RAGTruth 17,790 QA, Summary, Data2txt Human-annotated hallucination spans
DART 2,000 Data2txt LLM-generated structured data responses
E2E 1,500 Data2txt LLM-generated restaurant descriptions
Total 21,290 Mixed Balanced task distribution

The DART and E2E datasets were synthetically generated using Qwen2.5-72B-Instruct to create both faithful and intentionally hallucinated responses from structured data, then LLM-annotated for span-level hallucinations.

Configuration

base_model: llm-semantic-router/modernbert-base-32k
max_length: 8192
batch_size: 32
learning_rate: 1e-5
epochs: 6
loss: CrossEntropyLoss (weighted)
scheduler: None (constant LR)
early_stopping_patience: 4

Hardware

  • AMD Instinct MI300X GPU (192GB HBM3) - Trained entirely on AMD ROCm
  • Training time: ~17 minutes (6 epochs)
  • Framework: PyTorch 2.9 + HuggingFace Transformers on ROCm 7.0

When to Use This Model

Use Case Recommended Model
Documents > 8K tokens βœ… This model
Multi-document RAG βœ… This model
Legal/Financial docs βœ… This model
Structured data (tables, lists) βœ… This model
Short QA (<3K tokens) Either model works
Speed critical 8K model (faster)

Limitations

  • Trained primarily on English text
  • Best performance on RAG-style prompts (context + question + answer format)
  • Longer contexts require more GPU memory

Related Resources

Datasets

Models

Citation

@misc{modernbert-32k-haldetect,
  title={ModernBERT-32K Hallucination Detector with Data2txt Augmentation},
  author={LLM Semantic Router Team},
  year={2026},
  url={https://huggingface.co/llm-semantic-router/modernbert-base-32k-haldetect}
}

Acknowledgments

Downloads last month
32
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for llm-semantic-router/modernbert-base-32k-haldetect

Finetuned
(2)
this model

Datasets used to train llm-semantic-router/modernbert-base-32k-haldetect

Evaluation results