---
license: apache-2.0
language:
  - en
tags:
  - modernbert
  - hallucination-detection
  - rag
  - fact-checking
  - long-context
  - 32k
  - amd
  - rocm
  - mi300x
datasets:
  - llm-semantic-router/longcontext-haldetect
  - llm-semantic-router/dart-halspans
  - llm-semantic-router/e2e-halspans
base_model:
  - llm-semantic-router/modernbert-base-32k
pipeline_tag: token-classification
model-index:
  - name: modernbert-base-32k-haldetect
    results:
      - task:
          type: token-classification
          name: Hallucination Detection
        dataset:
          name: RAGTruth Test Set
          type: ragtruth
        metrics:
          - name: Example-Level F1
            type: f1
            value: 76.56
          - name: Token-Level F1
            type: f1
            value: 53.77
      - task:
          type: token-classification
          name: Long-Context Hallucination Detection
        dataset:
          name: Long-Context Benchmark (8K-24K tokens)
          type: llm-semantic-router/longcontext-haldetect
        metrics:
          - name: Hallucination F1
            type: f1
            value: 49.86
---

# 🥬 ModernBERT-base-32k Hallucination Detector

A hallucination detection model fine-tuned on RAGTruth dataset with Data2txt augmentation using extended 32K context ModernBERT. **Specifically designed for long documents that exceed 8K tokens.**

## 🚀 Why 32K Context Matters

| Scenario | 8K Model | 32K Model |
|----------|----------|-----------|
| 15K-token legal contract | ❌ Truncates 47% | ✅ Full context |
| Multi-document RAG | ❌ Loses evidence | ✅ Sees all docs |
| Long-form summarization | ❌ Misses details | ✅ Complete view |

## Performance

### RAGTruth Benchmark (Standard, <3K tokens)

Evaluated on RAGTruth test set (2,700 samples):

| Metric | This Model | LettuceDetect BASE | LettuceDetect LARGE |
|--------|------------|-------------------|---------------------|
| **Example-Level F1** | **76.56%** ✅ | 75.99% | 79.22% |
| Token-Level F1 | 53.77% | 56.27% | - |
| Context Window | **32K** | 8K | 8K |

✅ **Exceeds LettuceDetect BASE** on short documents while supporting **4x longer context**

### Long-Context Benchmark (8K-24K tokens)

Evaluated on [llm-semantic-router/longcontext-haldetect](https://huggingface.co/datasets/llm-semantic-router/longcontext-haldetect) (337 test samples, avg 17,550 tokens):

| Metric | 32K ModernBERT | 8K LettuceDetect | Improvement |
|--------|----------------|------------------|-------------|
| **Samples Truncated** | 0 (0%) | 320 (95%) | **-95%** |
| Hallucination Recall | 0.547 | 0.056 | **+877%** |
| **Hallucination F1** | **0.499** | 0.101 | **+393%** |

## Model Description

This model detects hallucinations in LLM-generated text by classifying each token as either **Supported** (grounded in context) or **Hallucinated** (not supported by context).

### Key Features

- **32K Context Window**: Built on [llm-semantic-router/modernbert-base-32k](https://huggingface.co/llm-semantic-router/modernbert-base-32k) with YaRN RoPE scaling
- **Token-Level Classification**: Identifies specific spans that are hallucinated
- **RAG Optimized**: Trained on RAGTruth benchmark for RAG applications
- **Data2txt Augmentation**: Enhanced with DART and E2E datasets for better structured data handling
- **Long Document Support**: Handles legal contracts, financial reports, research papers

## Usage

```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model_name = "llm-semantic-router/modernbert-base-32k-haldetect"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(model_name, trust_remote_code=True)

# Format: context + question + answer
text = """Context: The Eiffel Tower is located in Paris, France. It was completed in 1889.
Question: Where is the Eiffel Tower and when was it built?
Answer: The Eiffel Tower is located in London, England and was completed in 1920."""

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=24000)
with torch.no_grad():
    outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)

# 0 = Supported, 1 = Hallucinated
# Tokens for "London, England" and "1920" will be marked as hallucinated
```

### With LettuceDetect Library

```python
from lettucedetect.models.inference import HallucinationDetector

detector = HallucinationDetector(
    method="transformer",
    model_path="llm-semantic-router/modernbert-base-32k-haldetect",
    max_length=24000  # Use extended context
)

context = "The Eiffel Tower is located in Paris, France. It was completed in 1889."
question = "Where is the Eiffel Tower?"
answer = "The Eiffel Tower is located in London, England."

spans = detector.predict(context, question, answer)
# Returns: [{"text": "London, England", "start": 35, "end": 50, "confidence": 0.95}]
```

## Training Details

### Datasets

| Dataset | Samples | Task Type | Description |
|---------|---------|-----------|-------------|
| **[RAGTruth](https://github.com/ParticleMedia/RAGTruth)** | 17,790 | QA, Summary, Data2txt | Human-annotated hallucination spans |
| **[DART](https://huggingface.co/datasets/llm-semantic-router/dart-halspans)** | 2,000 | Data2txt | LLM-generated structured data responses |
| **[E2E](https://huggingface.co/datasets/llm-semantic-router/e2e-halspans)** | 1,500 | Data2txt | LLM-generated restaurant descriptions |
| **Total** | 21,290 | Mixed | Balanced task distribution |

The DART and E2E datasets were synthetically generated using Qwen2.5-72B-Instruct to create both faithful and intentionally hallucinated responses from structured data, then LLM-annotated for span-level hallucinations.

### Configuration

```yaml
base_model: llm-semantic-router/modernbert-base-32k
max_length: 8192
batch_size: 32
learning_rate: 1e-5
epochs: 6
loss: CrossEntropyLoss (weighted)
scheduler: None (constant LR)
early_stopping_patience: 4
```

### Hardware

- **AMD Instinct MI300X GPU** (192GB HBM3) - Trained entirely on AMD ROCm
- Training time: ~17 minutes (6 epochs)
- Framework: PyTorch 2.9 + HuggingFace Transformers on ROCm 7.0

## When to Use This Model

| Use Case | Recommended Model |
|----------|-------------------|
| Documents > 8K tokens | ✅ **This model** |
| Multi-document RAG | ✅ **This model** |
| Legal/Financial docs | ✅ **This model** |
| Structured data (tables, lists) | ✅ **This model** |
| Short QA (<3K tokens) | Either model works |
| Speed critical | 8K model (faster) |

## Limitations

- Trained primarily on English text
- Best performance on RAG-style prompts (context + question + answer format)
- Longer contexts require more GPU memory

## Related Resources

### Datasets
- **Long-Context Benchmark**: [llm-semantic-router/longcontext-haldetect](https://huggingface.co/datasets/llm-semantic-router/longcontext-haldetect) - 3,366 samples, 8K-24K tokens
- **DART Hallucination Spans**: [llm-semantic-router/dart-halspans](https://huggingface.co/datasets/llm-semantic-router/dart-halspans) - 2,000 Data2txt samples
- **E2E Hallucination Spans**: [llm-semantic-router/e2e-halspans](https://huggingface.co/datasets/llm-semantic-router/e2e-halspans) - 1,500 restaurant descriptions

### Models
- **Base Model**: [llm-semantic-router/modernbert-base-32k](https://huggingface.co/llm-semantic-router/modernbert-base-32k) - Extended ModernBERT
- **Combined Model**: [modernbert-base-32k-haldetect-combined](https://huggingface.co/llm-semantic-router/modernbert-base-32k-haldetect-combined) - Trained on RAGTruth + HaluEval

## Citation

```bibtex
@misc{modernbert-32k-haldetect,
  title={ModernBERT-32K Hallucination Detector with Data2txt Augmentation},
  author={LLM Semantic Router Team},
  year={2026},
  url={https://huggingface.co/llm-semantic-router/modernbert-base-32k-haldetect}
}
```

## Acknowledgments

- Built on [LettuceDetect](https://github.com/KRLabTech/LettuceDetect) framework
- Uses [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) architecture
- Trained on [RAGTruth](https://github.com/ParticleMedia/RAGTruth) dataset
- Data2txt augmentation from [DART](https://github.com/Yale-LILY/dart) and [E2E](https://github.com/tuetschek/e2e-dataset) datasets