Decoding Text Spans for Efficient and Accurate Named-Entity Recognition Paper • 2604.20447 • Published 4 days ago • 2
GlotSuite Collection GlotSuite: Paving the Way for Bringing Generative AI to Underserved Communities • 17 items • Updated 10 days ago • 3
view article Article How we OCR'ed 30,000 papers using Codex, open OCR models and Jobs 18 days ago • 59
fiNERweb Collection A multilingual dataset for NER covering 91 langauges and 25 scripts • 3 items • Updated Dec 16, 2025 • 3
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World Paper • 2603.19223 • Published Mar 19 • 31
Nemotron-Post-Training-v3 Collection Collection of datasets used in the post-training phase of Nemotron Nano and Super v3. • 28 items • Updated 5 days ago • 126
Nemotron-Cascade 2 Collection Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation • 4 items • Updated 5 days ago • 50
view article Article Efficient LLM Pretraining: Packed Sequences and Masked Attention Oct 7, 2024 • 70
Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA Paper • 2603.14782 • Published Mar 16 • 1
Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike Paper • 2603.15130 • Published Mar 16 • 1
view article Article FlashHead: Accelerating Language Model Inference ~ *Efficient drop-in replacement for the classification head* Mar 11 • 2