We are thrilled to announce the launch of SKT-OMNI-CORPUS-146T-V1, a massive-scale, high-quality dataset designed to power the next generation of Foundation Models (LLMs) from scratch. Developed at SKT AI LABS, this corpus is not just a collection of data; it’s a mission to decentralize high-grade AI training for regional languages and global knowledge.
💎 Key Highlights:
•• Massive Scale: Targeting a multi-terabyte architecture for 146T-level tokenization.
•• Pure Quality: Curated from 500+ Elite Sources
•• Structured for MoE: Perfectly sharded into 3.5GB standardized units (SKT-𝕻 series) for seamless distributed training.
🤝 Open for Collaboration!
We are looking for AI researchers, CUDA engineers, and data scientists to join us in this journey of building Project Surya and the ST-X Series models. Whether it's optimization, custom tokenization, or architecture design—let’s build the future together.
Introducing GRM2, a powerful 3 billion parameter model designed for long-term reasoning and high performance in complex tasks.
Even with only 3 billion parameters, it outperforms qwen3-32b in several benchmarks and complex reasoning tasks.
With just 3 billion parameters, it can also generate extensive and complex code with over 1000 lines, utilize tools comparable to larger models, and is perfect for agentic tasks.
GRM2 is licensed under Apache 2.0, making it ideal as a base for FineTune in other tasks. You can see more here: OrionLLM/GRM2-3b
Nanochat Moroccan is the first language model family built specifically for Moroccan Darija.
This project brings together a small family of models and datasets centered on Darija, with the goal of building something genuinely useful for a language that is still underserved in AI.
Moroccan Darija is spoken by millions of people, yet it remains underrepresented in language technology. Nanochat Moroccan is a step toward building tools that take the language seriously.