dataflare
/

df-arc

@@ -17,12 +17,12 @@ datasets:
 **DF-Arc** is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**.
-It achieves near 1:1 fertility (1.26) and high semantic density.
 ## Key Highlights
 - **Architecture**: Unigram SentencePiece (compatible with `LlamaTokenizer`).
-- **Vocab Size**: 64,000 tokens.
 - **Baked-in Logic**: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
 - **Dialect Native**: Trained on Egyptian dialogue, songs, and feedback corpora.

 **DF-Arc** is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**.
+It achieves near 1:1 fertility (1.16) and high semantic density.
 ## Key Highlights
 - **Architecture**: Unigram SentencePiece (compatible with `LlamaTokenizer`).
+- **Vocab Size**: 128,000 tokens.
 - **Baked-in Logic**: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
 - **Dialect Native**: Trained on Egyptian dialogue, songs, and feedback corpora.