Update README.md
Browse files
README.md
CHANGED
|
@@ -17,12 +17,12 @@ datasets:
|
|
| 17 |
|
| 18 |
**DF-Arc** is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**.
|
| 19 |
|
| 20 |
-
It achieves near 1:1 fertility (1.
|
| 21 |
|
| 22 |
## Key Highlights
|
| 23 |
|
| 24 |
- **Architecture**: Unigram SentencePiece (compatible with `LlamaTokenizer`).
|
| 25 |
-
- **Vocab Size**:
|
| 26 |
- **Baked-in Logic**: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
|
| 27 |
- **Dialect Native**: Trained on Egyptian dialogue, songs, and feedback corpora.
|
| 28 |
|
|
|
|
| 17 |
|
| 18 |
**DF-Arc** is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**.
|
| 19 |
|
| 20 |
+
It achieves near 1:1 fertility (1.16) and high semantic density.
|
| 21 |
|
| 22 |
## Key Highlights
|
| 23 |
|
| 24 |
- **Architecture**: Unigram SentencePiece (compatible with `LlamaTokenizer`).
|
| 25 |
+
- **Vocab Size**: 128,000 tokens.
|
| 26 |
- **Baked-in Logic**: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
|
| 27 |
- **Dialect Native**: Trained on Egyptian dialogue, songs, and feedback corpora.
|
| 28 |
|