Arabic
arabic
tokenizer
morphology
nlp
dialect
fr3on commited on
Commit
a4db065
·
verified ·
1 Parent(s): e675913

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -17,12 +17,12 @@ datasets:
17
 
18
  **DF-Arc** is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**.
19
 
20
- It achieves near 1:1 fertility (1.26) and high semantic density.
21
 
22
  ## Key Highlights
23
 
24
  - **Architecture**: Unigram SentencePiece (compatible with `LlamaTokenizer`).
25
- - **Vocab Size**: 64,000 tokens.
26
  - **Baked-in Logic**: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
27
  - **Dialect Native**: Trained on Egyptian dialogue, songs, and feedback corpora.
28
 
 
17
 
18
  **DF-Arc** is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**.
19
 
20
+ It achieves near 1:1 fertility (1.16) and high semantic density.
21
 
22
  ## Key Highlights
23
 
24
  - **Architecture**: Unigram SentencePiece (compatible with `LlamaTokenizer`).
25
+ - **Vocab Size**: 128,000 tokens.
26
  - **Baked-in Logic**: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
27
  - **Dialect Native**: Trained on Egyptian dialogue, songs, and feedback corpora.
28