New Research & Demo: Intelligent Tokenizer — Attention Needs No Vocabulary
We propose a vocabulary-free, byte-level tokenizer that learns directly from raw UTF-8 bytes.
- 105M parameters, streaming with 256-byte chunks
- Trained across 204 languages (Flores-200)
- Preliminary results: English 95% accuracy, Korean 97% (single-lang), multilingual avg 47%
Paper (10.5281/zenodo.17116281)
Hugging Face Space (demo): [link]
I’d love feedback from the tokenizer community on:
- Handling CJK (Chinese/Japanese/Korean) UTF-8 sequences more effectively
- Ideas for efficient multilingual balancing
- Potential integration with existing LLM frameworks