Intelligent Tokenizer: Attention Needs No Vocabulary (Demo + Paper)

New Research & Demo: Intelligent Tokenizer — Attention Needs No Vocabulary

We propose a vocabulary-free, byte-level tokenizer that learns directly from raw UTF-8 bytes.

  • 105M parameters, streaming with 256-byte chunks
  • Trained across 204 languages (Flores-200)
  • Preliminary results: English 95% accuracy, Korean 97% (single-lang), multilingual avg 47%

:link: Paper (10.5281/zenodo.17116281)
:link: Hugging Face Space (demo): [link]

I’d love feedback from the tokenizer community on:

  • Handling CJK (Chinese/Japanese/Korean) UTF-8 sequences more effectively
  • Ideas for efficient multilingual balancing
  • Potential integration with existing LLM frameworks
1 Like