Parvesh Rawal's picture

On Vacation 🏝️

Parvesh Rawal

Parveshiiii

hugging-science

·

parveshiiii

AI & ML interests

I love deep neural nets.

Recent Activity

liked a model about 17 hours ago

Parveshiiii/microtok

reacted to theirpost with 🔥 about 20 hours ago

Just did something I’ve been meaning to try for ages. In only 3 hours, on 10 billion+ tokens, I trained a custom BPE + tiktoken-style tokenizer using my new library microtok — and it hits the same token efficiency as Qwen3. Tokenizers have always felt like black magic to me. We drop them into every LLM project, but actually training one from scratch? That always seemed way too complicated. Turns out it doesn’t have to be. microtok makes the whole process stupidly simple — literally just 3 lines of code. No heavy setup, no GPU required. I built it on top of the Hugging Face tokenizers library so it stays clean, fast, and actually understandable. If you’ve ever wanted to look under the hood and build your own optimized vocabulary instead of just copying someone else’s, this is the entry point you’ve been waiting for. I wrote up the full story, threw in a ready-to-run Colab template, and dropped the trained tokenizer on Hugging Face. Blog → https://parveshiiii.github.io/blogs/microtok/ Trained tokenizer → https://huggingface.co/Parveshiiii/microtok GitHub repo → https://github.com/Parveshiiii/microtok

posted an update about 20 hours ago

Just did something I’ve been meaning to try for ages. In only 3 hours, on 10 billion+ tokens, I trained a custom BPE + tiktoken-style tokenizer using my new library microtok — and it hits the same token efficiency as Qwen3. Tokenizers have always felt like black magic to me. We drop them into every LLM project, but actually training one from scratch? That always seemed way too complicated. Turns out it doesn’t have to be. microtok makes the whole process stupidly simple — literally just 3 lines of code. No heavy setup, no GPU required. I built it on top of the Hugging Face tokenizers library so it stays clean, fast, and actually understandable. If you’ve ever wanted to look under the hood and build your own optimized vocabulary instead of just copying someone else’s, this is the entry point you’ve been waiting for. I wrote up the full story, threw in a ready-to-run Colab template, and dropped the trained tokenizer on Hugging Face. Blog → https://parveshiiii.github.io/blogs/microtok/ Trained tokenizer → https://huggingface.co/Parveshiiii/microtok GitHub repo → https://github.com/Parveshiiii/microtok

View all activity

Organizations

Parveshiiii 's models 14

Parveshiiii/microtok

Updated 8 days ago • 1

Parveshiiii/BadGPT-1

0.5B • Updated 16 days ago • 13

Parveshiiii/Qwen-3.5-0.8B-OCR

0.9B • Updated 16 days ago • 53 • 2

Parveshiiii/BadGPT-2

0.6B • Updated Feb 6 • 1

Parveshiiii/Embedding

0.6B • Updated Feb 6 • 77

Parveshiiii/PixelGen-6B

6B • Updated Dec 10, 2025 • 9

Parveshiiii/Dad-GPT

Text Generation • 0.5B • Updated Dec 3, 2025 • 1

Parveshiiii/Classifier

0.2B • Updated Nov 4, 2025 • 2 • 1

Parveshiiii/ayu-0.6B-2

0.6B • Updated Oct 18, 2025 • 1

Parveshiiii/ayu-0.6B

0.6B • Updated Oct 17, 2025 • 1

Parveshiiii/Auto-Completer-0.2

Text Generation • 0.4B • Updated Sep 9, 2025 • 2 • 2

Parveshiiii/Auto-Completer-0.1

Text Generation • 0.4B • Updated Sep 9, 2025 • 2 • 1

Parveshiiii/Gemma-4b-pt-LateX

Updated Aug 24, 2025

Parveshiiii/mistral-small-int8

Text Generation • 7B • Updated Jul 8, 2025 • 3 • 1