Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
On Vacation 🏝️
22
5
18
Parvesh Rawal
Parveshiiii
Follow
Silentbull's profile picture
Jason28634's profile picture
ryanaustin1's profile picture
67 followers
·
35 following
parveshiiii
AI & ML interests
I love deep neural nets.
Recent Activity
liked
a model
about 17 hours ago
Parveshiiii/microtok
reacted
to
their
post
with 🔥
about 20 hours ago
Just did something I’ve been meaning to try for ages. In only 3 hours, on 10 billion+ tokens, I trained a custom BPE + tiktoken-style tokenizer using my new library microtok — and it hits the same token efficiency as Qwen3. Tokenizers have always felt like black magic to me. We drop them into every LLM project, but actually training one from scratch? That always seemed way too complicated. Turns out it doesn’t have to be. microtok makes the whole process stupidly simple — literally just 3 lines of code. No heavy setup, no GPU required. I built it on top of the Hugging Face tokenizers library so it stays clean, fast, and actually understandable. If you’ve ever wanted to look under the hood and build your own optimized vocabulary instead of just copying someone else’s, this is the entry point you’ve been waiting for. I wrote up the full story, threw in a ready-to-run Colab template, and dropped the trained tokenizer on Hugging Face. Blog → https://parveshiiii.github.io/blogs/microtok/ Trained tokenizer → https://huggingface.co/Parveshiiii/microtok GitHub repo → https://github.com/Parveshiiii/microtok
posted
an
update
about 20 hours ago
Just did something I’ve been meaning to try for ages. In only 3 hours, on 10 billion+ tokens, I trained a custom BPE + tiktoken-style tokenizer using my new library microtok — and it hits the same token efficiency as Qwen3. Tokenizers have always felt like black magic to me. We drop them into every LLM project, but actually training one from scratch? That always seemed way too complicated. Turns out it doesn’t have to be. microtok makes the whole process stupidly simple — literally just 3 lines of code. No heavy setup, no GPU required. I built it on top of the Hugging Face tokenizers library so it stays clean, fast, and actually understandable. If you’ve ever wanted to look under the hood and build your own optimized vocabulary instead of just copying someone else’s, this is the entry point you’ve been waiting for. I wrote up the full story, threw in a ready-to-run Colab template, and dropped the trained tokenizer on Hugging Face. Blog → https://parveshiiii.github.io/blogs/microtok/ Trained tokenizer → https://huggingface.co/Parveshiiii/microtok GitHub repo → https://github.com/Parveshiiii/microtok
View all activity
Organizations
Parveshiiii
's models
14
Sort: Recently updated
Parveshiiii/microtok
Updated
8 days ago
•
1
Parveshiiii/BadGPT-1
0.5B
•
Updated
16 days ago
•
13
Parveshiiii/Qwen-3.5-0.8B-OCR
0.9B
•
Updated
16 days ago
•
53
•
2
Parveshiiii/BadGPT-2
0.6B
•
Updated
Feb 6
•
1
Parveshiiii/Embedding
0.6B
•
Updated
Feb 6
•
77
Parveshiiii/PixelGen-6B
6B
•
Updated
Dec 10, 2025
•
9
Parveshiiii/Dad-GPT
Text Generation
•
0.5B
•
Updated
Dec 3, 2025
•
1
Parveshiiii/Classifier
0.2B
•
Updated
Nov 4, 2025
•
2
•
1
Parveshiiii/ayu-0.6B-2
0.6B
•
Updated
Oct 18, 2025
•
1
Parveshiiii/ayu-0.6B
0.6B
•
Updated
Oct 17, 2025
•
1
Parveshiiii/Auto-Completer-0.2
Text Generation
•
0.4B
•
Updated
Sep 9, 2025
•
2
•
2
Parveshiiii/Auto-Completer-0.1
Text Generation
•
0.4B
•
Updated
Sep 9, 2025
•
2
•
1
Parveshiiii/Gemma-4b-pt-LateX
Updated
Aug 24, 2025
Parveshiiii/mistral-small-int8
Text Generation
•
7B
•
Updated
Jul 8, 2025
•
3
•
1