ai4bharat/sangraha
Viewer β’ Updated β’ 268M β’ 8.61k β’ 73
How to use Vipplav/telugu-bpe-23k with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Vipplav/telugu-bpe-23k", dtype="auto")A Byte-Pair Encoding (BPE) tokenizer trained on over **3.4 lakh cleaned Telugu text keys ** from the AI4Bharat Sangraha dataset and other open sources. This tokenizer is ideal for pretraining or fine-tuning Telugu language models.
transformers + sentencepiece<unk> β Unknown token <pad> β Padding <s> β Start of sequence </s> β End of sequence \n, βΉ, β’, - β User-defined symbols preserved during trainingfrom transformers import T5Tokenizer
# Load tokenizer from Hugging Face Hub
tokenizer = T5Tokenizer.from_pretrained("Vipplav/telugu-bpe-23k")
# Sample Telugu input
text = "ΰ°ͺΰ°°ΰ°Ώΰ°Άΰ±ΰ°²ΰ°¨ ΰ°€ΰ±ΰ°¦ΰ±: 15-06-2025"
# Tokenize the input
tokens = tokenizer.tokenize(text)
# Decode tokens back to text
decoded = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens), skip_special_tokens=True)
# Display results
print(f"\nπ₯ Input : {text}")
print(f"π€ Tokens : {tokens}")
print(f"π Decoded : {decoded}")
If you use this tokenizer, please cite:
APA:
Vipplav AI (2025). Telugu BPE Tokenizer (23k vocab). Hugging Face. https://huggingface.co/Vipplav/telugu-bpe-23k
AI4Bharat. (2023). Sangraha: A Large-Scale Multidomain Corpus for Indian Languages. Hugging Face Datasets. https://huggingface.co/datasets/ai4bharat/sangraha
BibTeX:
@misc{vipplav_telugu_tokenizer,
author = {Vipplav AI},
title = {Telugu BPE Tokenizer (23k vocab)},
year = {2025},
url = {https://huggingface.co/Vipplav/telugu-bpe-23k}
}
@dataset{sangraha2023,
author = {AI4Bharat},
title = {Sangraha Dataset},
year = {2023},
url = {https://huggingface.co/datasets/ai4bharat/sangraha}
}