--- library_name: transformers tags: - llama - gpt - malayalam - text-generation-inference license: mit datasets: - uonlp/CulturaX language: - ml pipeline_tag: text-classification --- ### About - This tokenizer was trained using the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset. We sample 1.2 million datapoints at random. - This was trained using the [SentencePiece](https://github.com/google/sentencepiece) by Google. - Then the trained tokens were then added to the `LlamaTokenizer` leading to a total of 49,120 tokens from 32,000 from the original tokenizer. - The merging was done according to what the [Chinese-Llama-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_tokenizer/merge_tokenizers.py)'s merging did. ### Usage ```python from transformers import LlamaTokenizer tokenizer = LlamaTokenizer.from_pretrained("learninbit/malayalam-llama-2-tokenizer-v0.1") text = "ഹനഫസ ഹഫഞ്ചഥ ചകഡു ടെണല ഡൃൊമത്തീഴ ടഞ്ഞഭഞ റദ്ധഷ ഌിപത്മഫഥ ടജ്ജഡ്ഡപ്പെവ പഴുണൊ." tokens = tokenizer.tokenizer(text) ```