Hi everyone, I am Raghav Hinduja Swiss based IT-professional. I’m looking for suggestions on how to preprocess or tokenize data for training language models using Hugging Face.
What tools, workflows, or best practices have worked well for you?
If you’re unsure about what to fine-tune, I think starting with either LLM Course or a smol course will help you avoid confusion.
1) Start by choosing your training objective (it determines the “right” preprocessing)
Causal LM (CLM, decoder-only; “next token prediction”)
- You typically tokenize → concatenate → chunk into fixed-length blocks (
block_size) for efficient training. - This is the pattern used in HF’s canonical CLM example script (
run_clm.py). (GitHub)
Chat / instruction SFT (still CLM under the hood, but formatted as messages)
- Your biggest risk is formatting + special tokens + label masking, not raw tokenization.
- The safe default is to use chat templates correctly (details below). (Hugging Face)
MLM (BERT-style)
- Tokenization is similar, but masking is usually applied by a data collator at batch time.
2) Core tools in the Hugging Face stack (and what each is for)
datasets (I/O + transformations)
- Load data from files or Hub, transform with
map(), filter, shuffle, stream big corpora. - If your dataset is too large to store locally, load in streaming mode to get an
IterableDataset. (Hugging Face)
transformers tokenizers (text → token IDs)
- Prefer Fast tokenizers (Rust-backed) for speed and consistent behavior. (Hugging Face)
Optional: large-scale data pipelines (dedup/filtering)
- For web-scale preprocessing (filtering, dedup, etc.), HF’s DataTrove provides reference pipelines (e.g., the FineWeb processing script). (GitHub)
3) Data cleaning & quality filtering (what matters most before tokenization)
This step often dominates downstream model quality.
Minimum “always do it” cleaning
- Normalize whitespace / remove null bytes / fix obvious encoding issues.
- Drop pathological samples (extremely short, extremely long, repetitive junk).
- Remove markup if your source is HTML.
Deduplicate (especially for pretraining / continued pretraining)
Duplicate data wastes compute and can leak evaluation examples into training.
- FineWeb explicitly documents a pipeline of cleaning + dedup, and points to a working script for the full process. (Hugging Face)
- The DataTrove repository includes an example script used to create FineWeb. (GitHub)
If you’re not operating at web scale, even exact-match dedup (hash the normalized text) gives a meaningful win.
4) Tokenizer strategy: reuse vs train a new one
Fine-tuning an existing model
Use the model’s tokenizer as-is. Changing vocab has knock-on effects and usually isn’t worth it.
Pretraining from scratch (or new language/domain where the tokenizer is a bad fit)
Train a tokenizer on a representative slice of your corpus.
- HF’s LLM course shows
train_new_from_iterator()as a practical approach (works with fast tokenizers). (Hugging Face) - The Transformers tokenizer docs explain fast vs slow tokenizers and expected capabilities. (Hugging Face)
- HF also published a late-2025 overview of tokenization for LLMs (useful for updated mental models and API direction). (Hugging Face)
5) Tokenize efficiently with datasets.map() (speed + reproducibility)
Use batch mapping (batched=True)
Batch mapping is explicitly designed to speed up tokenization because tokenizers run faster on batches. (Hugging Face)
from datasets import load_dataset
from transformers import AutoTokenizer
ds = load_dataset("json", data_files={"train": "train.jsonl"})
tok = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
def tokenize(batch):
return tok(batch["text"], truncation=False)
tokenized = ds["train"].map(
tokenize,
batched=True,
remove_columns=ds["train"].column_names,
)
Caching and saving artifacts
- HF Datasets uses caching; if caching is disabled, your transforms can be recomputed and then deleted at session end unless you explicitly save the result. (Hugging Face)
tokenized.save_to_disk("./tokenized_train")
When map() slows down near the end
This is a common report in real workflows (often due to I/O, cache writes, or skewed example sizes). A typical mitigation is to shard, reduce output columns, and ensure fast local storage. (Hugging Face Forums)
6) CLM preprocessing: packing (concatenate + chunk) and boundary handling
The standard “group_texts” approach
The canonical CLM recipe is: tokenize → concatenate → slice into block_size chunks (often with labels = input_ids). This is the approach discussed around run_clm.py. (GitHub)
Boundary pitfall: “Should I insert EOS between documents?”
This is a frequently debated detail; there’s a dedicated issue asking whether run_clm.py should separate documents with a special token. (GitHub)
Practical guidance
- If your samples are independent documents, append an EOS to each doc before concatenation to prevent unnatural “doc bleed”.
- If your data is already a continuous stream (e.g., book text split into lines), you may choose not to.
Block-size pitfall: remainder handling
A known failure mode is producing chunks that aren’t exactly block_size, causing training errors. There’s an issue specifically about group_texts needing to drop incorrect-length sequences. (GitHub)
7) Chat / instruction SFT: use chat templates correctly (most important for your case)
Recommended default: apply_chat_template(..., tokenize=True)
Transformers explicitly warns that chat templates generally already include the special tokens; templating into text and then tokenizing “normally” can insert special tokens twice and degrade performance. (Hugging Face)
def chat_to_features(example, tokenizer):
# example["messages"] = [{"role": "system"/"user"/"assistant", "content": "..."}]
return tokenizer.apply_chat_template(
example["messages"],
tokenize=True,
add_generation_prompt=False,
return_dict=True,
)
If you do template → tokenize in two steps
Set add_special_tokens=False when tokenizing the rendered string, exactly as the docs recommend. (Hugging Face)
This issue shows a concrete example where templating then encoding results in duplicated BOS. (GitHub)
8) Labels and loss masking (assistant-only / completion-only training)
If you want loss only on the assistant output (common in instruction tuning):
- TRL documents
DataCollatorForCompletionOnlyLMand states it works only whenpacking=False. (Hugging Face) - There’s also an explicit TRL issue asking if you can combine packing with completion-only training (short answer: not directly “as-is”). (GitHub)
Practical recommendation
- Start with correctness: completion-only + no packing (simple, reliable).
- Only introduce packing after you have tests that confirm label masking does not cross sample boundaries.
9) Large datasets: when to stream instead of materialize
If the corpus is too large for local disk/RAM, use streaming:
streaming=Trueyields anIterableDatasetyou can iterate without downloading everything. (Hugging Face)- Be aware: streaming has different performance characteristics, and there are ongoing questions/issues about throughput and how it compares to map-style datasets. (GitHub)
A common production pattern is:
- stream + light filtering →
- write cleaned shards (e.g., parquet/jsonl) →
- train on the stable shards with map-style datasets for speed.
10) A “best-practice checklist” (what tends to work well)
Tokenization & formatting
- Use fast tokenizers (
use_fast=True). (Hugging Face) - Use
Dataset.map(..., batched=True)for tokenization speed. (Hugging Face) - For chat SFT: prefer
apply_chat_template(tokenize=True); if not, setadd_special_tokens=False. (Hugging Face)
CLM packing
- Ensure chunking outputs exactly
block_size(drop remainder). (GitHub) - Decide and document whether you insert EOS between documents (and keep it consistent). (GitHub)
Dataset ops & reproducibility
- Remove unused columns early (
remove_columns=...) to reduce I/O and cache size. (Hugging Face) - If caching is disabled,
save_to_disk()or you’ll lose results at session end. (Hugging Face)
Scaling
- Stream very large corpora, and materialize only cleaned/filtered shards you intend to train on. (Hugging Face)
- For web-scale, follow a pipeline-style approach with filtering + dedup (FineWeb + DataTrove are good reference points). (Hugging Face)
Recommended “reading order” (fast path)
- Batch mapping (
datasets.mapwithbatched=True). (Hugging Face) - Chat templating (and the special-token pitfall). (Hugging Face)
- Completion-only SFT constraints in TRL (packing vs masking). (Hugging Face)
- Streaming docs for big data. (Hugging Face)
- FineWeb/DataTrove pipeline as a reference for real-world filtering/dedup. (Hugging Face)
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.