Whisper fine-tuned with custom tokens works with model.generate but doesn't with a pipeline()

Hello!
I am using transformers 4.57.3 and python 3.19.3 for whisper. I am fine-tuning openai/whisper-small.en with custom tokens added.

I add tokens with tokenizer.add_tokens and model.resize_token_embeddings– fine-tuning and then testing 30-second audio clips with model.generate shows the tokens properly generated and the model is able to fine-tune properly. However, when I want to move to a pipeline, using my custom processor and model in the arguments, it will fail to generate the tokens.

Is there something I’m missing in this process? Not sure what it could be.

Thanks!

1 Like

Seems a known edge case?


You are not missing a training step. You are hitting a Whisper pipeline decoding edge case: the ASR pipeline uses Whisper’s timestamp-aware “ASR decode” logic, and that logic often treats high token IDs as timestamps. Your added tokens get appended to the end of the vocabulary, so they become “high IDs,” and the pipeline decoder can misclassify them and drop them.

This exact failure mode is documented in public issues, with a minimal reproduction. (GitHub)


First: sanity on versions (small but important)

  • Transformers 4.57.3 is a real release (stable v4 line, late 2025). (Hugging Face Forums)
  • “Python 3.19.3” is not a CPython release line. The Python devguide shows the “future” main branch is 3.15, not 3.19. (Python Developer’s Guide)
    You likely meant something like Python 3.9.19 (which exists). (Python.org)

This does not cause the token drop, but it matters for reproducibility.


Why model.generate() shows your tokens but pipeline() hides them

What you see with model.generate()

When you do:

  1. ids = model.generate(...)
  2. text = tokenizer.decode(ids[0])

…you are using the tokenizer’s “normal” decoding path.

What the ASR pipeline does differently

The automatic-speech-recognition pipeline is not “just generate.” It also does Whisper-specific postprocessing that splits segments and handles timestamps. That splitting logic is implemented in tokenizer._decode_asr (or equivalent), and the Transformers maintainers explicitly call out that the pipeline uses this path to “split up segments by timestamps.” (GitHub)

Now the critical part:

  • Whisper has special timestamp tokens (<|0.00|> … <|30.00|>) with IDs in a reserved region.
  • Many Whisper decoders use a boundary like timestamp_begin and logic like: if token_id >= timestamp_begin then “this is a timestamp token.”

When you add tokens with tokenizer.add_tokens(...), your tokens are appended at the end of the vocab. That often makes them land above that boundary.

This is not theoretical. It is reproduced in Transformers issue #35330:

  • tokenizer.decode(..., decode_with_timestamps=False) returns the added token (“newword1”)
  • tokenizer.decode(..., decode_with_timestamps=True) turns it into timestamp text like <|30.02|> and <|30.24|>
  • The issue points directly at the condition if token >= timestamp_begin: as the cause (GitHub)

Transformers.js has the same bug class, and the proposed fix is exactly what you’d expect: treat timestamps as a bounded range (timestamp_begin <= token <= timestamp_end) so user-added tokens beyond <|30.00|> decode normally. (GitHub)

So in your case:

  • The model is generating your custom token IDs correctly.
  • The pipeline’s Whisper ASR post-decoder is misinterpreting them as timestamps or non-text tokens.
  • Result: your custom tokens “disappear” from the final pipeline output.

Quick confirmation for your exact setup (2 minutes)

Run these checks with the same tokenizer you pass to the pipeline.

1) Verify the pipeline is using your modified tokenizer

If the pipeline loads a base tokenizer by accident, your custom tokens cannot appear.

print("tokenizer size:", len(tokenizer))
print("pipe tokenizer size:", len(asr_pipe.tokenizer))

print("added vocab (tokenizer):", len(tokenizer.get_added_vocab()))
print("added vocab (pipe):", len(asr_pipe.tokenizer.get_added_vocab()))

If the pipeline’s added vocab count is 0 (or smaller), you are not actually decoding with the modified tokenizer.

2) Check whether your custom token IDs collide with Whisper timestamp logic

Compute:

custom = "YOUR_CUSTOM_TOKEN"
custom_id = tokenizer.convert_tokens_to_ids(custom)

timestamp_begin = tokenizer.convert_tokens_to_ids("<|notimestamps|>") + 1
print("custom_id:", custom_id)
print("timestamp_begin:", timestamp_begin)
print("custom_id >= timestamp_begin:", custom_id >= timestamp_begin)

If custom_id >= timestamp_begin is True, you are in the same bug class as #35330. (GitHub)


Practical fixes (ordered by “works now”)

Fix A: Keep pipeline preprocessing, but decode yourself (most robust)

Use the processor to create features, run generate, then decode with plain batch_decode instead of _decode_asr.

This avoids the timestamp-aware ASR post-decoder that is dropping your added tokens.

# deps: transformers, torch, librosa(or soundfile)
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = WhisperProcessor.from_pretrained("YOUR_FINETUNED_DIR")
model = WhisperForConditionalGeneration.from_pretrained("YOUR_FINETUNED_DIR").to(device)

def transcribe(audio_array, sampling_rate=16000, **gen_kwargs):
    inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt")
    feats = inputs.input_features.to(device)

    with torch.no_grad():
        ids = model.generate(feats, **gen_kwargs)

    # Key point: do NOT call tokenizer._decode_asr
    return processor.tokenizer.batch_decode(ids, skip_special_tokens=True)[0]

If you need chunking for long audio, you can still chunk externally and concatenate text, or keep the pipeline for chunking but override postprocess (next fix).

Fix B: Subclass the pipeline and override postprocess decoding

If you want pipeline conveniences (audio loading, chunking, batching) but want safe decoding, override the final decode step to use batch_decode instead of _decode_asr.

This is a targeted workaround for the exact mismatch described in #23231 and the token/timestamp confusion described in #35330. (GitHub)

(If you want, paste your current pipeline construction code and I can show a minimal subclass that matches your parameters.)

Fix C: Patch the underlying condition (most “correct,” most invasive)

The public proposal is:

  • Change token >= timestamp_begin
  • To timestamp_begin <= token <= timestamp_end (where timestamp_end is <|30.00|>)

This is explicitly proposed in transformers.js issue #803, with the rationale that user-added tokens sit at the end of the vocab, beyond the true timestamp range. (GitHub)

Transformers issue #35330 also identifies the unbounded token >= timestamp_begin check as the root cause. (GitHub)

In Python Transformers, doing this “cleanly” usually means vendoring or monkeypatching the ASR decode function. It is a maintenance burden but it preserves timestamps and custom tokens.

Fix D: Consider not adding tokens at all (often best for Whisper)

Whisper’s tokenizer is already a large BPE vocabulary and can form uncommon words from subword pieces. HF forum guidance often recommends trying the pretrained tokenizer first and using prompting/biasing for jargon rather than extending the vocab. (Hugging Face Forums)

If your goal is “make the transcript contain domain terms,” using prompting (initial_prompt or similar) can work without touching token IDs, so you avoid all timestamp-ID collisions.


Extra pitfalls that can make it look worse

1) Are you accidentally treating your tokens as “special tokens”?

If you used add_special_tokens (or marked them special), then any decode with skip_special_tokens=True will remove them. In #35330, you can see how skip_special_tokens changes what remains. (GitHub)

2) Timestamp mode and pipeline timestamp bugs

Whisper timestamp behavior has had multiple pipeline bugs and refactors. There are separate issues around return_timestamps=True in the ASR pipeline. (GitHub)
Even if you are not requesting timestamps, the pipeline’s “ASR decode” path is still timestamp-aware because Whisper uses timestamp tokens internally for segmentation. (GitHub)


“My best diagnosis for your case”

Given:

  • fine-tuning works
  • model.generate shows the token
  • pipeline output loses it

The most likely explanation is:

  1. Your added tokens got IDs at the end of the vocab.
  2. Whisper pipeline uses _decode_asr style logic.
  3. That logic treats IDs beyond a boundary as timestamps.
  4. Your tokens are therefore mis-decoded or discarded in postprocessing.

That is exactly the reproduced behavior in #35330 and the fix rationale in transformers.js #803. (GitHub)


High-quality online references to keep open


Summary

  • Your model is generating the custom token IDs correctly.
  • The ASR pipeline’s Whisper post-decoder (_decode_asr) can misclassify appended token IDs as timestamps. (GitHub)
  • Confirm by checking custom_id >= timestamp_begin.
  • Fastest fix: bypass pipeline decoding and call batch_decode yourself.
  • “Correct” fix: bound timestamp detection to the real timestamp range (<= <|30.00|>). (GitHub)

Modifying _decode_asr with token <= timestamp_end worked. Thank you.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.