prof-freakenstein/whisper-librispeech-finetuned

Fine-tuned openai/whisper-small on the LibriSpeech ASR dataset (clean subset, train.100 split).

Model Details

Property	Value
Base model	`openai/whisper-small`
Language	English
Task	Automatic Speech Recognition (`transcribe`)
Dataset	LibriSpeech ASR – `clean` / `train.100`
Training epochs	3
Batch size	16 (per device) × 2 grad-accum = 32 effective
Learning rate	1e-05
Warmup steps	500
Precision	FP16
Hardware	NVIDIA L4, Intel Xeon

Training Configuration

# =============================================================================
#  Whisper Fine-Tuning Configuration – LibreSpeech on NVIDIA L4 / Intel Xeon
# =============================================================================

# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------
model:
  name: "openai/whisper-small"      # Options: whisper-tiny | whisper-base | whisper-small | whisper-medium | whisper-large-v3
  language: "english"
  task: "transcribe"
  # Set to false to freeze encoder weights and only train the decoder
  freeze_encoder: false
  # Set to true to freeze encoder weights (faster training, lower VRAM)
  # freeze_feature_extractor is not used; configure freezing via freeze_encoder above

# ---------------------------------------------------------------------------
# Dataset  – LibreSpeech via HuggingFace datasets hub
# ---------------------------------------------------------------------------
dataset:
  name: "librispeech_asr"
  config_name: "clean"              # "clean" | "other"
  train_split: "train.100"          # train.100 | train.360 | train.500
  validation_split: "validation"
  test_split: "test"
  streaming: false                  # Set true to avoid downloading full dataset
  num_proc: 8                       # Parallel CPU workers for preprocessing (Xeon friendly)
  cache_dir: "./data/cache"

# ---------------------------------------------------------------------------
# Feature extraction
# ---------------------------------------------------------------------------
feature_extraction:
  sampling_rate: 16000
  max_input_length_seconds: 30.0

# ---------------------------------------------------------------------------
# Training  – tuned for NVIDIA L4 (24 GB VRAM)
# ---------------------------------------------------------------------------
training:
  output_dir: "./outputs/whisper-librispeech"
  num_train_epochs: 3
  per_device_train_batch_size: 16
  per_device_eval_batch_size: 8
  gradient_accumulation_steps: 2     # Effective batch = 32 per GPU
  learning_rate: 1.0e-5
  warmup_steps: 500
  weight_decay: 0.01
  lr_scheduler_type: "linear"
  fp16: true                         # L4 supports FP16; set bf16: true if preferred
  bf16: false
  dataloader_num_workers: 4          # Intel Xeon – use multiple cores
  save_strategy: "epoch"
  evaluation_strategy: "epoch"
  logging_steps: 25
  save_total_limit: 3
  load_best_model_at_end: true
  metric_for_best_model: "wer"
  greater_is_better: false
  predict_with_generate: true
  generation_max_length: 225
  report_to: "tensorboard"
  seed: 42

# ---------------------------------------------------------------------------
# Hardware hints
# ---------------------------------------------------------------------------
hardware:
  gpu: "NVIDIA L4"
  vram_gb: 24
  cpu: "Intel Xeon"
  # Use torch.compile for extra throughput (requires PyTorch >= 2.0)
  use_torch_compile: false

Usage

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="prof-freakenstein/whisper-librispeech-finetuned",
    chunk_length_s=30,
    stride_length_s=5,
)
result = asr("path/to/audio.wav")
print(result["text"])

Or with the WhisperProcessor / WhisperForConditionalGeneration API:

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained("prof-freakenstein/whisper-librispeech-finetuned")
model = WhisperForConditionalGeneration.from_pretrained("prof-freakenstein/whisper-librispeech-finetuned")
model.eval()

# load your audio as a 16 kHz numpy array `audio_array` …
inputs = processor(audio_array, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
    predicted_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Training Script

This model was trained using the transcription-as-thought pipeline:

python scripts/train.py --config config/training_config.yaml

Limitations

Optimised for English speech; accuracy on other languages may vary.
Trained on read/narrated speech (LibriSpeech); performance on conversational or noisy audio may be lower than on clean recordings.

License

This model is released under the Apache 2.0 licence, consistent with the base Whisper model.

Downloads last month: 27

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for prof-freakenstein/whisper-librispeech-finetuned

Base model

openai/whisper-small

Finetuned

(3336)

this model

prof-freakenstein
/

whisper-librispeech-finetuned