prof-freakenstein/whisper-librispeech-finetuned

Fine-tuned openai/whisper-small on the LibriSpeech ASR dataset (clean subset, train.100 split).

Model Details

Property Value
Base model openai/whisper-small
Language English
Task Automatic Speech Recognition (transcribe)
Dataset LibriSpeech ASR – clean / train.100
Training epochs 3
Batch size 16 (per device) × 2 grad-accum = 32 effective
Learning rate 1e-05
Warmup steps 500
Precision FP16
Hardware NVIDIA L4, Intel Xeon

Training Configuration

# =============================================================================
#  Whisper Fine-Tuning Configuration – LibreSpeech on NVIDIA L4 / Intel Xeon
# =============================================================================

# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------
model:
  name: "openai/whisper-small"      # Options: whisper-tiny | whisper-base | whisper-small | whisper-medium | whisper-large-v3
  language: "english"
  task: "transcribe"
  # Set to false to freeze encoder weights and only train the decoder
  freeze_encoder: false
  # Set to true to freeze encoder weights (faster training, lower VRAM)
  # freeze_feature_extractor is not used; configure freezing via freeze_encoder above

# ---------------------------------------------------------------------------
# Dataset  – LibreSpeech via HuggingFace datasets hub
# ---------------------------------------------------------------------------
dataset:
  name: "librispeech_asr"
  config_name: "clean"              # "clean" | "other"
  train_split: "train.100"          # train.100 | train.360 | train.500
  validation_split: "validation"
  test_split: "test"
  streaming: false                  # Set true to avoid downloading full dataset
  num_proc: 8                       # Parallel CPU workers for preprocessing (Xeon friendly)
  cache_dir: "./data/cache"

# ---------------------------------------------------------------------------
# Feature extraction
# ---------------------------------------------------------------------------
feature_extraction:
  sampling_rate: 16000
  max_input_length_seconds: 30.0

# ---------------------------------------------------------------------------
# Training  – tuned for NVIDIA L4 (24 GB VRAM)
# ---------------------------------------------------------------------------
training:
  output_dir: "./outputs/whisper-librispeech"
  num_train_epochs: 3
  per_device_train_batch_size: 16
  per_device_eval_batch_size: 8
  gradient_accumulation_steps: 2     # Effective batch = 32 per GPU
  learning_rate: 1.0e-5
  warmup_steps: 500
  weight_decay: 0.01
  lr_scheduler_type: "linear"
  fp16: true                         # L4 supports FP16; set bf16: true if preferred
  bf16: false
  dataloader_num_workers: 4          # Intel Xeon – use multiple cores
  save_strategy: "epoch"
  evaluation_strategy: "epoch"
  logging_steps: 25
  save_total_limit: 3
  load_best_model_at_end: true
  metric_for_best_model: "wer"
  greater_is_better: false
  predict_with_generate: true
  generation_max_length: 225
  report_to: "tensorboard"
  seed: 42

# ---------------------------------------------------------------------------
# Hardware hints
# ---------------------------------------------------------------------------
hardware:
  gpu: "NVIDIA L4"
  vram_gb: 24
  cpu: "Intel Xeon"
  # Use torch.compile for extra throughput (requires PyTorch >= 2.0)
  use_torch_compile: false

Usage

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="prof-freakenstein/whisper-librispeech-finetuned",
    chunk_length_s=30,
    stride_length_s=5,
)
result = asr("path/to/audio.wav")
print(result["text"])

Or with the WhisperProcessor / WhisperForConditionalGeneration API:

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained("prof-freakenstein/whisper-librispeech-finetuned")
model = WhisperForConditionalGeneration.from_pretrained("prof-freakenstein/whisper-librispeech-finetuned")
model.eval()

# load your audio as a 16 kHz numpy array `audio_array` …
inputs = processor(audio_array, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
    predicted_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Training Script

This model was trained using the transcription-as-thought pipeline:

python scripts/train.py --config config/training_config.yaml

Limitations

  • Optimised for English speech; accuracy on other languages may vary.
  • Trained on read/narrated speech (LibriSpeech); performance on conversational or noisy audio may be lower than on clean recordings.

License

This model is released under the Apache 2.0 licence, consistent with the base Whisper model.

Downloads last month
27
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prof-freakenstein/whisper-librispeech-finetuned

Finetuned
(3336)
this model

Dataset used to train prof-freakenstein/whisper-librispeech-finetuned