prof-freakenstein/whisper-librispeech-finetuned
Fine-tuned openai/whisper-small on the
LibriSpeech ASR dataset
(clean subset, train.100 split).
Model Details
| Property | Value |
|---|---|
| Base model | openai/whisper-small |
| Language | English |
| Task | Automatic Speech Recognition (transcribe) |
| Dataset | LibriSpeech ASR – clean / train.100 |
| Training epochs | 3 |
| Batch size | 16 (per device) × 2 grad-accum = 32 effective |
| Learning rate | 1e-05 |
| Warmup steps | 500 |
| Precision | FP16 |
| Hardware | NVIDIA L4, Intel Xeon |
Training Configuration
# =============================================================================
# Whisper Fine-Tuning Configuration – LibreSpeech on NVIDIA L4 / Intel Xeon
# =============================================================================
# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------
model:
name: "openai/whisper-small" # Options: whisper-tiny | whisper-base | whisper-small | whisper-medium | whisper-large-v3
language: "english"
task: "transcribe"
# Set to false to freeze encoder weights and only train the decoder
freeze_encoder: false
# Set to true to freeze encoder weights (faster training, lower VRAM)
# freeze_feature_extractor is not used; configure freezing via freeze_encoder above
# ---------------------------------------------------------------------------
# Dataset – LibreSpeech via HuggingFace datasets hub
# ---------------------------------------------------------------------------
dataset:
name: "librispeech_asr"
config_name: "clean" # "clean" | "other"
train_split: "train.100" # train.100 | train.360 | train.500
validation_split: "validation"
test_split: "test"
streaming: false # Set true to avoid downloading full dataset
num_proc: 8 # Parallel CPU workers for preprocessing (Xeon friendly)
cache_dir: "./data/cache"
# ---------------------------------------------------------------------------
# Feature extraction
# ---------------------------------------------------------------------------
feature_extraction:
sampling_rate: 16000
max_input_length_seconds: 30.0
# ---------------------------------------------------------------------------
# Training – tuned for NVIDIA L4 (24 GB VRAM)
# ---------------------------------------------------------------------------
training:
output_dir: "./outputs/whisper-librispeech"
num_train_epochs: 3
per_device_train_batch_size: 16
per_device_eval_batch_size: 8
gradient_accumulation_steps: 2 # Effective batch = 32 per GPU
learning_rate: 1.0e-5
warmup_steps: 500
weight_decay: 0.01
lr_scheduler_type: "linear"
fp16: true # L4 supports FP16; set bf16: true if preferred
bf16: false
dataloader_num_workers: 4 # Intel Xeon – use multiple cores
save_strategy: "epoch"
evaluation_strategy: "epoch"
logging_steps: 25
save_total_limit: 3
load_best_model_at_end: true
metric_for_best_model: "wer"
greater_is_better: false
predict_with_generate: true
generation_max_length: 225
report_to: "tensorboard"
seed: 42
# ---------------------------------------------------------------------------
# Hardware hints
# ---------------------------------------------------------------------------
hardware:
gpu: "NVIDIA L4"
vram_gb: 24
cpu: "Intel Xeon"
# Use torch.compile for extra throughput (requires PyTorch >= 2.0)
use_torch_compile: false
Usage
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="prof-freakenstein/whisper-librispeech-finetuned",
chunk_length_s=30,
stride_length_s=5,
)
result = asr("path/to/audio.wav")
print(result["text"])
Or with the WhisperProcessor / WhisperForConditionalGeneration API:
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
processor = WhisperProcessor.from_pretrained("prof-freakenstein/whisper-librispeech-finetuned")
model = WhisperForConditionalGeneration.from_pretrained("prof-freakenstein/whisper-librispeech-finetuned")
model.eval()
# load your audio as a 16 kHz numpy array `audio_array` …
inputs = processor(audio_array, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
predicted_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
Training Script
This model was trained using the transcription-as-thought pipeline:
python scripts/train.py --config config/training_config.yaml
Limitations
- Optimised for English speech; accuracy on other languages may vary.
- Trained on read/narrated speech (LibriSpeech); performance on conversational or noisy audio may be lower than on clean recordings.
License
This model is released under the Apache 2.0 licence, consistent with the base Whisper model.
- Downloads last month
- 27
Model tree for prof-freakenstein/whisper-librispeech-finetuned
Base model
openai/whisper-small