Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS
This repository hosts a high-fidelity fine-tuned version of the Chatterbox TTS model, specifically optimized for the Finnish language. By leveraging a multilingual base and large-scale Finnish data, we achieved exceptional zero-shot generalization to unseen speakers, surpassing commercial-grade quality thresholds.
π Performance (Zero-Shot OOD)
The following metrics were calculated on Out-of-Distribution (OOD) speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before.
| Metric | Baseline (Original Multilingual) | Fine-Tuned (Step 986) | Improvement |
|---|---|---|---|
| Avg Word Error Rate (WER) | 28.94% | 2.76% | ~10.5x Accuracy Increase |
| Mean Opinion Score (MOS) | 2.29 / 5.0 | 4.34 / 5.0 | +2.05 Quality Points |
Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.
π§ Audio Comparison (OOD Speakers)
Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from speakers never seen during training.
| Speaker ID | Baseline (Generic Multilingual) | Fine-Tuned (Finnish Golden) |
|---|---|---|
| cv-15_11 | ||
| cv-15_16 | ||
| cv-15_2 |
π Data Processing & Transparency
The model was trained on a diverse corpus of 16,604 samples to capture the nuances of Finnish phonetics, including vowel length and gemination.
- Sources: Mozilla Common Voice (cv-15, lisence CC0-1.0)), Filmot (CC BY), YouTube (CC BY), and Parliament data (CLARIN PUB +BY +PRIV).
- Zero-Shot Integrity: Specific speakers (
cv-15_11,cv-15_16,cv-15_2) were strictly excluded from training to ensure valid OOD testing. - Traceability: Full attribution and filtering lineage are provided in
attribution.csv.
π¬ Phase 2 Research: Single-Speaker Fine-Tuning
As a separate research phase, we tested the model's capacity for deep voice cloning by fine-tuning the Phase 1 base on a specific high-quality Finnish dataset (GrowthMindset).
Results & Optimization
We used sweep_params.py to identify the "Golden Settings" for the most natural Finnish inference. By evaluating against holdout samples and everyday phrases, we achieved a peak quality of 4.63 MOS.
Best Parameters for Finnish:
repetition_penalty: 1.5 (Balanced for Finnish long vowels)temperature: 0.8exaggeration: 0.5cfg_weight: 0.3
Research Samples (Cloned Voice)
- Everyday Phrases: Polite Request | Morning Greeting
Note: The single-speaker weights are not included in this repository.
π» Hardware & Infrastructure
- Platform: Verda (NVIDIA A100 80GB)
- Mixed Precision: BF16 for stability.
- Repetition Guard: Custom threshold of 10 tokens in
AlignmentStreamAnalyzerto support Finnish phonology.
π Quick Start
Option A β Dev Container (recommended)
Open this repo in VS Code with the Dev Containers extension. Everything β dependencies, base model weights, GPU detection β is handled automatically by postCreateCommand.
Option B β Manual Setup
# 1. Clone (with LFS for model weights)
git clone https://huggingface.co/Finnish-NLP/Chatterbox-Finnish
cd Chatterbox-Finnish
# 2. Install dependencies (auto-detects your GPU architecture)
bash install_dependencies.sh
# 3. Download pretrained base models from ResembleAI
python setup.py
# 4. Run inference
python inference_example.py
GPU compatibility: The install script detects your GPU and picks the right PyTorch build automatically:
- Blackwell (sm_120+) e.g. RTX PRO 6000 β PyTorch 2.10.0 + CUDA 12.8
- Older GPUs (A100, RTX 30/40xx, etc.) β PyTorch 2.5.1 + CUDA 12.4
π Running Inference
import torch
import soundfile as sf
from src.chatterbox_.tts import ChatterboxTTS
from safetensors.torch import load_file
device = "cuda" if torch.cuda.is_available() else "cpu"
# 1. Load the base engine
engine = ChatterboxTTS.from_local("./pretrained_models", device=device)
# 2. Inject Finnish fine-tuned weights
checkpoint = load_file("./models/best_finnish_multilingual_cp986.safetensors")
t3_state = {k[3:] if k.startswith("t3.") else k: v for k, v in checkpoint.items()}
engine.t3.load_state_dict(t3_state, strict=False)
# 3. Generate with Finnish-optimized parameters
wav = engine.generate(
text="Tervetuloa kokeilemaan hienoviritettyΓ€ suomenkielistΓ€ Chatterbox-puhesynteesiΓ€.",
audio_prompt_path="./samples/reference_finnish.wav",
repetition_penalty=1.2,
temperature=0.8,
exaggeration=0.6,
)
sf.write("output.wav", wav.squeeze().cpu().numpy(), engine.sr)
Or just run the included example script directly:
python inference_example.py # outputs output_finnish.wav
π Acknowledgments & Credits
- Exploration Foundation: Initial fine-tuning exploration was based on the chatterbox-finetuning toolkit by gokhaneraslan.
- Model Authors: Deep thanks to the team at ResembleAI for the Chatterbox model.
- Single speaker finetuning: Huge thanks to Mape for letting me fine-tune using audio from the Growth Mindset Builder YouTube channel. (https://www.youtube.com/@Growthmindsetbuilder)
- Data Sourcing: Thanks to #Jobik at Nordic AI Discord for the dataset insights.
Disclaimer
- Don't use this model to do bad things.
Model tree for Finnish-NLP/Chatterbox-Finnish
Collection including Finnish-NLP/Chatterbox-Finnish
Evaluation results
- Word Error Rate (WER) on Mozilla Common Voice 15.0 (Finnish OOD)test set self-reported2.760
- Mean Opinion Score (MOS) on Mozilla Common Voice 15.0 (Finnish OOD)test set self-reported4.340