Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS

This repository hosts a high-fidelity fine-tuned version of the Chatterbox TTS model, specifically optimized for the Finnish language. By leveraging a multilingual base and large-scale Finnish data, we achieved exceptional zero-shot generalization to unseen speakers, surpassing commercial-grade quality thresholds.

🚀 Performance (Zero-Shot OOD)

The following metrics were calculated on Out-of-Distribution (OOD) speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before.

Metric	Baseline (Original Multilingual)	Fine-Tuned (Step 986)	Improvement
Avg Word Error Rate (WER)	28.94%	2.76%	~10.5x Accuracy Increase
Mean Opinion Score (MOS)	2.29 / 5.0	4.34 / 5.0	+2.05 Quality Points

Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.

🎧 Audio Comparison (OOD Speakers)

Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from speakers never seen during training.

Speaker ID	Baseline (Generic Multilingual)	Fine-Tuned (Finnish Golden)
cv-15_11
cv-15_16
cv-15_2

🛠 Data Processing & Transparency

The model was trained on a diverse corpus of 16,604 samples to capture the nuances of Finnish phonetics, including vowel length and gemination.

Sources: Mozilla Common Voice (cv-15, lisence CC0-1.0)), Filmot (CC BY), YouTube (CC BY), and Parliament data (CLARIN PUB +BY +PRIV).
Zero-Shot Integrity: Specific speakers (cv-15_11, cv-15_16, cv-15_2) were strictly excluded from training to ensure valid OOD testing.
Traceability: Full attribution and filtering lineage are provided in attribution.csv.

🔬 Phase 2 Research: Single-Speaker Fine-Tuning

As a separate research phase, we tested the model's capacity for deep voice cloning by fine-tuning the Phase 1 base on a specific high-quality Finnish dataset (GrowthMindset).

Results & Optimization

We used sweep_params.py to identify the "Golden Settings" for the most natural Finnish inference. By evaluating against holdout samples and everyday phrases, we achieved a peak quality of 4.63 MOS.

Best Parameters for Finnish:

repetition_penalty: 1.5 (Balanced for Finnish long vowels)
temperature: 0.8
exaggeration: 0.5
cfg_weight: 0.3

Research Samples (Cloned Voice)

Everyday Phrases: Polite Request | Morning Greeting

Note: The single-speaker weights are not included in this repository.

💻 Hardware & Infrastructure

Platform: Verda (NVIDIA A100 80GB)
Mixed Precision: BF16 for stability.
Repetition Guard: Custom threshold of 10 tokens in AlignmentStreamAnalyzer to support Finnish phonology.

🚀 Quick Start

Option A — Dev Container (recommended)

Open this repo in VS Code with the Dev Containers extension. Everything — dependencies, base model weights, GPU detection — is handled automatically by postCreateCommand.

Option B — Manual Setup

# 1. Clone (with LFS for model weights)
git clone https://huggingface.co/Finnish-NLP/Chatterbox-Finnish
cd Chatterbox-Finnish

# 2. Install dependencies (auto-detects your GPU architecture)
bash install_dependencies.sh

# 3. Download pretrained base models from ResembleAI
python setup.py

# 4. Run inference
python inference_example.py

GPU compatibility: The install script detects your GPU and picks the right PyTorch build automatically:

Blackwell (sm_120+) e.g. RTX PRO 6000 → PyTorch 2.10.0 + CUDA 12.8

Older GPUs (A100, RTX 30/40xx, etc.) → PyTorch 2.5.1 + CUDA 12.4

🏃 Running Inference

import torch
import soundfile as sf
from src.chatterbox_.tts import ChatterboxTTS
from safetensors.torch import load_file

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Load the base engine
engine = ChatterboxTTS.from_local("./pretrained_models", device=device)

# 2. Inject Finnish fine-tuned weights
checkpoint = load_file("./models/best_finnish_multilingual_cp986.safetensors")
t3_state = {k[3:] if k.startswith("t3.") else k: v for k, v in checkpoint.items()}
engine.t3.load_state_dict(t3_state, strict=False)

# 3. Generate with Finnish-optimized parameters
wav = engine.generate(
    text="Tervetuloa kokeilemaan hienoviritettyä suomenkielistä Chatterbox-puhesynteesiä.",
    audio_prompt_path="./samples/reference_finnish.wav",
    repetition_penalty=1.2,
    temperature=0.8,
    exaggeration=0.6,
)

sf.write("output.wav", wav.squeeze().cpu().numpy(), engine.sr)

Or just run the included example script directly:

python inference_example.py  # outputs output_finnish.wav

🙏 Acknowledgments & Credits

Exploration Foundation: Initial fine-tuning exploration was based on the chatterbox-finetuning toolkit by gokhaneraslan.
Model Authors: Deep thanks to the team at ResembleAI for the Chatterbox model.
Single speaker finetuning: Huge thanks to Mape for letting me fine-tune using audio from the Growth Mindset Builder YouTube channel. (https://www.youtube.com/@Growthmindsetbuilder)
Data Sourcing: Thanks to #Jobik at Nordic AI Discord for the dataset insights.

Disclaimer

Don't use this model to do bad things.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Finnish-NLP/Chatterbox-Finnish

Base model

ResembleAI/chatterbox

Finetuned

(59)

this model

Quantizations

1 model

Collection including Finnish-NLP/Chatterbox-Finnish

TTS-models

Collection

Collection of TTS models • 1 item • Updated Feb 3 • 1

Evaluation results

Word Error Rate (WER) on Mozilla Common Voice 15.0 (Finnish OOD)
test set self-reported

2.760
Mean Opinion Score (MOS) on Mozilla Common Voice 15.0 (Finnish OOD)
test set self-reported

4.340