Soprano: Instant, Ultra‑Realistic Text‑to‑Speech

soprano-github

Alt Text Alt Text

📰 News

2026.01.14 - Soprano-1.1-80M released! 95% fewer hallucinations and a 63% preference rate over Soprano-80M.
2026.01.13 - Soprano-Factory released! You can now train/fine-tune your own Soprano models.
2025.12.22 - Soprano-80M released! Code | Demo


Overview

Soprano is an ultra‑lightweight, on-device text‑to‑speech (TTS) model designed for expressive, high‑fidelity speech synthesis at unprecedented speed. Soprano was designed with the following features:

  • Up to 2000x real-time generation on GPU and 20x real-time on CPU
  • Lossless streaming with <15 ms latency on GPU, <250 ms on CPU
  • <1 GB memory usage with a compact 80M parameter architecture
  • Infinite generation length with automatic text splitting
  • Highly expressive, crystal clear audio generation at 32kHz
  • Widespread support for CUDA, CPU, and MPS devices on Windows, Linux, and Mac
  • Supports WebUI, CLI, and OpenAI-compatible endpoint for easy and production-ready inference

Installation

Install with wheel (CUDA-only for now)

pip install soprano-tts

To get the latest features, you can install from source instead.

Install from source (CUDA)

git clone https://github.com/ekwek1/soprano.git
cd soprano
pip install -e .[lmdeploy]

Install from source (CPU/MPS)

git clone https://github.com/ekwek1/soprano.git
cd soprano
pip install -e .

⚠️ Warning: Windows CUDA users

On Windows with CUDA, pip will install a CPU-only PyTorch build. To ensure CUDA support works as expected, reinstall PyTorch explicitly with the correct CUDA wheel after installing Soprano:

pip uninstall -y torch
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu128

Usage

WebUI

Start WebUI:

soprano-webui # hosted on http://127.0.0.1:7860 by default

Tip: You can increase cache size and decoder batch size to increase inference speed at the cost of higher memory usage. For example:

soprano-webui --cache-size 1000 --decoder-batch-size 4

CLI

soprano "Soprano is an extremely lightweight text to speech model."

optional arguments:
  --output, -o                  Output audio file path (non-streaming only). Defaults to 'output.wav'
  --model-path, -m              Path to local model directory (optional)
  --device, -d                  Device to use for inference. Supported: auto, cuda, cpu, mps. Defaults to 'auto'
  --backend, -b                 Backend to use for inference. Supported: auto, transformers, lmdeploy. Defaults to 'auto'
  --cache-size, -c              Cache size in MB (for lmdeploy backend). Defaults to 100
  --decoder-batch-size, -bs     Decoder batch size. Defaults to 1
  --streaming, -s               Enable streaming playback to speakers

Tip: You can increase cache size and decoder batch size to increase inference speed at the cost of higher memory usage.

Note: The CLI will reload the model every time it is called. As a result, inference speed will be slower than other methods.

OpenAI-compatible endpoint

Start server:

uvicorn soprano.server:app --host 0.0.0.0 --port 8000

Use the endpoint like this:

curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Soprano is an extremely lightweight text to speech model."
  }' \
  --output speech.wav

Note: Currently, this endpoint only supports nonstreaming output.

Python script

from soprano import SopranoTTS

model = SopranoTTS(backend='auto', device='auto', cache_size_mb=100, decoder_batch_size=1)

Tip: You can increase cache_size_mb and decoder_batch_size to increase inference speed at the cost of higher memory usage.

# Basic inference
out = model.infer("Soprano is an extremely lightweight text to speech model.") # can achieve 2000x real-time with sufficiently long input!

# Save output to a file
out = model.infer("Soprano is an extremely lightweight text to speech model.", "out.wav")

# Custom sampling parameters
out = model.infer(
    "Soprano is an extremely lightweight text to speech model.",
    temperature=0.3,
    top_p=0.95,
    repetition_penalty=1.2,
)


# Batched inference
out = model.infer_batch(["Soprano is an extremely lightweight text to speech model."] * 10) # can achieve 2000x real-time with sufficiently large input size!

# Save batch outputs to a directory
out = model.infer_batch(["Soprano is an extremely lightweight text to speech model."] * 10, "/dir")


# Streaming inference
from soprano.utils.streaming import play_stream
stream = model.infer_stream("Soprano is an extremely lightweight text to speech model.", chunk_size=1)
play_stream(stream) # plays audio with <15 ms latency!

Usage tips:

  • Soprano works best when each sentence is between 2 and 15 seconds long.
  • Although Soprano recognizes numbers and some special characters, it occasionally mispronounces them. Best results can be achieved by converting these into their phonetic form. (1+1 -> one plus one, etc)
  • If Soprano produces unsatisfactory results, you can easily regenerate it for a new, potentially better generation. You may also change the sampling settings for more varied results.
  • Avoid improper grammar such as not using contractions, multiple spaces, etc.

Limitations

Soprano is currently English-only and does not support voice cloning. In addition, Soprano was trained on only 1,000 hours of audio (~100x less than other TTS models), so mispronunciation of uncommon words may occur. This is expected to diminish as Soprano is trained on more data.


License

This project is licensed under the Apache-2.0 license. See LICENSE for details.

Downloads last month
407
Safetensors
Model size
79.7M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using ekwek/Soprano-1.1-80M 1