ποΈ Audio-Omni
Unified Audio Understanding, Generation, and Editing (SIGGRAPH 2026)
π Overview
Audio-Omni is the first end-to-end framework that unifies understanding, generation, and editing across general sound, music, and speech domains. It combines a frozen Multimodal Large Language Model (Qwen2.5-Omni) for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis.
π― Capabilities
- Understanding: Audio/video captioning, question answering
- Generation: Text-to-Audio, Text-to-Music, Video-to-Audio, Video-to-Music, Text-to-Speech, Voice Conversion
- Editing: Add, Remove, Extract, Style Transfer
π¦ Model Files
Audio-Omni.jsonβ Model configurationmodel.ckptβ Model checkpoint (~21 GB)synchformer_state_dict.pthβ Synchformer checkpoint for video conditioning
π Quick Start
Installation
# Clone the GitHub repository
git clone https://github.com/ZeyueT/Audio-Omni.git
cd Audio-Omni
# Install dependencies
pip install -e .
conda install -c conda-forge ffmpeg libsndfile
# Download model from Hugging Face
huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
Python API
from audio_omni import AudioOmni
import torchaudio
# Load model
model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")
1οΈβ£ Understanding
# Audio understanding
response = model.understand(
"Describe the sounds in this audio.",
audio="example.wav"
)
# Video understanding
response = model.understand(
"What is happening in this video?",
video="example.mp4"
)
# Audio + Video understanding
response = model.understand(
"Does the audio match the video?",
audio="example.wav",
video="example.mp4"
)
2οΈβ£ Generation
# Text-to-Audio
audio = model.generate("T2A", prompt="A clock ticking.")
torchaudio.save("output.wav", audio, model.sample_rate)
# Text-to-Music
audio = model.generate(
"T2M",
prompt="Compose a bright jazz swing instrumental with walking bass."
)
torchaudio.save("music.wav", audio, model.sample_rate)
# Video-to-Audio
audio = model.generate("V2A", video_path="example.mp4")
torchaudio.save("v2a_output.wav", audio, model.sample_rate)
# Text-to-Speech
audio = model.generate("TTS", prompt="Hello, welcome to Audio-Omni.")
torchaudio.save("tts_output.wav", audio, model.sample_rate)
# Text-to-Speech with voice cloning
audio = model.generate(
"TTS",
prompt="Hello, welcome to Audio-Omni.",
voice_prompt_path="ref_voice.wav",
voice_ref_text="This is the reference transcript."
)
torchaudio.save("tts_cloned.wav", audio, model.sample_rate)
3οΈβ£ Editing
# Add a sound
audio = model.edit("Add", "input.wav", desc="skateboarding")
torchaudio.save("output_add.wav", audio, model.sample_rate)
# Remove a sound
audio = model.edit("Remove", "input.wav", desc="female singing")
torchaudio.save("output_remove.wav", audio, model.sample_rate)
# Extract a sound
audio = model.edit("Extract", "input.wav", desc="wood thrush calling")
torchaudio.save("output_extract.wav", audio, model.sample_rate)
# Style transfer
audio = model.edit(
"Style Transfer",
"input.wav",
source_category="playing electric guitar",
target_category="playing saxophone"
)
torchaudio.save("output_transfer.wav", audio, model.sample_rate)
π₯οΈ Gradio Demo
# Launch interactive demo
python run_gradio.py \
--model-config model/Audio-Omni.json \
--ckpt-path model/model.ckpt \
--server-port 7777
Visit http://localhost:7777 to access the web interface.
π Documentation
For detailed documentation, training instructions, and more examples, visit the GitHub repository.
π Citation
@article{tian2026audioomni,
title={Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing},
author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and Guo, Yike},
journal={arXiv preprint arXiv:submit/7470507},
year={2026}
}
π License
CC-BY-NC-4.0 (Non-commercial use only)
Commercial use of the model weights requires explicit written authorization from the authors.
For commercial licensing inquiries, contact: ztianad@connect.ust.hk
π Contact
- Zeyue Tian: ztianad@connect.ust.hk
For full installation guide, API reference, and advanced usage, see the GitHub repository.