MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech
Paper
• 2509.18196 • Published
• 2
This repository demonstrates the excellent performance of Qwen2.5-Omni and Qwen2-Audio models fine-tuned on the MNV-17 dataset for Nonverbal Vocalization (NV) ASR recognition tasks. It also provides inference scripts for Qwen2.5-Omni and Qwen2-Audio.
Click here for interactive audio demo
Crucial Note: All demo samples are from speakers who were completely unseen during training.
This demonstrates that the model learned universal NV vocalization patterns rather than merely fitting specific speakers' habits, showcasing excellent cross-speaker generalization.
According to our paper experimental results:
| Model | Joint CER | NV Recognition Accuracy |
|---|---|---|
| Qwen2.5-Omni | 3.60% | 57.29% |
| Qwen2-Audio | 4.84% | 56.28% |
| SenseVoice | 8.71% | 57.29% |
| Paraformer | 5.70% | 28.64% |
Base model
Qwen/Qwen2-Audio-7B-Instruct