MNV-17: Nonverbal Vocalization Recognition

This repository demonstrates the excellent performance of Qwen2.5-Omni and Qwen2-Audio models fine-tuned on the MNV-17 dataset for Nonverbal Vocalization (NV) ASR recognition tasks. It also provides inference scripts for Qwen2.5-Omni and Qwen2-Audio.

Click here for interactive audio demo

Key Findings

Unseen Speaker Generalization

Crucial Note: All demo samples are from speakers who were completely unseen during training.

This demonstrates that the model learned universal NV vocalization patterns rather than merely fitting specific speakers' habits, showcasing excellent cross-speaker generalization.

Model Performance

According to our paper experimental results:

Model	Joint CER	NV Recognition Accuracy
Qwen2.5-Omni	3.60%	57.29%
Qwen2-Audio	4.84%	56.28%
SenseVoice	8.71%	57.29%
Paraformer	5.70%	28.64%

Performance Highlights

Lowest Joint Error Rate: Qwen2.5-Omni achieved 3.60% joint CER, best performance in dual ASR and NV recognition tasks.
Excellent NV Recognition: 57.29% accuracy under strict exact-match evaluation (type, count, order must all match).

Dataset Characteristics

MNV-17 Dataset Advantages

Performative Recording: Avoids ambiguity of NVs in spontaneous speech, ensures high-quality annotation.
Class Balance: 17 NV categories with balanced distribution (max/min ratio only 2.7).
Speaker Diversity: 49 native Mandarin speakers from various regions.
Rich Context: NVs naturally embedded in semantically rich sentences.

Design Innovation

Scripted Approach: LLM-generated natural contexts ensure semantic reasonableness of NVs.
Multi-NV Combinations: Supports random combinations of 1-3 NVs, simulating real scenarios.
Speaker-Independent Split: Strict train/validation/test division ensures generalization evaluation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kiiic/MNV-17-Qwen-fintune

Base model

Qwen/Qwen2-Audio-7B-Instruct

Finetuned

(14)

this model

Dataset used to train kiiic/MNV-17-Qwen-fintune

Paper for kiiic/MNV-17-Qwen-fintune

MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech

Paper • 2509.18196 • Published Sep 19, 2025 • 2