FormalASR: End-to-End Spoken Chinese to Formal Text
Paper • 2605.19266 • Published
FormalASR-0.6B is a fine-tuned ASR (Automatic Speech Recognition) model based on Qwen3-ASR-0.6B, specifically optimized for formal/written-style transcription — outputting clean, punctuated, written-form text rather than colloquial spoken transcripts.
| Attribute | Value |
|---|---|
| Architecture | Qwen3ASRForConditionalGeneration |
| Base Model | Qwen3-ASR-0.6B |
| Parameters | ~0.6B |
| Dtype | bfloat16 |
| Audio Encoder | Whisper-like (18 layers, d_model=896) |
| Text Decoder | Qwen3 (28 layers, hidden=1024) |
pip install -U qwen-asr
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"TaurenMountain/FormalASR-0.6B",
dtype=torch.bfloat16,
device_map="cuda:0",
max_new_tokens=512,
)
results = model.transcribe(
audio="your_audio.wav",
language="Chinese",
)
print(results[0].text)
import torch
from modelscope import snapshot_download
from qwen_asr import Qwen3ASRModel
# 下载模型到本地(首次运行自动下载)
model_dir = snapshot_download("TaurenMountain/FormalASR-0.6B")
model = Qwen3ASRModel.from_pretrained(
model_dir,
dtype=torch.bfloat16,
device_map="cuda:0",
max_new_tokens=512,
)
results = model.transcribe(
audio="your_audio.wav",
language="Chinese",
)
print(results[0].text)
This model is fine-tuned from Qwen3-ASR-0.6B on a curated dataset of formal speech paired with formal-style transcriptions. The fine-tuning process focuses on:
Evaluated on SpeechIO-Formal benchmark — a formal-domain Chinese speech recognition test set covering news, presentations, lectures, and other formal speech scenarios.
Apache 2.0
If you use this model in your research, please cite:
@misc{ning2026formalasrendtoendspokenchinese,
title={FormalASR: End-to-End Spoken Chinese to Formal Text},
author={Wanyi Ning and Yinshang Guo and Haitao Qian and Jiyuan Cheng and Wei Zhou and Weiyuan Feng and Yufei Zhang},
year={2026},
eprint={2605.19266},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.19266},
}