FormalASR-0.6B

FormalASR-0.6B is a fine-tuned ASR (Automatic Speech Recognition) model based on Qwen3-ASR-0.6B, specifically optimized for formal/written-style transcription — outputting clean, punctuated, written-form text rather than colloquial spoken transcripts.

Model Description

Attribute	Value
Architecture	Qwen3ASRForConditionalGeneration
Base Model	Qwen3-ASR-0.6B
Parameters	~0.6B
Dtype	bfloat16
Audio Encoder	Whisper-like (18 layers, d_model=896)
Text Decoder	Qwen3 (28 layers, hidden=1024)

Key Features

🎯 Formal-style output: Produces formal, punctuated text suitable for documentation, subtitles, and professional use
⚡ Compact: Only 0.6B parameters, suitable for edge deployment
🔊 Long-form audio: Supports up to 800 windows (~160 seconds) inference

Usage

Installation

pip install -U qwen-asr

HuggingFace

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "TaurenMountain/FormalASR-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_new_tokens=512,
)

results = model.transcribe(
    audio="your_audio.wav",
    language="Chinese",
)

print(results[0].text)

魔搭社区（ModelScope）

import torch
from modelscope import snapshot_download
from qwen_asr import Qwen3ASRModel

# 下载模型到本地（首次运行自动下载）
model_dir = snapshot_download("TaurenMountain/FormalASR-0.6B")

model = Qwen3ASRModel.from_pretrained(
    model_dir,
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_new_tokens=512,
)

results = model.transcribe(
    audio="your_audio.wav",
    language="Chinese",
)

print(results[0].text)

Training Details

This model is fine-tuned from Qwen3-ASR-0.6B on a curated dataset of formal speech paired with formal-style transcriptions. The fine-tuning process focuses on:

Converting spoken language patterns to formal written text
Proper punctuation insertion
Handling of filler words and disfluencies
Improved text normalization

Evaluation

Evaluated on SpeechIO-Formal benchmark — a formal-domain Chinese speech recognition test set covering news, presentations, lectures, and other formal speech scenarios.

License

Apache 2.0

Citation

If you use this model in your research, please cite:

@misc{ning2026formalasrendtoendspokenchinese,
      title={FormalASR: End-to-End Spoken Chinese to Formal Text},
      author={Wanyi Ning and Yinshang Guo and Haitao Qian and Jiyuan Cheng and Wei Zhou and Weiyuan Feng and Yufei Zhang},
      year={2026},
      eprint={2605.19266},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.19266},
}

Downloads last month: 39

Safetensors

Model size

0.8B params

Tensor type

BF16

Paper for TaurenMountain/FormalASR-0.6B

FormalASR: End-to-End Spoken Chinese to Formal Text

Paper • 2605.19266 • Published May 19