FormalASR-0.6B

FormalASR-0.6B is a fine-tuned ASR (Automatic Speech Recognition) model based on Qwen3-ASR-0.6B, specifically optimized for formal/written-style transcription — outputting clean, punctuated, written-form text rather than colloquial spoken transcripts.

Model Description

Attribute Value
Architecture Qwen3ASRForConditionalGeneration
Base Model Qwen3-ASR-0.6B
Parameters ~0.6B
Dtype bfloat16
Audio Encoder Whisper-like (18 layers, d_model=896)
Text Decoder Qwen3 (28 layers, hidden=1024)

Key Features

  • 🎯 Formal-style output: Produces formal, punctuated text suitable for documentation, subtitles, and professional use
  • Compact: Only 0.6B parameters, suitable for edge deployment
  • 🔊 Long-form audio: Supports up to 800 windows (~160 seconds) inference

Usage

Installation

pip install -U qwen-asr

HuggingFace

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "TaurenMountain/FormalASR-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_new_tokens=512,
)

results = model.transcribe(
    audio="your_audio.wav",
    language="Chinese",
)

print(results[0].text)

魔搭社区(ModelScope)

import torch
from modelscope import snapshot_download
from qwen_asr import Qwen3ASRModel

# 下载模型到本地(首次运行自动下载)
model_dir = snapshot_download("TaurenMountain/FormalASR-0.6B")

model = Qwen3ASRModel.from_pretrained(
    model_dir,
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_new_tokens=512,
)

results = model.transcribe(
    audio="your_audio.wav",
    language="Chinese",
)

print(results[0].text)

Training Details

This model is fine-tuned from Qwen3-ASR-0.6B on a curated dataset of formal speech paired with formal-style transcriptions. The fine-tuning process focuses on:

  • Converting spoken language patterns to formal written text
  • Proper punctuation insertion
  • Handling of filler words and disfluencies
  • Improved text normalization

Evaluation

Evaluated on SpeechIO-Formal benchmark — a formal-domain Chinese speech recognition test set covering news, presentations, lectures, and other formal speech scenarios.

License

Apache 2.0

Citation

If you use this model in your research, please cite:

@misc{ning2026formalasrendtoendspokenchinese,
      title={FormalASR: End-to-End Spoken Chinese to Formal Text},
      author={Wanyi Ning and Yinshang Guo and Haitao Qian and Jiyuan Cheng and Wei Zhou and Weiyuan Feng and Yufei Zhang},
      year={2026},
      eprint={2605.19266},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.19266},
}
Downloads last month
39
Safetensors
Model size
0.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for TaurenMountain/FormalASR-0.6B