Model Card: primeline-parakeet
Description
primeline-parakeet is a state-of-the-art, 600-million-parameter multilingual Automatic Speech Recognition (ASR) model, specifically optimized for high-precision German transcription. It is based on the NVIDIA parakeet-tdt-0.6b-v3 architecture, utilizing the efficient FastConformer encoder and Token-and-Duration Transducer (TDT) decoder.
While the base model provides broad European language support, primeline-parakeet has been refined to deliver superior accuracy in German contexts, significantly reducing Word Error Rates (WER) across diverse benchmarks compared to both the original NVIDIA release and various Whisper-based architectures.
Key Features
- Optimized for German: Exceptional performance on German datasets like Tuda-De.
- High Efficiency: Built on the TDT architecture, offering significantly higher throughput than standard Transducer models.
- Rich Outputs: Includes automatic punctuation, capitalization, and precise word-level timestamps.
- Robustness: Maintains high accuracy across different domains, from clean read speech to spontaneous conversations.
- Long-Audio Support: Capable of transcribing audio files up to several hours in length using local attention mechanisms.
Performance
The following table compares the Word Error Rate (WER %) of primeline-parakeet against the base model and other industry standards. Lower is better.
| Model | All (Avg) | Tuda-De | Multilingual LibriSpeech | Common Voice 19.0 |
|---|---|---|---|---|
| primeline-parakeet | 2.95 | 4.11 | 2.60 | 3.03 |
| nvidia-parakeet-tdt-0.6b-v3 | 3.64 | 7.05 | 2.95 | 3.70 |
| openai-whisper-large-v3 | 3.28 | 7.86 | 2.85 | 3.46 |
| openai-whisper-large-v3-turbo | 3.64 | 8.20 | 3.19 | 3.85 |
Analysis
primeline-parakeet demonstrates a significant leap in performance for German speech-to-text:
- 41% improvement on the Tuda-De benchmark compared to the NVIDIA base model (
4.11vs7.05). - Outperforms OpenAI Whisper-large-v3 across all tested categories while maintaining a much smaller and more efficient parameter count (0.6B).
Model Architecture
- Architecture Type: FastConformer-TDT (Hybrid Transducer/CTC)
- Parameters: 600 Million
- Input: 16kHz Mono-channel audio (WAV, FLAC)
- Output: Text (including Punctuation and Capitalization)
Use Cases
This model is designed for developers and researchers requiring high-speed, high-accuracy German transcription for:
- Media & Entertainment: Subtitle generation and automated captioning.
- Enterprise: Meeting minutes, call center analytics, and documentation.
- Accessibility: Real-time speech-to-text services.
- Research: Large-scale linguistic data processing.
Why This Model is a Game Changer: Instant Domain Adaptation
Beyond its compact 600M parameter size and blazing-fast TDT inference speed, the true power of primeline-parakeet lies in its architecture's compatibility with external Language Models (LM).
Unlike many modern ASR models that are "locked" after training, this model supports Shallow Fusion with KenLM-based N-gram models. This allows for massive accuracy gains without the need to retrain the neural network itself:
- Zero-Retrain Customization: You can enhance the ASR accuracy by simply training a lightweight, "cheap" LM on pure text data (e.g., legal documents, medical records, or company-specific jargon).
- Drastic Error Reduction: Internal tests show that adding a general-purpose LM can lower the Word Error Rate (WER) by up to 20%.
- Niche Specialization: When targeting specific industries with unique vocabulary, the enhancement is even more significant, allowing the model to recognize specialized terms that standard models would miss.
- Low Resource Requirements: Since the LM only processes text and works alongside the pre-trained ASR model, you can adapt your pipeline to new domains in minutes on standard CPU hardware.
This makes primeline-parakeet not just a static model, but a highly adaptable ASR engine that grows with your specific data needs.
Quellen für diese Integration:
To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.
pip install -U nemo_toolkit['asr']
The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Automatically instantiate the model
from huggingface_hub import hf_hub_download
from nemo.collections.asr.models import ASRModel
model_path = hf_hub_download(
repo_id="primeline/parakeet-primeline", filename="2_95WER.nemo"
)
asr_model = ASRModel.restore_from(model_path, map_location="cpu")
asr_model.eval()
Transcribing using Python
First, let's get a sample
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
Then simply do:
output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)
Transcribing with timestamps
To transcribe with timestamps:
output = asr_model.transcribe(['2086-149220-0033.wav'], timestamps=True)
# by default, timestamps are enabled for char, word and segment level
word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample
segment_timestamps = output[0].timestamp['segment'] # segment level timestamps
char_timestamps = output[0].timestamp['char'] # char level timestamps
for stamp in segment_timestamps:
print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")
Transcribing long-form audio
#updating self-attention model of fast-conformer encoder
#setting attention left and right context sizes to 256
asr_model.change_attention_model(self_attention_model="rel_pos_local_attn", att_context_size=[256, 256])
output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)
Technical Limitations
- Accuracy: While highly accurate, transcripts may still contain errors depending on audio quality, heavy accents, or extreme background noise.
- Out-of-Vocabulary (OOV): Rare technical terms or highly specific jargon not present in the training data may not be recognized correctly.
Ethical Considerations
Users should be aware of potential biases inherent in the training data. This model is intended for transcription purposes only and should be evaluated for specific use cases to ensure it meets safety and fairness requirements.
License
Use of this model is governed by the CC-BY-4.0 license, consistent with the base model's licensing.
Disclaimer
This model is not a product of the primeLine Group.
It represents research conducted by [Florian Zimmermeister](https://huggingface.co/flozi00), with computing power sponsored by primeLine.
The model is published under this account by primeLine, but it is not a commercial product of primeLine Solutions GmbH.
Please be aware that while we have tested and developed this model to the best of our abilities, errors may still occur.
Use of this model is at your own risk. We do not accept liability for any incorrect outputs generated by this model.
Model author: Florian Zimmermeister
- Downloads last month
- 87