Parakeet Realtime EOU 120M v1 ONNX
Converted ONNX package of nvidia/parakeet_realtime_eou_120m-v1 for use with @asrjs/speech-recognition.
This repository is not the original NVIDIA training checkpoint repo. It contains exported runtime artifacts for browser and Node.js inference.
Included Artifacts
encoder-model.onnxdecoder_joint-model.onnxencoder-model.fp16.onnxdecoder_joint-model.fp16.onnxencoder-model.int8.onnxdecoder_joint-model.int8.onnxvocab.txtconfig.json
Model Summary
Parakeet Realtime EOU 120M v1 is a streaming English ASR model with:
- cache-aware FastConformer encoder
- RNNT decoder
- explicit
<EOU>token emission for end-of-utterance detection - low-latency voice-agent-oriented output
The model:
- supports English only
- does not emit punctuation or capitalization
- may emit empty visible text for non-speech audio
- keeps
<EOU>in raw/native output while user-visible text should strip it
Architecture details:
- FastConformer-RNNT
- 17 encoder layers
- about 120M parameters
Frontend / Preprocessing
The upstream model expects raw 16 kHz mono audio and uses a NeMo mel frontend internally.
For @asrjs/speech-recognition, this ONNX package is intended to run with the shared in-repo JavaScript NeMo frontend. A dedicated nemo80.onnx or nemo128.onnx preprocessor is intentionally not required.
Frontend assumptions:
- sample rate:
16000 - mono audio
- mel bins:
128 - valid length mode:
centered - frontend output: raw log-mel features
This matters: this model does not use the normalized nemo128 frontend contract reused by some other NeMo exports.
Quantization Notes
Included variants:
- FP32
- FP16
- INT8 encoder
- INT8 decoder
Port validation summary on the smoke fixture:
- FP32: exact token/text/raw-text parity
- FP16: exact token/text/raw-text parity
- decoder-only INT8: exact token/text/raw-text parity
- encoder-only INT8: not exact
- full
int8/int8: not exact
Recommended default pairings:
fp32/fp32fp16/fp16fp32/int8if you specifically want decoder-only INT8
Usage with @asrjs/speech-recognition
Preset usage
import { createSpeechPipeline, PcmAudioBuffer } from '@asrjs/speech-recognition';
const pipeline = createSpeechPipeline({ cacheModels: true });
const loaded = await pipeline.loadModel({
preset: 'parakeet',
modelId: 'nvidia/parakeet_realtime_eou_120m-v1',
backend: 'wasm',
});
const audio = PcmAudioBuffer.fromMono(pcmFloat32, 16000);
const result = await loaded.transcribe(audio, {
detail: 'detailed',
responseFlavor: 'canonical+native',
});
console.log(result.canonical.text);
console.log(result.native.rawUtteranceText);
Direct source usage
const loaded = await pipeline.loadModel({
family: 'nemo-rnnt',
modelId: 'nvidia/parakeet_realtime_eou_120m-v1',
backend: 'wasm',
options: {
source: {
kind: 'huggingface',
repoId: 'ysdede/parakeet-realtime-eou-120m-v1-onnx',
preprocessorBackend: 'js',
encoderQuant: 'fp32',
decoderQuant: 'fp32',
},
},
});
Voice-Agent Context
The original model card highlights voice-agent usage, especially streaming end-of-utterance detection.
Upstream Model and License
Original model:
This converted package follows the upstream NVIDIA Open Model License terms.
References
- Downloads last month
- 9
Model tree for ysdede/parakeet-realtime-eou-120m-v1-onnx
Base model
nvidia/parakeet_realtime_eou_120m-v1