Parakeet Realtime EOU 120M v1 ONNX

Converted ONNX package of nvidia/parakeet_realtime_eou_120m-v1 for use with @asrjs/speech-recognition.

This repository is not the original NVIDIA training checkpoint repo. It contains exported runtime artifacts for browser and Node.js inference.

Included Artifacts

  • encoder-model.onnx
  • decoder_joint-model.onnx
  • encoder-model.fp16.onnx
  • decoder_joint-model.fp16.onnx
  • encoder-model.int8.onnx
  • decoder_joint-model.int8.onnx
  • vocab.txt
  • config.json

Model Summary

Parakeet Realtime EOU 120M v1 is a streaming English ASR model with:

  • cache-aware FastConformer encoder
  • RNNT decoder
  • explicit <EOU> token emission for end-of-utterance detection
  • low-latency voice-agent-oriented output

The model:

  • supports English only
  • does not emit punctuation or capitalization
  • may emit empty visible text for non-speech audio
  • keeps <EOU> in raw/native output while user-visible text should strip it

Architecture details:

  • FastConformer-RNNT
  • 17 encoder layers
  • about 120M parameters

Frontend / Preprocessing

The upstream model expects raw 16 kHz mono audio and uses a NeMo mel frontend internally.

For @asrjs/speech-recognition, this ONNX package is intended to run with the shared in-repo JavaScript NeMo frontend. A dedicated nemo80.onnx or nemo128.onnx preprocessor is intentionally not required.

Frontend assumptions:

  • sample rate: 16000
  • mono audio
  • mel bins: 128
  • valid length mode: centered
  • frontend output: raw log-mel features

This matters: this model does not use the normalized nemo128 frontend contract reused by some other NeMo exports.

Quantization Notes

Included variants:

  • FP32
  • FP16
  • INT8 encoder
  • INT8 decoder

Port validation summary on the smoke fixture:

  • FP32: exact token/text/raw-text parity
  • FP16: exact token/text/raw-text parity
  • decoder-only INT8: exact token/text/raw-text parity
  • encoder-only INT8: not exact
  • full int8/int8: not exact

Recommended default pairings:

  • fp32/fp32
  • fp16/fp16
  • fp32/int8 if you specifically want decoder-only INT8

Usage with @asrjs/speech-recognition

Preset usage

import { createSpeechPipeline, PcmAudioBuffer } from '@asrjs/speech-recognition';

const pipeline = createSpeechPipeline({ cacheModels: true });

const loaded = await pipeline.loadModel({
  preset: 'parakeet',
  modelId: 'nvidia/parakeet_realtime_eou_120m-v1',
  backend: 'wasm',
});

const audio = PcmAudioBuffer.fromMono(pcmFloat32, 16000);
const result = await loaded.transcribe(audio, {
  detail: 'detailed',
  responseFlavor: 'canonical+native',
});

console.log(result.canonical.text);
console.log(result.native.rawUtteranceText);

Direct source usage

const loaded = await pipeline.loadModel({
  family: 'nemo-rnnt',
  modelId: 'nvidia/parakeet_realtime_eou_120m-v1',
  backend: 'wasm',
  options: {
    source: {
      kind: 'huggingface',
      repoId: 'ysdede/parakeet-realtime-eou-120m-v1-onnx',
      preprocessorBackend: 'js',
      encoderQuant: 'fp32',
      decoderQuant: 'fp32',
    },
  },
});

Voice-Agent Context

The original model card highlights voice-agent usage, especially streaming end-of-utterance detection.

Upstream Model and License

Original model:

This converted package follows the upstream NVIDIA Open Model License terms.

References

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ysdede/parakeet-realtime-eou-120m-v1-onnx

Quantized
(1)
this model

Papers for ysdede/parakeet-realtime-eou-120m-v1-onnx