Parakeet Realtime EOU 120M v1 ONNX

Converted ONNX package of nvidia/parakeet_realtime_eou_120m-v1 for use with @asrjs/speech-recognition.

This repository is not the original NVIDIA training checkpoint repo. It contains exported runtime artifacts for browser and Node.js inference.

Included Artifacts

encoder-model.onnx
decoder_joint-model.onnx
encoder-model.fp16.onnx
decoder_joint-model.fp16.onnx
encoder-model.int8.onnx
decoder_joint-model.int8.onnx
vocab.txt
config.json

Model Summary

Parakeet Realtime EOU 120M v1 is a streaming English ASR model with:

cache-aware FastConformer encoder
RNNT decoder
explicit <EOU> token emission for end-of-utterance detection
low-latency voice-agent-oriented output

The model:

supports English only
does not emit punctuation or capitalization
may emit empty visible text for non-speech audio
keeps <EOU> in raw/native output while user-visible text should strip it

Architecture details:

FastConformer-RNNT
17 encoder layers
about 120M parameters

Frontend / Preprocessing

The upstream model expects raw 16 kHz mono audio and uses a NeMo mel frontend internally.

For @asrjs/speech-recognition, this ONNX package is intended to run with the shared in-repo JavaScript NeMo frontend. A dedicated nemo80.onnx or nemo128.onnx preprocessor is intentionally not required.

Frontend assumptions:

sample rate: 16000
mono audio
mel bins: 128
valid length mode: centered
frontend output: raw log-mel features

This matters: this model does not use the normalized nemo128 frontend contract reused by some other NeMo exports.

Quantization Notes

Included variants:

FP32
FP16
INT8 encoder
INT8 decoder

Port validation summary on the smoke fixture:

FP32: exact token/text/raw-text parity
FP16: exact token/text/raw-text parity
decoder-only INT8: exact token/text/raw-text parity
encoder-only INT8: not exact
full int8/int8: not exact

Recommended default pairings:

fp32/fp32
fp16/fp16
fp32/int8 if you specifically want decoder-only INT8

Usage with `@asrjs/speech-recognition`

Preset usage

import { createSpeechPipeline, PcmAudioBuffer } from '@asrjs/speech-recognition';

const pipeline = createSpeechPipeline({ cacheModels: true });

const loaded = await pipeline.loadModel({
  preset: 'parakeet',
  modelId: 'nvidia/parakeet_realtime_eou_120m-v1',
  backend: 'wasm',
});

const audio = PcmAudioBuffer.fromMono(pcmFloat32, 16000);
const result = await loaded.transcribe(audio, {
  detail: 'detailed',
  responseFlavor: 'canonical+native',
});

console.log(result.canonical.text);
console.log(result.native.rawUtteranceText);

Direct source usage

const loaded = await pipeline.loadModel({
  family: 'nemo-rnnt',
  modelId: 'nvidia/parakeet_realtime_eou_120m-v1',
  backend: 'wasm',
  options: {
    source: {
      kind: 'huggingface',
      repoId: 'ysdede/parakeet-realtime-eou-120m-v1-onnx',
      preprocessorBackend: 'js',
      encoderQuant: 'fp32',
      decoderQuant: 'fp32',
    },
  },
});