Does decoding efficiency decrease as the audio length increases?

by Kerwin11 - opened 26 days ago

Discussion

Kerwin11

26 days ago

•

edited 25 days ago

Dear Developers

Title: RNNT Streaming Decoding: Performance and Incremental Output

Issue Description:

During decoding, the model repeatedly feeds the previously decoded content as input for the next step. For very long audio, could this cause inference efficiency to decrease over time?
Currently, each output returns the full decoded text, for example:
```
"Hello"
"Hello, How"
"Hello, How are you?"
```
Ideally, we would like to output only the incremental portion, e.g.:
```
"Hello"
", How"
"are you?"
```

Test Results:
Average inference time (ms) over steps:

steps 0-99     : 70.14
steps 100-199  : 54.08
steps 200-299  : 54.14
steps 300-399  : 55.54
steps 400-499  : 54.73
steps 500-599  : 55.57
steps 600-699  : 65.92
steps 700-799  : 55.98
steps 800-899  : 57.52
steps 900-999  : 58.23
steps 1000-1099: 60.70
steps 1100-1199: 63.02
steps 1200-1299: 64.57
steps 1300-1399: 64.03
steps 1400-1499: 65.82
steps 1500-1599: 66.68
steps 1600-1699: 64.36
steps 1700-1799: 66.54
steps 1800-1899: 68.32
steps 1900-1946: 76.98

The results indicate that inference time gradually increases as more steps are processed, with a noticeable rise in later stages.

Attempted Solution:

Tried periodically clearing previous_hypotheses and pred_out_stream (e.g., every 100 inference steps) to limit the historical context. However, this approach causes two problems:
1. Certain punctuation may completely disappear in some segments.
2. Portions of the transcribed text can be lost.

Expected Behavior:

Maintain stable inference speed even for long audio sequences, without slowdown due to growing history.
Support incremental output (only new text since the last step) to improve streaming display efficiency.

@kunaldhawan

Kerwin11

25 days ago

I will add any additional information as soon as possible if needed.😃

kunaldhawan

NVIDIA org 24 days ago

Hi @Kerwin11 , thank you for the question and for sharing the profiling results.
To better understand your setup, it would be useful to know how you are running inference (e.g., audio chunking strategy, step size, how decoder state is managed across chunks, and what inference script you are using).

Regarding incremental output: although the default example aggregates and returns the full decoded hypothesis for convenience, the cache-aware streaming pipeline does expose chunk-level predictions. You can access the partial (incremental) transcript for each chunk via partial_transcript in the reference streaming inference script.

On the concern about efficiency degradation over long audio: the full decoded text is not fed back into the model at each step. The model forward pass only carries a fixed-size encoder cache and the RNN-T decoder hidden state, both of which are of fixed size and do not grow with audio length. From a modeling perspective, inference complexity should therefore remain stable over time.
The previous_hypotheses bookkeeping is currently retained to enable optional post-processing features such as LM rescoring or word boosting. For very long audio streams, we would also recommend chunking the input into smaller segments (e.g., 10–20 minutes) and running streaming inference per segment if that fits your usecase.

Hope this helps clarify the behavior, happy to dig deeper once we understand your inference setup a bit more.

Kerwin11

23 days ago

Hi @Kerwin11 , thank you for the question and for sharing the profiling results.
To better understand your setup, it would be useful to know how you are running inference (e.g., audio chunking strategy, step size, how decoder state is managed across chunks, and what inference script you are using).

Regarding incremental output: although the default example aggregates and returns the full decoded hypothesis for convenience, the cache-aware streaming pipeline does expose chunk-level predictions. You can access the partial (incremental) transcript for each chunk via partial_transcript in the reference streaming inference script.

On the concern about efficiency degradation over long audio: the full decoded text is not fed back into the model at each step. The model forward pass only carries a fixed-size encoder cache and the RNN-T decoder hidden state, both of which are of fixed size and do not grow with audio length. From a modeling perspective, inference complexity should therefore remain stable over time.
The previous_hypotheses bookkeeping is currently retained to enable optional post-processing features such as LM rescoring or word boosting. For very long audio streams, we would also recommend chunking the input into smaller segments (e.g., 10–20 minutes) and running streaming inference per segment if that fits your usecase.

Hope this helps clarify the behavior, happy to dig deeper once we understand your inference setup a bit more.

Thank you very much! After my retesting, it is true that the reasoning speed is not reduced by the length of the audio.

Kerwin11 changed discussion status to closed 23 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment