Does decoding efficiency decrease as the audio length increases?
Dear Developers
Title: RNNT Streaming Decoding: Performance and Incremental Output
Issue Description:
During decoding, the model repeatedly feeds the previously decoded content as input for the next step. For very long audio, could this cause inference efficiency to decrease over time?
Currently, each output returns the full decoded text, for example:
"Hello" "Hello, How" "Hello, How are you?"Ideally, we would like to output only the incremental portion, e.g.:
"Hello" ", How" "are you?"
Test Results:
Average inference time (ms) over steps:
steps 0-99 : 70.14
steps 100-199 : 54.08
steps 200-299 : 54.14
steps 300-399 : 55.54
steps 400-499 : 54.73
steps 500-599 : 55.57
steps 600-699 : 65.92
steps 700-799 : 55.98
steps 800-899 : 57.52
steps 900-999 : 58.23
steps 1000-1099: 60.70
steps 1100-1199: 63.02
steps 1200-1299: 64.57
steps 1300-1399: 64.03
steps 1400-1499: 65.82
steps 1500-1599: 66.68
steps 1600-1699: 64.36
steps 1700-1799: 66.54
steps 1800-1899: 68.32
steps 1900-1946: 76.98
The results indicate that inference time gradually increases as more steps are processed, with a noticeable rise in later stages.
Attempted Solution:
Tried periodically clearing
previous_hypothesesandpred_out_stream(e.g., every 100 inference steps) to limit the historical context. However, this approach causes two problems:- Certain punctuation may completely disappear in some segments.
- Portions of the transcribed text can be lost.
Expected Behavior:
- Maintain stable inference speed even for long audio sequences, without slowdown due to growing history.
- Support incremental output (only new text since the last step) to improve streaming display efficiency.
I will add any additional information as soon as possible if needed.π
Hi
@Kerwin11
, thank you for the question and for sharing the profiling results.
To better understand your setup, it would be useful to know how you are running inference (e.g., audio chunking strategy, step size, how decoder state is managed across chunks, and what inference script you are using).
Regarding incremental output: although the default example aggregates and returns the full decoded hypothesis for convenience, the cache-aware streaming pipeline does expose chunk-level predictions. You can access the partial (incremental) transcript for each chunk via partial_transcript in the reference streaming inference script.
On the concern about efficiency degradation over long audio: the full decoded text is not fed back into the model at each step. The model forward pass only carries a fixed-size encoder cache and the RNN-T decoder hidden state, both of which are of fixed size and do not grow with audio length. From a modeling perspective, inference complexity should therefore remain stable over time.
The previous_hypotheses bookkeeping is currently retained to enable optional post-processing features such as LM rescoring or word boosting. For very long audio streams, we would also recommend chunking the input into smaller segments (e.g., 10β20 minutes) and running streaming inference per segment if that fits your usecase.
Hope this helps clarify the behavior, happy to dig deeper once we understand your inference setup a bit more.
Hi @Kerwin11 , thank you for the question and for sharing the profiling results.
To better understand your setup, it would be useful to know how you are running inference (e.g., audio chunking strategy, step size, how decoder state is managed across chunks, and what inference script you are using).Regarding incremental output: although the default example aggregates and returns the full decoded hypothesis for convenience, the cache-aware streaming pipeline does expose chunk-level predictions. You can access the partial (incremental) transcript for each chunk via partial_transcript in the reference streaming inference script.
On the concern about efficiency degradation over long audio: the full decoded text is not fed back into the model at each step. The model forward pass only carries a fixed-size encoder cache and the RNN-T decoder hidden state, both of which are of fixed size and do not grow with audio length. From a modeling perspective, inference complexity should therefore remain stable over time.
The previous_hypotheses bookkeeping is currently retained to enable optional post-processing features such as LM rescoring or word boosting. For very long audio streams, we would also recommend chunking the input into smaller segments (e.g., 10β20 minutes) and running streaming inference per segment if that fits your usecase.Hope this helps clarify the behavior, happy to dig deeper once we understand your inference setup a bit more.
Thank you very much! After my retesting, it is true that the reasoning speed is not reduced by the length of the audio.
