Papers
arxiv:2606.25041

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Published on Jun 23
· Submitted by
Lianghua Huang
on Jun 25
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Wan-Streamer is a unified, end-to-end multimodal model that enables real-time audio-visual interaction through causal attention mechanisms and integrated processing of visual, audio, and text modalities.

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.

Community

Paper submitter

Wan-Streamer v0.1 is a native-streaming, end-to-end model that listens, sees, thinks, speaks, and responds on video in real time — at 25 fps with ~200 ms model-side latency, all within a single Transformer.

framework

Paper submitter
edited 2 days ago

Comparison with other interactive models / systems:

image

The two groups measure different things. The top group is full end-to-end interaction loops — they perceive the user and produce a response; only Wan Streamer also outputs video. The bottom group is avatar / audio-visual renderers timed at the rendering stage only: their latency excludes the external language model, ASR, and TTS they depend on, so their true user-visible latency is higher than shown. Wan Streamer is the only end-to-end model that outputs synchronized audio + video, and it does so under 0.6 s. Numbers are the closest publicly reported figures and mix measurement boundaries; see the paper for exact definitions.

image

Capability coverage across representative systems. Full-duplex means the system keeps perceiving while it generates — understanding and responding at the same time. Wan Streamer is the only model that perceives video, outputs synchronized video, runs full-duplex, is end-to-end, and responds within a second; every other system covers only part of this. "~" marks partial support or a figure that is not publicly disclosed.

·

Comparison with other interactive models / systems:

image

The two groups measure different things. The top group is full end-to-end interaction loops — they perceive the user and produce a response; only Wan Streamer also outputs video. The bottom group is avatar / audio-visual renderers timed at the rendering stage only: their latency excludes the external language model, ASR, and TTS they depend on, so their true user-visible latency is higher than shown. Wan Streamer is the only end-to-end model that outputs synchronized audio + video, and it does so under 0.6 s. Numbers are the closest publicly reported figures and mix measurement boundaries; see the paper for exact definitions.

image

Capability coverage across representative systems. Full-duplex means the system keeps perceiving while it generates — understanding and responding at the same time. Wan Streamer is the only model that perceives video, outputs synchronized video, runs full-duplex, is end-to-end, and responds within a second; every other system covers only part of this. "~" marks partial support or a figure that is not publicly disclosed.

Hello, I'm curious about the performance benchmark results of this model. Will these be made public later?

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.25041
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.25041 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.25041 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.25041 in a Space README.md to link it from this page.

Collections including this paper 9