OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation Paper • 2601.15369 • Published 1 day ago • 6
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders Paper • 2601.16208 • Published about 12 hours ago • 19
LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR Paper • 2601.14251 • Published 3 days ago • 18
OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer Paper • 2601.14250 • Published 3 days ago • 35
CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation Paper • 2601.11096 • Published 7 days ago • 8
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding Paper • 2601.10611 • Published 8 days ago • 26
Transition Matching Distillation for Fast Video Generation Paper • 2601.09881 • Published 8 days ago • 31
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking Paper • 2601.04720 • Published 15 days ago • 47
Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization Paper • 2601.05432 • Published 14 days ago • 160
SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices Paper • 2601.08303 • Published 10 days ago • 16
Yume-1.5: A Text-Controlled Interactive World Generation Model Paper • 2512.22096 • Published 28 days ago • 60