--- library_name: mlx license: apache-2.0 tags: - mlx --- # NexaAI/qwen3vl-30B-A3B-mlx ## 🔧 Quickstart Run directly with the [nexa-sdk](https://github.com/NexaAI/nexa-sdk) CLI: ```bash nexa infer NexaAI/qwen3vl-30B-A3B-mlx ```` > ⚠️ **Note:** You need at least **64 GB of RAM** on your Mac to run this model. --- ## 🧠 Model Overview **Qwen3-VL-30B-A3B-Instruct** is a cutting-edge vision-language model from the Qwen3 series, offering advanced reasoning, spatial perception, long-context understanding, and seamless integration between text and visual data. This model is part of the A3B (Advanced Agent + 3D + Multimodal Boost) instruct-tuned lineup. ### 🔑 Key Features * **Visual Agent Capabilities** Understands and interacts with GUIs, software tools, and system elements for agentic task automation. * **Visual Coding Generation** Converts images or video layouts into HTML, CSS, JS, or diagramming tools like Draw.io. * **Spatial & Temporal Reasoning** Handles complex visual spatial tasks (2D/3D object grounding, occlusion) and aligns language with video events. * **Multimodal Reasoning** Excels in STEM, math, and logic tasks with causal, evidence-based answers across text and image/video modalities. * **256K+ Context Length** Handles ultra-long documents and hours of video input with second-level indexing and full recall. * **High-Performance OCR** Recognizes 32 languages including ancient scripts, scientific notations, and performs well under low-light/blurry conditions. * **Multilingual & Instruction Following** Supports over 100 languages with robust multilingual instruction tuning and translation quality. --- ## 🏗️ Architecture Details * **Model Type**: Vision-Language Causal Transformer * **Architecture Enhancements**: * *Interleaved-MRoPE*: Improved positional embeddings for long-horizon vision tasks. * *DeepStack*: Multi-level ViT feature fusion for fine-grained alignment. * *Text-Timestamp Alignment*: Enhanced video temporal localization. * **Context Length**: Up to 256K tokens (expandable to 1M) * **Model Size**: 30B parameters * **Architecture**: Dense or MoE (Mixture of Experts)