File size: 2,180 Bytes
616c382 160ba4b 616c382 6617fa4 616c382 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | ---
library_name: mlx
license: apache-2.0
tags:
- mlx
---
# NexaAI/qwen3vl-30B-A3B-mlx
## 🔧 Quickstart
Run directly with the [nexa-sdk](https://github.com/NexaAI/nexa-sdk) CLI:
```bash
nexa infer NexaAI/qwen3vl-30B-A3B-mlx
````
> ⚠️ **Note:** You need at least **64 GB of RAM** on your Mac to run this model.
---
## 🧠 Model Overview
**Qwen3-VL-30B-A3B-Instruct** is a cutting-edge vision-language model from the Qwen3 series, offering advanced reasoning, spatial perception, long-context understanding, and seamless integration between text and visual data. This model is part of the A3B (Advanced Agent + 3D + Multimodal Boost) instruct-tuned lineup.
### 🔑 Key Features
* **Visual Agent Capabilities**
Understands and interacts with GUIs, software tools, and system elements for agentic task automation.
* **Visual Coding Generation**
Converts images or video layouts into HTML, CSS, JS, or diagramming tools like Draw.io.
* **Spatial & Temporal Reasoning**
Handles complex visual spatial tasks (2D/3D object grounding, occlusion) and aligns language with video events.
* **Multimodal Reasoning**
Excels in STEM, math, and logic tasks with causal, evidence-based answers across text and image/video modalities.
* **256K+ Context Length**
Handles ultra-long documents and hours of video input with second-level indexing and full recall.
* **High-Performance OCR**
Recognizes 32 languages including ancient scripts, scientific notations, and performs well under low-light/blurry conditions.
* **Multilingual & Instruction Following**
Supports over 100 languages with robust multilingual instruction tuning and translation quality.
---
## 🏗️ Architecture Details
* **Model Type**: Vision-Language Causal Transformer
* **Architecture Enhancements**:
* *Interleaved-MRoPE*: Improved positional embeddings for long-horizon vision tasks.
* *DeepStack*: Multi-level ViT feature fusion for fine-grained alignment.
* *Text-Timestamp Alignment*: Enhanced video temporal localization.
* **Context Length**: Up to 256K tokens (expandable to 1M)
* **Model Size**: 30B parameters
* **Architecture**: Dense or MoE (Mixture of Experts) |