MLX
File size: 2,180 Bytes
616c382
 
 
 
 
 
 
 
 
 
 
 
 
 
160ba4b
616c382
 
6617fa4
616c382
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
library_name: mlx
license: apache-2.0
tags:
- mlx
---

# NexaAI/qwen3vl-30B-A3B-mlx

## 🔧 Quickstart

Run directly with the [nexa-sdk](https://github.com/NexaAI/nexa-sdk) CLI:

```bash
nexa infer NexaAI/qwen3vl-30B-A3B-mlx
````

> ⚠️ **Note:** You need at least **64 GB of RAM** on your Mac to run this model.
---

## 🧠 Model Overview

**Qwen3-VL-30B-A3B-Instruct** is a cutting-edge vision-language model from the Qwen3 series, offering advanced reasoning, spatial perception, long-context understanding, and seamless integration between text and visual data. This model is part of the A3B (Advanced Agent + 3D + Multimodal Boost) instruct-tuned lineup.

### 🔑 Key Features

* **Visual Agent Capabilities**
  Understands and interacts with GUIs, software tools, and system elements for agentic task automation.

* **Visual Coding Generation**
  Converts images or video layouts into HTML, CSS, JS, or diagramming tools like Draw.io.

* **Spatial & Temporal Reasoning**
  Handles complex visual spatial tasks (2D/3D object grounding, occlusion) and aligns language with video events.

* **Multimodal Reasoning**
  Excels in STEM, math, and logic tasks with causal, evidence-based answers across text and image/video modalities.

* **256K+ Context Length**
  Handles ultra-long documents and hours of video input with second-level indexing and full recall.

* **High-Performance OCR**
  Recognizes 32 languages including ancient scripts, scientific notations, and performs well under low-light/blurry conditions.

* **Multilingual & Instruction Following**
  Supports over 100 languages with robust multilingual instruction tuning and translation quality.

---

## 🏗️ Architecture Details

* **Model Type**: Vision-Language Causal Transformer

* **Architecture Enhancements**:

  * *Interleaved-MRoPE*: Improved positional embeddings for long-horizon vision tasks.
  * *DeepStack*: Multi-level ViT feature fusion for fine-grained alignment.
  * *Text-Timestamp Alignment*: Enhanced video temporal localization.

* **Context Length**: Up to 256K tokens (expandable to 1M)

* **Model Size**: 30B parameters

* **Architecture**: Dense or MoE (Mixture of Experts)