UniVLA (VLA-Arena Fine-tuned)
About VLA-Arena
VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:
- Task Structure: 170+ tasks grouped into four key dimensions:
- Safety: Operating reliably under strict constraints.
- Distractor: Handling environmental unpredictability.
- Extrapolation: Generalizing to unseen scenarios.
- Long Horizon: Executing complex, multi-step tasks.
- Language Command: Variations in instruction complexity.
- Visual Observation: Perturbations in visual input.
Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.
Model Overview
The model is UniVLA, explicitly fine-tuned on demonstration data generated from VLA-Arena. UniVLA distinguishes itself by employing a Latent Action Model (LAM) to handle action generation, separating the policy learning into a high-level vision-language planner and a low-level latent action decoder.
Unlike typical parameter-efficient fine-tuning (PEFT) approaches where the backbone is frozen, this checkpoint involves training both the VLA backbone components and the dedicated action model.
Model Architecture
UniVLA utilizes a hierarchical structure involving a VLA backbone for semantic understanding and a specialized Latent Action Model (LAM) for discrete action token generation.
| Component | Description |
|---|---|
| Backbone | VLA (Vision-Language Backbone) |
| Action Generation | Latent Action Model (LAM) |
| Action Space | Discrete Codebook (Size 16) |
| Training State | Unfrozen (Both VLA Backbone and Action Model are trained) |
Key Feature: Latent Action Model (LAM)
The LAM acts as a specialized tokenizer and predictor for robotic actions. It compresses continuous actions into a compact discrete latent space, allowing for efficient sequence modeling.
| LAM Parameter | Value |
|---|---|
| Codebook Size | 16 |
| Model Dimension | 768 |
| Latent Dimension | 128 |
| Structure | 12 Encoder Blocks / 12 Decoder Blocks |
| Window Size | 12 |
Training Details
Dataset
This model was trained on the VLA-Arena/VLA_Arena_L0_L_rlds dataset. The data consists of diverse robotic manipulation demonstrations formatted in RLDS (Reinforcement Learning Datasets) standard.
Hyperparameters
The training utilized gradient accumulation to achieve an effective batch size of 16. Notably, the backbone was not frozen, allowing for deeper adaptation to the VLA-Arena tasks.
| Parameter | Value |
|---|---|
| Max Training Steps | 30,000 |
| Batch Size (Per Device) | 8 |
| Gradient Accumulation | 2 steps |
| Effective Total Batch Size | 16 |
| Optimizer | AdamW |
| Learning Rate ($\eta$) | $3.5 \times 10^{-4}$ (Fixed) |
| Shuffle Buffer Size | 16,000 |
| Image Augmentation | Enabled (TRUE) |
LoRA Configuration
While LoRA was enabled, the training configuration specified that the VLA backbone remained unfrozen, indicating a hybrid or comprehensive fine-tuning approach.
| Parameter | Value |
|---|---|
| LoRA Rank ($r$) | 32 |
| LoRA Dropout | 0.0 |
| Use 4-bit Quantization | Disabled (FALSE) |
| Backbone Freeze | Disabled (FALSE) |
Evaluation & Usage
This model is designed to be evaluated within the VLA-Arena benchmark ecosystem. It has been tested across 11 specialized suites with difficulty levels ranging from L0 (Basic) to L2 (Advanced).
For detailed evaluation instructions, metrics, and scripts, please refer to the VLA-Arena repository.
- Downloads last month
- 13