UniVLA (VLA-Arena Fine-tuned)

About VLA-Arena

VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:

  1. Task Structure: 170+ tasks grouped into four key dimensions:
    • Safety: Operating reliably under strict constraints.
    • Distractor: Handling environmental unpredictability.
    • Extrapolation: Generalizing to unseen scenarios.
    • Long Horizon: Executing complex, multi-step tasks.
  2. Language Command: Variations in instruction complexity.
  3. Visual Observation: Perturbations in visual input.

Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.

Model Overview

The model is UniVLA, explicitly fine-tuned on demonstration data generated from VLA-Arena. UniVLA distinguishes itself by employing a Latent Action Model (LAM) to handle action generation, separating the policy learning into a high-level vision-language planner and a low-level latent action decoder.

Unlike typical parameter-efficient fine-tuning (PEFT) approaches where the backbone is frozen, this checkpoint involves training both the VLA backbone components and the dedicated action model.


Model Architecture

UniVLA utilizes a hierarchical structure involving a VLA backbone for semantic understanding and a specialized Latent Action Model (LAM) for discrete action token generation.

Component Description
Backbone VLA (Vision-Language Backbone)
Action Generation Latent Action Model (LAM)
Action Space Discrete Codebook (Size 16)
Training State Unfrozen (Both VLA Backbone and Action Model are trained)

Key Feature: Latent Action Model (LAM)

The LAM acts as a specialized tokenizer and predictor for robotic actions. It compresses continuous actions into a compact discrete latent space, allowing for efficient sequence modeling.

LAM Parameter Value
Codebook Size 16
Model Dimension 768
Latent Dimension 128
Structure 12 Encoder Blocks / 12 Decoder Blocks
Window Size 12

Training Details

Dataset

This model was trained on the VLA-Arena/VLA_Arena_L0_L_rlds dataset. The data consists of diverse robotic manipulation demonstrations formatted in RLDS (Reinforcement Learning Datasets) standard.

Hyperparameters

The training utilized gradient accumulation to achieve an effective batch size of 16. Notably, the backbone was not frozen, allowing for deeper adaptation to the VLA-Arena tasks.

Parameter Value
Max Training Steps 30,000
Batch Size (Per Device) 8
Gradient Accumulation 2 steps
Effective Total Batch Size 16
Optimizer AdamW
Learning Rate ($\eta$) $3.5 \times 10^{-4}$ (Fixed)
Shuffle Buffer Size 16,000
Image Augmentation Enabled (TRUE)

LoRA Configuration

While LoRA was enabled, the training configuration specified that the VLA backbone remained unfrozen, indicating a hybrid or comprehensive fine-tuning approach.

Parameter Value
LoRA Rank ($r$) 32
LoRA Dropout 0.0
Use 4-bit Quantization Disabled (FALSE)
Backbone Freeze Disabled (FALSE)

Evaluation & Usage

This model is designed to be evaluated within the VLA-Arena benchmark ecosystem. It has been tested across 11 specialized suites with difficulty levels ranging from L0 (Basic) to L2 (Advanced).

For detailed evaluation instructions, metrics, and scripts, please refer to the VLA-Arena repository.

Downloads last month
13
Safetensors
Model size
8B params
Tensor type
BF16
·
Video Preview
loading

Dataset used to train VLA-Arena/univla-7b-finetuned-vla-arena