UniVLA (VLA-Arena Fine-tuned)

About VLA-Arena

VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:

Task Structure: 170+ tasks grouped into four key dimensions:
- Safety: Operating reliably under strict constraints.
- Distractor: Handling environmental unpredictability.
- Extrapolation: Generalizing to unseen scenarios.
- Long Horizon: Executing complex, multi-step tasks.
Language Command: Variations in instruction complexity.
Visual Observation: Perturbations in visual input.

Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.

Model Overview

The model is UniVLA, explicitly fine-tuned on demonstration data generated from VLA-Arena. UniVLA distinguishes itself by employing a Latent Action Model (LAM) to handle action generation, separating the policy learning into a high-level vision-language planner and a low-level latent action decoder.

Unlike typical parameter-efficient fine-tuning (PEFT) approaches where the backbone is frozen, this checkpoint involves training both the VLA backbone components and the dedicated action model.

Model Architecture

UniVLA utilizes a hierarchical structure involving a VLA backbone for semantic understanding and a specialized Latent Action Model (LAM) for discrete action token generation.

Component	Description
Backbone	VLA (Vision-Language Backbone)
Action Generation	Latent Action Model (LAM)
Action Space	Discrete Codebook (Size 16)
Training State	Unfrozen (Both VLA Backbone and Action Model are trained)

Key Feature: Latent Action Model (LAM)

The LAM acts as a specialized tokenizer and predictor for robotic actions. It compresses continuous actions into a compact discrete latent space, allowing for efficient sequence modeling.

LAM Parameter	Value
Codebook Size	16
Model Dimension	768
Latent Dimension	128
Structure	12 Encoder Blocks / 12 Decoder Blocks
Window Size	12

Training Details

Dataset

This model was trained on the VLA-Arena/VLA_Arena_L0_L_rlds dataset. The data consists of diverse robotic manipulation demonstrations formatted in RLDS (Reinforcement Learning Datasets) standard.

Hyperparameters

The training utilized gradient accumulation to achieve an effective batch size of 16. Notably, the backbone was not frozen, allowing for deeper adaptation to the VLA-Arena tasks.

Parameter	Value
Max Training Steps	30,000
Batch Size (Per Device)	8
Gradient Accumulation	2 steps
Effective Total Batch Size	16
Optimizer	AdamW
Learning Rate ($\eta$)	$3.5 \times 10^{-4}$ (Fixed)
Shuffle Buffer Size	16,000
Image Augmentation	Enabled (`TRUE`)

LoRA Configuration

While LoRA was enabled, the training configuration specified that the VLA backbone remained unfrozen, indicating a hybrid or comprehensive fine-tuning approach.

Parameter	Value
LoRA Rank ($r$)	32
LoRA Dropout	0.0
Use 4-bit Quantization	Disabled (`FALSE`)
Backbone Freeze	Disabled (`FALSE`)

Evaluation & Usage

This model is designed to be evaluated within the VLA-Arena benchmark ecosystem. It has been tested across 11 specialized suites with difficulty levels ranging from L0 (Basic) to L2 (Advanced).

For detailed evaluation instructions, metrics, and scripts, please refer to the VLA-Arena repository.

Downloads last month: 13

Safetensors

Model size

8B params

Tensor type

BF16

Video Preview

Robotics

VLA-Arena
/

univla-7b-finetuned-vla-arena