π₀-FAST (VLA-Arena Fine-tuned)

About VLA-Arena

VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:

Task Structure: 170+ tasks grouped into four key dimensions:
- Safety: Operating reliably under strict constraints.
- Distractor: Handling environmental unpredictability.
- Extrapolation: Generalizing to unseen scenarios.
- Long Horizon: Executing complex, multi-step tasks.
Language Command: Variations in instruction complexity.
Visual Observation: Perturbations in visual input.

Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.

Model Overview

The model is π₀-FAST explicitly fine-tuned on demonstration data generated from VLA-Arena. It serves as a strong baseline for evaluating performance across the benchmark's Safety, Distractor, Extrapolation, and Long Horizon dimensions.

This checkpoint utilizes FAST action tokenization, enabling efficient and stable multi-step action prediction (10-step horizon) for continuous control.

Model Architecture

The model combines a Vision-Language Model (VLM) backbone with a specialized action tokenizer to handle continuous robotic control.

Component	Description
Backbone	Gemma-2B (Vision-Language Model) with LoRA adaptation
Action Space	7-DoF continuous control (End-effector pose + Gripper)
Tokenization	FAST (Compresses temporally extended action sequences)
Prediction	Multi-step prediction (Horizon: 10 steps)

Key Feature: FAST Tokenizer

The FAST tokenizer compresses action sequences into a compact token representation. This improves learning efficiency and rollout stability compared to standard per-step prediction, making it particularly effective for the complex tasks found in VLA-Arena.

Training Details

Dataset

This model was trained on the VLA-Arena/VLA_Arena_L0_L_lerobot_openpi dataset. This dataset contains demonstration data collected from VLA-Arena, formatted specifically for LeRobot and OpenPi training pipelines.

Hyperparameters

The model was fine-tuned using LoRA (Low-Rank Adaptation) with the following configuration:

Parameter	Value
Max Training Steps	60,000
Global Batch Size	32
Optimizer	AdamW
LR Schedule	CosineDecaySchedule
EMA	Disabled

Tokenization Config

Parameter	Value
Action Dimension	7
Action Horizon	10
Max Token Length	180

Evaluation & Usage

This model is designed to be evaluated within the VLA-Arena benchmark ecosystem. It has been tested across 11 specialized suites with difficulty levels ranging from L0 (Basic) to L2 (Advanced).

For detailed evaluation instructions, metrics, and scripts, please refer to the VLA-Arena repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

VLA-Arena
/

pi0-fast-vla-arena-fintuned