Ο€β‚€-FAST (VLA-Arena Fine-tuned)

About VLA-Arena

VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:

  1. Task Structure: 170+ tasks grouped into four key dimensions:
    • Safety: Operating reliably under strict constraints.
    • Distractor: Handling environmental unpredictability.
    • Extrapolation: Generalizing to unseen scenarios.
    • Long Horizon: Executing complex, multi-step tasks.
  2. Language Command: Variations in instruction complexity.
  3. Visual Observation: Perturbations in visual input.

Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.

Model Overview

The model is Ο€β‚€-FAST explicitly fine-tuned on demonstration data generated from VLA-Arena. It serves as a strong baseline for evaluating performance across the benchmark's Safety, Distractor, Extrapolation, and Long Horizon dimensions.

This checkpoint utilizes FAST action tokenization, enabling efficient and stable multi-step action prediction (10-step horizon) for continuous control.


Model Architecture

The model combines a Vision-Language Model (VLM) backbone with a specialized action tokenizer to handle continuous robotic control.

Component Description
Backbone Gemma-2B (Vision-Language Model) with LoRA adaptation
Action Space 7-DoF continuous control (End-effector pose + Gripper)
Tokenization FAST (Compresses temporally extended action sequences)
Prediction Multi-step prediction (Horizon: 10 steps)

Key Feature: FAST Tokenizer

The FAST tokenizer compresses action sequences into a compact token representation. This improves learning efficiency and rollout stability compared to standard per-step prediction, making it particularly effective for the complex tasks found in VLA-Arena.


Training Details

Dataset

This model was trained on the VLA-Arena/VLA_Arena_L0_L_lerobot_openpi dataset. This dataset contains demonstration data collected from VLA-Arena, formatted specifically for LeRobot and OpenPi training pipelines.

Hyperparameters

The model was fine-tuned using LoRA (Low-Rank Adaptation) with the following configuration:

Parameter Value
Max Training Steps 60,000
Global Batch Size 32
Optimizer AdamW
LR Schedule CosineDecaySchedule
EMA Disabled

Tokenization Config

Parameter Value
Action Dimension 7
Action Horizon 10
Max Token Length 180

Evaluation & Usage

This model is designed to be evaluated within the VLA-Arena benchmark ecosystem. It has been tested across 11 specialized suites with difficulty levels ranging from L0 (Basic) to L2 (Advanced).

For detailed evaluation instructions, metrics, and scripts, please refer to the VLA-Arena repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Dataset used to train VLA-Arena/pi0-fast-vla-arena-fintuned