Model Card for RoboNoid's EgoActor 4B

EgoActor is a unified vision-language model designed to translate natural language instructions into precise spatial and temporal action sequences that control humanoid robots.

Model Details

Model Description

EgoActor is a unified vision-language model designed to translate natural language instructions into precise spatial and temporal action sequences that control humanoid robots. It combines perception, planning, and action execution by grounding instructions into egocentric, spatial-aware motor behaviors—including movement, manipulation, perception, and human interaction—bridging abstract task planning and concrete embodied control

Developed by: BAAI-Agents team
Model type: Vision-Language Model (VLM)
Language(s) (NLP): English language instructions and egocentric visual perception.
License: Apache-2.0
Finetuned from model: Qwen3-VL

Model Sources

Repository: https://github.com/BAAI-Agents/RoboNoid/tree/main/papers/EgoActor
Paper: https://arxiv.org/abs/2602.04515
Demo: https://baai-agents.github.io/RoboNoid/EgoActor/

Uses

Direct Use

EgoActor is intended for robotics and embodied AI research where:

Researchers require instruction-to-action grounding for humanoid robots.
Models interact with egocentric vision and spatial reasoning to produce motor commands.
Simulation and real-world robot testing are needed for mobile manipulation tasks.

Example tasks include:

Approaching and picking up objects from first-person camera input.
Navigation and humanoid manipulation tasks specified by natural language prompts.

Out-of-Scope Use

This model is not designed for tasks requiring high-speed low-level control only (e.g., hobby drones or micro-robotic control).
Not intended for natural language dialogue or general non-embodied tasks.
Does not provide general LLM capabilities outside of embodied action sequencing.

Bias, Risks, and Limitations

Egocentric vision dependence: Model performance depends on egocentric RGB inputs; behavior may degrade with poor sensor data.
Generalization: Although trained on diverse simulated and real data, adaptation to drastically different robot hardware or unstructured environments may require fine-tuning.
Safety considerations: Using the model in physical robots introduces safety risks (unexpected movements; collision hazards). Appropriate safety controls must be in place.
Data bias: The training datasets and environments shape which tasks the model performs well on; unusual or adversarial scenarios may result in failures or unpredictable actions.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. More information is needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

This model is served using vLLM (v0.11.0) and supports multi-image vision–language inputs for embodied action prediction from egocentric robot observations.

1. Serve the Model with vLLM

Install the required vLLM version:

pip install vllm==0.11.0

Start the vLLM server with multi-modal support enabled:

vllm serve [model_name] \
  --host 127.0.0.1 \
  --port 8000 \
  --max_model_len 6400 \
  --chat-template-content-format auto \
  --limit-mm-per-prompt.image 15 \
  --served-model-name "vlm_agent"

2. System Prompt

The model is designed for first-person embodied robot perception and action planning. Use the following system prompt to initialize model behavior:

system_message = """You are a Vision Language Model specialized in processing the first person view images of embodied robots.
Your task is to analyze the provided image and respond to queries with answers. Focus on the spatial relations in the image and make the right decisions."""

3. User Query Format

The model predicts the next executable action sequence given:

a high-level language instruction,
sampled historical observation frames,
recent observation frames and actions.

Example query construction:

sample_data = {
    "query": (
        "Given the following instruction, a series of sampled historical observation "
        "and recent observation image frames, predict a usable action sequence that "
        "you should perform next. Output format: "
        "'Turn [direction] [degrees] degrees; "
        "Look [direction] [degrees] degrees; "
        "Move [direction] [distance] meters; "
        "[direction] sidewalk [distance] meters; "
        "[manipulation action text]; "
        "[interaction action text]; "
        "Stop and no action'.\n\n"
        "Your task is: " + instruction
    )
}

4. Multi-Modal Chat Message Formatting

Inputs are formatted as OpenAI-style chat messages with mixed text and image content. The following function constructs a complete multi-modal prompt including:

system instructions,
uniformly sampled historical observations,
recent observation frames and actions.

def sample_n_elements(my_list, n):
    if n > len(my_list):
        return my_list
    if n <= 0:
        return []

    interval = len(my_list) / n
    sampled_elements = []
    for i in range(n):
        index = int(round(i * interval))
        if index < len(my_list):
            sampled_elements.append(my_list[index])
    return sampled_elements

def format_data_sft_with_step_and_hist(sample, system_message):
    formatted_sample = [
        {
            "role": "system",
            "content": [{"type": "text", "text": system_message}],
        }
    ]

    tmp_content = [
        {"type": "text", "text": sample["query"] + "\n"},
        {"type": "text", "text": "\nSampled Historical Observations:\n"},
    ]

    sampled_historical_images = sample_n_elements(sample["historical_images"], 10)
    for img in sampled_historical_images:
        tmp_content.append({
            "type": "image",
            "image": "file://" + img.replace("resized", "shrinked"),
        })

    tmp_content.append({"type": "text", "text": "\nRecent Observations:\n"})

    pos = 0
    for image, action in sample["image_action_pairs"]:
        if pos == 0:
            formatted_sample.append({
                "role": "user",
                "content": tmp_content + [
                    {"type": "image", "image": "file://" + image},
                    {"type": "text", "text": "Next action: "},
                ],
            })
        else:
            formatted_sample.append({
                "role": "user",
                "content": [
                    {"type": "image", "image": "file://" + image},
                    {"type": "text", "text": "Next action: "},
                ],
            })

        if action.lower() != "none":
            formatted_sample.append({
                "role": "assistant",
                "content": [{"type": "text", "text": action}],
            })

        pos += 1

    return formatted_sample

Training Details

Training Procedure

Refer to Section IV of the paper.

Training Hyperparameters

Training regime: Refer to Section IV of the paper.

Citation

BibTeX:

@article{bai2026EgoActor,
  title={{E}go{A}ctor: {G}rounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models},
  author={Yu Bai and Mingming Yu and Chaojie Li and Ziyi Bai and Xinlong Wang and Börje F. Karlsson},
  journal={arXiv: 2602.04515},
  year={2026},
  url={https://arxiv.org/abs/2602.04515}
}

Downloads last month: 11

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BAAI-Agents/EgoActor-4b-Qwen3VL

Base model

Qwen/Qwen3-VL-4B-Instruct

Finetuned

(208)

this model

Quantizations

2 models

Paper for BAAI-Agents/EgoActor-4b-Qwen3VL

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Paper • 2602.04515 • Published Feb 4 • 38