Model Card for RoboNoid's EgoActor 4B
EgoActor is a unified vision-language model designed to translate natural language instructions into precise spatial and temporal action sequences that control humanoid robots.
Model Details
Model Description
EgoActor is a unified vision-language model designed to translate natural language instructions into precise spatial and temporal action sequences that control humanoid robots. It combines perception, planning, and action execution by grounding instructions into egocentric, spatial-aware motor behaviors—including movement, manipulation, perception, and human interaction—bridging abstract task planning and concrete embodied control
- Developed by: BAAI-Agents team
- Model type: Vision-Language Model (VLM)
- Language(s) (NLP): English language instructions and egocentric visual perception.
- License: Apache-2.0
- Finetuned from model: Qwen3-VL
Model Sources
- Repository: https://github.com/BAAI-Agents/RoboNoid/tree/main/papers/EgoActor
- Paper: https://arxiv.org/abs/2602.04515
- Demo: https://baai-agents.github.io/RoboNoid/EgoActor/
Uses
Direct Use
EgoActor is intended for robotics and embodied AI research where:
- Researchers require instruction-to-action grounding for humanoid robots.
- Models interact with egocentric vision and spatial reasoning to produce motor commands.
- Simulation and real-world robot testing are needed for mobile manipulation tasks. 
Example tasks include:
- Approaching and picking up objects from first-person camera input.
- Navigation and humanoid manipulation tasks specified by natural language prompts.
Out-of-Scope Use
- This model is not designed for tasks requiring high-speed low-level control only (e.g., hobby drones or micro-robotic control).
- Not intended for natural language dialogue or general non-embodied tasks.
- Does not provide general LLM capabilities outside of embodied action sequencing.
Bias, Risks, and Limitations
- Egocentric vision dependence: Model performance depends on egocentric RGB inputs; behavior may degrade with poor sensor data.
- Generalization: Although trained on diverse simulated and real data, adaptation to drastically different robot hardware or unstructured environments may require fine-tuning.
- Safety considerations: Using the model in physical robots introduces safety risks (unexpected movements; collision hazards). Appropriate safety controls must be in place.
- Data bias: The training datasets and environments shape which tasks the model performs well on; unusual or adversarial scenarios may result in failures or unpredictable actions.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. More information is needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
This model is served using vLLM (v0.11.0) and supports multi-image vision–language inputs for embodied action prediction from egocentric robot observations.
1. Serve the Model with vLLM
Install the required vLLM version:
pip install vllm==0.11.0
Start the vLLM server with multi-modal support enabled:
vllm serve [model_name] \
--host 127.0.0.1 \
--port 8000 \
--max_model_len 6400 \
--chat-template-content-format auto \
--limit-mm-per-prompt.image 15 \
--served-model-name "vlm_agent"
2. System Prompt
The model is designed for first-person embodied robot perception and action planning. Use the following system prompt to initialize model behavior:
system_message = """You are a Vision Language Model specialized in processing the first person view images of embodied robots.
Your task is to analyze the provided image and respond to queries with answers. Focus on the spatial relations in the image and make the right decisions."""
3. User Query Format
The model predicts the next executable action sequence given:
- a high-level language instruction,
- sampled historical observation frames,
- recent observation frames and actions.
Example query construction:
sample_data = {
"query": (
"Given the following instruction, a series of sampled historical observation "
"and recent observation image frames, predict a usable action sequence that "
"you should perform next. Output format: "
"'Turn [direction] [degrees] degrees; "
"Look [direction] [degrees] degrees; "
"Move [direction] [distance] meters; "
"[direction] sidewalk [distance] meters; "
"[manipulation action text]; "
"[interaction action text]; "
"Stop and no action'.\n\n"
"Your task is: " + instruction
)
}
4. Multi-Modal Chat Message Formatting
Inputs are formatted as OpenAI-style chat messages with mixed text and image content. The following function constructs a complete multi-modal prompt including:
- system instructions,
- uniformly sampled historical observations,
- recent observation frames and actions.
def sample_n_elements(my_list, n):
if n > len(my_list):
return my_list
if n <= 0:
return []
interval = len(my_list) / n
sampled_elements = []
for i in range(n):
index = int(round(i * interval))
if index < len(my_list):
sampled_elements.append(my_list[index])
return sampled_elements
def format_data_sft_with_step_and_hist(sample, system_message):
formatted_sample = [
{
"role": "system",
"content": [{"type": "text", "text": system_message}],
}
]
tmp_content = [
{"type": "text", "text": sample["query"] + "\n"},
{"type": "text", "text": "\nSampled Historical Observations:\n"},
]
sampled_historical_images = sample_n_elements(sample["historical_images"], 10)
for img in sampled_historical_images:
tmp_content.append({
"type": "image",
"image": "file://" + img.replace("resized", "shrinked"),
})
tmp_content.append({"type": "text", "text": "\nRecent Observations:\n"})
pos = 0
for image, action in sample["image_action_pairs"]:
if pos == 0:
formatted_sample.append({
"role": "user",
"content": tmp_content + [
{"type": "image", "image": "file://" + image},
{"type": "text", "text": "Next action: "},
],
})
else:
formatted_sample.append({
"role": "user",
"content": [
{"type": "image", "image": "file://" + image},
{"type": "text", "text": "Next action: "},
],
})
if action.lower() != "none":
formatted_sample.append({
"role": "assistant",
"content": [{"type": "text", "text": action}],
})
pos += 1
return formatted_sample
Training Details
Training Procedure
Refer to Section IV of the paper.
Training Hyperparameters
- Training regime: Refer to Section IV of the paper.
Citation
BibTeX:
@article{bai2026EgoActor,
title={{E}go{A}ctor: {G}rounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models},
author={Yu Bai and Mingming Yu and Chaojie Li and Ziyi Bai and Xinlong Wang and Börje F. Karlsson},
journal={arXiv: 2602.04515},
year={2026},
url={https://arxiv.org/abs/2602.04515}
}
- Downloads last month
- 11