If you've been following Embodied AI research, you've likely noticed a frustrating pattern: every new Vision-Language-Action (VLA) paper comes with its own codebase, its own data format, its own evaluation protocol. Want to fairly compare two methods? Prepare to spend weeks just setting up environments. StarVLA was built to solve exactly this problem — a modular "Lego-like" framework that lets you swap backbones and action heads like building blocks, all sharing the same trainer, dataloader, and deployment stack.
Presented at ICLR 2026, the project has already garnered over 1,500 GitHub stars within weeks of release. In this tutorial, we'll walk through everything from architecture understanding to installation, training, and inference with StarVLA.
Why StarVLA?
Before StarVLA, experimenting with VLA models meant dealing with:
- Fragmentation: OpenVLA, Octo, RT-2, π₀ — each with its own repo, training pipeline, and evaluation setup.
- Unfair comparisons: Different data pipelines, different preprocessing → results aren't comparable.
- Wasted time: Want to try a new action head on an existing backbone? Rewrite everything from scratch.
StarVLA solves this by cleanly separating the backbone (the "see and understand language" part) from the action head (the "decide what to do" part). If you're new to VLA concepts, check out our VLA Models overview first. These two components can be swapped independently — like changing a camera lens without replacing the camera body.
StarVLA Architecture
Backbone — The Vision-Language "Brain"
StarVLA supports two main backbone types:
1. VLM Backbone (Vision-Language Model)
- Qwen3-VL (0.8B, 2B, 4B, 9B) — latest backbone with strong multilingual support
- Qwen2.5-VL — stable, well-tested version
- InternVL — open-source alternative
- Florence-2 — lightweight backbone from Microsoft
2. World Model Backbone
- Cosmos (NVIDIA) — world model that predicts next states
- Enables the VLA to "imagine" action outcomes before execution
Action Head — The Decision-Making "Arm"
This is where things get interesting. StarVLA provides 4 distinct action head types, each representing a different paradigm in VLA:
| Action Head | Paradigm | Description |
|---|---|---|
| StarVLA-FAST | Autoregressive Discrete | Tokenizes actions into discrete tokens, decodes sequentially like a language model |
| StarVLA-OFT | Parallel Continuous | MLP head decodes continuous actions in parallel, fastest inference |
| StarVLA-PI | Flow-Matching Diffusion | Uses flow-matching to generate continuous actions, highest accuracy |
| StarVLA-GR00T | Dual-System | VLM as "System 2" (deliberation), Flow-Matching as "System 1" (reflexes) |
StarVLA-FAST is ideal when you want to leverage the autoregressive power of VLMs. StarVLA-OFT offers the fastest inference through parallel decoding. StarVLA-PI delivers the highest accuracy with continuous action spaces — if you want to dive deeper into flow-matching, see our Diffusion Policy tutorial. StarVLA-GR00T is the most advanced, combining the "slow thinking" of VLMs with the "fast reflexes" of flow-matching — inspired by Daniel Kahneman's Dual Process theory.
The key insight: all 4 action heads share the same trainer, dataloader, and deployment stack. You only need to change a config parameter to switch between them.
Installation
System Requirements
- Python 3.10+
- CUDA 12.0 or 12.4 (confirmed compatible)
- GPU: minimum 24GB VRAM (A5000, RTX 4090, A100)
- RAM: 32GB+
Step 1: Clone and Set Up Environment
git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Create conda environment
conda create -n starvla python=3.10 -y
conda activate starvla
# Install dependencies
pip install -r requirements.txt
# FlashAttention2 — IMPORTANT: must match your CUDA version
pip install flash-attn --no-build-isolation
# Install StarVLA in development mode
pip install -e .
Important note: FlashAttention2 must be compatible with your CUDA toolkit and PyTorch versions. The framework has confirmed compatibility with
flash-attn==2.7.4.post1on CUDA 12.0 and 12.4.
Step 2: Download Pretrained Backbone
# Create model directory
mkdir -p playground/Pretrained_models
# Download Qwen3-VL-4B (recommended starting point)
huggingface-cli download Qwen/Qwen3-VL-4B-Instruct \
--local-dir playground/Pretrained_models/Qwen3-VL-4B-Instruct
Step 3: Verify Installation
# Quick test — run minimal forward pass
python starVLA/model/framework/QwenGR00T.py
If no errors appear, you're ready to go.
Training a VLA Model
Data Preparation
StarVLA uses a standardized data format. Each sample consists of:
- Observation: Camera images (supports multi-view)
- Language instruction: Task description (e.g., "pick up the red cup")
- Action: Corresponding action sequence (position, rotation, gripper state)
The framework supports major datasets out of the box:
- Open X-Embodiment — largest multi-robot dataset
- LIBERO — popular manipulation benchmark
- RoboTwin 2.0 — digital twin benchmark
- RoboCasa — household interaction benchmark
Training on LIBERO (Concrete Example)
LIBERO is the best starting point — small dataset, easy to download, and quick to validate results.
# Train StarVLA-OFT on LIBERO
bash examples/LIBERO/train_files/run_libero_train.sh
The training script includes these key parameters:
# Key training parameters
python -m starVLA.train \
--backbone qwen3-vl-4b \
--action_head oft \
--dataset libero \
--batch_size 16 \
--learning_rate 1e-4 \
--num_epochs 50 \
--output_dir ./checkpoints/libero-oft
Key parameter explanations:
--backbone: Choose VLM backbone (qwen3-vl-4b, qwen2.5-vl-3b, ...)--action_head: Choose paradigm (fast, oft, pi, gr00t)--dataset: Training dataset--batch_size: Depends on VRAM — 16 for A100 80GB, 4-8 for RTX 4090
Switching Between Action Heads
This is the real power of StarVLA. Want to try a different action head? Just change one line:
# Try StarVLA-FAST (autoregressive discrete)
python -m starVLA.train --action_head fast ...
# Try StarVLA-PI (flow-matching diffusion)
python -m starVLA.train --action_head pi ...
# Try StarVLA-GR00T (dual-system)
python -m starVLA.train --action_head gr00t ...
Everything else stays the same — same data pipeline, same trainer. You only change how the model decodes actions.
Evaluation and Benchmarks
Two-Terminal Architecture
StarVLA uses a policy server + simulator client pattern to avoid dependency conflicts:
# Terminal 1: Run policy server (model inference)
bash examples/LIBERO/eval_files/run_policy_server.sh &
# Terminal 2: Run simulator environment
bash examples/LIBERO/eval_files/eval_libero.sh
This design has a key advantage: model serving and simulation run in separate processes, communicating via WebSocket. This means you can deploy the same model to both simulation and real robots without rewriting any serving code.
Benchmark Results
Here are the published results from the paper:
| Benchmark | StarVLA-OFT | StarVLA-GR00T | Notes |
|---|---|---|---|
| LIBERO | 96.6% | — | Average success rate |
| SimplerEnv | — | 71.4% | WidowX success rate |
| RoboCasa | 48.8% | — | Average success |
| RoboTwin 2.0 | 88.32 | — | Hard average score |
Notably, these results were achieved with minimal data engineering — just simple training recipes, no complex tricks. The framework matches or surpasses prior methods across multiple benchmarks.
Inference and Deployment
Running Inference in Simulation
# Start policy server with trained checkpoint
python -m starVLA.serve \
--checkpoint ./checkpoints/libero-oft/best.pt \
--backbone qwen3-vl-4b \
--action_head oft \
--port 8765
# Connect from simulator
python -m starVLA.eval.libero_client \
--policy_url ws://localhost:8765 \
--task_suite libero_spatial
Deploying to a Real Robot
StarVLA provides complete examples for the Franka Panda robot:
# Real robot deployment example
from starVLA.deploy import PolicyClient
# Connect to policy server
client = PolicyClient("ws://localhost:8765")
# Control loop
while not done:
# Capture observation from camera
image = camera.capture()
# Send observation + instruction, receive action
action = client.predict(
image=image,
instruction="pick up the red cup"
)
# Execute action on robot
robot.execute(action)
The unified WebSocket interface is a major strength: write deployment code once, run it on both simulation and real hardware.
Extending StarVLA — Custom Action Heads
One of StarVLA's greatest strengths is extensibility. Want to add a new action head? You just need to:
- Implement the action head module — inherit from the base class
- Register the module — add it to the registry
- Done — the framework automatically integrates it with the trainer and deployment stack
from starVLA.model.action_heads import BaseActionHead, register_head
@register_head("my_custom_head")
class MyCustomActionHead(BaseActionHead):
def __init__(self, config):
super().__init__(config)
# Define layers
self.action_mlp = nn.Sequential(
nn.Linear(config.hidden_dim, 512),
nn.ReLU(),
nn.Linear(512, config.action_dim)
)
def forward(self, backbone_features, **kwargs):
# Decode action from backbone features
return self.action_mlp(backbone_features)
def compute_loss(self, pred_actions, gt_actions):
return F.mse_loss(pred_actions, gt_actions)
Similarly, adding a new backbone only requires implementing a lightweight adapter.
Comparison with Other Frameworks
| Criteria | StarVLA | LeRobot | OpenVLA |
|---|---|---|---|
| Modular backbone | ✅ Free swapping | ❌ Fixed | ❌ Fixed |
| Multiple action heads | ✅ 4 paradigms | ✅ 2-3 paradigms | ❌ 1 paradigm |
| Integrated benchmarks | ✅ 5+ benchmarks | ✅ 3-4 | ❌ Self-setup |
| Real robot deploy | ✅ WebSocket API | ✅ Native | ❌ Needs adapter |
| World model support | ✅ Cosmos | ❌ | ❌ |
| Community | 1.5k+ stars | 10k+ stars | 5k+ stars |
StarVLA doesn't replace LeRobot or OpenVLA — they serve different purposes. LeRobot excels at data collection and end-to-end training on real hardware (see our LeRobot hands-on guide). OpenVLA pioneered simple VLA design. StarVLA shines when you need to compare multiple architectures or develop new methods — it's a tool built for researchers.
Practical Tips
Where to start?
- Use StarVLA-OFT + Qwen3-VL-4B for your first run — fast inference, solid results
- Train on LIBERO first — small dataset, easy to debug
- Once comfortable, try StarVLA-GR00T — the dual-system architecture offers deeper insights
Pitfalls to avoid:
- Don't use the wrong FlashAttention version for your CUDA — it will crash silently
- Batch size too large will cause OOM — start small and scale up
- When comparing action heads, keep everything else constant (backbone, data, hyperparams)
Summary
StarVLA represents an important step toward standardizing VLA research. Instead of every lab building their own framework, the community now has a shared foundation to build on, compare against, and reproduce results with. With its "Lego-like" modular design, you can focus on research ideas rather than writing boilerplate code.
Resources:
- Paper: StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing — Jinhui Ye et al., arXiv 2026
- GitHub: github.com/starVLA/starVLA
- Documentation: starvla.github.io
- Model Zoo: Hugging Face (Qwen3-VL-4B-Action, Qwen2.5-VL-3B-Action)
Related Posts
- VLA Models — When Robots Understand Language — Overview of Vision-Language-Action models and how they work
- Diffusion Policy for Robot Manipulation — Understanding flow-matching and diffusion-based action generation
- SpatialVLA — Enhancing VLA with Spatial Awareness — VLA framework with advanced spatial understanding