aivlastarvlarobot-manipulationaideep-learningqwen-vlflow-matchingiclr-2026

StarVLA: Build Modular VLA Models

Step-by-step guide to building Vision-Language-Action models with StarVLA — a Lego-like modular framework from ICLR 2026 supporting 4 action head architectures.

Nguyễn Anh Tuấn12 tháng 4, 20269 phút đọc
StarVLA: Build Modular VLA Models

If you've been following Embodied AI research, you've likely noticed a frustrating pattern: every new Vision-Language-Action (VLA) paper comes with its own codebase, its own data format, its own evaluation protocol. Want to fairly compare two methods? Prepare to spend weeks just setting up environments. StarVLA was built to solve exactly this problem — a modular "Lego-like" framework that lets you swap backbones and action heads like building blocks, all sharing the same trainer, dataloader, and deployment stack.

Presented at ICLR 2026, the project has already garnered over 1,500 GitHub stars within weeks of release. In this tutorial, we'll walk through everything from architecture understanding to installation, training, and inference with StarVLA.

Why StarVLA?

Before StarVLA, experimenting with VLA models meant dealing with:

  • Fragmentation: OpenVLA, Octo, RT-2, π₀ — each with its own repo, training pipeline, and evaluation setup.
  • Unfair comparisons: Different data pipelines, different preprocessing → results aren't comparable.
  • Wasted time: Want to try a new action head on an existing backbone? Rewrite everything from scratch.

StarVLA solves this by cleanly separating the backbone (the "see and understand language" part) from the action head (the "decide what to do" part). If you're new to VLA concepts, check out our VLA Models overview first. These two components can be swapped independently — like changing a camera lens without replacing the camera body.

StarVLA's modular architecture — backbone and action head can be swapped independently

StarVLA Architecture

Backbone — The Vision-Language "Brain"

StarVLA supports two main backbone types:

1. VLM Backbone (Vision-Language Model)

  • Qwen3-VL (0.8B, 2B, 4B, 9B) — latest backbone with strong multilingual support
  • Qwen2.5-VL — stable, well-tested version
  • InternVL — open-source alternative
  • Florence-2 — lightweight backbone from Microsoft

2. World Model Backbone

  • Cosmos (NVIDIA) — world model that predicts next states
  • Enables the VLA to "imagine" action outcomes before execution

Action Head — The Decision-Making "Arm"

This is where things get interesting. StarVLA provides 4 distinct action head types, each representing a different paradigm in VLA:

Action Head Paradigm Description
StarVLA-FAST Autoregressive Discrete Tokenizes actions into discrete tokens, decodes sequentially like a language model
StarVLA-OFT Parallel Continuous MLP head decodes continuous actions in parallel, fastest inference
StarVLA-PI Flow-Matching Diffusion Uses flow-matching to generate continuous actions, highest accuracy
StarVLA-GR00T Dual-System VLM as "System 2" (deliberation), Flow-Matching as "System 1" (reflexes)

StarVLA-FAST is ideal when you want to leverage the autoregressive power of VLMs. StarVLA-OFT offers the fastest inference through parallel decoding. StarVLA-PI delivers the highest accuracy with continuous action spaces — if you want to dive deeper into flow-matching, see our Diffusion Policy tutorial. StarVLA-GR00T is the most advanced, combining the "slow thinking" of VLMs with the "fast reflexes" of flow-matching — inspired by Daniel Kahneman's Dual Process theory.

The key insight: all 4 action heads share the same trainer, dataloader, and deployment stack. You only need to change a config parameter to switch between them.

Installation

System Requirements

  • Python 3.10+
  • CUDA 12.0 or 12.4 (confirmed compatible)
  • GPU: minimum 24GB VRAM (A5000, RTX 4090, A100)
  • RAM: 32GB+

Step 1: Clone and Set Up Environment

git clone https://github.com/starVLA/starVLA.git
cd starVLA

# Create conda environment
conda create -n starvla python=3.10 -y
conda activate starvla

# Install dependencies
pip install -r requirements.txt

# FlashAttention2 — IMPORTANT: must match your CUDA version
pip install flash-attn --no-build-isolation

# Install StarVLA in development mode
pip install -e .

Important note: FlashAttention2 must be compatible with your CUDA toolkit and PyTorch versions. The framework has confirmed compatibility with flash-attn==2.7.4.post1 on CUDA 12.0 and 12.4.

Step 2: Download Pretrained Backbone

# Create model directory
mkdir -p playground/Pretrained_models

# Download Qwen3-VL-4B (recommended starting point)
huggingface-cli download Qwen/Qwen3-VL-4B-Instruct \
  --local-dir playground/Pretrained_models/Qwen3-VL-4B-Instruct

Step 3: Verify Installation

# Quick test — run minimal forward pass
python starVLA/model/framework/QwenGR00T.py

If no errors appear, you're ready to go.

StarVLA supports both simulation and real-robot deployment

Training a VLA Model

Data Preparation

StarVLA uses a standardized data format. Each sample consists of:

  • Observation: Camera images (supports multi-view)
  • Language instruction: Task description (e.g., "pick up the red cup")
  • Action: Corresponding action sequence (position, rotation, gripper state)

The framework supports major datasets out of the box:

  • Open X-Embodiment — largest multi-robot dataset
  • LIBERO — popular manipulation benchmark
  • RoboTwin 2.0 — digital twin benchmark
  • RoboCasa — household interaction benchmark

Training on LIBERO (Concrete Example)

LIBERO is the best starting point — small dataset, easy to download, and quick to validate results.

# Train StarVLA-OFT on LIBERO
bash examples/LIBERO/train_files/run_libero_train.sh

The training script includes these key parameters:

# Key training parameters
python -m starVLA.train \
  --backbone qwen3-vl-4b \
  --action_head oft \
  --dataset libero \
  --batch_size 16 \
  --learning_rate 1e-4 \
  --num_epochs 50 \
  --output_dir ./checkpoints/libero-oft

Key parameter explanations:

  • --backbone: Choose VLM backbone (qwen3-vl-4b, qwen2.5-vl-3b, ...)
  • --action_head: Choose paradigm (fast, oft, pi, gr00t)
  • --dataset: Training dataset
  • --batch_size: Depends on VRAM — 16 for A100 80GB, 4-8 for RTX 4090

Switching Between Action Heads

This is the real power of StarVLA. Want to try a different action head? Just change one line:

# Try StarVLA-FAST (autoregressive discrete)
python -m starVLA.train --action_head fast ...

# Try StarVLA-PI (flow-matching diffusion)
python -m starVLA.train --action_head pi ...

# Try StarVLA-GR00T (dual-system)
python -m starVLA.train --action_head gr00t ...

Everything else stays the same — same data pipeline, same trainer. You only change how the model decodes actions.

Evaluation and Benchmarks

Two-Terminal Architecture

StarVLA uses a policy server + simulator client pattern to avoid dependency conflicts:

# Terminal 1: Run policy server (model inference)
bash examples/LIBERO/eval_files/run_policy_server.sh &

# Terminal 2: Run simulator environment
bash examples/LIBERO/eval_files/eval_libero.sh

This design has a key advantage: model serving and simulation run in separate processes, communicating via WebSocket. This means you can deploy the same model to both simulation and real robots without rewriting any serving code.

Benchmark Results

Here are the published results from the paper:

Benchmark StarVLA-OFT StarVLA-GR00T Notes
LIBERO 96.6% Average success rate
SimplerEnv 71.4% WidowX success rate
RoboCasa 48.8% Average success
RoboTwin 2.0 88.32 Hard average score

Notably, these results were achieved with minimal data engineering — just simple training recipes, no complex tricks. The framework matches or surpasses prior methods across multiple benchmarks.

Inference and Deployment

Running Inference in Simulation

# Start policy server with trained checkpoint
python -m starVLA.serve \
  --checkpoint ./checkpoints/libero-oft/best.pt \
  --backbone qwen3-vl-4b \
  --action_head oft \
  --port 8765

# Connect from simulator
python -m starVLA.eval.libero_client \
  --policy_url ws://localhost:8765 \
  --task_suite libero_spatial

Deploying to a Real Robot

StarVLA provides complete examples for the Franka Panda robot:

# Real robot deployment example
from starVLA.deploy import PolicyClient

# Connect to policy server
client = PolicyClient("ws://localhost:8765")

# Control loop
while not done:
    # Capture observation from camera
    image = camera.capture()
    
    # Send observation + instruction, receive action
    action = client.predict(
        image=image,
        instruction="pick up the red cup"
    )
    
    # Execute action on robot
    robot.execute(action)

The unified WebSocket interface is a major strength: write deployment code once, run it on both simulation and real hardware.

From simulation to real robot — StarVLA provides a unified deployment interface

Extending StarVLA — Custom Action Heads

One of StarVLA's greatest strengths is extensibility. Want to add a new action head? You just need to:

  1. Implement the action head module — inherit from the base class
  2. Register the module — add it to the registry
  3. Done — the framework automatically integrates it with the trainer and deployment stack
from starVLA.model.action_heads import BaseActionHead, register_head

@register_head("my_custom_head")
class MyCustomActionHead(BaseActionHead):
    def __init__(self, config):
        super().__init__(config)
        # Define layers
        self.action_mlp = nn.Sequential(
            nn.Linear(config.hidden_dim, 512),
            nn.ReLU(),
            nn.Linear(512, config.action_dim)
        )
    
    def forward(self, backbone_features, **kwargs):
        # Decode action from backbone features
        return self.action_mlp(backbone_features)
    
    def compute_loss(self, pred_actions, gt_actions):
        return F.mse_loss(pred_actions, gt_actions)

Similarly, adding a new backbone only requires implementing a lightweight adapter.

Comparison with Other Frameworks

Criteria StarVLA LeRobot OpenVLA
Modular backbone ✅ Free swapping ❌ Fixed ❌ Fixed
Multiple action heads ✅ 4 paradigms ✅ 2-3 paradigms ❌ 1 paradigm
Integrated benchmarks ✅ 5+ benchmarks ✅ 3-4 ❌ Self-setup
Real robot deploy ✅ WebSocket API ✅ Native ❌ Needs adapter
World model support ✅ Cosmos
Community 1.5k+ stars 10k+ stars 5k+ stars

StarVLA doesn't replace LeRobot or OpenVLA — they serve different purposes. LeRobot excels at data collection and end-to-end training on real hardware (see our LeRobot hands-on guide). OpenVLA pioneered simple VLA design. StarVLA shines when you need to compare multiple architectures or develop new methods — it's a tool built for researchers.

Practical Tips

Where to start?

  1. Use StarVLA-OFT + Qwen3-VL-4B for your first run — fast inference, solid results
  2. Train on LIBERO first — small dataset, easy to debug
  3. Once comfortable, try StarVLA-GR00T — the dual-system architecture offers deeper insights

Pitfalls to avoid:

  • Don't use the wrong FlashAttention version for your CUDA — it will crash silently
  • Batch size too large will cause OOM — start small and scale up
  • When comparing action heads, keep everything else constant (backbone, data, hyperparams)

Summary

StarVLA represents an important step toward standardizing VLA research. Instead of every lab building their own framework, the community now has a shared foundation to build on, compare against, and reproduce results with. With its "Lego-like" modular design, you can focus on research ideas rather than writing boilerplate code.

Resources:


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Bài viết liên quan

NEWTutorial
Hướng dẫn fine-tune NVIDIA GR00T N1
vlahumanoidnvidiaisaac-labfine-tuningdeep-learninggrootsim2real

Hướng dẫn fine-tune NVIDIA GR00T N1

Hướng dẫn chi tiết fine-tune VLA model GR00T N1 cho humanoid robot với Isaac Lab và dữ liệu AGIBOT World — từ cài đặt đến inference.

12/4/202612 phút đọc
NEWDeep Dive
WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code
vlahumanoidloco-manipulationiclrrlopen-sourceisaac-lab

WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code

Deep-dive vào codebase WholebodyVLA — kiến trúc latent action, LMO RL policy, và cách xây dựng pipeline whole-body loco-manipulation cho humanoid.

12/4/202619 phút đọc
NEWTutorial
SimpleVLA-RL (6): OpenArm — Phân tích Lộ trình
openarmvlareinforcement-learninglerobotpi0Phần 6

SimpleVLA-RL (6): OpenArm — Phân tích Lộ trình

Phân tích chi tiết cách tiếp cận training robot OpenArm 7-DoF gắp hộp carton — so sánh 2 lộ trình: LeRobot native vs SimpleVLA-RL.

11/4/202613 phút đọc