StarVLA: Build Modular VLA Models

If you've been following Embodied AI research, you've likely noticed a frustrating pattern: every new Vision-Language-Action (VLA) paper comes with its own codebase, its own data format, its own evaluation protocol. Want to fairly compare two methods? Prepare to spend weeks just setting up environments. StarVLA was built to solve exactly this problem — a modular "Lego-like" framework that lets you swap backbones and action heads like building blocks, all sharing the same trainer, dataloader, and deployment stack.

Presented at ICLR 2026, the project has already garnered over 1,500 GitHub stars within weeks of release. In this tutorial, we'll walk through everything from architecture understanding to installation, training, and inference with StarVLA.

Why StarVLA?

Before StarVLA, experimenting with VLA models meant dealing with:

Fragmentation: OpenVLA, Octo, RT-2, π₀ — each with its own repo, training pipeline, and evaluation setup.
Unfair comparisons: Different data pipelines, different preprocessing → results aren't comparable.
Wasted time: Want to try a new action head on an existing backbone? Rewrite everything from scratch.

StarVLA solves this by cleanly separating the backbone (the "see and understand language" part) from the action head (the "decide what to do" part). If you're new to VLA concepts, check out our VLA Models overview first. These two components can be swapped independently — like changing a camera lens without replacing the camera body.

StarVLA's modular architecture — backbone and action head can be swapped independently

StarVLA Architecture

Backbone — The Vision-Language "Brain"

StarVLA supports two main backbone types:

1. VLM Backbone (Vision-Language Model)

Qwen3-VL (0.8B, 2B, 4B, 9B) — latest backbone with strong multilingual support
Qwen2.5-VL — stable, well-tested version
InternVL — open-source alternative
Florence-2 — lightweight backbone from Microsoft

2. World Model Backbone

Cosmos (NVIDIA) — world model that predicts next states
Enables the VLA to "imagine" action outcomes before execution

Action Head — The Decision-Making "Arm"

This is where things get interesting. StarVLA provides 4 distinct action head types, each representing a different paradigm in VLA:

Action Head	Paradigm	Description
StarVLA-FAST	Autoregressive Discrete	Tokenizes actions into discrete tokens, decodes sequentially like a language model
StarVLA-OFT	Parallel Continuous	MLP head decodes continuous actions in parallel, fastest inference
StarVLA-PI	Flow-Matching Diffusion	Uses flow-matching to generate continuous actions, highest accuracy
StarVLA-GR00T	Dual-System	VLM as "System 2" (deliberation), Flow-Matching as "System 1" (reflexes)

StarVLA-FAST is ideal when you want to leverage the autoregressive power of VLMs. StarVLA-OFT offers the fastest inference through parallel decoding. StarVLA-PI delivers the highest accuracy with continuous action spaces — if you want to dive deeper into flow-matching, see our Diffusion Policy tutorial. StarVLA-GR00T is the most advanced, combining the "slow thinking" of VLMs with the "fast reflexes" of flow-matching — inspired by Daniel Kahneman's Dual Process theory.

The key insight: all 4 action heads share the same trainer, dataloader, and deployment stack. You only need to change a config parameter to switch between them.

Installation

System Requirements

Python 3.10+
CUDA 12.0 or 12.4 (confirmed compatible)
GPU: minimum 24GB VRAM (A5000, RTX 4090, A100)
RAM: 32GB+

Step 1: Clone and Set Up Environment

git clone https://github.com/starVLA/starVLA.git
cd starVLA

# Create conda environment
conda create -n starvla python=3.10 -y
conda activate starvla

# Install dependencies
pip install -r requirements.txt

# FlashAttention2 — IMPORTANT: must match your CUDA version
pip install flash-attn --no-build-isolation

# Install StarVLA in development mode
pip install -e .

Important note: FlashAttention2 must be compatible with your CUDA toolkit and PyTorch versions. The framework has confirmed compatibility with flash-attn==2.7.4.post1 on CUDA 12.0 and 12.4.

Step 2: Download Pretrained Backbone

# Create model directory
mkdir -p playground/Pretrained_models

# Download Qwen3-VL-4B (recommended starting point)
huggingface-cli download Qwen/Qwen3-VL-4B-Instruct \
  --local-dir playground/Pretrained_models/Qwen3-VL-4B-Instruct

Step 3: Verify Installation

# Quick test — run minimal forward pass
python starVLA/model/framework/QwenGR00T.py

If no errors appear, you're ready to go.

StarVLA supports both simulation and real-robot deployment

Training a VLA Model

Data Preparation

StarVLA uses a standardized data format. Each sample consists of:

Observation: Camera images (supports multi-view)
Language instruction: Task description (e.g., "pick up the red cup")
Action: Corresponding action sequence (position, rotation, gripper state)

The framework supports major datasets out of the box:

Open X-Embodiment — largest multi-robot dataset
LIBERO — popular manipulation benchmark
RoboTwin 2.0 — digital twin benchmark
RoboCasa — household interaction benchmark

Training on LIBERO (Concrete Example)

LIBERO is the best starting point — small dataset, easy to download, and quick to validate results.

# Train StarVLA-OFT on LIBERO
bash examples/LIBERO/train_files/run_libero_train.sh

The training script includes these key parameters:

# Key training parameters
python -m starVLA.train \
  --backbone qwen3-vl-4b \
  --action_head oft \
  --dataset libero \
  --batch_size 16 \
  --learning_rate 1e-4 \
  --num_epochs 50 \
  --output_dir ./checkpoints/libero-oft

Key parameter explanations:

--backbone: Choose VLM backbone (qwen3-vl-4b, qwen2.5-vl-3b, ...)
--action_head: Choose paradigm (fast, oft, pi, gr00t)
--dataset: Training dataset
--batch_size: Depends on VRAM — 16 for A100 80GB, 4-8 for RTX 4090

Switching Between Action Heads

This is the real power of StarVLA. Want to try a different action head? Just change one line:

# Try StarVLA-FAST (autoregressive discrete)
python -m starVLA.train --action_head fast ...

# Try StarVLA-PI (flow-matching diffusion)
python -m starVLA.train --action_head pi ...

# Try StarVLA-GR00T (dual-system)
python -m starVLA.train --action_head gr00t ...

Everything else stays the same — same data pipeline, same trainer. You only change how the model decodes actions.

Evaluation and Benchmarks

Two-Terminal Architecture

StarVLA uses a policy server + simulator client pattern to avoid dependency conflicts:

# Terminal 1: Run policy server (model inference)
bash examples/LIBERO/eval_files/run_policy_server.sh &

# Terminal 2: Run simulator environment
bash examples/LIBERO/eval_files/eval_libero.sh

This design has a key advantage: model serving and simulation run in separate processes, communicating via WebSocket. This means you can deploy the same model to both simulation and real robots without rewriting any serving code.

Benchmark Results

Here are the published results from the paper:

Benchmark	StarVLA-OFT	StarVLA-GR00T	Notes
LIBERO	96.6%	—	Average success rate
SimplerEnv	—	71.4%	WidowX success rate
RoboCasa	48.8%	—	Average success
RoboTwin 2.0	88.32	—	Hard average score

Notably, these results were achieved with minimal data engineering — just simple training recipes, no complex tricks. The framework matches or surpasses prior methods across multiple benchmarks.

Inference and Deployment

Running Inference in Simulation

# Start policy server with trained checkpoint
python -m starVLA.serve \
  --checkpoint ./checkpoints/libero-oft/best.pt \
  --backbone qwen3-vl-4b \
  --action_head oft \
  --port 8765

# Connect from simulator
python -m starVLA.eval.libero_client \
  --policy_url ws://localhost:8765 \
  --task_suite libero_spatial

Deploying to a Real Robot

StarVLA provides complete examples for the Franka Panda robot:

# Real robot deployment example
from starVLA.deploy import PolicyClient

# Connect to policy server
client = PolicyClient("ws://localhost:8765")

# Control loop
while not done:
    # Capture observation from camera
    image = camera.capture()
    
    # Send observation + instruction, receive action
    action = client.predict(
        image=image,
        instruction="pick up the red cup"
    )
    
    # Execute action on robot
    robot.execute(action)

The unified WebSocket interface is a major strength: write deployment code once, run it on both simulation and real hardware.

From simulation to real robot — StarVLA provides a unified deployment interface

Extending StarVLA — Custom Action Heads

One of StarVLA's greatest strengths is extensibility. Want to add a new action head? You just need to:

Implement the action head module — inherit from the base class
Register the module — add it to the registry
Done — the framework automatically integrates it with the trainer and deployment stack

from starVLA.model.action_heads import BaseActionHead, register_head

@register_head("my_custom_head")
class MyCustomActionHead(BaseActionHead):
    def __init__(self, config):
        super().__init__(config)
        # Define layers
        self.action_mlp = nn.Sequential(
            nn.Linear(config.hidden_dim, 512),
            nn.ReLU(),
            nn.Linear(512, config.action_dim)
        )
    
    def forward(self, backbone_features, **kwargs):
        # Decode action from backbone features
        return self.action_mlp(backbone_features)
    
    def compute_loss(self, pred_actions, gt_actions):
        return F.mse_loss(pred_actions, gt_actions)

Similarly, adding a new backbone only requires implementing a lightweight adapter.

Comparison with Other Frameworks

Criteria	StarVLA	LeRobot	OpenVLA
Modular backbone	✅ Free swapping	❌ Fixed	❌ Fixed
Multiple action heads	✅ 4 paradigms	✅ 2-3 paradigms	❌ 1 paradigm
Integrated benchmarks	✅ 5+ benchmarks	✅ 3-4	❌ Self-setup
Real robot deploy	✅ WebSocket API	✅ Native	❌ Needs adapter
World model support	✅ Cosmos	❌	❌
Community	1.5k+ stars	10k+ stars	5k+ stars

StarVLA doesn't replace LeRobot or OpenVLA — they serve different purposes. LeRobot excels at data collection and end-to-end training on real hardware (see our LeRobot hands-on guide). OpenVLA pioneered simple VLA design. StarVLA shines when you need to compare multiple architectures or develop new methods — it's a tool built for researchers.

Practical Tips

Where to start?

Use StarVLA-OFT + Qwen3-VL-4B for your first run — fast inference, solid results
Train on LIBERO first — small dataset, easy to debug
Once comfortable, try StarVLA-GR00T — the dual-system architecture offers deeper insights

Pitfalls to avoid:

Don't use the wrong FlashAttention version for your CUDA — it will crash silently
Batch size too large will cause OOM — start small and scale up
When comparing action heads, keep everything else constant (backbone, data, hyperparams)

Summary

StarVLA represents an important step toward standardizing VLA research. Instead of every lab building their own framework, the community now has a shared foundation to build on, compare against, and reproduce results with. With its "Lego-like" modular design, you can focus on research ideas rather than writing boilerplate code.

Resources:

Paper: StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing — Jinhui Ye et al., arXiv 2026
GitHub: github.com/starVLA/starVLA
Documentation: starvla.github.io
Model Zoo: Hugging Face (Qwen3-VL-4B-Action, Qwen2.5-VL-3B-Action)

VLA Models — When Robots Understand Language — Overview of Vision-Language-Action models and how they work
Diffusion Policy for Robot Manipulation — Understanding flow-matching and diffusion-based action generation
SpatialVLA — Enhancing VLA with Spatial Awareness — VLA framework with advanced spatial understanding

Why StarVLA?

Before StarVLA, experimenting with VLA models meant dealing with:

Fragmentation: OpenVLA, Octo, RT-2, π₀ — each with its own repo, training pipeline, and evaluation setup.
Unfair comparisons: Different data pipelines, different preprocessing → results aren't comparable.
Wasted time: Want to try a new action head on an existing backbone? Rewrite everything from scratch.

StarVLA's modular architecture — backbone and action head can be swapped independently

StarVLA Architecture

Backbone — The Vision-Language "Brain"

StarVLA supports two main backbone types:

1. VLM Backbone (Vision-Language Model)

Qwen3-VL (0.8B, 2B, 4B, 9B) — latest backbone with strong multilingual support
Qwen2.5-VL — stable, well-tested version
InternVL — open-source alternative
Florence-2 — lightweight backbone from Microsoft

2. World Model Backbone

Cosmos (NVIDIA) — world model that predicts next states
Enables the VLA to "imagine" action outcomes before execution

Action Head — The Decision-Making "Arm"

This is where things get interesting. StarVLA provides 4 distinct action head types, each representing a different paradigm in VLA:

Action Head	Paradigm	Description
StarVLA-FAST	Autoregressive Discrete	Tokenizes actions into discrete tokens, decodes sequentially like a language model
StarVLA-OFT	Parallel Continuous	MLP head decodes continuous actions in parallel, fastest inference
StarVLA-PI	Flow-Matching Diffusion	Uses flow-matching to generate continuous actions, highest accuracy
StarVLA-GR00T	Dual-System	VLM as "System 2" (deliberation), Flow-Matching as "System 1" (reflexes)

The key insight: all 4 action heads share the same trainer, dataloader, and deployment stack. You only need to change a config parameter to switch between them.

Installation

System Requirements

Python 3.10+
CUDA 12.0 or 12.4 (confirmed compatible)
GPU: minimum 24GB VRAM (A5000, RTX 4090, A100)
RAM: 32GB+

Step 1: Clone and Set Up Environment

git clone https://github.com/starVLA/starVLA.git
cd starVLA

# Create conda environment
conda create -n starvla python=3.10 -y
conda activate starvla

# Install dependencies
pip install -r requirements.txt

# FlashAttention2 — IMPORTANT: must match your CUDA version
pip install flash-attn --no-build-isolation

# Install StarVLA in development mode
pip install -e .

Important note: FlashAttention2 must be compatible with your CUDA toolkit and PyTorch versions. The framework has confirmed compatibility with flash-attn==2.7.4.post1 on CUDA 12.0 and 12.4.

Step 2: Download Pretrained Backbone

# Create model directory
mkdir -p playground/Pretrained_models

# Download Qwen3-VL-4B (recommended starting point)
huggingface-cli download Qwen/Qwen3-VL-4B-Instruct \
  --local-dir playground/Pretrained_models/Qwen3-VL-4B-Instruct

Step 3: Verify Installation

# Quick test — run minimal forward pass
python starVLA/model/framework/QwenGR00T.py

If no errors appear, you're ready to go.

StarVLA supports both simulation and real-robot deployment

Training a VLA Model

Data Preparation

StarVLA uses a standardized data format. Each sample consists of:

Observation: Camera images (supports multi-view)
Language instruction: Task description (e.g., "pick up the red cup")
Action: Corresponding action sequence (position, rotation, gripper state)

The framework supports major datasets out of the box:

Open X-Embodiment — largest multi-robot dataset
LIBERO — popular manipulation benchmark
RoboTwin 2.0 — digital twin benchmark
RoboCasa — household interaction benchmark

Training on LIBERO (Concrete Example)

LIBERO is the best starting point — small dataset, easy to download, and quick to validate results.

# Train StarVLA-OFT on LIBERO
bash examples/LIBERO/train_files/run_libero_train.sh

The training script includes these key parameters:

# Key training parameters
python -m starVLA.train \
  --backbone qwen3-vl-4b \
  --action_head oft \
  --dataset libero \
  --batch_size 16 \
  --learning_rate 1e-4 \
  --num_epochs 50 \
  --output_dir ./checkpoints/libero-oft

Key parameter explanations:

--backbone: Choose VLM backbone (qwen3-vl-4b, qwen2.5-vl-3b, ...)
--action_head: Choose paradigm (fast, oft, pi, gr00t)
--dataset: Training dataset
--batch_size: Depends on VRAM — 16 for A100 80GB, 4-8 for RTX 4090

Switching Between Action Heads

This is the real power of StarVLA. Want to try a different action head? Just change one line:

# Try StarVLA-FAST (autoregressive discrete)
python -m starVLA.train --action_head fast ...

# Try StarVLA-PI (flow-matching diffusion)
python -m starVLA.train --action_head pi ...

# Try StarVLA-GR00T (dual-system)
python -m starVLA.train --action_head gr00t ...

Everything else stays the same — same data pipeline, same trainer. You only change how the model decodes actions.

Evaluation and Benchmarks

Two-Terminal Architecture

StarVLA uses a policy server + simulator client pattern to avoid dependency conflicts:

# Terminal 1: Run policy server (model inference)
bash examples/LIBERO/eval_files/run_policy_server.sh &

# Terminal 2: Run simulator environment
bash examples/LIBERO/eval_files/eval_libero.sh

Benchmark Results

Here are the published results from the paper:

Benchmark	StarVLA-OFT	StarVLA-GR00T	Notes
LIBERO	96.6%	—	Average success rate
SimplerEnv	—	71.4%	WidowX success rate
RoboCasa	48.8%	—	Average success
RoboTwin 2.0	88.32	—	Hard average score

Inference and Deployment

Running Inference in Simulation

# Start policy server with trained checkpoint
python -m starVLA.serve \
  --checkpoint ./checkpoints/libero-oft/best.pt \
  --backbone qwen3-vl-4b \
  --action_head oft \
  --port 8765

# Connect from simulator
python -m starVLA.eval.libero_client \
  --policy_url ws://localhost:8765 \
  --task_suite libero_spatial

Deploying to a Real Robot

StarVLA provides complete examples for the Franka Panda robot:

# Real robot deployment example
from starVLA.deploy import PolicyClient

# Connect to policy server
client = PolicyClient("ws://localhost:8765")

# Control loop
while not done:
    # Capture observation from camera
    image = camera.capture()
    
    # Send observation + instruction, receive action
    action = client.predict(
        image=image,
        instruction="pick up the red cup"
    )
    
    # Execute action on robot
    robot.execute(action)

The unified WebSocket interface is a major strength: write deployment code once, run it on both simulation and real hardware.

From simulation to real robot — StarVLA provides a unified deployment interface

Extending StarVLA — Custom Action Heads

One of StarVLA's greatest strengths is extensibility. Want to add a new action head? You just need to:

Implement the action head module — inherit from the base class
Register the module — add it to the registry
Done — the framework automatically integrates it with the trainer and deployment stack

from starVLA.model.action_heads import BaseActionHead, register_head

@register_head("my_custom_head")
class MyCustomActionHead(BaseActionHead):
    def __init__(self, config):
        super().__init__(config)
        # Define layers
        self.action_mlp = nn.Sequential(
            nn.Linear(config.hidden_dim, 512),
            nn.ReLU(),
            nn.Linear(512, config.action_dim)
        )
    
    def forward(self, backbone_features, **kwargs):
        # Decode action from backbone features
        return self.action_mlp(backbone_features)
    
    def compute_loss(self, pred_actions, gt_actions):
        return F.mse_loss(pred_actions, gt_actions)

Similarly, adding a new backbone only requires implementing a lightweight adapter.

Comparison with Other Frameworks

Criteria	StarVLA	LeRobot	OpenVLA
Modular backbone	✅ Free swapping	❌ Fixed	❌ Fixed
Multiple action heads	✅ 4 paradigms	✅ 2-3 paradigms	❌ 1 paradigm
Integrated benchmarks	✅ 5+ benchmarks	✅ 3-4	❌ Self-setup
Real robot deploy	✅ WebSocket API	✅ Native	❌ Needs adapter
World model support	✅ Cosmos	❌	❌
Community	1.5k+ stars	10k+ stars	5k+ stars

Practical Tips

Where to start?

Use StarVLA-OFT + Qwen3-VL-4B for your first run — fast inference, solid results
Train on LIBERO first — small dataset, easy to debug
Once comfortable, try StarVLA-GR00T — the dual-system architecture offers deeper insights

Pitfalls to avoid:

Don't use the wrong FlashAttention version for your CUDA — it will crash silently
Batch size too large will cause OOM — start small and scale up
When comparing action heads, keep everything else constant (backbone, data, hyperparams)

Summary

Resources:

Paper: StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing — Jinhui Ye et al., arXiv 2026
GitHub: github.com/starVLA/starVLA
Documentation: starvla.github.io
Model Zoo: Hugging Face (Qwen3-VL-4B-Action, Qwen2.5-VL-3B-Action)

VLA Models — When Robots Understand Language — Overview of Vision-Language-Action models and how they work
Diffusion Policy for Robot Manipulation — Understanding flow-matching and diffusion-based action generation
SpatialVLA — Enhancing VLA with Spatial Awareness — VLA framework with advanced spatial understanding

Why StarVLA?

StarVLA Architecture

Backbone — The Vision-Language "Brain"

Action Head — The Decision-Making "Arm"

Installation

System Requirements

Step 1: Clone and Set Up Environment

Step 2: Download Pretrained Backbone

Step 3: Verify Installation

Training a VLA Model

Data Preparation

Training on LIBERO (Concrete Example)

Switching Between Action Heads

Evaluation and Benchmarks

Two-Terminal Architecture

Benchmark Results

Inference and Deployment

Running Inference in Simulation

Deploying to a Real Robot

Extending StarVLA — Custom Action Heads

Comparison with Other Frameworks

Practical Tips

Summary

Related Posts

Nguyễn Anh Tuấn

Related Posts

Chạy Embodied-R1.5-VLA trên LIBERO

X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot

MemoryVLA++: memory và world model cho VLA

Why StarVLA?

StarVLA Architecture

Backbone — The Vision-Language "Brain"

Action Head — The Decision-Making "Arm"

Installation

System Requirements

Step 1: Clone and Set Up Environment

Step 2: Download Pretrained Backbone

Step 3: Verify Installation

Training a VLA Model

Data Preparation

Training on LIBERO (Concrete Example)

Switching Between Action Heads

Evaluation and Benchmarks

Two-Terminal Architecture

Benchmark Results

Inference and Deployment

Running Inference in Simulation

Deploying to a Real Robot

Extending StarVLA — Custom Action Heads

Comparison with Other Frameworks

Practical Tips

Summary

Related Posts

Nguyễn Anh Tuấn

Related Posts

Chạy Embodied-R1.5-VLA trên LIBERO

X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot

MemoryVLA++: memory và world model cho VLA