VLA-0: State-of-the-Art Robot VLA Without Architecture Changes
The robot learning community is locked in an arms race of complexity. Every new Vision-Language-Action (VLA) paper adds another layer of engineering: custom action heads, diffusion decoders, flow matching modules, separate discrete tokenizers, specialized continuous action pipelines. Modern VLA codebases regularly exceed 10,000 lines of code.
Then NVIDIA NVlabs asked a different question:
"What happens if we change nothing?"
The answer is VLA-0 — a VLA that achieves state-of-the-art results on the LIBERO benchmark with a 94.7% average success rate, outperforming pi_0, GR00T-N1, OpenVLA-OFT, and SmolVLA. No custom action head. No diffusion decoder. Just Qwen2.5-VL-3B fine-tuned to predict actions as ordinary text. The entire codebase is roughly 1,200 lines — compared to 10,000+ in competing approaches.
Paper: VLA-0: Building State-of-the-Art VLAs with Zero Modification (Goyal, Hadfield, Yang, Blukis, Ramos — NVIDIA NVlabs). GitHub: NVlabs/vla0. Project page: vla0.github.io.
The Problem With Existing VLAs
To understand why VLA-0 matters, examine the foundational assumption baked into every preceding approach.
Robot actions are inherently continuous: end-effector coordinates, joint angles, forces, velocities... These values don't map to any word token in a language model's vocabulary. The community responded with two dominant strategies:
Strategy 1 — Discrete tokenization: OpenVLA bins continuous action values and adds them to the LLM vocabulary. Downsides: requires vocabulary resizing (touching the embedding layer), loses resolution with coarse bins, complicates training.
Strategy 2 — Continuous action head: pi_0 and GR00T-N1 attach a flow matching or diffusion head on top of the VLM output that handles the continuous output separately. Downsides: complex architecture, difficult training, large codebase.
Both strategies share the same hidden assumption: "VLMs can't handle continuous actions; we need specialized components."
VLA-0 directly challenges this assumption.
Core Idea: Action As Text
VLA-0 asks: What if we simply ask the VLM to predict actions as a sequence of integers?
The representation works like this:
- Normalize: Each dimension of the action vector is mapped to the integer range
[0, 1000]. - Serialize: Values are written as a text string, e.g.,
"524 341 892 127 650 200". - Predict: The VLM is fine-tuned to generate this string like any other text.
- Decode: The integers are parsed and denormalized back to the actual action space.
Concrete example:
Input: [Camera frame] + "Pick up the red cup and place it on the plate"
Output: "524 341 892 127 483 671 201 890 ..."
↓ parse & denormalize ↓
Action: [0.524, 0.341, 0.892, 0.127, 0.483, ...] (joint targets)
Action as text: instead of adding a complex custom head, VLA-0 simply asks the VLM to predict integer sequences — surprisingly effective
Why does this work? VLMs are pretrained on massive corpora that include code, mathematics, and numerical sequences. The ability to model integer patterns is already deeply embedded in the weights from pretraining. Robot manipulation just needs a fine-tuning signal that teaches the model the mapping from (image, instruction) → action sequence.
The key distinction from OpenVLA's discrete tokenization: VLA-0 adds no new tokens to the vocabulary and doesn't resize the embedding layer. The string "524" is composed of ordinary characters already in Qwen2.5-VL's vocabulary. Resolution is also freely adjustable — swap [0, 1000] for [0, 100] or [0, 10000] without touching architecture.
Architecture: Qwen2.5-VL-3B, Unmodified
VLA-0 uses Qwen2.5-VL-3B-Instruct as its backbone — a 3-billion-parameter VLM from Alibaba with strong image-and-text understanding.
Qwen2.5-VL-3B's architecture:
- Vision Encoder: ViT-based, processes robot camera frames
- Language Model: Qwen2.5-3B transformer decoder, autoregressively generates tokens
- Visual-Language Bridge: Cross-attention injecting visual features into the language stream
And VLA-0 changes none of this:
- ❌ No separate action head
- ❌ No diffusion module
- ❌ No flow matching
- ❌ No custom tokenizer
- ❌ No embedding layer resizing
Training is standard supervised fine-tuning with causal language modeling loss — identical to fine-tuning any instruction-following LLM, except the "text" targets happen to be action integer sequences. The loss is still cross-entropy over next-token prediction, unchanged.
Action Chunking and Temporal Ensembling
A critical technique VLA-0 inherits is action chunking — instead of predicting one action per timestep, the model predicts a "chunk" of n consecutive future actions.
This technique originates from ACT (Action Chunking Transformer) and has been adopted by models like OpenVLA-OFT. The advantages are significant:
- Temporal consistency: Actions within a chunk are jointly optimized, producing smoother trajectories
- Long-horizon awareness: The model "sees" into the future and avoids myopic short-sighted decisions
- Temporal ensembling: Overlapping chunks are combined via weighted average (exponential decay), increasing stability
Concretely: at timestep t, the model predicts the chunk [a_t, a_{t+1}, ..., a_{t+n-1}]. The next inference at t+1 predicts an overlapping chunk. The weighted average across overlapping predictions determines the final action at each step.
LIBERO Benchmark: The Test Bed
LIBERO is the standard VLA evaluation benchmark for robot manipulation, consisting of 4 suites each targeting a distinct capability:
| Suite | What it tests | Example task |
|---|---|---|
| LIBERO-Spatial | Spatial reasoning | "Put the bowl to the left of the mug" |
| LIBERO-Object | Object recognition | "Pick up the black bowl" |
| LIBERO-Goal | Goal-directed behavior | "Stack the red block on the blue block" |
| LIBERO-Long | Multi-step long-horizon | Complete 3–4 sequential sub-tasks |
Each suite has 10 tasks, each evaluated over 50 episodes — 2,000 total evaluation rollouts.
VLA-0 achieves a 94.7% average success rate across all LIBERO suites, outperforming every listed competitor:
- SmolVLA — pretrained on large-scale real robot data
- OpenVLA-OFT — VLA with discrete action tokenizer
- pi_0 and pi_0.5-KI — flow matching VLAs from Physical Intelligence
- GR00T-N1 — NVIDIA's own diffusion-based VLA
- MolmoAct — Allen AI's VLA
Particularly striking: VLA-0 achieves this without large-scale robotics pretraining. It is fine-tuned only on the LIBERO dataset — yet beats models that consumed orders of magnitude more robotics data.
Environment Setup
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| GPU | RTX 3090 (24GB) | 2× A100 (80GB) |
| RAM | 32GB | 64GB |
| Storage | 50GB | 100GB |
| CUDA | 11.8+ | 12.1+ |
Step 1: Clone the Repository
# Clone with submodules — RoboVerse is a submodule, don't skip this flag
git clone --recurse-submodules [email protected]:NVlabs/vla0.git
cd vla0
The --recurse-submodules flag is mandatory. VLA-0 uses RoboVerse as a git submodule for the dataset loading pipeline.
Step 2: Create the Environment
conda create -n vla0 python=3.10 -y
conda activate vla0
# Install lerobot with LIBERO extras
pip install -e "libs/lerobot[libero]"
# Install the vla0 package
pip install -e .
Step 3: Download the LIBERO Dataset
LIBERO is available on HuggingFace Hub through lerobot:
# Download all 4 LIBERO suites (~20GB total)
python scripts/download_libero.py --all
# Or download individual suites
python scripts/download_libero.py --suite spatial
python scripts/download_libero.py --suite object
python scripts/download_libero.py --suite goal
python scripts/download_libero.py --suite long
Step 4: Verify Setup
python scripts/verify_setup.py
Training Configuration
VLA-0 uses YAML-based configuration. Create configs/my_vla0.yaml:
MODEL:
name: "Qwen/Qwen2.5-VL-3B-Instruct" # HuggingFace model ID
freeze_vision_encoder: false # Fine-tune end-to-end
TRAINING:
batch_size: 8 # Adjust for your VRAM
learning_rate: 2.0e-5
num_epochs: 50
warmup_ratio: 0.05
weight_decay: 0.01
fp16: true # Mixed precision
gradient_checkpointing: true # Required for 24GB GPUs
ACTION:
chunk_size: 16 # Predict 16 actions per inference step
action_bins: 1000 # Normalize actions to [0, 1000]
ensemble_k: 5 # Average 5 overlapping chunks
DATALOADER:
ROBOVERSE:
cfg_path: "libs/RoboVerse/roboverse/configs/img_libero_aug.yaml"
num_workers: 4
LOGGING:
output_dir: "checkpoints/my_vla0"
save_every_n_epochs: 5
eval_every_n_epochs: 10
Key hyperparameter notes:
action_bins: 1000— integer range for action normalization. Increase to 10000 for higher precision; decrease to 256 for faster inference.chunk_size: 16— number of future actions per prediction. Increase for long-horizon tasks.ensemble_k: 5— number of overlapping predictions to average. Set to 1 to disable ensembling (faster inference, less stable).gradient_checkpointing: true— required to fit the 3B model on a 24GB RTX 3090.
Running Training
# Train on all 4 LIBERO suites
python train.py --config configs/my_vla0.yaml
# Train on a single suite (faster iteration)
python train.py --config configs/my_vla0.yaml \
DATALOADER.ROBOVERSE.suite=spatial
# Multi-GPU with torchrun (recommended for A100s)
torchrun --nproc_per_node=4 \
train.py --config configs/my_vla0.yaml
Estimated training times:
| Hardware | Dataset | Time |
|---|---|---|
| 1× RTX 3090 | LIBERO-Spatial | ~8 hours |
| 1× A100 80GB | All LIBERO | ~6 hours |
| 4× A100 80GB | All LIBERO | ~2 hours |
Monitor training:
# TensorBoard
tensorboard --logdir checkpoints/my_vla0/logs --port 6006
# Weights & Biases
wandb login
python train.py --config configs/my_vla0.yaml LOGGING.wandb=true
Evaluation on LIBERO
After training, run the LIBERO evaluator:
# Evaluate all 4 suites, 50 episodes per task
python eval_libero.py \
--checkpoint checkpoints/my_vla0/best.pt \
--suite all \
--num_episodes 50 \
--render # Optional: render the simulation
# Quick check — 1 suite, 10 episodes
python eval_libero.py \
--checkpoint checkpoints/my_vla0/best.pt \
--suite spatial \
--num_episodes 10
Example output:
LIBERO-Spatial: Success 46/50 = 92.0%
LIBERO-Object: Success 48/50 = 96.0%
LIBERO-Goal: Success 44/50 = 88.0%
LIBERO-Long: Success 49/50 = 98.0%
Average: Success 187/200 = 93.5%
Real Robot Inference
Integrating VLA-0 into a real robot control loop:
import torch
import numpy as np
from vla0 import VLA0
# Load trained model
model = VLA0.from_pretrained(
"checkpoints/my_vla0/best.pt",
device="cuda"
)
model.eval()
# Control loop
instruction = "Stack the red block on the blue block"
chunk_buffer = [] # Buffer for temporal ensembling
for timestep in range(max_steps):
# Get camera image
image = robot.get_camera_image() # numpy (H, W, 3)
# Predict new action chunk
with torch.no_grad():
action_chunk = model.predict(
image=image,
instruction=instruction,
chunk_size=16
) # shape: (16, action_dim)
# Temporal ensembling
chunk_buffer.append(action_chunk)
current_action = model.ensemble_actions(chunk_buffer, k=5)
# Send to robot
robot.execute_action(current_action)
if robot.is_task_complete():
break
VLA-0 transfers to real hardware and outperforms SmolVLA — a model pretrained on large-scale real robot data
Important deployment notes:
- Camera images must be resized to the resolution used during training (typically 224×224 or 336×336).
- Action space must use the same normalization statistics (min/max) as the training data.
- With
chunk_size=16and a 10Hz robot, you only need inference once every 1.6 seconds — a manageable compute budget even on edge hardware.
Comparison With Other VLA Approaches
| Aspect | VLA-0 | OpenVLA-OFT | pi_0 | GR00T-N1 |
|---|---|---|---|---|
| Action representation | Text integers | Discrete tokens | Flow matching | Diffusion |
| Architecture modified | ❌ None | ✅ Vocab resize | ✅ Action head | ✅ Diffusion head |
| Codebase (lines) | ~1,200 | 10,000+ | 10,000+ | 10,000+ |
| LIBERO avg | 94.7% | ~80% | ~86% | ~88% |
| Fine-tuning difficulty | Easy | Moderate | Hard | Hard |
| Needs large-scale data | ❌ | ❌ | ✅ | ✅ |
Engineering Lessons
VLA-0 isn't just a benchmark result — it's an engineering philosophy lesson.
Lesson 1: Audit your assumptions The entire VLA community assumed continuous actions needed special handling. Nobody had tested this assumption rigorously. VLA-0 did exactly that and found it wrong. Before you engineer a solution, verify that the problem actually exists.
Lesson 2: Leverage existing capability Qwen2.5-VL already learned to model integer sequences from pretraining. VLA-0 doesn't teach this from scratch — it redirects an existing capability toward a new domain. Reuse beats rebuild.
Lesson 3: Complexity isn't virtue When a simpler approach beats a complex one, it usually signals overengineering, not innovation gap. VLA-0 is a reminder that simple solutions deserve fair comparison before we architect complexity.
Lesson 4: Validate transfer VLA-0 was tested on real hardware, not just simulation. This validates both the benchmark's utility and the method's practical applicability — a step many robotics papers skip.
Limitations and Future Work
VLA-0 is not perfect. Honest limitations to keep in mind:
- Discretization loss: Normalizing to
[0, 1000]loses some precision compared to fully continuous representations. For tasks requiring fine force control, this may matter. - Token sequence length: Each action dimension is a separate token. High-DOF robots (7-DOF+) produce long sequences → slower inference.
- Limited real-world scale: Real hardware results are promising but not yet tested at production scale.
- Camera dependency: VLA-0 uses RGB images. Noisy or low-quality cameras will degrade performance.
Natural extensions:
- Multi-camera inputs (wrist + overhead)
- Proprioception fusion (joint states + force/torque readings)
- Scaling to Qwen2.5-VL-7B or 72B backbone
- Cross-embodiment transfer to different robot platforms
Conclusion
VLA-0 is compelling evidence that the answer isn't always more engineering — sometimes it's less. By refusing to add custom components and instead representing actions as text, NVIDIA NVlabs produced the simplest and most effective VLA in its class.
If you're starting with VLA for your robot, VLA-0 is the right starting point: small codebase, easy to understand, easy to debug, and proven on real hardware. Start here, then add complexity only if you can demonstrate it helps.