aivlanvidianvlabsqwen2.5-vlliberorobot-learningfine-tuningaction-as-textmanipulation

VLA-0: State-of-the-Art Robot VLA Without Changes

NVIDIA NVlabs proves action as text reaches 94.7% on LIBERO, beating pi_0 and GR00T-N1 with zero architecture modification — just Qwen2.5-VL-3B.

Nguyễn Anh Tuấn4 tháng 5, 202612 phút đọc
VLA-0: State-of-the-Art Robot VLA Without Changes

VLA-0: State-of-the-Art Robot VLA Without Architecture Changes

The robot learning community is locked in an arms race of complexity. Every new Vision-Language-Action (VLA) paper adds another layer of engineering: custom action heads, diffusion decoders, flow matching modules, separate discrete tokenizers, specialized continuous action pipelines. Modern VLA codebases regularly exceed 10,000 lines of code.

Then NVIDIA NVlabs asked a different question:

"What happens if we change nothing?"

The answer is VLA-0 — a VLA that achieves state-of-the-art results on the LIBERO benchmark with a 94.7% average success rate, outperforming pi_0, GR00T-N1, OpenVLA-OFT, and SmolVLA. No custom action head. No diffusion decoder. Just Qwen2.5-VL-3B fine-tuned to predict actions as ordinary text. The entire codebase is roughly 1,200 lines — compared to 10,000+ in competing approaches.

Paper: VLA-0: Building State-of-the-Art VLAs with Zero Modification (Goyal, Hadfield, Yang, Blukis, Ramos — NVIDIA NVlabs). GitHub: NVlabs/vla0. Project page: vla0.github.io.


The Problem With Existing VLAs

To understand why VLA-0 matters, examine the foundational assumption baked into every preceding approach.

Robot actions are inherently continuous: end-effector coordinates, joint angles, forces, velocities... These values don't map to any word token in a language model's vocabulary. The community responded with two dominant strategies:

Strategy 1 — Discrete tokenization: OpenVLA bins continuous action values and adds them to the LLM vocabulary. Downsides: requires vocabulary resizing (touching the embedding layer), loses resolution with coarse bins, complicates training.

Strategy 2 — Continuous action head: pi_0 and GR00T-N1 attach a flow matching or diffusion head on top of the VLM output that handles the continuous output separately. Downsides: complex architecture, difficult training, large codebase.

Both strategies share the same hidden assumption: "VLMs can't handle continuous actions; we need specialized components."

VLA-0 directly challenges this assumption.


Core Idea: Action As Text

VLA-0 asks: What if we simply ask the VLM to predict actions as a sequence of integers?

The representation works like this:

  1. Normalize: Each dimension of the action vector is mapped to the integer range [0, 1000].
  2. Serialize: Values are written as a text string, e.g., "524 341 892 127 650 200".
  3. Predict: The VLM is fine-tuned to generate this string like any other text.
  4. Decode: The integers are parsed and denormalized back to the actual action space.

Concrete example:

Input:  [Camera frame] + "Pick up the red cup and place it on the plate"
Output: "524 341 892 127 483 671 201 890 ..."
         ↓ parse & denormalize ↓
Action: [0.524, 0.341, 0.892, 0.127, 0.483, ...] (joint targets)

VLA-0 action as text concept — representing robot actions as integer sequences Action as text: instead of adding a complex custom head, VLA-0 simply asks the VLM to predict integer sequences — surprisingly effective

Why does this work? VLMs are pretrained on massive corpora that include code, mathematics, and numerical sequences. The ability to model integer patterns is already deeply embedded in the weights from pretraining. Robot manipulation just needs a fine-tuning signal that teaches the model the mapping from (image, instruction) → action sequence.

The key distinction from OpenVLA's discrete tokenization: VLA-0 adds no new tokens to the vocabulary and doesn't resize the embedding layer. The string "524" is composed of ordinary characters already in Qwen2.5-VL's vocabulary. Resolution is also freely adjustable — swap [0, 1000] for [0, 100] or [0, 10000] without touching architecture.


Architecture: Qwen2.5-VL-3B, Unmodified

VLA-0 uses Qwen2.5-VL-3B-Instruct as its backbone — a 3-billion-parameter VLM from Alibaba with strong image-and-text understanding.

Qwen2.5-VL-3B's architecture:

  • Vision Encoder: ViT-based, processes robot camera frames
  • Language Model: Qwen2.5-3B transformer decoder, autoregressively generates tokens
  • Visual-Language Bridge: Cross-attention injecting visual features into the language stream

And VLA-0 changes none of this:

  • ❌ No separate action head
  • ❌ No diffusion module
  • ❌ No flow matching
  • ❌ No custom tokenizer
  • ❌ No embedding layer resizing

Training is standard supervised fine-tuning with causal language modeling loss — identical to fine-tuning any instruction-following LLM, except the "text" targets happen to be action integer sequences. The loss is still cross-entropy over next-token prediction, unchanged.


Action Chunking and Temporal Ensembling

A critical technique VLA-0 inherits is action chunking — instead of predicting one action per timestep, the model predicts a "chunk" of n consecutive future actions.

This technique originates from ACT (Action Chunking Transformer) and has been adopted by models like OpenVLA-OFT. The advantages are significant:

  • Temporal consistency: Actions within a chunk are jointly optimized, producing smoother trajectories
  • Long-horizon awareness: The model "sees" into the future and avoids myopic short-sighted decisions
  • Temporal ensembling: Overlapping chunks are combined via weighted average (exponential decay), increasing stability

Concretely: at timestep t, the model predicts the chunk [a_t, a_{t+1}, ..., a_{t+n-1}]. The next inference at t+1 predicts an overlapping chunk. The weighted average across overlapping predictions determines the final action at each step.


LIBERO Benchmark: The Test Bed

LIBERO is the standard VLA evaluation benchmark for robot manipulation, consisting of 4 suites each targeting a distinct capability:

Suite What it tests Example task
LIBERO-Spatial Spatial reasoning "Put the bowl to the left of the mug"
LIBERO-Object Object recognition "Pick up the black bowl"
LIBERO-Goal Goal-directed behavior "Stack the red block on the blue block"
LIBERO-Long Multi-step long-horizon Complete 3–4 sequential sub-tasks

Each suite has 10 tasks, each evaluated over 50 episodes — 2,000 total evaluation rollouts.

VLA-0 achieves a 94.7% average success rate across all LIBERO suites, outperforming every listed competitor:

  • SmolVLA — pretrained on large-scale real robot data
  • OpenVLA-OFT — VLA with discrete action tokenizer
  • pi_0 and pi_0.5-KI — flow matching VLAs from Physical Intelligence
  • GR00T-N1 — NVIDIA's own diffusion-based VLA
  • MolmoAct — Allen AI's VLA

Particularly striking: VLA-0 achieves this without large-scale robotics pretraining. It is fine-tuned only on the LIBERO dataset — yet beats models that consumed orders of magnitude more robotics data.


Environment Setup

Hardware Requirements

Component Minimum Recommended
GPU RTX 3090 (24GB) 2× A100 (80GB)
RAM 32GB 64GB
Storage 50GB 100GB
CUDA 11.8+ 12.1+

Step 1: Clone the Repository

# Clone with submodules — RoboVerse is a submodule, don't skip this flag
git clone --recurse-submodules [email protected]:NVlabs/vla0.git
cd vla0

The --recurse-submodules flag is mandatory. VLA-0 uses RoboVerse as a git submodule for the dataset loading pipeline.

Step 2: Create the Environment

conda create -n vla0 python=3.10 -y
conda activate vla0

# Install lerobot with LIBERO extras
pip install -e "libs/lerobot[libero]"

# Install the vla0 package
pip install -e .

Step 3: Download the LIBERO Dataset

LIBERO is available on HuggingFace Hub through lerobot:

# Download all 4 LIBERO suites (~20GB total)
python scripts/download_libero.py --all

# Or download individual suites
python scripts/download_libero.py --suite spatial
python scripts/download_libero.py --suite object
python scripts/download_libero.py --suite goal
python scripts/download_libero.py --suite long

Step 4: Verify Setup

python scripts/verify_setup.py

Training Configuration

VLA-0 uses YAML-based configuration. Create configs/my_vla0.yaml:

MODEL:
  name: "Qwen/Qwen2.5-VL-3B-Instruct"  # HuggingFace model ID
  freeze_vision_encoder: false            # Fine-tune end-to-end

TRAINING:
  batch_size: 8                  # Adjust for your VRAM
  learning_rate: 2.0e-5
  num_epochs: 50
  warmup_ratio: 0.05
  weight_decay: 0.01
  fp16: true                     # Mixed precision
  gradient_checkpointing: true   # Required for 24GB GPUs

ACTION:
  chunk_size: 16                 # Predict 16 actions per inference step
  action_bins: 1000              # Normalize actions to [0, 1000]
  ensemble_k: 5                  # Average 5 overlapping chunks

DATALOADER:
  ROBOVERSE:
    cfg_path: "libs/RoboVerse/roboverse/configs/img_libero_aug.yaml"
    num_workers: 4

LOGGING:
  output_dir: "checkpoints/my_vla0"
  save_every_n_epochs: 5
  eval_every_n_epochs: 10

Key hyperparameter notes:

  • action_bins: 1000 — integer range for action normalization. Increase to 10000 for higher precision; decrease to 256 for faster inference.
  • chunk_size: 16 — number of future actions per prediction. Increase for long-horizon tasks.
  • ensemble_k: 5 — number of overlapping predictions to average. Set to 1 to disable ensembling (faster inference, less stable).
  • gradient_checkpointing: true — required to fit the 3B model on a 24GB RTX 3090.

Running Training

# Train on all 4 LIBERO suites
python train.py --config configs/my_vla0.yaml

# Train on a single suite (faster iteration)
python train.py --config configs/my_vla0.yaml \
    DATALOADER.ROBOVERSE.suite=spatial

# Multi-GPU with torchrun (recommended for A100s)
torchrun --nproc_per_node=4 \
    train.py --config configs/my_vla0.yaml

Estimated training times:

Hardware Dataset Time
1× RTX 3090 LIBERO-Spatial ~8 hours
1× A100 80GB All LIBERO ~6 hours
4× A100 80GB All LIBERO ~2 hours

Monitor training:

# TensorBoard
tensorboard --logdir checkpoints/my_vla0/logs --port 6006

# Weights & Biases
wandb login
python train.py --config configs/my_vla0.yaml LOGGING.wandb=true

Evaluation on LIBERO

After training, run the LIBERO evaluator:

# Evaluate all 4 suites, 50 episodes per task
python eval_libero.py \
    --checkpoint checkpoints/my_vla0/best.pt \
    --suite all \
    --num_episodes 50 \
    --render  # Optional: render the simulation

# Quick check — 1 suite, 10 episodes
python eval_libero.py \
    --checkpoint checkpoints/my_vla0/best.pt \
    --suite spatial \
    --num_episodes 10

Example output:

LIBERO-Spatial:  Success 46/50 = 92.0%
LIBERO-Object:   Success 48/50 = 96.0%
LIBERO-Goal:     Success 44/50 = 88.0%
LIBERO-Long:     Success 49/50 = 98.0%
Average:         Success 187/200 = 93.5%

Real Robot Inference

Integrating VLA-0 into a real robot control loop:

import torch
import numpy as np
from vla0 import VLA0

# Load trained model
model = VLA0.from_pretrained(
    "checkpoints/my_vla0/best.pt",
    device="cuda"
)
model.eval()

# Control loop
instruction = "Stack the red block on the blue block"
chunk_buffer = []  # Buffer for temporal ensembling

for timestep in range(max_steps):
    # Get camera image
    image = robot.get_camera_image()  # numpy (H, W, 3)
    
    # Predict new action chunk
    with torch.no_grad():
        action_chunk = model.predict(
            image=image,
            instruction=instruction,
            chunk_size=16
        )  # shape: (16, action_dim)
    
    # Temporal ensembling
    chunk_buffer.append(action_chunk)
    current_action = model.ensemble_actions(chunk_buffer, k=5)
    
    # Send to robot
    robot.execute_action(current_action)
    
    if robot.is_task_complete():
        break

Robot arm executing a real manipulation task VLA-0 transfers to real hardware and outperforms SmolVLA — a model pretrained on large-scale real robot data

Important deployment notes:

  • Camera images must be resized to the resolution used during training (typically 224×224 or 336×336).
  • Action space must use the same normalization statistics (min/max) as the training data.
  • With chunk_size=16 and a 10Hz robot, you only need inference once every 1.6 seconds — a manageable compute budget even on edge hardware.

Comparison With Other VLA Approaches

Aspect VLA-0 OpenVLA-OFT pi_0 GR00T-N1
Action representation Text integers Discrete tokens Flow matching Diffusion
Architecture modified ❌ None ✅ Vocab resize ✅ Action head ✅ Diffusion head
Codebase (lines) ~1,200 10,000+ 10,000+ 10,000+
LIBERO avg 94.7% ~80% ~86% ~88%
Fine-tuning difficulty Easy Moderate Hard Hard
Needs large-scale data

Engineering Lessons

VLA-0 isn't just a benchmark result — it's an engineering philosophy lesson.

Lesson 1: Audit your assumptions The entire VLA community assumed continuous actions needed special handling. Nobody had tested this assumption rigorously. VLA-0 did exactly that and found it wrong. Before you engineer a solution, verify that the problem actually exists.

Lesson 2: Leverage existing capability Qwen2.5-VL already learned to model integer sequences from pretraining. VLA-0 doesn't teach this from scratch — it redirects an existing capability toward a new domain. Reuse beats rebuild.

Lesson 3: Complexity isn't virtue When a simpler approach beats a complex one, it usually signals overengineering, not innovation gap. VLA-0 is a reminder that simple solutions deserve fair comparison before we architect complexity.

Lesson 4: Validate transfer VLA-0 was tested on real hardware, not just simulation. This validates both the benchmark's utility and the method's practical applicability — a step many robotics papers skip.


Limitations and Future Work

VLA-0 is not perfect. Honest limitations to keep in mind:

  • Discretization loss: Normalizing to [0, 1000] loses some precision compared to fully continuous representations. For tasks requiring fine force control, this may matter.
  • Token sequence length: Each action dimension is a separate token. High-DOF robots (7-DOF+) produce long sequences → slower inference.
  • Limited real-world scale: Real hardware results are promising but not yet tested at production scale.
  • Camera dependency: VLA-0 uses RGB images. Noisy or low-quality cameras will degrade performance.

Natural extensions:

  • Multi-camera inputs (wrist + overhead)
  • Proprioception fusion (joint states + force/torque readings)
  • Scaling to Qwen2.5-VL-7B or 72B backbone
  • Cross-embodiment transfer to different robot platforms

Conclusion

VLA-0 is compelling evidence that the answer isn't always more engineering — sometimes it's less. By refusing to add custom components and instead representing actions as text, NVIDIA NVlabs produced the simplest and most effective VLA in its class.

If you're starting with VLA for your robot, VLA-0 is the right starting point: small codebase, easy to understand, easy to debug, and proven on real hardware. Start here, then add complexity only if you can demonstrate it helps.


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Bài viết liên quan

NEWTutorial
VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO
vlavla-adapteropenhelixliberoqwen2.5lorafrankaur5manipulation

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

Hướng dẫn VLA-Adapter từ OpenHelix — train VLA 0.5B trên GPU consumer 8 giờ, đạt SOTA LIBERO, deploy thật trên Franka/UR-5.

13/5/202610 phút đọc
NEWTutorial
WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid
wholebodyvlavlahumanoidloco-manipulationiclr-2026agibot-x2teleoprl

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

ICLR 2026 — pipeline thực chiến từ thu thập teleop, train unified latent VLA đến deploy whole-body loco-manipulation trên AgiBot X2.

11/5/202611 phút đọc
Tutorial
RDT2: Foundation Model Zero-Shot Cross-Embodiment cho Bimanual UR5e/Franka
rdt2foundation-modelbimanualumimanipulationvlathu-ml

RDT2: Foundation Model Zero-Shot Cross-Embodiment cho Bimanual UR5e/Franka

Hướng dẫn chi tiết RDT2 từ THU-ML — foundation model zero-shot deploy cho bimanual UR5e và Franka, với code open-source.

1/5/20269 phút đọc