manipulationx-vlavlaiclr-2026soft-promptlerobotcross-embodimentflow-matchingliberomanipulation

X-VLA ICLR 2026: Soft-Prompted 0.9B VLA on LeRobot

X-VLA tutorial — flow-matching 0.9B VLA hitting SOTA on 6 sims + 3 real robots, native LeRobot, open-source on HuggingFace.

Nguyễn Anh Tuấn20 tháng 5, 202610 phút đọc
X-VLA ICLR 2026: Soft-Prompted 0.9B VLA on LeRobot

Vision-Language-Action (VLA) research is moving in two directions: scaling up to billions of parameters (RT-2, OpenVLA 7B, π0 3B) or scaling down to consumer GPUs (VLA-Adapter 0.5B). But both directions skip an important question: how do you train a single model that runs across many different robots — Franka, WidowX, Google Robot, Agilex bimanual — without separately fine-tuning each one?

This is exactly what X-VLA (ICLR 2026, paper arXiv:2510.10274) solves. With just 0.9B parameters, X-VLA achieves SOTA on 6 simulation benchmarks + 3 real robots, won the AgiBot World Challenge at IROS 2025, and most importantly: it's been natively integrated into LeRobot — a single line policy.type=xvla is enough to train.

In this tutorial I'll walk you from the "soft prompt" idea → flow-matching architecture → LeRobot installation → training on your dataset → inference on a real robot. Beginner-friendly, no heavy Transformer background required.

X-VLA soft-prompted cross-embodiment vision-language-action model

1. Why this paper matters

The cross-embodiment problem

Imagine you have 3 robots: a 7-DOF Franka, a 6-DOF WidowX, and a pair of AgileX bimanual arms. Each robot has:

  • Different action space — 7 joints vs 6 joints vs 14 joints
  • Different camera setup — 1 wrist + 1 third-person vs 2 wrist vs 3 cameras
  • Different gripper — parallel jaw vs underactuated vs custom

The old way (OpenVLA, π0): train separately for each embodiment, or try to tokenize actions as text and let the LLM figure it out — but quality is uneven, and every new robot needs heavy fine-tuning.

X-VLA's answer: use a "soft prompt" — a learnable embedding set per robot type, share the Transformer backbone across all of them. Like the same LLM with different prompts for different tasks, X-VLA uses the same Transformer with different prompts for different robots.

Striking results

Benchmark Embodiment X-VLA Score Comparison
LIBERO (4 suites) Franka 98.1% π0: ~97%, OpenVLA: 76.5%
SimplerEnv WidowX WidowX 95.8% RT-1-X: 64%
Google Robot (VM) Google Robot 83.5% OpenVLA: 71%
CALVIN (ABCD→D) Franka 4.43/5 RoboFlamingo: 3.49
RoboTwin2 AgileX bimanual 70% π0-FAST: 58%

More importantly: X-VLA-LIBERO achieves near-π₀ performance while using 300× fewer trainable parameters — because only the soft prompts (~9M params) need training per new embodiment, not the entire 3B backbone.

2. X-VLA architecture

Flow-matching overview

X-VLA doesn't use DDPM-style diffusion, nor autoregressive tokens like RT-2. It uses flow matching — a generative model family that learns the vector field between noise and data, generating 32-step action chunks in a few denoising steps.

The main pipeline:

[RGB images] → Vision encoder ────┐
[Language instruction] → Text enc ┤
[Proprioceptive state] → MLP ─────┤── Transformer (24 layers, 1024 dim)
[Soft prompt embeddings] ─────────┤              │
[Domain ID embedding] ────────────┘              ▼
                                         Flow-matching head
                                                 │
                                                 ▼
                                    Action chunk (32 steps × 20-D)

Soft prompt mechanism

Each embodiment has a set of 32 learnable embedding vectors (dim 1024). During training, the model learns these vectors as a "preamble" that gets prepended to the Transformer input. For a new robot:

  • Phase I (pretrain): Train the backbone + all soft prompts on 290K episodes from 7 platforms
  • Phase II (adapt): Freeze the backbone, only train a new soft prompt (~9M params) for the new embodiment

This is the big difference from LoRA: LoRA injects low-rank updates into weights, while soft prompts inject embeddings into inputs — simpler, lighter, and well-validated by NLP prompt tuning work (Prefix Tuning, P-Tuning v2).

Unified EE6D action space

X-VLA uses EE6D (End-Effector 6D) as the canonical action space: 3D position + 6D rotation representation + gripper signal + padding = 20 dimensions. Every other robot (joint-space 7-DOF, bimanual 14-DOF) gets mapped into this 20-D space via an Action Registry — if the robot has fewer dimensions, the rest is zero-padded and ignored in the loss.

This allows a single forward pass to handle any embodiment, differing only in soft prompt + domain ID.

Real robot arm controlled by a VLA model

3. Installing LeRobot with X-VLA

Hardware requirements

Purpose Minimum GPU Recommended
Inference only RTX 3060 12GB RTX 4090
Soft-prompt fine-tune RTX 4090 24GB A100 40GB
Full pretrain (290K episodes) 8× A100 80GB 8× H100

For beginners: RTX 4090 or rent A100 on Vast.ai (~$1/hour) is enough to fine-tune soft prompts for your own task.

Environment setup

# Create conda env
conda create -n xvla python=3.10 -y
conda activate xvla

# Clone LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Install LeRobot with X-VLA extras
pip install -e ".[xvla]"

# Verify
python -c "from lerobot.policies.xvla import XVLAPolicy; print('OK')"

Load a pretrained checkpoint

X-VLA ships several HuggingFace checkpoints:

Checkpoint Description Use case
lerobot/xvla-base 0.9B pretrained on 290K episodes Fine-tune for new tasks
lerobot/xvla-libero Fine-tuned LIBERO (98.1%) Evaluate LIBERO immediately
lerobot/xvla-widowx WidowX pick-and-place SimplerEnv demo
lerobot/xvla-folding Cloth folding 100% Hard bimanual task
lerobot/xvla-agibot-world AgileX dexterous General bimanual
lerobot/xvla-google-robot Google Robot RT-1 setup Cross-domain demo

Quick inference test:

from lerobot.policies.xvla import XVLAPolicy
import torch

# Load model
policy = XVLAPolicy.from_pretrained("lerobot/xvla-base")
policy = policy.to("cuda").eval()

# Dummy observation
obs = {
    "observation.images.primary": torch.randn(1, 3, 224, 224).cuda(),
    "observation.images.wrist": torch.randn(1, 3, 224, 224).cuda(),
    "observation.state": torch.randn(1, 7).cuda(),
    "task": ["pick up the red block"],
    "domain_id": torch.tensor([0]).cuda(),
}

# Inference — generate 32-step action chunk
with torch.no_grad():
    action_chunk = policy.select_action(obs)
print(action_chunk.shape)  # [1, 32, 20]

4. Train on your dataset

Dataset format

LeRobot datasets follow a standard schema — see LeRobot Ecosystem for the teleop-to-dataset recording flow. Required layout:

your-dataset/
├── meta/
│   ├── episodes.jsonl       # per-episode metadata
│   ├── tasks.jsonl          # natural language instructions
│   └── stats.json           # mean/std for normalization
├── data/
│   └── chunk-000/
│       └── episode_000000.parquet  # state + action
└── videos/
    └── chunk-000/
        └── observation.images.primary/
            └── episode_000000.mp4

Each episode needs:

  • Images at least 1 camera (RGB 224×224 or higher)
  • State proprioceptive vector (joints + gripper)
  • Action vector of matching dimension
  • Task natural language string

Basic fine-tune

For a new task on a standard robot (Franka/SO-101/UR-5):

lerobot-train \
  --dataset.repo_id=YOUR_USER/your-task-dataset \
  --output_dir=./outputs/xvla_my_task \
  --policy.path="lerobot/xvla-base" \
  --policy.dtype=bfloat16 \
  --policy.action_mode=auto \
  --steps=20000 \
  --policy.device=cuda \
  --policy.freeze_vision_encoder=false \
  --policy.freeze_language_encoder=false \
  --policy.train_policy_transformer=true \
  --policy.train_soft_prompts=true \
  --batch_size=8 \
  --num_workers=4

Key parameters:

  • --policy.action_mode=autouse this for new robots, X-VLA auto-detects dataset dimension and pads/trims
  • --policy.train_soft_prompts=true — train soft prompts (required for new embodiments)
  • --policy.dtype=bfloat16 — cuts VRAM 50% with near-zero accuracy loss

Soft-prompts-only fine-tune (PEFT-style)

If you only have 24GB VRAM:

lerobot-train \
  --dataset.repo_id=YOUR_USER/your-task-dataset \
  --policy.path="lerobot/xvla-base" \
  --policy.freeze_vision_encoder=true \
  --policy.freeze_language_encoder=true \
  --policy.train_policy_transformer=false \
  --policy.train_soft_prompts=true \
  --policy.dtype=bfloat16 \
  --steps=10000

Only ~9M trainable params — 300× less than full fine-tune but still achieves 90%+ performance on similar tasks. This is the "Phase II" mode in the paper and X-VLA's main selling point for small teams.

Critical hyperparameter

The paper recommends: train the VLM (vision + language encoder) at 1/10 base learning rate, other components at full LR. Reason: the VLM is already strongly pretrained, touching it too much causes catastrophic forgetting.

LeRobot config handles this when --policy.freeze_vision_encoder=false — but if you write a custom trainer, remember to set per-group LRs.

5. Inference on a real robot

Server-client architecture

X-VLA separates the model server from the robot environment over HTTP — important because robot dependencies (ROS, drivers) often conflict with PyTorch CUDA.

Server (GPU machine):

lerobot-serve \
  --policy.path="./outputs/xvla_my_task" \
  --port 8765 \
  --device cuda

Client (machine connected to robot):

import requests
import numpy as np

def get_action(images_dict, state, instruction, domain_id=0):
    payload = {
        "observation.state": state.tolist(),
        "task": instruction,
        "domain_id": domain_id,
    }
    # Encode images (base64 or multipart)
    for cam_name, img in images_dict.items():
        payload[f"observation.images.{cam_name}"] = encode_image(img)

    response = requests.post(
        "http://gpu-server:8765/act",
        json=payload,
        timeout=2.0,
    )
    return np.array(response.json()["action"])

# 30Hz control loop
while True:
    obs = robot.get_observation()
    action_chunk = get_action(obs["images"], obs["state"], "pick up the cup")

    # 32-step chunk — execute chunk_step steps then re-query
    for action in action_chunk[:8]:
        robot.execute(action)

Async inference for real-time

Action chunking (32 steps) enables async inference: while the robot executes the current chunk, the server can compute the next one. Effective latency drops from 200-400ms per action to ~30-50ms.

The LeRobot HilSerl Real Robot RL post has sample code for this async pattern.

Bimanual robot manipulation with VLA controller

6. Evaluating results

Eval on LIBERO

lerobot-eval \
  --policy.path="lerobot/xvla-libero" \
  --env.type=libero \
  --env.task=libero_spatial,libero_goal,libero_10,libero_object \
  --env.control_mode=absolute \
  --eval.batch_size=1 \
  --eval.n_episodes=50 \
  --env.episode_length=800 \
  --seed=142

Expected results after ~30 minutes on an A100:

  • LIBERO-Spatial: 96-98%
  • LIBERO-Goal: 96-99%
  • LIBERO-Object: 98-100%
  • LIBERO-10: 92-95%

This is the baseline to compare your custom training against.

WandB logging

Add to your training command:

--wandb.enable=true \
--wandb.project=xvla-finetune \
--wandb.run_name=my-task-v1

Track: loss/flow_matching, loss/gripper_bce, validation/success_rate, gradients/soft_prompt_norm.

7. Practical tips from real use

Common errors

  1. CUDA OOM on load — Use --policy.dtype=bfloat16 instead of float32, halves VRAM
  2. Action dimension mismatch — Set --policy.action_mode=auto so X-VLA handles padding
  3. Soft prompt won't converge — Check the learning rate, soft prompts need higher LR than backbone (~5e-4 vs 1e-4)
  4. Slow inference — Reduce flow-matching num_inference_steps from 10 to 4-5 (slight accuracy loss but 2× faster)

Domain ID — don't forget!

Each embodiment has its own domain_id:

Dataset Domain ID
Bridge 0
RT-1 1
CALVIN 2
LIBERO 3
WidowX (air) 4
AIR-AGILEX-HQ 5
AGIBOT-challenge 9

Forgetting domain_id on inference → model uses default (0 = Bridge) → wrong soft prompt → policy fails. Always match domain_id with your training checkpoint.

When to use X-VLA vs alternatives?

Situation Pick
Multi-robot fleet (3+ embodiments) X-VLA — pretrain once, swap prompts
Single robot, small dataset (<5K eps) π0-FAST or VLA-Adapter
Single robot, large dataset, 1 task OpenVLA or fine-tune RT-2
Bimanual humanoid X-VLA-AgiBot or WholeBodyVLA
Consumer GPU (RTX 3060/4090) VLA-Adapter 0.5B

8. Learning roadmap

After grasping X-VLA, go further with:

  1. Read the paperarXiv 2510.10274 (33 pages, worth reading section 3 on soft prompt design carefully)
  2. Collect your own 100-500 episodes via teleop, train soft prompts for your task
  3. Compare with baselines — train the same dataset with ACT, Diffusion Policy, OpenVLA to understand trade-offs
  4. Contribute a custom action mode to upstream LeRobot if your robot is unusual (just 30 lines like the example in docs)

Conclusion

X-VLA is a clear step forward for cross-embodiment VLA: instead of training n models for n robots, train 1 backbone + n soft prompts. With LeRobot integration, beginners can now:

  • Load lerobot/xvla-base in one line
  • Fine-tune for their own task with ~9M trainable params on an RTX 4090
  • Deploy via HTTP server-client, safe for ROS-based robot stacks

Code, weights, dataset are all open-source under Apache 2.0 — no barriers for teams wanting to research or ship products. If you're building a robot fleet or doing manipulation research, X-VLA is the most reasonable bet in the 2026 VLA landscape.

References

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Bài viết liên quan

NEWTutorial
Multitask DiT Policy LeRobot v0.5: 1 model nhiều task
lerobotmultitask-ditdiffusion-policycliptext-conditioningso-100so-101huggingfacemanipulationflow-matching

Multitask DiT Policy LeRobot v0.5: 1 model nhiều task

Hướng dẫn Multitask DiT Policy của LeRobot v0.5: train 1 policy cho nhiều task với CLIP text-conditioning, code open-source HuggingFace, deploy SO-100/SO-101.

18/5/202610 phút đọc
NEWNghiên cứu
ABot-M0: VLA Foundation Model với Action Manifold
vlafoundation-modelaction-manifoldamap-cvlabalibabaliberorobocasarobotwinmanipulationdiffusion-transformer

ABot-M0: VLA Foundation Model với Action Manifold

Hướng dẫn ABot-M0 từ AMAP CVLab Alibaba: VLA train trên 6M+ trajectories, predict clean actions thay vì noise, code + weights open-source.

15/5/202610 phút đọc
NEWTutorial
VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO
vlavla-adapteropenhelixliberoqwen2.5lorafrankaur5manipulation

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

Hướng dẫn VLA-Adapter từ OpenHelix — train VLA 0.5B trên GPU consumer 8 giờ, đạt SOTA LIBERO, deploy thật trên Franka/UR-5.

13/5/202610 phút đọc