VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam
VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
  1. Home
  2. Blog
  3. X-VLA ICLR 2026: Soft-Prompted 0.9B VLA on LeRobot
wholebody-vlax-vlavlaiclr-2026soft-promptlerobotcross-embodimentflow-matchingliberomanipulation

X-VLA ICLR 2026: Soft-Prompted 0.9B VLA on LeRobot

X-VLA tutorial — flow-matching 0.9B VLA hitting SOTA on 6 sims + 3 real robots, native LeRobot, open-source on HuggingFace.

Nguyễn Anh TuấnMay 20, 202610 min readUpdated: Jun 14, 2026
X-VLA ICLR 2026: Soft-Prompted 0.9B VLA on LeRobot

Vision-Language-Action (VLA) research is moving in two directions: scaling up to billions of parameters (RT-2, OpenVLA 7B, π0 3B) or scaling down to consumer GPUs (VLA-Adapter 0.5B). But both directions skip an important question: how do you train a single model that runs across many different robots — Franka, WidowX, Google Robot, Agilex bimanual — without separately fine-tuning each one?

This is exactly what X-VLA (ICLR 2026, paper arXiv:2510.10274) solves. With just 0.9B parameters, X-VLA achieves SOTA on 6 simulation benchmarks + 3 real robots, won the AgiBot World Challenge at IROS 2025, and most importantly: it's been natively integrated into LeRobot — a single line policy.type=xvla is enough to train.

In this tutorial I'll walk you from the "soft prompt" idea → flow-matching architecture → LeRobot installation → training on your dataset → inference on a real robot. Beginner-friendly, no heavy Transformer background required.

X-VLA soft-prompted cross-embodiment vision-language-action model
X-VLA soft-prompted cross-embodiment vision-language-action model

1. Why this paper matters

The cross-embodiment problem

Imagine you have 3 robots: a 7-DOF Franka, a 6-DOF WidowX, and a pair of AgileX bimanual arms. Each robot has:

  • Different action space — 7 joints vs 6 joints vs 14 joints
  • Different camera setup — 1 wrist + 1 third-person vs 2 wrist vs 3 cameras
  • Different gripper — parallel jaw vs underactuated vs custom

The old way (OpenVLA, π0): train separately for each embodiment, or try to tokenize actions as text and let the LLM figure it out — but quality is uneven, and every new robot needs heavy fine-tuning.

X-VLA's answer: use a "soft prompt" — a learnable embedding set per robot type, share the Transformer backbone across all of them. Like the same LLM with different prompts for different tasks, X-VLA uses the same Transformer with different prompts for different robots.

Striking results

Benchmark Embodiment X-VLA Score Comparison
LIBERO (4 suites) Franka 98.1% π0: ~97%, OpenVLA: 76.5%
SimplerEnv WidowX WidowX 95.8% RT-1-X: 64%
Google Robot (VM) Google Robot 83.5% OpenVLA: 71%
CALVIN (ABCD→D) Franka 4.43/5 RoboFlamingo: 3.49
RoboTwin2 AgileX bimanual 70% π0-FAST: 58%

More importantly: X-VLA-LIBERO achieves near-π₀ performance while using 300× fewer trainable parameters — because only the soft prompts (~9M params) need training per new embodiment, not the entire 3B backbone.

2. X-VLA architecture

Flow-matching overview

X-VLA doesn't use DDPM-style diffusion, nor autoregressive tokens like RT-2. It uses flow matching — a generative model family that learns the vector field between noise and data, generating 32-step action chunks in a few denoising steps.

The main pipeline:

[RGB images] → Vision encoder ────┐
[Language instruction] → Text enc ┤
[Proprioceptive state] → MLP ─────┤── Transformer (24 layers, 1024 dim)
[Soft prompt embeddings] ─────────┤              │
[Domain ID embedding] ────────────┘              ▼
                                         Flow-matching head
                                                 │
                                                 ▼
                                    Action chunk (32 steps × 20-D)

Soft prompt mechanism

Each embodiment has a set of 32 learnable embedding vectors (dim 1024). During training, the model learns these vectors as a "preamble" that gets prepended to the Transformer input. For a new robot:

  • Phase I (pretrain): Train the backbone + all soft prompts on 290K episodes from 7 platforms
  • Phase II (adapt): Freeze the backbone, only train a new soft prompt (~9M params) for the new embodiment

This is the big difference from LoRA: LoRA injects low-rank updates into weights, while soft prompts inject embeddings into inputs — simpler, lighter, and well-validated by NLP prompt tuning work (Prefix Tuning, P-Tuning v2).

Unified EE6D action space

X-VLA uses EE6D (End-Effector 6D) as the canonical action space: 3D position + 6D rotation representation + gripper signal + padding = 20 dimensions. Every other robot (joint-space 7-DOF, bimanual 14-DOF) gets mapped into this 20-D space via an Action Registry — if the robot has fewer dimensions, the rest is zero-padded and ignored in the loss.

This allows a single forward pass to handle any embodiment, differing only in soft prompt + domain ID.

Real robot arm controlled by a VLA model
Real robot arm controlled by a VLA model

3. Installing LeRobot with X-VLA

Hardware requirements

Purpose Minimum GPU Recommended
Inference only RTX 3060 12GB RTX 4090
Soft-prompt fine-tune RTX 4090 24GB A100 40GB
Full pretrain (290K episodes) 8× A100 80GB 8× H100

For beginners: RTX 4090 or rent A100 on Vast.ai (~$1/hour) is enough to fine-tune soft prompts for your own task.

Environment setup

# Create conda env
conda create -n xvla python=3.10 -y
conda activate xvla

# Clone LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Install LeRobot with X-VLA extras
pip install -e ".[xvla]"

# Verify
python -c "from lerobot.policies.xvla import XVLAPolicy; print('OK')"

Load a pretrained checkpoint

X-VLA ships several HuggingFace checkpoints:

Checkpoint Description Use case
lerobot/xvla-base 0.9B pretrained on 290K episodes Fine-tune for new tasks
lerobot/xvla-libero Fine-tuned LIBERO (98.1%) Evaluate LIBERO immediately
lerobot/xvla-widowx WidowX pick-and-place SimplerEnv demo
lerobot/xvla-folding Cloth folding 100% Hard bimanual task
lerobot/xvla-agibot-world AgileX dexterous General bimanual
lerobot/xvla-google-robot Google Robot RT-1 setup Cross-domain demo

Quick inference test:

from lerobot.policies.xvla import XVLAPolicy
import torch

# Load model
policy = XVLAPolicy.from_pretrained("lerobot/xvla-base")
policy = policy.to("cuda").eval()

# Dummy observation
obs = {
    "observation.images.primary": torch.randn(1, 3, 224, 224).cuda(),
    "observation.images.wrist": torch.randn(1, 3, 224, 224).cuda(),
    "observation.state": torch.randn(1, 7).cuda(),
    "task": ["pick up the red block"],
    "domain_id": torch.tensor([0]).cuda(),
}

# Inference — generate 32-step action chunk
with torch.no_grad():
    action_chunk = policy.select_action(obs)
print(action_chunk.shape)  # [1, 32, 20]

4. Train on your dataset

Dataset format

LeRobot datasets follow a standard schema — see LeRobot Ecosystem for the teleop-to-dataset recording flow. Required layout:

your-dataset/
├── meta/
│   ├── episodes.jsonl       # per-episode metadata
│   ├── tasks.jsonl          # natural language instructions
│   └── stats.json           # mean/std for normalization
├── data/
│   └── chunk-000/
│       └── episode_000000.parquet  # state + action
└── videos/
    └── chunk-000/
        └── observation.images.primary/
            └── episode_000000.mp4

Each episode needs:

  • Images at least 1 camera (RGB 224×224 or higher)
  • State proprioceptive vector (joints + gripper)
  • Action vector of matching dimension
  • Task natural language string

Basic fine-tune

For a new task on a standard robot (Franka/SO-101/UR-5):

lerobot-train \
  --dataset.repo_id=YOUR_USER/your-task-dataset \
  --output_dir=./outputs/xvla_my_task \
  --policy.path="lerobot/xvla-base" \
  --policy.dtype=bfloat16 \
  --policy.action_mode=auto \
  --steps=20000 \
  --policy.device=cuda \
  --policy.freeze_vision_encoder=false \
  --policy.freeze_language_encoder=false \
  --policy.train_policy_transformer=true \
  --policy.train_soft_prompts=true \
  --batch_size=8 \
  --num_workers=4

Key parameters:

  • --policy.action_mode=auto — use this for new robots, X-VLA auto-detects dataset dimension and pads/trims
  • --policy.train_soft_prompts=true — train soft prompts (required for new embodiments)
  • --policy.dtype=bfloat16 — cuts VRAM 50% with near-zero accuracy loss

Soft-prompts-only fine-tune (PEFT-style)

If you only have 24GB VRAM:

lerobot-train \
  --dataset.repo_id=YOUR_USER/your-task-dataset \
  --policy.path="lerobot/xvla-base" \
  --policy.freeze_vision_encoder=true \
  --policy.freeze_language_encoder=true \
  --policy.train_policy_transformer=false \
  --policy.train_soft_prompts=true \
  --policy.dtype=bfloat16 \
  --steps=10000

Only ~9M trainable params — 300× less than full fine-tune but still achieves 90%+ performance on similar tasks. This is the "Phase II" mode in the paper and X-VLA's main selling point for small teams.

Critical hyperparameter

The paper recommends: train the VLM (vision + language encoder) at 1/10 base learning rate, other components at full LR. Reason: the VLM is already strongly pretrained, touching it too much causes catastrophic forgetting.

LeRobot config handles this when --policy.freeze_vision_encoder=false — but if you write a custom trainer, remember to set per-group LRs.

5. Inference on a real robot

Server-client architecture

X-VLA separates the model server from the robot environment over HTTP — important because robot dependencies (ROS, drivers) often conflict with PyTorch CUDA.

Server (GPU machine):

lerobot-serve \
  --policy.path="./outputs/xvla_my_task" \
  --port 8765 \
  --device cuda

Client (machine connected to robot):

import requests
import numpy as np

def get_action(images_dict, state, instruction, domain_id=0):
    payload = {
        "observation.state": state.tolist(),
        "task": instruction,
        "domain_id": domain_id,
    }
    # Encode images (base64 or multipart)
    for cam_name, img in images_dict.items():
        payload[f"observation.images.{cam_name}"] = encode_image(img)

    response = requests.post(
        "http://gpu-server:8765/act",
        json=payload,
        timeout=2.0,
    )
    return np.array(response.json()["action"])

# 30Hz control loop
while True:
    obs = robot.get_observation()
    action_chunk = get_action(obs["images"], obs["state"], "pick up the cup")

    # 32-step chunk — execute chunk_step steps then re-query
    for action in action_chunk[:8]:
        robot.execute(action)

Async inference for real-time

Action chunking (32 steps) enables async inference: while the robot executes the current chunk, the server can compute the next one. Effective latency drops from 200-400ms per action to ~30-50ms.

The LeRobot HilSerl Real Robot RL post has sample code for this async pattern.

Bimanual robot manipulation with VLA controller
Bimanual robot manipulation with VLA controller

6. Evaluating results

Eval on LIBERO

lerobot-eval \
  --policy.path="lerobot/xvla-libero" \
  --env.type=libero \
  --env.task=libero_spatial,libero_goal,libero_10,libero_object \
  --env.control_mode=absolute \
  --eval.batch_size=1 \
  --eval.n_episodes=50 \
  --env.episode_length=800 \
  --seed=142

Expected results after ~30 minutes on an A100:

  • LIBERO-Spatial: 96-98%
  • LIBERO-Goal: 96-99%
  • LIBERO-Object: 98-100%
  • LIBERO-10: 92-95%

This is the baseline to compare your custom training against.

WandB logging

Add to your training command:

--wandb.enable=true \
--wandb.project=xvla-finetune \
--wandb.run_name=my-task-v1

Track: loss/flow_matching, loss/gripper_bce, validation/success_rate, gradients/soft_prompt_norm.

7. Practical tips from real use

Common errors

  1. CUDA OOM on load — Use --policy.dtype=bfloat16 instead of float32, halves VRAM
  2. Action dimension mismatch — Set --policy.action_mode=auto so X-VLA handles padding
  3. Soft prompt won't converge — Check the learning rate, soft prompts need higher LR than backbone (~5e-4 vs 1e-4)
  4. Slow inference — Reduce flow-matching num_inference_steps from 10 to 4-5 (slight accuracy loss but 2× faster)

Domain ID — don't forget!

Each embodiment has its own domain_id:

Dataset Domain ID
Bridge 0
RT-1 1
CALVIN 2
LIBERO 3
WidowX (air) 4
AIR-AGILEX-HQ 5
AGIBOT-challenge 9

Forgetting domain_id on inference → model uses default (0 = Bridge) → wrong soft prompt → policy fails. Always match domain_id with your training checkpoint.

When to use X-VLA vs alternatives?

Situation Pick
Multi-robot fleet (3+ embodiments) X-VLA — pretrain once, swap prompts
Single robot, small dataset (<5K eps) π0-FAST or VLA-Adapter
Single robot, large dataset, 1 task OpenVLA or fine-tune RT-2
Bimanual humanoid X-VLA-AgiBot or WholeBodyVLA
Consumer GPU (RTX 3060/4090) VLA-Adapter 0.5B

8. Learning roadmap

After grasping X-VLA, go further with:

  1. Read the paper — arXiv 2510.10274 (33 pages, worth reading section 3 on soft prompt design carefully)
  2. Collect your own 100-500 episodes via teleop, train soft prompts for your task
  3. Compare with baselines — train the same dataset with ACT, Diffusion Policy, OpenVLA to understand trade-offs
  4. Contribute a custom action mode to upstream LeRobot if your robot is unusual (just 30 lines like the example in docs)

Conclusion

X-VLA is a clear step forward for cross-embodiment VLA: instead of training n models for n robots, train 1 backbone + n soft prompts. With LeRobot integration, beginners can now:

  • Load lerobot/xvla-base in one line
  • Fine-tune for their own task with ~9M trainable params on an RTX 4090
  • Deploy via HTTP server-client, safe for ROS-based robot stacks

Code, weights, dataset are all open-source under Apache 2.0 — no barriers for teams wanting to research or ship products. If you're building a robot fleet or doing manipulation research, X-VLA is the most reasonable bet in the 2026 VLA landscape.

Tool recommendations

VLA train/deploy stack

Train on cloud/workstation, then deploy optimized models to Jetson or the robot computer.

Cloud GPU for VLA / policy training Use for imitation learning, diffusion policies, RL, and robotics model fine-tuning. View cloud GPU → NVIDIA Jetson Orin NX / Orin Nano Edge deployment hardware for perception, logging, and optimized inference. View Jetson → Hugging Face / robotics dataset hosting Host datasets, checkpoints, and model cards for cleaner LeRobot/VLA workflows. View platform →

Related Posts

  • VLA-Adapter: 0.5B VLA with 9.6GB VRAM, 99.2% LIBERO
  • Multitask DiT Policy LeRobot v0.5
  • WholeBodyVLA ICLR 2026: Humanoid Loco-Manipulation

References

  • Paper X-VLA arXiv 2510.10274 — Soft-Prompted Transformer as Scalable Cross-Embodiment VLA, ICLR 2026
  • GitHub 2toinf/X-VLA — Reference implementation
  • LeRobot X-VLA docs — Integration guide
  • HuggingFace lerobot/xvla-base — 0.9B pretrained checkpoint
  • Project page — Demo videos + cloth folding dataset
NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Fleet MonitoringROS 2 IntegrationAMR Solutions

Related Posts

Tutorial
LabVLA: VLA Mã Nguồn Mở cho Robot Phòng Lab
lerobotvlaqwen3-vl
wholebody-vla

LabVLA: VLA Mã Nguồn Mở cho Robot Phòng Lab

Hướng dẫn chạy LabVLA — mô hình VLA đầu tiên cho lab khoa học, kết hợp Qwen3-VL-4B với DiT flow-matching và LeRobot v2 format. 71.1% trên LabUtopia benchmark.

6/12/202614 min read
NT
NEWResearch
Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers
vlaworld-modelmixture-of-transformers
wholebody-vla

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

InternVLA-A1 hợp nhất hiểu ngữ nghĩa, dự đoán tương lai và ra lệnh hành động trong một kiến trúc Mixture-of-Transformers duy nhất — đánh bại π0.5 trên cả benchmark tĩnh lẫn động.

7/1/202610 min read
NT
NEWResearch
Qwen-VLA: Mô hình VLA generalist của Alibaba
vlaalibabaqwen
wholebody-vla

Qwen-VLA: Mô hình VLA generalist của Alibaba

Khám phá Qwen-VLA — VLA generalist Alibaba dùng Qwen3.5-4B + DiT decoder, một bộ weight cho manipulation, navigation và đa robot dị cấu hình.

6/29/202612 min read
NT
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam