Vision-Language-Action (VLA) research is moving in two directions: scaling up to billions of parameters (RT-2, OpenVLA 7B, π0 3B) or scaling down to consumer GPUs (VLA-Adapter 0.5B). But both directions skip an important question: how do you train a single model that runs across many different robots — Franka, WidowX, Google Robot, Agilex bimanual — without separately fine-tuning each one?
This is exactly what X-VLA (ICLR 2026, paper arXiv:2510.10274) solves. With just 0.9B parameters, X-VLA achieves SOTA on 6 simulation benchmarks + 3 real robots, won the AgiBot World Challenge at IROS 2025, and most importantly: it's been natively integrated into LeRobot — a single line policy.type=xvla is enough to train.
In this tutorial I'll walk you from the "soft prompt" idea → flow-matching architecture → LeRobot installation → training on your dataset → inference on a real robot. Beginner-friendly, no heavy Transformer background required.
1. Why this paper matters
The cross-embodiment problem
Imagine you have 3 robots: a 7-DOF Franka, a 6-DOF WidowX, and a pair of AgileX bimanual arms. Each robot has:
- Different action space — 7 joints vs 6 joints vs 14 joints
- Different camera setup — 1 wrist + 1 third-person vs 2 wrist vs 3 cameras
- Different gripper — parallel jaw vs underactuated vs custom
The old way (OpenVLA, π0): train separately for each embodiment, or try to tokenize actions as text and let the LLM figure it out — but quality is uneven, and every new robot needs heavy fine-tuning.
X-VLA's answer: use a "soft prompt" — a learnable embedding set per robot type, share the Transformer backbone across all of them. Like the same LLM with different prompts for different tasks, X-VLA uses the same Transformer with different prompts for different robots.
Striking results
| Benchmark | Embodiment | X-VLA Score | Comparison |
|---|---|---|---|
| LIBERO (4 suites) | Franka | 98.1% | π0: ~97%, OpenVLA: 76.5% |
| SimplerEnv WidowX | WidowX | 95.8% | RT-1-X: 64% |
| Google Robot (VM) | Google Robot | 83.5% | OpenVLA: 71% |
| CALVIN (ABCD→D) | Franka | 4.43/5 | RoboFlamingo: 3.49 |
| RoboTwin2 | AgileX bimanual | 70% | π0-FAST: 58% |
More importantly: X-VLA-LIBERO achieves near-π₀ performance while using 300× fewer trainable parameters — because only the soft prompts (~9M params) need training per new embodiment, not the entire 3B backbone.
2. X-VLA architecture
Flow-matching overview
X-VLA doesn't use DDPM-style diffusion, nor autoregressive tokens like RT-2. It uses flow matching — a generative model family that learns the vector field between noise and data, generating 32-step action chunks in a few denoising steps.
The main pipeline:
[RGB images] → Vision encoder ────┐
[Language instruction] → Text enc ┤
[Proprioceptive state] → MLP ─────┤── Transformer (24 layers, 1024 dim)
[Soft prompt embeddings] ─────────┤ │
[Domain ID embedding] ────────────┘ ▼
Flow-matching head
│
▼
Action chunk (32 steps × 20-D)
Soft prompt mechanism
Each embodiment has a set of 32 learnable embedding vectors (dim 1024). During training, the model learns these vectors as a "preamble" that gets prepended to the Transformer input. For a new robot:
- Phase I (pretrain): Train the backbone + all soft prompts on 290K episodes from 7 platforms
- Phase II (adapt): Freeze the backbone, only train a new soft prompt (~9M params) for the new embodiment
This is the big difference from LoRA: LoRA injects low-rank updates into weights, while soft prompts inject embeddings into inputs — simpler, lighter, and well-validated by NLP prompt tuning work (Prefix Tuning, P-Tuning v2).
Unified EE6D action space
X-VLA uses EE6D (End-Effector 6D) as the canonical action space: 3D position + 6D rotation representation + gripper signal + padding = 20 dimensions. Every other robot (joint-space 7-DOF, bimanual 14-DOF) gets mapped into this 20-D space via an Action Registry — if the robot has fewer dimensions, the rest is zero-padded and ignored in the loss.
This allows a single forward pass to handle any embodiment, differing only in soft prompt + domain ID.
3. Installing LeRobot with X-VLA
Hardware requirements
| Purpose | Minimum GPU | Recommended |
|---|---|---|
| Inference only | RTX 3060 12GB | RTX 4090 |
| Soft-prompt fine-tune | RTX 4090 24GB | A100 40GB |
| Full pretrain (290K episodes) | 8× A100 80GB | 8× H100 |
For beginners: RTX 4090 or rent A100 on Vast.ai (~$1/hour) is enough to fine-tune soft prompts for your own task.
Environment setup
# Create conda env
conda create -n xvla python=3.10 -y
conda activate xvla
# Clone LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot
# Install LeRobot with X-VLA extras
pip install -e ".[xvla]"
# Verify
python -c "from lerobot.policies.xvla import XVLAPolicy; print('OK')"
Load a pretrained checkpoint
X-VLA ships several HuggingFace checkpoints:
| Checkpoint | Description | Use case |
|---|---|---|
lerobot/xvla-base |
0.9B pretrained on 290K episodes | Fine-tune for new tasks |
lerobot/xvla-libero |
Fine-tuned LIBERO (98.1%) | Evaluate LIBERO immediately |
lerobot/xvla-widowx |
WidowX pick-and-place | SimplerEnv demo |
lerobot/xvla-folding |
Cloth folding 100% | Hard bimanual task |
lerobot/xvla-agibot-world |
AgileX dexterous | General bimanual |
lerobot/xvla-google-robot |
Google Robot RT-1 setup | Cross-domain demo |
Quick inference test:
from lerobot.policies.xvla import XVLAPolicy
import torch
# Load model
policy = XVLAPolicy.from_pretrained("lerobot/xvla-base")
policy = policy.to("cuda").eval()
# Dummy observation
obs = {
"observation.images.primary": torch.randn(1, 3, 224, 224).cuda(),
"observation.images.wrist": torch.randn(1, 3, 224, 224).cuda(),
"observation.state": torch.randn(1, 7).cuda(),
"task": ["pick up the red block"],
"domain_id": torch.tensor([0]).cuda(),
}
# Inference — generate 32-step action chunk
with torch.no_grad():
action_chunk = policy.select_action(obs)
print(action_chunk.shape) # [1, 32, 20]
4. Train on your dataset
Dataset format
LeRobot datasets follow a standard schema — see LeRobot Ecosystem for the teleop-to-dataset recording flow. Required layout:
your-dataset/
├── meta/
│ ├── episodes.jsonl # per-episode metadata
│ ├── tasks.jsonl # natural language instructions
│ └── stats.json # mean/std for normalization
├── data/
│ └── chunk-000/
│ └── episode_000000.parquet # state + action
└── videos/
└── chunk-000/
└── observation.images.primary/
└── episode_000000.mp4
Each episode needs:
- Images at least 1 camera (RGB 224×224 or higher)
- State proprioceptive vector (joints + gripper)
- Action vector of matching dimension
- Task natural language string
Basic fine-tune
For a new task on a standard robot (Franka/SO-101/UR-5):
lerobot-train \
--dataset.repo_id=YOUR_USER/your-task-dataset \
--output_dir=./outputs/xvla_my_task \
--policy.path="lerobot/xvla-base" \
--policy.dtype=bfloat16 \
--policy.action_mode=auto \
--steps=20000 \
--policy.device=cuda \
--policy.freeze_vision_encoder=false \
--policy.freeze_language_encoder=false \
--policy.train_policy_transformer=true \
--policy.train_soft_prompts=true \
--batch_size=8 \
--num_workers=4
Key parameters:
--policy.action_mode=auto— use this for new robots, X-VLA auto-detects dataset dimension and pads/trims--policy.train_soft_prompts=true— train soft prompts (required for new embodiments)--policy.dtype=bfloat16— cuts VRAM 50% with near-zero accuracy loss
Soft-prompts-only fine-tune (PEFT-style)
If you only have 24GB VRAM:
lerobot-train \
--dataset.repo_id=YOUR_USER/your-task-dataset \
--policy.path="lerobot/xvla-base" \
--policy.freeze_vision_encoder=true \
--policy.freeze_language_encoder=true \
--policy.train_policy_transformer=false \
--policy.train_soft_prompts=true \
--policy.dtype=bfloat16 \
--steps=10000
Only ~9M trainable params — 300× less than full fine-tune but still achieves 90%+ performance on similar tasks. This is the "Phase II" mode in the paper and X-VLA's main selling point for small teams.
Critical hyperparameter
The paper recommends: train the VLM (vision + language encoder) at 1/10 base learning rate, other components at full LR. Reason: the VLM is already strongly pretrained, touching it too much causes catastrophic forgetting.
LeRobot config handles this when --policy.freeze_vision_encoder=false — but if you write a custom trainer, remember to set per-group LRs.
5. Inference on a real robot
Server-client architecture
X-VLA separates the model server from the robot environment over HTTP — important because robot dependencies (ROS, drivers) often conflict with PyTorch CUDA.
Server (GPU machine):
lerobot-serve \
--policy.path="./outputs/xvla_my_task" \
--port 8765 \
--device cuda
Client (machine connected to robot):
import requests
import numpy as np
def get_action(images_dict, state, instruction, domain_id=0):
payload = {
"observation.state": state.tolist(),
"task": instruction,
"domain_id": domain_id,
}
# Encode images (base64 or multipart)
for cam_name, img in images_dict.items():
payload[f"observation.images.{cam_name}"] = encode_image(img)
response = requests.post(
"http://gpu-server:8765/act",
json=payload,
timeout=2.0,
)
return np.array(response.json()["action"])
# 30Hz control loop
while True:
obs = robot.get_observation()
action_chunk = get_action(obs["images"], obs["state"], "pick up the cup")
# 32-step chunk — execute chunk_step steps then re-query
for action in action_chunk[:8]:
robot.execute(action)
Async inference for real-time
Action chunking (32 steps) enables async inference: while the robot executes the current chunk, the server can compute the next one. Effective latency drops from 200-400ms per action to ~30-50ms.
The LeRobot HilSerl Real Robot RL post has sample code for this async pattern.
6. Evaluating results
Eval on LIBERO
lerobot-eval \
--policy.path="lerobot/xvla-libero" \
--env.type=libero \
--env.task=libero_spatial,libero_goal,libero_10,libero_object \
--env.control_mode=absolute \
--eval.batch_size=1 \
--eval.n_episodes=50 \
--env.episode_length=800 \
--seed=142
Expected results after ~30 minutes on an A100:
- LIBERO-Spatial: 96-98%
- LIBERO-Goal: 96-99%
- LIBERO-Object: 98-100%
- LIBERO-10: 92-95%
This is the baseline to compare your custom training against.
WandB logging
Add to your training command:
--wandb.enable=true \
--wandb.project=xvla-finetune \
--wandb.run_name=my-task-v1
Track: loss/flow_matching, loss/gripper_bce, validation/success_rate, gradients/soft_prompt_norm.
7. Practical tips from real use
Common errors
- CUDA OOM on load — Use
--policy.dtype=bfloat16instead of float32, halves VRAM - Action dimension mismatch — Set
--policy.action_mode=autoso X-VLA handles padding - Soft prompt won't converge — Check the learning rate, soft prompts need higher LR than backbone (~5e-4 vs 1e-4)
- Slow inference — Reduce flow-matching
num_inference_stepsfrom 10 to 4-5 (slight accuracy loss but 2× faster)
Domain ID — don't forget!
Each embodiment has its own domain_id:
| Dataset | Domain ID |
|---|---|
| Bridge | 0 |
| RT-1 | 1 |
| CALVIN | 2 |
| LIBERO | 3 |
| WidowX (air) | 4 |
| AIR-AGILEX-HQ | 5 |
| AGIBOT-challenge | 9 |
Forgetting domain_id on inference → model uses default (0 = Bridge) → wrong soft prompt → policy fails. Always match domain_id with your training checkpoint.
When to use X-VLA vs alternatives?
| Situation | Pick |
|---|---|
| Multi-robot fleet (3+ embodiments) | X-VLA — pretrain once, swap prompts |
| Single robot, small dataset (<5K eps) | π0-FAST or VLA-Adapter |
| Single robot, large dataset, 1 task | OpenVLA or fine-tune RT-2 |
| Bimanual humanoid | X-VLA-AgiBot or WholeBodyVLA |
| Consumer GPU (RTX 3060/4090) | VLA-Adapter 0.5B |
8. Learning roadmap
After grasping X-VLA, go further with:
- Read the paper — arXiv 2510.10274 (33 pages, worth reading section 3 on soft prompt design carefully)
- Collect your own 100-500 episodes via teleop, train soft prompts for your task
- Compare with baselines — train the same dataset with ACT, Diffusion Policy, OpenVLA to understand trade-offs
- Contribute a custom action mode to upstream LeRobot if your robot is unusual (just 30 lines like the example in docs)
Conclusion
X-VLA is a clear step forward for cross-embodiment VLA: instead of training n models for n robots, train 1 backbone + n soft prompts. With LeRobot integration, beginners can now:
- Load
lerobot/xvla-basein one line - Fine-tune for their own task with ~9M trainable params on an RTX 4090
- Deploy via HTTP server-client, safe for ROS-based robot stacks
Code, weights, dataset are all open-source under Apache 2.0 — no barriers for teams wanting to research or ship products. If you're building a robot fleet or doing manipulation research, X-VLA is the most reasonable bet in the 2026 VLA landscape.
Related Posts
- VLA-Adapter: 0.5B VLA with 9.6GB VRAM, 99.2% LIBERO
- Multitask DiT Policy LeRobot v0.5
- WholeBodyVLA ICLR 2026: Humanoid Loco-Manipulation
References
- Paper X-VLA arXiv 2510.10274 — Soft-Prompted Transformer as Scalable Cross-Embodiment VLA, ICLR 2026
- GitHub 2toinf/X-VLA — Reference implementation
- LeRobot X-VLA docs — Integration guide
- HuggingFace lerobot/xvla-base — 0.9B pretrained checkpoint
- Project page — Demo videos + cloth folding dataset