X-VLA ICLR 2026: Soft-Prompted 0.9B VLA on LeRobot

Vision-Language-Action (VLA) research is moving in two directions: scaling up to billions of parameters (RT-2, OpenVLA 7B, π0 3B) or scaling down to consumer GPUs (VLA-Adapter 0.5B). But both directions skip an important question: how do you train a single model that runs across many different robots — Franka, WidowX, Google Robot, Agilex bimanual — without separately fine-tuning each one?

This is exactly what X-VLA (ICLR 2026, paper arXiv:2510.10274) solves. With just 0.9B parameters, X-VLA achieves SOTA on 6 simulation benchmarks + 3 real robots, won the AgiBot World Challenge at IROS 2025, and most importantly: it's been natively integrated into LeRobot — a single line policy.type=xvla is enough to train.

In this tutorial I'll walk you from the "soft prompt" idea → flow-matching architecture → LeRobot installation → training on your dataset → inference on a real robot. Beginner-friendly, no heavy Transformer background required.

X-VLA soft-prompted cross-embodiment vision-language-action model

1. Why this paper matters

The cross-embodiment problem

Imagine you have 3 robots: a 7-DOF Franka, a 6-DOF WidowX, and a pair of AgileX bimanual arms. Each robot has:

Different action space — 7 joints vs 6 joints vs 14 joints
Different camera setup — 1 wrist + 1 third-person vs 2 wrist vs 3 cameras
Different gripper — parallel jaw vs underactuated vs custom

The old way (OpenVLA, π0): train separately for each embodiment, or try to tokenize actions as text and let the LLM figure it out — but quality is uneven, and every new robot needs heavy fine-tuning.

X-VLA's answer: use a "soft prompt" — a learnable embedding set per robot type, share the Transformer backbone across all of them. Like the same LLM with different prompts for different tasks, X-VLA uses the same Transformer with different prompts for different robots.

Striking results

Benchmark	Embodiment	X-VLA Score	Comparison
LIBERO (4 suites)	Franka	98.1%	π0: ~97%, OpenVLA: 76.5%
SimplerEnv WidowX	WidowX	95.8%	RT-1-X: 64%
Google Robot (VM)	Google Robot	83.5%	OpenVLA: 71%
CALVIN (ABCD→D)	Franka	4.43/5	RoboFlamingo: 3.49
RoboTwin2	AgileX bimanual	70%	π0-FAST: 58%

More importantly: X-VLA-LIBERO achieves near-π₀ performance while using 300× fewer trainable parameters — because only the soft prompts (~9M params) need training per new embodiment, not the entire 3B backbone.

2. X-VLA architecture

Flow-matching overview

X-VLA doesn't use DDPM-style diffusion, nor autoregressive tokens like RT-2. It uses flow matching — a generative model family that learns the vector field between noise and data, generating 32-step action chunks in a few denoising steps.

The main pipeline:

[RGB images] → Vision encoder ────┐
[Language instruction] → Text enc ┤
[Proprioceptive state] → MLP ─────┤── Transformer (24 layers, 1024 dim)
[Soft prompt embeddings] ─────────┤              │
[Domain ID embedding] ────────────┘              ▼
                                         Flow-matching head
                                                 │
                                                 ▼
                                    Action chunk (32 steps × 20-D)

Soft prompt mechanism

Each embodiment has a set of 32 learnable embedding vectors (dim 1024). During training, the model learns these vectors as a "preamble" that gets prepended to the Transformer input. For a new robot:

Phase I (pretrain): Train the backbone + all soft prompts on 290K episodes from 7 platforms
Phase II (adapt): Freeze the backbone, only train a new soft prompt (~9M params) for the new embodiment

This is the big difference from LoRA: LoRA injects low-rank updates into weights, while soft prompts inject embeddings into inputs — simpler, lighter, and well-validated by NLP prompt tuning work (Prefix Tuning, P-Tuning v2).

Unified EE6D action space

X-VLA uses EE6D (End-Effector 6D) as the canonical action space: 3D position + 6D rotation representation + gripper signal + padding = 20 dimensions. Every other robot (joint-space 7-DOF, bimanual 14-DOF) gets mapped into this 20-D space via an Action Registry — if the robot has fewer dimensions, the rest is zero-padded and ignored in the loss.

This allows a single forward pass to handle any embodiment, differing only in soft prompt + domain ID.

Real robot arm controlled by a VLA model

3. Installing LeRobot with X-VLA

Hardware requirements

Purpose	Minimum GPU	Recommended
Inference only	RTX 3060 12GB	RTX 4090
Soft-prompt fine-tune	RTX 4090 24GB	A100 40GB
Full pretrain (290K episodes)	8× A100 80GB	8× H100

For beginners: RTX 4090 or rent A100 on Vast.ai (~$1/hour) is enough to fine-tune soft prompts for your own task.

Environment setup

# Create conda env
conda create -n xvla python=3.10 -y
conda activate xvla

# Clone LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Install LeRobot with X-VLA extras
pip install -e ".[xvla]"

# Verify
python -c "from lerobot.policies.xvla import XVLAPolicy; print('OK')"

Load a pretrained checkpoint

X-VLA ships several HuggingFace checkpoints:

Checkpoint	Description	Use case
`lerobot/xvla-base`	0.9B pretrained on 290K episodes	Fine-tune for new tasks
`lerobot/xvla-libero`	Fine-tuned LIBERO (98.1%)	Evaluate LIBERO immediately
`lerobot/xvla-widowx`	WidowX pick-and-place	SimplerEnv demo
`lerobot/xvla-folding`	Cloth folding 100%	Hard bimanual task
`lerobot/xvla-agibot-world`	AgileX dexterous	General bimanual
`lerobot/xvla-google-robot`	Google Robot RT-1 setup	Cross-domain demo

Quick inference test:

from lerobot.policies.xvla import XVLAPolicy
import torch

# Load model
policy = XVLAPolicy.from_pretrained("lerobot/xvla-base")
policy = policy.to("cuda").eval()

# Dummy observation
obs = {
    "observation.images.primary": torch.randn(1, 3, 224, 224).cuda(),
    "observation.images.wrist": torch.randn(1, 3, 224, 224).cuda(),
    "observation.state": torch.randn(1, 7).cuda(),
    "task": ["pick up the red block"],
    "domain_id": torch.tensor([0]).cuda(),
}

# Inference — generate 32-step action chunk
with torch.no_grad():
    action_chunk = policy.select_action(obs)
print(action_chunk.shape)  # [1, 32, 20]

4. Train on your dataset

Dataset format

LeRobot datasets follow a standard schema — see LeRobot Ecosystem for the teleop-to-dataset recording flow. Required layout:

your-dataset/
├── meta/
│   ├── episodes.jsonl       # per-episode metadata
│   ├── tasks.jsonl          # natural language instructions
│   └── stats.json           # mean/std for normalization
├── data/
│   └── chunk-000/
│       └── episode_000000.parquet  # state + action
└── videos/
    └── chunk-000/
        └── observation.images.primary/
            └── episode_000000.mp4

Each episode needs:

Images at least 1 camera (RGB 224×224 or higher)
State proprioceptive vector (joints + gripper)
Action vector of matching dimension
Task natural language string

Basic fine-tune

For a new task on a standard robot (Franka/SO-101/UR-5):

lerobot-train \
  --dataset.repo_id=YOUR_USER/your-task-dataset \
  --output_dir=./outputs/xvla_my_task \
  --policy.path="lerobot/xvla-base" \
  --policy.dtype=bfloat16 \
  --policy.action_mode=auto \
  --steps=20000 \
  --policy.device=cuda \
  --policy.freeze_vision_encoder=false \
  --policy.freeze_language_encoder=false \
  --policy.train_policy_transformer=true \
  --policy.train_soft_prompts=true \
  --batch_size=8 \
  --num_workers=4

Key parameters:

--policy.action_mode=auto — use this for new robots, X-VLA auto-detects dataset dimension and pads/trims
--policy.train_soft_prompts=true — train soft prompts (required for new embodiments)
--policy.dtype=bfloat16 — cuts VRAM 50% with near-zero accuracy loss

Soft-prompts-only fine-tune (PEFT-style)

If you only have 24GB VRAM:

lerobot-train \
  --dataset.repo_id=YOUR_USER/your-task-dataset \
  --policy.path="lerobot/xvla-base" \
  --policy.freeze_vision_encoder=true \
  --policy.freeze_language_encoder=true \
  --policy.train_policy_transformer=false \
  --policy.train_soft_prompts=true \
  --policy.dtype=bfloat16 \
  --steps=10000

Only ~9M trainable params — 300× less than full fine-tune but still achieves 90%+ performance on similar tasks. This is the "Phase II" mode in the paper and X-VLA's main selling point for small teams.

Critical hyperparameter

The paper recommends: train the VLM (vision + language encoder) at 1/10 base learning rate, other components at full LR. Reason: the VLM is already strongly pretrained, touching it too much causes catastrophic forgetting.

LeRobot config handles this when --policy.freeze_vision_encoder=false — but if you write a custom trainer, remember to set per-group LRs.

5. Inference on a real robot

Server-client architecture

X-VLA separates the model server from the robot environment over HTTP — important because robot dependencies (ROS, drivers) often conflict with PyTorch CUDA.

Server (GPU machine):

lerobot-serve \
  --policy.path="./outputs/xvla_my_task" \
  --port 8765 \
  --device cuda

Client (machine connected to robot):

import requests
import numpy as np

def get_action(images_dict, state, instruction, domain_id=0):
    payload = {
        "observation.state": state.tolist(),
        "task": instruction,
        "domain_id": domain_id,
    }
    # Encode images (base64 or multipart)
    for cam_name, img in images_dict.items():
        payload[f"observation.images.{cam_name}"] = encode_image(img)

    response = requests.post(
        "http://gpu-server:8765/act",
        json=payload,
        timeout=2.0,
    )
    return np.array(response.json()["action"])

# 30Hz control loop
while True:
    obs = robot.get_observation()
    action_chunk = get_action(obs["images"], obs["state"], "pick up the cup")

    # 32-step chunk — execute chunk_step steps then re-query
    for action in action_chunk[:8]:
        robot.execute(action)

Async inference for real-time

Action chunking (32 steps) enables async inference: while the robot executes the current chunk, the server can compute the next one. Effective latency drops from 200-400ms per action to ~30-50ms.

The LeRobot HilSerl Real Robot RL post has sample code for this async pattern.

Bimanual robot manipulation with VLA controller

6. Evaluating results

Eval on LIBERO

lerobot-eval \
  --policy.path="lerobot/xvla-libero" \
  --env.type=libero \
  --env.task=libero_spatial,libero_goal,libero_10,libero_object \
  --env.control_mode=absolute \
  --eval.batch_size=1 \
  --eval.n_episodes=50 \
  --env.episode_length=800 \
  --seed=142

Expected results after ~30 minutes on an A100:

LIBERO-Spatial: 96-98%
LIBERO-Goal: 96-99%
LIBERO-Object: 98-100%
LIBERO-10: 92-95%

This is the baseline to compare your custom training against.

WandB logging

Add to your training command:

--wandb.enable=true \
--wandb.project=xvla-finetune \
--wandb.run_name=my-task-v1

Track: loss/flow_matching, loss/gripper_bce, validation/success_rate, gradients/soft_prompt_norm.

7. Practical tips from real use

Common errors

CUDA OOM on load — Use --policy.dtype=bfloat16 instead of float32, halves VRAM
Action dimension mismatch — Set --policy.action_mode=auto so X-VLA handles padding
Soft prompt won't converge — Check the learning rate, soft prompts need higher LR than backbone (~5e-4 vs 1e-4)
Slow inference — Reduce flow-matching num_inference_steps from 10 to 4-5 (slight accuracy loss but 2× faster)

Domain ID — don't forget!

Each embodiment has its own domain_id:

Dataset	Domain ID
Bridge	0
RT-1	1
CALVIN	2
LIBERO	3
WidowX (air)	4
AIR-AGILEX-HQ	5
AGIBOT-challenge	9

Forgetting domain_id on inference → model uses default (0 = Bridge) → wrong soft prompt → policy fails. Always match domain_id with your training checkpoint.

When to use X-VLA vs alternatives?

Situation	Pick
Multi-robot fleet (3+ embodiments)	X-VLA — pretrain once, swap prompts
Single robot, small dataset (<5K eps)	π0-FAST or VLA-Adapter
Single robot, large dataset, 1 task	OpenVLA or fine-tune RT-2
Bimanual humanoid	X-VLA-AgiBot or WholeBodyVLA
Consumer GPU (RTX 3060/4090)	VLA-Adapter 0.5B

8. Learning roadmap

After grasping X-VLA, go further with:

Read the paper — arXiv 2510.10274 (33 pages, worth reading section 3 on soft prompt design carefully)
Collect your own 100-500 episodes via teleop, train soft prompts for your task
Compare with baselines — train the same dataset with ACT, Diffusion Policy, OpenVLA to understand trade-offs
Contribute a custom action mode to upstream LeRobot if your robot is unusual (just 30 lines like the example in docs)

Conclusion

X-VLA is a clear step forward for cross-embodiment VLA: instead of training n models for n robots, train 1 backbone + n soft prompts. With LeRobot integration, beginners can now:

Load lerobot/xvla-base in one line
Fine-tune for their own task with ~9M trainable params on an RTX 4090
Deploy via HTTP server-client, safe for ROS-based robot stacks

Code, weights, dataset are all open-source under Apache 2.0 — no barriers for teams wanting to research or ship products. If you're building a robot fleet or doing manipulation research, X-VLA is the most reasonable bet in the 2026 VLA landscape.

References

Paper X-VLA arXiv 2510.10274 — Soft-Prompted Transformer as Scalable Cross-Embodiment VLA, ICLR 2026
GitHub 2toinf/X-VLA — Reference implementation
LeRobot X-VLA docs — Integration guide
HuggingFace lerobot/xvla-base — 0.9B pretrained checkpoint
Project page — Demo videos + cloth folding dataset

X-VLA soft-prompted cross-embodiment vision-language-action model

1. Why this paper matters

The cross-embodiment problem

Imagine you have 3 robots: a 7-DOF Franka, a 6-DOF WidowX, and a pair of AgileX bimanual arms. Each robot has:

Different action space — 7 joints vs 6 joints vs 14 joints
Different camera setup — 1 wrist + 1 third-person vs 2 wrist vs 3 cameras
Different gripper — parallel jaw vs underactuated vs custom

Striking results

Benchmark	Embodiment	X-VLA Score	Comparison
LIBERO (4 suites)	Franka	98.1%	π0: ~97%, OpenVLA: 76.5%
SimplerEnv WidowX	WidowX	95.8%	RT-1-X: 64%
Google Robot (VM)	Google Robot	83.5%	OpenVLA: 71%
CALVIN (ABCD→D)	Franka	4.43/5	RoboFlamingo: 3.49
RoboTwin2	AgileX bimanual	70%	π0-FAST: 58%

2. X-VLA architecture

Flow-matching overview

The main pipeline:

[RGB images] → Vision encoder ────┐
[Language instruction] → Text enc ┤
[Proprioceptive state] → MLP ─────┤── Transformer (24 layers, 1024 dim)
[Soft prompt embeddings] ─────────┤              │
[Domain ID embedding] ────────────┘              ▼
                                         Flow-matching head
                                                 │
                                                 ▼
                                    Action chunk (32 steps × 20-D)

Soft prompt mechanism

Phase I (pretrain): Train the backbone + all soft prompts on 290K episodes from 7 platforms
Phase II (adapt): Freeze the backbone, only train a new soft prompt (~9M params) for the new embodiment

Unified EE6D action space

This allows a single forward pass to handle any embodiment, differing only in soft prompt + domain ID.

Real robot arm controlled by a VLA model

3. Installing LeRobot with X-VLA

Hardware requirements

Purpose	Minimum GPU	Recommended
Inference only	RTX 3060 12GB	RTX 4090
Soft-prompt fine-tune	RTX 4090 24GB	A100 40GB
Full pretrain (290K episodes)	8× A100 80GB	8× H100

For beginners: RTX 4090 or rent A100 on Vast.ai (~$1/hour) is enough to fine-tune soft prompts for your own task.

Environment setup

# Create conda env
conda create -n xvla python=3.10 -y
conda activate xvla

# Clone LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Install LeRobot with X-VLA extras
pip install -e ".[xvla]"

# Verify
python -c "from lerobot.policies.xvla import XVLAPolicy; print('OK')"

Load a pretrained checkpoint

X-VLA ships several HuggingFace checkpoints:

Checkpoint	Description	Use case
`lerobot/xvla-base`	0.9B pretrained on 290K episodes	Fine-tune for new tasks
`lerobot/xvla-libero`	Fine-tuned LIBERO (98.1%)	Evaluate LIBERO immediately
`lerobot/xvla-widowx`	WidowX pick-and-place	SimplerEnv demo
`lerobot/xvla-folding`	Cloth folding 100%	Hard bimanual task
`lerobot/xvla-agibot-world`	AgileX dexterous	General bimanual
`lerobot/xvla-google-robot`	Google Robot RT-1 setup	Cross-domain demo

Quick inference test:

from lerobot.policies.xvla import XVLAPolicy
import torch

# Load model
policy = XVLAPolicy.from_pretrained("lerobot/xvla-base")
policy = policy.to("cuda").eval()

# Dummy observation
obs = {
    "observation.images.primary": torch.randn(1, 3, 224, 224).cuda(),
    "observation.images.wrist": torch.randn(1, 3, 224, 224).cuda(),
    "observation.state": torch.randn(1, 7).cuda(),
    "task": ["pick up the red block"],
    "domain_id": torch.tensor([0]).cuda(),
}

# Inference — generate 32-step action chunk
with torch.no_grad():
    action_chunk = policy.select_action(obs)
print(action_chunk.shape)  # [1, 32, 20]

4. Train on your dataset

Dataset format

LeRobot datasets follow a standard schema — see LeRobot Ecosystem for the teleop-to-dataset recording flow. Required layout:

your-dataset/
├── meta/
│   ├── episodes.jsonl       # per-episode metadata
│   ├── tasks.jsonl          # natural language instructions
│   └── stats.json           # mean/std for normalization
├── data/
│   └── chunk-000/
│       └── episode_000000.parquet  # state + action
└── videos/
    └── chunk-000/
        └── observation.images.primary/
            └── episode_000000.mp4

Each episode needs:

Images at least 1 camera (RGB 224×224 or higher)
State proprioceptive vector (joints + gripper)
Action vector of matching dimension
Task natural language string

Basic fine-tune

For a new task on a standard robot (Franka/SO-101/UR-5):

lerobot-train \
  --dataset.repo_id=YOUR_USER/your-task-dataset \
  --output_dir=./outputs/xvla_my_task \
  --policy.path="lerobot/xvla-base" \
  --policy.dtype=bfloat16 \
  --policy.action_mode=auto \
  --steps=20000 \
  --policy.device=cuda \
  --policy.freeze_vision_encoder=false \
  --policy.freeze_language_encoder=false \
  --policy.train_policy_transformer=true \
  --policy.train_soft_prompts=true \
  --batch_size=8 \
  --num_workers=4

Key parameters:

--policy.action_mode=auto — use this for new robots, X-VLA auto-detects dataset dimension and pads/trims
--policy.train_soft_prompts=true — train soft prompts (required for new embodiments)
--policy.dtype=bfloat16 — cuts VRAM 50% with near-zero accuracy loss

Soft-prompts-only fine-tune (PEFT-style)

If you only have 24GB VRAM:

lerobot-train \
  --dataset.repo_id=YOUR_USER/your-task-dataset \
  --policy.path="lerobot/xvla-base" \
  --policy.freeze_vision_encoder=true \
  --policy.freeze_language_encoder=true \
  --policy.train_policy_transformer=false \
  --policy.train_soft_prompts=true \
  --policy.dtype=bfloat16 \
  --steps=10000

Critical hyperparameter

LeRobot config handles this when --policy.freeze_vision_encoder=false — but if you write a custom trainer, remember to set per-group LRs.

5. Inference on a real robot

Server-client architecture

X-VLA separates the model server from the robot environment over HTTP — important because robot dependencies (ROS, drivers) often conflict with PyTorch CUDA.

Server (GPU machine):

lerobot-serve \
  --policy.path="./outputs/xvla_my_task" \
  --port 8765 \
  --device cuda

Client (machine connected to robot):

import requests
import numpy as np

def get_action(images_dict, state, instruction, domain_id=0):
    payload = {
        "observation.state": state.tolist(),
        "task": instruction,
        "domain_id": domain_id,
    }
    # Encode images (base64 or multipart)
    for cam_name, img in images_dict.items():
        payload[f"observation.images.{cam_name}"] = encode_image(img)

    response = requests.post(
        "http://gpu-server:8765/act",
        json=payload,
        timeout=2.0,
    )
    return np.array(response.json()["action"])

# 30Hz control loop
while True:
    obs = robot.get_observation()
    action_chunk = get_action(obs["images"], obs["state"], "pick up the cup")

    # 32-step chunk — execute chunk_step steps then re-query
    for action in action_chunk[:8]:
        robot.execute(action)

Async inference for real-time

Action chunking (32 steps) enables async inference: while the robot executes the current chunk, the server can compute the next one. Effective latency drops from 200-400ms per action to ~30-50ms.

The LeRobot HilSerl Real Robot RL post has sample code for this async pattern.

Bimanual robot manipulation with VLA controller

6. Evaluating results

Eval on LIBERO

lerobot-eval \
  --policy.path="lerobot/xvla-libero" \
  --env.type=libero \
  --env.task=libero_spatial,libero_goal,libero_10,libero_object \
  --env.control_mode=absolute \
  --eval.batch_size=1 \
  --eval.n_episodes=50 \
  --env.episode_length=800 \
  --seed=142

Expected results after ~30 minutes on an A100:

LIBERO-Spatial: 96-98%
LIBERO-Goal: 96-99%
LIBERO-Object: 98-100%
LIBERO-10: 92-95%

This is the baseline to compare your custom training against.

WandB logging

Add to your training command:

--wandb.enable=true \
--wandb.project=xvla-finetune \
--wandb.run_name=my-task-v1

Track: loss/flow_matching, loss/gripper_bce, validation/success_rate, gradients/soft_prompt_norm.

7. Practical tips from real use

Common errors

CUDA OOM on load — Use --policy.dtype=bfloat16 instead of float32, halves VRAM
Action dimension mismatch — Set --policy.action_mode=auto so X-VLA handles padding
Soft prompt won't converge — Check the learning rate, soft prompts need higher LR than backbone (~5e-4 vs 1e-4)
Slow inference — Reduce flow-matching num_inference_steps from 10 to 4-5 (slight accuracy loss but 2× faster)

Domain ID — don't forget!

Each embodiment has its own domain_id:

Dataset	Domain ID
Bridge	0
RT-1	1
CALVIN	2
LIBERO	3
WidowX (air)	4
AIR-AGILEX-HQ	5
AGIBOT-challenge	9

Forgetting domain_id on inference → model uses default (0 = Bridge) → wrong soft prompt → policy fails. Always match domain_id with your training checkpoint.

When to use X-VLA vs alternatives?

Situation	Pick
Multi-robot fleet (3+ embodiments)	X-VLA — pretrain once, swap prompts
Single robot, small dataset (<5K eps)	π0-FAST or VLA-Adapter
Single robot, large dataset, 1 task	OpenVLA or fine-tune RT-2
Bimanual humanoid	X-VLA-AgiBot or WholeBodyVLA
Consumer GPU (RTX 3060/4090)	VLA-Adapter 0.5B

8. Learning roadmap

After grasping X-VLA, go further with:

Read the paper — arXiv 2510.10274 (33 pages, worth reading section 3 on soft prompt design carefully)
Collect your own 100-500 episodes via teleop, train soft prompts for your task
Compare with baselines — train the same dataset with ACT, Diffusion Policy, OpenVLA to understand trade-offs
Contribute a custom action mode to upstream LeRobot if your robot is unusual (just 30 lines like the example in docs)

Conclusion

X-VLA is a clear step forward for cross-embodiment VLA: instead of training n models for n robots, train 1 backbone + n soft prompts. With LeRobot integration, beginners can now:

Load lerobot/xvla-base in one line
Fine-tune for their own task with ~9M trainable params on an RTX 4090
Deploy via HTTP server-client, safe for ROS-based robot stacks

References

Paper X-VLA arXiv 2510.10274 — Soft-Prompted Transformer as Scalable Cross-Embodiment VLA, ICLR 2026
GitHub 2toinf/X-VLA — Reference implementation
LeRobot X-VLA docs — Integration guide
HuggingFace lerobot/xvla-base — 0.9B pretrained checkpoint
Project page — Demo videos + cloth folding dataset

1. Why this paper matters

The cross-embodiment problem

Striking results

2. X-VLA architecture

Flow-matching overview

Soft prompt mechanism

Unified EE6D action space

3. Installing LeRobot with X-VLA

Hardware requirements

Environment setup

Load a pretrained checkpoint

4. Train on your dataset

Dataset format

Basic fine-tune

Soft-prompts-only fine-tune (PEFT-style)

Critical hyperparameter

5. Inference on a real robot

Server-client architecture

Async inference for real-time

6. Evaluating results

Eval on LIBERO

WandB logging

7. Practical tips from real use

Common errors

Domain ID — don't forget!

When to use X-VLA vs alternatives?

8. Learning roadmap

Conclusion

Related Posts

References

Nguyễn Anh Tuấn

Related Posts

LabVLA: VLA Mã Nguồn Mở cho Robot Phòng Lab

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Qwen-VLA: Mô hình VLA generalist của Alibaba

1. Why this paper matters

The cross-embodiment problem

Striking results

2. X-VLA architecture

Flow-matching overview

Soft prompt mechanism

Unified EE6D action space

3. Installing LeRobot with X-VLA

Hardware requirements

Environment setup

Load a pretrained checkpoint

4. Train on your dataset

Dataset format

Basic fine-tune

Soft-prompts-only fine-tune (PEFT-style)

Critical hyperparameter

5. Inference on a real robot

Server-client architecture

Async inference for real-time

6. Evaluating results

Eval on LIBERO

WandB logging

7. Practical tips from real use

Common errors

Domain ID — don't forget!

When to use X-VLA vs alternatives?

8. Learning roadmap

Conclusion

Related Posts

References

Nguyễn Anh Tuấn

Related Posts

LabVLA: VLA Mã Nguồn Mở cho Robot Phòng Lab

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Qwen-VLA: Mô hình VLA generalist của Alibaba