X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot

Vision-Language-Action (VLA) đang chạy đua hai hướng: scale lên hàng tỷ tham số (RT-2, OpenVLA 7B, π0 3B) hoặc scale xuống GPU consumer (VLA-Adapter 0.5B). Nhưng cả hai hướng đều bỏ qua một câu hỏi quan trọng: làm sao train một model duy nhất chạy được trên nhiều loại robot khác nhau — Franka, WidowX, Google Robot, Agilex bimanual — mà không cần fine-tune riêng cho từng cái?

Đây chính là điều X-VLA (ICLR 2026, paper arXiv:2510.10274) giải quyết. Với chỉ 0.9B parameters, X-VLA đạt SOTA trên 6 simulation benchmarks + 3 robot thật, vô địch AgiBot World Challenge tại IROS 2025, và quan trọng nhất với người Việt: được tích hợp native vào LeRobot — chỉ cần 1 dòng policy.type=xvla là train được.

Bài này mình sẽ dẫn bạn đi từ ý tưởng "soft prompt" → kiến trúc flow-matching → cài đặt LeRobot → train trên dataset của bạn → inference trên robot thật. Beginner-friendly, không cần background nặng về Transformer.

X-VLA soft-prompted cross-embodiment vision-language-action model

1. Tại sao paper này quan trọng?

Vấn đề cross-embodiment

Hãy tưởng tượng bạn có 3 robot: 1 Franka 7-DOF, 1 WidowX 6-DOF, 1 cặp tay AgileX bimanual. Mỗi robot có:

Action space khác nhau — 7 joints vs 6 joints vs 14 joints
Camera setup khác nhau — 1 wrist + 1 third-person vs 2 wrist vs 3 cameras
Gripper khác nhau — parallel jaw vs underactuated vs custom

Cách cũ (OpenVLA, π0): train riêng cho từng embodiment, hoặc cố gắng tokenize action thành text rồi để LLM tự xử lý — nhưng chất lượng không đều, và mỗi robot mới phải fine-tune nặng.

X-VLA trả lời: dùng "soft prompt" — một bộ embedding học được riêng cho mỗi loại robot, share backbone Transformer cho tất cả. Giống như cùng 1 LLM nhưng prompt khác nhau cho task khác nhau, X-VLA dùng cùng 1 Transformer nhưng prompt khác nhau cho robot khác nhau.

Kết quả ấn tượng

Benchmark	Embodiment	X-VLA Score	So sánh
LIBERO (4 suites)	Franka	98.1%	π0: ~97%, OpenVLA: 76.5%
SimplerEnv WidowX	WidowX	95.8%	RT-1-X: 64%
Google Robot (VM)	Google Robot	83.5%	OpenVLA: 71%
CALVIN (ABCD→D)	Franka	4.43/5	RoboFlamingo: 3.49
RoboTwin2	AgileX bimanual	70%	π0-FAST: 58%

Quan trọng hơn: X-VLA-LIBERO đạt near-π₀ performance trong khi dùng 300× ít trainable parameters hơn — vì chỉ phải train soft prompts (~9M params) cho mỗi embodiment mới thay vì cả backbone 3B.

2. Kiến trúc X-VLA

Tổng quan flow-matching

X-VLA không dùng diffusion theo kiểu DDPM, cũng không dùng autoregressive token như RT-2. Nó dùng flow matching — một họ generative model học vector field giữa noise và data, sinh action chunk 32 bước trong vài denoising steps.

Pipeline chính:

[RGB images] → Vision encoder ────┐
[Language instruction] → Text enc ┤
[Proprioceptive state] → MLP ─────┤── Transformer (24 layers, 1024 dim)
[Soft prompt embeddings] ─────────┤              │
[Domain ID embedding] ────────────┘              ▼
                                         Flow-matching head
                                                 │
                                                 ▼
                                    Action chunk (32 steps × 20-D)

Soft prompt cơ chế

Mỗi embodiment có một bộ 32 learnable embedding vectors (kích thước 1024). Khi train, model học những vector này như "preamble" được prepend vào input của Transformer. Khi gặp robot mới:

Phase I (pretrain): Train cả backbone + tất cả soft prompts trên 290K episodes của 7 platforms
Phase II (adapt): Freeze backbone, chỉ train soft prompt mới (~9M params) cho embodiment mới

Đây là điểm khác biệt lớn với LoRA: LoRA inject low-rank vào weights, còn soft prompt inject embedding vào input — đơn giản hơn, nhẹ hơn, và hiệu quả với prompt tuning đã chứng minh trên NLP (Prefix Tuning, P-Tuning v2).

Action space EE6D thống nhất

X-VLA dùng EE6D (End-Effector 6D) làm action space chuẩn: 3 vị trí + 6D rotation representation + gripper signal + padding = 20 chiều. Mọi robot khác (joint-space 7-DOF, bimanual 14-DOF) được map vào 20-D này qua Action Registry — nếu robot có ít chiều hơn, phần dư padding zero và bị ignore trong loss.

Cách này cho phép cùng 1 forward pass xử lý mọi embodiment, chỉ khác nhau ở soft prompt + domain ID.

Robot arm thực tế đang được điều khiển bởi VLA model

3. Cài đặt LeRobot với X-VLA

Yêu cầu hardware

Mục đích	GPU tối thiểu	Khuyến nghị
Inference only	RTX 3060 12GB	RTX 4090
Fine-tune soft prompts	RTX 4090 24GB	A100 40GB
Full pretrain (290K episodes)	8× A100 80GB	8× H100

Cho beginner Việt Nam: RTX 4090 hoặc thuê A100 trên Vast.ai (~$1/giờ) là đủ để fine-tune soft prompts cho task riêng.

Setup environment

# Tạo conda env
conda create -n xvla python=3.10 -y
conda activate xvla

# Clone LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Cài LeRobot với X-VLA dependencies
pip install -e ".[xvla]"

# Verify install
python -c "from lerobot.policies.xvla import XVLAPolicy; print('OK')"

Load pretrained checkpoint

X-VLA có sẵn nhiều checkpoint trên HuggingFace:

Checkpoint	Mô tả	Use case
`lerobot/xvla-base`	0.9B pretrain trên 290K episodes	Fine-tune cho task mới
`lerobot/xvla-libero`	Fine-tuned LIBERO (98.1%)	Eval LIBERO ngay
`lerobot/xvla-widowx`	WidowX pick-and-place	Demo SimplerEnv
`lerobot/xvla-folding`	Cloth folding 100%	Bimanual task khó
`lerobot/xvla-agibot-world`	AgileX dexterous	Bimanual general
`lerobot/xvla-google-robot`	Google Robot RT-1 setup	Cross-domain demo

Test nhanh inference:

from lerobot.policies.xvla import XVLAPolicy
import torch

# Load model
policy = XVLAPolicy.from_pretrained("lerobot/xvla-base")
policy = policy.to("cuda").eval()

# Dummy observation
obs = {
    "observation.images.primary": torch.randn(1, 3, 224, 224).cuda(),
    "observation.images.wrist": torch.randn(1, 3, 224, 224).cuda(),
    "observation.state": torch.randn(1, 7).cuda(),
    "task": ["pick up the red block"],
    "domain_id": torch.tensor([0]).cuda(),
}

# Inference — sinh 32 bước action
with torch.no_grad():
    action_chunk = policy.select_action(obs)
print(action_chunk.shape)  # [1, 32, 20]

4. Train trên dataset của bạn

Format dataset

LeRobot dataset có schema chuẩn — xem thêm bài LeRobot Ecosystem để hiểu cách record dataset từ teleop. Tóm tắt cấu trúc cần thiết:

your-dataset/
├── meta/
│   ├── episodes.jsonl       # metadata mỗi episode
│   ├── tasks.jsonl          # natural language instructions
│   └── stats.json           # mean/std cho normalization
├── data/
│   └── chunk-000/
│       └── episode_000000.parquet  # state + action
└── videos/
    └── chunk-000/
        └── observation.images.primary/
            └── episode_000000.mp4

Mỗi episode cần:

Images ít nhất 1 camera (RGB 224×224 hoặc cao hơn)
State vector proprioceptive (joints + gripper)
Action vector cùng dimension
Task string natural language

Fine-tune cơ bản

Cho task mới với robot chuẩn (Franka/SO-101/UR-5):

lerobot-train \
  --dataset.repo_id=YOUR_USER/your-task-dataset \
  --output_dir=./outputs/xvla_my_task \
  --policy.path="lerobot/xvla-base" \
  --policy.dtype=bfloat16 \
  --policy.action_mode=auto \
  --steps=20000 \
  --policy.device=cuda \
  --policy.freeze_vision_encoder=false \
  --policy.freeze_language_encoder=false \
  --policy.train_policy_transformer=true \
  --policy.train_soft_prompts=true \
  --batch_size=8 \
  --num_workers=4

Tham số quan trọng:

--policy.action_mode=auto — dùng cái này cho robot mới, X-VLA tự detect dimension dataset và pad/trim
--policy.train_soft_prompts=true — train cả soft prompts (bắt buộc với embodiment mới)
--policy.dtype=bfloat16 — giảm VRAM 50% mà gần như không mất accuracy

Fine-tune soft prompts only (PEFT-style)

Nếu chỉ có 24GB VRAM:

lerobot-train \
  --dataset.repo_id=YOUR_USER/your-task-dataset \
  --policy.path="lerobot/xvla-base" \
  --policy.freeze_vision_encoder=true \
  --policy.freeze_language_encoder=true \
  --policy.train_policy_transformer=false \
  --policy.train_soft_prompts=true \
  --policy.dtype=bfloat16 \
  --steps=10000

Chỉ ~9M params train được — 300× ít hơn full fine-tune nhưng vẫn đạt 90%+ performance trên task tương tự. Đây là chế độ "Phase II" trong paper và là điểm bán hàng chính của X-VLA cho team nhỏ.

Critical hyperparameter

Paper khuyến nghị: Train VLM (vision + language encoder) với 1/10 base learning rate, các component khác dùng full LR. Lý do: VLM đã pretrain rất mạnh, đụng vào quá nhiều sẽ catastrophic forgetting.

LeRobot config tự handle khi --policy.freeze_vision_encoder=false — nhưng nếu bạn custom trainer, nhớ set group LR khác nhau.

5. Inference trên robot thật

Server-client architecture

X-VLA tách biệt model server và robot environment qua HTTP — quan trọng vì robot dependencies (ROS, drivers) hay conflict với PyTorch CUDA.

Server (machine có GPU):

lerobot-serve \
  --policy.path="./outputs/xvla_my_task" \
  --port 8765 \
  --device cuda

Client (machine kết nối robot):

import requests
import numpy as np

def get_action(images_dict, state, instruction, domain_id=0):
    payload = {
        "observation.state": state.tolist(),
        "task": instruction,
        "domain_id": domain_id,
    }
    # Encode images (base64 hoặc multipart)
    for cam_name, img in images_dict.items():
        payload[f"observation.images.{cam_name}"] = encode_image(img)

    response = requests.post(
        "http://gpu-server:8765/act",
        json=payload,
        timeout=2.0,
    )
    return np.array(response.json()["action"])

# Loop control 30Hz
while True:
    obs = robot.get_observation()
    action_chunk = get_action(obs["images"], obs["state"], "pick up the cup")

    # Action chunk 32 bước — execute chunk_step bước rồi re-query
    for action in action_chunk[:8]:
        robot.execute(action)

Async inference cho real-time

Action chunking (32 bước) cho phép async inference: trong khi robot execute chunk hiện tại, server có thể tính chunk tiếp theo. Latency hiệu quả ~30-50ms thay vì 200-400ms mỗi action.

Bài LeRobot HilSerl Real Robot RL có code mẫu cho async pattern này.

Bimanual robot manipulation với VLA controller

6. Đánh giá kết quả

Eval trên LIBERO

lerobot-eval \
  --policy.path="lerobot/xvla-libero" \
  --env.type=libero \
  --env.task=libero_spatial,libero_goal,libero_10,libero_object \
  --env.control_mode=absolute \
  --eval.batch_size=1 \
  --eval.n_episodes=50 \
  --env.episode_length=800 \
  --seed=142

Kết quả mong đợi sau ~30 phút trên A100:

LIBERO-Spatial: 96-98%
LIBERO-Goal: 96-99%
LIBERO-Object: 98-100%
LIBERO-10: 92-95%

Đây là baseline để so sánh với custom training của bạn.

Logging với WandB

Thêm vào training command:

--wandb.enable=true \
--wandb.project=xvla-finetune \
--wandb.run_name=my-task-v1

Theo dõi: loss/flow_matching, loss/gripper_bce, validation/success_rate, gradients/soft_prompt_norm.

7. Tips từ kinh nghiệm thực tế

Lỗi thường gặp

CUDA OOM khi load — Dùng --policy.dtype=bfloat16 thay vì float32, giảm 50% VRAM
Action dimension mismatch — Set --policy.action_mode=auto để X-VLA tự handle padding
Soft prompt không converge — Check learning rate, soft prompt cần LR cao hơn backbone (~5e-4 vs 1e-4)
Inference chậm — Giảm num_inference_steps flow matching từ 10 xuống 4-5 (mất ít accuracy nhưng nhanh 2×)

Domain ID — đừng quên!

Mỗi embodiment có domain_id riêng:

Dataset	Domain ID
Bridge	0
RT-1	1
CALVIN	2
LIBERO	3
WidowX (air)	4
AIR-AGILEX-HQ	5
AGIBOT-challenge	9

Inference quên set domain_id → model dùng default (0 = Bridge) → wrong soft prompt → policy fail. Luôn match domain_id với checkpoint train.

Khi nào nên dùng X-VLA vs alternatives?

Tình huống	Chọn
Multi-robot fleet (3+ embodiments)	X-VLA — pretrain 1 lần, swap prompt
Single robot, dataset nhỏ (<5K eps)	π0-FAST hoặc VLA-Adapter
Single robot, dataset lớn, 1 task	OpenVLA hoặc fine-tune RT-2
Bimanual humanoid	X-VLA-AgiBot hoặc WholeBodyVLA
GPU consumer (RTX 3060/4090)	VLA-Adapter 0.5B

8. Roadmap học tiếp

Sau khi nắm X-VLA, bạn nên đi tiếp:

Đọc paper gốc — arXiv 2510.10274 (33 trang, đáng đọc kỹ section 3 về soft prompt design)
Tự collect dataset 100-500 episodes với teleop, train soft prompts cho task riêng
So sánh với baselines — train cùng dataset với ACT, Diffusion Policy, OpenVLA để hiểu trade-offs
Đóng góp custom action mode vào upstream LeRobot nếu robot bạn lạ (chỉ 30 dòng code như example ở docs)

Kết luận

X-VLA là một bước tiến rõ ràng cho VLA cross-embodiment: thay vì train n model cho n robot, train 1 backbone + n soft prompts. Với LeRobot integration, beginner Việt Nam giờ đây có thể:

Load checkpoint lerobot/xvla-base trong 1 dòng
Fine-tune cho task riêng với ~9M params trainable trên RTX 4090
Deploy qua HTTP server-client, an toàn cho ROS-based robot setup

Code, weights, dataset tất cả open-source dưới Apache 2.0 — không có rào cản cho team Việt muốn nghiên cứu hoặc làm sản phẩm thật. Nếu bạn đang build robot fleet hoặc nghiên cứu manipulation cho ASEAN, đây là VLA đáng đặt cược nhất 2026.

Tham khảo

Paper X-VLA arXiv 2510.10274 — Soft-Prompted Transformer as Scalable Cross-Embodiment VLA, ICLR 2026
GitHub 2toinf/X-VLA — Reference implementation
LeRobot X-VLA docs — Integration guide
HuggingFace lerobot/xvla-base — 0.9B pretrained checkpoint
Project page — Demo videos + cloth folding dataset

X-VLA soft-prompted cross-embodiment vision-language-action model

1. Tại sao paper này quan trọng?

Vấn đề cross-embodiment

Hãy tưởng tượng bạn có 3 robot: 1 Franka 7-DOF, 1 WidowX 6-DOF, 1 cặp tay AgileX bimanual. Mỗi robot có:

Action space khác nhau — 7 joints vs 6 joints vs 14 joints
Camera setup khác nhau — 1 wrist + 1 third-person vs 2 wrist vs 3 cameras
Gripper khác nhau — parallel jaw vs underactuated vs custom

Kết quả ấn tượng

Benchmark	Embodiment	X-VLA Score	So sánh
LIBERO (4 suites)	Franka	98.1%	π0: ~97%, OpenVLA: 76.5%
SimplerEnv WidowX	WidowX	95.8%	RT-1-X: 64%
Google Robot (VM)	Google Robot	83.5%	OpenVLA: 71%
CALVIN (ABCD→D)	Franka	4.43/5	RoboFlamingo: 3.49
RoboTwin2	AgileX bimanual	70%	π0-FAST: 58%

2. Kiến trúc X-VLA

Tổng quan flow-matching

Pipeline chính:

[RGB images] → Vision encoder ────┐
[Language instruction] → Text enc ┤
[Proprioceptive state] → MLP ─────┤── Transformer (24 layers, 1024 dim)
[Soft prompt embeddings] ─────────┤              │
[Domain ID embedding] ────────────┘              ▼
                                         Flow-matching head
                                                 │
                                                 ▼
                                    Action chunk (32 steps × 20-D)

Soft prompt cơ chế

Phase I (pretrain): Train cả backbone + tất cả soft prompts trên 290K episodes của 7 platforms
Phase II (adapt): Freeze backbone, chỉ train soft prompt mới (~9M params) cho embodiment mới

Action space EE6D thống nhất

Cách này cho phép cùng 1 forward pass xử lý mọi embodiment, chỉ khác nhau ở soft prompt + domain ID.

Robot arm thực tế đang được điều khiển bởi VLA model

3. Cài đặt LeRobot với X-VLA

Yêu cầu hardware

Mục đích	GPU tối thiểu	Khuyến nghị
Inference only	RTX 3060 12GB	RTX 4090
Fine-tune soft prompts	RTX 4090 24GB	A100 40GB
Full pretrain (290K episodes)	8× A100 80GB	8× H100

Cho beginner Việt Nam: RTX 4090 hoặc thuê A100 trên Vast.ai (~$1/giờ) là đủ để fine-tune soft prompts cho task riêng.

Setup environment

# Tạo conda env
conda create -n xvla python=3.10 -y
conda activate xvla

# Clone LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Cài LeRobot với X-VLA dependencies
pip install -e ".[xvla]"

# Verify install
python -c "from lerobot.policies.xvla import XVLAPolicy; print('OK')"

Load pretrained checkpoint

X-VLA có sẵn nhiều checkpoint trên HuggingFace:

Checkpoint	Mô tả	Use case
`lerobot/xvla-base`	0.9B pretrain trên 290K episodes	Fine-tune cho task mới
`lerobot/xvla-libero`	Fine-tuned LIBERO (98.1%)	Eval LIBERO ngay
`lerobot/xvla-widowx`	WidowX pick-and-place	Demo SimplerEnv
`lerobot/xvla-folding`	Cloth folding 100%	Bimanual task khó
`lerobot/xvla-agibot-world`	AgileX dexterous	Bimanual general
`lerobot/xvla-google-robot`	Google Robot RT-1 setup	Cross-domain demo

Test nhanh inference:

from lerobot.policies.xvla import XVLAPolicy
import torch

# Load model
policy = XVLAPolicy.from_pretrained("lerobot/xvla-base")
policy = policy.to("cuda").eval()

# Dummy observation
obs = {
    "observation.images.primary": torch.randn(1, 3, 224, 224).cuda(),
    "observation.images.wrist": torch.randn(1, 3, 224, 224).cuda(),
    "observation.state": torch.randn(1, 7).cuda(),
    "task": ["pick up the red block"],
    "domain_id": torch.tensor([0]).cuda(),
}

# Inference — sinh 32 bước action
with torch.no_grad():
    action_chunk = policy.select_action(obs)
print(action_chunk.shape)  # [1, 32, 20]

4. Train trên dataset của bạn

Format dataset

LeRobot dataset có schema chuẩn — xem thêm bài LeRobot Ecosystem để hiểu cách record dataset từ teleop. Tóm tắt cấu trúc cần thiết:

your-dataset/
├── meta/
│   ├── episodes.jsonl       # metadata mỗi episode
│   ├── tasks.jsonl          # natural language instructions
│   └── stats.json           # mean/std cho normalization
├── data/
│   └── chunk-000/
│       └── episode_000000.parquet  # state + action
└── videos/
    └── chunk-000/
        └── observation.images.primary/
            └── episode_000000.mp4

Mỗi episode cần:

Images ít nhất 1 camera (RGB 224×224 hoặc cao hơn)
State vector proprioceptive (joints + gripper)
Action vector cùng dimension
Task string natural language

Fine-tune cơ bản

Cho task mới với robot chuẩn (Franka/SO-101/UR-5):

lerobot-train \
  --dataset.repo_id=YOUR_USER/your-task-dataset \
  --output_dir=./outputs/xvla_my_task \
  --policy.path="lerobot/xvla-base" \
  --policy.dtype=bfloat16 \
  --policy.action_mode=auto \
  --steps=20000 \
  --policy.device=cuda \
  --policy.freeze_vision_encoder=false \
  --policy.freeze_language_encoder=false \
  --policy.train_policy_transformer=true \
  --policy.train_soft_prompts=true \
  --batch_size=8 \
  --num_workers=4

Tham số quan trọng:

--policy.action_mode=auto — dùng cái này cho robot mới, X-VLA tự detect dimension dataset và pad/trim
--policy.train_soft_prompts=true — train cả soft prompts (bắt buộc với embodiment mới)
--policy.dtype=bfloat16 — giảm VRAM 50% mà gần như không mất accuracy

Fine-tune soft prompts only (PEFT-style)

Nếu chỉ có 24GB VRAM:

lerobot-train \
  --dataset.repo_id=YOUR_USER/your-task-dataset \
  --policy.path="lerobot/xvla-base" \
  --policy.freeze_vision_encoder=true \
  --policy.freeze_language_encoder=true \
  --policy.train_policy_transformer=false \
  --policy.train_soft_prompts=true \
  --policy.dtype=bfloat16 \
  --steps=10000

Critical hyperparameter

LeRobot config tự handle khi --policy.freeze_vision_encoder=false — nhưng nếu bạn custom trainer, nhớ set group LR khác nhau.

5. Inference trên robot thật

Server-client architecture

X-VLA tách biệt model server và robot environment qua HTTP — quan trọng vì robot dependencies (ROS, drivers) hay conflict với PyTorch CUDA.

Server (machine có GPU):

lerobot-serve \
  --policy.path="./outputs/xvla_my_task" \
  --port 8765 \
  --device cuda

Client (machine kết nối robot):

import requests
import numpy as np

def get_action(images_dict, state, instruction, domain_id=0):
    payload = {
        "observation.state": state.tolist(),
        "task": instruction,
        "domain_id": domain_id,
    }
    # Encode images (base64 hoặc multipart)
    for cam_name, img in images_dict.items():
        payload[f"observation.images.{cam_name}"] = encode_image(img)

    response = requests.post(
        "http://gpu-server:8765/act",
        json=payload,
        timeout=2.0,
    )
    return np.array(response.json()["action"])

# Loop control 30Hz
while True:
    obs = robot.get_observation()
    action_chunk = get_action(obs["images"], obs["state"], "pick up the cup")

    # Action chunk 32 bước — execute chunk_step bước rồi re-query
    for action in action_chunk[:8]:
        robot.execute(action)

Async inference cho real-time

Bài LeRobot HilSerl Real Robot RL có code mẫu cho async pattern này.

Bimanual robot manipulation với VLA controller

6. Đánh giá kết quả

Eval trên LIBERO

lerobot-eval \
  --policy.path="lerobot/xvla-libero" \
  --env.type=libero \
  --env.task=libero_spatial,libero_goal,libero_10,libero_object \
  --env.control_mode=absolute \
  --eval.batch_size=1 \
  --eval.n_episodes=50 \
  --env.episode_length=800 \
  --seed=142

Kết quả mong đợi sau ~30 phút trên A100:

LIBERO-Spatial: 96-98%
LIBERO-Goal: 96-99%
LIBERO-Object: 98-100%
LIBERO-10: 92-95%

Đây là baseline để so sánh với custom training của bạn.

Logging với WandB

Thêm vào training command:

--wandb.enable=true \
--wandb.project=xvla-finetune \
--wandb.run_name=my-task-v1

Theo dõi: loss/flow_matching, loss/gripper_bce, validation/success_rate, gradients/soft_prompt_norm.

7. Tips từ kinh nghiệm thực tế

Lỗi thường gặp

CUDA OOM khi load — Dùng --policy.dtype=bfloat16 thay vì float32, giảm 50% VRAM
Action dimension mismatch — Set --policy.action_mode=auto để X-VLA tự handle padding
Soft prompt không converge — Check learning rate, soft prompt cần LR cao hơn backbone (~5e-4 vs 1e-4)
Inference chậm — Giảm num_inference_steps flow matching từ 10 xuống 4-5 (mất ít accuracy nhưng nhanh 2×)

Domain ID — đừng quên!

Mỗi embodiment có domain_id riêng:

Dataset	Domain ID
Bridge	0
RT-1	1
CALVIN	2
LIBERO	3
WidowX (air)	4
AIR-AGILEX-HQ	5
AGIBOT-challenge	9

Inference quên set domain_id → model dùng default (0 = Bridge) → wrong soft prompt → policy fail. Luôn match domain_id với checkpoint train.

Khi nào nên dùng X-VLA vs alternatives?

Tình huống	Chọn
Multi-robot fleet (3+ embodiments)	X-VLA — pretrain 1 lần, swap prompt
Single robot, dataset nhỏ (<5K eps)	π0-FAST hoặc VLA-Adapter
Single robot, dataset lớn, 1 task	OpenVLA hoặc fine-tune RT-2
Bimanual humanoid	X-VLA-AgiBot hoặc WholeBodyVLA
GPU consumer (RTX 3060/4090)	VLA-Adapter 0.5B

8. Roadmap học tiếp

Sau khi nắm X-VLA, bạn nên đi tiếp:

Đọc paper gốc — arXiv 2510.10274 (33 trang, đáng đọc kỹ section 3 về soft prompt design)
Tự collect dataset 100-500 episodes với teleop, train soft prompts cho task riêng
So sánh với baselines — train cùng dataset với ACT, Diffusion Policy, OpenVLA để hiểu trade-offs
Đóng góp custom action mode vào upstream LeRobot nếu robot bạn lạ (chỉ 30 dòng code như example ở docs)

Kết luận

Load checkpoint lerobot/xvla-base trong 1 dòng
Fine-tune cho task riêng với ~9M params trainable trên RTX 4090
Deploy qua HTTP server-client, an toàn cho ROS-based robot setup

Tham khảo

Paper X-VLA arXiv 2510.10274 — Soft-Prompted Transformer as Scalable Cross-Embodiment VLA, ICLR 2026
GitHub 2toinf/X-VLA — Reference implementation
LeRobot X-VLA docs — Integration guide
HuggingFace lerobot/xvla-base — 0.9B pretrained checkpoint
Project page — Demo videos + cloth folding dataset

1. Tại sao paper này quan trọng?

Vấn đề cross-embodiment

Kết quả ấn tượng

2. Kiến trúc X-VLA

Tổng quan flow-matching

Soft prompt cơ chế

Action space EE6D thống nhất

3. Cài đặt LeRobot với X-VLA

Yêu cầu hardware

Setup environment

Load pretrained checkpoint

4. Train trên dataset của bạn

Format dataset

Fine-tune cơ bản

Fine-tune soft prompts only (PEFT-style)

Critical hyperparameter

5. Inference trên robot thật

Server-client architecture

Async inference cho real-time

6. Đánh giá kết quả

Eval trên LIBERO

Logging với WandB

7. Tips từ kinh nghiệm thực tế

Lỗi thường gặp

Domain ID — đừng quên!

Khi nào nên dùng X-VLA vs alternatives?

8. Roadmap học tiếp

Kết luận

Bài viết liên quan

Tham khảo

Nguyễn Anh Tuấn

Bài viết liên quan

LabVLA: VLA Mã Nguồn Mở cho Robot Phòng Lab

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Qwen-VLA: Mô hình VLA generalist của Alibaba

1. Tại sao paper này quan trọng?

Vấn đề cross-embodiment

Kết quả ấn tượng

2. Kiến trúc X-VLA

Tổng quan flow-matching

Soft prompt cơ chế

Action space EE6D thống nhất

3. Cài đặt LeRobot với X-VLA

Yêu cầu hardware

Setup environment

Load pretrained checkpoint

4. Train trên dataset của bạn

Format dataset

Fine-tune cơ bản

Fine-tune soft prompts only (PEFT-style)

Critical hyperparameter

5. Inference trên robot thật

Server-client architecture

Async inference cho real-time

6. Đánh giá kết quả

Eval trên LIBERO

Logging với WandB

7. Tips từ kinh nghiệm thực tế

Lỗi thường gặp

Domain ID — đừng quên!

Khi nào nên dùng X-VLA vs alternatives?

8. Roadmap học tiếp

Kết luận

Bài viết liên quan

Tham khảo

Nguyễn Anh Tuấn

Bài viết liên quan

LabVLA: VLA Mã Nguồn Mở cho Robot Phòng Lab

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Qwen-VLA: Mô hình VLA generalist của Alibaba