Why a third WholeBodyVLA post?
If you have already read the WholeBodyVLA ICLR 2026 research breakdown (ideas and contributions) and the open-source architecture deep-dive (codebase walkthrough), this post zooms in on the practical road from a teleoperated demo to a humanoid that runs the task on its own: where you sit, what the robot does, how data flows, which GPUs train what, and when you finally press "Run" on an AgiBot X2.
This is a pipeline-level tutorial — not a line-by-line clone (the OpenDriveLab/WholebodyVLA repository as of May 2026 only ships the paper and resources, not the full codebase yet) — but a field map that lets a robotics engineer start building an equivalent stack today using existing open-source pieces (LeRobot, Isaac Lab, AgiBot teleop suite).
What WholeBodyVLA actually solves
Whole-body loco-manipulation is not "walking plus grasping" — it is walking while grasping, in large workspaces, with non-trivial payloads. The paper (arxiv.org/abs/2512.11047) by Jiang, Chen et al. (Fudan University, OpenDriveLab & MMLab @ HKU, AGIBOT, SII) demonstrates:
- Bimanual grasping and placing items on shelves across a large room (the robot must walk, turn and squat to reach).
- Pushing a cart > 50 kg — legs and arms must coordinate to keep balance.
- Wiping tables, vacuuming — long-horizon tasks combining navigation and manipulation.
Two prior schools both failed:
- Modular: a locomotion policy (RL or MPC) and a manipulation policy (VLA or IL) glued by a state machine. Failure mode: the two never knew about each other — the robot fell when reaching high because the legs did not pre-compensate the CoM.
- End-to-end monolithic: one net outputs every joint torque. Failure mode: whole-body teleop data is rare and expensive; training never converges for long-horizon tasks.
WholeBodyVLA takes a third path: a VLA learns latent actions from action-free egocentric video, paired with an LMO RL policy (Loco-Manipulation-Oriented) running at 50 Hz for execution. The VLA emits latent intent, the LMO compiles it into real joint torques.
Pipeline at a glance
┌──────────────────────────────────────────────────────────────┐
│ Input: Egocentric RGB (head cam) + Language instruction │
│ "Pick up the red box and put it on the shelf" │
└────────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ VLM Encoder (frozen pretrained, e.g. Qwen-VL/InternVL) │
│ → vision tokens + language tokens │
└────────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Latent Action Model (LAM) — pretrained on action-free │
│ video, learns inverse dynamics from frame t → t+k │
│ → latent action tokens z_t (~10 Hz) │
└────────────────────────┬─────────────────────────────────────┘
│
┌────────────┴────────────┐
▼ ▼
┌──────────────────────┐ ┌──────────────────────────────────┐
│ Dual-arm decoder │ │ Locomotion command decoder │
│ → joint targets │ │ → (v_x, v_y, ω_z, body_height) │
│ for both arms │ │ for the LMO RL policy │
└──────────┬───────────┘ └──────────────┬───────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────────────────┐
│ Arm low-level ctrl │ │ LMO RL Policy (PPO, 50 Hz) │
│ (PD @ 200-500 Hz) │ │ Joint torque for legs + waist │
└──────────────────────┘ └──────────────────────────────────┘
The two-rate design matters:
- VLA is slow (~10 Hz) because it runs a large transformer with vision processing on a GPU.
- LMO is fast (50 Hz) because balance and locomotion must react to disturbances (pushes, contacts, uneven floor).
Headline number from the paper: +21.3% average success over modular baselines, with the biggest gains on large-space and long-horizon tasks.
Step 1 — Hardware and environment
Reference hardware
- Robot: AgiBot X2 (1.7 m humanoid, ~50 kg, 36+ DoF) — or Unitree G1, Booster T1, Fourier GR-1 if you do not have an X2. The recipe is the same.
- Training compute: 8× A100/H100 80GB (the paper uses a larger cluster for LAM pretraining).
- On-robot compute: Jetson AGX Orin 64GB or an onboard PC with an RTX 4090 Mobile — enough for VLA at 10 Hz.
Suggested software stack
conda create -n wbvla python=3.10 -y
conda activate wbvla
# Core
pip install torch==2.4.0 torchvision --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.45 accelerate datasets
# Robot learning frameworks
pip install lerobot # teleop + dataset format
pip install isaac-lab # locomotion RL sim
pip install hydra-core wandb einops
Note: the official WholeBodyVLA training code is not fully released yet. The recipe below reproduces the idea using LeRobot + Isaac Lab + the AgiBot teleop SDK on hardware you can actually obtain.
Step 2 — Collecting teleop data
This is the most expensive part of any VLA project. WholeBodyVLA reduces the cost by leveraging action-free video for the manipulation prior, but you still need a small teleop set to bootstrap the action decoders.
2.1. AgiBot X2 teleop setup
AgiBot ships a teleop suite (VR controllers + exoskeleton arms). The standard flow:
# On the control station
git clone https://github.com/AgiBot/teleop-suite
cd teleop-suite
./scripts/calibrate_operator.sh # measure height, arm length
./scripts/start_teleop.sh --robot agibot_x2 --record
The operator wears a VR headset and holds controllers. The system maps:
- Head → robot camera angle.
- Hands → end-effector poses (online IK).
- Walking/turning/squatting → commands (v_x, v_y, ω_z, height).
2.2. Data format
Each episode is stored in the LeRobot v2 schema:
# episode_000123.parquet
{
"observation.images.head_cam": [T, 3, 480, 640], # uint8 RGB
"observation.images.wrist_cam_l": [T, 3, 480, 640],
"observation.images.wrist_cam_r": [T, 3, 480, 640],
"observation.state": [T, 36], # joints
"action.arm_joints": [T, 14], # 7+7 DoF arms
"action.loco_cmd": [T, 4], # vx, vy, wz, h
"language_instruction": "pick the red box and place on shelf",
"task_id": "longhorizon_pick_place_001",
}
2.3. How much data?
The paper shows that with LAM pretraining on label-free video you only need about 2-5k labeled teleop episodes (50-150 hours) to fine-tune the decoders — roughly 10× less than fully end-to-end. The reason: the LAM has already learned inverse dynamics from far richer video.
Step 3 — Pretrain the Latent Action Model (LAM)
This is the core innovation of WholeBodyVLA. The idea: learn a latent action z_t from a pair of frames (frame_t, frame_{t+k}) without action labels — much like the inverse dynamics models in Genie or LAPO.
3.1. Pretraining video sources
- Ego4D, Epic-Kitchens, RH20T: thousands of hours of first-person video.
- AgiBot World 2026 (dataset overview): large-scale humanoid teleop video.
- Any internal first-person manipulation video you have.
Goal: the encoder learns a short-horizon action representation purely from visual change, without knowing what the robot is doing in joint space.
3.2. Loss and training
# Simplified — the key idea
class LAM(nn.Module):
def __init__(self, vit_backbone):
self.vision = vit_backbone # frozen ViT
self.action_quantizer = VQVAE(codebook_size=1024)
self.forward_dynamics = Transformer(depth=8)
def forward(self, frame_t, frame_tk):
feat_t = self.vision(frame_t)
feat_tk = self.vision(frame_tk)
# encode the action diff into a latent z_t
z_t = self.action_quantizer.encode(feat_tk - feat_t)
# predict frame_tk from frame_t + z_t
pred_tk = self.forward_dynamics(feat_t, z_t)
loss = F.mse_loss(pred_tk, feat_tk) + self.action_quantizer.commit_loss
return loss
Training run:
torchrun --nproc_per_node=8 train_lam.py \
--data.video_roots ego4d,epic_kitchens,agibot_world \
--model.vision_backbone vit-l-14 \
--model.codebook_size 1024 \
--train.batch_size 256 \
--train.lr 1.5e-4 \
--train.steps 200000
About 4-6 days on 8× A100.
Step 4 — Train the action decoders (arm + locomotion command)
With LAM in hand, you train two small decoders that map z_t to concrete actions, using the teleop data from Step 2.
torchrun --nproc_per_node=4 train_decoders.py \
--lam_ckpt outputs/lam/final.pt \
--data.teleop_dataset /data/agibot_x2_teleop \
--model.arm_decoder mlp_depth=4 \
--model.loco_decoder mlp_depth=3 \
--train.steps 50000
About 1-2 days on 4× A100. Output is two lightweight checkpoints (~50-100M params each).
Step 5 — Train the LMO RL policy in simulation
LMO runs independently of the VLA. It takes loco_cmd from the VLA and turns it into leg + waist torques that keep balance under disturbances (pushes, payloads, uneven floors).
5.1. Isaac Lab task
git clone https://github.com/isaac-sim/IsaacLab
cd IsaacLab
./isaaclab.sh -p source/standalone/workflows/skrl/train.py \
--task Isaac-Humanoid-LocoManip-AgiBotX2-v0 \
--num_envs 4096 --headless
Key rewards:
- Accurate tracking of
(v_x, v_y, ω_z, body_height). - Robustness: domain randomization on mass, friction, push forces.
- Manipulation-aware: penalize torso oscillation (it would shake the arms) — this is the differentiator from generic locomotion RL.
Roughly 24-48 hours on 1× A100 with 4096 parallel envs. See Booster Gym ICRA 2026 for a comparable sim-to-real recipe.
5.2. Sim-to-real
After sim training, real-robot deployment typically needs:
- Actuator network: a small MLP that maps motor command → actual torque (collect ~1 hour of random commands on the real robot).
- Curriculum: start with small commands, expand range over training.
- Noise injection: inject IMU/joint noise in sim to avoid jitter on real hardware.
Step 6 — On-robot inference loop
# robot_runner.py
import torch, time
from wbvla import VLAPolicy, LMOPolicy
vla = VLAPolicy.from_pretrained("./ckpts/vla_final")
lmo = LMOPolicy.from_pretrained("./ckpts/lmo_final")
robot = AgiBotX2() # wraps the teleop SDK
instruction = "pick the red box and place it on the top shelf"
while True:
obs = robot.get_observation() # ego image + state
# VLA @ 10 Hz
if time.time() - last_vla > 0.1:
arm_target, loco_cmd = vla.act(obs.image, instruction, obs.state)
last_vla = time.time()
# LMO @ 50 Hz
leg_torque = lmo.act(loco_cmd, obs.state)
robot.send(arm_target, leg_torque)
Measured latency:
| Block | Latency (Jetson AGX Orin) |
|---|---|
| VLM encoder (vision + language) | ~60 ms |
| LAM + decoders | ~30 ms |
| LMO policy | ~5 ms |
| Robot control loop | ~2 ms |
| Total VLA tick | ~95 ms (~10.5 Hz) |
Step 7 — Evaluating results
Headline numbers from the paper:
| Task | Modular baseline | WholeBodyVLA | Δ |
|---|---|---|---|
| Pick & place large-space | 52% | 71% | +19% |
| Push cart > 50 kg | 38% | 64% | +26% |
| Wipe table long-horizon | 41% | 60% | +19% |
| Average | 44% | 65% | +21.3% |
Known weaknesses:
- Failures cluster when the VLA tick slows beyond 200 ms (GPU thermal throttling) — the robot "freezes" mid-task. Fix: distill a smaller VLM (~1B params) or apply speculative decoding.
- Contact-rich bimanual tasks (opening jars, unscrewing lids) remain weak. Future work likely adds tactile sensing.
Pitfalls when reproducing the pipeline
- Do not run VLA and LMO at the same rate. This is the most common mistake. VLA targets intent, LMO targets stability — two objectives require two loops.
- LMO domain randomization must include "carrying payloads". Many implementations forget payload randomization and the robot loses balance the moment it picks up a 20 kg box.
- Teleop data must have synchronized timestamps across arm action and loco command. A 50 ms misalignment teaches the decoder bad correlations.
- Frozen pretrained VLMs beat full fine-tuning. You will not have enough robot data to fine-tune a 7B+ VLM without catastrophic forgetting.
- Always evaluate on the real robot, not just sim. The sim-to-real gap is especially wide for loco-manipulation because contact dynamics are hard to model.
Transferable lessons
- Separate "intent" from "execution" — this pattern carries over to navigation (planner + controller), autonomous driving (perception planner + low-level control), and even software (orchestration layer vs. execution layer).
- Action-free pretraining on video is a cheap path. If you have thousands of hours of task video without action labels, do not throw it away — LAM-style training can use it.
- A shared latent space for heterogeneous actions (arms + legs) enables multi-task generalization. Beyond robotics, this is a useful design hint: design a common interface first, decoders later.