WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

Why a third WholeBodyVLA post?

If you have already read the WholeBodyVLA ICLR 2026 research breakdown (ideas and contributions) and the open-source architecture deep-dive (codebase walkthrough), this post zooms in on the practical road from a teleoperated demo to a humanoid that runs the task on its own: where you sit, what the robot does, how data flows, which GPUs train what, and when you finally press "Run" on an AgiBot X2.

This is a pipeline-level tutorial — not a line-by-line clone (the OpenDriveLab/WholebodyVLA repository as of May 2026 only ships the paper and resources, not the full codebase yet) — but a field map that lets a robotics engineer start building an equivalent stack today using existing open-source pieces (LeRobot, Isaac Lab, AgiBot teleop suite).

What WholeBodyVLA actually solves

Whole-body loco-manipulation is not "walking plus grasping" — it is walking while grasping, in large workspaces, with non-trivial payloads. The paper (arxiv.org/abs/2512.11047) by Jiang, Chen et al. (Fudan University, OpenDriveLab & MMLab @ HKU, AGIBOT, SII) demonstrates:

Bimanual grasping and placing items on shelves across a large room (the robot must walk, turn and squat to reach).
Pushing a cart > 50 kg — legs and arms must coordinate to keep balance.
Wiping tables, vacuuming — long-horizon tasks combining navigation and manipulation.

Two prior schools both failed:

Modular: a locomotion policy (RL or MPC) and a manipulation policy (VLA or IL) glued by a state machine. Failure mode: the two never knew about each other — the robot fell when reaching high because the legs did not pre-compensate the CoM.
End-to-end monolithic: one net outputs every joint torque. Failure mode: whole-body teleop data is rare and expensive; training never converges for long-horizon tasks.

WholeBodyVLA takes a third path: a VLA learns latent actions from action-free egocentric video, paired with an LMO RL policy (Loco-Manipulation-Oriented) running at 50 Hz for execution. The VLA emits latent intent, the LMO compiles it into real joint torques.

Pipeline at a glance

┌──────────────────────────────────────────────────────────────┐
│  Input: Egocentric RGB (head cam) + Language instruction     │
│  "Pick up the red box and put it on the shelf"               │
└────────────────────────┬─────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────────────┐
│  VLM Encoder (frozen pretrained, e.g. Qwen-VL/InternVL)      │
│  → vision tokens + language tokens                           │
└────────────────────────┬─────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────────────┐
│  Latent Action Model (LAM) — pretrained on action-free       │
│  video, learns inverse dynamics from frame t → t+k           │
│  → latent action tokens z_t  (~10 Hz)                        │
└────────────────────────┬─────────────────────────────────────┘
                         │
            ┌────────────┴────────────┐
            ▼                         ▼
┌──────────────────────┐   ┌──────────────────────────────────┐
│  Dual-arm decoder    │   │  Locomotion command decoder      │
│  → joint targets     │   │  → (v_x, v_y, ω_z, body_height)  │
│    for both arms     │   │    for the LMO RL policy         │
└──────────┬───────────┘   └──────────────┬───────────────────┘
           │                              │
           ▼                              ▼
┌──────────────────────┐   ┌──────────────────────────────────┐
│  Arm low-level ctrl  │   │  LMO RL Policy (PPO, 50 Hz)      │
│  (PD @ 200-500 Hz)   │   │  Joint torque for legs + waist   │
└──────────────────────┘   └──────────────────────────────────┘

The two-rate design matters:

VLA is slow (~10 Hz) because it runs a large transformer with vision processing on a GPU.
LMO is fast (50 Hz) because balance and locomotion must react to disturbances (pushes, contacts, uneven floor).

Headline number from the paper: +21.3% average success over modular baselines, with the biggest gains on large-space and long-horizon tasks.

Step 1 — Hardware and environment

Reference hardware

Robot: AgiBot X2 (1.7 m humanoid, ~50 kg, 36+ DoF) — or Unitree G1, Booster T1, Fourier GR-1 if you do not have an X2. The recipe is the same.
Training compute: 8× A100/H100 80GB (the paper uses a larger cluster for LAM pretraining).
On-robot compute: Jetson AGX Orin 64GB or an onboard PC with an RTX 4090 Mobile — enough for VLA at 10 Hz.

Suggested software stack

conda create -n wbvla python=3.10 -y
conda activate wbvla

# Core
pip install torch==2.4.0 torchvision --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.45 accelerate datasets

# Robot learning frameworks
pip install lerobot                      # teleop + dataset format
pip install isaac-lab                    # locomotion RL sim
pip install hydra-core wandb einops

Note: the official WholeBodyVLA training code is not fully released yet. The recipe below reproduces the idea using LeRobot + Isaac Lab + the AgiBot teleop SDK on hardware you can actually obtain.

Step 2 — Collecting teleop data

This is the most expensive part of any VLA project. WholeBodyVLA reduces the cost by leveraging action-free video for the manipulation prior, but you still need a small teleop set to bootstrap the action decoders.

2.1. AgiBot X2 teleop setup

AgiBot ships a teleop suite (VR controllers + exoskeleton arms). The standard flow:

# On the control station
git clone https://github.com/AgiBot/teleop-suite
cd teleop-suite
./scripts/calibrate_operator.sh           # measure height, arm length
./scripts/start_teleop.sh --robot agibot_x2 --record

The operator wears a VR headset and holds controllers. The system maps:

Head → robot camera angle.
Hands → end-effector poses (online IK).
Walking/turning/squatting → commands (v_x, v_y, ω_z, height).

2.2. Data format

Each episode is stored in the LeRobot v2 schema:

# episode_000123.parquet
{
  "observation.images.head_cam": [T, 3, 480, 640],   # uint8 RGB
  "observation.images.wrist_cam_l": [T, 3, 480, 640],
  "observation.images.wrist_cam_r": [T, 3, 480, 640],
  "observation.state":              [T, 36],         # joints
  "action.arm_joints":              [T, 14],         # 7+7 DoF arms
  "action.loco_cmd":                [T, 4],          # vx, vy, wz, h
  "language_instruction":           "pick the red box and place on shelf",
  "task_id": "longhorizon_pick_place_001",
}

2.3. How much data?

The paper shows that with LAM pretraining on label-free video you only need about 2-5k labeled teleop episodes (50-150 hours) to fine-tune the decoders — roughly 10× less than fully end-to-end. The reason: the LAM has already learned inverse dynamics from far richer video.

Step 3 — Pretrain the Latent Action Model (LAM)

This is the core innovation of WholeBodyVLA. The idea: learn a latent action z_t from a pair of frames (frame_t, frame_{t+k}) without action labels — much like the inverse dynamics models in Genie or LAPO.

3.1. Pretraining video sources

Ego4D, Epic-Kitchens, RH20T: thousands of hours of first-person video.
AgiBot World 2026 (dataset overview): large-scale humanoid teleop video.
Any internal first-person manipulation video you have.

Goal: the encoder learns a short-horizon action representation purely from visual change, without knowing what the robot is doing in joint space.

3.2. Loss and training

# Simplified — the key idea
class LAM(nn.Module):
    def __init__(self, vit_backbone):
        self.vision = vit_backbone           # frozen ViT
        self.action_quantizer = VQVAE(codebook_size=1024)
        self.forward_dynamics = Transformer(depth=8)

    def forward(self, frame_t, frame_tk):
        feat_t  = self.vision(frame_t)
        feat_tk = self.vision(frame_tk)
        # encode the action diff into a latent z_t
        z_t = self.action_quantizer.encode(feat_tk - feat_t)
        # predict frame_tk from frame_t + z_t
        pred_tk = self.forward_dynamics(feat_t, z_t)
        loss = F.mse_loss(pred_tk, feat_tk) + self.action_quantizer.commit_loss
        return loss

Training run:

torchrun --nproc_per_node=8 train_lam.py \
  --data.video_roots ego4d,epic_kitchens,agibot_world \
  --model.vision_backbone vit-l-14 \
  --model.codebook_size 1024 \
  --train.batch_size 256 \
  --train.lr 1.5e-4 \
  --train.steps 200000

About 4-6 days on 8× A100.

Step 4 — Train the action decoders (arm + locomotion command)

With LAM in hand, you train two small decoders that map z_t to concrete actions, using the teleop data from Step 2.

torchrun --nproc_per_node=4 train_decoders.py \
  --lam_ckpt outputs/lam/final.pt \
  --data.teleop_dataset /data/agibot_x2_teleop \
  --model.arm_decoder mlp_depth=4 \
  --model.loco_decoder mlp_depth=3 \
  --train.steps 50000

About 1-2 days on 4× A100. Output is two lightweight checkpoints (~50-100M params each).

Step 5 — Train the LMO RL policy in simulation

LMO runs independently of the VLA. It takes loco_cmd from the VLA and turns it into leg + waist torques that keep balance under disturbances (pushes, payloads, uneven floors).

5.1. Isaac Lab task

git clone https://github.com/isaac-sim/IsaacLab
cd IsaacLab
./isaaclab.sh -p source/standalone/workflows/skrl/train.py \
  --task Isaac-Humanoid-LocoManip-AgiBotX2-v0 \
  --num_envs 4096 --headless

Key rewards:

Accurate tracking of (v_x, v_y, ω_z, body_height).
Robustness: domain randomization on mass, friction, push forces.
Manipulation-aware: penalize torso oscillation (it would shake the arms) — this is the differentiator from generic locomotion RL.

Roughly 24-48 hours on 1× A100 with 4096 parallel envs. See Booster Gym ICRA 2026 for a comparable sim-to-real recipe.

5.2. Sim-to-real

After sim training, real-robot deployment typically needs:

Actuator network: a small MLP that maps motor command → actual torque (collect ~1 hour of random commands on the real robot).
Curriculum: start with small commands, expand range over training.
Noise injection: inject IMU/joint noise in sim to avoid jitter on real hardware.

Step 6 — On-robot inference loop

# robot_runner.py
import torch, time
from wbvla import VLAPolicy, LMOPolicy

vla = VLAPolicy.from_pretrained("./ckpts/vla_final")
lmo = LMOPolicy.from_pretrained("./ckpts/lmo_final")
robot = AgiBotX2()              # wraps the teleop SDK

instruction = "pick the red box and place it on the top shelf"

while True:
    obs = robot.get_observation()       # ego image + state
    # VLA @ 10 Hz
    if time.time() - last_vla > 0.1:
        arm_target, loco_cmd = vla.act(obs.image, instruction, obs.state)
        last_vla = time.time()
    # LMO @ 50 Hz
    leg_torque = lmo.act(loco_cmd, obs.state)
    robot.send(arm_target, leg_torque)

Measured latency:

Block	Latency (Jetson AGX Orin)
VLM encoder (vision + language)	~60 ms
LAM + decoders	~30 ms
LMO policy	~5 ms
Robot control loop	~2 ms
Total VLA tick	~95 ms (~10.5 Hz)

Step 7 — Evaluating results

Headline numbers from the paper:

Task	Modular baseline	WholeBodyVLA	Δ
Pick & place large-space	52%	71%	+19%
Push cart > 50 kg	38%	64%	+26%
Wipe table long-horizon	41%	60%	+19%
Average	44%	65%	+21.3%

Known weaknesses:

Failures cluster when the VLA tick slows beyond 200 ms (GPU thermal throttling) — the robot "freezes" mid-task. Fix: distill a smaller VLM (~1B params) or apply speculative decoding.
Contact-rich bimanual tasks (opening jars, unscrewing lids) remain weak. Future work likely adds tactile sensing.

Pitfalls when reproducing the pipeline

Do not run VLA and LMO at the same rate. This is the most common mistake. VLA targets intent, LMO targets stability — two objectives require two loops.
LMO domain randomization must include "carrying payloads". Many implementations forget payload randomization and the robot loses balance the moment it picks up a 20 kg box.
Teleop data must have synchronized timestamps across arm action and loco command. A 50 ms misalignment teaches the decoder bad correlations.
Frozen pretrained VLMs beat full fine-tuning. You will not have enough robot data to fine-tune a 7B+ VLM without catastrophic forgetting.
Always evaluate on the real robot, not just sim. The sim-to-real gap is especially wide for loco-manipulation because contact dynamics are hard to model.

Transferable lessons

Separate "intent" from "execution" — this pattern carries over to navigation (planner + controller), autonomous driving (perception planner + low-level control), and even software (orchestration layer vs. execution layer).
Action-free pretraining on video is a cheap path. If you have thousands of hours of task video without action labels, do not throw it away — LAM-style training can use it.
A shared latent space for heterogeneous actions (arms + legs) enables multi-task generalization. Beyond robotics, this is a useful design hint: design a common interface first, decoders later.

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

Why a third WholeBodyVLA post?

What WholeBodyVLA actually solves

Pipeline at a glance

Step 1 — Hardware and environment

Reference hardware

Suggested software stack

Step 2 — Collecting teleop data

2.1. AgiBot X2 teleop setup

2.2. Data format

2.3. How much data?

Step 3 — Pretrain the Latent Action Model (LAM)

3.1. Pretraining video sources

3.2. Loss and training

Step 4 — Train the action decoders (arm + locomotion command)

Step 5 — Train the LMO RL policy in simulation

5.1. Isaac Lab task

5.2. Sim-to-real

Step 6 — On-robot inference loop

Step 7 — Evaluating results

Pitfalls when reproducing the pipeline

Transferable lessons

Nguyễn Anh Tuấn

Bài viết liên quan

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc

Why a third WholeBodyVLA post?

What WholeBodyVLA actually solves

Pipeline at a glance

Step 1 — Hardware and environment

Reference hardware

Suggested software stack

Step 2 — Collecting teleop data

2.1. AgiBot X2 teleop setup

2.2. Data format

2.3. How much data?

Step 3 — Pretrain the Latent Action Model (LAM)

3.1. Pretraining video sources

3.2. Loss and training

Step 4 — Train the action decoders (arm + locomotion command)

Step 5 — Train the LMO RL policy in simulation

5.1. Isaac Lab task

5.2. Sim-to-real

Step 6 — On-robot inference loop

Step 7 — Evaluating results

Pitfalls when reproducing the pipeline

Transferable lessons

Related Posts

Nguyễn Anh Tuấn

Bài viết liên quan

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc