If you've already read the WholeBodyVLA ICLR 2026 paper analysis and the architecture deep-dive, you understand why this system matters and how its components work internally. This tutorial goes one step further: how to build the training pipeline from scratch.
Note: As of April 2026, the OpenDriveLab team has not yet released the official code. However, the paper provides enough technical detail to understand the full process and prepare for when the code drops — or to implement it yourself following the paper's design.
The Big Picture: 3 Components, 2 Training Phases
Before diving into individual steps, internalize WholebodyVLA's structure:
[Egocentric Video Data] ──► LAM Training ──► Latent Codes
│ │
[Manipulation Data] ─────────────────────────── ┤
│ ▼
[Teleoperation Demos] ──► VLA Fine-tuning (Prismatic-7B + LoRA)
│
[Isaac Lab Simulation] ──► LMO RL Policy Training │
▼
┌──────────────────────────┐
│ Inference @ 10 / 50 Hz │
└──────────────────────────┘
The pipeline has two parallel phases:
- LAM + VLA Pretraining: Learning to "read" egocentric video and translate it into latent action codes.
- RL Policy Training: Training the LMO (Loco-Manipulation-Oriented) policy in simulation.
Both branches converge at inference, where the VLA issues commands and LMO executes them at higher frequency.
Hardware Requirements
These are the real numbers from the paper — this won't run on your laptop:
| Task | Hardware | Duration |
|---|---|---|
| LAM Training | 8× NVIDIA H100 | ~30,000 steps |
| VLA Pretraining | 8× NVIDIA H100 | ~20,000 steps |
| VLA Fine-tuning | 8× NVIDIA H100 | ~10,000 steps |
| LMO RL Training | 1× NVIDIA H100 | Several days |
| Inference (VLA) | RTX 4090 | ~10 Hz |
| Inference (RL) | NanoPi (onboard) | 50 Hz |
You don't need an H100 cluster to experiment — but to reproduce the paper's results (78% success), you'll need at least one H100 or equivalent A100 80GB for training.
Step 1: Environment Setup
# Python environment
conda create -n wholebodyvla python=3.10
conda activate wholebodyvla
# Core dependencies
pip install torch==2.3.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate einops timm
pip install open-clip-torch # CLIP encoder for VLA
# DINOv2 (LAM encoder)
pip install git+https://github.com/facebookresearch/dinov2.git
# Isaac Lab (LMO RL training)
# See official guide: https://isaac-sim.github.io/IsaacLab/
pip install isaaclab
# ZeroMQ (inference communication)
pip install pyzmq
pip install pyrealsense2 # Intel RealSense SDK
Step 2: Data Collection
This is the most time-intensive step and often underestimated. WholebodyVLA uses three distinct data types for three different purposes.
2.1 Egocentric Locomotion Video (for Locomotion LAM)
Goal: Collect first-person video of a human approaching manipulation targets — no robot required, no action labels needed.
Camera setup:
import pyrealsense2 as rs
import numpy as np
import cv2
pipeline = rs.pipeline()
config = rs.config()
# Intel RealSense D435i: 640x480 @ 30fps
config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)
config.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30)
config.enable_stream(rs.stream.accel) # IMU for locomotion tracking
config.enable_stream(rs.stream.gyro)
profile = pipeline.start(config)
8 motion primitives to collect:
| Primitive | Description | Recommended count |
|---|---|---|
| Advance | Walk straight toward target | 500 clips |
| Turn Left/Right | Rotate to face correct direction | 400 clips |
| Sidestep | Lateral step left/right | 300 clips |
| Squat Down | Crouch to reach low objects | 300 clips |
| Stand Up | Rise after grabbing object | 300 clips |
| Approach Near | Close distance to target | 400 clips |
| Back Away | Step backwards | 200 clips |
| Stop & Ready | Halt, arms ready for manipulation | 300 clips |
The paper collected ~300 hours of video — but ablation studies show you can start with 50 hours and capture 60%+ of the benefit.
def collect_primitive(primitive_name: str, duration_secs: int = 3):
"""Collect one motion primitive clip."""
timestamp = int(time.time())
frames = []
start = time.time()
while time.time() - start < duration_secs:
frameset = pipeline.wait_for_frames()
color_frame = frameset.get_color_frame()
if color_frame:
img = np.asanyarray(color_frame.get_data())
frames.append(img)
out_path = f"data/loco_videos/{primitive_name}/{timestamp}.mp4"
os.makedirs(os.path.dirname(out_path), exist_ok=True)
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(out_path, fourcc, 30, (640, 480))
for f in frames:
out.write(f)
out.release()
return out_path
2.2 Manipulation Data (AgiBot World Dataset)
The paper leverages the AgiBot World dataset for in-place bimanual manipulation pretraining:
# Expected directory structure:
data/
manipulation/
task_001/
episode_000/
images/ # frame-by-frame
actions.npy # joint angles
lang.txt # task description
task_002/
...
2.3 Robot Teleoperation Data (for Fine-tuning)
Collected directly on the AgiBot X2 — you need either a real robot or a high-fidelity simulation:
- Upper body: VR headset (Meta Quest 3 or HTC Vive) mapping hand poses → joint targets
- Lower body: Joystick → locomotion commands
- Quantity: 50 demonstrations/task × 3 tasks = 150 total demos
The three core tasks: Bag Packing, Box Loading, Cart Pushing.
Step 3: Training the Latent Action Model (LAM)
The LAM is the "translator" that converts video frames into latent codes, which the VLA then learns to predict.
The paper trains two separate LAMs:
- Manipulation LAM: Static camera, moving hands
- Locomotion LAM: Moving camera following the person
VQ-VAE Architecture
import torch
import torch.nn as nn
from transformers import Dinov2Model
class LatentActionModel(nn.Module):
"""VQ-VAE with DINOv2 encoder for WholebodyVLA."""
def __init__(
self,
codebook_size: int = 8192,
latent_dim: int = 256,
commitment_cost: float = 0.25, # beta in the paper
):
super().__init__()
# DINOv2-base as feature encoder
self.encoder = Dinov2Model.from_pretrained("facebook/dinov2-base")
# Project DINOv2 features (768) → latent_dim
self.encoder_proj = nn.Linear(768 * 2, latent_dim)
# Vector Quantizer
self.codebook = nn.Embedding(codebook_size, latent_dim)
self.commitment_cost = commitment_cost
# Decoder to reconstruct future frame
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 512),
nn.ReLU(),
nn.Linear(512, 768),
)
def encode(self, obs_t: torch.Tensor, obs_t_k: torch.Tensor):
"""
Args:
obs_t: Current frame (B, C, H, W)
obs_t_k: Future frame k steps ahead (B, C, H, W)
Returns:
Quantized latent code and VQ loss
"""
feat_t = self.encoder(pixel_values=obs_t).last_hidden_state[:, 0]
feat_tk = self.encoder(pixel_values=obs_t_k).last_hidden_state[:, 0]
z = self.encoder_proj(torch.cat([feat_t, feat_tk], dim=-1))
distances = torch.cdist(z.unsqueeze(1), self.codebook.weight.unsqueeze(0))
indices = distances.squeeze(1).argmin(dim=-1)
c = self.codebook(indices)
vq_loss = torch.mean((c.detach() - z) ** 2)
commit_loss = self.commitment_cost * torch.mean((c - z.detach()) ** 2)
c_st = z + (c - z).detach() # straight-through estimator
return c_st, indices, vq_loss + commit_loss
def forward(self, obs_t, obs_t_k):
c, indices, vq_loss = self.encode(obs_t, obs_t_k)
feat_t = self.encoder(pixel_values=obs_t).last_hidden_state[:, 0]
pred_feat = self.decoder(c + feat_t)
with torch.no_grad():
target_feat = self.encoder(pixel_values=obs_t_k).last_hidden_state[:, 0]
mse_loss = nn.functional.mse_loss(pred_feat, target_feat)
return mse_loss + vq_loss, indices
Training:
# Train manipulation LAM
python train_lam.py --data-type manipulation --data-dir data/manipulation/ \
--steps 30000 --batch-size 256 --gpus 8
# Train locomotion LAM
python train_lam.py --data-type locomotion --data-dir data/loco_videos/ \
--steps 30000 --batch-size 256 --gpus 8
Step 4: Training the VLA (Prismatic-7B)
The VLA backbone uses Prismatic-7B — a 7B parameter Vision-Language Model with dual-stream image encoding.
Stage 1: Pretraining with LAM Supervision
The VLA learns to predict latent codes from both LAMs trained in Step 3:
class WholebodyVLA(nn.Module):
"""VLA backbone with dual latent action prediction heads."""
def __init__(self, vlm_path="TRI-ML/prismatic-7b", codebook_size=8192):
super().__init__()
self.vlm = AutoModelForCausalLM.from_pretrained(vlm_path)
self.tokenizer = AutoTokenizer.from_pretrained(vlm_path)
hidden_size = self.vlm.config.hidden_size # 4096 for 7B
self.mani_head = nn.Linear(hidden_size, codebook_size)
self.loco_head = nn.Linear(hidden_size, codebook_size)
def forward(self, images, language, labels_mani=None, labels_loco=None):
inputs = self.tokenizer(language, return_tensors="pt", padding=True)
outputs = self.vlm(
input_ids=inputs["input_ids"].cuda(),
pixel_values=images,
output_hidden_states=True,
)
last_hidden = outputs.hidden_states[-1][:, -1, :]
logits_mani = self.mani_head(last_hidden)
logits_loco = self.loco_head(last_hidden)
loss = 0
if labels_mani is not None:
loss += nn.functional.cross_entropy(logits_mani, labels_mani)
if labels_loco is not None:
loss += nn.functional.cross_entropy(logits_loco, labels_loco)
return loss, logits_mani, logits_loco
Pretraining config:
PRETRAIN_CONFIG = {
"steps": 20_000,
"batch_size": 1_024,
"lr": 2e-5,
"warmup_steps": 500,
"gpus": 8, # H100
}
Stage 2: LoRA Fine-tuning on Teleoperation Data
After pretraining, fine-tune using the 150 robot demonstration episodes:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(wholebodyvla.vlm, lora_config)
model.print_trainable_parameters()
# Trainable params: ~13M / 7B total (~0.2%)
FINETUNE_CONFIG = {
"steps": 10_000,
"batch_size": 64,
"lr": 5e-5,
}
Step 5: Training the LMO RL Policy
This is the most technically interesting part — an RL policy learning full-body humanoid control in simulation.
Discrete Command Interface
Instead of traditional velocity tracking, LMO uses discrete flags as the command interface:
class LMOCommand:
"""
Discrete command interface for LMO RL policy.
Each axis: {-1: backward/right/ccw, 0: stop, 1: forward/left/cw}
"""
s_x: int # Forward/backward: {-1, 0, 1}
s_y: int # Lateral: {-1, 0, 1}
s_psi: int # Yaw rotation: {-1, 0, 1}
h_star: float # Target stance height (continuous): [0.7, 1.0] m
def command_to_velocity(cmd: LMOCommand, alpha: float = 2.0) -> np.ndarray:
"""
Smooth velocity ramp from discrete command.
v_k^ref(t) = v_k^goal * tanh[alpha * (s_k - s_bar_k(t))]
Prevents jerky start/stop transitions.
"""
v_goal = np.array([0.5 * cmd.s_x, 0.3 * cmd.s_y, 0.5 * cmd.s_psi])
return v_goal
The paper shows that this discrete interface achieves 78.0% success vs. 54.0% for velocity-based RL — a 24-point gap that comes entirely from better start/stop control and reduced trajectory variance.
Two-Stage Curriculum in Isaac Lab
# Stage I: Gait acquisition (basic locomotion)
STAGE1_CONFIG = {
"num_envs": 4096,
"episode_length": 500,
"goal_speed_range": [0.1, 0.5], # m/s, sampled per episode
"rewards": {
"tracking_lin_vel": 1.5,
"tracking_ang_vel": 0.75,
"lin_vel_z_penalty": -2.0,
"feet_air_time": 1.0,
"joint_torque_penalty": -0.0002,
},
"domain_randomization": {
"mass_ratio": [0.8, 1.2],
"friction_coeff": [0.3, 1.0],
"motor_kp": [0.8, 1.2],
"control_delay_steps": [0, 3],
}
}
# Stage II: Precision refinement (fixed speed + manipulation perturbations)
STAGE2_CONFIG = {
"num_envs": 2048,
"episode_length": 500,
"goal_speed": 0.5, # Fixed, not sampled
"rewards": {
"directional_accuracy": 2.0, # J_dir = |wrap(ψ_end - ψ_start)|
"stand_still_penalty": -1.0,
"tracking_lin_vel": 1.5,
},
"arm_perturbation": {
"enabled": True,
"strength_range": [0.5, 1.0], # Stronger than Stage I
},
"domain_randomization": {
"mass_ratio": [0.7, 1.3], # Wider range
"friction_coeff": [0.2, 1.2],
"ground_unevenness": [0.0, 0.02], # Uneven terrain added
}
}
# Stage I: ~2 days on 1× H100
python train_lmo.py --config configs/lmo_stage1.yaml --robot agibot_x2
# Stage II: ~1 day on 1× H100 (resuming from Stage I)
python train_lmo.py --config configs/lmo_stage2.yaml --robot agibot_x2 \
--resume checkpoints/lmo_stage1_final.pt
Step 6: Inference Deployment
This is where both training branches converge. The system runs two communicating processes via ZeroMQ:
import zmq
import torch
import numpy as np
class VLAInferenceServer:
"""
VLA server on offboard GPU (RTX 4090).
Rate: ~10 Hz (bottlenecked by VLM inference latency).
"""
def __init__(self, vla_model, lam_mani, lam_loco, port=5555):
self.vla = vla_model
self.lam_mani = lam_mani
self.lam_loco = lam_loco
ctx = zmq.Context()
self.socket = ctx.socket(zmq.REP)
self.socket.bind(f"tcp://*:{port}")
def run(self):
while True:
obs = self.socket.recv_pyobj()
with torch.no_grad():
image_t = preprocess_image(obs["egocentric_image"]).cuda()
_, logits_mani, logits_loco = self.vla(
images=image_t.unsqueeze(0),
language=[obs["task_description"]],
)
mani_code = logits_mani.argmax(dim=-1)
loco_code = logits_loco.argmax(dim=-1)
arm_actions = self.lam_mani.decode_to_joints(mani_code)
loco_cmd = self.lam_loco.decode_to_command(loco_code)
self.socket.send_pyobj({
"arm_actions": arm_actions.cpu().numpy(), # (14,) joint targets
"loco_command": loco_cmd.cpu().numpy(), # (4,) = [s_x, s_y, s_ψ, h*]
})
class LMOPolicyClient:
"""
RL policy on robot onboard computer (NanoPi M6).
Rate: 50 Hz — receives LMO commands from VLA and outputs joint torques.
"""
def __init__(self, lmo_policy, vla_server_ip, port=5555):
self.lmo = lmo_policy
ctx = zmq.Context()
self.socket = ctx.socket(zmq.REQ)
self.socket.connect(f"tcp://{vla_server_ip}:{port}")
self.socket.RCVTIMEO = 50 # 50ms timeout
self.last_loco_cmd = np.zeros(4)
def step(self, robot_obs: dict) -> np.ndarray:
"""Run one control step at 50 Hz."""
try:
self.socket.send_pyobj(robot_obs)
vla_output = self.socket.recv_pyobj()
self.last_loco_cmd = vla_output["loco_command"]
except zmq.Again:
pass # VLA not ready — hold last command
obs_vec = torch.tensor(np.concatenate([
self.last_loco_cmd, # (4,)
robot_obs["base_ang_vel"], # (3,)
robot_obs["gravity_vec"], # (3,)
robot_obs["joint_pos"], # (N,)
robot_obs["joint_vel"], # (N,)
robot_obs["prev_action"], # (N,)
]), dtype=torch.float32).cuda()
with torch.no_grad():
action = self.lmo(obs_vec.unsqueeze(0))
return action.squeeze(0).cpu().numpy()
Launching inference:
# Machine 1 — offboard GPU (RTX 4090):
python inference_server.py \
--vla-checkpoint checkpoints/wholebodyvla_final.pt \
--lam-mani checkpoints/lam_mani.pt \
--lam-loco checkpoints/lam_loco.pt
# Machine 2 — onboard robot computer:
python robot_controller.py \
--lmo-checkpoint checkpoints/lmo_stage2_final.pt \
--vla-server 192.168.1.100 \
--control-freq 50
Expected Results
If implemented correctly with sufficient data:
| Task | WholebodyVLA | Best Baseline |
|---|---|---|
| Bag Packing | ~75% | ~60% |
| Box Loading | ~80% | ~65% |
| Cart Pushing | ~79% | ~57% |
| Average | ~78% | ~64% |
The single most impactful ablation: Removing LAM pretraining (supervised learning from demos only) drops performance to ~39%. The egocentric video collection is the highest-ROI investment in this entire pipeline.
Pre-Flight Checklist
□ Hardware ready (≥1× H100 for training, RTX 4090 for VLA inference)
□ Intel RealSense D435i camera calibrated and tested
□ 50+ hours of egocentric locomotion video collected
□ AgiBot World dataset downloaded (or equivalent manipulation data)
□ Robot teleoperation data: 50+ demos/task across 3 tasks
□ Isaac Lab v2.3+ installed and tested with humanoid URDF
□ ZeroMQ network between VLA server and robot onboard stable (<50ms)
□ LAM training converged (codebook utilization >50%)
□ VLA validation loss decreasing during fine-tuning
□ LMO Stage I passing straight-line walking and turning tests
WholebodyVLA sets a new bar for humanoid loco-manipulation — but it's also the system with the most moving parts you'll encounter in robot learning. Mastering each component independently before combining them is the right strategy.
Dive deeper into individual components with VLA Models — AI Series 5 and Isaac Lab for Robotics Simulation.