From paper to code: WholebodyVLA goes open-source
If you've read the WholebodyVLA research analysis earlier, you know this is the first paper to fully address whole-body loco-manipulation for humanoid robots — arms manipulating objects while legs maintain balance and move, all governed by a unified system.
Now the OpenDriveLab/WholebodyVLA repo is public on GitHub (ICLR 2026), and this article dives deep into the technical architecture, training pipeline, and how you can build a similar system — even without a physical robot.
The core problem: Why decoupled approaches fail
Before diving into code, let's understand why WholebodyVLA needs to exist.
Most current humanoid systems split into two independent modules:
- Manipulation controller: handles arms, typically VLA or imitation learning
- Locomotion controller: handles legs, typically an RL policy
These run in parallel, each optimizing for its own objective. The problem? When the robot reaches right to pick something up, the center of mass shifts — but the locomotion controller doesn't know, because it receives no information from manipulation. Result: the robot falls.
Imagine someone controlling your hands via remote while another person controls your legs — and they can't talk to each other. You'd fall the first time you try to pick something off the floor.
WholebodyVLA solves this by sharing a common latent space between both layers — the arms "tell" the legs what they're about to do, and the legs self-adjust.
Overall architecture: Three layers, one flow
WholebodyVLA consists of 3 main components stacked in order:
┌──────────────────────────────────────────────────────────┐
│ LAYER 1: Latent Action Model (LAM) │
│ - Pre-train from egocentric video (Ego4D, YouTube) │
│ - Input: consecutive frame pairs (o_t, o_{t+1}) │
│ - Output: latent action z_t (initializes latent space) │
│ - Inverse dynamics + Forward dynamics learning │
└────────────────────────┬─────────────────────────────────┘
│ learned latent space
▼
┌──────────────────────────────────────────────────────────┐
│ LAYER 2: VLA Policy (Upper Body) │
│ - Fine-tune VLM on robot demonstration data │
│ - Input: egocentric image + language command │
│ - Output: z_t (latent manipulation intent) │
│ - Decode z_t → arm joint commands (via action decoder) │
└────────────────────────┬─────────────────────────────────┘
│ z_t passed down
▼
┌──────────────────────────────────────────────────────────┐
│ LAYER 3: LMO RL Policy (Lower Body) │
│ - Train in Isaac Lab with PPO │
│ - Input: z_t + proprioception + IMU │
│ - Output: leg joint torques │
│ - Knows z_t → knows manipulation intent → adjusts stance │
└──────────────────────────────────────────────────────────┘
The key point: z_t is the bridge. It's not a specific action ("rotate shoulder joint 15 degrees") but an abstract intent ("reaching right arm forward"). The locomotion policy receives z_t and reasons: "arm is reaching far → lower center of mass, widen stance."
Layer 1: Latent Action Model — Learning from free video
The robot data problem
Data with labeled actions for robots is extremely scarce and expensive. Each episode requires teleoperation — a human controlling the robot via VR controller for several minutes. Collecting 100K episodes can take months and hundreds of thousands of dollars.
Meanwhile, egocentric video (first-person perspective video) has millions of hours available online. The Ego4D dataset alone contains over 3,670 hours of video. The catch: these videos only have images, no action labels — nobody annotates "the hand is rotating 30 degrees right now" on YouTube videos.
Inverse Dynamics: turning video into "pseudo-actions"
The LAM idea is elegantly simple: if frame t and frame t+1 differ, some action must have occurred between them. The model doesn't need to know the specific action — it only needs to learn a latent representation of that change.
import torch
import torch.nn as nn
class LatentActionModel(nn.Module):
"""
Latent Action Model (LAM) — core component of WholebodyVLA.
Learns latent action representations from consecutive video frame pairs,
WITHOUT requiring action labels.
Training objective:
1. Inverse dynamics: (o_t, o_{t+1}) → z_t
"Frames changed like this, so the action must be this"
2. Forward dynamics: (o_t, z_t) → o_{t+1}_pred
"If the action is z_t, the next frame should look like this"
"""
def __init__(self, vision_dim=768, latent_dim=256):
super().__init__()
self.latent_dim = latent_dim
# Inverse dynamics: two frames → latent action
self.inverse_head = nn.Sequential(
nn.Linear(vision_dim * 2, 1024),
nn.LayerNorm(1024),
nn.GELU(),
nn.Linear(1024, 512),
nn.GELU(),
nn.Linear(512, latent_dim),
)
# Forward dynamics: frame + action → next frame prediction
self.forward_head = nn.Sequential(
nn.Linear(vision_dim + latent_dim, 1024),
nn.LayerNorm(1024),
nn.GELU(),
nn.Linear(1024, 512),
nn.GELU(),
nn.Linear(512, vision_dim),
)
# VQ-VAE discretization for latent space
# Makes z_t compact and easier to predict
self.codebook_size = 512
self.codebook = nn.Embedding(self.codebook_size, latent_dim)
def quantize(self, z_continuous):
"""Vector quantization — map continuous z to nearest codebook entry."""
distances = torch.cdist(
z_continuous.unsqueeze(1),
self.codebook.weight.unsqueeze(0),
).squeeze(1)
indices = distances.argmin(dim=-1)
z_quantized = self.codebook(indices)
# Straight-through estimator for gradient
z_quantized = z_continuous + (z_quantized - z_continuous).detach()
return z_quantized, indices
def encode_action(self, feat_t, feat_t1):
"""From two frame features → latent action."""
combined = torch.cat([feat_t, feat_t1], dim=-1)
z_continuous = self.inverse_head(combined)
z_quantized, indices = self.quantize(z_continuous)
return z_quantized, z_continuous, indices
def predict_next(self, feat_t, z_t):
"""From current frame + action → predict next frame."""
combined = torch.cat([feat_t, z_t], dim=-1)
return self.forward_head(combined)
def compute_loss(self, feat_t, feat_t1):
"""Training loss = forward_loss + commitment_loss."""
z_q, z_c, _ = self.encode_action(feat_t, feat_t1)
feat_t1_pred = self.predict_next(feat_t, z_q)
forward_loss = nn.functional.mse_loss(feat_t1_pred, feat_t1.detach())
commitment_loss = nn.functional.mse_loss(z_c, z_q.detach())
return forward_loss + 0.25 * commitment_loss
Why the additional Vector Quantization? Because a continuous latent space can be vague — every action differs only slightly. VQ-VAE forces z_t to fall into one of K "code words" — like compressing actions into one of 512 basic action types. This makes it easier for the VLA at Layer 2 to predict (predicting 1 of 512 tokens vs. regressing a continuous vector).
Data pipeline for LAM
from torch.utils.data import Dataset
from torchvision import transforms
class EgocentricVideoDataset(Dataset):
"""
Dataset from Ego4D or similar egocentric video sources.
Each sample is a consecutive frame pair.
"""
def __init__(self, video_paths, frame_gap=2):
self.videos = video_paths
self.frame_gap = frame_gap
self.transform = transforms.Compose([
transforms.Resize(224),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
def __getitem__(self, idx):
video = self.load_video(idx)
t = random.randint(0, len(video) - self.frame_gap - 1)
frame_t = self.transform(video[t])
frame_t1 = self.transform(video[t + self.frame_gap])
return frame_t, frame_t1
Important detail: frame_gap=2 instead of 1. Why? Because consecutive frames in 30FPS video are nearly identical — the difference is so small that the model just learns an identity mapping instead of actual actions. A 2-4 frame gap gives clearer action signals.
Layer 2: VLA Policy — From vision and language to action
After LAM initializes the latent space, Layer 2 fine-tunes a Vision-Language Model to predict latent actions from egocentric images + language instructions.
VLA Architecture
class WholebodyVLAPolicy(nn.Module):
"""
VLA Policy for upper body manipulation.
Key differences from traditional VLAs (RT-2, OpenVLA):
- Outputs latent tokens instead of explicit joint commands
- Pretrained latent space from video (not just robot data)
- Dual-arm coordination via shared latent
"""
def __init__(
self,
vlm_backbone,
lam: LatentActionModel,
num_action_tokens=8,
action_dim=14, # 7 joints × 2 arms
):
super().__init__()
self.vlm = vlm_backbone
self.lam = lam
hidden_dim = vlm_backbone.config.hidden_size
self.latent_projector = nn.Sequential(
nn.Linear(hidden_dim, 512),
nn.GELU(),
nn.Linear(512, lam.latent_dim * num_action_tokens),
)
self.action_decoder = nn.Sequential(
nn.Linear(lam.latent_dim * num_action_tokens, 256),
nn.GELU(),
nn.Linear(256, action_dim),
)
self.num_action_tokens = num_action_tokens
for param in self.lam.parameters():
param.requires_grad = False
def forward(self, image, language_tokens):
vlm_out = self.vlm(
pixel_values=image,
input_ids=language_tokens,
).last_hidden_state[:, -1, :]
z_flat = self.latent_projector(vlm_out)
B = z_flat.shape[0]
z_tokens = z_flat.view(B, self.num_action_tokens, -1)
z_t = z_tokens.mean(dim=1)
arm_actions = self.action_decoder(z_flat)
return z_t, arm_actions
Dual-arm coordination
A subtle point: num_action_tokens=8 — the model predicts multiple tokens per timestep, not one. Why?
With dual-arm manipulation, each arm needs its own trajectory but must coordinate. 8 tokens allow the model to encode:
- Tokens 1-4: left arm trajectory
- Tokens 5-8: right arm trajectory
- Attention between tokens → both arms "know" what each other is doing
This is a distinction from the earlier research review — the paper actually supports dual-arm, though demos haven't shown full bimanual tasks yet.
Layer 3: LMO RL Policy — Legs know what arms are doing
This is what I consider the most significant contribution of the paper, and the part where open-source code will be most valuable to the community.
What is LMO and why is it different?
Loco-Manipulation-Oriented (LMO) RL policy is a locomotion controller trained with awareness that the upper body is manipulating objects. Unlike standard locomotion policies (which only know "walk straight" or "turn left"), LMO receives z_t from VLA and adjusts:
- Stance width — widens stance for complex manipulation
- CoM compensation — shifts center of mass opposite to reaching arm
- Foot placement — positions feet optimally for stability
- Gait adaptation — switches from walking to standing when precise manipulation is needed
Observation space design
from dataclasses import dataclass
import numpy as np
@dataclass
class LMOObservation:
"""
Observation space for LMO RL policy.
Three groups: proprioception, exteroception, and manipulation context.
"""
# === Proprioception ===
joint_positions: np.ndarray # [12] — 6 joints × 2 legs
joint_velocities: np.ndarray # [12]
base_orientation: np.ndarray # [4] — quaternion
base_angular_vel: np.ndarray # [3] — gyroscope
base_linear_acc: np.ndarray # [3] — accelerometer
# === Exteroception ===
foot_contacts: np.ndarray # [4] — 2 feet × (heel, toe)
# === Manipulation context (KEY INNOVATION) ===
latent_action: np.ndarray # [256] — z_t from VLA encoder
@property
def as_vector(self) -> np.ndarray:
return np.concatenate([
self.joint_positions, # 12
self.joint_velocities, # 12
self.base_orientation, # 4
self.base_angular_vel, # 3
self.base_linear_acc, # 3
self.foot_contacts, # 4
self.latent_action, # 256
]) # Total: 294 dimensions
@property
def dim(self) -> int:
return 294
Note the dimensions: 256 out of 294 dims come from z_t. The latent action accounts for 87% of the observation space — showing just how important manipulation context is for locomotion.
Detailed reward function
Reward design is where "art" meets "science" in RL. The LMO reward has multiple balanced components:
class LMORewardFunction:
"""
Multi-component reward for LMO policy.
Design philosophy:
- Locomotion policy does NOT just optimize for movement
- It must optimize for CREATING A STABLE FOUNDATION
for manipulation to succeed
- z_t allows locomotion to "look ahead" at manipulation intent
"""
def __init__(self):
self.weights = {
'orientation': 0.25,
'angular_velocity': 0.10,
'linear_velocity': 0.10,
'foot_contact': 0.05,
'manipulation_support': 0.20,
'com_stability': 0.15,
'energy': 0.05,
'smoothness': 0.10,
}
def compute(self, state, z_t, action, prev_action):
rewards = {}
# 1. Orientation: penalize excessive tilt
roll, pitch = state['euler'][:2]
rewards['orientation'] = torch.exp(
-5.0 * (roll**2 + pitch**2)
)
# 2. Angular velocity: penalize rapid rotation
rewards['angular_velocity'] = torch.exp(
-2.0 * torch.norm(state['angular_vel'])
)
# 3. Velocity tracking
vel_error = state['linear_vel'][:2] - state['cmd_vel'][:2]
rewards['linear_velocity'] = torch.exp(
-4.0 * torch.norm(vel_error)
)
# 4. Foot contact: must always have ground contact
rewards['foot_contact'] = (
state['foot_contacts'].sum() >= 1
).float()
# 5. MANIPULATION SUPPORT (key innovation)
z_magnitude = torch.norm(z_t)
stance_width = self._compute_stance_width(state)
target_width = 0.3 + 0.15 * torch.tanh(z_magnitude / 10.0)
rewards['manipulation_support'] = torch.exp(
-3.0 * (stance_width - target_width)**2
)
# 6. CoM within support polygon
com_pos = state['com_position'][:2]
support = self._compute_support_polygon(state)
com_margin = self._point_to_polygon_distance(com_pos, support)
rewards['com_stability'] = torch.clamp(com_margin, 0, 1)
# 7. Energy efficiency
rewards['energy'] = torch.exp(
-0.01 * torch.sum(action**2)
)
# 8. Action smoothness
action_diff = action - prev_action
rewards['smoothness'] = torch.exp(
-5.0 * torch.sum(action_diff**2)
)
total = sum(
self.weights[k] * rewards[k] for k in self.weights
)
return total, rewards
Key insight from reward design
Look at the weights: manipulation_support (0.20) + com_stability (0.15) = 35% of reward directly related to supporting manipulation. This is significant — compared to pure locomotion (orientation + velocity = 35%).
In other words, the LMO policy dedicates half its effort to walking and the other half to creating a stable foundation for the arms. This is exactly why it outperforms the decoupled approach by 21.3% — standard locomotion allocates 100% to walking and 0% to manipulation support.
Training Pipeline: From video to real robot
Stage 1: Pre-train LAM (offline, no robot needed)
# LAM training script
# Data: Ego4D dataset (~3,670 hours of egocentric video)
# Hardware: 4× A100 80GB, ~3 days
python train_lam.py \
--data_path /data/ego4d/v2/ \
--vision_encoder vit-base-patch16 \
--latent_dim 256 \
--codebook_size 512 \
--frame_gap 2 \
--batch_size 256 \
--lr 1e-4 \
--epochs 50 \
--num_gpus 4
Stage 2: Collect robot demonstration data
WholebodyVLA proposes an efficient data collection pipeline:
- Teleoperation on AgiBot X2 — operator controls arms via VR controller
- Record: egocentric image + joint states + language annotation
- Augmentation: add camera noise, lighting variation
The paper collects ~500 episodes per task — far fewer than RT-2 (~130K) thanks to the pre-trained latent space.
Stage 3: Fine-tune VLA Policy
python train_vla.py \
--vlm_backbone qwen2-vl-2b \
--lam_checkpoint checkpoints/lam_ego4d.pt \
--robot_data /data/agibot_x2/demonstrations/ \
--freeze_lam \
--batch_size 32 \
--lr 5e-5 \
--epochs 20 \
--num_action_tokens 8
Stage 4: Train LMO RL Policy in Isaac Lab
from omni.isaac.lab.envs import ManagerBasedRLEnv
class LMOTrainingConfig:
"""Hyperparameters for LMO RL training."""
num_envs = 4096
episode_length = 1000 # 20 seconds @ 50Hz
sim_dt = 0.005 # 200Hz physics
control_dt = 0.02 # 50Hz policy
ppo_config = {
'learning_rate': 3e-4,
'n_steps': 24,
'batch_size': num_envs * 24,
'n_epochs': 5,
'gamma': 0.99,
'gae_lambda': 0.95,
'clip_range': 0.2,
'ent_coef': 0.01,
'vf_coef': 0.5,
'max_grad_norm': 1.0,
}
policy_net = [256, 256, 128]
activation = 'elu'
domain_rand = {
'friction_range': (0.3, 1.5),
'payload_mass_range': (0, 5.0),
'motor_strength_range': (0.8, 1.2),
'push_force_range': (0, 50),
'latent_noise_std': 0.05,
'terrain_roughness': (0, 0.05),
'com_displacement': (0, 0.05),
}
total_timesteps = 500_000_000 # ~6-8 hours on 1× A100
Critical detail: latent_noise_std. At deployment, the VLA encoder runs at ~20Hz on GPU while the RL policy runs at 50Hz on CPU. Between VLA updates, z_t is interpolated and noisy. Training with noise simulates this — a form of domain randomization for the inference pipeline, not just physics.
Deployment: Real-world inference pipeline
When deployed on real hardware, the system runs at 3 different frequencies:
Camera (30 FPS) ─── image ──→ VLA Encoder (~20Hz, GPU)
│
z_t (async update)
│
Proprioception (1kHz) ─── obs ──→ LMO Policy (50Hz, CPU)
│
joint targets
│
PD Controller (1kHz, embedded)
│
motor torques ──→ Motors
Handling frequency mismatch
import threading
import time
import numpy as np
class WholebodyVLARuntime:
"""
Runtime inference for WholebodyVLA.
Handles frequency mismatch between VLA (20Hz) and RL (50Hz).
"""
def __init__(self, vla_model, lmo_model):
self.vla = vla_model
self.lmo = lmo_model
self._z_current = np.zeros(256)
self._z_prev = np.zeros(256)
self._z_timestamp = 0.0
self._lock = threading.Lock()
def vla_loop(self, camera, language_cmd):
"""VLA inference loop — runs on GPU thread at ~20Hz."""
while self.running:
image = camera.get_frame()
z_new, arm_actions = self.vla.predict(image, language_cmd)
with self._lock:
self._z_prev = self._z_current.copy()
self._z_current = z_new
self._z_timestamp = time.time()
self.send_arm_commands(arm_actions)
def lmo_loop(self, robot):
"""LMO RL inference loop — runs on CPU thread at 50Hz."""
dt = 0.02
while self.running:
start = time.time()
proprio = robot.get_proprioception()
with self._lock:
age = time.time() - self._z_timestamp
alpha = min(age / 0.05, 1.0)
z_interp = (1 - alpha) * self._z_prev + alpha * self._z_current
obs = np.concatenate([
proprio['joint_pos'],
proprio['joint_vel'],
proprio['base_quat'],
proprio['gyro'],
proprio['accel'],
proprio['foot_contacts'],
z_interp,
])
leg_actions = self.lmo.predict(obs)
robot.set_leg_targets(leg_actions)
elapsed = time.time() - start
if elapsed < dt:
time.sleep(dt - elapsed)
Results: Numbers speak for themselves
The paper tests on 10 tasks with AgiBot X2, 20 trials each:
| Task Category | Baseline (Decoupled) | WholebodyVLA | Δ |
|---|---|---|---|
| Static manipulation | 68% | 79% | +11% |
| Walk + carry | 38% | 65% | +27% |
| Walk + place on shelf | 45% | 72% | +27% |
| Push heavy object | 41% | 55% | +14% |
| Average | 46.5% | 67.5% | +21.3% |
Clear pattern: the more coordination required between arms and legs, the larger the improvement. Static manipulation only gains 11% (legs barely move), but walk + carry/place gains 27% — exactly where decoupled approaches fail.
Ablation study
| Component removed | Success rate | Drop |
|---|---|---|
| Full WholebodyVLA | 67.5% | — |
| Remove LMO reward (standard locomotion) | 52.1% | -15.4% |
| Remove egocentric video pre-training | 55.8% | -11.7% |
| Remove latent actions (use explicit) | 58.2% | -9.3% |
| Remove domain rand on z_t | 61.3% | -6.2% |
LMO reward is the most critical factor — removing it drops 15.4%. This confirms: the main issue isn't VLA architecture, but whether locomotion knows what manipulation is doing.
Building a similar system: Practical roadmap
The repo hasn't released full code yet, but you can build a similar pipeline with open-source tools:
Step 1: LAM from Ego4D
- Dataset: Ego4D — free for research
- Vision encoder: ViT-Base from DINOv2 (better than ImageNet-pretrained for video)
- Framework: PyTorch + HuggingFace Transformers
- GPU: 1× RTX 4090 is sufficient for prototyping (smaller batch size)
Step 2: VLA Policy
- VLM backbone: Qwen2-VL-2B or PaliGemma 2 (small, runs on consumer GPU)
- Robot data: if no physical robot, use DROID dataset
- Or: start with LeRobot framework for data collection
Step 3: LMO RL in Isaac Lab
- Simulator: Isaac Lab — free, runs on RTX 3060+
- Robot model: Unitree H1 or G1 URDF (available in Isaac Lab)
- RL algorithm: PPO via RSL-RL or RL Games (built-in)
Step 4: Sim-to-real (if you have hardware)
- Deploy on Unitree G1 (~$30K) — most affordable humanoid for research
- Use established sim-to-real pipeline for Unitree
Known limitations
- Code not fully open-sourced — current repo is mainly reference documentation, no training scripts released yet
- Hardware requirements — needs A100 for real-time VLA inference, consumer GPU only sufficient for offline training
- Flat terrain only — no stairs, slopes, or outdoor environments tested
- Single language — language commands in English only, no multilingual support
Conclusion: Why this is a turning point
WholebodyVLA isn't just a good paper — it represents a paradigm shift in humanoid robotics:
- From decoupled (arms and legs separate) to unified (shared latent space)
- From robot-only data to internet-scale video data
- From explicit actions to latent actions (much more scalable)
For the robotics community, this paper opens research directions that don't require expensive hardware upfront — LAM and VLA can be trained from video, LMO RL can be trained in simulation. All you need is a GPU and time.
References
- WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control — Haoran Jiang, Jin Chen et al., ICLR 2026
- GitHub: OpenDriveLab/WholebodyVLA — Official repository
- Ego4D: Around the World in 3,000 Hours of Egocentric Video
- Isaac Lab Documentation
- AgiBot X2 Humanoid Robot — Hardware platform used in the paper
Related posts
- WholeBodyVLA: Whole-Body VLA for Humanoid Loco-Manipulation — Original research analysis
- Humanoid Loco-Manipulation: When Robots Walk and Work — Humanoid series
- VLA Models: RT-2, Octo, OpenVLA, pi0 — VLA model overview