The Core Problem: Humanoids Must Walk and Manipulate Simultaneously
Imagine carrying a hot cup of coffee across a room. Your brain simultaneously has to: (1) keep the cup stable, (2) adjust your steps to avoid spilling, and (3) avoid obstacles. Two systems — locomotion and manipulation — must coordinate tightly, not run independently in parallel.
This is exactly the loco-manipulation problem for humanoid robots, and it is where most current methods fail. Why? Because they split locomotion and manipulation into two separate modules and stitch them together — like having one person controlling the arms while another controls the legs. The result: the robot falls when trying to manipulate, or manipulation fails because the locomotion policy does not understand what manipulation needs.
WholeBodyVLA (Jiang, Chen et al., ICLR 2026) from Fudan University, OpenDriveLab, and AGIBOT solves this with a unified architecture — a VLA for manipulation combined with an RL policy for locomotion, both sharing a common latent space.
Why This Paper Matters
WholeBodyVLA achieves 3 things that no prior paper accomplished simultaneously:
- Learning manipulation from egocentric video without action labels — leveraging millions of YouTube/Ego4D videos shot from a first-person perspective
- Manipulation-aware locomotion — the robot's legs automatically adjust posture based on what the arms are doing
- Outperforming baselines by 21.3% on a real robot (AgiBot X2), not just simulation
The real-robot results on AgiBot X2 are particularly noteworthy — this is not a "nice demo in sim and nothing else" paper.
Architecture: Two Layers, One Latent Space
WholeBodyVLA consists of 2 main layers:
┌─────────────────────────────────────────────┐
│ VLA Encoder (Upper Body) │
│ │
│ [Egocentric Image] + [Language Command] │
│ ↓ │
│ Vision-Language Model (pretrained) │
│ ↓ │
│ Latent Action Space z_t │
│ (learned from action-free video) │
└─────────────┬───────────────────────────────┘
│ z_t (latent manipulation intent)
↓
┌─────────────────────────────────────────────┐
│ LMO RL Policy (Lower Body) │
│ │
│ [z_t] + [Proprioception] + [IMU] │
│ ↓ │
│ PPO Policy (trained in Isaac Lab) │
│ ↓ │
│ Joint torques for locomotion │
│ (knows manipulation → adjusts stance) │
└─────────────────────────────────────────────┘
The key is the latent vector z_t — it encodes both manipulation intent and spatial context, allowing the locomotion policy to "know" what the robot's arms are trying to do and adjust posture accordingly.
Layer 1: VLA Encoder — Learning from Video Without Actions
The Problem: Robot Manipulation Data is Extremely Scarce
Traditional VLA training requires (observation, action) pairs from robots — extremely expensive to collect. Bridge V2 has ~60K episodes, Open X-Embodiment has ~800K — sounds like a lot, but compared to billions of image-text pairs for VLMs, it is tiny.
WholeBodyVLA solves this with latent action learning: instead of requiring explicit action labels, the model learns latent representations from consecutive egocentric video frames.
Inverse Dynamics in Latent Space
The idea: if two consecutive video frames change, some action must have occurred. The model does not need to know the specific action — it only needs to learn a latent representation of that "change".
import torch
import torch.nn as nn
class LatentActionEncoder(nn.Module):
"""
Learn latent actions from consecutive frame pairs.
No action labels needed — only video frames.
"""
def __init__(self, vision_encoder, latent_dim=256):
super().__init__()
self.vision_encoder = vision_encoder # Pretrained ViT
self.latent_dim = latent_dim
# Inverse dynamics model: (s_t, s_{t+1}) -> z_t
# z_t is the latent action between 2 frames
self.inverse_dynamics = nn.Sequential(
nn.Linear(vision_encoder.hidden_size * 2, 512),
nn.ReLU(),
nn.Linear(512, latent_dim),
)
# Forward dynamics: (s_t, z_t) -> s_{t+1}_pred
# Ensures z_t encodes enough info to predict next frame
self.forward_dynamics = nn.Sequential(
nn.Linear(vision_encoder.hidden_size + latent_dim, 512),
nn.ReLU(),
nn.Linear(512, vision_encoder.hidden_size),
)
def encode_latent_action(self, frame_t, frame_t1):
"""Compute latent action z_t from 2 consecutive frames."""
feat_t = self.vision_encoder(frame_t) # [B, D]
feat_t1 = self.vision_encoder(frame_t1) # [B, D]
combined = torch.cat([feat_t, feat_t1], dim=-1) # [B, 2D]
z_t = self.inverse_dynamics(combined) # [B, latent_dim]
return z_t, feat_t, feat_t1
def forward(self, frame_t, frame_t1):
z_t, feat_t, feat_t1 = self.encode_latent_action(frame_t, frame_t1)
# Forward dynamics loss: z_t + s_t -> s_{t+1}
feat_t1_pred = self.forward_dynamics(
torch.cat([feat_t, z_t], dim=-1)
)
# L2 loss in feature space
loss = nn.functional.mse_loss(feat_t1_pred, feat_t1.detach())
return z_t, loss
The two loss functions complement each other:
- Inverse dynamics: forces z_t to contain information about "what action occurred"
- Forward dynamics: forces z_t to contain enough information to reconstruct the next state
VLA with Latent Actions
After pre-training the latent action encoder, WholeBodyVLA fine-tunes a VLM to predict latent actions instead of explicit robot actions:
class WholeBodyVLAEncoder(nn.Module):
"""
VLA outputs latent actions instead of explicit joint commands.
Latent space is pre-trained from egocentric video.
"""
def __init__(self, vlm_backbone, latent_action_encoder):
super().__init__()
self.vlm = vlm_backbone # e.g., LLaVA, Qwen-VL
self.latent_projector = nn.Linear(
vlm_backbone.config.hidden_size,
latent_action_encoder.latent_dim
)
self.latent_action_encoder = latent_action_encoder
# Freeze latent encoder after pre-training
for param in self.latent_action_encoder.parameters():
param.requires_grad = False
def forward(self, image, language_instruction):
"""
Input: egocentric image + language command
Output: latent action z_t
"""
vlm_output = self.vlm(
pixel_values=image,
input_ids=language_instruction,
)
# Project to latent action space
hidden = vlm_output.last_hidden_state[:, -1, :]
z_t = self.latent_projector(hidden) # [B, latent_dim]
return z_t
A subtle but critical point: the VLA does not need to know which specific robot it is controlling — it only outputs "manipulation intent" in latent space. The locomotion policy below maps z_t to motor commands for the specific robot.
Layer 2: LMO RL Policy — Manipulation-Aware Locomotion
This is the most important contribution — the Loco-Manipulation-Oriented (LMO) RL policy.
Why Standard Locomotion Policies Are Not Enough
Standard locomotion policies (like those in Unitree Go2, Boston Dynamics) are trained with the goal of: walking straight, maintaining balance, following velocity commands. They do not know what the upper body is doing.
Real-world consequences:
| Situation | Standard Locomotion | LMO Policy |
|---|---|---|
| Robot reaches arm up high | Stands normally → falls | Lowers CoM, widens stance |
| Robot leans right to pick up | No compensation → CoM shift | Steps left to balance |
| Robot pushes heavy object | No friction increase → slides | Increases ground reaction force |
LMO Reward Design
The LMO RL reward function has 3 main components:
import torch
def compute_lmo_reward(
# State information
base_orientation, # quaternion body orientation
base_velocity, # linear + angular velocity
joint_positions, # current joint positions
joint_torques, # applied torques
foot_contacts, # foot ground contact flags
# Manipulation context
z_t, # latent action from VLA encoder
ee_position, # end-effector position
ee_target, # manipulation target position
# Config
dt=0.02,
):
"""
LMO Reward: 3 main components
1. Stability reward: keep body stable
2. Manipulation support: facilitate manipulation
3. Energy efficiency: minimize torque waste
"""
# === 1. STABILITY REWARD ===
roll, pitch, yaw = quaternion_to_euler(base_orientation)
r_orientation = torch.exp(-5.0 * (roll**2 + pitch**2))
r_angular_vel = torch.exp(-2.0 * torch.norm(base_velocity[3:], dim=-1))
r_contact = (foot_contacts.sum(dim=-1) >= 1).float()
# === 2. MANIPULATION SUPPORT REWARD ===
# Key innovation: locomotion is rewarded for SUPPORTING
# manipulation, not just walking straight
ee_error = torch.norm(ee_position - ee_target, dim=-1)
r_manipulation = torch.exp(-3.0 * ee_error)
# Manipulation-aware CoM adjustment
com_to_target = compute_com_support(
base_orientation, joint_positions, ee_target
)
r_com_support = torch.exp(-1.0 * com_to_target)
# Latent-conditioned bonus: RL knows z_t -> knows intent
z_magnitude = torch.norm(z_t, dim=-1)
stance_width = compute_stance_width(joint_positions)
# Larger manipulation -> wider stance
r_stance = torch.exp(-2.0 * torch.abs(
stance_width - 0.3 - 0.1 * z_magnitude
))
# === 3. ENERGY EFFICIENCY ===
r_energy = torch.exp(-0.01 * torch.sum(joint_torques**2, dim=-1))
# === COMBINE ===
reward = (
0.3 * r_orientation +
0.15 * r_angular_vel +
0.1 * r_contact +
0.2 * r_manipulation +
0.1 * r_com_support +
0.1 * r_stance +
0.05 * r_energy
)
return reward
The core differentiator: r_manipulation and r_com_support — the locomotion policy is rewarded when it helps manipulation succeed, not just when it maintains balance.
Training Pipeline in Isaac Lab
from omni.isaac.lab.envs import ManagerBasedRLEnv
from omni.isaac.lab.utils import configclass
from stable_baselines3 import PPO
@configclass
class LMOEnvCfg:
"""Config for LMO policy training environment."""
robot_asset = "agibot_x2.usd"
observation_space = {
"proprioception": 45, # joint positions + velocities
"imu": 6, # linear acc + angular vel
"latent_action": 256, # z_t from VLA encoder
"foot_contacts": 4, # binary contact sensors
"base_orientation": 4, # quaternion
}
action_space = 12 # 6 joints x 2 legs
# Domain randomization — crucial for sim-to-real
domain_rand = {
"friction": (0.3, 1.5),
"payload_mass": (0, 5.0), # kg, simulates held objects
"motor_strength": (0.8, 1.2),
"push_force": (0, 50), # N, random perturbations
"latent_noise": 0.05, # noise on z_t
}
terrain = "rough"
max_episode_length = 1000 # 20 seconds at 50Hz
sim_dt = 0.005 # physics at 200Hz
control_dt = 0.02 # policy at 50Hz
def train_lmo_policy():
"""Train LMO policy with PPO in Isaac Lab."""
env = ManagerBasedRLEnv(cfg=LMOEnvCfg())
model = PPO(
"MlpPolicy",
env,
learning_rate=3e-4,
n_steps=24,
batch_size=4096,
n_epochs=5,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
ent_coef=0.01,
vf_coef=0.5,
max_grad_norm=1.0,
policy_kwargs={
"net_arch": [256, 256, 128],
"activation_fn": torch.nn.ELU,
},
verbose=1,
)
# Train 500M steps — approximately 6-8 hours on 1x A100
model.learn(total_timesteps=500_000_000)
model.save("lmo_policy_agibot_x2")
return model
Domain Randomization for Sim-to-Real
A critical detail the paper emphasizes: latent_noise in domain randomization. During deployment, the VLA encoder runs on a separate GPU with ~50ms latency — creating noise on z_t. Training with noise makes the RL policy robust to VLA inference jitter.
End-to-End Deployment Pipeline
Egocentric Camera (30 FPS)
│
▼
VLA Encoder (GPU, ~20Hz)
- Input: 224x224 image + text command
- Output: z_t (256-dim latent vector)
│
▼ z_t via shared memory
LMO RL Policy (CPU, 50Hz)
- Input: z_t + proprioception + IMU
- Output: 12 joint position targets
│
▼
PD Controller (1kHz)
- Low-level joint tracking
│
▼
Robot Motors (AgiBot X2)
The two layers run at different frequencies: VLA at ~20Hz (limited by GPU inference), RL policy at 50Hz (interpolating z_t between VLA updates). This is very practical — GPU inference for VLMs cannot reach 50Hz, but locomotion needs fast feedback.
Experimental Results: 21.3% Improvement
The paper tests on AgiBot X2 — a bipedal humanoid robot from AGIBOT (China), ~1.5m tall, 23 DoF.
| Task | Baseline (Decoupled) | WholeBodyVLA | Improvement |
|---|---|---|---|
| Pick from table | 62% | 78% | +16% |
| Place on shelf | 45% | 72% | +27% |
| Carry while walking | 38% | 65% | +27% |
| Push heavy object | 41% | 55% | +14% |
| Average | 46.5% | 67.5% | +21.3% |
Notably, tasks requiring high coordination (carry while walking, place on shelf) show the largest improvements — exactly where decoupled approaches fail.
Ablation: How Much Does Each Component Contribute?
| Configuration | Success Rate |
|---|---|
| Full WholeBodyVLA | 67.5% |
| Remove latent action learning (use explicit actions) | 58.2% |
| Remove LMO reward (use standard locomotion) | 52.1% |
| Remove egocentric video pre-training | 55.8% |
| Remove domain randomization on z_t | 61.3% |
The LMO reward contributes the most (-15.4%) — confirming that manipulation-aware locomotion is the key innovation, not the VLA architecture itself.
Comparison with Other Methods
| Method | Whole-body? | Learns from video? | Real robot? | Loco-manip? |
|---|---|---|---|---|
| RT-2 | Arms only | No | Yes | No |
| OpenVLA | Arms only | No | Yes | No |
| HumanPlus | Yes | Yes (RGB) | Yes | Limited |
| OKAMI | Arms only | Yes (teleop) | Yes | No |
| WholeBodyVLA | Yes | Yes (egocentric) | Yes | Yes |
WholeBodyVLA is the first paper to achieve all 4 criteria simultaneously.
Limitations and Future Directions
The paper acknowledges several limitations:
- Latency: VLA encoder at ~50ms is the bottleneck — model distillation or quantization needed
- Bimanual: only tested single-arm manipulation, no dual-arm coordination yet
- Terrain: only tested on flat ground, no stairs or uneven terrain
- Long-horizon: all tasks in the paper are under 30 seconds — no multi-step planning tested
Promising future directions:
- Combining with Diffusion Policy for multimodal action distributions
- Extending to bimanual loco-manipulation
- Integrating 3D spatial understanding from SpatialVLA
Key Takeaways
1. Latent Actions Are a Paradigm Shift
Instead of collecting expensive robot data, we can learn manipulation from human videos. This is particularly important for labs with limited robot hardware but abundant video data.
2. Locomotion and Manipulation MUST Be Coupled
Anyone building humanoid robots should abandon the traditional decoupled architecture. The cost of coupling (more complex training) is much smaller than the benefit (21.3% improvement).
3. Isaac Lab Is the Standard Tool for RL Robotics
The paper uses NVIDIA Isaac Lab for training — open-source, free, and runs on consumer GPUs. This is currently the best entry point for RL robotics research.
References
- WholeBodyVLA: Towards Whole-Body Loco-Manipulation through Vision-Language-Action Model — Haoran Jiang, Jin Chen et al., ICLR 2026
- GitHub: OpenDriveLab/WholebodyVLA
- RT-2: Vision-Language-Action Models — Brohan et al., 2023
- HumanPlus: Humanoid Shadowing and Imitation — Fu et al., 2024
- OKAMI: Teaching Humanoid Manipulation — Li et al., 2024
Related Posts
- VLA Models: RT-2, Octo, OpenVLA, pi0 — History and evolution of VLA
- Diffusion Policy for Robot Manipulation — Foundation for multimodal actions
- Humanoid Loco-Manipulation — Humanoid series deep dive
- VLA-RL: Scaling RL to Improve VLA Manipulation — Related paper on RL for VLA