← Back to Blog
researchvlahumanoidloco-manipulationiclrrl

WholeBodyVLA: Unified VLA for Whole-Body Loco-Manipulation

ICLR 2026 — learning manipulation from egocentric video, combining VLA + RL for locomotion-aware control

Nguyễn Anh Tuấn7 tháng 4, 202612 min read
WholeBodyVLA: Unified VLA for Whole-Body Loco-Manipulation

The Core Problem: Humanoids Must Walk and Manipulate Simultaneously

Imagine carrying a hot cup of coffee across a room. Your brain simultaneously has to: (1) keep the cup stable, (2) adjust your steps to avoid spilling, and (3) avoid obstacles. Two systems — locomotion and manipulation — must coordinate tightly, not run independently in parallel.

This is exactly the loco-manipulation problem for humanoid robots, and it is where most current methods fail. Why? Because they split locomotion and manipulation into two separate modules and stitch them together — like having one person controlling the arms while another controls the legs. The result: the robot falls when trying to manipulate, or manipulation fails because the locomotion policy does not understand what manipulation needs.

WholeBodyVLA (Jiang, Chen et al., ICLR 2026) from Fudan University, OpenDriveLab, and AGIBOT solves this with a unified architecture — a VLA for manipulation combined with an RL policy for locomotion, both sharing a common latent space.

Humanoid robot performing manipulation in real environment

Why This Paper Matters

WholeBodyVLA achieves 3 things that no prior paper accomplished simultaneously:

  1. Learning manipulation from egocentric video without action labels — leveraging millions of YouTube/Ego4D videos shot from a first-person perspective
  2. Manipulation-aware locomotion — the robot's legs automatically adjust posture based on what the arms are doing
  3. Outperforming baselines by 21.3% on a real robot (AgiBot X2), not just simulation

The real-robot results on AgiBot X2 are particularly noteworthy — this is not a "nice demo in sim and nothing else" paper.

Architecture: Two Layers, One Latent Space

WholeBodyVLA consists of 2 main layers:

┌─────────────────────────────────────────────┐
│           VLA Encoder (Upper Body)          │
│                                             │
│  [Egocentric Image] + [Language Command]    │
│           ↓                                 │
│   Vision-Language Model (pretrained)        │
│           ↓                                 │
│   Latent Action Space z_t                   │
│   (learned from action-free video)          │
└─────────────┬───────────────────────────────┘
              │ z_t (latent manipulation intent)
              ↓
┌─────────────────────────────────────────────┐
│      LMO RL Policy (Lower Body)            │
│                                             │
│  [z_t] + [Proprioception] + [IMU]          │
│           ↓                                 │
│   PPO Policy (trained in Isaac Lab)         │
│           ↓                                 │
│   Joint torques for locomotion              │
│   (knows manipulation → adjusts stance)    │
└─────────────────────────────────────────────┘

The key is the latent vector z_t — it encodes both manipulation intent and spatial context, allowing the locomotion policy to "know" what the robot's arms are trying to do and adjust posture accordingly.

Layer 1: VLA Encoder — Learning from Video Without Actions

The Problem: Robot Manipulation Data is Extremely Scarce

Traditional VLA training requires (observation, action) pairs from robots — extremely expensive to collect. Bridge V2 has ~60K episodes, Open X-Embodiment has ~800K — sounds like a lot, but compared to billions of image-text pairs for VLMs, it is tiny.

WholeBodyVLA solves this with latent action learning: instead of requiring explicit action labels, the model learns latent representations from consecutive egocentric video frames.

Inverse Dynamics in Latent Space

The idea: if two consecutive video frames change, some action must have occurred. The model does not need to know the specific action — it only needs to learn a latent representation of that "change".

import torch
import torch.nn as nn

class LatentActionEncoder(nn.Module):
    """
    Learn latent actions from consecutive frame pairs.
    No action labels needed — only video frames.
    """
    def __init__(self, vision_encoder, latent_dim=256):
        super().__init__()
        self.vision_encoder = vision_encoder  # Pretrained ViT
        self.latent_dim = latent_dim
        
        # Inverse dynamics model: (s_t, s_{t+1}) -> z_t
        # z_t is the latent action between 2 frames
        self.inverse_dynamics = nn.Sequential(
            nn.Linear(vision_encoder.hidden_size * 2, 512),
            nn.ReLU(),
            nn.Linear(512, latent_dim),
        )
        
        # Forward dynamics: (s_t, z_t) -> s_{t+1}_pred
        # Ensures z_t encodes enough info to predict next frame
        self.forward_dynamics = nn.Sequential(
            nn.Linear(vision_encoder.hidden_size + latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, vision_encoder.hidden_size),
        )
    
    def encode_latent_action(self, frame_t, frame_t1):
        """Compute latent action z_t from 2 consecutive frames."""
        feat_t = self.vision_encoder(frame_t)    # [B, D]
        feat_t1 = self.vision_encoder(frame_t1)  # [B, D]
        
        combined = torch.cat([feat_t, feat_t1], dim=-1)  # [B, 2D]
        z_t = self.inverse_dynamics(combined)              # [B, latent_dim]
        return z_t, feat_t, feat_t1
    
    def forward(self, frame_t, frame_t1):
        z_t, feat_t, feat_t1 = self.encode_latent_action(frame_t, frame_t1)
        
        # Forward dynamics loss: z_t + s_t -> s_{t+1}
        feat_t1_pred = self.forward_dynamics(
            torch.cat([feat_t, z_t], dim=-1)
        )
        
        # L2 loss in feature space
        loss = nn.functional.mse_loss(feat_t1_pred, feat_t1.detach())
        return z_t, loss

The two loss functions complement each other:

VLA with Latent Actions

After pre-training the latent action encoder, WholeBodyVLA fine-tunes a VLM to predict latent actions instead of explicit robot actions:

class WholeBodyVLAEncoder(nn.Module):
    """
    VLA outputs latent actions instead of explicit joint commands.
    Latent space is pre-trained from egocentric video.
    """
    def __init__(self, vlm_backbone, latent_action_encoder):
        super().__init__()
        self.vlm = vlm_backbone  # e.g., LLaVA, Qwen-VL
        self.latent_projector = nn.Linear(
            vlm_backbone.config.hidden_size,
            latent_action_encoder.latent_dim
        )
        self.latent_action_encoder = latent_action_encoder
        
        # Freeze latent encoder after pre-training
        for param in self.latent_action_encoder.parameters():
            param.requires_grad = False
    
    def forward(self, image, language_instruction):
        """
        Input: egocentric image + language command
        Output: latent action z_t
        """
        vlm_output = self.vlm(
            pixel_values=image,
            input_ids=language_instruction,
        )
        
        # Project to latent action space
        hidden = vlm_output.last_hidden_state[:, -1, :]
        z_t = self.latent_projector(hidden)  # [B, latent_dim]
        
        return z_t

A subtle but critical point: the VLA does not need to know which specific robot it is controlling — it only outputs "manipulation intent" in latent space. The locomotion policy below maps z_t to motor commands for the specific robot.

Layer 2: LMO RL Policy — Manipulation-Aware Locomotion

This is the most important contribution — the Loco-Manipulation-Oriented (LMO) RL policy.

Why Standard Locomotion Policies Are Not Enough

Standard locomotion policies (like those in Unitree Go2, Boston Dynamics) are trained with the goal of: walking straight, maintaining balance, following velocity commands. They do not know what the upper body is doing.

Real-world consequences:

Situation Standard Locomotion LMO Policy
Robot reaches arm up high Stands normally → falls Lowers CoM, widens stance
Robot leans right to pick up No compensation → CoM shift Steps left to balance
Robot pushes heavy object No friction increase → slides Increases ground reaction force

Humanoid robot in research lab

LMO Reward Design

The LMO RL reward function has 3 main components:

import torch

def compute_lmo_reward(
    # State information
    base_orientation,     # quaternion body orientation
    base_velocity,        # linear + angular velocity
    joint_positions,      # current joint positions
    joint_torques,        # applied torques
    foot_contacts,        # foot ground contact flags
    # Manipulation context
    z_t,                  # latent action from VLA encoder
    ee_position,          # end-effector position
    ee_target,            # manipulation target position
    # Config
    dt=0.02,
):
    """
    LMO Reward: 3 main components
    1. Stability reward: keep body stable
    2. Manipulation support: facilitate manipulation
    3. Energy efficiency: minimize torque waste
    """
    
    # === 1. STABILITY REWARD ===
    roll, pitch, yaw = quaternion_to_euler(base_orientation)
    r_orientation = torch.exp(-5.0 * (roll**2 + pitch**2))
    
    r_angular_vel = torch.exp(-2.0 * torch.norm(base_velocity[3:], dim=-1))
    
    r_contact = (foot_contacts.sum(dim=-1) >= 1).float()
    
    # === 2. MANIPULATION SUPPORT REWARD ===
    # Key innovation: locomotion is rewarded for SUPPORTING
    # manipulation, not just walking straight
    
    ee_error = torch.norm(ee_position - ee_target, dim=-1)
    r_manipulation = torch.exp(-3.0 * ee_error)
    
    # Manipulation-aware CoM adjustment
    com_to_target = compute_com_support(
        base_orientation, joint_positions, ee_target
    )
    r_com_support = torch.exp(-1.0 * com_to_target)
    
    # Latent-conditioned bonus: RL knows z_t -> knows intent
    z_magnitude = torch.norm(z_t, dim=-1)
    stance_width = compute_stance_width(joint_positions)
    # Larger manipulation -> wider stance
    r_stance = torch.exp(-2.0 * torch.abs(
        stance_width - 0.3 - 0.1 * z_magnitude
    ))
    
    # === 3. ENERGY EFFICIENCY ===
    r_energy = torch.exp(-0.01 * torch.sum(joint_torques**2, dim=-1))
    
    # === COMBINE ===
    reward = (
        0.3 * r_orientation +
        0.15 * r_angular_vel +
        0.1 * r_contact +
        0.2 * r_manipulation +
        0.1 * r_com_support +
        0.1 * r_stance +
        0.05 * r_energy
    )
    
    return reward

The core differentiator: r_manipulation and r_com_support — the locomotion policy is rewarded when it helps manipulation succeed, not just when it maintains balance.

Training Pipeline in Isaac Lab

from omni.isaac.lab.envs import ManagerBasedRLEnv
from omni.isaac.lab.utils import configclass
from stable_baselines3 import PPO

@configclass
class LMOEnvCfg:
    """Config for LMO policy training environment."""
    
    robot_asset = "agibot_x2.usd"
    
    observation_space = {
        "proprioception": 45,      # joint positions + velocities
        "imu": 6,                  # linear acc + angular vel
        "latent_action": 256,      # z_t from VLA encoder
        "foot_contacts": 4,        # binary contact sensors
        "base_orientation": 4,     # quaternion
    }
    
    action_space = 12  # 6 joints x 2 legs
    
    # Domain randomization — crucial for sim-to-real
    domain_rand = {
        "friction": (0.3, 1.5),
        "payload_mass": (0, 5.0),     # kg, simulates held objects
        "motor_strength": (0.8, 1.2),
        "push_force": (0, 50),        # N, random perturbations
        "latent_noise": 0.05,         # noise on z_t
    }
    
    terrain = "rough"
    max_episode_length = 1000  # 20 seconds at 50Hz
    sim_dt = 0.005       # physics at 200Hz
    control_dt = 0.02    # policy at 50Hz

def train_lmo_policy():
    """Train LMO policy with PPO in Isaac Lab."""
    
    env = ManagerBasedRLEnv(cfg=LMOEnvCfg())
    
    model = PPO(
        "MlpPolicy",
        env,
        learning_rate=3e-4,
        n_steps=24,
        batch_size=4096,
        n_epochs=5,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.2,
        ent_coef=0.01,
        vf_coef=0.5,
        max_grad_norm=1.0,
        policy_kwargs={
            "net_arch": [256, 256, 128],
            "activation_fn": torch.nn.ELU,
        },
        verbose=1,
    )
    
    # Train 500M steps — approximately 6-8 hours on 1x A100
    model.learn(total_timesteps=500_000_000)
    model.save("lmo_policy_agibot_x2")
    
    return model

Domain Randomization for Sim-to-Real

A critical detail the paper emphasizes: latent_noise in domain randomization. During deployment, the VLA encoder runs on a separate GPU with ~50ms latency — creating noise on z_t. Training with noise makes the RL policy robust to VLA inference jitter.

End-to-End Deployment Pipeline

Egocentric Camera (30 FPS)
        │
        ▼
VLA Encoder (GPU, ~20Hz)
  - Input: 224x224 image + text command
  - Output: z_t (256-dim latent vector)
        │
        ▼ z_t via shared memory
LMO RL Policy (CPU, 50Hz)
  - Input: z_t + proprioception + IMU
  - Output: 12 joint position targets
        │
        ▼
PD Controller (1kHz)
  - Low-level joint tracking
        │
        ▼
Robot Motors (AgiBot X2)

The two layers run at different frequencies: VLA at ~20Hz (limited by GPU inference), RL policy at 50Hz (interpolating z_t between VLA updates). This is very practical — GPU inference for VLMs cannot reach 50Hz, but locomotion needs fast feedback.

Experimental Results: 21.3% Improvement

The paper tests on AgiBot X2 — a bipedal humanoid robot from AGIBOT (China), ~1.5m tall, 23 DoF.

Task Baseline (Decoupled) WholeBodyVLA Improvement
Pick from table 62% 78% +16%
Place on shelf 45% 72% +27%
Carry while walking 38% 65% +27%
Push heavy object 41% 55% +14%
Average 46.5% 67.5% +21.3%

Notably, tasks requiring high coordination (carry while walking, place on shelf) show the largest improvements — exactly where decoupled approaches fail.

Ablation: How Much Does Each Component Contribute?

Configuration Success Rate
Full WholeBodyVLA 67.5%
Remove latent action learning (use explicit actions) 58.2%
Remove LMO reward (use standard locomotion) 52.1%
Remove egocentric video pre-training 55.8%
Remove domain randomization on z_t 61.3%

The LMO reward contributes the most (-15.4%) — confirming that manipulation-aware locomotion is the key innovation, not the VLA architecture itself.

AI and deep learning research for robotics

Comparison with Other Methods

Method Whole-body? Learns from video? Real robot? Loco-manip?
RT-2 Arms only No Yes No
OpenVLA Arms only No Yes No
HumanPlus Yes Yes (RGB) Yes Limited
OKAMI Arms only Yes (teleop) Yes No
WholeBodyVLA Yes Yes (egocentric) Yes Yes

WholeBodyVLA is the first paper to achieve all 4 criteria simultaneously.

Limitations and Future Directions

The paper acknowledges several limitations:

  1. Latency: VLA encoder at ~50ms is the bottleneck — model distillation or quantization needed
  2. Bimanual: only tested single-arm manipulation, no dual-arm coordination yet
  3. Terrain: only tested on flat ground, no stairs or uneven terrain
  4. Long-horizon: all tasks in the paper are under 30 seconds — no multi-step planning tested

Promising future directions:

Key Takeaways

1. Latent Actions Are a Paradigm Shift

Instead of collecting expensive robot data, we can learn manipulation from human videos. This is particularly important for labs with limited robot hardware but abundant video data.

2. Locomotion and Manipulation MUST Be Coupled

Anyone building humanoid robots should abandon the traditional decoupled architecture. The cost of coupling (more complex training) is much smaller than the benefit (21.3% improvement).

3. Isaac Lab Is the Standard Tool for RL Robotics

The paper uses NVIDIA Isaac Lab for training — open-source, free, and runs on consumer GPUs. This is currently the best entry point for RL robotics research.

References

Related Posts

Related Posts

TutorialSim-to-Real cho Humanoid: Deployment Best Practices
sim2realhumanoiddeploymentrlunitreePart 10

Sim-to-Real cho Humanoid: Deployment Best Practices

Pipeline hoàn chỉnh deploy RL locomotion policy lên robot humanoid thật — domain randomization, system identification, safety, và Unitree SDK.

9/4/202611 min read
ResearchUnifoLM-VLA-0: Mô hình VLA cho Manipulation trên Unitree G1
vlaunitreeg1manipulationhumanoid

UnifoLM-VLA-0: Mô hình VLA cho Manipulation trên Unitree G1

Phân tích và hướng dẫn triển khai UnifoLM-VLA-0 — mô hình VLA open-source đầu tiên chạy trực tiếp trên G1 humanoid

8/4/202623 min read
TutorialSim-to-Real Transfer: Deploy VLA Policy lên Robot thật
lerobotsim2realdeploymentvlaPart 10

Sim-to-Real Transfer: Deploy VLA Policy lên Robot thật

Pipeline hoàn chỉnh từ simulation đến real robot — domain randomization, camera calibration, inference optimization và ROS 2 deployment.

8/4/20269 min read