WholeBodyVLA: Unified VLA for Whole-Body Loco-Manipulation

The Core Problem: Humanoids Must Walk and Manipulate Simultaneously

Imagine carrying a hot cup of coffee across a room. Your brain simultaneously has to: (1) keep the cup stable, (2) adjust your steps to avoid spilling, and (3) avoid obstacles. Two systems — locomotion and manipulation — must coordinate tightly, not run independently in parallel.

This is exactly the loco-manipulation problem for humanoid robots, and it is where most current methods fail. Why? Because they split locomotion and manipulation into two separate modules and stitch them together — like having one person controlling the arms while another controls the legs. The result: the robot falls when trying to manipulate, or manipulation fails because the locomotion policy does not understand what manipulation needs.

WholeBodyVLA (Jiang, Chen et al., ICLR 2026) from Fudan University, OpenDriveLab, and AGIBOT solves this with a unified architecture — a VLA for manipulation combined with an RL policy for locomotion, both sharing a common latent space.

Humanoid robot performing manipulation in real environment

Why This Paper Matters

WholeBodyVLA achieves 3 things that no prior paper accomplished simultaneously:

Learning manipulation from egocentric video without action labels — leveraging millions of YouTube/Ego4D videos shot from a first-person perspective
Manipulation-aware locomotion — the robot's legs automatically adjust posture based on what the arms are doing
Outperforming baselines by 21.3% on a real robot (AgiBot X2), not just simulation

The real-robot results on AgiBot X2 are particularly noteworthy — this is not a "nice demo in sim and nothing else" paper.

Architecture: Two Layers, One Latent Space

WholeBodyVLA consists of 2 main layers:

┌─────────────────────────────────────────────┐
│           VLA Encoder (Upper Body)          │
│                                             │
│  [Egocentric Image] + [Language Command]    │
│           ↓                                 │
│   Vision-Language Model (pretrained)        │
│           ↓                                 │
│   Latent Action Space z_t                   │
│   (learned from action-free video)          │
└─────────────┬───────────────────────────────┘
              │ z_t (latent manipulation intent)
              ↓
┌─────────────────────────────────────────────┐
│      LMO RL Policy (Lower Body)            │
│                                             │
│  [z_t] + [Proprioception] + [IMU]          │
│           ↓                                 │
│   PPO Policy (trained in Isaac Lab)         │
│           ↓                                 │
│   Joint torques for locomotion              │
│   (knows manipulation → adjusts stance)    │
└─────────────────────────────────────────────┘

The key is the latent vector z_t — it encodes both manipulation intent and spatial context, allowing the locomotion policy to "know" what the robot's arms are trying to do and adjust posture accordingly.

Layer 1: VLA Encoder — Learning from Video Without Actions

The Problem: Robot Manipulation Data is Extremely Scarce

Traditional VLA training requires (observation, action) pairs from robots — extremely expensive to collect. Bridge V2 has ~60K episodes, Open X-Embodiment has ~800K — sounds like a lot, but compared to billions of image-text pairs for VLMs, it is tiny.

WholeBodyVLA solves this with latent action learning: instead of requiring explicit action labels, the model learns latent representations from consecutive egocentric video frames.

Inverse Dynamics in Latent Space

The idea: if two consecutive video frames change, some action must have occurred. The model does not need to know the specific action — it only needs to learn a latent representation of that "change".

import torch
import torch.nn as nn

class LatentActionEncoder(nn.Module):
    """
    Learn latent actions from consecutive frame pairs.
    No action labels needed — only video frames.
    """
    def __init__(self, vision_encoder, latent_dim=256):
        super().__init__()
        self.vision_encoder = vision_encoder  # Pretrained ViT
        self.latent_dim = latent_dim
        
        # Inverse dynamics model: (s_t, s_{t+1}) -> z_t
        # z_t is the latent action between 2 frames
        self.inverse_dynamics = nn.Sequential(
            nn.Linear(vision_encoder.hidden_size * 2, 512),
            nn.ReLU(),
            nn.Linear(512, latent_dim),
        )
        
        # Forward dynamics: (s_t, z_t) -> s_{t+1}_pred
        # Ensures z_t encodes enough info to predict next frame
        self.forward_dynamics = nn.Sequential(
            nn.Linear(vision_encoder.hidden_size + latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, vision_encoder.hidden_size),
        )
    
    def encode_latent_action(self, frame_t, frame_t1):
        """Compute latent action z_t from 2 consecutive frames."""
        feat_t = self.vision_encoder(frame_t)    # [B, D]
        feat_t1 = self.vision_encoder(frame_t1)  # [B, D]
        
        combined = torch.cat([feat_t, feat_t1], dim=-1)  # [B, 2D]
        z_t = self.inverse_dynamics(combined)              # [B, latent_dim]
        return z_t, feat_t, feat_t1
    
    def forward(self, frame_t, frame_t1):
        z_t, feat_t, feat_t1 = self.encode_latent_action(frame_t, frame_t1)
        
        # Forward dynamics loss: z_t + s_t -> s_{t+1}
        feat_t1_pred = self.forward_dynamics(
            torch.cat([feat_t, z_t], dim=-1)
        )
        
        # L2 loss in feature space
        loss = nn.functional.mse_loss(feat_t1_pred, feat_t1.detach())
        return z_t, loss

The two loss functions complement each other:

Inverse dynamics: forces z_t to contain information about "what action occurred"
Forward dynamics: forces z_t to contain enough information to reconstruct the next state

VLA with Latent Actions

After pre-training the latent action encoder, WholeBodyVLA fine-tunes a VLM to predict latent actions instead of explicit robot actions:

class WholeBodyVLAEncoder(nn.Module):
    """
    VLA outputs latent actions instead of explicit joint commands.
    Latent space is pre-trained from egocentric video.
    """
    def __init__(self, vlm_backbone, latent_action_encoder):
        super().__init__()
        self.vlm = vlm_backbone  # e.g., LLaVA, Qwen-VL
        self.latent_projector = nn.Linear(
            vlm_backbone.config.hidden_size,
            latent_action_encoder.latent_dim
        )
        self.latent_action_encoder = latent_action_encoder
        
        # Freeze latent encoder after pre-training
        for param in self.latent_action_encoder.parameters():
            param.requires_grad = False
    
    def forward(self, image, language_instruction):
        """
        Input: egocentric image + language command
        Output: latent action z_t
        """
        vlm_output = self.vlm(
            pixel_values=image,
            input_ids=language_instruction,
        )
        
        # Project to latent action space
        hidden = vlm_output.last_hidden_state[:, -1, :]
        z_t = self.latent_projector(hidden)  # [B, latent_dim]
        
        return z_t

A subtle but critical point: the VLA does not need to know which specific robot it is controlling — it only outputs "manipulation intent" in latent space. The locomotion policy below maps z_t to motor commands for the specific robot.

Layer 2: LMO RL Policy — Manipulation-Aware Locomotion

This is the most important contribution — the Loco-Manipulation-Oriented (LMO) RL policy.

Why Standard Locomotion Policies Are Not Enough

Standard locomotion policies (like those in Unitree Go2, Boston Dynamics) are trained with the goal of: walking straight, maintaining balance, following velocity commands. They do not know what the upper body is doing.

Real-world consequences:

Situation	Standard Locomotion	LMO Policy
Robot reaches arm up high	Stands normally → falls	Lowers CoM, widens stance
Robot leans right to pick up	No compensation → CoM shift	Steps left to balance
Robot pushes heavy object	No friction increase → slides	Increases ground reaction force

Humanoid robot in research lab

LMO Reward Design

The LMO RL reward function has 3 main components:

import torch

def compute_lmo_reward(
    # State information
    base_orientation,     # quaternion body orientation
    base_velocity,        # linear + angular velocity
    joint_positions,      # current joint positions
    joint_torques,        # applied torques
    foot_contacts,        # foot ground contact flags
    # Manipulation context
    z_t,                  # latent action from VLA encoder
    ee_position,          # end-effector position
    ee_target,            # manipulation target position
    # Config
    dt=0.02,
):
    """
    LMO Reward: 3 main components
    1. Stability reward: keep body stable
    2. Manipulation support: facilitate manipulation
    3. Energy efficiency: minimize torque waste
    """
    
    # === 1. STABILITY REWARD ===
    roll, pitch, yaw = quaternion_to_euler(base_orientation)
    r_orientation = torch.exp(-5.0 * (roll**2 + pitch**2))
    
    r_angular_vel = torch.exp(-2.0 * torch.norm(base_velocity[3:], dim=-1))
    
    r_contact = (foot_contacts.sum(dim=-1) >= 1).float()
    
    # === 2. MANIPULATION SUPPORT REWARD ===
    # Key innovation: locomotion is rewarded for SUPPORTING
    # manipulation, not just walking straight
    
    ee_error = torch.norm(ee_position - ee_target, dim=-1)
    r_manipulation = torch.exp(-3.0 * ee_error)
    
    # Manipulation-aware CoM adjustment
    com_to_target = compute_com_support(
        base_orientation, joint_positions, ee_target
    )
    r_com_support = torch.exp(-1.0 * com_to_target)
    
    # Latent-conditioned bonus: RL knows z_t -> knows intent
    z_magnitude = torch.norm(z_t, dim=-1)
    stance_width = compute_stance_width(joint_positions)
    # Larger manipulation -> wider stance
    r_stance = torch.exp(-2.0 * torch.abs(
        stance_width - 0.3 - 0.1 * z_magnitude
    ))
    
    # === 3. ENERGY EFFICIENCY ===
    r_energy = torch.exp(-0.01 * torch.sum(joint_torques**2, dim=-1))
    
    # === COMBINE ===
    reward = (
        0.3 * r_orientation +
        0.15 * r_angular_vel +
        0.1 * r_contact +
        0.2 * r_manipulation +
        0.1 * r_com_support +
        0.1 * r_stance +
        0.05 * r_energy
    )
    
    return reward

The core differentiator: r_manipulation and r_com_support — the locomotion policy is rewarded when it helps manipulation succeed, not just when it maintains balance.

Training Pipeline in Isaac Lab

from omni.isaac.lab.envs import ManagerBasedRLEnv
from omni.isaac.lab.utils import configclass
from stable_baselines3 import PPO

@configclass
class LMOEnvCfg:
    """Config for LMO policy training environment."""
    
    robot_asset = "agibot_x2.usd"
    
    observation_space = {
        "proprioception": 45,      # joint positions + velocities
        "imu": 6,                  # linear acc + angular vel
        "latent_action": 256,      # z_t from VLA encoder
        "foot_contacts": 4,        # binary contact sensors
        "base_orientation": 4,     # quaternion
    }
    
    action_space = 12  # 6 joints x 2 legs
    
    # Domain randomization — crucial for sim-to-real
    domain_rand = {
        "friction": (0.3, 1.5),
        "payload_mass": (0, 5.0),     # kg, simulates held objects
        "motor_strength": (0.8, 1.2),
        "push_force": (0, 50),        # N, random perturbations
        "latent_noise": 0.05,         # noise on z_t
    }
    
    terrain = "rough"
    max_episode_length = 1000  # 20 seconds at 50Hz
    sim_dt = 0.005       # physics at 200Hz
    control_dt = 0.02    # policy at 50Hz

def train_lmo_policy():
    """Train LMO policy with PPO in Isaac Lab."""
    
    env = ManagerBasedRLEnv(cfg=LMOEnvCfg())
    
    model = PPO(
        "MlpPolicy",
        env,
        learning_rate=3e-4,
        n_steps=24,
        batch_size=4096,
        n_epochs=5,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.2,
        ent_coef=0.01,
        vf_coef=0.5,
        max_grad_norm=1.0,
        policy_kwargs={
            "net_arch": [256, 256, 128],
            "activation_fn": torch.nn.ELU,
        },
        verbose=1,
    )
    
    # Train 500M steps — approximately 6-8 hours on 1x A100
    model.learn(total_timesteps=500_000_000)
    model.save("lmo_policy_agibot_x2")
    
    return model

Domain Randomization for Sim-to-Real

A critical detail the paper emphasizes: latent_noise in domain randomization. During deployment, the VLA encoder runs on a separate GPU with ~50ms latency — creating noise on z_t. Training with noise makes the RL policy robust to VLA inference jitter.

End-to-End Deployment Pipeline

Egocentric Camera (30 FPS)
        │
        ▼
VLA Encoder (GPU, ~20Hz)
  - Input: 224x224 image + text command
  - Output: z_t (256-dim latent vector)
        │
        ▼ z_t via shared memory
LMO RL Policy (CPU, 50Hz)
  - Input: z_t + proprioception + IMU
  - Output: 12 joint position targets
        │
        ▼
PD Controller (1kHz)
  - Low-level joint tracking
        │
        ▼
Robot Motors (AgiBot X2)

The two layers run at different frequencies: VLA at ~20Hz (limited by GPU inference), RL policy at 50Hz (interpolating z_t between VLA updates). This is very practical — GPU inference for VLMs cannot reach 50Hz, but locomotion needs fast feedback.

Experimental Results: 21.3% Improvement

The paper tests on AgiBot X2 — a bipedal humanoid robot from AGIBOT (China), ~1.5m tall, 23 DoF.

Task	Baseline (Decoupled)	WholeBodyVLA	Improvement
Pick from table	62%	78%	+16%
Place on shelf	45%	72%	+27%
Carry while walking	38%	65%	+27%
Push heavy object	41%	55%	+14%
Average	46.5%	67.5%	+21.3%

Notably, tasks requiring high coordination (carry while walking, place on shelf) show the largest improvements — exactly where decoupled approaches fail.

Ablation: How Much Does Each Component Contribute?

Configuration	Success Rate
Full WholeBodyVLA	67.5%
Remove latent action learning (use explicit actions)	58.2%
Remove LMO reward (use standard locomotion)	52.1%
Remove egocentric video pre-training	55.8%
Remove domain randomization on z_t	61.3%

The LMO reward contributes the most (-15.4%) — confirming that manipulation-aware locomotion is the key innovation, not the VLA architecture itself.

AI and deep learning research for robotics

Comparison with Other Methods

Method	Whole-body?	Learns from video?	Real robot?	Loco-manip?
RT-2	Arms only	No	Yes	No
OpenVLA	Arms only	No	Yes	No
HumanPlus	Yes	Yes (RGB)	Yes	Limited
OKAMI	Arms only	Yes (teleop)	Yes	No
WholeBodyVLA	Yes	Yes (egocentric)	Yes	Yes

WholeBodyVLA is the first paper to achieve all 4 criteria simultaneously.

Limitations and Future Directions

The paper acknowledges several limitations:

Latency: VLA encoder at ~50ms is the bottleneck — model distillation or quantization needed
Bimanual: only tested single-arm manipulation, no dual-arm coordination yet
Terrain: only tested on flat ground, no stairs or uneven terrain
Long-horizon: all tasks in the paper are under 30 seconds — no multi-step planning tested

Promising future directions:

Combining with Diffusion Policy for multimodal action distributions
Extending to bimanual loco-manipulation
Integrating 3D spatial understanding from SpatialVLA

Key Takeaways

1. Latent Actions Are a Paradigm Shift

Instead of collecting expensive robot data, we can learn manipulation from human videos. This is particularly important for labs with limited robot hardware but abundant video data.

2. Locomotion and Manipulation MUST Be Coupled

Anyone building humanoid robots should abandon the traditional decoupled architecture. The cost of coupling (more complex training) is much smaller than the benefit (21.3% improvement).

3. Isaac Lab Is the Standard Tool for RL Robotics

The paper uses NVIDIA Isaac Lab for training — open-source, free, and runs on consumer GPUs. This is currently the best entry point for RL robotics research.

References

WholeBodyVLA: Towards Whole-Body Loco-Manipulation through Vision-Language-Action Model — Haoran Jiang, Jin Chen et al., ICLR 2026
GitHub: OpenDriveLab/WholebodyVLA
RT-2: Vision-Language-Action Models — Brohan et al., 2023
HumanPlus: Humanoid Shadowing and Imitation — Fu et al., 2024
OKAMI: Teaching Humanoid Manipulation — Li et al., 2024

VLA Models: RT-2, Octo, OpenVLA, pi0 — History and evolution of VLA
Diffusion Policy for Robot Manipulation — Foundation for multimodal actions
Humanoid Loco-Manipulation — Humanoid series deep dive
VLA-RL: Scaling RL to Improve VLA Manipulation — Related paper on RL for VLA

The Core Problem: Humanoids Must Walk and Manipulate Simultaneously

Humanoid robot performing manipulation in real environment

Why This Paper Matters

WholeBodyVLA achieves 3 things that no prior paper accomplished simultaneously:

Learning manipulation from egocentric video without action labels — leveraging millions of YouTube/Ego4D videos shot from a first-person perspective
Manipulation-aware locomotion — the robot's legs automatically adjust posture based on what the arms are doing
Outperforming baselines by 21.3% on a real robot (AgiBot X2), not just simulation

The real-robot results on AgiBot X2 are particularly noteworthy — this is not a "nice demo in sim and nothing else" paper.

Architecture: Two Layers, One Latent Space

WholeBodyVLA consists of 2 main layers:

┌─────────────────────────────────────────────┐
│           VLA Encoder (Upper Body)          │
│                                             │
│  [Egocentric Image] + [Language Command]    │
│           ↓                                 │
│   Vision-Language Model (pretrained)        │
│           ↓                                 │
│   Latent Action Space z_t                   │
│   (learned from action-free video)          │
└─────────────┬───────────────────────────────┘
              │ z_t (latent manipulation intent)
              ↓
┌─────────────────────────────────────────────┐
│      LMO RL Policy (Lower Body)            │
│                                             │
│  [z_t] + [Proprioception] + [IMU]          │
│           ↓                                 │
│   PPO Policy (trained in Isaac Lab)         │
│           ↓                                 │
│   Joint torques for locomotion              │
│   (knows manipulation → adjusts stance)    │
└─────────────────────────────────────────────┘

Layer 1: VLA Encoder — Learning from Video Without Actions

The Problem: Robot Manipulation Data is Extremely Scarce

WholeBodyVLA solves this with latent action learning: instead of requiring explicit action labels, the model learns latent representations from consecutive egocentric video frames.

Inverse Dynamics in Latent Space

import torch
import torch.nn as nn

class LatentActionEncoder(nn.Module):
    """
    Learn latent actions from consecutive frame pairs.
    No action labels needed — only video frames.
    """
    def __init__(self, vision_encoder, latent_dim=256):
        super().__init__()
        self.vision_encoder = vision_encoder  # Pretrained ViT
        self.latent_dim = latent_dim
        
        # Inverse dynamics model: (s_t, s_{t+1}) -> z_t
        # z_t is the latent action between 2 frames
        self.inverse_dynamics = nn.Sequential(
            nn.Linear(vision_encoder.hidden_size * 2, 512),
            nn.ReLU(),
            nn.Linear(512, latent_dim),
        )
        
        # Forward dynamics: (s_t, z_t) -> s_{t+1}_pred
        # Ensures z_t encodes enough info to predict next frame
        self.forward_dynamics = nn.Sequential(
            nn.Linear(vision_encoder.hidden_size + latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, vision_encoder.hidden_size),
        )
    
    def encode_latent_action(self, frame_t, frame_t1):
        """Compute latent action z_t from 2 consecutive frames."""
        feat_t = self.vision_encoder(frame_t)    # [B, D]
        feat_t1 = self.vision_encoder(frame_t1)  # [B, D]
        
        combined = torch.cat([feat_t, feat_t1], dim=-1)  # [B, 2D]
        z_t = self.inverse_dynamics(combined)              # [B, latent_dim]
        return z_t, feat_t, feat_t1
    
    def forward(self, frame_t, frame_t1):
        z_t, feat_t, feat_t1 = self.encode_latent_action(frame_t, frame_t1)
        
        # Forward dynamics loss: z_t + s_t -> s_{t+1}
        feat_t1_pred = self.forward_dynamics(
            torch.cat([feat_t, z_t], dim=-1)
        )
        
        # L2 loss in feature space
        loss = nn.functional.mse_loss(feat_t1_pred, feat_t1.detach())
        return z_t, loss

The two loss functions complement each other:

Inverse dynamics: forces z_t to contain information about "what action occurred"
Forward dynamics: forces z_t to contain enough information to reconstruct the next state

VLA with Latent Actions

After pre-training the latent action encoder, WholeBodyVLA fine-tunes a VLM to predict latent actions instead of explicit robot actions:

class WholeBodyVLAEncoder(nn.Module):
    """
    VLA outputs latent actions instead of explicit joint commands.
    Latent space is pre-trained from egocentric video.
    """
    def __init__(self, vlm_backbone, latent_action_encoder):
        super().__init__()
        self.vlm = vlm_backbone  # e.g., LLaVA, Qwen-VL
        self.latent_projector = nn.Linear(
            vlm_backbone.config.hidden_size,
            latent_action_encoder.latent_dim
        )
        self.latent_action_encoder = latent_action_encoder
        
        # Freeze latent encoder after pre-training
        for param in self.latent_action_encoder.parameters():
            param.requires_grad = False
    
    def forward(self, image, language_instruction):
        """
        Input: egocentric image + language command
        Output: latent action z_t
        """
        vlm_output = self.vlm(
            pixel_values=image,
            input_ids=language_instruction,
        )
        
        # Project to latent action space
        hidden = vlm_output.last_hidden_state[:, -1, :]
        z_t = self.latent_projector(hidden)  # [B, latent_dim]
        
        return z_t

Layer 2: LMO RL Policy — Manipulation-Aware Locomotion

This is the most important contribution — the Loco-Manipulation-Oriented (LMO) RL policy.

Why Standard Locomotion Policies Are Not Enough

Real-world consequences:

Situation	Standard Locomotion	LMO Policy
Robot reaches arm up high	Stands normally → falls	Lowers CoM, widens stance
Robot leans right to pick up	No compensation → CoM shift	Steps left to balance
Robot pushes heavy object	No friction increase → slides	Increases ground reaction force

Humanoid robot in research lab

LMO Reward Design

The LMO RL reward function has 3 main components:

import torch

def compute_lmo_reward(
    # State information
    base_orientation,     # quaternion body orientation
    base_velocity,        # linear + angular velocity
    joint_positions,      # current joint positions
    joint_torques,        # applied torques
    foot_contacts,        # foot ground contact flags
    # Manipulation context
    z_t,                  # latent action from VLA encoder
    ee_position,          # end-effector position
    ee_target,            # manipulation target position
    # Config
    dt=0.02,
):
    """
    LMO Reward: 3 main components
    1. Stability reward: keep body stable
    2. Manipulation support: facilitate manipulation
    3. Energy efficiency: minimize torque waste
    """
    
    # === 1. STABILITY REWARD ===
    roll, pitch, yaw = quaternion_to_euler(base_orientation)
    r_orientation = torch.exp(-5.0 * (roll**2 + pitch**2))
    
    r_angular_vel = torch.exp(-2.0 * torch.norm(base_velocity[3:], dim=-1))
    
    r_contact = (foot_contacts.sum(dim=-1) >= 1).float()
    
    # === 2. MANIPULATION SUPPORT REWARD ===
    # Key innovation: locomotion is rewarded for SUPPORTING
    # manipulation, not just walking straight
    
    ee_error = torch.norm(ee_position - ee_target, dim=-1)
    r_manipulation = torch.exp(-3.0 * ee_error)
    
    # Manipulation-aware CoM adjustment
    com_to_target = compute_com_support(
        base_orientation, joint_positions, ee_target
    )
    r_com_support = torch.exp(-1.0 * com_to_target)
    
    # Latent-conditioned bonus: RL knows z_t -> knows intent
    z_magnitude = torch.norm(z_t, dim=-1)
    stance_width = compute_stance_width(joint_positions)
    # Larger manipulation -> wider stance
    r_stance = torch.exp(-2.0 * torch.abs(
        stance_width - 0.3 - 0.1 * z_magnitude
    ))
    
    # === 3. ENERGY EFFICIENCY ===
    r_energy = torch.exp(-0.01 * torch.sum(joint_torques**2, dim=-1))
    
    # === COMBINE ===
    reward = (
        0.3 * r_orientation +
        0.15 * r_angular_vel +
        0.1 * r_contact +
        0.2 * r_manipulation +
        0.1 * r_com_support +
        0.1 * r_stance +
        0.05 * r_energy
    )
    
    return reward

The core differentiator: r_manipulation and r_com_support — the locomotion policy is rewarded when it helps manipulation succeed, not just when it maintains balance.

Training Pipeline in Isaac Lab

from omni.isaac.lab.envs import ManagerBasedRLEnv
from omni.isaac.lab.utils import configclass
from stable_baselines3 import PPO

@configclass
class LMOEnvCfg:
    """Config for LMO policy training environment."""
    
    robot_asset = "agibot_x2.usd"
    
    observation_space = {
        "proprioception": 45,      # joint positions + velocities
        "imu": 6,                  # linear acc + angular vel
        "latent_action": 256,      # z_t from VLA encoder
        "foot_contacts": 4,        # binary contact sensors
        "base_orientation": 4,     # quaternion
    }
    
    action_space = 12  # 6 joints x 2 legs
    
    # Domain randomization — crucial for sim-to-real
    domain_rand = {
        "friction": (0.3, 1.5),
        "payload_mass": (0, 5.0),     # kg, simulates held objects
        "motor_strength": (0.8, 1.2),
        "push_force": (0, 50),        # N, random perturbations
        "latent_noise": 0.05,         # noise on z_t
    }
    
    terrain = "rough"
    max_episode_length = 1000  # 20 seconds at 50Hz
    sim_dt = 0.005       # physics at 200Hz
    control_dt = 0.02    # policy at 50Hz

def train_lmo_policy():
    """Train LMO policy with PPO in Isaac Lab."""
    
    env = ManagerBasedRLEnv(cfg=LMOEnvCfg())
    
    model = PPO(
        "MlpPolicy",
        env,
        learning_rate=3e-4,
        n_steps=24,
        batch_size=4096,
        n_epochs=5,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.2,
        ent_coef=0.01,
        vf_coef=0.5,
        max_grad_norm=1.0,
        policy_kwargs={
            "net_arch": [256, 256, 128],
            "activation_fn": torch.nn.ELU,
        },
        verbose=1,
    )
    
    # Train 500M steps — approximately 6-8 hours on 1x A100
    model.learn(total_timesteps=500_000_000)
    model.save("lmo_policy_agibot_x2")
    
    return model

Domain Randomization for Sim-to-Real

End-to-End Deployment Pipeline

Egocentric Camera (30 FPS)
        │
        ▼
VLA Encoder (GPU, ~20Hz)
  - Input: 224x224 image + text command
  - Output: z_t (256-dim latent vector)
        │
        ▼ z_t via shared memory
LMO RL Policy (CPU, 50Hz)
  - Input: z_t + proprioception + IMU
  - Output: 12 joint position targets
        │
        ▼
PD Controller (1kHz)
  - Low-level joint tracking
        │
        ▼
Robot Motors (AgiBot X2)

Experimental Results: 21.3% Improvement

The paper tests on AgiBot X2 — a bipedal humanoid robot from AGIBOT (China), ~1.5m tall, 23 DoF.

Task	Baseline (Decoupled)	WholeBodyVLA	Improvement
Pick from table	62%	78%	+16%
Place on shelf	45%	72%	+27%
Carry while walking	38%	65%	+27%
Push heavy object	41%	55%	+14%
Average	46.5%	67.5%	+21.3%

Notably, tasks requiring high coordination (carry while walking, place on shelf) show the largest improvements — exactly where decoupled approaches fail.

Ablation: How Much Does Each Component Contribute?

Configuration	Success Rate
Full WholeBodyVLA	67.5%
Remove latent action learning (use explicit actions)	58.2%
Remove LMO reward (use standard locomotion)	52.1%
Remove egocentric video pre-training	55.8%
Remove domain randomization on z_t	61.3%

The LMO reward contributes the most (-15.4%) — confirming that manipulation-aware locomotion is the key innovation, not the VLA architecture itself.

AI and deep learning research for robotics

Comparison with Other Methods

Method	Whole-body?	Learns from video?	Real robot?	Loco-manip?
RT-2	Arms only	No	Yes	No
OpenVLA	Arms only	No	Yes	No
HumanPlus	Yes	Yes (RGB)	Yes	Limited
OKAMI	Arms only	Yes (teleop)	Yes	No
WholeBodyVLA	Yes	Yes (egocentric)	Yes	Yes

WholeBodyVLA is the first paper to achieve all 4 criteria simultaneously.

Limitations and Future Directions

The paper acknowledges several limitations:

Latency: VLA encoder at ~50ms is the bottleneck — model distillation or quantization needed
Bimanual: only tested single-arm manipulation, no dual-arm coordination yet
Terrain: only tested on flat ground, no stairs or uneven terrain
Long-horizon: all tasks in the paper are under 30 seconds — no multi-step planning tested

Promising future directions:

Combining with Diffusion Policy for multimodal action distributions
Extending to bimanual loco-manipulation
Integrating 3D spatial understanding from SpatialVLA

Key Takeaways

1. Latent Actions Are a Paradigm Shift

Instead of collecting expensive robot data, we can learn manipulation from human videos. This is particularly important for labs with limited robot hardware but abundant video data.

2. Locomotion and Manipulation MUST Be Coupled

Anyone building humanoid robots should abandon the traditional decoupled architecture. The cost of coupling (more complex training) is much smaller than the benefit (21.3% improvement).

3. Isaac Lab Is the Standard Tool for RL Robotics

The paper uses NVIDIA Isaac Lab for training — open-source, free, and runs on consumer GPUs. This is currently the best entry point for RL robotics research.

References

WholeBodyVLA: Towards Whole-Body Loco-Manipulation through Vision-Language-Action Model — Haoran Jiang, Jin Chen et al., ICLR 2026
GitHub: OpenDriveLab/WholebodyVLA
RT-2: Vision-Language-Action Models — Brohan et al., 2023
HumanPlus: Humanoid Shadowing and Imitation — Fu et al., 2024
OKAMI: Teaching Humanoid Manipulation — Li et al., 2024

VLA Models: RT-2, Octo, OpenVLA, pi0 — History and evolution of VLA
Diffusion Policy for Robot Manipulation — Foundation for multimodal actions
Humanoid Loco-Manipulation — Humanoid series deep dive
VLA-RL: Scaling RL to Improve VLA Manipulation — Related paper on RL for VLA

The Core Problem: Humanoids Must Walk and Manipulate Simultaneously

Why This Paper Matters

Architecture: Two Layers, One Latent Space

Layer 1: VLA Encoder — Learning from Video Without Actions

The Problem: Robot Manipulation Data is Extremely Scarce

Inverse Dynamics in Latent Space

VLA with Latent Actions

Layer 2: LMO RL Policy — Manipulation-Aware Locomotion

Why Standard Locomotion Policies Are Not Enough

LMO Reward Design

Training Pipeline in Isaac Lab

Domain Randomization for Sim-to-Real

End-to-End Deployment Pipeline

Experimental Results: 21.3% Improvement

Ablation: How Much Does Each Component Contribute?

Comparison with Other Methods

Limitations and Future Directions

Key Takeaways

1. Latent Actions Are a Paradigm Shift

2. Locomotion and Manipulation MUST Be Coupled

3. Isaac Lab Is the Standard Tool for RL Robotics

References

Related Posts

Nguyễn Anh Tuấn

Related Posts

WholebodyVLA: Pipeline Training Toàn Bộ

WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

The Core Problem: Humanoids Must Walk and Manipulate Simultaneously

Why This Paper Matters

Architecture: Two Layers, One Latent Space

Layer 1: VLA Encoder — Learning from Video Without Actions

The Problem: Robot Manipulation Data is Extremely Scarce

Inverse Dynamics in Latent Space

VLA with Latent Actions

Layer 2: LMO RL Policy — Manipulation-Aware Locomotion

Why Standard Locomotion Policies Are Not Enough

LMO Reward Design

Training Pipeline in Isaac Lab

Domain Randomization for Sim-to-Real

End-to-End Deployment Pipeline

Experimental Results: 21.3% Improvement

Ablation: How Much Does Each Component Contribute?

Comparison with Other Methods

Limitations and Future Directions

Key Takeaways

1. Latent Actions Are a Paradigm Shift

2. Locomotion and Manipulation MUST Be Coupled

3. Isaac Lab Is the Standard Tool for RL Robotics

References

Related Posts

Nguyễn Anh Tuấn

Related Posts

WholebodyVLA: Pipeline Training Toàn Bộ

WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid