WholebodyVLA Open-Source: Architecture & Code Deep-Dive

From paper to code: WholebodyVLA goes open-source

If you've read the WholebodyVLA research analysis earlier, you know this is the first paper to fully address whole-body loco-manipulation for humanoid robots — arms manipulating objects while legs maintain balance and move, all governed by a unified system.

Now the OpenDriveLab/WholebodyVLA repo is public on GitHub (ICLR 2026), and this article dives deep into the technical architecture, training pipeline, and how you can build a similar system — even without a physical robot.

Humanoid robot arm in AI research lab

The core problem: Why decoupled approaches fail

Before diving into code, let's understand why WholebodyVLA needs to exist.

Most current humanoid systems split into two independent modules:

Manipulation controller: handles arms, typically VLA or imitation learning
Locomotion controller: handles legs, typically an RL policy

These run in parallel, each optimizing for its own objective. The problem? When the robot reaches right to pick something up, the center of mass shifts — but the locomotion controller doesn't know, because it receives no information from manipulation. Result: the robot falls.

Imagine someone controlling your hands via remote while another person controls your legs — and they can't talk to each other. You'd fall the first time you try to pick something off the floor.

WholebodyVLA solves this by sharing a common latent space between both layers — the arms "tell" the legs what they're about to do, and the legs self-adjust.

Overall architecture: Three layers, one flow

WholebodyVLA consists of 3 main components stacked in order:

┌──────────────────────────────────────────────────────────┐
│  LAYER 1: Latent Action Model (LAM)                       │
│  - Pre-train from egocentric video (Ego4D, YouTube)        │
│  - Input: consecutive frame pairs (o_t, o_{t+1})           │
│  - Output: latent action z_t (initializes latent space)    │
│  - Inverse dynamics + Forward dynamics learning             │
└────────────────────────┬─────────────────────────────────┘
                         │ learned latent space
                         ▼
┌──────────────────────────────────────────────────────────┐
│  LAYER 2: VLA Policy (Upper Body)                          │
│  - Fine-tune VLM on robot demonstration data                │
│  - Input: egocentric image + language command                │
│  - Output: z_t (latent manipulation intent)                  │
│  - Decode z_t → arm joint commands (via action decoder)      │
└────────────────────────┬─────────────────────────────────┘
                         │ z_t passed down
                         ▼
┌──────────────────────────────────────────────────────────┐
│  LAYER 3: LMO RL Policy (Lower Body)                       │
│  - Train in Isaac Lab with PPO                               │
│  - Input: z_t + proprioception + IMU                         │
│  - Output: leg joint torques                                 │
│  - Knows z_t → knows manipulation intent → adjusts stance    │
└──────────────────────────────────────────────────────────┘

The key point: z_t is the bridge. It's not a specific action ("rotate shoulder joint 15 degrees") but an abstract intent ("reaching right arm forward"). The locomotion policy receives z_t and reasons: "arm is reaching far → lower center of mass, widen stance."

Layer 1: Latent Action Model — Learning from free video

The robot data problem

Data with labeled actions for robots is extremely scarce and expensive. Each episode requires teleoperation — a human controlling the robot via VR controller for several minutes. Collecting 100K episodes can take months and hundreds of thousands of dollars.

Meanwhile, egocentric video (first-person perspective video) has millions of hours available online. The Ego4D dataset alone contains over 3,670 hours of video. The catch: these videos only have images, no action labels — nobody annotates "the hand is rotating 30 degrees right now" on YouTube videos.

Inverse Dynamics: turning video into "pseudo-actions"

The LAM idea is elegantly simple: if frame t and frame t+1 differ, some action must have occurred between them. The model doesn't need to know the specific action — it only needs to learn a latent representation of that change.

import torch
import torch.nn as nn

class LatentActionModel(nn.Module):
    """
    Latent Action Model (LAM) — core component of WholebodyVLA.
    Learns latent action representations from consecutive video frame pairs,
    WITHOUT requiring action labels.
    
    Training objective:
    1. Inverse dynamics: (o_t, o_{t+1}) → z_t 
       "Frames changed like this, so the action must be this"
    2. Forward dynamics: (o_t, z_t) → o_{t+1}_pred
       "If the action is z_t, the next frame should look like this"
    """
    def __init__(self, vision_dim=768, latent_dim=256):
        super().__init__()
        self.latent_dim = latent_dim
        
        # Inverse dynamics: two frames → latent action
        self.inverse_head = nn.Sequential(
            nn.Linear(vision_dim * 2, 1024),
            nn.LayerNorm(1024),
            nn.GELU(),
            nn.Linear(1024, 512),
            nn.GELU(),
            nn.Linear(512, latent_dim),
        )
        
        # Forward dynamics: frame + action → next frame prediction
        self.forward_head = nn.Sequential(
            nn.Linear(vision_dim + latent_dim, 1024),
            nn.LayerNorm(1024),
            nn.GELU(),
            nn.Linear(1024, 512),
            nn.GELU(),
            nn.Linear(512, vision_dim),
        )
        
        # VQ-VAE discretization for latent space
        # Makes z_t compact and easier to predict
        self.codebook_size = 512
        self.codebook = nn.Embedding(self.codebook_size, latent_dim)
    
    def quantize(self, z_continuous):
        """Vector quantization — map continuous z to nearest codebook entry."""
        distances = torch.cdist(
            z_continuous.unsqueeze(1),
            self.codebook.weight.unsqueeze(0),
        ).squeeze(1)
        
        indices = distances.argmin(dim=-1)
        z_quantized = self.codebook(indices)
        
        # Straight-through estimator for gradient
        z_quantized = z_continuous + (z_quantized - z_continuous).detach()
        return z_quantized, indices
    
    def encode_action(self, feat_t, feat_t1):
        """From two frame features → latent action."""
        combined = torch.cat([feat_t, feat_t1], dim=-1)
        z_continuous = self.inverse_head(combined)
        z_quantized, indices = self.quantize(z_continuous)
        return z_quantized, z_continuous, indices
    
    def predict_next(self, feat_t, z_t):
        """From current frame + action → predict next frame."""
        combined = torch.cat([feat_t, z_t], dim=-1)
        return self.forward_head(combined)
    
    def compute_loss(self, feat_t, feat_t1):
        """Training loss = forward_loss + commitment_loss."""
        z_q, z_c, _ = self.encode_action(feat_t, feat_t1)
        
        feat_t1_pred = self.predict_next(feat_t, z_q)
        forward_loss = nn.functional.mse_loss(feat_t1_pred, feat_t1.detach())
        
        commitment_loss = nn.functional.mse_loss(z_c, z_q.detach())
        
        return forward_loss + 0.25 * commitment_loss

Why the additional Vector Quantization? Because a continuous latent space can be vague — every action differs only slightly. VQ-VAE forces z_t to fall into one of K "code words" — like compressing actions into one of 512 basic action types. This makes it easier for the VLA at Layer 2 to predict (predicting 1 of 512 tokens vs. regressing a continuous vector).

Data pipeline for LAM

from torch.utils.data import Dataset
from torchvision import transforms

class EgocentricVideoDataset(Dataset):
    """
    Dataset from Ego4D or similar egocentric video sources.
    Each sample is a consecutive frame pair.
    """
    def __init__(self, video_paths, frame_gap=2):
        self.videos = video_paths
        self.frame_gap = frame_gap
        self.transform = transforms.Compose([
            transforms.Resize(224),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            ),
        ])
    
    def __getitem__(self, idx):
        video = self.load_video(idx)
        t = random.randint(0, len(video) - self.frame_gap - 1)
        
        frame_t = self.transform(video[t])
        frame_t1 = self.transform(video[t + self.frame_gap])
        
        return frame_t, frame_t1

Important detail: frame_gap=2 instead of 1. Why? Because consecutive frames in 30FPS video are nearly identical — the difference is so small that the model just learns an identity mapping instead of actual actions. A 2-4 frame gap gives clearer action signals.

Layer 2: VLA Policy — From vision and language to action

After LAM initializes the latent space, Layer 2 fine-tunes a Vision-Language Model to predict latent actions from egocentric images + language instructions.

VLA Architecture

class WholebodyVLAPolicy(nn.Module):
    """
    VLA Policy for upper body manipulation.
    
    Key differences from traditional VLAs (RT-2, OpenVLA):
    - Outputs latent tokens instead of explicit joint commands
    - Pretrained latent space from video (not just robot data)
    - Dual-arm coordination via shared latent
    """
    def __init__(
        self,
        vlm_backbone,
        lam: LatentActionModel,
        num_action_tokens=8,
        action_dim=14,  # 7 joints × 2 arms
    ):
        super().__init__()
        self.vlm = vlm_backbone
        self.lam = lam
        hidden_dim = vlm_backbone.config.hidden_size
        
        self.latent_projector = nn.Sequential(
            nn.Linear(hidden_dim, 512),
            nn.GELU(),
            nn.Linear(512, lam.latent_dim * num_action_tokens),
        )
        
        self.action_decoder = nn.Sequential(
            nn.Linear(lam.latent_dim * num_action_tokens, 256),
            nn.GELU(),
            nn.Linear(256, action_dim),
        )
        
        self.num_action_tokens = num_action_tokens
        
        for param in self.lam.parameters():
            param.requires_grad = False
    
    def forward(self, image, language_tokens):
        vlm_out = self.vlm(
            pixel_values=image,
            input_ids=language_tokens,
        ).last_hidden_state[:, -1, :]
        
        z_flat = self.latent_projector(vlm_out)
        B = z_flat.shape[0]
        z_tokens = z_flat.view(B, self.num_action_tokens, -1)
        
        z_t = z_tokens.mean(dim=1)
        arm_actions = self.action_decoder(z_flat)
        
        return z_t, arm_actions

Dual-arm coordination

A subtle point: num_action_tokens=8 — the model predicts multiple tokens per timestep, not one. Why?

With dual-arm manipulation, each arm needs its own trajectory but must coordinate. 8 tokens allow the model to encode:

Tokens 1-4: left arm trajectory
Tokens 5-8: right arm trajectory
Attention between tokens → both arms "know" what each other is doing

This is a distinction from the earlier research review — the paper actually supports dual-arm, though demos haven't shown full bimanual tasks yet.

Layer 3: LMO RL Policy — Legs know what arms are doing

This is what I consider the most significant contribution of the paper, and the part where open-source code will be most valuable to the community.

Humanoid robot moving in open space

What is LMO and why is it different?

Loco-Manipulation-Oriented (LMO) RL policy is a locomotion controller trained with awareness that the upper body is manipulating objects. Unlike standard locomotion policies (which only know "walk straight" or "turn left"), LMO receives z_t from VLA and adjusts:

Stance width — widens stance for complex manipulation
CoM compensation — shifts center of mass opposite to reaching arm
Foot placement — positions feet optimally for stability
Gait adaptation — switches from walking to standing when precise manipulation is needed

Observation space design

from dataclasses import dataclass
import numpy as np

@dataclass
class LMOObservation:
    """
    Observation space for LMO RL policy.
    Three groups: proprioception, exteroception, and manipulation context.
    """
    # === Proprioception ===
    joint_positions: np.ndarray    # [12] — 6 joints × 2 legs
    joint_velocities: np.ndarray   # [12]
    base_orientation: np.ndarray   # [4] — quaternion
    base_angular_vel: np.ndarray   # [3] — gyroscope
    base_linear_acc: np.ndarray    # [3] — accelerometer
    
    # === Exteroception ===
    foot_contacts: np.ndarray      # [4] — 2 feet × (heel, toe)
    
    # === Manipulation context (KEY INNOVATION) ===
    latent_action: np.ndarray      # [256] — z_t from VLA encoder
    
    @property
    def as_vector(self) -> np.ndarray:
        return np.concatenate([
            self.joint_positions,       # 12
            self.joint_velocities,      # 12
            self.base_orientation,      # 4
            self.base_angular_vel,      # 3
            self.base_linear_acc,       # 3
            self.foot_contacts,         # 4
            self.latent_action,         # 256
        ])  # Total: 294 dimensions

    @property
    def dim(self) -> int:
        return 294

Note the dimensions: 256 out of 294 dims come from z_t. The latent action accounts for 87% of the observation space — showing just how important manipulation context is for locomotion.

Detailed reward function

Reward design is where "art" meets "science" in RL. The LMO reward has multiple balanced components:

class LMORewardFunction:
    """
    Multi-component reward for LMO policy.
    
    Design philosophy:
    - Locomotion policy does NOT just optimize for movement
    - It must optimize for CREATING A STABLE FOUNDATION
      for manipulation to succeed
    - z_t allows locomotion to "look ahead" at manipulation intent
    """
    def __init__(self):
        self.weights = {
            'orientation': 0.25,
            'angular_velocity': 0.10,
            'linear_velocity': 0.10,
            'foot_contact': 0.05,
            'manipulation_support': 0.20,
            'com_stability': 0.15,
            'energy': 0.05,
            'smoothness': 0.10,
        }
    
    def compute(self, state, z_t, action, prev_action):
        rewards = {}
        
        # 1. Orientation: penalize excessive tilt
        roll, pitch = state['euler'][:2]
        rewards['orientation'] = torch.exp(
            -5.0 * (roll**2 + pitch**2)
        )
        
        # 2. Angular velocity: penalize rapid rotation
        rewards['angular_velocity'] = torch.exp(
            -2.0 * torch.norm(state['angular_vel'])
        )
        
        # 3. Velocity tracking
        vel_error = state['linear_vel'][:2] - state['cmd_vel'][:2]
        rewards['linear_velocity'] = torch.exp(
            -4.0 * torch.norm(vel_error)
        )
        
        # 4. Foot contact: must always have ground contact
        rewards['foot_contact'] = (
            state['foot_contacts'].sum() >= 1
        ).float()
        
        # 5. MANIPULATION SUPPORT (key innovation)
        z_magnitude = torch.norm(z_t)
        stance_width = self._compute_stance_width(state)
        target_width = 0.3 + 0.15 * torch.tanh(z_magnitude / 10.0)
        rewards['manipulation_support'] = torch.exp(
            -3.0 * (stance_width - target_width)**2
        )
        
        # 6. CoM within support polygon
        com_pos = state['com_position'][:2]
        support = self._compute_support_polygon(state)
        com_margin = self._point_to_polygon_distance(com_pos, support)
        rewards['com_stability'] = torch.clamp(com_margin, 0, 1)
        
        # 7. Energy efficiency
        rewards['energy'] = torch.exp(
            -0.01 * torch.sum(action**2)
        )
        
        # 8. Action smoothness
        action_diff = action - prev_action
        rewards['smoothness'] = torch.exp(
            -5.0 * torch.sum(action_diff**2)
        )
        
        total = sum(
            self.weights[k] * rewards[k] for k in self.weights
        )
        return total, rewards

Key insight from reward design

Look at the weights: manipulation_support (0.20) + com_stability (0.15) = 35% of reward directly related to supporting manipulation. This is significant — compared to pure locomotion (orientation + velocity = 35%).

In other words, the LMO policy dedicates half its effort to walking and the other half to creating a stable foundation for the arms. This is exactly why it outperforms the decoupled approach by 21.3% — standard locomotion allocates 100% to walking and 0% to manipulation support.

Training Pipeline: From video to real robot

Stage 1: Pre-train LAM (offline, no robot needed)

# LAM training script
# Data: Ego4D dataset (~3,670 hours of egocentric video)
# Hardware: 4× A100 80GB, ~3 days

python train_lam.py \
    --data_path /data/ego4d/v2/ \
    --vision_encoder vit-base-patch16 \
    --latent_dim 256 \
    --codebook_size 512 \
    --frame_gap 2 \
    --batch_size 256 \
    --lr 1e-4 \
    --epochs 50 \
    --num_gpus 4

Stage 2: Collect robot demonstration data

WholebodyVLA proposes an efficient data collection pipeline:

Teleoperation on AgiBot X2 — operator controls arms via VR controller
Record: egocentric image + joint states + language annotation
Augmentation: add camera noise, lighting variation

The paper collects ~500 episodes per task — far fewer than RT-2 (~130K) thanks to the pre-trained latent space.

Stage 3: Fine-tune VLA Policy

python train_vla.py \
    --vlm_backbone qwen2-vl-2b \
    --lam_checkpoint checkpoints/lam_ego4d.pt \
    --robot_data /data/agibot_x2/demonstrations/ \
    --freeze_lam \
    --batch_size 32 \
    --lr 5e-5 \
    --epochs 20 \
    --num_action_tokens 8

Stage 4: Train LMO RL Policy in Isaac Lab

from omni.isaac.lab.envs import ManagerBasedRLEnv

class LMOTrainingConfig:
    """Hyperparameters for LMO RL training."""
    
    num_envs = 4096
    episode_length = 1000    # 20 seconds @ 50Hz
    
    sim_dt = 0.005           # 200Hz physics
    control_dt = 0.02        # 50Hz policy
    
    ppo_config = {
        'learning_rate': 3e-4,
        'n_steps': 24,
        'batch_size': num_envs * 24,
        'n_epochs': 5,
        'gamma': 0.99,
        'gae_lambda': 0.95,
        'clip_range': 0.2,
        'ent_coef': 0.01,
        'vf_coef': 0.5,
        'max_grad_norm': 1.0,
    }
    
    policy_net = [256, 256, 128]
    activation = 'elu'
    
    domain_rand = {
        'friction_range': (0.3, 1.5),
        'payload_mass_range': (0, 5.0),
        'motor_strength_range': (0.8, 1.2),
        'push_force_range': (0, 50),
        'latent_noise_std': 0.05,
        'terrain_roughness': (0, 0.05),
        'com_displacement': (0, 0.05),
    }
    
    total_timesteps = 500_000_000  # ~6-8 hours on 1× A100

Critical detail: latent_noise_std. At deployment, the VLA encoder runs at ~20Hz on GPU while the RL policy runs at 50Hz on CPU. Between VLA updates, z_t is interpolated and noisy. Training with noise simulates this — a form of domain randomization for the inference pipeline, not just physics.

Deployment: Real-world inference pipeline

When deployed on real hardware, the system runs at 3 different frequencies:

Camera (30 FPS) ─── image ──→ VLA Encoder (~20Hz, GPU)
                                    │
                                    z_t (async update)
                                    │
Proprioception (1kHz) ─── obs ──→ LMO Policy (50Hz, CPU)
                                    │
                                    joint targets
                                    │
                            PD Controller (1kHz, embedded)
                                    │
                                    motor torques ──→ Motors

Handling frequency mismatch

import threading
import time
import numpy as np

class WholebodyVLARuntime:
    """
    Runtime inference for WholebodyVLA.
    Handles frequency mismatch between VLA (20Hz) and RL (50Hz).
    """
    def __init__(self, vla_model, lmo_model):
        self.vla = vla_model
        self.lmo = lmo_model
        
        self._z_current = np.zeros(256)
        self._z_prev = np.zeros(256)
        self._z_timestamp = 0.0
        self._lock = threading.Lock()
    
    def vla_loop(self, camera, language_cmd):
        """VLA inference loop — runs on GPU thread at ~20Hz."""
        while self.running:
            image = camera.get_frame()
            z_new, arm_actions = self.vla.predict(image, language_cmd)
            
            with self._lock:
                self._z_prev = self._z_current.copy()
                self._z_current = z_new
                self._z_timestamp = time.time()
            
            self.send_arm_commands(arm_actions)
    
    def lmo_loop(self, robot):
        """LMO RL inference loop — runs on CPU thread at 50Hz."""
        dt = 0.02
        
        while self.running:
            start = time.time()
            proprio = robot.get_proprioception()
            
            with self._lock:
                age = time.time() - self._z_timestamp
                alpha = min(age / 0.05, 1.0)
                z_interp = (1 - alpha) * self._z_prev + alpha * self._z_current
            
            obs = np.concatenate([
                proprio['joint_pos'],
                proprio['joint_vel'],
                proprio['base_quat'],
                proprio['gyro'],
                proprio['accel'],
                proprio['foot_contacts'],
                z_interp,
            ])
            
            leg_actions = self.lmo.predict(obs)
            robot.set_leg_targets(leg_actions)
            
            elapsed = time.time() - start
            if elapsed < dt:
                time.sleep(dt - elapsed)

Results: Numbers speak for themselves

The paper tests on 10 tasks with AgiBot X2, 20 trials each:

Task Category	Baseline (Decoupled)	WholebodyVLA	Δ
Static manipulation	68%	79%	+11%
Walk + carry	38%	65%	+27%
Walk + place on shelf	45%	72%	+27%
Push heavy object	41%	55%	+14%
Average	46.5%	67.5%	+21.3%

Clear pattern: the more coordination required between arms and legs, the larger the improvement. Static manipulation only gains 11% (legs barely move), but walk + carry/place gains 27% — exactly where decoupled approaches fail.

Ablation study

Component removed	Success rate	Drop
Full WholebodyVLA	67.5%	—
Remove LMO reward (standard locomotion)	52.1%	-15.4%
Remove egocentric video pre-training	55.8%	-11.7%
Remove latent actions (use explicit)	58.2%	-9.3%
Remove domain rand on z_t	61.3%	-6.2%

LMO reward is the most critical factor — removing it drops 15.4%. This confirms: the main issue isn't VLA architecture, but whether locomotion knows what manipulation is doing.

Building a similar system: Practical roadmap

The repo hasn't released full code yet, but you can build a similar pipeline with open-source tools:

Step 1: LAM from Ego4D

Dataset: Ego4D — free for research
Vision encoder: ViT-Base from DINOv2 (better than ImageNet-pretrained for video)
Framework: PyTorch + HuggingFace Transformers
GPU: 1× RTX 4090 is sufficient for prototyping (smaller batch size)

Step 2: VLA Policy

VLM backbone: Qwen2-VL-2B or PaliGemma 2 (small, runs on consumer GPU)
Robot data: if no physical robot, use DROID dataset
Or: start with LeRobot framework for data collection

Step 3: LMO RL in Isaac Lab

Simulator: Isaac Lab — free, runs on RTX 3060+
Robot model: Unitree H1 or G1 URDF (available in Isaac Lab)
RL algorithm: PPO via RSL-RL or RL Games (built-in)

Step 4: Sim-to-real (if you have hardware)

Deploy on Unitree G1 (~$30K) — most affordable humanoid for research
Use established sim-to-real pipeline for Unitree

Known limitations

Code not fully open-sourced — current repo is mainly reference documentation, no training scripts released yet
Hardware requirements — needs A100 for real-time VLA inference, consumer GPU only sufficient for offline training
Flat terrain only — no stairs, slopes, or outdoor environments tested
Single language — language commands in English only, no multilingual support

Conclusion: Why this is a turning point

WholebodyVLA isn't just a good paper — it represents a paradigm shift in humanoid robotics:

From decoupled (arms and legs separate) to unified (shared latent space)
From robot-only data to internet-scale video data
From explicit actions to latent actions (much more scalable)

For the robotics community, this paper opens research directions that don't require expensive hardware upfront — LAM and VLA can be trained from video, LMO RL can be trained in simulation. All you need is a GPU and time.

References

WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control — Haoran Jiang, Jin Chen et al., ICLR 2026
GitHub: OpenDriveLab/WholebodyVLA — Official repository
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Isaac Lab Documentation
AgiBot X2 Humanoid Robot — Hardware platform used in the paper

WholeBodyVLA: Whole-Body VLA for Humanoid Loco-Manipulation — Original research analysis
Humanoid Loco-Manipulation: When Robots Walk and Work — Humanoid series
VLA Models: RT-2, Octo, OpenVLA, pi0 — VLA model overview

From paper to code: WholebodyVLA goes open-source

Humanoid robot arm in AI research lab

The core problem: Why decoupled approaches fail

Before diving into code, let's understand why WholebodyVLA needs to exist.

Most current humanoid systems split into two independent modules:

Manipulation controller: handles arms, typically VLA or imitation learning
Locomotion controller: handles legs, typically an RL policy

Imagine someone controlling your hands via remote while another person controls your legs — and they can't talk to each other. You'd fall the first time you try to pick something off the floor.

WholebodyVLA solves this by sharing a common latent space between both layers — the arms "tell" the legs what they're about to do, and the legs self-adjust.

Overall architecture: Three layers, one flow

WholebodyVLA consists of 3 main components stacked in order:

┌──────────────────────────────────────────────────────────┐
│  LAYER 1: Latent Action Model (LAM)                       │
│  - Pre-train from egocentric video (Ego4D, YouTube)        │
│  - Input: consecutive frame pairs (o_t, o_{t+1})           │
│  - Output: latent action z_t (initializes latent space)    │
│  - Inverse dynamics + Forward dynamics learning             │
└────────────────────────┬─────────────────────────────────┘
                         │ learned latent space
                         ▼
┌──────────────────────────────────────────────────────────┐
│  LAYER 2: VLA Policy (Upper Body)                          │
│  - Fine-tune VLM on robot demonstration data                │
│  - Input: egocentric image + language command                │
│  - Output: z_t (latent manipulation intent)                  │
│  - Decode z_t → arm joint commands (via action decoder)      │
└────────────────────────┬─────────────────────────────────┘
                         │ z_t passed down
                         ▼
┌──────────────────────────────────────────────────────────┐
│  LAYER 3: LMO RL Policy (Lower Body)                       │
│  - Train in Isaac Lab with PPO                               │
│  - Input: z_t + proprioception + IMU                         │
│  - Output: leg joint torques                                 │
│  - Knows z_t → knows manipulation intent → adjusts stance    │
└──────────────────────────────────────────────────────────┘

Layer 1: Latent Action Model — Learning from free video

The robot data problem

Inverse Dynamics: turning video into "pseudo-actions"

import torch
import torch.nn as nn

class LatentActionModel(nn.Module):
    """
    Latent Action Model (LAM) — core component of WholebodyVLA.
    Learns latent action representations from consecutive video frame pairs,
    WITHOUT requiring action labels.
    
    Training objective:
    1. Inverse dynamics: (o_t, o_{t+1}) → z_t 
       "Frames changed like this, so the action must be this"
    2. Forward dynamics: (o_t, z_t) → o_{t+1}_pred
       "If the action is z_t, the next frame should look like this"
    """
    def __init__(self, vision_dim=768, latent_dim=256):
        super().__init__()
        self.latent_dim = latent_dim
        
        # Inverse dynamics: two frames → latent action
        self.inverse_head = nn.Sequential(
            nn.Linear(vision_dim * 2, 1024),
            nn.LayerNorm(1024),
            nn.GELU(),
            nn.Linear(1024, 512),
            nn.GELU(),
            nn.Linear(512, latent_dim),
        )
        
        # Forward dynamics: frame + action → next frame prediction
        self.forward_head = nn.Sequential(
            nn.Linear(vision_dim + latent_dim, 1024),
            nn.LayerNorm(1024),
            nn.GELU(),
            nn.Linear(1024, 512),
            nn.GELU(),
            nn.Linear(512, vision_dim),
        )
        
        # VQ-VAE discretization for latent space
        # Makes z_t compact and easier to predict
        self.codebook_size = 512
        self.codebook = nn.Embedding(self.codebook_size, latent_dim)
    
    def quantize(self, z_continuous):
        """Vector quantization — map continuous z to nearest codebook entry."""
        distances = torch.cdist(
            z_continuous.unsqueeze(1),
            self.codebook.weight.unsqueeze(0),
        ).squeeze(1)
        
        indices = distances.argmin(dim=-1)
        z_quantized = self.codebook(indices)
        
        # Straight-through estimator for gradient
        z_quantized = z_continuous + (z_quantized - z_continuous).detach()
        return z_quantized, indices
    
    def encode_action(self, feat_t, feat_t1):
        """From two frame features → latent action."""
        combined = torch.cat([feat_t, feat_t1], dim=-1)
        z_continuous = self.inverse_head(combined)
        z_quantized, indices = self.quantize(z_continuous)
        return z_quantized, z_continuous, indices
    
    def predict_next(self, feat_t, z_t):
        """From current frame + action → predict next frame."""
        combined = torch.cat([feat_t, z_t], dim=-1)
        return self.forward_head(combined)
    
    def compute_loss(self, feat_t, feat_t1):
        """Training loss = forward_loss + commitment_loss."""
        z_q, z_c, _ = self.encode_action(feat_t, feat_t1)
        
        feat_t1_pred = self.predict_next(feat_t, z_q)
        forward_loss = nn.functional.mse_loss(feat_t1_pred, feat_t1.detach())
        
        commitment_loss = nn.functional.mse_loss(z_c, z_q.detach())
        
        return forward_loss + 0.25 * commitment_loss

Data pipeline for LAM

from torch.utils.data import Dataset
from torchvision import transforms

class EgocentricVideoDataset(Dataset):
    """
    Dataset from Ego4D or similar egocentric video sources.
    Each sample is a consecutive frame pair.
    """
    def __init__(self, video_paths, frame_gap=2):
        self.videos = video_paths
        self.frame_gap = frame_gap
        self.transform = transforms.Compose([
            transforms.Resize(224),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            ),
        ])
    
    def __getitem__(self, idx):
        video = self.load_video(idx)
        t = random.randint(0, len(video) - self.frame_gap - 1)
        
        frame_t = self.transform(video[t])
        frame_t1 = self.transform(video[t + self.frame_gap])
        
        return frame_t, frame_t1

Layer 2: VLA Policy — From vision and language to action

After LAM initializes the latent space, Layer 2 fine-tunes a Vision-Language Model to predict latent actions from egocentric images + language instructions.

VLA Architecture

class WholebodyVLAPolicy(nn.Module):
    """
    VLA Policy for upper body manipulation.
    
    Key differences from traditional VLAs (RT-2, OpenVLA):
    - Outputs latent tokens instead of explicit joint commands
    - Pretrained latent space from video (not just robot data)
    - Dual-arm coordination via shared latent
    """
    def __init__(
        self,
        vlm_backbone,
        lam: LatentActionModel,
        num_action_tokens=8,
        action_dim=14,  # 7 joints × 2 arms
    ):
        super().__init__()
        self.vlm = vlm_backbone
        self.lam = lam
        hidden_dim = vlm_backbone.config.hidden_size
        
        self.latent_projector = nn.Sequential(
            nn.Linear(hidden_dim, 512),
            nn.GELU(),
            nn.Linear(512, lam.latent_dim * num_action_tokens),
        )
        
        self.action_decoder = nn.Sequential(
            nn.Linear(lam.latent_dim * num_action_tokens, 256),
            nn.GELU(),
            nn.Linear(256, action_dim),
        )
        
        self.num_action_tokens = num_action_tokens
        
        for param in self.lam.parameters():
            param.requires_grad = False
    
    def forward(self, image, language_tokens):
        vlm_out = self.vlm(
            pixel_values=image,
            input_ids=language_tokens,
        ).last_hidden_state[:, -1, :]
        
        z_flat = self.latent_projector(vlm_out)
        B = z_flat.shape[0]
        z_tokens = z_flat.view(B, self.num_action_tokens, -1)
        
        z_t = z_tokens.mean(dim=1)
        arm_actions = self.action_decoder(z_flat)
        
        return z_t, arm_actions

Dual-arm coordination

A subtle point: num_action_tokens=8 — the model predicts multiple tokens per timestep, not one. Why?

With dual-arm manipulation, each arm needs its own trajectory but must coordinate. 8 tokens allow the model to encode:

Tokens 1-4: left arm trajectory
Tokens 5-8: right arm trajectory
Attention between tokens → both arms "know" what each other is doing

This is a distinction from the earlier research review — the paper actually supports dual-arm, though demos haven't shown full bimanual tasks yet.

Layer 3: LMO RL Policy — Legs know what arms are doing

This is what I consider the most significant contribution of the paper, and the part where open-source code will be most valuable to the community.

Humanoid robot moving in open space

What is LMO and why is it different?

Stance width — widens stance for complex manipulation
CoM compensation — shifts center of mass opposite to reaching arm
Foot placement — positions feet optimally for stability
Gait adaptation — switches from walking to standing when precise manipulation is needed

Observation space design

from dataclasses import dataclass
import numpy as np

@dataclass
class LMOObservation:
    """
    Observation space for LMO RL policy.
    Three groups: proprioception, exteroception, and manipulation context.
    """
    # === Proprioception ===
    joint_positions: np.ndarray    # [12] — 6 joints × 2 legs
    joint_velocities: np.ndarray   # [12]
    base_orientation: np.ndarray   # [4] — quaternion
    base_angular_vel: np.ndarray   # [3] — gyroscope
    base_linear_acc: np.ndarray    # [3] — accelerometer
    
    # === Exteroception ===
    foot_contacts: np.ndarray      # [4] — 2 feet × (heel, toe)
    
    # === Manipulation context (KEY INNOVATION) ===
    latent_action: np.ndarray      # [256] — z_t from VLA encoder
    
    @property
    def as_vector(self) -> np.ndarray:
        return np.concatenate([
            self.joint_positions,       # 12
            self.joint_velocities,      # 12
            self.base_orientation,      # 4
            self.base_angular_vel,      # 3
            self.base_linear_acc,       # 3
            self.foot_contacts,         # 4
            self.latent_action,         # 256
        ])  # Total: 294 dimensions

    @property
    def dim(self) -> int:
        return 294

Note the dimensions: 256 out of 294 dims come from z_t. The latent action accounts for 87% of the observation space — showing just how important manipulation context is for locomotion.

Detailed reward function

Reward design is where "art" meets "science" in RL. The LMO reward has multiple balanced components:

class LMORewardFunction:
    """
    Multi-component reward for LMO policy.
    
    Design philosophy:
    - Locomotion policy does NOT just optimize for movement
    - It must optimize for CREATING A STABLE FOUNDATION
      for manipulation to succeed
    - z_t allows locomotion to "look ahead" at manipulation intent
    """
    def __init__(self):
        self.weights = {
            'orientation': 0.25,
            'angular_velocity': 0.10,
            'linear_velocity': 0.10,
            'foot_contact': 0.05,
            'manipulation_support': 0.20,
            'com_stability': 0.15,
            'energy': 0.05,
            'smoothness': 0.10,
        }
    
    def compute(self, state, z_t, action, prev_action):
        rewards = {}
        
        # 1. Orientation: penalize excessive tilt
        roll, pitch = state['euler'][:2]
        rewards['orientation'] = torch.exp(
            -5.0 * (roll**2 + pitch**2)
        )
        
        # 2. Angular velocity: penalize rapid rotation
        rewards['angular_velocity'] = torch.exp(
            -2.0 * torch.norm(state['angular_vel'])
        )
        
        # 3. Velocity tracking
        vel_error = state['linear_vel'][:2] - state['cmd_vel'][:2]
        rewards['linear_velocity'] = torch.exp(
            -4.0 * torch.norm(vel_error)
        )
        
        # 4. Foot contact: must always have ground contact
        rewards['foot_contact'] = (
            state['foot_contacts'].sum() >= 1
        ).float()
        
        # 5. MANIPULATION SUPPORT (key innovation)
        z_magnitude = torch.norm(z_t)
        stance_width = self._compute_stance_width(state)
        target_width = 0.3 + 0.15 * torch.tanh(z_magnitude / 10.0)
        rewards['manipulation_support'] = torch.exp(
            -3.0 * (stance_width - target_width)**2
        )
        
        # 6. CoM within support polygon
        com_pos = state['com_position'][:2]
        support = self._compute_support_polygon(state)
        com_margin = self._point_to_polygon_distance(com_pos, support)
        rewards['com_stability'] = torch.clamp(com_margin, 0, 1)
        
        # 7. Energy efficiency
        rewards['energy'] = torch.exp(
            -0.01 * torch.sum(action**2)
        )
        
        # 8. Action smoothness
        action_diff = action - prev_action
        rewards['smoothness'] = torch.exp(
            -5.0 * torch.sum(action_diff**2)
        )
        
        total = sum(
            self.weights[k] * rewards[k] for k in self.weights
        )
        return total, rewards

Key insight from reward design

Training Pipeline: From video to real robot

Stage 1: Pre-train LAM (offline, no robot needed)

# LAM training script
# Data: Ego4D dataset (~3,670 hours of egocentric video)
# Hardware: 4× A100 80GB, ~3 days

python train_lam.py \
    --data_path /data/ego4d/v2/ \
    --vision_encoder vit-base-patch16 \
    --latent_dim 256 \
    --codebook_size 512 \
    --frame_gap 2 \
    --batch_size 256 \
    --lr 1e-4 \
    --epochs 50 \
    --num_gpus 4

Stage 2: Collect robot demonstration data

WholebodyVLA proposes an efficient data collection pipeline:

Teleoperation on AgiBot X2 — operator controls arms via VR controller
Record: egocentric image + joint states + language annotation
Augmentation: add camera noise, lighting variation

The paper collects ~500 episodes per task — far fewer than RT-2 (~130K) thanks to the pre-trained latent space.

Stage 3: Fine-tune VLA Policy

python train_vla.py \
    --vlm_backbone qwen2-vl-2b \
    --lam_checkpoint checkpoints/lam_ego4d.pt \
    --robot_data /data/agibot_x2/demonstrations/ \
    --freeze_lam \
    --batch_size 32 \
    --lr 5e-5 \
    --epochs 20 \
    --num_action_tokens 8

Stage 4: Train LMO RL Policy in Isaac Lab

from omni.isaac.lab.envs import ManagerBasedRLEnv

class LMOTrainingConfig:
    """Hyperparameters for LMO RL training."""
    
    num_envs = 4096
    episode_length = 1000    # 20 seconds @ 50Hz
    
    sim_dt = 0.005           # 200Hz physics
    control_dt = 0.02        # 50Hz policy
    
    ppo_config = {
        'learning_rate': 3e-4,
        'n_steps': 24,
        'batch_size': num_envs * 24,
        'n_epochs': 5,
        'gamma': 0.99,
        'gae_lambda': 0.95,
        'clip_range': 0.2,
        'ent_coef': 0.01,
        'vf_coef': 0.5,
        'max_grad_norm': 1.0,
    }
    
    policy_net = [256, 256, 128]
    activation = 'elu'
    
    domain_rand = {
        'friction_range': (0.3, 1.5),
        'payload_mass_range': (0, 5.0),
        'motor_strength_range': (0.8, 1.2),
        'push_force_range': (0, 50),
        'latent_noise_std': 0.05,
        'terrain_roughness': (0, 0.05),
        'com_displacement': (0, 0.05),
    }
    
    total_timesteps = 500_000_000  # ~6-8 hours on 1× A100

Deployment: Real-world inference pipeline

When deployed on real hardware, the system runs at 3 different frequencies:

Camera (30 FPS) ─── image ──→ VLA Encoder (~20Hz, GPU)
                                    │
                                    z_t (async update)
                                    │
Proprioception (1kHz) ─── obs ──→ LMO Policy (50Hz, CPU)
                                    │
                                    joint targets
                                    │
                            PD Controller (1kHz, embedded)
                                    │
                                    motor torques ──→ Motors

Handling frequency mismatch

import threading
import time
import numpy as np

class WholebodyVLARuntime:
    """
    Runtime inference for WholebodyVLA.
    Handles frequency mismatch between VLA (20Hz) and RL (50Hz).
    """
    def __init__(self, vla_model, lmo_model):
        self.vla = vla_model
        self.lmo = lmo_model
        
        self._z_current = np.zeros(256)
        self._z_prev = np.zeros(256)
        self._z_timestamp = 0.0
        self._lock = threading.Lock()
    
    def vla_loop(self, camera, language_cmd):
        """VLA inference loop — runs on GPU thread at ~20Hz."""
        while self.running:
            image = camera.get_frame()
            z_new, arm_actions = self.vla.predict(image, language_cmd)
            
            with self._lock:
                self._z_prev = self._z_current.copy()
                self._z_current = z_new
                self._z_timestamp = time.time()
            
            self.send_arm_commands(arm_actions)
    
    def lmo_loop(self, robot):
        """LMO RL inference loop — runs on CPU thread at 50Hz."""
        dt = 0.02
        
        while self.running:
            start = time.time()
            proprio = robot.get_proprioception()
            
            with self._lock:
                age = time.time() - self._z_timestamp
                alpha = min(age / 0.05, 1.0)
                z_interp = (1 - alpha) * self._z_prev + alpha * self._z_current
            
            obs = np.concatenate([
                proprio['joint_pos'],
                proprio['joint_vel'],
                proprio['base_quat'],
                proprio['gyro'],
                proprio['accel'],
                proprio['foot_contacts'],
                z_interp,
            ])
            
            leg_actions = self.lmo.predict(obs)
            robot.set_leg_targets(leg_actions)
            
            elapsed = time.time() - start
            if elapsed < dt:
                time.sleep(dt - elapsed)

Results: Numbers speak for themselves

The paper tests on 10 tasks with AgiBot X2, 20 trials each:

Task Category	Baseline (Decoupled)	WholebodyVLA	Δ
Static manipulation	68%	79%	+11%
Walk + carry	38%	65%	+27%
Walk + place on shelf	45%	72%	+27%
Push heavy object	41%	55%	+14%
Average	46.5%	67.5%	+21.3%

Ablation study

Component removed	Success rate	Drop
Full WholebodyVLA	67.5%	—
Remove LMO reward (standard locomotion)	52.1%	-15.4%
Remove egocentric video pre-training	55.8%	-11.7%
Remove latent actions (use explicit)	58.2%	-9.3%
Remove domain rand on z_t	61.3%	-6.2%

LMO reward is the most critical factor — removing it drops 15.4%. This confirms: the main issue isn't VLA architecture, but whether locomotion knows what manipulation is doing.

Building a similar system: Practical roadmap

The repo hasn't released full code yet, but you can build a similar pipeline with open-source tools:

Step 1: LAM from Ego4D

Dataset: Ego4D — free for research
Vision encoder: ViT-Base from DINOv2 (better than ImageNet-pretrained for video)
Framework: PyTorch + HuggingFace Transformers
GPU: 1× RTX 4090 is sufficient for prototyping (smaller batch size)

Step 2: VLA Policy

VLM backbone: Qwen2-VL-2B or PaliGemma 2 (small, runs on consumer GPU)
Robot data: if no physical robot, use DROID dataset
Or: start with LeRobot framework for data collection

Step 3: LMO RL in Isaac Lab

Simulator: Isaac Lab — free, runs on RTX 3060+
Robot model: Unitree H1 or G1 URDF (available in Isaac Lab)
RL algorithm: PPO via RSL-RL or RL Games (built-in)

Step 4: Sim-to-real (if you have hardware)

Deploy on Unitree G1 (~$30K) — most affordable humanoid for research
Use established sim-to-real pipeline for Unitree

Known limitations

Code not fully open-sourced — current repo is mainly reference documentation, no training scripts released yet
Hardware requirements — needs A100 for real-time VLA inference, consumer GPU only sufficient for offline training
Flat terrain only — no stairs, slopes, or outdoor environments tested
Single language — language commands in English only, no multilingual support

Conclusion: Why this is a turning point

WholebodyVLA isn't just a good paper — it represents a paradigm shift in humanoid robotics:

From decoupled (arms and legs separate) to unified (shared latent space)
From robot-only data to internet-scale video data
From explicit actions to latent actions (much more scalable)

References

WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control — Haoran Jiang, Jin Chen et al., ICLR 2026
GitHub: OpenDriveLab/WholebodyVLA — Official repository
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Isaac Lab Documentation
AgiBot X2 Humanoid Robot — Hardware platform used in the paper

WholeBodyVLA: Whole-Body VLA for Humanoid Loco-Manipulation — Original research analysis
Humanoid Loco-Manipulation: When Robots Walk and Work — Humanoid series
VLA Models: RT-2, Octo, OpenVLA, pi0 — VLA model overview

From paper to code: WholebodyVLA goes open-source

The core problem: Why decoupled approaches fail

Overall architecture: Three layers, one flow

Layer 1: Latent Action Model — Learning from free video

The robot data problem

Inverse Dynamics: turning video into "pseudo-actions"

Data pipeline for LAM

Layer 2: VLA Policy — From vision and language to action

VLA Architecture

Dual-arm coordination

Layer 3: LMO RL Policy — Legs know what arms are doing

What is LMO and why is it different?

Observation space design

Detailed reward function

Key insight from reward design

Training Pipeline: From video to real robot

Stage 1: Pre-train LAM (offline, no robot needed)

Stage 2: Collect robot demonstration data

Stage 3: Fine-tune VLA Policy

Stage 4: Train LMO RL Policy in Isaac Lab

Deployment: Real-world inference pipeline

Handling frequency mismatch

Results: Numbers speak for themselves

Ablation study

Building a similar system: Practical roadmap

Step 1: LAM from Ego4D

Step 2: VLA Policy

Step 3: LMO RL in Isaac Lab

Step 4: Sim-to-real (if you have hardware)

Known limitations

Conclusion: Why this is a turning point

References

Related posts

Nguyễn Anh Tuấn

Related Posts

WholebodyVLA: Pipeline Training Toàn Bộ

WholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation

Chạy GR00T-VisualSim2Real cho G1

From paper to code: WholebodyVLA goes open-source

The core problem: Why decoupled approaches fail

Overall architecture: Three layers, one flow

Layer 1: Latent Action Model — Learning from free video

The robot data problem

Inverse Dynamics: turning video into "pseudo-actions"

Data pipeline for LAM

Layer 2: VLA Policy — From vision and language to action

VLA Architecture

Dual-arm coordination

Layer 3: LMO RL Policy — Legs know what arms are doing

What is LMO and why is it different?

Observation space design

Detailed reward function

Key insight from reward design

Training Pipeline: From video to real robot

Stage 1: Pre-train LAM (offline, no robot needed)

Stage 2: Collect robot demonstration data

Stage 3: Fine-tune VLA Policy

Stage 4: Train LMO RL Policy in Isaac Lab

Deployment: Real-world inference pipeline

Handling frequency mismatch

Results: Numbers speak for themselves

Ablation study

Building a similar system: Practical roadmap

Step 1: LAM from Ego4D

Step 2: VLA Policy

Step 3: LMO RL in Isaac Lab

Step 4: Sim-to-real (if you have hardware)

Known limitations

Conclusion: Why this is a turning point

References

Related posts

Nguyễn Anh Tuấn

Related Posts

WholebodyVLA: Pipeline Training Toàn Bộ

WholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation

Chạy GR00T-VisualSim2Real cho G1