aivlahumanoidloco-manipulationiclrtrainingrlprismaticdino

WholebodyVLA: Full Training Pipeline Guide

Step-by-step guide to the complete WholebodyVLA training pipeline — egocentric data collection, LAM training, VLA fine-tuning, LMO RL policy, and inference deployment.

Nguyễn Anh Tuấn21 tháng 4, 202612 min read
WholebodyVLA: Full Training Pipeline Guide

If you've already read the WholeBodyVLA ICLR 2026 paper analysis and the architecture deep-dive, you understand why this system matters and how its components work internally. This tutorial goes one step further: how to build the training pipeline from scratch.

Note: As of April 2026, the OpenDriveLab team has not yet released the official code. However, the paper provides enough technical detail to understand the full process and prepare for when the code drops — or to implement it yourself following the paper's design.

The Big Picture: 3 Components, 2 Training Phases

Before diving into individual steps, internalize WholebodyVLA's structure:

[Egocentric Video Data] ──► LAM Training ──► Latent Codes
         │                                         │
[Manipulation Data]     ─────────────────────────── ┤
         │                                         ▼
[Teleoperation Demos]   ──► VLA Fine-tuning (Prismatic-7B + LoRA)
                                                   │
[Isaac Lab Simulation]  ──► LMO RL Policy Training  │
                                                   ▼
                         ┌──────────────────────────┐
                         │  Inference @ 10 / 50 Hz  │
                         └──────────────────────────┘

The pipeline has two parallel phases:

  1. LAM + VLA Pretraining: Learning to "read" egocentric video and translate it into latent action codes.
  2. RL Policy Training: Training the LMO (Loco-Manipulation-Oriented) policy in simulation.

Both branches converge at inference, where the VLA issues commands and LMO executes them at higher frequency.

Hardware Requirements

These are the real numbers from the paper — this won't run on your laptop:

Task Hardware Duration
LAM Training 8× NVIDIA H100 ~30,000 steps
VLA Pretraining 8× NVIDIA H100 ~20,000 steps
VLA Fine-tuning 8× NVIDIA H100 ~10,000 steps
LMO RL Training 1× NVIDIA H100 Several days
Inference (VLA) RTX 4090 ~10 Hz
Inference (RL) NanoPi (onboard) 50 Hz

You don't need an H100 cluster to experiment — but to reproduce the paper's results (78% success), you'll need at least one H100 or equivalent A100 80GB for training.

Step 1: Environment Setup

# Python environment
conda create -n wholebodyvla python=3.10
conda activate wholebodyvla

# Core dependencies
pip install torch==2.3.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate einops timm
pip install open-clip-torch  # CLIP encoder for VLA

# DINOv2 (LAM encoder)
pip install git+https://github.com/facebookresearch/dinov2.git

# Isaac Lab (LMO RL training)
# See official guide: https://isaac-sim.github.io/IsaacLab/
pip install isaaclab

# ZeroMQ (inference communication)
pip install pyzmq
pip install pyrealsense2  # Intel RealSense SDK

Step 2: Data Collection

This is the most time-intensive step and often underestimated. WholebodyVLA uses three distinct data types for three different purposes.

2.1 Egocentric Locomotion Video (for Locomotion LAM)

Goal: Collect first-person video of a human approaching manipulation targets — no robot required, no action labels needed.

Camera setup:

import pyrealsense2 as rs
import numpy as np
import cv2

pipeline = rs.pipeline()
config = rs.config()

# Intel RealSense D435i: 640x480 @ 30fps
config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)
config.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30)
config.enable_stream(rs.stream.accel)  # IMU for locomotion tracking
config.enable_stream(rs.stream.gyro)

profile = pipeline.start(config)

8 motion primitives to collect:

Primitive Description Recommended count
Advance Walk straight toward target 500 clips
Turn Left/Right Rotate to face correct direction 400 clips
Sidestep Lateral step left/right 300 clips
Squat Down Crouch to reach low objects 300 clips
Stand Up Rise after grabbing object 300 clips
Approach Near Close distance to target 400 clips
Back Away Step backwards 200 clips
Stop & Ready Halt, arms ready for manipulation 300 clips

The paper collected ~300 hours of video — but ablation studies show you can start with 50 hours and capture 60%+ of the benefit.

def collect_primitive(primitive_name: str, duration_secs: int = 3):
    """Collect one motion primitive clip."""
    timestamp = int(time.time())
    frames = []
    
    start = time.time()
    while time.time() - start < duration_secs:
        frameset = pipeline.wait_for_frames()
        color_frame = frameset.get_color_frame()
        
        if color_frame:
            img = np.asanyarray(color_frame.get_data())
            frames.append(img)
    
    out_path = f"data/loco_videos/{primitive_name}/{timestamp}.mp4"
    os.makedirs(os.path.dirname(out_path), exist_ok=True)
    
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(out_path, fourcc, 30, (640, 480))
    for f in frames:
        out.write(f)
    out.release()
    
    return out_path

2.2 Manipulation Data (AgiBot World Dataset)

The paper leverages the AgiBot World dataset for in-place bimanual manipulation pretraining:

# Expected directory structure:
data/
  manipulation/
    task_001/
      episode_000/
        images/      # frame-by-frame
        actions.npy  # joint angles
        lang.txt     # task description
    task_002/
    ...

2.3 Robot Teleoperation Data (for Fine-tuning)

Collected directly on the AgiBot X2 — you need either a real robot or a high-fidelity simulation:

  • Upper body: VR headset (Meta Quest 3 or HTC Vive) mapping hand poses → joint targets
  • Lower body: Joystick → locomotion commands
  • Quantity: 50 demonstrations/task × 3 tasks = 150 total demos

The three core tasks: Bag Packing, Box Loading, Cart Pushing.

Step 3: Training the Latent Action Model (LAM)

The LAM is the "translator" that converts video frames into latent codes, which the VLA then learns to predict.

The paper trains two separate LAMs:

  • Manipulation LAM: Static camera, moving hands
  • Locomotion LAM: Moving camera following the person

VQ-VAE Architecture

import torch
import torch.nn as nn
from transformers import Dinov2Model

class LatentActionModel(nn.Module):
    """VQ-VAE with DINOv2 encoder for WholebodyVLA."""
    
    def __init__(
        self,
        codebook_size: int = 8192,
        latent_dim: int = 256,
        commitment_cost: float = 0.25,  # beta in the paper
    ):
        super().__init__()
        
        # DINOv2-base as feature encoder
        self.encoder = Dinov2Model.from_pretrained("facebook/dinov2-base")
        
        # Project DINOv2 features (768) → latent_dim
        self.encoder_proj = nn.Linear(768 * 2, latent_dim)
        
        # Vector Quantizer
        self.codebook = nn.Embedding(codebook_size, latent_dim)
        self.commitment_cost = commitment_cost
        
        # Decoder to reconstruct future frame
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, 768),
        )
    
    def encode(self, obs_t: torch.Tensor, obs_t_k: torch.Tensor):
        """
        Args:
            obs_t: Current frame (B, C, H, W)
            obs_t_k: Future frame k steps ahead (B, C, H, W)
        Returns:
            Quantized latent code and VQ loss
        """
        feat_t = self.encoder(pixel_values=obs_t).last_hidden_state[:, 0]
        feat_tk = self.encoder(pixel_values=obs_t_k).last_hidden_state[:, 0]
        
        z = self.encoder_proj(torch.cat([feat_t, feat_tk], dim=-1))
        
        distances = torch.cdist(z.unsqueeze(1), self.codebook.weight.unsqueeze(0))
        indices = distances.squeeze(1).argmin(dim=-1)
        c = self.codebook(indices)
        
        vq_loss = torch.mean((c.detach() - z) ** 2)
        commit_loss = self.commitment_cost * torch.mean((c - z.detach()) ** 2)
        c_st = z + (c - z).detach()  # straight-through estimator
        
        return c_st, indices, vq_loss + commit_loss
    
    def forward(self, obs_t, obs_t_k):
        c, indices, vq_loss = self.encode(obs_t, obs_t_k)
        
        feat_t = self.encoder(pixel_values=obs_t).last_hidden_state[:, 0]
        pred_feat = self.decoder(c + feat_t)
        
        with torch.no_grad():
            target_feat = self.encoder(pixel_values=obs_t_k).last_hidden_state[:, 0]
        
        mse_loss = nn.functional.mse_loss(pred_feat, target_feat)
        return mse_loss + vq_loss, indices

Training:

# Train manipulation LAM
python train_lam.py --data-type manipulation --data-dir data/manipulation/ \
  --steps 30000 --batch-size 256 --gpus 8

# Train locomotion LAM
python train_lam.py --data-type locomotion --data-dir data/loco_videos/ \
  --steps 30000 --batch-size 256 --gpus 8

Step 4: Training the VLA (Prismatic-7B)

The VLA backbone uses Prismatic-7B — a 7B parameter Vision-Language Model with dual-stream image encoding.

Humanoid robot moving through open space — loco-manipulation requires tight coordination between arms and legs

Stage 1: Pretraining with LAM Supervision

The VLA learns to predict latent codes from both LAMs trained in Step 3:

class WholebodyVLA(nn.Module):
    """VLA backbone with dual latent action prediction heads."""
    
    def __init__(self, vlm_path="TRI-ML/prismatic-7b", codebook_size=8192):
        super().__init__()
        self.vlm = AutoModelForCausalLM.from_pretrained(vlm_path)
        self.tokenizer = AutoTokenizer.from_pretrained(vlm_path)
        
        hidden_size = self.vlm.config.hidden_size  # 4096 for 7B
        self.mani_head = nn.Linear(hidden_size, codebook_size)
        self.loco_head = nn.Linear(hidden_size, codebook_size)
    
    def forward(self, images, language, labels_mani=None, labels_loco=None):
        inputs = self.tokenizer(language, return_tensors="pt", padding=True)
        outputs = self.vlm(
            input_ids=inputs["input_ids"].cuda(),
            pixel_values=images,
            output_hidden_states=True,
        )
        
        last_hidden = outputs.hidden_states[-1][:, -1, :]
        logits_mani = self.mani_head(last_hidden)
        logits_loco = self.loco_head(last_hidden)
        
        loss = 0
        if labels_mani is not None:
            loss += nn.functional.cross_entropy(logits_mani, labels_mani)
        if labels_loco is not None:
            loss += nn.functional.cross_entropy(logits_loco, labels_loco)
        
        return loss, logits_mani, logits_loco

Pretraining config:

PRETRAIN_CONFIG = {
    "steps": 20_000,
    "batch_size": 1_024,
    "lr": 2e-5,
    "warmup_steps": 500,
    "gpus": 8,  # H100
}

Stage 2: LoRA Fine-tuning on Teleoperation Data

After pretraining, fine-tune using the 150 robot demonstration episodes:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(wholebodyvla.vlm, lora_config)
model.print_trainable_parameters()
# Trainable params: ~13M / 7B total (~0.2%)

FINETUNE_CONFIG = {
    "steps": 10_000,
    "batch_size": 64,
    "lr": 5e-5,
}

Step 5: Training the LMO RL Policy

This is the most technically interesting part — an RL policy learning full-body humanoid control in simulation.

Discrete Command Interface

Instead of traditional velocity tracking, LMO uses discrete flags as the command interface:

class LMOCommand:
    """
    Discrete command interface for LMO RL policy.
    Each axis: {-1: backward/right/ccw, 0: stop, 1: forward/left/cw}
    """
    s_x: int    # Forward/backward: {-1, 0, 1}
    s_y: int    # Lateral: {-1, 0, 1}
    s_psi: int  # Yaw rotation: {-1, 0, 1}
    h_star: float  # Target stance height (continuous): [0.7, 1.0] m

def command_to_velocity(cmd: LMOCommand, alpha: float = 2.0) -> np.ndarray:
    """
    Smooth velocity ramp from discrete command.
    v_k^ref(t) = v_k^goal * tanh[alpha * (s_k - s_bar_k(t))]
    Prevents jerky start/stop transitions.
    """
    v_goal = np.array([0.5 * cmd.s_x, 0.3 * cmd.s_y, 0.5 * cmd.s_psi])
    return v_goal

The paper shows that this discrete interface achieves 78.0% success vs. 54.0% for velocity-based RL — a 24-point gap that comes entirely from better start/stop control and reduced trajectory variance.

Two-Stage Curriculum in Isaac Lab

# Stage I: Gait acquisition (basic locomotion)
STAGE1_CONFIG = {
    "num_envs": 4096,
    "episode_length": 500,
    "goal_speed_range": [0.1, 0.5],  # m/s, sampled per episode
    "rewards": {
        "tracking_lin_vel": 1.5,
        "tracking_ang_vel": 0.75,
        "lin_vel_z_penalty": -2.0,
        "feet_air_time": 1.0,
        "joint_torque_penalty": -0.0002,
    },
    "domain_randomization": {
        "mass_ratio": [0.8, 1.2],
        "friction_coeff": [0.3, 1.0],
        "motor_kp": [0.8, 1.2],
        "control_delay_steps": [0, 3],
    }
}

# Stage II: Precision refinement (fixed speed + manipulation perturbations)
STAGE2_CONFIG = {
    "num_envs": 2048,
    "episode_length": 500,
    "goal_speed": 0.5,  # Fixed, not sampled
    "rewards": {
        "directional_accuracy": 2.0,  # J_dir = |wrap(ψ_end - ψ_start)|
        "stand_still_penalty": -1.0,
        "tracking_lin_vel": 1.5,
    },
    "arm_perturbation": {
        "enabled": True,
        "strength_range": [0.5, 1.0],  # Stronger than Stage I
    },
    "domain_randomization": {
        "mass_ratio": [0.7, 1.3],      # Wider range
        "friction_coeff": [0.2, 1.2],
        "ground_unevenness": [0.0, 0.02],  # Uneven terrain added
    }
}
# Stage I: ~2 days on 1× H100
python train_lmo.py --config configs/lmo_stage1.yaml --robot agibot_x2

# Stage II: ~1 day on 1× H100 (resuming from Stage I)
python train_lmo.py --config configs/lmo_stage2.yaml --robot agibot_x2 \
  --resume checkpoints/lmo_stage1_final.pt

Step 6: Inference Deployment

This is where both training branches converge. The system runs two communicating processes via ZeroMQ:

AI chip hardware — WholebodyVLA inference splits across offboard RTX 4090 and onboard NanoPi computer

import zmq
import torch
import numpy as np

class VLAInferenceServer:
    """
    VLA server on offboard GPU (RTX 4090).
    Rate: ~10 Hz (bottlenecked by VLM inference latency).
    """
    
    def __init__(self, vla_model, lam_mani, lam_loco, port=5555):
        self.vla = vla_model
        self.lam_mani = lam_mani
        self.lam_loco = lam_loco
        
        ctx = zmq.Context()
        self.socket = ctx.socket(zmq.REP)
        self.socket.bind(f"tcp://*:{port}")
    
    def run(self):
        while True:
            obs = self.socket.recv_pyobj()
            
            with torch.no_grad():
                image_t = preprocess_image(obs["egocentric_image"]).cuda()
                _, logits_mani, logits_loco = self.vla(
                    images=image_t.unsqueeze(0),
                    language=[obs["task_description"]],
                )
                
                mani_code = logits_mani.argmax(dim=-1)
                loco_code = logits_loco.argmax(dim=-1)
                
                arm_actions = self.lam_mani.decode_to_joints(mani_code)
                loco_cmd = self.lam_loco.decode_to_command(loco_code)
            
            self.socket.send_pyobj({
                "arm_actions": arm_actions.cpu().numpy(),   # (14,) joint targets
                "loco_command": loco_cmd.cpu().numpy(),     # (4,) = [s_x, s_y, s_ψ, h*]
            })


class LMOPolicyClient:
    """
    RL policy on robot onboard computer (NanoPi M6).
    Rate: 50 Hz — receives LMO commands from VLA and outputs joint torques.
    """
    
    def __init__(self, lmo_policy, vla_server_ip, port=5555):
        self.lmo = lmo_policy
        
        ctx = zmq.Context()
        self.socket = ctx.socket(zmq.REQ)
        self.socket.connect(f"tcp://{vla_server_ip}:{port}")
        self.socket.RCVTIMEO = 50  # 50ms timeout
        self.last_loco_cmd = np.zeros(4)
    
    def step(self, robot_obs: dict) -> np.ndarray:
        """Run one control step at 50 Hz."""
        try:
            self.socket.send_pyobj(robot_obs)
            vla_output = self.socket.recv_pyobj()
            self.last_loco_cmd = vla_output["loco_command"]
        except zmq.Again:
            pass  # VLA not ready — hold last command
        
        obs_vec = torch.tensor(np.concatenate([
            self.last_loco_cmd,        # (4,)
            robot_obs["base_ang_vel"], # (3,)
            robot_obs["gravity_vec"],  # (3,)
            robot_obs["joint_pos"],    # (N,)
            robot_obs["joint_vel"],    # (N,)
            robot_obs["prev_action"],  # (N,)
        ]), dtype=torch.float32).cuda()
        
        with torch.no_grad():
            action = self.lmo(obs_vec.unsqueeze(0))
        
        return action.squeeze(0).cpu().numpy()

Launching inference:

# Machine 1 — offboard GPU (RTX 4090):
python inference_server.py \
  --vla-checkpoint checkpoints/wholebodyvla_final.pt \
  --lam-mani checkpoints/lam_mani.pt \
  --lam-loco checkpoints/lam_loco.pt

# Machine 2 — onboard robot computer:
python robot_controller.py \
  --lmo-checkpoint checkpoints/lmo_stage2_final.pt \
  --vla-server 192.168.1.100 \
  --control-freq 50

Expected Results

If implemented correctly with sufficient data:

Task WholebodyVLA Best Baseline
Bag Packing ~75% ~60%
Box Loading ~80% ~65%
Cart Pushing ~79% ~57%
Average ~78% ~64%

The single most impactful ablation: Removing LAM pretraining (supervised learning from demos only) drops performance to ~39%. The egocentric video collection is the highest-ROI investment in this entire pipeline.

Pre-Flight Checklist

□ Hardware ready (≥1× H100 for training, RTX 4090 for VLA inference)
□ Intel RealSense D435i camera calibrated and tested
□ 50+ hours of egocentric locomotion video collected
□ AgiBot World dataset downloaded (or equivalent manipulation data)
□ Robot teleoperation data: 50+ demos/task across 3 tasks
□ Isaac Lab v2.3+ installed and tested with humanoid URDF
□ ZeroMQ network between VLA server and robot onboard stable (<50ms)
□ LAM training converged (codebook utilization >50%)
□ VLA validation loss decreasing during fine-tuning
□ LMO Stage I passing straight-line walking and turning tests

WholebodyVLA sets a new bar for humanoid loco-manipulation — but it's also the system with the most moving parts you'll encounter in robot learning. Mastering each component independently before combining them is the right strategy.

Dive deeper into individual components with VLA Models — AI Series 5 and Isaac Lab for Robotics Simulation.

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

NEWDeep Dive
WBC + VLA mới nhất cho humanoid
wbcvlawhole-body-controlhumanoidloco-manipulation

WBC + VLA mới nhất cho humanoid

Deep dive WholeBodyVLA, LMO RL và xu hướng WBC + VLA giúp humanoid vừa đi vừa thao tác trong không gian lớn.

3/6/202615 min read
NEWDeep Dive
VLA-RFT: RL Fine-Tune VLA trong World Simulator
vlavla-rftreinforcement-learningworld-modelgrpoliberoopenhelixmanipulation

VLA-RFT: RL Fine-Tune VLA trong World Simulator

VLA-RFT dùng world model làm simulator để fine-tune VLA bằng GRPO, reward kiểm chứng và code GitHub trên LIBERO.

3/6/202614 min read
NEWTutorial
Chạy Wall-OSS-0.5 với LeRobot
wall-ossvlalerobotmanipulationzero-shot

Chạy Wall-OSS-0.5 với LeRobot

Hướng dẫn chạy Wall-OSS-0.5, VLA 4B open-source zero-shot cho robot manipulation, từ paper đến LeRobot training và inference.

3/6/202613 min read