WholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation

Bài toán lớn: Humanoid cần vừa đi vừa thao tác

Hãy tưởng tượng bạn đang bê một cốc cà phê nóng và đi qua phòng. Não bạn đồng thời phải: (1) giữ cốc ổn định, (2) điều chỉnh bước chân để không làm đổ, và (3) tránh chướng ngại vật. Hai hệ thống — locomotion và manipulation — phải phối hợp chặt chẽ, không phải chạy song song độc lập.

Đây chính xác là bài toán loco-manipulation cho humanoid robot, và cũng là nơi hầu hết các phương pháp hiện tại thất bại. Tại sao? Vì chúng chia tách locomotion và manipulation thành hai module riêng biệt, rồi ghép lại — giống việc một người điều khiển tay trong khi người khác điều khiển chân. Kết quả: robot ngã khi cố gắng thao tác, hoặc thao tác thất bại vì locomotion policy không hiểu manipulation cần gì.

WholeBodyVLA (Jiang, Chen et al., ICLR 2026) từ Fudan University, OpenDriveLab và AGIBOT giải quyết vấn đề này bằng một kiến trúc thống nhất — VLA cho manipulation kết hợp RL policy cho locomotion, cả hai chia sẻ một latent space chung.

Tại sao paper này quan trọng?

WholeBodyVLA đạt được 3 điều mà chưa paper nào trước đó làm được đồng thời:

Học manipulation từ egocentric video không cần action labels — sử dụng hàng triệu video YouTube/ego4D mà con người quay từ góc nhìn thứ nhất
Locomotion nhận biết manipulation — chân robot tự điều chỉnh tư thế dựa trên tay đang làm gì
Outperform baseline 21.3% trên robot thật (AgiBot X2), không chỉ simulation

Kết quả trên robot thật với AgiBot X2 là đặc biệt đáng chú ý — đây không phải paper "demo đẹp trong sim rồi thôi".

Kiến trúc: Hai tầng, một latent space

WholeBodyVLA gồm 2 tầng chính:

┌─────────────────────────────────────────────┐
│           VLA Encoder (Upper Body)          │
│                                             │
│  [Egocentric Image] + [Language Command]    │
│           ↓                                 │
│   Vision-Language Model (pretrained)        │
│           ↓                                 │
│   Latent Action Space z_t                   │
│   (learned từ action-free video)            │
└─────────────┬───────────────────────────────┘
              │ z_t (latent manipulation intent)
              ↓
┌─────────────────────────────────────────────┐
│      LMO RL Policy (Lower Body)            │
│                                             │
│  [z_t] + [Proprioception] + [IMU]          │
│           ↓                                 │
│   PPO Policy (trained in Isaac Lab)         │
│           ↓                                 │
│   Joint torques cho locomotion              │
│   (biết đang manipulation → điều chỉnh)    │
└─────────────────────────────────────────────┘

Điểm mấu chốt là latent vector z_t — nó encode cả manipulation intent lẫn spatial context, cho phép locomotion policy "biết" tay robot đang cố làm gì để điều chỉnh tư thế phù hợp.

Tầng 1: VLA Encoder — Học từ video không cần action

Vấn đề: Data cho robot manipulation cực hiếm

Training VLA truyền thống cần cặp (observation, action) từ robot — thu thập cực tốn kém. Bridge V2 có ~60K episodes, Open X-Embodiment có ~800K — nghe nhiều nhưng so với hàng tỷ image-text pairs cho VLM thì quá ít.

WholeBodyVLA giải quyết bằng latent action learning: thay vì cần explicit action labels, model tự học latent representation từ egocentric video frames liên tiếp.

Inverse Dynamics trong Latent Space

Ý tưởng: nếu hai frame video liên tiếp thay đổi, chắc chắn có một action nào đó đã xảy ra. Model không cần biết action cụ thể — chỉ cần học biểu diễn latent của "sự thay đổi" đó.

import torch
import torch.nn as nn
from transformers import AutoModel

class LatentActionEncoder(nn.Module):
    """
    Học latent action từ cặp frame liên tiếp.
    Không cần action label — chỉ cần video frames.
    """
    def __init__(self, vision_encoder, latent_dim=256):
        super().__init__()
        self.vision_encoder = vision_encoder  # ViT pretrained
        self.latent_dim = latent_dim
        
        # Inverse dynamics model: (s_t, s_{t+1}) → z_t
        # z_t là latent action giữa 2 frames
        self.inverse_dynamics = nn.Sequential(
            nn.Linear(vision_encoder.hidden_size * 2, 512),
            nn.ReLU(),
            nn.Linear(512, latent_dim),
        )
        
        # Forward dynamics: (s_t, z_t) → s_{t+1}_pred
        # Đảm bảo z_t encode đủ thông tin để predict frame tiếp theo
        self.forward_dynamics = nn.Sequential(
            nn.Linear(vision_encoder.hidden_size + latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, vision_encoder.hidden_size),
        )
    
    def encode_latent_action(self, frame_t, frame_t1):
        """Từ 2 frames liên tiếp, tính latent action z_t."""
        feat_t = self.vision_encoder(frame_t)    # [B, D]
        feat_t1 = self.vision_encoder(frame_t1)  # [B, D]
        
        # Concat features và predict latent action
        combined = torch.cat([feat_t, feat_t1], dim=-1)  # [B, 2D]
        z_t = self.inverse_dynamics(combined)              # [B, latent_dim]
        return z_t, feat_t, feat_t1
    
    def forward(self, frame_t, frame_t1):
        z_t, feat_t, feat_t1 = self.encode_latent_action(frame_t, frame_t1)
        
        # Forward dynamics loss: z_t + s_t → s_{t+1} 
        feat_t1_pred = self.forward_dynamics(
            torch.cat([feat_t, z_t], dim=-1)
        )
        
        # L2 loss trong feature space
        loss = nn.functional.mse_loss(feat_t1_pred, feat_t1.detach())
        return z_t, loss

Hai loss functions bổ trợ nhau:

Inverse dynamics: buộc z_t chứa thông tin về "action đã xảy ra"
Forward dynamics: buộc z_t chứa đủ thông tin để reconstruct state tiếp theo

VLA với Latent Actions

Sau khi có latent action encoder, WholeBodyVLA fine-tune VLM để predict latent actions thay vì explicit robot actions:

class WholeBodyVLAEncoder(nn.Module):
    """
    VLA output latent actions thay vì explicit joint commands.
    Latent space được pre-train từ egocentric video.
    """
    def __init__(self, vlm_backbone, latent_action_encoder):
        super().__init__()
        self.vlm = vlm_backbone  # e.g., LLaVA, Qwen-VL
        self.latent_projector = nn.Linear(
            vlm_backbone.config.hidden_size,
            latent_action_encoder.latent_dim
        )
        self.latent_action_encoder = latent_action_encoder
        
        # Freeze latent encoder sau pre-training
        for param in self.latent_action_encoder.parameters():
            param.requires_grad = False
    
    def forward(self, image, language_instruction):
        """
        Input: egocentric image + language command
        Output: latent action z_t
        """
        # VLM xử lý image + text
        vlm_output = self.vlm(
            pixel_values=image,
            input_ids=language_instruction,
        )
        
        # Project sang latent action space
        hidden = vlm_output.last_hidden_state[:, -1, :]  # last token
        z_t = self.latent_projector(hidden)  # [B, latent_dim]
        
        return z_t

Điểm tinh tế: VLA không cần biết robot cụ thể nào — nó chỉ output "manipulation intent" trong latent space. Locomotion policy ở tầng dưới mới map z_t thành motor commands cho robot cụ thể.

Tầng 2: LMO RL Policy — Locomotion nhận biết Manipulation

Đây là contribution quan trọng nhất — Loco-Manipulation-Oriented (LMO) RL policy.

Tại sao locomotion policy thông thường không đủ?

Locomotion policy tiêu chuẩn (như trong Unitree Go2, Boston Dynamics) được train với mục tiêu: đi thẳng, giữ thăng bằng, follow velocity command. Chúng không biết phần trên thân đang làm gì.

Hậu quả thực tế:

Tình huống	Locomotion thông thường	LMO Policy
Robot giơ tay cao lên	Vẫn đứng bình thường → ngã	Hạ trọng tâm, dang chân rộng
Robot nghiêng sang phải để nhặt	Không bù trừ → lệch trọng tâm	Bước sang trái để balance
Robot đẩy vật nặng	Không tăng friction → trượt	Tăng lực ép chân xuống sàn

Reward Design cho LMO

Reward function của LMO RL có 3 thành phần chính:

import torch

def compute_lmo_reward(
    # State information
    base_orientation,     # quaternion orientation thân robot
    base_velocity,        # linear + angular velocity
    joint_positions,      # vị trí các khớp hiện tại
    joint_torques,        # torque đang apply
    foot_contacts,        # chân có chạm đất không
    # Manipulation context
    z_t,                  # latent action từ VLA encoder
    ee_position,          # end-effector position 
    ee_target,            # target position cho manipulation
    # Config
    dt=0.02,
):
    """
    LMO Reward: 3 thành phần chính
    1. Stability reward: giữ thân ổn định
    2. Manipulation support: tạo điều kiện cho manipulation
    3. Energy efficiency: không lãng phí torque
    """
    
    # === 1. STABILITY REWARD ===
    # Penalize orientation deviation từ upright
    # Roll và pitch nên gần 0, yaw theo command
    roll, pitch, yaw = quaternion_to_euler(base_orientation)
    r_orientation = torch.exp(-5.0 * (roll**2 + pitch**2))
    
    # Penalize angular velocity (robot không nên lắc)
    r_angular_vel = torch.exp(-2.0 * torch.norm(base_velocity[3:], dim=-1))
    
    # Reward cho foot contact pattern (ít nhất 1 chân chạm đất)
    r_contact = (foot_contacts.sum(dim=-1) >= 1).float()
    
    # === 2. MANIPULATION SUPPORT REWARD ===
    # Đây là key innovation: locomotion được reward cho việc 
    # HỖ TRỢ manipulation, không chỉ đi thẳng
    
    # End-effector tracking: chân phải ổn định để tay chính xác
    ee_error = torch.norm(ee_position - ee_target, dim=-1)
    r_manipulation = torch.exp(-3.0 * ee_error)
    
    # Manipulation-aware CoM: trọng tâm nên dịch chuyển
    # để hỗ trợ tay vươn đến target
    com_to_target = compute_com_support(
        base_orientation, joint_positions, ee_target
    )
    r_com_support = torch.exp(-1.0 * com_to_target)
    
    # Latent-conditioned bonus: RL biết z_t → biết intent
    # Reward bonus khi stance phù hợp với manipulation intent
    z_magnitude = torch.norm(z_t, dim=-1)
    stance_width = compute_stance_width(joint_positions)
    # Manipulation lớn → stance rộng hơn
    r_stance = torch.exp(-2.0 * torch.abs(
        stance_width - 0.3 - 0.1 * z_magnitude
    ))
    
    # === 3. ENERGY EFFICIENCY ===
    r_energy = torch.exp(-0.01 * torch.sum(joint_torques**2, dim=-1))
    
    # === TỔNG HỢP ===
    reward = (
        0.3 * r_orientation +
        0.15 * r_angular_vel +
        0.1 * r_contact +
        0.2 * r_manipulation +
        0.1 * r_com_support +
        0.1 * r_stance +
        0.05 * r_energy
    )
    
    return reward

Điểm khác biệt cốt lõi: r_manipulation và r_com_support — locomotion policy được thưởng khi nó giúp manipulation thành công, không chỉ khi nó giữ thăng bằng.

Training Pipeline trong Isaac Lab

from omni.isaac.lab.envs import ManagerBasedRLEnv
from omni.isaac.lab.utils import configclass
from stable_baselines3 import PPO

@configclass
class LMOEnvCfg:
    """Config cho môi trường train LMO policy."""
    
    # Robot
    robot_asset = "agibot_x2.usd"
    
    # Observation space
    # KEY: bao gồm z_t từ VLA encoder
    observation_space = {
        "proprioception": 45,      # joint positions + velocities
        "imu": 6,                  # linear acc + angular vel  
        "latent_action": 256,      # z_t từ VLA encoder
        "foot_contacts": 4,        # binary contact sensors
        "base_orientation": 4,     # quaternion
    }
    
    # Action: joint position targets cho lower body
    action_space = 12  # 6 joints × 2 legs
    
    # Randomization — crucial cho sim-to-real
    domain_rand = {
        "friction": (0.3, 1.5),
        "payload_mass": (0, 5.0),     # kg, simulate vật đang cầm
        "motor_strength": (0.8, 1.2),
        "push_force": (0, 50),        # N, random perturbations
        "latent_noise": 0.05,         # noise trên z_t
    }
    
    # Terrain
    terrain = "rough"  # flat, rough, stairs
    
    # Episode
    max_episode_length = 1000  # 20 seconds tại 50Hz
    
    sim_dt = 0.005       # physics 200Hz
    control_dt = 0.02    # policy 50Hz

def train_lmo_policy():
    """Train LMO policy với PPO trong Isaac Lab."""
    
    env = ManagerBasedRLEnv(cfg=LMOEnvCfg())
    
    model = PPO(
        "MlpPolicy",
        env,
        learning_rate=3e-4,
        n_steps=24,          # short horizon cho locomotion
        batch_size=4096,
        n_epochs=5,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.2,
        ent_coef=0.01,       # exploration
        vf_coef=0.5,
        max_grad_norm=1.0,
        policy_kwargs={
            "net_arch": [256, 256, 128],
            "activation_fn": torch.nn.ELU,
        },
        verbose=1,
    )
    
    # Train 500M steps — khoảng 6-8 giờ trên 1 GPU A100
    model.learn(total_timesteps=500_000_000)
    model.save("lmo_policy_agibot_x2")
    
    return model

Domain Randomization cho Sim-to-Real

Một chi tiết quan trọng mà paper nhấn mạnh: latent_noise trong domain randomization. Khi deploy, VLA encoder chạy trên GPU riêng và có latency ~50ms — gây noise trên z_t. Train với noise giúp RL policy robust với VLA inference jitter.

Pipeline triển khai End-to-End

Egocentric Camera (30 FPS)
        │
        ▼
VLA Encoder (GPU, ~20Hz)
  - Input: image 224×224 + text command
  - Output: z_t (256-dim latent vector)
        │
        ▼ z_t via shared memory
LMO RL Policy (CPU, 50Hz)
  - Input: z_t + proprioception + IMU
  - Output: 12 joint position targets
        │
        ▼
PD Controller (1kHz)
  - Low-level joint tracking
        │
        ▼
Robot Motors (AgiBot X2)

Hai tầng chạy ở tần số khác nhau: VLA ở ~20Hz (limited by GPU inference), RL policy ở 50Hz (interpolate z_t giữa các VLA updates). Điều này rất thực tế — GPU inference cho VLM không thể đạt 50Hz, nhưng locomotion cần phản hồi nhanh.

Kết quả thực nghiệm: 21.3% vượt trội

Paper test trên AgiBot X2 — robot humanoid hai chân từ AGIBOT (Trung Quốc), cao ~1.5m, 23 DoF.

Task	Baseline (Decoupled)	WholeBodyVLA	Improvement
Pick from table	62%	78%	+16%
Place on shelf	45%	72%	+27%
Carry while walking	38%	65%	+27%
Push heavy object	41%	55%	+14%
Average	46.5%	67.5%	+21.3%

Đặc biệt, tasks yêu cầu coordination cao (carry while walking, place on shelf) có improvement lớn nhất — chính xác nơi mà decoupled approach thất bại.

Ablation: Mỗi component đóng góp bao nhiêu?

Configuration	Success Rate
Full WholeBodyVLA	67.5%
Bỏ latent action learning (dùng explicit actions)	58.2%
Bỏ LMO reward (dùng standard locomotion)	52.1%
Bỏ egocentric video pre-training	55.8%
Bỏ domain randomization trên z_t	61.3%

LMO reward đóng góp nhiều nhất (-15.4%) — xác nhận rằng manipulation-aware locomotion là key innovation, không phải VLA architecture.

So sánh với các phương pháp khác

Phương pháp	Whole-body?	Học từ video?	Real robot?	Loco-manip?
RT-2	Chỉ tay	Không	Có	Không
OpenVLA	Chỉ tay	Không	Có	Không
HumanPlus	Có	Có (RGB)	Có	Hạn chế
OKAMI	Chỉ tay	Có (teleoperation)	Có	Không
WholeBodyVLA	Có	Có (egocentric)	Có	Có

WholeBodyVLA là paper đầu tiên đạt được cả 4 tiêu chí đồng thời.

Hạn chế và hướng phát triển

Paper thừa nhận một số hạn chế:

Latency: VLA encoder ~50ms là bottleneck — cần model distillation hoặc quantization
Bimanual: chỉ test single-arm manipulation, chưa có dual-arm coordination
Terrain: chỉ test trên flat ground, chưa có stairs hay uneven terrain
Long-horizon: tasks trong paper đều dưới 30 giây — chưa test multi-step planning

Hướng phát triển tiềm năng:

Kết hợp với Diffusion Policy cho multimodal action distributions
Mở rộng sang bimanual loco-manipulation
Tích hợp 3D spatial understanding từ SpatialVLA

Bài học rút ra cho cộng đồng robotics Việt Nam

1. Latent actions là paradigm shift

Thay vì thu thập dữ liệu robot đắt đỏ, ta có thể học manipulation từ video con người. Điều này đặc biệt quan trọng cho lab ở Việt Nam — nơi hardware robot hạn chế nhưng data video dồi dào.

2. Locomotion và manipulation PHẢI coupled

Bất kỳ ai đang build humanoid robot cần từ bỏ kiến trúc decoupled truyền thống. Cost của coupling (phức tạp hơn trong training) nhỏ hơn nhiều so với benefit (21.3% improvement).

3. Isaac Lab là công cụ chuẩn cho RL robotics

Paper dùng NVIDIA Isaac Lab cho training — open-source, free, và chạy được trên consumer GPU. Đây là entry point tốt nhất cho nghiên cứu RL robotics hiện tại.

Tài liệu tham khảo

WholeBodyVLA: Towards Whole-Body Loco-Manipulation through Vision-Language-Action Model — Haoran Jiang, Jin Chen et al., ICLR 2026
GitHub: OpenDriveLab/WholebodyVLA
RT-2: Vision-Language-Action Models — Brohan et al., 2023
HumanPlus: Humanoid Shadowing and Imitation — Fu et al., 2024
OKAMI: Teaching Humanoid Manipulation — Li et al., 2024