WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code

Từ paper đến code: WholebodyVLA mở cửa cho cộng đồng

Nếu bạn đã đọc bài phân tích research WholebodyVLA trước đó, bạn biết rằng đây là paper đầu tiên giải quyết trọn vẹn bài toán whole-body loco-manipulation cho humanoid robot — tay thao tác đồ vật trong khi chân duy trì thăng bằng và di chuyển, tất cả được điều khiển bởi một hệ thống thống nhất.

Giờ đây, repo OpenDriveLab/WholebodyVLA đã public trên GitHub (ICLR 2026), và bài viết này sẽ đi sâu vào kiến trúc kỹ thuật, pipeline training, và cách bạn có thể tự xây dựng một hệ thống tương tự — dù chưa có robot thật trong tay.

Cánh tay robot humanoid trong phòng lab nghiên cứu AI

Bài toán cốt lõi: Tại sao decoupled approach thất bại?

Trước khi đào sâu vào code, hãy hiểu rõ tại sao WholebodyVLA cần tồn tại.

Hầu hết các hệ thống humanoid hiện tại chia thành hai module riêng biệt:

Manipulation controller: điều khiển tay, thường là VLA hoặc imitation learning
Locomotion controller: điều khiển chân, thường là RL policy

Hai module này chạy song song, mỗi cái tối ưu cho mục tiêu riêng. Vấn đề? Khi robot vươn tay sang phải để nhặt vật, trọng tâm lệch — nhưng locomotion controller không biết, vì nó không nhận được thông tin từ manipulation. Kết quả: robot ngã.

Tưởng tượng bạn nhờ một người điều khiển tay bạn bằng remote, và người khác điều khiển chân — cả hai không nói chuyện với nhau. Bạn sẽ ngã ngay lần đầu cố nhặt gì đó trên sàn.

WholebodyVLA giải quyết bằng cách chia sẻ một latent space chung giữa hai tầng — tay "nói" cho chân biết nó đang định làm gì, và chân tự điều chỉnh.

Kiến trúc tổng thể: Ba tầng, một dòng chảy

WholebodyVLA gồm 3 thành phần chính, xếp chồng theo thứ tự:

┌──────────────────────────────────────────────────────────┐
│  TẦNG 1: Latent Action Model (LAM)                       │
│  - Pre-train từ egocentric video (Ego4D, YouTube)         │
│  - Input: cặp frame liên tiếp (o_t, o_{t+1})             │
│  - Output: latent action z_t (khởi tạo latent space)      │
│  - Inverse dynamics + Forward dynamics learning            │
└────────────────────────┬─────────────────────────────────┘
                         │ latent space đã học
                         ▼
┌──────────────────────────────────────────────────────────┐
│  TẦNG 2: VLA Policy (Upper Body)                          │
│  - Fine-tune VLM trên robot demonstration data             │
│  - Input: egocentric image + language command               │
│  - Output: z_t (latent manipulation intent)                 │
│  - Decode z_t → arm joint commands (qua action decoder)     │
└────────────────────────┬─────────────────────────────────┘
                         │ z_t truyền xuống
                         ▼
┌──────────────────────────────────────────────────────────┐
│  TẦNG 3: LMO RL Policy (Lower Body)                       │
│  - Train trong Isaac Lab với PPO                            │
│  - Input: z_t + proprioception + IMU                        │
│  - Output: leg joint torques                                │
│  - Biết z_t → biết manipulation intent → điều chỉnh stance  │
└──────────────────────────────────────────────────────────┘

Điểm mấu chốt: z_t là cầu nối. Nó không phải action cụ thể ("xoay khớp vai 15 độ") mà là intent trừu tượng ("đang vươn tay phải về phía trước"). Locomotion policy nhận z_t và tự suy luận: "tay đang vươn xa → hạ trọng tâm, dang chân rộng hơn".

Tầng 1: Latent Action Model — Học từ video miễn phí

Vấn đề data cho robot

Data có labeled actions cho robot cực kỳ hiếm và đắt. Mỗi episode cần teleoperation — một người điều khiển robot qua VR controller trong vài phút. Thu thập 100K episodes có thể tốn hàng tháng và hàng trăm ngàn USD.

Trong khi đó, egocentric video (video quay từ góc nhìn thứ nhất) có hàng triệu giờ trên internet. Dataset Ego4D một mình đã có hơn 3,670 giờ video. Vấn đề: những video này chỉ có hình ảnh, không có action labels — không ai gắn nhãn "lúc này tay đang xoay 30 độ" cho video YouTube.

Inverse Dynamics: biến video thành "pseudo-actions"

Ý tưởng của LAM cực kỳ elegant: nếu frame t và frame t+1 khác nhau, chắc chắn có một action nào đó đã xảy ra. Model không cần biết action cụ thể — chỉ cần học một biểu diễn latent của sự thay đổi đó.

import torch
import torch.nn as nn

class LatentActionModel(nn.Module):
    """
    Latent Action Model (LAM) — core component của WholebodyVLA.
    Học latent action representation từ cặp frame video liên tiếp,
    KHÔNG cần action labels.
    
    Training objective:
    1. Inverse dynamics: (o_t, o_{t+1}) → z_t 
       "Frame thay đổi thế này, action phải là thế này"
    2. Forward dynamics: (o_t, z_t) → o_{t+1}_pred
       "Nếu action là z_t, frame tiếp theo trông thế nào"
    """
    def __init__(self, vision_dim=768, latent_dim=256):
        super().__init__()
        self.latent_dim = latent_dim
        
        # Inverse dynamics: hai frame → latent action
        self.inverse_head = nn.Sequential(
            nn.Linear(vision_dim * 2, 1024),
            nn.LayerNorm(1024),
            nn.GELU(),
            nn.Linear(1024, 512),
            nn.GELU(),
            nn.Linear(512, latent_dim),
        )
        
        # Forward dynamics: frame + action → next frame prediction
        self.forward_head = nn.Sequential(
            nn.Linear(vision_dim + latent_dim, 1024),
            nn.LayerNorm(1024),
            nn.GELU(),
            nn.Linear(1024, 512),
            nn.GELU(),
            nn.Linear(512, vision_dim),
        )
        
        # VQ-VAE discretization cho latent space
        # Giúp z_t compact và dễ predict hơn
        self.codebook_size = 512
        self.codebook = nn.Embedding(self.codebook_size, latent_dim)
    
    def quantize(self, z_continuous):
        """Vector quantization — map continuous z sang nearest codebook entry."""
        # Tính khoảng cách đến mọi codebook entry
        distances = torch.cdist(
            z_continuous.unsqueeze(1),  # [B, 1, D]
            self.codebook.weight.unsqueeze(0),  # [1, K, D]
        ).squeeze(1)  # [B, K]
        
        # Chọn nearest neighbor
        indices = distances.argmin(dim=-1)  # [B]
        z_quantized = self.codebook(indices)  # [B, D]
        
        # Straight-through estimator cho gradient
        z_quantized = z_continuous + (z_quantized - z_continuous).detach()
        return z_quantized, indices
    
    def encode_action(self, feat_t, feat_t1):
        """Từ feature 2 frames → latent action."""
        combined = torch.cat([feat_t, feat_t1], dim=-1)
        z_continuous = self.inverse_head(combined)
        z_quantized, indices = self.quantize(z_continuous)
        return z_quantized, z_continuous, indices
    
    def predict_next(self, feat_t, z_t):
        """Từ frame hiện tại + action → predict frame tiếp theo."""
        combined = torch.cat([feat_t, z_t], dim=-1)
        return self.forward_head(combined)
    
    def compute_loss(self, feat_t, feat_t1):
        """Training loss = inverse_loss + forward_loss + commitment_loss."""
        z_q, z_c, _ = self.encode_action(feat_t, feat_t1)
        
        # Forward dynamics loss
        feat_t1_pred = self.predict_next(feat_t, z_q)
        forward_loss = nn.functional.mse_loss(feat_t1_pred, feat_t1.detach())
        
        # VQ commitment loss (giữ continuous gần codebook)
        commitment_loss = nn.functional.mse_loss(z_c, z_q.detach())
        
        return forward_loss + 0.25 * commitment_loss

Tại sao có thêm Vector Quantization? Vì latent space liên tục có thể "lan man" — mỗi action khác nhau chỉ một chút. VQ-VAE buộc z_t phải rơi vào 1 trong K "code words" — giống như nén action thành 1 trong 512 loại action cơ bản. Điều này giúp VLA ở tầng 2 predict dễ hơn (predict 1 trong 512 tokens vs. regress continuous vector).

Data pipeline cho LAM

# Pseudo-code cho data pipeline
from torch.utils.data import Dataset
from torchvision import transforms

class EgocentricVideoDataset(Dataset):
    """
    Dataset từ Ego4D hoặc video egocentric tương tự.
    Mỗi sample là cặp frame liên tiếp.
    """
    def __init__(self, video_paths, frame_gap=2):
        self.videos = video_paths
        self.frame_gap = frame_gap  # khoảng cách giữa 2 frames
        self.transform = transforms.Compose([
            transforms.Resize(224),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            ),
        ])
    
    def __getitem__(self, idx):
        # Lấy 2 frames cách nhau frame_gap
        video = self.load_video(idx)
        t = random.randint(0, len(video) - self.frame_gap - 1)
        
        frame_t = self.transform(video[t])
        frame_t1 = self.transform(video[t + self.frame_gap])
        
        return frame_t, frame_t1

Chi tiết quan trọng: frame_gap=2 thay vì 1. Tại sao? Vì frames liên tiếp trong video 30FPS gần như giống hệt nhau — khác biệt quá nhỏ khiến model chỉ học identity mapping thay vì action thực sự. Khoảng cách 2-4 frames cho action rõ ràng hơn.

Tầng 2: VLA Policy — Từ hình ảnh và ngôn ngữ đến hành động

Sau khi LAM đã khởi tạo latent space, tầng 2 fine-tune một Vision-Language Model để predict latent actions từ egocentric image + language instruction.

Kiến trúc VLA

class WholebodyVLAPolicy(nn.Module):
    """
    VLA Policy cho upper body manipulation.
    
    Khác biệt so với VLA truyền thống (RT-2, OpenVLA):
    - Output latent tokens thay vì explicit joint commands
    - Pretrained latent space từ video (không chỉ robot data)
    - Dual-arm coordination qua shared latent
    """
    def __init__(
        self,
        vlm_backbone,       # VLM pretrained (e.g., Qwen2-VL)
        lam: LatentActionModel,
        num_action_tokens=8, # số latent tokens per timestep
        action_dim=14,       # 7 joints × 2 arms
    ):
        super().__init__()
        self.vlm = vlm_backbone
        self.lam = lam
        hidden_dim = vlm_backbone.config.hidden_size
        
        # Project VLM output → latent action tokens
        self.latent_projector = nn.Sequential(
            nn.Linear(hidden_dim, 512),
            nn.GELU(),
            nn.Linear(512, lam.latent_dim * num_action_tokens),
        )
        
        # Action decoder: latent tokens → explicit joint commands
        # Chỉ dùng khi cần actual motor commands cho upper body
        self.action_decoder = nn.Sequential(
            nn.Linear(lam.latent_dim * num_action_tokens, 256),
            nn.GELU(),
            nn.Linear(256, action_dim),  # delta joint positions
        )
        
        self.num_action_tokens = num_action_tokens
        
        # Freeze LAM codebook — latent space đã stable
        for param in self.lam.parameters():
            param.requires_grad = False
    
    def forward(self, image, language_tokens):
        """
        Input: egocentric image + tokenized language instruction
        Output: 
          - z_t: latent intent vector (gửi xuống LMO RL)
          - arm_actions: explicit joint commands cho upper body
        """
        # VLM encode image + language
        vlm_out = self.vlm(
            pixel_values=image,
            input_ids=language_tokens,
        ).last_hidden_state[:, -1, :]  # [B, hidden_dim]
        
        # Project sang latent action space
        z_flat = self.latent_projector(vlm_out)  # [B, latent_dim * N]
        B = z_flat.shape[0]
        z_tokens = z_flat.view(B, self.num_action_tokens, -1)
        
        # z_t aggregate — gửi xuống locomotion
        z_t = z_tokens.mean(dim=1)  # [B, latent_dim]
        
        # Decode thành arm joint commands
        arm_actions = self.action_decoder(z_flat)  # [B, action_dim]
        
        return z_t, arm_actions

Dual-arm coordination

Một điểm tinh tế: num_action_tokens=8 — model predict nhiều tokens cho mỗi timestep, không phải 1. Tại sao?

Với dual-arm manipulation (hai tay), mỗi tay cần trajectory riêng nhưng phải phối hợp. 8 tokens cho phép model encode:

Tokens 1-4: trajectory cho tay trái
Tokens 5-8: trajectory cho tay phải
Attention giữa các tokens → hai tay "biết" nhau đang làm gì

Đây là khác biệt so với bài research review trước — paper thực sự hỗ trợ dual-arm, chỉ là demo chưa full bimanual tasks.

Tầng 3: LMO RL Policy — Chân biết tay đang làm gì

Đây là phần tôi cho là đóng góp lớn nhất của paper, và cũng là phần mà code open-source sẽ có giá trị nhất cho cộng đồng.

Robot humanoid di chuyển trong không gian mở

LMO là gì và tại sao nó khác?

Loco-Manipulation-Oriented (LMO) RL policy là locomotion controller được train với ý thức rằng phần trên thân robot đang thao tác đồ vật. Khác với locomotion policy thông thường (chỉ biết "đi thẳng" hoặc "quay trái"), LMO nhận z_t từ VLA và điều chỉnh:

Stance width — dang chân rộng hơn khi manipulation phức tạp
CoM compensation — dịch trọng tâm ngược chiều với tay
Foot placement — đặt chân ở vị trí tối ưu cho stability
Gait adaptation — chuyển từ đi bộ sang đứng yên khi cần thao tác chính xác

Observation space thiết kế

from dataclasses import dataclass
import numpy as np

@dataclass
class LMOObservation:
    """
    Observation space cho LMO RL policy.
    Gồm 3 nhóm thông tin: proprioception, exteroception, và manipulation context.
    """
    # === Proprioception (nội cảm) ===
    joint_positions: np.ndarray    # [12] — 6 khớp × 2 chân
    joint_velocities: np.ndarray   # [12]
    base_orientation: np.ndarray   # [4] — quaternion
    base_angular_vel: np.ndarray   # [3] — gyroscope
    base_linear_acc: np.ndarray    # [3] — accelerometer
    
    # === Exteroception (ngoại cảm) ===
    foot_contacts: np.ndarray      # [4] — 2 chân × (heel, toe)
    
    # === Manipulation context (KEY INNOVATION) ===
    latent_action: np.ndarray      # [256] — z_t từ VLA encoder
    # z_t encode: manipulation intent, target location, force estimate
    # LMO dùng để ANTICIPATE cần bù trừ gì
    
    @property
    def as_vector(self) -> np.ndarray:
        """Flatten thành vector cho MLP policy."""
        return np.concatenate([
            self.joint_positions,       # 12
            self.joint_velocities,      # 12
            self.base_orientation,      # 4
            self.base_angular_vel,      # 3
            self.base_linear_acc,       # 3
            self.foot_contacts,         # 4
            self.latent_action,         # 256
        ])  # Total: 294 dimensions

    @property
    def dim(self) -> int:
        return 294

Chú ý dimension: 256 trong 294 dims đến từ z_t. Latent action chiếm tới 87% observation space — nói lên tầm quan trọng của manipulation context cho locomotion.

Reward function chi tiết

Reward design là nơi mà "nghệ thuật" gặp "khoa học" trong RL. LMO reward gồm nhiều thành phần cân bằng nhau:

class LMORewardFunction:
    """
    Multi-component reward cho LMO policy.
    
    Triết lý thiết kế:
    - Locomotion policy KHÔNG chỉ tối ưu cho di chuyển
    - Nó phải tối ưu cho việc TẠO NỀN TẢNG ỔN ĐỊNH 
      để manipulation thành công
    - z_t cho phép locomotion "nhìn trước" manipulation intent
    """
    def __init__(self):
        # Trọng số cho từng component — tuned trên AgiBot X2
        self.weights = {
            'orientation': 0.25,      # giữ thân thẳng
            'angular_velocity': 0.10, # không lắc
            'linear_velocity': 0.10,  # track velocity command
            'foot_contact': 0.05,     # duy trì contact pattern
            'manipulation_support': 0.20,  # HỖ TRỢ tay thao tác
            'com_stability': 0.15,    # trọng tâm trong support polygon
            'energy': 0.05,           # tiết kiệm năng lượng
            'smoothness': 0.10,       # action mượt, không giật
        }
    
    def compute(self, state, z_t, action, prev_action):
        rewards = {}
        
        # 1. Orientation: penalize nghiêng quá mức
        roll, pitch = state['euler'][:2]
        rewards['orientation'] = torch.exp(
            -5.0 * (roll**2 + pitch**2)
        )
        
        # 2. Angular velocity: penalize xoay nhanh
        rewards['angular_velocity'] = torch.exp(
            -2.0 * torch.norm(state['angular_vel'])
        )
        
        # 3. Velocity tracking: follow locomotion command
        vel_error = state['linear_vel'][:2] - state['cmd_vel'][:2]
        rewards['linear_velocity'] = torch.exp(
            -4.0 * torch.norm(vel_error)
        )
        
        # 4. Foot contact: phải luôn có chân chạm đất
        rewards['foot_contact'] = (
            state['foot_contacts'].sum() >= 1
        ).float()
        
        # 5. MANIPULATION SUPPORT (key innovation)
        # z_t magnitude correlate với complexity của manipulation
        z_magnitude = torch.norm(z_t)
        
        # Khi manipulation phức tạp → cần stance rộng hơn
        stance_width = self._compute_stance_width(state)
        target_width = 0.3 + 0.15 * torch.tanh(z_magnitude / 10.0)
        rewards['manipulation_support'] = torch.exp(
            -3.0 * (stance_width - target_width)**2
        )
        
        # 6. CoM trong support polygon
        com_pos = state['com_position'][:2]  # xy
        support = self._compute_support_polygon(state)
        com_margin = self._point_to_polygon_distance(com_pos, support)
        rewards['com_stability'] = torch.clamp(com_margin, 0, 1)
        
        # 7. Energy efficiency
        rewards['energy'] = torch.exp(
            -0.01 * torch.sum(action**2)
        )
        
        # 8. Action smoothness
        action_diff = action - prev_action
        rewards['smoothness'] = torch.exp(
            -5.0 * torch.sum(action_diff**2)
        )
        
        # Weighted sum
        total = sum(
            self.weights[k] * rewards[k] for k in self.weights
        )
        return total, rewards

Insight quan trọng từ reward design

Nhìn vào trọng số: manipulation_support (0.20) + com_stability (0.15) = 35% reward liên quan trực tiếp đến hỗ trợ manipulation. Đây là con số lớn — so với locomotion thuần (orientation + velocity = 35%).

Nói cách khác, LMO policy dành nửa nỗ lực cho việc đi đứng và nửa còn lại cho việc tạo nền tảng ổn định cho tay. Đây chính xác là lý do nó vượt trội 21.3% so với decoupled approach — locomotion thông thường dành 100% cho đi đứng và 0% cho manipulation support.

Training Pipeline: Từ video đến robot thật

Giai đoạn 1: Pre-train LAM (offline, không cần robot)

# Pseudo training script cho LAM
# Data: Ego4D dataset (~3,670 giờ egocentric video)
# Hardware: 4× A100 80GB, ~3 ngày

python train_lam.py \
    --data_path /data/ego4d/v2/ \
    --vision_encoder vit-base-patch16 \
    --latent_dim 256 \
    --codebook_size 512 \
    --frame_gap 2 \
    --batch_size 256 \
    --lr 1e-4 \
    --epochs 50 \
    --num_gpus 4

Giai đoạn 2: Collect robot demonstration data

WholebodyVLA đề xuất pipeline thu thập data hiệu quả:

Teleoperation trên AgiBot X2 — operator điều khiển tay qua VR controller
Record: egocentric image + joint states + language annotation
Augmentation: thêm camera noise, lighting variation

Paper thu thập ~500 episodes cho mỗi task — ít hơn nhiều so với RT-2 (~130K) nhờ latent space đã pre-train.

Giai đoạn 3: Fine-tune VLA Policy

# Fine-tune VLM backbone trên robot data
# Latent space từ LAM được freeze
python train_vla.py \
    --vlm_backbone qwen2-vl-2b \
    --lam_checkpoint checkpoints/lam_ego4d.pt \
    --robot_data /data/agibot_x2/demonstrations/ \
    --freeze_lam \
    --batch_size 32 \
    --lr 5e-5 \
    --epochs 20 \
    --num_action_tokens 8

Giai đoạn 4: Train LMO RL Policy trong Isaac Lab

# Isaac Lab environment config cho LMO training
# Train trên 4096 parallel environments

from omni.isaac.lab.envs import ManagerBasedRLEnv

class LMOTrainingConfig:
    """Hyperparameters cho LMO RL training."""
    
    # Environment
    num_envs = 4096          # parallel environments
    episode_length = 1000    # 20 seconds @ 50Hz
    
    # Physics
    sim_dt = 0.005           # 200Hz physics
    control_dt = 0.02        # 50Hz policy
    
    # PPO hyperparameters
    ppo_config = {
        'learning_rate': 3e-4,
        'n_steps': 24,
        'batch_size': num_envs * 24,
        'n_epochs': 5,
        'gamma': 0.99,
        'gae_lambda': 0.95,
        'clip_range': 0.2,
        'ent_coef': 0.01,
        'vf_coef': 0.5,
        'max_grad_norm': 1.0,
    }
    
    # Policy architecture
    policy_net = [256, 256, 128]  # MLP layers
    activation = 'elu'
    
    # Domain randomization (crucial cho sim-to-real)
    domain_rand = {
        'friction_range': (0.3, 1.5),
        'payload_mass_range': (0, 5.0),    # kg
        'motor_strength_range': (0.8, 1.2),
        'push_force_range': (0, 50),       # N
        'latent_noise_std': 0.05,          # noise trên z_t
        'terrain_roughness': (0, 0.05),    # m
        'com_displacement': (0, 0.05),     # m
    }
    
    # Total training
    total_timesteps = 500_000_000  # ~6-8 giờ trên 1× A100

Chi tiết quan trọng: latent_noise_std. Khi deploy thật, VLA encoder chạy ở ~20Hz trên GPU trong khi RL policy chạy ở 50Hz trên CPU. Giữa các VLA updates, z_t được interpolate và có noise. Train với noise mô phỏng điều này — một dạng domain randomization cho inference pipeline, không chỉ physics.

Deploy: Inference pipeline thực tế

Khi deploy trên robot thật, hệ thống chạy với 3 tần số khác nhau:

Camera (30 FPS) ─── image ──→ VLA Encoder (~20Hz, GPU)
                                    │
                                    z_t (async update)
                                    │
Proprioception (1kHz) ─── obs ──→ LMO Policy (50Hz, CPU)
                                    │
                                    joint targets
                                    │
                            PD Controller (1kHz, embedded)
                                    │
                                    motor torques ──→ Motors

Xử lý frequency mismatch

import threading
import time
import numpy as np

class WholebodyVLARuntime:
    """
    Runtime inference cho WholebodyVLA.
    Xử lý frequency mismatch giữa VLA (20Hz) và RL (50Hz).
    """
    def __init__(self, vla_model, lmo_model):
        self.vla = vla_model
        self.lmo = lmo_model
        
        # Shared state (thread-safe)
        self._z_current = np.zeros(256)
        self._z_prev = np.zeros(256)
        self._z_timestamp = 0.0
        self._lock = threading.Lock()
    
    def vla_loop(self, camera, language_cmd):
        """
        VLA inference loop — chạy trên GPU thread.
        ~20Hz, mỗi lần update z_t mới.
        """
        while self.running:
            image = camera.get_frame()
            
            # VLA inference: ~50ms trên A100
            z_new, arm_actions = self.vla.predict(image, language_cmd)
            
            with self._lock:
                self._z_prev = self._z_current.copy()
                self._z_current = z_new
                self._z_timestamp = time.time()
            
            # Gửi arm commands trực tiếp (không qua RL)
            self.send_arm_commands(arm_actions)
    
    def lmo_loop(self, robot):
        """
        LMO RL inference loop — chạy trên CPU thread.
        50Hz, interpolate z_t giữa các VLA updates.
        """
        dt = 0.02  # 50Hz
        
        while self.running:
            start = time.time()
            
            # Lấy proprioception từ robot (IMU, joint encoders)
            proprio = robot.get_proprioception()
            
            # Interpolate z_t cho smooth control
            with self._lock:
                age = time.time() - self._z_timestamp
                # Linear interpolation dựa trên age
                alpha = min(age / 0.05, 1.0)  # 50ms VLA period
                z_interp = (1 - alpha) * self._z_prev + alpha * self._z_current
            
            # RL inference: <1ms trên CPU
            obs = np.concatenate([
                proprio['joint_pos'],
                proprio['joint_vel'],
                proprio['base_quat'],
                proprio['gyro'],
                proprio['accel'],
                proprio['foot_contacts'],
                z_interp,
            ])
            
            leg_actions = self.lmo.predict(obs)
            robot.set_leg_targets(leg_actions)
            
            # Maintain 50Hz
            elapsed = time.time() - start
            if elapsed < dt:
                time.sleep(dt - elapsed)

Kết quả: Con số nói lên tất cả

Paper test trên 10 tasks với AgiBot X2, mỗi task 20 trials:

Task Category	Baseline (Decoupled)	WholebodyVLA	Δ
Static manipulation (đứng yên, nhặt)	68%	79%	+11%
Walk + carry (đi + bê đồ)	38%	65%	+27%
Walk + place (đi + đặt lên kệ)	45%	72%	+27%
Push heavy (đẩy vật nặng)	41%	55%	+14%
Trung bình	46.5%	67.5%	+21.3%

Pattern rõ ràng: tasks càng cần coordination giữa tay và chân, improvement càng lớn. Static manipulation chỉ tăng 11% (vì chân gần như đứng yên), nhưng walk + carry/place tăng 27% — đúng chỗ decoupled approach thất bại.

Ablation study

Bỏ component nào?	Success rate	Drop
Full WholebodyVLA	67.5%	—
Bỏ LMO reward (dùng standard locomotion)	52.1%	-15.4%
Bỏ egocentric video pre-training	55.8%	-11.7%
Bỏ latent actions (dùng explicit actions)	58.2%	-9.3%
Bỏ domain randomization trên z_t	61.3%	-6.2%

LMO reward là yếu tố quan trọng nhất — bỏ nó giảm 15.4%. Điều này xác nhận: vấn đề chính không phải VLA architecture, mà là locomotion có biết manipulation đang làm gì hay không.

Tự xây dựng hệ thống tương tự: Roadmap thực tế

Repo chưa release full code, nhưng bạn hoàn toàn có thể xây dựng pipeline tương tự với các công cụ open-source:

Bước 1: LAM từ Ego4D

Dataset: Ego4D — miễn phí cho nghiên cứu
Vision encoder: ViT-Base từ DINOv2 (tốt hơn ImageNet-pretrained cho video)
Framework: PyTorch + HuggingFace Transformers
GPU: 1× RTX 4090 đủ cho prototype (batch size nhỏ hơn)

Bước 2: VLA Policy

VLM backbone: Qwen2-VL-2B hoặc PaliGemma 2 (nhỏ, chạy được trên consumer GPU)
Robot data: nếu không có robot thật, dùng DROID dataset
Hoặc: bắt đầu với LeRobot framework cho data collection

Bước 3: LMO RL trong Isaac Lab

Simulator: Isaac Lab — free, chạy trên RTX 3060+
Robot model: Unitree H1 hoặc G1 URDF (có sẵn trong Isaac Lab)
RL algorithm: PPO via RSL-RL hoặc RL Games (tích hợp sẵn)

Bước 4: Sim-to-real (nếu có hardware)

Deploy trên Unitree G1 (~$30K) — affordable nhất cho humanoid research
Dùng sim-to-real pipeline đã có sẵn cho Unitree

Hạn chế cần biết

Code chưa full open-source — repo hiện tại chủ yếu là tài liệu tham khảo, chưa có training scripts release
Hardware requirement — cần A100 cho VLA inference real-time, consumer GPU chỉ đủ cho offline training
Chỉ test flat terrain — chưa có stairs, slopes, outdoor
Single language — language commands chỉ tiếng Anh, chưa multilingual

Kết luận: Tại sao đây là bước ngoặt

WholebodyVLA không chỉ là một paper hay — nó đại diện cho sự chuyển đổi paradigm trong humanoid robotics:

Từ decoupled (tay và chân riêng biệt) sang unified (chia sẻ latent space)
Từ robot-only data sang internet-scale video data
Từ explicit actions sang latent actions (scalable hơn nhiều)

Cho cộng đồng robotics Việt Nam, paper này mở ra hướng nghiên cứu mà không cần hardware đắt tiền ban đầu — LAM và VLA có thể train từ video, LMO RL có thể train trong simulation. Bạn chỉ cần GPU và thời gian.

Tài liệu tham khảo

WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control — Haoran Jiang, Jin Chen et al., ICLR 2026
GitHub: OpenDriveLab/WholebodyVLA — Repository chính thức
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Isaac Lab Documentation
AgiBot X2 Humanoid Robot — Hardware platform trong paper

Từ paper đến code: WholebodyVLA mở cửa cho cộng đồng

Cánh tay robot humanoid trong phòng lab nghiên cứu AI

Bài toán cốt lõi: Tại sao decoupled approach thất bại?

Trước khi đào sâu vào code, hãy hiểu rõ tại sao WholebodyVLA cần tồn tại.

Hầu hết các hệ thống humanoid hiện tại chia thành hai module riêng biệt:

Manipulation controller: điều khiển tay, thường là VLA hoặc imitation learning
Locomotion controller: điều khiển chân, thường là RL policy

WholebodyVLA giải quyết bằng cách chia sẻ một latent space chung giữa hai tầng — tay "nói" cho chân biết nó đang định làm gì, và chân tự điều chỉnh.

Kiến trúc tổng thể: Ba tầng, một dòng chảy

WholebodyVLA gồm 3 thành phần chính, xếp chồng theo thứ tự:

┌──────────────────────────────────────────────────────────┐
│  TẦNG 1: Latent Action Model (LAM)                       │
│  - Pre-train từ egocentric video (Ego4D, YouTube)         │
│  - Input: cặp frame liên tiếp (o_t, o_{t+1})             │
│  - Output: latent action z_t (khởi tạo latent space)      │
│  - Inverse dynamics + Forward dynamics learning            │
└────────────────────────┬─────────────────────────────────┘
                         │ latent space đã học
                         ▼
┌──────────────────────────────────────────────────────────┐
│  TẦNG 2: VLA Policy (Upper Body)                          │
│  - Fine-tune VLM trên robot demonstration data             │
│  - Input: egocentric image + language command               │
│  - Output: z_t (latent manipulation intent)                 │
│  - Decode z_t → arm joint commands (qua action decoder)     │
└────────────────────────┬─────────────────────────────────┘
                         │ z_t truyền xuống
                         ▼
┌──────────────────────────────────────────────────────────┐
│  TẦNG 3: LMO RL Policy (Lower Body)                       │
│  - Train trong Isaac Lab với PPO                            │
│  - Input: z_t + proprioception + IMU                        │
│  - Output: leg joint torques                                │
│  - Biết z_t → biết manipulation intent → điều chỉnh stance  │
└──────────────────────────────────────────────────────────┘

Tầng 1: Latent Action Model — Học từ video miễn phí

Vấn đề data cho robot

Inverse Dynamics: biến video thành "pseudo-actions"

import torch
import torch.nn as nn

class LatentActionModel(nn.Module):
    """
    Latent Action Model (LAM) — core component của WholebodyVLA.
    Học latent action representation từ cặp frame video liên tiếp,
    KHÔNG cần action labels.
    
    Training objective:
    1. Inverse dynamics: (o_t, o_{t+1}) → z_t 
       "Frame thay đổi thế này, action phải là thế này"
    2. Forward dynamics: (o_t, z_t) → o_{t+1}_pred
       "Nếu action là z_t, frame tiếp theo trông thế nào"
    """
    def __init__(self, vision_dim=768, latent_dim=256):
        super().__init__()
        self.latent_dim = latent_dim
        
        # Inverse dynamics: hai frame → latent action
        self.inverse_head = nn.Sequential(
            nn.Linear(vision_dim * 2, 1024),
            nn.LayerNorm(1024),
            nn.GELU(),
            nn.Linear(1024, 512),
            nn.GELU(),
            nn.Linear(512, latent_dim),
        )
        
        # Forward dynamics: frame + action → next frame prediction
        self.forward_head = nn.Sequential(
            nn.Linear(vision_dim + latent_dim, 1024),
            nn.LayerNorm(1024),
            nn.GELU(),
            nn.Linear(1024, 512),
            nn.GELU(),
            nn.Linear(512, vision_dim),
        )
        
        # VQ-VAE discretization cho latent space
        # Giúp z_t compact và dễ predict hơn
        self.codebook_size = 512
        self.codebook = nn.Embedding(self.codebook_size, latent_dim)
    
    def quantize(self, z_continuous):
        """Vector quantization — map continuous z sang nearest codebook entry."""
        # Tính khoảng cách đến mọi codebook entry
        distances = torch.cdist(
            z_continuous.unsqueeze(1),  # [B, 1, D]
            self.codebook.weight.unsqueeze(0),  # [1, K, D]
        ).squeeze(1)  # [B, K]
        
        # Chọn nearest neighbor
        indices = distances.argmin(dim=-1)  # [B]
        z_quantized = self.codebook(indices)  # [B, D]
        
        # Straight-through estimator cho gradient
        z_quantized = z_continuous + (z_quantized - z_continuous).detach()
        return z_quantized, indices
    
    def encode_action(self, feat_t, feat_t1):
        """Từ feature 2 frames → latent action."""
        combined = torch.cat([feat_t, feat_t1], dim=-1)
        z_continuous = self.inverse_head(combined)
        z_quantized, indices = self.quantize(z_continuous)
        return z_quantized, z_continuous, indices
    
    def predict_next(self, feat_t, z_t):
        """Từ frame hiện tại + action → predict frame tiếp theo."""
        combined = torch.cat([feat_t, z_t], dim=-1)
        return self.forward_head(combined)
    
    def compute_loss(self, feat_t, feat_t1):
        """Training loss = inverse_loss + forward_loss + commitment_loss."""
        z_q, z_c, _ = self.encode_action(feat_t, feat_t1)
        
        # Forward dynamics loss
        feat_t1_pred = self.predict_next(feat_t, z_q)
        forward_loss = nn.functional.mse_loss(feat_t1_pred, feat_t1.detach())
        
        # VQ commitment loss (giữ continuous gần codebook)
        commitment_loss = nn.functional.mse_loss(z_c, z_q.detach())
        
        return forward_loss + 0.25 * commitment_loss

Data pipeline cho LAM

# Pseudo-code cho data pipeline
from torch.utils.data import Dataset
from torchvision import transforms

class EgocentricVideoDataset(Dataset):
    """
    Dataset từ Ego4D hoặc video egocentric tương tự.
    Mỗi sample là cặp frame liên tiếp.
    """
    def __init__(self, video_paths, frame_gap=2):
        self.videos = video_paths
        self.frame_gap = frame_gap  # khoảng cách giữa 2 frames
        self.transform = transforms.Compose([
            transforms.Resize(224),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            ),
        ])
    
    def __getitem__(self, idx):
        # Lấy 2 frames cách nhau frame_gap
        video = self.load_video(idx)
        t = random.randint(0, len(video) - self.frame_gap - 1)
        
        frame_t = self.transform(video[t])
        frame_t1 = self.transform(video[t + self.frame_gap])
        
        return frame_t, frame_t1

Tầng 2: VLA Policy — Từ hình ảnh và ngôn ngữ đến hành động

Sau khi LAM đã khởi tạo latent space, tầng 2 fine-tune một Vision-Language Model để predict latent actions từ egocentric image + language instruction.

Kiến trúc VLA

class WholebodyVLAPolicy(nn.Module):
    """
    VLA Policy cho upper body manipulation.
    
    Khác biệt so với VLA truyền thống (RT-2, OpenVLA):
    - Output latent tokens thay vì explicit joint commands
    - Pretrained latent space từ video (không chỉ robot data)
    - Dual-arm coordination qua shared latent
    """
    def __init__(
        self,
        vlm_backbone,       # VLM pretrained (e.g., Qwen2-VL)
        lam: LatentActionModel,
        num_action_tokens=8, # số latent tokens per timestep
        action_dim=14,       # 7 joints × 2 arms
    ):
        super().__init__()
        self.vlm = vlm_backbone
        self.lam = lam
        hidden_dim = vlm_backbone.config.hidden_size
        
        # Project VLM output → latent action tokens
        self.latent_projector = nn.Sequential(
            nn.Linear(hidden_dim, 512),
            nn.GELU(),
            nn.Linear(512, lam.latent_dim * num_action_tokens),
        )
        
        # Action decoder: latent tokens → explicit joint commands
        # Chỉ dùng khi cần actual motor commands cho upper body
        self.action_decoder = nn.Sequential(
            nn.Linear(lam.latent_dim * num_action_tokens, 256),
            nn.GELU(),
            nn.Linear(256, action_dim),  # delta joint positions
        )
        
        self.num_action_tokens = num_action_tokens
        
        # Freeze LAM codebook — latent space đã stable
        for param in self.lam.parameters():
            param.requires_grad = False
    
    def forward(self, image, language_tokens):
        """
        Input: egocentric image + tokenized language instruction
        Output: 
          - z_t: latent intent vector (gửi xuống LMO RL)
          - arm_actions: explicit joint commands cho upper body
        """
        # VLM encode image + language
        vlm_out = self.vlm(
            pixel_values=image,
            input_ids=language_tokens,
        ).last_hidden_state[:, -1, :]  # [B, hidden_dim]
        
        # Project sang latent action space
        z_flat = self.latent_projector(vlm_out)  # [B, latent_dim * N]
        B = z_flat.shape[0]
        z_tokens = z_flat.view(B, self.num_action_tokens, -1)
        
        # z_t aggregate — gửi xuống locomotion
        z_t = z_tokens.mean(dim=1)  # [B, latent_dim]
        
        # Decode thành arm joint commands
        arm_actions = self.action_decoder(z_flat)  # [B, action_dim]
        
        return z_t, arm_actions

Dual-arm coordination

Một điểm tinh tế: num_action_tokens=8 — model predict nhiều tokens cho mỗi timestep, không phải 1. Tại sao?

Với dual-arm manipulation (hai tay), mỗi tay cần trajectory riêng nhưng phải phối hợp. 8 tokens cho phép model encode:

Tokens 1-4: trajectory cho tay trái
Tokens 5-8: trajectory cho tay phải
Attention giữa các tokens → hai tay "biết" nhau đang làm gì

Đây là khác biệt so với bài research review trước — paper thực sự hỗ trợ dual-arm, chỉ là demo chưa full bimanual tasks.

Tầng 3: LMO RL Policy — Chân biết tay đang làm gì

Đây là phần tôi cho là đóng góp lớn nhất của paper, và cũng là phần mà code open-source sẽ có giá trị nhất cho cộng đồng.

Robot humanoid di chuyển trong không gian mở

LMO là gì và tại sao nó khác?

Stance width — dang chân rộng hơn khi manipulation phức tạp
CoM compensation — dịch trọng tâm ngược chiều với tay
Foot placement — đặt chân ở vị trí tối ưu cho stability
Gait adaptation — chuyển từ đi bộ sang đứng yên khi cần thao tác chính xác

Observation space thiết kế

from dataclasses import dataclass
import numpy as np

@dataclass
class LMOObservation:
    """
    Observation space cho LMO RL policy.
    Gồm 3 nhóm thông tin: proprioception, exteroception, và manipulation context.
    """
    # === Proprioception (nội cảm) ===
    joint_positions: np.ndarray    # [12] — 6 khớp × 2 chân
    joint_velocities: np.ndarray   # [12]
    base_orientation: np.ndarray   # [4] — quaternion
    base_angular_vel: np.ndarray   # [3] — gyroscope
    base_linear_acc: np.ndarray    # [3] — accelerometer
    
    # === Exteroception (ngoại cảm) ===
    foot_contacts: np.ndarray      # [4] — 2 chân × (heel, toe)
    
    # === Manipulation context (KEY INNOVATION) ===
    latent_action: np.ndarray      # [256] — z_t từ VLA encoder
    # z_t encode: manipulation intent, target location, force estimate
    # LMO dùng để ANTICIPATE cần bù trừ gì
    
    @property
    def as_vector(self) -> np.ndarray:
        """Flatten thành vector cho MLP policy."""
        return np.concatenate([
            self.joint_positions,       # 12
            self.joint_velocities,      # 12
            self.base_orientation,      # 4
            self.base_angular_vel,      # 3
            self.base_linear_acc,       # 3
            self.foot_contacts,         # 4
            self.latent_action,         # 256
        ])  # Total: 294 dimensions

    @property
    def dim(self) -> int:
        return 294

Chú ý dimension: 256 trong 294 dims đến từ z_t. Latent action chiếm tới 87% observation space — nói lên tầm quan trọng của manipulation context cho locomotion.

Reward function chi tiết

Reward design là nơi mà "nghệ thuật" gặp "khoa học" trong RL. LMO reward gồm nhiều thành phần cân bằng nhau:

class LMORewardFunction:
    """
    Multi-component reward cho LMO policy.
    
    Triết lý thiết kế:
    - Locomotion policy KHÔNG chỉ tối ưu cho di chuyển
    - Nó phải tối ưu cho việc TẠO NỀN TẢNG ỔN ĐỊNH 
      để manipulation thành công
    - z_t cho phép locomotion "nhìn trước" manipulation intent
    """
    def __init__(self):
        # Trọng số cho từng component — tuned trên AgiBot X2
        self.weights = {
            'orientation': 0.25,      # giữ thân thẳng
            'angular_velocity': 0.10, # không lắc
            'linear_velocity': 0.10,  # track velocity command
            'foot_contact': 0.05,     # duy trì contact pattern
            'manipulation_support': 0.20,  # HỖ TRỢ tay thao tác
            'com_stability': 0.15,    # trọng tâm trong support polygon
            'energy': 0.05,           # tiết kiệm năng lượng
            'smoothness': 0.10,       # action mượt, không giật
        }
    
    def compute(self, state, z_t, action, prev_action):
        rewards = {}
        
        # 1. Orientation: penalize nghiêng quá mức
        roll, pitch = state['euler'][:2]
        rewards['orientation'] = torch.exp(
            -5.0 * (roll**2 + pitch**2)
        )
        
        # 2. Angular velocity: penalize xoay nhanh
        rewards['angular_velocity'] = torch.exp(
            -2.0 * torch.norm(state['angular_vel'])
        )
        
        # 3. Velocity tracking: follow locomotion command
        vel_error = state['linear_vel'][:2] - state['cmd_vel'][:2]
        rewards['linear_velocity'] = torch.exp(
            -4.0 * torch.norm(vel_error)
        )
        
        # 4. Foot contact: phải luôn có chân chạm đất
        rewards['foot_contact'] = (
            state['foot_contacts'].sum() >= 1
        ).float()
        
        # 5. MANIPULATION SUPPORT (key innovation)
        # z_t magnitude correlate với complexity của manipulation
        z_magnitude = torch.norm(z_t)
        
        # Khi manipulation phức tạp → cần stance rộng hơn
        stance_width = self._compute_stance_width(state)
        target_width = 0.3 + 0.15 * torch.tanh(z_magnitude / 10.0)
        rewards['manipulation_support'] = torch.exp(
            -3.0 * (stance_width - target_width)**2
        )
        
        # 6. CoM trong support polygon
        com_pos = state['com_position'][:2]  # xy
        support = self._compute_support_polygon(state)
        com_margin = self._point_to_polygon_distance(com_pos, support)
        rewards['com_stability'] = torch.clamp(com_margin, 0, 1)
        
        # 7. Energy efficiency
        rewards['energy'] = torch.exp(
            -0.01 * torch.sum(action**2)
        )
        
        # 8. Action smoothness
        action_diff = action - prev_action
        rewards['smoothness'] = torch.exp(
            -5.0 * torch.sum(action_diff**2)
        )
        
        # Weighted sum
        total = sum(
            self.weights[k] * rewards[k] for k in self.weights
        )
        return total, rewards

Insight quan trọng từ reward design

Training Pipeline: Từ video đến robot thật

Giai đoạn 1: Pre-train LAM (offline, không cần robot)

# Pseudo training script cho LAM
# Data: Ego4D dataset (~3,670 giờ egocentric video)
# Hardware: 4× A100 80GB, ~3 ngày

python train_lam.py \
    --data_path /data/ego4d/v2/ \
    --vision_encoder vit-base-patch16 \
    --latent_dim 256 \
    --codebook_size 512 \
    --frame_gap 2 \
    --batch_size 256 \
    --lr 1e-4 \
    --epochs 50 \
    --num_gpus 4

Giai đoạn 2: Collect robot demonstration data

WholebodyVLA đề xuất pipeline thu thập data hiệu quả:

Teleoperation trên AgiBot X2 — operator điều khiển tay qua VR controller
Record: egocentric image + joint states + language annotation
Augmentation: thêm camera noise, lighting variation

Paper thu thập ~500 episodes cho mỗi task — ít hơn nhiều so với RT-2 (~130K) nhờ latent space đã pre-train.

Giai đoạn 3: Fine-tune VLA Policy

# Fine-tune VLM backbone trên robot data
# Latent space từ LAM được freeze
python train_vla.py \
    --vlm_backbone qwen2-vl-2b \
    --lam_checkpoint checkpoints/lam_ego4d.pt \
    --robot_data /data/agibot_x2/demonstrations/ \
    --freeze_lam \
    --batch_size 32 \
    --lr 5e-5 \
    --epochs 20 \
    --num_action_tokens 8

Giai đoạn 4: Train LMO RL Policy trong Isaac Lab

# Isaac Lab environment config cho LMO training
# Train trên 4096 parallel environments

from omni.isaac.lab.envs import ManagerBasedRLEnv

class LMOTrainingConfig:
    """Hyperparameters cho LMO RL training."""
    
    # Environment
    num_envs = 4096          # parallel environments
    episode_length = 1000    # 20 seconds @ 50Hz
    
    # Physics
    sim_dt = 0.005           # 200Hz physics
    control_dt = 0.02        # 50Hz policy
    
    # PPO hyperparameters
    ppo_config = {
        'learning_rate': 3e-4,
        'n_steps': 24,
        'batch_size': num_envs * 24,
        'n_epochs': 5,
        'gamma': 0.99,
        'gae_lambda': 0.95,
        'clip_range': 0.2,
        'ent_coef': 0.01,
        'vf_coef': 0.5,
        'max_grad_norm': 1.0,
    }
    
    # Policy architecture
    policy_net = [256, 256, 128]  # MLP layers
    activation = 'elu'
    
    # Domain randomization (crucial cho sim-to-real)
    domain_rand = {
        'friction_range': (0.3, 1.5),
        'payload_mass_range': (0, 5.0),    # kg
        'motor_strength_range': (0.8, 1.2),
        'push_force_range': (0, 50),       # N
        'latent_noise_std': 0.05,          # noise trên z_t
        'terrain_roughness': (0, 0.05),    # m
        'com_displacement': (0, 0.05),     # m
    }
    
    # Total training
    total_timesteps = 500_000_000  # ~6-8 giờ trên 1× A100

Deploy: Inference pipeline thực tế

Khi deploy trên robot thật, hệ thống chạy với 3 tần số khác nhau:

Camera (30 FPS) ─── image ──→ VLA Encoder (~20Hz, GPU)
                                    │
                                    z_t (async update)
                                    │
Proprioception (1kHz) ─── obs ──→ LMO Policy (50Hz, CPU)
                                    │
                                    joint targets
                                    │
                            PD Controller (1kHz, embedded)
                                    │
                                    motor torques ──→ Motors

Xử lý frequency mismatch

import threading
import time
import numpy as np

class WholebodyVLARuntime:
    """
    Runtime inference cho WholebodyVLA.
    Xử lý frequency mismatch giữa VLA (20Hz) và RL (50Hz).
    """
    def __init__(self, vla_model, lmo_model):
        self.vla = vla_model
        self.lmo = lmo_model
        
        # Shared state (thread-safe)
        self._z_current = np.zeros(256)
        self._z_prev = np.zeros(256)
        self._z_timestamp = 0.0
        self._lock = threading.Lock()
    
    def vla_loop(self, camera, language_cmd):
        """
        VLA inference loop — chạy trên GPU thread.
        ~20Hz, mỗi lần update z_t mới.
        """
        while self.running:
            image = camera.get_frame()
            
            # VLA inference: ~50ms trên A100
            z_new, arm_actions = self.vla.predict(image, language_cmd)
            
            with self._lock:
                self._z_prev = self._z_current.copy()
                self._z_current = z_new
                self._z_timestamp = time.time()
            
            # Gửi arm commands trực tiếp (không qua RL)
            self.send_arm_commands(arm_actions)
    
    def lmo_loop(self, robot):
        """
        LMO RL inference loop — chạy trên CPU thread.
        50Hz, interpolate z_t giữa các VLA updates.
        """
        dt = 0.02  # 50Hz
        
        while self.running:
            start = time.time()
            
            # Lấy proprioception từ robot (IMU, joint encoders)
            proprio = robot.get_proprioception()
            
            # Interpolate z_t cho smooth control
            with self._lock:
                age = time.time() - self._z_timestamp
                # Linear interpolation dựa trên age
                alpha = min(age / 0.05, 1.0)  # 50ms VLA period
                z_interp = (1 - alpha) * self._z_prev + alpha * self._z_current
            
            # RL inference: <1ms trên CPU
            obs = np.concatenate([
                proprio['joint_pos'],
                proprio['joint_vel'],
                proprio['base_quat'],
                proprio['gyro'],
                proprio['accel'],
                proprio['foot_contacts'],
                z_interp,
            ])
            
            leg_actions = self.lmo.predict(obs)
            robot.set_leg_targets(leg_actions)
            
            # Maintain 50Hz
            elapsed = time.time() - start
            if elapsed < dt:
                time.sleep(dt - elapsed)

Kết quả: Con số nói lên tất cả

Paper test trên 10 tasks với AgiBot X2, mỗi task 20 trials:

Task Category	Baseline (Decoupled)	WholebodyVLA	Δ
Static manipulation (đứng yên, nhặt)	68%	79%	+11%
Walk + carry (đi + bê đồ)	38%	65%	+27%
Walk + place (đi + đặt lên kệ)	45%	72%	+27%
Push heavy (đẩy vật nặng)	41%	55%	+14%
Trung bình	46.5%	67.5%	+21.3%

Ablation study

Bỏ component nào?	Success rate	Drop
Full WholebodyVLA	67.5%	—
Bỏ LMO reward (dùng standard locomotion)	52.1%	-15.4%
Bỏ egocentric video pre-training	55.8%	-11.7%
Bỏ latent actions (dùng explicit actions)	58.2%	-9.3%
Bỏ domain randomization trên z_t	61.3%	-6.2%

Tự xây dựng hệ thống tương tự: Roadmap thực tế

Repo chưa release full code, nhưng bạn hoàn toàn có thể xây dựng pipeline tương tự với các công cụ open-source:

Bước 1: LAM từ Ego4D

Dataset: Ego4D — miễn phí cho nghiên cứu
Vision encoder: ViT-Base từ DINOv2 (tốt hơn ImageNet-pretrained cho video)
Framework: PyTorch + HuggingFace Transformers
GPU: 1× RTX 4090 đủ cho prototype (batch size nhỏ hơn)

Bước 2: VLA Policy

VLM backbone: Qwen2-VL-2B hoặc PaliGemma 2 (nhỏ, chạy được trên consumer GPU)
Robot data: nếu không có robot thật, dùng DROID dataset
Hoặc: bắt đầu với LeRobot framework cho data collection

Bước 3: LMO RL trong Isaac Lab

Simulator: Isaac Lab — free, chạy trên RTX 3060+
Robot model: Unitree H1 hoặc G1 URDF (có sẵn trong Isaac Lab)
RL algorithm: PPO via RSL-RL hoặc RL Games (tích hợp sẵn)

Bước 4: Sim-to-real (nếu có hardware)

Deploy trên Unitree G1 (~$30K) — affordable nhất cho humanoid research
Dùng sim-to-real pipeline đã có sẵn cho Unitree

Hạn chế cần biết

Code chưa full open-source — repo hiện tại chủ yếu là tài liệu tham khảo, chưa có training scripts release
Hardware requirement — cần A100 cho VLA inference real-time, consumer GPU chỉ đủ cho offline training
Chỉ test flat terrain — chưa có stairs, slopes, outdoor
Single language — language commands chỉ tiếng Anh, chưa multilingual

Kết luận: Tại sao đây là bước ngoặt

WholebodyVLA không chỉ là một paper hay — nó đại diện cho sự chuyển đổi paradigm trong humanoid robotics:

Từ decoupled (tay và chân riêng biệt) sang unified (chia sẻ latent space)
Từ robot-only data sang internet-scale video data
Từ explicit actions sang latent actions (scalable hơn nhiều)

Tài liệu tham khảo

WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control — Haoran Jiang, Jin Chen et al., ICLR 2026
GitHub: OpenDriveLab/WholebodyVLA — Repository chính thức
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Isaac Lab Documentation
AgiBot X2 Humanoid Robot — Hardware platform trong paper

Từ paper đến code: WholebodyVLA mở cửa cho cộng đồng

Bài toán cốt lõi: Tại sao decoupled approach thất bại?

Kiến trúc tổng thể: Ba tầng, một dòng chảy

Tầng 1: Latent Action Model — Học từ video miễn phí

Vấn đề data cho robot

Inverse Dynamics: biến video thành "pseudo-actions"

Data pipeline cho LAM

Tầng 2: VLA Policy — Từ hình ảnh và ngôn ngữ đến hành động

Kiến trúc VLA

Dual-arm coordination

Tầng 3: LMO RL Policy — Chân biết tay đang làm gì

LMO là gì và tại sao nó khác?

Observation space thiết kế

Reward function chi tiết

Insight quan trọng từ reward design

Training Pipeline: Từ video đến robot thật

Giai đoạn 1: Pre-train LAM (offline, không cần robot)

Giai đoạn 2: Collect robot demonstration data

Giai đoạn 3: Fine-tune VLA Policy

Giai đoạn 4: Train LMO RL Policy trong Isaac Lab

Deploy: Inference pipeline thực tế

Xử lý frequency mismatch

Kết quả: Con số nói lên tất cả

Ablation study

Tự xây dựng hệ thống tương tự: Roadmap thực tế

Bước 1: LAM từ Ego4D

Bước 2: VLA Policy

Bước 3: LMO RL trong Isaac Lab

Bước 4: Sim-to-real (nếu có hardware)

Hạn chế cần biết

Kết luận: Tại sao đây là bước ngoặt

Tài liệu tham khảo

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

WholebodyVLA: Pipeline Training Toàn Bộ

WholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation

Chạy GR00T-VisualSim2Real cho G1

Từ paper đến code: WholebodyVLA mở cửa cho cộng đồng

Bài toán cốt lõi: Tại sao decoupled approach thất bại?

Kiến trúc tổng thể: Ba tầng, một dòng chảy

Tầng 1: Latent Action Model — Học từ video miễn phí

Vấn đề data cho robot

Inverse Dynamics: biến video thành "pseudo-actions"

Data pipeline cho LAM

Tầng 2: VLA Policy — Từ hình ảnh và ngôn ngữ đến hành động

Kiến trúc VLA

Dual-arm coordination

Tầng 3: LMO RL Policy — Chân biết tay đang làm gì

LMO là gì và tại sao nó khác?

Observation space thiết kế

Reward function chi tiết

Insight quan trọng từ reward design

Training Pipeline: Từ video đến robot thật

Giai đoạn 1: Pre-train LAM (offline, không cần robot)

Giai đoạn 2: Collect robot demonstration data

Giai đoạn 3: Fine-tune VLA Policy

Giai đoạn 4: Train LMO RL Policy trong Isaac Lab

Deploy: Inference pipeline thực tế

Xử lý frequency mismatch

Kết quả: Con số nói lên tất cả

Ablation study

Tự xây dựng hệ thống tương tự: Roadmap thực tế

Bước 1: LAM từ Ego4D

Bước 2: VLA Policy

Bước 3: LMO RL trong Isaac Lab

Bước 4: Sim-to-real (nếu có hardware)

Hạn chế cần biết

Kết luận: Tại sao đây là bước ngoặt

Tài liệu tham khảo

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

WholebodyVLA: Pipeline Training Toàn Bộ

WholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation

Chạy GR00T-VisualSim2Real cho G1