WholebodyVLA: Pipeline Training Toàn Bộ

Nếu bạn đã đọc bài phân tích paper WholebodyVLA ICLR 2026 và deep-dive vào kiến trúc, bạn hiểu tại sao hệ thống này quan trọng và các thành phần bên trong nó hoạt động thế nào. Bài viết này sẽ đi một bước xa hơn: làm thế nào để xây dựng pipeline training từ đầu đến cuối.

Lưu ý: Tại thời điểm viết bài (tháng 4/2026), team OpenDriveLab chưa release code chính thức. Tuy nhiên, paper cung cấp đủ chi tiết kỹ thuật để bạn hiểu toàn bộ quy trình và chuẩn bị sẵn sàng khi code được công bố — hoặc tự implement theo thiết kế của paper.

Bức tranh tổng thể: 3 component, 2 giai đoạn training

Trước khi đi vào từng bước, hãy nắm rõ cấu trúc của WholebodyVLA:

[Dữ liệu Egocentric Video] ──► LAM Training ──► Latent Codes
         │                                              │
[Manipulation Data]           ──────────────────────────┤
         │                                              ▼
[Demo Teleoperation]  ──────► VLA Fine-tuning (Prismatic-7B + LoRA)
                                                        │
[Isaac Lab Simulation] ──────► LMO RL Policy Training   │
                                                        ▼
                              ┌─────────────────────────┐
                              │   Inference @ 10/50 Hz   │
                              └─────────────────────────┘

Pipeline gồm 2 giai đoạn song song:

Pretraining LAM + VLA: Học cách "đọc" video egocentric và chuyển thành latent action codes.
RL Policy Training: Huấn luyện LMO (Loco-Manipulation-Oriented) policy trong simulation.

Hai nhánh hội tụ tại inference, nơi VLA ra lệnh và LMO thực thi ở tần số cao hơn.

Yêu cầu phần cứng

Đây là con số thực tế từ paper — không phải dùng laptop nhà bạn được:

Tác vụ	Phần cứng	Thời gian
LAM Training	8× NVIDIA H100	~30,000 steps
VLA Pretraining	8× NVIDIA H100	~20,000 steps
VLA Fine-tuning	8× NVIDIA H100	~10,000 steps
LMO RL Training	1× NVIDIA H100	Vài ngày
Inference (VLA)	RTX 4090	~10 Hz
Inference (RL)	NanoPi (onboard)	50 Hz

Bạn không cần cluster H100 để thử nghiệm — nhưng để reproduce kết quả paper (78% success), bạn cần ít nhất 1× H100 hoặc tương đương A100 80GB để training.

Bước 1: Cài đặt môi trường

# Python environment
conda create -n wholebodyvla python=3.10
conda activate wholebodyvla

# Core dependencies
pip install torch==2.3.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate einops timm
pip install open-clip-torch  # CLIP encoder cho VLA

# DINOv2 (encoder cho LAM)
pip install git+https://github.com/facebookresearch/dinov2.git

# Isaac Lab (cho LMO RL training)
# Theo hướng dẫn chính thức: https://isaac-sim.github.io/IsaacLab/
pip install isaaclab

# ZeroMQ (inference communication)
pip install pyzmq
pip install pyrealsense2  # Intel RealSense SDK

Bước 2: Thu thập dữ liệu

Đây là bước tốn thời gian nhất và thường bị underestimate. WholebodyVLA dùng 3 loại dữ liệu khác nhau cho 3 mục đích khác nhau.

2.1 Egocentric Locomotion Video (cho Locomotion LAM)

Mục tiêu: Thu thập video từ góc nhìn thứ nhất của người đang di chuyển đến mục tiêu thao tác, không cần robot, không cần action labels.

Setup camera:

import pyrealsense2 as rs
import numpy as np
import cv2

pipeline = rs.pipeline()
config = rs.config()

# Intel RealSense D435i: 640x480 @ 30fps
config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)
config.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30)
config.enable_stream(rs.stream.accel)  # IMU cho locomotion tracking
config.enable_stream(rs.stream.gyro)

profile = pipeline.start(config)

8 motion primitives cần thu thập:

Primitive	Mô tả	Số lượng đề xuất
Advance	Đi thẳng đến mục tiêu	500 clips
Turn Left/Right	Xoay người để canh hướng	400 clips
Sidestep	Bước ngang sang trái/phải	300 clips
Squat Down	Cúi xuống để lấy đồ thấp	300 clips
Stand Up	Đứng dậy sau khi lấy đồ	300 clips
Approach Near	Tiến sát vào đối tượng	400 clips
Back Away	Lùi ra sau	200 clips
Stop & Ready	Dừng, tay chuẩn bị thao tác	300 clips

Paper thu thập ~300 giờ video — nhưng thử nghiệm ablation cho thấy bạn có thể bắt đầu với 50 giờ và đạt được 60%+ benefit.

# Script thu thập và lưu video theo primitives
import os, time, json

def collect_primitive(primitive_name: str, duration_secs: int = 3):
    """Thu thập một clip motion primitive."""
    timestamp = int(time.time())
    frames = []
    
    start = time.time()
    while time.time() - start < duration_secs:
        frameset = pipeline.wait_for_frames()
        color_frame = frameset.get_color_frame()
        
        if color_frame:
            img = np.asanyarray(color_frame.get_data())
            frames.append(img)
    
    # Lưu clip
    out_path = f"data/loco_videos/{primitive_name}/{timestamp}.mp4"
    os.makedirs(os.path.dirname(out_path), exist_ok=True)
    
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(out_path, fourcc, 30, (640, 480))
    for f in frames:
        out.write(f)
    out.release()
    
    return out_path

2.2 Manipulation Data (AgiBot World Dataset)

Paper dùng AgiBot World dataset cho manipulation pretraining — dataset gồm các task bimanual manipulation với robot thật.

# Download AgiBot World (nếu được public)
wget https://huggingface.co/datasets/agibot-world/agibot-world-v1/...

# Cấu trúc thư mục mong đợi:
data/
  manipulation/
    task_001/
      episode_000/
        images/  # frame-by-frame
        actions.npy  # joint angles
        lang.txt  # task description
    task_002/
    ...

2.3 Robot Teleoperation Data (cho Fine-tuning)

Đây là dữ liệu thu thập trực tiếp trên robot AgiBot X2 — bạn cần robot thật hoặc simulation tương đương:

Upper body: VR headset (Meta Quest 3 hoặc HTC Vive) mapping hand pose → joint targets
Lower body: Joystick → locomotion commands
Số lượng: 50 demonstrations/task × 3 tasks = 150 total demos

3 tasks chính: Bag Packing, Box Loading, Cart Pushing.

# Format dữ liệu teleoperation
import numpy as np

class DemoDataset:
    """Dataset format cho WholebodyVLA fine-tuning."""
    
    def __init__(self, demo_dir: str):
        self.episodes = self._load_episodes(demo_dir)
    
    def _load_episodes(self, demo_dir):
        episodes = []
        for ep_dir in sorted(os.listdir(demo_dir)):
            ep = {
                "images": [],      # (T, H, W, 3) egocentric frames
                "arm_actions": [],  # (T, 14) = 7 DoF × 2 arms
                "loco_cmds": [],   # (T, 4) = [s_x, s_y, s_psi, h*]
                "language": "",    # task description string
            }
            # Load từng frame và action...
            episodes.append(ep)
        return episodes

Humanoid robot di chuyển trong không gian rộng — bài toán loco-manipulation đòi hỏi cả tay lẫn chân phối hợp chặt chẽ

Bước 3: Training Latent Action Model (LAM)

LAM là "bộ dịch" chuyển đổi video frames thành latent codes, sau đó VLA học cách predict các codes này.

Paper train hai LAMs riêng biệt:

Manipulation LAM: Camera tĩnh, tay di chuyển
Locomotion LAM: Camera di chuyển theo người

Kiến trúc VQ-VAE

import torch
import torch.nn as nn
from transformers import Dinov2Model

class LatentActionModel(nn.Module):
    """VQ-VAE với DINOv2 encoder cho WholebodyVLA."""
    
    def __init__(
        self,
        codebook_size: int = 8192,
        latent_dim: int = 256,
        commitment_cost: float = 0.25,  # beta trong paper
    ):
        super().__init__()
        
        # DINOv2-base làm feature encoder
        self.encoder = Dinov2Model.from_pretrained("facebook/dinov2-base")
        
        # Projection từ DINOv2 features (768) → latent_dim
        self.encoder_proj = nn.Linear(768 * 2, latent_dim)
        
        # Vector Quantizer
        self.codebook = nn.Embedding(codebook_size, latent_dim)
        self.commitment_cost = commitment_cost
        
        # Decoder để reconstruct future frame
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, 768),  # back to DINOv2 feature space
        )
    
    def encode(self, obs_t: torch.Tensor, obs_t_k: torch.Tensor):
        """
        Args:
            obs_t: Frame hiện tại (B, C, H, W)
            obs_t_k: Frame tương lai k steps (B, C, H, W)
        Returns:
            Quantized latent code và loss
        """
        # Extract features từ cả hai frames
        feat_t = self.encoder(pixel_values=obs_t).last_hidden_state[:, 0]  # [CLS]
        feat_tk = self.encoder(pixel_values=obs_t_k).last_hidden_state[:, 0]
        
        # Concatenate và project
        z = self.encoder_proj(torch.cat([feat_t, feat_tk], dim=-1))  # (B, latent_dim)
        
        # Vector Quantization
        distances = torch.cdist(z.unsqueeze(1), self.codebook.weight.unsqueeze(0))
        indices = distances.squeeze(1).argmin(dim=-1)  # (B,)
        c = self.codebook(indices)  # (B, latent_dim)
        
        # VQ loss (straight-through estimator)
        vq_loss = torch.mean((c.detach() - z) ** 2)  # codebook loss
        commit_loss = self.commitment_cost * torch.mean((c - z.detach()) ** 2)
        
        # Straight-through gradient
        c_st = z + (c - z).detach()
        
        return c_st, indices, vq_loss + commit_loss
    
    def decode(self, c: torch.Tensor, obs_t: torch.Tensor):
        """Reconstruct future frame features từ latent code + current frame."""
        feat_t = self.encoder(pixel_values=obs_t).last_hidden_state[:, 0]
        pred_feat = self.decoder(c + feat_t)  # residual prediction
        return pred_feat
    
    def forward(self, obs_t, obs_t_k):
        c, indices, vq_loss = self.encode(obs_t, obs_t_k)
        pred_feat = self.decode(c, obs_t)
        
        # Ground truth features
        with torch.no_grad():
            target_feat = self.encoder(pixel_values=obs_t_k).last_hidden_state[:, 0]
        
        mse_loss = nn.functional.mse_loss(pred_feat, target_feat)
        total_loss = mse_loss + vq_loss
        
        return total_loss, indices

Training script

from torch.utils.data import DataLoader
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

def train_lam(
    model: LatentActionModel,
    dataloader: DataLoader,
    num_steps: int = 30_000,
    batch_size: int = 256,
    lr: float = 1e-4,
):
    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = CosineAnnealingLR(optimizer, T_max=num_steps)
    
    model.train()
    step = 0
    
    for epoch in range(1000):  # Chạy đến khi đủ steps
        for obs_t, obs_t_k in dataloader:
            obs_t = obs_t.cuda()
            obs_t_k = obs_t_k.cuda()
            
            loss, indices = model(obs_t, obs_t_k)
            
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()
            
            step += 1
            if step % 100 == 0:
                print(f"Step {step}/{num_steps}, Loss: {loss.item():.4f}")
            
            if step >= num_steps:
                return model
    
    return model

Chạy riêng cho từng loại dữ liệu:

# Train manipulation LAM
python train_lam.py --data-type manipulation --data-dir data/manipulation/ \
  --steps 30000 --batch-size 256 --gpus 8

# Train locomotion LAM
python train_lam.py --data-type locomotion --data-dir data/loco_videos/ \
  --steps 30000 --batch-size 256 --gpus 8

Bước 4: Training VLA (Prismatic-7B)

VLA backbone dùng Prismatic-7B — một Vision-Language Model 7 tham số với dual-stream image encoding.

Kiến trúc AI chip và GPU cluster — training VLA đòi hỏi phần cứng H100 grade

Stage 1: Pretraining với LAM supervision

VLA học predict latent codes từ hai LAMs đã train ở Bước 3:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class WholebodyVLA(nn.Module):
    """VLA backbone với dual latent action prediction."""
    
    def __init__(
        self,
        vlm_path: str = "TRI-ML/prismatic-7b",
        lam_mani: LatentActionModel = None,
        lam_loco: LatentActionModel = None,
        codebook_size: int = 8192,
    ):
        super().__init__()
        
        # Prismatic-7B backbone
        self.vlm = AutoModelForCausalLM.from_pretrained(vlm_path)
        self.tokenizer = AutoTokenizer.from_pretrained(vlm_path)
        
        # LAMs (frozen sau pretraining)
        self.lam_mani = lam_mani
        self.lam_loco = lam_loco
        
        # Prediction heads cho latent codes
        hidden_size = self.vlm.config.hidden_size  # 4096 cho 7B
        self.mani_head = nn.Linear(hidden_size, codebook_size)
        self.loco_head = nn.Linear(hidden_size, codebook_size)
    
    def forward(
        self,
        images: torch.Tensor,   # (B, C, H, W) egocentric frame
        language: list[str],    # task descriptions
        labels_mani: torch.Tensor = None,  # LAM code indices (B,)
        labels_loco: torch.Tensor = None,
    ):
        # Encode inputs với VLM
        inputs = self.tokenizer(language, return_tensors="pt", padding=True)
        outputs = self.vlm(
            input_ids=inputs["input_ids"].cuda(),
            pixel_values=images,
            output_hidden_states=True,
        )
        
        # Lấy hidden state cuối cùng cho prediction
        last_hidden = outputs.hidden_states[-1][:, -1, :]  # (B, hidden_size)
        
        # Predict latent codes
        logits_mani = self.mani_head(last_hidden)  # (B, codebook_size)
        logits_loco = self.loco_head(last_hidden)
        
        loss = 0
        if labels_mani is not None:
            loss += nn.functional.cross_entropy(logits_mani, labels_mani)
        if labels_loco is not None:
            loss += nn.functional.cross_entropy(logits_loco, labels_loco)
        
        return loss, logits_mani, logits_loco

Hyperparameters pretraining:

PRETRAIN_CONFIG = {
    "steps": 20_000,
    "batch_size": 1_024,
    "lr": 2e-5,
    "warmup_steps": 500,
    "weight_decay": 0.01,
    "gpus": 8,  # H100
}

Stage 2: Fine-tuning với LoRA

Sau pretraining, fine-tune với dữ liệu teleoperation thực tế (150 demos):

from peft import LoraConfig, get_peft_model

# Cấu hình LoRA
lora_config = LoraConfig(
    r=16,               # LoRA rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(wholebodyvla.vlm, lora_config)
model.print_trainable_parameters()
# Trainable params: ~13M / 7B total (~0.2%)

FINETUNE_CONFIG = {
    "steps": 10_000,
    "batch_size": 64,
    "lr": 5e-5,
    "warmup_steps": 200,
}

Bước 5: Training LMO RL Policy

Đây là phần thú vị nhất về mặt kỹ thuật — RL policy học điều khiển toàn thân humanoid trong simulation.

Discrete Command Interface

Thay vì velocity tracking truyền thống, LMO dùng discrete flags:

import numpy as np

class LMOCommand:
    """
    Discrete command interface cho LMO RL policy.
    Thay thế continuous velocity tracking.
    """
    # Mỗi trục: {-1: backward/right/ccw, 0: stop, 1: forward/left/cw}
    s_x: int    # Forward/backward: {-1, 0, 1}
    s_y: int    # Lateral: {-1, 0, 1}  
    s_psi: int  # Yaw rotation: {-1, 0, 1}
    h_star: float  # Target stance height (continuous): [0.7, 1.0] m

def command_to_velocity(cmd: LMOCommand, t: float, alpha: float = 2.0) -> np.ndarray:
    """
    Smooth velocity ramp từ discrete command.
    Prevents abrupt acceleration khi start/stop.
    
    v_k^ref(t) = v_k^goal * tanh[alpha * (s_k - s_bar_k(t))]
    """
    v_goal = np.array([
        0.5 * cmd.s_x,   # max forward: 0.5 m/s
        0.3 * cmd.s_y,   # max lateral: 0.3 m/s
        0.5 * cmd.s_psi, # max yaw: 0.5 rad/s
    ])
    
    # s_bar: smooth ramp state (updated by PD controller)
    s_bar = np.tanh(alpha * t) * np.array([cmd.s_x, cmd.s_y, cmd.s_psi])
    v_ref = v_goal * np.tanh(alpha * (np.array([cmd.s_x, cmd.s_y, cmd.s_psi]) - s_bar))
    
    return v_ref

Observation Space

class LMOObservation:
    """
    O_t = [u_t, ω_t, g_t, q_t, q̇_t, a_{t-1}]
    Tất cả proprioceptive — không cần camera cho locomotion!
    """
    u_t: np.ndarray     # LMO command: (4,) = [s_x, s_y, s_psi, h*]
    omega_t: np.ndarray # Base angular velocity: (3,)
    g_t: np.ndarray     # Gravity vector in body frame: (3,)
    q_t: np.ndarray     # Joint positions: (N_joints,)
    q_dot_t: np.ndarray # Joint velocities: (N_joints,)
    a_prev: np.ndarray  # Previous action: (N_joints,)
    
    # AgiBot X2: 6-DoF legs × 2 + 1-DoF waist = 13 joints cho locomotion
    # Total obs dim: 4 + 3 + 3 + 13 + 13 + 13 = 49

Two-Stage Curriculum với Isaac Lab

# configs/lmo_rl_stage1.yaml — Gait acquisition cơ bản
stage: 1
env:
  num_envs: 4096       # Parallel environments trên H100
  episode_length: 500  # Steps per episode
  
curriculum:
  goal_speed_range: [0.1, 0.5]  # m/s, tăng dần theo training
  goal_yaw_range: [-0.3, 0.3]   # rad/s

rewards:
  tracking_lin_vel: 1.5    # Bám theo velocity target
  tracking_ang_vel: 0.75   # Bám theo yaw target
  lin_vel_z_penalty: -2.0  # Tránh bouncing
  ang_vel_xy_penalty: -0.05
  joint_torque_penalty: -0.0002
  action_rate_penalty: -0.01
  feet_air_time: 1.0       # Khuyến khích bước đều
  
domain_randomization:
  mass_ratio: [0.8, 1.2]
  com_offset: [-0.05, 0.05]  # m
  friction_coeff: [0.3, 1.0]
  motor_kp: [0.8, 1.2]       # ratio
  control_delay: [0, 3]      # timesteps

# configs/lmo_rl_stage2.yaml — Precision refinement + manipulation perturbations
stage: 2
env:
  num_envs: 2048
  
curriculum:
  goal_speed: 0.5  # Fixed cruising speed
  
rewards:
  directional_accuracy: 2.0  # J_dir = |wrap(psi_end - psi_start)|
  stand_still_penalty: -1.0  # Khi command = 0 nhưng robot vẫn di chuyển
  
  # Perturbations từ AgiBot-World trajectories
  arm_perturbation:
    enabled: true
    trajectory_file: data/manipulation/trajectories.npy
    strength_range: [0.5, 1.0]  # Mạnh hơn Stage I
  
domain_randomization:
  # Intensity tăng lên trong Stage II
  mass_ratio: [0.7, 1.3]
  friction_coeff: [0.2, 1.2]
  ground_unevenness: [0.0, 0.02]  # Thêm uneven terrain

Chạy RL training:

# Stage I: ~2 ngày trên 1× H100
python train_lmo.py --config configs/lmo_rl_stage1.yaml \
  --robot agibot_x2 --sim isaac_lab

# Stage II: ~1 ngày trên 1× H100 (từ Stage I checkpoint)
python train_lmo.py --config configs/lmo_rl_stage2.yaml \
  --robot agibot_x2 --sim isaac_lab \
  --resume checkpoints/lmo_stage1_final.pt

Bước 6: Inference Deployment

Đây là nơi hai nhánh training hội tụ. System chạy hai processes giao tiếp qua ZeroMQ:

Edge computing và onboard hardware — inference của WholebodyVLA chia làm hai lớp: VLA trên RTX 4090 và RL policy trên NanoPi onboard

import zmq
import numpy as np
import time

class WholebodyVLAInferenceServer:
    """
    VLA inference server chạy trên GPU offboard (RTX 4090).
    Rate: ~10 Hz (bị giới hạn bởi VLM inference latency).
    """
    
    def __init__(self, vla_model, lam_mani, lam_loco, port=5555):
        self.vla = vla_model
        self.lam_mani = lam_mani
        self.lam_loco = lam_loco
        
        # ZeroMQ server
        ctx = zmq.Context()
        self.socket = ctx.socket(zmq.REP)
        self.socket.bind(f"tcp://*:{port}")
        print(f"VLA Server running on port {port}")
    
    def run(self):
        while True:
            # Nhận observation từ robot
            obs = self.socket.recv_pyobj()
            
            image = obs["egocentric_image"]    # (H, W, 3)
            language = obs["task_description"] # str
            robot_state = obs["proprioception"] # joint states, etc.
            
            # VLA inference
            with torch.no_grad():
                image_tensor = preprocess_image(image).cuda()
                _, logits_mani, logits_loco = self.vla(
                    images=image_tensor.unsqueeze(0),
                    language=[language],
                )
                
                # Decode latent codes → robot commands
                mani_code = logits_mani.argmax(dim=-1)
                loco_code = logits_loco.argmax(dim=-1)
                
                arm_actions = self.lam_mani.decode_to_joints(mani_code)
                loco_cmd = self.lam_loco.decode_to_command(loco_code)
            
            # Gửi action về robot
            response = {
                "arm_actions": arm_actions.cpu().numpy(),  # (14,) joints
                "loco_command": loco_cmd.cpu().numpy(),    # (4,) = [s_x, s_y, s_psi, h*]
            }
            self.socket.send_pyobj(response)

class LMOPolicyClient:
    """
    RL policy client chạy onboard robot (NanoPi M6 hoặc tương đương).
    Rate: 50 Hz — nhận LMO commands từ VLA server và output joint torques.
    """
    
    def __init__(self, lmo_policy, vla_server_ip, port=5555):
        self.lmo = lmo_policy
        
        ctx = zmq.Context()
        self.socket = ctx.socket(zmq.REQ)
        self.socket.connect(f"tcp://{vla_server_ip}:{port}")
        self.socket.RCVTIMEO = 50  # 50ms timeout (non-blocking)
        
        self.last_loco_cmd = np.zeros(4)  # Hold last command nếu VLA lag
    
    def step(self, robot_obs: dict):
        """Chạy 1 bước control @ 50 Hz."""
        
        # Gửi yêu cầu lên VLA server (non-blocking)
        try:
            self.socket.send_pyobj(robot_obs)
            vla_output = self.socket.recv_pyobj()
            self.last_loco_cmd = vla_output["loco_command"]
        except zmq.Again:
            pass  # VLA chưa ready → dùng last command
        
        # RL policy chạy @ 50 Hz với proprioception
        obs_vector = self._build_obs(robot_obs, self.last_loco_cmd)
        
        with torch.no_grad():
            action = self.lmo(obs_vector.unsqueeze(0))  # (1, N_joints)
        
        return action.squeeze(0).cpu().numpy()
    
    def _build_obs(self, robot_obs, loco_cmd):
        """Build observation vector cho LMO policy."""
        return torch.tensor(np.concatenate([
            loco_cmd,                      # (4,)
            robot_obs["base_ang_vel"],     # (3,)
            robot_obs["gravity_vec"],      # (3,)
            robot_obs["joint_pos"],        # (N,)
            robot_obs["joint_vel"],        # (N,)
            robot_obs["prev_action"],      # (N,)
        ]), dtype=torch.float32).cuda()

Chạy inference:

# Terminal 1 (GPU machine với RTX 4090):
python inference_server.py \
  --vla-checkpoint checkpoints/wholebodyvla_final.pt \
  --lam-mani checkpoints/lam_mani.pt \
  --lam-loco checkpoints/lam_loco.pt \
  --port 5555

# Terminal 2 (onboard robot):
python robot_controller.py \
  --lmo-checkpoint checkpoints/lmo_stage2_final.pt \
  --vla-server 192.168.1.100 \
  --port 5555 \
  --control-freq 50

Kết quả kỳ vọng

Nếu bạn implement đúng theo paper và có đủ dữ liệu:

Task	WholebodyVLA	Baseline tốt nhất
Bag Packing	~75%	~60%
Box Loading	~80%	~65%
Cart Pushing	~79%	~57%
Trung bình	~78%	~64%

Ablation quan trọng: Nếu bạn bỏ LAM pretraining (chỉ supervised learning từ demos), performance giảm xuống ~39%. Đây là gap lớn nhất — data thu thập egocentric video là investment đáng nhất.

Checklist trước khi chạy

□ Đủ phần cứng (ít nhất 1× H100 + 1× RTX 4090 cho inference)
□ Camera Intel RealSense D435i cài đặt đúng
□ 50+ giờ egocentric locomotion video đã thu thập
□ AgiBot World dataset đã download (hoặc equivalent manipulation data)
□ Robot teleoperation data: 50+ demos/task
□ Isaac Lab v2.3+ cài đặt và test với humanoid URDF
□ ZeroMQ network giữa VLA server và robot onboard ổn định (<50ms latency)
□ LAM training converge (check codebook utilization >50%)
□ VLA validation loss giảm trong fine-tuning
□ LMO policy pass Stage I tests (straight-line walking, turning)

WholebodyVLA đặt ra bar mới cho humanoid loco-manipulation — nhưng đây cũng là hệ thống có nhiều moving parts nhất mà bạn sẽ gặp trong robotics learning. Hiểu từng component riêng lẻ trước khi kết hợp là chiến lược đúng đắn.

Đọc thêm về các component riêng lẻ trong VLA Models — AI Series 5 và Isaac Lab cho Robotics Simulation.

Bài viết liên quan

Lưu ý: Tại thời điểm viết bài (tháng 4/2026), team OpenDriveLab chưa release code chính thức. Tuy nhiên, paper cung cấp đủ chi tiết kỹ thuật để bạn hiểu toàn bộ quy trình và chuẩn bị sẵn sàng khi code được công bố — hoặc tự implement theo thiết kế của paper.

Bức tranh tổng thể: 3 component, 2 giai đoạn training

Trước khi đi vào từng bước, hãy nắm rõ cấu trúc của WholebodyVLA:

[Dữ liệu Egocentric Video] ──► LAM Training ──► Latent Codes
         │                                              │
[Manipulation Data]           ──────────────────────────┤
         │                                              ▼
[Demo Teleoperation]  ──────► VLA Fine-tuning (Prismatic-7B + LoRA)
                                                        │
[Isaac Lab Simulation] ──────► LMO RL Policy Training   │
                                                        ▼
                              ┌─────────────────────────┐
                              │   Inference @ 10/50 Hz   │
                              └─────────────────────────┘

Pipeline gồm 2 giai đoạn song song:

Pretraining LAM + VLA: Học cách "đọc" video egocentric và chuyển thành latent action codes.
RL Policy Training: Huấn luyện LMO (Loco-Manipulation-Oriented) policy trong simulation.

Hai nhánh hội tụ tại inference, nơi VLA ra lệnh và LMO thực thi ở tần số cao hơn.

Yêu cầu phần cứng

Đây là con số thực tế từ paper — không phải dùng laptop nhà bạn được:

Tác vụ	Phần cứng	Thời gian
LAM Training	8× NVIDIA H100	~30,000 steps
VLA Pretraining	8× NVIDIA H100	~20,000 steps
VLA Fine-tuning	8× NVIDIA H100	~10,000 steps
LMO RL Training	1× NVIDIA H100	Vài ngày
Inference (VLA)	RTX 4090	~10 Hz
Inference (RL)	NanoPi (onboard)	50 Hz

Bạn không cần cluster H100 để thử nghiệm — nhưng để reproduce kết quả paper (78% success), bạn cần ít nhất 1× H100 hoặc tương đương A100 80GB để training.

Bước 1: Cài đặt môi trường

# Python environment
conda create -n wholebodyvla python=3.10
conda activate wholebodyvla

# Core dependencies
pip install torch==2.3.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate einops timm
pip install open-clip-torch  # CLIP encoder cho VLA

# DINOv2 (encoder cho LAM)
pip install git+https://github.com/facebookresearch/dinov2.git

# Isaac Lab (cho LMO RL training)
# Theo hướng dẫn chính thức: https://isaac-sim.github.io/IsaacLab/
pip install isaaclab

# ZeroMQ (inference communication)
pip install pyzmq
pip install pyrealsense2  # Intel RealSense SDK

Bước 2: Thu thập dữ liệu

Đây là bước tốn thời gian nhất và thường bị underestimate. WholebodyVLA dùng 3 loại dữ liệu khác nhau cho 3 mục đích khác nhau.

2.1 Egocentric Locomotion Video (cho Locomotion LAM)

Mục tiêu: Thu thập video từ góc nhìn thứ nhất của người đang di chuyển đến mục tiêu thao tác, không cần robot, không cần action labels.

Setup camera:

import pyrealsense2 as rs
import numpy as np
import cv2

pipeline = rs.pipeline()
config = rs.config()

# Intel RealSense D435i: 640x480 @ 30fps
config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)
config.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30)
config.enable_stream(rs.stream.accel)  # IMU cho locomotion tracking
config.enable_stream(rs.stream.gyro)

profile = pipeline.start(config)

8 motion primitives cần thu thập:

Primitive	Mô tả	Số lượng đề xuất
Advance	Đi thẳng đến mục tiêu	500 clips
Turn Left/Right	Xoay người để canh hướng	400 clips
Sidestep	Bước ngang sang trái/phải	300 clips
Squat Down	Cúi xuống để lấy đồ thấp	300 clips
Stand Up	Đứng dậy sau khi lấy đồ	300 clips
Approach Near	Tiến sát vào đối tượng	400 clips
Back Away	Lùi ra sau	200 clips
Stop & Ready	Dừng, tay chuẩn bị thao tác	300 clips

Paper thu thập ~300 giờ video — nhưng thử nghiệm ablation cho thấy bạn có thể bắt đầu với 50 giờ và đạt được 60%+ benefit.

# Script thu thập và lưu video theo primitives
import os, time, json

def collect_primitive(primitive_name: str, duration_secs: int = 3):
    """Thu thập một clip motion primitive."""
    timestamp = int(time.time())
    frames = []
    
    start = time.time()
    while time.time() - start < duration_secs:
        frameset = pipeline.wait_for_frames()
        color_frame = frameset.get_color_frame()
        
        if color_frame:
            img = np.asanyarray(color_frame.get_data())
            frames.append(img)
    
    # Lưu clip
    out_path = f"data/loco_videos/{primitive_name}/{timestamp}.mp4"
    os.makedirs(os.path.dirname(out_path), exist_ok=True)
    
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(out_path, fourcc, 30, (640, 480))
    for f in frames:
        out.write(f)
    out.release()
    
    return out_path

2.2 Manipulation Data (AgiBot World Dataset)

Paper dùng AgiBot World dataset cho manipulation pretraining — dataset gồm các task bimanual manipulation với robot thật.

# Download AgiBot World (nếu được public)
wget https://huggingface.co/datasets/agibot-world/agibot-world-v1/...

# Cấu trúc thư mục mong đợi:
data/
  manipulation/
    task_001/
      episode_000/
        images/  # frame-by-frame
        actions.npy  # joint angles
        lang.txt  # task description
    task_002/
    ...

2.3 Robot Teleoperation Data (cho Fine-tuning)

Đây là dữ liệu thu thập trực tiếp trên robot AgiBot X2 — bạn cần robot thật hoặc simulation tương đương:

Upper body: VR headset (Meta Quest 3 hoặc HTC Vive) mapping hand pose → joint targets
Lower body: Joystick → locomotion commands
Số lượng: 50 demonstrations/task × 3 tasks = 150 total demos

3 tasks chính: Bag Packing, Box Loading, Cart Pushing.

# Format dữ liệu teleoperation
import numpy as np

class DemoDataset:
    """Dataset format cho WholebodyVLA fine-tuning."""
    
    def __init__(self, demo_dir: str):
        self.episodes = self._load_episodes(demo_dir)
    
    def _load_episodes(self, demo_dir):
        episodes = []
        for ep_dir in sorted(os.listdir(demo_dir)):
            ep = {
                "images": [],      # (T, H, W, 3) egocentric frames
                "arm_actions": [],  # (T, 14) = 7 DoF × 2 arms
                "loco_cmds": [],   # (T, 4) = [s_x, s_y, s_psi, h*]
                "language": "",    # task description string
            }
            # Load từng frame và action...
            episodes.append(ep)
        return episodes

Humanoid robot di chuyển trong không gian rộng — bài toán loco-manipulation đòi hỏi cả tay lẫn chân phối hợp chặt chẽ

Bước 3: Training Latent Action Model (LAM)

LAM là "bộ dịch" chuyển đổi video frames thành latent codes, sau đó VLA học cách predict các codes này.

Paper train hai LAMs riêng biệt:

Manipulation LAM: Camera tĩnh, tay di chuyển
Locomotion LAM: Camera di chuyển theo người

Kiến trúc VQ-VAE

import torch
import torch.nn as nn
from transformers import Dinov2Model

class LatentActionModel(nn.Module):
    """VQ-VAE với DINOv2 encoder cho WholebodyVLA."""
    
    def __init__(
        self,
        codebook_size: int = 8192,
        latent_dim: int = 256,
        commitment_cost: float = 0.25,  # beta trong paper
    ):
        super().__init__()
        
        # DINOv2-base làm feature encoder
        self.encoder = Dinov2Model.from_pretrained("facebook/dinov2-base")
        
        # Projection từ DINOv2 features (768) → latent_dim
        self.encoder_proj = nn.Linear(768 * 2, latent_dim)
        
        # Vector Quantizer
        self.codebook = nn.Embedding(codebook_size, latent_dim)
        self.commitment_cost = commitment_cost
        
        # Decoder để reconstruct future frame
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, 768),  # back to DINOv2 feature space
        )
    
    def encode(self, obs_t: torch.Tensor, obs_t_k: torch.Tensor):
        """
        Args:
            obs_t: Frame hiện tại (B, C, H, W)
            obs_t_k: Frame tương lai k steps (B, C, H, W)
        Returns:
            Quantized latent code và loss
        """
        # Extract features từ cả hai frames
        feat_t = self.encoder(pixel_values=obs_t).last_hidden_state[:, 0]  # [CLS]
        feat_tk = self.encoder(pixel_values=obs_t_k).last_hidden_state[:, 0]
        
        # Concatenate và project
        z = self.encoder_proj(torch.cat([feat_t, feat_tk], dim=-1))  # (B, latent_dim)
        
        # Vector Quantization
        distances = torch.cdist(z.unsqueeze(1), self.codebook.weight.unsqueeze(0))
        indices = distances.squeeze(1).argmin(dim=-1)  # (B,)
        c = self.codebook(indices)  # (B, latent_dim)
        
        # VQ loss (straight-through estimator)
        vq_loss = torch.mean((c.detach() - z) ** 2)  # codebook loss
        commit_loss = self.commitment_cost * torch.mean((c - z.detach()) ** 2)
        
        # Straight-through gradient
        c_st = z + (c - z).detach()
        
        return c_st, indices, vq_loss + commit_loss
    
    def decode(self, c: torch.Tensor, obs_t: torch.Tensor):
        """Reconstruct future frame features từ latent code + current frame."""
        feat_t = self.encoder(pixel_values=obs_t).last_hidden_state[:, 0]
        pred_feat = self.decoder(c + feat_t)  # residual prediction
        return pred_feat
    
    def forward(self, obs_t, obs_t_k):
        c, indices, vq_loss = self.encode(obs_t, obs_t_k)
        pred_feat = self.decode(c, obs_t)
        
        # Ground truth features
        with torch.no_grad():
            target_feat = self.encoder(pixel_values=obs_t_k).last_hidden_state[:, 0]
        
        mse_loss = nn.functional.mse_loss(pred_feat, target_feat)
        total_loss = mse_loss + vq_loss
        
        return total_loss, indices

Training script

from torch.utils.data import DataLoader
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

def train_lam(
    model: LatentActionModel,
    dataloader: DataLoader,
    num_steps: int = 30_000,
    batch_size: int = 256,
    lr: float = 1e-4,
):
    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = CosineAnnealingLR(optimizer, T_max=num_steps)
    
    model.train()
    step = 0
    
    for epoch in range(1000):  # Chạy đến khi đủ steps
        for obs_t, obs_t_k in dataloader:
            obs_t = obs_t.cuda()
            obs_t_k = obs_t_k.cuda()
            
            loss, indices = model(obs_t, obs_t_k)
            
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()
            
            step += 1
            if step % 100 == 0:
                print(f"Step {step}/{num_steps}, Loss: {loss.item():.4f}")
            
            if step >= num_steps:
                return model
    
    return model

Chạy riêng cho từng loại dữ liệu:

# Train manipulation LAM
python train_lam.py --data-type manipulation --data-dir data/manipulation/ \
  --steps 30000 --batch-size 256 --gpus 8

# Train locomotion LAM
python train_lam.py --data-type locomotion --data-dir data/loco_videos/ \
  --steps 30000 --batch-size 256 --gpus 8

Bước 4: Training VLA (Prismatic-7B)

VLA backbone dùng Prismatic-7B — một Vision-Language Model 7 tham số với dual-stream image encoding.

Kiến trúc AI chip và GPU cluster — training VLA đòi hỏi phần cứng H100 grade

Stage 1: Pretraining với LAM supervision

VLA học predict latent codes từ hai LAMs đã train ở Bước 3:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class WholebodyVLA(nn.Module):
    """VLA backbone với dual latent action prediction."""
    
    def __init__(
        self,
        vlm_path: str = "TRI-ML/prismatic-7b",
        lam_mani: LatentActionModel = None,
        lam_loco: LatentActionModel = None,
        codebook_size: int = 8192,
    ):
        super().__init__()
        
        # Prismatic-7B backbone
        self.vlm = AutoModelForCausalLM.from_pretrained(vlm_path)
        self.tokenizer = AutoTokenizer.from_pretrained(vlm_path)
        
        # LAMs (frozen sau pretraining)
        self.lam_mani = lam_mani
        self.lam_loco = lam_loco
        
        # Prediction heads cho latent codes
        hidden_size = self.vlm.config.hidden_size  # 4096 cho 7B
        self.mani_head = nn.Linear(hidden_size, codebook_size)
        self.loco_head = nn.Linear(hidden_size, codebook_size)
    
    def forward(
        self,
        images: torch.Tensor,   # (B, C, H, W) egocentric frame
        language: list[str],    # task descriptions
        labels_mani: torch.Tensor = None,  # LAM code indices (B,)
        labels_loco: torch.Tensor = None,
    ):
        # Encode inputs với VLM
        inputs = self.tokenizer(language, return_tensors="pt", padding=True)
        outputs = self.vlm(
            input_ids=inputs["input_ids"].cuda(),
            pixel_values=images,
            output_hidden_states=True,
        )
        
        # Lấy hidden state cuối cùng cho prediction
        last_hidden = outputs.hidden_states[-1][:, -1, :]  # (B, hidden_size)
        
        # Predict latent codes
        logits_mani = self.mani_head(last_hidden)  # (B, codebook_size)
        logits_loco = self.loco_head(last_hidden)
        
        loss = 0
        if labels_mani is not None:
            loss += nn.functional.cross_entropy(logits_mani, labels_mani)
        if labels_loco is not None:
            loss += nn.functional.cross_entropy(logits_loco, labels_loco)
        
        return loss, logits_mani, logits_loco

Hyperparameters pretraining:

PRETRAIN_CONFIG = {
    "steps": 20_000,
    "batch_size": 1_024,
    "lr": 2e-5,
    "warmup_steps": 500,
    "weight_decay": 0.01,
    "gpus": 8,  # H100
}

Stage 2: Fine-tuning với LoRA

Sau pretraining, fine-tune với dữ liệu teleoperation thực tế (150 demos):

from peft import LoraConfig, get_peft_model

# Cấu hình LoRA
lora_config = LoraConfig(
    r=16,               # LoRA rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(wholebodyvla.vlm, lora_config)
model.print_trainable_parameters()
# Trainable params: ~13M / 7B total (~0.2%)

FINETUNE_CONFIG = {
    "steps": 10_000,
    "batch_size": 64,
    "lr": 5e-5,
    "warmup_steps": 200,
}

Bước 5: Training LMO RL Policy

Đây là phần thú vị nhất về mặt kỹ thuật — RL policy học điều khiển toàn thân humanoid trong simulation.

Discrete Command Interface

Thay vì velocity tracking truyền thống, LMO dùng discrete flags:

import numpy as np

class LMOCommand:
    """
    Discrete command interface cho LMO RL policy.
    Thay thế continuous velocity tracking.
    """
    # Mỗi trục: {-1: backward/right/ccw, 0: stop, 1: forward/left/cw}
    s_x: int    # Forward/backward: {-1, 0, 1}
    s_y: int    # Lateral: {-1, 0, 1}  
    s_psi: int  # Yaw rotation: {-1, 0, 1}
    h_star: float  # Target stance height (continuous): [0.7, 1.0] m

def command_to_velocity(cmd: LMOCommand, t: float, alpha: float = 2.0) -> np.ndarray:
    """
    Smooth velocity ramp từ discrete command.
    Prevents abrupt acceleration khi start/stop.
    
    v_k^ref(t) = v_k^goal * tanh[alpha * (s_k - s_bar_k(t))]
    """
    v_goal = np.array([
        0.5 * cmd.s_x,   # max forward: 0.5 m/s
        0.3 * cmd.s_y,   # max lateral: 0.3 m/s
        0.5 * cmd.s_psi, # max yaw: 0.5 rad/s
    ])
    
    # s_bar: smooth ramp state (updated by PD controller)
    s_bar = np.tanh(alpha * t) * np.array([cmd.s_x, cmd.s_y, cmd.s_psi])
    v_ref = v_goal * np.tanh(alpha * (np.array([cmd.s_x, cmd.s_y, cmd.s_psi]) - s_bar))
    
    return v_ref

Observation Space

class LMOObservation:
    """
    O_t = [u_t, ω_t, g_t, q_t, q̇_t, a_{t-1}]
    Tất cả proprioceptive — không cần camera cho locomotion!
    """
    u_t: np.ndarray     # LMO command: (4,) = [s_x, s_y, s_psi, h*]
    omega_t: np.ndarray # Base angular velocity: (3,)
    g_t: np.ndarray     # Gravity vector in body frame: (3,)
    q_t: np.ndarray     # Joint positions: (N_joints,)
    q_dot_t: np.ndarray # Joint velocities: (N_joints,)
    a_prev: np.ndarray  # Previous action: (N_joints,)
    
    # AgiBot X2: 6-DoF legs × 2 + 1-DoF waist = 13 joints cho locomotion
    # Total obs dim: 4 + 3 + 3 + 13 + 13 + 13 = 49

Two-Stage Curriculum với Isaac Lab

# configs/lmo_rl_stage1.yaml — Gait acquisition cơ bản
stage: 1
env:
  num_envs: 4096       # Parallel environments trên H100
  episode_length: 500  # Steps per episode
  
curriculum:
  goal_speed_range: [0.1, 0.5]  # m/s, tăng dần theo training
  goal_yaw_range: [-0.3, 0.3]   # rad/s

rewards:
  tracking_lin_vel: 1.5    # Bám theo velocity target
  tracking_ang_vel: 0.75   # Bám theo yaw target
  lin_vel_z_penalty: -2.0  # Tránh bouncing
  ang_vel_xy_penalty: -0.05
  joint_torque_penalty: -0.0002
  action_rate_penalty: -0.01
  feet_air_time: 1.0       # Khuyến khích bước đều
  
domain_randomization:
  mass_ratio: [0.8, 1.2]
  com_offset: [-0.05, 0.05]  # m
  friction_coeff: [0.3, 1.0]
  motor_kp: [0.8, 1.2]       # ratio
  control_delay: [0, 3]      # timesteps

# configs/lmo_rl_stage2.yaml — Precision refinement + manipulation perturbations
stage: 2
env:
  num_envs: 2048
  
curriculum:
  goal_speed: 0.5  # Fixed cruising speed
  
rewards:
  directional_accuracy: 2.0  # J_dir = |wrap(psi_end - psi_start)|
  stand_still_penalty: -1.0  # Khi command = 0 nhưng robot vẫn di chuyển
  
  # Perturbations từ AgiBot-World trajectories
  arm_perturbation:
    enabled: true
    trajectory_file: data/manipulation/trajectories.npy
    strength_range: [0.5, 1.0]  # Mạnh hơn Stage I
  
domain_randomization:
  # Intensity tăng lên trong Stage II
  mass_ratio: [0.7, 1.3]
  friction_coeff: [0.2, 1.2]
  ground_unevenness: [0.0, 0.02]  # Thêm uneven terrain

Chạy RL training:

# Stage I: ~2 ngày trên 1× H100
python train_lmo.py --config configs/lmo_rl_stage1.yaml \
  --robot agibot_x2 --sim isaac_lab

# Stage II: ~1 ngày trên 1× H100 (từ Stage I checkpoint)
python train_lmo.py --config configs/lmo_rl_stage2.yaml \
  --robot agibot_x2 --sim isaac_lab \
  --resume checkpoints/lmo_stage1_final.pt

Bước 6: Inference Deployment

Đây là nơi hai nhánh training hội tụ. System chạy hai processes giao tiếp qua ZeroMQ:

Edge computing và onboard hardware — inference của WholebodyVLA chia làm hai lớp: VLA trên RTX 4090 và RL policy trên NanoPi onboard

import zmq
import numpy as np
import time

class WholebodyVLAInferenceServer:
    """
    VLA inference server chạy trên GPU offboard (RTX 4090).
    Rate: ~10 Hz (bị giới hạn bởi VLM inference latency).
    """
    
    def __init__(self, vla_model, lam_mani, lam_loco, port=5555):
        self.vla = vla_model
        self.lam_mani = lam_mani
        self.lam_loco = lam_loco
        
        # ZeroMQ server
        ctx = zmq.Context()
        self.socket = ctx.socket(zmq.REP)
        self.socket.bind(f"tcp://*:{port}")
        print(f"VLA Server running on port {port}")
    
    def run(self):
        while True:
            # Nhận observation từ robot
            obs = self.socket.recv_pyobj()
            
            image = obs["egocentric_image"]    # (H, W, 3)
            language = obs["task_description"] # str
            robot_state = obs["proprioception"] # joint states, etc.
            
            # VLA inference
            with torch.no_grad():
                image_tensor = preprocess_image(image).cuda()
                _, logits_mani, logits_loco = self.vla(
                    images=image_tensor.unsqueeze(0),
                    language=[language],
                )
                
                # Decode latent codes → robot commands
                mani_code = logits_mani.argmax(dim=-1)
                loco_code = logits_loco.argmax(dim=-1)
                
                arm_actions = self.lam_mani.decode_to_joints(mani_code)
                loco_cmd = self.lam_loco.decode_to_command(loco_code)
            
            # Gửi action về robot
            response = {
                "arm_actions": arm_actions.cpu().numpy(),  # (14,) joints
                "loco_command": loco_cmd.cpu().numpy(),    # (4,) = [s_x, s_y, s_psi, h*]
            }
            self.socket.send_pyobj(response)

class LMOPolicyClient:
    """
    RL policy client chạy onboard robot (NanoPi M6 hoặc tương đương).
    Rate: 50 Hz — nhận LMO commands từ VLA server và output joint torques.
    """
    
    def __init__(self, lmo_policy, vla_server_ip, port=5555):
        self.lmo = lmo_policy
        
        ctx = zmq.Context()
        self.socket = ctx.socket(zmq.REQ)
        self.socket.connect(f"tcp://{vla_server_ip}:{port}")
        self.socket.RCVTIMEO = 50  # 50ms timeout (non-blocking)
        
        self.last_loco_cmd = np.zeros(4)  # Hold last command nếu VLA lag
    
    def step(self, robot_obs: dict):
        """Chạy 1 bước control @ 50 Hz."""
        
        # Gửi yêu cầu lên VLA server (non-blocking)
        try:
            self.socket.send_pyobj(robot_obs)
            vla_output = self.socket.recv_pyobj()
            self.last_loco_cmd = vla_output["loco_command"]
        except zmq.Again:
            pass  # VLA chưa ready → dùng last command
        
        # RL policy chạy @ 50 Hz với proprioception
        obs_vector = self._build_obs(robot_obs, self.last_loco_cmd)
        
        with torch.no_grad():
            action = self.lmo(obs_vector.unsqueeze(0))  # (1, N_joints)
        
        return action.squeeze(0).cpu().numpy()
    
    def _build_obs(self, robot_obs, loco_cmd):
        """Build observation vector cho LMO policy."""
        return torch.tensor(np.concatenate([
            loco_cmd,                      # (4,)
            robot_obs["base_ang_vel"],     # (3,)
            robot_obs["gravity_vec"],      # (3,)
            robot_obs["joint_pos"],        # (N,)
            robot_obs["joint_vel"],        # (N,)
            robot_obs["prev_action"],      # (N,)
        ]), dtype=torch.float32).cuda()

Chạy inference:

# Terminal 1 (GPU machine với RTX 4090):
python inference_server.py \
  --vla-checkpoint checkpoints/wholebodyvla_final.pt \
  --lam-mani checkpoints/lam_mani.pt \
  --lam-loco checkpoints/lam_loco.pt \
  --port 5555

# Terminal 2 (onboard robot):
python robot_controller.py \
  --lmo-checkpoint checkpoints/lmo_stage2_final.pt \
  --vla-server 192.168.1.100 \
  --port 5555 \
  --control-freq 50

Kết quả kỳ vọng

Nếu bạn implement đúng theo paper và có đủ dữ liệu:

Task	WholebodyVLA	Baseline tốt nhất
Bag Packing	~75%	~60%
Box Loading	~80%	~65%
Cart Pushing	~79%	~57%
Trung bình	~78%	~64%

Checklist trước khi chạy

□ Đủ phần cứng (ít nhất 1× H100 + 1× RTX 4090 cho inference)
□ Camera Intel RealSense D435i cài đặt đúng
□ 50+ giờ egocentric locomotion video đã thu thập
□ AgiBot World dataset đã download (hoặc equivalent manipulation data)
□ Robot teleoperation data: 50+ demos/task
□ Isaac Lab v2.3+ cài đặt và test với humanoid URDF
□ ZeroMQ network giữa VLA server và robot onboard ổn định (<50ms latency)
□ LAM training converge (check codebook utilization >50%)
□ VLA validation loss giảm trong fine-tuning
□ LMO policy pass Stage I tests (straight-line walking, turning)

Đọc thêm về các component riêng lẻ trong VLA Models — AI Series 5 và Isaac Lab cho Robotics Simulation.

Bức tranh tổng thể: 3 component, 2 giai đoạn training

Yêu cầu phần cứng

Bước 1: Cài đặt môi trường

Bước 2: Thu thập dữ liệu

2.1 Egocentric Locomotion Video (cho Locomotion LAM)

2.2 Manipulation Data (AgiBot World Dataset)

2.3 Robot Teleoperation Data (cho Fine-tuning)

Bước 3: Training Latent Action Model (LAM)

Kiến trúc VQ-VAE

Training script

Bước 4: Training VLA (Prismatic-7B)

Stage 1: Pretraining với LAM supervision

Stage 2: Fine-tuning với LoRA

Bước 5: Training LMO RL Policy

Discrete Command Interface

Observation Space

Two-Stage Curriculum với Isaac Lab

Bước 6: Inference Deployment

Kết quả kỳ vọng

Checklist trước khi chạy

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code

WholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

Bức tranh tổng thể: 3 component, 2 giai đoạn training

Yêu cầu phần cứng

Bước 1: Cài đặt môi trường

Bước 2: Thu thập dữ liệu

2.1 Egocentric Locomotion Video (cho Locomotion LAM)

2.2 Manipulation Data (AgiBot World Dataset)

2.3 Robot Teleoperation Data (cho Fine-tuning)

Bước 3: Training Latent Action Model (LAM)

Kiến trúc VQ-VAE

Training script

Bước 4: Training VLA (Prismatic-7B)

Stage 1: Pretraining với LAM supervision

Stage 2: Fine-tuning với LoRA

Bước 5: Training LMO RL Policy

Discrete Command Interface

Observation Space

Two-Stage Curriculum với Isaac Lab

Bước 6: Inference Deployment

Kết quả kỳ vọng

Checklist trước khi chạy

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code

WholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid