Humanoid Manipulation: Whole-body Control với LeRobot

Giới thiệu: Đỉnh cao của Manipulation

Trong suốt series này, chúng ta đã đi từ single-arm (bài 3), qua dual-arm (bài 6-7), đến mobile manipulation (bài 8). Mỗi bước đều thêm độ phức tạp — thêm DOF, thêm coordination, thêm perception.

Bài viết này đưa chúng ta đến đỉnh cao: robot hình người (humanoid). Với 20-50+ DOF, hai bàn tay khéo léo (dexterous hands), khả năng di chuyển bằng hai chân, và hình dáng tương tự con người — humanoid robot là nền tảng manipulation tối thượng và cũng là thách thức lớn nhất trong robotics.

Tại sao Humanoid cho Manipulation?

Ưu điểm	Giải thích
Human environment compatible	Thế giới được thiết kế cho người — cửa, bàn, tay vịn
Dexterous hands	5 ngón tay × 4 DOF = 20 DOF mỗi tay
Large workspace	Toàn bộ cơ thể là workspace — tầm với lớn
Whole-body dynamics	Dùng momentum, trọng lực để hỗ trợ manipulation
Transfer from human data	Có thể học từ video người thật

Thách thức	Mô tả
Extreme DOF	30-50+ joints cần điều khiển đồng thời
Balance	Phải giữ cân bằng trong khi manipulation
Contact-rich	Cả chân và tay đều tiếp xúc môi trường
Compute	Inference phải real-time trên edge devices

High-DOF Action Spaces

Anatomy of a Humanoid Action Space

import numpy as np
from dataclasses import dataclass

@dataclass
class HumanoidActionSpace:
    """Action space cho humanoid robot manipulation.
    
    Ví dụ: Unitree H1
    - Base: 6 DOF (floating base — controlled by legs)
    - Left leg: 6 DOF
    - Right leg: 6 DOF
    - Torso: 3 DOF (waist yaw, pitch, roll)
    - Left arm: 7 DOF (shoulder 3 + elbow 1 + wrist 3)
    - Right arm: 7 DOF
    - Left hand: 12 DOF (Inspire hand)
    - Right hand: 12 DOF
    Total: ~59 DOF
    """
    
    # Joint groups
    leg_dof: int = 12      # 6 per leg
    torso_dof: int = 3
    arm_dof: int = 14      # 7 per arm
    hand_dof: int = 24     # 12 per hand (optional)
    
    @property
    def total_dof(self):
        return self.leg_dof + self.torso_dof + self.arm_dof + self.hand_dof
    
    @property
    def manipulation_dof(self):
        """DOF liên quan trực tiếp đến manipulation."""
        return self.torso_dof + self.arm_dof + self.hand_dof
    
    def split_action(self, action):
        """Tách action thành các nhóm joints."""
        idx = 0
        legs = action[idx:idx + self.leg_dof]; idx += self.leg_dof
        torso = action[idx:idx + self.torso_dof]; idx += self.torso_dof
        arms = action[idx:idx + self.arm_dof]; idx += self.arm_dof
        hands = action[idx:idx + self.hand_dof]; idx += self.hand_dof
        
        return {
            "legs": legs,
            "torso": torso,
            "left_arm": arms[:7],
            "right_arm": arms[7:],
            "left_hand": hands[:12],
            "right_hand": hands[12:],
        }


# Ví dụ
action_space = HumanoidActionSpace()
print(f"Total DOF: {action_space.total_dof}")
print(f"Manipulation DOF: {action_space.manipulation_dof}")
# Output: Total DOF: 53, Manipulation DOF: 41

Giảm dimensionality cho learning

class ActionSpaceReduction:
    """Giảm action space cho humanoid để dễ train hơn.
    
    Chiến lược:
    1. Fixed legs: Đứng yên, chỉ train upper body
    2. Coupled joints: Ghép các joints liên quan
    3. Latent actions: Dùng VAE encode action space
    4. Primitive-based: Ghép từ motion primitives
    """
    
    def __init__(self, mode="fixed_legs"):
        self.mode = mode
    
    def reduce(self, full_action):
        """Giảm action space."""
        parts = HumanoidActionSpace().split_action(full_action)
        
        if self.mode == "fixed_legs":
            # Chỉ điều khiển upper body (torso + arms + hands)
            return np.concatenate([
                parts["torso"],      # 3 DOF
                parts["left_arm"],   # 7 DOF
                parts["right_arm"],  # 7 DOF
                parts["left_hand"][:6],   # Simplified hand: 6 DOF
                parts["right_hand"][:6],  # Simplified hand: 6 DOF
            ])
            # Total: 29 DOF (giảm từ 53)
        
        elif self.mode == "coupled_fingers":
            # Ghép ngón tay: open/close/spread thay vì từng joint
            left_hand_simple = np.array([
                parts["left_hand"][:4].mean(),    # Fingers curl
                parts["left_hand"][4:8].mean(),   # Fingers spread
                parts["left_hand"][8:12].mean(),  # Thumb
            ])
            right_hand_simple = np.array([
                parts["right_hand"][:4].mean(),
                parts["right_hand"][4:8].mean(),
                parts["right_hand"][8:12].mean(),
            ])
            
            return np.concatenate([
                parts["torso"],
                parts["left_arm"],
                parts["right_arm"],
                left_hand_simple,
                right_hand_simple,
            ])
            # Total: 23 DOF
        
        return full_action


reducer = ActionSpaceReduction(mode="fixed_legs")
full_action = np.random.randn(53)
reduced_action = reducer.reduce(full_action)
print(f"Reduced from {len(full_action)} to {len(reduced_action)} DOF")

Observation Design cho Humanoid

Egocentric cameras

class HumanoidObservation:
    """Thiết kế observation cho humanoid manipulation.
    
    Cameras:
    - Head camera: Bird's eye view, navigation
    - Chest camera: Mid-range manipulation view
    - Left/Right wrist cameras: Close-up manipulation
    
    Proprioception:
    - Joint positions (all joints)
    - Joint velocities
    - IMU data (body orientation)
    - Contact forces (feet + hands)
    """
    
    def __init__(self):
        self.camera_configs = {
            "head": {
                "resolution": (480, 640),
                "fov": 90,
                "position": "head_link",
                "purpose": "Navigation + overview",
            },
            "chest": {
                "resolution": (480, 640),
                "fov": 70,
                "position": "chest_link",
                "purpose": "Mid-range manipulation",
            },
            "left_wrist": {
                "resolution": (480, 640),
                "fov": 90,
                "position": "left_wrist_link",
                "purpose": "Left hand close-up",
            },
            "right_wrist": {
                "resolution": (480, 640),
                "fov": 90,
                "position": "right_wrist_link",
                "purpose": "Right hand close-up",
            },
        }
    
    def get_observation(self, robot) -> dict:
        """Thu thập observation từ humanoid robot."""
        obs = {}
        
        # Camera images
        for cam_name, config in self.camera_configs.items():
            obs[f"observation.images.{cam_name}"] = robot.get_camera_image(
                cam_name, config["resolution"]
            )
        
        # Proprioception
        obs["observation.state"] = np.concatenate([
            robot.get_joint_positions(),    # All joint positions
            robot.get_joint_velocities(),   # All joint velocities
            robot.get_imu_data(),           # Orientation (quaternion)
            robot.get_contact_forces(),     # Feet + hand contacts
        ])
        
        return obs
    
    @property
    def state_dim(self):
        """Tổng dimension của proprioceptive state."""
        return (
            53 +    # Joint positions
            53 +    # Joint velocities
            4 +     # IMU quaternion
            8       # Contact forces (4 foot corners + 2 hands)
        )
        # Total: 118D

Humanoid Foundation Models

GR00T (NVIDIA)

class GR00TInspired:
    """Architecture lấy cảm hứng từ NVIDIA GR00T N1.5.
    
    GR00T N1.5 components:
    - Vision encoder: Pretrained ViT
    - Language encoder: CLIP text encoder  
    - Action head: Diffusion-based action generation
    - Dual-system architecture:
      * System 1 (fast): Reactive policy, low latency
      * System 2 (slow): Deliberative planning
    
    Reference: https://arxiv.org/abs/2503.14734
    """
    
    def __init__(self):
        self.vision_encoder = None  # Pretrained ViT-B/16
        self.language_encoder = None  # CLIP text encoder
        self.action_head = None  # Diffusion action head
    
    def forward(self, images, language_instruction, proprioception):
        """
        Args:
            images: Dict of camera images
            language_instruction: Text instruction
            proprioception: Joint states + IMU
        
        Returns:
            action: Full-body action (53D+)
        """
        # Encode visual observations
        visual_tokens = []
        for cam_name, image in images.items():
            tokens = self.vision_encoder(image)
            visual_tokens.append(tokens)
        
        # Encode language
        lang_tokens = self.language_encoder(language_instruction)
        
        # Concatenate all tokens
        all_tokens = torch.cat(visual_tokens + [lang_tokens], dim=1)
        
        # Generate action via diffusion
        action = self.action_head.sample(
            condition=all_tokens,
            proprioception=proprioception,
        )
        
        return action

Training humanoid manipulation policy với LeRobot

from lerobot.common.policies.act.configuration_act import ACTConfig
from lerobot.common.policies.act.modeling_act import ACTPolicy

def create_humanoid_manip_policy():
    """Tạo policy cho humanoid manipulation.
    
    Strategy: Fixed legs, chỉ train upper body (29 DOF)
    """
    config = ACTConfig(
        input_shapes={
            "observation.images.head": [3, 480, 640],
            "observation.images.left_wrist": [3, 480, 640],
            "observation.images.right_wrist": [3, 480, 640],
            "observation.state": [70],  # Reduced proprioception
        },
        output_shapes={
            "action": [29],  # Upper body only
        },
        input_normalization_modes={
            "observation.images.head": "mean_std",
            "observation.images.left_wrist": "mean_std",
            "observation.images.right_wrist": "mean_std",
            "observation.state": "min_max",
        },
        output_normalization_modes={
            "action": "min_max",
        },
        
        # Humanoid cần model lớn hơn
        chunk_size=50,
        n_action_steps=50,
        dim_model=768,       # Lớn hơn cho high-DOF
        n_heads=12,
        n_layers=6,
        use_vae=True,
        latent_dim=64,       # Lớn hơn cho action diversity
        kl_weight=10.0,
        
        vision_backbone="resnet18",
        pretrained_backbone=True,
    )
    
    policy = ACTPolicy(config)
    print(f"Parameters: {sum(p.numel() for p in policy.parameters()):,}")
    return policy


def train_humanoid_policy(policy, dataset, num_epochs=500):
    """Training loop cho humanoid manipulation."""
    device = torch.device("cuda")
    policy.to(device)
    
    # Larger batch size nếu GPU đủ VRAM
    optimizer = torch.optim.AdamW(
        policy.parameters(), lr=1e-5, weight_decay=1e-4
    )
    scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
        optimizer, T_0=100, T_mult=2, eta_min=1e-7
    )
    
    dataloader = torch.utils.data.DataLoader(
        dataset, batch_size=4, shuffle=True, num_workers=4
    )
    
    for epoch in range(num_epochs):
        epoch_loss = 0
        n_batches = 0
        
        for batch in dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            output = policy.forward(batch)
            loss = output["loss"]
            
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(policy.parameters(), 10.0)
            optimizer.step()
            
            epoch_loss += loss.item()
            n_batches += 1
        
        scheduler.step()
        
        if (epoch + 1) % 50 == 0:
            avg_loss = epoch_loss / n_batches
            print(f"Epoch {epoch+1}/{num_epochs} | Loss: {avg_loss:.4f}")
    
    return policy

Simulation với Unitree H1/G1

import mujoco

class UnitreeH1Sim:
    """Simulation cho Unitree H1 humanoid trong MuJoCo.
    
    H1 specs:
    - Height: 1.8m
    - Weight: 47kg
    - DOF: 19 (legs 10 + waist 1 + arms 8)
    - Payload: 3kg per arm
    """
    
    def __init__(self, xml_path="unitree_h1/scene.xml"):
        self.model = mujoco.MjModel.from_xml_path(xml_path)
        self.data = mujoco.MjData(self.model)
        
        # Joint mappings
        self.leg_joints = list(range(10))
        self.waist_joint = [10]
        self.arm_joints = list(range(11, 19))
    
    def reset(self, standing=True):
        """Reset robot — đứng thẳng hoặc vị trí custom."""
        mujoco.mj_resetData(self.model, self.data)
        
        if standing:
            # Set standing pose
            self.data.qpos[0:3] = [0, 0, 1.0]  # Position (x, y, z)
            self.data.qpos[3:7] = [1, 0, 0, 0]  # Quaternion (standing)
        
        mujoco.mj_forward(self.model, self.data)
        return self._get_obs()
    
    def step(self, action):
        """Thực thi action.
        
        Args:
            action: Joint position targets [19D for full body]
        """
        # Apply PD control
        self.data.ctrl[:] = action
        
        # Step simulation
        for _ in range(4):  # 4 substeps
            mujoco.mj_step(self.model, self.data)
        
        obs = self._get_obs()
        reward = self._compute_reward()
        done = self._check_done()
        
        return obs, reward, done, {}
    
    def _get_obs(self):
        return {
            "qpos": self.data.qpos.copy(),
            "qvel": self.data.qvel.copy(),
            "head_camera": self._render("head_cam"),
            "left_wrist_camera": self._render("left_wrist_cam"),
            "right_wrist_camera": self._render("right_wrist_cam"),
        }
    
    def _check_done(self):
        """Kiểm tra robot đã ngã chưa."""
        head_height = self.data.xpos[self.model.body("head").id][2]
        return head_height < 0.5  # Đầu dưới 50cm = ngã


# Ví dụ: Humanoid gắp object trên bàn
env = UnitreeH1Sim()
obs = env.reset(standing=True)

policy = create_humanoid_manip_policy()

for step in range(500):
    action = policy.predict(obs)
    obs, reward, done, info = env.step(action)
    
    if done:
        print(f"Fell down at step {step}!")
        break

Balance during Manipulation

class BalanceAwarePolicy:
    """Policy biết giữ cân bằng trong khi manipulation.
    
    Dùng ZMP (Zero Moment Point) để đảm bảo stability.
    """
    
    def __init__(self, manip_policy, balance_controller):
        self.manip_policy = manip_policy
        self.balance_controller = balance_controller
        self.stability_threshold = 0.05  # ZMP margin
    
    def act(self, observation):
        """Sinh action có cân nhắc balance."""
        # Lấy manipulation action (upper body)
        manip_action = self.manip_policy.predict(observation)
        
        # Tính ZMP dự đoán
        predicted_zmp = self.balance_controller.predict_zmp(
            observation["qpos"],
            observation["qvel"],
            manip_action,
        )
        
        # Kiểm tra stability
        support_polygon = self.balance_controller.get_support_polygon(
            observation["contact_forces"]
        )
        
        if self.balance_controller.is_stable(predicted_zmp, support_polygon):
            # Ổn định — thực hiện manipulation action
            leg_action = self.balance_controller.maintain_balance(observation)
        else:
            # Không ổn định — ưu tiên balance
            print("WARNING: Adjusting for balance!")
            manip_action *= 0.5  # Giảm cường độ manipulation
            leg_action = self.balance_controller.recover_balance(observation)
        
        # Ghép full action
        full_action = np.concatenate([leg_action, manip_action])
        return full_action

Papers tham khảo

HumanPlus: Humanoid Shadowing and Imitation from Humans — Fu et al., 2024 — Human-to-humanoid transfer learning
GR00T N1: Open Foundation Model for Humanoid Robots — NVIDIA, 2024 — Foundation model cho humanoid
Humanoid-Gym: Reinforcement Learning for Humanoid Robot with Zero-Shot Sim2Real Transfer — 2024 — Sim2real cho humanoid locomotion

Kết luận và bước tiếp theo

Humanoid manipulation là frontier của robotics research. Key insights:

Action space reduction rất quan trọng — không train toàn bộ 50+ DOF cùng lúc
Balance + manipulation phải được phối hợp — humanoid ngã = thất bại hoàn toàn
Foundation models (GR00T, HumanPlus) là hướng đi tương lai — pretrain trên data lớn, fine-tune cho task cụ thể

Bài cuối cùng của series — Sim-to-Real Transfer: Deploy VLA Policy lên Robot thật — sẽ mang tất cả những gì chúng ta đã học vào thế giới thực.