ailerobothumanoidwhole-bodyunitree

Humanoid Manipulation: Whole-body Control with LeRobot

Humanoid robots as the ultimate manipulation platform — high-DOF control, egocentric vision, and humanoid foundation models.

Nguyễn Anh Tuấn5 tháng 4, 20267 phút đọc
Humanoid Manipulation: Whole-body Control with LeRobot

Introduction: The Pinnacle of Manipulation

Throughout this series, we've progressed from single-arm (post 3), through dual-arm (posts 6-7), to mobile manipulation (post 8). Each step added complexity — more DOF, more coordination, more perception.

This post takes us to the pinnacle: humanoid robots. With 20-50+ DOF, two dexterous hands, bipedal locomotion, and human-like form — humanoid robots are the ultimate manipulation platform and also the greatest challenge in robotics.

Humanoid robot for manipulation

Why Humanoid for Manipulation?

Advantage Explanation
Human environment compatible The world is designed for humans — doors, tables, handrails
Dexterous hands 5 fingers x 4 DOF = 20 DOF per hand
Large workspace Entire body is the workspace — large reach
Whole-body dynamics Use momentum, gravity to assist manipulation
Transfer from human data Can learn from real human video
Challenge Description
Extreme DOF 30-50+ joints to control simultaneously
Balance Must maintain balance during manipulation
Contact-rich Both feet and hands contact the environment
Compute Inference must be real-time on edge devices

High-DOF Action Spaces

Anatomy of a Humanoid Action Space

import numpy as np
from dataclasses import dataclass

@dataclass
class HumanoidActionSpace:
    """Action space for humanoid robot manipulation.
    
    Example: Unitree H1
    - Base: 6 DOF (floating base controlled by legs)
    - Left leg: 6 DOF
    - Right leg: 6 DOF
    - Torso: 3 DOF (waist yaw, pitch, roll)
    - Left arm: 7 DOF (shoulder 3 + elbow 1 + wrist 3)
    - Right arm: 7 DOF
    - Left hand: 12 DOF (Inspire hand)
    - Right hand: 12 DOF
    Total: ~59 DOF
    """
    
    leg_dof: int = 12
    torso_dof: int = 3
    arm_dof: int = 14
    hand_dof: int = 24
    
    @property
    def total_dof(self):
        return self.leg_dof + self.torso_dof + self.arm_dof + self.hand_dof
    
    @property
    def manipulation_dof(self):
        """DOF directly related to manipulation."""
        return self.torso_dof + self.arm_dof + self.hand_dof
    
    def split_action(self, action):
        """Split action into joint groups."""
        idx = 0
        legs = action[idx:idx + self.leg_dof]; idx += self.leg_dof
        torso = action[idx:idx + self.torso_dof]; idx += self.torso_dof
        arms = action[idx:idx + self.arm_dof]; idx += self.arm_dof
        hands = action[idx:idx + self.hand_dof]; idx += self.hand_dof
        
        return {
            "legs": legs,
            "torso": torso,
            "left_arm": arms[:7],
            "right_arm": arms[7:],
            "left_hand": hands[:12],
            "right_hand": hands[12:],
        }


action_space = HumanoidActionSpace()
print(f"Total DOF: {action_space.total_dof}")
print(f"Manipulation DOF: {action_space.manipulation_dof}")

Reducing Dimensionality for Learning

class ActionSpaceReduction:
    """Reduce humanoid action space for easier training.
    
    Strategies:
    1. Fixed legs: Stand still, train only upper body
    2. Coupled joints: Group related joints
    3. Latent actions: VAE encode action space
    4. Primitive-based: Compose from motion primitives
    """
    
    def __init__(self, mode="fixed_legs"):
        self.mode = mode
    
    def reduce(self, full_action):
        """Reduce action space."""
        parts = HumanoidActionSpace().split_action(full_action)
        
        if self.mode == "fixed_legs":
            return np.concatenate([
                parts["torso"],
                parts["left_arm"],
                parts["right_arm"],
                parts["left_hand"][:6],
                parts["right_hand"][:6],
            ])
            # Total: 29 DOF (reduced from 53)
        
        elif self.mode == "coupled_fingers":
            left_hand_simple = np.array([
                parts["left_hand"][:4].mean(),
                parts["left_hand"][4:8].mean(),
                parts["left_hand"][8:12].mean(),
            ])
            right_hand_simple = np.array([
                parts["right_hand"][:4].mean(),
                parts["right_hand"][4:8].mean(),
                parts["right_hand"][8:12].mean(),
            ])
            
            return np.concatenate([
                parts["torso"],
                parts["left_arm"],
                parts["right_arm"],
                left_hand_simple,
                right_hand_simple,
            ])
            # Total: 23 DOF
        
        return full_action

Observation Design for Humanoid

Egocentric Cameras

class HumanoidObservation:
    """Observation design for humanoid manipulation.
    
    Cameras:
    - Head camera: Bird's eye view, navigation
    - Chest camera: Mid-range manipulation view
    - Left/Right wrist cameras: Close-up manipulation
    
    Proprioception:
    - Joint positions (all joints)
    - Joint velocities
    - IMU data (body orientation)
    - Contact forces (feet + hands)
    """
    
    def __init__(self):
        self.camera_configs = {
            "head": {"resolution": (480, 640), "fov": 90, "purpose": "Navigation + overview"},
            "chest": {"resolution": (480, 640), "fov": 70, "purpose": "Mid-range manipulation"},
            "left_wrist": {"resolution": (480, 640), "fov": 90, "purpose": "Left hand close-up"},
            "right_wrist": {"resolution": (480, 640), "fov": 90, "purpose": "Right hand close-up"},
        }
    
    def get_observation(self, robot) -> dict:
        """Collect observation from humanoid robot."""
        obs = {}
        
        for cam_name, config in self.camera_configs.items():
            obs[f"observation.images.{cam_name}"] = robot.get_camera_image(
                cam_name, config["resolution"]
            )
        
        obs["observation.state"] = np.concatenate([
            robot.get_joint_positions(),
            robot.get_joint_velocities(),
            robot.get_imu_data(),
            robot.get_contact_forces(),
        ])
        
        return obs
    
    @property
    def state_dim(self):
        return 53 + 53 + 4 + 8  # positions + velocities + IMU + contacts = 118D

Humanoid robot egocentric view

Humanoid Foundation Models

GR00T-Inspired Architecture

class GR00TInspired:
    """Architecture inspired by NVIDIA GR00T N1.5.
    
    Components:
    - Vision encoder: Pretrained ViT
    - Language encoder: CLIP text encoder  
    - Action head: Diffusion-based action generation
    - Dual-system architecture:
      * System 1 (fast): Reactive policy, low latency
      * System 2 (slow): Deliberative planning
    
    Reference: https://arxiv.org/abs/2503.14734
    """
    
    def __init__(self):
        self.vision_encoder = None
        self.language_encoder = None
        self.action_head = None
    
    def forward(self, images, language_instruction, proprioception):
        """
        Args:
            images: Dict of camera images
            language_instruction: Text instruction
            proprioception: Joint states + IMU
        
        Returns:
            action: Full-body action (53D+)
        """
        visual_tokens = []
        for cam_name, image in images.items():
            tokens = self.vision_encoder(image)
            visual_tokens.append(tokens)
        
        lang_tokens = self.language_encoder(language_instruction)
        all_tokens = torch.cat(visual_tokens + [lang_tokens], dim=1)
        
        action = self.action_head.sample(
            condition=all_tokens,
            proprioception=proprioception,
        )
        
        return action

Training Humanoid Manipulation with LeRobot

from lerobot.common.policies.act.configuration_act import ACTConfig
from lerobot.common.policies.act.modeling_act import ACTPolicy

def create_humanoid_manip_policy():
    """Create policy for humanoid manipulation.
    
    Strategy: Fixed legs, train only upper body (29 DOF)
    """
    config = ACTConfig(
        input_shapes={
            "observation.images.head": [3, 480, 640],
            "observation.images.left_wrist": [3, 480, 640],
            "observation.images.right_wrist": [3, 480, 640],
            "observation.state": [70],
        },
        output_shapes={"action": [29]},
        input_normalization_modes={
            "observation.images.head": "mean_std",
            "observation.images.left_wrist": "mean_std",
            "observation.images.right_wrist": "mean_std",
            "observation.state": "min_max",
        },
        output_normalization_modes={"action": "min_max"},
        
        chunk_size=50,
        n_action_steps=50,
        dim_model=768,
        n_heads=12,
        n_layers=6,
        use_vae=True,
        latent_dim=64,
        kl_weight=10.0,
        vision_backbone="resnet18",
        pretrained_backbone=True,
    )
    
    policy = ACTPolicy(config)
    print(f"Parameters: {sum(p.numel() for p in policy.parameters()):,}")
    return policy

Simulation with Unitree H1/G1

import mujoco

class UnitreeH1Sim:
    """Simulation for Unitree H1 humanoid in MuJoCo.
    
    H1 specs:
    - Height: 1.8m
    - Weight: 47kg
    - DOF: 19 (legs 10 + waist 1 + arms 8)
    - Payload: 3kg per arm
    """
    
    def __init__(self, xml_path="unitree_h1/scene.xml"):
        self.model = mujoco.MjModel.from_xml_path(xml_path)
        self.data = mujoco.MjData(self.model)
        
        self.leg_joints = list(range(10))
        self.waist_joint = [10]
        self.arm_joints = list(range(11, 19))
    
    def reset(self, standing=True):
        """Reset robot to standing or custom position."""
        mujoco.mj_resetData(self.model, self.data)
        
        if standing:
            self.data.qpos[0:3] = [0, 0, 1.0]
            self.data.qpos[3:7] = [1, 0, 0, 0]
        
        mujoco.mj_forward(self.model, self.data)
        return self._get_obs()
    
    def step(self, action):
        """Execute action."""
        self.data.ctrl[:] = action
        
        for _ in range(4):
            mujoco.mj_step(self.model, self.data)
        
        obs = self._get_obs()
        reward = self._compute_reward()
        done = self._check_done()
        
        return obs, reward, done, {}
    
    def _check_done(self):
        """Check if robot has fallen."""
        head_height = self.data.xpos[self.model.body("head").id][2]
        return head_height < 0.5


env = UnitreeH1Sim()
obs = env.reset(standing=True)
policy = create_humanoid_manip_policy()

for step in range(500):
    action = policy.predict(obs)
    obs, reward, done, info = env.step(action)
    if done:
        print(f"Fell down at step {step}!")
        break

Balance During Manipulation

class BalanceAwarePolicy:
    """Policy that maintains balance during manipulation.
    
    Uses ZMP (Zero Moment Point) to ensure stability.
    """
    
    def __init__(self, manip_policy, balance_controller):
        self.manip_policy = manip_policy
        self.balance_controller = balance_controller
        self.stability_threshold = 0.05
    
    def act(self, observation):
        """Generate action considering balance."""
        manip_action = self.manip_policy.predict(observation)
        
        predicted_zmp = self.balance_controller.predict_zmp(
            observation["qpos"],
            observation["qvel"],
            manip_action,
        )
        
        support_polygon = self.balance_controller.get_support_polygon(
            observation["contact_forces"]
        )
        
        if self.balance_controller.is_stable(predicted_zmp, support_polygon):
            leg_action = self.balance_controller.maintain_balance(observation)
        else:
            print("WARNING: Adjusting for balance!")
            manip_action *= 0.5
            leg_action = self.balance_controller.recover_balance(observation)
        
        full_action = np.concatenate([leg_action, manip_action])
        return full_action

Reference Papers

  1. HumanPlus: Humanoid Shadowing and Imitation from HumansFu et al., 2024 — Human-to-humanoid transfer learning
  2. GR00T N1: Open Foundation Model for Humanoid RobotsNVIDIA, 2024 — Foundation model for humanoid
  3. Humanoid-Gym: Reinforcement Learning for Humanoid Robot with Zero-Shot Sim2Real Transfer2024 — Sim2real for humanoid locomotion

Conclusion and Next Steps

Humanoid manipulation is the frontier of robotics research. Key insights:

  • Action space reduction is critical — don't train all 50+ DOF at once
  • Balance + manipulation must be coordinated — humanoid falling = complete failure
  • Foundation models (GR00T, HumanPlus) are the future — pretrain on large data, fine-tune for specific tasks

The final post in this series — Sim-to-Real Transfer: Deploying VLA Policies to Real Robots — will bring everything we've learned into the real world.

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Bài viết liên quan

NEWTutorial
WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid
wholebodyvlavlahumanoidloco-manipulationiclr-2026agibot-x2teleoprl

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

ICLR 2026 — pipeline thực chiến từ thu thập teleop, train unified latent VLA đến deploy whole-body loco-manipulation trên AgiBot X2.

11/5/202611 phút đọc
NEWTutorial
Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym
humanoidisaac-gymreinforcement-learningsim2realbooster-t1icra-2026

Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym

Hướng dẫn chi tiết Booster Gym — RL framework end-to-end open-source train humanoid Booster T1 walking từ teleop đến deploy thực tế.

6/5/202611 phút đọc
Tutorial
Hướng dẫn SO-101 sim-to-real với Isaac Lab & LeRobot
so-101isaac-lablerobotsim-to-realmanipulationgr00tdomain-randomization

Hướng dẫn SO-101 sim-to-real với Isaac Lab & LeRobot

Từng bước train cánh tay SO-101 trong NVIDIA Isaac Lab, thu thập dữ liệu teleop, fine-tune GR00T N1.5, và deploy policy lên robot thực tế.

29/4/202612 phút đọc