← Back to Blog
ailerobothumanoidwhole-bodyunitree

Humanoid Manipulation: Whole-body Control with LeRobot

Humanoid robots as the ultimate manipulation platform — high-DOF control, egocentric vision, and humanoid foundation models.

Nguyễn Anh Tuấn5 tháng 4, 20267 min read
Humanoid Manipulation: Whole-body Control with LeRobot

Introduction: The Pinnacle of Manipulation

Throughout this series, we've progressed from single-arm (post 3), through dual-arm (posts 6-7), to mobile manipulation (post 8). Each step added complexity — more DOF, more coordination, more perception.

This post takes us to the pinnacle: humanoid robots. With 20-50+ DOF, two dexterous hands, bipedal locomotion, and human-like form — humanoid robots are the ultimate manipulation platform and also the greatest challenge in robotics.

Humanoid robot for manipulation

Why Humanoid for Manipulation?

Advantage Explanation
Human environment compatible The world is designed for humans — doors, tables, handrails
Dexterous hands 5 fingers x 4 DOF = 20 DOF per hand
Large workspace Entire body is the workspace — large reach
Whole-body dynamics Use momentum, gravity to assist manipulation
Transfer from human data Can learn from real human video
Challenge Description
Extreme DOF 30-50+ joints to control simultaneously
Balance Must maintain balance during manipulation
Contact-rich Both feet and hands contact the environment
Compute Inference must be real-time on edge devices

High-DOF Action Spaces

Anatomy of a Humanoid Action Space

import numpy as np
from dataclasses import dataclass

@dataclass
class HumanoidActionSpace:
    """Action space for humanoid robot manipulation.
    
    Example: Unitree H1
    - Base: 6 DOF (floating base controlled by legs)
    - Left leg: 6 DOF
    - Right leg: 6 DOF
    - Torso: 3 DOF (waist yaw, pitch, roll)
    - Left arm: 7 DOF (shoulder 3 + elbow 1 + wrist 3)
    - Right arm: 7 DOF
    - Left hand: 12 DOF (Inspire hand)
    - Right hand: 12 DOF
    Total: ~59 DOF
    """
    
    leg_dof: int = 12
    torso_dof: int = 3
    arm_dof: int = 14
    hand_dof: int = 24
    
    @property
    def total_dof(self):
        return self.leg_dof + self.torso_dof + self.arm_dof + self.hand_dof
    
    @property
    def manipulation_dof(self):
        """DOF directly related to manipulation."""
        return self.torso_dof + self.arm_dof + self.hand_dof
    
    def split_action(self, action):
        """Split action into joint groups."""
        idx = 0
        legs = action[idx:idx + self.leg_dof]; idx += self.leg_dof
        torso = action[idx:idx + self.torso_dof]; idx += self.torso_dof
        arms = action[idx:idx + self.arm_dof]; idx += self.arm_dof
        hands = action[idx:idx + self.hand_dof]; idx += self.hand_dof
        
        return {
            "legs": legs,
            "torso": torso,
            "left_arm": arms[:7],
            "right_arm": arms[7:],
            "left_hand": hands[:12],
            "right_hand": hands[12:],
        }


action_space = HumanoidActionSpace()
print(f"Total DOF: {action_space.total_dof}")
print(f"Manipulation DOF: {action_space.manipulation_dof}")

Reducing Dimensionality for Learning

class ActionSpaceReduction:
    """Reduce humanoid action space for easier training.
    
    Strategies:
    1. Fixed legs: Stand still, train only upper body
    2. Coupled joints: Group related joints
    3. Latent actions: VAE encode action space
    4. Primitive-based: Compose from motion primitives
    """
    
    def __init__(self, mode="fixed_legs"):
        self.mode = mode
    
    def reduce(self, full_action):
        """Reduce action space."""
        parts = HumanoidActionSpace().split_action(full_action)
        
        if self.mode == "fixed_legs":
            return np.concatenate([
                parts["torso"],
                parts["left_arm"],
                parts["right_arm"],
                parts["left_hand"][:6],
                parts["right_hand"][:6],
            ])
            # Total: 29 DOF (reduced from 53)
        
        elif self.mode == "coupled_fingers":
            left_hand_simple = np.array([
                parts["left_hand"][:4].mean(),
                parts["left_hand"][4:8].mean(),
                parts["left_hand"][8:12].mean(),
            ])
            right_hand_simple = np.array([
                parts["right_hand"][:4].mean(),
                parts["right_hand"][4:8].mean(),
                parts["right_hand"][8:12].mean(),
            ])
            
            return np.concatenate([
                parts["torso"],
                parts["left_arm"],
                parts["right_arm"],
                left_hand_simple,
                right_hand_simple,
            ])
            # Total: 23 DOF
        
        return full_action

Observation Design for Humanoid

Egocentric Cameras

class HumanoidObservation:
    """Observation design for humanoid manipulation.
    
    Cameras:
    - Head camera: Bird's eye view, navigation
    - Chest camera: Mid-range manipulation view
    - Left/Right wrist cameras: Close-up manipulation
    
    Proprioception:
    - Joint positions (all joints)
    - Joint velocities
    - IMU data (body orientation)
    - Contact forces (feet + hands)
    """
    
    def __init__(self):
        self.camera_configs = {
            "head": {"resolution": (480, 640), "fov": 90, "purpose": "Navigation + overview"},
            "chest": {"resolution": (480, 640), "fov": 70, "purpose": "Mid-range manipulation"},
            "left_wrist": {"resolution": (480, 640), "fov": 90, "purpose": "Left hand close-up"},
            "right_wrist": {"resolution": (480, 640), "fov": 90, "purpose": "Right hand close-up"},
        }
    
    def get_observation(self, robot) -> dict:
        """Collect observation from humanoid robot."""
        obs = {}
        
        for cam_name, config in self.camera_configs.items():
            obs[f"observation.images.{cam_name}"] = robot.get_camera_image(
                cam_name, config["resolution"]
            )
        
        obs["observation.state"] = np.concatenate([
            robot.get_joint_positions(),
            robot.get_joint_velocities(),
            robot.get_imu_data(),
            robot.get_contact_forces(),
        ])
        
        return obs
    
    @property
    def state_dim(self):
        return 53 + 53 + 4 + 8  # positions + velocities + IMU + contacts = 118D

Humanoid robot egocentric view

Humanoid Foundation Models

GR00T-Inspired Architecture

class GR00TInspired:
    """Architecture inspired by NVIDIA GR00T N1.5.
    
    Components:
    - Vision encoder: Pretrained ViT
    - Language encoder: CLIP text encoder  
    - Action head: Diffusion-based action generation
    - Dual-system architecture:
      * System 1 (fast): Reactive policy, low latency
      * System 2 (slow): Deliberative planning
    
    Reference: https://arxiv.org/abs/2503.14734
    """
    
    def __init__(self):
        self.vision_encoder = None
        self.language_encoder = None
        self.action_head = None
    
    def forward(self, images, language_instruction, proprioception):
        """
        Args:
            images: Dict of camera images
            language_instruction: Text instruction
            proprioception: Joint states + IMU
        
        Returns:
            action: Full-body action (53D+)
        """
        visual_tokens = []
        for cam_name, image in images.items():
            tokens = self.vision_encoder(image)
            visual_tokens.append(tokens)
        
        lang_tokens = self.language_encoder(language_instruction)
        all_tokens = torch.cat(visual_tokens + [lang_tokens], dim=1)
        
        action = self.action_head.sample(
            condition=all_tokens,
            proprioception=proprioception,
        )
        
        return action

Training Humanoid Manipulation with LeRobot

from lerobot.common.policies.act.configuration_act import ACTConfig
from lerobot.common.policies.act.modeling_act import ACTPolicy

def create_humanoid_manip_policy():
    """Create policy for humanoid manipulation.
    
    Strategy: Fixed legs, train only upper body (29 DOF)
    """
    config = ACTConfig(
        input_shapes={
            "observation.images.head": [3, 480, 640],
            "observation.images.left_wrist": [3, 480, 640],
            "observation.images.right_wrist": [3, 480, 640],
            "observation.state": [70],
        },
        output_shapes={"action": [29]},
        input_normalization_modes={
            "observation.images.head": "mean_std",
            "observation.images.left_wrist": "mean_std",
            "observation.images.right_wrist": "mean_std",
            "observation.state": "min_max",
        },
        output_normalization_modes={"action": "min_max"},
        
        chunk_size=50,
        n_action_steps=50,
        dim_model=768,
        n_heads=12,
        n_layers=6,
        use_vae=True,
        latent_dim=64,
        kl_weight=10.0,
        vision_backbone="resnet18",
        pretrained_backbone=True,
    )
    
    policy = ACTPolicy(config)
    print(f"Parameters: {sum(p.numel() for p in policy.parameters()):,}")
    return policy

Simulation with Unitree H1/G1

import mujoco

class UnitreeH1Sim:
    """Simulation for Unitree H1 humanoid in MuJoCo.
    
    H1 specs:
    - Height: 1.8m
    - Weight: 47kg
    - DOF: 19 (legs 10 + waist 1 + arms 8)
    - Payload: 3kg per arm
    """
    
    def __init__(self, xml_path="unitree_h1/scene.xml"):
        self.model = mujoco.MjModel.from_xml_path(xml_path)
        self.data = mujoco.MjData(self.model)
        
        self.leg_joints = list(range(10))
        self.waist_joint = [10]
        self.arm_joints = list(range(11, 19))
    
    def reset(self, standing=True):
        """Reset robot to standing or custom position."""
        mujoco.mj_resetData(self.model, self.data)
        
        if standing:
            self.data.qpos[0:3] = [0, 0, 1.0]
            self.data.qpos[3:7] = [1, 0, 0, 0]
        
        mujoco.mj_forward(self.model, self.data)
        return self._get_obs()
    
    def step(self, action):
        """Execute action."""
        self.data.ctrl[:] = action
        
        for _ in range(4):
            mujoco.mj_step(self.model, self.data)
        
        obs = self._get_obs()
        reward = self._compute_reward()
        done = self._check_done()
        
        return obs, reward, done, {}
    
    def _check_done(self):
        """Check if robot has fallen."""
        head_height = self.data.xpos[self.model.body("head").id][2]
        return head_height < 0.5


env = UnitreeH1Sim()
obs = env.reset(standing=True)
policy = create_humanoid_manip_policy()

for step in range(500):
    action = policy.predict(obs)
    obs, reward, done, info = env.step(action)
    if done:
        print(f"Fell down at step {step}!")
        break

Balance During Manipulation

class BalanceAwarePolicy:
    """Policy that maintains balance during manipulation.
    
    Uses ZMP (Zero Moment Point) to ensure stability.
    """
    
    def __init__(self, manip_policy, balance_controller):
        self.manip_policy = manip_policy
        self.balance_controller = balance_controller
        self.stability_threshold = 0.05
    
    def act(self, observation):
        """Generate action considering balance."""
        manip_action = self.manip_policy.predict(observation)
        
        predicted_zmp = self.balance_controller.predict_zmp(
            observation["qpos"],
            observation["qvel"],
            manip_action,
        )
        
        support_polygon = self.balance_controller.get_support_polygon(
            observation["contact_forces"]
        )
        
        if self.balance_controller.is_stable(predicted_zmp, support_polygon):
            leg_action = self.balance_controller.maintain_balance(observation)
        else:
            print("WARNING: Adjusting for balance!")
            manip_action *= 0.5
            leg_action = self.balance_controller.recover_balance(observation)
        
        full_action = np.concatenate([leg_action, manip_action])
        return full_action

Reference Papers

  1. HumanPlus: Humanoid Shadowing and Imitation from HumansFu et al., 2024 — Human-to-humanoid transfer learning
  2. GR00T N1: Open Foundation Model for Humanoid RobotsNVIDIA, 2024 — Foundation model for humanoid
  3. Humanoid-Gym: Reinforcement Learning for Humanoid Robot with Zero-Shot Sim2Real Transfer2024 — Sim2real for humanoid locomotion

Conclusion and Next Steps

Humanoid manipulation is the frontier of robotics research. Key insights:

The final post in this series — Sim-to-Real Transfer: Deploying VLA Policies to Real Robots — will bring everything we've learned into the real world.

Related Posts

Related Posts

TutorialSim-to-Real cho Humanoid: Deployment Best Practices
sim2realhumanoiddeploymentrlunitreePart 10

Sim-to-Real cho Humanoid: Deployment Best Practices

Pipeline hoàn chỉnh deploy RL locomotion policy lên robot humanoid thật — domain randomization, system identification, safety, và Unitree SDK.

9/4/202611 min read
ResearchUnifoLM-VLA-0: Mô hình VLA cho Manipulation trên Unitree G1
vlaunitreeg1manipulationhumanoid

UnifoLM-VLA-0: Mô hình VLA cho Manipulation trên Unitree G1

Phân tích và hướng dẫn triển khai UnifoLM-VLA-0 — mô hình VLA open-source đầu tiên chạy trực tiếp trên G1 humanoid

8/4/202623 min read
TutorialSim-to-Real Transfer: Deploy VLA Policy lên Robot thật
lerobotsim2realdeploymentvlaPart 10

Sim-to-Real Transfer: Deploy VLA Policy lên Robot thật

Pipeline hoàn chỉnh từ simulation đến real robot — domain randomization, camera calibration, inference optimization và ROS 2 deployment.

8/4/20269 min read