Introduction: The Pinnacle of Manipulation
Throughout this series, we've progressed from single-arm (post 3), through dual-arm (posts 6-7), to mobile manipulation (post 8). Each step added complexity — more DOF, more coordination, more perception.
This post takes us to the pinnacle: humanoid robots. With 20-50+ DOF, two dexterous hands, bipedal locomotion, and human-like form — humanoid robots are the ultimate manipulation platform and also the greatest challenge in robotics.
Why Humanoid for Manipulation?
| Advantage | Explanation |
|---|---|
| Human environment compatible | The world is designed for humans — doors, tables, handrails |
| Dexterous hands | 5 fingers x 4 DOF = 20 DOF per hand |
| Large workspace | Entire body is the workspace — large reach |
| Whole-body dynamics | Use momentum, gravity to assist manipulation |
| Transfer from human data | Can learn from real human video |
| Challenge | Description |
|---|---|
| Extreme DOF | 30-50+ joints to control simultaneously |
| Balance | Must maintain balance during manipulation |
| Contact-rich | Both feet and hands contact the environment |
| Compute | Inference must be real-time on edge devices |
High-DOF Action Spaces
Anatomy of a Humanoid Action Space
import numpy as np
from dataclasses import dataclass
@dataclass
class HumanoidActionSpace:
"""Action space for humanoid robot manipulation.
Example: Unitree H1
- Base: 6 DOF (floating base controlled by legs)
- Left leg: 6 DOF
- Right leg: 6 DOF
- Torso: 3 DOF (waist yaw, pitch, roll)
- Left arm: 7 DOF (shoulder 3 + elbow 1 + wrist 3)
- Right arm: 7 DOF
- Left hand: 12 DOF (Inspire hand)
- Right hand: 12 DOF
Total: ~59 DOF
"""
leg_dof: int = 12
torso_dof: int = 3
arm_dof: int = 14
hand_dof: int = 24
@property
def total_dof(self):
return self.leg_dof + self.torso_dof + self.arm_dof + self.hand_dof
@property
def manipulation_dof(self):
"""DOF directly related to manipulation."""
return self.torso_dof + self.arm_dof + self.hand_dof
def split_action(self, action):
"""Split action into joint groups."""
idx = 0
legs = action[idx:idx + self.leg_dof]; idx += self.leg_dof
torso = action[idx:idx + self.torso_dof]; idx += self.torso_dof
arms = action[idx:idx + self.arm_dof]; idx += self.arm_dof
hands = action[idx:idx + self.hand_dof]; idx += self.hand_dof
return {
"legs": legs,
"torso": torso,
"left_arm": arms[:7],
"right_arm": arms[7:],
"left_hand": hands[:12],
"right_hand": hands[12:],
}
action_space = HumanoidActionSpace()
print(f"Total DOF: {action_space.total_dof}")
print(f"Manipulation DOF: {action_space.manipulation_dof}")
Reducing Dimensionality for Learning
class ActionSpaceReduction:
"""Reduce humanoid action space for easier training.
Strategies:
1. Fixed legs: Stand still, train only upper body
2. Coupled joints: Group related joints
3. Latent actions: VAE encode action space
4. Primitive-based: Compose from motion primitives
"""
def __init__(self, mode="fixed_legs"):
self.mode = mode
def reduce(self, full_action):
"""Reduce action space."""
parts = HumanoidActionSpace().split_action(full_action)
if self.mode == "fixed_legs":
return np.concatenate([
parts["torso"],
parts["left_arm"],
parts["right_arm"],
parts["left_hand"][:6],
parts["right_hand"][:6],
])
# Total: 29 DOF (reduced from 53)
elif self.mode == "coupled_fingers":
left_hand_simple = np.array([
parts["left_hand"][:4].mean(),
parts["left_hand"][4:8].mean(),
parts["left_hand"][8:12].mean(),
])
right_hand_simple = np.array([
parts["right_hand"][:4].mean(),
parts["right_hand"][4:8].mean(),
parts["right_hand"][8:12].mean(),
])
return np.concatenate([
parts["torso"],
parts["left_arm"],
parts["right_arm"],
left_hand_simple,
right_hand_simple,
])
# Total: 23 DOF
return full_action
Observation Design for Humanoid
Egocentric Cameras
class HumanoidObservation:
"""Observation design for humanoid manipulation.
Cameras:
- Head camera: Bird's eye view, navigation
- Chest camera: Mid-range manipulation view
- Left/Right wrist cameras: Close-up manipulation
Proprioception:
- Joint positions (all joints)
- Joint velocities
- IMU data (body orientation)
- Contact forces (feet + hands)
"""
def __init__(self):
self.camera_configs = {
"head": {"resolution": (480, 640), "fov": 90, "purpose": "Navigation + overview"},
"chest": {"resolution": (480, 640), "fov": 70, "purpose": "Mid-range manipulation"},
"left_wrist": {"resolution": (480, 640), "fov": 90, "purpose": "Left hand close-up"},
"right_wrist": {"resolution": (480, 640), "fov": 90, "purpose": "Right hand close-up"},
}
def get_observation(self, robot) -> dict:
"""Collect observation from humanoid robot."""
obs = {}
for cam_name, config in self.camera_configs.items():
obs[f"observation.images.{cam_name}"] = robot.get_camera_image(
cam_name, config["resolution"]
)
obs["observation.state"] = np.concatenate([
robot.get_joint_positions(),
robot.get_joint_velocities(),
robot.get_imu_data(),
robot.get_contact_forces(),
])
return obs
@property
def state_dim(self):
return 53 + 53 + 4 + 8 # positions + velocities + IMU + contacts = 118D
Humanoid Foundation Models
GR00T-Inspired Architecture
class GR00TInspired:
"""Architecture inspired by NVIDIA GR00T N1.5.
Components:
- Vision encoder: Pretrained ViT
- Language encoder: CLIP text encoder
- Action head: Diffusion-based action generation
- Dual-system architecture:
* System 1 (fast): Reactive policy, low latency
* System 2 (slow): Deliberative planning
Reference: https://arxiv.org/abs/2503.14734
"""
def __init__(self):
self.vision_encoder = None
self.language_encoder = None
self.action_head = None
def forward(self, images, language_instruction, proprioception):
"""
Args:
images: Dict of camera images
language_instruction: Text instruction
proprioception: Joint states + IMU
Returns:
action: Full-body action (53D+)
"""
visual_tokens = []
for cam_name, image in images.items():
tokens = self.vision_encoder(image)
visual_tokens.append(tokens)
lang_tokens = self.language_encoder(language_instruction)
all_tokens = torch.cat(visual_tokens + [lang_tokens], dim=1)
action = self.action_head.sample(
condition=all_tokens,
proprioception=proprioception,
)
return action
Training Humanoid Manipulation with LeRobot
from lerobot.common.policies.act.configuration_act import ACTConfig
from lerobot.common.policies.act.modeling_act import ACTPolicy
def create_humanoid_manip_policy():
"""Create policy for humanoid manipulation.
Strategy: Fixed legs, train only upper body (29 DOF)
"""
config = ACTConfig(
input_shapes={
"observation.images.head": [3, 480, 640],
"observation.images.left_wrist": [3, 480, 640],
"observation.images.right_wrist": [3, 480, 640],
"observation.state": [70],
},
output_shapes={"action": [29]},
input_normalization_modes={
"observation.images.head": "mean_std",
"observation.images.left_wrist": "mean_std",
"observation.images.right_wrist": "mean_std",
"observation.state": "min_max",
},
output_normalization_modes={"action": "min_max"},
chunk_size=50,
n_action_steps=50,
dim_model=768,
n_heads=12,
n_layers=6,
use_vae=True,
latent_dim=64,
kl_weight=10.0,
vision_backbone="resnet18",
pretrained_backbone=True,
)
policy = ACTPolicy(config)
print(f"Parameters: {sum(p.numel() for p in policy.parameters()):,}")
return policy
Simulation with Unitree H1/G1
import mujoco
class UnitreeH1Sim:
"""Simulation for Unitree H1 humanoid in MuJoCo.
H1 specs:
- Height: 1.8m
- Weight: 47kg
- DOF: 19 (legs 10 + waist 1 + arms 8)
- Payload: 3kg per arm
"""
def __init__(self, xml_path="unitree_h1/scene.xml"):
self.model = mujoco.MjModel.from_xml_path(xml_path)
self.data = mujoco.MjData(self.model)
self.leg_joints = list(range(10))
self.waist_joint = [10]
self.arm_joints = list(range(11, 19))
def reset(self, standing=True):
"""Reset robot to standing or custom position."""
mujoco.mj_resetData(self.model, self.data)
if standing:
self.data.qpos[0:3] = [0, 0, 1.0]
self.data.qpos[3:7] = [1, 0, 0, 0]
mujoco.mj_forward(self.model, self.data)
return self._get_obs()
def step(self, action):
"""Execute action."""
self.data.ctrl[:] = action
for _ in range(4):
mujoco.mj_step(self.model, self.data)
obs = self._get_obs()
reward = self._compute_reward()
done = self._check_done()
return obs, reward, done, {}
def _check_done(self):
"""Check if robot has fallen."""
head_height = self.data.xpos[self.model.body("head").id][2]
return head_height < 0.5
env = UnitreeH1Sim()
obs = env.reset(standing=True)
policy = create_humanoid_manip_policy()
for step in range(500):
action = policy.predict(obs)
obs, reward, done, info = env.step(action)
if done:
print(f"Fell down at step {step}!")
break
Balance During Manipulation
class BalanceAwarePolicy:
"""Policy that maintains balance during manipulation.
Uses ZMP (Zero Moment Point) to ensure stability.
"""
def __init__(self, manip_policy, balance_controller):
self.manip_policy = manip_policy
self.balance_controller = balance_controller
self.stability_threshold = 0.05
def act(self, observation):
"""Generate action considering balance."""
manip_action = self.manip_policy.predict(observation)
predicted_zmp = self.balance_controller.predict_zmp(
observation["qpos"],
observation["qvel"],
manip_action,
)
support_polygon = self.balance_controller.get_support_polygon(
observation["contact_forces"]
)
if self.balance_controller.is_stable(predicted_zmp, support_polygon):
leg_action = self.balance_controller.maintain_balance(observation)
else:
print("WARNING: Adjusting for balance!")
manip_action *= 0.5
leg_action = self.balance_controller.recover_balance(observation)
full_action = np.concatenate([leg_action, manip_action])
return full_action
Reference Papers
- HumanPlus: Humanoid Shadowing and Imitation from Humans — Fu et al., 2024 — Human-to-humanoid transfer learning
- GR00T N1: Open Foundation Model for Humanoid Robots — NVIDIA, 2024 — Foundation model for humanoid
- Humanoid-Gym: Reinforcement Learning for Humanoid Robot with Zero-Shot Sim2Real Transfer — 2024 — Sim2real for humanoid locomotion
Conclusion and Next Steps
Humanoid manipulation is the frontier of robotics research. Key insights:
- Action space reduction is critical — don't train all 50+ DOF at once
- Balance + manipulation must be coordinated — humanoid falling = complete failure
- Foundation models (GR00T, HumanPlus) are the future — pretrain on large data, fine-tune for specific tasks
The final post in this series — Sim-to-Real Transfer: Deploying VLA Policies to Real Robots — will bring everything we've learned into the real world.
Related Posts
- Mobile Manipulation — Foundation before humanoid
- Humanoid Locomotion + Manipulation — Humanoid loco-manip overview
- VLA Models for Robots — Foundation models for robots