← Back to Blog
humanoidhumanoidreinforcement-learningsim2realhumanoid-gym

RL for Humanoid: From Humanoid-Gym to Sim2Real

Guide to training locomotion policies for humanoid robots with Humanoid-Gym — from reward design to zero-shot sim-to-real transfer.

Nguyen Anh Tuan19 tháng 2, 20268 min read
RL for Humanoid: From Humanoid-Gym to Sim2Real

RL for Humanoids — From Simulation to Reality

In Part 3, we learned about MPC — a powerful model-based approach but computationally expensive. Reinforcement Learning (RL) is another path: train policies in simulation, then deploy directly to real robots.

Key RL advantages over MPC:

The main disadvantage: sim-to-real gap — policies trained in simulation may fail on real robots. This is the core problem that Humanoid-Gym is designed to solve.

Reinforcement learning training loop for humanoid robots

Humanoid-Gym — The Standard Framework

Humanoid-Gym (arXiv:2404.05695) is an RL framework designed specifically for humanoid locomotion, developed by RobotEra and Tsinghua University.

Key Features

  1. Built on NVIDIA Isaac Gym: Leverages GPU parallelism — train 4,096 robots simultaneously
  2. Sim-to-sim verification: Transfer policy from Isaac Gym to MuJoCo before deployment
  3. Zero-shot sim-to-real: Verified on RobotEra XBot-S (1.2m) and XBot-L (1.65m)
  4. Modular design: Easy to swap robot models, reward functions, and terrain

Installation

# Clone repo
git clone https://github.com/roboterax/humanoid-gym.git
cd humanoid-gym

# Install Isaac Gym (requires NVIDIA GPU)
pip install isaacgym  # Download from NVIDIA

# Install Humanoid-Gym
pip install -e .

Project Structure

humanoid-gym/
├── humanoid/
│   ├── envs/
│   │   ├── base/               # Base environment
│   │   ├── humanoid/
│   │   │   ├── humanoid_config.py    # Config (reward, domain rand, ...)
│   │   │   └── humanoid_env.py       # Environment logic
│   │   └── custom/              # For your own robot
│   ├── scripts/
│   │   ├── train.py             # Training script
│   │   └── play.py              # Evaluation script
│   └── utils/
├── resources/
│   └── robots/                  # URDF/MJCF models
└── logs/                        # Training logs

Reward Engineering — The Most Critical Part

The reward function determines what the robot learns. Design the reward wrong and the robot learns bizarre behaviors (reward hacking).

Reward Components for Humanoid Walking

class HumanoidRewardConfig:
    class rewards:
        # === Tracking rewards (positive) ===
        # Follow desired velocity
        tracking_lin_vel = 1.5      # Track linear velocity command
        tracking_ang_vel = 0.8      # Track angular velocity command
        
        # === Regularization (negative, penalties) ===
        lin_vel_z = -2.0            # Penalize vertical oscillation (bouncing)
        ang_vel_xy = -0.05          # Penalize excessive lateral rocking
        orientation = -1.0          # Penalize non-upright torso
        base_height = -10.0         # Penalize wrong height
        
        # === Smooth motion (negative) ===
        torques = -0.0001           # Penalize large torques (energy efficiency)
        dof_vel = -0.001            # Penalize high joint velocity
        dof_acc = -2.5e-7           # Penalize joint acceleration (smooth motion)
        action_rate = -0.01         # Penalize sudden action changes
        
        # === Contact rewards ===
        feet_air_time = 1.0         # Reward foot being airborne (no dragging)
        feet_stumble = -1.0         # Penalize feet hitting obstacles
        collision = -1.0            # Penalize body collisions
        
        # === Survival ===
        alive = 0.5                 # Reward each survival step
        termination = -200.0        # Penalize falling

Reward Shaping Tips

  1. Start simple: Only tracking_lin_vel + alive + termination. Add other rewards gradually.
  2. Normalize: Rewards should have similar magnitudes. If termination = -200 but torques = -0.0001, they need adjustment.
  3. Curriculum: Increase difficulty gradually — start on flat ground, add terrain after walking is stable.
  4. Observation noise: Add noise to observations to make the policy more robust.

Training Pipeline

Step 1: Configuration

# humanoid/envs/custom/my_robot_config.py

from humanoid.envs.base.legged_robot_config import LeggedRobotCfg

class MyHumanoidCfg(LeggedRobotCfg):
    class env:
        num_envs = 4096          # Parallel robot count
        num_observations = 47     # Observation dimension
        num_actions = 12          # Controlled joints count
        episode_length_s = 20     # Max episode length
    
    class terrain:
        mesh_type = 'plane'       # Start with flat ground
        # mesh_type = 'trimesh'   # More complex terrain
        curriculum = True         # Gradually increase difficulty
    
    class init_state:
        pos = [0.0, 0.0, 0.95]   # Initial position (standing)
        default_joint_angles = {  # Default joint angles
            'left_hip_pitch': -0.1,
            'left_knee': 0.3,
            'left_ankle': -0.2,
            'right_hip_pitch': -0.1,
            'right_knee': 0.3,
            'right_ankle': -0.2,
            # ... add remaining joints
        }
    
    class control:
        # PD gains
        stiffness = {'hip': 80, 'knee': 80, 'ankle': 40}
        damping = {'hip': 2, 'knee': 2, 'ankle': 1}
        action_scale = 0.25       # Scale action output
        decimation = 4            # Control frequency = sim_freq / decimation
    
    class domain_rand:
        randomize_friction = True
        friction_range = [0.5, 1.5]
        randomize_base_mass = True
        added_mass_range = [-1.0, 3.0]     # kg
        push_robots = True
        push_interval_s = 7                 # Push robot every 7 seconds
        max_push_vel_xy = 1.0              # m/s
    
    class noise:
        add_noise = True
        noise_level = 1.0
        class noise_scales:
            dof_pos = 0.01
            dof_vel = 1.5
            lin_vel = 0.1
            ang_vel = 0.2
            gravity = 0.05

Step 2: Train

# Train with PPO (Proximal Policy Optimization)
python humanoid/scripts/train.py \
    --task my_humanoid \
    --num_envs 4096 \
    --max_iterations 5000 \
    --headless

Training typically takes 2-8 hours on an RTX 4090, depending on task complexity.

Step 3: Evaluate

# Watch policy in simulation
python humanoid/scripts/play.py \
    --task my_humanoid \
    --load_run <run_name> \
    --checkpoint <iteration>

Domain randomization and terrain curriculum in RL training

Sim-to-Sim Verification

Before deploying to real robot, verify policy in MuJoCo (a different simulator from Isaac Gym). If the policy performs well in both simulators, there is a high probability it will work in the real world.

import mujoco
import numpy as np
import torch

class MuJoCoVerifier:
    def __init__(self, model_path, policy_path):
        self.model = mujoco.MjModel.from_xml_path(model_path)
        self.data = mujoco.MjData(self.model)
        
        # Load trained policy
        self.policy = torch.jit.load(policy_path)
        self.policy.eval()
    
    def get_observation(self):
        """Build observation matching Isaac Gym format."""
        obs = np.concatenate([
            self.data.qpos[7:],         # Joint positions (skip root)
            self.data.qvel[6:],         # Joint velocities (skip root)
            self.data.qvel[:3],         # Base linear velocity
            self.data.qvel[3:6],        # Base angular velocity
            self.get_projected_gravity(),
            self.command,               # Velocity command
            self.prev_action,           # Previous action
        ])
        return obs
    
    def get_projected_gravity(self):
        """Gravity vector in body frame."""
        quat = self.data.qpos[3:7]
        rot = np.zeros(9)
        mujoco.mju_quat2Mat(rot, quat)
        rot = rot.reshape(3, 3)
        return rot.T @ np.array([0, 0, -1])
    
    def run(self, command=[0.5, 0.0, 0.0], steps=1000):
        """Run policy in MuJoCo."""
        self.command = np.array(command)
        self.prev_action = np.zeros(self.model.nu)
        
        for step in range(steps):
            obs = self.get_observation()
            obs_tensor = torch.FloatTensor(obs).unsqueeze(0)
            
            with torch.no_grad():
                action = self.policy(obs_tensor).squeeze().numpy()
            
            # Apply action (position targets)
            self.data.ctrl[:] = action * 0.25  # action_scale
            
            # Step MuJoCo
            for _ in range(4):  # decimation
                mujoco.mj_step(self.model, self.data)
            
            self.prev_action = action

Zero-Shot Sim-to-Real Transfer

Zero-shot means deploying directly from simulation to real robot without fine-tuning. To achieve this:

1. Domain Randomization (DR)

Randomize everything possible:

2. Observation Design

Only use observations available on real robot:

Do NOT use: ground truth position (no indoor GPS), contact forces (expensive sensors), terrain height maps (requires perception).

3. Action Space

Use position targets instead of torques:

target_position = default_angle + action * action_scale

The on-robot PD controller tracks the target position. This is more stable than sending torques directly.

4. Latency Compensation

Real robots have latency (~20-50 ms) from observation to action. Add latency to simulation:

class domain_rand:
    observation_delay_range = [0, 3]  # 0-3 timesteps delay

RL vs MPC Comparison for Humanoids

Criterion MPC (Part 3) RL (Humanoid-Gym)
Inference time 5-50 ms 0.1-1 ms
Training time 0 (model-based) 2-8 hours
Robustness Medium High (after DR)
Optimality High (receding horizon) Medium
Sim-to-real Good (when model accurate) Requires careful DR
Terrain adaptation Needs perception Learned (proprioception)
Compute hardware CPU (real-time) GPU (training), CPU (deploy)
Customization Modify cost function Modify reward function
State of art Boston Dynamics, MuJoCo MPC Unitree, RobotEra, Agility

2026 trend: Most companies (Unitree, Tesla, Figure) use RL for locomotion and MPC or learned control for manipulation. RL has proven to be the most effective method for humanoid locomotion in complex environments.

Practical Tips

Training Not Converging?

  1. Check reward function: Log each reward component to identify which one dominates
  2. Reduce learning rate: From 3e-4 to 1e-4
  3. Increase entropy bonus: Encourage more exploration
  4. Simplify terrain: Train on flat ground first

Policy Showing Strange Behavior?

  1. Check reward hacking: Robot may find shortcuts — e.g., hopping instead of walking
  2. Add regularization: Increase penalties for torques, action rate
  3. Video logging: Watch the robot in simulation to spot issues early

Sim-to-Real Failing?

  1. Increase domain randomization: Add more parameter variation
  2. Check observation mismatch: Compare observations in sim vs real robot
  3. Calibrate actuator model: PD gains, motor delays, deadzones
  4. Sim-to-sim first: Verify in MuJoCo before deploying to real robot

Next in Series


Related Articles

Related Posts

TutorialSim-to-Real Pipeline: Từ training đến robot thật
simulationsim2realtutorialPart 5

Sim-to-Real Pipeline: Từ training đến robot thật

End-to-end guide: train policy trong sim, evaluate, domain randomization, deploy lên robot thật và iterate.

2/4/202615 min read
Deep DiveDomain Randomization: Chìa khóa Sim-to-Real Transfer
simulationsim2realrlPart 4

Domain Randomization: Chìa khóa Sim-to-Real Transfer

Lý thuyết và thực hành domain randomization — visual, dynamics, sensor randomization với case studies thành công.

26/3/202610 min read
Unitree G1 vs H1 vs Tesla Optimus: So sánh humanoid 2026
humanoidroboticsresearch

Unitree G1 vs H1 vs Tesla Optimus: So sánh humanoid 2026

Phân tích chi tiết 3 nền tảng humanoid robot phổ biến nhất — specs, giá thành, SDK và khả năng ứng dụng thực tế.

23/3/202612 min read