humanoidhumanoidrlreward-designbipedal

Reward Engineering for Bipedal Walking: The Art of Reward Design

Detailed analysis of 12+ reward terms for humanoid walking, reward hacking avoidance, and reward curriculum from standing to walking.

Nguyễn Anh Tuấn16 tháng 3, 202610 phút đọc
Reward Engineering for Bipedal Walking: The Art of Reward Design

If the observation space is the robot's "eyes," then the reward function is its "soul" — it shapes every behavior the policy learns. Designing rewards for bipedal walking is more art than science: too few reward terms and the robot learns bizarre gaits, too many and training won't converge.

In the previous post — Fundamentals & Environment Setup — we set up the environment. Now, let's dive into the hardest part: designing the reward function.

Overview of Reward Engineering

Why Is Reward Design Hard?

Bipedal walking has many simultaneous objectives:

  • Walk at the desired velocity
  • Keep the torso upright
  • Lift feet high enough (foot clearance)
  • Conserve energy
  • Move smoothly (no jerkiness)
  • Coordinate legs in alternating rhythm

Each objective requires its own reward term, and the weights between terms determine the final behavior.

Reward Hacking — Enemy Number One

Reward hacking occurs when the policy finds a way to maximize reward without actually doing what you want. Classic examples:

Bad reward design The robot will...
Only reward forward velocity Dive forward and fall
Only reward standing upright Stand still, never step
Heavily reward foot clearance Lift feet and never put them down
No energy penalty Oscillate at high frequency continuously

Robot locomotion training

12+ Reward Terms for Bipedal Walking

Below is a complete reward function with detailed explanations for each term:

import torch
import numpy as np

class BipedalRewardFunction:
    """
    Complete reward function for humanoid bipedal walking.
    Synthesized from Humanoid-Gym, Walk These Ways, and practical experience.
    """

    def __init__(self, cfg):
        # Reward weights — THIS IS WHERE THE MAGIC HAPPENS
        self.weights = {
            # === Tracking rewards (primary objectives) ===
            "linear_vel_tracking": 1.5,    # Track desired velocity
            "angular_vel_tracking": 0.8,   # Track desired yaw rate

            # === Posture rewards (balance) ===
            "upright": 0.5,                # Keep torso upright
            "base_height": 0.3,            # Maintain stable height

            # === Gait quality rewards ===
            "foot_clearance": 0.3,         # Lift feet high enough
            "contact_pattern": 0.4,        # Alternate feet properly
            "foot_slip": -0.1,             # Penalize foot slipping

            # === Smoothness rewards ===
            "action_rate": -0.01,          # Penalize sudden action changes
            "torque": -0.00005,            # Penalize large torques
            "joint_acceleration": -0.0001, # Penalize joint acceleration

            # === Safety rewards ===
            "termination": -200.0,         # Heavy penalty for falling
            "joint_limit": -1.0,           # Penalize hitting joint limits

            # === Style rewards (optional) ===
            "feet_air_time": 0.2,          # Reward natural swing time
        }

    def compute_reward(self, state, action, prev_action, command):
        """Compute total reward from all terms."""
        rewards = {}

        # 1. Linear velocity tracking
        # Most important — robot must walk at requested speed
        vel_error = torch.sum(
            torch.square(command[:, :2] - state["base_lin_vel"][:, :2]),
            dim=1
        )
        rewards["linear_vel_tracking"] = torch.exp(-vel_error / 0.25)

        # 2. Angular velocity tracking (yaw rate)
        yaw_error = torch.square(
            command[:, 2] - state["base_ang_vel"][:, 2]
        )
        rewards["angular_vel_tracking"] = torch.exp(-yaw_error / 0.25)

        # 3. Upright posture
        # projected_gravity[2] = -1 when perfectly upright
        rewards["upright"] = torch.square(state["projected_gravity"][:, 2] + 1.0)

        # 4. Base height
        # Keep height near target (0.72m for G1)
        target_height = 0.72
        height_error = torch.square(state["base_height"] - target_height)
        rewards["base_height"] = torch.exp(-height_error / 0.05)

        # 5. Foot clearance
        # Reward when swing foot lifts above 0.06m
        swing_mask = state["foot_contact"] < 0.5
        foot_height = state["foot_height"]
        clearance_reward = torch.where(
            swing_mask,
            torch.clamp(foot_height - 0.06, min=0.0),
            torch.zeros_like(foot_height)
        )
        rewards["foot_clearance"] = torch.sum(clearance_reward, dim=1)

        # 6. Contact pattern — feet must alternate
        left_contact = state["foot_contact"][:, 0]
        right_contact = state["foot_contact"][:, 1]
        both_air = (1 - left_contact) * (1 - right_contact)
        rewards["contact_pattern"] = -both_air

        # 7. Foot slip penalty
        contact_mask = state["foot_contact"] > 0.5
        foot_vel = torch.norm(state["foot_velocity"][:, :, :2], dim=2)
        slip = contact_mask * foot_vel
        rewards["foot_slip"] = -torch.sum(slip, dim=1)

        # 8. Action rate penalty
        action_diff = torch.sum(torch.square(action - prev_action), dim=1)
        rewards["action_rate"] = -action_diff

        # 9. Torque penalty
        torques = state["applied_torques"]
        rewards["torque"] = -torch.sum(torch.square(torques), dim=1)

        # 10. Joint acceleration penalty
        joint_acc = state["joint_accelerations"]
        rewards["joint_acceleration"] = -torch.sum(torch.square(joint_acc), dim=1)

        # 11. Termination penalty
        rewards["termination"] = state["is_terminated"].float()

        # 12. Joint limit penalty
        joint_pos = state["joint_positions"]
        lower = state["joint_lower_limits"]
        upper = state["joint_upper_limits"]
        margin = 0.1
        below = torch.clamp(lower + margin - joint_pos, min=0.0)
        above = torch.clamp(joint_pos - upper + margin, min=0.0)
        rewards["joint_limit"] = -torch.sum(below + above, dim=1)

        # 13. Feet air time (style reward)
        target_air_time = 0.3
        air_time_error = torch.abs(
            state["feet_air_time"] - target_air_time
        )
        rewards["feet_air_time"] = torch.sum(
            torch.exp(-air_time_error / 0.1), dim=1
        )

        # Compute weighted total
        total_reward = torch.zeros(state["base_height"].shape[0])
        for name, value in rewards.items():
            total_reward += self.weights[name] * value

        return total_reward, rewards

Reward Weighting Strategies

Method 1: Manual Tuning (Most Common)

Start with default weights, train for ~30 minutes, watch behavior videos, and adjust:

# Iteration 1: Robot dives forward
weights_v1 = {"linear_vel_tracking": 2.0, "upright": 0.1}
# -> Increase upright, decrease velocity

# Iteration 2: Robot stands still, very upright
weights_v2 = {"linear_vel_tracking": 1.0, "upright": 1.0}
# -> Better balanced, but robot doesn't lift feet

# Iteration 3: Add foot clearance + contact pattern
weights_v3 = {
    "linear_vel_tracking": 1.5,
    "upright": 0.5,
    "foot_clearance": 0.3,
    "contact_pattern": 0.4,
}
# -> Robot starts walking!

Method 2: Exponential Reward Scaling

Instead of linear rewards, use exponential to create a sharp peak near the target:

def exponential_tracking_reward(error, sigma=0.25):
    """
    exp(-error^2 / sigma) creates a bell-shaped reward.
    - Small sigma: only rewards when very close to target
    - Large sigma: rewards even when far from target
    """
    return torch.exp(-torch.square(error) / sigma)

# Velocity tracking with sigma=0.25
vel_reward = exponential_tracking_reward(vel_error, sigma=0.25)

# Height tracking with sigma=0.05 (stricter)
height_reward = exponential_tracking_reward(height_error, sigma=0.05)

Method 3: Adaptive Reward Scaling

Automatically adjust weights based on training progress:

class AdaptiveRewardScaler:
    """Auto-scale rewards to maintain equivalent magnitudes."""

    def __init__(self, reward_names, target_magnitude=1.0):
        self.ema = {name: target_magnitude for name in reward_names}
        self.alpha = 0.99

    def scale(self, rewards_dict):
        scaled = {}
        for name, value in rewards_dict.items():
            mag = torch.abs(value).mean().item()
            self.ema[name] = self.alpha * self.ema[name] + (1 - self.alpha) * mag

            if self.ema[name] > 1e-6:
                scaled[name] = value / self.ema[name]
            else:
                scaled[name] = value
        return scaled

Reward function design

Reward Curriculum: From Standing to Walking

Instead of training walking from scratch, curriculum learning helps stabilize training:

Phase 1: Standing (0-500 iterations)

curriculum_phase_1 = {
    "linear_vel_tracking": 0.0,     # Don't need to walk yet
    "upright": 2.0,                  # Priority: stand upright
    "base_height": 1.0,             # Maintain height
    "termination": -200.0,          # Penalize falling
    "torque": -0.0001,              # Energy efficiency
}
# Command velocity = [0, 0, 0]

Phase 2: Weight Shifting (500-1500 iterations)

curriculum_phase_2 = {
    "linear_vel_tracking": 0.3,     # Start tracking small velocity
    "upright": 1.5,
    "base_height": 0.8,
    "contact_pattern": 0.2,         # Start requiring foot alternation
    "foot_clearance": 0.1,
    "termination": -200.0,
    "torque": -0.0001,
}
# Command velocity = [0.2, 0, 0]  # slow

Phase 3: Walking (1500-5000 iterations)

curriculum_phase_3 = {
    "linear_vel_tracking": 1.5,     # Full tracking
    "angular_vel_tracking": 0.8,
    "upright": 0.5,
    "base_height": 0.3,
    "foot_clearance": 0.3,
    "contact_pattern": 0.4,
    "foot_slip": -0.1,
    "action_rate": -0.01,
    "torque": -0.00005,
    "joint_acceleration": -0.0001,
    "termination": -200.0,
    "joint_limit": -1.0,
    "feet_air_time": 0.2,
}
# Command velocity = random[-1.0, 1.0] m/s
class RewardCurriculum:
    """Reward curriculum manager."""

    def __init__(self):
        self.phases = [
            (0, curriculum_phase_1),
            (500, curriculum_phase_2),
            (1500, curriculum_phase_3),
        ]

    def get_weights(self, iteration):
        """Return weights appropriate for current iteration."""
        current_weights = self.phases[0][1]
        for threshold, weights in self.phases:
            if iteration >= threshold:
                current_weights = weights
        return current_weights

    def get_command_range(self, iteration):
        """Gradually increase command velocity range."""
        if iteration < 500:
            return {"vx": [0, 0], "vy": [0, 0], "yaw": [0, 0]}
        elif iteration < 1500:
            return {"vx": [0, 0.3], "vy": [-0.1, 0.1], "yaw": [-0.2, 0.2]}
        else:
            progress = min((iteration - 1500) / 3500, 1.0)
            max_vx = 0.3 + progress * 0.7
            return {
                "vx": [-0.3, max_vx],
                "vy": [-0.3 * progress, 0.3 * progress],
                "yaw": [-0.5 * progress, 0.5 * progress],
            }

Ablation Study: What Does Each Reward Term Contribute?

The table below summarizes results when removing each reward term:

Reward term removed Consequence Impact level
linear_vel_tracking Robot doesn't walk Critical
upright Robot tilts and falls after 2-3 steps Critical
foot_clearance Robot drags feet (shuffling) High
contact_pattern Robot hops or stands on both feet High
action_rate Robot jerks violently, motor wear Medium
torque High energy consumption, motor overheating Medium
foot_slip Feet slide on ground Medium
base_height Robot crouches/squats Medium
joint_limit Joints hit mechanical limits Low-Medium
feet_air_time Unnatural gait timing Low
joint_acceleration Slightly jerky motion Low

Comparing Reward Formulations from Papers

Paper Reward terms Key feature
Walk These Ways (Margolis 2023) 13 Gait frequency command, versatile gaits
Humanoid-Gym (Gu 2024) 10 Focus sim-to-real, conservative rewards
Learning Humanoid Locomotion (Radosavovic 2024) 8 Transformer policy, simpler rewards
Expressive Humanoid (Cheng 2024) 15+ AMP style reward, reference motions

Pitfalls and Best Practices

Pitfall 1: Reward Magnitude Mismatch

# WRONG: vel_tracking ~1.0 but torque ~10000
rewards = vel_reward + 0.001 * torque_penalty
# Despite small weight, torque penalty still dominates gradient

# RIGHT: Normalize before weighting
rewards = vel_reward + 0.001 * (torque_penalty / torque_scale)

Pitfall 2: Sparse vs Dense Rewards

# WRONG: Only reward reaching the goal (sparse)
reward = 100.0 if reached_goal else 0.0
# Robot has no signal about correct direction until random goal reach

# RIGHT: Reward continuous progress (dense)
reward = -distance_to_goal + velocity_toward_goal

Pitfall 3: Forgetting Termination Penalty

# WRONG: No penalty for falling
# Robot will try risky behaviors since there's no penalty

# RIGHT: Heavy penalty for falling
if is_terminated:
    reward -= 200.0

To learn more about quadruped locomotion (a simpler problem with the same principles), see Quadruped Locomotion with RL.

Training visualization

Summary

Reward engineering for bipedal walking is not "set and forget" — it is an iterative process: design, train, watch videos, adjust. Key takeaways:

  1. 12+ reward terms are necessary for natural gaits
  2. Exponential tracking is better than linear for velocity/height targets
  3. Reward curriculum (standing, walking, running) stabilizes training
  4. Ablation study reveals each term's importance
  5. Reward hacking is the enemy — always watch behavior videos, not just reward curves

Next post — Unitree G1: Training a Walking Policy from Scratch — applies all this knowledge to real training on the Unitree G1 in Isaac Lab.

References

  1. Walk These Ways: Tuning Robot Control for Generalization with Multiplicity of Behavior — Margolis & Agrawal, CoRL 2023
  2. Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition — Siekmann et al., ICRA 2021
  3. Humanoid-Gym: Reinforcement Learning for Humanoid Robot — Gu et al., 2024
  4. Learning Humanoid Locomotion with Transformers — Radosavovic et al., 2024
NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Bài viết liên quan

NEWTutorial
WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid
wholebodyvlavlahumanoidloco-manipulationiclr-2026agibot-x2teleoprl

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

ICLR 2026 — pipeline thực chiến từ thu thập teleop, train unified latent VLA đến deploy whole-body loco-manipulation trên AgiBot X2.

11/5/202611 phút đọc
NEWTutorial
Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym
humanoidisaac-gymreinforcement-learningsim2realbooster-t1icra-2026

Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym

Hướng dẫn chi tiết Booster Gym — RL framework end-to-end open-source train humanoid Booster T1 walking từ teleop đến deploy thực tế.

6/5/202611 phút đọc
Tutorial
LIFT humanoid ICLR 2026: pretrain SAC + world model finetune
humanoidreinforcement-learningsacworld-modelmujoco-playgroundbraxsim2simiclr-2026booster-t1unitree-g1

LIFT humanoid ICLR 2026: pretrain SAC + world model finetune

Reproduce LIFT ICLR 2026: pretrain SAC trong MuJoCo Playground rồi sim2sim finetune world model trên Brax cho Booster T1, Unitree G1.

24/4/202612 phút đọc