← Back to Blog
humanoidhumanoidrlreward-designbipedal

Reward Engineering for Bipedal Walking: The Art of Reward Design

Detailed analysis of 12+ reward terms for humanoid walking, reward hacking avoidance, and reward curriculum from standing to walking.

Nguyễn Anh Tuấn16 tháng 3, 202610 min read
Reward Engineering for Bipedal Walking: The Art of Reward Design

If the observation space is the robot's "eyes," then the reward function is its "soul" — it shapes every behavior the policy learns. Designing rewards for bipedal walking is more art than science: too few reward terms and the robot learns bizarre gaits, too many and training won't converge.

In the previous post — Fundamentals & Environment Setup — we set up the environment. Now, let's dive into the hardest part: designing the reward function.

Overview of Reward Engineering

Why Is Reward Design Hard?

Bipedal walking has many simultaneous objectives:

Each objective requires its own reward term, and the weights between terms determine the final behavior.

Reward Hacking — Enemy Number One

Reward hacking occurs when the policy finds a way to maximize reward without actually doing what you want. Classic examples:

Bad reward design The robot will...
Only reward forward velocity Dive forward and fall
Only reward standing upright Stand still, never step
Heavily reward foot clearance Lift feet and never put them down
No energy penalty Oscillate at high frequency continuously

Robot locomotion training

12+ Reward Terms for Bipedal Walking

Below is a complete reward function with detailed explanations for each term:

import torch
import numpy as np

class BipedalRewardFunction:
    """
    Complete reward function for humanoid bipedal walking.
    Synthesized from Humanoid-Gym, Walk These Ways, and practical experience.
    """

    def __init__(self, cfg):
        # Reward weights — THIS IS WHERE THE MAGIC HAPPENS
        self.weights = {
            # === Tracking rewards (primary objectives) ===
            "linear_vel_tracking": 1.5,    # Track desired velocity
            "angular_vel_tracking": 0.8,   # Track desired yaw rate

            # === Posture rewards (balance) ===
            "upright": 0.5,                # Keep torso upright
            "base_height": 0.3,            # Maintain stable height

            # === Gait quality rewards ===
            "foot_clearance": 0.3,         # Lift feet high enough
            "contact_pattern": 0.4,        # Alternate feet properly
            "foot_slip": -0.1,             # Penalize foot slipping

            # === Smoothness rewards ===
            "action_rate": -0.01,          # Penalize sudden action changes
            "torque": -0.00005,            # Penalize large torques
            "joint_acceleration": -0.0001, # Penalize joint acceleration

            # === Safety rewards ===
            "termination": -200.0,         # Heavy penalty for falling
            "joint_limit": -1.0,           # Penalize hitting joint limits

            # === Style rewards (optional) ===
            "feet_air_time": 0.2,          # Reward natural swing time
        }

    def compute_reward(self, state, action, prev_action, command):
        """Compute total reward from all terms."""
        rewards = {}

        # 1. Linear velocity tracking
        # Most important — robot must walk at requested speed
        vel_error = torch.sum(
            torch.square(command[:, :2] - state["base_lin_vel"][:, :2]),
            dim=1
        )
        rewards["linear_vel_tracking"] = torch.exp(-vel_error / 0.25)

        # 2. Angular velocity tracking (yaw rate)
        yaw_error = torch.square(
            command[:, 2] - state["base_ang_vel"][:, 2]
        )
        rewards["angular_vel_tracking"] = torch.exp(-yaw_error / 0.25)

        # 3. Upright posture
        # projected_gravity[2] = -1 when perfectly upright
        rewards["upright"] = torch.square(state["projected_gravity"][:, 2] + 1.0)

        # 4. Base height
        # Keep height near target (0.72m for G1)
        target_height = 0.72
        height_error = torch.square(state["base_height"] - target_height)
        rewards["base_height"] = torch.exp(-height_error / 0.05)

        # 5. Foot clearance
        # Reward when swing foot lifts above 0.06m
        swing_mask = state["foot_contact"] < 0.5
        foot_height = state["foot_height"]
        clearance_reward = torch.where(
            swing_mask,
            torch.clamp(foot_height - 0.06, min=0.0),
            torch.zeros_like(foot_height)
        )
        rewards["foot_clearance"] = torch.sum(clearance_reward, dim=1)

        # 6. Contact pattern — feet must alternate
        left_contact = state["foot_contact"][:, 0]
        right_contact = state["foot_contact"][:, 1]
        both_air = (1 - left_contact) * (1 - right_contact)
        rewards["contact_pattern"] = -both_air

        # 7. Foot slip penalty
        contact_mask = state["foot_contact"] > 0.5
        foot_vel = torch.norm(state["foot_velocity"][:, :, :2], dim=2)
        slip = contact_mask * foot_vel
        rewards["foot_slip"] = -torch.sum(slip, dim=1)

        # 8. Action rate penalty
        action_diff = torch.sum(torch.square(action - prev_action), dim=1)
        rewards["action_rate"] = -action_diff

        # 9. Torque penalty
        torques = state["applied_torques"]
        rewards["torque"] = -torch.sum(torch.square(torques), dim=1)

        # 10. Joint acceleration penalty
        joint_acc = state["joint_accelerations"]
        rewards["joint_acceleration"] = -torch.sum(torch.square(joint_acc), dim=1)

        # 11. Termination penalty
        rewards["termination"] = state["is_terminated"].float()

        # 12. Joint limit penalty
        joint_pos = state["joint_positions"]
        lower = state["joint_lower_limits"]
        upper = state["joint_upper_limits"]
        margin = 0.1
        below = torch.clamp(lower + margin - joint_pos, min=0.0)
        above = torch.clamp(joint_pos - upper + margin, min=0.0)
        rewards["joint_limit"] = -torch.sum(below + above, dim=1)

        # 13. Feet air time (style reward)
        target_air_time = 0.3
        air_time_error = torch.abs(
            state["feet_air_time"] - target_air_time
        )
        rewards["feet_air_time"] = torch.sum(
            torch.exp(-air_time_error / 0.1), dim=1
        )

        # Compute weighted total
        total_reward = torch.zeros(state["base_height"].shape[0])
        for name, value in rewards.items():
            total_reward += self.weights[name] * value

        return total_reward, rewards

Reward Weighting Strategies

Method 1: Manual Tuning (Most Common)

Start with default weights, train for ~30 minutes, watch behavior videos, and adjust:

# Iteration 1: Robot dives forward
weights_v1 = {"linear_vel_tracking": 2.0, "upright": 0.1}
# -> Increase upright, decrease velocity

# Iteration 2: Robot stands still, very upright
weights_v2 = {"linear_vel_tracking": 1.0, "upright": 1.0}
# -> Better balanced, but robot doesn't lift feet

# Iteration 3: Add foot clearance + contact pattern
weights_v3 = {
    "linear_vel_tracking": 1.5,
    "upright": 0.5,
    "foot_clearance": 0.3,
    "contact_pattern": 0.4,
}
# -> Robot starts walking!

Method 2: Exponential Reward Scaling

Instead of linear rewards, use exponential to create a sharp peak near the target:

def exponential_tracking_reward(error, sigma=0.25):
    """
    exp(-error^2 / sigma) creates a bell-shaped reward.
    - Small sigma: only rewards when very close to target
    - Large sigma: rewards even when far from target
    """
    return torch.exp(-torch.square(error) / sigma)

# Velocity tracking with sigma=0.25
vel_reward = exponential_tracking_reward(vel_error, sigma=0.25)

# Height tracking with sigma=0.05 (stricter)
height_reward = exponential_tracking_reward(height_error, sigma=0.05)

Method 3: Adaptive Reward Scaling

Automatically adjust weights based on training progress:

class AdaptiveRewardScaler:
    """Auto-scale rewards to maintain equivalent magnitudes."""

    def __init__(self, reward_names, target_magnitude=1.0):
        self.ema = {name: target_magnitude for name in reward_names}
        self.alpha = 0.99

    def scale(self, rewards_dict):
        scaled = {}
        for name, value in rewards_dict.items():
            mag = torch.abs(value).mean().item()
            self.ema[name] = self.alpha * self.ema[name] + (1 - self.alpha) * mag

            if self.ema[name] > 1e-6:
                scaled[name] = value / self.ema[name]
            else:
                scaled[name] = value
        return scaled

Reward function design

Reward Curriculum: From Standing to Walking

Instead of training walking from scratch, curriculum learning helps stabilize training:

Phase 1: Standing (0-500 iterations)

curriculum_phase_1 = {
    "linear_vel_tracking": 0.0,     # Don't need to walk yet
    "upright": 2.0,                  # Priority: stand upright
    "base_height": 1.0,             # Maintain height
    "termination": -200.0,          # Penalize falling
    "torque": -0.0001,              # Energy efficiency
}
# Command velocity = [0, 0, 0]

Phase 2: Weight Shifting (500-1500 iterations)

curriculum_phase_2 = {
    "linear_vel_tracking": 0.3,     # Start tracking small velocity
    "upright": 1.5,
    "base_height": 0.8,
    "contact_pattern": 0.2,         # Start requiring foot alternation
    "foot_clearance": 0.1,
    "termination": -200.0,
    "torque": -0.0001,
}
# Command velocity = [0.2, 0, 0]  # slow

Phase 3: Walking (1500-5000 iterations)

curriculum_phase_3 = {
    "linear_vel_tracking": 1.5,     # Full tracking
    "angular_vel_tracking": 0.8,
    "upright": 0.5,
    "base_height": 0.3,
    "foot_clearance": 0.3,
    "contact_pattern": 0.4,
    "foot_slip": -0.1,
    "action_rate": -0.01,
    "torque": -0.00005,
    "joint_acceleration": -0.0001,
    "termination": -200.0,
    "joint_limit": -1.0,
    "feet_air_time": 0.2,
}
# Command velocity = random[-1.0, 1.0] m/s
class RewardCurriculum:
    """Reward curriculum manager."""

    def __init__(self):
        self.phases = [
            (0, curriculum_phase_1),
            (500, curriculum_phase_2),
            (1500, curriculum_phase_3),
        ]

    def get_weights(self, iteration):
        """Return weights appropriate for current iteration."""
        current_weights = self.phases[0][1]
        for threshold, weights in self.phases:
            if iteration >= threshold:
                current_weights = weights
        return current_weights

    def get_command_range(self, iteration):
        """Gradually increase command velocity range."""
        if iteration < 500:
            return {"vx": [0, 0], "vy": [0, 0], "yaw": [0, 0]}
        elif iteration < 1500:
            return {"vx": [0, 0.3], "vy": [-0.1, 0.1], "yaw": [-0.2, 0.2]}
        else:
            progress = min((iteration - 1500) / 3500, 1.0)
            max_vx = 0.3 + progress * 0.7
            return {
                "vx": [-0.3, max_vx],
                "vy": [-0.3 * progress, 0.3 * progress],
                "yaw": [-0.5 * progress, 0.5 * progress],
            }

Ablation Study: What Does Each Reward Term Contribute?

The table below summarizes results when removing each reward term:

Reward term removed Consequence Impact level
linear_vel_tracking Robot doesn't walk Critical
upright Robot tilts and falls after 2-3 steps Critical
foot_clearance Robot drags feet (shuffling) High
contact_pattern Robot hops or stands on both feet High
action_rate Robot jerks violently, motor wear Medium
torque High energy consumption, motor overheating Medium
foot_slip Feet slide on ground Medium
base_height Robot crouches/squats Medium
joint_limit Joints hit mechanical limits Low-Medium
feet_air_time Unnatural gait timing Low
joint_acceleration Slightly jerky motion Low

Comparing Reward Formulations from Papers

Paper Reward terms Key feature
Walk These Ways (Margolis 2023) 13 Gait frequency command, versatile gaits
Humanoid-Gym (Gu 2024) 10 Focus sim-to-real, conservative rewards
Learning Humanoid Locomotion (Radosavovic 2024) 8 Transformer policy, simpler rewards
Expressive Humanoid (Cheng 2024) 15+ AMP style reward, reference motions

Pitfalls and Best Practices

Pitfall 1: Reward Magnitude Mismatch

# WRONG: vel_tracking ~1.0 but torque ~10000
rewards = vel_reward + 0.001 * torque_penalty
# Despite small weight, torque penalty still dominates gradient

# RIGHT: Normalize before weighting
rewards = vel_reward + 0.001 * (torque_penalty / torque_scale)

Pitfall 2: Sparse vs Dense Rewards

# WRONG: Only reward reaching the goal (sparse)
reward = 100.0 if reached_goal else 0.0
# Robot has no signal about correct direction until random goal reach

# RIGHT: Reward continuous progress (dense)
reward = -distance_to_goal + velocity_toward_goal

Pitfall 3: Forgetting Termination Penalty

# WRONG: No penalty for falling
# Robot will try risky behaviors since there's no penalty

# RIGHT: Heavy penalty for falling
if is_terminated:
    reward -= 200.0

To learn more about quadruped locomotion (a simpler problem with the same principles), see Quadruped Locomotion with RL.

Training visualization

Summary

Reward engineering for bipedal walking is not "set and forget" — it is an iterative process: design, train, watch videos, adjust. Key takeaways:

  1. 12+ reward terms are necessary for natural gaits
  2. Exponential tracking is better than linear for velocity/height targets
  3. Reward curriculum (standing, walking, running) stabilizes training
  4. Ablation study reveals each term's importance
  5. Reward hacking is the enemy — always watch behavior videos, not just reward curves

Next post — Unitree G1: Training a Walking Policy from Scratch — applies all this knowledge to real training on the Unitree G1 in Isaac Lab.

References

  1. Walk These Ways: Tuning Robot Control for Generalization with Multiplicity of Behavior — Margolis & Agrawal, CoRL 2023
  2. Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition — Siekmann et al., ICRA 2021
  3. Humanoid-Gym: Reinforcement Learning for Humanoid Robot — Gu et al., 2024
  4. Learning Humanoid Locomotion with Transformers — Radosavovic et al., 2024

Related Posts

Related Posts

ResearchΨ₀ Hands-On (6): Ablation & Bài học rút ra
ai-perceptionvlaresearchhumanoidpsi0Part 6

Ψ₀ Hands-On (6): Ablation & Bài học rút ra

Phân tích ablation studies, so sánh baselines, và 5 bài học quan trọng nhất từ Ψ₀ cho người mới bắt đầu.

11/4/202616 min read
ResearchFlashSAC: RL nhanh hơn PPO cho Robot
ai-perceptionreinforcement-learninghumanoidresearch

FlashSAC: RL nhanh hơn PPO cho Robot

FlashSAC — off-policy RL mới vượt PPO về tốc độ lẫn hiệu quả trên 100+ tasks robotics, từ humanoid locomotion đến dexterous manipulation.

11/4/202610 min read
TutorialSim-to-Real cho Humanoid: Deployment Best Practices
sim2realhumanoiddeploymentrlunitreePart 10

Sim-to-Real cho Humanoid: Deployment Best Practices

Pipeline hoàn chỉnh deploy RL locomotion policy lên robot humanoid thật — domain randomization, system identification, safety, và Unitree SDK.

9/4/202611 min read