Reward Engineering for Bipedal Walking: The Art of Reward Design

If the observation space is the robot's "eyes," then the reward function is its "soul" — it shapes every behavior the policy learns. Designing rewards for bipedal walking is more art than science: too few reward terms and the robot learns bizarre gaits, too many and training won't converge.

In the previous post — Fundamentals & Environment Setup — we set up the environment. Now, let's dive into the hardest part: designing the reward function.

Overview of Reward Engineering

Why Is Reward Design Hard?

Bipedal walking has many simultaneous objectives:

Walk at the desired velocity
Keep the torso upright
Lift feet high enough (foot clearance)
Conserve energy
Move smoothly (no jerkiness)
Coordinate legs in alternating rhythm

Each objective requires its own reward term, and the weights between terms determine the final behavior.

Reward Hacking — Enemy Number One

Reward hacking occurs when the policy finds a way to maximize reward without actually doing what you want. Classic examples:

Bad reward design	The robot will...
Only reward forward velocity	Dive forward and fall
Only reward standing upright	Stand still, never step
Heavily reward foot clearance	Lift feet and never put them down
No energy penalty	Oscillate at high frequency continuously

Robot locomotion training

12+ Reward Terms for Bipedal Walking

Below is a complete reward function with detailed explanations for each term:

import torch
import numpy as np

class BipedalRewardFunction:
    """
    Complete reward function for humanoid bipedal walking.
    Synthesized from Humanoid-Gym, Walk These Ways, and practical experience.
    """

    def __init__(self, cfg):
        # Reward weights — THIS IS WHERE THE MAGIC HAPPENS
        self.weights = {
            # === Tracking rewards (primary objectives) ===
            "linear_vel_tracking": 1.5,    # Track desired velocity
            "angular_vel_tracking": 0.8,   # Track desired yaw rate

            # === Posture rewards (balance) ===
            "upright": 0.5,                # Keep torso upright
            "base_height": 0.3,            # Maintain stable height

            # === Gait quality rewards ===
            "foot_clearance": 0.3,         # Lift feet high enough
            "contact_pattern": 0.4,        # Alternate feet properly
            "foot_slip": -0.1,             # Penalize foot slipping

            # === Smoothness rewards ===
            "action_rate": -0.01,          # Penalize sudden action changes
            "torque": -0.00005,            # Penalize large torques
            "joint_acceleration": -0.0001, # Penalize joint acceleration

            # === Safety rewards ===
            "termination": -200.0,         # Heavy penalty for falling
            "joint_limit": -1.0,           # Penalize hitting joint limits

            # === Style rewards (optional) ===
            "feet_air_time": 0.2,          # Reward natural swing time
        }

    def compute_reward(self, state, action, prev_action, command):
        """Compute total reward from all terms."""
        rewards = {}

        # 1. Linear velocity tracking
        # Most important — robot must walk at requested speed
        vel_error = torch.sum(
            torch.square(command[:, :2] - state["base_lin_vel"][:, :2]),
            dim=1
        )
        rewards["linear_vel_tracking"] = torch.exp(-vel_error / 0.25)

        # 2. Angular velocity tracking (yaw rate)
        yaw_error = torch.square(
            command[:, 2] - state["base_ang_vel"][:, 2]
        )
        rewards["angular_vel_tracking"] = torch.exp(-yaw_error / 0.25)

        # 3. Upright posture
        # projected_gravity[2] = -1 when perfectly upright
        rewards["upright"] = torch.square(state["projected_gravity"][:, 2] + 1.0)

        # 4. Base height
        # Keep height near target (0.72m for G1)
        target_height = 0.72
        height_error = torch.square(state["base_height"] - target_height)
        rewards["base_height"] = torch.exp(-height_error / 0.05)

        # 5. Foot clearance
        # Reward when swing foot lifts above 0.06m
        swing_mask = state["foot_contact"] < 0.5
        foot_height = state["foot_height"]
        clearance_reward = torch.where(
            swing_mask,
            torch.clamp(foot_height - 0.06, min=0.0),
            torch.zeros_like(foot_height)
        )
        rewards["foot_clearance"] = torch.sum(clearance_reward, dim=1)

        # 6. Contact pattern — feet must alternate
        left_contact = state["foot_contact"][:, 0]
        right_contact = state["foot_contact"][:, 1]
        both_air = (1 - left_contact) * (1 - right_contact)
        rewards["contact_pattern"] = -both_air

        # 7. Foot slip penalty
        contact_mask = state["foot_contact"] > 0.5
        foot_vel = torch.norm(state["foot_velocity"][:, :, :2], dim=2)
        slip = contact_mask * foot_vel
        rewards["foot_slip"] = -torch.sum(slip, dim=1)

        # 8. Action rate penalty
        action_diff = torch.sum(torch.square(action - prev_action), dim=1)
        rewards["action_rate"] = -action_diff

        # 9. Torque penalty
        torques = state["applied_torques"]
        rewards["torque"] = -torch.sum(torch.square(torques), dim=1)

        # 10. Joint acceleration penalty
        joint_acc = state["joint_accelerations"]
        rewards["joint_acceleration"] = -torch.sum(torch.square(joint_acc), dim=1)

        # 11. Termination penalty
        rewards["termination"] = state["is_terminated"].float()

        # 12. Joint limit penalty
        joint_pos = state["joint_positions"]
        lower = state["joint_lower_limits"]
        upper = state["joint_upper_limits"]
        margin = 0.1
        below = torch.clamp(lower + margin - joint_pos, min=0.0)
        above = torch.clamp(joint_pos - upper + margin, min=0.0)
        rewards["joint_limit"] = -torch.sum(below + above, dim=1)

        # 13. Feet air time (style reward)
        target_air_time = 0.3
        air_time_error = torch.abs(
            state["feet_air_time"] - target_air_time
        )
        rewards["feet_air_time"] = torch.sum(
            torch.exp(-air_time_error / 0.1), dim=1
        )

        # Compute weighted total
        total_reward = torch.zeros(state["base_height"].shape[0])
        for name, value in rewards.items():
            total_reward += self.weights[name] * value

        return total_reward, rewards

Reward Weighting Strategies

Method 1: Manual Tuning (Most Common)

Start with default weights, train for ~30 minutes, watch behavior videos, and adjust:

# Iteration 1: Robot dives forward
weights_v1 = {"linear_vel_tracking": 2.0, "upright": 0.1}
# -> Increase upright, decrease velocity

# Iteration 2: Robot stands still, very upright
weights_v2 = {"linear_vel_tracking": 1.0, "upright": 1.0}
# -> Better balanced, but robot doesn't lift feet

# Iteration 3: Add foot clearance + contact pattern
weights_v3 = {
    "linear_vel_tracking": 1.5,
    "upright": 0.5,
    "foot_clearance": 0.3,
    "contact_pattern": 0.4,
}
# -> Robot starts walking!

Method 2: Exponential Reward Scaling

Instead of linear rewards, use exponential to create a sharp peak near the target:

def exponential_tracking_reward(error, sigma=0.25):
    """
    exp(-error^2 / sigma) creates a bell-shaped reward.
    - Small sigma: only rewards when very close to target
    - Large sigma: rewards even when far from target
    """
    return torch.exp(-torch.square(error) / sigma)

# Velocity tracking with sigma=0.25
vel_reward = exponential_tracking_reward(vel_error, sigma=0.25)

# Height tracking with sigma=0.05 (stricter)
height_reward = exponential_tracking_reward(height_error, sigma=0.05)

Method 3: Adaptive Reward Scaling

Automatically adjust weights based on training progress:

class AdaptiveRewardScaler:
    """Auto-scale rewards to maintain equivalent magnitudes."""

    def __init__(self, reward_names, target_magnitude=1.0):
        self.ema = {name: target_magnitude for name in reward_names}
        self.alpha = 0.99

    def scale(self, rewards_dict):
        scaled = {}
        for name, value in rewards_dict.items():
            mag = torch.abs(value).mean().item()
            self.ema[name] = self.alpha * self.ema[name] + (1 - self.alpha) * mag

            if self.ema[name] > 1e-6:
                scaled[name] = value / self.ema[name]
            else:
                scaled[name] = value
        return scaled

Reward function design

Reward Curriculum: From Standing to Walking

Instead of training walking from scratch, curriculum learning helps stabilize training:

Phase 1: Standing (0-500 iterations)

curriculum_phase_1 = {
    "linear_vel_tracking": 0.0,     # Don't need to walk yet
    "upright": 2.0,                  # Priority: stand upright
    "base_height": 1.0,             # Maintain height
    "termination": -200.0,          # Penalize falling
    "torque": -0.0001,              # Energy efficiency
}
# Command velocity = [0, 0, 0]

Phase 2: Weight Shifting (500-1500 iterations)

curriculum_phase_2 = {
    "linear_vel_tracking": 0.3,     # Start tracking small velocity
    "upright": 1.5,
    "base_height": 0.8,
    "contact_pattern": 0.2,         # Start requiring foot alternation
    "foot_clearance": 0.1,
    "termination": -200.0,
    "torque": -0.0001,
}
# Command velocity = [0.2, 0, 0]  # slow

Phase 3: Walking (1500-5000 iterations)

curriculum_phase_3 = {
    "linear_vel_tracking": 1.5,     # Full tracking
    "angular_vel_tracking": 0.8,
    "upright": 0.5,
    "base_height": 0.3,
    "foot_clearance": 0.3,
    "contact_pattern": 0.4,
    "foot_slip": -0.1,
    "action_rate": -0.01,
    "torque": -0.00005,
    "joint_acceleration": -0.0001,
    "termination": -200.0,
    "joint_limit": -1.0,
    "feet_air_time": 0.2,
}
# Command velocity = random[-1.0, 1.0] m/s

class RewardCurriculum:
    """Reward curriculum manager."""

    def __init__(self):
        self.phases = [
            (0, curriculum_phase_1),
            (500, curriculum_phase_2),
            (1500, curriculum_phase_3),
        ]

    def get_weights(self, iteration):
        """Return weights appropriate for current iteration."""
        current_weights = self.phases[0][1]
        for threshold, weights in self.phases:
            if iteration >= threshold:
                current_weights = weights
        return current_weights

    def get_command_range(self, iteration):
        """Gradually increase command velocity range."""
        if iteration < 500:
            return {"vx": [0, 0], "vy": [0, 0], "yaw": [0, 0]}
        elif iteration < 1500:
            return {"vx": [0, 0.3], "vy": [-0.1, 0.1], "yaw": [-0.2, 0.2]}
        else:
            progress = min((iteration - 1500) / 3500, 1.0)
            max_vx = 0.3 + progress * 0.7
            return {
                "vx": [-0.3, max_vx],
                "vy": [-0.3 * progress, 0.3 * progress],
                "yaw": [-0.5 * progress, 0.5 * progress],
            }

Ablation Study: What Does Each Reward Term Contribute?

The table below summarizes results when removing each reward term:

Reward term removed	Consequence	Impact level
linear_vel_tracking	Robot doesn't walk	Critical
upright	Robot tilts and falls after 2-3 steps	Critical
foot_clearance	Robot drags feet (shuffling)	High
contact_pattern	Robot hops or stands on both feet	High
action_rate	Robot jerks violently, motor wear	Medium
torque	High energy consumption, motor overheating	Medium
foot_slip	Feet slide on ground	Medium
base_height	Robot crouches/squats	Medium
joint_limit	Joints hit mechanical limits	Low-Medium
feet_air_time	Unnatural gait timing	Low
joint_acceleration	Slightly jerky motion	Low

Comparing Reward Formulations from Papers

Paper	Reward terms	Key feature
Walk These Ways (Margolis 2023)	13	Gait frequency command, versatile gaits
Humanoid-Gym (Gu 2024)	10	Focus sim-to-real, conservative rewards
Learning Humanoid Locomotion (Radosavovic 2024)	8	Transformer policy, simpler rewards
Expressive Humanoid (Cheng 2024)	15+	AMP style reward, reference motions

Pitfalls and Best Practices

Pitfall 1: Reward Magnitude Mismatch

# WRONG: vel_tracking ~1.0 but torque ~10000
rewards = vel_reward + 0.001 * torque_penalty
# Despite small weight, torque penalty still dominates gradient

# RIGHT: Normalize before weighting
rewards = vel_reward + 0.001 * (torque_penalty / torque_scale)

Pitfall 2: Sparse vs Dense Rewards

# WRONG: Only reward reaching the goal (sparse)
reward = 100.0 if reached_goal else 0.0
# Robot has no signal about correct direction until random goal reach

# RIGHT: Reward continuous progress (dense)
reward = -distance_to_goal + velocity_toward_goal

Pitfall 3: Forgetting Termination Penalty

# WRONG: No penalty for falling
# Robot will try risky behaviors since there's no penalty

# RIGHT: Heavy penalty for falling
if is_terminated:
    reward -= 200.0

To learn more about quadruped locomotion (a simpler problem with the same principles), see Quadruped Locomotion with RL.

Training visualization

Summary

Reward engineering for bipedal walking is not "set and forget" — it is an iterative process: design, train, watch videos, adjust. Key takeaways:

12+ reward terms are necessary for natural gaits
Exponential tracking is better than linear for velocity/height targets
Reward curriculum (standing, walking, running) stabilizes training
Ablation study reveals each term's importance
Reward hacking is the enemy — always watch behavior videos, not just reward curves

Next post — Unitree G1: Training a Walking Policy from Scratch — applies all this knowledge to real training on the Unitree G1 in Isaac Lab.

References

Walk These Ways: Tuning Robot Control for Generalization with Multiplicity of Behavior — Margolis & Agrawal, CoRL 2023
Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition — Siekmann et al., ICRA 2021
Humanoid-Gym: Reinforcement Learning for Humanoid Robot — Gu et al., 2024
Learning Humanoid Locomotion with Transformers — Radosavovic et al., 2024

In the previous post — Fundamentals & Environment Setup — we set up the environment. Now, let's dive into the hardest part: designing the reward function.

Overview of Reward Engineering

Why Is Reward Design Hard?

Bipedal walking has many simultaneous objectives:

Walk at the desired velocity
Keep the torso upright
Lift feet high enough (foot clearance)
Conserve energy
Move smoothly (no jerkiness)
Coordinate legs in alternating rhythm

Each objective requires its own reward term, and the weights between terms determine the final behavior.

Reward Hacking — Enemy Number One

Reward hacking occurs when the policy finds a way to maximize reward without actually doing what you want. Classic examples:

Bad reward design	The robot will...
Only reward forward velocity	Dive forward and fall
Only reward standing upright	Stand still, never step
Heavily reward foot clearance	Lift feet and never put them down
No energy penalty	Oscillate at high frequency continuously

Robot locomotion training

12+ Reward Terms for Bipedal Walking

Below is a complete reward function with detailed explanations for each term:

import torch
import numpy as np

class BipedalRewardFunction:
    """
    Complete reward function for humanoid bipedal walking.
    Synthesized from Humanoid-Gym, Walk These Ways, and practical experience.
    """

    def __init__(self, cfg):
        # Reward weights — THIS IS WHERE THE MAGIC HAPPENS
        self.weights = {
            # === Tracking rewards (primary objectives) ===
            "linear_vel_tracking": 1.5,    # Track desired velocity
            "angular_vel_tracking": 0.8,   # Track desired yaw rate

            # === Posture rewards (balance) ===
            "upright": 0.5,                # Keep torso upright
            "base_height": 0.3,            # Maintain stable height

            # === Gait quality rewards ===
            "foot_clearance": 0.3,         # Lift feet high enough
            "contact_pattern": 0.4,        # Alternate feet properly
            "foot_slip": -0.1,             # Penalize foot slipping

            # === Smoothness rewards ===
            "action_rate": -0.01,          # Penalize sudden action changes
            "torque": -0.00005,            # Penalize large torques
            "joint_acceleration": -0.0001, # Penalize joint acceleration

            # === Safety rewards ===
            "termination": -200.0,         # Heavy penalty for falling
            "joint_limit": -1.0,           # Penalize hitting joint limits

            # === Style rewards (optional) ===
            "feet_air_time": 0.2,          # Reward natural swing time
        }

    def compute_reward(self, state, action, prev_action, command):
        """Compute total reward from all terms."""
        rewards = {}

        # 1. Linear velocity tracking
        # Most important — robot must walk at requested speed
        vel_error = torch.sum(
            torch.square(command[:, :2] - state["base_lin_vel"][:, :2]),
            dim=1
        )
        rewards["linear_vel_tracking"] = torch.exp(-vel_error / 0.25)

        # 2. Angular velocity tracking (yaw rate)
        yaw_error = torch.square(
            command[:, 2] - state["base_ang_vel"][:, 2]
        )
        rewards["angular_vel_tracking"] = torch.exp(-yaw_error / 0.25)

        # 3. Upright posture
        # projected_gravity[2] = -1 when perfectly upright
        rewards["upright"] = torch.square(state["projected_gravity"][:, 2] + 1.0)

        # 4. Base height
        # Keep height near target (0.72m for G1)
        target_height = 0.72
        height_error = torch.square(state["base_height"] - target_height)
        rewards["base_height"] = torch.exp(-height_error / 0.05)

        # 5. Foot clearance
        # Reward when swing foot lifts above 0.06m
        swing_mask = state["foot_contact"] < 0.5
        foot_height = state["foot_height"]
        clearance_reward = torch.where(
            swing_mask,
            torch.clamp(foot_height - 0.06, min=0.0),
            torch.zeros_like(foot_height)
        )
        rewards["foot_clearance"] = torch.sum(clearance_reward, dim=1)

        # 6. Contact pattern — feet must alternate
        left_contact = state["foot_contact"][:, 0]
        right_contact = state["foot_contact"][:, 1]
        both_air = (1 - left_contact) * (1 - right_contact)
        rewards["contact_pattern"] = -both_air

        # 7. Foot slip penalty
        contact_mask = state["foot_contact"] > 0.5
        foot_vel = torch.norm(state["foot_velocity"][:, :, :2], dim=2)
        slip = contact_mask * foot_vel
        rewards["foot_slip"] = -torch.sum(slip, dim=1)

        # 8. Action rate penalty
        action_diff = torch.sum(torch.square(action - prev_action), dim=1)
        rewards["action_rate"] = -action_diff

        # 9. Torque penalty
        torques = state["applied_torques"]
        rewards["torque"] = -torch.sum(torch.square(torques), dim=1)

        # 10. Joint acceleration penalty
        joint_acc = state["joint_accelerations"]
        rewards["joint_acceleration"] = -torch.sum(torch.square(joint_acc), dim=1)

        # 11. Termination penalty
        rewards["termination"] = state["is_terminated"].float()

        # 12. Joint limit penalty
        joint_pos = state["joint_positions"]
        lower = state["joint_lower_limits"]
        upper = state["joint_upper_limits"]
        margin = 0.1
        below = torch.clamp(lower + margin - joint_pos, min=0.0)
        above = torch.clamp(joint_pos - upper + margin, min=0.0)
        rewards["joint_limit"] = -torch.sum(below + above, dim=1)

        # 13. Feet air time (style reward)
        target_air_time = 0.3
        air_time_error = torch.abs(
            state["feet_air_time"] - target_air_time
        )
        rewards["feet_air_time"] = torch.sum(
            torch.exp(-air_time_error / 0.1), dim=1
        )

        # Compute weighted total
        total_reward = torch.zeros(state["base_height"].shape[0])
        for name, value in rewards.items():
            total_reward += self.weights[name] * value

        return total_reward, rewards

Reward Weighting Strategies

Method 1: Manual Tuning (Most Common)

Start with default weights, train for ~30 minutes, watch behavior videos, and adjust:

# Iteration 1: Robot dives forward
weights_v1 = {"linear_vel_tracking": 2.0, "upright": 0.1}
# -> Increase upright, decrease velocity

# Iteration 2: Robot stands still, very upright
weights_v2 = {"linear_vel_tracking": 1.0, "upright": 1.0}
# -> Better balanced, but robot doesn't lift feet

# Iteration 3: Add foot clearance + contact pattern
weights_v3 = {
    "linear_vel_tracking": 1.5,
    "upright": 0.5,
    "foot_clearance": 0.3,
    "contact_pattern": 0.4,
}
# -> Robot starts walking!

Method 2: Exponential Reward Scaling

Instead of linear rewards, use exponential to create a sharp peak near the target:

def exponential_tracking_reward(error, sigma=0.25):
    """
    exp(-error^2 / sigma) creates a bell-shaped reward.
    - Small sigma: only rewards when very close to target
    - Large sigma: rewards even when far from target
    """
    return torch.exp(-torch.square(error) / sigma)

# Velocity tracking with sigma=0.25
vel_reward = exponential_tracking_reward(vel_error, sigma=0.25)

# Height tracking with sigma=0.05 (stricter)
height_reward = exponential_tracking_reward(height_error, sigma=0.05)

Method 3: Adaptive Reward Scaling

Automatically adjust weights based on training progress:

class AdaptiveRewardScaler:
    """Auto-scale rewards to maintain equivalent magnitudes."""

    def __init__(self, reward_names, target_magnitude=1.0):
        self.ema = {name: target_magnitude for name in reward_names}
        self.alpha = 0.99

    def scale(self, rewards_dict):
        scaled = {}
        for name, value in rewards_dict.items():
            mag = torch.abs(value).mean().item()
            self.ema[name] = self.alpha * self.ema[name] + (1 - self.alpha) * mag

            if self.ema[name] > 1e-6:
                scaled[name] = value / self.ema[name]
            else:
                scaled[name] = value
        return scaled

Reward function design

Reward Curriculum: From Standing to Walking

Instead of training walking from scratch, curriculum learning helps stabilize training:

Phase 1: Standing (0-500 iterations)

curriculum_phase_1 = {
    "linear_vel_tracking": 0.0,     # Don't need to walk yet
    "upright": 2.0,                  # Priority: stand upright
    "base_height": 1.0,             # Maintain height
    "termination": -200.0,          # Penalize falling
    "torque": -0.0001,              # Energy efficiency
}
# Command velocity = [0, 0, 0]

Phase 2: Weight Shifting (500-1500 iterations)

curriculum_phase_2 = {
    "linear_vel_tracking": 0.3,     # Start tracking small velocity
    "upright": 1.5,
    "base_height": 0.8,
    "contact_pattern": 0.2,         # Start requiring foot alternation
    "foot_clearance": 0.1,
    "termination": -200.0,
    "torque": -0.0001,
}
# Command velocity = [0.2, 0, 0]  # slow

Phase 3: Walking (1500-5000 iterations)

curriculum_phase_3 = {
    "linear_vel_tracking": 1.5,     # Full tracking
    "angular_vel_tracking": 0.8,
    "upright": 0.5,
    "base_height": 0.3,
    "foot_clearance": 0.3,
    "contact_pattern": 0.4,
    "foot_slip": -0.1,
    "action_rate": -0.01,
    "torque": -0.00005,
    "joint_acceleration": -0.0001,
    "termination": -200.0,
    "joint_limit": -1.0,
    "feet_air_time": 0.2,
}
# Command velocity = random[-1.0, 1.0] m/s

class RewardCurriculum:
    """Reward curriculum manager."""

    def __init__(self):
        self.phases = [
            (0, curriculum_phase_1),
            (500, curriculum_phase_2),
            (1500, curriculum_phase_3),
        ]

    def get_weights(self, iteration):
        """Return weights appropriate for current iteration."""
        current_weights = self.phases[0][1]
        for threshold, weights in self.phases:
            if iteration >= threshold:
                current_weights = weights
        return current_weights

    def get_command_range(self, iteration):
        """Gradually increase command velocity range."""
        if iteration < 500:
            return {"vx": [0, 0], "vy": [0, 0], "yaw": [0, 0]}
        elif iteration < 1500:
            return {"vx": [0, 0.3], "vy": [-0.1, 0.1], "yaw": [-0.2, 0.2]}
        else:
            progress = min((iteration - 1500) / 3500, 1.0)
            max_vx = 0.3 + progress * 0.7
            return {
                "vx": [-0.3, max_vx],
                "vy": [-0.3 * progress, 0.3 * progress],
                "yaw": [-0.5 * progress, 0.5 * progress],
            }

Ablation Study: What Does Each Reward Term Contribute?

The table below summarizes results when removing each reward term:

Reward term removed	Consequence	Impact level
linear_vel_tracking	Robot doesn't walk	Critical
upright	Robot tilts and falls after 2-3 steps	Critical
foot_clearance	Robot drags feet (shuffling)	High
contact_pattern	Robot hops or stands on both feet	High
action_rate	Robot jerks violently, motor wear	Medium
torque	High energy consumption, motor overheating	Medium
foot_slip	Feet slide on ground	Medium
base_height	Robot crouches/squats	Medium
joint_limit	Joints hit mechanical limits	Low-Medium
feet_air_time	Unnatural gait timing	Low
joint_acceleration	Slightly jerky motion	Low

Comparing Reward Formulations from Papers

Paper	Reward terms	Key feature
Walk These Ways (Margolis 2023)	13	Gait frequency command, versatile gaits
Humanoid-Gym (Gu 2024)	10	Focus sim-to-real, conservative rewards
Learning Humanoid Locomotion (Radosavovic 2024)	8	Transformer policy, simpler rewards
Expressive Humanoid (Cheng 2024)	15+	AMP style reward, reference motions

Pitfalls and Best Practices

Pitfall 1: Reward Magnitude Mismatch

# WRONG: vel_tracking ~1.0 but torque ~10000
rewards = vel_reward + 0.001 * torque_penalty
# Despite small weight, torque penalty still dominates gradient

# RIGHT: Normalize before weighting
rewards = vel_reward + 0.001 * (torque_penalty / torque_scale)

Pitfall 2: Sparse vs Dense Rewards

# WRONG: Only reward reaching the goal (sparse)
reward = 100.0 if reached_goal else 0.0
# Robot has no signal about correct direction until random goal reach

# RIGHT: Reward continuous progress (dense)
reward = -distance_to_goal + velocity_toward_goal

Pitfall 3: Forgetting Termination Penalty

# WRONG: No penalty for falling
# Robot will try risky behaviors since there's no penalty

# RIGHT: Heavy penalty for falling
if is_terminated:
    reward -= 200.0

To learn more about quadruped locomotion (a simpler problem with the same principles), see Quadruped Locomotion with RL.

Training visualization

Summary

Reward engineering for bipedal walking is not "set and forget" — it is an iterative process: design, train, watch videos, adjust. Key takeaways:

12+ reward terms are necessary for natural gaits
Exponential tracking is better than linear for velocity/height targets
Reward curriculum (standing, walking, running) stabilizes training
Ablation study reveals each term's importance
Reward hacking is the enemy — always watch behavior videos, not just reward curves

Next post — Unitree G1: Training a Walking Policy from Scratch — applies all this knowledge to real training on the Unitree G1 in Isaac Lab.

References

Walk These Ways: Tuning Robot Control for Generalization with Multiplicity of Behavior — Margolis & Agrawal, CoRL 2023
Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition — Siekmann et al., ICRA 2021
Humanoid-Gym: Reinforcement Learning for Humanoid Robot — Gu et al., 2024
Learning Humanoid Locomotion with Transformers — Radosavovic et al., 2024

Overview of Reward Engineering

Why Is Reward Design Hard?

Reward Hacking — Enemy Number One

12+ Reward Terms for Bipedal Walking

Reward Weighting Strategies

Method 1: Manual Tuning (Most Common)

Method 2: Exponential Reward Scaling

Method 3: Adaptive Reward Scaling

Reward Curriculum: From Standing to Walking

Phase 1: Standing (0-500 iterations)

Phase 2: Weight Shifting (500-1500 iterations)

Phase 3: Walking (1500-5000 iterations)

Ablation Study: What Does Each Reward Term Contribute?

Comparing Reward Formulations from Papers

Pitfalls and Best Practices

Pitfall 1: Reward Magnitude Mismatch

Pitfall 2: Sparse vs Dense Rewards

Pitfall 3: Forgetting Termination Penalty

Summary

References

Related Posts

Nguyễn Anh Tuấn

Related Posts

Sim-to-Real cho Humanoid: Deployment Best Practices

Humanoid Parkour: Jumping, Climbing và Obstacle Course

Loco-Manipulation: Walking trong khi bê và thao tác vật thể

Overview of Reward Engineering

Why Is Reward Design Hard?

Reward Hacking — Enemy Number One

12+ Reward Terms for Bipedal Walking

Reward Weighting Strategies

Method 1: Manual Tuning (Most Common)

Method 2: Exponential Reward Scaling

Method 3: Adaptive Reward Scaling

Reward Curriculum: From Standing to Walking

Phase 1: Standing (0-500 iterations)

Phase 2: Weight Shifting (500-1500 iterations)

Phase 3: Walking (1500-5000 iterations)

Ablation Study: What Does Each Reward Term Contribute?

Comparing Reward Formulations from Papers

Pitfalls and Best Practices

Pitfall 1: Reward Magnitude Mismatch

Pitfall 2: Sparse vs Dense Rewards

Pitfall 3: Forgetting Termination Penalty

Summary

References

Related Posts

Nguyễn Anh Tuấn

Related Posts

Sim-to-Real cho Humanoid: Deployment Best Practices

Humanoid Parkour: Jumping, Climbing và Obstacle Course

Loco-Manipulation: Walking trong khi bê và thao tác vật thể