If the observation space is the robot's "eyes," then the reward function is its "soul" — it shapes every behavior the policy learns. Designing rewards for bipedal walking is more art than science: too few reward terms and the robot learns bizarre gaits, too many and training won't converge.
In the previous post — Fundamentals & Environment Setup — we set up the environment. Now, let's dive into the hardest part: designing the reward function.
Overview of Reward Engineering
Why Is Reward Design Hard?
Bipedal walking has many simultaneous objectives:
- Walk at the desired velocity
- Keep the torso upright
- Lift feet high enough (foot clearance)
- Conserve energy
- Move smoothly (no jerkiness)
- Coordinate legs in alternating rhythm
Each objective requires its own reward term, and the weights between terms determine the final behavior.
Reward Hacking — Enemy Number One
Reward hacking occurs when the policy finds a way to maximize reward without actually doing what you want. Classic examples:
| Bad reward design | The robot will... |
|---|---|
| Only reward forward velocity | Dive forward and fall |
| Only reward standing upright | Stand still, never step |
| Heavily reward foot clearance | Lift feet and never put them down |
| No energy penalty | Oscillate at high frequency continuously |
12+ Reward Terms for Bipedal Walking
Below is a complete reward function with detailed explanations for each term:
import torch
import numpy as np
class BipedalRewardFunction:
"""
Complete reward function for humanoid bipedal walking.
Synthesized from Humanoid-Gym, Walk These Ways, and practical experience.
"""
def __init__(self, cfg):
# Reward weights — THIS IS WHERE THE MAGIC HAPPENS
self.weights = {
# === Tracking rewards (primary objectives) ===
"linear_vel_tracking": 1.5, # Track desired velocity
"angular_vel_tracking": 0.8, # Track desired yaw rate
# === Posture rewards (balance) ===
"upright": 0.5, # Keep torso upright
"base_height": 0.3, # Maintain stable height
# === Gait quality rewards ===
"foot_clearance": 0.3, # Lift feet high enough
"contact_pattern": 0.4, # Alternate feet properly
"foot_slip": -0.1, # Penalize foot slipping
# === Smoothness rewards ===
"action_rate": -0.01, # Penalize sudden action changes
"torque": -0.00005, # Penalize large torques
"joint_acceleration": -0.0001, # Penalize joint acceleration
# === Safety rewards ===
"termination": -200.0, # Heavy penalty for falling
"joint_limit": -1.0, # Penalize hitting joint limits
# === Style rewards (optional) ===
"feet_air_time": 0.2, # Reward natural swing time
}
def compute_reward(self, state, action, prev_action, command):
"""Compute total reward from all terms."""
rewards = {}
# 1. Linear velocity tracking
# Most important — robot must walk at requested speed
vel_error = torch.sum(
torch.square(command[:, :2] - state["base_lin_vel"][:, :2]),
dim=1
)
rewards["linear_vel_tracking"] = torch.exp(-vel_error / 0.25)
# 2. Angular velocity tracking (yaw rate)
yaw_error = torch.square(
command[:, 2] - state["base_ang_vel"][:, 2]
)
rewards["angular_vel_tracking"] = torch.exp(-yaw_error / 0.25)
# 3. Upright posture
# projected_gravity[2] = -1 when perfectly upright
rewards["upright"] = torch.square(state["projected_gravity"][:, 2] + 1.0)
# 4. Base height
# Keep height near target (0.72m for G1)
target_height = 0.72
height_error = torch.square(state["base_height"] - target_height)
rewards["base_height"] = torch.exp(-height_error / 0.05)
# 5. Foot clearance
# Reward when swing foot lifts above 0.06m
swing_mask = state["foot_contact"] < 0.5
foot_height = state["foot_height"]
clearance_reward = torch.where(
swing_mask,
torch.clamp(foot_height - 0.06, min=0.0),
torch.zeros_like(foot_height)
)
rewards["foot_clearance"] = torch.sum(clearance_reward, dim=1)
# 6. Contact pattern — feet must alternate
left_contact = state["foot_contact"][:, 0]
right_contact = state["foot_contact"][:, 1]
both_air = (1 - left_contact) * (1 - right_contact)
rewards["contact_pattern"] = -both_air
# 7. Foot slip penalty
contact_mask = state["foot_contact"] > 0.5
foot_vel = torch.norm(state["foot_velocity"][:, :, :2], dim=2)
slip = contact_mask * foot_vel
rewards["foot_slip"] = -torch.sum(slip, dim=1)
# 8. Action rate penalty
action_diff = torch.sum(torch.square(action - prev_action), dim=1)
rewards["action_rate"] = -action_diff
# 9. Torque penalty
torques = state["applied_torques"]
rewards["torque"] = -torch.sum(torch.square(torques), dim=1)
# 10. Joint acceleration penalty
joint_acc = state["joint_accelerations"]
rewards["joint_acceleration"] = -torch.sum(torch.square(joint_acc), dim=1)
# 11. Termination penalty
rewards["termination"] = state["is_terminated"].float()
# 12. Joint limit penalty
joint_pos = state["joint_positions"]
lower = state["joint_lower_limits"]
upper = state["joint_upper_limits"]
margin = 0.1
below = torch.clamp(lower + margin - joint_pos, min=0.0)
above = torch.clamp(joint_pos - upper + margin, min=0.0)
rewards["joint_limit"] = -torch.sum(below + above, dim=1)
# 13. Feet air time (style reward)
target_air_time = 0.3
air_time_error = torch.abs(
state["feet_air_time"] - target_air_time
)
rewards["feet_air_time"] = torch.sum(
torch.exp(-air_time_error / 0.1), dim=1
)
# Compute weighted total
total_reward = torch.zeros(state["base_height"].shape[0])
for name, value in rewards.items():
total_reward += self.weights[name] * value
return total_reward, rewards
Reward Weighting Strategies
Method 1: Manual Tuning (Most Common)
Start with default weights, train for ~30 minutes, watch behavior videos, and adjust:
# Iteration 1: Robot dives forward
weights_v1 = {"linear_vel_tracking": 2.0, "upright": 0.1}
# -> Increase upright, decrease velocity
# Iteration 2: Robot stands still, very upright
weights_v2 = {"linear_vel_tracking": 1.0, "upright": 1.0}
# -> Better balanced, but robot doesn't lift feet
# Iteration 3: Add foot clearance + contact pattern
weights_v3 = {
"linear_vel_tracking": 1.5,
"upright": 0.5,
"foot_clearance": 0.3,
"contact_pattern": 0.4,
}
# -> Robot starts walking!
Method 2: Exponential Reward Scaling
Instead of linear rewards, use exponential to create a sharp peak near the target:
def exponential_tracking_reward(error, sigma=0.25):
"""
exp(-error^2 / sigma) creates a bell-shaped reward.
- Small sigma: only rewards when very close to target
- Large sigma: rewards even when far from target
"""
return torch.exp(-torch.square(error) / sigma)
# Velocity tracking with sigma=0.25
vel_reward = exponential_tracking_reward(vel_error, sigma=0.25)
# Height tracking with sigma=0.05 (stricter)
height_reward = exponential_tracking_reward(height_error, sigma=0.05)
Method 3: Adaptive Reward Scaling
Automatically adjust weights based on training progress:
class AdaptiveRewardScaler:
"""Auto-scale rewards to maintain equivalent magnitudes."""
def __init__(self, reward_names, target_magnitude=1.0):
self.ema = {name: target_magnitude for name in reward_names}
self.alpha = 0.99
def scale(self, rewards_dict):
scaled = {}
for name, value in rewards_dict.items():
mag = torch.abs(value).mean().item()
self.ema[name] = self.alpha * self.ema[name] + (1 - self.alpha) * mag
if self.ema[name] > 1e-6:
scaled[name] = value / self.ema[name]
else:
scaled[name] = value
return scaled
Reward Curriculum: From Standing to Walking
Instead of training walking from scratch, curriculum learning helps stabilize training:
Phase 1: Standing (0-500 iterations)
curriculum_phase_1 = {
"linear_vel_tracking": 0.0, # Don't need to walk yet
"upright": 2.0, # Priority: stand upright
"base_height": 1.0, # Maintain height
"termination": -200.0, # Penalize falling
"torque": -0.0001, # Energy efficiency
}
# Command velocity = [0, 0, 0]
Phase 2: Weight Shifting (500-1500 iterations)
curriculum_phase_2 = {
"linear_vel_tracking": 0.3, # Start tracking small velocity
"upright": 1.5,
"base_height": 0.8,
"contact_pattern": 0.2, # Start requiring foot alternation
"foot_clearance": 0.1,
"termination": -200.0,
"torque": -0.0001,
}
# Command velocity = [0.2, 0, 0] # slow
Phase 3: Walking (1500-5000 iterations)
curriculum_phase_3 = {
"linear_vel_tracking": 1.5, # Full tracking
"angular_vel_tracking": 0.8,
"upright": 0.5,
"base_height": 0.3,
"foot_clearance": 0.3,
"contact_pattern": 0.4,
"foot_slip": -0.1,
"action_rate": -0.01,
"torque": -0.00005,
"joint_acceleration": -0.0001,
"termination": -200.0,
"joint_limit": -1.0,
"feet_air_time": 0.2,
}
# Command velocity = random[-1.0, 1.0] m/s
class RewardCurriculum:
"""Reward curriculum manager."""
def __init__(self):
self.phases = [
(0, curriculum_phase_1),
(500, curriculum_phase_2),
(1500, curriculum_phase_3),
]
def get_weights(self, iteration):
"""Return weights appropriate for current iteration."""
current_weights = self.phases[0][1]
for threshold, weights in self.phases:
if iteration >= threshold:
current_weights = weights
return current_weights
def get_command_range(self, iteration):
"""Gradually increase command velocity range."""
if iteration < 500:
return {"vx": [0, 0], "vy": [0, 0], "yaw": [0, 0]}
elif iteration < 1500:
return {"vx": [0, 0.3], "vy": [-0.1, 0.1], "yaw": [-0.2, 0.2]}
else:
progress = min((iteration - 1500) / 3500, 1.0)
max_vx = 0.3 + progress * 0.7
return {
"vx": [-0.3, max_vx],
"vy": [-0.3 * progress, 0.3 * progress],
"yaw": [-0.5 * progress, 0.5 * progress],
}
Ablation Study: What Does Each Reward Term Contribute?
The table below summarizes results when removing each reward term:
| Reward term removed | Consequence | Impact level |
|---|---|---|
| linear_vel_tracking | Robot doesn't walk | Critical |
| upright | Robot tilts and falls after 2-3 steps | Critical |
| foot_clearance | Robot drags feet (shuffling) | High |
| contact_pattern | Robot hops or stands on both feet | High |
| action_rate | Robot jerks violently, motor wear | Medium |
| torque | High energy consumption, motor overheating | Medium |
| foot_slip | Feet slide on ground | Medium |
| base_height | Robot crouches/squats | Medium |
| joint_limit | Joints hit mechanical limits | Low-Medium |
| feet_air_time | Unnatural gait timing | Low |
| joint_acceleration | Slightly jerky motion | Low |
Comparing Reward Formulations from Papers
| Paper | Reward terms | Key feature |
|---|---|---|
| Walk These Ways (Margolis 2023) | 13 | Gait frequency command, versatile gaits |
| Humanoid-Gym (Gu 2024) | 10 | Focus sim-to-real, conservative rewards |
| Learning Humanoid Locomotion (Radosavovic 2024) | 8 | Transformer policy, simpler rewards |
| Expressive Humanoid (Cheng 2024) | 15+ | AMP style reward, reference motions |
Pitfalls and Best Practices
Pitfall 1: Reward Magnitude Mismatch
# WRONG: vel_tracking ~1.0 but torque ~10000
rewards = vel_reward + 0.001 * torque_penalty
# Despite small weight, torque penalty still dominates gradient
# RIGHT: Normalize before weighting
rewards = vel_reward + 0.001 * (torque_penalty / torque_scale)
Pitfall 2: Sparse vs Dense Rewards
# WRONG: Only reward reaching the goal (sparse)
reward = 100.0 if reached_goal else 0.0
# Robot has no signal about correct direction until random goal reach
# RIGHT: Reward continuous progress (dense)
reward = -distance_to_goal + velocity_toward_goal
Pitfall 3: Forgetting Termination Penalty
# WRONG: No penalty for falling
# Robot will try risky behaviors since there's no penalty
# RIGHT: Heavy penalty for falling
if is_terminated:
reward -= 200.0
To learn more about quadruped locomotion (a simpler problem with the same principles), see Quadruped Locomotion with RL.
Summary
Reward engineering for bipedal walking is not "set and forget" — it is an iterative process: design, train, watch videos, adjust. Key takeaways:
- 12+ reward terms are necessary for natural gaits
- Exponential tracking is better than linear for velocity/height targets
- Reward curriculum (standing, walking, running) stabilizes training
- Ablation study reveals each term's importance
- Reward hacking is the enemy — always watch behavior videos, not just reward curves
Next post — Unitree G1: Training a Walking Policy from Scratch — applies all this knowledge to real training on the Unitree G1 in Isaac Lab.
References
- Walk These Ways: Tuning Robot Control for Generalization with Multiplicity of Behavior — Margolis & Agrawal, CoRL 2023
- Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition — Siekmann et al., ICRA 2021
- Humanoid-Gym: Reinforcement Learning for Humanoid Robot — Gu et al., 2024
- Learning Humanoid Locomotion with Transformers — Radosavovic et al., 2024