← Back to Blog
locomotionlocomotionrlai-perception

RL for Locomotion: PPO, Reward Shaping and Curriculum

How RL is applied to locomotion — from reward design, observation space to training pipeline with Isaac Lab.

Nguyen Anh Tuan9 tháng 2, 202610 min read
RL for Locomotion: PPO, Reward Shaping and Curriculum

From Classical to RL: The Turning Point in Locomotion

In Part 1, we learned the classical methods (ZMP, CPG, IK) and why they're limited. Now it's time to dive into RL-based locomotion -- the approach that has dominated the field since 2020.

The core idea is extremely simple: instead of hand-crafting a controller, let the robot learn how to walk through trial-and-error in simulation. But "simple" doesn't mean "easy" -- there are many important decisions to understand.

AI and machine learning for robot control

MDP for Walking: Problem Formulation

Locomotion is formulated as a Markov Decision Process (MDP). If you're not familiar with MDPs, read RL Basics for Robotics first.

Observation Space -- What Does the Robot "See"?

Observation is all the information the policy receives to make decisions. For locomotion, observation typically includes:

observation_space = {
    # === Proprioception ===
    "base_angular_velocity": 3,    # gyroscope (roll, pitch, yaw rates)
    "projected_gravity": 3,        # direction of gravity in body frame
    "joint_positions": 12,         # angles of 12 joints (3 joints x 4 legs)
    "joint_velocities": 12,        # angular velocities of 12 joints
    
    # === Commands ===
    "velocity_command": 3,         # desired (vx, vy, yaw_rate)
    
    # === History ===
    "previous_actions": 12,        # previous actions
    
    # TOTAL: 45 dimensions
}

Why not use visual input? Most locomotion policies use only proprioception (joint angles, IMU) without cameras. Reasons:

  1. Proprioception is enough to walk on many terrains (robot "feels" ground through feet)
  2. Visual input increases observation dimension to thousands → slower training
  3. Cameras have latency, noise, and large sim-to-real gap

However, newer papers like "Learning Robust Perceptive Locomotion for Quadrupedal Robots in the Wild" (arXiv:2201.08117) have begun combining depth cameras + proprioception for more complex terrain (stairs, bumps).

Action Space -- What Does the Robot Do?

Action space defines how the robot controls its joints:

Action Type Description Advantages Disadvantages
Joint position targets Policy outputs desired joint angles Stable, easy sim-to-real Doesn't directly control force
Joint torques Policy outputs torques on joints Precise control Hard to train, hard to transfer
Position offsets Output deltas from default standing pose Faster training Needs good default pose

Most common today: Position offsets + PD controller. Policy outputs joint angle deltas, PD controller converts to torques:

# Action -> Joint torques (run on robot)
def pd_controller(target_pos, current_pos, current_vel, kp=40.0, kd=2.0):
    """
    PD controller converts position targets to torques
    kp, kd: proportional and derivative gains
    """
    torque = kp * (target_pos - current_pos) - kd * current_vel
    return np.clip(torque, -33.5, 33.5)  # torque limit

# In training loop
default_pose = np.array([0.0, 0.8, -1.6] * 4)  # standing pose for 4 legs
action = policy(observation)  # output: 12 position offsets
target_positions = default_pose + action * action_scale  # scale action
torques = pd_controller(target_positions, current_joint_pos, current_joint_vel)

Reward Shaping -- The Art of Designing Rewards

The reward function is the most important part of locomotion RL. A good reward function creates smooth walking; a bad reward function creates jerky, jumping, or energy-wasting robots.

Reward Components

Reward function for locomotion is typically a weighted sum of multiple components:

def compute_reward(env):
    """
    Reward function for quadruped locomotion
    Each component has its own weight -- tuning weights is an art
    """
    rewards = {}
    
    # === Rewards (encouraging) ===
    
    # 1. Track velocity command
    lin_vel_error = (env.commands[:, :2] - env.base_lin_vel[:, :2]).square().sum(dim=1)
    rewards["tracking_lin_vel"] = torch.exp(-lin_vel_error / 0.25) * 1.5
    
    ang_vel_error = (env.commands[:, 2] - env.base_ang_vel[:, 2]).square()
    rewards["tracking_ang_vel"] = torch.exp(-ang_vel_error / 0.25) * 0.5
    
    # === Penalties (discouraging) ===
    
    # 2. Energy -- penalize power consumption
    rewards["torques"] = -torch.sum(env.torques.square(), dim=1) * 0.00001
    
    # 3. Smoothness -- penalize sudden action changes
    rewards["action_rate"] = -torch.sum(
        (env.actions - env.last_actions).square(), dim=1
    ) * 0.01
    
    # 4. Air time -- encourage foot clearance (foot leaving ground)
    contact = env.foot_contacts  # True if foot touches ground
    first_contact = (env.feet_air_time > 0.) * contact
    rewards["feet_air_time"] = torch.sum(
        (env.feet_air_time - 0.5) * first_contact, dim=1
    ) * 1.0
    
    # 5. Orientation -- penalize excessive tilting
    rewards["orientation"] = -torch.sum(
        env.projected_gravity[:, :2].square(), dim=1
    ) * 0.2
    
    # 6. Joint limits -- penalize operating near joint limits
    rewards["joint_limits"] = -torch.sum(
        (env.dof_pos - env.dof_pos_limits[:, 0]).clip(max=0.).square() +
        (env.dof_pos - env.dof_pos_limits[:, 1]).clip(min=0.).square(),
        dim=1
    ) * 10.0
    
    # 7. Collision -- penalize body touching ground (not feet)
    rewards["collision"] = -torch.sum(
        (torch.norm(env.contact_forces[:, env.penalized_bodies, :], dim=-1) > 0.1),
        dim=1
    ).float() * 1.0
    
    # Total reward
    total_reward = sum(rewards.values())
    return total_reward, rewards

Reward Design Tips from Experience

Tip Explanation
Exponential tracking reward Use exp(-error/sigma) instead of -error. Gives better gradient when error is large
Balance reward weights Tracking reward should be largest (~1.5), penalties should be small (~0.001-0.1)
Feet air time This reward is critically important -- without it, robot will drag feet on ground
Action rate penalty Implicitly smooths movement, prevents jerky robot behavior
Base height reward Add if robot tends to crouch -- encourage standing upright

Training robot locomotion in simulation with thousands of parallel environments

Curriculum Learning -- From Easy to Hard

One of the biggest breakthroughs for locomotion RL is curriculum learning -- start training on easy terrain, gradually increase difficulty. Without curriculum, the robot fails to learn from the start.

Terrain Curriculum

class TerrainCurriculum:
    """
    Game-inspired curriculum (from legged_gym paper)
    Robot "levels up" when completing current terrain
    """
    def __init__(self):
        self.terrain_levels = {
            0: "flat",           # flat ground
            1: "rough_02",       # light roughness (max height 2cm)
            2: "rough_05",       # medium roughness (max height 5cm)
            3: "rough_10",       # heavy roughness (max height 10cm)
            4: "slopes_10deg",   # slopes 10 degrees
            5: "slopes_20deg",   # slopes 20 degrees
            6: "stairs_10cm",    # stairs 10cm
            7: "stairs_15cm",    # stairs 15cm
            8: "stepping_stones", # rocks to step over
            9: "random_mixed",   # combination of all
        }
    
    def update_level(self, robot_id, distance_walked):
        """
        If robot walks far → level up
        If robot falls frequently → hold or lower level
        """
        if distance_walked > self.threshold_up:
            self.levels[robot_id] = min(
                self.levels[robot_id] + 1, 
                self.max_level
            )
        elif distance_walked < self.threshold_down:
            self.levels[robot_id] = max(
                self.levels[robot_id] - 1, 
                0
            )

Command Curriculum

In addition to terrain, velocity commands are also curricularized:

PPO Training Pipeline with Isaac Lab

Pipeline Overview

Isaac Lab Environment
    ↓
4,096 parallel robots (GPU)
    ↓
Observations (45 dims x 4096 robots)
    ↓
PPO Policy (MLP: 512-256-128)
    ↓
Actions (12 dims x 4096 robots)
    ↓
Environment step (GPU physics)
    ↓
Rewards + Terminations
    ↓
PPO Update (on-policy)
    ↓
Repeat ~2000 iterations (~20 minutes on RTX 4090)

PPO -- Why PPO?

Proximal Policy Optimization (Schulman et al., 2017) is the default algorithm for locomotion because:

  1. On-policy: Collects new data each iteration, suitable for fast simulation (4096 envs)
  2. Stable: Clip ratio prevents policy change too much, avoids catastrophic forgetting
  3. Parallel-friendly: RSL-RL implementation optimized for GPU rollout collection
# PPO training config (Isaac Lab / RSL-RL)
ppo_config = {
    "policy": {
        "class_name": "ActorCritic",
        "hidden_dims": [512, 256, 128],
        "activation": "elu",
        "init_noise_std": 1.0,
    },
    "algorithm": {
        "value_loss_coef": 1.0,
        "use_clipped_value_loss": True,
        "clip_param": 0.2,
        "entropy_coef": 0.01,
        "num_learning_epochs": 5,
        "num_mini_batches": 4,
        "learning_rate": 1e-3,
        "schedule": "adaptive",  # decrease lr when KL divergence is small
        "gamma": 0.99,
        "lam": 0.95,
        "desired_kl": 0.01,
        "max_grad_norm": 1.0,
    },
    "runner": {
        "num_steps_per_env": 24,  # rollout length
        "max_iterations": 1500,
    }
}

Policy Architecture

Policy for locomotion is usually a simple MLP -- no need for complex transformers or CNNs:

import torch
import torch.nn as nn

class LocomotionPolicy(nn.Module):
    """
    Actor-Critic MLP for locomotion
    Simple but effective -- no need for more complexity
    """
    def __init__(self, obs_dim=45, action_dim=12):
        super().__init__()
        
        # Separate backbones for Actor and Critic
        # Better for locomotion than shared backbone
        
        self.actor = nn.Sequential(
            nn.Linear(obs_dim, 512),
            nn.ELU(),
            nn.Linear(512, 256),
            nn.ELU(),
            nn.Linear(256, 128),
            nn.ELU(),
            nn.Linear(128, action_dim),
        )
        
        self.critic = nn.Sequential(
            nn.Linear(obs_dim, 512),
            nn.ELU(),
            nn.Linear(512, 256),
            nn.ELU(),
            nn.Linear(256, 128),
            nn.ELU(),
            nn.Linear(128, 1),
        )
        
        self.log_std = nn.Parameter(torch.zeros(action_dim))
    
    def forward(self, obs):
        action_mean = self.actor(obs)
        value = self.critic(obs)
        return action_mean, self.log_std.exp(), value

Hands-on: Train with Isaac Lab

# 1. Setup Isaac Lab
git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab && ./isaaclab.sh --install

# 2. Train ANYmal locomotion
python source/standalone/workflows/rsl_rl/train.py \
    --task Isaac-Velocity-Rough-Anymal-D-v0 \
    --num_envs 4096 \
    --max_iterations 1500

# 3. See results
python source/standalone/workflows/rsl_rl/play.py \
    --task Isaac-Velocity-Rough-Anymal-D-v0 \
    --num_envs 64

Training takes about 20 minutes on RTX 4090 for 1500 iterations. The policy converges and the robot can walk on rough terrain, climb slopes, and follow velocity commands.

Domain Randomization for Sim-to-Real

To make a policy trained in sim work on a real robot, we need domain randomization -- randomly vary physical parameters in simulation:

domain_randomization = {
    # Dynamics randomization
    "friction_range": [0.5, 1.25],        # friction coefficient
    "restitution_range": [0.0, 0.4],       # bounce
    "added_mass_range": [-1.0, 3.0],       # added mass to base (kg)
    "com_displacement_range": [-0.15, 0.15], # COM displacement (m)
    
    # Motor randomization
    "motor_strength_range": [0.9, 1.1],    # motor variation
    "motor_offset_range": [-0.02, 0.02],   # encoder offset (rad)
    
    # Observation noise
    "joint_pos_noise": 0.01,    # rad
    "joint_vel_noise": 1.5,     # rad/s
    "gravity_noise": 0.05,      # m/s^2
    "imu_noise": 0.2,           # rad/s
    
    # Latency
    "action_delay_range": [0, 2],  # frames delay
}

With domain randomization, the policy learns to be robust to uncertainty -- doesn't rely on specific sim parameters.

Training Monitoring -- Knowing When Policy is Learning Well

Metric Meaning Good Value
Mean reward Average total reward Increases steadily, converges
Mean episode length How long robot survives Increases to max (20s)
Tracking error Error between command and actual velocity Decreases to ~0.1 m/s
Feet air time Time foot is in the air ~0.3-0.5s (natural gait)
Power consumption Energy usage Decreases (more efficient)

If mean episode length doesn't increase after 200 iterations, likely reward function issue -- check weights.

Robot quadruped locomotion training results

Comparison: RL vs Classical for Locomotion

Criteria Classical (ZMP/CPG/IK) RL (PPO)
Setup time Months (tuning) Hours (reward + train)
Terrain generalization Need replanning Train once, run many terrains
Optimality Sub-optimal Near optimal for reward function
Interpretability High (know what robot is doing) Low (black box)
Safety guarantees Can prove stability No formal guarantees
Compute requirement CPU sufficient Need strong GPU
Sim-to-real Don't need sim Need sim + domain randomization

Next in the Series

You've now understood how to formulate locomotion as an RL problem and the entire training pipeline. In upcoming posts:


Related Posts

Related Posts

TutorialNVIDIA Isaac Lab: GPU-accelerated RL training từ zero
simulationisaac-simrlPart 3

NVIDIA Isaac Lab: GPU-accelerated RL training từ zero

Setup Isaac Lab, train locomotion policy với 4096 parallel environments và domain randomization trên GPU.

1/4/202611 min read
Sim-to-Real Transfer: Train simulation, chạy thực tế
ai-perceptionresearchrobotics

Sim-to-Real Transfer: Train simulation, chạy thực tế

Kỹ thuật chuyển đổi mô hình từ simulation sang robot thật — domain randomization, system identification và best practices.

1/4/202612 min read
Deep DiveDomain Randomization: Chìa khóa Sim-to-Real Transfer
simulationsim2realrlPart 4

Domain Randomization: Chìa khóa Sim-to-Real Transfer

Lý thuyết và thực hành domain randomization — visual, dynamics, sensor randomization với case studies thành công.

26/3/202610 min read