Walk These Ways: Adaptive Locomotion One Policy

The Problem: One Gait, One Policy?

In previous parts of this series (Part 1, Part 2, Part 3), we trained locomotion policies — but typically each policy learns only one type of movement. Want robot to trot? Train one policy. Want it to gallop? Train another policy. Want slow walking, fast walking, sideways walking? Need yet more policies.

This approach has many problems:

Inefficient: N movement types = N policies = N training sessions
Hard gait transitions: How to smoothly transition between 2 policies?
Not flexible: When needing new gait, must train from scratch

Paper "Walk These Ways: Tuning Robot Control for Generalization with Multiplicity of Behavior" (arXiv:2212.03238) by Gabriel Margolis and Pulkit Agrawal (MIT CSAIL, CoRL 2023) solves exactly this problem.

Multiple different movement styles from a single policy

Core Idea: Multiplicity of Behavior (MoB)

Key Insight

When training locomotion with RL, there are many ways to solve the same task. For example, to move forward at 1 m/s, a robot can:

Trot (diagonal legs simultaneously)
Walk (one leg at a time)
Bound (front legs then back legs)
Pronk (all four legs together)

RL typically converges to one single strategy (usually trot because it's most stable). This paper asks: how can one policy learn MANY strategies simultaneously?

Solution: Command Conditioning

Instead of just sending velocity command (vx, vy, yaw_rate), Walk These Ways adds an extended command vector that controls how the robot moves:

# Standard locomotion command
standard_command = {
    "vx": 1.0,        # m/s, forward
    "vy": 0.0,        # m/s, lateral
    "yaw_rate": 0.0,   # rad/s, turning
}

# Walk These Ways EXTENDED command
wtw_command = {
    # === Velocity (same as standard) ===
    "vx": 1.0,
    "vy": 0.0,
    "yaw_rate": 0.0,

    # === Gait parameters (NEW) ===
    "body_height": 0.0,        # [-1, 1] low/high
    "step_frequency": 3.0,     # Hz, step frequency
    "gait": [1, 0, 0],         # one-hot: trot/pace/bound
    "swing_height": 0.08,      # m, foot lift height
    "stance_width": 0.0,       # [-1, 1] narrow/wide
    "body_pitch": 0.0,         # rad, forward/backward lean
    "body_roll": 0.0,          # rad, left/right lean
    "footswing_height": 0.08,  # m

    # TOTAL: 15 dimensions (vs 3 before)
}

Key idea: Policy receives 15-dim command vector and learns to execute any combination of these parameters. One policy, infinite movement types.

Architecture and Training

Observation Space

observation = {
    # Proprioception (same as standard)
    "base_angular_velocity": 3,
    "projected_gravity": 3,
    "joint_positions": 12,
    "joint_velocities": 12,
    "previous_actions": 12,

    # Extended command (vs 3 dims standard)
    "extended_command": 15,

    # TOTAL: 57 dimensions
}

Reward Function

Beyond standard rewards (velocity tracking, energy penalty), Walk These Ways adds gait-specific rewards:

def compute_gait_reward(env):
    """
    Reward for gait pattern matching.
    Based on commanded gait, encourage correct contact pattern.
    """
    rewards = {}

    # 1. Step frequency tracking
    # Count actual steps vs. commanded frequency
    actual_freq = compute_step_frequency(env.foot_contacts)
    freq_error = (actual_freq - env.commands.step_frequency).square()
    rewards["step_freq"] = torch.exp(-freq_error / 0.25)

    # 2. Gait pattern tracking
    # Each gait has desired phase offsets between 4 legs
    gait_phases = {
        "trot":  [0.0, 0.5, 0.5, 0.0],  # FL-RR in phase, FR-RL in phase
        "pace":  [0.0, 0.5, 0.0, 0.5],  # FL-RL in phase, FR-RR in phase
        "bound": [0.0, 0.0, 0.5, 0.5],  # front in phase, rear in phase
    }
    desired_phases = get_desired_phases(env.commands.gait, gait_phases)
    phase_error = compute_phase_error(env.foot_contacts, desired_phases)
    rewards["gait_phase"] = torch.exp(-phase_error / 0.5)

    # 3. Swing height tracking
    actual_swing = env.foot_heights.max(dim=-1).values
    swing_error = (actual_swing - env.commands.swing_height).square()
    rewards["swing_height"] = torch.exp(-swing_error / 0.01)

    # 4. Body height tracking
    height_error = (env.base_height - env.commands.body_height_target).square()
    rewards["body_height"] = torch.exp(-height_error / 0.01)

    # 5. Body orientation tracking
    pitch_error = (env.base_euler[:, 1] - env.commands.body_pitch).square()
    roll_error = (env.base_euler[:, 0] - env.commands.body_roll).square()
    rewards["orientation"] = torch.exp(-(pitch_error + roll_error) / 0.1)

    return rewards

Training Procedure

training_config = {
    "num_envs": 4096,
    "max_iterations": 3000,   # More than standard (1500) — task is more complex

    # Command sampling — CRITICAL
    "command_sampling": {
        "vx_range": [-1.0, 2.0],
        "vy_range": [-0.5, 0.5],
        "yaw_range": [-1.0, 1.0],
        "body_height_range": [-0.1, 0.1],
        "step_freq_range": [2.0, 4.0],
        "gait": "uniform_categorical",   # randomly select gait
        "swing_height_range": [0.04, 0.12],
        "stance_width_range": [-0.05, 0.05],
        "body_pitch_range": [-0.3, 0.3],
        "body_roll_range": [-0.2, 0.2],
    },

    # Each episode, sample RANDOM command combination
    # → policy must learn all combinations
    "command_resample_interval": 500,  # steps
}

The brilliant insight: Each episode, each environment gets random command combination. With 4096 parallel envs, each iteration has 4096 different combinations running simultaneously. After 3000 iterations, the policy has "seen" millions of combinations.

RL training with many gait patterns in parallel

Results and Demo

What One Policy Can Do

With Walk These Ways, one single policy can:

Behavior	Command
Trot 2 m/s	vx=2.0, gait=trot, freq=3.0
Slow walk	vx=0.3, gait=trot, freq=1.5
Crouch walk	vx=0.5, body_height=-0.08
High-step march	vx=0.5, swing_height=0.12
Bound gallop	vx=1.5, gait=bound
Strafe left	vy=-0.5, gait=trot
Spin in place	vx=0, yaw_rate=1.5
Lean forward	vx=0, body_pitch=0.3
Dance rhythm	Oscillate swing_height and body_height
Brace against push	body_height=-0.05, stance_width=0.05

And all transitions between behaviors are smooth — it's just changing continuous command values.

Comparison with Single-Task Policies

Metric	Single-task policy	Walk These Ways
Tracking accuracy	Higher (~5%)	Good, slightly lower
Gait diversity	1 gait	Multiple gaits
Transition quality	None	Smooth
Training time	20 min × N gaits	60 min (once)
Deployment complexity	N models	1 model
Novel behaviors	No	Yes (by tuning commands)

Hardware Demo

Paper demos on Unitree A1 (predecessor to Go2). Policy deployed on onboard Jetson Xavier, inference at 50Hz. Robot can:

Walk on grass, dirt, rocks
Climb 25-degree slopes
Withstand push/kick perturbations
Switch gait in real-time via joystick
Navigate 10cm steps

How to Replicate

Step 1: Clone Repo

git clone https://github.com/Improbable-AI/walk-these-ways.git
cd walk-these-ways
pip install -e .

Step 2: Adjust Command Ranges

Main config file:

# walk_these_ways/envs/configs/go2_config.py
class Go2WTWCfg:
    class commands:
        # Adjust ranges for your robot
        lin_vel_x_range = [-1.0, 2.0]
        lin_vel_y_range = [-0.5, 0.5]
        ang_vel_yaw_range = [-1.0, 1.0]
        body_height_range = [-0.05, 0.05]
        step_frequency_range = [2.0, 4.0]
        gait_types = ["trot", "pace", "bound"]
        swing_height_range = [0.04, 0.10]

Step 3: Train

python train.py --task go2_wtw --num_envs 4096 --max_iterations 3000

# Training takes ~60 minutes on RTX 4090
# Longer than standard because observation and reward are more complex

Step 4: Deploy

Export ONNX and run on Go2 same as Part 3. Only difference: observation has 15 command dims instead of 3, and you need GUI/joystick to adjust commands in real-time.

# Joystick mapping for Walk These Ways
joystick_mapping = {
    "left_stick_x": "vy",
    "left_stick_y": "vx",
    "right_stick_x": "yaw_rate",
    "right_stick_y": "body_height",
    "dpad_up": "swing_height += 0.01",
    "dpad_down": "swing_height -= 0.01",
    "button_a": "gait = trot",
    "button_b": "gait = pace",
    "button_x": "gait = bound",
    "L1": "step_frequency -= 0.5",
    "R1": "step_frequency += 0.5",
}

Walk These Ways is one of the most influential papers in locomotion RL. It showed that RL policies can be generalizable — not just across terrains, but across behaviors.

Papers Building on Walk These Ways

Extreme Parkour with Legged Robots (Cheng et al., 2024) — Extends from flat terrain to parkour (jumping, climbing, crawling), still uses command conditioning approach
DTC: Deep Tracking Control — Uses Walk These Ways policy as low-level controller, adds high-level vision policy
Humanoid locomotion — Teams like Agility Robotics (Digit) and Tesla (Optimus) have applied similar ideas to bipedal robots

Comparison with Other Approaches

Approach	Paper	Strengths	Weaknesses
Walk These Ways	Margolis & Agrawal, 2023	1 policy, many gaits, open-source	Command design requires experience
AMP (Adversarial Motion Priors)	Peng et al., 2021	Natural motion from mocap	Needs motion capture data
DribbleBot	Ji et al., 2023	Soccer + locomotion	Task-specific
Parkour	Cheng et al., 2024	Extreme terrain	Requires depth camera

Lessons from the Paper

1. Command Space Design is the Core

Designing the command space is the most important decision. Too few dimensions → not expressive enough. Too many → hard to train. Walk These Ways chose 15 dims after extensive experimentation.

2. Reward Engineering is Still an Art

Even with RL, the reward function still needs domain knowledge. Knowing what gaits are, what phase means, what swing height is — all comes from classical locomotion knowledge (Part 1).

3. Open-Source Changes Everything

Walk These Ways was fully open-sourced — code, config, trained weights. Anyone with a GPU and Unitree A1/Go2 can replicate it. This is why the paper has such large impact.

4. Sim-to-Real Remains the Bottleneck

Even with an excellent policy in sim, sim-to-real transfer is still the hardest step. The paper uses strong domain randomization (friction, mass, motor strength) but still requires fine-tuning for new robots.

Series Summary: Locomotion from Zero to Hero

Across these four posts, we've gone from theoretical foundations to state-of-the-art papers:

Part 1: ZMP, CPG, IK — classical methods and why they're being replaced
Part 2: RL formulation — MDP, reward shaping, PPO, curriculum learning
Part 3: Hands-on — legged_gym, Unitree Go2, sim-to-real deployment
Part 4 (this post): Walk These Ways — multi-gait learning from one policy

Locomotion RL is evolving rapidly. Hot new directions include:

Whole-body control: Not just walking but also using arms to manipulate (loco-manipulation)
Vision-based locomotion: Using cameras to "see" terrain ahead
Foundation models for locomotion: Pre-train on many robots, fine-tune for specific robot
Humanoid locomotion: From quadruped to bipedal — much harder but seeing many breakthroughs (Agility Digit, Tesla Optimus, Fourier GR-2)

Locomotion Basics: From ZMP to CPG — Part 1 of the series
RL for Locomotion: PPO, reward shaping and curriculum — Part 2 of the series
Quadruped Locomotion: legged_gym to Unitree Go2 — Part 3 of the series
RL Basics for Robotics: From Markov to PPO — RL foundations
Sim-to-Real Transfer: Train Simulation, Run Reality — Domain randomization in detail

The Problem: One Gait, One Policy?

This approach has many problems:

Inefficient: N movement types = N policies = N training sessions
Hard gait transitions: How to smoothly transition between 2 policies?
Not flexible: When needing new gait, must train from scratch

Multiple different movement styles from a single policy

Core Idea: Multiplicity of Behavior (MoB)

Key Insight

When training locomotion with RL, there are many ways to solve the same task. For example, to move forward at 1 m/s, a robot can:

Trot (diagonal legs simultaneously)
Walk (one leg at a time)
Bound (front legs then back legs)
Pronk (all four legs together)

RL typically converges to one single strategy (usually trot because it's most stable). This paper asks: how can one policy learn MANY strategies simultaneously?

Solution: Command Conditioning

Instead of just sending velocity command (vx, vy, yaw_rate), Walk These Ways adds an extended command vector that controls how the robot moves:

# Standard locomotion command
standard_command = {
    "vx": 1.0,        # m/s, forward
    "vy": 0.0,        # m/s, lateral
    "yaw_rate": 0.0,   # rad/s, turning
}

# Walk These Ways EXTENDED command
wtw_command = {
    # === Velocity (same as standard) ===
    "vx": 1.0,
    "vy": 0.0,
    "yaw_rate": 0.0,

    # === Gait parameters (NEW) ===
    "body_height": 0.0,        # [-1, 1] low/high
    "step_frequency": 3.0,     # Hz, step frequency
    "gait": [1, 0, 0],         # one-hot: trot/pace/bound
    "swing_height": 0.08,      # m, foot lift height
    "stance_width": 0.0,       # [-1, 1] narrow/wide
    "body_pitch": 0.0,         # rad, forward/backward lean
    "body_roll": 0.0,          # rad, left/right lean
    "footswing_height": 0.08,  # m

    # TOTAL: 15 dimensions (vs 3 before)
}

Key idea: Policy receives 15-dim command vector and learns to execute any combination of these parameters. One policy, infinite movement types.

Architecture and Training

Observation Space

observation = {
    # Proprioception (same as standard)
    "base_angular_velocity": 3,
    "projected_gravity": 3,
    "joint_positions": 12,
    "joint_velocities": 12,
    "previous_actions": 12,

    # Extended command (vs 3 dims standard)
    "extended_command": 15,

    # TOTAL: 57 dimensions
}

Reward Function

Beyond standard rewards (velocity tracking, energy penalty), Walk These Ways adds gait-specific rewards:

def compute_gait_reward(env):
    """
    Reward for gait pattern matching.
    Based on commanded gait, encourage correct contact pattern.
    """
    rewards = {}

    # 1. Step frequency tracking
    # Count actual steps vs. commanded frequency
    actual_freq = compute_step_frequency(env.foot_contacts)
    freq_error = (actual_freq - env.commands.step_frequency).square()
    rewards["step_freq"] = torch.exp(-freq_error / 0.25)

    # 2. Gait pattern tracking
    # Each gait has desired phase offsets between 4 legs
    gait_phases = {
        "trot":  [0.0, 0.5, 0.5, 0.0],  # FL-RR in phase, FR-RL in phase
        "pace":  [0.0, 0.5, 0.0, 0.5],  # FL-RL in phase, FR-RR in phase
        "bound": [0.0, 0.0, 0.5, 0.5],  # front in phase, rear in phase
    }
    desired_phases = get_desired_phases(env.commands.gait, gait_phases)
    phase_error = compute_phase_error(env.foot_contacts, desired_phases)
    rewards["gait_phase"] = torch.exp(-phase_error / 0.5)

    # 3. Swing height tracking
    actual_swing = env.foot_heights.max(dim=-1).values
    swing_error = (actual_swing - env.commands.swing_height).square()
    rewards["swing_height"] = torch.exp(-swing_error / 0.01)

    # 4. Body height tracking
    height_error = (env.base_height - env.commands.body_height_target).square()
    rewards["body_height"] = torch.exp(-height_error / 0.01)

    # 5. Body orientation tracking
    pitch_error = (env.base_euler[:, 1] - env.commands.body_pitch).square()
    roll_error = (env.base_euler[:, 0] - env.commands.body_roll).square()
    rewards["orientation"] = torch.exp(-(pitch_error + roll_error) / 0.1)

    return rewards

Training Procedure

training_config = {
    "num_envs": 4096,
    "max_iterations": 3000,   # More than standard (1500) — task is more complex

    # Command sampling — CRITICAL
    "command_sampling": {
        "vx_range": [-1.0, 2.0],
        "vy_range": [-0.5, 0.5],
        "yaw_range": [-1.0, 1.0],
        "body_height_range": [-0.1, 0.1],
        "step_freq_range": [2.0, 4.0],
        "gait": "uniform_categorical",   # randomly select gait
        "swing_height_range": [0.04, 0.12],
        "stance_width_range": [-0.05, 0.05],
        "body_pitch_range": [-0.3, 0.3],
        "body_roll_range": [-0.2, 0.2],
    },

    # Each episode, sample RANDOM command combination
    # → policy must learn all combinations
    "command_resample_interval": 500,  # steps
}

RL training with many gait patterns in parallel

Results and Demo

What One Policy Can Do

With Walk These Ways, one single policy can:

Behavior	Command
Trot 2 m/s	vx=2.0, gait=trot, freq=3.0
Slow walk	vx=0.3, gait=trot, freq=1.5
Crouch walk	vx=0.5, body_height=-0.08
High-step march	vx=0.5, swing_height=0.12
Bound gallop	vx=1.5, gait=bound
Strafe left	vy=-0.5, gait=trot
Spin in place	vx=0, yaw_rate=1.5
Lean forward	vx=0, body_pitch=0.3
Dance rhythm	Oscillate swing_height and body_height
Brace against push	body_height=-0.05, stance_width=0.05

And all transitions between behaviors are smooth — it's just changing continuous command values.

Comparison with Single-Task Policies

Metric	Single-task policy	Walk These Ways
Tracking accuracy	Higher (~5%)	Good, slightly lower
Gait diversity	1 gait	Multiple gaits
Transition quality	None	Smooth
Training time	20 min × N gaits	60 min (once)
Deployment complexity	N models	1 model
Novel behaviors	No	Yes (by tuning commands)

Hardware Demo

Paper demos on Unitree A1 (predecessor to Go2). Policy deployed on onboard Jetson Xavier, inference at 50Hz. Robot can:

Walk on grass, dirt, rocks
Climb 25-degree slopes
Withstand push/kick perturbations
Switch gait in real-time via joystick
Navigate 10cm steps

How to Replicate

Step 1: Clone Repo

git clone https://github.com/Improbable-AI/walk-these-ways.git
cd walk-these-ways
pip install -e .

Step 2: Adjust Command Ranges

Main config file:

# walk_these_ways/envs/configs/go2_config.py
class Go2WTWCfg:
    class commands:
        # Adjust ranges for your robot
        lin_vel_x_range = [-1.0, 2.0]
        lin_vel_y_range = [-0.5, 0.5]
        ang_vel_yaw_range = [-1.0, 1.0]
        body_height_range = [-0.05, 0.05]
        step_frequency_range = [2.0, 4.0]
        gait_types = ["trot", "pace", "bound"]
        swing_height_range = [0.04, 0.10]

Step 3: Train

python train.py --task go2_wtw --num_envs 4096 --max_iterations 3000

# Training takes ~60 minutes on RTX 4090
# Longer than standard because observation and reward are more complex

Step 4: Deploy

Export ONNX and run on Go2 same as Part 3. Only difference: observation has 15 command dims instead of 3, and you need GUI/joystick to adjust commands in real-time.

# Joystick mapping for Walk These Ways
joystick_mapping = {
    "left_stick_x": "vy",
    "left_stick_y": "vx",
    "right_stick_x": "yaw_rate",
    "right_stick_y": "body_height",
    "dpad_up": "swing_height += 0.01",
    "dpad_down": "swing_height -= 0.01",
    "button_a": "gait = trot",
    "button_b": "gait = pace",
    "button_x": "gait = bound",
    "L1": "step_frequency -= 0.5",
    "R1": "step_frequency += 0.5",
}

Walk These Ways is one of the most influential papers in locomotion RL. It showed that RL policies can be generalizable — not just across terrains, but across behaviors.

Papers Building on Walk These Ways

Extreme Parkour with Legged Robots (Cheng et al., 2024) — Extends from flat terrain to parkour (jumping, climbing, crawling), still uses command conditioning approach
DTC: Deep Tracking Control — Uses Walk These Ways policy as low-level controller, adds high-level vision policy
Humanoid locomotion — Teams like Agility Robotics (Digit) and Tesla (Optimus) have applied similar ideas to bipedal robots

Comparison with Other Approaches

Approach	Paper	Strengths	Weaknesses
Walk These Ways	Margolis & Agrawal, 2023	1 policy, many gaits, open-source	Command design requires experience
AMP (Adversarial Motion Priors)	Peng et al., 2021	Natural motion from mocap	Needs motion capture data
DribbleBot	Ji et al., 2023	Soccer + locomotion	Task-specific
Parkour	Cheng et al., 2024	Extreme terrain	Requires depth camera

Lessons from the Paper

1. Command Space Design is the Core

2. Reward Engineering is Still an Art

Even with RL, the reward function still needs domain knowledge. Knowing what gaits are, what phase means, what swing height is — all comes from classical locomotion knowledge (Part 1).

3. Open-Source Changes Everything

Walk These Ways was fully open-sourced — code, config, trained weights. Anyone with a GPU and Unitree A1/Go2 can replicate it. This is why the paper has such large impact.

4. Sim-to-Real Remains the Bottleneck

Series Summary: Locomotion from Zero to Hero

Across these four posts, we've gone from theoretical foundations to state-of-the-art papers:

Part 1: ZMP, CPG, IK — classical methods and why they're being replaced
Part 2: RL formulation — MDP, reward shaping, PPO, curriculum learning
Part 3: Hands-on — legged_gym, Unitree Go2, sim-to-real deployment
Part 4 (this post): Walk These Ways — multi-gait learning from one policy

Locomotion RL is evolving rapidly. Hot new directions include:

Whole-body control: Not just walking but also using arms to manipulate (loco-manipulation)
Vision-based locomotion: Using cameras to "see" terrain ahead
Foundation models for locomotion: Pre-train on many robots, fine-tune for specific robot
Humanoid locomotion: From quadruped to bipedal — much harder but seeing many breakthroughs (Agility Digit, Tesla Optimus, Fourier GR-2)

Locomotion Basics: From ZMP to CPG — Part 1 of the series
RL for Locomotion: PPO, reward shaping and curriculum — Part 2 of the series
Quadruped Locomotion: legged_gym to Unitree Go2 — Part 3 of the series
RL Basics for Robotics: From Markov to PPO — RL foundations
Sim-to-Real Transfer: Train Simulation, Run Reality — Domain randomization in detail

The Problem: One Gait, One Policy?

Core Idea: Multiplicity of Behavior (MoB)

Key Insight

Solution: Command Conditioning

Architecture and Training

Observation Space

Reward Function

Training Procedure

Results and Demo

What One Policy Can Do

Comparison with Single-Task Policies

Hardware Demo

How to Replicate

Step 1: Clone Repo

Step 2: Adjust Command Ranges

Step 3: Train

Step 4: Deploy

Impact and Related Works

Papers Building on Walk These Ways

Comparison with Other Approaches

Lessons from the Paper

1. Command Space Design is the Core

2. Reward Engineering is Still an Art

3. Open-Source Changes Everything

4. Sim-to-Real Remains the Bottleneck

Series Summary: Locomotion from Zero to Hero

Related Posts

Nguyễn Anh Tuấn

Related Posts

Sim-to-Real cho Locomotion: Thực tế và kinh nghiệm

Robot Parkour: Nhảy, leo cầu thang bằng RL

Quadruped Locomotion: legged_gym đến Unitree Go2

The Problem: One Gait, One Policy?

Core Idea: Multiplicity of Behavior (MoB)

Key Insight

Solution: Command Conditioning

Architecture and Training

Observation Space

Reward Function

Training Procedure

Results and Demo

What One Policy Can Do

Comparison with Single-Task Policies

Hardware Demo

How to Replicate

Step 1: Clone Repo

Step 2: Adjust Command Ranges

Step 3: Train

Step 4: Deploy

Impact and Related Works

Papers Building on Walk These Ways

Comparison with Other Approaches

Lessons from the Paper

1. Command Space Design is the Core

2. Reward Engineering is Still an Art

3. Open-Source Changes Everything

4. Sim-to-Real Remains the Bottleneck

Series Summary: Locomotion from Zero to Hero

Related Posts

Nguyễn Anh Tuấn

Related Posts

Sim-to-Real cho Locomotion: Thực tế và kinh nghiệm

Robot Parkour: Nhảy, leo cầu thang bằng RL

Quadruped Locomotion: legged_gym đến Unitree Go2