← Back to Blog
locomotionlocomotionrlresearch

Walk These Ways: Adaptive Locomotion One Policy

Analyzing Walk-These-Ways paper — train one single policy for multiple movement types with multiplicity of behavior.

Nguyen Anh Tuan17 tháng 2, 20269 min read
Walk These Ways: Adaptive Locomotion One Policy

The Problem: One Gait, One Policy?

In previous parts of this series (Part 1, Part 2, Part 3), we trained locomotion policies — but typically each policy learns only one type of movement. Want robot to trot? Train one policy. Want it to gallop? Train another policy. Want slow walking, fast walking, sideways walking? Need yet more policies.

This approach has many problems:

Paper "Walk These Ways: Tuning Robot Control for Generalization with Multiplicity of Behavior" (arXiv:2212.03238) by Gabriel Margolis and Pulkit Agrawal (MIT CSAIL, CoRL 2023) solves exactly this problem.

Multiple different movement styles from a single policy

Core Idea: Multiplicity of Behavior (MoB)

Key Insight

When training locomotion with RL, there are many ways to solve the same task. For example, to move forward at 1 m/s, a robot can:

RL typically converges to one single strategy (usually trot because it's most stable). This paper asks: how can one policy learn MANY strategies simultaneously?

Solution: Command Conditioning

Instead of just sending velocity command (vx, vy, yaw_rate), Walk These Ways adds an extended command vector that controls how the robot moves:

# Standard locomotion command
standard_command = {
    "vx": 1.0,        # m/s, forward
    "vy": 0.0,        # m/s, lateral
    "yaw_rate": 0.0,   # rad/s, turning
}

# Walk These Ways EXTENDED command
wtw_command = {
    # === Velocity (same as standard) ===
    "vx": 1.0,
    "vy": 0.0,
    "yaw_rate": 0.0,

    # === Gait parameters (NEW) ===
    "body_height": 0.0,        # [-1, 1] low/high
    "step_frequency": 3.0,     # Hz, step frequency
    "gait": [1, 0, 0],         # one-hot: trot/pace/bound
    "swing_height": 0.08,      # m, foot lift height
    "stance_width": 0.0,       # [-1, 1] narrow/wide
    "body_pitch": 0.0,         # rad, forward/backward lean
    "body_roll": 0.0,          # rad, left/right lean
    "footswing_height": 0.08,  # m

    # TOTAL: 15 dimensions (vs 3 before)
}

Key idea: Policy receives 15-dim command vector and learns to execute any combination of these parameters. One policy, infinite movement types.

Architecture and Training

Observation Space

observation = {
    # Proprioception (same as standard)
    "base_angular_velocity": 3,
    "projected_gravity": 3,
    "joint_positions": 12,
    "joint_velocities": 12,
    "previous_actions": 12,

    # Extended command (vs 3 dims standard)
    "extended_command": 15,

    # TOTAL: 57 dimensions
}

Reward Function

Beyond standard rewards (velocity tracking, energy penalty), Walk These Ways adds gait-specific rewards:

def compute_gait_reward(env):
    """
    Reward for gait pattern matching.
    Based on commanded gait, encourage correct contact pattern.
    """
    rewards = {}

    # 1. Step frequency tracking
    # Count actual steps vs. commanded frequency
    actual_freq = compute_step_frequency(env.foot_contacts)
    freq_error = (actual_freq - env.commands.step_frequency).square()
    rewards["step_freq"] = torch.exp(-freq_error / 0.25)

    # 2. Gait pattern tracking
    # Each gait has desired phase offsets between 4 legs
    gait_phases = {
        "trot":  [0.0, 0.5, 0.5, 0.0],  # FL-RR in phase, FR-RL in phase
        "pace":  [0.0, 0.5, 0.0, 0.5],  # FL-RL in phase, FR-RR in phase
        "bound": [0.0, 0.0, 0.5, 0.5],  # front in phase, rear in phase
    }
    desired_phases = get_desired_phases(env.commands.gait, gait_phases)
    phase_error = compute_phase_error(env.foot_contacts, desired_phases)
    rewards["gait_phase"] = torch.exp(-phase_error / 0.5)

    # 3. Swing height tracking
    actual_swing = env.foot_heights.max(dim=-1).values
    swing_error = (actual_swing - env.commands.swing_height).square()
    rewards["swing_height"] = torch.exp(-swing_error / 0.01)

    # 4. Body height tracking
    height_error = (env.base_height - env.commands.body_height_target).square()
    rewards["body_height"] = torch.exp(-height_error / 0.01)

    # 5. Body orientation tracking
    pitch_error = (env.base_euler[:, 1] - env.commands.body_pitch).square()
    roll_error = (env.base_euler[:, 0] - env.commands.body_roll).square()
    rewards["orientation"] = torch.exp(-(pitch_error + roll_error) / 0.1)

    return rewards

Training Procedure

training_config = {
    "num_envs": 4096,
    "max_iterations": 3000,   # More than standard (1500) — task is more complex

    # Command sampling — CRITICAL
    "command_sampling": {
        "vx_range": [-1.0, 2.0],
        "vy_range": [-0.5, 0.5],
        "yaw_range": [-1.0, 1.0],
        "body_height_range": [-0.1, 0.1],
        "step_freq_range": [2.0, 4.0],
        "gait": "uniform_categorical",   # randomly select gait
        "swing_height_range": [0.04, 0.12],
        "stance_width_range": [-0.05, 0.05],
        "body_pitch_range": [-0.3, 0.3],
        "body_roll_range": [-0.2, 0.2],
    },

    # Each episode, sample RANDOM command combination
    # → policy must learn all combinations
    "command_resample_interval": 500,  # steps
}

The brilliant insight: Each episode, each environment gets random command combination. With 4096 parallel envs, each iteration has 4096 different combinations running simultaneously. After 3000 iterations, the policy has "seen" millions of combinations.

RL training with many gait patterns in parallel

Results and Demo

What One Policy Can Do

With Walk These Ways, one single policy can:

Behavior Command
Trot 2 m/s vx=2.0, gait=trot, freq=3.0
Slow walk vx=0.3, gait=trot, freq=1.5
Crouch walk vx=0.5, body_height=-0.08
High-step march vx=0.5, swing_height=0.12
Bound gallop vx=1.5, gait=bound
Strafe left vy=-0.5, gait=trot
Spin in place vx=0, yaw_rate=1.5
Lean forward vx=0, body_pitch=0.3
Dance rhythm Oscillate swing_height and body_height
Brace against push body_height=-0.05, stance_width=0.05

And all transitions between behaviors are smooth — it's just changing continuous command values.

Comparison with Single-Task Policies

Metric Single-task policy Walk These Ways
Tracking accuracy Higher (~5%) Good, slightly lower
Gait diversity 1 gait Multiple gaits
Transition quality None Smooth
Training time 20 min × N gaits 60 min (once)
Deployment complexity N models 1 model
Novel behaviors No Yes (by tuning commands)

Hardware Demo

Paper demos on Unitree A1 (predecessor to Go2). Policy deployed on onboard Jetson Xavier, inference at 50Hz. Robot can:

How to Replicate

Step 1: Clone Repo

git clone https://github.com/Improbable-AI/walk-these-ways.git
cd walk-these-ways
pip install -e .

Step 2: Adjust Command Ranges

Main config file:

# walk_these_ways/envs/configs/go2_config.py
class Go2WTWCfg:
    class commands:
        # Adjust ranges for your robot
        lin_vel_x_range = [-1.0, 2.0]
        lin_vel_y_range = [-0.5, 0.5]
        ang_vel_yaw_range = [-1.0, 1.0]
        body_height_range = [-0.05, 0.05]
        step_frequency_range = [2.0, 4.0]
        gait_types = ["trot", "pace", "bound"]
        swing_height_range = [0.04, 0.10]

Step 3: Train

python train.py --task go2_wtw --num_envs 4096 --max_iterations 3000

# Training takes ~60 minutes on RTX 4090
# Longer than standard because observation and reward are more complex

Step 4: Deploy

Export ONNX and run on Go2 same as Part 3. Only difference: observation has 15 command dims instead of 3, and you need GUI/joystick to adjust commands in real-time.

# Joystick mapping for Walk These Ways
joystick_mapping = {
    "left_stick_x": "vy",
    "left_stick_y": "vx",
    "right_stick_x": "yaw_rate",
    "right_stick_y": "body_height",
    "dpad_up": "swing_height += 0.01",
    "dpad_down": "swing_height -= 0.01",
    "button_a": "gait = trot",
    "button_b": "gait = pace",
    "button_x": "gait = bound",
    "L1": "step_frequency -= 0.5",
    "R1": "step_frequency += 0.5",
}

Impact and Related Works

Walk These Ways is one of the most influential papers in locomotion RL. It showed that RL policies can be generalizable — not just across terrains, but across behaviors.

Papers Building on Walk These Ways

  1. Extreme Parkour with Legged Robots (Cheng et al., 2024) — Extends from flat terrain to parkour (jumping, climbing, crawling), still uses command conditioning approach
  2. DTC: Deep Tracking Control — Uses Walk These Ways policy as low-level controller, adds high-level vision policy
  3. Humanoid locomotion — Teams like Agility Robotics (Digit) and Tesla (Optimus) have applied similar ideas to bipedal robots

Comparison with Other Approaches

Approach Paper Strengths Weaknesses
Walk These Ways Margolis & Agrawal, 2023 1 policy, many gaits, open-source Command design requires experience
AMP (Adversarial Motion Priors) Peng et al., 2021 Natural motion from mocap Needs motion capture data
DribbleBot Ji et al., 2023 Soccer + locomotion Task-specific
Parkour Cheng et al., 2024 Extreme terrain Requires depth camera

Lessons from the Paper

1. Command Space Design is the Core

Designing the command space is the most important decision. Too few dimensions → not expressive enough. Too many → hard to train. Walk These Ways chose 15 dims after extensive experimentation.

2. Reward Engineering is Still an Art

Even with RL, the reward function still needs domain knowledge. Knowing what gaits are, what phase means, what swing height is — all comes from classical locomotion knowledge (Part 1).

3. Open-Source Changes Everything

Walk These Ways was fully open-sourced — code, config, trained weights. Anyone with a GPU and Unitree A1/Go2 can replicate it. This is why the paper has such large impact.

4. Sim-to-Real Remains the Bottleneck

Even with an excellent policy in sim, sim-to-real transfer is still the hardest step. The paper uses strong domain randomization (friction, mass, motor strength) but still requires fine-tuning for new robots.

Series Summary: Locomotion from Zero to Hero

Across these four posts, we've gone from theoretical foundations to state-of-the-art papers:

  1. Part 1: ZMP, CPG, IK — classical methods and why they're being replaced
  2. Part 2: RL formulation — MDP, reward shaping, PPO, curriculum learning
  3. Part 3: Hands-on — legged_gym, Unitree Go2, sim-to-real deployment
  4. Part 4 (this post): Walk These Ways — multi-gait learning from one policy

Locomotion RL is evolving rapidly. Hot new directions include:


Related Posts

Related Posts

IROS 2026: Papers navigation và manipulation đáng theo dõi
researchconferencerobotics

IROS 2026: Papers navigation và manipulation đáng theo dõi

Phân tích papers nổi bật về autonomous navigation và manipulation — chuẩn bị cho IROS 2026 Pittsburgh.

2/4/20267 min read
TutorialNVIDIA Isaac Lab: GPU-accelerated RL training từ zero
simulationisaac-simrlPart 3

NVIDIA Isaac Lab: GPU-accelerated RL training từ zero

Setup Isaac Lab, train locomotion policy với 4096 parallel environments và domain randomization trên GPU.

1/4/202611 min read
Sim-to-Real Transfer: Train simulation, chạy thực tế
ai-perceptionresearchrobotics

Sim-to-Real Transfer: Train simulation, chạy thực tế

Kỹ thuật chuyển đổi mô hình từ simulation sang robot thật — domain randomization, system identification và best practices.

1/4/202612 min read