RL cho Locomotion: PPO, reward shaping và curriculum

Từ Classical đến RL: Bước ngoặt của Locomotion

Trong Part 1, chúng ta đã tìm hiểu các phương pháp classical (ZMP, CPG, IK) và tại sao chúng bị hạn chế. Bây giờ là lúc đi vào RL-based locomotion -- phương pháp đang thống trị toàn bộ ngành robot di chuyển kể từ 2020.

Ý tưởng cốt lõi cực kỳ đơn giản: thay vì hand-craft controller, để robot tự học cách đi bằng trial-and-error trong simulation. Nhưng "đơn giản" không có nghĩa là "dễ" -- có rất nhiều decisions quan trọng cần hiểu rõ.

AI và machine learning cho robot control

MDP cho Walking: Formulation bài toán

Locomotion được formulate như một Markov Decision Process (MDP). Nếu bạn chưa quen thuộc với MDP, đọc RL cơ bản cho Robotics trước.

Observation Space -- Robot "nhìn thấy" gì?

Observation là tất cả thông tin mà policy nhận được để ra quyết định. Đối với locomotion, observation thường bao gồm:

observation_space = {
    # === Proprioception (cam nhan co the) ===
    "base_angular_velocity": 3,    # gyroscope (roll, pitch, yaw rates)
    "projected_gravity": 3,        # huong trong luc trong body frame
    "joint_positions": 12,         # goc cua 12 joints (3 joints x 4 chan)
    "joint_velocities": 12,        # van toc goc cua 12 joints
    
    # === Commands (lenh dieu khien) ===
    "velocity_command": 3,         # (vx, vy, yaw_rate) mong muon
    
    # === History (lich su) ===
    "previous_actions": 12,        # action truoc do
    
    # TONG: 45 dimensions
}

Tại sao không dùng visual input? Hầu hết locomotion policies chỉ dùng proprioception (joint angles, IMU) mà không dùng camera. Lý do:

Proprioception đủ để đi trên nhiều địa hình (robot "cảm nhận" mặt đất qua chân)
Visual input tăng observation dimension lên hàng nghìn → training chậm hơn
Camera có latency, noise, và sim-to-real gap lớn

Tuy nhiên, các paper mới như "Learning Robust Perceptive Locomotion for Quadrupedal Robots in the Wild" (arXiv:2201.08117) đã bắt đầu kết hợp depth camera + proprioception để xử lý địa hình phức tạp hơn (cầu thang, gập ghềnh).

Action Space -- Robot làm gì?

Action space xác định robot điều khiển joints như thế nào:

Kiểu action	Mô tả	Ưu điểm	Nhược điểm
Joint position targets	Policy output góc joints mong muốn	Ổn định, dễ sim-to-real	Không trực tiếp control lực
Joint torques	Policy output moment lực trên joints	Control chính xác	Khó train, khó transfer
Position offsets	Output delta từ default standing pose	Train nhanh hơn position tuyệt đối	Cần default pose tốt

Phổ biến nhất hiện nay: Position offsets + PD controller. Policy output delta góc joints, PD controller chuyển thành torques:

# Action -> Joint torques (chay tren robot)
def pd_controller(target_pos, current_pos, current_vel, kp=40.0, kd=2.0):
    """
    PD controller chuyen position targets thanh torques
    kp, kd: proportional va derivative gains
    """
    torque = kp * (target_pos - current_pos) - kd * current_vel
    return np.clip(torque, -33.5, 33.5)  # torque limit

# Trong training loop
default_pose = np.array([0.0, 0.8, -1.6] * 4)  # standing pose cho 4 chan
action = policy(observation)  # output: 12 position offsets
target_positions = default_pose + action * action_scale  # scale action
torques = pd_controller(target_positions, current_joint_pos, current_joint_vel)

Reward Shaping -- Nghệ thuật thiết kế reward

Reward function là phần quan trọng nhất của locomotion RL. Một reward function tốt tạo ra robot đi mượt mà; một reward function tệ tạo ra robot run rẩy, nhảy nhót, hoặc lãng biến suất năng lượng.

Reward components

Reward function cho locomotion thường là tổng có trọng của nhiều components:

def compute_reward(env):
    """
    Reward function cho quadruped locomotion
    Moi component co weight rieng -- tuning weights la nghe thuat
    """
    rewards = {}
    
    # === Rewards (khuyen khich) ===
    
    # 1. Tracking van toc lenh
    lin_vel_error = (env.commands[:, :2] - env.base_lin_vel[:, :2]).square().sum(dim=1)
    rewards["tracking_lin_vel"] = torch.exp(-lin_vel_error / 0.25) * 1.5
    
    ang_vel_error = (env.commands[:, 2] - env.base_ang_vel[:, 2]).square()
    rewards["tracking_ang_vel"] = torch.exp(-ang_vel_error / 0.25) * 0.5
    
    # === Penalties (phat) ===
    
    # 2. Energy -- phat tieu hao nang luong
    rewards["torques"] = -torch.sum(env.torques.square(), dim=1) * 0.00001
    
    # 3. Smoothness -- phat thay doi action dot ngot
    rewards["action_rate"] = -torch.sum(
        (env.actions - env.last_actions).square(), dim=1
    ) * 0.01
    
    # 4. Air time -- khuyen khich chan roi khoi mat dat (foot clearance)
    contact = env.foot_contacts  # True neu chan cham dat
    first_contact = (env.feet_air_time > 0.) * contact
    rewards["feet_air_time"] = torch.sum(
        (env.feet_air_time - 0.5) * first_contact, dim=1
    ) * 1.0
    
    # 5. Orientation -- phat nghieng qua nhieu
    rewards["orientation"] = -torch.sum(
        env.projected_gravity[:, :2].square(), dim=1
    ) * 0.2
    
    # 6. Joint limits -- phat khi joint gan limit
    rewards["joint_limits"] = -torch.sum(
        (env.dof_pos - env.dof_pos_limits[:, 0]).clip(max=0.).square() +
        (env.dof_pos - env.dof_pos_limits[:, 1]).clip(min=0.).square(),
        dim=1
    ) * 10.0
    
    # 7. Collision -- phat khi body cham mat dat (khong phai chan)
    rewards["collision"] = -torch.sum(
        (torch.norm(env.contact_forces[:, env.penalized_bodies, :], dim=-1) > 0.1),
        dim=1
    ).float() * 1.0
    
    # Tong reward
    total_reward = sum(rewards.values())
    return total_reward, rewards

Reward design tips từ kinh nghiệm

Tip	Giải thích
Exponential tracking reward	Dùng `exp(-error/sigma)` thay vì `-error`. Cho gradient tốt hơn khi error lớn
Balance reward weights	Tracking reward nên lớn nhất (~1.5), penalties nên nhỏ (~0.001-0.1)
Feet air time	Reward này cực kỳ quan trọng -- không có nó, robot sẽ kéo chân trên mặt đất
Action rate penalty	Ngầm smooth movement, tránh robot giật giật
Base height reward	Thêm nếu robot hay ngồi xổm -- khuyến khích đứng thẳng

Training robot locomotion trong simulation với hàng nghìn parallel environments

Curriculum Learning -- Từ dễ đến khó

Một trong những breakthrough lớn nhất cho locomotion RL là curriculum learning -- bắt đầu train trên địa hình dễ, rồi tăng độ khó dần. Không có curriculum, robot phải học đi trên địa hình khó từ đầu và thường thất bại.

Terrain curriculum

class TerrainCurriculum:
    """
    Game-inspired curriculum (tu legged_gym paper)
    Robot duoc "len level" khi hoan thanh terrain hien tai
    """
    def __init__(self):
        self.terrain_levels = {
            0: "flat",           # mat phang
            1: "rough_02",       # gho ghe nhe (max height 2cm)
            2: "rough_05",       # gho ghe vua (max height 5cm)
            3: "rough_10",       # gho ghe manh (max height 10cm)
            4: "slopes_10deg",   # doc 10 do
            5: "slopes_20deg",   # doc 20 do
            6: "stairs_10cm",    # bac thang 10cm
            7: "stairs_15cm",    # bac thang 15cm
            8: "stepping_stones", # buoc qua da
            9: "random_mixed",   # ket hop tat ca
        }
    
    def update_level(self, robot_id, distance_walked):
        """
        Neu robot di duoc xa → len level
        Neu robot nga nhieu → giu hoac ha level
        """
        if distance_walked > self.threshold_up:
            self.levels[robot_id] = min(
                self.levels[robot_id] + 1, 
                self.max_level
            )
        elif distance_walked < self.threshold_down:
            self.levels[robot_id] = max(
                self.levels[robot_id] - 1, 
                0
            )

Command curriculum

Ngoài terrain, velocity commands cũng được curriculum hóa:

Giai đoạn 1: Chỉ command vận tốc thấp (0-0.5 m/s), chỉ đi thẳng
Giai đoạn 2: Tăng vận tốc (0-1.5 m/s), thêm rẽ trái/phải
Giai đoạn 3: Vận tốc cao (0-3.0 m/s), thêm đi lui, đi ngang

PPO Training Pipeline với Isaac Lab

Tổng quan pipeline

Isaac Lab Environment
    ↓
4,096 parallel robots (GPU)
    ↓
Observations (45 dims x 4096 robots)
    ↓
PPO Policy (MLP: 512-256-128)
    ↓
Actions (12 dims x 4096 robots)
    ↓
Environment step (GPU physics)
    ↓
Rewards + Terminations
    ↓
PPO Update (on-policy)
    ↓
Repeat ~2000 iterations (~20 phut tren RTX 4090)

PPO -- Tại sao là PPO?

Proximal Policy Optimization (Schulman et al., 2017) là algorithm mặc định cho locomotion vì:

On-policy: Thu thập data mới mỗi iteration, phù hợp với sim nhanh (4096 envs)
Stable: Clip ratio ngăn policy thay đổi quá lớn, tránh catastrophic forgetting
Parallel-friendly: RSL-RL implementation tối ưu cho GPU rollout collection

# PPO training config (Isaac Lab / RSL-RL)
ppo_config = {
    "policy": {
        "class_name": "ActorCritic",
        "hidden_dims": [512, 256, 128],
        "activation": "elu",
        "init_noise_std": 1.0,
    },
    "algorithm": {
        "value_loss_coef": 1.0,
        "use_clipped_value_loss": True,
        "clip_param": 0.2,
        "entropy_coef": 0.01,
        "num_learning_epochs": 5,
        "num_mini_batches": 4,
        "learning_rate": 1e-3,
        "schedule": "adaptive",  # giam lr khi KL divergence nho
        "gamma": 0.99,
        "lam": 0.95,
        "desired_kl": 0.01,
        "max_grad_norm": 1.0,
    },
    "runner": {
        "num_steps_per_env": 24,  # rollout length
        "max_iterations": 1500,
    }
}

Policy architecture

Policy cho locomotion thường là MLP đơn giản -- không cần transformer hay CNN phức tạp:

import torch
import torch.nn as nn

class LocomotionPolicy(nn.Module):
    """
    Actor-Critic MLP cho locomotion
    Simple nhung hieu qua -- khong can phuc tap hon
    """
    def __init__(self, obs_dim=45, action_dim=12):
        super().__init__()
        
        # Shared backbone (khong)
        # Actor va Critic TACH BIET -- tot hon cho locomotion
        
        self.actor = nn.Sequential(
            nn.Linear(obs_dim, 512),
            nn.ELU(),
            nn.Linear(512, 256),
            nn.ELU(),
            nn.Linear(256, 128),
            nn.ELU(),
            nn.Linear(128, action_dim),
        )
        
        self.critic = nn.Sequential(
            nn.Linear(obs_dim, 512),
            nn.ELU(),
            nn.Linear(512, 256),
            nn.ELU(),
            nn.Linear(256, 128),
            nn.ELU(),
            nn.Linear(128, 1),
        )
        
        self.log_std = nn.Parameter(torch.zeros(action_dim))
    
    def forward(self, obs):
        action_mean = self.actor(obs)
        value = self.critic(obs)
        return action_mean, self.log_std.exp(), value

Hands-on: Train với Isaac Lab

# 1. Setup Isaac Lab
git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab && ./isaaclab.sh --install

# 2. Train ANYmal locomotion
python source/standalone/workflows/rsl_rl/train.py \
    --task Isaac-Velocity-Rough-Anymal-D-v0 \
    --num_envs 4096 \
    --max_iterations 1500

# 3. Xem ket qua
python source/standalone/workflows/rsl_rl/play.py \
    --task Isaac-Velocity-Rough-Anymal-D-v0 \
    --num_envs 64

Training mất khoảng 20 phút trên RTX 4090 cho 1500 iterations. Policy hội tụ và robot có thể đi trên rough terrain, leo dốc, và theo dõi velocity commands.

Domain Randomization cho Sim-to-Real

Để policy train trong sim hoạt động trên robot thật, cần domain randomization -- thay đổi ngẫu nhiên các tham số vật lý trong simulation:

domain_randomization = {
    # Dynamics randomization
    "friction_range": [0.5, 1.25],        # he so ma sat
    "restitution_range": [0.0, 0.4],       # do nay
    "added_mass_range": [-1.0, 3.0],       # kg them vao base
    "com_displacement_range": [-0.15, 0.15], # dich chuyen COM (m)
    
    # Motor randomization
    "motor_strength_range": [0.9, 1.1],    # sai so motor
    "motor_offset_range": [-0.02, 0.02],   # offset encoder (rad)
    
    # Observation noise
    "joint_pos_noise": 0.01,    # rad
    "joint_vel_noise": 1.5,     # rad/s
    "gravity_noise": 0.05,      # m/s^2
    "imu_noise": 0.2,           # rad/s
    
    # Latency
    "action_delay_range": [0, 2],  # frames delay
}

Với domain randomization, policy học cách robust với uncertainty -- không rely vào bất kỳ tham số cụ thể nào của sim.

Training monitoring -- Biết khi nào policy đang học tốt

Metric	Ý nghĩa	Giá trị tốt
Mean reward	Tổng reward trung bình	Tăng đều, hội tụ
Mean episode length	Robot sống được bao lâu	Tăng lên max (20s)
Tracking error	Sai số giữa command và actual velocity	Giảm về ~0.1 m/s
Feet air time	Thời gian chân ở trên không	~0.3-0.5s (gait tự nhiên)
Power consumption	Tiêu hao năng lượng	Giảm dần (hiệu quả hơn)

Nếu mean episode length không tăng sau 200 iterations, có thể reward function có vấn đề -- kiểm tra lại weights.

Robot quadruped locomotion training results

So sánh: RL vs Classical cho Locomotion

Tiêu chí	Classical (ZMP/CPG/IK)	RL (PPO)
Setup time	Hàng tháng (tuning)	Hàng giờ (reward + train)
Terrain generalization	Cần re-plan	Train 1 lần, chạy nhiều terrain
Optimality	Sub-optimal	Gần optimal cho reward function
Interpretability	Cao (biết robot đang làm gì)	Thấp (black box)
Safety guarantees	Có thể chứng minh stability	Không có formal guarantees
Compute requirement	CPU đủ	Cần GPU mạnh
Sim-to-real	Không cần sim	Cần sim + domain randomization

Tiếp theo trong series

Bạn đã hiểu cách formulate locomotion như RL problem và toàn bộ training pipeline. Trong các bài tiếp theo:

Part 3: Quadruped Locomotion: legged_gym đến Unitree Go2 -- Hands-on train và deploy lên robot thật
Part 4: Walk These Ways: Adaptive locomotion một policy -- Phân tích paper về multi-gait learning

Từ Classical đến RL: Bước ngoặt của Locomotion

AI và machine learning cho robot control

MDP cho Walking: Formulation bài toán

Locomotion được formulate như một Markov Decision Process (MDP). Nếu bạn chưa quen thuộc với MDP, đọc RL cơ bản cho Robotics trước.

Observation Space -- Robot "nhìn thấy" gì?

Observation là tất cả thông tin mà policy nhận được để ra quyết định. Đối với locomotion, observation thường bao gồm:

observation_space = {
    # === Proprioception (cam nhan co the) ===
    "base_angular_velocity": 3,    # gyroscope (roll, pitch, yaw rates)
    "projected_gravity": 3,        # huong trong luc trong body frame
    "joint_positions": 12,         # goc cua 12 joints (3 joints x 4 chan)
    "joint_velocities": 12,        # van toc goc cua 12 joints
    
    # === Commands (lenh dieu khien) ===
    "velocity_command": 3,         # (vx, vy, yaw_rate) mong muon
    
    # === History (lich su) ===
    "previous_actions": 12,        # action truoc do
    
    # TONG: 45 dimensions
}

Tại sao không dùng visual input? Hầu hết locomotion policies chỉ dùng proprioception (joint angles, IMU) mà không dùng camera. Lý do:

Proprioception đủ để đi trên nhiều địa hình (robot "cảm nhận" mặt đất qua chân)
Visual input tăng observation dimension lên hàng nghìn → training chậm hơn
Camera có latency, noise, và sim-to-real gap lớn

Action Space -- Robot làm gì?

Action space xác định robot điều khiển joints như thế nào:

Kiểu action	Mô tả	Ưu điểm	Nhược điểm
Joint position targets	Policy output góc joints mong muốn	Ổn định, dễ sim-to-real	Không trực tiếp control lực
Joint torques	Policy output moment lực trên joints	Control chính xác	Khó train, khó transfer
Position offsets	Output delta từ default standing pose	Train nhanh hơn position tuyệt đối	Cần default pose tốt

Phổ biến nhất hiện nay: Position offsets + PD controller. Policy output delta góc joints, PD controller chuyển thành torques:

# Action -> Joint torques (chay tren robot)
def pd_controller(target_pos, current_pos, current_vel, kp=40.0, kd=2.0):
    """
    PD controller chuyen position targets thanh torques
    kp, kd: proportional va derivative gains
    """
    torque = kp * (target_pos - current_pos) - kd * current_vel
    return np.clip(torque, -33.5, 33.5)  # torque limit

# Trong training loop
default_pose = np.array([0.0, 0.8, -1.6] * 4)  # standing pose cho 4 chan
action = policy(observation)  # output: 12 position offsets
target_positions = default_pose + action * action_scale  # scale action
torques = pd_controller(target_positions, current_joint_pos, current_joint_vel)

Reward Shaping -- Nghệ thuật thiết kế reward

Reward components

Reward function cho locomotion thường là tổng có trọng của nhiều components:

def compute_reward(env):
    """
    Reward function cho quadruped locomotion
    Moi component co weight rieng -- tuning weights la nghe thuat
    """
    rewards = {}
    
    # === Rewards (khuyen khich) ===
    
    # 1. Tracking van toc lenh
    lin_vel_error = (env.commands[:, :2] - env.base_lin_vel[:, :2]).square().sum(dim=1)
    rewards["tracking_lin_vel"] = torch.exp(-lin_vel_error / 0.25) * 1.5
    
    ang_vel_error = (env.commands[:, 2] - env.base_ang_vel[:, 2]).square()
    rewards["tracking_ang_vel"] = torch.exp(-ang_vel_error / 0.25) * 0.5
    
    # === Penalties (phat) ===
    
    # 2. Energy -- phat tieu hao nang luong
    rewards["torques"] = -torch.sum(env.torques.square(), dim=1) * 0.00001
    
    # 3. Smoothness -- phat thay doi action dot ngot
    rewards["action_rate"] = -torch.sum(
        (env.actions - env.last_actions).square(), dim=1
    ) * 0.01
    
    # 4. Air time -- khuyen khich chan roi khoi mat dat (foot clearance)
    contact = env.foot_contacts  # True neu chan cham dat
    first_contact = (env.feet_air_time > 0.) * contact
    rewards["feet_air_time"] = torch.sum(
        (env.feet_air_time - 0.5) * first_contact, dim=1
    ) * 1.0
    
    # 5. Orientation -- phat nghieng qua nhieu
    rewards["orientation"] = -torch.sum(
        env.projected_gravity[:, :2].square(), dim=1
    ) * 0.2
    
    # 6. Joint limits -- phat khi joint gan limit
    rewards["joint_limits"] = -torch.sum(
        (env.dof_pos - env.dof_pos_limits[:, 0]).clip(max=0.).square() +
        (env.dof_pos - env.dof_pos_limits[:, 1]).clip(min=0.).square(),
        dim=1
    ) * 10.0
    
    # 7. Collision -- phat khi body cham mat dat (khong phai chan)
    rewards["collision"] = -torch.sum(
        (torch.norm(env.contact_forces[:, env.penalized_bodies, :], dim=-1) > 0.1),
        dim=1
    ).float() * 1.0
    
    # Tong reward
    total_reward = sum(rewards.values())
    return total_reward, rewards

Reward design tips từ kinh nghiệm

Tip	Giải thích
Exponential tracking reward	Dùng `exp(-error/sigma)` thay vì `-error`. Cho gradient tốt hơn khi error lớn
Balance reward weights	Tracking reward nên lớn nhất (~1.5), penalties nên nhỏ (~0.001-0.1)
Feet air time	Reward này cực kỳ quan trọng -- không có nó, robot sẽ kéo chân trên mặt đất
Action rate penalty	Ngầm smooth movement, tránh robot giật giật
Base height reward	Thêm nếu robot hay ngồi xổm -- khuyến khích đứng thẳng

Training robot locomotion trong simulation với hàng nghìn parallel environments

Curriculum Learning -- Từ dễ đến khó

Terrain curriculum

class TerrainCurriculum:
    """
    Game-inspired curriculum (tu legged_gym paper)
    Robot duoc "len level" khi hoan thanh terrain hien tai
    """
    def __init__(self):
        self.terrain_levels = {
            0: "flat",           # mat phang
            1: "rough_02",       # gho ghe nhe (max height 2cm)
            2: "rough_05",       # gho ghe vua (max height 5cm)
            3: "rough_10",       # gho ghe manh (max height 10cm)
            4: "slopes_10deg",   # doc 10 do
            5: "slopes_20deg",   # doc 20 do
            6: "stairs_10cm",    # bac thang 10cm
            7: "stairs_15cm",    # bac thang 15cm
            8: "stepping_stones", # buoc qua da
            9: "random_mixed",   # ket hop tat ca
        }
    
    def update_level(self, robot_id, distance_walked):
        """
        Neu robot di duoc xa → len level
        Neu robot nga nhieu → giu hoac ha level
        """
        if distance_walked > self.threshold_up:
            self.levels[robot_id] = min(
                self.levels[robot_id] + 1, 
                self.max_level
            )
        elif distance_walked < self.threshold_down:
            self.levels[robot_id] = max(
                self.levels[robot_id] - 1, 
                0
            )

Command curriculum

Ngoài terrain, velocity commands cũng được curriculum hóa:

Giai đoạn 1: Chỉ command vận tốc thấp (0-0.5 m/s), chỉ đi thẳng
Giai đoạn 2: Tăng vận tốc (0-1.5 m/s), thêm rẽ trái/phải
Giai đoạn 3: Vận tốc cao (0-3.0 m/s), thêm đi lui, đi ngang

PPO Training Pipeline với Isaac Lab

Tổng quan pipeline

Isaac Lab Environment
    ↓
4,096 parallel robots (GPU)
    ↓
Observations (45 dims x 4096 robots)
    ↓
PPO Policy (MLP: 512-256-128)
    ↓
Actions (12 dims x 4096 robots)
    ↓
Environment step (GPU physics)
    ↓
Rewards + Terminations
    ↓
PPO Update (on-policy)
    ↓
Repeat ~2000 iterations (~20 phut tren RTX 4090)

PPO -- Tại sao là PPO?

Proximal Policy Optimization (Schulman et al., 2017) là algorithm mặc định cho locomotion vì:

On-policy: Thu thập data mới mỗi iteration, phù hợp với sim nhanh (4096 envs)
Stable: Clip ratio ngăn policy thay đổi quá lớn, tránh catastrophic forgetting
Parallel-friendly: RSL-RL implementation tối ưu cho GPU rollout collection

# PPO training config (Isaac Lab / RSL-RL)
ppo_config = {
    "policy": {
        "class_name": "ActorCritic",
        "hidden_dims": [512, 256, 128],
        "activation": "elu",
        "init_noise_std": 1.0,
    },
    "algorithm": {
        "value_loss_coef": 1.0,
        "use_clipped_value_loss": True,
        "clip_param": 0.2,
        "entropy_coef": 0.01,
        "num_learning_epochs": 5,
        "num_mini_batches": 4,
        "learning_rate": 1e-3,
        "schedule": "adaptive",  # giam lr khi KL divergence nho
        "gamma": 0.99,
        "lam": 0.95,
        "desired_kl": 0.01,
        "max_grad_norm": 1.0,
    },
    "runner": {
        "num_steps_per_env": 24,  # rollout length
        "max_iterations": 1500,
    }
}

Policy architecture

Policy cho locomotion thường là MLP đơn giản -- không cần transformer hay CNN phức tạp:

import torch
import torch.nn as nn

class LocomotionPolicy(nn.Module):
    """
    Actor-Critic MLP cho locomotion
    Simple nhung hieu qua -- khong can phuc tap hon
    """
    def __init__(self, obs_dim=45, action_dim=12):
        super().__init__()
        
        # Shared backbone (khong)
        # Actor va Critic TACH BIET -- tot hon cho locomotion
        
        self.actor = nn.Sequential(
            nn.Linear(obs_dim, 512),
            nn.ELU(),
            nn.Linear(512, 256),
            nn.ELU(),
            nn.Linear(256, 128),
            nn.ELU(),
            nn.Linear(128, action_dim),
        )
        
        self.critic = nn.Sequential(
            nn.Linear(obs_dim, 512),
            nn.ELU(),
            nn.Linear(512, 256),
            nn.ELU(),
            nn.Linear(256, 128),
            nn.ELU(),
            nn.Linear(128, 1),
        )
        
        self.log_std = nn.Parameter(torch.zeros(action_dim))
    
    def forward(self, obs):
        action_mean = self.actor(obs)
        value = self.critic(obs)
        return action_mean, self.log_std.exp(), value

Hands-on: Train với Isaac Lab

# 1. Setup Isaac Lab
git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab && ./isaaclab.sh --install

# 2. Train ANYmal locomotion
python source/standalone/workflows/rsl_rl/train.py \
    --task Isaac-Velocity-Rough-Anymal-D-v0 \
    --num_envs 4096 \
    --max_iterations 1500

# 3. Xem ket qua
python source/standalone/workflows/rsl_rl/play.py \
    --task Isaac-Velocity-Rough-Anymal-D-v0 \
    --num_envs 64

Training mất khoảng 20 phút trên RTX 4090 cho 1500 iterations. Policy hội tụ và robot có thể đi trên rough terrain, leo dốc, và theo dõi velocity commands.

Domain Randomization cho Sim-to-Real

Để policy train trong sim hoạt động trên robot thật, cần domain randomization -- thay đổi ngẫu nhiên các tham số vật lý trong simulation:

domain_randomization = {
    # Dynamics randomization
    "friction_range": [0.5, 1.25],        # he so ma sat
    "restitution_range": [0.0, 0.4],       # do nay
    "added_mass_range": [-1.0, 3.0],       # kg them vao base
    "com_displacement_range": [-0.15, 0.15], # dich chuyen COM (m)
    
    # Motor randomization
    "motor_strength_range": [0.9, 1.1],    # sai so motor
    "motor_offset_range": [-0.02, 0.02],   # offset encoder (rad)
    
    # Observation noise
    "joint_pos_noise": 0.01,    # rad
    "joint_vel_noise": 1.5,     # rad/s
    "gravity_noise": 0.05,      # m/s^2
    "imu_noise": 0.2,           # rad/s
    
    # Latency
    "action_delay_range": [0, 2],  # frames delay
}

Với domain randomization, policy học cách robust với uncertainty -- không rely vào bất kỳ tham số cụ thể nào của sim.

Training monitoring -- Biết khi nào policy đang học tốt

Metric	Ý nghĩa	Giá trị tốt
Mean reward	Tổng reward trung bình	Tăng đều, hội tụ
Mean episode length	Robot sống được bao lâu	Tăng lên max (20s)
Tracking error	Sai số giữa command và actual velocity	Giảm về ~0.1 m/s
Feet air time	Thời gian chân ở trên không	~0.3-0.5s (gait tự nhiên)
Power consumption	Tiêu hao năng lượng	Giảm dần (hiệu quả hơn)

Nếu mean episode length không tăng sau 200 iterations, có thể reward function có vấn đề -- kiểm tra lại weights.

Robot quadruped locomotion training results

So sánh: RL vs Classical cho Locomotion

Tiêu chí	Classical (ZMP/CPG/IK)	RL (PPO)
Setup time	Hàng tháng (tuning)	Hàng giờ (reward + train)
Terrain generalization	Cần re-plan	Train 1 lần, chạy nhiều terrain
Optimality	Sub-optimal	Gần optimal cho reward function
Interpretability	Cao (biết robot đang làm gì)	Thấp (black box)
Safety guarantees	Có thể chứng minh stability	Không có formal guarantees
Compute requirement	CPU đủ	Cần GPU mạnh
Sim-to-real	Không cần sim	Cần sim + domain randomization

Tiếp theo trong series

Bạn đã hiểu cách formulate locomotion như RL problem và toàn bộ training pipeline. Trong các bài tiếp theo:

Part 3: Quadruped Locomotion: legged_gym đến Unitree Go2 -- Hands-on train và deploy lên robot thật
Part 4: Walk These Ways: Adaptive locomotion một policy -- Phân tích paper về multi-gait learning

Từ Classical đến RL: Bước ngoặt của Locomotion

MDP cho Walking: Formulation bài toán

Observation Space -- Robot "nhìn thấy" gì?

Action Space -- Robot làm gì?

Reward Shaping -- Nghệ thuật thiết kế reward

Reward components

Reward design tips từ kinh nghiệm

Curriculum Learning -- Từ dễ đến khó

Terrain curriculum

Command curriculum

PPO Training Pipeline với Isaac Lab

Tổng quan pipeline

PPO -- Tại sao là PPO?

Policy architecture

Hands-on: Train với Isaac Lab

Domain Randomization cho Sim-to-Real

Training monitoring -- Biết khi nào policy đang học tốt

So sánh: RL vs Classical cho Locomotion

Tiếp theo trong series

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

Sim-to-Real cho Locomotion: Thực tế và kinh nghiệm

Robot Parkour: Nhảy, leo cầu thang bằng RL

Walk These Ways: Adaptive locomotion một policy

Từ Classical đến RL: Bước ngoặt của Locomotion

MDP cho Walking: Formulation bài toán

Observation Space -- Robot "nhìn thấy" gì?

Action Space -- Robot làm gì?

Reward Shaping -- Nghệ thuật thiết kế reward

Reward components

Reward design tips từ kinh nghiệm

Curriculum Learning -- Từ dễ đến khó

Terrain curriculum

Command curriculum

PPO Training Pipeline với Isaac Lab

Tổng quan pipeline

PPO -- Tại sao là PPO?

Policy architecture

Hands-on: Train với Isaac Lab

Domain Randomization cho Sim-to-Real

Training monitoring -- Biết khi nào policy đang học tốt

So sánh: RL vs Classical cho Locomotion

Tiếp theo trong series

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

Sim-to-Real cho Locomotion: Thực tế và kinh nghiệm

Robot Parkour: Nhảy, leo cầu thang bằng RL

Walk These Ways: Adaptive locomotion một policy