Reward Engineering cho Bipedal Walking: Nghệ thuật thiết kế reward

Nếu observation space là "mắt" của robot, thì reward function là "linh hồn" — nó định hình toàn bộ hành vi mà policy học được. Thiết kế reward cho bipedal walking là nghệ thuật hơn là khoa học: quá ít reward terms thì robot học được dáng đi kỳ quái, quá nhiều thì training không hội tụ.

Trong bài trước — Cơ bản và Environment Setup — chúng ta đã setup môi trường. Bây giờ, hãy đi sâu vào phần khó nhất: thiết kế reward function.

Tổng quan về Reward Engineering

Tại sao reward design khó?

Bipedal walking có nhiều mục tiêu đồng thời:

Đi đúng tốc độ mong muốn
Giữ thân thẳng đứng
Bước chân nhấc cao đủ (foot clearance)
Tiết kiệm năng lượng
Chuyển động mượt mà (không giật)
Hai chân phải phối hợp nhịp nhàng

Mỗi mục tiêu cần một reward term riêng, và trọng số giữa các terms quyết định hành vi cuối cùng.

Reward Hacking — Kẻ thù số 1

Reward hacking xảy ra khi policy tìm ra cách maximize reward mà không thực sự làm điều bạn muốn. Ví dụ kinh điển:

Reward design sai	Robot sẽ...
Chỉ thưởng forward velocity	Lao đầu về phía trước rồi ngã
Chỉ thưởng đứng thẳng	Đứng yên, không bước
Thưởng cao cho foot clearance	Nhấc chân lên rồi không hạ xuống
Không phạt energy	Rung giật liên tục (high freq oscillation)

12+ Reward Terms cho Bipedal Walking

Dưới đây là reward function hoàn chỉnh với giải thích chi tiết từng term:

import torch
import numpy as np

class BipedalRewardFunction:
    """
    Reward function hoàn chỉnh cho humanoid bipedal walking.
    Tổng hợp từ Humanoid-Gym, Walk These Ways, và kinh nghiệm thực tế.
    """

    def __init__(self, cfg):
        # Reward weights — ĐÂY LÀ NƠI MAGIC XẢY RA
        self.weights = {
            # === Tracking rewards (mục tiêu chính) ===
            "linear_vel_tracking": 1.5,    # Track vận tốc mong muốn
            "angular_vel_tracking": 0.8,   # Track yaw rate mong muốn

            # === Posture rewards (giữ thăng bằng) ===
            "upright": 0.5,                # Giữ thân thẳng đứng
            "base_height": 0.3,            # Giữ chiều cao ổn định

            # === Gait quality rewards ===
            "foot_clearance": 0.3,         # Nhấc chân đủ cao
            "contact_pattern": 0.4,        # Hai chân phải xen kẽ
            "foot_slip": -0.1,             # Phạt trượt chân

            # === Smoothness rewards ===
            "action_rate": -0.01,          # Phạt thay đổi action đột ngột
            "torque": -0.00005,            # Phạt torque lớn
            "joint_acceleration": -0.0001, # Phạt gia tốc khớp

            # === Safety rewards ===
            "termination": -200.0,         # Phạt nặng khi ngã
            "joint_limit": -1.0,           # Phạt chạm giới hạn khớp

            # === Style rewards (optional) ===
            "feet_air_time": 0.2,          # Thưởng thời gian chân trên không
        }

    def compute_reward(self, state, action, prev_action, command):
        """Tính tổng reward từ tất cả terms."""
        rewards = {}

        # 1. Linear velocity tracking
        # Quan trọng nhất — robot phải đi đúng tốc độ yêu cầu
        vel_error = torch.sum(
            torch.square(command[:, :2] - state["base_lin_vel"][:, :2]),
            dim=1
        )
        rewards["linear_vel_tracking"] = torch.exp(-vel_error / 0.25)

        # 2. Angular velocity tracking (yaw rate)
        yaw_error = torch.square(
            command[:, 2] - state["base_ang_vel"][:, 2]
        )
        rewards["angular_vel_tracking"] = torch.exp(-yaw_error / 0.25)

        # 3. Upright posture
        # projected_gravity[2] = -1 khi hoàn toàn thẳng đứng
        rewards["upright"] = torch.square(state["projected_gravity"][:, 2] + 1.0)

        # 4. Base height
        # Giữ chiều cao gần target (0.72m cho G1)
        target_height = 0.72
        height_error = torch.square(state["base_height"] - target_height)
        rewards["base_height"] = torch.exp(-height_error / 0.05)

        # 5. Foot clearance
        # Thưởng khi chân swing nhấc cao > 0.06m
        swing_mask = state["foot_contact"] < 0.5  # chân không chạm đất
        foot_height = state["foot_height"]
        clearance_reward = torch.where(
            swing_mask,
            torch.clamp(foot_height - 0.06, min=0.0),
            torch.zeros_like(foot_height)
        )
        rewards["foot_clearance"] = torch.sum(clearance_reward, dim=1)

        # 6. Contact pattern — hai chân phải xen kẽ
        # Đo "phase difference" giữa hai chân
        left_contact = state["foot_contact"][:, 0]
        right_contact = state["foot_contact"][:, 1]
        # Phạt khi cả hai chân trên không hoặc cả hai chạm đất quá lâu
        both_air = (1 - left_contact) * (1 - right_contact)
        rewards["contact_pattern"] = -both_air

        # 7. Foot slip penalty
        # Phạt khi chân đang chạm đất mà vẫn trượt
        contact_mask = state["foot_contact"] > 0.5
        foot_vel = torch.norm(state["foot_velocity"][:, :, :2], dim=2)
        slip = contact_mask * foot_vel
        rewards["foot_slip"] = -torch.sum(slip, dim=1)

        # 8. Action rate penalty
        # Phạt thay đổi action giữa 2 timestep liên tiếp
        action_diff = torch.sum(torch.square(action - prev_action), dim=1)
        rewards["action_rate"] = -action_diff

        # 9. Torque penalty
        torques = state["applied_torques"]
        rewards["torque"] = -torch.sum(torch.square(torques), dim=1)

        # 10. Joint acceleration penalty
        joint_acc = state["joint_accelerations"]
        rewards["joint_acceleration"] = -torch.sum(torch.square(joint_acc), dim=1)

        # 11. Termination penalty
        rewards["termination"] = state["is_terminated"].float()

        # 12. Joint limit penalty
        joint_pos = state["joint_positions"]
        lower = state["joint_lower_limits"]
        upper = state["joint_upper_limits"]
        margin = 0.1  # 0.1 rad margin
        below = torch.clamp(lower + margin - joint_pos, min=0.0)
        above = torch.clamp(joint_pos - upper + margin, min=0.0)
        rewards["joint_limit"] = -torch.sum(below + above, dim=1)

        # 13. Feet air time (style reward)
        # Thưởng khi mỗi chân swing ~0.3s (dáng đi tự nhiên)
        target_air_time = 0.3
        air_time_error = torch.abs(
            state["feet_air_time"] - target_air_time
        )
        rewards["feet_air_time"] = torch.sum(
            torch.exp(-air_time_error / 0.1), dim=1
        )

        # Tính tổng weighted reward
        total_reward = torch.zeros(state["base_height"].shape[0])
        for name, value in rewards.items():
            total_reward += self.weights[name] * value

        return total_reward, rewards

Reward Weighting Strategies

Phương pháp 1: Manual Tuning (phổ biến nhất)

Bắt đầu với weights mặc định, chạy training ~30 phút, xem video behavior, điều chỉnh:

# Iteration 1: Robot lao đầu về phía trước
weights_v1 = {"linear_vel_tracking": 2.0, "upright": 0.1}
# → Tăng upright, giảm velocity

# Iteration 2: Robot đứng yên, rất thẳng
weights_v2 = {"linear_vel_tracking": 1.0, "upright": 1.0}
# → Cân bằng hơn, nhưng robot không nhấc chân

# Iteration 3: Thêm foot clearance + contact pattern
weights_v3 = {
    "linear_vel_tracking": 1.5,
    "upright": 0.5,
    "foot_clearance": 0.3,
    "contact_pattern": 0.4,
}
# → Robot bắt đầu có dáng đi!

Phương pháp 2: Exponential Reward Scaling

Thay vì linear reward, dùng exponential để tạo sharp peak khi gần target:

def exponential_tracking_reward(error, sigma=0.25):
    """
    exp(-error^2 / sigma) tạo reward bell-shaped.
    - sigma nhỏ: chỉ reward khi rất gần target
    - sigma lớn: reward cả khi xa target
    """
    return torch.exp(-torch.square(error) / sigma)

# Velocity tracking với sigma=0.25
vel_reward = exponential_tracking_reward(vel_error, sigma=0.25)

# Height tracking với sigma=0.05 (chặt hơn)
height_reward = exponential_tracking_reward(height_error, sigma=0.05)

Phương pháp 3: Adaptive Reward Scaling

Tự động điều chỉnh weights dựa trên training progress:

class AdaptiveRewardScaler:
    """Tự động scale rewards để giữ magnitude tương đương."""

    def __init__(self, reward_names, target_magnitude=1.0):
        self.ema = {name: target_magnitude for name in reward_names}
        self.alpha = 0.99

    def scale(self, rewards_dict):
        scaled = {}
        for name, value in rewards_dict.items():
            # Update exponential moving average
            mag = torch.abs(value).mean().item()
            self.ema[name] = self.alpha * self.ema[name] + (1 - self.alpha) * mag

            # Scale to target magnitude
            if self.ema[name] > 1e-6:
                scaled[name] = value / self.ema[name]
            else:
                scaled[name] = value
        return scaled

Reward Curriculum: Từ đứng đến đi

Thay vì train walking ngay từ đầu, curriculum learning giúp training ổn định hơn:

Giai đoạn 1: Standing (0-500 iterations)

curriculum_phase_1 = {
    "linear_vel_tracking": 0.0,     # Chưa cần đi
    "upright": 2.0,                  # Ưu tiên đứng thẳng
    "base_height": 1.0,             # Giữ chiều cao
    "termination": -200.0,          # Phạt ngã
    "torque": -0.0001,              # Tiết kiệm năng lượng
}
# Command velocity = [0, 0, 0]

Giai đoạn 2: Weight Shifting (500-1500 iterations)

curriculum_phase_2 = {
    "linear_vel_tracking": 0.3,     # Bắt đầu track velocity nhỏ
    "upright": 1.5,
    "base_height": 0.8,
    "contact_pattern": 0.2,         # Bắt đầu yêu cầu đổi chân
    "foot_clearance": 0.1,
    "termination": -200.0,
    "torque": -0.0001,
}
# Command velocity = [0.2, 0, 0]  # chậm

Giai đoạn 3: Walking (1500-5000 iterations)

curriculum_phase_3 = {
    "linear_vel_tracking": 1.5,     # Full tracking
    "angular_vel_tracking": 0.8,
    "upright": 0.5,
    "base_height": 0.3,
    "foot_clearance": 0.3,
    "contact_pattern": 0.4,
    "foot_slip": -0.1,
    "action_rate": -0.01,
    "torque": -0.00005,
    "joint_acceleration": -0.0001,
    "termination": -200.0,
    "joint_limit": -1.0,
    "feet_air_time": 0.2,
}
# Command velocity = random[-1.0, 1.0] m/s

class RewardCurriculum:
    """Reward curriculum manager."""

    def __init__(self):
        self.phases = [
            (0, curriculum_phase_1),
            (500, curriculum_phase_2),
            (1500, curriculum_phase_3),
        ]

    def get_weights(self, iteration):
        """Trả về weights phù hợp với iteration hiện tại."""
        current_weights = self.phases[0][1]
        for threshold, weights in self.phases:
            if iteration >= threshold:
                current_weights = weights
        return current_weights

    def get_command_range(self, iteration):
        """Tăng dần phạm vi command velocity."""
        if iteration < 500:
            return {"vx": [0, 0], "vy": [0, 0], "yaw": [0, 0]}
        elif iteration < 1500:
            return {"vx": [0, 0.3], "vy": [-0.1, 0.1], "yaw": [-0.2, 0.2]}
        else:
            progress = min((iteration - 1500) / 3500, 1.0)
            max_vx = 0.3 + progress * 0.7  # 0.3 → 1.0 m/s
            return {
                "vx": [-0.3, max_vx],
                "vy": [-0.3 * progress, 0.3 * progress],
                "yaw": [-0.5 * progress, 0.5 * progress],
            }

Ablation Study: Mỗi reward term đóng góp gì?

Bảng dưới đây tóm tắt kết quả khi bỏ từng reward term:

Reward term bị bỏ	Hệ quả	Mức ảnh hưởng
linear_vel_tracking	Robot không đi	Critical
upright	Robot nghiêng, ngã sau 2-3 bước	Critical
foot_clearance	Robot kéo lê chân (shuffling)	High
contact_pattern	Robot nhảy hoặc đứng 2 chân	High
action_rate	Robot giật mạnh, motor wear	Medium
torque	Tốn năng lượng, motor nóng	Medium
foot_slip	Chân trượt trên mặt đất	Medium
base_height	Robot cúi/ngồi xuống	Medium
joint_limit	Khớp va chạm giới hạn	Low-Medium
feet_air_time	Dáng đi không tự nhiên	Low
joint_acceleration	Chuyển động hơi giật	Low

So sánh Reward Formulations từ các paper

Paper	Số reward terms	Đặc điểm nổi bật
Walk These Ways (Margolis 2023)	13	Gait frequency command, versatile gaits
Humanoid-Gym (Gu 2024)	10	Focus sim-to-real, conservative rewards
Learning Humanoid Locomotion (Radosavovic 2024)	8	Transformer policy, simpler rewards
Expressive Humanoid (Cheng 2024)	15+	AMP style reward, reference motions

Pitfalls và best practices

Pitfall 1: Reward magnitude mismatch

# SAI: vel_tracking ~1.0 nhưng torque ~10000
rewards = vel_reward + 0.001 * torque_penalty
# Dù weight nhỏ, torque penalty vẫn dominate gradient

# ĐÚNG: Normalize trước khi weight
rewards = vel_reward + 0.001 * (torque_penalty / torque_scale)

Pitfall 2: Sparse vs Dense rewards

# SAI: Chỉ thưởng khi đến đích (sparse)
reward = 100.0 if reached_goal else 0.0
# Robot không biết đi đúng hướng cho đến khi ngẫu nhiên đến đích

# ĐÚNG: Thưởng continuous progress (dense)
reward = -distance_to_goal + velocity_toward_goal

Pitfall 3: Quên phạt termination

# SAI: Không phạt ngã
# Robot sẽ thử các hành vi rủi ro vì không có penalty

# ĐÚNG: Phạt nặng khi ngã
if is_terminated:
    reward -= 200.0

Để hiểu thêm về quadruped locomotion (bài toán đơn giản hơn nhưng cùng nguyên lý), xem bài Quadruped Locomotion với RL.

Tổng kết

Reward engineering cho bipedal walking không phải "set and forget" — đó là quá trình lặp đi lặp lại: thiết kế → train → xem video → điều chỉnh. Các takeaways quan trọng:

12+ reward terms cần thiết cho dáng đi tự nhiên
Exponential tracking tốt hơn linear cho velocity/height targets
Reward curriculum (standing → walking → running) giúp training ổn định
Ablation study cho thấy mỗi term quan trọng thế nào
Reward hacking là kẻ thù — luôn xem video behavior, không chỉ xem reward curve

Bài tiếp theo — Unitree G1: Train Walking Policy từ Scratch — sẽ áp dụng tất cả kiến thức này vào training thực tế trên Unitree G1 trong Isaac Lab.

Tài liệu tham khảo

Walk These Ways: Tuning Robot Control for Generalization with Multiplicity of Behavior — Margolis & Agrawal, CoRL 2023
Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition — Siekmann et al., ICRA 2021
Humanoid-Gym: Reinforcement Learning for Humanoid Robot — Gu et al., 2024
Learning Humanoid Locomotion with Transformers — Radosavovic et al., 2024