Unitree G1: Train Walking Policy từ Scratch trong Isaac Lab

Sau hai bài lý thuyết về MDP formulation và reward engineering, đã đến lúc bắt tay vào training thực tế. Trong bài này, chúng ta sẽ train một walking policy hoàn chỉnh cho Unitree G1 — từ zero đến robot đi được — sử dụng PPO trong Isaac Lab.

Unitree G1 — Thông số kỹ thuật

Unitree G1 là humanoid nhỏ gọn, phù hợp cho nghiên cứu và ứng dụng thực tế:

Thông số	Giá trị
Chiều cao	1.27 m
Cân nặng	35 kg
Tổng DOF	23 (chân 12, tay 8, torso 3)
DOF chân (mỗi bên)	6 (hip roll/pitch/yaw, knee, ankle roll/pitch)
Tốc độ tối đa	~2 m/s
Thời gian hoạt động	~2 giờ
Onboard compute	Jetson Orin NX
Giá	~$16,000 USD

So sánh với H1 (1.8m, 47kg, 19 DOF chân), G1 nhỏ hơn và dễ thực nghiệm hơn. Nhưng trọng tâm thấp hơn cũng có nghĩa là dynamics khác biệt — tương tự sự khác biệt giữa trẻ em và người lớn khi đi bộ.

Bước 1: Load G1 URDF vào Isaac Lab

Chuẩn bị robot model

"""
Bước 1: Cấu hình Unitree G1 trong Isaac Lab.
File: g1_env_cfg.py
"""
import omni.isaac.lab.sim as sim_utils
from omni.isaac.lab.assets import ArticulationCfg
from omni.isaac.lab.envs import ManagerBasedRLEnvCfg
from omni.isaac.lab.managers import ObservationGroupCfg, ObservationTermCfg
from omni.isaac.lab.managers import RewardTermCfg, TerminationTermCfg
from omni.isaac.lab.scene import InteractiveSceneCfg
from omni.isaac.lab.terrains import TerrainImporterCfg
from omni.isaac.lab.utils import configclass

@configclass
class G1RobotCfg:
    """Cấu hình robot Unitree G1."""

    robot = ArticulationCfg(
        prim_path="/World/envs/env_.*/Robot",
        spawn=sim_utils.UsdFileCfg(
            usd_path="datasets/robots/unitree/g1/g1.usd",
            activate_contact_sensors=True,
        ),
        init_state=ArticulationCfg.InitialStateCfg(
            pos=(0.0, 0.0, 0.75),
            joint_pos={
                # === Left leg ===
                "left_hip_yaw": 0.0,
                "left_hip_roll": 0.0,
                "left_hip_pitch": -0.15,
                "left_knee": 0.35,
                "left_ankle_pitch": -0.20,
                "left_ankle_roll": 0.0,
                # === Right leg ===
                "right_hip_yaw": 0.0,
                "right_hip_roll": 0.0,
                "right_hip_pitch": -0.15,
                "right_knee": 0.35,
                "right_ankle_pitch": -0.20,
                "right_ankle_roll": 0.0,
                # Arms: slightly bent
                ".*shoulder.*": 0.0,
                ".*elbow.*": 0.3,
            },
        ),
        actuators={
            "legs": sim_utils.DCMotorCfg(
                joint_names_expr=[
                    ".*hip.*", ".*knee.*", ".*ankle.*"
                ],
                stiffness={
                    ".*hip.*": 150.0,
                    ".*knee.*": 200.0,
                    ".*ankle.*": 40.0,
                },
                damping={
                    ".*hip.*": 5.0,
                    ".*knee.*": 8.0,
                    ".*ankle.*": 2.0,
                },
                effort_limit=100.0,
            ),
            "arms": sim_utils.DCMotorCfg(
                joint_names_expr=[".*shoulder.*", ".*elbow.*"],
                stiffness=80.0,
                damping=4.0,
                effort_limit=50.0,
            ),
        },
    )

Cấu hình môi trường đầy đủ

@configclass
class G1FlatWalkEnvCfg(ManagerBasedRLEnvCfg):
    """Environment hoàn chỉnh cho G1 flat-ground walking."""

    # === Simulation ===
    sim = sim_utils.SimulationCfg(
        dt=0.005,                    # 200 Hz physics
        render_interval=4,
        gravity=(0.0, 0.0, -9.81),
    )

    # === Scene ===
    scene = InteractiveSceneCfg(
        num_envs=4096,               # 4096 environments song song
        env_spacing=2.5,
    )

    # === Terrain (flat ground) ===
    terrain = TerrainImporterCfg(
        prim_path="/World/ground",
        terrain_type="plane",
        physics_material=sim_utils.RigidBodyMaterialCfg(
            static_friction=1.0,
            dynamic_friction=1.0,
            restitution=0.0,
        ),
    )

    # === Robot ===
    robot = G1RobotCfg.robot

    # === Observations ===
    observations = ObservationGroupCfg(
        policy=ObservationGroupCfg(
            concatenate_terms=True,
            terms={
                "base_ang_vel": ObservationTermCfg(
                    func="base_ang_vel", noise={"mean": 0, "std": 0.05}
                ),
                "projected_gravity": ObservationTermCfg(
                    func="projected_gravity",
                ),
                "velocity_commands": ObservationTermCfg(
                    func="generated_commands",
                    params={"command_name": "base_velocity"},
                ),
                "joint_pos": ObservationTermCfg(
                    func="joint_pos_rel",
                    noise={"mean": 0, "std": 0.01},
                ),
                "joint_vel": ObservationTermCfg(
                    func="joint_vel_rel",
                    noise={"mean": 0, "std": 0.1},
                    scale=0.05,  # Scale down vì magnitude lớn
                ),
                "actions": ObservationTermCfg(
                    func="last_action",
                ),
            },
        ),
    )

    # === Control ===
    decimation = 4                   # Policy ở 50 Hz
    episode_length_s = 20.0

    # === Commands ===
    commands = {
        "base_velocity": {
            "ranges": {
                "lin_vel_x": (-0.5, 1.0),    # Forward/backward
                "lin_vel_y": (-0.3, 0.3),     # Lateral
                "ang_vel_z": (-0.5, 0.5),     # Yaw rate
            },
            "resampling_time": 10.0,          # Đổi command mỗi 10s
        },
    }

Bước 2: Cấu hình PPO Hyperparameters

PPO (Proximal Policy Optimization) là thuật toán được dùng nhiều nhất cho locomotion vì ổn định và hiệu quả:

"""
Bước 2: PPO training configuration.
File: g1_ppo_cfg.py
"""

class G1PPORunnerCfg:
    """Cấu hình PPO cho G1 walking."""

    # === Algorithm ===
    algorithm = {
        "class_name": "PPO",
        "clip_param": 0.2,           # PPO clip range
        "entropy_coef": 0.01,        # Khuyến khích exploration
        "gamma": 0.99,               # Discount factor
        "lam": 0.95,                 # GAE lambda
        "learning_rate": 1e-3,       # Initial LR
        "lr_schedule": "adaptive",   # Giảm LR khi KL divergence cao
        "desired_kl": 0.01,
        "max_grad_norm": 1.0,
        "num_learning_epochs": 5,    # Epochs per update
        "num_mini_batches": 4,
        "value_loss_coef": 1.0,
    }

    # === Policy network ===
    policy = {
        "class_name": "ActorCritic",
        "activation": "elu",
        "actor_hidden_dims": [512, 256, 128],
        "critic_hidden_dims": [512, 256, 128],
        "init_noise_std": 1.0,
    }

    # === Runner ===
    runner = {
        "num_steps_per_env": 24,     # Steps per env per update
        "max_iterations": 5000,      # Tổng iterations
        "save_interval": 500,
        "log_interval": 10,
    }

Giải thích các hyperparameters quan trọng

Parameter	Giá trị	Tại sao?
`num_envs`	4096	Đủ data cho stable gradients. Tăng lên 8192 nếu có VRAM
`clip_param`	0.2	Chuẩn PPO. Giảm xuống 0.1 nếu training bất ổn
`entropy_coef`	0.01	Quá cao → random actions. Quá thấp → local optima
`learning_rate`	1e-3	Adaptive schedule tự giảm khi cần
`actor_hidden_dims`	[512,256,128]	Đủ capacity cho 23 DOF. Network quá nhỏ → underfitting
`num_steps_per_env`	24	~0.48s real-time mỗi update. Tăng lên 48 cho long-horizon tasks
`init_noise_std`	1.0	Exploration ban đầu cao, giảm dần tự động

Bước 3: Reward Function cho G1

Áp dụng kiến thức từ bài reward engineering:

"""
Bước 3: Reward terms cho G1 walking.
File: g1_rewards.py
"""
import torch
from omni.isaac.lab.managers import RewardTermCfg

class G1RewardsCfg:
    """Reward configuration cho G1 flat walking."""

    # === Primary tracking ===
    lin_vel_xy_tracking = RewardTermCfg(
        func="track_lin_vel_xy_exp",
        weight=1.5,
        params={"std": 0.25, "command_name": "base_velocity"},
    )
    ang_vel_z_tracking = RewardTermCfg(
        func="track_ang_vel_z_exp",
        weight=0.8,
        params={"std": 0.25, "command_name": "base_velocity"},
    )

    # === Posture ===
    flat_orientation = RewardTermCfg(
        func="flat_orientation_l2", weight=-1.0,
    )
    base_height = RewardTermCfg(
        func="base_height_l2",
        weight=-0.5,
        params={"target_height": 0.72},
    )

    # === Regularization ===
    action_rate = RewardTermCfg(
        func="action_rate_l2", weight=-0.01,
    )
    joint_torques = RewardTermCfg(
        func="joint_torques_l2", weight=-5e-5,
    )
    joint_accel = RewardTermCfg(
        func="joint_accel_l2", weight=-1e-4,
    )

    # === Gait ===
    feet_air_time = RewardTermCfg(
        func="feet_air_time",
        weight=0.2,
        params={"threshold": 0.3, "sensor_cfg": "contact_forces"},
    )
    undesired_contacts = RewardTermCfg(
        func="undesired_contacts",
        weight=-1.0,
        params={
            "sensor_cfg": "contact_forces",
            "threshold": 1.0,
            # Phạt khi đầu gối hoặc torso chạm đất
            "body_names": [".*knee.*", ".*torso.*", ".*hip.*"],
        },
    )

    # === Termination ===
    termination_penalty = RewardTermCfg(
        func="is_terminated", weight=-200.0,
    )

    # === Velocity penalty (penalize unwanted velocities) ===
    lin_vel_z = RewardTermCfg(
        func="lin_vel_z_l2", weight=-0.5,  # Phạt bouncing
    )
    ang_vel_xy = RewardTermCfg(
        func="ang_vel_xy_l2", weight=-0.05,  # Phạt lắc ngang
    )

Bước 4: Chạy Training

Lệnh training

# Training trên RTX 4090 — khoảng 45-60 phút
cd IsaacLab
python source/standalone/workflows/rsl_rl/train.py \
    --task=Isaac-Velocity-Flat-G1-v0 \
    --num_envs=4096 \
    --max_iterations=5000 \
    --headless

# Với logging đến WandB
python source/standalone/workflows/rsl_rl/train.py \
    --task=Isaac-Velocity-Flat-G1-v0 \
    --num_envs=4096 \
    --max_iterations=5000 \
    --headless \
    --logger wandb \
    --wandb_project g1-locomotion

Training schedule điển hình

Iteration 0-200:    Robot ngã liên tục, reward ~-100
Iteration 200-500:  Bắt đầu đứng được, reward ~-20 → 0
Iteration 500-1500: Bước đầu tiên! reward tăng dần → 5-10
Iteration 1500-3000: Walking ổn định, reward ~10-15
Iteration 3000-5000: Fine-tuning, reward ~15-20 (plateau)

Monitor training

"""
Script theo dõi training progress.
File: monitor_training.py
"""
import matplotlib.pyplot as plt
import json
import glob
import numpy as np

def plot_training_curves(log_dir):
    """Plot các metrics quan trọng trong quá trình training."""

    # Load logs
    log_files = sorted(glob.glob(f"{log_dir}/events.out.tfevents.*"))
    # ... hoặc parse từ JSON logs

    # Metrics cần theo dõi
    metrics = {
        "reward/total": [],
        "reward/lin_vel_tracking": [],
        "reward/termination": [],
        "loss/policy": [],
        "loss/value": [],
        "diagnostics/mean_episode_length": [],
    }

    fig, axes = plt.subplots(2, 3, figsize=(18, 10))

    # Plot 1: Total reward
    axes[0, 0].set_title("Total Reward")
    axes[0, 0].set_xlabel("Iteration")
    axes[0, 0].set_ylabel("Reward")

    # Plot 2: Velocity tracking
    axes[0, 1].set_title("Velocity Tracking Reward")

    # Plot 3: Episode length (dài = robot không ngã)
    axes[0, 2].set_title("Episode Length (s)")

    # Plot 4: Policy loss
    axes[1, 0].set_title("Policy Loss")

    # Plot 5: Value loss
    axes[1, 1].set_title("Value Loss")

    # Plot 6: Termination rate
    axes[1, 2].set_title("Termination Rate")

    plt.tight_layout()
    plt.savefig("training_curves.png", dpi=150)
    print("Saved training_curves.png")

if __name__ == "__main__":
    plot_training_curves("logs/g1_flat_walk")

Bước 5: Evaluate Policy

Chạy trained policy

# Evaluate với rendering
python source/standalone/workflows/rsl_rl/play.py \
    --task=Isaac-Velocity-Flat-G1-v0 \
    --num_envs=16 \
    --checkpoint=logs/g1_flat_walk/model_5000.pt

# Record video
python source/standalone/workflows/rsl_rl/play.py \
    --task=Isaac-Velocity-Flat-G1-v0 \
    --num_envs=1 \
    --checkpoint=logs/g1_flat_walk/model_5000.pt \
    --video \
    --video_length=300

Đánh giá định lượng

"""
Đánh giá chất lượng walking policy.
File: evaluate_g1.py
"""
import torch
import numpy as np

class G1PolicyEvaluator:
    """Đánh giá walking policy trên nhiều tiêu chí."""

    def __init__(self, env, policy, num_eval_steps=1000):
        self.env = env
        self.policy = policy
        self.num_eval_steps = num_eval_steps

    def evaluate(self):
        """Chạy evaluation đầy đủ."""
        metrics = {
            "velocity_tracking_error": [],
            "energy_per_meter": [],
            "survival_rate": [],
            "gait_symmetry": [],
            "smoothness": [],
        }

        obs = self.env.reset()
        for step in range(self.num_eval_steps):
            action = self.policy(obs)
            obs, reward, done, info = self.env.step(action)

            # 1. Velocity tracking error
            cmd_vel = info["velocity_command"][:, 0]  # vx command
            actual_vel = info["base_lin_vel"][:, 0]    # vx actual
            vel_error = torch.abs(cmd_vel - actual_vel).mean().item()
            metrics["velocity_tracking_error"].append(vel_error)

            # 2. Energy consumption (Cost of Transport)
            torques = info["applied_torques"]
            joint_vel = info["joint_velocities"]
            power = torch.sum(torch.abs(torques * joint_vel), dim=1)
            velocity = torch.norm(info["base_lin_vel"][:, :2], dim=1)
            cot = power / (35.0 * 9.81 * torch.clamp(velocity, min=0.1))
            metrics["energy_per_meter"].append(cot.mean().item())

            # 3. Gait symmetry
            left_phase = info["left_foot_contact_time"]
            right_phase = info["right_foot_contact_time"]
            symmetry = 1.0 - torch.abs(
                left_phase - right_phase
            ).mean().item()
            metrics["gait_symmetry"].append(symmetry)

        # Tổng hợp kết quả
        results = {}
        for key, values in metrics.items():
            results[key] = {
                "mean": np.mean(values),
                "std": np.std(values),
            }

        # In báo cáo
        print("=" * 50)
        print("G1 Walking Policy Evaluation")
        print("=" * 50)
        print(f"Vel tracking error: {results['velocity_tracking_error']['mean']:.3f} "
              f"± {results['velocity_tracking_error']['std']:.3f} m/s")
        print(f"Cost of Transport:  {results['energy_per_meter']['mean']:.2f} "
              f"± {results['energy_per_meter']['std']:.2f}")
        print(f"Gait symmetry:      {results['gait_symmetry']['mean']:.3f}")
        print(f"(Human CoT ≈ 0.2, Good robot CoT < 1.0)")

        return results

Tips cho Stable Training

1. Observation normalization (BẮT BUỘC)

# Trong config
normalize_observations = True
clip_observations = 100.0

2. Reward scaling

# Clip reward để tránh outliers
clip_rewards = True
max_reward = 50.0
min_reward = -200.0

3. Action smoothing penalty

# Nếu robot giật → tăng action_rate weight
action_rate_weight = -0.01  # Mặc định
# Nếu vẫn giật:
action_rate_weight = -0.05  # Tăng 5x

4. Gradient clipping

max_grad_norm = 1.0  # Clip gradient L2 norm
# Tránh gradient explosion khi reward có spike

5. Early termination conditions

class G1TerminationsCfg:
    """Điều kiện kết thúc episode."""

    # Ngã (base height quá thấp)
    base_contact = TerminationTermCfg(
        func="illegal_contact",
        params={
            "sensor_cfg": "contact_forces",
            "threshold": 1.0,
            "body_names": ["torso", ".*head.*"],
        },
    )

    # Nghiêng quá nhiều (> 60 độ)
    bad_orientation = TerminationTermCfg(
        func="bad_orientation",
        params={"limit_angle": 1.05},  # ~60 degrees
    )

    # Timeout
    time_out = TerminationTermCfg(
        func="time_out",
        time_out=True,  # Không phạt termination cho timeout
    )

Để hiểu thêm về Isaac Lab, xem bài Isaac Lab: GPU-Accelerated Simulation. Về kiểm soát humanoid truyền thống, xem Humanoid Control Methods.

Tổng kết

Trong bài này, chúng ta đã hoàn thành full training pipeline cho Unitree G1:

Load G1 URDF với joint configuration và PD gains phù hợp
Cấu hình PPO với 4096 envs, adaptive learning rate
Thiết kế 13 reward terms cân bằng giữa tracking, posture, smoothness
Train ~1 giờ trên RTX 4090, đạt walking ổn định
Evaluate velocity tracking, energy efficiency, gait symmetry

Kết quả típ: velocity tracking error < 0.15 m/s, Cost of Transport < 0.8, gait symmetry > 0.9.

Bài tiếp theo — Unitree G1: Terrain Adaptation — sẽ mở rộng policy sang terrain phức tạp: cầu thang, dốc, và mặt đất gồ ghề.

Tài liệu tham khảo

Humanoid-Gym: Reinforcement Learning for Humanoid Robot with Zero-Shot Sim-to-Real Transfer — Gu et al., 2024
Proximal Policy Optimization Algorithms — Schulman et al., 2017
Isaac Lab: A Unified Framework for Robot Learning — Mittal et al., 2025
Learning Agile and Dynamic Motor Skills for Legged Robots — Hwangbo et al., Science Robotics, 2019