RL Locomotion cho Humanoid: Cơ bản và Environment Setup

Bạn đã bao giờ tự hỏi tại sao con người có thể đi bộ một cách tự nhiên mà không cần suy nghĩ, trong khi robot humanoid lại vật lộn để giữ thăng bằng? Câu trả lời nằm ở sự phức tạp đáng kinh ngạc của bài toán bipedal locomotion — và Reinforcement Learning (RL) đang trở thành công cụ mạnh mẽ nhất để giải quyết nó.

Trong bài viết đầu tiên của series RL Humanoid Locomotion, chúng ta sẽ đặt nền móng vững chắc: hiểu tại sao humanoid locomotion khó, formulate bài toán dưới dạng MDP, và setup môi trường huấn luyện với Isaac Lab.

Tại sao Humanoid Locomotion là bài toán cực khó?

Underactuation — Ít cơ cấu chấp hành hơn bậc tự do

Robot humanoid là hệ thống underactuated — nghĩa là số actuator ít hơn số bậc tự do (DOF) cần điều khiển. Khi robot đứng trên một chân (single support phase), chân đó phải kiểm soát toàn bộ cơ thể. Nhưng tiếp xúc chân-đất là unilateral — robot chỉ có thể đẩy mặt đất, không thể kéo. Điều này tạo ra ràng buộc friction cone mà controller phải tôn trọng.

Bậc tự do cao (High DOF)

Một humanoid như Unitree G1 có 23 DOF, H1 có 19 DOF chân. So với quadruped (12 DOF) hay wheeled robot (2-4 DOF), không gian trạng thái và hành động lớn hơn nhiều lần:

Robot	DOF	Observation dim	Action dim
Wheeled robot	2	~10	2
Quadruped (Go2)	12	~48	12
Humanoid G1	23	~70+	23
Humanoid H1	19	~60+	19

Balance và ZMP

Zero Moment Point (ZMP) là khái niệm cốt lõi: robot chỉ ổn định khi ZMP nằm trong support polygon (vùng tiếp xúc chân). Với bipedal walking, support polygon thay đổi liên tục và có giai đoạn double support (2 chân) xen kẽ single support (1 chân). Trong giai đoạn single support, support polygon rất nhỏ — khoảng 25x15 cm — khiến robot dễ ngã.

Tại sao RL vượt trội hơn control truyền thống?

Phương pháp truyền thống (ZMP-based, model predictive control) yêu cầu mô hình động lực học chính xác và giải optimization online. Nhưng:

Mô hình không bao giờ hoàn hảo (friction, backlash, flexibility)
Giải optimization real-time tốn tính toán
Khó tổng quát hóa cho địa hình khác nhau

RL tiếp cận khác: học trực tiếp policy từ tương tác với simulator, không cần mô hình chính xác. Policy học cách xử lý uncertainty một cách tự nhiên thông qua domain randomization.

Nếu bạn chưa quen với RL, hãy đọc bài RL cơ bản cho Robotics trước khi tiếp tục.

MDP Formulation cho Bipedal Walking

Markov Decision Process

Bài toán locomotion được formulate thành MDP tuple $(S, A, T, R, \gamma)$:

State $S$: Trạng thái đầy đủ của robot và môi trường
Action $A$: Lệnh điều khiển gửi đến actuator
Transition $T$: Dynamics của simulator (MuJoCo/Isaac Sim)
Reward $R$: Hàm thưởng đánh giá chất lượng locomotion
Discount $\gamma$: Thường 0.99 cho locomotion tasks

Observation Space — Robot "nhìn" thấy gì?

Observation là thông tin policy nhận được mỗi timestep. Thiết kế observation space tốt là quyết định thành bại:

import torch
import numpy as np

class HumanoidObservation:
    """Observation space cho humanoid locomotion."""

    def __init__(self, num_joints=23):
        self.num_joints = num_joints

    def compute_observation(self, robot_state, command):
        """
        Tính observation vector từ trạng thái robot.

        Returns:
            obs: tensor shape (obs_dim,)
        """
        obs_components = []

        # 1. Base angular velocity (3D) - từ IMU gyroscope
        # Robot cần biết nó đang xoay thế nào
        base_ang_vel = robot_state["base_angular_velocity"]  # (3,)
        obs_components.append(base_ang_vel)

        # 2. Projected gravity (3D) - hướng trọng lực trong frame robot
        # Quan trọng nhất để giữ thăng bằng!
        projected_gravity = robot_state["projected_gravity"]  # (3,)
        obs_components.append(projected_gravity)

        # 3. Velocity command (3D) - vận tốc mong muốn [vx, vy, yaw_rate]
        velocity_command = command["velocity"]  # (3,)
        obs_components.append(velocity_command)

        # 4. Joint positions relative to default (num_joints,)
        # Sai lệch so với tư thế đứng mặc định
        joint_pos_error = (
            robot_state["joint_positions"]
            - robot_state["default_joint_positions"]
        )
        obs_components.append(joint_pos_error)

        # 5. Joint velocities (num_joints,)
        joint_vel = robot_state["joint_velocities"]
        obs_components.append(joint_vel)

        # 6. Previous actions (num_joints,) - action trước đó
        # Giúp policy smooth hơn, tránh giật
        prev_actions = robot_state["previous_actions"]
        obs_components.append(prev_actions)

        # Concatenate tất cả
        obs = np.concatenate(obs_components)
        # Tổng dim: 3 + 3 + 3 + 23 + 23 + 23 = 78 cho G1
        return obs

Tại sao projected gravity quan trọng nhất? Vì nó cho policy biết robot đang nghiêng thế nào so với phương thẳng đứng. Không có thông tin này, robot không thể giữ thăng bằng.

Action Space — Robot điều khiển thế nào?

Action space thường là joint position targets — vị trí góc mong muốn cho mỗi khớp. PD controller ở mức thấp sẽ tạo torque để tracking:

class HumanoidActionSpace:
    """Action space: joint position targets + PD control."""

    def __init__(self, num_joints=23):
        self.num_joints = num_joints
        # PD gains cho mỗi khớp
        self.kp = np.array([
            # Hip (3 DOF mỗi bên): cao hơn vì chịu tải lớn
            200, 200, 200,  # left hip
            200, 200, 200,  # right hip
            # Knee (1 DOF mỗi bên)
            250, 250,
            # Ankle (2 DOF mỗi bên)
            40, 40, 40, 40,
            # Torso
            300,
            # Shoulder (3 DOF mỗi bên)
            100, 100, 100,
            100, 100, 100,
            # Elbow
            50, 50,
            # Wrist
            20, 20,
        ])
        self.kd = self.kp * 0.05  # Kd = 5% của Kp

    def apply_action(self, action, current_pos, current_vel):
        """
        Chuyển action (position targets) thành torque.

        action: (num_joints,) — scaled position offsets
        """
        # Scale action từ [-1, 1] thành radian offset
        action_scale = 0.25  # ±0.25 rad max
        target_pos = current_pos + action * action_scale

        # PD control
        torque = self.kp * (target_pos - current_pos) - self.kd * current_vel

        # Clip torque theo giới hạn motor
        torque = np.clip(torque, -100, 100)  # Nm
        return torque

Chọn Simulator: Isaac Lab vs MuJoCo

Lựa chọn simulator ảnh hưởng trực tiếp đến tốc độ training và chất lượng sim-to-real transfer.

Tiêu chí	Isaac Lab (NVIDIA)	MuJoCo (DeepMind)
GPU parallelization	Hàng nghìn envs trên GPU	CPU-based (mới có MJX cho GPU)
Tốc độ train	Rất nhanh (~1h cho walking)	Chậm hơn 10-100x
Contact physics	Tốt, GPU-accelerated	Xuất sắc, chính xác nhất
Sim-to-real gap	Nhỏ với domain rand	Rất nhỏ
Cộng đồng	Đang phát triển nhanh	Trưởng thành, nhiều paper
Hardware yêu cầu	RTX 3090+ (CUDA)	Bất kỳ CPU/GPU
Phù hợp cho	Training nhanh, massively parallel	Research, prototyping

Khuyến nghị: Dùng Isaac Lab cho training (nhanh), MuJoCo cho debug và visualization. Nhiều paper hàng đầu dùng cả hai.

Chi tiết về Isaac Lab, xem bài Isaac Lab: GPU-Accelerated Robot Simulation.

Setup Isaac Lab Environment cho Humanoid

Cài đặt Isaac Lab

# Clone Isaac Lab
git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab

# Tạo conda environment
conda create -n isaaclab python=3.10 -y
conda activate isaaclab

# Cài Isaac Sim (yêu cầu NVIDIA GPU)
pip install isaacsim-rl isaacsim-replicator isaacsim-extscache-physics \
    isaacsim-extscache-kit-sdk isaacsim-extscache-kit \
    isaacsim-app --extra-index-url https://pypi.nvidia.com

# Cài Isaac Lab
pip install -e .

Cài đặt Humanoid-Gym

Humanoid-Gym là framework chuyên biệt cho humanoid locomotion, xây dựng trên Isaac Lab:

# Clone Humanoid-Gym
git clone https://github.com/roboterax/humanoid-gym.git
cd humanoid-gym

# Cài dependencies
pip install -e .

Load Unitree G1 trong MuJoCo

Bắt đầu với MuJoCo để hiểu robot model trước khi chuyển sang Isaac Lab:

import mujoco
import mujoco.viewer
import numpy as np
import time

def load_and_visualize_g1():
    """
    Load Unitree G1 MJCF model và visualize trong MuJoCo.
    """
    # Download G1 model từ mujoco_menagerie
    # git clone https://github.com/google-deepmind/mujoco_menagerie.git
    model_path = "mujoco_menagerie/unitree_g1/g1.xml"

    # Load model
    model = mujoco.MjModel.from_xml_path(model_path)
    data = mujoco.MjData(model)

    print(f"=== Unitree G1 Model Info ===")
    print(f"Số bodies: {model.nbody}")
    print(f"Số joints: {model.njnt}")
    print(f"Số actuators: {model.nu}")
    print(f"DOF tổng: {model.nv}")
    print(f"Timestep: {model.opt.timestep}s")

    # In danh sách joints
    print(f"\n=== Joints ===")
    for i in range(model.njnt):
        joint_name = mujoco.mj_id2name(model, mujoco.mjtObj.mjOBJ_JOINT, i)
        joint_range = model.jnt_range[i]
        print(f"  [{i}] {joint_name}: range [{joint_range[0]:.2f}, {joint_range[1]:.2f}] rad")

    # In danh sách actuators
    print(f"\n=== Actuators ===")
    for i in range(model.nu):
        act_name = mujoco.mj_id2name(model, mujoco.mjtObj.mjOBJ_ACTUATOR, i)
        ctrl_range = model.actuator_ctrlrange[i]
        print(f"  [{i}] {act_name}: range [{ctrl_range[0]:.1f}, {ctrl_range[1]:.1f}]")

    # Reset về vị trí đứng
    mujoco.mj_resetData(model, data)

    # Set initial height (G1 cao ~1.27m)
    data.qpos[2] = 0.75  # base height

    # Chạy simulation một vài bước để ổn định
    for _ in range(1000):
        mujoco.mj_step(model, data)

    # Mở viewer để xem robot
    print("\nMở viewer... (nhấn ESC để thoát)")
    with mujoco.viewer.launch_passive(model, data) as viewer:
        while viewer.is_running():
            mujoco.mj_step(model, data)
            viewer.sync()
            time.sleep(model.opt.timestep)

if __name__ == "__main__":
    load_and_visualize_g1()

Cấu hình Isaac Lab Environment

from omni.isaac.lab.envs import ManagerBasedRLEnvCfg
from omni.isaac.lab.scene import InteractiveSceneCfg
from omni.isaac.lab.assets import ArticulationCfg
import omni.isaac.lab.sim as sim_utils

class G1FlatEnvCfg(ManagerBasedRLEnvCfg):
    """Cấu hình environment cho G1 walking trên mặt phẳng."""

    # Simulation
    sim = sim_utils.SimulationCfg(
        dt=0.005,                    # 200Hz physics
        render_interval=4,           # 50Hz rendering
        gravity=(0.0, 0.0, -9.81),
        physics_material=sim_utils.RigidBodyMaterialCfg(
            friction_combine_mode="multiply",
            restitution_combine_mode="multiply",
            static_friction=1.0,
            dynamic_friction=1.0,
        ),
    )

    # Scene
    scene = InteractiveSceneCfg(
        num_envs=4096,              # Số environments song song
        env_spacing=2.5,
    )

    # Robot
    robot = ArticulationCfg(
        prim_path="/World/envs/env_.*/Robot",
        spawn=sim_utils.UsdFileCfg(
            usd_path="datasets/robots/unitree/g1/g1.usd",
        ),
        init_state=ArticulationCfg.InitialStateCfg(
            pos=(0.0, 0.0, 0.75),    # Chiều cao ban đầu
            joint_pos={
                # Default standing pose
                ".*hip_pitch.*": -0.1,
                ".*knee.*": 0.3,
                ".*ankle_pitch.*": -0.2,
            },
        ),
        actuators={
            "legs": sim_utils.DCMotorCfg(
                joint_names_expr=[".*hip.*", ".*knee.*", ".*ankle.*"],
                stiffness=200.0,
                damping=10.0,
                effort_limit=100.0,
            ),
            "arms": sim_utils.DCMotorCfg(
                joint_names_expr=[".*shoulder.*", ".*elbow.*", ".*wrist.*"],
                stiffness=80.0,
                damping=4.0,
                effort_limit=50.0,
            ),
        },
    )

    # Decimation: policy chạy ở 50Hz (200Hz / 4)
    decimation = 4

    # Episode length
    episode_length_s = 20.0

Observation Normalization — Chi tiết quan trọng

Một kỹ thuật không thể bỏ qua là normalize observations. Nếu joint velocity có range [-10, 10] rad/s mà gravity vector có range [-1, 1], policy sẽ bias về features có magnitude lớn:

class ObservationNormalizer:
    """Running mean/std normalization cho observations."""

    def __init__(self, obs_dim, clip_range=5.0):
        self.mean = np.zeros(obs_dim)
        self.var = np.ones(obs_dim)
        self.count = 1e-4
        self.clip_range = clip_range

    def update(self, obs_batch):
        """Cập nhật running statistics."""
        batch_mean = np.mean(obs_batch, axis=0)
        batch_var = np.var(obs_batch, axis=0)
        batch_count = obs_batch.shape[0]

        delta = batch_mean - self.mean
        total_count = self.count + batch_count

        self.mean = self.mean + delta * batch_count / total_count
        m_a = self.var * self.count
        m_b = batch_var * batch_count
        m2 = m_a + m_b + np.square(delta) * self.count * batch_count / total_count
        self.var = m2 / total_count
        self.count = total_count

    def normalize(self, obs):
        """Normalize observation."""
        normalized = (obs - self.mean) / np.sqrt(self.var + 1e-8)
        return np.clip(normalized, -self.clip_range, self.clip_range)

Tổng kết và tiếp theo

Trong bài này, chúng ta đã:

Hiểu tại sao humanoid locomotion khó: underactuation, high DOF, dynamic balance
Formulate MDP: observation space (~78 dims cho G1), action space (joint position targets), PD control
So sánh simulators: Isaac Lab (nhanh, GPU) vs MuJoCo (chính xác, dễ debug)
Setup environment: cài Isaac Lab, load G1 model, cấu hình training environment
Observation normalization: kỹ thuật quan trọng cho training ổn định

Bài tiếp theo — Reward Engineering cho Bipedal Walking — sẽ đi sâu vào nghệ thuật thiết kế reward function, yếu tố quyết định policy có học được dáng đi tự nhiên hay không.

Để hiểu thêm về simulation nền tảng, xem bài Simulation cho Robotics: Tổng quan và Isaac Lab chi tiết.

Tài liệu tham khảo

Humanoid-Gym: Reinforcement Learning for Humanoid Robot with Zero-Shot Sim-to-Real Transfer — Gu et al., 2024
Isaac Lab: A Unified Framework for Robot Learning — Mittal et al., 2025
Learning Humanoid Locomotion with Transformers — Radosavovic et al., 2024
MuJoCo: A physics engine for model-based control — DeepMind