FAME: RL Thích Ứng Lực cho Humanoid Manipulation

Vấn đề cốt lõi: Manipulation phá vỡ Balance

Hầu hết các nghiên cứu về humanoid locomotion đều giả định một điều: phần thân trên của robot không làm gì cả. Tay buông thõng, trọng tâm ổn định, và policy chỉ cần lo giữ thăng bằng cho đôi chân. Nhưng khi humanoid bắt đầu manipulation — nâng vật nặng, đẩy đồ vật, hoặc thao tác hai tay — mọi thứ thay đổi hoàn toàn.

Lực từ phần thân trên truyền xuống chân, tạo ra moment lực không dự đoán được. Một humanoid đang đứng vững có thể ngã ngay khi nâng một vật 5kg bằng hai tay vì trọng tâm dịch chuyển và lực phản tác dụng thay đổi liên tục.

Paper FAME: Force-Adaptive Policy with Implicit Estimation for Versatile Humanoid Locomotion (Pudasaini et al., CU Boulder, March 2026) giải quyết bài toán này bằng cách condition standing policy trên latent context của cấu hình thân trên và lực bimanual — không cần force/torque sensor trên cổ tay.

Humanoid robot thực hiện manipulation task với hai tay

Tại sao các phương pháp hiện tại thất bại?

Base Policy: Không biết lực nào đang tác động

Các locomotion policy tiêu chuẩn (như của Unitree SDK) chỉ nhận observation từ IMU và joint encoders. Khi phần thân trên di chuyển hoặc chịu lực, policy không có thông tin gì về sự thay đổi này — dẫn đến phản ứng chậm hoặc sai.

Kết quả trong paper: base policy chỉ đạt 29.44% success rate khi robot phải đứng trong điều kiện có lực ngoại lai.

Curriculum-Only: Cải thiện nhưng chưa đủ

Approach phổ biến hơn là dùng curriculum learning — tăng dần độ khó của perturbation trong quá trình training. Nhưng curriculum chỉ giúp robot robust hơn một cách "mù quáng" — nó không hiểu lực nào đang tác động.

Kết quả: curriculum-only đạt 51.40% — tốt hơn base nhưng vẫn fail gần nửa số trường hợp.

FAME: Hiểu lực → thích ứng

FAME đạt 73.84% success rate bằng cách cho policy biết context về lực và cấu hình thân trên thông qua một latent encoder. Sự khác biệt không phải ở việc train nhiều hơn, mà ở việc cung cấp đúng thông tin cho policy.

Kiến trúc FAME chi tiết

1. Force Sampling Strategy: Phân bố cầu (Spherical Distribution)

Điểm sáng tạo đầu tiên: thay vì random forces theo uniform distribution, FAME sample lực theo phân bố cầu 3D trên mỗi bàn tay. Điều này đảm bảo policy thấy lực từ mọi hướng với xác suất đều nhau.

import numpy as np
import torch

def sample_spherical_forces(
    num_envs: int,
    force_range: tuple[float, float] = (0.0, 50.0),
    device: str = "cuda"
) -> torch.Tensor:
    """
    Sample 3D forces theo phân bố cầu cho mỗi tay.
    Returns: (num_envs, 6) — 3D force cho tay trái + tay phải
    """
    forces = torch.zeros(num_envs, 6, device=device)

    for hand_idx in range(2):
        # Sample direction đều trên mặt cầu (uniform spherical)
        # Marsaglia method: tránh bias ở poles
        phi = torch.rand(num_envs, device=device) * 2 * np.pi
        cos_theta = 2 * torch.rand(num_envs, device=device) - 1
        sin_theta = torch.sqrt(1 - cos_theta**2)

        # Direction vector (unit sphere)
        dx = sin_theta * torch.cos(phi)
        dy = sin_theta * torch.sin(phi)
        dz = cos_theta

        # Sample magnitude từ uniform range
        magnitude = (
            torch.rand(num_envs, device=device)
            * (force_range[1] - force_range[0])
            + force_range[0]
        )

        # Combine direction * magnitude
        offset = hand_idx * 3
        forces[:, offset + 0] = dx * magnitude
        forces[:, offset + 1] = dy * magnitude
        forces[:, offset + 2] = dz * magnitude

    return forces

Tại sao spherical sampling quan trọng? Nếu dùng uniform sampling trong Cartesian space, phần lớn samples sẽ tập trung ở góc của cube, tạo bias. Spherical sampling đảm bảo mỗi hướng trong không gian 3D được đại diện đều.

2. Upper-Body Pose Curriculum

FAME không chỉ randomize lực — nó còn thay đổi cấu hình thân trên (vị trí tay, vai, khuỷu tay) trong quá trình training. Curriculum tăng dần biên độ di chuyển:

import torch.nn as nn

class UpperBodyCurriculum:
    """
    Curriculum cho upper-body pose randomization.
    Bắt đầu từ pose mặc định, tăng dần biên độ di chuyển.
    """

    # Joints thân trên của H1-2: vai (3), khuỷu (1), cổ tay (1) × 2 tay = 10
    UPPER_BODY_JOINTS = [
        "left_shoulder_pitch", "left_shoulder_roll", "left_shoulder_yaw",
        "left_elbow", "left_wrist",
        "right_shoulder_pitch", "right_shoulder_roll", "right_shoulder_yaw",
        "right_elbow", "right_wrist",
    ]

    # Joint limits (rad) cho H1-2
    JOINT_RANGES = {
        "shoulder_pitch": (-2.87, 2.87),
        "shoulder_roll": (-1.34, 2.53),
        "shoulder_yaw": (-2.87, 2.87),
        "elbow": (-1.92, 2.09),
        "wrist": (-1.57, 1.57),
    }

    def __init__(
        self,
        num_envs: int,
        num_upper_joints: int = 10,
        max_curriculum_level: int = 10,
        device: str = "cuda",
    ):
        self.num_envs = num_envs
        self.num_upper_joints = num_upper_joints
        self.max_level = max_curriculum_level
        self.device = device
        self.current_level = 0  # 0 = default pose only, 10 = full range

    def get_scale(self) -> float:
        """Tỷ lệ biên độ dựa trên curriculum level."""
        return self.current_level / self.max_level

    def sample_poses(self, default_pose: torch.Tensor) -> torch.Tensor:
        """
        Sample upper-body poses quanh default pose.
        default_pose: (num_upper_joints,) — pose mặc định
        Returns: (num_envs, num_upper_joints)
        """
        scale = self.get_scale()
        noise = torch.randn(
            self.num_envs, self.num_upper_joints, device=self.device
        ) * scale * 0.5  # 0.5 rad max deviation at full scale

        poses = default_pose.unsqueeze(0) + noise
        return poses

    def update(self, mean_reward: float, threshold: float = 0.7):
        """
        Tăng curriculum nếu policy đã stable (reward > threshold).
        """
        if mean_reward > threshold and self.current_level < self.max_level:
            self.current_level += 1
            print(
                f"[Curriculum] Level up: {self.current_level}/{self.max_level}"
            )

3. Latent Context Encoder — Trái tim của FAME

Đây là component quan trọng nhất. Encoder nhận joint positions thân trên và lực ước lượng, encode thành latent vector mà policy sử dụng để điều chỉnh hành vi:

import torch
import torch.nn as nn

class ForceContextEncoder(nn.Module):
    """
    Latent context encoder cho FAME.
    Input: upper-body joint config + estimated bimanual forces
    Output: latent context vector z
    """

    def __init__(
        self,
        num_upper_joints: int = 10,  # 5 joints × 2 tay
        force_dim: int = 6,          # 3D force × 2 tay
        hidden_dim: int = 128,
        latent_dim: int = 32,
    ):
        super().__init__()
        input_dim = num_upper_joints + force_dim  # 16

        # MLP encoder: joint config + forces → latent z
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, latent_dim),
        )

        # Temporal smoothing: dùng GRU để ổn định latent context
        self.gru = nn.GRU(
            input_size=latent_dim,
            hidden_size=latent_dim,
            batch_first=True,
        )

        self.latent_dim = latent_dim

    def forward(
        self,
        upper_body_joints: torch.Tensor,
        estimated_forces: torch.Tensor,
        hidden_state: torch.Tensor | None = None,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            upper_body_joints: (batch, num_upper_joints)
            estimated_forces: (batch, 6) — 3D force per hand
            hidden_state: (1, batch, latent_dim) — GRU hidden

        Returns:
            latent_z: (batch, latent_dim)
            new_hidden: (1, batch, latent_dim)
        """
        # Concatenate inputs
        x = torch.cat([upper_body_joints, estimated_forces], dim=-1)

        # Encode to latent
        z = self.encoder(x)

        # Temporal smoothing qua GRU
        z_seq = z.unsqueeze(1)  # (batch, 1, latent_dim)
        if hidden_state is None:
            hidden_state = torch.zeros(
                1, x.shape[0], self.latent_dim, device=x.device
            )
        z_smooth, new_hidden = self.gru(z_seq, hidden_state)
        z_smooth = z_smooth.squeeze(1)  # (batch, latent_dim)

        return z_smooth, new_hidden

Kiến trúc mạng neural cho robot control với reinforcement learning

4. Force Estimation từ Robot Dynamics (Không cần F/T Sensor)

Đây là insight thiết thực nhất của paper: ước lượng lực từ dynamics thay vì đo trực tiếp bằng sensor. Force/torque sensor ở cổ tay humanoid rất đắt và dễ hỏng. FAME dùng phương trình động lực học:

$$\tau_{ext} = \tau_{measured} - M(q)\ddot{q} - C(q, \dot{q})\dot{q} - g(q)$$

import torch

class ForceEstimator:
    """
    Ước lượng external forces từ robot dynamics.
    Không cần F/T sensor — chỉ cần joint torque readings và model dynamics.
    """

    def __init__(self, robot_model, dt: float = 0.02):
        self.robot = robot_model
        self.dt = dt
        self.prev_qvel = None

    def estimate(
        self,
        joint_pos: torch.Tensor,
        joint_vel: torch.Tensor,
        joint_torque: torch.Tensor,
    ) -> torch.Tensor:
        """
        Ước lượng external forces/torques trên end-effectors.

        Args:
            joint_pos: (batch, num_joints) — vị trí khớp
            joint_vel: (batch, num_joints) — vận tốc khớp
            joint_torque: (batch, num_joints) — torque đo được

        Returns:
            estimated_forces: (batch, 6) — 3D force per hand
        """
        # Tính gia tốc từ finite difference
        if self.prev_qvel is None:
            self.prev_qvel = torch.zeros_like(joint_vel)
        joint_acc = (joint_vel - self.prev_qvel) / self.dt
        self.prev_qvel = joint_vel.clone()

        # Inverse dynamics: tính torque cần thiết cho motion hiện tại
        # tau_id = M(q) * qddot + C(q, qdot) * qdot + g(q)
        mass_matrix = self.robot.compute_mass_matrix(joint_pos)
        coriolis = self.robot.compute_coriolis(joint_pos, joint_vel)
        gravity = self.robot.compute_gravity(joint_pos)

        tau_expected = (
            torch.bmm(mass_matrix, joint_acc.unsqueeze(-1)).squeeze(-1)
            + coriolis
            + gravity
        )

        # External torque = measured - expected
        tau_external = joint_torque - tau_expected

        # Map joint torques → Cartesian forces qua Jacobian transpose
        jacobian_left = self.robot.compute_jacobian(
            joint_pos, end_effector="left_hand"
        )
        jacobian_right = self.robot.compute_jacobian(
            joint_pos, end_effector="right_hand"
        )

        # F = (J^T)^{-1} * tau_ext (least squares solution)
        force_left = torch.linalg.lstsq(
            jacobian_left.transpose(-1, -2), tau_external[:, :5].unsqueeze(-1)
        ).solution.squeeze(-1)[:, :3]

        force_right = torch.linalg.lstsq(
            jacobian_right.transpose(-1, -2), tau_external[:, 5:10].unsqueeze(-1)
        ).solution.squeeze(-1)[:, :3]

        return torch.cat([force_left, force_right], dim=-1)

5. PPO Training với Force-Conditioned Observations

Kết hợp tất cả lại — FAME policy nhận observation gốc cộng thêm latent context:

import torch
import torch.nn as nn
from torch.distributions import Normal

class FAMEPolicy(nn.Module):
    """
    FAME Standing Policy — PPO actor-critic conditioned on force context.
    """

    def __init__(
        self,
        base_obs_dim: int = 45,    # proprio: joint pos/vel + IMU + projected gravity
        latent_dim: int = 32,      # từ ForceContextEncoder
        action_dim: int = 10,      # lower-body joint targets (5 per leg)
        hidden_dim: int = 256,
    ):
        super().__init__()
        total_obs = base_obs_dim + latent_dim  # 77

        # Actor (policy)
        self.actor = nn.Sequential(
            nn.Linear(total_obs, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, action_dim),
        )
        self.log_std = nn.Parameter(torch.zeros(action_dim))

        # Critic (value function)
        self.critic = nn.Sequential(
            nn.Linear(total_obs, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, 1),
        )

    def forward(
        self,
        base_obs: torch.Tensor,
        latent_z: torch.Tensor,
    ) -> tuple[Normal, torch.Tensor]:
        """
        Args:
            base_obs: (batch, base_obs_dim) — proprio observations
            latent_z: (batch, latent_dim) — force context

        Returns:
            action_dist: Normal distribution over actions
            value: (batch, 1) — estimated value
        """
        obs = torch.cat([base_obs, latent_z], dim=-1)

        # Actor output
        mean = self.actor(obs)
        std = torch.exp(self.log_std.clamp(-5, 2))
        action_dist = Normal(mean, std)

        # Critic output
        value = self.critic(obs)

        return action_dist, value

# ===== Reward Function =====

def compute_fame_reward(
    base_pos: torch.Tensor,
    base_quat: torch.Tensor,
    joint_vel: torch.Tensor,
    joint_torque: torch.Tensor,
    target_pos: torch.Tensor,
    target_quat: torch.Tensor,
    dt: float = 0.02,
) -> dict[str, torch.Tensor]:
    """
    FAME reward function cho standing task.
    Cân bằng giữa stability, energy efficiency, và tracking.
    """
    batch = base_pos.shape[0]

    # 1. Position tracking (giữ vị trí đứng)
    pos_error = torch.norm(base_pos[:, :2] - target_pos[:, :2], dim=-1)
    r_pos = torch.exp(-5.0 * pos_error)  # ∈ [0, 1]

    # 2. Orientation tracking (giữ thẳng)
    # Tính góc giữa quat hiện tại và target
    quat_diff = _quat_diff(base_quat, target_quat)
    angle_error = 2.0 * torch.acos(torch.clamp(quat_diff[:, 0].abs(), 0, 1))
    r_orient = torch.exp(-3.0 * angle_error)

    # 3. Height tracking (không ngồi xuống)
    height_error = torch.abs(base_pos[:, 2] - target_pos[:, 2])
    r_height = torch.exp(-10.0 * height_error)

    # 4. Energy penalty (torque nhỏ = tiết kiệm năng lượng)
    torque_penalty = torch.sum(joint_torque**2, dim=-1) * 1e-5
    r_energy = -torque_penalty

    # 5. Smoothness penalty (joint velocity nhỏ = chuyển động mượt)
    vel_penalty = torch.sum(joint_vel**2, dim=-1) * 1e-4
    r_smooth = -vel_penalty

    # 6. Alive bonus (không ngã = +1 mỗi step)
    alive = (base_pos[:, 2] > 0.5).float()  # z > 0.5m = đang đứng
    r_alive = alive * 1.0

    # Tổng hợp reward
    reward = (
        0.3 * r_pos
        + 0.2 * r_orient
        + 0.2 * r_height
        + 0.1 * r_energy
        + 0.1 * r_smooth
        + 0.1 * r_alive
    )

    return {
        "total": reward,
        "pos": r_pos,
        "orient": r_orient,
        "height": r_height,
        "energy": r_energy,
        "smooth": r_smooth,
        "alive": r_alive,
    }

def _quat_diff(q1: torch.Tensor, q2: torch.Tensor) -> torch.Tensor:
    """Quaternion difference: q_diff = q1* × q2"""
    w1, x1, y1, z1 = q1.unbind(-1)
    w2, x2, y2, z2 = q2.unbind(-1)
    return torch.stack([
        w1*w2 + x1*x2 + y1*y2 + z1*z2,
        -w1*x2 + x1*w2 - y1*z2 + z1*y2,
        -w1*y2 + x1*z2 + y1*w2 - z1*x2,
        -w1*z2 - x1*y2 + y1*x2 + z1*w2,
    ], dim=-1)

6. Full Training Config

# ===== Training Configuration =====

FAME_CONFIG = {
    # Environment
    "env": {
        "num_envs": 4096,
        "episode_length": 1000,     # 20 seconds at 50Hz
        "dt": 0.02,                 # 50Hz control
        "sim_dt": 0.005,            # 200Hz physics
        "robot": "unitree_h1_2",
        "terrain": "flat",
    },

    # Force sampling
    "force": {
        "min_magnitude": 0.0,       # N
        "max_magnitude": 50.0,      # N — tăng dần qua curriculum
        "sample_interval": 100,     # Resample mỗi 100 steps (2s)
        "distribution": "spherical",
    },

    # Upper-body curriculum
    "curriculum": {
        "max_level": 10,
        "reward_threshold": 0.7,    # Level up khi mean_reward > 0.7
        "initial_force_max": 5.0,   # Bắt đầu với lực nhỏ
        "final_force_max": 50.0,    # Kết thúc với lực lớn
    },

    # Context encoder
    "encoder": {
        "num_upper_joints": 10,
        "force_dim": 6,
        "hidden_dim": 128,
        "latent_dim": 32,
    },

    # PPO hyperparameters
    "ppo": {
        "learning_rate": 3e-4,
        "num_epochs": 5,
        "num_minibatches": 4,
        "gamma": 0.99,
        "lam": 0.95,              # GAE lambda
        "clip_range": 0.2,
        "entropy_coef": 0.01,
        "value_coef": 0.5,
        "max_grad_norm": 1.0,
        "target_kl": 0.01,
    },

    # Training
    "training": {
        "total_timesteps": 100_000_000,  # 100M steps
        "log_interval": 10,
        "save_interval": 500,
        "eval_interval": 100,
    },

    # Domain randomization
    "domain_rand": {
        "friction_range": [0.3, 1.5],
        "mass_offset_range": [-1.0, 1.0],   # kg
        "com_offset_range": [-0.05, 0.05],   # m
        "kp_factor_range": [0.8, 1.2],
        "kd_factor_range": [0.8, 1.2],
        "push_interval": 200,               # Random push mỗi 4s
        "push_magnitude": [0, 30],           # N
    },
}

7. PPO Training Loop

import torch
import torch.optim as optim
from collections import defaultdict

def train_fame(config: dict):
    """
    Full training loop cho FAME policy.
    Dùng Isaac Lab/Gym cho simulation.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    num_envs = config["env"]["num_envs"]

    # ===== Initialize components =====
    # 1. Environment (Isaac Lab)
    env = create_humanoid_env(config["env"])  # Isaac Lab vectorized env

    # 2. Policy + Context Encoder
    encoder = ForceContextEncoder(**config["encoder"]).to(device)
    policy = FAMEPolicy(
        base_obs_dim=env.obs_dim,
        latent_dim=config["encoder"]["latent_dim"],
        action_dim=env.action_dim,
    ).to(device)

    # 3. Force estimator + Curriculum
    force_estimator = ForceEstimator(env.robot_model, dt=config["env"]["dt"])
    curriculum = UpperBodyCurriculum(
        num_envs=num_envs,
        max_curriculum_level=config["curriculum"]["max_level"],
        device=device,
    )

    # 4. Optimizer (single optimizer cho cả encoder + policy)
    all_params = list(encoder.parameters()) + list(policy.parameters())
    optimizer = optim.Adam(all_params, lr=config["ppo"]["learning_rate"])

    # ===== Training Loop =====
    total_steps = 0
    episode_length = config["env"]["episode_length"]
    ppo_cfg = config["ppo"]

    obs = env.reset()
    hidden_state = None  # GRU hidden state cho encoder

    while total_steps < config["training"]["total_timesteps"]:
        # ----- Collect rollout -----
        rollout = defaultdict(list)

        for step in range(episode_length):
            with torch.no_grad():
                # Sample forces (resample mỗi N steps)
                if step % config["force"]["sample_interval"] == 0:
                    ext_forces = sample_spherical_forces(
                        num_envs,
                        force_range=(0, curriculum.get_scale() * config["force"]["max_magnitude"]),
                        device=device,
                    )
                    # Apply forces vào simulation
                    env.apply_external_forces(ext_forces)

                    # Sample upper-body poses
                    upper_poses = curriculum.sample_poses(env.default_upper_pose)
                    env.set_upper_body_targets(upper_poses)

                # Estimate forces từ dynamics (không cần F/T sensor)
                estimated_forces = force_estimator.estimate(
                    env.joint_pos, env.joint_vel, env.joint_torque
                )

                # Encode context
                latent_z, hidden_state = encoder(
                    env.upper_body_joints, estimated_forces, hidden_state
                )

                # Get action
                action_dist, value = policy(obs, latent_z)
                action = action_dist.sample()
                log_prob = action_dist.log_prob(action).sum(-1)

            # Step environment
            next_obs, reward_dict, done, info = env.step(action)
            reward = reward_dict["total"]

            # Store transition
            rollout["obs"].append(obs)
            rollout["latent_z"].append(latent_z)
            rollout["actions"].append(action)
            rollout["log_probs"].append(log_prob)
            rollout["values"].append(value.squeeze(-1))
            rollout["rewards"].append(reward)
            rollout["dones"].append(done.float())

            obs = next_obs
            total_steps += num_envs

        # ----- Compute GAE advantages -----
        with torch.no_grad():
            last_latent, hidden_state = encoder(
                env.upper_body_joints, estimated_forces, hidden_state
            )
            _, last_value = policy(obs, last_latent)
            last_value = last_value.squeeze(-1)

        advantages = compute_gae(
            rollout["rewards"],
            rollout["values"],
            rollout["dones"],
            last_value,
            gamma=ppo_cfg["gamma"],
            lam=ppo_cfg["lam"],
        )
        returns = [adv + val for adv, val in zip(advantages, rollout["values"])]

        # Stack rollout
        all_obs = torch.stack(rollout["obs"])
        all_z = torch.stack(rollout["latent_z"])
        all_actions = torch.stack(rollout["actions"])
        all_log_probs = torch.stack(rollout["log_probs"])
        all_returns = torch.stack(returns)
        all_advantages = torch.stack(advantages)

        # Normalize advantages
        all_advantages = (all_advantages - all_advantages.mean()) / (
            all_advantages.std() + 1e-8
        )

        # ----- PPO Update -----
        for epoch in range(ppo_cfg["num_epochs"]):
            # Mini-batch indices
            batch_size = episode_length * num_envs
            minibatch_size = batch_size // ppo_cfg["num_minibatches"]
            indices = torch.randperm(batch_size, device=device)

            for start in range(0, batch_size, minibatch_size):
                mb_idx = indices[start : start + minibatch_size]

                # Flatten temporal dimension
                flat_obs = all_obs.view(-1, all_obs.shape[-1])[mb_idx]
                flat_z = all_z.view(-1, all_z.shape[-1])[mb_idx]
                flat_actions = all_actions.view(-1, all_actions.shape[-1])[mb_idx]
                flat_old_lp = all_log_probs.view(-1)[mb_idx]
                flat_returns = all_returns.view(-1)[mb_idx]
                flat_adv = all_advantages.view(-1)[mb_idx]

                # Forward pass
                dist, values = policy(flat_obs, flat_z)
                new_log_probs = dist.log_prob(flat_actions).sum(-1)
                entropy = dist.entropy().sum(-1).mean()

                # PPO clipped objective
                ratio = torch.exp(new_log_probs - flat_old_lp)
                surr1 = ratio * flat_adv
                surr2 = torch.clamp(
                    ratio,
                    1.0 - ppo_cfg["clip_range"],
                    1.0 + ppo_cfg["clip_range"],
                ) * flat_adv
                policy_loss = -torch.min(surr1, surr2).mean()

                # Value loss
                value_loss = 0.5 * (flat_returns - values.squeeze(-1)).pow(2).mean()

                # Total loss
                loss = (
                    policy_loss
                    + ppo_cfg["value_coef"] * value_loss
                    - ppo_cfg["entropy_coef"] * entropy
                )

                optimizer.zero_grad()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(all_params, ppo_cfg["max_grad_norm"])
                optimizer.step()

        # ----- Curriculum update -----
        mean_reward = torch.stack(rollout["rewards"]).mean().item()
        curriculum.update(mean_reward, threshold=ppo_cfg.get("target_kl", 0.7))

        # ----- Logging -----
        if total_steps % (config["training"]["log_interval"] * num_envs * episode_length) == 0:
            print(
                f"Steps: {total_steps:>10d} | "
                f"Reward: {mean_reward:.3f} | "
                f"Curriculum: {curriculum.current_level}/{curriculum.max_level} | "
                f"Policy Loss: {policy_loss.item():.4f}"
            )

        # ----- Save checkpoint -----
        if total_steps % (config["training"]["save_interval"] * num_envs * episode_length) == 0:
            torch.save({
                "encoder": encoder.state_dict(),
                "policy": policy.state_dict(),
                "optimizer": optimizer.state_dict(),
                "total_steps": total_steps,
                "curriculum_level": curriculum.current_level,
            }, f"checkpoints/fame_step_{total_steps}.pt")

def compute_gae(rewards, values, dones, last_value, gamma=0.99, lam=0.95):
    """Generalized Advantage Estimation."""
    advantages = []
    gae = torch.zeros_like(rewards[0])
    next_value = last_value

    for t in reversed(range(len(rewards))):
        delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
        gae = delta + gamma * lam * (1 - dones[t]) * gae
        advantages.insert(0, gae)
        next_value = values[t]

    return advantages

8. Sim-to-Real Transfer cho H1-2 / G1

Quy trình deploy lên robot thật:

class FAMERealDeployment:
    """
    Deploy FAME policy lên Unitree H1-2 / G1.
    Quy trình: Load checkpoint → Run inference at 50Hz → Send joint targets.
    """

    def __init__(
        self,
        checkpoint_path: str,
        robot_type: str = "h1_2",  # "h1_2" hoặc "g1"
        control_freq: float = 50.0,
    ):
        self.device = torch.device("cpu")  # Inference trên CPU cho latency thấp
        self.dt = 1.0 / control_freq

        # Load trained models
        ckpt = torch.load(checkpoint_path, map_location=self.device)
        self.encoder = ForceContextEncoder(
            num_upper_joints=10, latent_dim=32
        ).to(self.device)
        self.encoder.load_state_dict(ckpt["encoder"])
        self.encoder.eval()

        self.policy = FAMEPolicy(
            base_obs_dim=45, latent_dim=32, action_dim=10
        ).to(self.device)
        self.policy.load_state_dict(ckpt["policy"])
        self.policy.eval()

        self.force_estimator = ForceEstimator(
            robot_model=load_robot_model(robot_type), dt=self.dt
        )
        self.hidden_state = None

        # Action scaling cho hardware
        self.action_scale = torch.tensor([
            0.25, 0.25, 0.25, 0.25, 0.25,  # left leg
            0.25, 0.25, 0.25, 0.25, 0.25,  # right leg
        ])

    @torch.no_grad()
    def step(self, robot_state: dict) -> torch.Tensor:
        """
        Một step inference. Gọi ở 50Hz.

        Args:
            robot_state: dict với keys:
                - joint_pos: (num_joints,)
                - joint_vel: (num_joints,)
                - joint_torque: (num_joints,)
                - imu_quat: (4,)
                - imu_angular_vel: (3,)
                - projected_gravity: (3,)
        """
        # Build observation vector
        obs = self._build_observation(robot_state)

        # Estimate forces
        estimated_forces = self.force_estimator.estimate(
            robot_state["joint_pos"].unsqueeze(0),
            robot_state["joint_vel"].unsqueeze(0),
            robot_state["joint_torque"].unsqueeze(0),
        )

        # Encode context
        upper_joints = robot_state["joint_pos"][:10].unsqueeze(0)
        latent_z, self.hidden_state = self.encoder(
            upper_joints, estimated_forces, self.hidden_state
        )

        # Get action (deterministic — dùng mean, không sample)
        action_dist, _ = self.policy(obs.unsqueeze(0), latent_z)
        action = action_dist.mean.squeeze(0)

        # Scale và clamp
        action = action * self.action_scale
        action = action.clamp(-1.0, 1.0)

        return action

    def _build_observation(self, state: dict) -> torch.Tensor:
        """Build observation vector từ robot state."""
        return torch.cat([
            state["joint_pos"][:10],       # lower-body joints (10D)
            state["joint_vel"][:10],       # lower-body velocities (10D)
            state["joint_pos"][10:20],     # upper-body joints (10D)
            state["imu_quat"],             # orientation (4D)
            state["imu_angular_vel"],      # angular velocity (3D)
            state["projected_gravity"],    # gravity in body frame (3D)
            # Thêm command velocity nếu cần (5D)
            torch.zeros(5),
        ])

Kết quả và Phân tích

Method	Standing Success (%)	Recovery Time (s)
Base Policy	29.44	N/A
Curriculum Only	51.40	2.8
FAME (full)	73.84	1.2

Điều đáng chú ý là FAME không chỉ đứng vững hơn mà còn phục hồi nhanh hơn — 1.2 giây so với 2.8 giây. Policy "hiểu" lực nào đang tác động nên có thể phản ứng chủ động thay vì chờ mất thăng bằng rồi mới sửa.

Robot humanoid trong môi trường testing thực tế

Bài học rút ra cho người làm Robotics RL

1. Observation design quan trọng hơn algorithm

FAME dùng PPO — algorithm cơ bản nhất. Sự khác biệt nằm ở observation space: thêm latent context về lực và cấu hình thân trên. Đây là bài học áp dụng rộng: trước khi thử algorithm phức tạp hơn, hãy cải thiện observation trước.

2. Sensor-free estimation có thể đủ tốt

Force/torque sensor 6-axis cho humanoid rất đắt (hàng nghìn USD mỗi cái). FAME chứng minh rằng ước lượng từ dynamics model + joint torque readings cho kết quả đủ tốt cho policy training. Điều này giảm cost đáng kể khi deploy.

3. Curriculum + conditioning > curriculum alone

Curriculum learning giúp robot robust hơn, nhưng kết hợp curriculum với explicit conditioning (latent context) cho kết quả tốt hơn nhiều (73.84% vs 51.40%). Policy cần biết điều kiện hiện tại, không chỉ "quen" với nhiều điều kiện.

Liên kết với các nghiên cứu khác

FAME nằm trong xu hướng lớn hơn của loco-manipulation — kết hợp locomotion và manipulation cho humanoid. Nếu bạn quan tâm, hãy đọc thêm:

RL cho Robotics: PPO, SAC và cách chọn algorithm — nền tảng RL cần thiết
Humanoid Locomotion: Từ MPC đến RL — so sánh control approaches
Loco-Manipulation cho Humanoid Robot — overview lĩnh vực
Sim-to-Real RL cho Dexterous Manipulation trên Humanoid — paper liên quan về manipulation

Tài liệu tham khảo

FAME: Force-Adaptive Policy with Implicit Estimation for Versatile Humanoid Locomotion — Pudasaini, N., Zhang, Y., Lavering, J., Roncone, A., Correll, N., CU Boulder, arXiv 2603.08961, March 2026
Proximal Policy Optimization Algorithms — Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., OpenAI, 2017
Learning Agile and Dynamic Motor Skills for Legged Robots — Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V., Koltun, V., Hutter, M., Science Robotics, 2019

Vấn đề cốt lõi: Manipulation phá vỡ Balance

Humanoid robot thực hiện manipulation task với hai tay

Tại sao các phương pháp hiện tại thất bại?

Base Policy: Không biết lực nào đang tác động

Kết quả trong paper: base policy chỉ đạt 29.44% success rate khi robot phải đứng trong điều kiện có lực ngoại lai.

Curriculum-Only: Cải thiện nhưng chưa đủ

Kết quả: curriculum-only đạt 51.40% — tốt hơn base nhưng vẫn fail gần nửa số trường hợp.

FAME: Hiểu lực → thích ứng

Kiến trúc FAME chi tiết

1. Force Sampling Strategy: Phân bố cầu (Spherical Distribution)

import numpy as np
import torch

def sample_spherical_forces(
    num_envs: int,
    force_range: tuple[float, float] = (0.0, 50.0),
    device: str = "cuda"
) -> torch.Tensor:
    """
    Sample 3D forces theo phân bố cầu cho mỗi tay.
    Returns: (num_envs, 6) — 3D force cho tay trái + tay phải
    """
    forces = torch.zeros(num_envs, 6, device=device)

    for hand_idx in range(2):
        # Sample direction đều trên mặt cầu (uniform spherical)
        # Marsaglia method: tránh bias ở poles
        phi = torch.rand(num_envs, device=device) * 2 * np.pi
        cos_theta = 2 * torch.rand(num_envs, device=device) - 1
        sin_theta = torch.sqrt(1 - cos_theta**2)

        # Direction vector (unit sphere)
        dx = sin_theta * torch.cos(phi)
        dy = sin_theta * torch.sin(phi)
        dz = cos_theta

        # Sample magnitude từ uniform range
        magnitude = (
            torch.rand(num_envs, device=device)
            * (force_range[1] - force_range[0])
            + force_range[0]
        )

        # Combine direction * magnitude
        offset = hand_idx * 3
        forces[:, offset + 0] = dx * magnitude
        forces[:, offset + 1] = dy * magnitude
        forces[:, offset + 2] = dz * magnitude

    return forces

2. Upper-Body Pose Curriculum

FAME không chỉ randomize lực — nó còn thay đổi cấu hình thân trên (vị trí tay, vai, khuỷu tay) trong quá trình training. Curriculum tăng dần biên độ di chuyển:

import torch.nn as nn

class UpperBodyCurriculum:
    """
    Curriculum cho upper-body pose randomization.
    Bắt đầu từ pose mặc định, tăng dần biên độ di chuyển.
    """

    # Joints thân trên của H1-2: vai (3), khuỷu (1), cổ tay (1) × 2 tay = 10
    UPPER_BODY_JOINTS = [
        "left_shoulder_pitch", "left_shoulder_roll", "left_shoulder_yaw",
        "left_elbow", "left_wrist",
        "right_shoulder_pitch", "right_shoulder_roll", "right_shoulder_yaw",
        "right_elbow", "right_wrist",
    ]

    # Joint limits (rad) cho H1-2
    JOINT_RANGES = {
        "shoulder_pitch": (-2.87, 2.87),
        "shoulder_roll": (-1.34, 2.53),
        "shoulder_yaw": (-2.87, 2.87),
        "elbow": (-1.92, 2.09),
        "wrist": (-1.57, 1.57),
    }

    def __init__(
        self,
        num_envs: int,
        num_upper_joints: int = 10,
        max_curriculum_level: int = 10,
        device: str = "cuda",
    ):
        self.num_envs = num_envs
        self.num_upper_joints = num_upper_joints
        self.max_level = max_curriculum_level
        self.device = device
        self.current_level = 0  # 0 = default pose only, 10 = full range

    def get_scale(self) -> float:
        """Tỷ lệ biên độ dựa trên curriculum level."""
        return self.current_level / self.max_level

    def sample_poses(self, default_pose: torch.Tensor) -> torch.Tensor:
        """
        Sample upper-body poses quanh default pose.
        default_pose: (num_upper_joints,) — pose mặc định
        Returns: (num_envs, num_upper_joints)
        """
        scale = self.get_scale()
        noise = torch.randn(
            self.num_envs, self.num_upper_joints, device=self.device
        ) * scale * 0.5  # 0.5 rad max deviation at full scale

        poses = default_pose.unsqueeze(0) + noise
        return poses

    def update(self, mean_reward: float, threshold: float = 0.7):
        """
        Tăng curriculum nếu policy đã stable (reward > threshold).
        """
        if mean_reward > threshold and self.current_level < self.max_level:
            self.current_level += 1
            print(
                f"[Curriculum] Level up: {self.current_level}/{self.max_level}"
            )

3. Latent Context Encoder — Trái tim của FAME

Đây là component quan trọng nhất. Encoder nhận joint positions thân trên và lực ước lượng, encode thành latent vector mà policy sử dụng để điều chỉnh hành vi:

import torch
import torch.nn as nn

class ForceContextEncoder(nn.Module):
    """
    Latent context encoder cho FAME.
    Input: upper-body joint config + estimated bimanual forces
    Output: latent context vector z
    """

    def __init__(
        self,
        num_upper_joints: int = 10,  # 5 joints × 2 tay
        force_dim: int = 6,          # 3D force × 2 tay
        hidden_dim: int = 128,
        latent_dim: int = 32,
    ):
        super().__init__()
        input_dim = num_upper_joints + force_dim  # 16

        # MLP encoder: joint config + forces → latent z
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, latent_dim),
        )

        # Temporal smoothing: dùng GRU để ổn định latent context
        self.gru = nn.GRU(
            input_size=latent_dim,
            hidden_size=latent_dim,
            batch_first=True,
        )

        self.latent_dim = latent_dim

    def forward(
        self,
        upper_body_joints: torch.Tensor,
        estimated_forces: torch.Tensor,
        hidden_state: torch.Tensor | None = None,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            upper_body_joints: (batch, num_upper_joints)
            estimated_forces: (batch, 6) — 3D force per hand
            hidden_state: (1, batch, latent_dim) — GRU hidden

        Returns:
            latent_z: (batch, latent_dim)
            new_hidden: (1, batch, latent_dim)
        """
        # Concatenate inputs
        x = torch.cat([upper_body_joints, estimated_forces], dim=-1)

        # Encode to latent
        z = self.encoder(x)

        # Temporal smoothing qua GRU
        z_seq = z.unsqueeze(1)  # (batch, 1, latent_dim)
        if hidden_state is None:
            hidden_state = torch.zeros(
                1, x.shape[0], self.latent_dim, device=x.device
            )
        z_smooth, new_hidden = self.gru(z_seq, hidden_state)
        z_smooth = z_smooth.squeeze(1)  # (batch, latent_dim)

        return z_smooth, new_hidden

Kiến trúc mạng neural cho robot control với reinforcement learning

4. Force Estimation từ Robot Dynamics (Không cần F/T Sensor)

$$\tau_{ext} = \tau_{measured} - M(q)\ddot{q} - C(q, \dot{q})\dot{q} - g(q)$$

import torch

class ForceEstimator:
    """
    Ước lượng external forces từ robot dynamics.
    Không cần F/T sensor — chỉ cần joint torque readings và model dynamics.
    """

    def __init__(self, robot_model, dt: float = 0.02):
        self.robot = robot_model
        self.dt = dt
        self.prev_qvel = None

    def estimate(
        self,
        joint_pos: torch.Tensor,
        joint_vel: torch.Tensor,
        joint_torque: torch.Tensor,
    ) -> torch.Tensor:
        """
        Ước lượng external forces/torques trên end-effectors.

        Args:
            joint_pos: (batch, num_joints) — vị trí khớp
            joint_vel: (batch, num_joints) — vận tốc khớp
            joint_torque: (batch, num_joints) — torque đo được

        Returns:
            estimated_forces: (batch, 6) — 3D force per hand
        """
        # Tính gia tốc từ finite difference
        if self.prev_qvel is None:
            self.prev_qvel = torch.zeros_like(joint_vel)
        joint_acc = (joint_vel - self.prev_qvel) / self.dt
        self.prev_qvel = joint_vel.clone()

        # Inverse dynamics: tính torque cần thiết cho motion hiện tại
        # tau_id = M(q) * qddot + C(q, qdot) * qdot + g(q)
        mass_matrix = self.robot.compute_mass_matrix(joint_pos)
        coriolis = self.robot.compute_coriolis(joint_pos, joint_vel)
        gravity = self.robot.compute_gravity(joint_pos)

        tau_expected = (
            torch.bmm(mass_matrix, joint_acc.unsqueeze(-1)).squeeze(-1)
            + coriolis
            + gravity
        )

        # External torque = measured - expected
        tau_external = joint_torque - tau_expected

        # Map joint torques → Cartesian forces qua Jacobian transpose
        jacobian_left = self.robot.compute_jacobian(
            joint_pos, end_effector="left_hand"
        )
        jacobian_right = self.robot.compute_jacobian(
            joint_pos, end_effector="right_hand"
        )

        # F = (J^T)^{-1} * tau_ext (least squares solution)
        force_left = torch.linalg.lstsq(
            jacobian_left.transpose(-1, -2), tau_external[:, :5].unsqueeze(-1)
        ).solution.squeeze(-1)[:, :3]

        force_right = torch.linalg.lstsq(
            jacobian_right.transpose(-1, -2), tau_external[:, 5:10].unsqueeze(-1)
        ).solution.squeeze(-1)[:, :3]

        return torch.cat([force_left, force_right], dim=-1)

5. PPO Training với Force-Conditioned Observations

Kết hợp tất cả lại — FAME policy nhận observation gốc cộng thêm latent context:

import torch
import torch.nn as nn
from torch.distributions import Normal

class FAMEPolicy(nn.Module):
    """
    FAME Standing Policy — PPO actor-critic conditioned on force context.
    """

    def __init__(
        self,
        base_obs_dim: int = 45,    # proprio: joint pos/vel + IMU + projected gravity
        latent_dim: int = 32,      # từ ForceContextEncoder
        action_dim: int = 10,      # lower-body joint targets (5 per leg)
        hidden_dim: int = 256,
    ):
        super().__init__()
        total_obs = base_obs_dim + latent_dim  # 77

        # Actor (policy)
        self.actor = nn.Sequential(
            nn.Linear(total_obs, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, action_dim),
        )
        self.log_std = nn.Parameter(torch.zeros(action_dim))

        # Critic (value function)
        self.critic = nn.Sequential(
            nn.Linear(total_obs, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, 1),
        )

    def forward(
        self,
        base_obs: torch.Tensor,
        latent_z: torch.Tensor,
    ) -> tuple[Normal, torch.Tensor]:
        """
        Args:
            base_obs: (batch, base_obs_dim) — proprio observations
            latent_z: (batch, latent_dim) — force context

        Returns:
            action_dist: Normal distribution over actions
            value: (batch, 1) — estimated value
        """
        obs = torch.cat([base_obs, latent_z], dim=-1)

        # Actor output
        mean = self.actor(obs)
        std = torch.exp(self.log_std.clamp(-5, 2))
        action_dist = Normal(mean, std)

        # Critic output
        value = self.critic(obs)

        return action_dist, value

# ===== Reward Function =====

def compute_fame_reward(
    base_pos: torch.Tensor,
    base_quat: torch.Tensor,
    joint_vel: torch.Tensor,
    joint_torque: torch.Tensor,
    target_pos: torch.Tensor,
    target_quat: torch.Tensor,
    dt: float = 0.02,
) -> dict[str, torch.Tensor]:
    """
    FAME reward function cho standing task.
    Cân bằng giữa stability, energy efficiency, và tracking.
    """
    batch = base_pos.shape[0]

    # 1. Position tracking (giữ vị trí đứng)
    pos_error = torch.norm(base_pos[:, :2] - target_pos[:, :2], dim=-1)
    r_pos = torch.exp(-5.0 * pos_error)  # ∈ [0, 1]

    # 2. Orientation tracking (giữ thẳng)
    # Tính góc giữa quat hiện tại và target
    quat_diff = _quat_diff(base_quat, target_quat)
    angle_error = 2.0 * torch.acos(torch.clamp(quat_diff[:, 0].abs(), 0, 1))
    r_orient = torch.exp(-3.0 * angle_error)

    # 3. Height tracking (không ngồi xuống)
    height_error = torch.abs(base_pos[:, 2] - target_pos[:, 2])
    r_height = torch.exp(-10.0 * height_error)

    # 4. Energy penalty (torque nhỏ = tiết kiệm năng lượng)
    torque_penalty = torch.sum(joint_torque**2, dim=-1) * 1e-5
    r_energy = -torque_penalty

    # 5. Smoothness penalty (joint velocity nhỏ = chuyển động mượt)
    vel_penalty = torch.sum(joint_vel**2, dim=-1) * 1e-4
    r_smooth = -vel_penalty

    # 6. Alive bonus (không ngã = +1 mỗi step)
    alive = (base_pos[:, 2] > 0.5).float()  # z > 0.5m = đang đứng
    r_alive = alive * 1.0

    # Tổng hợp reward
    reward = (
        0.3 * r_pos
        + 0.2 * r_orient
        + 0.2 * r_height
        + 0.1 * r_energy
        + 0.1 * r_smooth
        + 0.1 * r_alive
    )

    return {
        "total": reward,
        "pos": r_pos,
        "orient": r_orient,
        "height": r_height,
        "energy": r_energy,
        "smooth": r_smooth,
        "alive": r_alive,
    }

def _quat_diff(q1: torch.Tensor, q2: torch.Tensor) -> torch.Tensor:
    """Quaternion difference: q_diff = q1* × q2"""
    w1, x1, y1, z1 = q1.unbind(-1)
    w2, x2, y2, z2 = q2.unbind(-1)
    return torch.stack([
        w1*w2 + x1*x2 + y1*y2 + z1*z2,
        -w1*x2 + x1*w2 - y1*z2 + z1*y2,
        -w1*y2 + x1*z2 + y1*w2 - z1*x2,
        -w1*z2 - x1*y2 + y1*x2 + z1*w2,
    ], dim=-1)

6. Full Training Config

# ===== Training Configuration =====

FAME_CONFIG = {
    # Environment
    "env": {
        "num_envs": 4096,
        "episode_length": 1000,     # 20 seconds at 50Hz
        "dt": 0.02,                 # 50Hz control
        "sim_dt": 0.005,            # 200Hz physics
        "robot": "unitree_h1_2",
        "terrain": "flat",
    },

    # Force sampling
    "force": {
        "min_magnitude": 0.0,       # N
        "max_magnitude": 50.0,      # N — tăng dần qua curriculum
        "sample_interval": 100,     # Resample mỗi 100 steps (2s)
        "distribution": "spherical",
    },

    # Upper-body curriculum
    "curriculum": {
        "max_level": 10,
        "reward_threshold": 0.7,    # Level up khi mean_reward > 0.7
        "initial_force_max": 5.0,   # Bắt đầu với lực nhỏ
        "final_force_max": 50.0,    # Kết thúc với lực lớn
    },

    # Context encoder
    "encoder": {
        "num_upper_joints": 10,
        "force_dim": 6,
        "hidden_dim": 128,
        "latent_dim": 32,
    },

    # PPO hyperparameters
    "ppo": {
        "learning_rate": 3e-4,
        "num_epochs": 5,
        "num_minibatches": 4,
        "gamma": 0.99,
        "lam": 0.95,              # GAE lambda
        "clip_range": 0.2,
        "entropy_coef": 0.01,
        "value_coef": 0.5,
        "max_grad_norm": 1.0,
        "target_kl": 0.01,
    },

    # Training
    "training": {
        "total_timesteps": 100_000_000,  # 100M steps
        "log_interval": 10,
        "save_interval": 500,
        "eval_interval": 100,
    },

    # Domain randomization
    "domain_rand": {
        "friction_range": [0.3, 1.5],
        "mass_offset_range": [-1.0, 1.0],   # kg
        "com_offset_range": [-0.05, 0.05],   # m
        "kp_factor_range": [0.8, 1.2],
        "kd_factor_range": [0.8, 1.2],
        "push_interval": 200,               # Random push mỗi 4s
        "push_magnitude": [0, 30],           # N
    },
}

7. PPO Training Loop

import torch
import torch.optim as optim
from collections import defaultdict

def train_fame(config: dict):
    """
    Full training loop cho FAME policy.
    Dùng Isaac Lab/Gym cho simulation.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    num_envs = config["env"]["num_envs"]

    # ===== Initialize components =====
    # 1. Environment (Isaac Lab)
    env = create_humanoid_env(config["env"])  # Isaac Lab vectorized env

    # 2. Policy + Context Encoder
    encoder = ForceContextEncoder(**config["encoder"]).to(device)
    policy = FAMEPolicy(
        base_obs_dim=env.obs_dim,
        latent_dim=config["encoder"]["latent_dim"],
        action_dim=env.action_dim,
    ).to(device)

    # 3. Force estimator + Curriculum
    force_estimator = ForceEstimator(env.robot_model, dt=config["env"]["dt"])
    curriculum = UpperBodyCurriculum(
        num_envs=num_envs,
        max_curriculum_level=config["curriculum"]["max_level"],
        device=device,
    )

    # 4. Optimizer (single optimizer cho cả encoder + policy)
    all_params = list(encoder.parameters()) + list(policy.parameters())
    optimizer = optim.Adam(all_params, lr=config["ppo"]["learning_rate"])

    # ===== Training Loop =====
    total_steps = 0
    episode_length = config["env"]["episode_length"]
    ppo_cfg = config["ppo"]

    obs = env.reset()
    hidden_state = None  # GRU hidden state cho encoder

    while total_steps < config["training"]["total_timesteps"]:
        # ----- Collect rollout -----
        rollout = defaultdict(list)

        for step in range(episode_length):
            with torch.no_grad():
                # Sample forces (resample mỗi N steps)
                if step % config["force"]["sample_interval"] == 0:
                    ext_forces = sample_spherical_forces(
                        num_envs,
                        force_range=(0, curriculum.get_scale() * config["force"]["max_magnitude"]),
                        device=device,
                    )
                    # Apply forces vào simulation
                    env.apply_external_forces(ext_forces)

                    # Sample upper-body poses
                    upper_poses = curriculum.sample_poses(env.default_upper_pose)
                    env.set_upper_body_targets(upper_poses)

                # Estimate forces từ dynamics (không cần F/T sensor)
                estimated_forces = force_estimator.estimate(
                    env.joint_pos, env.joint_vel, env.joint_torque
                )

                # Encode context
                latent_z, hidden_state = encoder(
                    env.upper_body_joints, estimated_forces, hidden_state
                )

                # Get action
                action_dist, value = policy(obs, latent_z)
                action = action_dist.sample()
                log_prob = action_dist.log_prob(action).sum(-1)

            # Step environment
            next_obs, reward_dict, done, info = env.step(action)
            reward = reward_dict["total"]

            # Store transition
            rollout["obs"].append(obs)
            rollout["latent_z"].append(latent_z)
            rollout["actions"].append(action)
            rollout["log_probs"].append(log_prob)
            rollout["values"].append(value.squeeze(-1))
            rollout["rewards"].append(reward)
            rollout["dones"].append(done.float())

            obs = next_obs
            total_steps += num_envs

        # ----- Compute GAE advantages -----
        with torch.no_grad():
            last_latent, hidden_state = encoder(
                env.upper_body_joints, estimated_forces, hidden_state
            )
            _, last_value = policy(obs, last_latent)
            last_value = last_value.squeeze(-1)

        advantages = compute_gae(
            rollout["rewards"],
            rollout["values"],
            rollout["dones"],
            last_value,
            gamma=ppo_cfg["gamma"],
            lam=ppo_cfg["lam"],
        )
        returns = [adv + val for adv, val in zip(advantages, rollout["values"])]

        # Stack rollout
        all_obs = torch.stack(rollout["obs"])
        all_z = torch.stack(rollout["latent_z"])
        all_actions = torch.stack(rollout["actions"])
        all_log_probs = torch.stack(rollout["log_probs"])
        all_returns = torch.stack(returns)
        all_advantages = torch.stack(advantages)

        # Normalize advantages
        all_advantages = (all_advantages - all_advantages.mean()) / (
            all_advantages.std() + 1e-8
        )

        # ----- PPO Update -----
        for epoch in range(ppo_cfg["num_epochs"]):
            # Mini-batch indices
            batch_size = episode_length * num_envs
            minibatch_size = batch_size // ppo_cfg["num_minibatches"]
            indices = torch.randperm(batch_size, device=device)

            for start in range(0, batch_size, minibatch_size):
                mb_idx = indices[start : start + minibatch_size]

                # Flatten temporal dimension
                flat_obs = all_obs.view(-1, all_obs.shape[-1])[mb_idx]
                flat_z = all_z.view(-1, all_z.shape[-1])[mb_idx]
                flat_actions = all_actions.view(-1, all_actions.shape[-1])[mb_idx]
                flat_old_lp = all_log_probs.view(-1)[mb_idx]
                flat_returns = all_returns.view(-1)[mb_idx]
                flat_adv = all_advantages.view(-1)[mb_idx]

                # Forward pass
                dist, values = policy(flat_obs, flat_z)
                new_log_probs = dist.log_prob(flat_actions).sum(-1)
                entropy = dist.entropy().sum(-1).mean()

                # PPO clipped objective
                ratio = torch.exp(new_log_probs - flat_old_lp)
                surr1 = ratio * flat_adv
                surr2 = torch.clamp(
                    ratio,
                    1.0 - ppo_cfg["clip_range"],
                    1.0 + ppo_cfg["clip_range"],
                ) * flat_adv
                policy_loss = -torch.min(surr1, surr2).mean()

                # Value loss
                value_loss = 0.5 * (flat_returns - values.squeeze(-1)).pow(2).mean()

                # Total loss
                loss = (
                    policy_loss
                    + ppo_cfg["value_coef"] * value_loss
                    - ppo_cfg["entropy_coef"] * entropy
                )

                optimizer.zero_grad()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(all_params, ppo_cfg["max_grad_norm"])
                optimizer.step()

        # ----- Curriculum update -----
        mean_reward = torch.stack(rollout["rewards"]).mean().item()
        curriculum.update(mean_reward, threshold=ppo_cfg.get("target_kl", 0.7))

        # ----- Logging -----
        if total_steps % (config["training"]["log_interval"] * num_envs * episode_length) == 0:
            print(
                f"Steps: {total_steps:>10d} | "
                f"Reward: {mean_reward:.3f} | "
                f"Curriculum: {curriculum.current_level}/{curriculum.max_level} | "
                f"Policy Loss: {policy_loss.item():.4f}"
            )

        # ----- Save checkpoint -----
        if total_steps % (config["training"]["save_interval"] * num_envs * episode_length) == 0:
            torch.save({
                "encoder": encoder.state_dict(),
                "policy": policy.state_dict(),
                "optimizer": optimizer.state_dict(),
                "total_steps": total_steps,
                "curriculum_level": curriculum.current_level,
            }, f"checkpoints/fame_step_{total_steps}.pt")

def compute_gae(rewards, values, dones, last_value, gamma=0.99, lam=0.95):
    """Generalized Advantage Estimation."""
    advantages = []
    gae = torch.zeros_like(rewards[0])
    next_value = last_value

    for t in reversed(range(len(rewards))):
        delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
        gae = delta + gamma * lam * (1 - dones[t]) * gae
        advantages.insert(0, gae)
        next_value = values[t]

    return advantages

8. Sim-to-Real Transfer cho H1-2 / G1

Quy trình deploy lên robot thật:

class FAMERealDeployment:
    """
    Deploy FAME policy lên Unitree H1-2 / G1.
    Quy trình: Load checkpoint → Run inference at 50Hz → Send joint targets.
    """

    def __init__(
        self,
        checkpoint_path: str,
        robot_type: str = "h1_2",  # "h1_2" hoặc "g1"
        control_freq: float = 50.0,
    ):
        self.device = torch.device("cpu")  # Inference trên CPU cho latency thấp
        self.dt = 1.0 / control_freq

        # Load trained models
        ckpt = torch.load(checkpoint_path, map_location=self.device)
        self.encoder = ForceContextEncoder(
            num_upper_joints=10, latent_dim=32
        ).to(self.device)
        self.encoder.load_state_dict(ckpt["encoder"])
        self.encoder.eval()

        self.policy = FAMEPolicy(
            base_obs_dim=45, latent_dim=32, action_dim=10
        ).to(self.device)
        self.policy.load_state_dict(ckpt["policy"])
        self.policy.eval()

        self.force_estimator = ForceEstimator(
            robot_model=load_robot_model(robot_type), dt=self.dt
        )
        self.hidden_state = None

        # Action scaling cho hardware
        self.action_scale = torch.tensor([
            0.25, 0.25, 0.25, 0.25, 0.25,  # left leg
            0.25, 0.25, 0.25, 0.25, 0.25,  # right leg
        ])

    @torch.no_grad()
    def step(self, robot_state: dict) -> torch.Tensor:
        """
        Một step inference. Gọi ở 50Hz.

        Args:
            robot_state: dict với keys:
                - joint_pos: (num_joints,)
                - joint_vel: (num_joints,)
                - joint_torque: (num_joints,)
                - imu_quat: (4,)
                - imu_angular_vel: (3,)
                - projected_gravity: (3,)
        """
        # Build observation vector
        obs = self._build_observation(robot_state)

        # Estimate forces
        estimated_forces = self.force_estimator.estimate(
            robot_state["joint_pos"].unsqueeze(0),
            robot_state["joint_vel"].unsqueeze(0),
            robot_state["joint_torque"].unsqueeze(0),
        )

        # Encode context
        upper_joints = robot_state["joint_pos"][:10].unsqueeze(0)
        latent_z, self.hidden_state = self.encoder(
            upper_joints, estimated_forces, self.hidden_state
        )

        # Get action (deterministic — dùng mean, không sample)
        action_dist, _ = self.policy(obs.unsqueeze(0), latent_z)
        action = action_dist.mean.squeeze(0)

        # Scale và clamp
        action = action * self.action_scale
        action = action.clamp(-1.0, 1.0)

        return action

    def _build_observation(self, state: dict) -> torch.Tensor:
        """Build observation vector từ robot state."""
        return torch.cat([
            state["joint_pos"][:10],       # lower-body joints (10D)
            state["joint_vel"][:10],       # lower-body velocities (10D)
            state["joint_pos"][10:20],     # upper-body joints (10D)
            state["imu_quat"],             # orientation (4D)
            state["imu_angular_vel"],      # angular velocity (3D)
            state["projected_gravity"],    # gravity in body frame (3D)
            # Thêm command velocity nếu cần (5D)
            torch.zeros(5),
        ])

Kết quả và Phân tích

Method	Standing Success (%)	Recovery Time (s)
Base Policy	29.44	N/A
Curriculum Only	51.40	2.8
FAME (full)	73.84	1.2

Robot humanoid trong môi trường testing thực tế

Bài học rút ra cho người làm Robotics RL

1. Observation design quan trọng hơn algorithm

2. Sensor-free estimation có thể đủ tốt

3. Curriculum + conditioning > curriculum alone

Liên kết với các nghiên cứu khác

FAME nằm trong xu hướng lớn hơn của loco-manipulation — kết hợp locomotion và manipulation cho humanoid. Nếu bạn quan tâm, hãy đọc thêm:

RL cho Robotics: PPO, SAC và cách chọn algorithm — nền tảng RL cần thiết
Humanoid Locomotion: Từ MPC đến RL — so sánh control approaches
Loco-Manipulation cho Humanoid Robot — overview lĩnh vực
Sim-to-Real RL cho Dexterous Manipulation trên Humanoid — paper liên quan về manipulation

Tài liệu tham khảo

FAME: Force-Adaptive Policy with Implicit Estimation for Versatile Humanoid Locomotion — Pudasaini, N., Zhang, Y., Lavering, J., Roncone, A., Correll, N., CU Boulder, arXiv 2603.08961, March 2026
Proximal Policy Optimization Algorithms — Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., OpenAI, 2017
Learning Agile and Dynamic Motor Skills for Legged Robots — Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V., Koltun, V., Hutter, M., Science Robotics, 2019

Vấn đề cốt lõi: Manipulation phá vỡ Balance

Tại sao các phương pháp hiện tại thất bại?

Base Policy: Không biết lực nào đang tác động

Curriculum-Only: Cải thiện nhưng chưa đủ

FAME: Hiểu lực → thích ứng

Kiến trúc FAME chi tiết

1. Force Sampling Strategy: Phân bố cầu (Spherical Distribution)

2. Upper-Body Pose Curriculum

3. Latent Context Encoder — Trái tim của FAME

4. Force Estimation từ Robot Dynamics (Không cần F/T Sensor)

5. PPO Training với Force-Conditioned Observations

6. Full Training Config

7. PPO Training Loop

8. Sim-to-Real Transfer cho H1-2 / G1

Kết quả và Phân tích

Bài học rút ra cho người làm Robotics RL

1. Observation design quan trọng hơn algorithm

2. Sensor-free estimation có thể đủ tốt

3. Curriculum + conditioning > curriculum alone

Liên kết với các nghiên cứu khác

Bài viết liên quan

Tài liệu tham khảo

Nguyễn Anh Tuấn

Bài viết liên quan

Sim-to-Real RL cho Dexterous Manipulation trên Humanoid

Perception 3D cho humanoid: Omni-Manip & spatial reasoning

WholeBodyVLA: video egocentric + RL loco-manipulation

Vấn đề cốt lõi: Manipulation phá vỡ Balance

Tại sao các phương pháp hiện tại thất bại?

Base Policy: Không biết lực nào đang tác động

Curriculum-Only: Cải thiện nhưng chưa đủ

FAME: Hiểu lực → thích ứng

Kiến trúc FAME chi tiết

1. Force Sampling Strategy: Phân bố cầu (Spherical Distribution)

2. Upper-Body Pose Curriculum

3. Latent Context Encoder — Trái tim của FAME

4. Force Estimation từ Robot Dynamics (Không cần F/T Sensor)

5. PPO Training với Force-Conditioned Observations

6. Full Training Config

7. PPO Training Loop

8. Sim-to-Real Transfer cho H1-2 / G1

Kết quả và Phân tích

Bài học rút ra cho người làm Robotics RL

1. Observation design quan trọng hơn algorithm

2. Sensor-free estimation có thể đủ tốt

3. Curriculum + conditioning > curriculum alone

Liên kết với các nghiên cứu khác

Bài viết liên quan

Tài liệu tham khảo

Nguyễn Anh Tuấn

Bài viết liên quan

Sim-to-Real RL cho Dexterous Manipulation trên Humanoid

Perception 3D cho humanoid: Omni-Manip & spatial reasoning

WholeBodyVLA: video egocentric + RL loco-manipulation