FAME: Force-Adaptive RL for Humanoid Manipulation

The Core Problem: Manipulation Breaks Balance

Most humanoid locomotion research assumes one thing: the upper body does nothing. Arms hang loosely, the center of mass stays stable, and the policy only needs to keep the legs balanced. But when a humanoid starts manipulating — lifting heavy objects, pushing things, or performing bimanual tasks — everything changes.

Forces from the upper body propagate down to the legs, creating unpredictable moments. A humanoid standing perfectly stable can fall immediately when lifting a 5kg object with both hands because the center of mass shifts and reaction forces change continuously.

The paper FAME: Force-Adaptive Policy with Implicit Estimation for Versatile Humanoid Locomotion (Pudasaini et al., CU Boulder, March 2026) solves this by conditioning the standing policy on a latent context of upper-body configuration and bimanual forces — without needing wrist force/torque sensors.

Humanoid robot performing bimanual manipulation tasks

Why Current Methods Fail

Standard locomotion policies (like Unitree SDK defaults) only receive observations from IMU and joint encoders. When the upper body moves or experiences forces, the policy has zero information about these changes — leading to slow or incorrect reactions.

Result from the paper: base policy achieves only 29.44% success rate when the robot must stand under external force disturbances.

Curriculum-Only: Better but Not Enough

A more common approach uses curriculum learning — gradually increasing perturbation difficulty during training. But curriculum only makes the robot more robust "blindly" — it doesn't understand which forces are acting.

Result: curriculum-only achieves 51.40% — better than base but still fails nearly half the time.

FAME: Understanding Forces Enables Adaptation

FAME achieves 73.84% success rate by providing the policy with context about forces and upper-body configuration through a latent encoder. The difference isn't training longer — it's providing the right information to the policy.

FAME Architecture in Detail

1. Force Sampling Strategy: Spherical Distribution

The first key innovation: instead of random forces from a uniform distribution, FAME samples forces using a 3D spherical distribution on each hand. This ensures the policy sees forces from all directions with equal probability.

import numpy as np
import torch

def sample_spherical_forces(
    num_envs: int,
    force_range: tuple[float, float] = (0.0, 50.0),
    device: str = "cuda"
) -> torch.Tensor:
    """
    Sample 3D forces using spherical distribution for each hand.
    Returns: (num_envs, 6) — 3D force for left + right hand
    """
    forces = torch.zeros(num_envs, 6, device=device)

    for hand_idx in range(2):
        # Uniform sampling on unit sphere (Marsaglia method to avoid pole bias)
        phi = torch.rand(num_envs, device=device) * 2 * np.pi
        cos_theta = 2 * torch.rand(num_envs, device=device) - 1
        sin_theta = torch.sqrt(1 - cos_theta**2)

        # Direction vector (unit sphere)
        dx = sin_theta * torch.cos(phi)
        dy = sin_theta * torch.sin(phi)
        dz = cos_theta

        # Sample magnitude from uniform range
        magnitude = (
            torch.rand(num_envs, device=device)
            * (force_range[1] - force_range[0])
            + force_range[0]
        )

        # Combine direction * magnitude
        offset = hand_idx * 3
        forces[:, offset + 0] = dx * magnitude
        forces[:, offset + 1] = dy * magnitude
        forces[:, offset + 2] = dz * magnitude

    return forces

Why does spherical sampling matter? If you sample uniformly in Cartesian space, most samples concentrate at the corners of the cube, creating directional bias. Spherical sampling ensures every direction in 3D space is equally represented.

2. Upper-Body Pose Curriculum

FAME doesn't just randomize forces — it also varies the upper-body configuration (hand, shoulder, elbow positions) during training. The curriculum gradually increases the range of motion:

import torch.nn as nn

class UpperBodyCurriculum:
    """
    Curriculum for upper-body pose randomization.
    Starts from default pose, gradually increases movement amplitude.
    """

    # H1-2 upper-body joints: shoulder (3), elbow (1), wrist (1) x 2 arms = 10
    UPPER_BODY_JOINTS = [
        "left_shoulder_pitch", "left_shoulder_roll", "left_shoulder_yaw",
        "left_elbow", "left_wrist",
        "right_shoulder_pitch", "right_shoulder_roll", "right_shoulder_yaw",
        "right_elbow", "right_wrist",
    ]

    # Joint limits (rad) for H1-2
    JOINT_RANGES = {
        "shoulder_pitch": (-2.87, 2.87),
        "shoulder_roll": (-1.34, 2.53),
        "shoulder_yaw": (-2.87, 2.87),
        "elbow": (-1.92, 2.09),
        "wrist": (-1.57, 1.57),
    }

    def __init__(
        self,
        num_envs: int,
        num_upper_joints: int = 10,
        max_curriculum_level: int = 10,
        device: str = "cuda",
    ):
        self.num_envs = num_envs
        self.num_upper_joints = num_upper_joints
        self.max_level = max_curriculum_level
        self.device = device
        self.current_level = 0  # 0 = default pose only, 10 = full range

    def get_scale(self) -> float:
        """Scale factor based on curriculum level."""
        return self.current_level / self.max_level

    def sample_poses(self, default_pose: torch.Tensor) -> torch.Tensor:
        """
        Sample upper-body poses around default pose.
        default_pose: (num_upper_joints,) — default configuration
        Returns: (num_envs, num_upper_joints)
        """
        scale = self.get_scale()
        noise = torch.randn(
            self.num_envs, self.num_upper_joints, device=self.device
        ) * scale * 0.5  # 0.5 rad max deviation at full scale

        poses = default_pose.unsqueeze(0) + noise
        return poses

    def update(self, mean_reward: float, threshold: float = 0.7):
        """
        Level up when policy is stable (reward > threshold).
        """
        if mean_reward > threshold and self.current_level < self.max_level:
            self.current_level += 1
            print(
                f"[Curriculum] Level up: {self.current_level}/{self.max_level}"
            )

3. Latent Context Encoder — The Heart of FAME

This is the most critical component. The encoder takes upper-body joint positions and estimated forces, encoding them into a latent vector that the policy uses to adjust its behavior:

import torch
import torch.nn as nn

class ForceContextEncoder(nn.Module):
    """
    Latent context encoder for FAME.
    Input: upper-body joint config + estimated bimanual forces
    Output: latent context vector z
    """

    def __init__(
        self,
        num_upper_joints: int = 10,  # 5 joints x 2 arms
        force_dim: int = 6,          # 3D force x 2 hands
        hidden_dim: int = 128,
        latent_dim: int = 32,
    ):
        super().__init__()
        input_dim = num_upper_joints + force_dim  # 16

        # MLP encoder: joint config + forces -> latent z
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, latent_dim),
        )

        # Temporal smoothing via GRU for stable latent context
        self.gru = nn.GRU(
            input_size=latent_dim,
            hidden_size=latent_dim,
            batch_first=True,
        )

        self.latent_dim = latent_dim

    def forward(
        self,
        upper_body_joints: torch.Tensor,
        estimated_forces: torch.Tensor,
        hidden_state: torch.Tensor | None = None,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            upper_body_joints: (batch, num_upper_joints)
            estimated_forces: (batch, 6) — 3D force per hand
            hidden_state: (1, batch, latent_dim) — GRU hidden

        Returns:
            latent_z: (batch, latent_dim)
            new_hidden: (1, batch, latent_dim)
        """
        # Concatenate inputs
        x = torch.cat([upper_body_joints, estimated_forces], dim=-1)

        # Encode to latent
        z = self.encoder(x)

        # Temporal smoothing through GRU
        z_seq = z.unsqueeze(1)  # (batch, 1, latent_dim)
        if hidden_state is None:
            hidden_state = torch.zeros(
                1, x.shape[0], self.latent_dim, device=x.device
            )
        z_smooth, new_hidden = self.gru(z_seq, hidden_state)
        z_smooth = z_smooth.squeeze(1)  # (batch, latent_dim)

        return z_smooth, new_hidden

Neural network architecture for robot control with reinforcement learning

4. Force Estimation from Robot Dynamics (No F/T Sensor Needed)

This is the most practical insight from the paper: estimate forces from dynamics instead of measuring directly with sensors. Wrist force/torque sensors on humanoids are expensive and fragile. FAME uses the equations of motion:

$$\tau_{ext} = \tau_{measured} - M(q)\ddot{q} - C(q, \dot{q})\dot{q} - g(q)$$

import torch

class ForceEstimator:
    """
    Estimate external forces from robot dynamics.
    No F/T sensor needed — only joint torque readings and model dynamics.
    """

    def __init__(self, robot_model, dt: float = 0.02):
        self.robot = robot_model
        self.dt = dt
        self.prev_qvel = None

    def estimate(
        self,
        joint_pos: torch.Tensor,
        joint_vel: torch.Tensor,
        joint_torque: torch.Tensor,
    ) -> torch.Tensor:
        """
        Estimate external forces/torques on end-effectors.

        Args:
            joint_pos: (batch, num_joints) — joint positions
            joint_vel: (batch, num_joints) — joint velocities
            joint_torque: (batch, num_joints) — measured torques

        Returns:
            estimated_forces: (batch, 6) — 3D force per hand
        """
        # Compute acceleration from finite difference
        if self.prev_qvel is None:
            self.prev_qvel = torch.zeros_like(joint_vel)
        joint_acc = (joint_vel - self.prev_qvel) / self.dt
        self.prev_qvel = joint_vel.clone()

        # Inverse dynamics: compute expected torque for current motion
        # tau_id = M(q) * qddot + C(q, qdot) * qdot + g(q)
        mass_matrix = self.robot.compute_mass_matrix(joint_pos)
        coriolis = self.robot.compute_coriolis(joint_pos, joint_vel)
        gravity = self.robot.compute_gravity(joint_pos)

        tau_expected = (
            torch.bmm(mass_matrix, joint_acc.unsqueeze(-1)).squeeze(-1)
            + coriolis
            + gravity
        )

        # External torque = measured - expected
        tau_external = joint_torque - tau_expected

        # Map joint torques to Cartesian forces via Jacobian transpose
        jacobian_left = self.robot.compute_jacobian(
            joint_pos, end_effector="left_hand"
        )
        jacobian_right = self.robot.compute_jacobian(
            joint_pos, end_effector="right_hand"
        )

        # F = (J^T)^{-1} * tau_ext (least squares solution)
        force_left = torch.linalg.lstsq(
            jacobian_left.transpose(-1, -2), tau_external[:, :5].unsqueeze(-1)
        ).solution.squeeze(-1)[:, :3]

        force_right = torch.linalg.lstsq(
            jacobian_right.transpose(-1, -2), tau_external[:, 5:10].unsqueeze(-1)
        ).solution.squeeze(-1)[:, :3]

        return torch.cat([force_left, force_right], dim=-1)

5. PPO Training with Force-Conditioned Observations

Putting it all together — the FAME policy receives the base observation plus the latent context:

import torch
import torch.nn as nn
from torch.distributions import Normal

class FAMEPolicy(nn.Module):
    """
    FAME Standing Policy — PPO actor-critic conditioned on force context.
    """

    def __init__(
        self,
        base_obs_dim: int = 45,    # proprio: joint pos/vel + IMU + projected gravity
        latent_dim: int = 32,      # from ForceContextEncoder
        action_dim: int = 10,      # lower-body joint targets (5 per leg)
        hidden_dim: int = 256,
    ):
        super().__init__()
        total_obs = base_obs_dim + latent_dim  # 77

        # Actor (policy)
        self.actor = nn.Sequential(
            nn.Linear(total_obs, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, action_dim),
        )
        self.log_std = nn.Parameter(torch.zeros(action_dim))

        # Critic (value function)
        self.critic = nn.Sequential(
            nn.Linear(total_obs, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, 1),
        )

    def forward(
        self,
        base_obs: torch.Tensor,
        latent_z: torch.Tensor,
    ) -> tuple[Normal, torch.Tensor]:
        """
        Args:
            base_obs: (batch, base_obs_dim) — proprioceptive observations
            latent_z: (batch, latent_dim) — force context

        Returns:
            action_dist: Normal distribution over actions
            value: (batch, 1) — estimated value
        """
        obs = torch.cat([base_obs, latent_z], dim=-1)

        # Actor output
        mean = self.actor(obs)
        std = torch.exp(self.log_std.clamp(-5, 2))
        action_dist = Normal(mean, std)

        # Critic output
        value = self.critic(obs)

        return action_dist, value

# ===== Reward Function =====

def compute_fame_reward(
    base_pos: torch.Tensor,
    base_quat: torch.Tensor,
    joint_vel: torch.Tensor,
    joint_torque: torch.Tensor,
    target_pos: torch.Tensor,
    target_quat: torch.Tensor,
    dt: float = 0.02,
) -> dict[str, torch.Tensor]:
    """
    FAME reward function for the standing task.
    Balances stability, energy efficiency, and tracking.
    """
    batch = base_pos.shape[0]

    # 1. Position tracking (maintain standing position)
    pos_error = torch.norm(base_pos[:, :2] - target_pos[:, :2], dim=-1)
    r_pos = torch.exp(-5.0 * pos_error)  # in [0, 1]

    # 2. Orientation tracking (stay upright)
    quat_diff = _quat_diff(base_quat, target_quat)
    angle_error = 2.0 * torch.acos(torch.clamp(quat_diff[:, 0].abs(), 0, 1))
    r_orient = torch.exp(-3.0 * angle_error)

    # 3. Height tracking (don't squat down)
    height_error = torch.abs(base_pos[:, 2] - target_pos[:, 2])
    r_height = torch.exp(-10.0 * height_error)

    # 4. Energy penalty (smaller torque = more efficient)
    torque_penalty = torch.sum(joint_torque**2, dim=-1) * 1e-5
    r_energy = -torque_penalty

    # 5. Smoothness penalty (smaller joint velocity = smoother motion)
    vel_penalty = torch.sum(joint_vel**2, dim=-1) * 1e-4
    r_smooth = -vel_penalty

    # 6. Alive bonus (+1 per step for not falling)
    alive = (base_pos[:, 2] > 0.5).float()  # z > 0.5m = standing
    r_alive = alive * 1.0

    # Combined reward
    reward = (
        0.3 * r_pos
        + 0.2 * r_orient
        + 0.2 * r_height
        + 0.1 * r_energy
        + 0.1 * r_smooth
        + 0.1 * r_alive
    )

    return {
        "total": reward,
        "pos": r_pos,
        "orient": r_orient,
        "height": r_height,
        "energy": r_energy,
        "smooth": r_smooth,
        "alive": r_alive,
    }

def _quat_diff(q1: torch.Tensor, q2: torch.Tensor) -> torch.Tensor:
    """Quaternion difference: q_diff = q1* x q2"""
    w1, x1, y1, z1 = q1.unbind(-1)
    w2, x2, y2, z2 = q2.unbind(-1)
    return torch.stack([
        w1*w2 + x1*x2 + y1*y2 + z1*z2,
        -w1*x2 + x1*w2 - y1*z2 + z1*y2,
        -w1*y2 + x1*z2 + y1*w2 - z1*x2,
        -w1*z2 - x1*y2 + y1*x2 + z1*w2,
    ], dim=-1)

6. Full Training Configuration

# ===== Training Configuration =====

FAME_CONFIG = {
    # Environment
    "env": {
        "num_envs": 4096,
        "episode_length": 1000,     # 20 seconds at 50Hz
        "dt": 0.02,                 # 50Hz control
        "sim_dt": 0.005,            # 200Hz physics
        "robot": "unitree_h1_2",
        "terrain": "flat",
    },

    # Force sampling
    "force": {
        "min_magnitude": 0.0,       # N
        "max_magnitude": 50.0,      # N — scales with curriculum
        "sample_interval": 100,     # Resample every 100 steps (2s)
        "distribution": "spherical",
    },

    # Upper-body curriculum
    "curriculum": {
        "max_level": 10,
        "reward_threshold": 0.7,    # Level up when mean_reward > 0.7
        "initial_force_max": 5.0,   # Start with small forces
        "final_force_max": 50.0,    # End with large forces
    },

    # Context encoder
    "encoder": {
        "num_upper_joints": 10,
        "force_dim": 6,
        "hidden_dim": 128,
        "latent_dim": 32,
    },

    # PPO hyperparameters
    "ppo": {
        "learning_rate": 3e-4,
        "num_epochs": 5,
        "num_minibatches": 4,
        "gamma": 0.99,
        "lam": 0.95,              # GAE lambda
        "clip_range": 0.2,
        "entropy_coef": 0.01,
        "value_coef": 0.5,
        "max_grad_norm": 1.0,
        "target_kl": 0.01,
    },

    # Training
    "training": {
        "total_timesteps": 100_000_000,  # 100M steps
        "log_interval": 10,
        "save_interval": 500,
        "eval_interval": 100,
    },

    # Domain randomization
    "domain_rand": {
        "friction_range": [0.3, 1.5],
        "mass_offset_range": [-1.0, 1.0],   # kg
        "com_offset_range": [-0.05, 0.05],   # m
        "kp_factor_range": [0.8, 1.2],
        "kd_factor_range": [0.8, 1.2],
        "push_interval": 200,               # Random push every 4s
        "push_magnitude": [0, 30],           # N
    },
}

7. PPO Training Loop

import torch
import torch.optim as optim
from collections import defaultdict

def train_fame(config: dict):
    """
    Full training loop for FAME policy.
    Uses Isaac Lab/Gym for simulation.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    num_envs = config["env"]["num_envs"]

    # ===== Initialize components =====
    env = create_humanoid_env(config["env"])  # Isaac Lab vectorized env

    encoder = ForceContextEncoder(**config["encoder"]).to(device)
    policy = FAMEPolicy(
        base_obs_dim=env.obs_dim,
        latent_dim=config["encoder"]["latent_dim"],
        action_dim=env.action_dim,
    ).to(device)

    force_estimator = ForceEstimator(env.robot_model, dt=config["env"]["dt"])
    curriculum = UpperBodyCurriculum(
        num_envs=num_envs,
        max_curriculum_level=config["curriculum"]["max_level"],
        device=device,
    )

    all_params = list(encoder.parameters()) + list(policy.parameters())
    optimizer = optim.Adam(all_params, lr=config["ppo"]["learning_rate"])

    # ===== Training Loop =====
    total_steps = 0
    episode_length = config["env"]["episode_length"]
    ppo_cfg = config["ppo"]

    obs = env.reset()
    hidden_state = None

    while total_steps < config["training"]["total_timesteps"]:
        # ----- Collect rollout -----
        rollout = defaultdict(list)

        for step in range(episode_length):
            with torch.no_grad():
                # Sample forces (resample every N steps)
                if step % config["force"]["sample_interval"] == 0:
                    ext_forces = sample_spherical_forces(
                        num_envs,
                        force_range=(0, curriculum.get_scale() * config["force"]["max_magnitude"]),
                        device=device,
                    )
                    env.apply_external_forces(ext_forces)

                    upper_poses = curriculum.sample_poses(env.default_upper_pose)
                    env.set_upper_body_targets(upper_poses)

                # Estimate forces from dynamics (no F/T sensor)
                estimated_forces = force_estimator.estimate(
                    env.joint_pos, env.joint_vel, env.joint_torque
                )

                # Encode context
                latent_z, hidden_state = encoder(
                    env.upper_body_joints, estimated_forces, hidden_state
                )

                # Get action
                action_dist, value = policy(obs, latent_z)
                action = action_dist.sample()
                log_prob = action_dist.log_prob(action).sum(-1)

            # Step environment
            next_obs, reward_dict, done, info = env.step(action)
            reward = reward_dict["total"]

            rollout["obs"].append(obs)
            rollout["latent_z"].append(latent_z)
            rollout["actions"].append(action)
            rollout["log_probs"].append(log_prob)
            rollout["values"].append(value.squeeze(-1))
            rollout["rewards"].append(reward)
            rollout["dones"].append(done.float())

            obs = next_obs
            total_steps += num_envs

        # ----- Compute GAE advantages -----
        with torch.no_grad():
            last_latent, hidden_state = encoder(
                env.upper_body_joints, estimated_forces, hidden_state
            )
            _, last_value = policy(obs, last_latent)
            last_value = last_value.squeeze(-1)

        advantages = compute_gae(
            rollout["rewards"],
            rollout["values"],
            rollout["dones"],
            last_value,
            gamma=ppo_cfg["gamma"],
            lam=ppo_cfg["lam"],
        )
        returns = [adv + val for adv, val in zip(advantages, rollout["values"])]

        # Stack rollout
        all_obs = torch.stack(rollout["obs"])
        all_z = torch.stack(rollout["latent_z"])
        all_actions = torch.stack(rollout["actions"])
        all_log_probs = torch.stack(rollout["log_probs"])
        all_returns = torch.stack(returns)
        all_advantages = torch.stack(advantages)

        # Normalize advantages
        all_advantages = (all_advantages - all_advantages.mean()) / (
            all_advantages.std() + 1e-8
        )

        # ----- PPO Update -----
        for epoch in range(ppo_cfg["num_epochs"]):
            batch_size = episode_length * num_envs
            minibatch_size = batch_size // ppo_cfg["num_minibatches"]
            indices = torch.randperm(batch_size, device=device)

            for start in range(0, batch_size, minibatch_size):
                mb_idx = indices[start : start + minibatch_size]

                flat_obs = all_obs.view(-1, all_obs.shape[-1])[mb_idx]
                flat_z = all_z.view(-1, all_z.shape[-1])[mb_idx]
                flat_actions = all_actions.view(-1, all_actions.shape[-1])[mb_idx]
                flat_old_lp = all_log_probs.view(-1)[mb_idx]
                flat_returns = all_returns.view(-1)[mb_idx]
                flat_adv = all_advantages.view(-1)[mb_idx]

                dist, values = policy(flat_obs, flat_z)
                new_log_probs = dist.log_prob(flat_actions).sum(-1)
                entropy = dist.entropy().sum(-1).mean()

                # PPO clipped objective
                ratio = torch.exp(new_log_probs - flat_old_lp)
                surr1 = ratio * flat_adv
                surr2 = torch.clamp(
                    ratio,
                    1.0 - ppo_cfg["clip_range"],
                    1.0 + ppo_cfg["clip_range"],
                ) * flat_adv
                policy_loss = -torch.min(surr1, surr2).mean()

                value_loss = 0.5 * (flat_returns - values.squeeze(-1)).pow(2).mean()

                loss = (
                    policy_loss
                    + ppo_cfg["value_coef"] * value_loss
                    - ppo_cfg["entropy_coef"] * entropy
                )

                optimizer.zero_grad()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(all_params, ppo_cfg["max_grad_norm"])
                optimizer.step()

        # ----- Curriculum update -----
        mean_reward = torch.stack(rollout["rewards"]).mean().item()
        curriculum.update(mean_reward, threshold=ppo_cfg.get("target_kl", 0.7))

        # ----- Logging -----
        if total_steps % (config["training"]["log_interval"] * num_envs * episode_length) == 0:
            print(
                f"Steps: {total_steps:>10d} | "
                f"Reward: {mean_reward:.3f} | "
                f"Curriculum: {curriculum.current_level}/{curriculum.max_level} | "
                f"Policy Loss: {policy_loss.item():.4f}"
            )

        # ----- Save checkpoint -----
        if total_steps % (config["training"]["save_interval"] * num_envs * episode_length) == 0:
            torch.save({
                "encoder": encoder.state_dict(),
                "policy": policy.state_dict(),
                "optimizer": optimizer.state_dict(),
                "total_steps": total_steps,
                "curriculum_level": curriculum.current_level,
            }, f"checkpoints/fame_step_{total_steps}.pt")

def compute_gae(rewards, values, dones, last_value, gamma=0.99, lam=0.95):
    """Generalized Advantage Estimation."""
    advantages = []
    gae = torch.zeros_like(rewards[0])
    next_value = last_value

    for t in reversed(range(len(rewards))):
        delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
        gae = delta + gamma * lam * (1 - dones[t]) * gae
        advantages.insert(0, gae)
        next_value = values[t]

    return advantages

8. Sim-to-Real Transfer for H1-2 / G1

The deployment procedure for real hardware:

class FAMERealDeployment:
    """
    Deploy FAME policy on Unitree H1-2 / G1.
    Procedure: Load checkpoint -> Run inference at 50Hz -> Send joint targets.
    """

    def __init__(
        self,
        checkpoint_path: str,
        robot_type: str = "h1_2",  # "h1_2" or "g1"
        control_freq: float = 50.0,
    ):
        self.device = torch.device("cpu")  # CPU inference for low latency
        self.dt = 1.0 / control_freq

        # Load trained models
        ckpt = torch.load(checkpoint_path, map_location=self.device)
        self.encoder = ForceContextEncoder(
            num_upper_joints=10, latent_dim=32
        ).to(self.device)
        self.encoder.load_state_dict(ckpt["encoder"])
        self.encoder.eval()

        self.policy = FAMEPolicy(
            base_obs_dim=45, latent_dim=32, action_dim=10
        ).to(self.device)
        self.policy.load_state_dict(ckpt["policy"])
        self.policy.eval()

        self.force_estimator = ForceEstimator(
            robot_model=load_robot_model(robot_type), dt=self.dt
        )
        self.hidden_state = None

        # Action scaling for hardware
        self.action_scale = torch.tensor([
            0.25, 0.25, 0.25, 0.25, 0.25,  # left leg
            0.25, 0.25, 0.25, 0.25, 0.25,  # right leg
        ])

    @torch.no_grad()
    def step(self, robot_state: dict) -> torch.Tensor:
        """
        One inference step. Called at 50Hz.

        Args:
            robot_state: dict with keys:
                - joint_pos: (num_joints,)
                - joint_vel: (num_joints,)
                - joint_torque: (num_joints,)
                - imu_quat: (4,)
                - imu_angular_vel: (3,)
                - projected_gravity: (3,)
        """
        obs = self._build_observation(robot_state)

        estimated_forces = self.force_estimator.estimate(
            robot_state["joint_pos"].unsqueeze(0),
            robot_state["joint_vel"].unsqueeze(0),
            robot_state["joint_torque"].unsqueeze(0),
        )

        upper_joints = robot_state["joint_pos"][:10].unsqueeze(0)
        latent_z, self.hidden_state = self.encoder(
            upper_joints, estimated_forces, self.hidden_state
        )

        # Deterministic action (use mean, no sampling)
        action_dist, _ = self.policy(obs.unsqueeze(0), latent_z)
        action = action_dist.mean.squeeze(0)

        action = action * self.action_scale
        action = action.clamp(-1.0, 1.0)

        return action

    def _build_observation(self, state: dict) -> torch.Tensor:
        """Build observation vector from robot state."""
        return torch.cat([
            state["joint_pos"][:10],       # lower-body joints (10D)
            state["joint_vel"][:10],       # lower-body velocities (10D)
            state["joint_pos"][10:20],     # upper-body joints (10D)
            state["imu_quat"],             # orientation (4D)
            state["imu_angular_vel"],      # angular velocity (3D)
            state["projected_gravity"],    # gravity in body frame (3D)
            torch.zeros(5),                # command velocity placeholder
        ])

Results and Analysis

Method	Standing Success (%)	Recovery Time (s)
Base Policy	29.44	N/A
Curriculum Only	51.40	2.8
FAME (full)	73.84	1.2

Notably, FAME doesn't just maintain balance better — it also recovers faster: 1.2 seconds vs 2.8 seconds. The policy "understands" what forces are acting, enabling proactive compensation rather than waiting for instability and then correcting.

Humanoid robot in a real-world testing environment

Key Takeaways for Robotics RL Practitioners

1. Observation Design Matters More Than Algorithm Choice

FAME uses PPO — the most basic RL algorithm. The difference lies in the observation space: adding latent context about forces and upper-body configuration. This is a broadly applicable lesson: before trying fancier algorithms, improve your observations first.

2. Sensor-Free Estimation Can Be Good Enough

Six-axis force/torque sensors for humanoids are expensive (thousands of USD each). FAME demonstrates that estimation from the dynamics model plus joint torque readings yields results good enough for policy training. This significantly reduces deployment cost.

3. Curriculum + Conditioning > Curriculum Alone

Curriculum learning makes robots more robust, but combining curriculum with explicit conditioning (latent context) yields much better results (73.84% vs 51.40%). The policy needs to know the current conditions, not just be "accustomed" to diverse conditions.

FAME sits within the broader trend of loco-manipulation — combining locomotion and manipulation for humanoids. For further reading:

RL for Robotics: PPO, SAC and Algorithm Selection — essential RL foundations
Humanoid Locomotion: From MPC to RL — comparison of control approaches
Loco-Manipulation for Humanoid Robots — field overview
Sim-to-Real RL for Dexterous Manipulation on Humanoids — related manipulation paper

References

FAME: Force-Adaptive Policy with Implicit Estimation for Versatile Humanoid Locomotion — Pudasaini, N., Zhang, Y., Lavering, J., Roncone, A., Correll, N., CU Boulder, arXiv 2603.08961, March 2026
Proximal Policy Optimization Algorithms — Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., OpenAI, 2017
Learning Agile and Dynamic Motor Skills for Legged Robots — Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V., Koltun, V., Hutter, M., Science Robotics, 2019

The Core Problem: Manipulation Breaks Balance

Humanoid robot performing bimanual manipulation tasks

Why Current Methods Fail

Result from the paper: base policy achieves only 29.44% success rate when the robot must stand under external force disturbances.

Curriculum-Only: Better but Not Enough

Result: curriculum-only achieves 51.40% — better than base but still fails nearly half the time.

FAME: Understanding Forces Enables Adaptation

FAME Architecture in Detail

1. Force Sampling Strategy: Spherical Distribution

import numpy as np
import torch

def sample_spherical_forces(
    num_envs: int,
    force_range: tuple[float, float] = (0.0, 50.0),
    device: str = "cuda"
) -> torch.Tensor:
    """
    Sample 3D forces using spherical distribution for each hand.
    Returns: (num_envs, 6) — 3D force for left + right hand
    """
    forces = torch.zeros(num_envs, 6, device=device)

    for hand_idx in range(2):
        # Uniform sampling on unit sphere (Marsaglia method to avoid pole bias)
        phi = torch.rand(num_envs, device=device) * 2 * np.pi
        cos_theta = 2 * torch.rand(num_envs, device=device) - 1
        sin_theta = torch.sqrt(1 - cos_theta**2)

        # Direction vector (unit sphere)
        dx = sin_theta * torch.cos(phi)
        dy = sin_theta * torch.sin(phi)
        dz = cos_theta

        # Sample magnitude from uniform range
        magnitude = (
            torch.rand(num_envs, device=device)
            * (force_range[1] - force_range[0])
            + force_range[0]
        )

        # Combine direction * magnitude
        offset = hand_idx * 3
        forces[:, offset + 0] = dx * magnitude
        forces[:, offset + 1] = dy * magnitude
        forces[:, offset + 2] = dz * magnitude

    return forces

2. Upper-Body Pose Curriculum

FAME doesn't just randomize forces — it also varies the upper-body configuration (hand, shoulder, elbow positions) during training. The curriculum gradually increases the range of motion:

import torch.nn as nn

class UpperBodyCurriculum:
    """
    Curriculum for upper-body pose randomization.
    Starts from default pose, gradually increases movement amplitude.
    """

    # H1-2 upper-body joints: shoulder (3), elbow (1), wrist (1) x 2 arms = 10
    UPPER_BODY_JOINTS = [
        "left_shoulder_pitch", "left_shoulder_roll", "left_shoulder_yaw",
        "left_elbow", "left_wrist",
        "right_shoulder_pitch", "right_shoulder_roll", "right_shoulder_yaw",
        "right_elbow", "right_wrist",
    ]

    # Joint limits (rad) for H1-2
    JOINT_RANGES = {
        "shoulder_pitch": (-2.87, 2.87),
        "shoulder_roll": (-1.34, 2.53),
        "shoulder_yaw": (-2.87, 2.87),
        "elbow": (-1.92, 2.09),
        "wrist": (-1.57, 1.57),
    }

    def __init__(
        self,
        num_envs: int,
        num_upper_joints: int = 10,
        max_curriculum_level: int = 10,
        device: str = "cuda",
    ):
        self.num_envs = num_envs
        self.num_upper_joints = num_upper_joints
        self.max_level = max_curriculum_level
        self.device = device
        self.current_level = 0  # 0 = default pose only, 10 = full range

    def get_scale(self) -> float:
        """Scale factor based on curriculum level."""
        return self.current_level / self.max_level

    def sample_poses(self, default_pose: torch.Tensor) -> torch.Tensor:
        """
        Sample upper-body poses around default pose.
        default_pose: (num_upper_joints,) — default configuration
        Returns: (num_envs, num_upper_joints)
        """
        scale = self.get_scale()
        noise = torch.randn(
            self.num_envs, self.num_upper_joints, device=self.device
        ) * scale * 0.5  # 0.5 rad max deviation at full scale

        poses = default_pose.unsqueeze(0) + noise
        return poses

    def update(self, mean_reward: float, threshold: float = 0.7):
        """
        Level up when policy is stable (reward > threshold).
        """
        if mean_reward > threshold and self.current_level < self.max_level:
            self.current_level += 1
            print(
                f"[Curriculum] Level up: {self.current_level}/{self.max_level}"
            )

3. Latent Context Encoder — The Heart of FAME

This is the most critical component. The encoder takes upper-body joint positions and estimated forces, encoding them into a latent vector that the policy uses to adjust its behavior:

import torch
import torch.nn as nn

class ForceContextEncoder(nn.Module):
    """
    Latent context encoder for FAME.
    Input: upper-body joint config + estimated bimanual forces
    Output: latent context vector z
    """

    def __init__(
        self,
        num_upper_joints: int = 10,  # 5 joints x 2 arms
        force_dim: int = 6,          # 3D force x 2 hands
        hidden_dim: int = 128,
        latent_dim: int = 32,
    ):
        super().__init__()
        input_dim = num_upper_joints + force_dim  # 16

        # MLP encoder: joint config + forces -> latent z
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, latent_dim),
        )

        # Temporal smoothing via GRU for stable latent context
        self.gru = nn.GRU(
            input_size=latent_dim,
            hidden_size=latent_dim,
            batch_first=True,
        )

        self.latent_dim = latent_dim

    def forward(
        self,
        upper_body_joints: torch.Tensor,
        estimated_forces: torch.Tensor,
        hidden_state: torch.Tensor | None = None,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            upper_body_joints: (batch, num_upper_joints)
            estimated_forces: (batch, 6) — 3D force per hand
            hidden_state: (1, batch, latent_dim) — GRU hidden

        Returns:
            latent_z: (batch, latent_dim)
            new_hidden: (1, batch, latent_dim)
        """
        # Concatenate inputs
        x = torch.cat([upper_body_joints, estimated_forces], dim=-1)

        # Encode to latent
        z = self.encoder(x)

        # Temporal smoothing through GRU
        z_seq = z.unsqueeze(1)  # (batch, 1, latent_dim)
        if hidden_state is None:
            hidden_state = torch.zeros(
                1, x.shape[0], self.latent_dim, device=x.device
            )
        z_smooth, new_hidden = self.gru(z_seq, hidden_state)
        z_smooth = z_smooth.squeeze(1)  # (batch, latent_dim)

        return z_smooth, new_hidden

Neural network architecture for robot control with reinforcement learning

4. Force Estimation from Robot Dynamics (No F/T Sensor Needed)

$$\tau_{ext} = \tau_{measured} - M(q)\ddot{q} - C(q, \dot{q})\dot{q} - g(q)$$

import torch

class ForceEstimator:
    """
    Estimate external forces from robot dynamics.
    No F/T sensor needed — only joint torque readings and model dynamics.
    """

    def __init__(self, robot_model, dt: float = 0.02):
        self.robot = robot_model
        self.dt = dt
        self.prev_qvel = None

    def estimate(
        self,
        joint_pos: torch.Tensor,
        joint_vel: torch.Tensor,
        joint_torque: torch.Tensor,
    ) -> torch.Tensor:
        """
        Estimate external forces/torques on end-effectors.

        Args:
            joint_pos: (batch, num_joints) — joint positions
            joint_vel: (batch, num_joints) — joint velocities
            joint_torque: (batch, num_joints) — measured torques

        Returns:
            estimated_forces: (batch, 6) — 3D force per hand
        """
        # Compute acceleration from finite difference
        if self.prev_qvel is None:
            self.prev_qvel = torch.zeros_like(joint_vel)
        joint_acc = (joint_vel - self.prev_qvel) / self.dt
        self.prev_qvel = joint_vel.clone()

        # Inverse dynamics: compute expected torque for current motion
        # tau_id = M(q) * qddot + C(q, qdot) * qdot + g(q)
        mass_matrix = self.robot.compute_mass_matrix(joint_pos)
        coriolis = self.robot.compute_coriolis(joint_pos, joint_vel)
        gravity = self.robot.compute_gravity(joint_pos)

        tau_expected = (
            torch.bmm(mass_matrix, joint_acc.unsqueeze(-1)).squeeze(-1)
            + coriolis
            + gravity
        )

        # External torque = measured - expected
        tau_external = joint_torque - tau_expected

        # Map joint torques to Cartesian forces via Jacobian transpose
        jacobian_left = self.robot.compute_jacobian(
            joint_pos, end_effector="left_hand"
        )
        jacobian_right = self.robot.compute_jacobian(
            joint_pos, end_effector="right_hand"
        )

        # F = (J^T)^{-1} * tau_ext (least squares solution)
        force_left = torch.linalg.lstsq(
            jacobian_left.transpose(-1, -2), tau_external[:, :5].unsqueeze(-1)
        ).solution.squeeze(-1)[:, :3]

        force_right = torch.linalg.lstsq(
            jacobian_right.transpose(-1, -2), tau_external[:, 5:10].unsqueeze(-1)
        ).solution.squeeze(-1)[:, :3]

        return torch.cat([force_left, force_right], dim=-1)

5. PPO Training with Force-Conditioned Observations

Putting it all together — the FAME policy receives the base observation plus the latent context:

import torch
import torch.nn as nn
from torch.distributions import Normal

class FAMEPolicy(nn.Module):
    """
    FAME Standing Policy — PPO actor-critic conditioned on force context.
    """

    def __init__(
        self,
        base_obs_dim: int = 45,    # proprio: joint pos/vel + IMU + projected gravity
        latent_dim: int = 32,      # from ForceContextEncoder
        action_dim: int = 10,      # lower-body joint targets (5 per leg)
        hidden_dim: int = 256,
    ):
        super().__init__()
        total_obs = base_obs_dim + latent_dim  # 77

        # Actor (policy)
        self.actor = nn.Sequential(
            nn.Linear(total_obs, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, action_dim),
        )
        self.log_std = nn.Parameter(torch.zeros(action_dim))

        # Critic (value function)
        self.critic = nn.Sequential(
            nn.Linear(total_obs, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, 1),
        )

    def forward(
        self,
        base_obs: torch.Tensor,
        latent_z: torch.Tensor,
    ) -> tuple[Normal, torch.Tensor]:
        """
        Args:
            base_obs: (batch, base_obs_dim) — proprioceptive observations
            latent_z: (batch, latent_dim) — force context

        Returns:
            action_dist: Normal distribution over actions
            value: (batch, 1) — estimated value
        """
        obs = torch.cat([base_obs, latent_z], dim=-1)

        # Actor output
        mean = self.actor(obs)
        std = torch.exp(self.log_std.clamp(-5, 2))
        action_dist = Normal(mean, std)

        # Critic output
        value = self.critic(obs)

        return action_dist, value

# ===== Reward Function =====

def compute_fame_reward(
    base_pos: torch.Tensor,
    base_quat: torch.Tensor,
    joint_vel: torch.Tensor,
    joint_torque: torch.Tensor,
    target_pos: torch.Tensor,
    target_quat: torch.Tensor,
    dt: float = 0.02,
) -> dict[str, torch.Tensor]:
    """
    FAME reward function for the standing task.
    Balances stability, energy efficiency, and tracking.
    """
    batch = base_pos.shape[0]

    # 1. Position tracking (maintain standing position)
    pos_error = torch.norm(base_pos[:, :2] - target_pos[:, :2], dim=-1)
    r_pos = torch.exp(-5.0 * pos_error)  # in [0, 1]

    # 2. Orientation tracking (stay upright)
    quat_diff = _quat_diff(base_quat, target_quat)
    angle_error = 2.0 * torch.acos(torch.clamp(quat_diff[:, 0].abs(), 0, 1))
    r_orient = torch.exp(-3.0 * angle_error)

    # 3. Height tracking (don't squat down)
    height_error = torch.abs(base_pos[:, 2] - target_pos[:, 2])
    r_height = torch.exp(-10.0 * height_error)

    # 4. Energy penalty (smaller torque = more efficient)
    torque_penalty = torch.sum(joint_torque**2, dim=-1) * 1e-5
    r_energy = -torque_penalty

    # 5. Smoothness penalty (smaller joint velocity = smoother motion)
    vel_penalty = torch.sum(joint_vel**2, dim=-1) * 1e-4
    r_smooth = -vel_penalty

    # 6. Alive bonus (+1 per step for not falling)
    alive = (base_pos[:, 2] > 0.5).float()  # z > 0.5m = standing
    r_alive = alive * 1.0

    # Combined reward
    reward = (
        0.3 * r_pos
        + 0.2 * r_orient
        + 0.2 * r_height
        + 0.1 * r_energy
        + 0.1 * r_smooth
        + 0.1 * r_alive
    )

    return {
        "total": reward,
        "pos": r_pos,
        "orient": r_orient,
        "height": r_height,
        "energy": r_energy,
        "smooth": r_smooth,
        "alive": r_alive,
    }

def _quat_diff(q1: torch.Tensor, q2: torch.Tensor) -> torch.Tensor:
    """Quaternion difference: q_diff = q1* x q2"""
    w1, x1, y1, z1 = q1.unbind(-1)
    w2, x2, y2, z2 = q2.unbind(-1)
    return torch.stack([
        w1*w2 + x1*x2 + y1*y2 + z1*z2,
        -w1*x2 + x1*w2 - y1*z2 + z1*y2,
        -w1*y2 + x1*z2 + y1*w2 - z1*x2,
        -w1*z2 - x1*y2 + y1*x2 + z1*w2,
    ], dim=-1)

6. Full Training Configuration

# ===== Training Configuration =====

FAME_CONFIG = {
    # Environment
    "env": {
        "num_envs": 4096,
        "episode_length": 1000,     # 20 seconds at 50Hz
        "dt": 0.02,                 # 50Hz control
        "sim_dt": 0.005,            # 200Hz physics
        "robot": "unitree_h1_2",
        "terrain": "flat",
    },

    # Force sampling
    "force": {
        "min_magnitude": 0.0,       # N
        "max_magnitude": 50.0,      # N — scales with curriculum
        "sample_interval": 100,     # Resample every 100 steps (2s)
        "distribution": "spherical",
    },

    # Upper-body curriculum
    "curriculum": {
        "max_level": 10,
        "reward_threshold": 0.7,    # Level up when mean_reward > 0.7
        "initial_force_max": 5.0,   # Start with small forces
        "final_force_max": 50.0,    # End with large forces
    },

    # Context encoder
    "encoder": {
        "num_upper_joints": 10,
        "force_dim": 6,
        "hidden_dim": 128,
        "latent_dim": 32,
    },

    # PPO hyperparameters
    "ppo": {
        "learning_rate": 3e-4,
        "num_epochs": 5,
        "num_minibatches": 4,
        "gamma": 0.99,
        "lam": 0.95,              # GAE lambda
        "clip_range": 0.2,
        "entropy_coef": 0.01,
        "value_coef": 0.5,
        "max_grad_norm": 1.0,
        "target_kl": 0.01,
    },

    # Training
    "training": {
        "total_timesteps": 100_000_000,  # 100M steps
        "log_interval": 10,
        "save_interval": 500,
        "eval_interval": 100,
    },

    # Domain randomization
    "domain_rand": {
        "friction_range": [0.3, 1.5],
        "mass_offset_range": [-1.0, 1.0],   # kg
        "com_offset_range": [-0.05, 0.05],   # m
        "kp_factor_range": [0.8, 1.2],
        "kd_factor_range": [0.8, 1.2],
        "push_interval": 200,               # Random push every 4s
        "push_magnitude": [0, 30],           # N
    },
}

7. PPO Training Loop

import torch
import torch.optim as optim
from collections import defaultdict

def train_fame(config: dict):
    """
    Full training loop for FAME policy.
    Uses Isaac Lab/Gym for simulation.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    num_envs = config["env"]["num_envs"]

    # ===== Initialize components =====
    env = create_humanoid_env(config["env"])  # Isaac Lab vectorized env

    encoder = ForceContextEncoder(**config["encoder"]).to(device)
    policy = FAMEPolicy(
        base_obs_dim=env.obs_dim,
        latent_dim=config["encoder"]["latent_dim"],
        action_dim=env.action_dim,
    ).to(device)

    force_estimator = ForceEstimator(env.robot_model, dt=config["env"]["dt"])
    curriculum = UpperBodyCurriculum(
        num_envs=num_envs,
        max_curriculum_level=config["curriculum"]["max_level"],
        device=device,
    )

    all_params = list(encoder.parameters()) + list(policy.parameters())
    optimizer = optim.Adam(all_params, lr=config["ppo"]["learning_rate"])

    # ===== Training Loop =====
    total_steps = 0
    episode_length = config["env"]["episode_length"]
    ppo_cfg = config["ppo"]

    obs = env.reset()
    hidden_state = None

    while total_steps < config["training"]["total_timesteps"]:
        # ----- Collect rollout -----
        rollout = defaultdict(list)

        for step in range(episode_length):
            with torch.no_grad():
                # Sample forces (resample every N steps)
                if step % config["force"]["sample_interval"] == 0:
                    ext_forces = sample_spherical_forces(
                        num_envs,
                        force_range=(0, curriculum.get_scale() * config["force"]["max_magnitude"]),
                        device=device,
                    )
                    env.apply_external_forces(ext_forces)

                    upper_poses = curriculum.sample_poses(env.default_upper_pose)
                    env.set_upper_body_targets(upper_poses)

                # Estimate forces from dynamics (no F/T sensor)
                estimated_forces = force_estimator.estimate(
                    env.joint_pos, env.joint_vel, env.joint_torque
                )

                # Encode context
                latent_z, hidden_state = encoder(
                    env.upper_body_joints, estimated_forces, hidden_state
                )

                # Get action
                action_dist, value = policy(obs, latent_z)
                action = action_dist.sample()
                log_prob = action_dist.log_prob(action).sum(-1)

            # Step environment
            next_obs, reward_dict, done, info = env.step(action)
            reward = reward_dict["total"]

            rollout["obs"].append(obs)
            rollout["latent_z"].append(latent_z)
            rollout["actions"].append(action)
            rollout["log_probs"].append(log_prob)
            rollout["values"].append(value.squeeze(-1))
            rollout["rewards"].append(reward)
            rollout["dones"].append(done.float())

            obs = next_obs
            total_steps += num_envs

        # ----- Compute GAE advantages -----
        with torch.no_grad():
            last_latent, hidden_state = encoder(
                env.upper_body_joints, estimated_forces, hidden_state
            )
            _, last_value = policy(obs, last_latent)
            last_value = last_value.squeeze(-1)

        advantages = compute_gae(
            rollout["rewards"],
            rollout["values"],
            rollout["dones"],
            last_value,
            gamma=ppo_cfg["gamma"],
            lam=ppo_cfg["lam"],
        )
        returns = [adv + val for adv, val in zip(advantages, rollout["values"])]

        # Stack rollout
        all_obs = torch.stack(rollout["obs"])
        all_z = torch.stack(rollout["latent_z"])
        all_actions = torch.stack(rollout["actions"])
        all_log_probs = torch.stack(rollout["log_probs"])
        all_returns = torch.stack(returns)
        all_advantages = torch.stack(advantages)

        # Normalize advantages
        all_advantages = (all_advantages - all_advantages.mean()) / (
            all_advantages.std() + 1e-8
        )

        # ----- PPO Update -----
        for epoch in range(ppo_cfg["num_epochs"]):
            batch_size = episode_length * num_envs
            minibatch_size = batch_size // ppo_cfg["num_minibatches"]
            indices = torch.randperm(batch_size, device=device)

            for start in range(0, batch_size, minibatch_size):
                mb_idx = indices[start : start + minibatch_size]

                flat_obs = all_obs.view(-1, all_obs.shape[-1])[mb_idx]
                flat_z = all_z.view(-1, all_z.shape[-1])[mb_idx]
                flat_actions = all_actions.view(-1, all_actions.shape[-1])[mb_idx]
                flat_old_lp = all_log_probs.view(-1)[mb_idx]
                flat_returns = all_returns.view(-1)[mb_idx]
                flat_adv = all_advantages.view(-1)[mb_idx]

                dist, values = policy(flat_obs, flat_z)
                new_log_probs = dist.log_prob(flat_actions).sum(-1)
                entropy = dist.entropy().sum(-1).mean()

                # PPO clipped objective
                ratio = torch.exp(new_log_probs - flat_old_lp)
                surr1 = ratio * flat_adv
                surr2 = torch.clamp(
                    ratio,
                    1.0 - ppo_cfg["clip_range"],
                    1.0 + ppo_cfg["clip_range"],
                ) * flat_adv
                policy_loss = -torch.min(surr1, surr2).mean()

                value_loss = 0.5 * (flat_returns - values.squeeze(-1)).pow(2).mean()

                loss = (
                    policy_loss
                    + ppo_cfg["value_coef"] * value_loss
                    - ppo_cfg["entropy_coef"] * entropy
                )

                optimizer.zero_grad()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(all_params, ppo_cfg["max_grad_norm"])
                optimizer.step()

        # ----- Curriculum update -----
        mean_reward = torch.stack(rollout["rewards"]).mean().item()
        curriculum.update(mean_reward, threshold=ppo_cfg.get("target_kl", 0.7))

        # ----- Logging -----
        if total_steps % (config["training"]["log_interval"] * num_envs * episode_length) == 0:
            print(
                f"Steps: {total_steps:>10d} | "
                f"Reward: {mean_reward:.3f} | "
                f"Curriculum: {curriculum.current_level}/{curriculum.max_level} | "
                f"Policy Loss: {policy_loss.item():.4f}"
            )

        # ----- Save checkpoint -----
        if total_steps % (config["training"]["save_interval"] * num_envs * episode_length) == 0:
            torch.save({
                "encoder": encoder.state_dict(),
                "policy": policy.state_dict(),
                "optimizer": optimizer.state_dict(),
                "total_steps": total_steps,
                "curriculum_level": curriculum.current_level,
            }, f"checkpoints/fame_step_{total_steps}.pt")

def compute_gae(rewards, values, dones, last_value, gamma=0.99, lam=0.95):
    """Generalized Advantage Estimation."""
    advantages = []
    gae = torch.zeros_like(rewards[0])
    next_value = last_value

    for t in reversed(range(len(rewards))):
        delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
        gae = delta + gamma * lam * (1 - dones[t]) * gae
        advantages.insert(0, gae)
        next_value = values[t]

    return advantages

8. Sim-to-Real Transfer for H1-2 / G1

The deployment procedure for real hardware:

class FAMERealDeployment:
    """
    Deploy FAME policy on Unitree H1-2 / G1.
    Procedure: Load checkpoint -> Run inference at 50Hz -> Send joint targets.
    """

    def __init__(
        self,
        checkpoint_path: str,
        robot_type: str = "h1_2",  # "h1_2" or "g1"
        control_freq: float = 50.0,
    ):
        self.device = torch.device("cpu")  # CPU inference for low latency
        self.dt = 1.0 / control_freq

        # Load trained models
        ckpt = torch.load(checkpoint_path, map_location=self.device)
        self.encoder = ForceContextEncoder(
            num_upper_joints=10, latent_dim=32
        ).to(self.device)
        self.encoder.load_state_dict(ckpt["encoder"])
        self.encoder.eval()

        self.policy = FAMEPolicy(
            base_obs_dim=45, latent_dim=32, action_dim=10
        ).to(self.device)
        self.policy.load_state_dict(ckpt["policy"])
        self.policy.eval()

        self.force_estimator = ForceEstimator(
            robot_model=load_robot_model(robot_type), dt=self.dt
        )
        self.hidden_state = None

        # Action scaling for hardware
        self.action_scale = torch.tensor([
            0.25, 0.25, 0.25, 0.25, 0.25,  # left leg
            0.25, 0.25, 0.25, 0.25, 0.25,  # right leg
        ])

    @torch.no_grad()
    def step(self, robot_state: dict) -> torch.Tensor:
        """
        One inference step. Called at 50Hz.

        Args:
            robot_state: dict with keys:
                - joint_pos: (num_joints,)
                - joint_vel: (num_joints,)
                - joint_torque: (num_joints,)
                - imu_quat: (4,)
                - imu_angular_vel: (3,)
                - projected_gravity: (3,)
        """
        obs = self._build_observation(robot_state)

        estimated_forces = self.force_estimator.estimate(
            robot_state["joint_pos"].unsqueeze(0),
            robot_state["joint_vel"].unsqueeze(0),
            robot_state["joint_torque"].unsqueeze(0),
        )

        upper_joints = robot_state["joint_pos"][:10].unsqueeze(0)
        latent_z, self.hidden_state = self.encoder(
            upper_joints, estimated_forces, self.hidden_state
        )

        # Deterministic action (use mean, no sampling)
        action_dist, _ = self.policy(obs.unsqueeze(0), latent_z)
        action = action_dist.mean.squeeze(0)

        action = action * self.action_scale
        action = action.clamp(-1.0, 1.0)

        return action

    def _build_observation(self, state: dict) -> torch.Tensor:
        """Build observation vector from robot state."""
        return torch.cat([
            state["joint_pos"][:10],       # lower-body joints (10D)
            state["joint_vel"][:10],       # lower-body velocities (10D)
            state["joint_pos"][10:20],     # upper-body joints (10D)
            state["imu_quat"],             # orientation (4D)
            state["imu_angular_vel"],      # angular velocity (3D)
            state["projected_gravity"],    # gravity in body frame (3D)
            torch.zeros(5),                # command velocity placeholder
        ])

Results and Analysis

Method	Standing Success (%)	Recovery Time (s)
Base Policy	29.44	N/A
Curriculum Only	51.40	2.8
FAME (full)	73.84	1.2

Humanoid robot in a real-world testing environment

Key Takeaways for Robotics RL Practitioners

1. Observation Design Matters More Than Algorithm Choice

2. Sensor-Free Estimation Can Be Good Enough

3. Curriculum + Conditioning > Curriculum Alone

FAME sits within the broader trend of loco-manipulation — combining locomotion and manipulation for humanoids. For further reading:

RL for Robotics: PPO, SAC and Algorithm Selection — essential RL foundations
Humanoid Locomotion: From MPC to RL — comparison of control approaches
Loco-Manipulation for Humanoid Robots — field overview
Sim-to-Real RL for Dexterous Manipulation on Humanoids — related manipulation paper

References

FAME: Force-Adaptive Policy with Implicit Estimation for Versatile Humanoid Locomotion — Pudasaini, N., Zhang, Y., Lavering, J., Roncone, A., Correll, N., CU Boulder, arXiv 2603.08961, March 2026
Proximal Policy Optimization Algorithms — Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., OpenAI, 2017
Learning Agile and Dynamic Motor Skills for Legged Robots — Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V., Koltun, V., Hutter, M., Science Robotics, 2019

The Core Problem: Manipulation Breaks Balance

Why Current Methods Fail

Base Policy: Blind to External Forces

Curriculum-Only: Better but Not Enough

FAME: Understanding Forces Enables Adaptation

FAME Architecture in Detail

1. Force Sampling Strategy: Spherical Distribution

2. Upper-Body Pose Curriculum

3. Latent Context Encoder — The Heart of FAME

4. Force Estimation from Robot Dynamics (No F/T Sensor Needed)

5. PPO Training with Force-Conditioned Observations

6. Full Training Configuration

7. PPO Training Loop

8. Sim-to-Real Transfer for H1-2 / G1

Results and Analysis

Key Takeaways for Robotics RL Practitioners

1. Observation Design Matters More Than Algorithm Choice

2. Sensor-Free Estimation Can Be Good Enough

3. Curriculum + Conditioning > Curriculum Alone

Connections to Related Research

Related Posts

References

Nguyễn Anh Tuấn

Related Posts

Sim-to-Real RL cho Dexterous Manipulation trên Humanoid

Perception 3D cho humanoid: Omni-Manip & spatial reasoning

WholeBodyVLA: video egocentric + RL loco-manipulation

The Core Problem: Manipulation Breaks Balance

Why Current Methods Fail

Base Policy: Blind to External Forces

Curriculum-Only: Better but Not Enough

FAME: Understanding Forces Enables Adaptation

FAME Architecture in Detail

1. Force Sampling Strategy: Spherical Distribution

2. Upper-Body Pose Curriculum

3. Latent Context Encoder — The Heart of FAME

4. Force Estimation from Robot Dynamics (No F/T Sensor Needed)

5. PPO Training with Force-Conditioned Observations

6. Full Training Configuration

7. PPO Training Loop

8. Sim-to-Real Transfer for H1-2 / G1

Results and Analysis

Key Takeaways for Robotics RL Practitioners

1. Observation Design Matters More Than Algorithm Choice

2. Sensor-Free Estimation Can Be Good Enough

3. Curriculum + Conditioning > Curriculum Alone

Connections to Related Research

Related Posts

References

Nguyễn Anh Tuấn

Related Posts

Sim-to-Real RL cho Dexterous Manipulation trên Humanoid

Perception 3D cho humanoid: Omni-Manip & spatial reasoning

WholeBodyVLA: video egocentric + RL loco-manipulation