The Core Problem: Manipulation Breaks Balance
Most humanoid locomotion research assumes one thing: the upper body does nothing. Arms hang loosely, the center of mass stays stable, and the policy only needs to keep the legs balanced. But when a humanoid starts manipulating — lifting heavy objects, pushing things, or performing bimanual tasks — everything changes.
Forces from the upper body propagate down to the legs, creating unpredictable moments. A humanoid standing perfectly stable can fall immediately when lifting a 5kg object with both hands because the center of mass shifts and reaction forces change continuously.
The paper FAME: Force-Adaptive Policy with Implicit Estimation for Versatile Humanoid Locomotion (Pudasaini et al., CU Boulder, March 2026) solves this by conditioning the standing policy on a latent context of upper-body configuration and bimanual forces — without needing wrist force/torque sensors.
Why Current Methods Fail
Base Policy: Blind to External Forces
Standard locomotion policies (like Unitree SDK defaults) only receive observations from IMU and joint encoders. When the upper body moves or experiences forces, the policy has zero information about these changes — leading to slow or incorrect reactions.
Result from the paper: base policy achieves only 29.44% success rate when the robot must stand under external force disturbances.
Curriculum-Only: Better but Not Enough
A more common approach uses curriculum learning — gradually increasing perturbation difficulty during training. But curriculum only makes the robot more robust "blindly" — it doesn't understand which forces are acting.
Result: curriculum-only achieves 51.40% — better than base but still fails nearly half the time.
FAME: Understanding Forces Enables Adaptation
FAME achieves 73.84% success rate by providing the policy with context about forces and upper-body configuration through a latent encoder. The difference isn't training longer — it's providing the right information to the policy.
FAME Architecture in Detail
1. Force Sampling Strategy: Spherical Distribution
The first key innovation: instead of random forces from a uniform distribution, FAME samples forces using a 3D spherical distribution on each hand. This ensures the policy sees forces from all directions with equal probability.
import numpy as np
import torch
def sample_spherical_forces(
num_envs: int,
force_range: tuple[float, float] = (0.0, 50.0),
device: str = "cuda"
) -> torch.Tensor:
"""
Sample 3D forces using spherical distribution for each hand.
Returns: (num_envs, 6) — 3D force for left + right hand
"""
forces = torch.zeros(num_envs, 6, device=device)
for hand_idx in range(2):
# Uniform sampling on unit sphere (Marsaglia method to avoid pole bias)
phi = torch.rand(num_envs, device=device) * 2 * np.pi
cos_theta = 2 * torch.rand(num_envs, device=device) - 1
sin_theta = torch.sqrt(1 - cos_theta**2)
# Direction vector (unit sphere)
dx = sin_theta * torch.cos(phi)
dy = sin_theta * torch.sin(phi)
dz = cos_theta
# Sample magnitude from uniform range
magnitude = (
torch.rand(num_envs, device=device)
* (force_range[1] - force_range[0])
+ force_range[0]
)
# Combine direction * magnitude
offset = hand_idx * 3
forces[:, offset + 0] = dx * magnitude
forces[:, offset + 1] = dy * magnitude
forces[:, offset + 2] = dz * magnitude
return forces
Why does spherical sampling matter? If you sample uniformly in Cartesian space, most samples concentrate at the corners of the cube, creating directional bias. Spherical sampling ensures every direction in 3D space is equally represented.
2. Upper-Body Pose Curriculum
FAME doesn't just randomize forces — it also varies the upper-body configuration (hand, shoulder, elbow positions) during training. The curriculum gradually increases the range of motion:
import torch.nn as nn
class UpperBodyCurriculum:
"""
Curriculum for upper-body pose randomization.
Starts from default pose, gradually increases movement amplitude.
"""
# H1-2 upper-body joints: shoulder (3), elbow (1), wrist (1) x 2 arms = 10
UPPER_BODY_JOINTS = [
"left_shoulder_pitch", "left_shoulder_roll", "left_shoulder_yaw",
"left_elbow", "left_wrist",
"right_shoulder_pitch", "right_shoulder_roll", "right_shoulder_yaw",
"right_elbow", "right_wrist",
]
# Joint limits (rad) for H1-2
JOINT_RANGES = {
"shoulder_pitch": (-2.87, 2.87),
"shoulder_roll": (-1.34, 2.53),
"shoulder_yaw": (-2.87, 2.87),
"elbow": (-1.92, 2.09),
"wrist": (-1.57, 1.57),
}
def __init__(
self,
num_envs: int,
num_upper_joints: int = 10,
max_curriculum_level: int = 10,
device: str = "cuda",
):
self.num_envs = num_envs
self.num_upper_joints = num_upper_joints
self.max_level = max_curriculum_level
self.device = device
self.current_level = 0 # 0 = default pose only, 10 = full range
def get_scale(self) -> float:
"""Scale factor based on curriculum level."""
return self.current_level / self.max_level
def sample_poses(self, default_pose: torch.Tensor) -> torch.Tensor:
"""
Sample upper-body poses around default pose.
default_pose: (num_upper_joints,) — default configuration
Returns: (num_envs, num_upper_joints)
"""
scale = self.get_scale()
noise = torch.randn(
self.num_envs, self.num_upper_joints, device=self.device
) * scale * 0.5 # 0.5 rad max deviation at full scale
poses = default_pose.unsqueeze(0) + noise
return poses
def update(self, mean_reward: float, threshold: float = 0.7):
"""
Level up when policy is stable (reward > threshold).
"""
if mean_reward > threshold and self.current_level < self.max_level:
self.current_level += 1
print(
f"[Curriculum] Level up: {self.current_level}/{self.max_level}"
)
3. Latent Context Encoder — The Heart of FAME
This is the most critical component. The encoder takes upper-body joint positions and estimated forces, encoding them into a latent vector that the policy uses to adjust its behavior:
import torch
import torch.nn as nn
class ForceContextEncoder(nn.Module):
"""
Latent context encoder for FAME.
Input: upper-body joint config + estimated bimanual forces
Output: latent context vector z
"""
def __init__(
self,
num_upper_joints: int = 10, # 5 joints x 2 arms
force_dim: int = 6, # 3D force x 2 hands
hidden_dim: int = 128,
latent_dim: int = 32,
):
super().__init__()
input_dim = num_upper_joints + force_dim # 16
# MLP encoder: joint config + forces -> latent z
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ELU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ELU(),
nn.Linear(hidden_dim, latent_dim),
)
# Temporal smoothing via GRU for stable latent context
self.gru = nn.GRU(
input_size=latent_dim,
hidden_size=latent_dim,
batch_first=True,
)
self.latent_dim = latent_dim
def forward(
self,
upper_body_joints: torch.Tensor,
estimated_forces: torch.Tensor,
hidden_state: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Args:
upper_body_joints: (batch, num_upper_joints)
estimated_forces: (batch, 6) — 3D force per hand
hidden_state: (1, batch, latent_dim) — GRU hidden
Returns:
latent_z: (batch, latent_dim)
new_hidden: (1, batch, latent_dim)
"""
# Concatenate inputs
x = torch.cat([upper_body_joints, estimated_forces], dim=-1)
# Encode to latent
z = self.encoder(x)
# Temporal smoothing through GRU
z_seq = z.unsqueeze(1) # (batch, 1, latent_dim)
if hidden_state is None:
hidden_state = torch.zeros(
1, x.shape[0], self.latent_dim, device=x.device
)
z_smooth, new_hidden = self.gru(z_seq, hidden_state)
z_smooth = z_smooth.squeeze(1) # (batch, latent_dim)
return z_smooth, new_hidden
4. Force Estimation from Robot Dynamics (No F/T Sensor Needed)
This is the most practical insight from the paper: estimate forces from dynamics instead of measuring directly with sensors. Wrist force/torque sensors on humanoids are expensive and fragile. FAME uses the equations of motion:
$$\tau_{ext} = \tau_{measured} - M(q)\ddot{q} - C(q, \dot{q})\dot{q} - g(q)$$
import torch
class ForceEstimator:
"""
Estimate external forces from robot dynamics.
No F/T sensor needed — only joint torque readings and model dynamics.
"""
def __init__(self, robot_model, dt: float = 0.02):
self.robot = robot_model
self.dt = dt
self.prev_qvel = None
def estimate(
self,
joint_pos: torch.Tensor,
joint_vel: torch.Tensor,
joint_torque: torch.Tensor,
) -> torch.Tensor:
"""
Estimate external forces/torques on end-effectors.
Args:
joint_pos: (batch, num_joints) — joint positions
joint_vel: (batch, num_joints) — joint velocities
joint_torque: (batch, num_joints) — measured torques
Returns:
estimated_forces: (batch, 6) — 3D force per hand
"""
# Compute acceleration from finite difference
if self.prev_qvel is None:
self.prev_qvel = torch.zeros_like(joint_vel)
joint_acc = (joint_vel - self.prev_qvel) / self.dt
self.prev_qvel = joint_vel.clone()
# Inverse dynamics: compute expected torque for current motion
# tau_id = M(q) * qddot + C(q, qdot) * qdot + g(q)
mass_matrix = self.robot.compute_mass_matrix(joint_pos)
coriolis = self.robot.compute_coriolis(joint_pos, joint_vel)
gravity = self.robot.compute_gravity(joint_pos)
tau_expected = (
torch.bmm(mass_matrix, joint_acc.unsqueeze(-1)).squeeze(-1)
+ coriolis
+ gravity
)
# External torque = measured - expected
tau_external = joint_torque - tau_expected
# Map joint torques to Cartesian forces via Jacobian transpose
jacobian_left = self.robot.compute_jacobian(
joint_pos, end_effector="left_hand"
)
jacobian_right = self.robot.compute_jacobian(
joint_pos, end_effector="right_hand"
)
# F = (J^T)^{-1} * tau_ext (least squares solution)
force_left = torch.linalg.lstsq(
jacobian_left.transpose(-1, -2), tau_external[:, :5].unsqueeze(-1)
).solution.squeeze(-1)[:, :3]
force_right = torch.linalg.lstsq(
jacobian_right.transpose(-1, -2), tau_external[:, 5:10].unsqueeze(-1)
).solution.squeeze(-1)[:, :3]
return torch.cat([force_left, force_right], dim=-1)
5. PPO Training with Force-Conditioned Observations
Putting it all together — the FAME policy receives the base observation plus the latent context:
import torch
import torch.nn as nn
from torch.distributions import Normal
class FAMEPolicy(nn.Module):
"""
FAME Standing Policy — PPO actor-critic conditioned on force context.
"""
def __init__(
self,
base_obs_dim: int = 45, # proprio: joint pos/vel + IMU + projected gravity
latent_dim: int = 32, # from ForceContextEncoder
action_dim: int = 10, # lower-body joint targets (5 per leg)
hidden_dim: int = 256,
):
super().__init__()
total_obs = base_obs_dim + latent_dim # 77
# Actor (policy)
self.actor = nn.Sequential(
nn.Linear(total_obs, hidden_dim),
nn.ELU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ELU(),
nn.Linear(hidden_dim, action_dim),
)
self.log_std = nn.Parameter(torch.zeros(action_dim))
# Critic (value function)
self.critic = nn.Sequential(
nn.Linear(total_obs, hidden_dim),
nn.ELU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ELU(),
nn.Linear(hidden_dim, 1),
)
def forward(
self,
base_obs: torch.Tensor,
latent_z: torch.Tensor,
) -> tuple[Normal, torch.Tensor]:
"""
Args:
base_obs: (batch, base_obs_dim) — proprioceptive observations
latent_z: (batch, latent_dim) — force context
Returns:
action_dist: Normal distribution over actions
value: (batch, 1) — estimated value
"""
obs = torch.cat([base_obs, latent_z], dim=-1)
# Actor output
mean = self.actor(obs)
std = torch.exp(self.log_std.clamp(-5, 2))
action_dist = Normal(mean, std)
# Critic output
value = self.critic(obs)
return action_dist, value
# ===== Reward Function =====
def compute_fame_reward(
base_pos: torch.Tensor,
base_quat: torch.Tensor,
joint_vel: torch.Tensor,
joint_torque: torch.Tensor,
target_pos: torch.Tensor,
target_quat: torch.Tensor,
dt: float = 0.02,
) -> dict[str, torch.Tensor]:
"""
FAME reward function for the standing task.
Balances stability, energy efficiency, and tracking.
"""
batch = base_pos.shape[0]
# 1. Position tracking (maintain standing position)
pos_error = torch.norm(base_pos[:, :2] - target_pos[:, :2], dim=-1)
r_pos = torch.exp(-5.0 * pos_error) # in [0, 1]
# 2. Orientation tracking (stay upright)
quat_diff = _quat_diff(base_quat, target_quat)
angle_error = 2.0 * torch.acos(torch.clamp(quat_diff[:, 0].abs(), 0, 1))
r_orient = torch.exp(-3.0 * angle_error)
# 3. Height tracking (don't squat down)
height_error = torch.abs(base_pos[:, 2] - target_pos[:, 2])
r_height = torch.exp(-10.0 * height_error)
# 4. Energy penalty (smaller torque = more efficient)
torque_penalty = torch.sum(joint_torque**2, dim=-1) * 1e-5
r_energy = -torque_penalty
# 5. Smoothness penalty (smaller joint velocity = smoother motion)
vel_penalty = torch.sum(joint_vel**2, dim=-1) * 1e-4
r_smooth = -vel_penalty
# 6. Alive bonus (+1 per step for not falling)
alive = (base_pos[:, 2] > 0.5).float() # z > 0.5m = standing
r_alive = alive * 1.0
# Combined reward
reward = (
0.3 * r_pos
+ 0.2 * r_orient
+ 0.2 * r_height
+ 0.1 * r_energy
+ 0.1 * r_smooth
+ 0.1 * r_alive
)
return {
"total": reward,
"pos": r_pos,
"orient": r_orient,
"height": r_height,
"energy": r_energy,
"smooth": r_smooth,
"alive": r_alive,
}
def _quat_diff(q1: torch.Tensor, q2: torch.Tensor) -> torch.Tensor:
"""Quaternion difference: q_diff = q1* x q2"""
w1, x1, y1, z1 = q1.unbind(-1)
w2, x2, y2, z2 = q2.unbind(-1)
return torch.stack([
w1*w2 + x1*x2 + y1*y2 + z1*z2,
-w1*x2 + x1*w2 - y1*z2 + z1*y2,
-w1*y2 + x1*z2 + y1*w2 - z1*x2,
-w1*z2 - x1*y2 + y1*x2 + z1*w2,
], dim=-1)
6. Full Training Configuration
# ===== Training Configuration =====
FAME_CONFIG = {
# Environment
"env": {
"num_envs": 4096,
"episode_length": 1000, # 20 seconds at 50Hz
"dt": 0.02, # 50Hz control
"sim_dt": 0.005, # 200Hz physics
"robot": "unitree_h1_2",
"terrain": "flat",
},
# Force sampling
"force": {
"min_magnitude": 0.0, # N
"max_magnitude": 50.0, # N — scales with curriculum
"sample_interval": 100, # Resample every 100 steps (2s)
"distribution": "spherical",
},
# Upper-body curriculum
"curriculum": {
"max_level": 10,
"reward_threshold": 0.7, # Level up when mean_reward > 0.7
"initial_force_max": 5.0, # Start with small forces
"final_force_max": 50.0, # End with large forces
},
# Context encoder
"encoder": {
"num_upper_joints": 10,
"force_dim": 6,
"hidden_dim": 128,
"latent_dim": 32,
},
# PPO hyperparameters
"ppo": {
"learning_rate": 3e-4,
"num_epochs": 5,
"num_minibatches": 4,
"gamma": 0.99,
"lam": 0.95, # GAE lambda
"clip_range": 0.2,
"entropy_coef": 0.01,
"value_coef": 0.5,
"max_grad_norm": 1.0,
"target_kl": 0.01,
},
# Training
"training": {
"total_timesteps": 100_000_000, # 100M steps
"log_interval": 10,
"save_interval": 500,
"eval_interval": 100,
},
# Domain randomization
"domain_rand": {
"friction_range": [0.3, 1.5],
"mass_offset_range": [-1.0, 1.0], # kg
"com_offset_range": [-0.05, 0.05], # m
"kp_factor_range": [0.8, 1.2],
"kd_factor_range": [0.8, 1.2],
"push_interval": 200, # Random push every 4s
"push_magnitude": [0, 30], # N
},
}
7. PPO Training Loop
import torch
import torch.optim as optim
from collections import defaultdict
def train_fame(config: dict):
"""
Full training loop for FAME policy.
Uses Isaac Lab/Gym for simulation.
"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_envs = config["env"]["num_envs"]
# ===== Initialize components =====
env = create_humanoid_env(config["env"]) # Isaac Lab vectorized env
encoder = ForceContextEncoder(**config["encoder"]).to(device)
policy = FAMEPolicy(
base_obs_dim=env.obs_dim,
latent_dim=config["encoder"]["latent_dim"],
action_dim=env.action_dim,
).to(device)
force_estimator = ForceEstimator(env.robot_model, dt=config["env"]["dt"])
curriculum = UpperBodyCurriculum(
num_envs=num_envs,
max_curriculum_level=config["curriculum"]["max_level"],
device=device,
)
all_params = list(encoder.parameters()) + list(policy.parameters())
optimizer = optim.Adam(all_params, lr=config["ppo"]["learning_rate"])
# ===== Training Loop =====
total_steps = 0
episode_length = config["env"]["episode_length"]
ppo_cfg = config["ppo"]
obs = env.reset()
hidden_state = None
while total_steps < config["training"]["total_timesteps"]:
# ----- Collect rollout -----
rollout = defaultdict(list)
for step in range(episode_length):
with torch.no_grad():
# Sample forces (resample every N steps)
if step % config["force"]["sample_interval"] == 0:
ext_forces = sample_spherical_forces(
num_envs,
force_range=(0, curriculum.get_scale() * config["force"]["max_magnitude"]),
device=device,
)
env.apply_external_forces(ext_forces)
upper_poses = curriculum.sample_poses(env.default_upper_pose)
env.set_upper_body_targets(upper_poses)
# Estimate forces from dynamics (no F/T sensor)
estimated_forces = force_estimator.estimate(
env.joint_pos, env.joint_vel, env.joint_torque
)
# Encode context
latent_z, hidden_state = encoder(
env.upper_body_joints, estimated_forces, hidden_state
)
# Get action
action_dist, value = policy(obs, latent_z)
action = action_dist.sample()
log_prob = action_dist.log_prob(action).sum(-1)
# Step environment
next_obs, reward_dict, done, info = env.step(action)
reward = reward_dict["total"]
rollout["obs"].append(obs)
rollout["latent_z"].append(latent_z)
rollout["actions"].append(action)
rollout["log_probs"].append(log_prob)
rollout["values"].append(value.squeeze(-1))
rollout["rewards"].append(reward)
rollout["dones"].append(done.float())
obs = next_obs
total_steps += num_envs
# ----- Compute GAE advantages -----
with torch.no_grad():
last_latent, hidden_state = encoder(
env.upper_body_joints, estimated_forces, hidden_state
)
_, last_value = policy(obs, last_latent)
last_value = last_value.squeeze(-1)
advantages = compute_gae(
rollout["rewards"],
rollout["values"],
rollout["dones"],
last_value,
gamma=ppo_cfg["gamma"],
lam=ppo_cfg["lam"],
)
returns = [adv + val for adv, val in zip(advantages, rollout["values"])]
# Stack rollout
all_obs = torch.stack(rollout["obs"])
all_z = torch.stack(rollout["latent_z"])
all_actions = torch.stack(rollout["actions"])
all_log_probs = torch.stack(rollout["log_probs"])
all_returns = torch.stack(returns)
all_advantages = torch.stack(advantages)
# Normalize advantages
all_advantages = (all_advantages - all_advantages.mean()) / (
all_advantages.std() + 1e-8
)
# ----- PPO Update -----
for epoch in range(ppo_cfg["num_epochs"]):
batch_size = episode_length * num_envs
minibatch_size = batch_size // ppo_cfg["num_minibatches"]
indices = torch.randperm(batch_size, device=device)
for start in range(0, batch_size, minibatch_size):
mb_idx = indices[start : start + minibatch_size]
flat_obs = all_obs.view(-1, all_obs.shape[-1])[mb_idx]
flat_z = all_z.view(-1, all_z.shape[-1])[mb_idx]
flat_actions = all_actions.view(-1, all_actions.shape[-1])[mb_idx]
flat_old_lp = all_log_probs.view(-1)[mb_idx]
flat_returns = all_returns.view(-1)[mb_idx]
flat_adv = all_advantages.view(-1)[mb_idx]
dist, values = policy(flat_obs, flat_z)
new_log_probs = dist.log_prob(flat_actions).sum(-1)
entropy = dist.entropy().sum(-1).mean()
# PPO clipped objective
ratio = torch.exp(new_log_probs - flat_old_lp)
surr1 = ratio * flat_adv
surr2 = torch.clamp(
ratio,
1.0 - ppo_cfg["clip_range"],
1.0 + ppo_cfg["clip_range"],
) * flat_adv
policy_loss = -torch.min(surr1, surr2).mean()
value_loss = 0.5 * (flat_returns - values.squeeze(-1)).pow(2).mean()
loss = (
policy_loss
+ ppo_cfg["value_coef"] * value_loss
- ppo_cfg["entropy_coef"] * entropy
)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(all_params, ppo_cfg["max_grad_norm"])
optimizer.step()
# ----- Curriculum update -----
mean_reward = torch.stack(rollout["rewards"]).mean().item()
curriculum.update(mean_reward, threshold=ppo_cfg.get("target_kl", 0.7))
# ----- Logging -----
if total_steps % (config["training"]["log_interval"] * num_envs * episode_length) == 0:
print(
f"Steps: {total_steps:>10d} | "
f"Reward: {mean_reward:.3f} | "
f"Curriculum: {curriculum.current_level}/{curriculum.max_level} | "
f"Policy Loss: {policy_loss.item():.4f}"
)
# ----- Save checkpoint -----
if total_steps % (config["training"]["save_interval"] * num_envs * episode_length) == 0:
torch.save({
"encoder": encoder.state_dict(),
"policy": policy.state_dict(),
"optimizer": optimizer.state_dict(),
"total_steps": total_steps,
"curriculum_level": curriculum.current_level,
}, f"checkpoints/fame_step_{total_steps}.pt")
def compute_gae(rewards, values, dones, last_value, gamma=0.99, lam=0.95):
"""Generalized Advantage Estimation."""
advantages = []
gae = torch.zeros_like(rewards[0])
next_value = last_value
for t in reversed(range(len(rewards))):
delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
gae = delta + gamma * lam * (1 - dones[t]) * gae
advantages.insert(0, gae)
next_value = values[t]
return advantages
8. Sim-to-Real Transfer for H1-2 / G1
The deployment procedure for real hardware:
class FAMERealDeployment:
"""
Deploy FAME policy on Unitree H1-2 / G1.
Procedure: Load checkpoint -> Run inference at 50Hz -> Send joint targets.
"""
def __init__(
self,
checkpoint_path: str,
robot_type: str = "h1_2", # "h1_2" or "g1"
control_freq: float = 50.0,
):
self.device = torch.device("cpu") # CPU inference for low latency
self.dt = 1.0 / control_freq
# Load trained models
ckpt = torch.load(checkpoint_path, map_location=self.device)
self.encoder = ForceContextEncoder(
num_upper_joints=10, latent_dim=32
).to(self.device)
self.encoder.load_state_dict(ckpt["encoder"])
self.encoder.eval()
self.policy = FAMEPolicy(
base_obs_dim=45, latent_dim=32, action_dim=10
).to(self.device)
self.policy.load_state_dict(ckpt["policy"])
self.policy.eval()
self.force_estimator = ForceEstimator(
robot_model=load_robot_model(robot_type), dt=self.dt
)
self.hidden_state = None
# Action scaling for hardware
self.action_scale = torch.tensor([
0.25, 0.25, 0.25, 0.25, 0.25, # left leg
0.25, 0.25, 0.25, 0.25, 0.25, # right leg
])
@torch.no_grad()
def step(self, robot_state: dict) -> torch.Tensor:
"""
One inference step. Called at 50Hz.
Args:
robot_state: dict with keys:
- joint_pos: (num_joints,)
- joint_vel: (num_joints,)
- joint_torque: (num_joints,)
- imu_quat: (4,)
- imu_angular_vel: (3,)
- projected_gravity: (3,)
"""
obs = self._build_observation(robot_state)
estimated_forces = self.force_estimator.estimate(
robot_state["joint_pos"].unsqueeze(0),
robot_state["joint_vel"].unsqueeze(0),
robot_state["joint_torque"].unsqueeze(0),
)
upper_joints = robot_state["joint_pos"][:10].unsqueeze(0)
latent_z, self.hidden_state = self.encoder(
upper_joints, estimated_forces, self.hidden_state
)
# Deterministic action (use mean, no sampling)
action_dist, _ = self.policy(obs.unsqueeze(0), latent_z)
action = action_dist.mean.squeeze(0)
action = action * self.action_scale
action = action.clamp(-1.0, 1.0)
return action
def _build_observation(self, state: dict) -> torch.Tensor:
"""Build observation vector from robot state."""
return torch.cat([
state["joint_pos"][:10], # lower-body joints (10D)
state["joint_vel"][:10], # lower-body velocities (10D)
state["joint_pos"][10:20], # upper-body joints (10D)
state["imu_quat"], # orientation (4D)
state["imu_angular_vel"], # angular velocity (3D)
state["projected_gravity"], # gravity in body frame (3D)
torch.zeros(5), # command velocity placeholder
])
Results and Analysis
| Method | Standing Success (%) | Recovery Time (s) |
|---|---|---|
| Base Policy | 29.44 | N/A |
| Curriculum Only | 51.40 | 2.8 |
| FAME (full) | 73.84 | 1.2 |
Notably, FAME doesn't just maintain balance better — it also recovers faster: 1.2 seconds vs 2.8 seconds. The policy "understands" what forces are acting, enabling proactive compensation rather than waiting for instability and then correcting.
Key Takeaways for Robotics RL Practitioners
1. Observation Design Matters More Than Algorithm Choice
FAME uses PPO — the most basic RL algorithm. The difference lies in the observation space: adding latent context about forces and upper-body configuration. This is a broadly applicable lesson: before trying fancier algorithms, improve your observations first.
2. Sensor-Free Estimation Can Be Good Enough
Six-axis force/torque sensors for humanoids are expensive (thousands of USD each). FAME demonstrates that estimation from the dynamics model plus joint torque readings yields results good enough for policy training. This significantly reduces deployment cost.
3. Curriculum + Conditioning > Curriculum Alone
Curriculum learning makes robots more robust, but combining curriculum with explicit conditioning (latent context) yields much better results (73.84% vs 51.40%). The policy needs to know the current conditions, not just be "accustomed" to diverse conditions.
Connections to Related Research
FAME sits within the broader trend of loco-manipulation — combining locomotion and manipulation for humanoids. For further reading:
- RL for Robotics: PPO, SAC and Algorithm Selection — essential RL foundations
- Humanoid Locomotion: From MPC to RL — comparison of control approaches
- Loco-Manipulation for Humanoid Robots — field overview
- Sim-to-Real RL for Dexterous Manipulation on Humanoids — related manipulation paper
Related Posts
- RL for Robotics: PPO, SAC and Algorithm Selection
- Loco-Manipulation for Humanoid Robots
- Sim-to-Real RL for Dexterous Manipulation on Humanoids
References
- FAME: Force-Adaptive Policy with Implicit Estimation for Versatile Humanoid Locomotion — Pudasaini, N., Zhang, Y., Lavering, J., Roncone, A., Correll, N., CU Boulder, arXiv 2603.08961, March 2026
- Proximal Policy Optimization Algorithms — Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., OpenAI, 2017
- Learning Agile and Dynamic Motor Skills for Legged Robots — Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V., Koltun, V., Hutter, M., Science Robotics, 2019