From Simulation to Real Robot: The Hardest Problem in RL
Sim-to-real transfer has always been the bottleneck of robot learning. You can train a perfect policy in simulation — 99% success rate, smooth motions, grasping everything — but when deployed on a real robot, it fails catastrophically. Why? Because simulation is never 100% accurate compared to the real world: friction differs, inertia differs, delays differ, sensor noise differs.
The paper Sim-to-Real Reinforcement Learning for Humanoid Dexterous Manipulation (Feb 2025) provides a practical recipe for sim-to-real humanoid manipulation, solving 3 specific tasks: grasp-and-reach, box lift, and bimanual handover.
What makes it special: the paper goes beyond "standard" domain randomization — it includes an automated real-to-sim tuning module that automatically adjusts simulation parameters to match the real robot.
Pipeline Overview
The pipeline consists of 4 main stages:
- Real-to-Sim Tuning: Automatically adjust simulation parameters to match the real robot
- Specialist Training: Train specialized policies for each task (grasp, lift, handover)
- Policy Distillation: Combine specialists into a single generalist policy
- Sim-to-Real Deployment: Domain randomization + transfer to hardware
Why Divide-and-Conquer?
Directly training a single policy for all tasks is difficult because:
- Reward conflict: grasp reward can conflict with handover reward
- Exploration difficulty: the policy must discover strategies for all 3 tasks simultaneously
- Training instability: gradients from multiple tasks can cancel each other
Solution: train a specialist for each task separately, then distill into one generalist. Specialists are easier to train; the generalist inherits knowledge from all specialists.
1. Real-to-Sim Parameter Identification
The automated tuning module compares real trajectories with simulated ones, optimizing physics parameters:
import torch
import numpy as np
from scipy.optimize import minimize
class RealToSimTuner:
"""
Automated real-to-sim parameter identification.
Collect trajectories from real robot -> optimize simulation parameters
so that simulation reproduces those trajectories.
"""
# Parameters to tune
PARAM_NAMES = [
"joint_friction", # Joint friction (Nm)
"joint_damping", # Damping (Nm*s/rad)
"joint_armature", # Rotor inertia reflected (kg*m^2)
"contact_friction", # Contact friction
"contact_restitution", # Bounce coefficient
"actuator_kp", # PD gains
"actuator_kd",
"com_offset_x", # Center of mass offset (m)
"com_offset_y",
"com_offset_z",
"mass_scale", # Real/model mass ratio
]
def __init__(
self,
sim_env,
real_trajectories: list[dict],
num_iterations: int = 200,
):
"""
Args:
sim_env: Isaac Lab environment instance
real_trajectories: list of dicts with keys:
- joint_pos: (T, num_joints) — measured positions
- joint_vel: (T, num_joints) — measured velocities
- joint_torque: (T, num_joints) — commanded torques
- timestamps: (T,) — time stamps
"""
self.sim_env = sim_env
self.real_trajs = real_trajectories
self.num_iterations = num_iterations
# Initial parameter guess (from URDF defaults)
self.initial_params = self._get_default_params()
def _get_default_params(self) -> np.ndarray:
"""Load default parameters from URDF/simulation."""
return np.array([
0.05, # joint_friction (Nm)
0.5, # joint_damping (Nm*s/rad)
0.01, # joint_armature (kg*m^2)
0.8, # contact_friction
0.0, # contact_restitution
200.0, # actuator_kp
20.0, # actuator_kd
0.0, # com_offset_x
0.0, # com_offset_y
0.0, # com_offset_z
1.0, # mass_scale
])
def _simulate_trajectory(
self, params: np.ndarray, real_traj: dict
) -> dict:
"""
Replay torque commands in simulation with given params.
Returns simulated trajectory.
"""
# Apply parameters to simulation
self.sim_env.set_physics_params({
"joint_friction": params[0],
"joint_damping": params[1],
"joint_armature": params[2],
"contact_friction": params[3],
"contact_restitution": params[4],
"actuator_kp": params[5],
"actuator_kd": params[6],
"com_offset": params[7:10],
"mass_scale": params[10],
})
# Reset simulation to initial state from real trajectory
self.sim_env.reset()
self.sim_env.set_joint_state(
real_traj["joint_pos"][0],
real_traj["joint_vel"][0],
)
# Replay torque commands
sim_positions = []
sim_velocities = []
for t in range(len(real_traj["joint_torque"])):
self.sim_env.apply_torque(real_traj["joint_torque"][t])
self.sim_env.step()
sim_positions.append(self.sim_env.get_joint_pos())
sim_velocities.append(self.sim_env.get_joint_vel())
return {
"joint_pos": np.stack(sim_positions),
"joint_vel": np.stack(sim_velocities),
}
def _objective(self, params: np.ndarray) -> float:
"""
Objective function: total MSE between sim and real trajectories.
"""
total_error = 0.0
for real_traj in self.real_trajs:
sim_traj = self._simulate_trajectory(params, real_traj)
# Position error (weighted higher)
pos_error = np.mean(
(sim_traj["joint_pos"] - real_traj["joint_pos"]) ** 2
)
# Velocity error
vel_error = np.mean(
(sim_traj["joint_vel"] - real_traj["joint_vel"]) ** 2
)
total_error += 10.0 * pos_error + 1.0 * vel_error
return total_error / len(self.real_trajs)
def tune(self) -> dict:
"""
Run optimization to find best simulation parameters.
Returns: dict of tuned parameters.
"""
# Parameter bounds (physical constraints)
bounds = [
(0.001, 1.0), # joint_friction
(0.01, 5.0), # joint_damping
(0.001, 0.1), # joint_armature
(0.1, 2.0), # contact_friction
(0.0, 0.5), # contact_restitution
(50.0, 500.0), # actuator_kp
(5.0, 100.0), # actuator_kd
(-0.05, 0.05), # com_offset_x
(-0.05, 0.05), # com_offset_y
(-0.05, 0.05), # com_offset_z
(0.8, 1.2), # mass_scale
]
result = minimize(
self._objective,
self.initial_params,
method="L-BFGS-B",
bounds=bounds,
options={"maxiter": self.num_iterations, "disp": True},
)
tuned_params = result.x
print(f"[Real-to-Sim] Final error: {result.fun:.6f}")
print(f"[Real-to-Sim] Tuned params:")
for name, val in zip(self.PARAM_NAMES, tuned_params):
print(f" {name}: {val:.4f}")
return dict(zip(self.PARAM_NAMES, tuned_params))
2. Reward Function Design for 3 Tasks
Each task has its own reward function, following a generalized formulation based on contact and object goals:
import torch
class ManipulationRewards:
"""
Generalized reward formulation for dexterous manipulation.
Based on contact goals and object goals.
"""
@staticmethod
def grasp_and_reach(
hand_pos: torch.Tensor,
object_pos: torch.Tensor,
target_pos: torch.Tensor,
contact_force: torch.Tensor,
finger_positions: torch.Tensor,
object_in_hand: torch.Tensor,
) -> dict[str, torch.Tensor]:
"""
Reward for grasp-and-reach task.
Phase 1: Approach object -> Phase 2: Grasp -> Phase 3: Move to target.
"""
batch = hand_pos.shape[0]
# Phase 1: Approach — hand moves toward object
hand_to_obj = torch.norm(hand_pos - object_pos, dim=-1)
r_approach = torch.exp(-5.0 * hand_to_obj)
# Phase 2: Grasp — sufficient contact force + fingers wrap object
contact_magnitude = torch.norm(contact_force, dim=-1)
r_contact = torch.clamp(contact_magnitude / 10.0, 0, 1)
# Finger closure metric: fingers close together = tighter grasp
finger_spread = torch.std(finger_positions, dim=-1)
r_closure = torch.exp(-3.0 * finger_spread)
# Phase 3: Reach — object reaches target
obj_to_target = torch.norm(object_pos - target_pos, dim=-1)
r_reach = torch.exp(-3.0 * obj_to_target) * object_in_hand
# Bonus for task completion (object at target)
success = (obj_to_target < 0.05).float() * object_in_hand
r_success = success * 10.0
# Penalties
r_energy = -0.001 * contact_magnitude # Avoid excessive squeezing
total = (
0.2 * r_approach
+ 0.2 * r_contact
+ 0.1 * r_closure
+ 0.3 * r_reach
+ r_success
+ r_energy
)
return {
"total": total,
"approach": r_approach,
"contact": r_contact,
"closure": r_closure,
"reach": r_reach,
"success": success,
}
@staticmethod
def box_lift(
left_hand_pos: torch.Tensor,
right_hand_pos: torch.Tensor,
box_pos: torch.Tensor,
box_quat: torch.Tensor,
target_height: float,
left_contact: torch.Tensor,
right_contact: torch.Tensor,
) -> dict[str, torch.Tensor]:
"""
Reward for bimanual box lift.
Both hands must coordinate to lift the box to target height.
"""
# Approach: both hands near box
left_dist = torch.norm(left_hand_pos - box_pos, dim=-1)
right_dist = torch.norm(right_hand_pos - box_pos, dim=-1)
r_approach = torch.exp(-3.0 * (left_dist + right_dist))
# Contact: both hands touching box
left_contact_mag = torch.norm(left_contact, dim=-1)
right_contact_mag = torch.norm(right_contact, dim=-1)
r_bimanual_contact = (
torch.clamp(left_contact_mag / 5.0, 0, 1)
* torch.clamp(right_contact_mag / 5.0, 0, 1)
)
# Lift: box height approaches target
height_error = torch.abs(box_pos[:, 2] - target_height)
r_lift = torch.exp(-5.0 * height_error)
# Orientation: keep box upright (no tilting)
r_orient = torch.abs(box_quat[:, 0]) # w component near 1 = upright
# Symmetry: balanced forces between hands
force_diff = torch.abs(left_contact_mag - right_contact_mag)
r_symmetry = torch.exp(-2.0 * force_diff)
# Success
success = (height_error < 0.03).float() * (r_bimanual_contact > 0.5).float()
r_success = success * 10.0
total = (
0.15 * r_approach
+ 0.2 * r_bimanual_contact
+ 0.3 * r_lift
+ 0.1 * r_orient
+ 0.1 * r_symmetry
+ r_success
)
return {
"total": total,
"approach": r_approach,
"contact": r_bimanual_contact,
"lift": r_lift,
"orient": r_orient,
"symmetry": r_symmetry,
"success": success,
}
@staticmethod
def bimanual_handover(
giver_hand_pos: torch.Tensor,
receiver_hand_pos: torch.Tensor,
object_pos: torch.Tensor,
giver_contact: torch.Tensor,
receiver_contact: torch.Tensor,
phase: torch.Tensor, # 0=giver_grasp, 1=transfer, 2=receiver_grasp
) -> dict[str, torch.Tensor]:
"""
Reward for bimanual handover.
Hand A grasps object -> passes to hand B -> hand B receives.
"""
batch = object_pos.shape[0]
# Phase 0: Giver grasps object
giver_to_obj = torch.norm(giver_hand_pos - object_pos, dim=-1)
giver_contact_mag = torch.norm(giver_contact, dim=-1)
r_giver_grasp = (
torch.exp(-5.0 * giver_to_obj)
* torch.clamp(giver_contact_mag / 5.0, 0, 1)
)
# Phase 1: Move object to transfer zone (midpoint between hands)
midpoint = (giver_hand_pos + receiver_hand_pos) / 2
obj_to_mid = torch.norm(object_pos - midpoint, dim=-1)
r_transfer = torch.exp(-3.0 * obj_to_mid)
# Phase 2: Receiver grasps (contact) + Giver releases (no contact)
receiver_contact_mag = torch.norm(receiver_contact, dim=-1)
r_receiver_grasp = torch.clamp(receiver_contact_mag / 5.0, 0, 1)
r_giver_release = torch.exp(-2.0 * giver_contact_mag)
# Phase-weighted rewards
is_phase0 = (phase == 0).float()
is_phase1 = (phase == 1).float()
is_phase2 = (phase == 2).float()
total = (
is_phase0 * r_giver_grasp
+ is_phase1 * (0.5 * r_transfer + 0.3 * r_giver_grasp)
+ is_phase2 * (0.5 * r_receiver_grasp + 0.3 * r_giver_release)
)
# Success: receiver has object and giver released
success = (
(receiver_contact_mag > 3.0).float()
* (giver_contact_mag < 0.5).float()
)
total = total + success * 10.0
return {
"total": total,
"giver_grasp": r_giver_grasp,
"transfer": r_transfer,
"receiver_grasp": r_receiver_grasp,
"success": success,
}
3. Divide-and-Conquer: Specialist Training and Distillation
Specialist Training
Each specialist is a separate policy, trained with its corresponding reward function:
import torch
import torch.nn as nn
class DexterousManipulationPolicy(nn.Module):
"""
Policy for dexterous manipulation.
Observation space includes: proprioception + object state + tactile.
"""
def __init__(
self,
proprio_dim: int = 60, # joint pos (30) + joint vel (30)
object_dim: int = 13, # pos (3) + quat (4) + vel (3) + angular_vel (3)
tactile_dim: int = 24, # 12 fingertip sensors x 2 hands
action_dim: int = 30, # all upper-body + hand joints
hidden_dim: int = 512,
):
super().__init__()
total_obs = proprio_dim + object_dim + tactile_dim # 97
# Actor
self.actor = nn.Sequential(
nn.Linear(total_obs, hidden_dim),
nn.ELU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ELU(),
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ELU(),
nn.Linear(hidden_dim // 2, action_dim),
nn.Tanh(), # Actions in [-1, 1]
)
# Critic
self.critic = nn.Sequential(
nn.Linear(total_obs, hidden_dim),
nn.ELU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ELU(),
nn.Linear(hidden_dim, 1),
)
def forward(self, obs: torch.Tensor):
return self.actor(obs), self.critic(obs)
def get_action(self, obs: torch.Tensor, deterministic: bool = False):
mean = self.actor(obs)
if deterministic:
return mean
noise = torch.randn_like(mean) * 0.1
return (mean + noise).clamp(-1, 1)
def train_specialist(
task_name: str,
env_config: dict,
reward_fn,
num_timesteps: int = 50_000_000,
):
"""
Train specialist policy for a single task.
"""
device = torch.device("cuda")
env = create_manipulation_env(task_name, env_config)
policy = DexterousManipulationPolicy(
proprio_dim=env.proprio_dim,
object_dim=env.object_dim,
tactile_dim=env.tactile_dim,
action_dim=env.action_dim,
).to(device)
optimizer = torch.optim.Adam(policy.parameters(), lr=3e-4)
total_steps = 0
obs = env.reset()
while total_steps < num_timesteps:
actions = policy.get_action(obs)
next_obs, info = env.step(actions)
reward_dict = reward_fn(**info)
reward = reward_dict["total"]
_, values = policy(obs)
_, next_values = policy(next_obs)
advantage = reward + 0.99 * next_values.squeeze() - values.squeeze()
policy_loss = -(advantage.detach() * policy.get_action(obs)).mean()
value_loss = advantage.pow(2).mean()
loss = policy_loss + 0.5 * value_loss
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
optimizer.step()
obs = next_obs
total_steps += env.num_envs
if total_steps % 1_000_000 == 0:
success_rate = reward_dict["success"].mean().item()
print(
f"[{task_name}] Steps: {total_steps:>10d} | "
f"Success: {success_rate:.1%} | "
f"Reward: {reward.mean().item():.3f}"
)
return policy
# Train 3 specialists
grasp_specialist = train_specialist(
"grasp_and_reach",
env_config={"object_type": "diverse", "randomize": True},
reward_fn=ManipulationRewards.grasp_and_reach,
)
lift_specialist = train_specialist(
"box_lift",
env_config={"box_sizes": [(0.1, 0.1, 0.1), (0.15, 0.15, 0.15)], "randomize": True},
reward_fn=ManipulationRewards.box_lift,
)
handover_specialist = train_specialist(
"bimanual_handover",
env_config={"object_type": "diverse", "randomize": True},
reward_fn=ManipulationRewards.bimanual_handover,
)
Policy Distillation: Specialists to Generalist
import torch
import torch.nn as nn
import torch.nn.functional as F
class GeneralistPolicy(nn.Module):
"""
Generalist policy — distilled from multiple specialists.
Uses task embedding to distinguish tasks.
"""
def __init__(
self,
obs_dim: int = 97,
action_dim: int = 30,
num_tasks: int = 3,
task_embed_dim: int = 16,
hidden_dim: int = 512,
):
super().__init__()
self.task_embedding = nn.Embedding(num_tasks, task_embed_dim)
total_input = obs_dim + task_embed_dim
self.actor = nn.Sequential(
nn.Linear(total_input, hidden_dim),
nn.ELU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ELU(),
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ELU(),
nn.Linear(hidden_dim // 2, action_dim),
nn.Tanh(),
)
def forward(self, obs: torch.Tensor, task_id: torch.Tensor):
task_embed = self.task_embedding(task_id)
x = torch.cat([obs, task_embed], dim=-1)
return self.actor(x)
def distill_specialists(
specialists: list[nn.Module],
task_envs: list,
num_distill_steps: int = 10_000_000,
temperature: float = 1.0,
):
"""
Distill multiple specialist policies into a single generalist.
Args:
specialists: list of trained specialist policies
task_envs: list of environments (one per task)
temperature: distillation temperature
"""
device = torch.device("cuda")
num_tasks = len(specialists)
# Freeze specialists
for spec in specialists:
spec.eval()
for p in spec.parameters():
p.requires_grad = False
generalist = GeneralistPolicy(
obs_dim=task_envs[0].obs_dim,
action_dim=task_envs[0].action_dim,
num_tasks=num_tasks,
).to(device)
optimizer = torch.optim.Adam(generalist.parameters(), lr=1e-4)
total_steps = 0
envs_obs = [env.reset() for env in task_envs]
while total_steps < num_distill_steps:
total_loss = 0.0
for task_id, (specialist, env, obs) in enumerate(
zip(specialists, task_envs, envs_obs)
):
# Teacher action (from specialist)
with torch.no_grad():
teacher_action = specialist.get_action(obs, deterministic=True)
# Student action (from generalist)
task_ids = torch.full(
(obs.shape[0],), task_id, dtype=torch.long, device=device
)
student_action = generalist(obs, task_ids)
# Distillation loss
distill_loss = F.mse_loss(student_action, teacher_action)
next_obs, _ = env.step(teacher_action)
envs_obs[task_id] = next_obs
total_loss += distill_loss
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(generalist.parameters(), 1.0)
optimizer.step()
total_steps += sum(env.num_envs for env in task_envs)
if total_steps % 500_000 == 0:
print(
f"[Distill] Steps: {total_steps:>10d} | "
f"Loss: {total_loss.item():.6f}"
)
return generalist
4. Domain Randomization Configuration
DOMAIN_RANDOMIZATION = {
# Physics parameters
"physics": {
"friction": {
"range": [0.3, 2.0],
"operation": "scaling", # Multiply with default value
},
"restitution": {
"range": [0.0, 0.3],
"operation": "absolute",
},
"mass": {
"range": [0.85, 1.15],
"operation": "scaling",
},
"com_displacement": {
"range": [-0.03, 0.03], # meters
"operation": "additive",
"per_axis": True,
},
},
# Actuator parameters
"actuator": {
"kp": {"range": [0.7, 1.3], "operation": "scaling"},
"kd": {"range": [0.7, 1.3], "operation": "scaling"},
"torque_limit": {"range": [0.85, 1.0], "operation": "scaling"},
"action_delay": {"range": [0, 3], "operation": "absolute", "unit": "steps"},
},
# Sensor noise
"sensor": {
"joint_pos_noise": {"std": 0.01, "unit": "rad"},
"joint_vel_noise": {"std": 0.1, "unit": "rad/s"},
"imu_quat_noise": {"std": 0.005},
"imu_gyro_noise": {"std": 0.02, "unit": "rad/s"},
},
# Object properties (for manipulation)
"object": {
"mass": {"range": [0.1, 2.0], "unit": "kg"},
"size_scale": {"range": [0.8, 1.2]},
"friction": {"range": [0.3, 1.5]},
"initial_pos_noise": {"std": 0.02, "unit": "m"},
"initial_rot_noise": {"std": 0.1, "unit": "rad"},
},
# Visual randomization (for camera-based policies)
"visual": {
"light_intensity": {"range": [0.5, 2.0]},
"light_direction": {"range": [-30, 30], "unit": "deg"},
"texture_randomize": True,
"camera_fov_noise": {"std": 2.0, "unit": "deg"},
},
}
5. Hybrid Object Representation
The paper uses a hybrid representation combining multiple modalities for object state:
import torch
import torch.nn as nn
class HybridObjectEncoder(nn.Module):
"""
Hybrid object representation with modality-specific augmentation.
Combines: proprioceptive + tactile + visual features.
"""
def __init__(
self,
proprio_dim: int = 13, # object pos (3) + quat (4) + vel (6)
tactile_dim: int = 24, # fingertip sensors
visual_dim: int = 64, # CNN features (if using camera)
output_dim: int = 32,
use_visual: bool = False,
):
super().__init__()
self.use_visual = use_visual
# Proprioceptive encoder
self.proprio_encoder = nn.Sequential(
nn.Linear(proprio_dim, 64),
nn.LayerNorm(64),
nn.ELU(),
nn.Linear(64, 32),
)
# Tactile encoder
self.tactile_encoder = nn.Sequential(
nn.Linear(tactile_dim, 64),
nn.LayerNorm(64),
nn.ELU(),
nn.Linear(64, 32),
)
# Visual encoder (optional)
if use_visual:
self.visual_encoder = nn.Sequential(
nn.Linear(visual_dim, 64),
nn.LayerNorm(64),
nn.ELU(),
nn.Linear(64, 32),
)
fusion_dim = 32 * 3
else:
fusion_dim = 32 * 2
# Fusion layer
self.fusion = nn.Sequential(
nn.Linear(fusion_dim, 64),
nn.ELU(),
nn.Linear(64, output_dim),
)
# Augmentation parameters
self.proprio_noise_std = 0.01
self.tactile_noise_std = 0.05
self.dropout = nn.Dropout(0.1)
def forward(
self,
proprio: torch.Tensor,
tactile: torch.Tensor,
visual: torch.Tensor | None = None,
augment: bool = False,
) -> torch.Tensor:
"""
Encode object state from multiple modalities.
Args:
proprio: (batch, proprio_dim) — object pose + velocity
tactile: (batch, tactile_dim) — fingertip sensor readings
visual: (batch, visual_dim) — CNN features (optional)
augment: whether to apply modality-specific augmentation
"""
if augment and self.training:
proprio = proprio + torch.randn_like(proprio) * self.proprio_noise_std
tactile = tactile + torch.randn_like(tactile) * self.tactile_noise_std
if torch.rand(1).item() < 0.1:
tactile = torch.zeros_like(tactile)
proprio_feat = self.proprio_encoder(proprio)
tactile_feat = self.tactile_encoder(tactile)
if self.use_visual and visual is not None:
visual_feat = self.visual_encoder(visual)
fused = torch.cat([proprio_feat, tactile_feat, visual_feat], dim=-1)
else:
fused = torch.cat([proprio_feat, tactile_feat], dim=-1)
fused = self.dropout(fused) if augment else fused
output = self.fusion(fused)
return output
6. Complete Deployment Script
import torch
import time
import numpy as np
class Sim2RealDeployer:
"""
Deploy trained policy on real robot hardware.
Handles: observation processing, action filtering, safety checks.
"""
def __init__(
self,
policy_path: str,
robot_interface,
control_freq: float = 30.0,
action_smoothing: float = 0.3, # EMA smoothing factor
):
self.device = torch.device("cpu")
self.dt = 1.0 / control_freq
self.alpha = action_smoothing
ckpt = torch.load(policy_path, map_location=self.device)
self.policy = GeneralistPolicy(
obs_dim=ckpt["obs_dim"],
action_dim=ckpt["action_dim"],
).to(self.device)
self.policy.load_state_dict(ckpt["policy"])
self.policy.eval()
self.robot = robot_interface
self.prev_action = None
# Safety limits
self.max_joint_delta = 0.1 # rad per step
self.torque_limit = 50.0 # Nm
self.obs_mean = ckpt.get("obs_mean", torch.zeros(ckpt["obs_dim"]))
self.obs_std = ckpt.get("obs_std", torch.ones(ckpt["obs_dim"]))
def _normalize_obs(self, obs: torch.Tensor) -> torch.Tensor:
"""Normalize observation using training statistics."""
return (obs - self.obs_mean) / (self.obs_std + 1e-8)
def _smooth_action(self, action: torch.Tensor) -> torch.Tensor:
"""EMA smoothing to prevent jerky motion."""
if self.prev_action is None:
self.prev_action = action
return action
smoothed = self.alpha * action + (1 - self.alpha) * self.prev_action
self.prev_action = smoothed
return smoothed
def _safety_check(self, action: torch.Tensor, current_pos: torch.Tensor) -> torch.Tensor:
"""Clamp action to safety limits."""
delta = action - current_pos
delta = delta.clamp(-self.max_joint_delta, self.max_joint_delta)
return current_pos + delta
@torch.no_grad()
def run(self, task_id: int, duration: float = 30.0):
"""
Run policy on real robot.
Args:
task_id: 0=grasp, 1=lift, 2=handover
duration: seconds to run
"""
print(f"[Deploy] Starting task {task_id} for {duration}s")
print("[Deploy] Press Ctrl+C to stop")
task_tensor = torch.tensor([task_id], dtype=torch.long)
start_time = time.time()
step_count = 0
try:
while time.time() - start_time < duration:
loop_start = time.time()
state = self.robot.get_state()
obs = self._build_observation(state)
obs = self._normalize_obs(obs)
action = self.policy(obs.unsqueeze(0), task_tensor).squeeze(0)
action = self._smooth_action(action)
action = self._safety_check(action, state["joint_pos"])
self.robot.set_joint_targets(action.numpy())
step_count += 1
if step_count % 100 == 0:
elapsed = time.time() - start_time
print(
f"[Deploy] Step {step_count} | "
f"Time: {elapsed:.1f}s | "
f"Action norm: {action.norm():.3f}"
)
elapsed_step = time.time() - loop_start
if elapsed_step < self.dt:
time.sleep(self.dt - elapsed_step)
except KeyboardInterrupt:
print("[Deploy] Stopped by user")
finally:
print("[Deploy] Moving to safe pose...")
self.robot.move_to_default(duration=3.0)
print("[Deploy] Done")
def _build_observation(self, state: dict) -> torch.Tensor:
"""Build observation tensor from robot state dict."""
return torch.cat([
state["joint_pos"],
state["joint_vel"],
state["object_pos"],
state["object_quat"],
state["object_vel"],
state["object_angular_vel"],
state["tactile_left"],
state["tactile_right"],
])
Experimental Results
| Task | Sim Success (%) | Real Success (%) | Sim-to-Real Gap |
|---|---|---|---|
| Grasp-and-Reach | 92.3 | 78.5 | 13.8% |
| Box Lift | 87.1 | 71.2 | 15.9% |
| Bimanual Handover | 83.6 | 64.8 | 18.8% |
Key observations:
- Sim-to-real gap is ~14-19% — much smaller than methods without real-to-sim tuning (typical gap is 30-50%)
- Bimanual handover is hardest because it requires precise coordination timing between two hands
- Real-to-sim tuning reduces the gap by ~10% compared to domain randomization alone
Key Lessons
1. Real-to-Sim > Domain Randomization Alone
Domain randomization is necessary but not sufficient. If simulation is too different from reality, no amount of randomization will produce good transfer. Real-to-sim tuning moves the simulation closer to reality first, then randomizes around that more accurate baseline.
2. Divide-and-Conquer Reduces Training Time 3-5x
Training a single generalist directly takes 200M+ steps. Training 3 specialists (50M each) + distillation (10M) = 160M total, with better results because each specialist optimizes for a single clear objective.
3. Action Smoothing Is Mandatory for Hardware
In simulation, actions can change abruptly every step. On real hardware, this creates jerk that damages motors and reduces lifespan. EMA smoothing (alpha=0.3) is a simple but critical trick for deployment.
Further Reading
- RL for Robotics: PPO, SAC and Algorithm Selection — RL foundations
- FAME: Force-Adaptive RL for Humanoid Manipulation — force-adaptive locomotion
- Imitation Learning for Robots — alternative approach
- VLA Models for Robots — vision-language-action models
Related Posts
- FAME: Force-Adaptive RL for Humanoid Manipulation
- RL for Robotics: PPO, SAC and Algorithm Selection
- Loco-Manipulation for Humanoid Robots
References
- Sim-to-Real Reinforcement Learning for Humanoid Dexterous Manipulation — arXiv 2502.20396, Feb 2025
- Sim-to-Real: Learning Agile Locomotion For Quadruped Robots — Tan, J., Zhang, T., Coumans, E., et al., RSS 2018
- Learning Dexterous In-Hand Manipulation — OpenAI, Andrychowicz, M., et al., IJRR 2020
- DexMV: Imitation Learning for Dexterous Manipulation from Human Videos — Qin, Y., et al., ECCV 2022