Sim-to-Real RL for Dexterous Manipulation on Humanoids

From Simulation to Real Robot: The Hardest Problem in RL

Sim-to-real transfer has always been the bottleneck of robot learning. You can train a perfect policy in simulation — 99% success rate, smooth motions, grasping everything — but when deployed on a real robot, it fails catastrophically. Why? Because simulation is never 100% accurate compared to the real world: friction differs, inertia differs, delays differ, sensor noise differs.

The paper Sim-to-Real Reinforcement Learning for Humanoid Dexterous Manipulation (Feb 2025) provides a practical recipe for sim-to-real humanoid manipulation, solving 3 specific tasks: grasp-and-reach, box lift, and bimanual handover.

What makes it special: the paper goes beyond "standard" domain randomization — it includes an automated real-to-sim tuning module that automatically adjusts simulation parameters to match the real robot.

Simulation environment for robot manipulation training

Pipeline Overview

The pipeline consists of 4 main stages:

Real-to-Sim Tuning: Automatically adjust simulation parameters to match the real robot
Specialist Training: Train specialized policies for each task (grasp, lift, handover)
Policy Distillation: Combine specialists into a single generalist policy
Sim-to-Real Deployment: Domain randomization + transfer to hardware

Why Divide-and-Conquer?

Directly training a single policy for all tasks is difficult because:

Reward conflict: grasp reward can conflict with handover reward
Exploration difficulty: the policy must discover strategies for all 3 tasks simultaneously
Training instability: gradients from multiple tasks can cancel each other

Solution: train a specialist for each task separately, then distill into one generalist. Specialists are easier to train; the generalist inherits knowledge from all specialists.

1. Real-to-Sim Parameter Identification

The automated tuning module compares real trajectories with simulated ones, optimizing physics parameters:

import torch
import numpy as np
from scipy.optimize import minimize

class RealToSimTuner:
    """
    Automated real-to-sim parameter identification.
    Collect trajectories from real robot -> optimize simulation parameters
    so that simulation reproduces those trajectories.
    """

    # Parameters to tune
    PARAM_NAMES = [
        "joint_friction",      # Joint friction (Nm)
        "joint_damping",       # Damping (Nm*s/rad)
        "joint_armature",      # Rotor inertia reflected (kg*m^2)
        "contact_friction",    # Contact friction
        "contact_restitution", # Bounce coefficient
        "actuator_kp",         # PD gains
        "actuator_kd",
        "com_offset_x",        # Center of mass offset (m)
        "com_offset_y",
        "com_offset_z",
        "mass_scale",          # Real/model mass ratio
    ]

    def __init__(
        self,
        sim_env,
        real_trajectories: list[dict],
        num_iterations: int = 200,
    ):
        """
        Args:
            sim_env: Isaac Lab environment instance
            real_trajectories: list of dicts with keys:
                - joint_pos: (T, num_joints) — measured positions
                - joint_vel: (T, num_joints) — measured velocities
                - joint_torque: (T, num_joints) — commanded torques
                - timestamps: (T,) — time stamps
        """
        self.sim_env = sim_env
        self.real_trajs = real_trajectories
        self.num_iterations = num_iterations

        # Initial parameter guess (from URDF defaults)
        self.initial_params = self._get_default_params()

    def _get_default_params(self) -> np.ndarray:
        """Load default parameters from URDF/simulation."""
        return np.array([
            0.05,   # joint_friction (Nm)
            0.5,    # joint_damping (Nm*s/rad)
            0.01,   # joint_armature (kg*m^2)
            0.8,    # contact_friction
            0.0,    # contact_restitution
            200.0,  # actuator_kp
            20.0,   # actuator_kd
            0.0,    # com_offset_x
            0.0,    # com_offset_y
            0.0,    # com_offset_z
            1.0,    # mass_scale
        ])

    def _simulate_trajectory(
        self, params: np.ndarray, real_traj: dict
    ) -> dict:
        """
        Replay torque commands in simulation with given params.
        Returns simulated trajectory.
        """
        # Apply parameters to simulation
        self.sim_env.set_physics_params({
            "joint_friction": params[0],
            "joint_damping": params[1],
            "joint_armature": params[2],
            "contact_friction": params[3],
            "contact_restitution": params[4],
            "actuator_kp": params[5],
            "actuator_kd": params[6],
            "com_offset": params[7:10],
            "mass_scale": params[10],
        })

        # Reset simulation to initial state from real trajectory
        self.sim_env.reset()
        self.sim_env.set_joint_state(
            real_traj["joint_pos"][0],
            real_traj["joint_vel"][0],
        )

        # Replay torque commands
        sim_positions = []
        sim_velocities = []

        for t in range(len(real_traj["joint_torque"])):
            self.sim_env.apply_torque(real_traj["joint_torque"][t])
            self.sim_env.step()
            sim_positions.append(self.sim_env.get_joint_pos())
            sim_velocities.append(self.sim_env.get_joint_vel())

        return {
            "joint_pos": np.stack(sim_positions),
            "joint_vel": np.stack(sim_velocities),
        }

    def _objective(self, params: np.ndarray) -> float:
        """
        Objective function: total MSE between sim and real trajectories.
        """
        total_error = 0.0

        for real_traj in self.real_trajs:
            sim_traj = self._simulate_trajectory(params, real_traj)

            # Position error (weighted higher)
            pos_error = np.mean(
                (sim_traj["joint_pos"] - real_traj["joint_pos"]) ** 2
            )

            # Velocity error
            vel_error = np.mean(
                (sim_traj["joint_vel"] - real_traj["joint_vel"]) ** 2
            )

            total_error += 10.0 * pos_error + 1.0 * vel_error

        return total_error / len(self.real_trajs)

    def tune(self) -> dict:
        """
        Run optimization to find best simulation parameters.
        Returns: dict of tuned parameters.
        """
        # Parameter bounds (physical constraints)
        bounds = [
            (0.001, 1.0),    # joint_friction
            (0.01, 5.0),     # joint_damping
            (0.001, 0.1),    # joint_armature
            (0.1, 2.0),      # contact_friction
            (0.0, 0.5),      # contact_restitution
            (50.0, 500.0),   # actuator_kp
            (5.0, 100.0),    # actuator_kd
            (-0.05, 0.05),   # com_offset_x
            (-0.05, 0.05),   # com_offset_y
            (-0.05, 0.05),   # com_offset_z
            (0.8, 1.2),      # mass_scale
        ]

        result = minimize(
            self._objective,
            self.initial_params,
            method="L-BFGS-B",
            bounds=bounds,
            options={"maxiter": self.num_iterations, "disp": True},
        )

        tuned_params = result.x
        print(f"[Real-to-Sim] Final error: {result.fun:.6f}")
        print(f"[Real-to-Sim] Tuned params:")
        for name, val in zip(self.PARAM_NAMES, tuned_params):
            print(f"  {name}: {val:.4f}")

        return dict(zip(self.PARAM_NAMES, tuned_params))

2. Reward Function Design for 3 Tasks

Each task has its own reward function, following a generalized formulation based on contact and object goals:

import torch

class ManipulationRewards:
    """
    Generalized reward formulation for dexterous manipulation.
    Based on contact goals and object goals.
    """

    @staticmethod
    def grasp_and_reach(
        hand_pos: torch.Tensor,
        object_pos: torch.Tensor,
        target_pos: torch.Tensor,
        contact_force: torch.Tensor,
        finger_positions: torch.Tensor,
        object_in_hand: torch.Tensor,
    ) -> dict[str, torch.Tensor]:
        """
        Reward for grasp-and-reach task.
        Phase 1: Approach object -> Phase 2: Grasp -> Phase 3: Move to target.
        """
        batch = hand_pos.shape[0]

        # Phase 1: Approach — hand moves toward object
        hand_to_obj = torch.norm(hand_pos - object_pos, dim=-1)
        r_approach = torch.exp(-5.0 * hand_to_obj)

        # Phase 2: Grasp — sufficient contact force + fingers wrap object
        contact_magnitude = torch.norm(contact_force, dim=-1)
        r_contact = torch.clamp(contact_magnitude / 10.0, 0, 1)

        # Finger closure metric: fingers close together = tighter grasp
        finger_spread = torch.std(finger_positions, dim=-1)
        r_closure = torch.exp(-3.0 * finger_spread)

        # Phase 3: Reach — object reaches target
        obj_to_target = torch.norm(object_pos - target_pos, dim=-1)
        r_reach = torch.exp(-3.0 * obj_to_target) * object_in_hand

        # Bonus for task completion (object at target)
        success = (obj_to_target < 0.05).float() * object_in_hand
        r_success = success * 10.0

        # Penalties
        r_energy = -0.001 * contact_magnitude  # Avoid excessive squeezing

        total = (
            0.2 * r_approach
            + 0.2 * r_contact
            + 0.1 * r_closure
            + 0.3 * r_reach
            + r_success
            + r_energy
        )

        return {
            "total": total,
            "approach": r_approach,
            "contact": r_contact,
            "closure": r_closure,
            "reach": r_reach,
            "success": success,
        }

    @staticmethod
    def box_lift(
        left_hand_pos: torch.Tensor,
        right_hand_pos: torch.Tensor,
        box_pos: torch.Tensor,
        box_quat: torch.Tensor,
        target_height: float,
        left_contact: torch.Tensor,
        right_contact: torch.Tensor,
    ) -> dict[str, torch.Tensor]:
        """
        Reward for bimanual box lift.
        Both hands must coordinate to lift the box to target height.
        """
        # Approach: both hands near box
        left_dist = torch.norm(left_hand_pos - box_pos, dim=-1)
        right_dist = torch.norm(right_hand_pos - box_pos, dim=-1)
        r_approach = torch.exp(-3.0 * (left_dist + right_dist))

        # Contact: both hands touching box
        left_contact_mag = torch.norm(left_contact, dim=-1)
        right_contact_mag = torch.norm(right_contact, dim=-1)
        r_bimanual_contact = (
            torch.clamp(left_contact_mag / 5.0, 0, 1)
            * torch.clamp(right_contact_mag / 5.0, 0, 1)
        )

        # Lift: box height approaches target
        height_error = torch.abs(box_pos[:, 2] - target_height)
        r_lift = torch.exp(-5.0 * height_error)

        # Orientation: keep box upright (no tilting)
        r_orient = torch.abs(box_quat[:, 0])  # w component near 1 = upright

        # Symmetry: balanced forces between hands
        force_diff = torch.abs(left_contact_mag - right_contact_mag)
        r_symmetry = torch.exp(-2.0 * force_diff)

        # Success
        success = (height_error < 0.03).float() * (r_bimanual_contact > 0.5).float()
        r_success = success * 10.0

        total = (
            0.15 * r_approach
            + 0.2 * r_bimanual_contact
            + 0.3 * r_lift
            + 0.1 * r_orient
            + 0.1 * r_symmetry
            + r_success
        )

        return {
            "total": total,
            "approach": r_approach,
            "contact": r_bimanual_contact,
            "lift": r_lift,
            "orient": r_orient,
            "symmetry": r_symmetry,
            "success": success,
        }

    @staticmethod
    def bimanual_handover(
        giver_hand_pos: torch.Tensor,
        receiver_hand_pos: torch.Tensor,
        object_pos: torch.Tensor,
        giver_contact: torch.Tensor,
        receiver_contact: torch.Tensor,
        phase: torch.Tensor,  # 0=giver_grasp, 1=transfer, 2=receiver_grasp
    ) -> dict[str, torch.Tensor]:
        """
        Reward for bimanual handover.
        Hand A grasps object -> passes to hand B -> hand B receives.
        """
        batch = object_pos.shape[0]

        # Phase 0: Giver grasps object
        giver_to_obj = torch.norm(giver_hand_pos - object_pos, dim=-1)
        giver_contact_mag = torch.norm(giver_contact, dim=-1)
        r_giver_grasp = (
            torch.exp(-5.0 * giver_to_obj)
            * torch.clamp(giver_contact_mag / 5.0, 0, 1)
        )

        # Phase 1: Move object to transfer zone (midpoint between hands)
        midpoint = (giver_hand_pos + receiver_hand_pos) / 2
        obj_to_mid = torch.norm(object_pos - midpoint, dim=-1)
        r_transfer = torch.exp(-3.0 * obj_to_mid)

        # Phase 2: Receiver grasps (contact) + Giver releases (no contact)
        receiver_contact_mag = torch.norm(receiver_contact, dim=-1)
        r_receiver_grasp = torch.clamp(receiver_contact_mag / 5.0, 0, 1)
        r_giver_release = torch.exp(-2.0 * giver_contact_mag)

        # Phase-weighted rewards
        is_phase0 = (phase == 0).float()
        is_phase1 = (phase == 1).float()
        is_phase2 = (phase == 2).float()

        total = (
            is_phase0 * r_giver_grasp
            + is_phase1 * (0.5 * r_transfer + 0.3 * r_giver_grasp)
            + is_phase2 * (0.5 * r_receiver_grasp + 0.3 * r_giver_release)
        )

        # Success: receiver has object and giver released
        success = (
            (receiver_contact_mag > 3.0).float()
            * (giver_contact_mag < 0.5).float()
        )
        total = total + success * 10.0

        return {
            "total": total,
            "giver_grasp": r_giver_grasp,
            "transfer": r_transfer,
            "receiver_grasp": r_receiver_grasp,
            "success": success,
        }

3. Divide-and-Conquer: Specialist Training and Distillation

Specialist Training

Each specialist is a separate policy, trained with its corresponding reward function:

import torch
import torch.nn as nn

class DexterousManipulationPolicy(nn.Module):
    """
    Policy for dexterous manipulation.
    Observation space includes: proprioception + object state + tactile.
    """

    def __init__(
        self,
        proprio_dim: int = 60,     # joint pos (30) + joint vel (30)
        object_dim: int = 13,      # pos (3) + quat (4) + vel (3) + angular_vel (3)
        tactile_dim: int = 24,     # 12 fingertip sensors x 2 hands
        action_dim: int = 30,      # all upper-body + hand joints
        hidden_dim: int = 512,
    ):
        super().__init__()
        total_obs = proprio_dim + object_dim + tactile_dim  # 97

        # Actor
        self.actor = nn.Sequential(
            nn.Linear(total_obs, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ELU(),
            nn.Linear(hidden_dim // 2, action_dim),
            nn.Tanh(),  # Actions in [-1, 1]
        )

        # Critic
        self.critic = nn.Sequential(
            nn.Linear(total_obs, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, 1),
        )

    def forward(self, obs: torch.Tensor):
        return self.actor(obs), self.critic(obs)

    def get_action(self, obs: torch.Tensor, deterministic: bool = False):
        mean = self.actor(obs)
        if deterministic:
            return mean
        noise = torch.randn_like(mean) * 0.1
        return (mean + noise).clamp(-1, 1)

def train_specialist(
    task_name: str,
    env_config: dict,
    reward_fn,
    num_timesteps: int = 50_000_000,
):
    """
    Train specialist policy for a single task.
    """
    device = torch.device("cuda")

    env = create_manipulation_env(task_name, env_config)

    policy = DexterousManipulationPolicy(
        proprio_dim=env.proprio_dim,
        object_dim=env.object_dim,
        tactile_dim=env.tactile_dim,
        action_dim=env.action_dim,
    ).to(device)

    optimizer = torch.optim.Adam(policy.parameters(), lr=3e-4)

    total_steps = 0
    obs = env.reset()

    while total_steps < num_timesteps:
        actions = policy.get_action(obs)
        next_obs, info = env.step(actions)

        reward_dict = reward_fn(**info)
        reward = reward_dict["total"]

        _, values = policy(obs)
        _, next_values = policy(next_obs)

        advantage = reward + 0.99 * next_values.squeeze() - values.squeeze()
        policy_loss = -(advantage.detach() * policy.get_action(obs)).mean()
        value_loss = advantage.pow(2).mean()

        loss = policy_loss + 0.5 * value_loss

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
        optimizer.step()

        obs = next_obs
        total_steps += env.num_envs

        if total_steps % 1_000_000 == 0:
            success_rate = reward_dict["success"].mean().item()
            print(
                f"[{task_name}] Steps: {total_steps:>10d} | "
                f"Success: {success_rate:.1%} | "
                f"Reward: {reward.mean().item():.3f}"
            )

    return policy

# Train 3 specialists
grasp_specialist = train_specialist(
    "grasp_and_reach",
    env_config={"object_type": "diverse", "randomize": True},
    reward_fn=ManipulationRewards.grasp_and_reach,
)

lift_specialist = train_specialist(
    "box_lift",
    env_config={"box_sizes": [(0.1, 0.1, 0.1), (0.15, 0.15, 0.15)], "randomize": True},
    reward_fn=ManipulationRewards.box_lift,
)

handover_specialist = train_specialist(
    "bimanual_handover",
    env_config={"object_type": "diverse", "randomize": True},
    reward_fn=ManipulationRewards.bimanual_handover,
)

Robot manipulation training process in simulation

Policy Distillation: Specialists to Generalist

import torch
import torch.nn as nn
import torch.nn.functional as F

class GeneralistPolicy(nn.Module):
    """
    Generalist policy — distilled from multiple specialists.
    Uses task embedding to distinguish tasks.
    """

    def __init__(
        self,
        obs_dim: int = 97,
        action_dim: int = 30,
        num_tasks: int = 3,
        task_embed_dim: int = 16,
        hidden_dim: int = 512,
    ):
        super().__init__()
        self.task_embedding = nn.Embedding(num_tasks, task_embed_dim)
        total_input = obs_dim + task_embed_dim

        self.actor = nn.Sequential(
            nn.Linear(total_input, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ELU(),
            nn.Linear(hidden_dim // 2, action_dim),
            nn.Tanh(),
        )

    def forward(self, obs: torch.Tensor, task_id: torch.Tensor):
        task_embed = self.task_embedding(task_id)
        x = torch.cat([obs, task_embed], dim=-1)
        return self.actor(x)

def distill_specialists(
    specialists: list[nn.Module],
    task_envs: list,
    num_distill_steps: int = 10_000_000,
    temperature: float = 1.0,
):
    """
    Distill multiple specialist policies into a single generalist.

    Args:
        specialists: list of trained specialist policies
        task_envs: list of environments (one per task)
        temperature: distillation temperature
    """
    device = torch.device("cuda")
    num_tasks = len(specialists)

    # Freeze specialists
    for spec in specialists:
        spec.eval()
        for p in spec.parameters():
            p.requires_grad = False

    generalist = GeneralistPolicy(
        obs_dim=task_envs[0].obs_dim,
        action_dim=task_envs[0].action_dim,
        num_tasks=num_tasks,
    ).to(device)

    optimizer = torch.optim.Adam(generalist.parameters(), lr=1e-4)

    total_steps = 0
    envs_obs = [env.reset() for env in task_envs]

    while total_steps < num_distill_steps:
        total_loss = 0.0

        for task_id, (specialist, env, obs) in enumerate(
            zip(specialists, task_envs, envs_obs)
        ):
            # Teacher action (from specialist)
            with torch.no_grad():
                teacher_action = specialist.get_action(obs, deterministic=True)

            # Student action (from generalist)
            task_ids = torch.full(
                (obs.shape[0],), task_id, dtype=torch.long, device=device
            )
            student_action = generalist(obs, task_ids)

            # Distillation loss
            distill_loss = F.mse_loss(student_action, teacher_action)

            next_obs, _ = env.step(teacher_action)
            envs_obs[task_id] = next_obs

            total_loss += distill_loss

        optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(generalist.parameters(), 1.0)
        optimizer.step()

        total_steps += sum(env.num_envs for env in task_envs)

        if total_steps % 500_000 == 0:
            print(
                f"[Distill] Steps: {total_steps:>10d} | "
                f"Loss: {total_loss.item():.6f}"
            )

    return generalist

4. Domain Randomization Configuration

DOMAIN_RANDOMIZATION = {
    # Physics parameters
    "physics": {
        "friction": {
            "range": [0.3, 2.0],
            "operation": "scaling",  # Multiply with default value
        },
        "restitution": {
            "range": [0.0, 0.3],
            "operation": "absolute",
        },
        "mass": {
            "range": [0.85, 1.15],
            "operation": "scaling",
        },
        "com_displacement": {
            "range": [-0.03, 0.03],  # meters
            "operation": "additive",
            "per_axis": True,
        },
    },

    # Actuator parameters
    "actuator": {
        "kp": {"range": [0.7, 1.3], "operation": "scaling"},
        "kd": {"range": [0.7, 1.3], "operation": "scaling"},
        "torque_limit": {"range": [0.85, 1.0], "operation": "scaling"},
        "action_delay": {"range": [0, 3], "operation": "absolute", "unit": "steps"},
    },

    # Sensor noise
    "sensor": {
        "joint_pos_noise": {"std": 0.01, "unit": "rad"},
        "joint_vel_noise": {"std": 0.1, "unit": "rad/s"},
        "imu_quat_noise": {"std": 0.005},
        "imu_gyro_noise": {"std": 0.02, "unit": "rad/s"},
    },

    # Object properties (for manipulation)
    "object": {
        "mass": {"range": [0.1, 2.0], "unit": "kg"},
        "size_scale": {"range": [0.8, 1.2]},
        "friction": {"range": [0.3, 1.5]},
        "initial_pos_noise": {"std": 0.02, "unit": "m"},
        "initial_rot_noise": {"std": 0.1, "unit": "rad"},
    },

    # Visual randomization (for camera-based policies)
    "visual": {
        "light_intensity": {"range": [0.5, 2.0]},
        "light_direction": {"range": [-30, 30], "unit": "deg"},
        "texture_randomize": True,
        "camera_fov_noise": {"std": 2.0, "unit": "deg"},
    },
}

5. Hybrid Object Representation

The paper uses a hybrid representation combining multiple modalities for object state:

import torch
import torch.nn as nn

class HybridObjectEncoder(nn.Module):
    """
    Hybrid object representation with modality-specific augmentation.
    Combines: proprioceptive + tactile + visual features.
    """

    def __init__(
        self,
        proprio_dim: int = 13,   # object pos (3) + quat (4) + vel (6)
        tactile_dim: int = 24,   # fingertip sensors
        visual_dim: int = 64,    # CNN features (if using camera)
        output_dim: int = 32,
        use_visual: bool = False,
    ):
        super().__init__()
        self.use_visual = use_visual

        # Proprioceptive encoder
        self.proprio_encoder = nn.Sequential(
            nn.Linear(proprio_dim, 64),
            nn.LayerNorm(64),
            nn.ELU(),
            nn.Linear(64, 32),
        )

        # Tactile encoder
        self.tactile_encoder = nn.Sequential(
            nn.Linear(tactile_dim, 64),
            nn.LayerNorm(64),
            nn.ELU(),
            nn.Linear(64, 32),
        )

        # Visual encoder (optional)
        if use_visual:
            self.visual_encoder = nn.Sequential(
                nn.Linear(visual_dim, 64),
                nn.LayerNorm(64),
                nn.ELU(),
                nn.Linear(64, 32),
            )
            fusion_dim = 32 * 3
        else:
            fusion_dim = 32 * 2

        # Fusion layer
        self.fusion = nn.Sequential(
            nn.Linear(fusion_dim, 64),
            nn.ELU(),
            nn.Linear(64, output_dim),
        )

        # Augmentation parameters
        self.proprio_noise_std = 0.01
        self.tactile_noise_std = 0.05
        self.dropout = nn.Dropout(0.1)

    def forward(
        self,
        proprio: torch.Tensor,
        tactile: torch.Tensor,
        visual: torch.Tensor | None = None,
        augment: bool = False,
    ) -> torch.Tensor:
        """
        Encode object state from multiple modalities.

        Args:
            proprio: (batch, proprio_dim) — object pose + velocity
            tactile: (batch, tactile_dim) — fingertip sensor readings
            visual: (batch, visual_dim) — CNN features (optional)
            augment: whether to apply modality-specific augmentation
        """
        if augment and self.training:
            proprio = proprio + torch.randn_like(proprio) * self.proprio_noise_std
            tactile = tactile + torch.randn_like(tactile) * self.tactile_noise_std
            if torch.rand(1).item() < 0.1:
                tactile = torch.zeros_like(tactile)

        proprio_feat = self.proprio_encoder(proprio)
        tactile_feat = self.tactile_encoder(tactile)

        if self.use_visual and visual is not None:
            visual_feat = self.visual_encoder(visual)
            fused = torch.cat([proprio_feat, tactile_feat, visual_feat], dim=-1)
        else:
            fused = torch.cat([proprio_feat, tactile_feat], dim=-1)

        fused = self.dropout(fused) if augment else fused
        output = self.fusion(fused)

        return output

6. Complete Deployment Script

import torch
import time
import numpy as np

class Sim2RealDeployer:
    """
    Deploy trained policy on real robot hardware.
    Handles: observation processing, action filtering, safety checks.
    """

    def __init__(
        self,
        policy_path: str,
        robot_interface,
        control_freq: float = 30.0,
        action_smoothing: float = 0.3,  # EMA smoothing factor
    ):
        self.device = torch.device("cpu")
        self.dt = 1.0 / control_freq
        self.alpha = action_smoothing

        ckpt = torch.load(policy_path, map_location=self.device)
        self.policy = GeneralistPolicy(
            obs_dim=ckpt["obs_dim"],
            action_dim=ckpt["action_dim"],
        ).to(self.device)
        self.policy.load_state_dict(ckpt["policy"])
        self.policy.eval()

        self.robot = robot_interface
        self.prev_action = None

        # Safety limits
        self.max_joint_delta = 0.1  # rad per step
        self.torque_limit = 50.0    # Nm

        self.obs_mean = ckpt.get("obs_mean", torch.zeros(ckpt["obs_dim"]))
        self.obs_std = ckpt.get("obs_std", torch.ones(ckpt["obs_dim"]))

    def _normalize_obs(self, obs: torch.Tensor) -> torch.Tensor:
        """Normalize observation using training statistics."""
        return (obs - self.obs_mean) / (self.obs_std + 1e-8)

    def _smooth_action(self, action: torch.Tensor) -> torch.Tensor:
        """EMA smoothing to prevent jerky motion."""
        if self.prev_action is None:
            self.prev_action = action
            return action
        smoothed = self.alpha * action + (1 - self.alpha) * self.prev_action
        self.prev_action = smoothed
        return smoothed

    def _safety_check(self, action: torch.Tensor, current_pos: torch.Tensor) -> torch.Tensor:
        """Clamp action to safety limits."""
        delta = action - current_pos
        delta = delta.clamp(-self.max_joint_delta, self.max_joint_delta)
        return current_pos + delta

    @torch.no_grad()
    def run(self, task_id: int, duration: float = 30.0):
        """
        Run policy on real robot.

        Args:
            task_id: 0=grasp, 1=lift, 2=handover
            duration: seconds to run
        """
        print(f"[Deploy] Starting task {task_id} for {duration}s")
        print("[Deploy] Press Ctrl+C to stop")

        task_tensor = torch.tensor([task_id], dtype=torch.long)
        start_time = time.time()
        step_count = 0

        try:
            while time.time() - start_time < duration:
                loop_start = time.time()

                state = self.robot.get_state()
                obs = self._build_observation(state)
                obs = self._normalize_obs(obs)

                action = self.policy(obs.unsqueeze(0), task_tensor).squeeze(0)
                action = self._smooth_action(action)
                action = self._safety_check(action, state["joint_pos"])

                self.robot.set_joint_targets(action.numpy())

                step_count += 1
                if step_count % 100 == 0:
                    elapsed = time.time() - start_time
                    print(
                        f"[Deploy] Step {step_count} | "
                        f"Time: {elapsed:.1f}s | "
                        f"Action norm: {action.norm():.3f}"
                    )

                elapsed_step = time.time() - loop_start
                if elapsed_step < self.dt:
                    time.sleep(self.dt - elapsed_step)

        except KeyboardInterrupt:
            print("[Deploy] Stopped by user")
        finally:
            print("[Deploy] Moving to safe pose...")
            self.robot.move_to_default(duration=3.0)
            print("[Deploy] Done")

    def _build_observation(self, state: dict) -> torch.Tensor:
        """Build observation tensor from robot state dict."""
        return torch.cat([
            state["joint_pos"],
            state["joint_vel"],
            state["object_pos"],
            state["object_quat"],
            state["object_vel"],
            state["object_angular_vel"],
            state["tactile_left"],
            state["tactile_right"],
        ])

Experimental Results

Task	Sim Success (%)	Real Success (%)	Sim-to-Real Gap
Grasp-and-Reach	92.3	78.5	13.8%
Box Lift	87.1	71.2	15.9%
Bimanual Handover	83.6	64.8	18.8%

Key observations:

Sim-to-real gap is ~14-19% — much smaller than methods without real-to-sim tuning (typical gap is 30-50%)
Bimanual handover is hardest because it requires precise coordination timing between two hands
Real-to-sim tuning reduces the gap by ~10% compared to domain randomization alone

Key Lessons

1. Real-to-Sim > Domain Randomization Alone

Domain randomization is necessary but not sufficient. If simulation is too different from reality, no amount of randomization will produce good transfer. Real-to-sim tuning moves the simulation closer to reality first, then randomizes around that more accurate baseline.

2. Divide-and-Conquer Reduces Training Time 3-5x

Training a single generalist directly takes 200M+ steps. Training 3 specialists (50M each) + distillation (10M) = 160M total, with better results because each specialist optimizes for a single clear objective.

3. Action Smoothing Is Mandatory for Hardware

In simulation, actions can change abruptly every step. On real hardware, this creates jerk that damages motors and reduces lifespan. EMA smoothing (alpha=0.3) is a simple but critical trick for deployment.

References

Sim-to-Real Reinforcement Learning for Humanoid Dexterous Manipulation — arXiv 2502.20396, Feb 2025
Sim-to-Real: Learning Agile Locomotion For Quadruped Robots — Tan, J., Zhang, T., Coumans, E., et al., RSS 2018
Learning Dexterous In-Hand Manipulation — OpenAI, Andrychowicz, M., et al., IJRR 2020
DexMV: Imitation Learning for Dexterous Manipulation from Human Videos — Qin, Y., et al., ECCV 2022

From Simulation to Real Robot: The Hardest Problem in RL

Simulation environment for robot manipulation training

Pipeline Overview

The pipeline consists of 4 main stages:

Real-to-Sim Tuning: Automatically adjust simulation parameters to match the real robot
Specialist Training: Train specialized policies for each task (grasp, lift, handover)
Policy Distillation: Combine specialists into a single generalist policy
Sim-to-Real Deployment: Domain randomization + transfer to hardware

Why Divide-and-Conquer?

Directly training a single policy for all tasks is difficult because:

Reward conflict: grasp reward can conflict with handover reward
Exploration difficulty: the policy must discover strategies for all 3 tasks simultaneously
Training instability: gradients from multiple tasks can cancel each other

Solution: train a specialist for each task separately, then distill into one generalist. Specialists are easier to train; the generalist inherits knowledge from all specialists.

1. Real-to-Sim Parameter Identification

The automated tuning module compares real trajectories with simulated ones, optimizing physics parameters:

import torch
import numpy as np
from scipy.optimize import minimize

class RealToSimTuner:
    """
    Automated real-to-sim parameter identification.
    Collect trajectories from real robot -> optimize simulation parameters
    so that simulation reproduces those trajectories.
    """

    # Parameters to tune
    PARAM_NAMES = [
        "joint_friction",      # Joint friction (Nm)
        "joint_damping",       # Damping (Nm*s/rad)
        "joint_armature",      # Rotor inertia reflected (kg*m^2)
        "contact_friction",    # Contact friction
        "contact_restitution", # Bounce coefficient
        "actuator_kp",         # PD gains
        "actuator_kd",
        "com_offset_x",        # Center of mass offset (m)
        "com_offset_y",
        "com_offset_z",
        "mass_scale",          # Real/model mass ratio
    ]

    def __init__(
        self,
        sim_env,
        real_trajectories: list[dict],
        num_iterations: int = 200,
    ):
        """
        Args:
            sim_env: Isaac Lab environment instance
            real_trajectories: list of dicts with keys:
                - joint_pos: (T, num_joints) — measured positions
                - joint_vel: (T, num_joints) — measured velocities
                - joint_torque: (T, num_joints) — commanded torques
                - timestamps: (T,) — time stamps
        """
        self.sim_env = sim_env
        self.real_trajs = real_trajectories
        self.num_iterations = num_iterations

        # Initial parameter guess (from URDF defaults)
        self.initial_params = self._get_default_params()

    def _get_default_params(self) -> np.ndarray:
        """Load default parameters from URDF/simulation."""
        return np.array([
            0.05,   # joint_friction (Nm)
            0.5,    # joint_damping (Nm*s/rad)
            0.01,   # joint_armature (kg*m^2)
            0.8,    # contact_friction
            0.0,    # contact_restitution
            200.0,  # actuator_kp
            20.0,   # actuator_kd
            0.0,    # com_offset_x
            0.0,    # com_offset_y
            0.0,    # com_offset_z
            1.0,    # mass_scale
        ])

    def _simulate_trajectory(
        self, params: np.ndarray, real_traj: dict
    ) -> dict:
        """
        Replay torque commands in simulation with given params.
        Returns simulated trajectory.
        """
        # Apply parameters to simulation
        self.sim_env.set_physics_params({
            "joint_friction": params[0],
            "joint_damping": params[1],
            "joint_armature": params[2],
            "contact_friction": params[3],
            "contact_restitution": params[4],
            "actuator_kp": params[5],
            "actuator_kd": params[6],
            "com_offset": params[7:10],
            "mass_scale": params[10],
        })

        # Reset simulation to initial state from real trajectory
        self.sim_env.reset()
        self.sim_env.set_joint_state(
            real_traj["joint_pos"][0],
            real_traj["joint_vel"][0],
        )

        # Replay torque commands
        sim_positions = []
        sim_velocities = []

        for t in range(len(real_traj["joint_torque"])):
            self.sim_env.apply_torque(real_traj["joint_torque"][t])
            self.sim_env.step()
            sim_positions.append(self.sim_env.get_joint_pos())
            sim_velocities.append(self.sim_env.get_joint_vel())

        return {
            "joint_pos": np.stack(sim_positions),
            "joint_vel": np.stack(sim_velocities),
        }

    def _objective(self, params: np.ndarray) -> float:
        """
        Objective function: total MSE between sim and real trajectories.
        """
        total_error = 0.0

        for real_traj in self.real_trajs:
            sim_traj = self._simulate_trajectory(params, real_traj)

            # Position error (weighted higher)
            pos_error = np.mean(
                (sim_traj["joint_pos"] - real_traj["joint_pos"]) ** 2
            )

            # Velocity error
            vel_error = np.mean(
                (sim_traj["joint_vel"] - real_traj["joint_vel"]) ** 2
            )

            total_error += 10.0 * pos_error + 1.0 * vel_error

        return total_error / len(self.real_trajs)

    def tune(self) -> dict:
        """
        Run optimization to find best simulation parameters.
        Returns: dict of tuned parameters.
        """
        # Parameter bounds (physical constraints)
        bounds = [
            (0.001, 1.0),    # joint_friction
            (0.01, 5.0),     # joint_damping
            (0.001, 0.1),    # joint_armature
            (0.1, 2.0),      # contact_friction
            (0.0, 0.5),      # contact_restitution
            (50.0, 500.0),   # actuator_kp
            (5.0, 100.0),    # actuator_kd
            (-0.05, 0.05),   # com_offset_x
            (-0.05, 0.05),   # com_offset_y
            (-0.05, 0.05),   # com_offset_z
            (0.8, 1.2),      # mass_scale
        ]

        result = minimize(
            self._objective,
            self.initial_params,
            method="L-BFGS-B",
            bounds=bounds,
            options={"maxiter": self.num_iterations, "disp": True},
        )

        tuned_params = result.x
        print(f"[Real-to-Sim] Final error: {result.fun:.6f}")
        print(f"[Real-to-Sim] Tuned params:")
        for name, val in zip(self.PARAM_NAMES, tuned_params):
            print(f"  {name}: {val:.4f}")

        return dict(zip(self.PARAM_NAMES, tuned_params))

2. Reward Function Design for 3 Tasks

Each task has its own reward function, following a generalized formulation based on contact and object goals:

import torch

class ManipulationRewards:
    """
    Generalized reward formulation for dexterous manipulation.
    Based on contact goals and object goals.
    """

    @staticmethod
    def grasp_and_reach(
        hand_pos: torch.Tensor,
        object_pos: torch.Tensor,
        target_pos: torch.Tensor,
        contact_force: torch.Tensor,
        finger_positions: torch.Tensor,
        object_in_hand: torch.Tensor,
    ) -> dict[str, torch.Tensor]:
        """
        Reward for grasp-and-reach task.
        Phase 1: Approach object -> Phase 2: Grasp -> Phase 3: Move to target.
        """
        batch = hand_pos.shape[0]

        # Phase 1: Approach — hand moves toward object
        hand_to_obj = torch.norm(hand_pos - object_pos, dim=-1)
        r_approach = torch.exp(-5.0 * hand_to_obj)

        # Phase 2: Grasp — sufficient contact force + fingers wrap object
        contact_magnitude = torch.norm(contact_force, dim=-1)
        r_contact = torch.clamp(contact_magnitude / 10.0, 0, 1)

        # Finger closure metric: fingers close together = tighter grasp
        finger_spread = torch.std(finger_positions, dim=-1)
        r_closure = torch.exp(-3.0 * finger_spread)

        # Phase 3: Reach — object reaches target
        obj_to_target = torch.norm(object_pos - target_pos, dim=-1)
        r_reach = torch.exp(-3.0 * obj_to_target) * object_in_hand

        # Bonus for task completion (object at target)
        success = (obj_to_target < 0.05).float() * object_in_hand
        r_success = success * 10.0

        # Penalties
        r_energy = -0.001 * contact_magnitude  # Avoid excessive squeezing

        total = (
            0.2 * r_approach
            + 0.2 * r_contact
            + 0.1 * r_closure
            + 0.3 * r_reach
            + r_success
            + r_energy
        )

        return {
            "total": total,
            "approach": r_approach,
            "contact": r_contact,
            "closure": r_closure,
            "reach": r_reach,
            "success": success,
        }

    @staticmethod
    def box_lift(
        left_hand_pos: torch.Tensor,
        right_hand_pos: torch.Tensor,
        box_pos: torch.Tensor,
        box_quat: torch.Tensor,
        target_height: float,
        left_contact: torch.Tensor,
        right_contact: torch.Tensor,
    ) -> dict[str, torch.Tensor]:
        """
        Reward for bimanual box lift.
        Both hands must coordinate to lift the box to target height.
        """
        # Approach: both hands near box
        left_dist = torch.norm(left_hand_pos - box_pos, dim=-1)
        right_dist = torch.norm(right_hand_pos - box_pos, dim=-1)
        r_approach = torch.exp(-3.0 * (left_dist + right_dist))

        # Contact: both hands touching box
        left_contact_mag = torch.norm(left_contact, dim=-1)
        right_contact_mag = torch.norm(right_contact, dim=-1)
        r_bimanual_contact = (
            torch.clamp(left_contact_mag / 5.0, 0, 1)
            * torch.clamp(right_contact_mag / 5.0, 0, 1)
        )

        # Lift: box height approaches target
        height_error = torch.abs(box_pos[:, 2] - target_height)
        r_lift = torch.exp(-5.0 * height_error)

        # Orientation: keep box upright (no tilting)
        r_orient = torch.abs(box_quat[:, 0])  # w component near 1 = upright

        # Symmetry: balanced forces between hands
        force_diff = torch.abs(left_contact_mag - right_contact_mag)
        r_symmetry = torch.exp(-2.0 * force_diff)

        # Success
        success = (height_error < 0.03).float() * (r_bimanual_contact > 0.5).float()
        r_success = success * 10.0

        total = (
            0.15 * r_approach
            + 0.2 * r_bimanual_contact
            + 0.3 * r_lift
            + 0.1 * r_orient
            + 0.1 * r_symmetry
            + r_success
        )

        return {
            "total": total,
            "approach": r_approach,
            "contact": r_bimanual_contact,
            "lift": r_lift,
            "orient": r_orient,
            "symmetry": r_symmetry,
            "success": success,
        }

    @staticmethod
    def bimanual_handover(
        giver_hand_pos: torch.Tensor,
        receiver_hand_pos: torch.Tensor,
        object_pos: torch.Tensor,
        giver_contact: torch.Tensor,
        receiver_contact: torch.Tensor,
        phase: torch.Tensor,  # 0=giver_grasp, 1=transfer, 2=receiver_grasp
    ) -> dict[str, torch.Tensor]:
        """
        Reward for bimanual handover.
        Hand A grasps object -> passes to hand B -> hand B receives.
        """
        batch = object_pos.shape[0]

        # Phase 0: Giver grasps object
        giver_to_obj = torch.norm(giver_hand_pos - object_pos, dim=-1)
        giver_contact_mag = torch.norm(giver_contact, dim=-1)
        r_giver_grasp = (
            torch.exp(-5.0 * giver_to_obj)
            * torch.clamp(giver_contact_mag / 5.0, 0, 1)
        )

        # Phase 1: Move object to transfer zone (midpoint between hands)
        midpoint = (giver_hand_pos + receiver_hand_pos) / 2
        obj_to_mid = torch.norm(object_pos - midpoint, dim=-1)
        r_transfer = torch.exp(-3.0 * obj_to_mid)

        # Phase 2: Receiver grasps (contact) + Giver releases (no contact)
        receiver_contact_mag = torch.norm(receiver_contact, dim=-1)
        r_receiver_grasp = torch.clamp(receiver_contact_mag / 5.0, 0, 1)
        r_giver_release = torch.exp(-2.0 * giver_contact_mag)

        # Phase-weighted rewards
        is_phase0 = (phase == 0).float()
        is_phase1 = (phase == 1).float()
        is_phase2 = (phase == 2).float()

        total = (
            is_phase0 * r_giver_grasp
            + is_phase1 * (0.5 * r_transfer + 0.3 * r_giver_grasp)
            + is_phase2 * (0.5 * r_receiver_grasp + 0.3 * r_giver_release)
        )

        # Success: receiver has object and giver released
        success = (
            (receiver_contact_mag > 3.0).float()
            * (giver_contact_mag < 0.5).float()
        )
        total = total + success * 10.0

        return {
            "total": total,
            "giver_grasp": r_giver_grasp,
            "transfer": r_transfer,
            "receiver_grasp": r_receiver_grasp,
            "success": success,
        }

3. Divide-and-Conquer: Specialist Training and Distillation

Specialist Training

Each specialist is a separate policy, trained with its corresponding reward function:

import torch
import torch.nn as nn

class DexterousManipulationPolicy(nn.Module):
    """
    Policy for dexterous manipulation.
    Observation space includes: proprioception + object state + tactile.
    """

    def __init__(
        self,
        proprio_dim: int = 60,     # joint pos (30) + joint vel (30)
        object_dim: int = 13,      # pos (3) + quat (4) + vel (3) + angular_vel (3)
        tactile_dim: int = 24,     # 12 fingertip sensors x 2 hands
        action_dim: int = 30,      # all upper-body + hand joints
        hidden_dim: int = 512,
    ):
        super().__init__()
        total_obs = proprio_dim + object_dim + tactile_dim  # 97

        # Actor
        self.actor = nn.Sequential(
            nn.Linear(total_obs, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ELU(),
            nn.Linear(hidden_dim // 2, action_dim),
            nn.Tanh(),  # Actions in [-1, 1]
        )

        # Critic
        self.critic = nn.Sequential(
            nn.Linear(total_obs, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, 1),
        )

    def forward(self, obs: torch.Tensor):
        return self.actor(obs), self.critic(obs)

    def get_action(self, obs: torch.Tensor, deterministic: bool = False):
        mean = self.actor(obs)
        if deterministic:
            return mean
        noise = torch.randn_like(mean) * 0.1
        return (mean + noise).clamp(-1, 1)

def train_specialist(
    task_name: str,
    env_config: dict,
    reward_fn,
    num_timesteps: int = 50_000_000,
):
    """
    Train specialist policy for a single task.
    """
    device = torch.device("cuda")

    env = create_manipulation_env(task_name, env_config)

    policy = DexterousManipulationPolicy(
        proprio_dim=env.proprio_dim,
        object_dim=env.object_dim,
        tactile_dim=env.tactile_dim,
        action_dim=env.action_dim,
    ).to(device)

    optimizer = torch.optim.Adam(policy.parameters(), lr=3e-4)

    total_steps = 0
    obs = env.reset()

    while total_steps < num_timesteps:
        actions = policy.get_action(obs)
        next_obs, info = env.step(actions)

        reward_dict = reward_fn(**info)
        reward = reward_dict["total"]

        _, values = policy(obs)
        _, next_values = policy(next_obs)

        advantage = reward + 0.99 * next_values.squeeze() - values.squeeze()
        policy_loss = -(advantage.detach() * policy.get_action(obs)).mean()
        value_loss = advantage.pow(2).mean()

        loss = policy_loss + 0.5 * value_loss

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
        optimizer.step()

        obs = next_obs
        total_steps += env.num_envs

        if total_steps % 1_000_000 == 0:
            success_rate = reward_dict["success"].mean().item()
            print(
                f"[{task_name}] Steps: {total_steps:>10d} | "
                f"Success: {success_rate:.1%} | "
                f"Reward: {reward.mean().item():.3f}"
            )

    return policy

# Train 3 specialists
grasp_specialist = train_specialist(
    "grasp_and_reach",
    env_config={"object_type": "diverse", "randomize": True},
    reward_fn=ManipulationRewards.grasp_and_reach,
)

lift_specialist = train_specialist(
    "box_lift",
    env_config={"box_sizes": [(0.1, 0.1, 0.1), (0.15, 0.15, 0.15)], "randomize": True},
    reward_fn=ManipulationRewards.box_lift,
)

handover_specialist = train_specialist(
    "bimanual_handover",
    env_config={"object_type": "diverse", "randomize": True},
    reward_fn=ManipulationRewards.bimanual_handover,
)

Robot manipulation training process in simulation

Policy Distillation: Specialists to Generalist

import torch
import torch.nn as nn
import torch.nn.functional as F

class GeneralistPolicy(nn.Module):
    """
    Generalist policy — distilled from multiple specialists.
    Uses task embedding to distinguish tasks.
    """

    def __init__(
        self,
        obs_dim: int = 97,
        action_dim: int = 30,
        num_tasks: int = 3,
        task_embed_dim: int = 16,
        hidden_dim: int = 512,
    ):
        super().__init__()
        self.task_embedding = nn.Embedding(num_tasks, task_embed_dim)
        total_input = obs_dim + task_embed_dim

        self.actor = nn.Sequential(
            nn.Linear(total_input, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ELU(),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ELU(),
            nn.Linear(hidden_dim // 2, action_dim),
            nn.Tanh(),
        )

    def forward(self, obs: torch.Tensor, task_id: torch.Tensor):
        task_embed = self.task_embedding(task_id)
        x = torch.cat([obs, task_embed], dim=-1)
        return self.actor(x)

def distill_specialists(
    specialists: list[nn.Module],
    task_envs: list,
    num_distill_steps: int = 10_000_000,
    temperature: float = 1.0,
):
    """
    Distill multiple specialist policies into a single generalist.

    Args:
        specialists: list of trained specialist policies
        task_envs: list of environments (one per task)
        temperature: distillation temperature
    """
    device = torch.device("cuda")
    num_tasks = len(specialists)

    # Freeze specialists
    for spec in specialists:
        spec.eval()
        for p in spec.parameters():
            p.requires_grad = False

    generalist = GeneralistPolicy(
        obs_dim=task_envs[0].obs_dim,
        action_dim=task_envs[0].action_dim,
        num_tasks=num_tasks,
    ).to(device)

    optimizer = torch.optim.Adam(generalist.parameters(), lr=1e-4)

    total_steps = 0
    envs_obs = [env.reset() for env in task_envs]

    while total_steps < num_distill_steps:
        total_loss = 0.0

        for task_id, (specialist, env, obs) in enumerate(
            zip(specialists, task_envs, envs_obs)
        ):
            # Teacher action (from specialist)
            with torch.no_grad():
                teacher_action = specialist.get_action(obs, deterministic=True)

            # Student action (from generalist)
            task_ids = torch.full(
                (obs.shape[0],), task_id, dtype=torch.long, device=device
            )
            student_action = generalist(obs, task_ids)

            # Distillation loss
            distill_loss = F.mse_loss(student_action, teacher_action)

            next_obs, _ = env.step(teacher_action)
            envs_obs[task_id] = next_obs

            total_loss += distill_loss

        optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(generalist.parameters(), 1.0)
        optimizer.step()

        total_steps += sum(env.num_envs for env in task_envs)

        if total_steps % 500_000 == 0:
            print(
                f"[Distill] Steps: {total_steps:>10d} | "
                f"Loss: {total_loss.item():.6f}"
            )

    return generalist

4. Domain Randomization Configuration

DOMAIN_RANDOMIZATION = {
    # Physics parameters
    "physics": {
        "friction": {
            "range": [0.3, 2.0],
            "operation": "scaling",  # Multiply with default value
        },
        "restitution": {
            "range": [0.0, 0.3],
            "operation": "absolute",
        },
        "mass": {
            "range": [0.85, 1.15],
            "operation": "scaling",
        },
        "com_displacement": {
            "range": [-0.03, 0.03],  # meters
            "operation": "additive",
            "per_axis": True,
        },
    },

    # Actuator parameters
    "actuator": {
        "kp": {"range": [0.7, 1.3], "operation": "scaling"},
        "kd": {"range": [0.7, 1.3], "operation": "scaling"},
        "torque_limit": {"range": [0.85, 1.0], "operation": "scaling"},
        "action_delay": {"range": [0, 3], "operation": "absolute", "unit": "steps"},
    },

    # Sensor noise
    "sensor": {
        "joint_pos_noise": {"std": 0.01, "unit": "rad"},
        "joint_vel_noise": {"std": 0.1, "unit": "rad/s"},
        "imu_quat_noise": {"std": 0.005},
        "imu_gyro_noise": {"std": 0.02, "unit": "rad/s"},
    },

    # Object properties (for manipulation)
    "object": {
        "mass": {"range": [0.1, 2.0], "unit": "kg"},
        "size_scale": {"range": [0.8, 1.2]},
        "friction": {"range": [0.3, 1.5]},
        "initial_pos_noise": {"std": 0.02, "unit": "m"},
        "initial_rot_noise": {"std": 0.1, "unit": "rad"},
    },

    # Visual randomization (for camera-based policies)
    "visual": {
        "light_intensity": {"range": [0.5, 2.0]},
        "light_direction": {"range": [-30, 30], "unit": "deg"},
        "texture_randomize": True,
        "camera_fov_noise": {"std": 2.0, "unit": "deg"},
    },
}

5. Hybrid Object Representation

The paper uses a hybrid representation combining multiple modalities for object state:

import torch
import torch.nn as nn

class HybridObjectEncoder(nn.Module):
    """
    Hybrid object representation with modality-specific augmentation.
    Combines: proprioceptive + tactile + visual features.
    """

    def __init__(
        self,
        proprio_dim: int = 13,   # object pos (3) + quat (4) + vel (6)
        tactile_dim: int = 24,   # fingertip sensors
        visual_dim: int = 64,    # CNN features (if using camera)
        output_dim: int = 32,
        use_visual: bool = False,
    ):
        super().__init__()
        self.use_visual = use_visual

        # Proprioceptive encoder
        self.proprio_encoder = nn.Sequential(
            nn.Linear(proprio_dim, 64),
            nn.LayerNorm(64),
            nn.ELU(),
            nn.Linear(64, 32),
        )

        # Tactile encoder
        self.tactile_encoder = nn.Sequential(
            nn.Linear(tactile_dim, 64),
            nn.LayerNorm(64),
            nn.ELU(),
            nn.Linear(64, 32),
        )

        # Visual encoder (optional)
        if use_visual:
            self.visual_encoder = nn.Sequential(
                nn.Linear(visual_dim, 64),
                nn.LayerNorm(64),
                nn.ELU(),
                nn.Linear(64, 32),
            )
            fusion_dim = 32 * 3
        else:
            fusion_dim = 32 * 2

        # Fusion layer
        self.fusion = nn.Sequential(
            nn.Linear(fusion_dim, 64),
            nn.ELU(),
            nn.Linear(64, output_dim),
        )

        # Augmentation parameters
        self.proprio_noise_std = 0.01
        self.tactile_noise_std = 0.05
        self.dropout = nn.Dropout(0.1)

    def forward(
        self,
        proprio: torch.Tensor,
        tactile: torch.Tensor,
        visual: torch.Tensor | None = None,
        augment: bool = False,
    ) -> torch.Tensor:
        """
        Encode object state from multiple modalities.

        Args:
            proprio: (batch, proprio_dim) — object pose + velocity
            tactile: (batch, tactile_dim) — fingertip sensor readings
            visual: (batch, visual_dim) — CNN features (optional)
            augment: whether to apply modality-specific augmentation
        """
        if augment and self.training:
            proprio = proprio + torch.randn_like(proprio) * self.proprio_noise_std
            tactile = tactile + torch.randn_like(tactile) * self.tactile_noise_std
            if torch.rand(1).item() < 0.1:
                tactile = torch.zeros_like(tactile)

        proprio_feat = self.proprio_encoder(proprio)
        tactile_feat = self.tactile_encoder(tactile)

        if self.use_visual and visual is not None:
            visual_feat = self.visual_encoder(visual)
            fused = torch.cat([proprio_feat, tactile_feat, visual_feat], dim=-1)
        else:
            fused = torch.cat([proprio_feat, tactile_feat], dim=-1)

        fused = self.dropout(fused) if augment else fused
        output = self.fusion(fused)

        return output

6. Complete Deployment Script

import torch
import time
import numpy as np

class Sim2RealDeployer:
    """
    Deploy trained policy on real robot hardware.
    Handles: observation processing, action filtering, safety checks.
    """

    def __init__(
        self,
        policy_path: str,
        robot_interface,
        control_freq: float = 30.0,
        action_smoothing: float = 0.3,  # EMA smoothing factor
    ):
        self.device = torch.device("cpu")
        self.dt = 1.0 / control_freq
        self.alpha = action_smoothing

        ckpt = torch.load(policy_path, map_location=self.device)
        self.policy = GeneralistPolicy(
            obs_dim=ckpt["obs_dim"],
            action_dim=ckpt["action_dim"],
        ).to(self.device)
        self.policy.load_state_dict(ckpt["policy"])
        self.policy.eval()

        self.robot = robot_interface
        self.prev_action = None

        # Safety limits
        self.max_joint_delta = 0.1  # rad per step
        self.torque_limit = 50.0    # Nm

        self.obs_mean = ckpt.get("obs_mean", torch.zeros(ckpt["obs_dim"]))
        self.obs_std = ckpt.get("obs_std", torch.ones(ckpt["obs_dim"]))

    def _normalize_obs(self, obs: torch.Tensor) -> torch.Tensor:
        """Normalize observation using training statistics."""
        return (obs - self.obs_mean) / (self.obs_std + 1e-8)

    def _smooth_action(self, action: torch.Tensor) -> torch.Tensor:
        """EMA smoothing to prevent jerky motion."""
        if self.prev_action is None:
            self.prev_action = action
            return action
        smoothed = self.alpha * action + (1 - self.alpha) * self.prev_action
        self.prev_action = smoothed
        return smoothed

    def _safety_check(self, action: torch.Tensor, current_pos: torch.Tensor) -> torch.Tensor:
        """Clamp action to safety limits."""
        delta = action - current_pos
        delta = delta.clamp(-self.max_joint_delta, self.max_joint_delta)
        return current_pos + delta

    @torch.no_grad()
    def run(self, task_id: int, duration: float = 30.0):
        """
        Run policy on real robot.

        Args:
            task_id: 0=grasp, 1=lift, 2=handover
            duration: seconds to run
        """
        print(f"[Deploy] Starting task {task_id} for {duration}s")
        print("[Deploy] Press Ctrl+C to stop")

        task_tensor = torch.tensor([task_id], dtype=torch.long)
        start_time = time.time()
        step_count = 0

        try:
            while time.time() - start_time < duration:
                loop_start = time.time()

                state = self.robot.get_state()
                obs = self._build_observation(state)
                obs = self._normalize_obs(obs)

                action = self.policy(obs.unsqueeze(0), task_tensor).squeeze(0)
                action = self._smooth_action(action)
                action = self._safety_check(action, state["joint_pos"])

                self.robot.set_joint_targets(action.numpy())

                step_count += 1
                if step_count % 100 == 0:
                    elapsed = time.time() - start_time
                    print(
                        f"[Deploy] Step {step_count} | "
                        f"Time: {elapsed:.1f}s | "
                        f"Action norm: {action.norm():.3f}"
                    )

                elapsed_step = time.time() - loop_start
                if elapsed_step < self.dt:
                    time.sleep(self.dt - elapsed_step)

        except KeyboardInterrupt:
            print("[Deploy] Stopped by user")
        finally:
            print("[Deploy] Moving to safe pose...")
            self.robot.move_to_default(duration=3.0)
            print("[Deploy] Done")

    def _build_observation(self, state: dict) -> torch.Tensor:
        """Build observation tensor from robot state dict."""
        return torch.cat([
            state["joint_pos"],
            state["joint_vel"],
            state["object_pos"],
            state["object_quat"],
            state["object_vel"],
            state["object_angular_vel"],
            state["tactile_left"],
            state["tactile_right"],
        ])

Experimental Results

Task	Sim Success (%)	Real Success (%)	Sim-to-Real Gap
Grasp-and-Reach	92.3	78.5	13.8%
Box Lift	87.1	71.2	15.9%
Bimanual Handover	83.6	64.8	18.8%

Key observations:

Sim-to-real gap is ~14-19% — much smaller than methods without real-to-sim tuning (typical gap is 30-50%)
Bimanual handover is hardest because it requires precise coordination timing between two hands
Real-to-sim tuning reduces the gap by ~10% compared to domain randomization alone

Key Lessons

1. Real-to-Sim > Domain Randomization Alone

2. Divide-and-Conquer Reduces Training Time 3-5x

3. Action Smoothing Is Mandatory for Hardware

References

Sim-to-Real Reinforcement Learning for Humanoid Dexterous Manipulation — arXiv 2502.20396, Feb 2025
Sim-to-Real: Learning Agile Locomotion For Quadruped Robots — Tan, J., Zhang, T., Coumans, E., et al., RSS 2018
Learning Dexterous In-Hand Manipulation — OpenAI, Andrychowicz, M., et al., IJRR 2020
DexMV: Imitation Learning for Dexterous Manipulation from Human Videos — Qin, Y., et al., ECCV 2022

From Simulation to Real Robot: The Hardest Problem in RL

Pipeline Overview

Why Divide-and-Conquer?

1. Real-to-Sim Parameter Identification

2. Reward Function Design for 3 Tasks

3. Divide-and-Conquer: Specialist Training and Distillation

Specialist Training

Policy Distillation: Specialists to Generalist

4. Domain Randomization Configuration

5. Hybrid Object Representation

6. Complete Deployment Script

Experimental Results

Key Lessons

1. Real-to-Sim > Domain Randomization Alone

2. Divide-and-Conquer Reduces Training Time 3-5x

3. Action Smoothing Is Mandatory for Hardware

Further Reading

Related Posts

References

Nguyễn Anh Tuấn

Related Posts

FAME: RL Thích Ứng Lực cho Humanoid Manipulation

Perception 3D cho humanoid: Omni-Manip & spatial reasoning

WholeBodyVLA: video egocentric + RL loco-manipulation

From Simulation to Real Robot: The Hardest Problem in RL

Pipeline Overview

Why Divide-and-Conquer?

1. Real-to-Sim Parameter Identification

2. Reward Function Design for 3 Tasks

3. Divide-and-Conquer: Specialist Training and Distillation

Specialist Training

Policy Distillation: Specialists to Generalist

4. Domain Randomization Configuration

5. Hybrid Object Representation

6. Complete Deployment Script

Experimental Results

Key Lessons

1. Real-to-Sim > Domain Randomization Alone

2. Divide-and-Conquer Reduces Training Time 3-5x

3. Action Smoothing Is Mandatory for Hardware

Further Reading

Related Posts

References

Nguyễn Anh Tuấn

Related Posts

FAME: RL Thích Ứng Lực cho Humanoid Manipulation

Perception 3D cho humanoid: Omni-Manip & spatial reasoning

WholeBodyVLA: video egocentric + RL loco-manipulation