Sim-to-Real Transfer: Train simulation, chạy thực tế

Sim-to-Real Transfer — Vì sao train simulation lại quan trọng?

Sim-to-real transfer là kỹ thuật train mô hình AI trong simulation rồi deploy trực tiếp lên robot thật. Đây là một trong những chủ đề nóng nhất trong robotics research hiện nay, vì lý do đơn giản: thu thập data trên robot thật quá đắt và quá chậm.

Trong simulation, bạn có thể chạy 10,000 robot song song, mỗi robot thử 1,000 episodes/giờ. Trên robot thật? 1 robot, 10 episodes/giờ, và mỗi lần fail có thể hỏng hardware. Simulation nhanh hơn 1 triệu lần và an toàn hơn vô hạn.

Nhưng có một vấn đề lớn: reality gap — mô hình train trong simulation không hoạt động ngoài đời thật.

The Reality Gap — Vấn đề cốt lõi

Reality gap xuất phát từ sự khác biệt giữa simulation và thực tế:

Visual Gap

Simulation	Thực tế
Perfect lighting	Shadows, reflections, glare
Clean textures	Scratches, dust, wear
Exact object models	Shape variations, deformation
No motion blur	Camera blur, rolling shutter
Perfect camera model	Lens distortion, noise

Dynamics Gap

Simulation	Thực tế
Rigid body physics	Soft contacts, deformation
Perfect friction model	Variable friction, surface conditions
No actuator delay	Motor latency, communication delay
Exact mass/inertia	Manufacturing tolerances
Deterministic	Stochastic, non-repeatable

Một policy chạy perfect trong simulation có thể fail hoàn toàn trên robot thật chỉ vì friction coefficient sai 10% hoặc camera exposure khác.

Kỹ thuật 1: Domain Randomization

Domain randomization (DR) là kỹ thuật đầu tiên và vẫn phổ biến nhất để bridge reality gap. Ý tưởng: nếu model đã thấy đủ nhiều variations trong simulation, thực tế chỉ là thêm một variation nữa.

Paper gốc: Tobin et al., 2017 — Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

Visual Domain Randomization

Randomize mọi thứ liên quan đến rendering:

"""
Visual Domain Randomization cho robot manipulation
Sử dụng Isaac Lab / Isaac Sim
"""
import omni.isaac.lab as lab
from omni.isaac.lab.envs import ManagerBasedRLEnv
import numpy as np


class VisualRandomizationConfig:
    """Config cho visual domain randomization."""

    def __init__(self):
        # Lighting randomization
        self.light_intensity_range = (300, 3000)      # lux
        self.light_color_range = (0.8, 1.2)            # RGB multiplier
        self.num_random_lights = (1, 5)                 # Số đèn random
        self.light_position_range = (-2.0, 2.0)        # meters

        # Camera randomization
        self.camera_fov_range = (55, 75)                # degrees
        self.camera_position_noise = 0.02               # meters
        self.camera_rotation_noise = 3.0                # degrees

        # Texture randomization
        self.table_color_range = (0.2, 0.9)             # RGB
        self.object_color_range = (0.1, 1.0)
        self.randomize_background = True

        # Distractor objects
        self.num_distractors = (0, 5)
        self.distractor_scale_range = (0.5, 2.0)


def apply_visual_randomization(env, config: VisualRandomizationConfig):
    """Apply visual randomization mỗi episode reset."""

    # Random lighting
    num_lights = np.random.randint(*config.num_random_lights)
    for i in range(num_lights):
        intensity = np.random.uniform(*config.light_intensity_range)
        color = np.random.uniform(*config.light_color_range, size=3)
        position = np.random.uniform(
            config.light_position_range,
            config.light_position_range,
            size=3,
        )
        env.scene.add_light(
            position=position,
            intensity=intensity,
            color=color,
        )

    # Random camera pose (small perturbation)
    cam_pos_noise = np.random.normal(0, config.camera_position_noise, size=3)
    cam_rot_noise = np.random.normal(0, config.camera_rotation_noise, size=3)
    env.scene.camera.set_pose(
        position=env.scene.camera.default_position + cam_pos_noise,
        rotation=env.scene.camera.default_rotation + cam_rot_noise,
    )

    # Random table texture
    table_color = np.random.uniform(*config.table_color_range, size=3)
    env.scene.table.set_color(table_color)

    return env

Dynamics Domain Randomization

Randomize physical parameters — quan trọng hơn visual cho control tasks:

class DynamicsRandomizationConfig:
    """Config cho dynamics domain randomization."""

    def __init__(self):
        # Object properties
        self.mass_scale_range = (0.5, 2.0)          # Multiplier trên nominal mass
        self.friction_range = (0.3, 1.5)             # Coefficient of friction
        self.restitution_range = (0.0, 0.5)          # Bounciness

        # Robot properties
        self.joint_damping_scale = (0.8, 1.2)
        self.joint_friction_scale = (0.5, 2.0)
        self.actuator_strength_scale = (0.85, 1.15)  # +-15%

        # Control properties
        self.action_delay_range = (0, 3)              # Frames of delay
        self.action_noise_std = 0.02                  # Gaussian noise on actions
        self.observation_noise_std = 0.01             # Noise on sensor readings

        # Gravity
        self.gravity_noise = 0.05                     # m/s^2


def apply_dynamics_randomization(env, config: DynamicsRandomizationConfig):
    """Apply dynamics randomization mỗi episode."""

    # Randomize object mass
    mass_scale = np.random.uniform(*config.mass_scale_range)
    env.scene.target_object.set_mass(
        env.scene.target_object.default_mass * mass_scale
    )

    # Randomize friction
    friction = np.random.uniform(*config.friction_range)
    env.scene.target_object.set_friction(friction)
    env.scene.table.set_friction(friction * np.random.uniform(0.8, 1.2))

    # Randomize actuator strength
    for joint in env.robot.joints:
        strength_scale = np.random.uniform(*config.actuator_strength_scale)
        joint.set_max_force(joint.default_max_force * strength_scale)

    # Random action delay
    env.action_delay = np.random.randint(*config.action_delay_range)

    return env

Automatic Domain Randomization (ADR)

OpenAI đã nâng DR lên level mới với ADR — tự động mở rộng randomization range khi policy đã master range hiện tại:

Ban đầu:  friction ∈ [0.9, 1.1]  (gần realistic)
Policy master → mở rộng: friction ∈ [0.7, 1.3]
Policy master → mở rộng: friction ∈ [0.3, 1.5]
...
Cuối: friction ∈ [0.1, 2.0]  (cực kỳ diverse)

Kết quả: policy robust đến mức thực tế chỉ là "easy mode" so với training distribution.

Kỹ thuật 2: System Identification (SysID)

Thay vì randomize mọi thứ, SysID đo chính xác physical parameters của robot thật rồi set simulation cho match:

"""
System Identification workflow
Đo robot thật → calibrate simulation
"""
import numpy as np
from scipy.optimize import minimize


def measure_joint_dynamics(robot, joint_id: int, num_trials: int = 20):
    """
    Đo response của 1 joint trên robot thật.
    Gửi step command, record trajectory.
    """
    trajectories = []
    for _ in range(num_trials):
        # Gửi step command
        robot.set_joint_position(joint_id, target=1.0)

        # Record response
        traj = []
        for t in range(100):  # 100 timesteps
            pos = robot.get_joint_position(joint_id)
            vel = robot.get_joint_velocity(joint_id)
            torque = robot.get_joint_torque(joint_id)
            traj.append([pos, vel, torque])

        trajectories.append(np.array(traj))

    return np.array(trajectories)  # shape: (trials, timesteps, 3)


def optimize_sim_parameters(real_trajectories, sim_env):
    """
    Optimize simulation parameters để match robot thật.
    """
    def cost_function(params):
        damping, friction, stiffness = params

        # Set simulation parameters
        sim_env.set_joint_damping(damping)
        sim_env.set_joint_friction(friction)
        sim_env.set_joint_stiffness(stiffness)

        # Run same trajectory in sim
        sim_traj = sim_env.step_response(target=1.0, timesteps=100)

        # MSE giữa sim và real
        error = np.mean((real_trajectories.mean(axis=0) - sim_traj) ** 2)
        return error

    # Optimize
    result = minimize(
        cost_function,
        x0=[1.0, 0.1, 100.0],  # Initial guess
        method="Nelder-Mead",
        options={"maxiter": 1000},
    )

    print(f"Optimized params: damping={result.x[0]:.3f}, "
          f"friction={result.x[1]:.3f}, stiffness={result.x[2]:.1f}")
    return result.x

SysID vs Domain Randomization

	Domain Randomization	System Identification
Approach	Train trên nhiều variations	Match sim chính xác với real
Effort	Low (chỉ cần set ranges)	High (cần đo robot thật)
Robustness	Cao (diverse training)	Thấp hơn (overfits to 1 config)
Best for	Vision + diverse environments	Precise control + known robot
Combine?	Có — SysID center + DR around it

Best practice: Dùng SysID để xác định nominal parameters, rồi apply DR quanh nominal values đó. Ví dụ: SysID đo friction = 0.8, thì DR range = [0.5, 1.1] thay vì [0.1, 2.0].

Kỹ thuật 3: Curriculum Learning

Thay vì ném policy vào randomized environment ngay từ đầu, tăng dần độ khó:

"""
Curriculum Learning cho sim-to-real transfer
Tăng dần domain randomization range theo training progress
"""

class CurriculumScheduler:
    def __init__(self, total_steps: int = 1_000_000):
        self.total_steps = total_steps

    def get_randomization_scale(self, current_step: int) -> float:
        """
        Return scale factor 0→1 cho randomization ranges.
        0 = no randomization (easy), 1 = full randomization (hard)
        """
        progress = current_step / self.total_steps
        # Smooth ramp-up
        return min(1.0, progress * 2)  # Full randomization at 50% training

    def get_task_difficulty(self, current_step: int) -> dict:
        scale = self.get_randomization_scale(current_step)

        return {
            # Visual randomization
            "light_intensity_range": (
                1000 - 700 * scale,
                1000 + 2000 * scale,
            ),
            "camera_noise": 0.02 * scale,
            "num_distractors": int(5 * scale),

            # Dynamics randomization
            "friction_range": (
                0.8 - 0.5 * scale,
                0.8 + 0.7 * scale,
            ),
            "mass_scale_range": (
                1.0 - 0.5 * scale,
                1.0 + 1.0 * scale,
            ),
            "action_delay": int(3 * scale),
            "action_noise": 0.02 * scale,

            # Task difficulty
            "object_position_range": 0.1 + 0.3 * scale,  # meters
            "goal_tolerance": 0.05 - 0.03 * scale,        # tighter goal
        }


# Sử dụng trong training loop
scheduler = CurriculumScheduler(total_steps=2_000_000)

for step in range(2_000_000):
    difficulty = scheduler.get_task_difficulty(step)
    env.set_randomization(difficulty)

    obs = env.reset()
    action = policy(obs)
    next_obs, reward, done, info = env.step(action)
    # ... update policy

So sánh Simulation Tools

	Isaac Lab	MuJoCo	Gazebo
Engine	PhysX 5 (GPU)	MuJoCo (CPU/GPU)	ODE/Bullet/DART
GPU parallel	10,000+ envs	1,000+ (MJX)	Không
Rendering	RTX ray-tracing	Basic OpenGL	OGRE/Ignition
ROS integration	Isaac ROS	ros2_mujoco	Native
Domain Rand	Built-in, extensive	Manual	Plugin-based
License	Free (NVIDIA)	Apache 2.0	Apache 2.0
Best for	Large-scale RL, visual sim-to-real	Contact-rich manipulation	ROS 2 integration, multi-robot
Learning curve	Steep (Omniverse)	Medium	Medium
Typical FPS	100K+ steps/s	50K+ steps/s	1K steps/s

Khi nào dùng tool nào?

Isaac Lab: Bạn cần massive parallelism (RL training), visual sim-to-real, hoặc NVIDIA GPU hardware
MuJoCo: Contact-rich manipulation tasks, research benchmark, hoặc khi cần accurate contact physics
Gazebo: Prototype nhanh với ROS 2, multi-robot simulation, hoặc khi không cần RL training

Step-by-step: Train manipulation policy trong Isaac Lab

Ví dụ end-to-end: train robot arm pick-and-place trong Isaac Lab, transfer sang robot thật.

Bước 1: Setup environment

"""
Isaac Lab environment cho pick-and-place
GPU-accelerated, 4096 parallel envs
"""
import omni.isaac.lab as lab
from omni.isaac.lab.envs import ManagerBasedRLEnvCfg
from omni.isaac.lab.assets import ArticulationCfg, RigidObjectCfg
from omni.isaac.lab.managers import (
    ObservationGroupCfg,
    RewardTermCfg,
    EventTermCfg,
)


class PickPlaceEnvCfg(ManagerBasedRLEnvCfg):
    """Config cho pick-and-place environment."""

    # Scene
    num_envs = 4096
    env_spacing = 2.0

    # Robot: Franka Panda
    robot = ArticulationCfg(
        prim_path="/World/envs/env_.*/Robot",
        spawn=lab.sim.schemas.FrankaCfg(),
        actuators={
            "panda_shoulder": lab.actuators.ImplicitActuatorCfg(
                joint_names_expr=["panda_joint[1-4]"],
                stiffness=80.0,
                damping=4.0,
            ),
            "panda_forearm": lab.actuators.ImplicitActuatorCfg(
                joint_names_expr=["panda_joint[5-7]"],
                stiffness=40.0,
                damping=2.0,
            ),
        },
    )

    # Target object
    object = RigidObjectCfg(
        prim_path="/World/envs/env_.*/Object",
        spawn=lab.sim.schemas.CuboidCfg(
            size=(0.04, 0.04, 0.04),
            mass=0.1,
        ),
    )

    # Observations
    observations = ObservationGroupCfg(
        policy=ObservationGroupCfg(
            joint_pos=True,
            joint_vel=True,
            object_pos=True,
            object_rot=True,
            goal_pos=True,
        )
    )

    # Rewards
    rewards = {
        "reaching": RewardTermCfg(
            func=reaching_reward,
            weight=1.0,
        ),
        "grasping": RewardTermCfg(
            func=grasping_reward,
            weight=5.0,
        ),
        "placing": RewardTermCfg(
            func=placing_reward,
            weight=10.0,
        ),
    }

    # Domain Randomization events
    events = {
        "reset_object_position": EventTermCfg(
            func=randomize_object_position,
            mode="reset",
            params={"range": 0.15},
        ),
        "randomize_mass": EventTermCfg(
            func=randomize_object_mass,
            mode="reset",
            params={"scale_range": (0.5, 2.0)},
        ),
        "randomize_friction": EventTermCfg(
            func=randomize_friction,
            mode="reset",
            params={"range": (0.4, 1.2)},
        ),
    }

Bước 2: Train với PPO

# Train 4096 parallel envs, 50M timesteps
# Trên RTX 4090: ~2 giờ
python -m omni.isaac.lab_tasks.train \
    --task Isaac-Pick-Place-v0 \
    --num_envs 4096 \
    --max_iterations 5000 \
    --algo ppo \
    --headless

# Evaluate trong sim
python -m omni.isaac.lab_tasks.play \
    --task Isaac-Pick-Place-v0 \
    --checkpoint logs/ppo/model_5000.pt

Bước 3: Transfer sang robot thật

"""
Deploy trained policy lên Franka Panda thật
Sử dụng Isaac ROS + Franka ROS 2 driver
"""
import rclpy
from rclpy.node import Node
import torch
import numpy as np
from sensor_msgs.msg import JointState
from geometry_msgs.msg import PoseStamped


class PolicyDeployNode(Node):
    def __init__(self):
        super().__init__("policy_deploy")

        # Load trained policy
        self.policy = torch.jit.load("/models/pick_place_policy.pt")
        self.policy.eval()

        # Observation buffer
        self.joint_pos = np.zeros(7)
        self.joint_vel = np.zeros(7)
        self.object_pos = np.zeros(3)

        # Subscribers
        self.create_subscription(
            JointState, "/franka/joint_states", self.joint_callback, 10
        )
        self.create_subscription(
            PoseStamped, "/object/pose", self.object_callback, 10
        )

        # Publisher
        self.cmd_pub = self.create_publisher(
            JointState, "/franka/joint_commands", 10
        )

        # Control loop at 20 Hz
        self.create_timer(0.05, self.control_loop)

    def joint_callback(self, msg):
        self.joint_pos = np.array(msg.position[:7])
        self.joint_vel = np.array(msg.velocity[:7])

    def object_callback(self, msg):
        self.object_pos = np.array([
            msg.pose.position.x,
            msg.pose.position.y,
            msg.pose.position.z,
        ])

    def control_loop(self):
        # Build observation (same format as simulation)
        obs = np.concatenate([
            self.joint_pos,      # 7
            self.joint_vel,      # 7
            self.object_pos,     # 3
            self.goal_pos,       # 3 (set beforehand)
        ])

        # Inference
        with torch.no_grad():
            obs_tensor = torch.FloatTensor(obs).unsqueeze(0)
            action = self.policy(obs_tensor).squeeze().numpy()

        # Safety clamp
        action = np.clip(action, -0.05, 0.05)  # Max delta per step

        # Publish command
        cmd = JointState()
        cmd.position = (self.joint_pos + action[:7]).tolist()
        self.cmd_pub.publish(cmd)

Success Stories

OpenAI Rubik's Cube (2019)

Solving Rubik's Cube with a Robot Hand — train hoàn toàn trong simulation với Automatic Domain Randomization, deploy zero-shot lên Shadow Dexterous Hand thật. Đây là demo sim-to-real ấn tượng nhất từ trước đến nay cho dexterous manipulation.

Key insights:

ADR randomize hàng trăm parameters cùng lúc
Policy develop emergent meta-learning — tự adapt khi gặp dynamics mới
Train trong 1 năm trên cluster, nhưng kết quả unprecedented

Agility Robotics Digit Locomotion

Real-World Humanoid Locomotion with Reinforcement Learning — train causal transformer policy trong simulation, deploy zero-shot lên Digit humanoid robot. Robot đi được trên nhiều terrain khác nhau mà chưa bao giờ thấy trong training.