The Final Step: From Simulation to the Real World
Throughout the previous 9 posts, we built everything in simulation: data collection via teleop, single-arm training, multi-object manipulation, dual-arm tasks, mobile manipulation, and humanoid control. But simulation is only a stepping stone — the ultimate goal has always been real robots operating in the real world.
Sim-to-real transfer is where many projects fail. A policy achieving 95% success rate in sim can score 0% on a real robot. This post will help you avoid those pitfalls with a proven pipeline.
Why is Sim-to-Real Hard?
The Reality Gap
| Factor | Simulation | Real World |
|---|---|---|
| Physics | Perfect rigid body, known friction | Deformable, unknown parameters |
| Rendering | Clean synthetic images | Lighting variation, reflections, shadows |
| Latency | ~0ms | 10-50ms control loop, 30-100ms vision |
| Sensor noise | Optional, Gaussian | Complex, correlated, drifting |
| Actuator | Perfect torque/position | Backlash, friction, thermal drift |
| Objects | Exact CAD model | Manufacturing tolerance, wear |
Research from Tobin et al. (2017) demonstrated that domain randomization is key to bridging the reality gap, but it's only one piece of the puzzle.
Domain Randomization for Visual Policies
Visual Randomization
VLA policies rely heavily on camera input. Visual domain randomization is critical:
# LeRobot visual domain randomization config
visual_randomization = {
# Camera randomization
"camera_position_noise": 0.02, # ±2cm position noise
"camera_rotation_noise": 3.0, # ±3 degrees rotation
"camera_fov_range": [55, 75], # Field of view range
# Lighting randomization
"light_intensity_range": [0.3, 1.5],
"light_color_range": [0.8, 1.2], # Per-channel multiplier
"light_position_noise": 0.5, # ±50cm
"num_random_lights": [1, 4],
# Texture randomization
"table_texture_randomize": True,
"object_color_randomize": True,
"background_randomize": True,
# Post-processing augmentation
"brightness_range": [0.7, 1.3],
"contrast_range": [0.8, 1.2],
"saturation_range": [0.8, 1.2],
"noise_std_range": [0.0, 0.02],
}
Physics Randomization
# Physics domain randomization
physics_randomization = {
# Object properties
"object_mass_range": [0.8, 1.2], # ±20% mass
"object_friction_range": [0.3, 1.5],
"object_size_range": [0.9, 1.1], # ±10% size
# Robot properties
"joint_friction_range": [0.5, 2.0],
"joint_damping_range": [0.8, 1.2],
"actuator_strength_range": [0.9, 1.1],
# Control simulation
"action_delay_range": [0, 3], # 0-3 frames delay
"action_noise_std": 0.01,
"observation_delay_range": [0, 2],
# Environment
"gravity_noise": 0.05, # ±5% gravity
"table_height_range": [-0.02, 0.02], # ±2cm
}
The paper Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World (Tobin et al., 2017) proved that with sufficient randomization breadth, policies learn features that are invariant between sim and real.
Camera Calibration and Real-World Setup
Intrinsic Calibration
Camera placement must match between sim and real — this is the most common error:
import numpy as np
import cv2
def calibrate_camera(images_dir, checkerboard_size=(9, 6)):
"""Camera calibration using checkerboard pattern."""
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)
objp = np.zeros((checkerboard_size[0] * checkerboard_size[1], 3), np.float32)
objp[:, :2] = np.mgrid[0:checkerboard_size[0],
0:checkerboard_size[1]].T.reshape(-1, 2)
objpoints, imgpoints = [], []
import glob
images = glob.glob(f"{images_dir}/*.png")
for fname in images:
img = cv2.imread(fname)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
ret, corners = cv2.findChessboardCorners(gray, checkerboard_size, None)
if ret:
objpoints.append(objp)
corners2 = cv2.cornerSubPix(gray, corners, (11, 11), (-1, -1), criteria)
imgpoints.append(corners2)
ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(
objpoints, imgpoints, gray.shape[::-1], None, None
)
print(f"Reprojection error: {ret:.4f}")
print(f"Camera matrix:\n{mtx}")
print(f"Distortion coefficients: {dist}")
return mtx, dist
mtx, dist = calibrate_camera("calibration_images/")
Hand-Eye Calibration (Camera-to-Robot)
def hand_eye_calibration(robot_poses, camera_poses):
"""
AX = XB hand-eye calibration.
robot_poses: list of 4x4 homogeneous transforms (base to end-effector)
camera_poses: list of 4x4 transforms (camera to calibration target)
"""
R_gripper2base = [pose[:3, :3] for pose in robot_poses]
t_gripper2base = [pose[:3, 3] for pose in robot_poses]
R_target2cam = [pose[:3, :3] for pose in camera_poses]
t_target2cam = [pose[:3, 3] for pose in camera_poses]
R_cam2gripper, t_cam2gripper = cv2.calibrateHandEye(
R_gripper2base, t_gripper2base,
R_target2cam, t_target2cam,
method=cv2.CALIB_HAND_EYE_TSAI
)
T_cam2gripper = np.eye(4)
T_cam2gripper[:3, :3] = R_cam2gripper
T_cam2gripper[:3, 3] = t_cam2gripper.flatten()
return T_cam2gripper
Inference Optimization
Quantization for Real-Time Inference
VLA models (7B+ params) need optimization for real-time control:
from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
import torch
# 8-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0,
)
model = AutoModelForVision2Seq.from_pretrained(
"openvla/openvla-7b-finetuned",
quantization_config=quantization_config,
device_map="auto",
torch_dtype=torch.float16,
)
# Benchmark inference speed
import time
# Warm up
for _ in range(5):
with torch.no_grad():
_ = model.predict_action(dummy_obs)
# Measure
times = []
for _ in range(100):
start = time.perf_counter()
with torch.no_grad():
action = model.predict_action(obs)
times.append(time.perf_counter() - start)
print(f"Inference: {np.mean(times)*1000:.1f}ms ± {np.std(times)*1000:.1f}ms")
print(f"Control frequency: {1/np.mean(times):.1f} Hz")
Performance Comparison
| Method | Model Size | Latency | GPU Memory | Success Rate |
|---|---|---|---|---|
| FP32 | 28 GB | 450ms | 32 GB | 100% (baseline) |
| FP16 | 14 GB | 180ms | 16 GB | 99.5% |
| INT8 | 7 GB | 95ms | 10 GB | 98.2% |
| INT4 (GPTQ) | 3.5 GB | 65ms | 6 GB | 94.1% |
| TensorRT FP16 | 14 GB | 45ms | 16 GB | 99.5% |
For typical robot arms, a control frequency of 5-10 Hz is sufficient, so INT8 (~95ms) works well. For faster control, TensorRT is the best option but requires more complex conversion.
ROS 2 Deployment Node
Complete Policy Server
#!/usr/bin/env python3
"""ROS 2 node for VLA policy deployment."""
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, JointState
from trajectory_msgs.msg import JointTrajectoryPoint
from cv_bridge import CvBridge
import numpy as np
import torch
from lerobot.common.policies.factory import make_policy
class VLAPolicyNode(Node):
def __init__(self):
super().__init__('vla_policy_node')
# Parameters
self.declare_parameter('policy_path', '')
self.declare_parameter('control_freq', 10.0)
self.declare_parameter('device', 'cuda')
policy_path = self.get_parameter('policy_path').value
self.device = self.get_parameter('device').value
control_freq = self.get_parameter('control_freq').value
# Load policy
self.get_logger().info(f'Loading policy from {policy_path}...')
self.policy = make_policy(
cfg=None,
pretrained_policy_name_or_path=policy_path,
)
self.policy.to(self.device)
self.policy.eval()
self.get_logger().info('Policy loaded successfully!')
# State
self.bridge = CvBridge()
self.latest_image = None
self.latest_joints = None
# Subscribers
self.image_sub = self.create_subscription(
Image, '/camera/color/image_raw', self.image_callback, 10
)
self.joint_sub = self.create_subscription(
JointState, '/joint_states', self.joint_callback, 10
)
# Publisher
self.action_pub = self.create_publisher(
JointTrajectoryPoint, '/policy_action', 10
)
# Control loop timer
period = 1.0 / control_freq
self.timer = self.create_timer(period, self.control_loop)
self.get_logger().info(f'Control loop running at {control_freq} Hz')
def image_callback(self, msg):
self.latest_image = self.bridge.imgmsg_to_cv2(msg, 'rgb8')
def joint_callback(self, msg):
self.latest_joints = np.array(msg.position)
def control_loop(self):
if self.latest_image is None or self.latest_joints is None:
return
obs = {
'observation.images.top': torch.from_numpy(
self.latest_image
).permute(2, 0, 1).unsqueeze(0).float().to(self.device) / 255.0,
'observation.state': torch.from_numpy(
self.latest_joints
).unsqueeze(0).float().to(self.device),
}
with torch.no_grad():
action = self.policy.select_action(obs)
action_np = action.squeeze().cpu().numpy()
msg = JointTrajectoryPoint()
msg.positions = action_np.tolist()
msg.time_from_start.sec = 0
msg.time_from_start.nanosec = int(1e8) # 100ms
self.action_pub.publish(msg)
def main():
rclpy.init()
node = VLAPolicyNode()
rclpy.spin(node)
node.destroy_node()
rclpy.shutdown()
if __name__ == '__main__':
main()
Debugging Common Failures
Failure Taxonomy
| Failure Mode | Symptom | Root Cause | Fix |
|---|---|---|---|
| Frozen robot | Robot doesn't move | Observation mismatch (wrong camera, wrong joint order) | Verify obs matches training format |
| Wild oscillation | Robot shakes violently | Action scale mismatch | Check action normalization stats |
| Drift | Robot gradually drifts from target | Latency accumulation | Reduce inference time, add action smoothing |
| Collision | Robot hits obstacles | Missing safety limits | Add workspace bounds, torque limits |
| Grasp fail | Can't grasp objects | Visual domain gap | More visual randomization, finetune on real images |
Action Smoothing Filter
class ActionSmoother:
"""Exponential moving average for smooth actions."""
def __init__(self, alpha=0.7, action_dim=7):
self.alpha = alpha
self.prev_action = np.zeros(action_dim)
self.initialized = False
def smooth(self, raw_action):
if not self.initialized:
self.prev_action = raw_action.copy()
self.initialized = True
return raw_action
smoothed = self.alpha * raw_action + (1 - self.alpha) * self.prev_action
self.prev_action = smoothed.copy()
return smoothed
smoother = ActionSmoother(alpha=0.7)
smoothed_action = smoother.smooth(raw_policy_action)
Complete Deployment Checklist
Pre-deployment
- Camera intrinsic calibration (reprojection error < 0.5px)
- Hand-eye calibration (position error < 5mm)
- Joint order matches training (verify with manual test)
- Action space bounds verified against robot limits
- Emergency stop (e-stop) tested and accessible
- Workspace boundaries configured in software
- Inference latency measured (< 200ms for 5Hz control)
- GPU memory sufficient (check with
nvidia-smi)
During deployment
- Start with slow speed (10% of max velocity)
- Run 5 episodes with soft objects first
- Monitor joint torques — spikes indicate collision
- Log all observations and actions for post-analysis
- Gradually increase speed over 20+ successful episodes
Post-deployment
- Calculate success rate over 50+ episodes
- Analyze failure modes (categorize using table above)
- Collect real-world failure cases for fine-tuning
- Compare sim vs real performance gap
Real-World Fine-tuning with LeRobot
If the sim-trained policy achieves < 80% on the real robot, fine-tuning on real data is the next step:
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.policies.diffusion.modeling_diffusion import DiffusionPolicy
# 1. Collect 50-100 real demonstrations
# (using teleop as in post 2, but on the real robot)
# 2. Load sim-pretrained policy
policy = DiffusionPolicy.from_pretrained("my-sim-trained-policy")
# 3. Load real dataset
real_dataset = LeRobotDataset("my-real-demos")
# 4. Fine-tune with lower learning rate
from lerobot.scripts.train import train
train(
policy=policy,
dataset=real_dataset,
training_steps=10000, # Less than sim training
learning_rate=1e-5, # 10x lower
batch_size=32,
eval_freq=1000,
)
This strategy is called sim-to-real-to-sim — train in sim, fine-tune on real, then use real data to improve sim fidelity.
Series Conclusion
Across 10 posts, we've built a complete pipeline:
- Framework deep dive — understanding LeRobot inside out
- Data collection — quality demonstration recording
- Single-arm training — ACT vs Diffusion Policy
- Multi-object — scaling to multiple objects
- Long-horizon — multi-step task planning
- Dual-arm setup — bimanual calibration
- Dual-arm tasks — coordination training
- Mobile manipulation — base + arms
- Humanoid — whole-body control
- Sim-to-Real (this post) — deployment pipeline
LeRobot is evolving rapidly — follow the GitHub repo for updates. Good luck with your deployment!
References
- Domain Randomization for Transferring Deep Neural Networks — Tobin et al., 2017
- Sim-to-Real Robot Learning from Pixels with Progressive Nets — Rusu et al., 2017
- Closing the Sim-to-Real Loop — Chebotar et al., 2019
- LeRobot: State-of-the-art Machine Learning for Real-World Robotics — Hugging Face, 2024
Related Posts
- Sim-to-Real Pipeline: From Training to Real Robots — Simulation series post 5
- Hands-on: Fine-tune OpenVLA with LeRobot — AI series post 7
- Domain Randomization: The Key to Sim-to-Real — Simulation series post 4