← Back to Blog
ailerobotsim2realdeploymentvla

Sim-to-Real Transfer: Deploying VLA Policies to Real Robots

Complete pipeline from simulation to real robot — domain randomization, camera calibration, inference optimization, and ROS 2 deployment.

Nguyễn Anh Tuấn8 tháng 4, 20269 min read
Sim-to-Real Transfer: Deploying VLA Policies to Real Robots

The Final Step: From Simulation to the Real World

Throughout the previous 9 posts, we built everything in simulation: data collection via teleop, single-arm training, multi-object manipulation, dual-arm tasks, mobile manipulation, and humanoid control. But simulation is only a stepping stone — the ultimate goal has always been real robots operating in the real world.

Sim-to-real transfer is where many projects fail. A policy achieving 95% success rate in sim can score 0% on a real robot. This post will help you avoid those pitfalls with a proven pipeline.

Real robot arm in a lab environment

Why is Sim-to-Real Hard?

The Reality Gap

Factor Simulation Real World
Physics Perfect rigid body, known friction Deformable, unknown parameters
Rendering Clean synthetic images Lighting variation, reflections, shadows
Latency ~0ms 10-50ms control loop, 30-100ms vision
Sensor noise Optional, Gaussian Complex, correlated, drifting
Actuator Perfect torque/position Backlash, friction, thermal drift
Objects Exact CAD model Manufacturing tolerance, wear

Research from Tobin et al. (2017) demonstrated that domain randomization is key to bridging the reality gap, but it's only one piece of the puzzle.

Domain Randomization for Visual Policies

Visual Randomization

VLA policies rely heavily on camera input. Visual domain randomization is critical:

# LeRobot visual domain randomization config
visual_randomization = {
    # Camera randomization
    "camera_position_noise": 0.02,       # ±2cm position noise
    "camera_rotation_noise": 3.0,        # ±3 degrees rotation
    "camera_fov_range": [55, 75],        # Field of view range
    
    # Lighting randomization
    "light_intensity_range": [0.3, 1.5],
    "light_color_range": [0.8, 1.2],     # Per-channel multiplier
    "light_position_noise": 0.5,         # ±50cm
    "num_random_lights": [1, 4],
    
    # Texture randomization
    "table_texture_randomize": True,
    "object_color_randomize": True,
    "background_randomize": True,
    
    # Post-processing augmentation
    "brightness_range": [0.7, 1.3],
    "contrast_range": [0.8, 1.2],
    "saturation_range": [0.8, 1.2],
    "noise_std_range": [0.0, 0.02],
}

Physics Randomization

# Physics domain randomization
physics_randomization = {
    # Object properties
    "object_mass_range": [0.8, 1.2],      # ±20% mass
    "object_friction_range": [0.3, 1.5],
    "object_size_range": [0.9, 1.1],      # ±10% size
    
    # Robot properties
    "joint_friction_range": [0.5, 2.0],
    "joint_damping_range": [0.8, 1.2],
    "actuator_strength_range": [0.9, 1.1],
    
    # Control simulation
    "action_delay_range": [0, 3],         # 0-3 frames delay
    "action_noise_std": 0.01,
    "observation_delay_range": [0, 2],
    
    # Environment
    "gravity_noise": 0.05,                # ±5% gravity
    "table_height_range": [-0.02, 0.02],  # ±2cm
}

The paper Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World (Tobin et al., 2017) proved that with sufficient randomization breadth, policies learn features that are invariant between sim and real.

Camera Calibration and Real-World Setup

Intrinsic Calibration

Camera placement must match between sim and real — this is the most common error:

import numpy as np
import cv2

def calibrate_camera(images_dir, checkerboard_size=(9, 6)):
    """Camera calibration using checkerboard pattern."""
    criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)
    
    objp = np.zeros((checkerboard_size[0] * checkerboard_size[1], 3), np.float32)
    objp[:, :2] = np.mgrid[0:checkerboard_size[0], 
                           0:checkerboard_size[1]].T.reshape(-1, 2)
    
    objpoints, imgpoints = [], []
    
    import glob
    images = glob.glob(f"{images_dir}/*.png")
    
    for fname in images:
        img = cv2.imread(fname)
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        ret, corners = cv2.findChessboardCorners(gray, checkerboard_size, None)
        
        if ret:
            objpoints.append(objp)
            corners2 = cv2.cornerSubPix(gray, corners, (11, 11), (-1, -1), criteria)
            imgpoints.append(corners2)
    
    ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(
        objpoints, imgpoints, gray.shape[::-1], None, None
    )
    
    print(f"Reprojection error: {ret:.4f}")
    print(f"Camera matrix:\n{mtx}")
    print(f"Distortion coefficients: {dist}")
    
    return mtx, dist

mtx, dist = calibrate_camera("calibration_images/")

Hand-Eye Calibration (Camera-to-Robot)

def hand_eye_calibration(robot_poses, camera_poses):
    """
    AX = XB hand-eye calibration.
    robot_poses: list of 4x4 homogeneous transforms (base to end-effector)
    camera_poses: list of 4x4 transforms (camera to calibration target)
    """
    R_gripper2base = [pose[:3, :3] for pose in robot_poses]
    t_gripper2base = [pose[:3, 3] for pose in robot_poses]
    R_target2cam = [pose[:3, :3] for pose in camera_poses]
    t_target2cam = [pose[:3, 3] for pose in camera_poses]
    
    R_cam2gripper, t_cam2gripper = cv2.calibrateHandEye(
        R_gripper2base, t_gripper2base,
        R_target2cam, t_target2cam,
        method=cv2.CALIB_HAND_EYE_TSAI
    )
    
    T_cam2gripper = np.eye(4)
    T_cam2gripper[:3, :3] = R_cam2gripper
    T_cam2gripper[:3, 3] = t_cam2gripper.flatten()
    
    return T_cam2gripper

Camera calibration setup in robotics lab

Inference Optimization

Quantization for Real-Time Inference

VLA models (7B+ params) need optimization for real-time control:

from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
import torch

# 8-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)

model = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b-finetuned",
    quantization_config=quantization_config,
    device_map="auto",
    torch_dtype=torch.float16,
)

# Benchmark inference speed
import time

# Warm up
for _ in range(5):
    with torch.no_grad():
        _ = model.predict_action(dummy_obs)

# Measure
times = []
for _ in range(100):
    start = time.perf_counter()
    with torch.no_grad():
        action = model.predict_action(obs)
    times.append(time.perf_counter() - start)

print(f"Inference: {np.mean(times)*1000:.1f}ms ± {np.std(times)*1000:.1f}ms")
print(f"Control frequency: {1/np.mean(times):.1f} Hz")

Performance Comparison

Method Model Size Latency GPU Memory Success Rate
FP32 28 GB 450ms 32 GB 100% (baseline)
FP16 14 GB 180ms 16 GB 99.5%
INT8 7 GB 95ms 10 GB 98.2%
INT4 (GPTQ) 3.5 GB 65ms 6 GB 94.1%
TensorRT FP16 14 GB 45ms 16 GB 99.5%

For typical robot arms, a control frequency of 5-10 Hz is sufficient, so INT8 (~95ms) works well. For faster control, TensorRT is the best option but requires more complex conversion.

ROS 2 Deployment Node

Complete Policy Server

#!/usr/bin/env python3
"""ROS 2 node for VLA policy deployment."""

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, JointState
from trajectory_msgs.msg import JointTrajectoryPoint
from cv_bridge import CvBridge
import numpy as np
import torch
from lerobot.common.policies.factory import make_policy

class VLAPolicyNode(Node):
    def __init__(self):
        super().__init__('vla_policy_node')
        
        # Parameters
        self.declare_parameter('policy_path', '')
        self.declare_parameter('control_freq', 10.0)
        self.declare_parameter('device', 'cuda')
        
        policy_path = self.get_parameter('policy_path').value
        self.device = self.get_parameter('device').value
        control_freq = self.get_parameter('control_freq').value
        
        # Load policy
        self.get_logger().info(f'Loading policy from {policy_path}...')
        self.policy = make_policy(
            cfg=None,
            pretrained_policy_name_or_path=policy_path,
        )
        self.policy.to(self.device)
        self.policy.eval()
        self.get_logger().info('Policy loaded successfully!')
        
        # State
        self.bridge = CvBridge()
        self.latest_image = None
        self.latest_joints = None
        
        # Subscribers
        self.image_sub = self.create_subscription(
            Image, '/camera/color/image_raw', self.image_callback, 10
        )
        self.joint_sub = self.create_subscription(
            JointState, '/joint_states', self.joint_callback, 10
        )
        
        # Publisher
        self.action_pub = self.create_publisher(
            JointTrajectoryPoint, '/policy_action', 10
        )
        
        # Control loop timer
        period = 1.0 / control_freq
        self.timer = self.create_timer(period, self.control_loop)
        self.get_logger().info(f'Control loop running at {control_freq} Hz')
    
    def image_callback(self, msg):
        self.latest_image = self.bridge.imgmsg_to_cv2(msg, 'rgb8')
    
    def joint_callback(self, msg):
        self.latest_joints = np.array(msg.position)
    
    def control_loop(self):
        if self.latest_image is None or self.latest_joints is None:
            return
        
        obs = {
            'observation.images.top': torch.from_numpy(
                self.latest_image
            ).permute(2, 0, 1).unsqueeze(0).float().to(self.device) / 255.0,
            'observation.state': torch.from_numpy(
                self.latest_joints
            ).unsqueeze(0).float().to(self.device),
        }
        
        with torch.no_grad():
            action = self.policy.select_action(obs)
        
        action_np = action.squeeze().cpu().numpy()
        
        msg = JointTrajectoryPoint()
        msg.positions = action_np.tolist()
        msg.time_from_start.sec = 0
        msg.time_from_start.nanosec = int(1e8)  # 100ms
        self.action_pub.publish(msg)

def main():
    rclpy.init()
    node = VLAPolicyNode()
    rclpy.spin(node)
    node.destroy_node()
    rclpy.shutdown()

if __name__ == '__main__':
    main()

Debugging Common Failures

Failure Taxonomy

Failure Mode Symptom Root Cause Fix
Frozen robot Robot doesn't move Observation mismatch (wrong camera, wrong joint order) Verify obs matches training format
Wild oscillation Robot shakes violently Action scale mismatch Check action normalization stats
Drift Robot gradually drifts from target Latency accumulation Reduce inference time, add action smoothing
Collision Robot hits obstacles Missing safety limits Add workspace bounds, torque limits
Grasp fail Can't grasp objects Visual domain gap More visual randomization, finetune on real images

Action Smoothing Filter

class ActionSmoother:
    """Exponential moving average for smooth actions."""
    
    def __init__(self, alpha=0.7, action_dim=7):
        self.alpha = alpha
        self.prev_action = np.zeros(action_dim)
        self.initialized = False
    
    def smooth(self, raw_action):
        if not self.initialized:
            self.prev_action = raw_action.copy()
            self.initialized = True
            return raw_action
        
        smoothed = self.alpha * raw_action + (1 - self.alpha) * self.prev_action
        self.prev_action = smoothed.copy()
        return smoothed

smoother = ActionSmoother(alpha=0.7)
smoothed_action = smoother.smooth(raw_policy_action)

Complete Deployment Checklist

Pre-deployment

During deployment

Post-deployment

Real-World Fine-tuning with LeRobot

If the sim-trained policy achieves < 80% on the real robot, fine-tuning on real data is the next step:

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.policies.diffusion.modeling_diffusion import DiffusionPolicy

# 1. Collect 50-100 real demonstrations
# (using teleop as in post 2, but on the real robot)

# 2. Load sim-pretrained policy
policy = DiffusionPolicy.from_pretrained("my-sim-trained-policy")

# 3. Load real dataset
real_dataset = LeRobotDataset("my-real-demos")

# 4. Fine-tune with lower learning rate
from lerobot.scripts.train import train
train(
    policy=policy,
    dataset=real_dataset,
    training_steps=10000,        # Less than sim training
    learning_rate=1e-5,          # 10x lower
    batch_size=32,
    eval_freq=1000,
)

This strategy is called sim-to-real-to-sim — train in sim, fine-tune on real, then use real data to improve sim fidelity.

Series Conclusion

Across 10 posts, we've built a complete pipeline:

  1. Framework deep dive — understanding LeRobot inside out
  2. Data collection — quality demonstration recording
  3. Single-arm training — ACT vs Diffusion Policy
  4. Multi-object — scaling to multiple objects
  5. Long-horizon — multi-step task planning
  6. Dual-arm setup — bimanual calibration
  7. Dual-arm tasks — coordination training
  8. Mobile manipulation — base + arms
  9. Humanoid — whole-body control
  10. Sim-to-Real (this post) — deployment pipeline

LeRobot is evolving rapidly — follow the GitHub repo for updates. Good luck with your deployment!

References

Related Posts

Related Posts

TutorialSim-to-Real cho Humanoid: Deployment Best Practices
sim2realhumanoiddeploymentrlunitreePart 10

Sim-to-Real cho Humanoid: Deployment Best Practices

Pipeline hoàn chỉnh deploy RL locomotion policy lên robot humanoid thật — domain randomization, system identification, safety, và Unitree SDK.

9/4/202611 min read
TutorialIntegration: ROS 2 MoveIt2, URScript và Real Robot Deploy
ros2moveit2urscriptrobot-armdeploymentPart 8

Integration: ROS 2 MoveIt2, URScript và Real Robot Deploy

Tổng hợp toàn bộ series: tích hợp trajectory planning với MoveIt2, URScript và deploy lên robot thật qua ros2_control.

5/4/202611 min read
Deep DiveHumanoid Manipulation: Whole-body Control với LeRobot
lerobothumanoidwhole-bodyunitreePart 9

Humanoid Manipulation: Whole-body Control với LeRobot

Robot hình người là nền tảng manipulation tối thượng — high-DOF control, egocentric vision, và humanoid foundation models.

5/4/20269 min read