← Back to Blog
aiopenarmsim-to-realdeploymentsimplevla-rl

SimpleVLA-RL (11): Sim-to-Real Transfer for OpenArm

Deploy a SimpleVLA-RL model from simulation to real OpenArm — camera setup, action mapping, and tips for reducing the sim-to-real gap.

Nguyễn Anh Tuấn11 tháng 4, 202616 min read
SimpleVLA-RL (11): Sim-to-Real Transfer for OpenArm

Sim-to-Real Transfer for OpenArm with SimpleVLA-RL

You now have a VLA policy trained with SimpleVLA-RL in Isaac Lab — SFT created the baseline, GRPO pushed the success rate higher. Now comes the most important and hardest step in the entire pipeline: transferring the model from simulation to the real robot. This is where every assumption about physics, visuals, and dynamics gets tested by reality.

This article provides a detailed walkthrough from setting up OpenArm hardware, building an inference server, mapping actions from sim to real, to techniques for reducing the sim-to-real gap. We will also review reference results from the SimpleVLA-RL paper on the Piper robot and analyze how they apply to OpenArm.

The Sim-to-Real Gap: Three Main Enemies

Before writing any code, let us understand why policies trained in simulation usually do not work perfectly on real robots. There are three main types of gap:

1. Visual Gap — Sim Eyes Differ from Real Eyes

Rendering in Isaac Lab, despite being quite good, still differs from real cameras. Lighting, shadows, textures, reflections — all different. A cardboard box in sim has a perfect surface, while a real box has wrinkles, labels, and uneven light reflections.

2. Dynamics Gap — Sim Physics Differs from Real Physics

Friction between the gripper and cardboard box in sim is modeled with simple mathematical formulas. In reality, friction depends on surface material, humidity, grip force, contact angle — countless factors that simulation cannot fully capture.

3. Action Gap — Sim Control Differs from Real Control

In sim, when you send joint_position = 1.57, the robot joint jumps to that position instantly (or follows a simple dynamic model). On a real robot, motors have inertia, backlash (mechanical play), and CAN bus communication delay — all creating differences between sim actions and real actions.

Sim-to-real transfer is the biggest challenge in robot learning

Step 1: Setup OpenArm Hardware

CAN Bus Connection

OpenArm uses the CAN bus protocol to communicate with servo motors. There are two ways to connect:

Option 1: Use openarm_can library (no LeRobot dependency)

# Install openarm_can — direct communication with OpenArm
pip install openarm-can

# Or clone from source
git clone https://github.com/OpenArm-org/openarm_can.git
cd openarm_can
pip install -e .
# Test connection
from openarm_can import OpenArmController

controller = OpenArmController(
    can_interface="can0",      # CAN bus interface
    baudrate=1000000,          # 1Mbps
)

# Check all servos respond
status = controller.ping_all()
print(f"Found {len(status)} servos: {status}")
# Expected: Found 8 servos: [1, 2, 3, 4, 5, 6, 7, 8]

Option 2: Use lerobot CLI for hardware setup only

# ONLY use for CAN setup — training is still SimpleVLA-RL
pip install lerobot

# Setup CAN interface
sudo ip link set can0 type can bitrate 1000000
sudo ip link set can0 up

# Verify
candump can0  # Should see CAN frames

Note: we use the LeRobot CLI only for CAN hardware setup because it has convenient ready-made scripts. All training and inference remains 100% SimpleVLA-RL pipeline.

Calibrate Joints

Calibration ensures the zero position of each joint in code corresponds to the real zero position on the robot:

from openarm_can import OpenArmController, Calibrator

controller = OpenArmController(can_interface="can0")
calibrator = Calibrator(controller)

# Step 1: Move robot to home position (arm straight up)
# Do this manually or with teach pendant
input("Move robot to home position then press Enter...")

# Step 2: Record offset for each joint
offsets = calibrator.calibrate()
print(f"Joint offsets: {offsets}")

# Step 3: Save calibration
calibrator.save("openarm_calibration.json")

Mount Camera

Camera position is an extremely important factor for sim-to-real. In sim, you placed the camera at a specific position — the real camera must be in a similar position:

# Camera parameters in Isaac Lab (from Part 9)
SIM_CAMERA = {
    "position": [0.5, 0.0, 1.2],   # x, y, z relative to base
    "target": [0.3, 0.0, 0.75],    # Look-at point (table center)
    "fov": 69,                      # Field of view (degrees)
    "resolution": [640, 480],       # Raw resolution
}

# On real robot: mount USB webcam at the same position
# Use a measuring tape to place camera:
# - 50cm in front of robot base
# - 120cm high (shoulder level of robot)
# - Pointing down at table, FOV ~69 degrees
# - Logitech C920 has 78 degree FOV -> zoom in slightly or crop

Tips for mounting camera:

Step 2: Create the Inference Server

Instead of running the VLA model directly on the robot controller PC (usually weak), we create an inference server on a machine with a powerful GPU:

Architecture

+-----------------------------------------------------------+
|                    GPU Server                              |
|  +-----------------------------------------------------+  |
|  |  SimpleVLA-RL Inference                              |  |
|  |  (OpenVLA-OFT model loaded in GPU memory)            |  |
|  |  POST /predict -> {image} -> {action_8dof}           |  |
|  +-----------------------------------------------------+  |
+----------------------------+------------------------------+
                             | HTTP (LAN)
+----------------------------+------------------------------+
|                Robot Controller PC                         |
|  +--------------+    +----------------+    +------------+  |
|  | USB Camera   |--->| Control Loop   |--->| OpenArm    |  |
|  | (Logitech)   |    | (30 Hz)        |    | CAN Bus    |  |
|  +--------------+    +----------------+    +------------+  |
+-----------------------------------------------------------+

Inference Server Code

# inference_server.py — Run on GPU server
import torch
import numpy as np
from flask import Flask, request, jsonify
from PIL import Image
import io
import time

# Load SimpleVLA-RL checkpoint
from prismatic.models import load_vla_model

app = Flask(__name__)

# Global model — loaded once at server start
MODEL = None
DEVICE = "cuda:0"

def load_model(checkpoint_path):
    """Load trained SimpleVLA-RL model"""
    global MODEL
    print(f"Loading model from {checkpoint_path}...")
    MODEL = load_vla_model(
        checkpoint_path,
        device=DEVICE,
        use_flash_attn=True,
    )
    MODEL.eval()
    print("Model loaded successfully!")

@app.route("/predict", methods=["POST"])
def predict():
    """
    Receive image -> return 8-DoF action
    Input: multipart form with 'image' field (JPEG)
    Output: JSON {"action": [8 floats], "inference_ms": float}
    """
    start = time.time()
    
    # Read image from request
    image_file = request.files["image"]
    image = Image.open(io.BytesIO(image_file.read())).convert("RGB")
    image = image.resize((224, 224))  # Resize for VLA input
    
    # Task description — fixed for box grasping
    task = "Pick up the carton box from the table"
    
    # Inference
    with torch.no_grad():
        action = MODEL.predict_action(
            image=image,
            instruction=task,
            unnorm_key="openarm_box_grasp",  # Denormalize key
        )
    
    # action shape: (8,) — 7 joint positions + 1 gripper
    action_list = action.cpu().numpy().tolist()
    
    inference_ms = (time.time() - start) * 1000
    
    return jsonify({
        "action": action_list,
        "inference_ms": inference_ms,
    })

@app.route("/health", methods=["GET"])
def health():
    return jsonify({"status": "ok", "model_loaded": MODEL is not None})

if __name__ == "__main__":
    import sys
    checkpoint = sys.argv[1] if len(sys.argv) > 1 else "./checkpoints/openarm-rl-v1/best"
    load_model(checkpoint)
    app.run(host="0.0.0.0", port=5000, threaded=False)

Run the Server

# On GPU server
conda activate simplevla
python inference_server.py ./checkpoints/openarm-rl-v1/best

# Test
curl -X POST http://gpu-server:5000/predict \
    -F "image=@test_image.jpg"
# Expected: {"action": [0.12, -0.34, ...], "inference_ms": 45.2}

Inference time on a single A100 GPU is roughly 30-60ms — fast enough for a 10-30 Hz control loop.

Step 3: Action Mapping Sim to Real

This is the step many people skip and then wonder why the robot behaves strangely in the real world.

Joint Limit Clipping

Joint ranges in sim may differ from the actual physical limits of OpenArm:

# action_mapper.py

import numpy as np

# REAL joint limits for OpenArm (from datasheet)
REAL_JOINT_LIMITS = {
    0: (-2.96, 2.96),    # Joint 1 (base) — narrower than sim
    1: (-1.48, 1.48),    # Joint 2 (shoulder)
    2: (-1.48, 1.48),    # Joint 3 (elbow)
    3: (-2.96, 2.96),    # Joint 4 (wrist 1)
    4: (-1.48, 1.48),    # Joint 5 (wrist 2)
    5: (-2.96, 2.96),    # Joint 6 (wrist 3)
    6: (-1.48, 1.48),    # Joint 7 (wrist 4)
    7: (0.0, 1.0),       # Gripper
}

def clip_to_real_limits(action):
    """Clip action values to real joint limits"""
    clipped = np.copy(action)
    for i, (low, high) in REAL_JOINT_LIMITS.items():
        clipped[i] = np.clip(action[i], low, high)
    return clipped

Exponential Moving Average (EMA) Smoothing

VLA model output can jump between steps — in sim this is fine because the simulator accepts anything. On a real robot, sudden jumps cause vibration and motor wear:

class ActionSmoother:
    """Smooth actions with EMA to prevent jerk on real robot"""
    
    def __init__(self, alpha=0.7, action_dim=8):
        """
        alpha: EMA weight (0.5-0.9)
            - 0.5: very smooth but slow response
            - 0.9: less smooth but fast response
            - 0.7: recommended for manipulation
        """
        self.alpha = alpha
        self.prev_action = np.zeros(action_dim)
        self.initialized = False
    
    def smooth(self, action):
        if not self.initialized:
            self.prev_action = action.copy()
            self.initialized = True
            return action
        
        smoothed = self.alpha * action + (1 - self.alpha) * self.prev_action
        self.prev_action = smoothed.copy()
        return smoothed
    
    def reset(self):
        self.initialized = False

Control Frequency Matching

The control frequency must match the frequency used in simulation:

import time

class ControlLoop:
    """Main control loop — runs on robot controller PC"""
    
    def __init__(self, server_url, controller, target_hz=20):
        self.server_url = server_url
        self.controller = controller  # OpenArmController
        self.target_hz = target_hz
        self.dt = 1.0 / target_hz
        self.smoother = ActionSmoother(alpha=0.7)
        
    def run_episode(self, max_steps=400):
        """Run one episode on the real robot"""
        import requests
        import cv2
        
        camera = cv2.VideoCapture(0)  # USB camera
        camera.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
        camera.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
        
        self.smoother.reset()
        
        for step in range(max_steps):
            step_start = time.time()
            
            # 1. Capture image
            ret, frame = camera.read()
            if not ret:
                print("Camera error!")
                break
            
            # 2. Encode image
            _, buffer = cv2.imencode('.jpg', frame, 
                                     [cv2.IMWRITE_JPEG_QUALITY, 85])
            
            # 3. Send to inference server
            response = requests.post(
                f"{self.server_url}/predict",
                files={"image": ("frame.jpg", buffer.tobytes(), "image/jpeg")}
            )
            action = np.array(response.json()["action"])
            
            # 4. Post-process action
            action = clip_to_real_limits(action)
            action = self.smoother.smooth(action)
            
            # 5. Send to robot
            self.controller.set_joint_positions(action[:7])  # 7 joints
            self.controller.set_gripper(action[7])           # gripper
            
            # 6. Maintain control frequency
            elapsed = time.time() - step_start
            if elapsed < self.dt:
                time.sleep(self.dt - elapsed)
            
            # Log
            if step % 20 == 0:
                hz = 1.0 / max(elapsed, 0.001)
                print(f"Step {step}/{max_steps} | "
                      f"Hz: {hz:.1f} | "
                      f"Action: [{', '.join(f'{a:.2f}' for a in action[:3])}...]")
        
        camera.release()

Step 4: Reducing the Sim-to-Real Gap

4.1 Domain Randomization (in sim)

If you already used domain randomization when collecting demos in Isaac Lab (as discussed in Part 9), the policy will be more robust. If not, go back and retrain with randomization:

# Domain randomization parameters (Isaac Lab)
randomization = {
    "lighting": {
        "intensity": (0.5, 1.5),     # 50-150% intensity
        "color_temp": (4000, 7000),  # Kelvin
        "direction": "random",
    },
    "camera": {
        "position_noise": 0.02,      # +/-2cm random offset
        "fov_noise": 2,              # +/-2 degree FOV variation
    },
    "box": {
        "position_noise": 0.05,      # +/-5cm position
        "size_scale": (0.8, 1.2),    # 80-120% size
        "texture": "random",         # Random texture each episode
    },
    "table": {
        "friction": (0.3, 0.8),      # Random friction
        "color": "random",
    },
}

4.2 Camera Matching

# Compare sim vs real images
# 1. Capture 1 frame from sim (Isaac Lab)
# 2. Capture 1 frame from real camera
# 3. Compare side by side

# Camera matching checklist:
# [x] Same resolution (640x480 or 224x224 after resize)
# [x] Same FOV (or crop real camera to match)
# [x] Same position relative to robot base
# [x] Same viewing direction (pitch, yaw angles)
# [ ] DO NOT need to match lighting exactly — domain randomization handles this

4.3 Background Simplification

A simple but effective trick: clean up the background. The VLA model sees the entire 224x224 frame — if the background is cluttered (furniture, people walking, screens), the model will be confused.

Good backgrounds for sim-to-real:
[+] White/gray solid-color wall behind
[+] Clean table with only the target carton box
[+] Even lighting without strong shadows
[+] No moving objects in frame

Bad backgrounds:
[-] Monitor/TV behind (bright pixels, scrolling text)
[-] Many objects on table besides target
[-] Natural light that changes (windows)
[-] People walking in frame

Robotics environments need careful control for sim-to-real transfer

4.4 Fine-tune with Real Demos (Optional but Very Effective)

If the real-world success rate is much lower than sim (which will almost certainly happen), collect 10-20 real demonstrations and fine-tune the model:

# Real fine-tuning procedure:
# 1. Teleoperate OpenArm manually for 10-20 successful box grasps
# 2. Record image + joint positions at each step
# 3. Format into dataset matching sim demo format
# 4. Run additional SFT for 500-1000 steps with low LR (1e-6)

# Why only 10-20 demos?
# Because the model already understands manipulation from sim training.
# Real demos only need to teach it to "recalibrate" visual + dynamics.
# Too many real demos (>50) can cause overfitting.

Step 5: Evaluate on the Real Robot

Evaluation Protocol

# eval_real.py
import json
from datetime import datetime

class RealEvaluator:
    def __init__(self, control_loop, num_trials=10):
        self.loop = control_loop
        self.num_trials = num_trials
        self.results = []
    
    def run_evaluation(self):
        for trial in range(self.num_trials):
            print(f"\n{'='*50}")
            print(f"Trial {trial+1}/{self.num_trials}")
            print(f"{'='*50}")
            
            # 1. Reset: place carton box on table at random position
            input(f"Place carton box on table at new position, press Enter...")
            
            # 2. Move robot to home position
            self.loop.controller.go_home()
            time.sleep(2)
            
            # 3. Run episode
            self.loop.run_episode(max_steps=400)
            
            # 4. Record result (manual)
            success = input("Grasp successful? (y/n): ").lower() == 'y'
            failure_mode = ""
            if not success:
                failure_mode = input(
                    "Failure mode (miss/position/timeout/collision): "
                )
            
            self.results.append({
                "trial": trial + 1,
                "success": success,
                "failure_mode": failure_mode,
                "timestamp": datetime.now().isoformat(),
            })
            
            # 5. Open gripper, go home
            self.loop.controller.open_gripper()
            self.loop.controller.go_home()
        
        # Summary
        successes = sum(1 for r in self.results if r["success"])
        rate = successes / self.num_trials * 100
        print(f"\n{'='*50}")
        print(f"RESULTS: {successes}/{self.num_trials} = {rate:.1f}%")
        print(f"{'='*50}")
        
        # Save results
        with open("real_eval_results.json", "w") as f:
            json.dump(self.results, f, indent=2)

Reference Results from Paper (Piper Robot)

The SimpleVLA-RL paper reports real-world results on the Piper robot (7-DoF, different from OpenArm but same class):

+------------------------------------------------------+
|  SimpleVLA-RL Real-World Results (Piper Robot)        |
+---------------------+----------+----------+----------+
| Task                | SFT Only | SFT + RL | Change   |
+---------------------+----------+----------+----------+
| Average (4 tasks)   | 17.5%    | 38.5%    | +120%    |
| Stack Bowls         | 38%      | 70%      | +84%     |
| Click Bell          | 30%      | 60%      | +100%    |
| Fold Towel          | 0%       | 18%      | N/A      |
| Pour Water          | 2%       | 6%       | +200%    |
+---------------------+----------+----------+----------+
| * OpenArm box grasping is expected to match or exceed |
|   Stack Bowls since the task is simpler               |
+------------------------------------------------------+

Based on these results, for OpenArm cardboard box grasping (a simpler task than Stack Bowls), we expect:

Failure Mode Analysis on Real Robot

From practical experience, the most common failure modes when deploying VLA on real robots:

1. Grasp Failure (30-40% of failures)

2. Positioning Error (25-35% of failures)

3. Visual Confusion (15-25% of failures)

4. Motor Overshoot (5-10% of failures)

Tips for Box Grasping Specifically

Grasping cardboard boxes has some specifics to be aware of:

Gripper Opening Must Match Box Size

# Calculate optimal gripper opening
BOX_WIDTH = 0.12  # 12cm box width
GRIPPER_MAX_OPEN = 0.08  # 8cm max opening (OpenArm gripper)

# If box is wider than gripper max -> must grasp the longer edge
# If box fits -> grasping the shorter edge is safer

# In reward function, add condition:
# success = box_lifted AND gripper_force > threshold
# Don't just check position — also check grip force

Top-Down Approach Works Best

For cardboard boxes placed on a table, approaching from above (vertical) yields the highest success rate because:

Domain Randomization for Boxes

# In Isaac Lab, randomize:
box_randomization = {
    "position_x": (0.2, 0.5),    # Range on table
    "position_y": (-0.15, 0.15),  # Left-right
    "rotation_z": (-45, 45),      # Rotate +/-45 degrees
    "size_scale": (0.8, 1.3),     # Size 80-130%
    "texture": ["cardboard_plain", "cardboard_printed", 
                "cardboard_worn", "cardboard_wet"],
    "mass": (0.1, 0.5),           # 100g - 500g
}

Complete Pipeline Diagram

+-----------------------------------------------------+
|                COMPLETE PIPELINE                      |
|                                                       |
|  Isaac Lab Sim                                        |
|       |                                               |
|       v                                               |
|  RL Expert Demos (500-1000 episodes)                 |
|       |                                               |
|       v                                               |
|  SFT Training (OpenVLA-OFT) ---- ~20% success       |
|       |                                               |
|       v                                               |
|  RL Training (GRPO) ------------ ~50% success        |
|       |                                               |
|       v                                               |
|  Sim Evaluation (100 episodes)                       |
|       |                                               |
|       v                                               |
|  Sim-to-Real Transfer <-- YOU ARE HERE               |
|       |                                               |
|       v                                               |
|  Real Evaluation (10 trials)                         |
|       |                                               |
|       v                                               |
|  Real Fine-tune (10-20 demos) -- ~65% success        |
|       |                                               |
|       v                                               |
|  Iterate & Improve                                   |
+-----------------------------------------------------+

Robot manipulation research requires patience and continuous iteration

What Comes Next

After getting the first real-world evaluation results, the improvement loop begins:

  1. Analyze failure modes — Understand where the robot fails
  2. Collect more real demos focusing on failure scenarios
  3. Fine-tune the model on real data
  4. Re-evaluate — Repeat until reaching target success rate

SimpleVLA-RL gives us a major advantage: instead of just collecting more demos (like pure Imitation Learning), we can go back to sim, run more RL epochs focused on failure scenarios, then transfer back to real. This is closed-loop improvement that pure Imitation Learning cannot provide.

The path from 0% to an autonomous box-grasping robot is not easy — but with SimpleVLA-RL, each iteration brings measurable improvement. Patience and a systematic approach are the keys to success.


Related Posts

Related Posts

TutorialSimpleVLA-RL (10): SFT & RL Training cho OpenArm
openarmsimplevla-rltraininggrporeinforcement-learningPart 10

SimpleVLA-RL (10): SFT & RL Training cho OpenArm

Hướng dẫn chi tiết SFT fine-tuning và RL training với SimpleVLA-RL cho OpenArm — từ config environment đến chạy GRPO.

11/4/202616 min read
TutorialSimpleVLA-RL (6): OpenArm — Phân tích Lộ trình
openarmvlareinforcement-learninglerobotpi0Part 6

SimpleVLA-RL (6): OpenArm — Phân tích Lộ trình

Phân tích chi tiết cách tiếp cận training robot OpenArm 7-DoF gắp hộp carton — so sánh 2 lộ trình: LeRobot native vs SimpleVLA-RL.

11/4/202613 min read
TutorialSimpleVLA-RL (7): Collect Data cho OpenArm
openarmlerobotdata-collectionteleoperationPart 7

SimpleVLA-RL (7): Collect Data cho OpenArm

Hướng dẫn từng bước setup OpenArm, calibrate, teleoperate và thu thập 50 episodes gắp hộp carton với LeRobot.

11/4/202616 min read