Sim-to-Real Transfer for OpenArm with SimpleVLA-RL

You now have a VLA policy trained with SimpleVLA-RL in Isaac Lab — SFT created the baseline, GRPO pushed the success rate higher. Now comes the most important and hardest step in the entire pipeline: transferring the model from simulation to the real robot. This is where every assumption about physics, visuals, and dynamics gets tested by reality.

This article provides a detailed walkthrough from setting up OpenArm hardware, building an inference server, mapping actions from sim to real, to techniques for reducing the sim-to-real gap. We will also review reference results from the SimpleVLA-RL paper on the Piper robot and analyze how they apply to OpenArm.

The Sim-to-Real Gap: Three Main Enemies

Before writing any code, let us understand why policies trained in simulation usually do not work perfectly on real robots. There are three main types of gap:

1. Visual Gap — Sim Eyes Differ from Real Eyes

Rendering in Isaac Lab, despite being quite good, still differs from real cameras. Lighting, shadows, textures, reflections — all different. A cardboard box in sim has a perfect surface, while a real box has wrinkles, labels, and uneven light reflections.

2. Dynamics Gap — Sim Physics Differs from Real Physics

Friction between the gripper and cardboard box in sim is modeled with simple mathematical formulas. In reality, friction depends on surface material, humidity, grip force, contact angle — countless factors that simulation cannot fully capture.

3. Action Gap — Sim Control Differs from Real Control

In sim, when you send joint_position = 1.57, the robot joint jumps to that position instantly (or follows a simple dynamic model). On a real robot, motors have inertia, backlash (mechanical play), and CAN bus communication delay — all creating differences between sim actions and real actions.

Sim-to-real transfer is the biggest challenge in robot learning

Step 1: Setup OpenArm Hardware

CAN Bus Connection

OpenArm uses the CAN bus protocol to communicate with servo motors. There are two ways to connect:

Option 1: Use openarm_can library (no LeRobot dependency)

# Install openarm_can — direct communication with OpenArm
pip install openarm-can

# Or clone from source
git clone https://github.com/OpenArm-org/openarm_can.git
cd openarm_can
pip install -e .

# Test connection
from openarm_can import OpenArmController

controller = OpenArmController(
    can_interface="can0",      # CAN bus interface
    baudrate=1000000,          # 1Mbps
)

# Check all servos respond
status = controller.ping_all()
print(f"Found {len(status)} servos: {status}")
# Expected: Found 8 servos: [1, 2, 3, 4, 5, 6, 7, 8]

Option 2: Use lerobot CLI for hardware setup only

# ONLY use for CAN setup — training is still SimpleVLA-RL
pip install lerobot

# Setup CAN interface
sudo ip link set can0 type can bitrate 1000000
sudo ip link set can0 up

# Verify
candump can0  # Should see CAN frames

Note: we use the LeRobot CLI only for CAN hardware setup because it has convenient ready-made scripts. All training and inference remains 100% SimpleVLA-RL pipeline.

Calibrate Joints

Calibration ensures the zero position of each joint in code corresponds to the real zero position on the robot:

from openarm_can import OpenArmController, Calibrator

controller = OpenArmController(can_interface="can0")
calibrator = Calibrator(controller)

# Step 1: Move robot to home position (arm straight up)
# Do this manually or with teach pendant
input("Move robot to home position then press Enter...")

# Step 2: Record offset for each joint
offsets = calibrator.calibrate()
print(f"Joint offsets: {offsets}")

# Step 3: Save calibration
calibrator.save("openarm_calibration.json")

Mount Camera

Camera position is an extremely important factor for sim-to-real. In sim, you placed the camera at a specific position — the real camera must be in a similar position:

# Camera parameters in Isaac Lab (from Part 9)
SIM_CAMERA = {
    "position": [0.5, 0.0, 1.2],   # x, y, z relative to base
    "target": [0.3, 0.0, 0.75],    # Look-at point (table center)
    "fov": 69,                      # Field of view (degrees)
    "resolution": [640, 480],       # Raw resolution
}

# On real robot: mount USB webcam at the same position
# Use a measuring tape to place camera:
# - 50cm in front of robot base
# - 120cm high (shoulder level of robot)
# - Pointing down at table, FOV ~69 degrees
# - Logitech C920 has 78 degree FOV -> zoom in slightly or crop

Tips for mounting camera:

Use a tripod or fixed mount — camera shake will kill the policy
Mark camera position on the table for reproducible setup next time
Take screenshots from sim and real side by side to compare viewing angle

Step 2: Create the Inference Server

Instead of running the VLA model directly on the robot controller PC (usually weak), we create an inference server on a machine with a powerful GPU:

Architecture

+-----------------------------------------------------------+
|                    GPU Server                              |
|  +-----------------------------------------------------+  |
|  |  SimpleVLA-RL Inference                              |  |
|  |  (OpenVLA-OFT model loaded in GPU memory)            |  |
|  |  POST /predict -> {image} -> {action_8dof}           |  |
|  +-----------------------------------------------------+  |
+----------------------------+------------------------------+
                             | HTTP (LAN)
+----------------------------+------------------------------+
|                Robot Controller PC                         |
|  +--------------+    +----------------+    +------------+  |
|  | USB Camera   |--->| Control Loop   |--->| OpenArm    |  |
|  | (Logitech)   |    | (30 Hz)        |    | CAN Bus    |  |
|  +--------------+    +----------------+    +------------+  |
+-----------------------------------------------------------+

Inference Server Code

# inference_server.py — Run on GPU server
import torch
import numpy as np
from flask import Flask, request, jsonify
from PIL import Image
import io
import time

# Load SimpleVLA-RL checkpoint
from prismatic.models import load_vla_model

app = Flask(__name__)

# Global model — loaded once at server start
MODEL = None
DEVICE = "cuda:0"

def load_model(checkpoint_path):
    """Load trained SimpleVLA-RL model"""
    global MODEL
    print(f"Loading model from {checkpoint_path}...")
    MODEL = load_vla_model(
        checkpoint_path,
        device=DEVICE,
        use_flash_attn=True,
    )
    MODEL.eval()
    print("Model loaded successfully!")

@app.route("/predict", methods=["POST"])
def predict():
    """
    Receive image -> return 8-DoF action
    Input: multipart form with 'image' field (JPEG)
    Output: JSON {"action": [8 floats], "inference_ms": float}
    """
    start = time.time()
    
    # Read image from request
    image_file = request.files["image"]
    image = Image.open(io.BytesIO(image_file.read())).convert("RGB")
    image = image.resize((224, 224))  # Resize for VLA input
    
    # Task description — fixed for box grasping
    task = "Pick up the carton box from the table"
    
    # Inference
    with torch.no_grad():
        action = MODEL.predict_action(
            image=image,
            instruction=task,
            unnorm_key="openarm_box_grasp",  # Denormalize key
        )
    
    # action shape: (8,) — 7 joint positions + 1 gripper
    action_list = action.cpu().numpy().tolist()
    
    inference_ms = (time.time() - start) * 1000
    
    return jsonify({
        "action": action_list,
        "inference_ms": inference_ms,
    })

@app.route("/health", methods=["GET"])
def health():
    return jsonify({"status": "ok", "model_loaded": MODEL is not None})

if __name__ == "__main__":
    import sys
    checkpoint = sys.argv[1] if len(sys.argv) > 1 else "./checkpoints/openarm-rl-v1/best"
    load_model(checkpoint)
    app.run(host="0.0.0.0", port=5000, threaded=False)

Run the Server

# On GPU server
conda activate simplevla
python inference_server.py ./checkpoints/openarm-rl-v1/best

# Test
curl -X POST http://gpu-server:5000/predict \
    -F "image=@test_image.jpg"
# Expected: {"action": [0.12, -0.34, ...], "inference_ms": 45.2}

Inference time on a single A100 GPU is roughly 30-60ms — fast enough for a 10-30 Hz control loop.

Step 3: Action Mapping Sim to Real

This is the step many people skip and then wonder why the robot behaves strangely in the real world.

Joint Limit Clipping

Joint ranges in sim may differ from the actual physical limits of OpenArm:

# action_mapper.py

import numpy as np

# REAL joint limits for OpenArm (from datasheet)
REAL_JOINT_LIMITS = {
    0: (-2.96, 2.96),    # Joint 1 (base) — narrower than sim
    1: (-1.48, 1.48),    # Joint 2 (shoulder)
    2: (-1.48, 1.48),    # Joint 3 (elbow)
    3: (-2.96, 2.96),    # Joint 4 (wrist 1)
    4: (-1.48, 1.48),    # Joint 5 (wrist 2)
    5: (-2.96, 2.96),    # Joint 6 (wrist 3)
    6: (-1.48, 1.48),    # Joint 7 (wrist 4)
    7: (0.0, 1.0),       # Gripper
}

def clip_to_real_limits(action):
    """Clip action values to real joint limits"""
    clipped = np.copy(action)
    for i, (low, high) in REAL_JOINT_LIMITS.items():
        clipped[i] = np.clip(action[i], low, high)
    return clipped

Exponential Moving Average (EMA) Smoothing

VLA model output can jump between steps — in sim this is fine because the simulator accepts anything. On a real robot, sudden jumps cause vibration and motor wear:

class ActionSmoother:
    """Smooth actions with EMA to prevent jerk on real robot"""
    
    def __init__(self, alpha=0.7, action_dim=8):
        """
        alpha: EMA weight (0.5-0.9)
            - 0.5: very smooth but slow response
            - 0.9: less smooth but fast response
            - 0.7: recommended for manipulation
        """
        self.alpha = alpha
        self.prev_action = np.zeros(action_dim)
        self.initialized = False
    
    def smooth(self, action):
        if not self.initialized:
            self.prev_action = action.copy()
            self.initialized = True
            return action
        
        smoothed = self.alpha * action + (1 - self.alpha) * self.prev_action
        self.prev_action = smoothed.copy()
        return smoothed
    
    def reset(self):
        self.initialized = False

Control Frequency Matching

The control frequency must match the frequency used in simulation:

import time

class ControlLoop:
    """Main control loop — runs on robot controller PC"""
    
    def __init__(self, server_url, controller, target_hz=20):
        self.server_url = server_url
        self.controller = controller  # OpenArmController
        self.target_hz = target_hz
        self.dt = 1.0 / target_hz
        self.smoother = ActionSmoother(alpha=0.7)
        
    def run_episode(self, max_steps=400):
        """Run one episode on the real robot"""
        import requests
        import cv2
        
        camera = cv2.VideoCapture(0)  # USB camera
        camera.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
        camera.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
        
        self.smoother.reset()
        
        for step in range(max_steps):
            step_start = time.time()
            
            # 1. Capture image
            ret, frame = camera.read()
            if not ret:
                print("Camera error!")
                break
            
            # 2. Encode image
            _, buffer = cv2.imencode('.jpg', frame, 
                                     [cv2.IMWRITE_JPEG_QUALITY, 85])
            
            # 3. Send to inference server
            response = requests.post(
                f"{self.server_url}/predict",
                files={"image": ("frame.jpg", buffer.tobytes(), "image/jpeg")}
            )
            action = np.array(response.json()["action"])
            
            # 4. Post-process action
            action = clip_to_real_limits(action)
            action = self.smoother.smooth(action)
            
            # 5. Send to robot
            self.controller.set_joint_positions(action[:7])  # 7 joints
            self.controller.set_gripper(action[7])           # gripper
            
            # 6. Maintain control frequency
            elapsed = time.time() - step_start
            if elapsed < self.dt:
                time.sleep(self.dt - elapsed)
            
            # Log
            if step % 20 == 0:
                hz = 1.0 / max(elapsed, 0.001)
                print(f"Step {step}/{max_steps} | "
                      f"Hz: {hz:.1f} | "
                      f"Action: [{', '.join(f'{a:.2f}' for a in action[:3])}...]")
        
        camera.release()

Step 4: Reducing the Sim-to-Real Gap

4.1 Domain Randomization (in sim)

If you already used domain randomization when collecting demos in Isaac Lab (as discussed in Part 9), the policy will be more robust. If not, go back and retrain with randomization:

# Domain randomization parameters (Isaac Lab)
randomization = {
    "lighting": {
        "intensity": (0.5, 1.5),     # 50-150% intensity
        "color_temp": (4000, 7000),  # Kelvin
        "direction": "random",
    },
    "camera": {
        "position_noise": 0.02,      # +/-2cm random offset
        "fov_noise": 2,              # +/-2 degree FOV variation
    },
    "box": {
        "position_noise": 0.05,      # +/-5cm position
        "size_scale": (0.8, 1.2),    # 80-120% size
        "texture": "random",         # Random texture each episode
    },
    "table": {
        "friction": (0.3, 0.8),      # Random friction
        "color": "random",
    },
}

4.2 Camera Matching

# Compare sim vs real images
# 1. Capture 1 frame from sim (Isaac Lab)
# 2. Capture 1 frame from real camera
# 3. Compare side by side

# Camera matching checklist:
# [x] Same resolution (640x480 or 224x224 after resize)
# [x] Same FOV (or crop real camera to match)
# [x] Same position relative to robot base
# [x] Same viewing direction (pitch, yaw angles)
# [ ] DO NOT need to match lighting exactly — domain randomization handles this

4.3 Background Simplification

A simple but effective trick: clean up the background. The VLA model sees the entire 224x224 frame — if the background is cluttered (furniture, people walking, screens), the model will be confused.

Good backgrounds for sim-to-real:
[+] White/gray solid-color wall behind
[+] Clean table with only the target carton box
[+] Even lighting without strong shadows
[+] No moving objects in frame

Bad backgrounds:
[-] Monitor/TV behind (bright pixels, scrolling text)
[-] Many objects on table besides target
[-] Natural light that changes (windows)
[-] People walking in frame

Robotics environments need careful control for sim-to-real transfer

4.4 Fine-tune with Real Demos (Optional but Very Effective)

If the real-world success rate is much lower than sim (which will almost certainly happen), collect 10-20 real demonstrations and fine-tune the model:

# Real fine-tuning procedure:
# 1. Teleoperate OpenArm manually for 10-20 successful box grasps
# 2. Record image + joint positions at each step
# 3. Format into dataset matching sim demo format
# 4. Run additional SFT for 500-1000 steps with low LR (1e-6)

# Why only 10-20 demos?
# Because the model already understands manipulation from sim training.
# Real demos only need to teach it to "recalibrate" visual + dynamics.
# Too many real demos (>50) can cause overfitting.

Step 5: Evaluate on the Real Robot

Evaluation Protocol

# eval_real.py
import json
from datetime import datetime

class RealEvaluator:
    def __init__(self, control_loop, num_trials=10):
        self.loop = control_loop
        self.num_trials = num_trials
        self.results = []
    
    def run_evaluation(self):
        for trial in range(self.num_trials):
            print(f"\n{'='*50}")
            print(f"Trial {trial+1}/{self.num_trials}")
            print(f"{'='*50}")
            
            # 1. Reset: place carton box on table at random position
            input(f"Place carton box on table at new position, press Enter...")
            
            # 2. Move robot to home position
            self.loop.controller.go_home()
            time.sleep(2)
            
            # 3. Run episode
            self.loop.run_episode(max_steps=400)
            
            # 4. Record result (manual)
            success = input("Grasp successful? (y/n): ").lower() == 'y'
            failure_mode = ""
            if not success:
                failure_mode = input(
                    "Failure mode (miss/position/timeout/collision): "
                )
            
            self.results.append({
                "trial": trial + 1,
                "success": success,
                "failure_mode": failure_mode,
                "timestamp": datetime.now().isoformat(),
            })
            
            # 5. Open gripper, go home
            self.loop.controller.open_gripper()
            self.loop.controller.go_home()
        
        # Summary
        successes = sum(1 for r in self.results if r["success"])
        rate = successes / self.num_trials * 100
        print(f"\n{'='*50}")
        print(f"RESULTS: {successes}/{self.num_trials} = {rate:.1f}%")
        print(f"{'='*50}")
        
        # Save results
        with open("real_eval_results.json", "w") as f:
            json.dump(self.results, f, indent=2)

Reference Results from Paper (Piper Robot)

The SimpleVLA-RL paper reports real-world results on the Piper robot (7-DoF, different from OpenArm but same class):

+------------------------------------------------------+
|  SimpleVLA-RL Real-World Results (Piper Robot)        |
+---------------------+----------+----------+----------+
| Task                | SFT Only | SFT + RL | Change   |
+---------------------+----------+----------+----------+
| Average (4 tasks)   | 17.5%    | 38.5%    | +120%    |
| Stack Bowls         | 38%      | 70%      | +84%     |
| Click Bell          | 30%      | 60%      | +100%    |
| Fold Towel          | 0%       | 18%      | N/A      |
| Pour Water          | 2%       | 6%       | +200%    |
+---------------------+----------+----------+----------+
| * OpenArm box grasping is expected to match or exceed |
|   Stack Bowls since the task is simpler               |
+------------------------------------------------------+

Based on these results, for OpenArm cardboard box grasping (a simpler task than Stack Bowls), we expect:

SFT only: 20-35% success rate on real robot
SFT + RL: 45-70% success rate on real robot
SFT + RL + real fine-tune (10 demos): 60-80%

Failure Mode Analysis on Real Robot

From practical experience, the most common failure modes when deploying VLA on real robots:

1. Grasp Failure (30-40% of failures)

Gripper closes too early or too late
Cause: timing in sim differs from real due to CAN bus latency
Fix: Increase EMA alpha (smoother actions) or reduce control frequency

2. Positioning Error (25-35% of failures)

Robot approaches box from wrong angle
Cause: camera position slightly different from sim
Fix: Adjust camera mount or fine-tune with real demos

3. Visual Confusion (15-25% of failures)

Robot "sees" incorrectly because background differs from sim
Cause: insufficient domain randomization during training
Fix: Clean up background or retrain with more randomization

4. Motor Overshoot (5-10% of failures)

Robot overshoots target position then oscillates
Cause: PID gains on motors do not match sim
Fix: Tune PID on OpenArm or reduce action magnitude

Tips for Box Grasping Specifically

Grasping cardboard boxes has some specifics to be aware of:

Gripper Opening Must Match Box Size

# Calculate optimal gripper opening
BOX_WIDTH = 0.12  # 12cm box width
GRIPPER_MAX_OPEN = 0.08  # 8cm max opening (OpenArm gripper)

# If box is wider than gripper max -> must grasp the longer edge
# If box fits -> grasping the shorter edge is safer

# In reward function, add condition:
# success = box_lifted AND gripper_force > threshold
# Don't just check position — also check grip force

Top-Down Approach Works Best

For cardboard boxes placed on a table, approaching from above (vertical) yields the highest success rate because:

Gripper alignment is simple — only need to align X, Y
Gravity assists the grip — box does not slide sideways
Less collision with table compared to lateral approach

Domain Randomization for Boxes

# In Isaac Lab, randomize:
box_randomization = {
    "position_x": (0.2, 0.5),    # Range on table
    "position_y": (-0.15, 0.15),  # Left-right
    "rotation_z": (-45, 45),      # Rotate +/-45 degrees
    "size_scale": (0.8, 1.3),     # Size 80-130%
    "texture": ["cardboard_plain", "cardboard_printed", 
                "cardboard_worn", "cardboard_wet"],
    "mass": (0.1, 0.5),           # 100g - 500g
}

Complete Pipeline Diagram

+-----------------------------------------------------+
|                COMPLETE PIPELINE                      |
|                                                       |
|  Isaac Lab Sim                                        |
|       |                                               |
|       v                                               |
|  RL Expert Demos (500-1000 episodes)                 |
|       |                                               |
|       v                                               |
|  SFT Training (OpenVLA-OFT) ---- ~20% success       |
|       |                                               |
|       v                                               |
|  RL Training (GRPO) ------------ ~50% success        |
|       |                                               |
|       v                                               |
|  Sim Evaluation (100 episodes)                       |
|       |                                               |
|       v                                               |
|  Sim-to-Real Transfer <-- YOU ARE HERE               |
|       |                                               |
|       v                                               |
|  Real Evaluation (10 trials)                         |
|       |                                               |
|       v                                               |
|  Real Fine-tune (10-20 demos) -- ~65% success        |
|       |                                               |
|       v                                               |
|  Iterate & Improve                                   |
+-----------------------------------------------------+

Robot manipulation research requires patience and continuous iteration

What Comes Next

After getting the first real-world evaluation results, the improvement loop begins:

Analyze failure modes — Understand where the robot fails
Collect more real demos focusing on failure scenarios
Fine-tune the model on real data
Re-evaluate — Repeat until reaching target success rate

SimpleVLA-RL gives us a major advantage: instead of just collecting more demos (like pure Imitation Learning), we can go back to sim, run more RL epochs focused on failure scenarios, then transfer back to real. This is closed-loop improvement that pure Imitation Learning cannot provide.

The path from 0% to an autonomous box-grasping robot is not easy — but with SimpleVLA-RL, each iteration brings measurable improvement. Patience and a systematic approach are the keys to success.

SimpleVLA-RL (10): SFT & RL Training for OpenArm — Train a VLA model for OpenArm with SFT and GRPO
SimpleVLA-RL (4): Experimental Results — Analysis of original paper results on LIBERO and Piper
SimpleVLA-RL (2): Detailed Architecture — Deep dive into OpenVLA-OFT and GRPO algorithm

Sim-to-Real Transfer for OpenArm with SimpleVLA-RL

The Sim-to-Real Gap: Three Main Enemies

Before writing any code, let us understand why policies trained in simulation usually do not work perfectly on real robots. There are three main types of gap:

1. Visual Gap — Sim Eyes Differ from Real Eyes

2. Dynamics Gap — Sim Physics Differs from Real Physics

3. Action Gap — Sim Control Differs from Real Control

Sim-to-real transfer is the biggest challenge in robot learning

Step 1: Setup OpenArm Hardware

CAN Bus Connection

OpenArm uses the CAN bus protocol to communicate with servo motors. There are two ways to connect:

Option 1: Use openarm_can library (no LeRobot dependency)

# Install openarm_can — direct communication with OpenArm
pip install openarm-can

# Or clone from source
git clone https://github.com/OpenArm-org/openarm_can.git
cd openarm_can
pip install -e .

# Test connection
from openarm_can import OpenArmController

controller = OpenArmController(
    can_interface="can0",      # CAN bus interface
    baudrate=1000000,          # 1Mbps
)

# Check all servos respond
status = controller.ping_all()
print(f"Found {len(status)} servos: {status}")
# Expected: Found 8 servos: [1, 2, 3, 4, 5, 6, 7, 8]

Option 2: Use lerobot CLI for hardware setup only

# ONLY use for CAN setup — training is still SimpleVLA-RL
pip install lerobot

# Setup CAN interface
sudo ip link set can0 type can bitrate 1000000
sudo ip link set can0 up

# Verify
candump can0  # Should see CAN frames

Note: we use the LeRobot CLI only for CAN hardware setup because it has convenient ready-made scripts. All training and inference remains 100% SimpleVLA-RL pipeline.

Calibrate Joints

Calibration ensures the zero position of each joint in code corresponds to the real zero position on the robot:

from openarm_can import OpenArmController, Calibrator

controller = OpenArmController(can_interface="can0")
calibrator = Calibrator(controller)

# Step 1: Move robot to home position (arm straight up)
# Do this manually or with teach pendant
input("Move robot to home position then press Enter...")

# Step 2: Record offset for each joint
offsets = calibrator.calibrate()
print(f"Joint offsets: {offsets}")

# Step 3: Save calibration
calibrator.save("openarm_calibration.json")

Mount Camera

Camera position is an extremely important factor for sim-to-real. In sim, you placed the camera at a specific position — the real camera must be in a similar position:

# Camera parameters in Isaac Lab (from Part 9)
SIM_CAMERA = {
    "position": [0.5, 0.0, 1.2],   # x, y, z relative to base
    "target": [0.3, 0.0, 0.75],    # Look-at point (table center)
    "fov": 69,                      # Field of view (degrees)
    "resolution": [640, 480],       # Raw resolution
}

# On real robot: mount USB webcam at the same position
# Use a measuring tape to place camera:
# - 50cm in front of robot base
# - 120cm high (shoulder level of robot)
# - Pointing down at table, FOV ~69 degrees
# - Logitech C920 has 78 degree FOV -> zoom in slightly or crop

Tips for mounting camera:

Use a tripod or fixed mount — camera shake will kill the policy
Mark camera position on the table for reproducible setup next time
Take screenshots from sim and real side by side to compare viewing angle

Step 2: Create the Inference Server

Instead of running the VLA model directly on the robot controller PC (usually weak), we create an inference server on a machine with a powerful GPU:

Architecture

+-----------------------------------------------------------+
|                    GPU Server                              |
|  +-----------------------------------------------------+  |
|  |  SimpleVLA-RL Inference                              |  |
|  |  (OpenVLA-OFT model loaded in GPU memory)            |  |
|  |  POST /predict -> {image} -> {action_8dof}           |  |
|  +-----------------------------------------------------+  |
+----------------------------+------------------------------+
                             | HTTP (LAN)
+----------------------------+------------------------------+
|                Robot Controller PC                         |
|  +--------------+    +----------------+    +------------+  |
|  | USB Camera   |--->| Control Loop   |--->| OpenArm    |  |
|  | (Logitech)   |    | (30 Hz)        |    | CAN Bus    |  |
|  +--------------+    +----------------+    +------------+  |
+-----------------------------------------------------------+

Inference Server Code

# inference_server.py — Run on GPU server
import torch
import numpy as np
from flask import Flask, request, jsonify
from PIL import Image
import io
import time

# Load SimpleVLA-RL checkpoint
from prismatic.models import load_vla_model

app = Flask(__name__)

# Global model — loaded once at server start
MODEL = None
DEVICE = "cuda:0"

def load_model(checkpoint_path):
    """Load trained SimpleVLA-RL model"""
    global MODEL
    print(f"Loading model from {checkpoint_path}...")
    MODEL = load_vla_model(
        checkpoint_path,
        device=DEVICE,
        use_flash_attn=True,
    )
    MODEL.eval()
    print("Model loaded successfully!")

@app.route("/predict", methods=["POST"])
def predict():
    """
    Receive image -> return 8-DoF action
    Input: multipart form with 'image' field (JPEG)
    Output: JSON {"action": [8 floats], "inference_ms": float}
    """
    start = time.time()
    
    # Read image from request
    image_file = request.files["image"]
    image = Image.open(io.BytesIO(image_file.read())).convert("RGB")
    image = image.resize((224, 224))  # Resize for VLA input
    
    # Task description — fixed for box grasping
    task = "Pick up the carton box from the table"
    
    # Inference
    with torch.no_grad():
        action = MODEL.predict_action(
            image=image,
            instruction=task,
            unnorm_key="openarm_box_grasp",  # Denormalize key
        )
    
    # action shape: (8,) — 7 joint positions + 1 gripper
    action_list = action.cpu().numpy().tolist()
    
    inference_ms = (time.time() - start) * 1000
    
    return jsonify({
        "action": action_list,
        "inference_ms": inference_ms,
    })

@app.route("/health", methods=["GET"])
def health():
    return jsonify({"status": "ok", "model_loaded": MODEL is not None})

if __name__ == "__main__":
    import sys
    checkpoint = sys.argv[1] if len(sys.argv) > 1 else "./checkpoints/openarm-rl-v1/best"
    load_model(checkpoint)
    app.run(host="0.0.0.0", port=5000, threaded=False)

Run the Server

# On GPU server
conda activate simplevla
python inference_server.py ./checkpoints/openarm-rl-v1/best

# Test
curl -X POST http://gpu-server:5000/predict \
    -F "image=@test_image.jpg"
# Expected: {"action": [0.12, -0.34, ...], "inference_ms": 45.2}

Inference time on a single A100 GPU is roughly 30-60ms — fast enough for a 10-30 Hz control loop.

Step 3: Action Mapping Sim to Real

This is the step many people skip and then wonder why the robot behaves strangely in the real world.

Joint Limit Clipping

Joint ranges in sim may differ from the actual physical limits of OpenArm:

# action_mapper.py

import numpy as np

# REAL joint limits for OpenArm (from datasheet)
REAL_JOINT_LIMITS = {
    0: (-2.96, 2.96),    # Joint 1 (base) — narrower than sim
    1: (-1.48, 1.48),    # Joint 2 (shoulder)
    2: (-1.48, 1.48),    # Joint 3 (elbow)
    3: (-2.96, 2.96),    # Joint 4 (wrist 1)
    4: (-1.48, 1.48),    # Joint 5 (wrist 2)
    5: (-2.96, 2.96),    # Joint 6 (wrist 3)
    6: (-1.48, 1.48),    # Joint 7 (wrist 4)
    7: (0.0, 1.0),       # Gripper
}

def clip_to_real_limits(action):
    """Clip action values to real joint limits"""
    clipped = np.copy(action)
    for i, (low, high) in REAL_JOINT_LIMITS.items():
        clipped[i] = np.clip(action[i], low, high)
    return clipped

Exponential Moving Average (EMA) Smoothing

VLA model output can jump between steps — in sim this is fine because the simulator accepts anything. On a real robot, sudden jumps cause vibration and motor wear:

class ActionSmoother:
    """Smooth actions with EMA to prevent jerk on real robot"""
    
    def __init__(self, alpha=0.7, action_dim=8):
        """
        alpha: EMA weight (0.5-0.9)
            - 0.5: very smooth but slow response
            - 0.9: less smooth but fast response
            - 0.7: recommended for manipulation
        """
        self.alpha = alpha
        self.prev_action = np.zeros(action_dim)
        self.initialized = False
    
    def smooth(self, action):
        if not self.initialized:
            self.prev_action = action.copy()
            self.initialized = True
            return action
        
        smoothed = self.alpha * action + (1 - self.alpha) * self.prev_action
        self.prev_action = smoothed.copy()
        return smoothed
    
    def reset(self):
        self.initialized = False

Control Frequency Matching

The control frequency must match the frequency used in simulation:

import time

class ControlLoop:
    """Main control loop — runs on robot controller PC"""
    
    def __init__(self, server_url, controller, target_hz=20):
        self.server_url = server_url
        self.controller = controller  # OpenArmController
        self.target_hz = target_hz
        self.dt = 1.0 / target_hz
        self.smoother = ActionSmoother(alpha=0.7)
        
    def run_episode(self, max_steps=400):
        """Run one episode on the real robot"""
        import requests
        import cv2
        
        camera = cv2.VideoCapture(0)  # USB camera
        camera.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
        camera.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
        
        self.smoother.reset()
        
        for step in range(max_steps):
            step_start = time.time()
            
            # 1. Capture image
            ret, frame = camera.read()
            if not ret:
                print("Camera error!")
                break
            
            # 2. Encode image
            _, buffer = cv2.imencode('.jpg', frame, 
                                     [cv2.IMWRITE_JPEG_QUALITY, 85])
            
            # 3. Send to inference server
            response = requests.post(
                f"{self.server_url}/predict",
                files={"image": ("frame.jpg", buffer.tobytes(), "image/jpeg")}
            )
            action = np.array(response.json()["action"])
            
            # 4. Post-process action
            action = clip_to_real_limits(action)
            action = self.smoother.smooth(action)
            
            # 5. Send to robot
            self.controller.set_joint_positions(action[:7])  # 7 joints
            self.controller.set_gripper(action[7])           # gripper
            
            # 6. Maintain control frequency
            elapsed = time.time() - step_start
            if elapsed < self.dt:
                time.sleep(self.dt - elapsed)
            
            # Log
            if step % 20 == 0:
                hz = 1.0 / max(elapsed, 0.001)
                print(f"Step {step}/{max_steps} | "
                      f"Hz: {hz:.1f} | "
                      f"Action: [{', '.join(f'{a:.2f}' for a in action[:3])}...]")
        
        camera.release()

Step 4: Reducing the Sim-to-Real Gap

4.1 Domain Randomization (in sim)

If you already used domain randomization when collecting demos in Isaac Lab (as discussed in Part 9), the policy will be more robust. If not, go back and retrain with randomization:

# Domain randomization parameters (Isaac Lab)
randomization = {
    "lighting": {
        "intensity": (0.5, 1.5),     # 50-150% intensity
        "color_temp": (4000, 7000),  # Kelvin
        "direction": "random",
    },
    "camera": {
        "position_noise": 0.02,      # +/-2cm random offset
        "fov_noise": 2,              # +/-2 degree FOV variation
    },
    "box": {
        "position_noise": 0.05,      # +/-5cm position
        "size_scale": (0.8, 1.2),    # 80-120% size
        "texture": "random",         # Random texture each episode
    },
    "table": {
        "friction": (0.3, 0.8),      # Random friction
        "color": "random",
    },
}

4.2 Camera Matching

# Compare sim vs real images
# 1. Capture 1 frame from sim (Isaac Lab)
# 2. Capture 1 frame from real camera
# 3. Compare side by side

# Camera matching checklist:
# [x] Same resolution (640x480 or 224x224 after resize)
# [x] Same FOV (or crop real camera to match)
# [x] Same position relative to robot base
# [x] Same viewing direction (pitch, yaw angles)
# [ ] DO NOT need to match lighting exactly — domain randomization handles this

4.3 Background Simplification

Good backgrounds for sim-to-real:
[+] White/gray solid-color wall behind
[+] Clean table with only the target carton box
[+] Even lighting without strong shadows
[+] No moving objects in frame

Bad backgrounds:
[-] Monitor/TV behind (bright pixels, scrolling text)
[-] Many objects on table besides target
[-] Natural light that changes (windows)
[-] People walking in frame

Robotics environments need careful control for sim-to-real transfer

4.4 Fine-tune with Real Demos (Optional but Very Effective)

If the real-world success rate is much lower than sim (which will almost certainly happen), collect 10-20 real demonstrations and fine-tune the model:

# Real fine-tuning procedure:
# 1. Teleoperate OpenArm manually for 10-20 successful box grasps
# 2. Record image + joint positions at each step
# 3. Format into dataset matching sim demo format
# 4. Run additional SFT for 500-1000 steps with low LR (1e-6)

# Why only 10-20 demos?
# Because the model already understands manipulation from sim training.
# Real demos only need to teach it to "recalibrate" visual + dynamics.
# Too many real demos (>50) can cause overfitting.

Step 5: Evaluate on the Real Robot

Evaluation Protocol

# eval_real.py
import json
from datetime import datetime

class RealEvaluator:
    def __init__(self, control_loop, num_trials=10):
        self.loop = control_loop
        self.num_trials = num_trials
        self.results = []
    
    def run_evaluation(self):
        for trial in range(self.num_trials):
            print(f"\n{'='*50}")
            print(f"Trial {trial+1}/{self.num_trials}")
            print(f"{'='*50}")
            
            # 1. Reset: place carton box on table at random position
            input(f"Place carton box on table at new position, press Enter...")
            
            # 2. Move robot to home position
            self.loop.controller.go_home()
            time.sleep(2)
            
            # 3. Run episode
            self.loop.run_episode(max_steps=400)
            
            # 4. Record result (manual)
            success = input("Grasp successful? (y/n): ").lower() == 'y'
            failure_mode = ""
            if not success:
                failure_mode = input(
                    "Failure mode (miss/position/timeout/collision): "
                )
            
            self.results.append({
                "trial": trial + 1,
                "success": success,
                "failure_mode": failure_mode,
                "timestamp": datetime.now().isoformat(),
            })
            
            # 5. Open gripper, go home
            self.loop.controller.open_gripper()
            self.loop.controller.go_home()
        
        # Summary
        successes = sum(1 for r in self.results if r["success"])
        rate = successes / self.num_trials * 100
        print(f"\n{'='*50}")
        print(f"RESULTS: {successes}/{self.num_trials} = {rate:.1f}%")
        print(f"{'='*50}")
        
        # Save results
        with open("real_eval_results.json", "w") as f:
            json.dump(self.results, f, indent=2)

Reference Results from Paper (Piper Robot)

The SimpleVLA-RL paper reports real-world results on the Piper robot (7-DoF, different from OpenArm but same class):

+------------------------------------------------------+
|  SimpleVLA-RL Real-World Results (Piper Robot)        |
+---------------------+----------+----------+----------+
| Task                | SFT Only | SFT + RL | Change   |
+---------------------+----------+----------+----------+
| Average (4 tasks)   | 17.5%    | 38.5%    | +120%    |
| Stack Bowls         | 38%      | 70%      | +84%     |
| Click Bell          | 30%      | 60%      | +100%    |
| Fold Towel          | 0%       | 18%      | N/A      |
| Pour Water          | 2%       | 6%       | +200%    |
+---------------------+----------+----------+----------+
| * OpenArm box grasping is expected to match or exceed |
|   Stack Bowls since the task is simpler               |
+------------------------------------------------------+

Based on these results, for OpenArm cardboard box grasping (a simpler task than Stack Bowls), we expect:

SFT only: 20-35% success rate on real robot
SFT + RL: 45-70% success rate on real robot
SFT + RL + real fine-tune (10 demos): 60-80%

Failure Mode Analysis on Real Robot

From practical experience, the most common failure modes when deploying VLA on real robots:

1. Grasp Failure (30-40% of failures)

Gripper closes too early or too late
Cause: timing in sim differs from real due to CAN bus latency
Fix: Increase EMA alpha (smoother actions) or reduce control frequency

2. Positioning Error (25-35% of failures)

Robot approaches box from wrong angle
Cause: camera position slightly different from sim
Fix: Adjust camera mount or fine-tune with real demos

3. Visual Confusion (15-25% of failures)

Robot "sees" incorrectly because background differs from sim
Cause: insufficient domain randomization during training
Fix: Clean up background or retrain with more randomization

4. Motor Overshoot (5-10% of failures)

Robot overshoots target position then oscillates
Cause: PID gains on motors do not match sim
Fix: Tune PID on OpenArm or reduce action magnitude

Tips for Box Grasping Specifically

Grasping cardboard boxes has some specifics to be aware of:

Gripper Opening Must Match Box Size

# Calculate optimal gripper opening
BOX_WIDTH = 0.12  # 12cm box width
GRIPPER_MAX_OPEN = 0.08  # 8cm max opening (OpenArm gripper)

# If box is wider than gripper max -> must grasp the longer edge
# If box fits -> grasping the shorter edge is safer

# In reward function, add condition:
# success = box_lifted AND gripper_force > threshold
# Don't just check position — also check grip force

Top-Down Approach Works Best

For cardboard boxes placed on a table, approaching from above (vertical) yields the highest success rate because:

Gripper alignment is simple — only need to align X, Y
Gravity assists the grip — box does not slide sideways
Less collision with table compared to lateral approach

Domain Randomization for Boxes

# In Isaac Lab, randomize:
box_randomization = {
    "position_x": (0.2, 0.5),    # Range on table
    "position_y": (-0.15, 0.15),  # Left-right
    "rotation_z": (-45, 45),      # Rotate +/-45 degrees
    "size_scale": (0.8, 1.3),     # Size 80-130%
    "texture": ["cardboard_plain", "cardboard_printed", 
                "cardboard_worn", "cardboard_wet"],
    "mass": (0.1, 0.5),           # 100g - 500g
}

Complete Pipeline Diagram

+-----------------------------------------------------+
|                COMPLETE PIPELINE                      |
|                                                       |
|  Isaac Lab Sim                                        |
|       |                                               |
|       v                                               |
|  RL Expert Demos (500-1000 episodes)                 |
|       |                                               |
|       v                                               |
|  SFT Training (OpenVLA-OFT) ---- ~20% success       |
|       |                                               |
|       v                                               |
|  RL Training (GRPO) ------------ ~50% success        |
|       |                                               |
|       v                                               |
|  Sim Evaluation (100 episodes)                       |
|       |                                               |
|       v                                               |
|  Sim-to-Real Transfer <-- YOU ARE HERE               |
|       |                                               |
|       v                                               |
|  Real Evaluation (10 trials)                         |
|       |                                               |
|       v                                               |
|  Real Fine-tune (10-20 demos) -- ~65% success        |
|       |                                               |
|       v                                               |
|  Iterate & Improve                                   |
+-----------------------------------------------------+

Robot manipulation research requires patience and continuous iteration

What Comes Next

After getting the first real-world evaluation results, the improvement loop begins:

Analyze failure modes — Understand where the robot fails
Collect more real demos focusing on failure scenarios
Fine-tune the model on real data
Re-evaluate — Repeat until reaching target success rate

The path from 0% to an autonomous box-grasping robot is not easy — but with SimpleVLA-RL, each iteration brings measurable improvement. Patience and a systematic approach are the keys to success.

SimpleVLA-RL (10): SFT & RL Training for OpenArm — Train a VLA model for OpenArm with SFT and GRPO
SimpleVLA-RL (4): Experimental Results — Analysis of original paper results on LIBERO and Piper
SimpleVLA-RL (2): Detailed Architecture — Deep dive into OpenVLA-OFT and GRPO algorithm

Sim-to-Real Transfer for OpenArm with SimpleVLA-RL

The Sim-to-Real Gap: Three Main Enemies

1. Visual Gap — Sim Eyes Differ from Real Eyes

2. Dynamics Gap — Sim Physics Differs from Real Physics

3. Action Gap — Sim Control Differs from Real Control

Step 1: Setup OpenArm Hardware

CAN Bus Connection

Calibrate Joints

Mount Camera

Step 2: Create the Inference Server

Architecture

Inference Server Code

Run the Server

Step 3: Action Mapping Sim to Real

Joint Limit Clipping

Exponential Moving Average (EMA) Smoothing

Control Frequency Matching

Step 4: Reducing the Sim-to-Real Gap

4.1 Domain Randomization (in sim)

4.2 Camera Matching

4.3 Background Simplification

4.4 Fine-tune with Real Demos (Optional but Very Effective)

Step 5: Evaluate on the Real Robot

Evaluation Protocol

Reference Results from Paper (Piper Robot)

Failure Mode Analysis on Real Robot

Tips for Box Grasping Specifically

Gripper Opening Must Match Box Size

Top-Down Approach Works Best

Domain Randomization for Boxes

Complete Pipeline Diagram

What Comes Next

Related Posts

Nguyễn Anh Tuấn

Related Posts

SimpleVLA-RL (8): Train & Deploy trên OpenArm

SimpleVLA-RL (9): OpenArm Simulation & Data

SimpleVLA-RL (10): SFT & RL Training cho OpenArm

Sim-to-Real Transfer for OpenArm with SimpleVLA-RL

The Sim-to-Real Gap: Three Main Enemies

1. Visual Gap — Sim Eyes Differ from Real Eyes

2. Dynamics Gap — Sim Physics Differs from Real Physics

3. Action Gap — Sim Control Differs from Real Control

Step 1: Setup OpenArm Hardware

CAN Bus Connection

Calibrate Joints

Mount Camera

Step 2: Create the Inference Server

Architecture

Inference Server Code

Run the Server

Step 3: Action Mapping Sim to Real

Joint Limit Clipping

Exponential Moving Average (EMA) Smoothing

Control Frequency Matching

Step 4: Reducing the Sim-to-Real Gap

4.1 Domain Randomization (in sim)

4.2 Camera Matching

4.3 Background Simplification

4.4 Fine-tune with Real Demos (Optional but Very Effective)

Step 5: Evaluate on the Real Robot

Evaluation Protocol

Reference Results from Paper (Piper Robot)

Failure Mode Analysis on Real Robot

Tips for Box Grasping Specifically

Gripper Opening Must Match Box Size

Top-Down Approach Works Best

Domain Randomization for Boxes

Complete Pipeline Diagram

What Comes Next

Related Posts

Nguyễn Anh Tuấn

Related Posts

SimpleVLA-RL (8): Train & Deploy trên OpenArm

SimpleVLA-RL (9): OpenArm Simulation & Data

SimpleVLA-RL (10): SFT & RL Training cho OpenArm