manipulationso-101isaac-lablerobotsim-to-realmanipulationgr00tdomain-randomization

SO-101 Sim-to-Real Training: Isaac Lab & LeRobot Guide

Step-by-step guide to train SO-101 robot arm in NVIDIA Isaac Lab, collect teleop data, fine-tune GR00T N1.5, and deploy the policy on real hardware.

Nguyen Anh Tuan29 tháng 4, 202611 phút đọc
SO-101 Sim-to-Real Training: Isaac Lab & LeRobot Guide

You just unboxed an SO-101 robot arm — 6 degrees of freedom, Feetech STS3215 servos, under $100. It's the most affordable robot arm in HuggingFace's LeRobot community. But the next question is the hard one: how do you teach it to pick and place objects reliably?

The answer is sim-to-real transfer — train a policy in simulation, then deploy it on the real robot. NVIDIA has published a comprehensive learning path combining Isaac Lab (physics-accurate simulation), LeRobot (data collection + training framework), and GR00T N1.5 (a 3B-parameter vision-language-action foundation model). This article walks through the entire pipeline from scratch.

Why Sim-to-Real Instead of Just Real Teleoperation?

Teleoperation directly on the physical robot has two major problems:

  1. Data bottleneck: Collecting 100 demo episodes on a real robot takes hours, is fatiguing, and produces data that lacks diversity.
  2. Risk: Small errors in demonstration data → policy learns wrong behavior → robot crashes.

With simulation, you can collect thousands of demos in minutes, apply domain randomization (randomize lighting, textures, physics) to make the policy robust, and safely test before running on real hardware. For a deeper look at Isaac Lab fundamentals, see Isaac Lab from Scratch: Simulation for Robot Learning.

Required Hardware

SO-101 consists of a pair of arms: a Leader (you move by hand for demonstrations) and a Follower (runs the learned policy). Both use Feetech STS3215 servos.

Component Specification
Degrees of freedom 6 DOF (Base, Shoulder, Elbow, Wrist Pitch, Wrist Roll, Gripper)
Follower motors 6× Feetech STS3215, gear ratio 1/345
Leader motors Mixed gear ratios per joint (1/191 to 1/345) for easy hand-held movement
Cameras 2× USB webcam 640×480 @30fps (wrist + front)
Controller board Waveshare Bus Servo Adapter
Training/inference PC GPU with ≥25GB VRAM (RTX 4080 or better)

Most SO-101 structural parts are 3D-printed. You can buy a kit from Hiwonder or order your own BOM from TheRobotStudio's GitHub and print in PLA or PETG.

SO-101 robot arm training workflow - from teleop data collection to sim-to-real deployment

Technical Stack Overview

Isaac Sim (NVIDIA Omniverse)
   └── Physically-accurate 3D environment for SO-101
       │
Isaac Lab
   └── Training framework: RL/IL, domain randomization, scripted policies
       │
LeRobot (HuggingFace)
   └── Data collection, Hub upload, policy training, robot control
       │
GR00T N1.5 (3B params)
   └── VLA foundation model: vision + language → action sequences
       │
SO-101 Follower Arm
   └── Real hardware deployment via inference server

Isaac Lab provides a physically-accurate environment for the SO-101 with the "vial-to-rack" task (pick up a centrifuge vial and place it in a rack). LeRobot is the "glue layer" — data collection, HuggingFace Hub upload, policy training, and robot control. GR00T N1.5 takes camera images and a language instruction, then outputs action sequences for each joint. For a deep introduction to the LeRobot framework, read LeRobot Framework: Imitation Learning for Real Robots.

Step 1: Environment Setup

System Requirements

  • Ubuntu 22.04 (recommended)
  • Python 3.10
  • CUDA 12.x
  • GPU with ≥25GB VRAM

Install Isaac GR00T

git clone https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
conda create -n gr00t python=3.10
conda activate gr00t
pip install --upgrade setuptools
pip install -e ".[base]"
# Flash Attention speeds up training by ~2×
pip install --no-build-isolation flash-attn==2.7.1.post4

Install LeRobot with Feetech SDK

pip install lerobot
# Feetech SDK required to communicate with STS3215 servos
pip install -e ".[feetech]"

Download the GR00T N1.5 Model

huggingface-cli download nvidia/GR00T-N1.5-3B

Step 2: Assemble and Configure the SO-101

Find USB Ports

Connect each arm to your computer and run:

lerobot-find-port
# When prompted: unplug the USB cable of the arm you're configuring
# Example output: /dev/ttyACM0 (follower), /dev/ttyACM1 (leader)

On Linux, grant USB access permissions:

sudo chmod 666 /dev/ttyACM0
sudo chmod 666 /dev/ttyACM1

Set Motor IDs and Baudrate

Each motor needs a unique ID from 1–6. Connect one motor at a time to the controller board and run:

# Configure follower arm
lerobot-setup-motors \
    --robot.type=so101_follower \
    --robot.port=/dev/ttyACM0 \
    --robot.id=my_follower_arm

The script guides you through connecting each motor in sequence, automatically assigning IDs: 1 (shoulder pan) → 2 (shoulder lift) → 3 (elbow flex) → 4 (wrist flex) → 5 (wrist roll) → 6 (gripper).

# Configure leader arm
lerobot-setup-motors \
    --teleop.type=so101_leader \
    --teleop.port=/dev/ttyACM1 \
    --teleop.id=my_leader_arm

Calibration

Calibration ensures leader and follower report the same position values when in the same physical pose. This is mandatory if you want to transfer policies between robots — a neural network trained on one robot needs to know position offsets to run correctly on another.

# Calibrate follower
lerobot-calibrate \
    --robot.type=so101_follower \
    --robot.port=/dev/ttyACM0 \
    --robot.id=my_follower_arm

# Calibrate leader
lerobot-calibrate \
    --teleop.type=so101_leader \
    --teleop.port=/dev/ttyACM1 \
    --teleop.id=my_leader_arm

During calibration, you'll move each arm to reference poses (neutral position, joint limits). LeRobot saves offsets to ~/.cache/lerobot/calibration/.

Step 3: Collect Data in Isaac Lab

Instead of direct real-robot teleoperation (time-consuming and tiring), teleop inside Isaac Sim to collect demos faster and with more variety. For more on teleoperation data collection techniques, see LeRobot Teleop and Real-World Data Collection.

# Launch teleoperation in Isaac Lab with Domain Randomization enabled
lerobot_agent \
    --task Lerobot-So101-Teleop-Vials-To-Rack-DR \
    --repo_id ${HF_USER}/so101_teleop_vials \
    --repo_root $(pwd)/datasets/so101_teleop_vials

Recording controls:

Key Function
S Start/Stop recording an episode
C Cancel episode (discard without saving)
R Reset environment with new randomization parameters

Domain Randomization (DR) is applied on each reset, randomizing:

  • Lighting: exposure −4 to +3 stops, color temperature 2500K–9500K, random HDRI selection
  • Camera pose: ±0.02m position offset, ±0.05 rad rotation offset
  • Objects: random vial and rack positions on the table, 33% chance a vial is pre-placed in a rack slot

Target: collect at least 70 episodes for quality policy training results.

Step 4: Upload Dataset to HuggingFace Hub

huggingface-cli login  # enter your HF token
lerobot-upload \
    --repo_id ${HF_USER}/so101_teleop_vials \
    --repo_root $(pwd)/datasets/so101_teleop_vials

The dataset is saved in LeRobot v2 format: JSON metadata, video observations (640×480 @30fps), joint positions, and gripper state. Each episode is indexed and can be browsed directly on the HuggingFace Hub.

Step 5: Fine-Tune GR00T N1.5

Prepare the Modality Config

SO-101 is a new embodiment (not in GR00T's pre-training corpus). GR00T uses EmbodimentTag = new_embodiment for unseen robots. Copy the appropriate modality config for your camera setup:

# Dual-camera setup (wrist + front camera)
cp getting_started/examples/so100_dualcam__modality.json \
   ./demo_data/so101-vials/meta/modality.json

# For single-camera setups:
# cp getting_started/examples/so100__modality.json \
#    ./demo_data/so101-vials/meta/modality.json

Verify Dataset Loading

python scripts/load_dataset.py \
    --dataset-path ./demo_data/so101-vials \
    --plot-state-action \
    --video-backend torchvision_av

If successful, you'll see state/action plots and video preview for each episode.

Training Command

python scripts/gr00t_finetune.py \
   --dataset-path ./demo_data/so101-vials/ \
   --num-gpus 1 \
   --output-dir ./checkpoints/so101-policy \
   --max-steps 10000 \
   --data-config so100_dualcam \
   --video-backend torchvision_av \
   --no-tune_diffusion_model \
   --batch-size 16 \
   --lora-rank 16 \
   --dataloader-num-workers 16

Key training flags:

Flag Purpose When to Use
--no-tune_diffusion_model Skip fine-tuning diffusion head; only train LoRA GPU VRAM < 40GB
--max-steps 10000 10K steps for simple tasks Basic pick-and-place
--max-steps 20000 20K steps for complex tasks Multi-step manipulation
--lora-rank 16 LoRA rank for parameter-efficient fine-tuning Balances quality vs compute
--batch-size 16 Batch size Fits RTX 4080 16GB with --no-tune_diffusion_model

Training takes approximately 2–4 hours on an RTX 4080 with 10K steps and 70 episodes.

Monitoring Convergence

Average MSE should reach 50–60 on action prediction around step 5000. If loss doesn't decrease after 2000 steps, check:

  1. Dataset format: must be LeRobot v2, not v3 (use PR #2109 to convert if needed)
  2. modality.json camera keys match your setup (default: wrist and front)
  3. --data-config matches your number of cameras

Step 6: Sim Evaluation (Open-Loop)

Before deploying on real hardware, run open-loop evaluation to verify action prediction quality:

python scripts/eval_policy.py --plot \
   --embodiment_tag new_embodiment \
   --model_path ./checkpoints/so101-policy \
   --data_config so100_dualcam \
   --dataset_path ./demo_data/so101-vials/ \
   --video_backend torchvision_av \
   --modality_keys single_arm gripper

The output shows action trajectory plots comparing ground truth vs predicted actions. Good open-loop performance is a necessary but not sufficient condition — policies can still fail in closed-loop execution due to error accumulation.

GR00T N1.5 server-client inference architecture for SO-101 real-robot deployment

Step 7: Deploy on Real Hardware

Deploy using a server-client split architecture. The server runs model inference on the GPU; the client reads cameras and commands the servos.

Terminal 1 — Inference Server

python scripts/inference_service.py --server \
    --model_path ./checkpoints/so101-policy \
    --embodiment-tag new_embodiment \
    --data-config so100_dualcam \
    --denoising-steps 4

If the robot moves jerkily, increase to --denoising-steps 16 for smoother trajectories (tradeoff: inference is ~4× slower).

Terminal 2 — Robot Client

python getting_started/examples/eval_lerobot.py \
    --robot.type=so101_follower \
    --robot.port=/dev/ttyACM0 \
    --robot.id=my_follower_arm \
    --robot.cameras="{ \
        wrist: {type: opencv, index_or_path: 9, width: 640, height: 480, fps: 30}, \
        front: {type: opencv, index_or_path: 15, width: 640, height: 480, fps: 30}}" \
    --policy_host=127.0.0.1 \
    --lang_instruction="Pick up the vial and place it in the yellow rack."

Finding the right camera index:

v4l2-ctl --list-devices
# Or enumerate: ls /dev/video*

Two Core Sim-to-Real Strategies

Strategy 1: Domain Randomization (DR) — Sim-Only

Train entirely in simulation with aggressive randomization so the policy generalizes to real-world conditions. No real demonstrations needed.

Strengths:

  • No real-world data collection required
  • Scales well across many randomization parameters
  • Simple to implement

Weaknesses:

  • Policies tend to be conservative (slow, "defensive" movements)
  • Requires expertise to tune randomization ranges appropriately
  • Visual accuracy lower than co-training approaches

Recommended when: You don't yet have a physical robot, or want to quickly validate the training pipeline.

Strategy 2: Co-Training (Sim + Real)

Combine a small amount of real teleoperation data with a large sim dataset. Just 5 real episodes combined with 70–100 sim episodes is typically sufficient.

# Collect real teleoperation data
lerobot-record \
    --robot.type=so101_follower \
    --robot.port=/dev/ttyACM0 \
    --teleop.type=so101_leader \
    --teleop.port=/dev/ttyACM1 \
    --dataset.repo_id=${HF_USER}/so101_real_5eps \
    --dataset.num_episodes=5

Strengths:

  • Higher accuracy than DR-only because real data anchors the policy to visual reality
  • Much less real data required than real-only training

Weaknesses:

  • Still requires some physical robot teleoperation
  • Slightly more complex pipeline setup

Pre-trained co-training checkpoint available at: aravindhs-NV/grootn16-finetune_sreetz-so101_teleop_vials_rack_left_sim_and_real/checkpoint-10000

Common Errors and Solutions

Error Cause Fix
ModuleNotFoundError: service Wrong Isaac-GR00T branch Checkout tag n1.5-release
Model ignores language instruction Overfitting on visual patterns Reduce --max-steps, add language variation in demos
Jerky robot movement denoising-steps too low Increase to 16
v3 dataset incompatible Dataset format newer than GR00T supports Use LeRobot v2 or convert via PR #2109
CUDA out of memory Insufficient VRAM Add --no-tune_diffusion_model flag
Camera not recognized Wrong index Use v4l2-ctl --list-devices to find correct index

Next Steps

Once you've mastered the basic SO-101 + Isaac Lab + GR00T N1.5 pipeline, consider exploring:

  1. Cosmos Augmentation — Use NVIDIA Cosmos to generate synthetic variations of real camera images (change textures, lighting), increasing visual diversity without recording more demos.
  2. SAGE + GapONet — Quantitatively measure the sim-to-real gap by comparing actuator dynamics between sim and real, then compensate with a learned residual model.
  3. SmolVLA as a lightweight alternative — A smaller (~0.5B parameter) model suitable for deployment on lower-spec GPUs. See LeRobot SmolVLA: Lightweight Training for Real Robots.

The key insight from this entire pipeline: diverse data matters more than abundant data. Seventy episodes with strong domain randomization consistently outperform 500 episodes collected under fixed conditions — because the policy needs to generalize to the real world, not just overfit to simulation conditions.

For a deeper dive into sim-to-real transfer fundamentals, see Complete Sim-to-Real Pipeline: From Isaac Lab to Real Hardware.


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Bài viết liên quan

NEWTutorial
VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO
vlavla-adapteropenhelixliberoqwen2.5lorafrankaur5manipulation

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

Hướng dẫn VLA-Adapter từ OpenHelix — train VLA 0.5B trên GPU consumer 8 giờ, đạt SOTA LIBERO, deploy thật trên Franka/UR-5.

13/5/202610 phút đọc
Tutorial
VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc
vlanvidianvlabsqwen2.5-vlliberorobot-learningfine-tuningaction-as-textmanipulation

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc

NVIDIA NVlabs chứng minh: action as text đạt 94.7% trên LIBERO, vượt pi_0 và GR00T-N1 mà không cần sửa kiến trúc — chỉ với Qwen2.5-VL-3B.

4/5/202613 phút đọc
Tutorial
RDT2: Foundation Model Zero-Shot Cross-Embodiment cho Bimanual UR5e/Franka
rdt2foundation-modelbimanualumimanipulationvlathu-ml

RDT2: Foundation Model Zero-Shot Cross-Embodiment cho Bimanual UR5e/Franka

Hướng dẫn chi tiết RDT2 từ THU-ML — foundation model zero-shot deploy cho bimanual UR5e và Franka, với code open-source.

1/5/20269 phút đọc