You just unboxed an SO-101 robot arm — 6 degrees of freedom, Feetech STS3215 servos, under $100. It's the most affordable robot arm in HuggingFace's LeRobot community. But the next question is the hard one: how do you teach it to pick and place objects reliably?
The answer is sim-to-real transfer — train a policy in simulation, then deploy it on the real robot. NVIDIA has published a comprehensive learning path combining Isaac Lab (physics-accurate simulation), LeRobot (data collection + training framework), and GR00T N1.5 (a 3B-parameter vision-language-action foundation model). This article walks through the entire pipeline from scratch.
Why Sim-to-Real Instead of Just Real Teleoperation?
Teleoperation directly on the physical robot has two major problems:
- Data bottleneck: Collecting 100 demo episodes on a real robot takes hours, is fatiguing, and produces data that lacks diversity.
- Risk: Small errors in demonstration data → policy learns wrong behavior → robot crashes.
With simulation, you can collect thousands of demos in minutes, apply domain randomization (randomize lighting, textures, physics) to make the policy robust, and safely test before running on real hardware. For a deeper look at Isaac Lab fundamentals, see Isaac Lab from Scratch: Simulation for Robot Learning.
Required Hardware
SO-101 consists of a pair of arms: a Leader (you move by hand for demonstrations) and a Follower (runs the learned policy). Both use Feetech STS3215 servos.
| Component | Specification |
|---|---|
| Degrees of freedom | 6 DOF (Base, Shoulder, Elbow, Wrist Pitch, Wrist Roll, Gripper) |
| Follower motors | 6× Feetech STS3215, gear ratio 1/345 |
| Leader motors | Mixed gear ratios per joint (1/191 to 1/345) for easy hand-held movement |
| Cameras | 2× USB webcam 640×480 @30fps (wrist + front) |
| Controller board | Waveshare Bus Servo Adapter |
| Training/inference PC | GPU with ≥25GB VRAM (RTX 4080 or better) |
Most SO-101 structural parts are 3D-printed. You can buy a kit from Hiwonder or order your own BOM from TheRobotStudio's GitHub and print in PLA or PETG.
Technical Stack Overview
Isaac Sim (NVIDIA Omniverse)
└── Physically-accurate 3D environment for SO-101
│
Isaac Lab
└── Training framework: RL/IL, domain randomization, scripted policies
│
LeRobot (HuggingFace)
└── Data collection, Hub upload, policy training, robot control
│
GR00T N1.5 (3B params)
└── VLA foundation model: vision + language → action sequences
│
SO-101 Follower Arm
└── Real hardware deployment via inference server
Isaac Lab provides a physically-accurate environment for the SO-101 with the "vial-to-rack" task (pick up a centrifuge vial and place it in a rack). LeRobot is the "glue layer" — data collection, HuggingFace Hub upload, policy training, and robot control. GR00T N1.5 takes camera images and a language instruction, then outputs action sequences for each joint. For a deep introduction to the LeRobot framework, read LeRobot Framework: Imitation Learning for Real Robots.
Step 1: Environment Setup
System Requirements
- Ubuntu 22.04 (recommended)
- Python 3.10
- CUDA 12.x
- GPU with ≥25GB VRAM
Install Isaac GR00T
git clone https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
conda create -n gr00t python=3.10
conda activate gr00t
pip install --upgrade setuptools
pip install -e ".[base]"
# Flash Attention speeds up training by ~2×
pip install --no-build-isolation flash-attn==2.7.1.post4
Install LeRobot with Feetech SDK
pip install lerobot
# Feetech SDK required to communicate with STS3215 servos
pip install -e ".[feetech]"
Download the GR00T N1.5 Model
huggingface-cli download nvidia/GR00T-N1.5-3B
Step 2: Assemble and Configure the SO-101
Find USB Ports
Connect each arm to your computer and run:
lerobot-find-port
# When prompted: unplug the USB cable of the arm you're configuring
# Example output: /dev/ttyACM0 (follower), /dev/ttyACM1 (leader)
On Linux, grant USB access permissions:
sudo chmod 666 /dev/ttyACM0
sudo chmod 666 /dev/ttyACM1
Set Motor IDs and Baudrate
Each motor needs a unique ID from 1–6. Connect one motor at a time to the controller board and run:
# Configure follower arm
lerobot-setup-motors \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.id=my_follower_arm
The script guides you through connecting each motor in sequence, automatically assigning IDs: 1 (shoulder pan) → 2 (shoulder lift) → 3 (elbow flex) → 4 (wrist flex) → 5 (wrist roll) → 6 (gripper).
# Configure leader arm
lerobot-setup-motors \
--teleop.type=so101_leader \
--teleop.port=/dev/ttyACM1 \
--teleop.id=my_leader_arm
Calibration
Calibration ensures leader and follower report the same position values when in the same physical pose. This is mandatory if you want to transfer policies between robots — a neural network trained on one robot needs to know position offsets to run correctly on another.
# Calibrate follower
lerobot-calibrate \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.id=my_follower_arm
# Calibrate leader
lerobot-calibrate \
--teleop.type=so101_leader \
--teleop.port=/dev/ttyACM1 \
--teleop.id=my_leader_arm
During calibration, you'll move each arm to reference poses (neutral position, joint limits). LeRobot saves offsets to ~/.cache/lerobot/calibration/.
Step 3: Collect Data in Isaac Lab
Instead of direct real-robot teleoperation (time-consuming and tiring), teleop inside Isaac Sim to collect demos faster and with more variety. For more on teleoperation data collection techniques, see LeRobot Teleop and Real-World Data Collection.
# Launch teleoperation in Isaac Lab with Domain Randomization enabled
lerobot_agent \
--task Lerobot-So101-Teleop-Vials-To-Rack-DR \
--repo_id ${HF_USER}/so101_teleop_vials \
--repo_root $(pwd)/datasets/so101_teleop_vials
Recording controls:
| Key | Function |
|---|---|
S |
Start/Stop recording an episode |
C |
Cancel episode (discard without saving) |
R |
Reset environment with new randomization parameters |
Domain Randomization (DR) is applied on each reset, randomizing:
- Lighting: exposure −4 to +3 stops, color temperature 2500K–9500K, random HDRI selection
- Camera pose: ±0.02m position offset, ±0.05 rad rotation offset
- Objects: random vial and rack positions on the table, 33% chance a vial is pre-placed in a rack slot
Target: collect at least 70 episodes for quality policy training results.
Step 4: Upload Dataset to HuggingFace Hub
huggingface-cli login # enter your HF token
lerobot-upload \
--repo_id ${HF_USER}/so101_teleop_vials \
--repo_root $(pwd)/datasets/so101_teleop_vials
The dataset is saved in LeRobot v2 format: JSON metadata, video observations (640×480 @30fps), joint positions, and gripper state. Each episode is indexed and can be browsed directly on the HuggingFace Hub.
Step 5: Fine-Tune GR00T N1.5
Prepare the Modality Config
SO-101 is a new embodiment (not in GR00T's pre-training corpus). GR00T uses EmbodimentTag = new_embodiment for unseen robots. Copy the appropriate modality config for your camera setup:
# Dual-camera setup (wrist + front camera)
cp getting_started/examples/so100_dualcam__modality.json \
./demo_data/so101-vials/meta/modality.json
# For single-camera setups:
# cp getting_started/examples/so100__modality.json \
# ./demo_data/so101-vials/meta/modality.json
Verify Dataset Loading
python scripts/load_dataset.py \
--dataset-path ./demo_data/so101-vials \
--plot-state-action \
--video-backend torchvision_av
If successful, you'll see state/action plots and video preview for each episode.
Training Command
python scripts/gr00t_finetune.py \
--dataset-path ./demo_data/so101-vials/ \
--num-gpus 1 \
--output-dir ./checkpoints/so101-policy \
--max-steps 10000 \
--data-config so100_dualcam \
--video-backend torchvision_av \
--no-tune_diffusion_model \
--batch-size 16 \
--lora-rank 16 \
--dataloader-num-workers 16
Key training flags:
| Flag | Purpose | When to Use |
|---|---|---|
--no-tune_diffusion_model |
Skip fine-tuning diffusion head; only train LoRA | GPU VRAM < 40GB |
--max-steps 10000 |
10K steps for simple tasks | Basic pick-and-place |
--max-steps 20000 |
20K steps for complex tasks | Multi-step manipulation |
--lora-rank 16 |
LoRA rank for parameter-efficient fine-tuning | Balances quality vs compute |
--batch-size 16 |
Batch size | Fits RTX 4080 16GB with --no-tune_diffusion_model |
Training takes approximately 2–4 hours on an RTX 4080 with 10K steps and 70 episodes.
Monitoring Convergence
Average MSE should reach 50–60 on action prediction around step 5000. If loss doesn't decrease after 2000 steps, check:
- Dataset format: must be LeRobot v2, not v3 (use PR #2109 to convert if needed)
modality.jsoncamera keys match your setup (default:wristandfront)--data-configmatches your number of cameras
Step 6: Sim Evaluation (Open-Loop)
Before deploying on real hardware, run open-loop evaluation to verify action prediction quality:
python scripts/eval_policy.py --plot \
--embodiment_tag new_embodiment \
--model_path ./checkpoints/so101-policy \
--data_config so100_dualcam \
--dataset_path ./demo_data/so101-vials/ \
--video_backend torchvision_av \
--modality_keys single_arm gripper
The output shows action trajectory plots comparing ground truth vs predicted actions. Good open-loop performance is a necessary but not sufficient condition — policies can still fail in closed-loop execution due to error accumulation.
Step 7: Deploy on Real Hardware
Deploy using a server-client split architecture. The server runs model inference on the GPU; the client reads cameras and commands the servos.
Terminal 1 — Inference Server
python scripts/inference_service.py --server \
--model_path ./checkpoints/so101-policy \
--embodiment-tag new_embodiment \
--data-config so100_dualcam \
--denoising-steps 4
If the robot moves jerkily, increase to --denoising-steps 16 for smoother trajectories (tradeoff: inference is ~4× slower).
Terminal 2 — Robot Client
python getting_started/examples/eval_lerobot.py \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.id=my_follower_arm \
--robot.cameras="{ \
wrist: {type: opencv, index_or_path: 9, width: 640, height: 480, fps: 30}, \
front: {type: opencv, index_or_path: 15, width: 640, height: 480, fps: 30}}" \
--policy_host=127.0.0.1 \
--lang_instruction="Pick up the vial and place it in the yellow rack."
Finding the right camera index:
v4l2-ctl --list-devices
# Or enumerate: ls /dev/video*
Two Core Sim-to-Real Strategies
Strategy 1: Domain Randomization (DR) — Sim-Only
Train entirely in simulation with aggressive randomization so the policy generalizes to real-world conditions. No real demonstrations needed.
Strengths:
- No real-world data collection required
- Scales well across many randomization parameters
- Simple to implement
Weaknesses:
- Policies tend to be conservative (slow, "defensive" movements)
- Requires expertise to tune randomization ranges appropriately
- Visual accuracy lower than co-training approaches
Recommended when: You don't yet have a physical robot, or want to quickly validate the training pipeline.
Strategy 2: Co-Training (Sim + Real)
Combine a small amount of real teleoperation data with a large sim dataset. Just 5 real episodes combined with 70–100 sim episodes is typically sufficient.
# Collect real teleoperation data
lerobot-record \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--teleop.type=so101_leader \
--teleop.port=/dev/ttyACM1 \
--dataset.repo_id=${HF_USER}/so101_real_5eps \
--dataset.num_episodes=5
Strengths:
- Higher accuracy than DR-only because real data anchors the policy to visual reality
- Much less real data required than real-only training
Weaknesses:
- Still requires some physical robot teleoperation
- Slightly more complex pipeline setup
Pre-trained co-training checkpoint available at: aravindhs-NV/grootn16-finetune_sreetz-so101_teleop_vials_rack_left_sim_and_real/checkpoint-10000
Common Errors and Solutions
| Error | Cause | Fix |
|---|---|---|
ModuleNotFoundError: service |
Wrong Isaac-GR00T branch | Checkout tag n1.5-release |
| Model ignores language instruction | Overfitting on visual patterns | Reduce --max-steps, add language variation in demos |
| Jerky robot movement | denoising-steps too low |
Increase to 16 |
v3 dataset incompatible |
Dataset format newer than GR00T supports | Use LeRobot v2 or convert via PR #2109 |
| CUDA out of memory | Insufficient VRAM | Add --no-tune_diffusion_model flag |
| Camera not recognized | Wrong index | Use v4l2-ctl --list-devices to find correct index |
Next Steps
Once you've mastered the basic SO-101 + Isaac Lab + GR00T N1.5 pipeline, consider exploring:
- Cosmos Augmentation — Use NVIDIA Cosmos to generate synthetic variations of real camera images (change textures, lighting), increasing visual diversity without recording more demos.
- SAGE + GapONet — Quantitatively measure the sim-to-real gap by comparing actuator dynamics between sim and real, then compensate with a learned residual model.
- SmolVLA as a lightweight alternative — A smaller (~0.5B parameter) model suitable for deployment on lower-spec GPUs. See LeRobot SmolVLA: Lightweight Training for Real Robots.
The key insight from this entire pipeline: diverse data matters more than abundant data. Seventy episodes with strong domain randomization consistently outperform 500 episodes collected under fixed conditions — because the policy needs to generalize to the real world, not just overfit to simulation conditions.
For a deeper dive into sim-to-real transfer fundamentals, see Complete Sim-to-Real Pipeline: From Isaac Lab to Real Hardware.