Teaching a robot to pick up a shoe, place a can into a box, or hand an object from one gripper to another might seem straightforward — but it demands enormous amounts of demonstration data across diverse environments. RoboTwin 2.0 was built to solve exactly that problem.
The paper RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation (arXiv:2506.18088, accepted at ICLR 2026) proposes a complete framework for automatically generating large-scale expert data for dual-arm robots, with strong domain randomization to bridge the sim-to-real gap.
Why Is Dual-Arm Manipulation So Hard?
Single-arm manipulation is already complex. Dual-arm manipulation multiplies that complexity: the state space doubles, both arms must coordinate in real time, and many tasks require simultaneous contact from both grippers. To open a box lid, for instance, one arm holds the box while the other turns the lid — two entirely different subtasks that must be perfectly synchronized.
Collecting such data through teleoperation is extremely labor-intensive. RoboTwin 2.0's answer: generate data automatically in simulation, then apply domain randomization so trained policies generalize to the real world.
Framework Architecture
The framework has three main layers:
1. RoboTwin-OD — 731-Instance Object Library
This is the foundation of the entire system. RoboTwin-OD comprises:
- 731 object instances across 147 categories
- 534 objects generated via RGB-to-3D reconstruction using the Rodin platform
- 153 objects from Objaverse (open-source 3D library)
- 44 articulated objects (boxes, bottles, laptops) from PartNet-Mobility
Each object is annotated with: placement points, functional points, grasping points, grasp axis directions, and 15 language descriptions covering shape, texture, functionality, and part structure. These annotations are what enable the MLLM code generation pipeline to work correctly.
2. MLLM Code Generation — Automated Robot Control Code
Instead of manually writing control code for each task, RoboTwin 2.0 uses a Multimodal Large Language Model (MLLM) to automatically synthesize Python programs that control the robot.
The process is a closed-loop system:
[Task Description] → [Code Agent] → [Python Program]
↓
[Simulate × 10 trials]
↓
[VLM Observer analyzes results]
↓
[Failure diagnosis → feedback]
↓
[Code Agent refines code]
(up to 5 iterations)
The Code Agent uses DeepSeek-V3 or moonshot-v1-32k-vision to generate Python programs from task descriptions and SAPIEN simulator API specifications.
The VLM Observer watches all 10 simulation trials, detects failures (robot missed the object? both arms collided?), and returns language feedback to the Code Agent.
Result: code generation success rate jumps from 47.4% to 71.3% — nearly a 50% improvement from the multimodal feedback loop alone.
3. Domain Randomization — 5 Systematic Axes
This is what makes synthetic data actually useful for real-world deployment.
| Randomization Axis | Implementation |
|---|---|
| Scene Clutter | Random placement of task-irrelevant objects with collision-aware positioning |
| Textures | 12,000 human-verified textures from Stable Diffusion, applied to surfaces randomly |
| Lighting | Randomized color, type, intensity, and position of light sources within physical bounds |
| Tabletop Height | Uniform sampling ±3cm from standard height — simulates real tables of varying height |
| Language Instructions | Sentence templates combined with object descriptions → diverse instruction phrasings per trajectory |
Every trajectory in the dataset is randomized across all 5 axes, creating a massive variation space that forces the policy to learn genuinely generalizable features.
Benchmark: 50 Tasks, 5 Robots
RoboTwin 2.0 standardizes evaluation across 50 bimanual manipulation tasks and 5 robot embodiments:
Supported robots:
- Franka (7-DoF, research standard)
- UR5 (7-DoF, industrial standard)
- Piper (6-DoF, Agilex)
- ARX-X5 (6-DoF)
- Aloha-AgileX (6-DoF, from Stanford/Berkeley)
For each (task, embodiment) pair, the framework generates:
- 100 clean trajectories (no domain randomization)
- 400 randomized trajectories (full 5-axis randomization)
Over 100,000 expert trajectories are pre-collected and available for download from HuggingFace.
Installation
Hardware Requirements
- OS: Linux (Ubuntu 20.04/22.04 recommended)
- GPU: NVIDIA RTX (required for ray tracing and denoising)
- CUDA: 12.1
- NVIDIA Driver: ≥ 520
Step 1: Create Conda Environment
conda create -n RoboTwin python=3.10 -y
conda activate RoboTwin
Step 2: Install Vulkan Dependencies
sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-tools
vulkaninfo # verify installation
Step 3: Clone Repo and Install Dependencies
git clone https://github.com/RoboTwin-Platform/RoboTwin.git
cd RoboTwin
bash script/_install.sh
This script automatically installs SAPIEN (physics simulator), CuRobo (GPU-accelerated motion planner), and remaining dependencies.
If the automated installer fails, try the manual path:
pip install -r requirements.txt
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"
# Install CuRobo in the envs/ directory
# Then fix mplib line 807: remove the "or collide" condition
Step 4: Download Assets
bash script/_download_assets.sh
This downloads all 3D models (731 objects), the texture library (12,000 textures), and embodiment configurations into the assets/ directory.
If you hit a config path error after installation:
python script/update_embodiment_config_path.py
Data Generation
Collecting Expert Trajectories for a Task
Once installed, you can run the code generation pipeline for any task:
# Generate code for a new task using the MLLM pipeline
python scripts/generate_task_code.py \
--task "move_can_to_box" \
--robot "franka" \
--iterations 5
# Collect trajectories once code is verified
python scripts/collect_data.py \
--task "move_can_to_box" \
--robot "franka" \
--num_clean 100 \
--num_random 400
Output Data Structure
Each trajectory is saved as HDF5 containing:
- Joint positions for both arms at each timestep
- RGB images from multiple camera viewpoints
- Depth images
- Language instruction (randomly sampled from 15 annotations)
- Domain randomization metadata
Policy Training
RoboTwin 2.0 supports the most popular policy architectures in robot learning:
ACT (Action Chunking with Transformers)
python train.py \
--policy act \
--task move_can_to_box \
--robot franka \
--data_path data/move_can_to_box_franka/ \
--num_epochs 100 \
--chunk_size 100
Diffusion Policy (DP/DP3)
python train.py \
--policy dp3 \
--task move_can_to_box \
--robot franka \
--data_path data/move_can_to_box_franka/
VLA Models (RDT, Pi0)
For VLA (Vision-Language-Action) models, you can fine-tune a pretrained model on RoboTwin data:
# Fine-tune RDT with synthetic pretraining + real demos
python train_vla.py \
--model rdt \
--pretrained_path checkpoints/rdt-base \
--task move_can_to_box \
--synthetic_data data/move_can_to_box_franka/randomized/ \
--real_demos data/real/10_demos/
This is the configuration that produced the paper's most impressive results.
Experimental Results
Results on unseen scenes (environments not seen during training):
| Method | Success Rate | vs. Baseline |
|---|---|---|
| Baseline (10 real demos only) | 9.0% | — |
| Synthetic pretraining + 10 real demos | 42.0% | +367% |
| Zero-shot (synthetic only, no real demos) | 29.6% | +228% |
These numbers are striking. In particular, zero-shot — training entirely on synthetic data without any real demonstrations — still achieves a 228% relative improvement over the real-demo baseline. This opens up the possibility of deploying manipulation robots without an expensive teleoperation phase.
Results on real hardware (4 bimanual tasks):
- Augmenting 10 real demos with 1,000 synthetic trajectories → +13.5% to +33.0% success rate
- Consistent improvement across task difficulty levels, suggesting effective learning of task-relevant features
Embodiment-specific analysis:
- Piper (6-DoF): +22.7% (largest gain)
- Aloha-AgileX (6-DoF): +13.7%
- Franka/UR5 (7-DoF): smaller gains due to existing kinematic flexibility from the extra DoF
Acknowledged Limitations
The authors are honest about what the framework cannot yet handle:
-
Articulated objects remain challenging: "Open Laptop" and "Shake Bottle Horizontally" both achieve 0% success. Dynamic contacts and force control are beyond what the current position-control pipeline handles.
-
Precise pose constraints: MLLM code generation struggles with tasks requiring sub-millimeter placement accuracy (e.g., inserting an object into a tight slot).
-
Randomization performance drop: Success rate still drops 28–40% when transitioning from clean to fully randomized environments — the domain gap is narrowed but not eliminated.
-
Platform gap: 6-DoF robots benefit more than 7-DoF robots because of kinematic redundancy. A useful signal when selecting hardware for a specific task.
Integration Scenarios
Scenario 1: You have a real robot and want to augment your data → Collect 10–20 real demos → generate 500–1,000 synthetic trajectories with RoboTwin → fine-tune VLA model → deploy
Scenario 2: You want to prototype before committing to hardware → Use pre-collected 100K trajectories from HuggingFace → train policy in simulation → evaluate on RoboTwin benchmark → decide whether to invest in hardware
Scenario 3: You have a task not in the 50 pre-built ones → Use the MLLM code generation pipeline → write a task description → the system generates and refines control code → collect data automatically
Resources
- Paper: arXiv:2506.18088 — Tianxing Chen et al., ICLR 2026
- GitHub: RoboTwin-Platform/RoboTwin
- Documentation: robotwin-platform.github.io/doc
- Dataset: HuggingFace — RoboTwin2.0 (100K+ trajectories)
- OpenReview: ICLR 2026 submission