manipulationdual-armmanipulationdomain-randomizationsynthetic-datasimulationvlabimanual

RoboTwin 2.0: Complete Guide to Dual-Arm Manipulation Data

A comprehensive guide to RoboTwin 2.0 — the scalable synthetic data framework for bimanual robots, featuring MLLM code generation and strong domain randomization.

Nguyễn Anh Tuấn21 tháng 4, 20268 phút đọc
RoboTwin 2.0: Complete Guide to Dual-Arm Manipulation Data

Teaching a robot to pick up a shoe, place a can into a box, or hand an object from one gripper to another might seem straightforward — but it demands enormous amounts of demonstration data across diverse environments. RoboTwin 2.0 was built to solve exactly that problem.

The paper RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation (arXiv:2506.18088, accepted at ICLR 2026) proposes a complete framework for automatically generating large-scale expert data for dual-arm robots, with strong domain randomization to bridge the sim-to-real gap.

Why Is Dual-Arm Manipulation So Hard?

Single-arm manipulation is already complex. Dual-arm manipulation multiplies that complexity: the state space doubles, both arms must coordinate in real time, and many tasks require simultaneous contact from both grippers. To open a box lid, for instance, one arm holds the box while the other turns the lid — two entirely different subtasks that must be perfectly synchronized.

Collecting such data through teleoperation is extremely labor-intensive. RoboTwin 2.0's answer: generate data automatically in simulation, then apply domain randomization so trained policies generalize to the real world.

Framework Architecture

RoboTwin 2.0 framework overview — from object dataset to expert data generation and policy training

The framework has three main layers:

1. RoboTwin-OD — 731-Instance Object Library

This is the foundation of the entire system. RoboTwin-OD comprises:

  • 731 object instances across 147 categories
  • 534 objects generated via RGB-to-3D reconstruction using the Rodin platform
  • 153 objects from Objaverse (open-source 3D library)
  • 44 articulated objects (boxes, bottles, laptops) from PartNet-Mobility

Each object is annotated with: placement points, functional points, grasping points, grasp axis directions, and 15 language descriptions covering shape, texture, functionality, and part structure. These annotations are what enable the MLLM code generation pipeline to work correctly.

2. MLLM Code Generation — Automated Robot Control Code

Instead of manually writing control code for each task, RoboTwin 2.0 uses a Multimodal Large Language Model (MLLM) to automatically synthesize Python programs that control the robot.

The process is a closed-loop system:

[Task Description] → [Code Agent] → [Python Program]
                                           ↓
                              [Simulate × 10 trials]
                                           ↓
                              [VLM Observer analyzes results]
                                           ↓
                              [Failure diagnosis → feedback]
                                           ↓
                              [Code Agent refines code]
                               (up to 5 iterations)

The Code Agent uses DeepSeek-V3 or moonshot-v1-32k-vision to generate Python programs from task descriptions and SAPIEN simulator API specifications.

The VLM Observer watches all 10 simulation trials, detects failures (robot missed the object? both arms collided?), and returns language feedback to the Code Agent.

Result: code generation success rate jumps from 47.4% to 71.3% — nearly a 50% improvement from the multimodal feedback loop alone.

3. Domain Randomization — 5 Systematic Axes

This is what makes synthetic data actually useful for real-world deployment.

Randomization Axis Implementation
Scene Clutter Random placement of task-irrelevant objects with collision-aware positioning
Textures 12,000 human-verified textures from Stable Diffusion, applied to surfaces randomly
Lighting Randomized color, type, intensity, and position of light sources within physical bounds
Tabletop Height Uniform sampling ±3cm from standard height — simulates real tables of varying height
Language Instructions Sentence templates combined with object descriptions → diverse instruction phrasings per trajectory

Every trajectory in the dataset is randomized across all 5 axes, creating a massive variation space that forces the policy to learn genuinely generalizable features.

Benchmark: 50 Tasks, 5 Robots

RoboTwin 2.0 standardizes evaluation across 50 bimanual manipulation tasks and 5 robot embodiments:

Supported robots:

  • Franka (7-DoF, research standard)
  • UR5 (7-DoF, industrial standard)
  • Piper (6-DoF, Agilex)
  • ARX-X5 (6-DoF)
  • Aloha-AgileX (6-DoF, from Stanford/Berkeley)

For each (task, embodiment) pair, the framework generates:

  • 100 clean trajectories (no domain randomization)
  • 400 randomized trajectories (full 5-axis randomization)

Over 100,000 expert trajectories are pre-collected and available for download from HuggingFace.

Installation

Hardware Requirements

  • OS: Linux (Ubuntu 20.04/22.04 recommended)
  • GPU: NVIDIA RTX (required for ray tracing and denoising)
  • CUDA: 12.1
  • NVIDIA Driver: ≥ 520

Step 1: Create Conda Environment

conda create -n RoboTwin python=3.10 -y
conda activate RoboTwin

Step 2: Install Vulkan Dependencies

sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-tools
vulkaninfo  # verify installation

Step 3: Clone Repo and Install Dependencies

git clone https://github.com/RoboTwin-Platform/RoboTwin.git
cd RoboTwin
bash script/_install.sh

This script automatically installs SAPIEN (physics simulator), CuRobo (GPU-accelerated motion planner), and remaining dependencies.

If the automated installer fails, try the manual path:

pip install -r requirements.txt
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"
# Install CuRobo in the envs/ directory
# Then fix mplib line 807: remove the "or collide" condition

Step 4: Download Assets

bash script/_download_assets.sh

This downloads all 3D models (731 objects), the texture library (12,000 textures), and embodiment configurations into the assets/ directory.

If you hit a config path error after installation:

python script/update_embodiment_config_path.py

Data Generation

Collecting Expert Trajectories for a Task

Once installed, you can run the code generation pipeline for any task:

# Generate code for a new task using the MLLM pipeline
python scripts/generate_task_code.py \
    --task "move_can_to_box" \
    --robot "franka" \
    --iterations 5

# Collect trajectories once code is verified
python scripts/collect_data.py \
    --task "move_can_to_box" \
    --robot "franka" \
    --num_clean 100 \
    --num_random 400

Output Data Structure

Each trajectory is saved as HDF5 containing:

  • Joint positions for both arms at each timestep
  • RGB images from multiple camera viewpoints
  • Depth images
  • Language instruction (randomly sampled from 15 annotations)
  • Domain randomization metadata

Policy Training

RoboTwin 2.0 supports the most popular policy architectures in robot learning:

ACT (Action Chunking with Transformers)

python train.py \
    --policy act \
    --task move_can_to_box \
    --robot franka \
    --data_path data/move_can_to_box_franka/ \
    --num_epochs 100 \
    --chunk_size 100

Diffusion Policy (DP/DP3)

python train.py \
    --policy dp3 \
    --task move_can_to_box \
    --robot franka \
    --data_path data/move_can_to_box_franka/

VLA Models (RDT, Pi0)

For VLA (Vision-Language-Action) models, you can fine-tune a pretrained model on RoboTwin data:

# Fine-tune RDT with synthetic pretraining + real demos
python train_vla.py \
    --model rdt \
    --pretrained_path checkpoints/rdt-base \
    --task move_can_to_box \
    --synthetic_data data/move_can_to_box_franka/randomized/ \
    --real_demos data/real/10_demos/

This is the configuration that produced the paper's most impressive results.

Experimental Results

Performance comparison across policy types on RoboTwin 2.0 benchmark — clean data, randomized data, and zero-shot synthetic-only

Results on unseen scenes (environments not seen during training):

Method Success Rate vs. Baseline
Baseline (10 real demos only) 9.0%
Synthetic pretraining + 10 real demos 42.0% +367%
Zero-shot (synthetic only, no real demos) 29.6% +228%

These numbers are striking. In particular, zero-shot — training entirely on synthetic data without any real demonstrations — still achieves a 228% relative improvement over the real-demo baseline. This opens up the possibility of deploying manipulation robots without an expensive teleoperation phase.

Results on real hardware (4 bimanual tasks):

  • Augmenting 10 real demos with 1,000 synthetic trajectories → +13.5% to +33.0% success rate
  • Consistent improvement across task difficulty levels, suggesting effective learning of task-relevant features

Embodiment-specific analysis:

  • Piper (6-DoF): +22.7% (largest gain)
  • Aloha-AgileX (6-DoF): +13.7%
  • Franka/UR5 (7-DoF): smaller gains due to existing kinematic flexibility from the extra DoF

Acknowledged Limitations

The authors are honest about what the framework cannot yet handle:

  1. Articulated objects remain challenging: "Open Laptop" and "Shake Bottle Horizontally" both achieve 0% success. Dynamic contacts and force control are beyond what the current position-control pipeline handles.

  2. Precise pose constraints: MLLM code generation struggles with tasks requiring sub-millimeter placement accuracy (e.g., inserting an object into a tight slot).

  3. Randomization performance drop: Success rate still drops 28–40% when transitioning from clean to fully randomized environments — the domain gap is narrowed but not eliminated.

  4. Platform gap: 6-DoF robots benefit more than 7-DoF robots because of kinematic redundancy. A useful signal when selecting hardware for a specific task.

Integration Scenarios

Scenario 1: You have a real robot and want to augment your data → Collect 10–20 real demos → generate 500–1,000 synthetic trajectories with RoboTwin → fine-tune VLA model → deploy

Scenario 2: You want to prototype before committing to hardware → Use pre-collected 100K trajectories from HuggingFace → train policy in simulation → evaluate on RoboTwin benchmark → decide whether to invest in hardware

Scenario 3: You have a task not in the 50 pre-built ones → Use the MLLM code generation pipeline → write a task description → the system generates and refines control code → collect data automatically

Resources

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Bài viết liên quan

NEWTutorial
VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO
vlavla-adapteropenhelixliberoqwen2.5lorafrankaur5manipulation

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

Hướng dẫn VLA-Adapter từ OpenHelix — train VLA 0.5B trên GPU consumer 8 giờ, đạt SOTA LIBERO, deploy thật trên Franka/UR-5.

13/5/202610 phút đọc
NEWTutorial
WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid
wholebodyvlavlahumanoidloco-manipulationiclr-2026agibot-x2teleoprl

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

ICLR 2026 — pipeline thực chiến từ thu thập teleop, train unified latent VLA đến deploy whole-body loco-manipulation trên AgiBot X2.

11/5/202611 phút đọc
Tutorial
VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc
vlanvidianvlabsqwen2.5-vlliberorobot-learningfine-tuningaction-as-textmanipulation

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc

NVIDIA NVlabs chứng minh: action as text đạt 94.7% trên LIBERO, vượt pi_0 và GR00T-N1 mà không cần sửa kiến trúc — chỉ với Qwen2.5-VL-3B.

4/5/202613 phút đọc