Why Two Arms?
Many everyday tasks require two arms: opening jars, pouring water, picking up food, packing boxes. Humans use two arms coordinated — one holding, one manipulating, or both doing same action together.
Bimanual manipulation for robots is similar: 2 robot arms working simultaneously with precise coordination. But complexity multiplies — 14 DOF (2 x 6-DOF arm + 2 grippers) instead of 7, action space doubles, and need to prevent arms from colliding.
Previous posts covered grasping, imitation learning, diffusion policy, VLA, and dexterous hands. This post focuses on bimanual — hardware, data collection, and training methods.
ALOHA: Hardware Platform
ALOHA Original (2023)
ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) from Stanford (Tony Zhao, Chelsea Finn) transformed bimanual manipulation research:
Design:
- 4 robot arms: 2 leader (human-controlled) + 2 follower (data collection)
- Dynamixel servos: XM430-W350 and XM540-W270
- 6 DOF per arm + 1 DOF gripper = 7 DOF x 2 = 14 DOF total
- 4 cameras: 2 top-down + 2 wrist-mounted
- Price: ~20,000 USD (10x cheaper than commercial bimanual setup)
Leader-follower teleoperation: human holds 2 leader arms, 2 follower arms copy movements exactly. Natural, fast, high-quality data.
Why ALOHA Succeeded?
- Low-cost: enables many labs to do bimanual research
- High-quality data: leader-follower more natural than joystick
- Open-source: CAD files, firmware, software all public
- ACT integration: train policy directly from ALOHA data with ACT
Mobile ALOHA (2024)
Mobile ALOHA (Fu et al., 2024) adds mobile base (AgileX Tracer) to ALOHA:
- Whole-body teleoperation: human moves + controls both arms simultaneously
- Price: ~32,000 USD (includes mobile base + compute)
- New tasks: cooking (stir-fry shrimp, wash pan), opening cabinets, taking elevator
- Co-training: data from static ALOHA (immobile) + Mobile ALOHA -> success rate up to 90%
Mobile ALOHA architecture:
Mobile base (AgileX Tracer)
├── Left arm (6-DOF + gripper)
├── Right arm (6-DOF + gripper)
├── Top camera (global view)
├── Left wrist camera
├── Right wrist camera
└── Onboard compute (laptop)
Action space: [left_arm(7), right_arm(7), base_vel(2)] = 16 DOF
ACT for Bimanual Tasks
Why ACT is Perfect for Bimanual?
ACT (from Part 2) is especially suited for bimanual because:
-
Action chunking: bimanual tasks need precise coordination of 2 arms at same time. Predicting chunks ensures both arms synchronized.
-
CVAE: when multiple ways to coordinate arms (left holds + right rotates, or vice versa), CVAE captures this diversity.
-
Data efficient: need only 50 demos per bimanual task — important since collecting bimanual data requires more effort than single arm.
Training Pipeline
# Train ACT for bimanual task with LeRobot
python -m lerobot.scripts.train \
--policy.type=act \
--env.type=aloha \
--env.task=AlohaInsertion-v0 \
--dataset.repo_id=lerobot/aloha_sim_insertion_human \
--training.num_epochs=2000 \
--training.batch_size=8 \
--policy.chunk_size=100 \
--policy.kl_weight=10 \
--policy.temporal_agg=true
Critical Hyperparameters for Bimanual
policy:
chunk_size: 100 # Larger than single arm (50-100 vs 20-50)
# Bimanual tasks usually longer
kl_weight: 10 # Higher than default (10 vs 1)
# So CVAE learns diverse modes better
temporal_agg: true # Mandatory for smooth bimanual coordination
dim_feedforward: 3200 # Larger (3200 vs 2048) since action space bigger
n_heads: 8 # More heads to capture cross-arm correlations
Data Collection for Bimanual
Setup
Camera placement for bimanual:
[Top camera] — looking down at workspace
|
[Left wrist cam] [Right wrist cam]
| |
[Left arm] [Right arm]
\ /
[Workspace]
Minimum 3 cameras: 1 top-down (global context) + 2 wrist (detail for each arm). Budget permitting, add front-facing camera.
Tips for Collecting Bimanual Data
-
Start simple: handover task (left hand passes to right) before complex tasks. Achieve 80% success on handover first.
-
Consistency is critical: collecting 50 bimanual demos, MUST be consistent:
- Always use same arm first
- Same sequence of steps
- Same speed Inconsistency confuses policy.
-
Pause = failure: never pause mid-episode. If mistake, restart. ALOHA software usually has reset button.
-
Vary initial conditions: change object positions between demos, but don't change manipulation sequence.
-
50 demos is enough with ACT: more doesn't guarantee better (risk overfitting to noise). Quality > quantity.
LeRobot SO-100 Dual Arm
Low-Cost Bimanual for Everyone
If ALOHA ($20K) is too expensive, LeRobot SO-100 from Hugging Face is alternative:
- Price: ~600 USD for dual arm (2 x SO-100)
- 5 DOF per arm + 1 DOF gripper = 12 DOF total
- Dynamixel STS3215 servos: cheap but accurate enough
- Leader-follower: same as ALOHA but smaller scale
- Integrated LeRobot: plug-and-play with ACT, Diffusion Policy
Setup SO-100 Dual Arm
# 1. Assemble 4 arms (2 leader + 2 follower)
# Per instructions at: https://github.com/huggingface/lerobot
# 2. Calibrate
python -m lerobot.scripts.calibrate \
--robot.type=so100 \
--robot.arms='["left_leader", "left_follower", "right_leader", "right_follower"]'
# 3. Teleoperate and record
python -m lerobot.scripts.record \
--robot.type=so100 \
--fps=50 \
--repo-id=my_bimanual_dataset \
--num-episodes=50 \
--task="bimanual_handover"
# 4. Train ACT
python -m lerobot.scripts.train \
--policy.type=act \
--dataset.repo_id=my_bimanual_dataset \
--training.num_epochs=2000
SO-100 Dual Limitations
- 5 DOF (missing 1 vs ALOHA's 6 DOF) — limited workspace
- Low torque: can't pick heavy objects (>500g)
- No wrist camera mount (need 3D-printed adapter)
- Small workspace: good for tabletop, not mobile
Diffusion Policy vs ACT for Bimanual
| Criterion | ACT | Diffusion Policy |
|---|---|---|
| Bimanual coordination | Good (CVAE captures modes) | Excellent (full distribution) |
| Data needed | 50 demos | 50-100 demos |
| Training time | 2-4h | 6-12h |
| Inference speed | ~5ms (fast enough) | ~15ms (still OK) |
| Long-horizon bimanual | Good | Better |
| Implementation | LeRobot built-in | LeRobot built-in |
| Recommendation | Default for bimanual | When ACT struggles |
Choose ACT first because: more data-efficient, trains faster, designed for bimanual (ALOHA paper). Switch to Diffusion Policy only if ACT performance plateaus.
Advanced: Co-Training
Idea
Co-training is Mobile ALOHA's power move: train together on data from many tasks/setups:
Dataset = Static ALOHA data (task A, B, C)
+ Mobile ALOHA data (task D)
+ SO-100 data (task E)
Policy = ACT trained on all data
Result: positive transfer — policy learns from many tasks, generalizes better than task-specific policy. Mobile ALOHA achieved 90% success via co-training vs 50% training separately.
Implement Co-Training
# Co-training with LeRobot (simplified)
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
# Load multiple datasets
datasets = [
LeRobotDataset("lerobot/aloha_sim_transfer_cube_human"),
LeRobotDataset("lerobot/aloha_sim_insertion_human"),
LeRobotDataset("my_custom_bimanual_data"),
]
# Merge and train
# LeRobot supports multi-dataset training natively
python -m lerobot.scripts.train \
--policy.type=act \
--dataset.repo_id=lerobot/aloha_sim_transfer_cube_human \
--dataset.repo_id=lerobot/aloha_sim_insertion_human \
--training.num_epochs=3000
Bimanual Manipulation Challenges
1. Collision Avoidance Between Arms
2 arms share workspace -> risk collision. Current solutions:
- Implicit avoidance: policy learns from data (no collisions in demos, policy avoids too)
- Explicit constraints: add penalty when arms too close during training
- Workspace partitioning: divide workspace to left/right regions
2. Asymmetric Roles
Many tasks have asymmetric roles: left arm holds (passive), right arm manipulates (active). Policy must learn role assignment — this naturally emerges from data (humans always use same arm), but requires consistency in demos.
3. Temporal Coordination
Some actions need tight synchronization: 2 arms lifting object together (must lift at same time, else drops). ACT with action chunking helps because predicts both arms' actions simultaneously.
4. Scale Up
14 DOF (ALOHA) is already hard; 32 DOF (2 x Shadow Hand) is nightmare territory. Currently no robust solution for bimanual dexterous manipulation — open research problem.
Next in Series
- Part 7: Building Manipulation Systems with LeRobot — End-to-end: setup, record, train, deploy
Related Articles
- Dexterous Manipulation: Teaching Robot Hands — Part 5 of this series
- Imitation Learning for Manipulation: BC, DAgger, ACT — ACT fundamentals
- ACT: Action Chunking with Transformers Deep Dive — Detailed architecture
- Building Manipulation Systems with LeRobot — Part 7 of this series