Bimanual Manipulation: Teaching Robots to Use Both Arms

Why Two Arms?

Many everyday tasks require two arms: opening jars, pouring water, picking up food, packing boxes. Humans use two arms coordinated — one holding, one manipulating, or both doing same action together.

Bimanual manipulation for robots is similar: 2 robot arms working simultaneously with precise coordination. But complexity multiplies — 14 DOF (2 x 6-DOF arm + 2 grippers) instead of 7, action space doubles, and need to prevent arms from colliding.

Previous posts covered grasping, imitation learning, diffusion policy, VLA, and dexterous hands. This post focuses on bimanual — hardware, data collection, and training methods.

Bimanual robot manipulation — 2 arms coordinated for complex tasks

ALOHA: Hardware Platform

ALOHA Original (2023)

ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) from Stanford (Tony Zhao, Chelsea Finn) transformed bimanual manipulation research:

Design:

4 robot arms: 2 leader (human-controlled) + 2 follower (data collection)
Dynamixel servos: XM430-W350 and XM540-W270
6 DOF per arm + 1 DOF gripper = 7 DOF x 2 = 14 DOF total
4 cameras: 2 top-down + 2 wrist-mounted
Price: ~20,000 USD (10x cheaper than commercial bimanual setup)

Leader-follower teleoperation: human holds 2 leader arms, 2 follower arms copy movements exactly. Natural, fast, high-quality data.

Why ALOHA Succeeded?

Low-cost: enables many labs to do bimanual research
High-quality data: leader-follower more natural than joystick
Open-source: CAD files, firmware, software all public
ACT integration: train policy directly from ALOHA data with ACT

Mobile ALOHA (2024)

Mobile ALOHA (Fu et al., 2024) adds mobile base (AgileX Tracer) to ALOHA:

Whole-body teleoperation: human moves + controls both arms simultaneously
Price: ~32,000 USD (includes mobile base + compute)
New tasks: cooking (stir-fry shrimp, wash pan), opening cabinets, taking elevator
Co-training: data from static ALOHA (immobile) + Mobile ALOHA -> success rate up to 90%

Mobile ALOHA architecture:
  Mobile base (AgileX Tracer)
    ├── Left arm (6-DOF + gripper)
    ├── Right arm (6-DOF + gripper)
    ├── Top camera (global view)
    ├── Left wrist camera
    ├── Right wrist camera
    └── Onboard compute (laptop)

Action space: [left_arm(7), right_arm(7), base_vel(2)] = 16 DOF

ACT for Bimanual Tasks

Why ACT is Perfect for Bimanual?

ACT (from Part 2) is especially suited for bimanual because:

Action chunking: bimanual tasks need precise coordination of 2 arms at same time. Predicting chunks ensures both arms synchronized.
CVAE: when multiple ways to coordinate arms (left holds + right rotates, or vice versa), CVAE captures this diversity.
Data efficient: need only 50 demos per bimanual task — important since collecting bimanual data requires more effort than single arm.

Training Pipeline

# Train ACT for bimanual task with LeRobot
python -m lerobot.scripts.train \
    --policy.type=act \
    --env.type=aloha \
    --env.task=AlohaInsertion-v0 \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=2000 \
    --training.batch_size=8 \
    --policy.chunk_size=100 \
    --policy.kl_weight=10 \
    --policy.temporal_agg=true

Critical Hyperparameters for Bimanual

policy:
  chunk_size: 100        # Larger than single arm (50-100 vs 20-50)
                          # Bimanual tasks usually longer
  kl_weight: 10          # Higher than default (10 vs 1)
                          # So CVAE learns diverse modes better
  temporal_agg: true     # Mandatory for smooth bimanual coordination
  dim_feedforward: 3200  # Larger (3200 vs 2048) since action space bigger
  n_heads: 8             # More heads to capture cross-arm correlations

Data Collection for Bimanual

Setup

Camera placement for bimanual:
  [Top camera] — looking down at workspace
        |
  [Left wrist cam] [Right wrist cam]
        |                |
   [Left arm]       [Right arm]
        \              /
         [Workspace]

Minimum 3 cameras: 1 top-down (global context) + 2 wrist (detail for each arm). Budget permitting, add front-facing camera.

Tips for Collecting Bimanual Data

Start simple: handover task (left hand passes to right) before complex tasks. Achieve 80% success on handover first.
Consistency is critical: collecting 50 bimanual demos, MUST be consistent:
- Always use same arm first
- Same sequence of steps
- Same speed Inconsistency confuses policy.
Pause = failure: never pause mid-episode. If mistake, restart. ALOHA software usually has reset button.
Vary initial conditions: change object positions between demos, but don't change manipulation sequence.
50 demos is enough with ACT: more doesn't guarantee better (risk overfitting to noise). Quality > quantity.

Data collection with bimanual teleoperation — leader-follower setup

LeRobot SO-100 Dual Arm

Low-Cost Bimanual for Everyone

If ALOHA ($20K) is too expensive, LeRobot SO-100 from Hugging Face is alternative:

Price: ~600 USD for dual arm (2 x SO-100)
5 DOF per arm + 1 DOF gripper = 12 DOF total
Dynamixel STS3215 servos: cheap but accurate enough
Leader-follower: same as ALOHA but smaller scale
Integrated LeRobot: plug-and-play with ACT, Diffusion Policy

Setup SO-100 Dual Arm

# 1. Assemble 4 arms (2 leader + 2 follower)
# Per instructions at: https://github.com/huggingface/lerobot

# 2. Calibrate
python -m lerobot.scripts.calibrate \
    --robot.type=so100 \
    --robot.arms='["left_leader", "left_follower", "right_leader", "right_follower"]'

# 3. Teleoperate and record
python -m lerobot.scripts.record \
    --robot.type=so100 \
    --fps=50 \
    --repo-id=my_bimanual_dataset \
    --num-episodes=50 \
    --task="bimanual_handover"

# 4. Train ACT
python -m lerobot.scripts.train \
    --policy.type=act \
    --dataset.repo_id=my_bimanual_dataset \
    --training.num_epochs=2000

SO-100 Dual Limitations

5 DOF (missing 1 vs ALOHA's 6 DOF) — limited workspace
Low torque: can't pick heavy objects (>500g)
No wrist camera mount (need 3D-printed adapter)
Small workspace: good for tabletop, not mobile

Diffusion Policy vs ACT for Bimanual

Criterion	ACT	Diffusion Policy
Bimanual coordination	Good (CVAE captures modes)	Excellent (full distribution)
Data needed	50 demos	50-100 demos
Training time	2-4h	6-12h
Inference speed	~5ms (fast enough)	~15ms (still OK)
Long-horizon bimanual	Good	Better
Implementation	LeRobot built-in	LeRobot built-in
Recommendation	Default for bimanual	When ACT struggles

Choose ACT first because: more data-efficient, trains faster, designed for bimanual (ALOHA paper). Switch to Diffusion Policy only if ACT performance plateaus.

Advanced: Co-Training

Idea

Co-training is Mobile ALOHA's power move: train together on data from many tasks/setups:

Dataset = Static ALOHA data (task A, B, C)
        + Mobile ALOHA data (task D)
        + SO-100 data (task E)

Policy = ACT trained on all data

Result: positive transfer — policy learns from many tasks, generalizes better than task-specific policy. Mobile ALOHA achieved 90% success via co-training vs 50% training separately.

Implement Co-Training

# Co-training with LeRobot (simplified)
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

# Load multiple datasets
datasets = [
    LeRobotDataset("lerobot/aloha_sim_transfer_cube_human"),
    LeRobotDataset("lerobot/aloha_sim_insertion_human"),
    LeRobotDataset("my_custom_bimanual_data"),
]

# Merge and train
# LeRobot supports multi-dataset training natively
python -m lerobot.scripts.train \
    --policy.type=act \
    --dataset.repo_id=lerobot/aloha_sim_transfer_cube_human \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=3000

Bimanual Manipulation Challenges

1. Collision Avoidance Between Arms

2 arms share workspace -> risk collision. Current solutions:

Implicit avoidance: policy learns from data (no collisions in demos, policy avoids too)
Explicit constraints: add penalty when arms too close during training
Workspace partitioning: divide workspace to left/right regions

2. Asymmetric Roles

Many tasks have asymmetric roles: left arm holds (passive), right arm manipulates (active). Policy must learn role assignment — this naturally emerges from data (humans always use same arm), but requires consistency in demos.

3. Temporal Coordination

Some actions need tight synchronization: 2 arms lifting object together (must lift at same time, else drops). ACT with action chunking helps because predicts both arms' actions simultaneously.

4. Scale Up

14 DOF (ALOHA) is already hard; 32 DOF (2 x Shadow Hand) is nightmare territory. Currently no robust solution for bimanual dexterous manipulation — open research problem.

Next in Series

Part 7: Building Manipulation Systems with LeRobot — End-to-end: setup, record, train, deploy

Dexterous Manipulation: Teaching Robot Hands — Part 5 of this series
Imitation Learning for Manipulation: BC, DAgger, ACT — ACT fundamentals
ACT: Action Chunking with Transformers Deep Dive — Detailed architecture
Building Manipulation Systems with LeRobot — Part 7 of this series

Why Two Arms?

Previous posts covered grasping, imitation learning, diffusion policy, VLA, and dexterous hands. This post focuses on bimanual — hardware, data collection, and training methods.

Bimanual robot manipulation — 2 arms coordinated for complex tasks

ALOHA: Hardware Platform

ALOHA Original (2023)

ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) from Stanford (Tony Zhao, Chelsea Finn) transformed bimanual manipulation research:

Design:

4 robot arms: 2 leader (human-controlled) + 2 follower (data collection)
Dynamixel servos: XM430-W350 and XM540-W270
6 DOF per arm + 1 DOF gripper = 7 DOF x 2 = 14 DOF total
4 cameras: 2 top-down + 2 wrist-mounted
Price: ~20,000 USD (10x cheaper than commercial bimanual setup)

Leader-follower teleoperation: human holds 2 leader arms, 2 follower arms copy movements exactly. Natural, fast, high-quality data.

Why ALOHA Succeeded?

Low-cost: enables many labs to do bimanual research
High-quality data: leader-follower more natural than joystick
Open-source: CAD files, firmware, software all public
ACT integration: train policy directly from ALOHA data with ACT

Mobile ALOHA (2024)

Mobile ALOHA (Fu et al., 2024) adds mobile base (AgileX Tracer) to ALOHA:

Whole-body teleoperation: human moves + controls both arms simultaneously
Price: ~32,000 USD (includes mobile base + compute)
New tasks: cooking (stir-fry shrimp, wash pan), opening cabinets, taking elevator
Co-training: data from static ALOHA (immobile) + Mobile ALOHA -> success rate up to 90%

Mobile ALOHA architecture:
  Mobile base (AgileX Tracer)
    ├── Left arm (6-DOF + gripper)
    ├── Right arm (6-DOF + gripper)
    ├── Top camera (global view)
    ├── Left wrist camera
    ├── Right wrist camera
    └── Onboard compute (laptop)

Action space: [left_arm(7), right_arm(7), base_vel(2)] = 16 DOF

ACT for Bimanual Tasks

Why ACT is Perfect for Bimanual?

ACT (from Part 2) is especially suited for bimanual because:

Action chunking: bimanual tasks need precise coordination of 2 arms at same time. Predicting chunks ensures both arms synchronized.
CVAE: when multiple ways to coordinate arms (left holds + right rotates, or vice versa), CVAE captures this diversity.
Data efficient: need only 50 demos per bimanual task — important since collecting bimanual data requires more effort than single arm.

Training Pipeline

# Train ACT for bimanual task with LeRobot
python -m lerobot.scripts.train \
    --policy.type=act \
    --env.type=aloha \
    --env.task=AlohaInsertion-v0 \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=2000 \
    --training.batch_size=8 \
    --policy.chunk_size=100 \
    --policy.kl_weight=10 \
    --policy.temporal_agg=true

Critical Hyperparameters for Bimanual

policy:
  chunk_size: 100        # Larger than single arm (50-100 vs 20-50)
                          # Bimanual tasks usually longer
  kl_weight: 10          # Higher than default (10 vs 1)
                          # So CVAE learns diverse modes better
  temporal_agg: true     # Mandatory for smooth bimanual coordination
  dim_feedforward: 3200  # Larger (3200 vs 2048) since action space bigger
  n_heads: 8             # More heads to capture cross-arm correlations

Data Collection for Bimanual

Setup

Camera placement for bimanual:
  [Top camera] — looking down at workspace
        |
  [Left wrist cam] [Right wrist cam]
        |                |
   [Left arm]       [Right arm]
        \              /
         [Workspace]

Minimum 3 cameras: 1 top-down (global context) + 2 wrist (detail for each arm). Budget permitting, add front-facing camera.

Tips for Collecting Bimanual Data

Start simple: handover task (left hand passes to right) before complex tasks. Achieve 80% success on handover first.
Consistency is critical: collecting 50 bimanual demos, MUST be consistent:
- Always use same arm first
- Same sequence of steps
- Same speed Inconsistency confuses policy.
Pause = failure: never pause mid-episode. If mistake, restart. ALOHA software usually has reset button.
Vary initial conditions: change object positions between demos, but don't change manipulation sequence.
50 demos is enough with ACT: more doesn't guarantee better (risk overfitting to noise). Quality > quantity.

Data collection with bimanual teleoperation — leader-follower setup

LeRobot SO-100 Dual Arm

Low-Cost Bimanual for Everyone

If ALOHA ($20K) is too expensive, LeRobot SO-100 from Hugging Face is alternative:

Price: ~600 USD for dual arm (2 x SO-100)
5 DOF per arm + 1 DOF gripper = 12 DOF total
Dynamixel STS3215 servos: cheap but accurate enough
Leader-follower: same as ALOHA but smaller scale
Integrated LeRobot: plug-and-play with ACT, Diffusion Policy

Setup SO-100 Dual Arm

# 1. Assemble 4 arms (2 leader + 2 follower)
# Per instructions at: https://github.com/huggingface/lerobot

# 2. Calibrate
python -m lerobot.scripts.calibrate \
    --robot.type=so100 \
    --robot.arms='["left_leader", "left_follower", "right_leader", "right_follower"]'

# 3. Teleoperate and record
python -m lerobot.scripts.record \
    --robot.type=so100 \
    --fps=50 \
    --repo-id=my_bimanual_dataset \
    --num-episodes=50 \
    --task="bimanual_handover"

# 4. Train ACT
python -m lerobot.scripts.train \
    --policy.type=act \
    --dataset.repo_id=my_bimanual_dataset \
    --training.num_epochs=2000

SO-100 Dual Limitations

5 DOF (missing 1 vs ALOHA's 6 DOF) — limited workspace
Low torque: can't pick heavy objects (>500g)
No wrist camera mount (need 3D-printed adapter)
Small workspace: good for tabletop, not mobile

Diffusion Policy vs ACT for Bimanual

Criterion	ACT	Diffusion Policy
Bimanual coordination	Good (CVAE captures modes)	Excellent (full distribution)
Data needed	50 demos	50-100 demos
Training time	2-4h	6-12h
Inference speed	~5ms (fast enough)	~15ms (still OK)
Long-horizon bimanual	Good	Better
Implementation	LeRobot built-in	LeRobot built-in
Recommendation	Default for bimanual	When ACT struggles

Choose ACT first because: more data-efficient, trains faster, designed for bimanual (ALOHA paper). Switch to Diffusion Policy only if ACT performance plateaus.

Advanced: Co-Training

Idea

Co-training is Mobile ALOHA's power move: train together on data from many tasks/setups:

Dataset = Static ALOHA data (task A, B, C)
        + Mobile ALOHA data (task D)
        + SO-100 data (task E)

Policy = ACT trained on all data

Result: positive transfer — policy learns from many tasks, generalizes better than task-specific policy. Mobile ALOHA achieved 90% success via co-training vs 50% training separately.

Implement Co-Training

# Co-training with LeRobot (simplified)
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

# Load multiple datasets
datasets = [
    LeRobotDataset("lerobot/aloha_sim_transfer_cube_human"),
    LeRobotDataset("lerobot/aloha_sim_insertion_human"),
    LeRobotDataset("my_custom_bimanual_data"),
]

# Merge and train
# LeRobot supports multi-dataset training natively
python -m lerobot.scripts.train \
    --policy.type=act \
    --dataset.repo_id=lerobot/aloha_sim_transfer_cube_human \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=3000

Bimanual Manipulation Challenges

1. Collision Avoidance Between Arms

2 arms share workspace -> risk collision. Current solutions:

Implicit avoidance: policy learns from data (no collisions in demos, policy avoids too)
Explicit constraints: add penalty when arms too close during training
Workspace partitioning: divide workspace to left/right regions

2. Asymmetric Roles

3. Temporal Coordination

Some actions need tight synchronization: 2 arms lifting object together (must lift at same time, else drops). ACT with action chunking helps because predicts both arms' actions simultaneously.

4. Scale Up

14 DOF (ALOHA) is already hard; 32 DOF (2 x Shadow Hand) is nightmare territory. Currently no robust solution for bimanual dexterous manipulation — open research problem.

Next in Series

Part 7: Building Manipulation Systems with LeRobot — End-to-end: setup, record, train, deploy

Dexterous Manipulation: Teaching Robot Hands — Part 5 of this series
Imitation Learning for Manipulation: BC, DAgger, ACT — ACT fundamentals
ACT: Action Chunking with Transformers Deep Dive — Detailed architecture
Building Manipulation Systems with LeRobot — Part 7 of this series

Why Two Arms?

ALOHA: Hardware Platform

ALOHA Original (2023)

Why ALOHA Succeeded?

Mobile ALOHA (2024)

ACT for Bimanual Tasks

Why ACT is Perfect for Bimanual?

Training Pipeline

Critical Hyperparameters for Bimanual

Data Collection for Bimanual

Setup

Tips for Collecting Bimanual Data

LeRobot SO-100 Dual Arm

Low-Cost Bimanual for Everyone

Setup SO-100 Dual Arm

SO-100 Dual Limitations

Diffusion Policy vs ACT for Bimanual

Advanced: Co-Training

Idea

Implement Co-Training

Bimanual Manipulation Challenges

1. Collision Avoidance Between Arms

2. Asymmetric Roles

3. Temporal Coordination

4. Scale Up

Next in Series

Related Articles

Nguyễn Anh Tuấn

Related Posts

Xây dựng hệ thống manipulation với LeRobot

Dexterous Manipulation: Thao tác bàn tay robot

VLA cho Manipulation: RT-2, Octo, pi0 thực tế

Why Two Arms?

ALOHA: Hardware Platform

ALOHA Original (2023)

Why ALOHA Succeeded?

Mobile ALOHA (2024)

ACT for Bimanual Tasks

Why ACT is Perfect for Bimanual?

Training Pipeline

Critical Hyperparameters for Bimanual

Data Collection for Bimanual

Setup

Tips for Collecting Bimanual Data

LeRobot SO-100 Dual Arm

Low-Cost Bimanual for Everyone

Setup SO-100 Dual Arm

SO-100 Dual Limitations

Diffusion Policy vs ACT for Bimanual

Advanced: Co-Training

Idea

Implement Co-Training

Bimanual Manipulation Challenges

1. Collision Avoidance Between Arms

2. Asymmetric Roles

3. Temporal Coordination

4. Scale Up

Next in Series

Related Articles

Nguyễn Anh Tuấn

Related Posts

Xây dựng hệ thống manipulation với LeRobot

Dexterous Manipulation: Thao tác bàn tay robot

VLA cho Manipulation: RT-2, Octo, pi0 thực tế