← Back to Blog
manipulationbimanualmanipulationALOHAmobile-aloha

Bimanual Manipulation: Teaching Robots to Use Both Arms

ALOHA hardware, Mobile ALOHA, ACT for bimanual tasks, data collection tips and LeRobot SO-100 dual arm — complete guide to bimanual manipulation.

Nguyen Anh Tuan22 tháng 3, 20267 min read
Bimanual Manipulation: Teaching Robots to Use Both Arms

Why Two Arms?

Many everyday tasks require two arms: opening jars, pouring water, picking up food, packing boxes. Humans use two arms coordinated — one holding, one manipulating, or both doing same action together.

Bimanual manipulation for robots is similar: 2 robot arms working simultaneously with precise coordination. But complexity multiplies — 14 DOF (2 x 6-DOF arm + 2 grippers) instead of 7, action space doubles, and need to prevent arms from colliding.

Previous posts covered grasping, imitation learning, diffusion policy, VLA, and dexterous hands. This post focuses on bimanual — hardware, data collection, and training methods.

Bimanual robot manipulation — 2 arms coordinated for complex tasks

ALOHA: Hardware Platform

ALOHA Original (2023)

ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) from Stanford (Tony Zhao, Chelsea Finn) transformed bimanual manipulation research:

Design:

Leader-follower teleoperation: human holds 2 leader arms, 2 follower arms copy movements exactly. Natural, fast, high-quality data.

Why ALOHA Succeeded?

  1. Low-cost: enables many labs to do bimanual research
  2. High-quality data: leader-follower more natural than joystick
  3. Open-source: CAD files, firmware, software all public
  4. ACT integration: train policy directly from ALOHA data with ACT

Mobile ALOHA (2024)

Mobile ALOHA (Fu et al., 2024) adds mobile base (AgileX Tracer) to ALOHA:

Mobile ALOHA architecture:
  Mobile base (AgileX Tracer)
    ├── Left arm (6-DOF + gripper)
    ├── Right arm (6-DOF + gripper)
    ├── Top camera (global view)
    ├── Left wrist camera
    ├── Right wrist camera
    └── Onboard compute (laptop)

Action space: [left_arm(7), right_arm(7), base_vel(2)] = 16 DOF

ACT for Bimanual Tasks

Why ACT is Perfect for Bimanual?

ACT (from Part 2) is especially suited for bimanual because:

  1. Action chunking: bimanual tasks need precise coordination of 2 arms at same time. Predicting chunks ensures both arms synchronized.

  2. CVAE: when multiple ways to coordinate arms (left holds + right rotates, or vice versa), CVAE captures this diversity.

  3. Data efficient: need only 50 demos per bimanual task — important since collecting bimanual data requires more effort than single arm.

Training Pipeline

# Train ACT for bimanual task with LeRobot
python -m lerobot.scripts.train \
    --policy.type=act \
    --env.type=aloha \
    --env.task=AlohaInsertion-v0 \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=2000 \
    --training.batch_size=8 \
    --policy.chunk_size=100 \
    --policy.kl_weight=10 \
    --policy.temporal_agg=true

Critical Hyperparameters for Bimanual

policy:
  chunk_size: 100        # Larger than single arm (50-100 vs 20-50)
                          # Bimanual tasks usually longer
  kl_weight: 10          # Higher than default (10 vs 1)
                          # So CVAE learns diverse modes better
  temporal_agg: true     # Mandatory for smooth bimanual coordination
  dim_feedforward: 3200  # Larger (3200 vs 2048) since action space bigger
  n_heads: 8             # More heads to capture cross-arm correlations

Data Collection for Bimanual

Setup

Camera placement for bimanual:
  [Top camera] — looking down at workspace
        |
  [Left wrist cam] [Right wrist cam]
        |                |
   [Left arm]       [Right arm]
        \              /
         [Workspace]

Minimum 3 cameras: 1 top-down (global context) + 2 wrist (detail for each arm). Budget permitting, add front-facing camera.

Tips for Collecting Bimanual Data

  1. Start simple: handover task (left hand passes to right) before complex tasks. Achieve 80% success on handover first.

  2. Consistency is critical: collecting 50 bimanual demos, MUST be consistent:

    • Always use same arm first
    • Same sequence of steps
    • Same speed Inconsistency confuses policy.
  3. Pause = failure: never pause mid-episode. If mistake, restart. ALOHA software usually has reset button.

  4. Vary initial conditions: change object positions between demos, but don't change manipulation sequence.

  5. 50 demos is enough with ACT: more doesn't guarantee better (risk overfitting to noise). Quality > quantity.

Data collection with bimanual teleoperation — leader-follower setup

LeRobot SO-100 Dual Arm

Low-Cost Bimanual for Everyone

If ALOHA ($20K) is too expensive, LeRobot SO-100 from Hugging Face is alternative:

Setup SO-100 Dual Arm

# 1. Assemble 4 arms (2 leader + 2 follower)
# Per instructions at: https://github.com/huggingface/lerobot

# 2. Calibrate
python -m lerobot.scripts.calibrate \
    --robot.type=so100 \
    --robot.arms='["left_leader", "left_follower", "right_leader", "right_follower"]'

# 3. Teleoperate and record
python -m lerobot.scripts.record \
    --robot.type=so100 \
    --fps=50 \
    --repo-id=my_bimanual_dataset \
    --num-episodes=50 \
    --task="bimanual_handover"

# 4. Train ACT
python -m lerobot.scripts.train \
    --policy.type=act \
    --dataset.repo_id=my_bimanual_dataset \
    --training.num_epochs=2000

SO-100 Dual Limitations

Diffusion Policy vs ACT for Bimanual

Criterion ACT Diffusion Policy
Bimanual coordination Good (CVAE captures modes) Excellent (full distribution)
Data needed 50 demos 50-100 demos
Training time 2-4h 6-12h
Inference speed ~5ms (fast enough) ~15ms (still OK)
Long-horizon bimanual Good Better
Implementation LeRobot built-in LeRobot built-in
Recommendation Default for bimanual When ACT struggles

Choose ACT first because: more data-efficient, trains faster, designed for bimanual (ALOHA paper). Switch to Diffusion Policy only if ACT performance plateaus.

Advanced: Co-Training

Idea

Co-training is Mobile ALOHA's power move: train together on data from many tasks/setups:

Dataset = Static ALOHA data (task A, B, C)
        + Mobile ALOHA data (task D)
        + SO-100 data (task E)

Policy = ACT trained on all data

Result: positive transfer — policy learns from many tasks, generalizes better than task-specific policy. Mobile ALOHA achieved 90% success via co-training vs 50% training separately.

Implement Co-Training

# Co-training with LeRobot (simplified)
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

# Load multiple datasets
datasets = [
    LeRobotDataset("lerobot/aloha_sim_transfer_cube_human"),
    LeRobotDataset("lerobot/aloha_sim_insertion_human"),
    LeRobotDataset("my_custom_bimanual_data"),
]

# Merge and train
# LeRobot supports multi-dataset training natively
python -m lerobot.scripts.train \
    --policy.type=act \
    --dataset.repo_id=lerobot/aloha_sim_transfer_cube_human \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=3000

Bimanual Manipulation Challenges

1. Collision Avoidance Between Arms

2 arms share workspace -> risk collision. Current solutions:

2. Asymmetric Roles

Many tasks have asymmetric roles: left arm holds (passive), right arm manipulates (active). Policy must learn role assignment — this naturally emerges from data (humans always use same arm), but requires consistency in demos.

3. Temporal Coordination

Some actions need tight synchronization: 2 arms lifting object together (must lift at same time, else drops). ACT with action chunking helps because predicts both arms' actions simultaneously.

4. Scale Up

14 DOF (ALOHA) is already hard; 32 DOF (2 x Shadow Hand) is nightmare territory. Currently no robust solution for bimanual dexterous manipulation — open research problem.

Next in Series


Related Articles

Related Posts

TutorialLeRobot Ecosystem: Hướng dẫn toàn diện 2026
ai-perceptionmanipulationtutorial

LeRobot Ecosystem: Hướng dẫn toàn diện 2026

Tổng quan LeRobot của Hugging Face -- models, datasets, hardware support và cách bắt đầu với $100.

22/3/20269 min read
Deep DiveDiffusion Policy: Cách mạng robot manipulation
ai-perceptiondiffusion-policymanipulationPart 4

Diffusion Policy: Cách mạng robot manipulation

Tại sao diffusion models là breakthrough cho robotics — multimodal distributions, high-dim actions và stability.

14/3/202610 min read
Deep DiveAction Chunking Transformers (ACT): Kiến trúc chi tiết
ai-perceptionmanipulationresearchPart 3

Action Chunking Transformers (ACT): Kiến trúc chi tiết

Phân tích ACT — tại sao predict nhiều actions cùng lúc tốt hơn, CVAE encoder và temporal ensembling.

11/3/202611 min read