manipulationbimanualmanipulationALOHAmobile-aloha

Bimanual Manipulation: Teaching Robots to Use Both Arms

ALOHA hardware, Mobile ALOHA, ACT for bimanual tasks, data collection tips and LeRobot SO-100 dual arm — complete guide to bimanual manipulation.

Nguyen Anh Tuan22 tháng 3, 20267 phút đọc
Bimanual Manipulation: Teaching Robots to Use Both Arms

Why Two Arms?

Many everyday tasks require two arms: opening jars, pouring water, picking up food, packing boxes. Humans use two arms coordinated — one holding, one manipulating, or both doing same action together.

Bimanual manipulation for robots is similar: 2 robot arms working simultaneously with precise coordination. But complexity multiplies — 14 DOF (2 x 6-DOF arm + 2 grippers) instead of 7, action space doubles, and need to prevent arms from colliding.

Previous posts covered grasping, imitation learning, diffusion policy, VLA, and dexterous hands. This post focuses on bimanual — hardware, data collection, and training methods.

Bimanual robot manipulation — 2 arms coordinated for complex tasks

ALOHA: Hardware Platform

ALOHA Original (2023)

ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) from Stanford (Tony Zhao, Chelsea Finn) transformed bimanual manipulation research:

Design:

  • 4 robot arms: 2 leader (human-controlled) + 2 follower (data collection)
  • Dynamixel servos: XM430-W350 and XM540-W270
  • 6 DOF per arm + 1 DOF gripper = 7 DOF x 2 = 14 DOF total
  • 4 cameras: 2 top-down + 2 wrist-mounted
  • Price: ~20,000 USD (10x cheaper than commercial bimanual setup)

Leader-follower teleoperation: human holds 2 leader arms, 2 follower arms copy movements exactly. Natural, fast, high-quality data.

Why ALOHA Succeeded?

  1. Low-cost: enables many labs to do bimanual research
  2. High-quality data: leader-follower more natural than joystick
  3. Open-source: CAD files, firmware, software all public
  4. ACT integration: train policy directly from ALOHA data with ACT

Mobile ALOHA (2024)

Mobile ALOHA (Fu et al., 2024) adds mobile base (AgileX Tracer) to ALOHA:

  • Whole-body teleoperation: human moves + controls both arms simultaneously
  • Price: ~32,000 USD (includes mobile base + compute)
  • New tasks: cooking (stir-fry shrimp, wash pan), opening cabinets, taking elevator
  • Co-training: data from static ALOHA (immobile) + Mobile ALOHA -> success rate up to 90%
Mobile ALOHA architecture:
  Mobile base (AgileX Tracer)
    ├── Left arm (6-DOF + gripper)
    ├── Right arm (6-DOF + gripper)
    ├── Top camera (global view)
    ├── Left wrist camera
    ├── Right wrist camera
    └── Onboard compute (laptop)

Action space: [left_arm(7), right_arm(7), base_vel(2)] = 16 DOF

ACT for Bimanual Tasks

Why ACT is Perfect for Bimanual?

ACT (from Part 2) is especially suited for bimanual because:

  1. Action chunking: bimanual tasks need precise coordination of 2 arms at same time. Predicting chunks ensures both arms synchronized.

  2. CVAE: when multiple ways to coordinate arms (left holds + right rotates, or vice versa), CVAE captures this diversity.

  3. Data efficient: need only 50 demos per bimanual task — important since collecting bimanual data requires more effort than single arm.

Training Pipeline

# Train ACT for bimanual task with LeRobot
python -m lerobot.scripts.train \
    --policy.type=act \
    --env.type=aloha \
    --env.task=AlohaInsertion-v0 \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=2000 \
    --training.batch_size=8 \
    --policy.chunk_size=100 \
    --policy.kl_weight=10 \
    --policy.temporal_agg=true

Critical Hyperparameters for Bimanual

policy:
  chunk_size: 100        # Larger than single arm (50-100 vs 20-50)
                          # Bimanual tasks usually longer
  kl_weight: 10          # Higher than default (10 vs 1)
                          # So CVAE learns diverse modes better
  temporal_agg: true     # Mandatory for smooth bimanual coordination
  dim_feedforward: 3200  # Larger (3200 vs 2048) since action space bigger
  n_heads: 8             # More heads to capture cross-arm correlations

Data Collection for Bimanual

Setup

Camera placement for bimanual:
  [Top camera] — looking down at workspace
        |
  [Left wrist cam] [Right wrist cam]
        |                |
   [Left arm]       [Right arm]
        \              /
         [Workspace]

Minimum 3 cameras: 1 top-down (global context) + 2 wrist (detail for each arm). Budget permitting, add front-facing camera.

Tips for Collecting Bimanual Data

  1. Start simple: handover task (left hand passes to right) before complex tasks. Achieve 80% success on handover first.

  2. Consistency is critical: collecting 50 bimanual demos, MUST be consistent:

    • Always use same arm first
    • Same sequence of steps
    • Same speed Inconsistency confuses policy.
  3. Pause = failure: never pause mid-episode. If mistake, restart. ALOHA software usually has reset button.

  4. Vary initial conditions: change object positions between demos, but don't change manipulation sequence.

  5. 50 demos is enough with ACT: more doesn't guarantee better (risk overfitting to noise). Quality > quantity.

Data collection with bimanual teleoperation — leader-follower setup

LeRobot SO-100 Dual Arm

Low-Cost Bimanual for Everyone

If ALOHA ($20K) is too expensive, LeRobot SO-100 from Hugging Face is alternative:

  • Price: ~600 USD for dual arm (2 x SO-100)
  • 5 DOF per arm + 1 DOF gripper = 12 DOF total
  • Dynamixel STS3215 servos: cheap but accurate enough
  • Leader-follower: same as ALOHA but smaller scale
  • Integrated LeRobot: plug-and-play with ACT, Diffusion Policy

Setup SO-100 Dual Arm

# 1. Assemble 4 arms (2 leader + 2 follower)
# Per instructions at: https://github.com/huggingface/lerobot

# 2. Calibrate
python -m lerobot.scripts.calibrate \
    --robot.type=so100 \
    --robot.arms='["left_leader", "left_follower", "right_leader", "right_follower"]'

# 3. Teleoperate and record
python -m lerobot.scripts.record \
    --robot.type=so100 \
    --fps=50 \
    --repo-id=my_bimanual_dataset \
    --num-episodes=50 \
    --task="bimanual_handover"

# 4. Train ACT
python -m lerobot.scripts.train \
    --policy.type=act \
    --dataset.repo_id=my_bimanual_dataset \
    --training.num_epochs=2000

SO-100 Dual Limitations

  • 5 DOF (missing 1 vs ALOHA's 6 DOF) — limited workspace
  • Low torque: can't pick heavy objects (>500g)
  • No wrist camera mount (need 3D-printed adapter)
  • Small workspace: good for tabletop, not mobile

Diffusion Policy vs ACT for Bimanual

Criterion ACT Diffusion Policy
Bimanual coordination Good (CVAE captures modes) Excellent (full distribution)
Data needed 50 demos 50-100 demos
Training time 2-4h 6-12h
Inference speed ~5ms (fast enough) ~15ms (still OK)
Long-horizon bimanual Good Better
Implementation LeRobot built-in LeRobot built-in
Recommendation Default for bimanual When ACT struggles

Choose ACT first because: more data-efficient, trains faster, designed for bimanual (ALOHA paper). Switch to Diffusion Policy only if ACT performance plateaus.

Advanced: Co-Training

Idea

Co-training is Mobile ALOHA's power move: train together on data from many tasks/setups:

Dataset = Static ALOHA data (task A, B, C)
        + Mobile ALOHA data (task D)
        + SO-100 data (task E)

Policy = ACT trained on all data

Result: positive transfer — policy learns from many tasks, generalizes better than task-specific policy. Mobile ALOHA achieved 90% success via co-training vs 50% training separately.

Implement Co-Training

# Co-training with LeRobot (simplified)
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

# Load multiple datasets
datasets = [
    LeRobotDataset("lerobot/aloha_sim_transfer_cube_human"),
    LeRobotDataset("lerobot/aloha_sim_insertion_human"),
    LeRobotDataset("my_custom_bimanual_data"),
]

# Merge and train
# LeRobot supports multi-dataset training natively
python -m lerobot.scripts.train \
    --policy.type=act \
    --dataset.repo_id=lerobot/aloha_sim_transfer_cube_human \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=3000

Bimanual Manipulation Challenges

1. Collision Avoidance Between Arms

2 arms share workspace -> risk collision. Current solutions:

  • Implicit avoidance: policy learns from data (no collisions in demos, policy avoids too)
  • Explicit constraints: add penalty when arms too close during training
  • Workspace partitioning: divide workspace to left/right regions

2. Asymmetric Roles

Many tasks have asymmetric roles: left arm holds (passive), right arm manipulates (active). Policy must learn role assignment — this naturally emerges from data (humans always use same arm), but requires consistency in demos.

3. Temporal Coordination

Some actions need tight synchronization: 2 arms lifting object together (must lift at same time, else drops). ACT with action chunking helps because predicts both arms' actions simultaneously.

4. Scale Up

14 DOF (ALOHA) is already hard; 32 DOF (2 x Shadow Hand) is nightmare territory. Currently no robust solution for bimanual dexterous manipulation — open research problem.

Next in Series


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Bài viết liên quan

NEWTutorial
Hướng dẫn GigaBrain-0: VLA + World Model + RL
vlaworld-modelreinforcement-learninggigabrainroboticsmanipulation

Hướng dẫn GigaBrain-0: VLA + World Model + RL

Hướng dẫn chi tiết huấn luyện VLA bằng World Model và Reinforcement Learning với framework RAMP từ GigaBrain — open-source, 3.5B params.

12/4/202611 phút đọc
NEWNghiên cứu
UnifoLM-VLA-0: Mô hình VLA cho Manipulation trên Unitree G1
vlaunitreeg1manipulationhumanoid

UnifoLM-VLA-0: Mô hình VLA cho Manipulation trên Unitree G1

Phân tích và hướng dẫn triển khai UnifoLM-VLA-0 — mô hình VLA open-source đầu tiên chạy trực tiếp trên G1 humanoid

8/4/202623 phút đọc
NEWDeep Dive
Multi-Step Manipulation: Curriculum Learning cho Long-Horizon
rlcurriculum-learninglong-horizonmanipulationPhần 8

Multi-Step Manipulation: Curriculum Learning cho Long-Horizon

Giải quyết manipulation dài hơi bằng RL — curriculum learning, hierarchical RL, skill chaining, và benchmark IKEA furniture assembly.

7/4/202610 phút đọc