Bimanual Manipulation: Dạy robot dùng 2 tay

Tại sao 2 tay?

Rất nhiều tasks hàng ngày không thể làm với 1 tay: mở lọ, rót nước, gắp đồ ăn, xếp hộp. Con người dùng 2 tay phối hợp -- một tay giữ, một tay thao tác, hoặc cả 2 tay cùng làm một hành động.

Bimanual manipulation cho robot cũng tương tự: 2 robot arms làm việc cùng lúc, với coordination chính xác. Nhưng độ phức tạp tăng gấp bội -- 14 DoF (2 x 6-DoF arm + 2 gripper) thay vì 7, không gian action lớn gấp đôi, và cần tránh va chạm giữa 2 tay.

Series này đã cover grasping, imitation learning, diffusion policy, VLA, và dexterous hands. Bài này tập trung vào bimanual -- hardware, data collection, và training methods.

ALOHA: Hardware Platform

ALOHA gốc (2023)

ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) từ Stanford (Tony Zhao, Chelsea Finn) là platform thay đổi cục diện bimanual manipulation research:

Thiết kế:

4 robot arms: 2 leader (người điều khiển) + 2 follower (thu data)
Dynamixel servos: XM430-W350 và XM540-W270
6 DoF mỗi arm + 1 DoF gripper = 7 DoF x 2 = 14 DoF total
4 cameras: 2 top-down + 2 wrist-mounted
Giá: ~20,000 USD (rẻ gấp 10x so với commercial bimanual setup)

Leader-follower teleoperation: người cầm 2 leader arms, 2 follower arms copy chính xác chuyển động. Tự nhiên, nhanh, và data quality cao.

Tại sao ALOHA thành công?

Low-cost: cho phép nhiều labs tiếp cận bimanual research
High-quality data: leader-follower teleoperation tự nhiên hơn joystick
Open-source: CAD files, firmware, software đều public
ACT integration: train policy trực tiếp từ ALOHA data với ACT

Mobile ALOHA (2024)

Mobile ALOHA (Fu et al., 2024) thêm mobile base (AgileX Tracer) vào ALOHA:

Whole-body teleoperation: người di chuyển + thao tác 2 tay đồng thời
Giá: ~32,000 USD (bao gồm mobile base + compute)
Tasks mới: nấu ăn (xào tôm, rửa chảo), mở tủ, đi vào thang máy
Co-training: dùng data từ ALOHA gốc (static) + Mobile ALOHA -> tăng success rate lên 90%

Mobile ALOHA architecture:
  Mobile base (AgileX Tracer)
    ├── Left arm (6-DoF + gripper)
    ├── Right arm (6-DoF + gripper)
    ├── Top camera (global view)
    ├── Left wrist camera
    ├── Right wrist camera
    └── Onboard compute (laptop)

Action space: [left_arm(7), right_arm(7), base_vel(2)] = 16 DoF

ACT cho Bimanual Tasks

Tại sao ACT phù hợp với bimanual?

ACT (từ Part 2) đặc biệt phù hợp cho bimanual vì:

Action chunking: bimanual tasks cần coordination chính xác giữa 2 tay tại cùng thời điểm. Predict chunk actions đảm bảo 2 tay synchronized.
CVAE: khi có nhiều cách phối hợp 2 tay (tay trái giữ + tay phải xoay, hoặc ngược lại), CVAE capture sự đa dạng này.
Data efficient: chỉ cần 50 demos cho một bimanual task -- quan trọng vì thu bimanual data mất nhiều effort hơn single arm.

Training pipeline

# Train ACT cho bimanual task với LeRobot
python -m lerobot.scripts.train \
    --policy.type=act \
    --env.type=aloha \
    --env.task=AlohaInsertion-v0 \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=2000 \
    --training.batch_size=8 \
    --policy.chunk_size=100 \
    --policy.kl_weight=10 \
    --policy.temporal_agg=true

Hyperparameters quan trọng cho bimanual

policy:
  chunk_size: 100        # Lớn hơn single arm (50-100 vs 20-50)
                          # Vì bimanual tasks thường dài hơn
  kl_weight: 10          # Cao hơn default (10 vs 1)
                          # Để CVAE học diverse modes tốt hơn
  temporal_agg: true     # Bắt buộc cho smooth bimanual coordination
  dim_feedforward: 3200  # Lớn hơn (3200 vs 2048) vì action space lớn hơn
  n_heads: 8             # Nhiều heads hơn để capture cross-arm correlations

Data Collection cho Bimanual

Setup

Camera placement cho bimanual:
  [Top camera] -- nhìn xuống workspace
        |
  [Left wrist cam] [Right wrist cam]
        |                |
   [Left arm]       [Right arm]
        \              /
         [Workspace]

3 cameras là minimum: 1 top-down (global context) + 2 wrist (detail cho mỗi tay). Nếu có thêm budget, đặt 1 camera phía trước (front view).

Tips thu data bimanual

Bắt đầu từ task đơn giản: handover (tay trái đưa cho tay phải) trước khi làm task phức tạp. Đạt 80% success trên handover mới chuyển sang task khác.
Consistency cực kỳ quan trọng: khi thu 50 demos cho bimanual, PHẢI làm giống nhau:
- Luôn dùng cùng một tay bắt đầu trước
- Cùng trình tự các bước
- Cùng tốc độ Inconsistency sẽ confuse policy.
Pause = failure: không dừng lại giữa episode. Nếu làm sai, bắt đầu lại. ALOHA software thường có nút reset.
Cover initial conditions: thay đổi vị trí objects giữa các demos, nhưng không thay đổi trình tự thao tác.
50 demos là đủ với ACT: nhiều hơn chưa chắc tốt hơn (risk overfit vào noise). Chất lượng > số lượng.

LeRobot SO-100 Dual Arm

Low-cost bimanual cho mọi người

Nếu ALOHA (20K USD) vẫn quá đắt, LeRobot SO-100 từ Hugging Face là alternative:

Giá: ~600 USD cho dual arm (2 x SO-100)
5 DoF mỗi arm + 1 DoF gripper = 12 DoF total
Dynamixel STS3215 servos: rẻ nhưng đủ chính xác
Leader-follower: tương tự ALOHA nhưng quy mô nhỏ hơn
Tích hợp LeRobot: plug-and-play với ACT, Diffusion Policy

Setup SO-100 dual arm

# 1. Assemble 4 arms (2 leader + 2 follower)
# Theo hướng dẫn tại: https://github.com/huggingface/lerobot

# 2. Calibrate
python -m lerobot.scripts.calibrate \
    --robot.type=so100 \
    --robot.arms='["left_leader", "left_follower", "right_leader", "right_follower"]'

# 3. Teleoperate và record
python -m lerobot.scripts.record \
    --robot.type=so100 \
    --fps=50 \
    --repo-id=my_bimanual_dataset \
    --num-episodes=50 \
    --task="bimanual_handover"

# 4. Train ACT
python -m lerobot.scripts.train \
    --policy.type=act \
    --dataset.repo_id=my_bimanual_dataset \
    --training.num_epochs=2000

Hạn chế của SO-100 dual

5 DoF (thiếu 1 DoF so với ALOHA 6 DoF) -- giới hạn workspace
Torque thấp: không nhặt được vật nặng (>500g)
Không có wrist camera mount (cần in 3D adapter)
Workspace nhỏ: phù hợp cho tabletop tasks, không cho mobile

Diffusion Policy vs ACT cho Bimanual

Tiêu chí	ACT	Diffusion Policy
Bimanual coordination	Tốt (CVAE captures modes)	Rất tốt (full distribution)
Data needed	50 demos	50-100 demos
Training time	2-4h	6-12h
Inference speed	~5ms (đủ nhanh)	~15ms (vẫn OK)
Long-horizon bimanual	Tốt	Tốt hơn
Implementation	LeRobot built-in	LeRobot built-in
Recommendation	Default cho bimanual	Khi ACT struggle

Chọn ACT trước vì: data efficient hơn, train nhanh hơn, và được thiết kế đặc biệt cho bimanual (ALOHA paper). Chỉ chuyển sang Diffusion Policy khi ACT đạt performance plateau.

Advanced: Co-training

Ý tưởng

Co-training là kỹ thuật mạnh của Mobile ALOHA: train chung data từ nhiều tasks và setups:

Dataset = Static ALOHA data (task A, B, C)
        + Mobile ALOHA data (task D)
        + SO-100 data (task E)

Policy = ACT trained trên tất cả data

Kết quả: positive transfer -- policy học chung từ nhiều tasks generalize tốt hơn policy học từng task riêng. Mobile ALOHA đạt 90% success nhờ co-training, vs 50% khi train riêng.

Implement co-training

# Co-training với LeRobot (simplified)
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

# Load nhiều datasets
datasets = [
    LeRobotDataset("lerobot/aloha_sim_transfer_cube_human"),
    LeRobotDataset("lerobot/aloha_sim_insertion_human"),
    LeRobotDataset("my_custom_bimanual_data"),
]

# Merge và train
# LeRobot hỗ trợ multi-dataset training natively
python -m lerobot.scripts.train \
    --policy.type=act \
    --dataset.repo_id=lerobot/aloha_sim_transfer_cube_human \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=3000

Challenges của Bimanual Manipulation

1. Collision avoidance giữa 2 arms

2 arms chia sẻ workspace -> risk va chạm. Giải pháp hiện tại:

Implicit avoidance: policy học từ data (không có collision trong demos, nên policy cũng tránh)
Explicit constraints: thêm penalty trong training khi 2 arms quá gần
Workspace partitioning: chia workspace thành left/right regions

2. Asymmetric roles

Nhiều tasks có asymmetric roles: tay trái giữ (passive), tay phải thao tác (active). Policy cần học role assignment -- cái này tự nhiên xuất hiện từ data (vì người luôn dùng cùng tay), nhưng cần consistency trong demos.

3. Temporal coordination

Một số hành động cần đồng bộ chính xác: 2 tay cùng nâng một vật (phải nâng cùng lúc, nếu một tay chậm hơn -> rớt). ACT với action chunking giúp vì predict đồng thời actions cho 2 tay.

4. Scale up

14 DoF (ALOHA) đã khó, 32 DoF (2 x Shadow Hand) là nightmare. Hiện tại chưa có robust solution cho bimanual dexterous manipulation -- đây là open research problem.

Tiếp theo trong series

Part 7: Xây dựng hệ thống manipulation với LeRobot -- End-to-end: setup, record, train, deploy