Imitation Learning for Manipulation: BC, DAgger, ACT

Why Imitation Learning for Manipulation?

In Part 1 of this series, we discussed grasping -- the problem of picking up a single object. But real-world manipulation is far more complex: feeding food into a box, inserting bolts into holes, opening bottle caps... These tasks involve long horizons, multiple steps, and are difficult to code by hand (hard-coding).

Imitation Learning (IL) solves this problem by: instead of coding each step, let the robot watch a person do it and learn from that. A human teleoperates the robot to perform the task 50-100 times, collecting data, then train the policy via supervised learning.

It sounds simple, but challenges like distribution shift, multimodal actions, and compounding errors make IL more nuanced. This post covers the journey from basics (Behavioral Cloning) to state-of-the-art (ACT), with practical code and real-world tips.

If you haven't read the IL overview, see Imitation Learning 101 in the AI for Robotics series.

Teleoperation for robot manipulation -- collecting data from human operator

Data Collection: Teleoperation

Why Data Quality is Everything

In IL, data quality determines 80% of success. A policy trained on 50 high-quality demonstrations outperforms one trained on 500 poor demonstrations. "Good" means:

Consistent: For same task, human performs similarly across episodes
Diverse enough: Covers various initial conditions (object position, angle...)
Smooth: No jerkiness, no pauses between steps, even speed

Teleoperation Methods

Method	Cost	Data Quality	Setup Difficulty
Keyboard/joystick	Low	Low (jerky, slow)	Easy
VR controller (Quest 3)	~$500	Medium	Medium
Leader-follower (ALOHA-style)	~$5,000-32,000	High (most natural)	Hard
Kinesthetic teaching	0 (just cobot)	High	Easy (but tiring)

Leader-follower is the gold standard today: you control a leader robot arm, the follower robot arm copies movement exactly. This is how ALOHA and Mobile ALOHA collect data -- natural, accurate, and scalable.

If you don't have ALOHA hardware, LeRobot SO-100 from Hugging Face (~$300) supports leader-follower with 2 cheap robot arms.

Data Format

Each demonstration episode contains:

# Each timestep t in episode
{
    "observation": {
        "images": {
            "cam_high": np.array([480, 640, 3]),   # RGB top camera
            "cam_wrist": np.array([480, 640, 3]),   # RGB wrist camera
        },
        "qpos": np.array([6]),      # joint positions (6-DoF arm)
        "qvel": np.array([6]),      # joint velocities
        "gripper": float,           # gripper opening (0-1)
    },
    "action": np.array([7]),        # target joint positions + gripper
}

Note: action space can be joint positions, velocities, or end-effector pose (Cartesian). Joint positions are most common for stability and reproducibility.

Behavioral Cloning (BC): Basic Supervised Learning

Concept

BC is the simplest approach: treat IL as supervised learning -- input is observation, output is action, loss is MSE between predicted and expert action.

import torch
import torch.nn as nn

class BCPolicy(nn.Module):
    def __init__(self, obs_dim, action_dim, hidden=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, action_dim),
        )

    def forward(self, obs):
        return self.net(obs)

# Training loop
policy = BCPolicy(obs_dim=18, action_dim=7)
optimizer = torch.optim.Adam(policy.parameters(), lr=1e-4)

for epoch in range(100):
    for obs, action in dataloader:
        pred_action = policy(obs)
        loss = nn.MSELoss()(pred_action, action)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

The Distribution Shift Problem

BC has a critical issue called distribution shift (or compounding error):

During training, policy sees states from expert trajectories (on-policy data)
During deployment, policy makes slightly wrong action → robot enters unseen state
At this novel state, policy predicts even worse action → larger deviation
After few steps, robot is in completely different state from expert training data → task fails

Small errors at step 1 accumulate into large errors at step T. For 100-step task, only 1% error per step results in ~36% success rate.

DAgger: Fixing Distribution Shift

Core Idea

DAgger (Dataset Aggregation, Ross et al., 2011) fixes distribution shift by collecting data from states the policy encounters during deployment:

Train policy on initial expert data (like BC)
Run policy on robot → observe states it reaches
Expert labels actions for those new states
Merge new data into dataset, retrain policy
Repeat from step 2

DAgger iteration:
D0 = expert demonstrations
pi_0 = BC(D0)

for i = 1, 2, ..., N:
    Run pi_{i-1} on robot → collect states S_i
    Expert labels actions for S_i → data D_i
    D = D0 union D1 union ... union D_i
    pi_i = BC(D)

Practical Limitations

Requires expert present continuously for labeling -- time-consuming
Robot running poor policy can be dangerous (collision, breaking things)
Each iteration needs real robot deployment -> slow

Variants like HG-DAgger and LazyDAgger reduce expert interventions, but still need human in the loop.

ACT: Action Chunking with Transformers

ACT's Breakthrough

ACT (Zhao et al., 2023) from Stanford is a major breakthrough for IL in manipulation. Two key ideas:

1. Action Chunking: Instead of predicting 1 action per timestep, predict a sequence of k actions (e.g., k=100, ~2 seconds). This reduces effective horizon from T to T/k, reducing compounding error proportionally.

2. CVAE (Conditional Variational Autoencoder): Handles multimodal actions -- when same observation can have multiple valid action sequences (e.g., grasp cup from left or right). CVAE encodes style variable z to capture this diversity.

Architecture

Input:
  - Images: [cam_high, cam_wrist] -> ResNet18 -> visual tokens
  - Joint positions: qpos -> MLP -> proprioception token
  - Style variable: z ~ N(0, I) (at inference)

Encoder (training only):
  - [action_sequence, obs_tokens] -> Transformer Encoder -> z (mean, var)

Decoder:
  - [z, obs_tokens] -> Transformer Decoder -> action_chunk [a_t, a_{t+1}, ..., a_{t+k}]

Temporal Ensembling

When executing action chunks, at each timestep t, the robot has multiple predicted actions from previous chunks (chunks starting at t-1, t-2, ...). ACT uses temporal ensembling -- weighted average of predictions with exponential decay:

# Temporal ensembling
def temporal_ensemble(all_predictions, current_step, decay=0.01):
    """
    all_predictions: dict {start_step: action_chunk}
    Returns action for current_step
    """
    weights = []
    actions = []
    for start_step, chunk in all_predictions.items():
        idx = current_step - start_step
        if 0 <= idx < len(chunk):
            w = np.exp(-decay * idx)
            weights.append(w)
            actions.append(chunk[idx])

    weights = np.array(weights) / sum(weights)
    return sum(w * a for w, a in zip(weights, actions))

Results

ACT achieves 80-90% success rate on 6 difficult manipulation tasks (opening bottle, inserting batteries, picking food) with only 10 minutes of demonstrations (~50 episodes). This contrasts sharply with BC (~30-50% on same tasks).

BC vs DAgger vs ACT Comparison

Criterion	BC	DAgger	ACT
Distribution shift	Serious	Mitigated (needs expert)	Mitigated (action chunking)
Multimodal actions	Unhandled	Unhandled	Handled (CVAE)
Demos needed	100-500+	50-100 + iterations	50 (10 minutes)
Needs expert in loop	No	Yes (each iteration)	No
Architecture	MLP/CNN	MLP/CNN	Transformer + CVAE
Long-horizon tasks	Poor	OK	Good
Implementation difficulty	Easy	Medium	Medium
Real robot risk	Low	High (poor policy)	Low

Hands-on: Training ACT with LeRobot

LeRobot from Hugging Face has ACT built-in. Here's fastest path:

# 1. Install LeRobot
pip install lerobot

# 2. Download sample dataset
python -m lerobot.scripts.download_dataset \
    --repo-id lerobot/aloha_sim_transfer_cube_human

# 3. Train ACT policy
python -m lerobot.scripts.train \
    --policy.type=act \
    --env.type=aloha \
    --env.task=AlohaTransferCube-v0 \
    --dataset.repo_id=lerobot/aloha_sim_transfer_cube_human \
    --training.num_epochs=2000 \
    --training.batch_size=8

# 4. Evaluate
python -m lerobot.scripts.eval \
    --policy.path=outputs/train/act_aloha_transfer_cube/checkpoints/last/pretrained_model \
    --env.type=aloha \
    --env.task=AlohaTransferCube-v0 \
    --eval.n_episodes=50

Tips for Collecting Good Data

Go slow and smooth: Teleoperate at 50-70% max speed, avoid jerky movements
Vary object positions: Cover variation in initial conditions
50 demos sufficient: For simple task with ACT
Review data before training: Replay episodes, remove poor ones
Camera angle matters: Position camera to clearly see contact area

Next in Series

This is Part 2 of Robot Manipulation Masterclass. Coming next:

Part 3: Diffusion Policy in Practice: From Theory to Code -- When ACT isn't enough, Diffusion Policy is next step
Part 4: VLA for Manipulation: RT-2, Octo, pi0 -- Foundation models for manipulation

Robot Grasping 101: Analytical to Learning-Based -- Part 1 of series
Imitation Learning 101: BC, IRL, and What You Need to Know -- Broader IL overview
ACT: Action Chunking with Transformers Deep Dive -- Detailed ACT architecture analysis
Diffusion Policy in Practice -- Part 3 of series
Building a Manipulation System with LeRobot -- End-to-end deployment

Why Imitation Learning for Manipulation?

If you haven't read the IL overview, see Imitation Learning 101 in the AI for Robotics series.

Teleoperation for robot manipulation -- collecting data from human operator

Data Collection: Teleoperation

Why Data Quality is Everything

In IL, data quality determines 80% of success. A policy trained on 50 high-quality demonstrations outperforms one trained on 500 poor demonstrations. "Good" means:

Consistent: For same task, human performs similarly across episodes
Diverse enough: Covers various initial conditions (object position, angle...)
Smooth: No jerkiness, no pauses between steps, even speed

Teleoperation Methods

Method	Cost	Data Quality	Setup Difficulty
Keyboard/joystick	Low	Low (jerky, slow)	Easy
VR controller (Quest 3)	~$500	Medium	Medium
Leader-follower (ALOHA-style)	~$5,000-32,000	High (most natural)	Hard
Kinesthetic teaching	0 (just cobot)	High	Easy (but tiring)

If you don't have ALOHA hardware, LeRobot SO-100 from Hugging Face (~$300) supports leader-follower with 2 cheap robot arms.

Data Format

Each demonstration episode contains:

# Each timestep t in episode
{
    "observation": {
        "images": {
            "cam_high": np.array([480, 640, 3]),   # RGB top camera
            "cam_wrist": np.array([480, 640, 3]),   # RGB wrist camera
        },
        "qpos": np.array([6]),      # joint positions (6-DoF arm)
        "qvel": np.array([6]),      # joint velocities
        "gripper": float,           # gripper opening (0-1)
    },
    "action": np.array([7]),        # target joint positions + gripper
}

Note: action space can be joint positions, velocities, or end-effector pose (Cartesian). Joint positions are most common for stability and reproducibility.

Behavioral Cloning (BC): Basic Supervised Learning

Concept

BC is the simplest approach: treat IL as supervised learning -- input is observation, output is action, loss is MSE between predicted and expert action.

import torch
import torch.nn as nn

class BCPolicy(nn.Module):
    def __init__(self, obs_dim, action_dim, hidden=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, action_dim),
        )

    def forward(self, obs):
        return self.net(obs)

# Training loop
policy = BCPolicy(obs_dim=18, action_dim=7)
optimizer = torch.optim.Adam(policy.parameters(), lr=1e-4)

for epoch in range(100):
    for obs, action in dataloader:
        pred_action = policy(obs)
        loss = nn.MSELoss()(pred_action, action)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

The Distribution Shift Problem

BC has a critical issue called distribution shift (or compounding error):

During training, policy sees states from expert trajectories (on-policy data)
During deployment, policy makes slightly wrong action → robot enters unseen state
At this novel state, policy predicts even worse action → larger deviation
After few steps, robot is in completely different state from expert training data → task fails

Small errors at step 1 accumulate into large errors at step T. For 100-step task, only 1% error per step results in ~36% success rate.

DAgger: Fixing Distribution Shift

Core Idea

DAgger (Dataset Aggregation, Ross et al., 2011) fixes distribution shift by collecting data from states the policy encounters during deployment:

Train policy on initial expert data (like BC)
Run policy on robot → observe states it reaches
Expert labels actions for those new states
Merge new data into dataset, retrain policy
Repeat from step 2

DAgger iteration:
D0 = expert demonstrations
pi_0 = BC(D0)

for i = 1, 2, ..., N:
    Run pi_{i-1} on robot → collect states S_i
    Expert labels actions for S_i → data D_i
    D = D0 union D1 union ... union D_i
    pi_i = BC(D)

Practical Limitations

Requires expert present continuously for labeling -- time-consuming
Robot running poor policy can be dangerous (collision, breaking things)
Each iteration needs real robot deployment -> slow

Variants like HG-DAgger and LazyDAgger reduce expert interventions, but still need human in the loop.

ACT: Action Chunking with Transformers

ACT's Breakthrough

ACT (Zhao et al., 2023) from Stanford is a major breakthrough for IL in manipulation. Two key ideas:

Architecture

Input:
  - Images: [cam_high, cam_wrist] -> ResNet18 -> visual tokens
  - Joint positions: qpos -> MLP -> proprioception token
  - Style variable: z ~ N(0, I) (at inference)

Encoder (training only):
  - [action_sequence, obs_tokens] -> Transformer Encoder -> z (mean, var)

Decoder:
  - [z, obs_tokens] -> Transformer Decoder -> action_chunk [a_t, a_{t+1}, ..., a_{t+k}]

Temporal Ensembling

# Temporal ensembling
def temporal_ensemble(all_predictions, current_step, decay=0.01):
    """
    all_predictions: dict {start_step: action_chunk}
    Returns action for current_step
    """
    weights = []
    actions = []
    for start_step, chunk in all_predictions.items():
        idx = current_step - start_step
        if 0 <= idx < len(chunk):
            w = np.exp(-decay * idx)
            weights.append(w)
            actions.append(chunk[idx])

    weights = np.array(weights) / sum(weights)
    return sum(w * a for w, a in zip(weights, actions))

Results

BC vs DAgger vs ACT Comparison

Criterion	BC	DAgger	ACT
Distribution shift	Serious	Mitigated (needs expert)	Mitigated (action chunking)
Multimodal actions	Unhandled	Unhandled	Handled (CVAE)
Demos needed	100-500+	50-100 + iterations	50 (10 minutes)
Needs expert in loop	No	Yes (each iteration)	No
Architecture	MLP/CNN	MLP/CNN	Transformer + CVAE
Long-horizon tasks	Poor	OK	Good
Implementation difficulty	Easy	Medium	Medium
Real robot risk	Low	High (poor policy)	Low

Hands-on: Training ACT with LeRobot

LeRobot from Hugging Face has ACT built-in. Here's fastest path:

# 1. Install LeRobot
pip install lerobot

# 2. Download sample dataset
python -m lerobot.scripts.download_dataset \
    --repo-id lerobot/aloha_sim_transfer_cube_human

# 3. Train ACT policy
python -m lerobot.scripts.train \
    --policy.type=act \
    --env.type=aloha \
    --env.task=AlohaTransferCube-v0 \
    --dataset.repo_id=lerobot/aloha_sim_transfer_cube_human \
    --training.num_epochs=2000 \
    --training.batch_size=8

# 4. Evaluate
python -m lerobot.scripts.eval \
    --policy.path=outputs/train/act_aloha_transfer_cube/checkpoints/last/pretrained_model \
    --env.type=aloha \
    --env.task=AlohaTransferCube-v0 \
    --eval.n_episodes=50

Tips for Collecting Good Data

Go slow and smooth: Teleoperate at 50-70% max speed, avoid jerky movements
Vary object positions: Cover variation in initial conditions
50 demos sufficient: For simple task with ACT
Review data before training: Replay episodes, remove poor ones
Camera angle matters: Position camera to clearly see contact area

Next in Series

This is Part 2 of Robot Manipulation Masterclass. Coming next:

Part 3: Diffusion Policy in Practice: From Theory to Code -- When ACT isn't enough, Diffusion Policy is next step
Part 4: VLA for Manipulation: RT-2, Octo, pi0 -- Foundation models for manipulation

Robot Grasping 101: Analytical to Learning-Based -- Part 1 of series
Imitation Learning 101: BC, IRL, and What You Need to Know -- Broader IL overview
ACT: Action Chunking with Transformers Deep Dive -- Detailed ACT architecture analysis
Diffusion Policy in Practice -- Part 3 of series
Building a Manipulation System with LeRobot -- End-to-end deployment

Why Imitation Learning for Manipulation?

Data Collection: Teleoperation

Why Data Quality is Everything

Teleoperation Methods

Data Format

Behavioral Cloning (BC): Basic Supervised Learning

Concept

The Distribution Shift Problem

DAgger: Fixing Distribution Shift

Core Idea

Practical Limitations

ACT: Action Chunking with Transformers

ACT's Breakthrough

Architecture

Temporal Ensembling

Results

BC vs DAgger vs ACT Comparison

Hands-on: Training ACT with LeRobot

Tips for Collecting Good Data

Next in Series

Related Posts

Nguyễn Anh Tuấn

Related Posts

Xây dựng hệ thống manipulation với LeRobot

Bimanual Manipulation: Dạy robot dùng 2 tay

Dexterous Manipulation: Thao tác bàn tay robot

Why Imitation Learning for Manipulation?

Data Collection: Teleoperation

Why Data Quality is Everything

Teleoperation Methods

Data Format

Behavioral Cloning (BC): Basic Supervised Learning

Concept

The Distribution Shift Problem

DAgger: Fixing Distribution Shift

Core Idea

Practical Limitations

ACT: Action Chunking with Transformers

ACT's Breakthrough

Architecture

Temporal Ensembling

Results

BC vs DAgger vs ACT Comparison

Hands-on: Training ACT with LeRobot

Tips for Collecting Good Data

Next in Series

Related Posts

Nguyễn Anh Tuấn

Related Posts

Xây dựng hệ thống manipulation với LeRobot

Bimanual Manipulation: Dạy robot dùng 2 tay

Dexterous Manipulation: Thao tác bàn tay robot