← Back to Blog
aiai-perceptionimitation-learningprogramming

Imitation Learning: BC, DAgger and DAPG for Robots

Why imitation learning is more important than RL for many manipulation problems — how to collect data and train policies.

Nguyen Anh Tuan8 tháng 3, 202611 min read
Imitation Learning: BC, DAgger and DAPG for Robots

Imitation Learning — Teaching Robot by Example

In Part 1, I introduced RL algorithms for robotics. But reality: many manipulation tasks don't need RL — just show the robot how humans do it, then have it copy. That's Imitation Learning (IL).

Why does IL matter? Because RL for manipulation is extremely hard:

Imitation learning solves all of this by skipping rewards entirely — learning directly from expert demonstrations (human teleoperation).

Person controlling robot via teleoperation to collect demonstration data

Behavioral Cloning (BC) — Simplest and Fastest

Behavioral Cloning is the simplest IL method: pure supervised learning on (observation, action) pairs from expert demonstrations.

How It Works

Collect data:
  Expert controls robot → record (state, action) each timestep
  Dataset: D = {(s_1, a_1), (s_2, a_2), ..., (s_N, a_N)}

Training:
  Minimize L = Σ ||π_θ(s_i) - a_i||²
  → Supervised regression: given state, predict action

Deployment:
  Robot runs learned policy π_θ autonomously

Simply put: BC = regression. Give the model (input, output) pairs and train it to predict output from input. No environment needed, no reward, no simulation.

Advantages

Disadvantages — Distribution Shift

BC has one fundamental problem: distribution shift (also called covariate shift).

Training:  Robot sees states from expert trajectory
           s_expert → a_expert (correct path)

Deployment: Robot makes small mistake → deviates → sees new state
            s_new (never seen) → a_??? (doesn't know what to do)
            → errors accumulate → complete failure

Expert always takes the correct path, so training data only contains "good" states. When the robot runs autonomously and makes a small mistake, it ends up in states it's never seen — and doesn't know how to recover. Small errors accumulate into large failures over time.

Solutions for Distribution Shift

  1. Collect more diverse data: Multiple demonstrations — not just one way to do the task
  2. Data augmentation: Add noise to observations during training
  3. Action chunking: Predict many actions at once instead of step-by-step (see Part 3)
  4. DAgger: The next method, specifically designed to fix this

DAgger — Iterative, Fixes Distribution Shift

DAgger (Dataset Aggregation) (Ross et al., 2011) solves distribution shift iteratively: let the robot run its current policy, have an expert fix mistakes, add new data to the dataset, retrain.

Algorithm

Algorithm DAgger:
  1. Collect initial dataset D₀ from expert demonstrations
  2. Train policy π₁ on D₀ (BC)
  3. For i = 1, 2, ..., N:
     a. Robot runs π_i → collect states {s₁, s₂, ...}
     b. Expert labels actions for those states: {(s₁, a*₁), (s₂, a*₂), ...}
     c. Aggregate: D_i = D_{i-1} ∪ {new labeled data}
     d. Train π_{i+1} on D_i
  4. Return π_N

Why It Works

The DAgger loop ensures training data covers states the robot actually encounters, not just states from the expert's trajectory. After a few iterations, distribution shift is greatly reduced.

Pros and Cons

Pros:

Cons:

Practical Variants

In practice, vanilla DAgger is rarely used because it requires a constantly available expert. More common variants:

DAPG — Combine Demos with RL Fine-tuning

DAPG (Demonstration Augmented Policy Gradient) (Rajeswaran et al., 2018) is best of both worlds: start from demonstrations, then fine-tune with RL.

Two Stages

Stage 1: Pre-training (BC)
  → Train initial policy from demonstrations
  → Policy knows "roughly" how to do the task

Stage 2: RL Fine-tuning with Demo Augmented Reward
  → Use RL (PPO/NPG) to optimize policy
  → Add auxiliary reward: ||π(s) - π_demo(s)||
    → Keeps policy near demonstrations, prevents drift
  → Gradually reduce auxiliary reward over time

Why Better Than BC or RL Alone?

Method Problem DAPG Solution
BC alone Distribution shift, not optimal RL fine-tuning improves performance
RL alone Sample inefficient, sparse reward Demo pre-training provides good starting point
BC → RL (naive) RL destroys BC policy Auxiliary reward keeps policy close to demos

Impressive Results

DAPG is especially powerful for dexterous manipulation — extremely hard with a 24-DOF robot hand:

Task RL alone BC alone DAPG
Door opening 0% (1M steps) 45% 95%
Tool use 0% (1M steps) 30% 88%
Object relocation 12% (1M steps) 52% 92%

RL alone fails to solve these even after 1 million steps, but DAPG with just 25 demonstrations + 200K RL steps achieves >90% success rate.

Robot hand performing dexterous manipulation task

Collecting Demonstration Data

Data collection is the most important part — a policy can only be as good as its training data. There are 3 main methods:

1. Teleoperation

A person controls the robot remotely via controller, VR headset, or a master device.

Method Setup Cost Quality Speed
Keyboard/Joystick Low Low (6-DoF hard to control) Slow
VR Controllers Medium High Medium
Master-slave (ALOHA) High Very high Fast
SpaceMouse Low Medium Medium

ALOHA (system from the ACT paper) uses a master robot — person controls the master robot, slave robot copies movements exactly. Best demonstration quality for manipulation.

2. Kinesthetic Teaching

Person physically guides the robot through the desired trajectory. Robot is in gravity compensation mode — human moves it, robot records joint positions.

3. VR-Based Collection

Use VR headset + controllers to control robot in simulation or real-time.

Comprehensive Comparison Table

Criterion BC DAgger DAPG
Expert online? No (offline) Yes (each iteration) No (offline demos)
Reward needed? No No Yes (RL phase)
Simulation needed? No Optional (can be real) Usually (RL phase)
Distribution shift Yes (major problem) Solved Solved (via RL)
Sample efficiency Very high High Medium
Performance ceiling Medium High Very high
Long-horizon tasks Poor Good Good
Implementation Simple Medium Complex
Demos needed 50-200 20-50 + iterations 20-50 + RL
Best for Quick start Safety-critical Best performance

Full BC Implementation with PyTorch

Here is a simple but complete implementation of BC — you can use this directly for a robot project:

"""
Behavioral Cloning for robot manipulation
Requires: pip install torch numpy
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np


class RobotDemoDataset(Dataset):
    """Dataset from demonstration files."""

    def __init__(self, demo_dir: str):
        """
        demo_dir: directory containing .npz files, one per demonstration
        Each .npz contains: observations (T, obs_dim), actions (T, act_dim)
        """
        import glob
        self.observations = []
        self.actions = []

        for f in sorted(glob.glob(f"{demo_dir}/*.npz")):
            data = np.load(f)
            self.observations.append(data["observations"])
            self.actions.append(data["actions"])

        self.observations = np.concatenate(self.observations, axis=0)
        self.actions = np.concatenate(self.actions, axis=0)

        # Normalize observations — critical for training stability
        self.obs_mean = self.observations.mean(axis=0)
        self.obs_std = self.observations.std(axis=0) + 1e-8
        self.observations = (self.observations - self.obs_mean) / self.obs_std

        # Normalize actions to [-1, 1]
        self.act_mean = self.actions.mean(axis=0)
        self.act_std = self.actions.std(axis=0) + 1e-8
        self.actions = (self.actions - self.act_mean) / self.act_std

        print(f"Loaded {len(self)} samples from {demo_dir}")
        print(f"  Obs dim: {self.observations.shape[1]}")
        print(f"  Act dim: {self.actions.shape[1]}")

    def __len__(self):
        return len(self.observations)

    def __getitem__(self, idx):
        return (
            torch.FloatTensor(self.observations[idx]),
            torch.FloatTensor(self.actions[idx]),
        )


class BCPolicy(nn.Module):
    """MLP policy for Behavioral Cloning."""

    def __init__(self, obs_dim: int, act_dim: int, hidden_dim: int = 256):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, act_dim),
            nn.Tanh(),  # Output in [-1, 1] → denormalize at deploy time
        )

    def forward(self, obs):
        return self.network(obs)


def train_bc(
    demo_dir: str,
    obs_dim: int = 12,
    act_dim: int = 7,
    epochs: int = 200,
    batch_size: int = 256,
    lr: float = 1e-3,
    save_path: str = "bc_policy.pt",
):
    """Train BC policy from demonstrations."""

    # 1. Load data with train/val split
    dataset = RobotDemoDataset(demo_dir)
    train_size = int(0.9 * len(dataset))
    val_size = len(dataset) - train_size
    train_set, val_set = torch.utils.data.random_split(dataset, [train_size, val_size])

    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_set, batch_size=batch_size)

    # 2. Model + optimizer
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    policy = BCPolicy(obs_dim, act_dim).to(device)
    optimizer = optim.AdamW(policy.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

    # 3. Training loop
    best_val_loss = float("inf")

    for epoch in range(epochs):
        # Train
        policy.train()
        train_loss = 0
        for obs, actions in train_loader:
            obs, actions = obs.to(device), actions.to(device)
            pred = policy(obs)
            loss = nn.functional.mse_loss(pred, actions)
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
            optimizer.step()
            train_loss += loss.item()

        # Validate
        policy.eval()
        val_loss = 0
        with torch.no_grad():
            for obs, actions in val_loader:
                obs, actions = obs.to(device), actions.to(device)
                pred = policy(obs)
                val_loss += nn.functional.mse_loss(pred, actions).item()

        train_loss /= len(train_loader)
        val_loss /= len(val_loader)
        scheduler.step()

        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save({
                "model": policy.state_dict(),
                "obs_mean": dataset.obs_mean,
                "obs_std": dataset.obs_std,
                "act_mean": dataset.act_mean,
                "act_std": dataset.act_std,
            }, save_path)

        if epoch % 20 == 0:
            print(f"Epoch {epoch}: train={train_loss:.6f}, val={val_loss:.6f}")

    print(f"Training complete! Best val_loss: {best_val_loss:.6f}")


# Deploy policy on real robot
def deploy_bc_policy(policy_path: str, robot):
    """Load and run BC policy on real robot."""
    checkpoint = torch.load(policy_path, map_location="cpu")
    policy = BCPolicy(obs_dim=12, act_dim=7)
    policy.load_state_dict(checkpoint["model"])
    policy.eval()

    obs_mean = checkpoint["obs_mean"]
    obs_std = checkpoint["obs_std"]
    act_mean = checkpoint["act_mean"]
    act_std = checkpoint["act_std"]

    while True:
        # Read sensor data from robot
        obs = robot.get_observation()  # np.array shape (12,)

        # Normalize observation (same as during training)
        obs_norm = (obs - obs_mean) / obs_std
        obs_tensor = torch.FloatTensor(obs_norm).unsqueeze(0)

        # Predict action
        with torch.no_grad():
            action_norm = policy(obs_tensor).squeeze(0).numpy()

        # Denormalize action
        action = action_norm * act_std + act_mean

        # Send action to robot
        robot.execute_action(action)


# Run training
if __name__ == "__main__":
    train_bc(
        demo_dir="./demos/pick_and_place/",
        obs_dim=12,   # 6 joints + 3 gripper pos + 3 object pos
        act_dim=7,    # 6 joint velocities + 1 gripper
        epochs=200,
        batch_size=256,
    )

When to Use Which Method?

Next Steps

BC and DAgger are solid foundations, but single-step prediction has serious problems with temporal correlations in robot actions. In the next part, Action Chunking Transformers (ACT) — predicting multiple actions at once, achieving state-of-the-art results for bimanual manipulation.


Related Articles

Related Posts

Sim-to-Real Transfer: Train simulation, chạy thực tế
ai-perceptionresearchrobotics

Sim-to-Real Transfer: Train simulation, chạy thực tế

Kỹ thuật chuyển đổi mô hình từ simulation sang robot thật — domain randomization, system identification và best practices.

1/4/202612 min read
ResearchEmbodied AI 2026: Toàn cảnh và xu hướng
ai-perceptionresearchrobotics

Embodied AI 2026: Toàn cảnh và xu hướng

Tổng quan embodied AI -- từ foundation models, sim-to-real đến robot learning tại scale với open-source tools.

25/3/202612 min read
TutorialHands-on: Fine-tune OpenVLA với LeRobot
ai-perceptionvlatutorialPart 7

Hands-on: Fine-tune OpenVLA với LeRobot

Tutorial thực hành — fine-tune OpenVLA trên custom data, LoRA, quantization và deploy trên robot thật.

23/3/202613 min read