aiai-perceptionimitation-learningprogramming

Imitation Learning: BC, DAgger and DAPG for Robots

Why imitation learning is more important than RL for many manipulation problems — how to collect data and train policies.

Nguyen Anh Tuan8 tháng 3, 202611 phút đọc
Imitation Learning: BC, DAgger and DAPG for Robots

Imitation Learning — Teaching Robot by Example

In Part 1, I introduced RL algorithms for robotics. But reality: many manipulation tasks don't need RL — just show the robot how humans do it, then have it copy. That's Imitation Learning (IL).

Why does IL matter? Because RL for manipulation is extremely hard:

  • Sparse rewards: Robot gets +1 only on success — millions of steps with no signal before that
  • Safety: RL exploration might break the robot or the environment
  • Sample complexity: SAC needs ~100K steps for simple grasping — that's many hours on a real robot
  • Reward engineering: Designing a good reward function is hard, and a wrong reward means the robot hacks it

Imitation learning solves all of this by skipping rewards entirely — learning directly from expert demonstrations (human teleoperation).

Person controlling robot via teleoperation to collect demonstration data

Behavioral Cloning (BC) — Simplest and Fastest

Behavioral Cloning is the simplest IL method: pure supervised learning on (observation, action) pairs from expert demonstrations.

How It Works

Collect data:
  Expert controls robot → record (state, action) each timestep
  Dataset: D = {(s_1, a_1), (s_2, a_2), ..., (s_N, a_N)}

Training:
  Minimize L = Σ ||π_θ(s_i) - a_i||²
  → Supervised regression: given state, predict action

Deployment:
  Robot runs learned policy π_θ autonomously

Simply put: BC = regression. Give the model (input, output) pairs and train it to predict output from input. No environment needed, no reward, no simulation.

Advantages

  • Extremely simple: Just PyTorch + dataset, train in minutes
  • Stable training: Supervised learning, no RL instability
  • Little data: 50-200 demonstrations often enough for a single task
  • Fast: Train in 30 minutes vs hours of RL

Disadvantages — Distribution Shift

BC has one fundamental problem: distribution shift (also called covariate shift).

Training:  Robot sees states from expert trajectory
           s_expert → a_expert (correct path)

Deployment: Robot makes small mistake → deviates → sees new state
            s_new (never seen) → a_??? (doesn't know what to do)
            → errors accumulate → complete failure

Expert always takes the correct path, so training data only contains "good" states. When the robot runs autonomously and makes a small mistake, it ends up in states it's never seen — and doesn't know how to recover. Small errors accumulate into large failures over time.

Solutions for Distribution Shift

  1. Collect more diverse data: Multiple demonstrations — not just one way to do the task
  2. Data augmentation: Add noise to observations during training
  3. Action chunking: Predict many actions at once instead of step-by-step (see Part 3)
  4. DAgger: The next method, specifically designed to fix this

DAgger — Iterative, Fixes Distribution Shift

DAgger (Dataset Aggregation) (Ross et al., 2011) solves distribution shift iteratively: let the robot run its current policy, have an expert fix mistakes, add new data to the dataset, retrain.

Algorithm

Algorithm DAgger:
  1. Collect initial dataset D₀ from expert demonstrations
  2. Train policy π₁ on D₀ (BC)
  3. For i = 1, 2, ..., N:
     a. Robot runs π_i → collect states {s₁, s₂, ...}
     b. Expert labels actions for those states: {(s₁, a*₁), (s₂, a*₂), ...}
     c. Aggregate: D_i = D_{i-1} ∪ {new labeled data}
     d. Train π_{i+1} on D_i
  4. Return π_N

Why It Works

The DAgger loop ensures training data covers states the robot actually encounters, not just states from the expert's trajectory. After a few iterations, distribution shift is greatly reduced.

Pros and Cons

Pros:

  • Solves distribution shift — proven theoretical convergence guarantee
  • Much better than BC for long-horizon tasks
  • Iterative improvement — policy gets better each round

Cons:

  • Needs expert online: Expert must be available to label data each iteration — time consuming
  • Safety risk: Robot runs a weak policy in early stages — may collide
  • Many iterations: Typically needs 5-10 rounds to converge

Practical Variants

In practice, vanilla DAgger is rarely used because it requires a constantly available expert. More common variants:

  • SafeDAgger: Only queries expert when policy is uncertain (high uncertainty)
  • HG-DAgger: Expert intervenes directly when robot is about to make a mistake
  • ThriftyDAgger: Optimizes the number of expert queries

DAPG — Combine Demos with RL Fine-tuning

DAPG (Demonstration Augmented Policy Gradient) (Rajeswaran et al., 2018) is best of both worlds: start from demonstrations, then fine-tune with RL.

Two Stages

Stage 1: Pre-training (BC)
  → Train initial policy from demonstrations
  → Policy knows "roughly" how to do the task

Stage 2: RL Fine-tuning with Demo Augmented Reward
  → Use RL (PPO/NPG) to optimize policy
  → Add auxiliary reward: ||π(s) - π_demo(s)||
    → Keeps policy near demonstrations, prevents drift
  → Gradually reduce auxiliary reward over time

Why Better Than BC or RL Alone?

Method Problem DAPG Solution
BC alone Distribution shift, not optimal RL fine-tuning improves performance
RL alone Sample inefficient, sparse reward Demo pre-training provides good starting point
BC → RL (naive) RL destroys BC policy Auxiliary reward keeps policy close to demos

Impressive Results

DAPG is especially powerful for dexterous manipulation — extremely hard with a 24-DOF robot hand:

Task RL alone BC alone DAPG
Door opening 0% (1M steps) 45% 95%
Tool use 0% (1M steps) 30% 88%
Object relocation 12% (1M steps) 52% 92%

RL alone fails to solve these even after 1 million steps, but DAPG with just 25 demonstrations + 200K RL steps achieves >90% success rate.

Robot hand performing dexterous manipulation task

Collecting Demonstration Data

Data collection is the most important part — a policy can only be as good as its training data. There are 3 main methods:

1. Teleoperation

A person controls the robot remotely via controller, VR headset, or a master device.

Method Setup Cost Quality Speed
Keyboard/Joystick Low Low (6-DoF hard to control) Slow
VR Controllers Medium High Medium
Master-slave (ALOHA) High Very high Fast
SpaceMouse Low Medium Medium

ALOHA (system from the ACT paper) uses a master robot — person controls the master robot, slave robot copies movements exactly. Best demonstration quality for manipulation.

2. Kinesthetic Teaching

Person physically guides the robot through the desired trajectory. Robot is in gravity compensation mode — human moves it, robot records joint positions.

  • Pros: Intuitive, no extra equipment needed
  • Cons: Only works with collaborative robots (cobots), hard for bimanual tasks

3. VR-Based Collection

Use VR headset + controllers to control robot in simulation or real-time.

  • Pros: Immersive, natural
  • Cons: Latency issues, requires careful calibration

Comprehensive Comparison Table

Criterion BC DAgger DAPG
Expert online? No (offline) Yes (each iteration) No (offline demos)
Reward needed? No No Yes (RL phase)
Simulation needed? No Optional (can be real) Usually (RL phase)
Distribution shift Yes (major problem) Solved Solved (via RL)
Sample efficiency Very high High Medium
Performance ceiling Medium High Very high
Long-horizon tasks Poor Good Good
Implementation Simple Medium Complex
Demos needed 50-200 20-50 + iterations 20-50 + RL
Best for Quick start Safety-critical Best performance

Full BC Implementation with PyTorch

Here is a simple but complete implementation of BC — you can use this directly for a robot project:

"""
Behavioral Cloning for robot manipulation
Requires: pip install torch numpy
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np


class RobotDemoDataset(Dataset):
    """Dataset from demonstration files."""

    def __init__(self, demo_dir: str):
        """
        demo_dir: directory containing .npz files, one per demonstration
        Each .npz contains: observations (T, obs_dim), actions (T, act_dim)
        """
        import glob
        self.observations = []
        self.actions = []

        for f in sorted(glob.glob(f"{demo_dir}/*.npz")):
            data = np.load(f)
            self.observations.append(data["observations"])
            self.actions.append(data["actions"])

        self.observations = np.concatenate(self.observations, axis=0)
        self.actions = np.concatenate(self.actions, axis=0)

        # Normalize observations — critical for training stability
        self.obs_mean = self.observations.mean(axis=0)
        self.obs_std = self.observations.std(axis=0) + 1e-8
        self.observations = (self.observations - self.obs_mean) / self.obs_std

        # Normalize actions to [-1, 1]
        self.act_mean = self.actions.mean(axis=0)
        self.act_std = self.actions.std(axis=0) + 1e-8
        self.actions = (self.actions - self.act_mean) / self.act_std

        print(f"Loaded {len(self)} samples from {demo_dir}")
        print(f"  Obs dim: {self.observations.shape[1]}")
        print(f"  Act dim: {self.actions.shape[1]}")

    def __len__(self):
        return len(self.observations)

    def __getitem__(self, idx):
        return (
            torch.FloatTensor(self.observations[idx]),
            torch.FloatTensor(self.actions[idx]),
        )


class BCPolicy(nn.Module):
    """MLP policy for Behavioral Cloning."""

    def __init__(self, obs_dim: int, act_dim: int, hidden_dim: int = 256):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, act_dim),
            nn.Tanh(),  # Output in [-1, 1] → denormalize at deploy time
        )

    def forward(self, obs):
        return self.network(obs)


def train_bc(
    demo_dir: str,
    obs_dim: int = 12,
    act_dim: int = 7,
    epochs: int = 200,
    batch_size: int = 256,
    lr: float = 1e-3,
    save_path: str = "bc_policy.pt",
):
    """Train BC policy from demonstrations."""

    # 1. Load data with train/val split
    dataset = RobotDemoDataset(demo_dir)
    train_size = int(0.9 * len(dataset))
    val_size = len(dataset) - train_size
    train_set, val_set = torch.utils.data.random_split(dataset, [train_size, val_size])

    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_set, batch_size=batch_size)

    # 2. Model + optimizer
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    policy = BCPolicy(obs_dim, act_dim).to(device)
    optimizer = optim.AdamW(policy.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

    # 3. Training loop
    best_val_loss = float("inf")

    for epoch in range(epochs):
        # Train
        policy.train()
        train_loss = 0
        for obs, actions in train_loader:
            obs, actions = obs.to(device), actions.to(device)
            pred = policy(obs)
            loss = nn.functional.mse_loss(pred, actions)
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
            optimizer.step()
            train_loss += loss.item()

        # Validate
        policy.eval()
        val_loss = 0
        with torch.no_grad():
            for obs, actions in val_loader:
                obs, actions = obs.to(device), actions.to(device)
                pred = policy(obs)
                val_loss += nn.functional.mse_loss(pred, actions).item()

        train_loss /= len(train_loader)
        val_loss /= len(val_loader)
        scheduler.step()

        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save({
                "model": policy.state_dict(),
                "obs_mean": dataset.obs_mean,
                "obs_std": dataset.obs_std,
                "act_mean": dataset.act_mean,
                "act_std": dataset.act_std,
            }, save_path)

        if epoch % 20 == 0:
            print(f"Epoch {epoch}: train={train_loss:.6f}, val={val_loss:.6f}")

    print(f"Training complete! Best val_loss: {best_val_loss:.6f}")


# Deploy policy on real robot
def deploy_bc_policy(policy_path: str, robot):
    """Load and run BC policy on real robot."""
    checkpoint = torch.load(policy_path, map_location="cpu")
    policy = BCPolicy(obs_dim=12, act_dim=7)
    policy.load_state_dict(checkpoint["model"])
    policy.eval()

    obs_mean = checkpoint["obs_mean"]
    obs_std = checkpoint["obs_std"]
    act_mean = checkpoint["act_mean"]
    act_std = checkpoint["act_std"]

    while True:
        # Read sensor data from robot
        obs = robot.get_observation()  # np.array shape (12,)

        # Normalize observation (same as during training)
        obs_norm = (obs - obs_mean) / obs_std
        obs_tensor = torch.FloatTensor(obs_norm).unsqueeze(0)

        # Predict action
        with torch.no_grad():
            action_norm = policy(obs_tensor).squeeze(0).numpy()

        # Denormalize action
        action = action_norm * act_std + act_mean

        # Send action to robot
        robot.execute_action(action)


# Run training
if __name__ == "__main__":
    train_bc(
        demo_dir="./demos/pick_and_place/",
        obs_dim=12,   # 6 joints + 3 gripper pos + 3 object pos
        act_dim=7,    # 6 joint velocities + 1 gripper
        epochs=200,
        batch_size=256,
    )

When to Use Which Method?

  • Little data, need it fastBC (train in 30 min, deploy immediately)
  • Long-horizon tasks, need accuracyDAgger (iterative improvement)
  • Need best possible performanceDAPG (BC + RL fine-tuning)
  • Vision-based observations → See Part 3: Action Chunking Transformers (ACT) (state-of-the-art for vision-based manipulation)

Next Steps

BC and DAgger are solid foundations, but single-step prediction has serious problems with temporal correlations in robot actions. In the next part, Action Chunking Transformers (ACT) — predicting multiple actions at once, achieving state-of-the-art results for bimanual manipulation.


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Bài viết liên quan

NEWDeep Dive
Gemma 4 cho Robotics: AI mã nguồn mở chạy trên Edge
ai-perceptionedge-computinggemmagoogleopen-source

Gemma 4 cho Robotics: AI mã nguồn mở chạy trên Edge

Phân tích Gemma 4 của Google — mô hình AI mã nguồn mở hỗ trợ multimodal, agentic, chạy trên Jetson và Raspberry Pi cho robotics.

12/4/202612 phút đọc
NEWNghiên cứu
Gemma 4 và Ứng Dụng Trong Robotics
ai-perceptiongemmaedge-aifoundation-modelsrobotics

Gemma 4 và Ứng Dụng Trong Robotics

Phân tích kiến trúc Gemma 4 của Google — từ on-device AI đến ứng dụng thực tế trong điều khiển robot, perception và agentic workflows.

12/4/202612 phút đọc
NEWSo sánh
SimpleVLA-RL (5): So sánh với LeRobot
ai-perceptionvlareinforcement-learninglerobotresearchPhần 5

SimpleVLA-RL (5): So sánh với LeRobot

So sánh chi tiết SimpleVLA-RL và LeRobot: RL approach, VLA models, sim vs real, data efficiency — hai framework bổ trợ nhau.

11/4/202612 phút đọc