VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam
VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
  1. Home
  2. Blog
  3. Imitation Learning: BC, DAgger and DAPG for Robots
aiai-perceptionimitation-learningprogramming

Imitation Learning: BC, DAgger and DAPG for Robots

Why imitation learning is more important than RL for many manipulation problems — how to collect data and train policies.

Nguyen Anh TuanMarch 8, 202611 min readUpdated: Jun 16, 2026

Imitation Learning — Teaching Robot by Example

In Part 1, I introduced RL algorithms for robotics. But reality: many manipulation tasks don't need RL — just show the robot how humans do it, then have it copy. That's Imitation Learning (IL).

Why does IL matter? Because RL for manipulation is extremely hard:

  • Sparse rewards: Robot gets +1 only on success — millions of steps with no signal before that
  • Safety: RL exploration might break the robot or the environment
  • Sample complexity: SAC needs ~100K steps for simple grasping — that's many hours on a real robot
  • Reward engineering: Designing a good reward function is hard, and a wrong reward means the robot hacks it

Imitation learning solves all of this by skipping rewards entirely — learning directly from expert demonstrations (human teleoperation).

Person controlling robot via teleoperation to collect demonstration data
Person controlling robot via teleoperation to collect demonstration data

Behavioral Cloning (BC) — Simplest and Fastest

Behavioral Cloning is the simplest IL method: pure supervised learning on (observation, action) pairs from expert demonstrations.

How It Works

Collect data:
  Expert controls robot → record (state, action) each timestep
  Dataset: D = {(s_1, a_1), (s_2, a_2), ..., (s_N, a_N)}

Training:
  Minimize L = Σ ||π_θ(s_i) - a_i||²
  → Supervised regression: given state, predict action

Deployment:
  Robot runs learned policy π_θ autonomously

Simply put: BC = regression. Give the model (input, output) pairs and train it to predict output from input. No environment needed, no reward, no simulation.

Advantages

  • Extremely simple: Just PyTorch + dataset, train in minutes
  • Stable training: Supervised learning, no RL instability
  • Little data: 50-200 demonstrations often enough for a single task
  • Fast: Train in 30 minutes vs hours of RL

Disadvantages — Distribution Shift

BC has one fundamental problem: distribution shift (also called covariate shift).

Training:  Robot sees states from expert trajectory
           s_expert → a_expert (correct path)

Deployment: Robot makes small mistake → deviates → sees new state
            s_new (never seen) → a_??? (doesn't know what to do)
            → errors accumulate → complete failure

Expert always takes the correct path, so training data only contains "good" states. When the robot runs autonomously and makes a small mistake, it ends up in states it's never seen — and doesn't know how to recover. Small errors accumulate into large failures over time.

Solutions for Distribution Shift

  1. Collect more diverse data: Multiple demonstrations — not just one way to do the task
  2. Data augmentation: Add noise to observations during training
  3. Action chunking: Predict many actions at once instead of step-by-step (see Part 3)
  4. DAgger: The next method, specifically designed to fix this

DAgger — Iterative, Fixes Distribution Shift

DAgger (Dataset Aggregation) (Ross et al., 2011) solves distribution shift iteratively: let the robot run its current policy, have an expert fix mistakes, add new data to the dataset, retrain.

Algorithm

Algorithm DAgger:
  1. Collect initial dataset D₀ from expert demonstrations
  2. Train policy π₁ on D₀ (BC)
  3. For i = 1, 2, ..., N:
     a. Robot runs π_i → collect states {s₁, s₂, ...}
     b. Expert labels actions for those states: {(s₁, a*₁), (s₂, a*₂), ...}
     c. Aggregate: D_i = D_{i-1} ∪ {new labeled data}
     d. Train π_{i+1} on D_i
  4. Return π_N

Why It Works

The DAgger loop ensures training data covers states the robot actually encounters, not just states from the expert's trajectory. After a few iterations, distribution shift is greatly reduced.

Pros and Cons

Pros:

  • Solves distribution shift — proven theoretical convergence guarantee
  • Much better than BC for long-horizon tasks
  • Iterative improvement — policy gets better each round

Cons:

  • Needs expert online: Expert must be available to label data each iteration — time consuming
  • Safety risk: Robot runs a weak policy in early stages — may collide
  • Many iterations: Typically needs 5-10 rounds to converge

Practical Variants

In practice, vanilla DAgger is rarely used because it requires a constantly available expert. More common variants:

  • SafeDAgger: Only queries expert when policy is uncertain (high uncertainty)
  • HG-DAgger: Expert intervenes directly when robot is about to make a mistake
  • ThriftyDAgger: Optimizes the number of expert queries

DAPG — Combine Demos with RL Fine-tuning

DAPG (Demonstration Augmented Policy Gradient) (Rajeswaran et al., 2018) is best of both worlds: start from demonstrations, then fine-tune with RL.

Two Stages

Stage 1: Pre-training (BC)
  → Train initial policy from demonstrations
  → Policy knows "roughly" how to do the task

Stage 2: RL Fine-tuning with Demo Augmented Reward
  → Use RL (PPO/NPG) to optimize policy
  → Add auxiliary reward: ||π(s) - π_demo(s)||
    → Keeps policy near demonstrations, prevents drift
  → Gradually reduce auxiliary reward over time

Why Better Than BC or RL Alone?

Method Problem DAPG Solution
BC alone Distribution shift, not optimal RL fine-tuning improves performance
RL alone Sample inefficient, sparse reward Demo pre-training provides good starting point
BC → RL (naive) RL destroys BC policy Auxiliary reward keeps policy close to demos

Impressive Results

DAPG is especially powerful for dexterous manipulation — extremely hard with a 24-DOF robot hand:

Task RL alone BC alone DAPG
Door opening 0% (1M steps) 45% 95%
Tool use 0% (1M steps) 30% 88%
Object relocation 12% (1M steps) 52% 92%

RL alone fails to solve these even after 1 million steps, but DAPG with just 25 demonstrations + 200K RL steps achieves >90% success rate.

Robot hand performing dexterous manipulation task
Robot hand performing dexterous manipulation task

Collecting Demonstration Data

Data collection is the most important part — a policy can only be as good as its training data. There are 3 main methods:

1. Teleoperation

A person controls the robot remotely via controller, VR headset, or a master device.

Method Setup Cost Quality Speed
Keyboard/Joystick Low Low (6-DoF hard to control) Slow
VR Controllers Medium High Medium
Master-slave (ALOHA) High Very high Fast
SpaceMouse Low Medium Medium

ALOHA (system from the ACT paper) uses a master robot — person controls the master robot, slave robot copies movements exactly. Best demonstration quality for manipulation.

2. Kinesthetic Teaching

Person physically guides the robot through the desired trajectory. Robot is in gravity compensation mode — human moves it, robot records joint positions.

  • Pros: Intuitive, no extra equipment needed
  • Cons: Only works with collaborative robots (cobots), hard for bimanual tasks

3. VR-Based Collection

Use VR headset + controllers to control robot in simulation or real-time.

  • Pros: Immersive, natural
  • Cons: Latency issues, requires careful calibration

Comprehensive Comparison Table

Criterion BC DAgger DAPG
Expert online? No (offline) Yes (each iteration) No (offline demos)
Reward needed? No No Yes (RL phase)
Simulation needed? No Optional (can be real) Usually (RL phase)
Distribution shift Yes (major problem) Solved Solved (via RL)
Sample efficiency Very high High Medium
Performance ceiling Medium High Very high
Long-horizon tasks Poor Good Good
Implementation Simple Medium Complex
Demos needed 50-200 20-50 + iterations 20-50 + RL
Best for Quick start Safety-critical Best performance

Full BC Implementation with PyTorch

Here is a simple but complete implementation of BC — you can use this directly for a robot project:

"""
Behavioral Cloning for robot manipulation
Requires: pip install torch numpy
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np


class RobotDemoDataset(Dataset):
    """Dataset from demonstration files."""

    def __init__(self, demo_dir: str):
        """
        demo_dir: directory containing .npz files, one per demonstration
        Each .npz contains: observations (T, obs_dim), actions (T, act_dim)
        """
        import glob
        self.observations = []
        self.actions = []

        for f in sorted(glob.glob(f"{demo_dir}/*.npz")):
            data = np.load(f)
            self.observations.append(data["observations"])
            self.actions.append(data["actions"])

        self.observations = np.concatenate(self.observations, axis=0)
        self.actions = np.concatenate(self.actions, axis=0)

        # Normalize observations — critical for training stability
        self.obs_mean = self.observations.mean(axis=0)
        self.obs_std = self.observations.std(axis=0) + 1e-8
        self.observations = (self.observations - self.obs_mean) / self.obs_std

        # Normalize actions to [-1, 1]
        self.act_mean = self.actions.mean(axis=0)
        self.act_std = self.actions.std(axis=0) + 1e-8
        self.actions = (self.actions - self.act_mean) / self.act_std

        print(f"Loaded {len(self)} samples from {demo_dir}")
        print(f"  Obs dim: {self.observations.shape[1]}")
        print(f"  Act dim: {self.actions.shape[1]}")

    def __len__(self):
        return len(self.observations)

    def __getitem__(self, idx):
        return (
            torch.FloatTensor(self.observations[idx]),
            torch.FloatTensor(self.actions[idx]),
        )


class BCPolicy(nn.Module):
    """MLP policy for Behavioral Cloning."""

    def __init__(self, obs_dim: int, act_dim: int, hidden_dim: int = 256):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, act_dim),
            nn.Tanh(),  # Output in [-1, 1] → denormalize at deploy time
        )

    def forward(self, obs):
        return self.network(obs)


def train_bc(
    demo_dir: str,
    obs_dim: int = 12,
    act_dim: int = 7,
    epochs: int = 200,
    batch_size: int = 256,
    lr: float = 1e-3,
    save_path: str = "bc_policy.pt",
):
    """Train BC policy from demonstrations."""

    # 1. Load data with train/val split
    dataset = RobotDemoDataset(demo_dir)
    train_size = int(0.9 * len(dataset))
    val_size = len(dataset) - train_size
    train_set, val_set = torch.utils.data.random_split(dataset, [train_size, val_size])

    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_set, batch_size=batch_size)

    # 2. Model + optimizer
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    policy = BCPolicy(obs_dim, act_dim).to(device)
    optimizer = optim.AdamW(policy.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

    # 3. Training loop
    best_val_loss = float("inf")

    for epoch in range(epochs):
        # Train
        policy.train()
        train_loss = 0
        for obs, actions in train_loader:
            obs, actions = obs.to(device), actions.to(device)
            pred = policy(obs)
            loss = nn.functional.mse_loss(pred, actions)
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
            optimizer.step()
            train_loss += loss.item()

        # Validate
        policy.eval()
        val_loss = 0
        with torch.no_grad():
            for obs, actions in val_loader:
                obs, actions = obs.to(device), actions.to(device)
                pred = policy(obs)
                val_loss += nn.functional.mse_loss(pred, actions).item()

        train_loss /= len(train_loader)
        val_loss /= len(val_loader)
        scheduler.step()

        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save({
                "model": policy.state_dict(),
                "obs_mean": dataset.obs_mean,
                "obs_std": dataset.obs_std,
                "act_mean": dataset.act_mean,
                "act_std": dataset.act_std,
            }, save_path)

        if epoch % 20 == 0:
            print(f"Epoch {epoch}: train={train_loss:.6f}, val={val_loss:.6f}")

    print(f"Training complete! Best val_loss: {best_val_loss:.6f}")


# Deploy policy on real robot
def deploy_bc_policy(policy_path: str, robot):
    """Load and run BC policy on real robot."""
    checkpoint = torch.load(policy_path, map_location="cpu")
    policy = BCPolicy(obs_dim=12, act_dim=7)
    policy.load_state_dict(checkpoint["model"])
    policy.eval()

    obs_mean = checkpoint["obs_mean"]
    obs_std = checkpoint["obs_std"]
    act_mean = checkpoint["act_mean"]
    act_std = checkpoint["act_std"]

    while True:
        # Read sensor data from robot
        obs = robot.get_observation()  # np.array shape (12,)

        # Normalize observation (same as during training)
        obs_norm = (obs - obs_mean) / obs_std
        obs_tensor = torch.FloatTensor(obs_norm).unsqueeze(0)

        # Predict action
        with torch.no_grad():
            action_norm = policy(obs_tensor).squeeze(0).numpy()

        # Denormalize action
        action = action_norm * act_std + act_mean

        # Send action to robot
        robot.execute_action(action)


# Run training
if __name__ == "__main__":
    train_bc(
        demo_dir="./demos/pick_and_place/",
        obs_dim=12,   # 6 joints + 3 gripper pos + 3 object pos
        act_dim=7,    # 6 joint velocities + 1 gripper
        epochs=200,
        batch_size=256,
    )

When to Use Which Method?

  • Little data, need it fast → BC (train in 30 min, deploy immediately)
  • Long-horizon tasks, need accuracy → DAgger (iterative improvement)
  • Need best possible performance → DAPG (BC + RL fine-tuning)
  • Vision-based observations → See Part 3: Action Chunking Transformers (ACT) (state-of-the-art for vision-based manipulation)

Next Steps

BC and DAgger are solid foundations, but single-step prediction has serious problems with temporal correlations in robot actions. In the next part, Action Chunking Transformers (ACT) — predicting multiple actions at once, achieving state-of-the-art results for bimanual manipulation.


Tool recommendations

VLA train/deploy stack

Train on cloud/workstation, then deploy optimized models to Jetson or the robot computer.

Cloud GPU for VLA / policy training Use for imitation learning, diffusion policies, RL, and robotics model fine-tuning. View cloud GPU → NVIDIA Jetson Orin NX / Orin Nano Edge deployment hardware for perception, logging, and optimized inference. View Jetson → Hugging Face / robotics dataset hosting Host datasets, checkpoints, and model cards for cleaner LeRobot/VLA workflows. View platform →

Related Articles

  • RL for Robotics: PPO, SAC and Choosing Algorithm — Part 1: RL foundation
  • Action Chunking Transformers (ACT): Architecture Deep Dive — Part 3: State-of-the-art imitation learning
  • Foundation Models for Robot: RT-2, Octo, OpenVLA — Foundation models combining IL and pre-training
  • Sim-to-Real Transfer: Train Simulation, Run Real Robot — Transfer policy from sim to real
NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Fleet MonitoringROS 2 IntegrationAMR Solutions
ai-robotics — Phần 2/8
← RL for Robotics: PPO, SAC and How to Choose Your AlgorithmAction Chunking Transformers (ACT): Architecture in Detail →

Related Posts

Tutorial
Robot AI
ai-perceptionrlprogrammingPart 1
ai

RL cho Robotics: PPO, SAC và cách chọn algorithm

Tổng quan RL algorithms cho robotics — PPO, SAC, TD3 và hướng dẫn chọn algorithm phù hợp cho từng bài toán robot.

3/5/202610 min read
NT
Tutorial
Hands-on: Fine-tune OpenVLA với LeRobot
ai-perceptionvlatutorialPart 7
ai

Hands-on: Fine-tune OpenVLA với LeRobot

Tutorial thực hành — fine-tune OpenVLA trên custom data, LoRA, quantization và deploy trên robot thật.

3/23/202613 min read
NT
Research
SpatialVLA: 3D understanding cho robot manipulation
ai-perceptionvlaresearchPart 6
ai

SpatialVLA: 3D understanding cho robot manipulation

Tại sao 3D spatial awareness quan trọng cho VLA — Ego3D Position Encoding và Adaptive Action Grids.

3/20/202610 min read
NT
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam