Imitation Learning: BC, DAgger and DAPG for Robots

Imitation Learning — Teaching Robot by Example

In Part 1, I introduced RL algorithms for robotics. But reality: many manipulation tasks don't need RL — just show the robot how humans do it, then have it copy. That's Imitation Learning (IL).

Why does IL matter? Because RL for manipulation is extremely hard:

Sparse rewards: Robot gets +1 only on success — millions of steps with no signal before that
Safety: RL exploration might break the robot or the environment
Sample complexity: SAC needs ~100K steps for simple grasping — that's many hours on a real robot
Reward engineering: Designing a good reward function is hard, and a wrong reward means the robot hacks it

Imitation learning solves all of this by skipping rewards entirely — learning directly from expert demonstrations (human teleoperation).

Person controlling robot via teleoperation to collect demonstration data

Behavioral Cloning (BC) — Simplest and Fastest

Behavioral Cloning is the simplest IL method: pure supervised learning on (observation, action) pairs from expert demonstrations.

How It Works

Collect data:
  Expert controls robot → record (state, action) each timestep
  Dataset: D = {(s_1, a_1), (s_2, a_2), ..., (s_N, a_N)}

Training:
  Minimize L = Σ ||π_θ(s_i) - a_i||²
  → Supervised regression: given state, predict action

Deployment:
  Robot runs learned policy π_θ autonomously

Simply put: BC = regression. Give the model (input, output) pairs and train it to predict output from input. No environment needed, no reward, no simulation.

Advantages

Extremely simple: Just PyTorch + dataset, train in minutes
Stable training: Supervised learning, no RL instability
Little data: 50-200 demonstrations often enough for a single task
Fast: Train in 30 minutes vs hours of RL

Disadvantages — Distribution Shift

BC has one fundamental problem: distribution shift (also called covariate shift).

Training:  Robot sees states from expert trajectory
           s_expert → a_expert (correct path)

Deployment: Robot makes small mistake → deviates → sees new state
            s_new (never seen) → a_??? (doesn't know what to do)
            → errors accumulate → complete failure

Expert always takes the correct path, so training data only contains "good" states. When the robot runs autonomously and makes a small mistake, it ends up in states it's never seen — and doesn't know how to recover. Small errors accumulate into large failures over time.

Solutions for Distribution Shift

Collect more diverse data: Multiple demonstrations — not just one way to do the task
Data augmentation: Add noise to observations during training
Action chunking: Predict many actions at once instead of step-by-step (see Part 3)
DAgger: The next method, specifically designed to fix this

DAgger — Iterative, Fixes Distribution Shift

DAgger (Dataset Aggregation) (Ross et al., 2011) solves distribution shift iteratively: let the robot run its current policy, have an expert fix mistakes, add new data to the dataset, retrain.

Algorithm

Algorithm DAgger:
  1. Collect initial dataset D₀ from expert demonstrations
  2. Train policy π₁ on D₀ (BC)
  3. For i = 1, 2, ..., N:
     a. Robot runs π_i → collect states {s₁, s₂, ...}
     b. Expert labels actions for those states: {(s₁, a*₁), (s₂, a*₂), ...}
     c. Aggregate: D_i = D_{i-1} ∪ {new labeled data}
     d. Train π_{i+1} on D_i
  4. Return π_N

Why It Works

The DAgger loop ensures training data covers states the robot actually encounters, not just states from the expert's trajectory. After a few iterations, distribution shift is greatly reduced.

Pros and Cons

Pros:

Solves distribution shift — proven theoretical convergence guarantee
Much better than BC for long-horizon tasks
Iterative improvement — policy gets better each round

Cons:

Needs expert online: Expert must be available to label data each iteration — time consuming
Safety risk: Robot runs a weak policy in early stages — may collide
Many iterations: Typically needs 5-10 rounds to converge

Practical Variants

In practice, vanilla DAgger is rarely used because it requires a constantly available expert. More common variants:

SafeDAgger: Only queries expert when policy is uncertain (high uncertainty)
HG-DAgger: Expert intervenes directly when robot is about to make a mistake
ThriftyDAgger: Optimizes the number of expert queries

DAPG — Combine Demos with RL Fine-tuning

DAPG (Demonstration Augmented Policy Gradient) (Rajeswaran et al., 2018) is best of both worlds: start from demonstrations, then fine-tune with RL.

Two Stages

Stage 1: Pre-training (BC)
  → Train initial policy from demonstrations
  → Policy knows "roughly" how to do the task

Stage 2: RL Fine-tuning with Demo Augmented Reward
  → Use RL (PPO/NPG) to optimize policy
  → Add auxiliary reward: ||π(s) - π_demo(s)||
    → Keeps policy near demonstrations, prevents drift
  → Gradually reduce auxiliary reward over time

Why Better Than BC or RL Alone?

Method	Problem	DAPG Solution
BC alone	Distribution shift, not optimal	RL fine-tuning improves performance
RL alone	Sample inefficient, sparse reward	Demo pre-training provides good starting point
BC → RL (naive)	RL destroys BC policy	Auxiliary reward keeps policy close to demos

Impressive Results

DAPG is especially powerful for dexterous manipulation — extremely hard with a 24-DOF robot hand:

Task	RL alone	BC alone	DAPG
Door opening	0% (1M steps)	45%	95%
Tool use	0% (1M steps)	30%	88%
Object relocation	12% (1M steps)	52%	92%

RL alone fails to solve these even after 1 million steps, but DAPG with just 25 demonstrations + 200K RL steps achieves >90% success rate.

Robot hand performing dexterous manipulation task

Collecting Demonstration Data

Data collection is the most important part — a policy can only be as good as its training data. There are 3 main methods:

1. Teleoperation

A person controls the robot remotely via controller, VR headset, or a master device.

Method	Setup Cost	Quality	Speed
Keyboard/Joystick	Low	Low (6-DoF hard to control)	Slow
VR Controllers	Medium	High	Medium
Master-slave (ALOHA)	High	Very high	Fast
SpaceMouse	Low	Medium	Medium

ALOHA (system from the ACT paper) uses a master robot — person controls the master robot, slave robot copies movements exactly. Best demonstration quality for manipulation.

2. Kinesthetic Teaching

Person physically guides the robot through the desired trajectory. Robot is in gravity compensation mode — human moves it, robot records joint positions.

Pros: Intuitive, no extra equipment needed
Cons: Only works with collaborative robots (cobots), hard for bimanual tasks

3. VR-Based Collection

Use VR headset + controllers to control robot in simulation or real-time.

Pros: Immersive, natural
Cons: Latency issues, requires careful calibration

Comprehensive Comparison Table

Criterion	BC	DAgger	DAPG
Expert online?	No (offline)	Yes (each iteration)	No (offline demos)
Reward needed?	No	No	Yes (RL phase)
Simulation needed?	No	Optional (can be real)	Usually (RL phase)
Distribution shift	Yes (major problem)	Solved	Solved (via RL)
Sample efficiency	Very high	High	Medium
Performance ceiling	Medium	High	Very high
Long-horizon tasks	Poor	Good	Good
Implementation	Simple	Medium	Complex
Demos needed	50-200	20-50 + iterations	20-50 + RL
Best for	Quick start	Safety-critical	Best performance

Full BC Implementation with PyTorch

Here is a simple but complete implementation of BC — you can use this directly for a robot project:

"""
Behavioral Cloning for robot manipulation
Requires: pip install torch numpy
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np


class RobotDemoDataset(Dataset):
    """Dataset from demonstration files."""

    def __init__(self, demo_dir: str):
        """
        demo_dir: directory containing .npz files, one per demonstration
        Each .npz contains: observations (T, obs_dim), actions (T, act_dim)
        """
        import glob
        self.observations = []
        self.actions = []

        for f in sorted(glob.glob(f"{demo_dir}/*.npz")):
            data = np.load(f)
            self.observations.append(data["observations"])
            self.actions.append(data["actions"])

        self.observations = np.concatenate(self.observations, axis=0)
        self.actions = np.concatenate(self.actions, axis=0)

        # Normalize observations — critical for training stability
        self.obs_mean = self.observations.mean(axis=0)
        self.obs_std = self.observations.std(axis=0) + 1e-8
        self.observations = (self.observations - self.obs_mean) / self.obs_std

        # Normalize actions to [-1, 1]
        self.act_mean = self.actions.mean(axis=0)
        self.act_std = self.actions.std(axis=0) + 1e-8
        self.actions = (self.actions - self.act_mean) / self.act_std

        print(f"Loaded {len(self)} samples from {demo_dir}")
        print(f"  Obs dim: {self.observations.shape[1]}")
        print(f"  Act dim: {self.actions.shape[1]}")

    def __len__(self):
        return len(self.observations)

    def __getitem__(self, idx):
        return (
            torch.FloatTensor(self.observations[idx]),
            torch.FloatTensor(self.actions[idx]),
        )


class BCPolicy(nn.Module):
    """MLP policy for Behavioral Cloning."""

    def __init__(self, obs_dim: int, act_dim: int, hidden_dim: int = 256):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, act_dim),
            nn.Tanh(),  # Output in [-1, 1] → denormalize at deploy time
        )

    def forward(self, obs):
        return self.network(obs)


def train_bc(
    demo_dir: str,
    obs_dim: int = 12,
    act_dim: int = 7,
    epochs: int = 200,
    batch_size: int = 256,
    lr: float = 1e-3,
    save_path: str = "bc_policy.pt",
):
    """Train BC policy from demonstrations."""

    # 1. Load data with train/val split
    dataset = RobotDemoDataset(demo_dir)
    train_size = int(0.9 * len(dataset))
    val_size = len(dataset) - train_size
    train_set, val_set = torch.utils.data.random_split(dataset, [train_size, val_size])

    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_set, batch_size=batch_size)

    # 2. Model + optimizer
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    policy = BCPolicy(obs_dim, act_dim).to(device)
    optimizer = optim.AdamW(policy.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

    # 3. Training loop
    best_val_loss = float("inf")

    for epoch in range(epochs):
        # Train
        policy.train()
        train_loss = 0
        for obs, actions in train_loader:
            obs, actions = obs.to(device), actions.to(device)
            pred = policy(obs)
            loss = nn.functional.mse_loss(pred, actions)
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
            optimizer.step()
            train_loss += loss.item()

        # Validate
        policy.eval()
        val_loss = 0
        with torch.no_grad():
            for obs, actions in val_loader:
                obs, actions = obs.to(device), actions.to(device)
                pred = policy(obs)
                val_loss += nn.functional.mse_loss(pred, actions).item()

        train_loss /= len(train_loader)
        val_loss /= len(val_loader)
        scheduler.step()

        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save({
                "model": policy.state_dict(),
                "obs_mean": dataset.obs_mean,
                "obs_std": dataset.obs_std,
                "act_mean": dataset.act_mean,
                "act_std": dataset.act_std,
            }, save_path)

        if epoch % 20 == 0:
            print(f"Epoch {epoch}: train={train_loss:.6f}, val={val_loss:.6f}")

    print(f"Training complete! Best val_loss: {best_val_loss:.6f}")


# Deploy policy on real robot
def deploy_bc_policy(policy_path: str, robot):
    """Load and run BC policy on real robot."""
    checkpoint = torch.load(policy_path, map_location="cpu")
    policy = BCPolicy(obs_dim=12, act_dim=7)
    policy.load_state_dict(checkpoint["model"])
    policy.eval()

    obs_mean = checkpoint["obs_mean"]
    obs_std = checkpoint["obs_std"]
    act_mean = checkpoint["act_mean"]
    act_std = checkpoint["act_std"]

    while True:
        # Read sensor data from robot
        obs = robot.get_observation()  # np.array shape (12,)

        # Normalize observation (same as during training)
        obs_norm = (obs - obs_mean) / obs_std
        obs_tensor = torch.FloatTensor(obs_norm).unsqueeze(0)

        # Predict action
        with torch.no_grad():
            action_norm = policy(obs_tensor).squeeze(0).numpy()

        # Denormalize action
        action = action_norm * act_std + act_mean

        # Send action to robot
        robot.execute_action(action)


# Run training
if __name__ == "__main__":
    train_bc(
        demo_dir="./demos/pick_and_place/",
        obs_dim=12,   # 6 joints + 3 gripper pos + 3 object pos
        act_dim=7,    # 6 joint velocities + 1 gripper
        epochs=200,
        batch_size=256,
    )

When to Use Which Method?

Little data, need it fast → BC (train in 30 min, deploy immediately)
Long-horizon tasks, need accuracy → DAgger (iterative improvement)
Need best possible performance → DAPG (BC + RL fine-tuning)
Vision-based observations → See Part 3: Action Chunking Transformers (ACT) (state-of-the-art for vision-based manipulation)

Next Steps

BC and DAgger are solid foundations, but single-step prediction has serious problems with temporal correlations in robot actions. In the next part, Action Chunking Transformers (ACT) — predicting multiple actions at once, achieving state-of-the-art results for bimanual manipulation.

RL for Robotics: PPO, SAC and Choosing Algorithm — Part 1: RL foundation
Action Chunking Transformers (ACT): Architecture Deep Dive — Part 3: State-of-the-art imitation learning
Foundation Models for Robot: RT-2, Octo, OpenVLA — Foundation models combining IL and pre-training
Sim-to-Real Transfer: Train Simulation, Run Real Robot — Transfer policy from sim to real

Imitation Learning — Teaching Robot by Example

In Part 1, I introduced RL algorithms for robotics. But reality: many manipulation tasks don't need RL — just show the robot how humans do it, then have it copy. That's Imitation Learning (IL).

Why does IL matter? Because RL for manipulation is extremely hard:

Sparse rewards: Robot gets +1 only on success — millions of steps with no signal before that
Safety: RL exploration might break the robot or the environment
Sample complexity: SAC needs ~100K steps for simple grasping — that's many hours on a real robot
Reward engineering: Designing a good reward function is hard, and a wrong reward means the robot hacks it

Imitation learning solves all of this by skipping rewards entirely — learning directly from expert demonstrations (human teleoperation).

Person controlling robot via teleoperation to collect demonstration data

Behavioral Cloning (BC) — Simplest and Fastest

Behavioral Cloning is the simplest IL method: pure supervised learning on (observation, action) pairs from expert demonstrations.

How It Works

Collect data:
  Expert controls robot → record (state, action) each timestep
  Dataset: D = {(s_1, a_1), (s_2, a_2), ..., (s_N, a_N)}

Training:
  Minimize L = Σ ||π_θ(s_i) - a_i||²
  → Supervised regression: given state, predict action

Deployment:
  Robot runs learned policy π_θ autonomously

Simply put: BC = regression. Give the model (input, output) pairs and train it to predict output from input. No environment needed, no reward, no simulation.

Advantages

Extremely simple: Just PyTorch + dataset, train in minutes
Stable training: Supervised learning, no RL instability
Little data: 50-200 demonstrations often enough for a single task
Fast: Train in 30 minutes vs hours of RL

Disadvantages — Distribution Shift

BC has one fundamental problem: distribution shift (also called covariate shift).

Training:  Robot sees states from expert trajectory
           s_expert → a_expert (correct path)

Deployment: Robot makes small mistake → deviates → sees new state
            s_new (never seen) → a_??? (doesn't know what to do)
            → errors accumulate → complete failure

Solutions for Distribution Shift

Collect more diverse data: Multiple demonstrations — not just one way to do the task
Data augmentation: Add noise to observations during training
Action chunking: Predict many actions at once instead of step-by-step (see Part 3)
DAgger: The next method, specifically designed to fix this

DAgger — Iterative, Fixes Distribution Shift

DAgger (Dataset Aggregation) (Ross et al., 2011) solves distribution shift iteratively: let the robot run its current policy, have an expert fix mistakes, add new data to the dataset, retrain.

Algorithm

Algorithm DAgger:
  1. Collect initial dataset D₀ from expert demonstrations
  2. Train policy π₁ on D₀ (BC)
  3. For i = 1, 2, ..., N:
     a. Robot runs π_i → collect states {s₁, s₂, ...}
     b. Expert labels actions for those states: {(s₁, a*₁), (s₂, a*₂), ...}
     c. Aggregate: D_i = D_{i-1} ∪ {new labeled data}
     d. Train π_{i+1} on D_i
  4. Return π_N

Why It Works

The DAgger loop ensures training data covers states the robot actually encounters, not just states from the expert's trajectory. After a few iterations, distribution shift is greatly reduced.

Pros and Cons

Pros:

Solves distribution shift — proven theoretical convergence guarantee
Much better than BC for long-horizon tasks
Iterative improvement — policy gets better each round

Cons:

Needs expert online: Expert must be available to label data each iteration — time consuming
Safety risk: Robot runs a weak policy in early stages — may collide
Many iterations: Typically needs 5-10 rounds to converge

Practical Variants

In practice, vanilla DAgger is rarely used because it requires a constantly available expert. More common variants:

SafeDAgger: Only queries expert when policy is uncertain (high uncertainty)
HG-DAgger: Expert intervenes directly when robot is about to make a mistake
ThriftyDAgger: Optimizes the number of expert queries

DAPG — Combine Demos with RL Fine-tuning

DAPG (Demonstration Augmented Policy Gradient) (Rajeswaran et al., 2018) is best of both worlds: start from demonstrations, then fine-tune with RL.

Two Stages

Stage 1: Pre-training (BC)
  → Train initial policy from demonstrations
  → Policy knows "roughly" how to do the task

Stage 2: RL Fine-tuning with Demo Augmented Reward
  → Use RL (PPO/NPG) to optimize policy
  → Add auxiliary reward: ||π(s) - π_demo(s)||
    → Keeps policy near demonstrations, prevents drift
  → Gradually reduce auxiliary reward over time

Why Better Than BC or RL Alone?

Method	Problem	DAPG Solution
BC alone	Distribution shift, not optimal	RL fine-tuning improves performance
RL alone	Sample inefficient, sparse reward	Demo pre-training provides good starting point
BC → RL (naive)	RL destroys BC policy	Auxiliary reward keeps policy close to demos

Impressive Results

DAPG is especially powerful for dexterous manipulation — extremely hard with a 24-DOF robot hand:

Task	RL alone	BC alone	DAPG
Door opening	0% (1M steps)	45%	95%
Tool use	0% (1M steps)	30%	88%
Object relocation	12% (1M steps)	52%	92%

RL alone fails to solve these even after 1 million steps, but DAPG with just 25 demonstrations + 200K RL steps achieves >90% success rate.

Robot hand performing dexterous manipulation task

Collecting Demonstration Data

Data collection is the most important part — a policy can only be as good as its training data. There are 3 main methods:

1. Teleoperation

A person controls the robot remotely via controller, VR headset, or a master device.

Method	Setup Cost	Quality	Speed
Keyboard/Joystick	Low	Low (6-DoF hard to control)	Slow
VR Controllers	Medium	High	Medium
Master-slave (ALOHA)	High	Very high	Fast
SpaceMouse	Low	Medium	Medium

ALOHA (system from the ACT paper) uses a master robot — person controls the master robot, slave robot copies movements exactly. Best demonstration quality for manipulation.

2. Kinesthetic Teaching

Person physically guides the robot through the desired trajectory. Robot is in gravity compensation mode — human moves it, robot records joint positions.

Pros: Intuitive, no extra equipment needed
Cons: Only works with collaborative robots (cobots), hard for bimanual tasks

3. VR-Based Collection

Use VR headset + controllers to control robot in simulation or real-time.

Pros: Immersive, natural
Cons: Latency issues, requires careful calibration

Comprehensive Comparison Table

Criterion	BC	DAgger	DAPG
Expert online?	No (offline)	Yes (each iteration)	No (offline demos)
Reward needed?	No	No	Yes (RL phase)
Simulation needed?	No	Optional (can be real)	Usually (RL phase)
Distribution shift	Yes (major problem)	Solved	Solved (via RL)
Sample efficiency	Very high	High	Medium
Performance ceiling	Medium	High	Very high
Long-horizon tasks	Poor	Good	Good
Implementation	Simple	Medium	Complex
Demos needed	50-200	20-50 + iterations	20-50 + RL
Best for	Quick start	Safety-critical	Best performance

Full BC Implementation with PyTorch

Here is a simple but complete implementation of BC — you can use this directly for a robot project:

"""
Behavioral Cloning for robot manipulation
Requires: pip install torch numpy
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np


class RobotDemoDataset(Dataset):
    """Dataset from demonstration files."""

    def __init__(self, demo_dir: str):
        """
        demo_dir: directory containing .npz files, one per demonstration
        Each .npz contains: observations (T, obs_dim), actions (T, act_dim)
        """
        import glob
        self.observations = []
        self.actions = []

        for f in sorted(glob.glob(f"{demo_dir}/*.npz")):
            data = np.load(f)
            self.observations.append(data["observations"])
            self.actions.append(data["actions"])

        self.observations = np.concatenate(self.observations, axis=0)
        self.actions = np.concatenate(self.actions, axis=0)

        # Normalize observations — critical for training stability
        self.obs_mean = self.observations.mean(axis=0)
        self.obs_std = self.observations.std(axis=0) + 1e-8
        self.observations = (self.observations - self.obs_mean) / self.obs_std

        # Normalize actions to [-1, 1]
        self.act_mean = self.actions.mean(axis=0)
        self.act_std = self.actions.std(axis=0) + 1e-8
        self.actions = (self.actions - self.act_mean) / self.act_std

        print(f"Loaded {len(self)} samples from {demo_dir}")
        print(f"  Obs dim: {self.observations.shape[1]}")
        print(f"  Act dim: {self.actions.shape[1]}")

    def __len__(self):
        return len(self.observations)

    def __getitem__(self, idx):
        return (
            torch.FloatTensor(self.observations[idx]),
            torch.FloatTensor(self.actions[idx]),
        )


class BCPolicy(nn.Module):
    """MLP policy for Behavioral Cloning."""

    def __init__(self, obs_dim: int, act_dim: int, hidden_dim: int = 256):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, act_dim),
            nn.Tanh(),  # Output in [-1, 1] → denormalize at deploy time
        )

    def forward(self, obs):
        return self.network(obs)


def train_bc(
    demo_dir: str,
    obs_dim: int = 12,
    act_dim: int = 7,
    epochs: int = 200,
    batch_size: int = 256,
    lr: float = 1e-3,
    save_path: str = "bc_policy.pt",
):
    """Train BC policy from demonstrations."""

    # 1. Load data with train/val split
    dataset = RobotDemoDataset(demo_dir)
    train_size = int(0.9 * len(dataset))
    val_size = len(dataset) - train_size
    train_set, val_set = torch.utils.data.random_split(dataset, [train_size, val_size])

    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_set, batch_size=batch_size)

    # 2. Model + optimizer
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    policy = BCPolicy(obs_dim, act_dim).to(device)
    optimizer = optim.AdamW(policy.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

    # 3. Training loop
    best_val_loss = float("inf")

    for epoch in range(epochs):
        # Train
        policy.train()
        train_loss = 0
        for obs, actions in train_loader:
            obs, actions = obs.to(device), actions.to(device)
            pred = policy(obs)
            loss = nn.functional.mse_loss(pred, actions)
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
            optimizer.step()
            train_loss += loss.item()

        # Validate
        policy.eval()
        val_loss = 0
        with torch.no_grad():
            for obs, actions in val_loader:
                obs, actions = obs.to(device), actions.to(device)
                pred = policy(obs)
                val_loss += nn.functional.mse_loss(pred, actions).item()

        train_loss /= len(train_loader)
        val_loss /= len(val_loader)
        scheduler.step()

        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save({
                "model": policy.state_dict(),
                "obs_mean": dataset.obs_mean,
                "obs_std": dataset.obs_std,
                "act_mean": dataset.act_mean,
                "act_std": dataset.act_std,
            }, save_path)

        if epoch % 20 == 0:
            print(f"Epoch {epoch}: train={train_loss:.6f}, val={val_loss:.6f}")

    print(f"Training complete! Best val_loss: {best_val_loss:.6f}")


# Deploy policy on real robot
def deploy_bc_policy(policy_path: str, robot):
    """Load and run BC policy on real robot."""
    checkpoint = torch.load(policy_path, map_location="cpu")
    policy = BCPolicy(obs_dim=12, act_dim=7)
    policy.load_state_dict(checkpoint["model"])
    policy.eval()

    obs_mean = checkpoint["obs_mean"]
    obs_std = checkpoint["obs_std"]
    act_mean = checkpoint["act_mean"]
    act_std = checkpoint["act_std"]

    while True:
        # Read sensor data from robot
        obs = robot.get_observation()  # np.array shape (12,)

        # Normalize observation (same as during training)
        obs_norm = (obs - obs_mean) / obs_std
        obs_tensor = torch.FloatTensor(obs_norm).unsqueeze(0)

        # Predict action
        with torch.no_grad():
            action_norm = policy(obs_tensor).squeeze(0).numpy()

        # Denormalize action
        action = action_norm * act_std + act_mean

        # Send action to robot
        robot.execute_action(action)


# Run training
if __name__ == "__main__":
    train_bc(
        demo_dir="./demos/pick_and_place/",
        obs_dim=12,   # 6 joints + 3 gripper pos + 3 object pos
        act_dim=7,    # 6 joint velocities + 1 gripper
        epochs=200,
        batch_size=256,
    )

When to Use Which Method?

Little data, need it fast → BC (train in 30 min, deploy immediately)
Long-horizon tasks, need accuracy → DAgger (iterative improvement)
Need best possible performance → DAPG (BC + RL fine-tuning)
Vision-based observations → See Part 3: Action Chunking Transformers (ACT) (state-of-the-art for vision-based manipulation)

Next Steps

RL for Robotics: PPO, SAC and Choosing Algorithm — Part 1: RL foundation
Action Chunking Transformers (ACT): Architecture Deep Dive — Part 3: State-of-the-art imitation learning
Foundation Models for Robot: RT-2, Octo, OpenVLA — Foundation models combining IL and pre-training
Sim-to-Real Transfer: Train Simulation, Run Real Robot — Transfer policy from sim to real

Imitation Learning — Teaching Robot by Example

Behavioral Cloning (BC) — Simplest and Fastest

How It Works

Advantages

Disadvantages — Distribution Shift

Solutions for Distribution Shift

DAgger — Iterative, Fixes Distribution Shift

Algorithm

Why It Works

Pros and Cons

Practical Variants

DAPG — Combine Demos with RL Fine-tuning

Two Stages

Why Better Than BC or RL Alone?

Impressive Results

Collecting Demonstration Data

1. Teleoperation

2. Kinesthetic Teaching

3. VR-Based Collection

Comprehensive Comparison Table

Full BC Implementation with PyTorch

When to Use Which Method?

Next Steps

Related Articles

Nguyễn Anh Tuấn

Related Posts

RL cho Robotics: PPO, SAC và cách chọn algorithm

Hands-on: Fine-tune OpenVLA với LeRobot

SpatialVLA: 3D understanding cho robot manipulation

Imitation Learning — Teaching Robot by Example

Behavioral Cloning (BC) — Simplest and Fastest

How It Works

Advantages

Disadvantages — Distribution Shift

Solutions for Distribution Shift

DAgger — Iterative, Fixes Distribution Shift

Algorithm

Why It Works

Pros and Cons

Practical Variants

DAPG — Combine Demos with RL Fine-tuning

Two Stages

Why Better Than BC or RL Alone?

Impressive Results

Collecting Demonstration Data

1. Teleoperation

2. Kinesthetic Teaching

3. VR-Based Collection

Comprehensive Comparison Table

Full BC Implementation with PyTorch

When to Use Which Method?

Next Steps

Related Articles

Nguyễn Anh Tuấn

Related Posts

RL cho Robotics: PPO, SAC và cách chọn algorithm

Hands-on: Fine-tune OpenVLA với LeRobot

SpatialVLA: 3D understanding cho robot manipulation