Imitation Learning — Teaching Robot by Example
In Part 1, I introduced RL algorithms for robotics. But reality: many manipulation tasks don't need RL — just show the robot how humans do it, then have it copy. That's Imitation Learning (IL).
Why does IL matter? Because RL for manipulation is extremely hard:
- Sparse rewards: Robot gets +1 only on success — millions of steps with no signal before that
- Safety: RL exploration might break the robot or the environment
- Sample complexity: SAC needs ~100K steps for simple grasping — that's many hours on a real robot
- Reward engineering: Designing a good reward function is hard, and a wrong reward means the robot hacks it
Imitation learning solves all of this by skipping rewards entirely — learning directly from expert demonstrations (human teleoperation).
Behavioral Cloning (BC) — Simplest and Fastest
Behavioral Cloning is the simplest IL method: pure supervised learning on (observation, action) pairs from expert demonstrations.
How It Works
Collect data:
Expert controls robot → record (state, action) each timestep
Dataset: D = {(s_1, a_1), (s_2, a_2), ..., (s_N, a_N)}
Training:
Minimize L = Σ ||π_θ(s_i) - a_i||²
→ Supervised regression: given state, predict action
Deployment:
Robot runs learned policy π_θ autonomously
Simply put: BC = regression. Give the model (input, output) pairs and train it to predict output from input. No environment needed, no reward, no simulation.
Advantages
- Extremely simple: Just PyTorch + dataset, train in minutes
- Stable training: Supervised learning, no RL instability
- Little data: 50-200 demonstrations often enough for a single task
- Fast: Train in 30 minutes vs hours of RL
Disadvantages — Distribution Shift
BC has one fundamental problem: distribution shift (also called covariate shift).
Training: Robot sees states from expert trajectory
s_expert → a_expert (correct path)
Deployment: Robot makes small mistake → deviates → sees new state
s_new (never seen) → a_??? (doesn't know what to do)
→ errors accumulate → complete failure
Expert always takes the correct path, so training data only contains "good" states. When the robot runs autonomously and makes a small mistake, it ends up in states it's never seen — and doesn't know how to recover. Small errors accumulate into large failures over time.
Solutions for Distribution Shift
- Collect more diverse data: Multiple demonstrations — not just one way to do the task
- Data augmentation: Add noise to observations during training
- Action chunking: Predict many actions at once instead of step-by-step (see Part 3)
- DAgger: The next method, specifically designed to fix this
DAgger — Iterative, Fixes Distribution Shift
DAgger (Dataset Aggregation) (Ross et al., 2011) solves distribution shift iteratively: let the robot run its current policy, have an expert fix mistakes, add new data to the dataset, retrain.
Algorithm
Algorithm DAgger:
1. Collect initial dataset D₀ from expert demonstrations
2. Train policy π₁ on D₀ (BC)
3. For i = 1, 2, ..., N:
a. Robot runs π_i → collect states {s₁, s₂, ...}
b. Expert labels actions for those states: {(s₁, a*₁), (s₂, a*₂), ...}
c. Aggregate: D_i = D_{i-1} ∪ {new labeled data}
d. Train π_{i+1} on D_i
4. Return π_N
Why It Works
The DAgger loop ensures training data covers states the robot actually encounters, not just states from the expert's trajectory. After a few iterations, distribution shift is greatly reduced.
Pros and Cons
Pros:
- Solves distribution shift — proven theoretical convergence guarantee
- Much better than BC for long-horizon tasks
- Iterative improvement — policy gets better each round
Cons:
- Needs expert online: Expert must be available to label data each iteration — time consuming
- Safety risk: Robot runs a weak policy in early stages — may collide
- Many iterations: Typically needs 5-10 rounds to converge
Practical Variants
In practice, vanilla DAgger is rarely used because it requires a constantly available expert. More common variants:
- SafeDAgger: Only queries expert when policy is uncertain (high uncertainty)
- HG-DAgger: Expert intervenes directly when robot is about to make a mistake
- ThriftyDAgger: Optimizes the number of expert queries
DAPG — Combine Demos with RL Fine-tuning
DAPG (Demonstration Augmented Policy Gradient) (Rajeswaran et al., 2018) is best of both worlds: start from demonstrations, then fine-tune with RL.
Two Stages
Stage 1: Pre-training (BC)
→ Train initial policy from demonstrations
→ Policy knows "roughly" how to do the task
Stage 2: RL Fine-tuning with Demo Augmented Reward
→ Use RL (PPO/NPG) to optimize policy
→ Add auxiliary reward: ||π(s) - π_demo(s)||
→ Keeps policy near demonstrations, prevents drift
→ Gradually reduce auxiliary reward over time
Why Better Than BC or RL Alone?
| Method | Problem | DAPG Solution |
|---|---|---|
| BC alone | Distribution shift, not optimal | RL fine-tuning improves performance |
| RL alone | Sample inefficient, sparse reward | Demo pre-training provides good starting point |
| BC → RL (naive) | RL destroys BC policy | Auxiliary reward keeps policy close to demos |
Impressive Results
DAPG is especially powerful for dexterous manipulation — extremely hard with a 24-DOF robot hand:
| Task | RL alone | BC alone | DAPG |
|---|---|---|---|
| Door opening | 0% (1M steps) | 45% | 95% |
| Tool use | 0% (1M steps) | 30% | 88% |
| Object relocation | 12% (1M steps) | 52% | 92% |
RL alone fails to solve these even after 1 million steps, but DAPG with just 25 demonstrations + 200K RL steps achieves >90% success rate.
Collecting Demonstration Data
Data collection is the most important part — a policy can only be as good as its training data. There are 3 main methods:
1. Teleoperation
A person controls the robot remotely via controller, VR headset, or a master device.
| Method | Setup Cost | Quality | Speed |
|---|---|---|---|
| Keyboard/Joystick | Low | Low (6-DoF hard to control) | Slow |
| VR Controllers | Medium | High | Medium |
| Master-slave (ALOHA) | High | Very high | Fast |
| SpaceMouse | Low | Medium | Medium |
ALOHA (system from the ACT paper) uses a master robot — person controls the master robot, slave robot copies movements exactly. Best demonstration quality for manipulation.
2. Kinesthetic Teaching
Person physically guides the robot through the desired trajectory. Robot is in gravity compensation mode — human moves it, robot records joint positions.
- Pros: Intuitive, no extra equipment needed
- Cons: Only works with collaborative robots (cobots), hard for bimanual tasks
3. VR-Based Collection
Use VR headset + controllers to control robot in simulation or real-time.
- Pros: Immersive, natural
- Cons: Latency issues, requires careful calibration
Comprehensive Comparison Table
| Criterion | BC | DAgger | DAPG |
|---|---|---|---|
| Expert online? | No (offline) | Yes (each iteration) | No (offline demos) |
| Reward needed? | No | No | Yes (RL phase) |
| Simulation needed? | No | Optional (can be real) | Usually (RL phase) |
| Distribution shift | Yes (major problem) | Solved | Solved (via RL) |
| Sample efficiency | Very high | High | Medium |
| Performance ceiling | Medium | High | Very high |
| Long-horizon tasks | Poor | Good | Good |
| Implementation | Simple | Medium | Complex |
| Demos needed | 50-200 | 20-50 + iterations | 20-50 + RL |
| Best for | Quick start | Safety-critical | Best performance |
Full BC Implementation with PyTorch
Here is a simple but complete implementation of BC — you can use this directly for a robot project:
"""
Behavioral Cloning for robot manipulation
Requires: pip install torch numpy
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
class RobotDemoDataset(Dataset):
"""Dataset from demonstration files."""
def __init__(self, demo_dir: str):
"""
demo_dir: directory containing .npz files, one per demonstration
Each .npz contains: observations (T, obs_dim), actions (T, act_dim)
"""
import glob
self.observations = []
self.actions = []
for f in sorted(glob.glob(f"{demo_dir}/*.npz")):
data = np.load(f)
self.observations.append(data["observations"])
self.actions.append(data["actions"])
self.observations = np.concatenate(self.observations, axis=0)
self.actions = np.concatenate(self.actions, axis=0)
# Normalize observations — critical for training stability
self.obs_mean = self.observations.mean(axis=0)
self.obs_std = self.observations.std(axis=0) + 1e-8
self.observations = (self.observations - self.obs_mean) / self.obs_std
# Normalize actions to [-1, 1]
self.act_mean = self.actions.mean(axis=0)
self.act_std = self.actions.std(axis=0) + 1e-8
self.actions = (self.actions - self.act_mean) / self.act_std
print(f"Loaded {len(self)} samples from {demo_dir}")
print(f" Obs dim: {self.observations.shape[1]}")
print(f" Act dim: {self.actions.shape[1]}")
def __len__(self):
return len(self.observations)
def __getitem__(self, idx):
return (
torch.FloatTensor(self.observations[idx]),
torch.FloatTensor(self.actions[idx]),
)
class BCPolicy(nn.Module):
"""MLP policy for Behavioral Cloning."""
def __init__(self, obs_dim: int, act_dim: int, hidden_dim: int = 256):
super().__init__()
self.network = nn.Sequential(
nn.Linear(obs_dim, hidden_dim),
nn.ReLU(),
nn.LayerNorm(hidden_dim),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.LayerNorm(hidden_dim),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, act_dim),
nn.Tanh(), # Output in [-1, 1] → denormalize at deploy time
)
def forward(self, obs):
return self.network(obs)
def train_bc(
demo_dir: str,
obs_dim: int = 12,
act_dim: int = 7,
epochs: int = 200,
batch_size: int = 256,
lr: float = 1e-3,
save_path: str = "bc_policy.pt",
):
"""Train BC policy from demonstrations."""
# 1. Load data with train/val split
dataset = RobotDemoDataset(demo_dir)
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size
train_set, val_set = torch.utils.data.random_split(dataset, [train_size, val_size])
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_set, batch_size=batch_size)
# 2. Model + optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
policy = BCPolicy(obs_dim, act_dim).to(device)
optimizer = optim.AdamW(policy.parameters(), lr=lr, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
# 3. Training loop
best_val_loss = float("inf")
for epoch in range(epochs):
# Train
policy.train()
train_loss = 0
for obs, actions in train_loader:
obs, actions = obs.to(device), actions.to(device)
pred = policy(obs)
loss = nn.functional.mse_loss(pred, actions)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
optimizer.step()
train_loss += loss.item()
# Validate
policy.eval()
val_loss = 0
with torch.no_grad():
for obs, actions in val_loader:
obs, actions = obs.to(device), actions.to(device)
pred = policy(obs)
val_loss += nn.functional.mse_loss(pred, actions).item()
train_loss /= len(train_loader)
val_loss /= len(val_loader)
scheduler.step()
# Save best model
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save({
"model": policy.state_dict(),
"obs_mean": dataset.obs_mean,
"obs_std": dataset.obs_std,
"act_mean": dataset.act_mean,
"act_std": dataset.act_std,
}, save_path)
if epoch % 20 == 0:
print(f"Epoch {epoch}: train={train_loss:.6f}, val={val_loss:.6f}")
print(f"Training complete! Best val_loss: {best_val_loss:.6f}")
# Deploy policy on real robot
def deploy_bc_policy(policy_path: str, robot):
"""Load and run BC policy on real robot."""
checkpoint = torch.load(policy_path, map_location="cpu")
policy = BCPolicy(obs_dim=12, act_dim=7)
policy.load_state_dict(checkpoint["model"])
policy.eval()
obs_mean = checkpoint["obs_mean"]
obs_std = checkpoint["obs_std"]
act_mean = checkpoint["act_mean"]
act_std = checkpoint["act_std"]
while True:
# Read sensor data from robot
obs = robot.get_observation() # np.array shape (12,)
# Normalize observation (same as during training)
obs_norm = (obs - obs_mean) / obs_std
obs_tensor = torch.FloatTensor(obs_norm).unsqueeze(0)
# Predict action
with torch.no_grad():
action_norm = policy(obs_tensor).squeeze(0).numpy()
# Denormalize action
action = action_norm * act_std + act_mean
# Send action to robot
robot.execute_action(action)
# Run training
if __name__ == "__main__":
train_bc(
demo_dir="./demos/pick_and_place/",
obs_dim=12, # 6 joints + 3 gripper pos + 3 object pos
act_dim=7, # 6 joint velocities + 1 gripper
epochs=200,
batch_size=256,
)
When to Use Which Method?
- Little data, need it fast → BC (train in 30 min, deploy immediately)
- Long-horizon tasks, need accuracy → DAgger (iterative improvement)
- Need best possible performance → DAPG (BC + RL fine-tuning)
- Vision-based observations → See Part 3: Action Chunking Transformers (ACT) (state-of-the-art for vision-based manipulation)
Next Steps
BC and DAgger are solid foundations, but single-step prediction has serious problems with temporal correlations in robot actions. In the next part, Action Chunking Transformers (ACT) — predicting multiple actions at once, achieving state-of-the-art results for bimanual manipulation.
Related Articles
- RL for Robotics: PPO, SAC and Choosing Algorithm — Part 1: RL foundation
- Action Chunking Transformers (ACT): Architecture Deep Dive — Part 3: State-of-the-art imitation learning
- Foundation Models for Robot: RT-2, Octo, OpenVLA — Foundation models combining IL and pre-training
- Sim-to-Real Transfer: Train Simulation, Run Real Robot — Transfer policy from sim to real