Introduction: From Data to Actions
In the previous post, we collected a demonstration dataset for a pick-and-place task. Now it's time to turn that data into a policy — a model capable of autonomously controlling the robot to complete tasks without human intervention.
This post will guide you through training the two most popular policies in LeRobot: ACT (Action Chunking with Transformers) and Diffusion Policy. We'll directly compare the performance of both methods on the same dataset, learn how to tune hyperparameters, and visualize results.
ACT: Action Chunking with Transformers
Core Idea
ACT, introduced by Zhao et al. (RSS 2023), solves a critical problem: temporal compounding errors. Instead of predicting one action at a time (which leads to accumulated drift), ACT predicts a chunk of multiple actions simultaneously.
The ACT architecture consists of:
- CVAE Encoder: Compresses a sequence of actions into a latent vector z
- Transformer Decoder: From observation + z, generates a chunk of actions
- Action chunking: Instead of 1 action/step, predicts k actions at once
ACT Configuration and Training
from lerobot.common.policies.act.configuration_act import ACTConfig
from lerobot.common.policies.act.modeling_act import ACTPolicy
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
import torch
from torch.utils.data import DataLoader
# Load dataset
dataset = LeRobotDataset("lerobot/pusht")
# ACT configuration — key hyperparameters
act_config = ACTConfig(
input_shapes={
"observation.image": [3, 96, 96],
"observation.state": [2],
},
output_shapes={
"action": [2],
},
input_normalization_modes={
"observation.image": "mean_std",
"observation.state": "min_max",
},
output_normalization_modes={
"action": "min_max",
},
# === Key hyperparameters ===
chunk_size=100, # Number of actions predicted at once
n_action_steps=100, # Actions to execute before re-predicting
# Transformer architecture
dim_model=512, # Hidden dimension
n_heads=8, # Attention heads
n_layers=1, # Transformer layers (fewer = faster)
# CVAE settings
latent_dim=32, # Latent dimension for CVAE
use_vae=True, # Enable CVAE (important!)
kl_weight=10.0, # KL loss weight
# Vision encoder
vision_backbone="resnet18", # resnet18 or resnet50
pretrained_backbone=True, # Use pretrained ImageNet weights
)
# Create policy and optimizer
policy = ACTPolicy(act_config)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
policy.to(device)
optimizer = torch.optim.AdamW(
policy.parameters(),
lr=1e-5,
weight_decay=1e-4,
)
# Dataloader
dataloader = DataLoader(
dataset,
batch_size=8,
shuffle=True,
num_workers=4,
pin_memory=True,
)
# Training loop
num_epochs = 100
policy.train()
for epoch in range(num_epochs):
epoch_loss = 0
num_batches = 0
for batch in dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
# Forward pass — ACT computes loss internally
output = policy.forward(batch)
loss = output["loss"]
# Backward pass
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=10.0)
optimizer.step()
epoch_loss += loss.item()
num_batches += 1
avg_loss = epoch_loss / num_batches
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/{num_epochs} | Loss: {avg_loss:.4f}")
# Save checkpoint
torch.save(policy.state_dict(), "act_checkpoint.pt")
Hyperparameter Tuning for ACT
| Parameter | Default | Range to Try | Impact |
|---|---|---|---|
chunk_size |
100 | 20-200 | Larger = smoother but less responsive |
n_action_steps |
100 | = chunk_size | Usually equals chunk_size |
kl_weight |
10.0 | 1.0-100.0 | High = less diverse, low = multimodal |
dim_model |
512 | 256-1024 | Larger = more expressive, slower |
lr |
1e-5 | 1e-6 to 1e-4 | Start low, increase gradually |
vision_backbone |
resnet18 | resnet18/50 | resnet50 better but slower |
Diffusion Policy
Core Idea
Diffusion Policy (Chi et al., RSS 2023) applies diffusion models to robot learning. Instead of predicting actions directly, it generates actions from noise through iterative denoising — similar to how Stable Diffusion generates images.
The biggest advantage: ability to handle multi-modal action distributions. When the same observation can lead to different valid actions (e.g., grasping from left or right side), Diffusion Policy handles this better than ACT.
Diffusion Policy Configuration and Training
from lerobot.common.policies.diffusion.configuration_diffusion import DiffusionConfig
from lerobot.common.policies.diffusion.modeling_diffusion import DiffusionPolicy
# Diffusion Policy configuration
diff_config = DiffusionConfig(
input_shapes={
"observation.image": [3, 96, 96],
"observation.state": [2],
},
output_shapes={
"action": [2],
},
input_normalization_modes={
"observation.image": "mean_std",
"observation.state": "min_max",
},
output_normalization_modes={
"action": "min_max",
},
# === Diffusion-specific hyperparameters ===
num_inference_steps=100, # Denoising steps during inference
# UNet architecture
down_dims=[256, 512, 1024], # Channel dims for each level
# Observation horizons
n_obs_steps=2, # Number of observation frames as input
horizon=16, # Planning horizon (action sequence length)
n_action_steps=8, # Number of actions to execute
# Noise scheduler
noise_scheduler_type="DDPM", # DDPM or DDIM
beta_schedule="squaredcos_cap_v2",
# Vision encoder
vision_backbone="resnet18",
crop_shape=[84, 84], # Random crop for augmentation
)
# Create policy
diff_policy = DiffusionPolicy(diff_config)
diff_policy.to(device)
# Optimizer — Diffusion typically needs higher lr than ACT
optimizer = torch.optim.AdamW(
diff_policy.parameters(),
lr=1e-4,
weight_decay=1e-6,
betas=(0.95, 0.999),
)
# Learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=num_epochs,
eta_min=1e-6,
)
# Training loop
diff_policy.train()
for epoch in range(200): # Diffusion typically needs more epochs
epoch_loss = 0
num_batches = 0
for batch in dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
output = diff_policy.forward(batch)
loss = output["loss"]
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(diff_policy.parameters(), max_norm=10.0)
optimizer.step()
epoch_loss += loss.item()
num_batches += 1
lr_scheduler.step()
avg_loss = epoch_loss / num_batches
if (epoch + 1) % 20 == 0:
print(f"Epoch {epoch+1}/200 | Loss: {avg_loss:.4f} | "
f"LR: {lr_scheduler.get_last_lr()[0]:.2e}")
torch.save(diff_policy.state_dict(), "diffusion_checkpoint.pt")
Optimizing Inference Speed for Diffusion Policy
A drawback of Diffusion Policy is slow inference due to multiple denoising steps. Here are optimization strategies:
# 1. Use DDIM instead of DDPM — fewer steps, similar results
diff_config_fast = DiffusionConfig(
num_inference_steps=10, # Reduced from 100 to 10
noise_scheduler_type="DDIM", # DDIM faster than DDPM
# ... keep other params
)
# 2. Benchmark inference time
import time
policy.eval()
obs = {k: v[:1].to(device) for k, v in next(iter(dataloader)).items()
if k.startswith("observation")}
# Warm up
for _ in range(5):
with torch.no_grad():
_ = policy.select_action(obs)
# Benchmark
times = []
for _ in range(50):
start = time.perf_counter()
with torch.no_grad():
action = policy.select_action(obs)
torch.cuda.synchronize()
times.append(time.perf_counter() - start)
print(f"Inference time: {np.mean(times)*1000:.1f} +/- {np.std(times)*1000:.1f} ms")
# ACT: ~5-10ms | Diffusion (100 steps): ~50-100ms | Diffusion DDIM (10 steps): ~10-20ms
ACT vs Diffusion Policy Comparison
Evaluation on the Same Dataset
import gymnasium as gym
import numpy as np
import torch
def evaluate_policy(policy, env_name, n_episodes=50, max_steps=500):
"""Evaluate policy in environment."""
env = gym.make(env_name)
policy.eval()
results = {
"success_rate": 0,
"avg_steps": 0,
"avg_reward": 0,
"path_lengths": [],
}
successes = 0
total_steps = 0
total_reward = 0
for ep in range(n_episodes):
obs, info = env.reset()
ep_reward = 0
ep_steps = 0
positions = []
for step in range(max_steps):
obs_tensor = {
k: torch.tensor(v).unsqueeze(0).to(device)
for k, v in obs.items()
}
with torch.no_grad():
action = policy.select_action(obs_tensor)
obs, reward, terminated, truncated, info = env.step(
action.squeeze(0).cpu().numpy()
)
ep_reward += reward
ep_steps += 1
positions.append(obs.get("achieved_goal", np.zeros(3)))
if terminated or truncated:
break
if info.get("is_success", False):
successes += 1
total_steps += ep_steps
total_reward += ep_reward
# Calculate path length
positions = np.array(positions)
path_length = np.sum(np.linalg.norm(np.diff(positions, axis=0), axis=1))
results["path_lengths"].append(path_length)
results["success_rate"] = successes / n_episodes
results["avg_steps"] = total_steps / n_episodes
results["avg_reward"] = total_reward / n_episodes
results["path_efficiency"] = np.mean(results["path_lengths"])
env.close()
return results
# Evaluate both policies
act_results = evaluate_policy(act_policy, "lerobot/pusht", n_episodes=50)
diff_results = evaluate_policy(diff_policy, "lerobot/pusht", n_episodes=50)
# Print comparison
print(f"\n{'Metric':<20} {'ACT':>10} {'Diffusion':>10}")
print(f"{'='*42}")
print(f"{'Success Rate':<20} {act_results['success_rate']:>9.1%} {diff_results['success_rate']:>9.1%}")
print(f"{'Avg Steps':<20} {act_results['avg_steps']:>10.1f} {diff_results['avg_steps']:>10.1f}")
print(f"{'Avg Reward':<20} {act_results['avg_reward']:>10.2f} {diff_results['avg_reward']:>10.2f}")
print(f"{'Path Efficiency':<20} {act_results['path_efficiency']:>10.3f} {diff_results['path_efficiency']:>10.3f}")
Summary Comparison Table
| Criteria | ACT | Diffusion Policy |
|---|---|---|
| Inference speed | ~5-10ms | ~50-100ms (DDPM), ~10-20ms (DDIM) |
| Training epochs | 100-200 | 200-500 |
| Success rate | High with consistent demos | Higher with diverse demos |
| Multi-modal support | Limited (CVAE helps somewhat) | Good — handles multiple modes |
| Memory | Lighter | Heavier (UNet + scheduler) |
| Ease of tuning | Easier — fewer hyperparams | Harder — more hyperparams |
| Best for | Clear tasks, real-time | Complex tasks, multi-modal |
Visualizing Learned Policies
import matplotlib.pyplot as plt
import numpy as np
def visualize_action_predictions(policy, dataset, n_samples=5):
"""Visualize action predictions vs ground truth."""
policy.eval()
fig, axes = plt.subplots(n_samples, 2, figsize=(12, 3*n_samples))
for i in range(n_samples):
idx = np.random.randint(len(dataset))
sample = {k: v.unsqueeze(0).to(device) for k, v in dataset[idx].items()}
# Ground truth action
gt_action = sample["action"].cpu().numpy().flatten()
# Predicted action
with torch.no_grad():
pred_action = policy.select_action(
{k: v for k, v in sample.items() if k.startswith("observation")}
).cpu().numpy().flatten()
# Plot
axes[i, 0].bar(range(len(gt_action)), gt_action, alpha=0.7, label="Ground Truth")
axes[i, 0].bar(range(len(pred_action)), pred_action, alpha=0.5, label="Predicted")
axes[i, 0].set_title(f"Sample {i+1}: Action Comparison")
axes[i, 0].legend()
# Error
error = np.abs(gt_action - pred_action[:len(gt_action)])
axes[i, 1].bar(range(len(error)), error, color='red', alpha=0.7)
axes[i, 1].set_title(f"Sample {i+1}: Absolute Error")
plt.tight_layout()
plt.savefig("policy_comparison.png", dpi=150)
plt.show()
visualize_action_predictions(act_policy, dataset)
visualize_action_predictions(diff_policy, dataset)
Training via CLI (Fastest Approach)
If you want to get started quickly without writing code, LeRobot provides a CLI:
# Train ACT on PushT
python lerobot/scripts/train.py \
--policy.type=act \
--dataset.repo_id=lerobot/pusht \
--training.num_epochs=100 \
--training.batch_size=8 \
--training.lr=1e-5 \
--policy.chunk_size=100 \
--policy.use_vae=true \
--output_dir=outputs/act_pusht
# Train Diffusion Policy on the same dataset
python lerobot/scripts/train.py \
--policy.type=diffusion \
--dataset.repo_id=lerobot/pusht \
--training.num_epochs=200 \
--training.batch_size=64 \
--training.lr=1e-4 \
--policy.num_inference_steps=100 \
--output_dir=outputs/diffusion_pusht
# Evaluate
python lerobot/scripts/eval.py \
--policy.path=outputs/act_pusht/checkpoints/last \
--env.type=pusht \
--eval.n_episodes=50
Reference Papers
- ACT: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware — Zhao et al., RSS 2023 — Original ACT paper
- Diffusion Policy: Visuomotor Policy Learning via Action Diffusion — Chi et al., RSS 2023 — Original Diffusion Policy paper
- Consistency Policy — Prasad et al., 2024 — Speeding up Diffusion Policy via consistency distillation
Conclusion and Next Steps
ACT and Diffusion Policy represent two different philosophies in robot learning. ACT excels when you need fast inference and consistent demonstrations. Diffusion Policy performs better when tasks are complex and require handling multi-modal behaviors.
In practice, start with ACT (simpler, faster), and switch to Diffusion Policy if ACT fails to capture diverse behaviors.
The next post — Multi-Object Manipulation — will expand from single objects to multiple objects, requiring policies to understand context and task ordering.
Related Posts
- Data Collection via Teleoperation — How to collect high-quality datasets
- ACT: Action Chunking with Transformers — Deep dive into ACT architecture
- Diffusion Policy: From Theory to Practice — Detailed explanation of Diffusion Policy