← Back to Blog
ailerobotactdiffusion-policysingle-arm

Training Single-Arm Policies: ACT & Diffusion Policy

Guide to training ACT and Diffusion Policy with LeRobot, comparing performance, hyperparameter tuning, and visualizing results.

Nguyễn Anh Tuấn18 tháng 3, 20269 min read
Training Single-Arm Policies: ACT & Diffusion Policy

Introduction: From Data to Actions

In the previous post, we collected a demonstration dataset for a pick-and-place task. Now it's time to turn that data into a policy — a model capable of autonomously controlling the robot to complete tasks without human intervention.

This post will guide you through training the two most popular policies in LeRobot: ACT (Action Chunking with Transformers) and Diffusion Policy. We'll directly compare the performance of both methods on the same dataset, learn how to tune hyperparameters, and visualize results.

Training robot policies

ACT: Action Chunking with Transformers

Core Idea

ACT, introduced by Zhao et al. (RSS 2023), solves a critical problem: temporal compounding errors. Instead of predicting one action at a time (which leads to accumulated drift), ACT predicts a chunk of multiple actions simultaneously.

The ACT architecture consists of:

ACT Configuration and Training

from lerobot.common.policies.act.configuration_act import ACTConfig
from lerobot.common.policies.act.modeling_act import ACTPolicy
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
import torch
from torch.utils.data import DataLoader

# Load dataset
dataset = LeRobotDataset("lerobot/pusht")

# ACT configuration — key hyperparameters
act_config = ACTConfig(
    input_shapes={
        "observation.image": [3, 96, 96],
        "observation.state": [2],
    },
    output_shapes={
        "action": [2],
    },
    input_normalization_modes={
        "observation.image": "mean_std",
        "observation.state": "min_max",
    },
    output_normalization_modes={
        "action": "min_max",
    },
    
    # === Key hyperparameters ===
    chunk_size=100,        # Number of actions predicted at once
    n_action_steps=100,    # Actions to execute before re-predicting
    
    # Transformer architecture
    dim_model=512,         # Hidden dimension
    n_heads=8,             # Attention heads
    n_layers=1,            # Transformer layers (fewer = faster)
    
    # CVAE settings
    latent_dim=32,         # Latent dimension for CVAE
    use_vae=True,          # Enable CVAE (important!)
    kl_weight=10.0,        # KL loss weight
    
    # Vision encoder
    vision_backbone="resnet18",  # resnet18 or resnet50
    pretrained_backbone=True,    # Use pretrained ImageNet weights
)

# Create policy and optimizer
policy = ACTPolicy(act_config)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
policy.to(device)

optimizer = torch.optim.AdamW(
    policy.parameters(),
    lr=1e-5,
    weight_decay=1e-4,
)

# Dataloader
dataloader = DataLoader(
    dataset,
    batch_size=8,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
)

# Training loop
num_epochs = 100
policy.train()

for epoch in range(num_epochs):
    epoch_loss = 0
    num_batches = 0
    
    for batch in dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # Forward pass — ACT computes loss internally
        output = policy.forward(batch)
        loss = output["loss"]
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=10.0)
        optimizer.step()
        
        epoch_loss += loss.item()
        num_batches += 1
    
    avg_loss = epoch_loss / num_batches
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{num_epochs} | Loss: {avg_loss:.4f}")

# Save checkpoint
torch.save(policy.state_dict(), "act_checkpoint.pt")

Hyperparameter Tuning for ACT

Parameter Default Range to Try Impact
chunk_size 100 20-200 Larger = smoother but less responsive
n_action_steps 100 = chunk_size Usually equals chunk_size
kl_weight 10.0 1.0-100.0 High = less diverse, low = multimodal
dim_model 512 256-1024 Larger = more expressive, slower
lr 1e-5 1e-6 to 1e-4 Start low, increase gradually
vision_backbone resnet18 resnet18/50 resnet50 better but slower

Diffusion Policy

Core Idea

Diffusion Policy (Chi et al., RSS 2023) applies diffusion models to robot learning. Instead of predicting actions directly, it generates actions from noise through iterative denoising — similar to how Stable Diffusion generates images.

The biggest advantage: ability to handle multi-modal action distributions. When the same observation can lead to different valid actions (e.g., grasping from left or right side), Diffusion Policy handles this better than ACT.

Diffusion Policy Configuration and Training

from lerobot.common.policies.diffusion.configuration_diffusion import DiffusionConfig
from lerobot.common.policies.diffusion.modeling_diffusion import DiffusionPolicy

# Diffusion Policy configuration
diff_config = DiffusionConfig(
    input_shapes={
        "observation.image": [3, 96, 96],
        "observation.state": [2],
    },
    output_shapes={
        "action": [2],
    },
    input_normalization_modes={
        "observation.image": "mean_std",
        "observation.state": "min_max",
    },
    output_normalization_modes={
        "action": "min_max",
    },
    
    # === Diffusion-specific hyperparameters ===
    num_inference_steps=100,    # Denoising steps during inference
    
    # UNet architecture
    down_dims=[256, 512, 1024],  # Channel dims for each level
    
    # Observation horizons
    n_obs_steps=2,              # Number of observation frames as input
    horizon=16,                 # Planning horizon (action sequence length)
    n_action_steps=8,           # Number of actions to execute
    
    # Noise scheduler
    noise_scheduler_type="DDPM",  # DDPM or DDIM
    beta_schedule="squaredcos_cap_v2",
    
    # Vision encoder
    vision_backbone="resnet18",
    crop_shape=[84, 84],        # Random crop for augmentation
)

# Create policy
diff_policy = DiffusionPolicy(diff_config)
diff_policy.to(device)

# Optimizer — Diffusion typically needs higher lr than ACT
optimizer = torch.optim.AdamW(
    diff_policy.parameters(),
    lr=1e-4,
    weight_decay=1e-6,
    betas=(0.95, 0.999),
)

# Learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=num_epochs,
    eta_min=1e-6,
)

# Training loop
diff_policy.train()

for epoch in range(200):  # Diffusion typically needs more epochs
    epoch_loss = 0
    num_batches = 0
    
    for batch in dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        
        output = diff_policy.forward(batch)
        loss = output["loss"]
        
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(diff_policy.parameters(), max_norm=10.0)
        optimizer.step()
        
        epoch_loss += loss.item()
        num_batches += 1
    
    lr_scheduler.step()
    avg_loss = epoch_loss / num_batches
    
    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1}/200 | Loss: {avg_loss:.4f} | "
              f"LR: {lr_scheduler.get_last_lr()[0]:.2e}")

torch.save(diff_policy.state_dict(), "diffusion_checkpoint.pt")

Diffusion process for action generation

Optimizing Inference Speed for Diffusion Policy

A drawback of Diffusion Policy is slow inference due to multiple denoising steps. Here are optimization strategies:

# 1. Use DDIM instead of DDPM — fewer steps, similar results
diff_config_fast = DiffusionConfig(
    num_inference_steps=10,              # Reduced from 100 to 10
    noise_scheduler_type="DDIM",         # DDIM faster than DDPM
    # ... keep other params
)

# 2. Benchmark inference time
import time

policy.eval()
obs = {k: v[:1].to(device) for k, v in next(iter(dataloader)).items()
       if k.startswith("observation")}

# Warm up
for _ in range(5):
    with torch.no_grad():
        _ = policy.select_action(obs)

# Benchmark
times = []
for _ in range(50):
    start = time.perf_counter()
    with torch.no_grad():
        action = policy.select_action(obs)
    torch.cuda.synchronize()
    times.append(time.perf_counter() - start)

print(f"Inference time: {np.mean(times)*1000:.1f} +/- {np.std(times)*1000:.1f} ms")
# ACT: ~5-10ms | Diffusion (100 steps): ~50-100ms | Diffusion DDIM (10 steps): ~10-20ms

ACT vs Diffusion Policy Comparison

Evaluation on the Same Dataset

import gymnasium as gym
import numpy as np
import torch

def evaluate_policy(policy, env_name, n_episodes=50, max_steps=500):
    """Evaluate policy in environment."""
    env = gym.make(env_name)
    policy.eval()
    
    results = {
        "success_rate": 0,
        "avg_steps": 0,
        "avg_reward": 0,
        "path_lengths": [],
    }
    
    successes = 0
    total_steps = 0
    total_reward = 0
    
    for ep in range(n_episodes):
        obs, info = env.reset()
        ep_reward = 0
        ep_steps = 0
        positions = []
        
        for step in range(max_steps):
            obs_tensor = {
                k: torch.tensor(v).unsqueeze(0).to(device) 
                for k, v in obs.items()
            }
            
            with torch.no_grad():
                action = policy.select_action(obs_tensor)
            
            obs, reward, terminated, truncated, info = env.step(
                action.squeeze(0).cpu().numpy()
            )
            
            ep_reward += reward
            ep_steps += 1
            positions.append(obs.get("achieved_goal", np.zeros(3)))
            
            if terminated or truncated:
                break
        
        if info.get("is_success", False):
            successes += 1
        total_steps += ep_steps
        total_reward += ep_reward
        
        # Calculate path length
        positions = np.array(positions)
        path_length = np.sum(np.linalg.norm(np.diff(positions, axis=0), axis=1))
        results["path_lengths"].append(path_length)
    
    results["success_rate"] = successes / n_episodes
    results["avg_steps"] = total_steps / n_episodes
    results["avg_reward"] = total_reward / n_episodes
    results["path_efficiency"] = np.mean(results["path_lengths"])
    
    env.close()
    return results


# Evaluate both policies
act_results = evaluate_policy(act_policy, "lerobot/pusht", n_episodes=50)
diff_results = evaluate_policy(diff_policy, "lerobot/pusht", n_episodes=50)

# Print comparison
print(f"\n{'Metric':<20} {'ACT':>10} {'Diffusion':>10}")
print(f"{'='*42}")
print(f"{'Success Rate':<20} {act_results['success_rate']:>9.1%} {diff_results['success_rate']:>9.1%}")
print(f"{'Avg Steps':<20} {act_results['avg_steps']:>10.1f} {diff_results['avg_steps']:>10.1f}")
print(f"{'Avg Reward':<20} {act_results['avg_reward']:>10.2f} {diff_results['avg_reward']:>10.2f}")
print(f"{'Path Efficiency':<20} {act_results['path_efficiency']:>10.3f} {diff_results['path_efficiency']:>10.3f}")

Summary Comparison Table

Criteria ACT Diffusion Policy
Inference speed ~5-10ms ~50-100ms (DDPM), ~10-20ms (DDIM)
Training epochs 100-200 200-500
Success rate High with consistent demos Higher with diverse demos
Multi-modal support Limited (CVAE helps somewhat) Good — handles multiple modes
Memory Lighter Heavier (UNet + scheduler)
Ease of tuning Easier — fewer hyperparams Harder — more hyperparams
Best for Clear tasks, real-time Complex tasks, multi-modal

Visualizing Learned Policies

import matplotlib.pyplot as plt
import numpy as np

def visualize_action_predictions(policy, dataset, n_samples=5):
    """Visualize action predictions vs ground truth."""
    policy.eval()
    
    fig, axes = plt.subplots(n_samples, 2, figsize=(12, 3*n_samples))
    
    for i in range(n_samples):
        idx = np.random.randint(len(dataset))
        sample = {k: v.unsqueeze(0).to(device) for k, v in dataset[idx].items()}
        
        # Ground truth action
        gt_action = sample["action"].cpu().numpy().flatten()
        
        # Predicted action
        with torch.no_grad():
            pred_action = policy.select_action(
                {k: v for k, v in sample.items() if k.startswith("observation")}
            ).cpu().numpy().flatten()
        
        # Plot
        axes[i, 0].bar(range(len(gt_action)), gt_action, alpha=0.7, label="Ground Truth")
        axes[i, 0].bar(range(len(pred_action)), pred_action, alpha=0.5, label="Predicted")
        axes[i, 0].set_title(f"Sample {i+1}: Action Comparison")
        axes[i, 0].legend()
        
        # Error
        error = np.abs(gt_action - pred_action[:len(gt_action)])
        axes[i, 1].bar(range(len(error)), error, color='red', alpha=0.7)
        axes[i, 1].set_title(f"Sample {i+1}: Absolute Error")
    
    plt.tight_layout()
    plt.savefig("policy_comparison.png", dpi=150)
    plt.show()

visualize_action_predictions(act_policy, dataset)
visualize_action_predictions(diff_policy, dataset)

Training via CLI (Fastest Approach)

If you want to get started quickly without writing code, LeRobot provides a CLI:

# Train ACT on PushT
python lerobot/scripts/train.py \
    --policy.type=act \
    --dataset.repo_id=lerobot/pusht \
    --training.num_epochs=100 \
    --training.batch_size=8 \
    --training.lr=1e-5 \
    --policy.chunk_size=100 \
    --policy.use_vae=true \
    --output_dir=outputs/act_pusht

# Train Diffusion Policy on the same dataset
python lerobot/scripts/train.py \
    --policy.type=diffusion \
    --dataset.repo_id=lerobot/pusht \
    --training.num_epochs=200 \
    --training.batch_size=64 \
    --training.lr=1e-4 \
    --policy.num_inference_steps=100 \
    --output_dir=outputs/diffusion_pusht

# Evaluate
python lerobot/scripts/eval.py \
    --policy.path=outputs/act_pusht/checkpoints/last \
    --env.type=pusht \
    --eval.n_episodes=50

Reference Papers

  1. ACT: Learning Fine-Grained Bimanual Manipulation with Low-Cost HardwareZhao et al., RSS 2023 — Original ACT paper
  2. Diffusion Policy: Visuomotor Policy Learning via Action DiffusionChi et al., RSS 2023 — Original Diffusion Policy paper
  3. Consistency PolicyPrasad et al., 2024 — Speeding up Diffusion Policy via consistency distillation

Conclusion and Next Steps

ACT and Diffusion Policy represent two different philosophies in robot learning. ACT excels when you need fast inference and consistent demonstrations. Diffusion Policy performs better when tasks are complex and require handling multi-modal behaviors.

In practice, start with ACT (simpler, faster), and switch to Diffusion Policy if ACT fails to capture diverse behaviors.

The next post — Multi-Object Manipulation — will expand from single objects to multiple objects, requiring policies to understand context and task ordering.

Related Posts

Related Posts

ComparisonSimpleVLA-RL (5): So sánh với LeRobot
ai-perceptionvlareinforcement-learninglerobotresearchPart 5

SimpleVLA-RL (5): So sánh với LeRobot

So sánh chi tiết SimpleVLA-RL và LeRobot: RL approach, VLA models, sim vs real, data efficiency — hai framework bổ trợ nhau.

11/4/202612 min read
TutorialSimpleVLA-RL (6): OpenArm — Phân tích Lộ trình
openarmvlareinforcement-learninglerobotpi0Part 6

SimpleVLA-RL (6): OpenArm — Phân tích Lộ trình

Phân tích chi tiết cách tiếp cận training robot OpenArm 7-DoF gắp hộp carton — so sánh 2 lộ trình: LeRobot native vs SimpleVLA-RL.

11/4/202613 min read
TutorialSimpleVLA-RL (7): Collect Data cho OpenArm
openarmlerobotdata-collectionteleoperationPart 7

SimpleVLA-RL (7): Collect Data cho OpenArm

Hướng dẫn từng bước setup OpenArm, calibrate, teleoperate và thu thập 50 episodes gắp hộp carton với LeRobot.

11/4/202616 min read