← Back to Blog
manipulationdiffusion-policymanipulationDDPMLeRobot

Diffusion Policy in Practice: From Theory to Code

DDPM recap, Diffusion Policy architecture (CNN vs Transformer), implementation with LeRobot, and benchmark results — complete practical guide.

Nguyen Anh Tuan16 tháng 3, 20267 min read
Diffusion Policy in Practice: From Theory to Code

From Image Generation to Robot Control

Diffusion models revolutionized image generation (Stable Diffusion, DALL-E 3). Core idea: start from noise, gradually denoise to produce structured output. But why use it for robots?

In Part 2, I discussed the multimodal actions problem — when the same observation has multiple correct solutions. Behavioral Cloning with MSE loss averages all modes, producing meaningless "average" actions. ACT uses CVAE to handle this, but Diffusion Policy (Chi et al., 2023) is more powerful: model the entire action distribution using diffusion.

Result? Outperforms all baselines on 12 tasks from 4 benchmarks, with average improvement of 46.9%. This is state-of-the-art for manipulation policy learning.

If you haven't read the Diffusion Policy overview, see Diffusion Policy deep dive in the AI for Robotics series.

Diffusion process — from noise to structured action sequences

DDPM Recap: Denoising Diffusion Probabilistic Models

Forward Process (Adding Noise)

Starting from data x_0, gradually add Gaussian noise over T steps:

q(x_t | x_{t-1}) = N(x_t; sqrt(1-beta_t) * x_{t-1}, beta_t * I)

After T steps, x_T ~ N(0, I) — pure noise. Beta_t is the noise schedule, increasing gradually from ~0.0001 to ~0.02.

Reverse Process (Denoising)

Learn a neural network epsilon_theta to predict noise at each step:

p_theta(x_{t-1} | x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma_t^2 * I)

Training objective: predict the noise that was added:

# DDPM training step (simplified)
def train_step(model, x_0):
    # 1. Sample random timestep
    t = torch.randint(0, T, (batch_size,))

    # 2. Sample noise
    epsilon = torch.randn_like(x_0)

    # 3. Add noise per schedule
    x_t = sqrt_alpha_cumprod[t] * x_0 + sqrt_one_minus_alpha_cumprod[t] * epsilon

    # 4. Predict noise
    epsilon_pred = model(x_t, t)

    # 5. MSE loss
    loss = F.mse_loss(epsilon_pred, epsilon)
    return loss

From Images to Actions

In image generation, x_0 is pixel values. In Diffusion Policy, x_0 is an action sequence [a_t, a_{t+1}, ..., a_{t+H}] with H as prediction horizon. Conditioning is observations (images + joint positions) instead of text prompts.

Diffusion Policy Architecture

CNN-based (Diffusion Policy - C)

Original architecture uses 1D temporal CNN (like WaveNet) to process action sequences:

Input: noisy action chunk x_t (B, H, action_dim) + timestep t
Condition: observation features (B, obs_dim) from ResNet18

Architecture:
  1. FiLM conditioning: inject obs features into each conv layer
  2. 1D Conv blocks: [Conv1D -> GroupNorm -> Mish -> Conv1D] x N
  3. Residual connections
  4. Output: predicted noise epsilon (B, H, action_dim)

Advantages: fast (inference ~10ms), lightweight (~5M parameters), easy to train.

Transformer-based (Diffusion Policy - T)

Replace CNN with Transformer decoder:

Input tokens:
  - Noisy action tokens: x_t -> Linear -> (H, d_model)
  - Observation tokens: obs -> ResNet18 -> (N_obs, d_model)
  - Timestep token: t -> sinusoidal embedding -> (1, d_model)

Transformer Decoder:
  - Self-attention on action tokens
  - Cross-attention from action to observation tokens
  - L layers, d_model=256, 4 heads

Output: predicted noise epsilon (H, action_dim)

Advantages: captures long-range dependencies better, scales well with data.

CNN vs Transformer: When to Use

Criterion Diffusion Policy - C Diffusion Policy - T
Inference speed ~10ms ~50ms
Parameters ~5M ~20M
Data efficiency Better with < 100 demos Needs > 200 demos
Long-horizon Limited Better
Training time Fast (2-4h on 1 GPU) Slower (6-12h)
Recommendation Default choice Complex tasks, plenty of data

Advice: Start with CNN variant. Switch to Transformer only when CNN plateaus and you have enough data.

Diffusion Policy in LeRobot: Hands-on

Setup

# Install LeRobot
pip install lerobot

# Check GPU
python -c "import torch; print(torch.cuda.is_available())"

Train Diffusion Policy on PushT (2D Benchmark)

PushT is a classic benchmark: robot pushes T-shaped object to target pose. Simple but demonstrates Diffusion Policy's power with multimodal actions.

# Download PushT dataset
python -m lerobot.scripts.download_dataset \
    --repo-id lerobot/pusht

# Train Diffusion Policy (CNN variant)
python -m lerobot.scripts.train \
    --policy.type=diffusion \
    --env.type=pusht \
    --dataset.repo_id=lerobot/pusht \
    --training.num_epochs=5000 \
    --training.batch_size=64 \
    --policy.n_action_steps=8 \
    --policy.n_obs_steps=2 \
    --policy.num_inference_steps=100

Important Hyperparameters

# Config explained
policy:
  n_obs_steps: 2          # Observation frames as input (2 = current + previous)
  n_action_steps: 8       # Actions to execute before replanning (action chunking)
  horizon: 16             # Prediction horizon (predict 16, execute 8)
  num_inference_steps: 100 # DDPM denoising steps (more = accurate, slower)

  # Noise schedule
  noise_scheduler:
    type: "ddpm"           # Or "ddim" (faster, fewer steps)
    beta_start: 0.0001
    beta_end: 0.02
    beta_schedule: "squaredcos_cap_v2"  # Cosine schedule (better than linear)

DDIM: Speedup Inference

DDPM needs 100 denoising steps -> too slow for real-time robot control. DDIM (Denoising Diffusion Implicit Models) reduces to 10-20 steps while maintaining quality:

# Switch from DDPM to DDIM in LeRobot config
policy:
  noise_scheduler:
    type: "ddim"
  num_inference_steps: 10  # Reduce from 100 to 10

In practice, DDIM with 10 steps gives ~15ms inference — fast enough for 50Hz control loops.

Diffusion Policy training loop — observation conditioning + action denoising

Benchmark Results

PushT (2D)

Method Success Rate
BC (MLP) 58.7%
BC (Transformer) 63.2%
IBC (Implicit BC) 62.4%
BET (BeT) 65.8%
Diffusion Policy - C 88.7%
Diffusion Policy - T 85.3%

RobomiMic (Simulated Manipulation)

Task BC-RNN IBC Diffusion-C Diffusion-T
Lift 100% 96% 100% 100%
Can 92% 84% 96% 95%
Square 82% 68% 92% 92%
Transport 46% 32% 78% 74%
Tool Hang 18% 12% 56% 52%

Diffusion Policy dominates, especially on hard tasks (Transport, Tool Hang) where multimodal actions matter.

Why is Diffusion Policy Powerful?

1. Handles Multimodal Actions Naturally

When many ways to complete task (push T from left or right), BC averages them -> meaningless action. Diffusion model learns full distribution, samples one specific mode, executes consistently.

2. High Expressiveness

Diffusion process can model any distribution, not limited by parametric assumptions (like Gaussians in VAEs). Important for complex manipulation trajectories.

3. Training Stability

No mode collapse like GANs, no posterior collapse like VAEs. Training objective (predict noise) is simple and stable.

4. Action Chunking Built-in

Predict entire action chunks at once, reduces compounding error like ACT, but with more powerful generation method.

Practical Tips

Training

Real Robot Inference

Debugging

Diffusion Policy vs ACT: Which to Choose?

Criterion ACT Diffusion Policy
Multimodal CVAE (limited modes) Full distribution
Speed Fast (~5ms) Slower (~15ms DDIM)
Data efficiency Excellent (50 demos) Good (50-100 demos)
Long-horizon Good Better
Implementation Complex (CVAE) Medium
Best for Bimanual, limited data Complex tasks, more data

Choose ACT when: limited data (<50 demos), need fast inference, bimanual tasks. Choose Diffusion Policy when: complex task, multimodal, enough data, 50Hz is sufficient.

Next in Series


Related Articles

Related Posts

TutorialLeRobot Ecosystem: Hướng dẫn toàn diện 2026
ai-perceptionmanipulationtutorial

LeRobot Ecosystem: Hướng dẫn toàn diện 2026

Tổng quan LeRobot của Hugging Face -- models, datasets, hardware support và cách bắt đầu với $100.

22/3/20269 min read
Deep DiveDiffusion Policy: Cách mạng robot manipulation
ai-perceptiondiffusion-policymanipulationPart 4

Diffusion Policy: Cách mạng robot manipulation

Tại sao diffusion models là breakthrough cho robotics — multimodal distributions, high-dim actions và stability.

14/3/202610 min read
Deep DiveAction Chunking Transformers (ACT): Kiến trúc chi tiết
ai-perceptionmanipulationresearchPart 3

Action Chunking Transformers (ACT): Kiến trúc chi tiết

Phân tích ACT — tại sao predict nhiều actions cùng lúc tốt hơn, CVAE encoder và temporal ensembling.

11/3/202611 min read