← Back to Blog
aiai-perceptiondiffusion-policymanipulation

Diffusion Policy: A Revolution in Robot Manipulation

Why diffusion models are breakthrough for robotics — multimodal distributions, high-dimensional actions, and stability.

Nguyen Anh Tuan14 tháng 3, 20269 min read
Diffusion Policy: A Revolution in Robot Manipulation

Why Traditional Behavior Cloning Fails

If you've tried teaching a robot to imitate human actions (imitation learning), you know Behavioral Cloning (BC) — simplest approach: collect demonstrations, train supervised learning, predict action from observation. Simple, easy to implement, but fails miserably in many real-world scenarios.

Problem: Multimodal Actions

Imagine: you teach robot to go around obstacle. Demo 1 goes left, Demo 2 goes right — both correct. What does BC with MSE loss learn? Average of two directions — robot goes straight into obstacle.

Demo 1:  ← (go left)    ✓ Correct
Demo 2:  → (go right)   ✓ Correct
BC output: ↑ (average)  ✗ Hits obstacle!

This is multimodal action distribution — same observation has multiple correct actions. BC with Gaussian regression outputs only 1 mode (mean), loses entire diversity in data.

Compounding Errors

Second problem: each prediction is slightly wrong, errors accumulate over time. After 50 steps, robot completely deviates from learned trajectory. This is why BC works in simulation but fails on real robot — real world always has noise and perturbations.

Robot arm performing manipulation task with AI policy

DDPM Basics: Forward Noise and Reverse Denoising

Before understanding Diffusion Policy, need Denoising Diffusion Probabilistic Models (DDPM) — foundation of all diffusion models.

Forward process: Add noise gradually

Given data point x₀ (e.g., action trajectory), forward process adds Gaussian noise over T steps:

x₀ (clean action) → x₁ (little noise) → x₂ → ... → x_T (pure noise)

Each step: x_t = √(1-β_t) · x_{t-1} + √(β_t) · ε
Where: ε ~ N(0, I), β_t is noise schedule

After T steps (usually T=100 or 1000), x_T is almost entirely Gaussian noise — no information left about x₀.

Reverse process: Denoise to generate data

Reverse process learns to go backwards — create clean data from noise:

x_T (noise) → x_{T-1} → ... → x₁ → x₀ (clean action)

Each step: x_{t-1} = μ_θ(x_t, t) + σ_t · z
Where: μ_θ is neural network predicting noise, z ~ N(0, I)

Neural network ε_θ(x_t, t) trained to predict noise ε added at each step t. Simple loss function:

L = E[‖ε - ε_θ(x_t, t)‖²]

Clever idea: model doesn't predict action directly, predicts noise to remove. Through many denoising steps, action "emerges" from noise — like sculptor removing excess stone.

Why diffusion models capture multimodal distributions?

Unlike regression (outputs 1 value), diffusion models learn entire distribution of data. Each sample, process starts from different random noise, leads to different output — naturally captures multiple modes.

Back to obstacle avoidance:

Diffusion Policy: Apply Diffusion to Action Space

Diffusion Policy (Chi et al., RSS 2023) first successful application of diffusion models to robot visuomotor policy learning. Core idea: instead of using diffusion to generate images, use it to generate action trajectories.

Architecture Overview

Observation (image + proprio)
        ↓
   Visual Encoder (ResNet / ViT)
        ↓
   Observation embedding (conditioning)
        ↓
   Conditional Denoising Network
   Input: noisy action chunk A_t^k + timestep k + obs embedding
   Output: predicted noise ε_θ
        ↓
   K denoising steps (DDPM or DDIM)
        ↓
   Clean action chunk A_t^0 = [a_t, a_{t+1}, ..., a_{t+H}]

Important: model predicts entire chunk of actions (e.g., 16 actions in a row) instead of single action. Called action chunking — reduces compounding errors because robot executes whole trajectory smoothly, doesn't need to re-predict every step.

Two Architecture Variants

Paper proposes 2 architectures for denoising network:

1. CNN-based (1D temporal convolution):

# Simplified CNN-based Diffusion Policy
class ConvDiffusionUnet(nn.Module):
    """
    1D U-Net processes action sequence along temporal dimension.
    Similar to image diffusion U-Net but 1D instead of 2D.
    """
    def __init__(self, action_dim=7, obs_dim=512, horizon=16):
        super().__init__()
        # Observation conditioning via FiLM
        self.obs_encoder = nn.Linear(obs_dim, 256)
        # 1D U-Net with residual blocks
        self.down_blocks = nn.ModuleList([
            ResBlock1D(action_dim, 64),
            ResBlock1D(64, 128),
            ResBlock1D(128, 256),
        ])
        self.up_blocks = nn.ModuleList([
            ResBlock1D(256, 128),
            ResBlock1D(128, 64),
            ResBlock1D(64, action_dim),
        ])
        # Timestep embedding
        self.time_embed = SinusoidalPosEmb(256)

2. Transformer-based (Time-series Diffusion Transformer):

# Simplified Transformer-based Diffusion Policy
class DiffusionTransformer(nn.Module):
    """
    Transformer processes action tokens + observation tokens.
    Attention mechanism captures long-range dependencies.
    """
    def __init__(self, action_dim=7, obs_dim=512, horizon=16):
        super().__init__()
        self.action_embed = nn.Linear(action_dim, 256)
        self.obs_embed = nn.Linear(obs_dim, 256)
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=256, nhead=8),
            num_layers=6,
        )
        self.action_head = nn.Linear(256, action_dim)

Deep learning model training on GPU cluster

Results: Why Diffusion Policy Beats BC by 46.9%

Paper benchmarks on 12 tasks from 4 robot manipulation benchmarks. Impressive results:

Task Category BC (LSTM) BC (Transformer) IBC BET Diffusion Policy
Push-T 49.1% 51.2% 45.3% 60.1% 88.5%
Robomimic (Can) 78.3% 82.1% 71.6% 75.4% 96.2%
Robomimic (Square) 52.1% 55.4% 44.2% 48.7% 84.8%
Real Robot (Push-T) 35.0% 38.2% 30.1% 42.3% 72.1%
Average improvement +46.9%

Why such huge improvement?

  1. Multimodal capture: Diffusion Policy naturally handles multimodal actions — when data has multiple solutions, model samples from distribution instead of averaging.

  2. Action chunking + receding horizon: Predict whole chunk of 16 actions, execute 8, re-plan. Reduces compounding errors and creates smooth trajectories.

  3. High-dimensional stability: Diffusion more stable than direct regression, especially with large action space (bimanual: 14D+).

  4. Training stability: Loss landscape smoother by predicting noise instead of action directly. No need for careful hyperparameter tuning like IBC or EBM methods.

Detailed Training: Action Horizon and Observation Conditioning

Action Horizon and Chunking

Diffusion Policy uses 3 important hyperparameters:

Observation horizon (T_o): Number observation frames = 2
Action horizon (T_a):      Number actions to predict = 16
Action execution (T_e):    Number actions actually execute = 8

Timeline:
  t-1  t  |  t+1  t+2  ...  t+8  |  t+9  ...  t+16
  [obs ]  |  [execute these]      |  [predicted but discarded]
           ↑ re-plan at t+8

Predict 16 but only execute 8, then re-plan with new observation. "Predicted but discarded" part helps model plan ahead — knows trajectory should go somewhere after 8 steps, so first 8 steps will be smoother.

Observation Conditioning

Visual observations encoded by pre-trained ResNet or ViT, then conditioned into denoising network via FiLM conditioning (Feature-wise Linear Modulation):

# FiLM conditioning: obs influences denoising at each layer
gamma = self.gamma_layer(obs_embedding)  # Scale
beta = self.beta_layer(obs_embedding)     # Shift
features = gamma * features + beta        # Modulate

More effective than concatenation because obs embedding influences every layer of network, not just input layer.

Code Example with LeRobot

LeRobot from Hugging Face provides ready Diffusion Policy implementation. Here's how to train and evaluate:

# Install LeRobot
pip install lerobot

# Train Diffusion Policy on PushT task
python lerobot/scripts/train.py \
    --output_dir=outputs/train/diffusion_pusht \
    --policy.type=diffusion \
    --dataset.repo_id=lerobot/pusht \
    --seed=100000 \
    --env.type=pusht \
    --batch_size=64 \
    --steps=200000 \
    --eval_freq=25000 \
    --save_freq=25000

# Evaluate pretrained model
python lerobot/scripts/eval.py \
    --policy.path=lerobot/diffusion_pusht \
    --env.type=pusht \
    --eval.batch_size=10 \
    --eval.n_episodes=10

For custom dataset (collected from real robot):

"""
Train Diffusion Policy on custom robot data with LeRobot
Requires: 1x RTX 3090/4090, dataset in LeRobot format
"""
from lerobot.datasets.lerobot_dataset import LeRobotDataset

# 1. Load dataset (RLDS or LeRobot HDF5 format)
dataset = LeRobotDataset("lerobot/aloha_mobile_cabinet")
print(f"Dataset size: {len(dataset)} frames")
print(f"Action dim: {dataset[0]['action'].shape}")
print(f"Observation keys: {dataset[0]['observation'].keys()}")

# 2. Train with config
# Config file: lerobot/configs/policy/diffusion.yaml
# Key hyperparameters:
#   noise_scheduler: DDPMScheduler (100 diffusion steps)
#   action_horizon: 16 (predict 16 future actions)
#   obs_horizon: 2 (use 2 observation frames)
#   n_action_steps: 8 (execute 8 actions)
#   vision_backbone: "resnet18" (visual encoder)
#   lr: 1e-4

# 3. Deploy on robot
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
import torch

policy = DiffusionPolicy.from_pretrained("outputs/train/diffusion_pusht/checkpoints/last/pretrained_model")
policy.eval()

# Inference loop
with torch.no_grad():
    obs = get_robot_observation()  # Camera image + proprioception
    action_chunk = policy.select_action(obs)  # [16, action_dim]
    # Execute first 8 actions
    for i in range(8):
        robot.execute(action_chunk[i])

Diffusion Policy vs ACT

ACT (Action Chunking with Transformers) (Zhao et al., RSS 2023) is another prominent approach from similar time period. Comparison:

Diffusion Policy ACT
Generation method Iterative denoising (K steps) Single forward pass (CVAE)
Multimodal actions Native (diffusion distribution) Via latent variable z
Inference speed Slower (~10-50 denoising steps) Faster (1 forward pass)
Training complexity Noise scheduling, many hyperparams Simpler (VAE loss + reconstruction)
Precision Higher via iterative refinement Good, especially for bimanual
Best use case Complex manipulation, multi-modal Bimanual, fine-grained tasks
Compounding error Low (action chunking + receding horizon) Low (action chunking)

When to use Diffusion Policy?

When to use ACT?

In practice, both are state-of-the-art and often give comparable results. Diffusion Policy has edge on complex multi-modal tasks, ACT has edge on speed and simplicity.

Impact and Follow-up Work

Diffusion Policy opened door to many new research directions:

Diffusion Policy not just a paper — changed how community thinks about robot learning. From "predict 1 action" to "generate action distribution", from deterministic to stochastic, from single-step to trajectory-level.


Related Posts

Related Posts

Sim-to-Real Transfer: Train simulation, chạy thực tế
ai-perceptionresearchrobotics

Sim-to-Real Transfer: Train simulation, chạy thực tế

Kỹ thuật chuyển đổi mô hình từ simulation sang robot thật — domain randomization, system identification và best practices.

1/4/202612 min read
ResearchEmbodied AI 2026: Toàn cảnh và xu hướng
ai-perceptionresearchrobotics

Embodied AI 2026: Toàn cảnh và xu hướng

Tổng quan embodied AI -- từ foundation models, sim-to-real đến robot learning tại scale với open-source tools.

25/3/202612 min read
TutorialHands-on: Fine-tune OpenVLA với LeRobot
ai-perceptionvlatutorialPart 7

Hands-on: Fine-tune OpenVLA với LeRobot

Tutorial thực hành — fine-tune OpenVLA trên custom data, LoRA, quantization và deploy trên robot thật.

23/3/202613 min read