Action Chunking Transformers (ACT): Architecture in Detail

Why Single-Step Prediction Fails

In Part 2, I introduced Behavioral Cloning — training a policy to predict 1 action per observation. This simple method works for many tasks, but fails catastrophically with complex manipulation tasks. Why?

Problem 1: Temporal Correlation

Robot actions are not independent — action at timestep t strongly depends on actions at t-1, t-2, t-3... When predicting each action individually, the policy loses temporal coherence:

Single-step prediction:
  t=0: move left (correct)
  t=1: move right (noise → wrong direction)
  t=2: move left (correct)
  → Robot shakes, not smooth

Action chunking:
  t=0: predict [left, left, left, left] (entire chunk)
  t=4: predict [down, down, down, down]
  → Robot moves smoothly

Problem 2: Multimodality

Same observation, expert can perform multiple ways. Example: grasping an object from left or right are both valid. Single-step BC with MSE loss will average 2 modes → robot's hand goes straight into the middle (correct mode neither).

Action Chunking with Transformers (ACT) solves both problems. Original paper: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (Zhao et al., RSS 2023).

Bimanual robot manipulation using ALOHA system

ACT Architecture: Overview

ACT consists of 2 main components:

Training time:
  Observations (images + proprio) ──┐
  Action sequence (ground truth) ───┤
                                    ▼
                              CVAE Encoder
                                    │
                              style variable z
                                    │
                                    ▼
  Observations ─────────────→ Transformer Decoder ──→ Action chunk (k actions)
                                    ▲
                              z (style conditioning)

Inference time:
  Observations ─────────────→ Transformer Decoder ──→ Action chunk
                                    ▲
                              z = 0 (mean of prior)

Why this architecture?

Action chunking: Predict k actions simultaneously (typically k=100) instead of 1 → solves temporal correlation
CVAE encoder: Captures multimodality — style variable z encodes "how to do it" (left vs right)
Transformer decoder: Powerful sequence model, attention between observations and action tokens

CVAE Encoder: Capturing Style

CVAE (Conditional Variational Autoencoder) handles multimodality. Idea: many ways to perform a task, each way is a "style". CVAE encodes style into latent variable z.

Training phase

Input:
  - Observation: camera images + joint positions
  - Action sequence: ground truth actions (k steps)

CVAE Encoder:
  1. Concatenate [CLS token, action tokens] (k+1 tokens)
  2. Cross-attend to observation features
  3. CLS token output → MLP → μ, σ (Gaussian parameters)
  4. Sample z ~ N(μ, σ²)

Loss:
  L = L_reconstruction + β × KL(q(z|o,a) || p(z))

  L_reconstruction = ||predicted_actions - ground_truth_actions||²
  KL regularization: push z distribution near N(0, I)
  β = 10 (in original paper, quite large → force z meaningful)

Inference phase

At deployment, no ground truth actions → don't run encoder. Instead, use z = 0 (mean of prior distribution N(0, I)). This works because:

KL regularization pushes posterior near prior
z = 0 gives "average style" — most common behavior in demos
To get diverse behavior, can sample z ~ N(0, I)

Why not just use Gaussian Mixture Model?

GMM also captures multimodality, but:

Must choose number of modes K beforehand — don't know how many ways exist
Doesn't scale to high action dimensions (7D × 100 steps = 700D)
CVAE learns continuous latent space — smooth interpolation between styles

Transformer Decoder: Generating Action Chunks

Transformer decoder takes observation features and style variable z, generates k sequential actions.

Architecture details

# Simplified ACT decoder architecture
class ACTDecoder(nn.Module):
    def __init__(
        self,
        action_dim=14,        # 7 per arm × 2 arms (bimanual)
        chunk_size=100,       # Predict 100 future actions
        hidden_dim=512,
        n_heads=8,
        n_layers=4,
        latent_dim=32,        # CVAE latent dimension
    ):
        # Observation processing
        self.image_encoder = ResNet18()  # Visual features
        self.proprio_proj = nn.Linear(14, hidden_dim)

        # Style conditioning
        self.style_proj = nn.Linear(latent_dim, hidden_dim)

        # Learnable action queries (like DETR object queries)
        self.action_queries = nn.Embedding(chunk_size, hidden_dim)

        # Transformer decoder layers
        self.decoder = nn.TransformerDecoder(
            nn.TransformerDecoderLayer(
                d_model=hidden_dim,
                nhead=n_heads,
                dim_feedforward=2048,
                batch_first=True,
            ),
            num_layers=n_layers,
        )

        # Action prediction head
        self.action_head = nn.Linear(hidden_dim, action_dim)

Forward pass

1. Process observations:
   - Images → ResNet18 → flatten → visual tokens (49 tokens, 1 token per patch)
   - Joint positions → linear projection → 1 proprio token
   - Style z → linear projection → 1 style token

2. Memory = [visual_tokens, proprio_token, style_token]  (51 tokens)

3. Queries = learnable action_queries  (100 tokens)

4. Transformer Decoder:
   - Self-attention: action queries attend to each other
   - Cross-attention: action queries attend to memory (observations + style)
   - 4 layers

5. Output: 100 action predictions, each 14D vector
   (7D per arm: x, y, z, rx, ry, rz, gripper)

Important point: action queries are learnable — model learns position encoding for each timestep within chunk. Query 0 "knows" it predicts immediate action, query 99 "knows" it predicts far future action.

Temporal Ensembling: Smooth Execution

Action chunking solves temporal correlation, but creates new problem: chunk transitions. When finishing 1 chunk and starting new one, can get "glitch" (discontinuity).

How it works

Instead of executing whole chunk then predicting new one, ACT predicts new chunk every timestep and uses exponential weighting to blend:

Timestep t:
  Chunk predicted at t-2:  [_, _, a_t^(t-2), a_{t+1}^(t-2), ...]
  Chunk predicted at t-1:  [_, a_t^(t-1), a_{t+1}^(t-1), ...]
  Chunk predicted at t:    [a_t^(t), a_{t+1}^(t), ...]

  Action to execute = weighted_average(a_t^(t-2), a_t^(t-1), a_t^(t))

  Weights: w_i = exp(-m × i) where m > 0
  → Most recent chunk has highest weight
  → Older chunks gradually fade in influence

Why exponential weighting?

Most recent chunk has newest observations → most accurate information → high weight
Older chunks based on old observations → less relevant but already committed → keep small portion for smoothness
m (temporal weight) is hyperparameter: large m = reactive (only newest chunk), small m = smooth (blend many chunks)

In original paper, temporal ensembling increases success rate 5-10% on hard tasks — especially important for contact-rich manipulation.

Temporal ensembling for smooth robot control

Training Pipeline

Data format (ALOHA)

ACT designed for ALOHA system — 2 robot arms, each 6-DoF + 1 gripper:

# Each episode in dataset
episode = {
    "observations": {
        "images": {
            "cam_high": np.array((T, 480, 640, 3)),   # Top camera
            "cam_left_wrist": np.array((T, 480, 640, 3)),  # Left wrist camera
            "cam_right_wrist": np.array((T, 480, 640, 3)), # Right wrist camera
        },
        "qpos": np.array((T, 14)),  # Joint positions: 7 left + 7 right
        "qvel": np.array((T, 14)),  # Joint velocities
    },
    "actions": np.array((T, 14)),   # Target joint positions
}

Training configuration

# Hyperparameters from paper
config = {
    "chunk_size": 100,        # Predict 100 future actions
    "hidden_dim": 512,
    "n_heads": 8,
    "n_encoder_layers": 4,    # CVAE encoder
    "n_decoder_layers": 7,    # Transformer decoder
    "latent_dim": 32,         # CVAE latent dimension
    "kl_weight": 10,          # β for KL loss
    "lr": 1e-5,
    "batch_size": 8,
    "epochs": 2000,
    "backbone": "resnet18",
    "temporal_agg": True,     # Temporal ensembling
    "temporal_agg_m": 0.01,   # Exponential decay factor
}

Loss function

def compute_loss(model, batch, kl_weight=10):
    """ACT training loss = reconstruction + KL divergence."""
    observations = batch["observations"]
    gt_actions = batch["actions"]  # (B, chunk_size, action_dim)

    # Forward pass — encoder receives both observations and gt_actions
    pred_actions, mu, logvar = model(observations, gt_actions)

    # Reconstruction loss: L1 instead of MSE (robust to outliers)
    l1_loss = F.l1_loss(pred_actions, gt_actions)

    # KL divergence: push posterior near prior N(0, I)
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    kl_loss = kl_loss / (mu.shape[0] * mu.shape[1])  # normalize

    total_loss = l1_loss + kl_weight * kl_loss

    return total_loss, {
        "l1_loss": l1_loss.item(),
        "kl_loss": kl_loss.item(),
        "total_loss": total_loss.item(),
    }

Practical: Train ACT with LeRobot

LeRobot (Hugging Face) is best open-source framework for training and deploying ACT. Wraps entire pipeline into simple CLI commands.

Installation

pip install lerobot
# Or from source for latest features
git clone https://github.com/huggingface/lerobot.git
cd lerobot && pip install -e ".[all]"

Train ACT on ALOHA dataset

# Download dataset and train ACT — 1 command only
lerobot-train \
    --policy.type=act \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=2000 \
    --training.batch_size=8 \
    --policy.chunk_size=100 \
    --policy.n_action_steps=100 \
    --policy.kl_weight=10 \
    --output_dir=outputs/act_insertion

Train on custom dataset

"""
Train ACT on self-collected data with LeRobot.
Requirements: pip install lerobot torch
"""
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.policies.act.modeling_act import ACTPolicy
from lerobot.common.policies.act.configuration_act import ACTConfig

# 1. Load dataset (LeRobot format on HuggingFace Hub)
dataset = LeRobotDataset(
    repo_id="your-username/your-robot-dataset",
    split="train",
)
print(f"Dataset: {len(dataset)} frames, {dataset.num_episodes} episodes")

# 2. Configure ACT policy
config = ACTConfig(
    input_shapes={
        "observation.images.top": [3, 480, 640],
        "observation.state": [14],  # Joint positions
    },
    output_shapes={
        "action": [14],  # Target joint positions
    },
    input_normalization_modes={
        "observation.images.top": "mean_std",
        "observation.state": "mean_std",
    },
    output_normalization_modes={
        "action": "mean_std",
    },
    chunk_size=100,
    n_action_steps=100,
    dim_model=512,
    n_heads=8,
    n_encoder_layers=4,
    n_decoder_layers=7,
    dim_feedforward=2048,
    latent_dim=32,
    use_vae=True,           # Enable CVAE encoder
    kl_weight=10.0,
    temporal_ensemble_coeff=0.01,  # Temporal ensembling
)

# 3. Create policy
policy = ACTPolicy(config=config, dataset_stats=dataset.stats)

# 4. Training loop
import torch
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
policy = policy.to(device)
optimizer = torch.optim.AdamW(policy.parameters(), lr=1e-5, weight_decay=1e-4)

dataloader = DataLoader(dataset, batch_size=8, shuffle=True, num_workers=4)

for epoch in range(2000):
    epoch_loss = 0
    for batch in dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        output = policy.forward(batch)
        loss = output["loss"]

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(policy.parameters(), 10.0)
        optimizer.step()

        epoch_loss += loss.item()

    if epoch % 100 == 0:
        avg_loss = epoch_loss / len(dataloader)
        print(f"Epoch {epoch}: loss={avg_loss:.6f}")

# 5. Save policy
policy.save_pretrained("outputs/act_custom/")
print("Training complete!")

Evaluate and deploy

"""Run ACT policy on robot or simulation."""
from lerobot.common.policies.act.modeling_act import ACTPolicy

# Load trained policy
policy = ACTPolicy.from_pretrained("outputs/act_custom/")
policy.eval()

# Inference loop with temporal ensembling
action_queue = []  # Buffer for temporal ensembling

for step in range(max_steps):
    observation = robot.get_observation()

    # Policy predict chunk of actions
    with torch.no_grad():
        action_chunk = policy.select_action(observation)
        # action_chunk shape: (chunk_size, action_dim)

    # Temporal ensembling
    action_queue.append(action_chunk)
    if len(action_queue) > max_queue_len:
        action_queue.pop(0)

    # Weighted average
    weights = [
        np.exp(-0.01 * i)
        for i in range(len(action_queue) - 1, -1, -1)
    ]
    weights = np.array(weights) / sum(weights)

    action = sum(w * q[len(action_queue) - 1 - i]
                 for i, (w, q) in enumerate(zip(weights, action_queue)))

    robot.execute_action(action)

Results and Comparison

ACT achieves impressive results on ALOHA bimanual tasks — tasks that standard BC barely solves:

Task	BC (single-step)	ACT (no temporal agg)	ACT (full)
Slot insertion	10%	80%	96%
Transfer cube	2%	72%	90%
Thread zip tie	0%	40%	52%
Open cup	0%	55%	68%

Notes:

BC single-step nearly completely fails on bimanual tasks — distribution shift too severe
Action chunking alone (no temporal ensembling) already boosts performance significantly
Temporal ensembling adds 5-16% success rate depending on task

Comparison with other methods

Method	Demos needed	Train time	Bimanual	Fine-grained
BC (MLP)	50	10 min	Poor	Poor
BC (Transformer)	50	30 min	Average	Average
Diffusion Policy	50	2 hours	Good	Good
ACT	50	1 hour	Very good	Very good
ACT + Diffusion	50	3 hours	Very good	Very good

ACT balances performance and training cost well. Diffusion Policy may be better for some tasks, but trains and infers slower.

Key Takeaways

Action chunking solves temporal correlation — predict k actions together maintains coherence
CVAE encoder captures multimodality — style variable z encodes "how to do it"
Temporal ensembling for smooth execution — blend multiple chunks with exponential weighting
50 demonstrations sufficient for many manipulation tasks — don't need thousands of demos
LeRobot is fastest way to start — download dataset, train, deploy in one day

Next Steps

ACT is state-of-the-art for imitation learning from demonstrations. But if you want robot to understand language instructions ("pick up the red cup"), you need foundation models — see Foundation Models for Robot: RT-2, Octo, OpenVLA in Practice to understand how to combine vision, language, and action in one model.

RL for Robotics: PPO, SAC and How to Choose Algorithm — Part 1: RL foundation for robots
Imitation Learning: BC, DAgger and DAPG for Robot — Part 2: Imitation learning basics
Foundation Models for Robot: RT-2, Octo, OpenVLA in Practice — Next step: vision-language-action models
Sim-to-Real Transfer: Train in Simulation, Run on Real Robot — Transfer ACT policy from sim to real

Why Single-Step Prediction Fails

Problem 1: Temporal Correlation

Robot actions are not independent — action at timestep t strongly depends on actions at t-1, t-2, t-3... When predicting each action individually, the policy loses temporal coherence:

Single-step prediction:
  t=0: move left (correct)
  t=1: move right (noise → wrong direction)
  t=2: move left (correct)
  → Robot shakes, not smooth

Action chunking:
  t=0: predict [left, left, left, left] (entire chunk)
  t=4: predict [down, down, down, down]
  → Robot moves smoothly

Problem 2: Multimodality

Action Chunking with Transformers (ACT) solves both problems. Original paper: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (Zhao et al., RSS 2023).

Bimanual robot manipulation using ALOHA system

ACT Architecture: Overview

ACT consists of 2 main components:

Training time:
  Observations (images + proprio) ──┐
  Action sequence (ground truth) ───┤
                                    ▼
                              CVAE Encoder
                                    │
                              style variable z
                                    │
                                    ▼
  Observations ─────────────→ Transformer Decoder ──→ Action chunk (k actions)
                                    ▲
                              z (style conditioning)

Inference time:
  Observations ─────────────→ Transformer Decoder ──→ Action chunk
                                    ▲
                              z = 0 (mean of prior)

Why this architecture?

Action chunking: Predict k actions simultaneously (typically k=100) instead of 1 → solves temporal correlation
CVAE encoder: Captures multimodality — style variable z encodes "how to do it" (left vs right)
Transformer decoder: Powerful sequence model, attention between observations and action tokens

CVAE Encoder: Capturing Style

CVAE (Conditional Variational Autoencoder) handles multimodality. Idea: many ways to perform a task, each way is a "style". CVAE encodes style into latent variable z.

Training phase

Input:
  - Observation: camera images + joint positions
  - Action sequence: ground truth actions (k steps)

CVAE Encoder:
  1. Concatenate [CLS token, action tokens] (k+1 tokens)
  2. Cross-attend to observation features
  3. CLS token output → MLP → μ, σ (Gaussian parameters)
  4. Sample z ~ N(μ, σ²)

Loss:
  L = L_reconstruction + β × KL(q(z|o,a) || p(z))

  L_reconstruction = ||predicted_actions - ground_truth_actions||²
  KL regularization: push z distribution near N(0, I)
  β = 10 (in original paper, quite large → force z meaningful)

Inference phase

At deployment, no ground truth actions → don't run encoder. Instead, use z = 0 (mean of prior distribution N(0, I)). This works because:

KL regularization pushes posterior near prior
z = 0 gives "average style" — most common behavior in demos
To get diverse behavior, can sample z ~ N(0, I)

Why not just use Gaussian Mixture Model?

GMM also captures multimodality, but:

Must choose number of modes K beforehand — don't know how many ways exist
Doesn't scale to high action dimensions (7D × 100 steps = 700D)
CVAE learns continuous latent space — smooth interpolation between styles

Transformer Decoder: Generating Action Chunks

Transformer decoder takes observation features and style variable z, generates k sequential actions.

Architecture details

# Simplified ACT decoder architecture
class ACTDecoder(nn.Module):
    def __init__(
        self,
        action_dim=14,        # 7 per arm × 2 arms (bimanual)
        chunk_size=100,       # Predict 100 future actions
        hidden_dim=512,
        n_heads=8,
        n_layers=4,
        latent_dim=32,        # CVAE latent dimension
    ):
        # Observation processing
        self.image_encoder = ResNet18()  # Visual features
        self.proprio_proj = nn.Linear(14, hidden_dim)

        # Style conditioning
        self.style_proj = nn.Linear(latent_dim, hidden_dim)

        # Learnable action queries (like DETR object queries)
        self.action_queries = nn.Embedding(chunk_size, hidden_dim)

        # Transformer decoder layers
        self.decoder = nn.TransformerDecoder(
            nn.TransformerDecoderLayer(
                d_model=hidden_dim,
                nhead=n_heads,
                dim_feedforward=2048,
                batch_first=True,
            ),
            num_layers=n_layers,
        )

        # Action prediction head
        self.action_head = nn.Linear(hidden_dim, action_dim)

Forward pass

1. Process observations:
   - Images → ResNet18 → flatten → visual tokens (49 tokens, 1 token per patch)
   - Joint positions → linear projection → 1 proprio token
   - Style z → linear projection → 1 style token

2. Memory = [visual_tokens, proprio_token, style_token]  (51 tokens)

3. Queries = learnable action_queries  (100 tokens)

4. Transformer Decoder:
   - Self-attention: action queries attend to each other
   - Cross-attention: action queries attend to memory (observations + style)
   - 4 layers

5. Output: 100 action predictions, each 14D vector
   (7D per arm: x, y, z, rx, ry, rz, gripper)

Temporal Ensembling: Smooth Execution

Action chunking solves temporal correlation, but creates new problem: chunk transitions. When finishing 1 chunk and starting new one, can get "glitch" (discontinuity).

How it works

Instead of executing whole chunk then predicting new one, ACT predicts new chunk every timestep and uses exponential weighting to blend:

Timestep t:
  Chunk predicted at t-2:  [_, _, a_t^(t-2), a_{t+1}^(t-2), ...]
  Chunk predicted at t-1:  [_, a_t^(t-1), a_{t+1}^(t-1), ...]
  Chunk predicted at t:    [a_t^(t), a_{t+1}^(t), ...]

  Action to execute = weighted_average(a_t^(t-2), a_t^(t-1), a_t^(t))

  Weights: w_i = exp(-m × i) where m > 0
  → Most recent chunk has highest weight
  → Older chunks gradually fade in influence

Why exponential weighting?

Most recent chunk has newest observations → most accurate information → high weight
Older chunks based on old observations → less relevant but already committed → keep small portion for smoothness
m (temporal weight) is hyperparameter: large m = reactive (only newest chunk), small m = smooth (blend many chunks)

In original paper, temporal ensembling increases success rate 5-10% on hard tasks — especially important for contact-rich manipulation.

Temporal ensembling for smooth robot control

Training Pipeline

Data format (ALOHA)

ACT designed for ALOHA system — 2 robot arms, each 6-DoF + 1 gripper:

# Each episode in dataset
episode = {
    "observations": {
        "images": {
            "cam_high": np.array((T, 480, 640, 3)),   # Top camera
            "cam_left_wrist": np.array((T, 480, 640, 3)),  # Left wrist camera
            "cam_right_wrist": np.array((T, 480, 640, 3)), # Right wrist camera
        },
        "qpos": np.array((T, 14)),  # Joint positions: 7 left + 7 right
        "qvel": np.array((T, 14)),  # Joint velocities
    },
    "actions": np.array((T, 14)),   # Target joint positions
}

Training configuration

# Hyperparameters from paper
config = {
    "chunk_size": 100,        # Predict 100 future actions
    "hidden_dim": 512,
    "n_heads": 8,
    "n_encoder_layers": 4,    # CVAE encoder
    "n_decoder_layers": 7,    # Transformer decoder
    "latent_dim": 32,         # CVAE latent dimension
    "kl_weight": 10,          # β for KL loss
    "lr": 1e-5,
    "batch_size": 8,
    "epochs": 2000,
    "backbone": "resnet18",
    "temporal_agg": True,     # Temporal ensembling
    "temporal_agg_m": 0.01,   # Exponential decay factor
}

Loss function

def compute_loss(model, batch, kl_weight=10):
    """ACT training loss = reconstruction + KL divergence."""
    observations = batch["observations"]
    gt_actions = batch["actions"]  # (B, chunk_size, action_dim)

    # Forward pass — encoder receives both observations and gt_actions
    pred_actions, mu, logvar = model(observations, gt_actions)

    # Reconstruction loss: L1 instead of MSE (robust to outliers)
    l1_loss = F.l1_loss(pred_actions, gt_actions)

    # KL divergence: push posterior near prior N(0, I)
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    kl_loss = kl_loss / (mu.shape[0] * mu.shape[1])  # normalize

    total_loss = l1_loss + kl_weight * kl_loss

    return total_loss, {
        "l1_loss": l1_loss.item(),
        "kl_loss": kl_loss.item(),
        "total_loss": total_loss.item(),
    }

Practical: Train ACT with LeRobot

LeRobot (Hugging Face) is best open-source framework for training and deploying ACT. Wraps entire pipeline into simple CLI commands.

Installation

pip install lerobot
# Or from source for latest features
git clone https://github.com/huggingface/lerobot.git
cd lerobot && pip install -e ".[all]"

Train ACT on ALOHA dataset

# Download dataset and train ACT — 1 command only
lerobot-train \
    --policy.type=act \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=2000 \
    --training.batch_size=8 \
    --policy.chunk_size=100 \
    --policy.n_action_steps=100 \
    --policy.kl_weight=10 \
    --output_dir=outputs/act_insertion

Train on custom dataset

"""
Train ACT on self-collected data with LeRobot.
Requirements: pip install lerobot torch
"""
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.policies.act.modeling_act import ACTPolicy
from lerobot.common.policies.act.configuration_act import ACTConfig

# 1. Load dataset (LeRobot format on HuggingFace Hub)
dataset = LeRobotDataset(
    repo_id="your-username/your-robot-dataset",
    split="train",
)
print(f"Dataset: {len(dataset)} frames, {dataset.num_episodes} episodes")

# 2. Configure ACT policy
config = ACTConfig(
    input_shapes={
        "observation.images.top": [3, 480, 640],
        "observation.state": [14],  # Joint positions
    },
    output_shapes={
        "action": [14],  # Target joint positions
    },
    input_normalization_modes={
        "observation.images.top": "mean_std",
        "observation.state": "mean_std",
    },
    output_normalization_modes={
        "action": "mean_std",
    },
    chunk_size=100,
    n_action_steps=100,
    dim_model=512,
    n_heads=8,
    n_encoder_layers=4,
    n_decoder_layers=7,
    dim_feedforward=2048,
    latent_dim=32,
    use_vae=True,           # Enable CVAE encoder
    kl_weight=10.0,
    temporal_ensemble_coeff=0.01,  # Temporal ensembling
)

# 3. Create policy
policy = ACTPolicy(config=config, dataset_stats=dataset.stats)

# 4. Training loop
import torch
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
policy = policy.to(device)
optimizer = torch.optim.AdamW(policy.parameters(), lr=1e-5, weight_decay=1e-4)

dataloader = DataLoader(dataset, batch_size=8, shuffle=True, num_workers=4)

for epoch in range(2000):
    epoch_loss = 0
    for batch in dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        output = policy.forward(batch)
        loss = output["loss"]

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(policy.parameters(), 10.0)
        optimizer.step()

        epoch_loss += loss.item()

    if epoch % 100 == 0:
        avg_loss = epoch_loss / len(dataloader)
        print(f"Epoch {epoch}: loss={avg_loss:.6f}")

# 5. Save policy
policy.save_pretrained("outputs/act_custom/")
print("Training complete!")

Evaluate and deploy

"""Run ACT policy on robot or simulation."""
from lerobot.common.policies.act.modeling_act import ACTPolicy

# Load trained policy
policy = ACTPolicy.from_pretrained("outputs/act_custom/")
policy.eval()

# Inference loop with temporal ensembling
action_queue = []  # Buffer for temporal ensembling

for step in range(max_steps):
    observation = robot.get_observation()

    # Policy predict chunk of actions
    with torch.no_grad():
        action_chunk = policy.select_action(observation)
        # action_chunk shape: (chunk_size, action_dim)

    # Temporal ensembling
    action_queue.append(action_chunk)
    if len(action_queue) > max_queue_len:
        action_queue.pop(0)

    # Weighted average
    weights = [
        np.exp(-0.01 * i)
        for i in range(len(action_queue) - 1, -1, -1)
    ]
    weights = np.array(weights) / sum(weights)

    action = sum(w * q[len(action_queue) - 1 - i]
                 for i, (w, q) in enumerate(zip(weights, action_queue)))

    robot.execute_action(action)

Results and Comparison

ACT achieves impressive results on ALOHA bimanual tasks — tasks that standard BC barely solves:

Task	BC (single-step)	ACT (no temporal agg)	ACT (full)
Slot insertion	10%	80%	96%
Transfer cube	2%	72%	90%
Thread zip tie	0%	40%	52%
Open cup	0%	55%	68%

Notes:

BC single-step nearly completely fails on bimanual tasks — distribution shift too severe
Action chunking alone (no temporal ensembling) already boosts performance significantly
Temporal ensembling adds 5-16% success rate depending on task

Comparison with other methods

Method	Demos needed	Train time	Bimanual	Fine-grained
BC (MLP)	50	10 min	Poor	Poor
BC (Transformer)	50	30 min	Average	Average
Diffusion Policy	50	2 hours	Good	Good
ACT	50	1 hour	Very good	Very good
ACT + Diffusion	50	3 hours	Very good	Very good

ACT balances performance and training cost well. Diffusion Policy may be better for some tasks, but trains and infers slower.

Key Takeaways

Action chunking solves temporal correlation — predict k actions together maintains coherence
CVAE encoder captures multimodality — style variable z encodes "how to do it"
Temporal ensembling for smooth execution — blend multiple chunks with exponential weighting
50 demonstrations sufficient for many manipulation tasks — don't need thousands of demos
LeRobot is fastest way to start — download dataset, train, deploy in one day

Next Steps

RL for Robotics: PPO, SAC and How to Choose Algorithm — Part 1: RL foundation for robots
Imitation Learning: BC, DAgger and DAPG for Robot — Part 2: Imitation learning basics
Foundation Models for Robot: RT-2, Octo, OpenVLA in Practice — Next step: vision-language-action models
Sim-to-Real Transfer: Train in Simulation, Run on Real Robot — Transfer ACT policy from sim to real

Why Single-Step Prediction Fails

Problem 1: Temporal Correlation

Problem 2: Multimodality

ACT Architecture: Overview

Why this architecture?

CVAE Encoder: Capturing Style

Training phase

Inference phase

Why not just use Gaussian Mixture Model?

Transformer Decoder: Generating Action Chunks

Architecture details

Forward pass

Temporal Ensembling: Smooth Execution

How it works

Why exponential weighting?

Training Pipeline

Data format (ALOHA)

Training configuration

Loss function

Practical: Train ACT with LeRobot

Installation

Train ACT on ALOHA dataset

Train on custom dataset

Evaluate and deploy

Results and Comparison

Comparison with other methods

Key Takeaways

Next Steps

Related Posts

Nguyễn Anh Tuấn

Related Posts

SpatialVLA: 3D understanding cho robot manipulation

VLA Models: RT-2 → Octo → OpenVLA → π0

Diffusion Policy: Cách mạng robot manipulation

Why Single-Step Prediction Fails

Problem 1: Temporal Correlation

Problem 2: Multimodality

ACT Architecture: Overview

Why this architecture?

CVAE Encoder: Capturing Style

Training phase

Inference phase

Why not just use Gaussian Mixture Model?

Transformer Decoder: Generating Action Chunks

Architecture details

Forward pass

Temporal Ensembling: Smooth Execution

How it works

Why exponential weighting?

Training Pipeline

Data format (ALOHA)

Training configuration

Loss function

Practical: Train ACT with LeRobot

Installation

Train ACT on ALOHA dataset

Train on custom dataset

Evaluate and deploy

Results and Comparison

Comparison with other methods

Key Takeaways

Next Steps

Related Posts

Nguyễn Anh Tuấn

Related Posts

SpatialVLA: 3D understanding cho robot manipulation

VLA Models: RT-2 → Octo → OpenVLA → π0

Diffusion Policy: Cách mạng robot manipulation