← Back to Blog
aiai-perceptionmanipulationresearch

Action Chunking Transformers (ACT): Architecture in Detail

Analyzing ACT — why predicting multiple actions together is better, CVAE encoders, and temporal ensembling.

Nguyen Anh Tuan11 tháng 3, 202611 min read
Action Chunking Transformers (ACT): Architecture in Detail

Why Single-Step Prediction Fails

In Part 2, I introduced Behavioral Cloning — training a policy to predict 1 action per observation. This simple method works for many tasks, but fails catastrophically with complex manipulation tasks. Why?

Problem 1: Temporal Correlation

Robot actions are not independent — action at timestep t strongly depends on actions at t-1, t-2, t-3... When predicting each action individually, the policy loses temporal coherence:

Single-step prediction:
  t=0: move left (correct)
  t=1: move right (noise → wrong direction)
  t=2: move left (correct)
  → Robot shakes, not smooth

Action chunking:
  t=0: predict [left, left, left, left] (entire chunk)
  t=4: predict [down, down, down, down]
  → Robot moves smoothly

Problem 2: Multimodality

Same observation, expert can perform multiple ways. Example: grasping an object from left or right are both valid. Single-step BC with MSE loss will average 2 modes → robot's hand goes straight into the middle (correct mode neither).

Action Chunking with Transformers (ACT) solves both problems. Original paper: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (Zhao et al., RSS 2023).

Bimanual robot manipulation using ALOHA system

ACT Architecture: Overview

ACT consists of 2 main components:

Training time:
  Observations (images + proprio) ──┐
  Action sequence (ground truth) ───┤
                                    ▼
                              CVAE Encoder
                                    │
                              style variable z
                                    │
                                    ▼
  Observations ─────────────→ Transformer Decoder ──→ Action chunk (k actions)
                                    ▲
                              z (style conditioning)

Inference time:
  Observations ─────────────→ Transformer Decoder ──→ Action chunk
                                    ▲
                              z = 0 (mean of prior)

Why this architecture?

  1. Action chunking: Predict k actions simultaneously (typically k=100) instead of 1 → solves temporal correlation
  2. CVAE encoder: Captures multimodality — style variable z encodes "how to do it" (left vs right)
  3. Transformer decoder: Powerful sequence model, attention between observations and action tokens

CVAE Encoder: Capturing Style

CVAE (Conditional Variational Autoencoder) handles multimodality. Idea: many ways to perform a task, each way is a "style". CVAE encodes style into latent variable z.

Training phase

Input:
  - Observation: camera images + joint positions
  - Action sequence: ground truth actions (k steps)

CVAE Encoder:
  1. Concatenate [CLS token, action tokens] (k+1 tokens)
  2. Cross-attend to observation features
  3. CLS token output → MLP → μ, σ (Gaussian parameters)
  4. Sample z ~ N(μ, σ²)

Loss:
  L = L_reconstruction + β × KL(q(z|o,a) || p(z))

  L_reconstruction = ||predicted_actions - ground_truth_actions||²
  KL regularization: push z distribution near N(0, I)
  β = 10 (in original paper, quite large → force z meaningful)

Inference phase

At deployment, no ground truth actions → don't run encoder. Instead, use z = 0 (mean of prior distribution N(0, I)). This works because:

Why not just use Gaussian Mixture Model?

GMM also captures multimodality, but:

Transformer Decoder: Generating Action Chunks

Transformer decoder takes observation features and style variable z, generates k sequential actions.

Architecture details

# Simplified ACT decoder architecture
class ACTDecoder(nn.Module):
    def __init__(
        self,
        action_dim=14,        # 7 per arm × 2 arms (bimanual)
        chunk_size=100,       # Predict 100 future actions
        hidden_dim=512,
        n_heads=8,
        n_layers=4,
        latent_dim=32,        # CVAE latent dimension
    ):
        # Observation processing
        self.image_encoder = ResNet18()  # Visual features
        self.proprio_proj = nn.Linear(14, hidden_dim)

        # Style conditioning
        self.style_proj = nn.Linear(latent_dim, hidden_dim)

        # Learnable action queries (like DETR object queries)
        self.action_queries = nn.Embedding(chunk_size, hidden_dim)

        # Transformer decoder layers
        self.decoder = nn.TransformerDecoder(
            nn.TransformerDecoderLayer(
                d_model=hidden_dim,
                nhead=n_heads,
                dim_feedforward=2048,
                batch_first=True,
            ),
            num_layers=n_layers,
        )

        # Action prediction head
        self.action_head = nn.Linear(hidden_dim, action_dim)

Forward pass

1. Process observations:
   - Images → ResNet18 → flatten → visual tokens (49 tokens, 1 token per patch)
   - Joint positions → linear projection → 1 proprio token
   - Style z → linear projection → 1 style token

2. Memory = [visual_tokens, proprio_token, style_token]  (51 tokens)

3. Queries = learnable action_queries  (100 tokens)

4. Transformer Decoder:
   - Self-attention: action queries attend to each other
   - Cross-attention: action queries attend to memory (observations + style)
   - 4 layers

5. Output: 100 action predictions, each 14D vector
   (7D per arm: x, y, z, rx, ry, rz, gripper)

Important point: action queries are learnable — model learns position encoding for each timestep within chunk. Query 0 "knows" it predicts immediate action, query 99 "knows" it predicts far future action.

Temporal Ensembling: Smooth Execution

Action chunking solves temporal correlation, but creates new problem: chunk transitions. When finishing 1 chunk and starting new one, can get "glitch" (discontinuity).

How it works

Instead of executing whole chunk then predicting new one, ACT predicts new chunk every timestep and uses exponential weighting to blend:

Timestep t:
  Chunk predicted at t-2:  [_, _, a_t^(t-2), a_{t+1}^(t-2), ...]
  Chunk predicted at t-1:  [_, a_t^(t-1), a_{t+1}^(t-1), ...]
  Chunk predicted at t:    [a_t^(t), a_{t+1}^(t), ...]

  Action to execute = weighted_average(a_t^(t-2), a_t^(t-1), a_t^(t))

  Weights: w_i = exp(-m × i) where m > 0
  → Most recent chunk has highest weight
  → Older chunks gradually fade in influence

Why exponential weighting?

In original paper, temporal ensembling increases success rate 5-10% on hard tasks — especially important for contact-rich manipulation.

Temporal ensembling for smooth robot control

Training Pipeline

Data format (ALOHA)

ACT designed for ALOHA system — 2 robot arms, each 6-DoF + 1 gripper:

# Each episode in dataset
episode = {
    "observations": {
        "images": {
            "cam_high": np.array((T, 480, 640, 3)),   # Top camera
            "cam_left_wrist": np.array((T, 480, 640, 3)),  # Left wrist camera
            "cam_right_wrist": np.array((T, 480, 640, 3)), # Right wrist camera
        },
        "qpos": np.array((T, 14)),  # Joint positions: 7 left + 7 right
        "qvel": np.array((T, 14)),  # Joint velocities
    },
    "actions": np.array((T, 14)),   # Target joint positions
}

Training configuration

# Hyperparameters from paper
config = {
    "chunk_size": 100,        # Predict 100 future actions
    "hidden_dim": 512,
    "n_heads": 8,
    "n_encoder_layers": 4,    # CVAE encoder
    "n_decoder_layers": 7,    # Transformer decoder
    "latent_dim": 32,         # CVAE latent dimension
    "kl_weight": 10,          # β for KL loss
    "lr": 1e-5,
    "batch_size": 8,
    "epochs": 2000,
    "backbone": "resnet18",
    "temporal_agg": True,     # Temporal ensembling
    "temporal_agg_m": 0.01,   # Exponential decay factor
}

Loss function

def compute_loss(model, batch, kl_weight=10):
    """ACT training loss = reconstruction + KL divergence."""
    observations = batch["observations"]
    gt_actions = batch["actions"]  # (B, chunk_size, action_dim)

    # Forward pass — encoder receives both observations and gt_actions
    pred_actions, mu, logvar = model(observations, gt_actions)

    # Reconstruction loss: L1 instead of MSE (robust to outliers)
    l1_loss = F.l1_loss(pred_actions, gt_actions)

    # KL divergence: push posterior near prior N(0, I)
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    kl_loss = kl_loss / (mu.shape[0] * mu.shape[1])  # normalize

    total_loss = l1_loss + kl_weight * kl_loss

    return total_loss, {
        "l1_loss": l1_loss.item(),
        "kl_loss": kl_loss.item(),
        "total_loss": total_loss.item(),
    }

Practical: Train ACT with LeRobot

LeRobot (Hugging Face) is best open-source framework for training and deploying ACT. Wraps entire pipeline into simple CLI commands.

Installation

pip install lerobot
# Or from source for latest features
git clone https://github.com/huggingface/lerobot.git
cd lerobot && pip install -e ".[all]"

Train ACT on ALOHA dataset

# Download dataset and train ACT — 1 command only
lerobot-train \
    --policy.type=act \
    --dataset.repo_id=lerobot/aloha_sim_insertion_human \
    --training.num_epochs=2000 \
    --training.batch_size=8 \
    --policy.chunk_size=100 \
    --policy.n_action_steps=100 \
    --policy.kl_weight=10 \
    --output_dir=outputs/act_insertion

Train on custom dataset

"""
Train ACT on self-collected data with LeRobot.
Requirements: pip install lerobot torch
"""
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.policies.act.modeling_act import ACTPolicy
from lerobot.common.policies.act.configuration_act import ACTConfig

# 1. Load dataset (LeRobot format on HuggingFace Hub)
dataset = LeRobotDataset(
    repo_id="your-username/your-robot-dataset",
    split="train",
)
print(f"Dataset: {len(dataset)} frames, {dataset.num_episodes} episodes")

# 2. Configure ACT policy
config = ACTConfig(
    input_shapes={
        "observation.images.top": [3, 480, 640],
        "observation.state": [14],  # Joint positions
    },
    output_shapes={
        "action": [14],  # Target joint positions
    },
    input_normalization_modes={
        "observation.images.top": "mean_std",
        "observation.state": "mean_std",
    },
    output_normalization_modes={
        "action": "mean_std",
    },
    chunk_size=100,
    n_action_steps=100,
    dim_model=512,
    n_heads=8,
    n_encoder_layers=4,
    n_decoder_layers=7,
    dim_feedforward=2048,
    latent_dim=32,
    use_vae=True,           # Enable CVAE encoder
    kl_weight=10.0,
    temporal_ensemble_coeff=0.01,  # Temporal ensembling
)

# 3. Create policy
policy = ACTPolicy(config=config, dataset_stats=dataset.stats)

# 4. Training loop
import torch
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
policy = policy.to(device)
optimizer = torch.optim.AdamW(policy.parameters(), lr=1e-5, weight_decay=1e-4)

dataloader = DataLoader(dataset, batch_size=8, shuffle=True, num_workers=4)

for epoch in range(2000):
    epoch_loss = 0
    for batch in dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        output = policy.forward(batch)
        loss = output["loss"]

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(policy.parameters(), 10.0)
        optimizer.step()

        epoch_loss += loss.item()

    if epoch % 100 == 0:
        avg_loss = epoch_loss / len(dataloader)
        print(f"Epoch {epoch}: loss={avg_loss:.6f}")

# 5. Save policy
policy.save_pretrained("outputs/act_custom/")
print("Training complete!")

Evaluate and deploy

"""Run ACT policy on robot or simulation."""
from lerobot.common.policies.act.modeling_act import ACTPolicy

# Load trained policy
policy = ACTPolicy.from_pretrained("outputs/act_custom/")
policy.eval()

# Inference loop with temporal ensembling
action_queue = []  # Buffer for temporal ensembling

for step in range(max_steps):
    observation = robot.get_observation()

    # Policy predict chunk of actions
    with torch.no_grad():
        action_chunk = policy.select_action(observation)
        # action_chunk shape: (chunk_size, action_dim)

    # Temporal ensembling
    action_queue.append(action_chunk)
    if len(action_queue) > max_queue_len:
        action_queue.pop(0)

    # Weighted average
    weights = [
        np.exp(-0.01 * i)
        for i in range(len(action_queue) - 1, -1, -1)
    ]
    weights = np.array(weights) / sum(weights)

    action = sum(w * q[len(action_queue) - 1 - i]
                 for i, (w, q) in enumerate(zip(weights, action_queue)))

    robot.execute_action(action)

Results and Comparison

ACT achieves impressive results on ALOHA bimanual tasks — tasks that standard BC barely solves:

Task BC (single-step) ACT (no temporal agg) ACT (full)
Slot insertion 10% 80% 96%
Transfer cube 2% 72% 90%
Thread zip tie 0% 40% 52%
Open cup 0% 55% 68%

Notes:

Comparison with other methods

Method Demos needed Train time Bimanual Fine-grained
BC (MLP) 50 10 min Poor Poor
BC (Transformer) 50 30 min Average Average
Diffusion Policy 50 2 hours Good Good
ACT 50 1 hour Very good Very good
ACT + Diffusion 50 3 hours Very good Very good

ACT balances performance and training cost well. Diffusion Policy may be better for some tasks, but trains and infers slower.

Key Takeaways

  1. Action chunking solves temporal correlation — predict k actions together maintains coherence
  2. CVAE encoder captures multimodality — style variable z encodes "how to do it"
  3. Temporal ensembling for smooth execution — blend multiple chunks with exponential weighting
  4. 50 demonstrations sufficient for many manipulation tasks — don't need thousands of demos
  5. LeRobot is fastest way to start — download dataset, train, deploy in one day

Next Steps

ACT is state-of-the-art for imitation learning from demonstrations. But if you want robot to understand language instructions ("pick up the red cup"), you need foundation models — see Foundation Models for Robot: RT-2, Octo, OpenVLA in Practice to understand how to combine vision, language, and action in one model.


Related Posts

Related Posts

IROS 2026: Papers navigation và manipulation đáng theo dõi
researchconferencerobotics

IROS 2026: Papers navigation và manipulation đáng theo dõi

Phân tích papers nổi bật về autonomous navigation và manipulation — chuẩn bị cho IROS 2026 Pittsburgh.

2/4/20267 min read
Sim-to-Real Transfer: Train simulation, chạy thực tế
ai-perceptionresearchrobotics

Sim-to-Real Transfer: Train simulation, chạy thực tế

Kỹ thuật chuyển đổi mô hình từ simulation sang robot thật — domain randomization, system identification và best practices.

1/4/202612 min read
IROS 2026 Preview: Những gì đáng chờ đợi
researchconferencerobotics

IROS 2026 Preview: Những gì đáng chờ đợi

IROS 2026 Pittsburgh — preview workshops, competitions và nghiên cứu navigation, manipulation hàng đầu.

30/3/20267 min read