← Back to Blog
ailerobotpi0-fastvlafast-tokenizer

Pi0-FAST: 5x Faster Autoregressive VLA

Guide to training Pi0-FAST in LeRobot — from training the FAST tokenizer, fine-tuning the model, to inference with KV-caching.

Nguyễn Anh Tuấn10 tháng 4, 202612 min read
Pi0-FAST: 5x Faster Autoregressive VLA

Pi0-FAST: When Autoregressive Beats Diffusion

In the previous post, we explored SmolVLA — a compact VLA that runs on consumer GPUs. Now, let's switch to an entirely different approach: Pi0-FAST, a model that combines the PaliGemma backbone with the FAST action tokenizer to achieve inference speeds 5 times faster than the diffusion-based original Pi0.

Why does this matter? Because in robotics, inference speed determines reactivity. A model running at 5 Hz (5 predictions/second) responds significantly slower than one running at 25 Hz. In tasks requiring high precision — assembling, pouring, inserting — this difference separates success from failure.

This article will guide you step by step: understanding why the FAST tokenizer solves problems that standard binning cannot, how to train a custom tokenizer, fine-tune Pi0-FAST, and deploy with KV-caching.

Speed and performance in robot AI

What is Pi0-FAST?

The Problem with Original Pi0

Pi0 (Physical Intelligence's model) uses flow matching — a type of diffusion process — to generate robot actions. Flow matching works by starting from random noise and iteratively denoising it into an action trajectory. This process requires multiple denoising steps (typically 10-50 steps), each requiring a forward pass through the neural network.

Result: the original Pi0 runs at about 5 Hz — sufficient for many tasks but too slow for:

The Solution: Autoregressive + FAST Tokenizer

Pi0-FAST replaces flow matching with autoregressive decoding — similar to how language models (GPT, LLaMA) generate text token-by-token. But there's a problem: robot actions are continuous values (joint angles, positions), not discrete tokens.

This is where the FAST tokenizer comes in.

FAST Tokenizer: Turning Continuous Actions into Tokens

FAST (Fourier Action Sequence Tokenization) solves the tokenization problem for robot actions with a 5-step pipeline:

Step 1 — Normalize: Standardize action values to range [-1, 1]

raw_actions = [0.15, -0.32, 1.47, 0.003, ...]
normalized = [-0.7, 0.2, 0.95, -0.99, ...]

Step 2 — DCT (Discrete Cosine Transform): Convert the action sequence from time domain to frequency domain. Similar to how JPEG compresses images using DCT, FAST compresses action trajectories.

# Action chunk [10 timesteps x 7 joints] = 70 values
# After DCT, keep only top-K frequency components
# Example: K=20 → compression ratio 3.5x

DCT works because robot actions are typically smooth — no sudden jumps between consecutive timesteps. High-frequency components (containing noise, jitter) can be discarded without losing important information.

Step 3 — Quantize: Convert continuous DCT coefficients to discrete integers

dct_coefficients = [0.742, -0.321, 0.055, ...]
quantized = [189, 87, 114, ...]  # Integers in range [0, 255]

Step 4 — Flatten: Concatenate all quantized values into one sequence

flattened = [189, 87, 114, 201, 55, ...]  # Flat sequence of integers

Step 5 — BPE (Byte Pair Encoding): Apply BPE tokenization (like text tokenization) to compress the sequence further

# BPE finds repeating patterns and merges them into single tokens
# Example: [189, 87] appears frequently → token_543
# Result: sequence of 70 → compressed to ~7-10 tokens

Result: An action chunk of 10 timesteps x 7 joints = 70 values is compressed to just ~7-10 tokens. This is a ~10x compression ratio, allowing the autoregressive model to generate actions much faster.

Why Standard Binning Fails

The naive approach: divide range [-1, 1] into N bins (e.g., 256 bins) and map each action value to the nearest bin. Problems:

  1. Curse of dimensionality: With 7 joints x 10 timesteps = 70 dimensions, autoregressive decoding needs to generate 70 tokens — too slow
  2. Lost correlation: Standard binning tokenizes each value independently, ignoring temporal correlation between timesteps
  3. No compression: 70 values become 70 tokens, zero compression

FAST solves all three: DCT exploits temporal correlation, BPE finds patterns, and the result is 70 values compressed to ~10 tokens.

Comparison: Pi0 vs Pi0-FAST

Feature Pi0 (Flow Matching) Pi0-FAST (Autoregressive)
Action generation Iterative denoising (10-50 steps) Token-by-token (~10 tokens)
Inference speed ~5 Hz ~25 Hz
KV-caching Not applicable Yes, significantly boosts speed
Training complexity Moderate Higher (need to train tokenizer)
LIBERO benchmark ~85% ~82.5%
Real-world dexterous Excellent Very good
Backbone PaliGemma 3B PaliGemma 3B

Pi0-FAST sacrifices ~2.5% benchmark accuracy to achieve 5x inference speed — an excellent tradeoff for real-world deployment.

Step 1: Installation

System Requirements

Install LeRobot with Pi Dependencies

# Create environment
python3.12 -m venv pi0fast-env
source pi0fast-env/bin/activate

# Clone and install
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[pi]"

# Verify
python -c "from lerobot.policies import Pi0FASTPolicy; print('Pi0-FAST OK')"

The [pi] package installs additional dependencies for the PaliGemma backbone and FAST tokenizer.

Step 2: Train Custom FAST Tokenizer

Why Train a Custom Tokenizer?

The pretrained FAST tokenizer (lerobot/fast-action-tokenizer) was trained on Open-X Embodiment data — a collection of many different robot types. It works well in many cases, but if your robot has a significantly different action space (different number of joints, different range), training a custom tokenizer will yield better results.

When to Use Pretrained vs Custom

Tokenizer Training Command

lerobot-train-tokenizer \
  --repo_id=YOUR_USERNAME/my_dataset \
  --action_horizon=10 \
  --encoded_dims=20 \
  --vocab_size=1024 \
  --scale=1.0 \
  --normalization_mode=bounds \
  --output_dir=outputs/tokenizer/my_tokenizer

Tokenizer Parameter Explanation

--action_horizon (default: 10): Number of timesteps per action chunk. Must match chunk_size when training Pi0-FAST.

action_horizon=5  → Fewer steps, faster reactions, less smooth
action_horizon=10 → Balanced (recommended)
action_horizon=20 → Smoother, but slower reactions

--encoded_dims (default: 20): Number of DCT coefficients to keep. Higher = more detail preserved, but longer token sequences.

encoded_dims=10 → High compression, loses fine details
encoded_dims=20 → Balanced (recommended)
encoded_dims=40 → Preserves most details, longer tokens

--vocab_size (default: 1024): BPE vocabulary size. Larger = better compression but needs more data.

vocab_size=256  → Small vocabulary, easy to learn, low compression
vocab_size=1024 → Balanced (recommended)
vocab_size=4096 → Large vocabulary, needs lots of data

--scale: Multiplier for DCT coefficients before quantization. Usually kept at 1.0.

--normalization_mode: How to normalize actions:

Check Tokenizer Quality

After training, check reconstruction error:

lerobot-check-tokenizer \
  --tokenizer_path=outputs/tokenizer/my_tokenizer \
  --repo_id=YOUR_USERNAME/my_dataset \
  --num_samples=100

Output shows reconstruction MSE — target: < 0.01. If higher, try increasing encoded_dims or vocab_size.

Step 3: Fine-tune Pi0-FAST

Using the Pretrained Model

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=20000 \
  --training.batch_size=32 \
  --training.lr=1e-5 \
  --training.dtype=bfloat16 \
  --training.gradient_checkpointing=true \
  --policy.chunk_size=10 \
  --policy.n_action_steps=10

Training Parameter Explanation

--training.dtype=bfloat16: Use bfloat16 mixed precision. Reduces VRAM ~40% and speeds up training ~30% compared to float32. Mandatory on RTX 4090 — float32 will OOM.

--training.gradient_checkpointing=true: Trades speed for memory. Reduces VRAM ~30% by recomputing activations instead of storing them. Training is ~20% slower, but enables training on smaller GPUs.

--policy.chunk_size=10: Actions per chunk. Must match the tokenizer's action_horizon.

--policy.n_action_steps=10: Number of action steps to execute before querying the model again. Set equal to chunk_size for standard chunking, or smaller for Real-Time Chunking.

Using Custom Tokenizer

If you trained a custom tokenizer in Step 2:

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --policy.tokenizer_path=outputs/tokenizer/my_tokenizer \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=20000 \
  --training.dtype=bfloat16 \
  --training.gradient_checkpointing=true

Estimated Training Times

GPU Batch Size 20k Steps Notes
A100 80GB 64 ~3 hours No gradient_checkpointing needed
RTX 4090 24GB 32 ~6 hours Needs bfloat16 + gradient_checkpointing
RTX 3090 24GB 16 ~10 hours Needs bfloat16 + gradient_checkpointing

Monitoring and Early Stopping

# With Weights & Biases
lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=50000 \
  --training.dtype=bfloat16 \
  --training.gradient_checkpointing=true \
  --training.wandb.enable=true \
  --training.wandb.project=pi0fast-finetune \
  --training.save_freq=5000 \
  --training.eval_freq=5000

Monitor token_accuracy on the W&B dashboard — this is the most important metric. Target: >85% token accuracy on the validation set.

Training VLA models efficiently

Step 4: Evaluate on LIBERO and Real Robot

LIBERO Benchmark

LIBERO is the standard benchmark for robot manipulation. It consists of 4 suites with increasing difficulty:

# Evaluate on LIBERO-Object (easiest)
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_object \
  --eval.num_episodes=50

# Evaluate on LIBERO-Goal
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_goal \
  --eval.num_episodes=50

# Evaluate on LIBERO-Spatial (harder)
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_spatial \
  --eval.num_episodes=50

# Evaluate on LIBERO-Long (hardest, multi-step)
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_long \
  --eval.num_episodes=50

Expected Results (fine-tuned Pi0-FAST):

Suite Success Rate
LIBERO-Object ~92%
LIBERO-Goal ~85%
LIBERO-Spatial ~80%
LIBERO-Long ~73%
Average ~82.5%

Deploy on Real Robot

# Run policy on real robot
lerobot-record \
  --robot.type=so100 \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --repo_id=YOUR_USERNAME/pi0fast_eval \
  --num_episodes=20 \
  --fps=25

Note: Pi0-FAST can run at 25 Hz on RTX 4090 — set --fps=25 to fully utilize its speed.

KV-Caching: Why Pi0-FAST is Fast

What is KV-Caching?

In autoregressive decoding, each new token needs to attend to all previous tokens. Without caching, the model must recompute key-value pairs for the entire sequence at every step — extremely wasteful.

KV-caching stores previously computed key-value pairs in GPU memory, and each step only needs to compute KV for the newest token. For Pi0-FAST, this means:

Without KV-cache:
  Token 1: compute attention for [image_tokens, instruction_tokens, token_1] → 1000 tokens
  Token 2: compute attention for [image_tokens, instruction_tokens, token_1, token_2] → 1001 tokens
  ...
  Token 10: compute attention for 1009 tokens
  Total: ~10,000 token computations

With KV-cache:
  Token 1: compute full attention (1000 tokens), cache KV
  Token 2: compute attention only for token_2, reuse cached KV
  ...
  Token 10: compute attention only for token_10
  Total: ~1,000 + 9 = ~1,009 token computations
  → ~10x less computation!

KV-Caching in Pi0-FAST

KV-caching is enabled by default in Pi0-FAST. You don't need to configure anything. However, it uses additional GPU memory — about 1-2GB per batch element. If you encounter OOM during inference, reduce batch size or disable it:

# Disable KV-cache (slower, but saves memory)
lerobot-eval \
  --policy.path=YOUR/model \
  --policy.use_kv_cache=false

Real-Time Chunking with Pi0-FAST

The Perfect Combination

Pi0-FAST + Real-Time Chunking is the most powerful combination in LeRobot v0.5. Thanks to 25 Hz inference speed, Pi0-FAST can continuously replan — every 2-3 actions, the model predicts a new chunk, blending with the old one:

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --policy.rtc_config.enabled=true \
  --policy.rtc_config.n_steps_warmup=5 \
  --policy.rtc_config.n_steps_between_replan=3 \
  --policy.rtc_config.blend_alpha=0.7 \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=20000

RTC Parameters

When to Use RTC?

PEFT/LoRA for Pi0-FAST

If you want to fine-tune Pi0-FAST but lack VRAM for full fine-tuning, use LoRA:

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --policy.peft_config.use_peft=true \
  --policy.peft_config.lora_r=16 \
  --policy.peft_config.lora_alpha=32 \
  --policy.peft_config.target_modules=["q_proj","v_proj"] \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=15000 \
  --training.dtype=bfloat16

With LoRA, VRAM drops from 40GB to 16-20GB, comfortably fitting on an RTX 3090.

If you're unfamiliar with PEFT/LoRA, read more in our v0.5 overview post for the theoretical background.

Troubleshooting

High Tokenizer Reconstruction Error

If lerobot-check-tokenizer reports MSE > 0.05:

  1. Increase encoded_dims (try 30 or 40)
  2. Increase vocab_size (try 2048)
  3. Check data: are there action outliers? Use normalization_mode=mean_std

Low Token Accuracy During Training

If token accuracy < 70% after 10k steps:

  1. Check tokenizer quality (MSE < 0.01)
  2. Reduce learning rate (try 5e-6)
  3. Increase training steps (try 50k)
  4. Check dataset size: Pi0-FAST needs at least 50 episodes

Inference Too Slow

If inference < 15 Hz on RTX 4090:

  1. Ensure KV-cache is enabled (default)
  2. Use dtype=bfloat16 for inference
  3. Reduce num_visual_tokens if applicable
  4. Check GPU utilization: nvidia-smi — if GPU < 80%, the bottleneck is CPU/IO

Conclusion

Pi0-FAST represents a new direction in robot AI: autoregressive models for robot control. By combining the FAST tokenizer (10x action sequence compression) with KV-caching, Pi0-FAST achieves 5x inference speed over diffusion-based approaches while losing only ~2.5% accuracy.

When combined with Real-Time Chunking, Pi0-FAST enables robots to continuously update their plans at 25 Hz — near real-time responsiveness. This is a capability that previously only existed in classical control systems, now available for learned policies.

The choice between SmolVLA and Pi0-FAST is straightforward: if you're GPU-constrained, use SmolVLA. If you have an RTX 4090+ and need fast inference, use Pi0-FAST. For a deeper understanding of VLA theory, start with our VLA models overview.


Related Posts

Related Posts

ResearchΨ₀ Hands-On (6): Ablation & Bài học rút ra
ai-perceptionvlaresearchhumanoidpsi0Part 6

Ψ₀ Hands-On (6): Ablation & Bài học rút ra

Phân tích ablation studies, so sánh baselines, và 5 bài học quan trọng nhất từ Ψ₀ cho người mới bắt đầu.

11/4/202616 min read
ResearchSimpleVLA-RL (4): Kết quả & Bài học
ai-perceptionvlareinforcement-learningresearchPart 4

SimpleVLA-RL (4): Kết quả & Bài học

Phân tích kết quả SimpleVLA-RL: ablation studies, hiện tượng pushcut, real-world transfer, và 5 bài học rút ra.

11/4/202614 min read
ComparisonSimpleVLA-RL (5): So sánh với LeRobot
ai-perceptionvlareinforcement-learninglerobotresearchPart 5

SimpleVLA-RL (5): So sánh với LeRobot

So sánh chi tiết SimpleVLA-RL và LeRobot: RL approach, VLA models, sim vs real, data efficiency — hai framework bổ trợ nhau.

11/4/202612 min read