Pi0-FAST: When Autoregressive Beats Diffusion
In the previous post, we explored SmolVLA — a compact VLA that runs on consumer GPUs. Now, let's switch to an entirely different approach: Pi0-FAST, a model that combines the PaliGemma backbone with the FAST action tokenizer to achieve inference speeds 5 times faster than the diffusion-based original Pi0.
Why does this matter? Because in robotics, inference speed determines reactivity. A model running at 5 Hz (5 predictions/second) responds significantly slower than one running at 25 Hz. In tasks requiring high precision — assembling, pouring, inserting — this difference separates success from failure.
This article will guide you step by step: understanding why the FAST tokenizer solves problems that standard binning cannot, how to train a custom tokenizer, fine-tune Pi0-FAST, and deploy with KV-caching.
What is Pi0-FAST?
The Problem with Original Pi0
Pi0 (Physical Intelligence's model) uses flow matching — a type of diffusion process — to generate robot actions. Flow matching works by starting from random noise and iteratively denoising it into an action trajectory. This process requires multiple denoising steps (typically 10-50 steps), each requiring a forward pass through the neural network.
Result: the original Pi0 runs at about 5 Hz — sufficient for many tasks but too slow for:
- Dexterous manipulation (needs >15 Hz)
- Dynamic tasks (catching, pouring)
- Tasks requiring fast replanning (Real-Time Chunking)
The Solution: Autoregressive + FAST Tokenizer
Pi0-FAST replaces flow matching with autoregressive decoding — similar to how language models (GPT, LLaMA) generate text token-by-token. But there's a problem: robot actions are continuous values (joint angles, positions), not discrete tokens.
This is where the FAST tokenizer comes in.
FAST Tokenizer: Turning Continuous Actions into Tokens
FAST (Fourier Action Sequence Tokenization) solves the tokenization problem for robot actions with a 5-step pipeline:
Step 1 — Normalize: Standardize action values to range [-1, 1]
raw_actions = [0.15, -0.32, 1.47, 0.003, ...]
normalized = [-0.7, 0.2, 0.95, -0.99, ...]
Step 2 — DCT (Discrete Cosine Transform): Convert the action sequence from time domain to frequency domain. Similar to how JPEG compresses images using DCT, FAST compresses action trajectories.
# Action chunk [10 timesteps x 7 joints] = 70 values
# After DCT, keep only top-K frequency components
# Example: K=20 → compression ratio 3.5x
DCT works because robot actions are typically smooth — no sudden jumps between consecutive timesteps. High-frequency components (containing noise, jitter) can be discarded without losing important information.
Step 3 — Quantize: Convert continuous DCT coefficients to discrete integers
dct_coefficients = [0.742, -0.321, 0.055, ...]
quantized = [189, 87, 114, ...] # Integers in range [0, 255]
Step 4 — Flatten: Concatenate all quantized values into one sequence
flattened = [189, 87, 114, 201, 55, ...] # Flat sequence of integers
Step 5 — BPE (Byte Pair Encoding): Apply BPE tokenization (like text tokenization) to compress the sequence further
# BPE finds repeating patterns and merges them into single tokens
# Example: [189, 87] appears frequently → token_543
# Result: sequence of 70 → compressed to ~7-10 tokens
Result: An action chunk of 10 timesteps x 7 joints = 70 values is compressed to just ~7-10 tokens. This is a ~10x compression ratio, allowing the autoregressive model to generate actions much faster.
Why Standard Binning Fails
The naive approach: divide range [-1, 1] into N bins (e.g., 256 bins) and map each action value to the nearest bin. Problems:
- Curse of dimensionality: With 7 joints x 10 timesteps = 70 dimensions, autoregressive decoding needs to generate 70 tokens — too slow
- Lost correlation: Standard binning tokenizes each value independently, ignoring temporal correlation between timesteps
- No compression: 70 values become 70 tokens, zero compression
FAST solves all three: DCT exploits temporal correlation, BPE finds patterns, and the result is 70 values compressed to ~10 tokens.
Comparison: Pi0 vs Pi0-FAST
| Feature | Pi0 (Flow Matching) | Pi0-FAST (Autoregressive) |
|---|---|---|
| Action generation | Iterative denoising (10-50 steps) | Token-by-token (~10 tokens) |
| Inference speed | ~5 Hz | ~25 Hz |
| KV-caching | Not applicable | Yes, significantly boosts speed |
| Training complexity | Moderate | Higher (need to train tokenizer) |
| LIBERO benchmark | ~85% | ~82.5% |
| Real-world dexterous | Excellent | Very good |
| Backbone | PaliGemma 3B | PaliGemma 3B |
Pi0-FAST sacrifices ~2.5% benchmark accuracy to achieve 5x inference speed — an excellent tradeoff for real-world deployment.
Step 1: Installation
System Requirements
- Python 3.12+
- GPU: RTX 4090 (24GB) or better for training, RTX 3090 for inference
- CUDA 12.1+
- RAM: 32GB+ recommended
Install LeRobot with Pi Dependencies
# Create environment
python3.12 -m venv pi0fast-env
source pi0fast-env/bin/activate
# Clone and install
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[pi]"
# Verify
python -c "from lerobot.policies import Pi0FASTPolicy; print('Pi0-FAST OK')"
The [pi] package installs additional dependencies for the PaliGemma backbone and FAST tokenizer.
Step 2: Train Custom FAST Tokenizer
Why Train a Custom Tokenizer?
The pretrained FAST tokenizer (lerobot/fast-action-tokenizer) was trained on Open-X Embodiment data — a collection of many different robot types. It works well in many cases, but if your robot has a significantly different action space (different number of joints, different range), training a custom tokenizer will yield better results.
When to Use Pretrained vs Custom
- Use pretrained: Standard robots (6-7 DOF arm, SO-100, ALOHA), common tasks
- Train custom: Specialized robots (humanoid, mobile manipulator), large action spaces (>10 DOF), large datasets (>500 episodes)
Tokenizer Training Command
lerobot-train-tokenizer \
--repo_id=YOUR_USERNAME/my_dataset \
--action_horizon=10 \
--encoded_dims=20 \
--vocab_size=1024 \
--scale=1.0 \
--normalization_mode=bounds \
--output_dir=outputs/tokenizer/my_tokenizer
Tokenizer Parameter Explanation
--action_horizon (default: 10): Number of timesteps per action chunk. Must match chunk_size when training Pi0-FAST.
action_horizon=5 → Fewer steps, faster reactions, less smooth
action_horizon=10 → Balanced (recommended)
action_horizon=20 → Smoother, but slower reactions
--encoded_dims (default: 20): Number of DCT coefficients to keep. Higher = more detail preserved, but longer token sequences.
encoded_dims=10 → High compression, loses fine details
encoded_dims=20 → Balanced (recommended)
encoded_dims=40 → Preserves most details, longer tokens
--vocab_size (default: 1024): BPE vocabulary size. Larger = better compression but needs more data.
vocab_size=256 → Small vocabulary, easy to learn, low compression
vocab_size=1024 → Balanced (recommended)
vocab_size=4096 → Large vocabulary, needs lots of data
--scale: Multiplier for DCT coefficients before quantization. Usually kept at 1.0.
--normalization_mode: How to normalize actions:
bounds: Use min/max from dataset (recommended for most cases)mean_std: Use mean/standard deviation normalizationnone: No normalization (only if actions are pre-normalized)
Check Tokenizer Quality
After training, check reconstruction error:
lerobot-check-tokenizer \
--tokenizer_path=outputs/tokenizer/my_tokenizer \
--repo_id=YOUR_USERNAME/my_dataset \
--num_samples=100
Output shows reconstruction MSE — target: < 0.01. If higher, try increasing encoded_dims or vocab_size.
Step 3: Fine-tune Pi0-FAST
Using the Pretrained Model
lerobot-train \
--policy.type=pi0_fast \
--policy.pretrained_path=lerobot/pi0_fast_base \
--dataset.repo_id=YOUR_USERNAME/my_dataset \
--training.steps=20000 \
--training.batch_size=32 \
--training.lr=1e-5 \
--training.dtype=bfloat16 \
--training.gradient_checkpointing=true \
--policy.chunk_size=10 \
--policy.n_action_steps=10
Training Parameter Explanation
--training.dtype=bfloat16: Use bfloat16 mixed precision. Reduces VRAM ~40% and speeds up training ~30% compared to float32. Mandatory on RTX 4090 — float32 will OOM.
--training.gradient_checkpointing=true: Trades speed for memory. Reduces VRAM ~30% by recomputing activations instead of storing them. Training is ~20% slower, but enables training on smaller GPUs.
--policy.chunk_size=10: Actions per chunk. Must match the tokenizer's action_horizon.
--policy.n_action_steps=10: Number of action steps to execute before querying the model again. Set equal to chunk_size for standard chunking, or smaller for Real-Time Chunking.
Using Custom Tokenizer
If you trained a custom tokenizer in Step 2:
lerobot-train \
--policy.type=pi0_fast \
--policy.pretrained_path=lerobot/pi0_fast_base \
--policy.tokenizer_path=outputs/tokenizer/my_tokenizer \
--dataset.repo_id=YOUR_USERNAME/my_dataset \
--training.steps=20000 \
--training.dtype=bfloat16 \
--training.gradient_checkpointing=true
Estimated Training Times
| GPU | Batch Size | 20k Steps | Notes |
|---|---|---|---|
| A100 80GB | 64 | ~3 hours | No gradient_checkpointing needed |
| RTX 4090 24GB | 32 | ~6 hours | Needs bfloat16 + gradient_checkpointing |
| RTX 3090 24GB | 16 | ~10 hours | Needs bfloat16 + gradient_checkpointing |
Monitoring and Early Stopping
# With Weights & Biases
lerobot-train \
--policy.type=pi0_fast \
--policy.pretrained_path=lerobot/pi0_fast_base \
--dataset.repo_id=YOUR_USERNAME/my_dataset \
--training.steps=50000 \
--training.dtype=bfloat16 \
--training.gradient_checkpointing=true \
--training.wandb.enable=true \
--training.wandb.project=pi0fast-finetune \
--training.save_freq=5000 \
--training.eval_freq=5000
Monitor token_accuracy on the W&B dashboard — this is the most important metric. Target: >85% token accuracy on the validation set.
Step 4: Evaluate on LIBERO and Real Robot
LIBERO Benchmark
LIBERO is the standard benchmark for robot manipulation. It consists of 4 suites with increasing difficulty:
# Evaluate on LIBERO-Object (easiest)
lerobot-eval \
--policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
--env.type=libero \
--env.task=libero_object \
--eval.num_episodes=50
# Evaluate on LIBERO-Goal
lerobot-eval \
--policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
--env.type=libero \
--env.task=libero_goal \
--eval.num_episodes=50
# Evaluate on LIBERO-Spatial (harder)
lerobot-eval \
--policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
--env.type=libero \
--env.task=libero_spatial \
--eval.num_episodes=50
# Evaluate on LIBERO-Long (hardest, multi-step)
lerobot-eval \
--policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
--env.type=libero \
--env.task=libero_long \
--eval.num_episodes=50
Expected Results (fine-tuned Pi0-FAST):
| Suite | Success Rate |
|---|---|
| LIBERO-Object | ~92% |
| LIBERO-Goal | ~85% |
| LIBERO-Spatial | ~80% |
| LIBERO-Long | ~73% |
| Average | ~82.5% |
Deploy on Real Robot
# Run policy on real robot
lerobot-record \
--robot.type=so100 \
--policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
--repo_id=YOUR_USERNAME/pi0fast_eval \
--num_episodes=20 \
--fps=25
Note: Pi0-FAST can run at 25 Hz on RTX 4090 — set --fps=25 to fully utilize its speed.
KV-Caching: Why Pi0-FAST is Fast
What is KV-Caching?
In autoregressive decoding, each new token needs to attend to all previous tokens. Without caching, the model must recompute key-value pairs for the entire sequence at every step — extremely wasteful.
KV-caching stores previously computed key-value pairs in GPU memory, and each step only needs to compute KV for the newest token. For Pi0-FAST, this means:
Without KV-cache:
Token 1: compute attention for [image_tokens, instruction_tokens, token_1] → 1000 tokens
Token 2: compute attention for [image_tokens, instruction_tokens, token_1, token_2] → 1001 tokens
...
Token 10: compute attention for 1009 tokens
Total: ~10,000 token computations
With KV-cache:
Token 1: compute full attention (1000 tokens), cache KV
Token 2: compute attention only for token_2, reuse cached KV
...
Token 10: compute attention only for token_10
Total: ~1,000 + 9 = ~1,009 token computations
→ ~10x less computation!
KV-Caching in Pi0-FAST
KV-caching is enabled by default in Pi0-FAST. You don't need to configure anything. However, it uses additional GPU memory — about 1-2GB per batch element. If you encounter OOM during inference, reduce batch size or disable it:
# Disable KV-cache (slower, but saves memory)
lerobot-eval \
--policy.path=YOUR/model \
--policy.use_kv_cache=false
Real-Time Chunking with Pi0-FAST
The Perfect Combination
Pi0-FAST + Real-Time Chunking is the most powerful combination in LeRobot v0.5. Thanks to 25 Hz inference speed, Pi0-FAST can continuously replan — every 2-3 actions, the model predicts a new chunk, blending with the old one:
lerobot-train \
--policy.type=pi0_fast \
--policy.pretrained_path=lerobot/pi0_fast_base \
--policy.rtc_config.enabled=true \
--policy.rtc_config.n_steps_warmup=5 \
--policy.rtc_config.n_steps_between_replan=3 \
--policy.rtc_config.blend_alpha=0.7 \
--dataset.repo_id=YOUR_USERNAME/my_dataset \
--training.steps=20000
RTC Parameters
n_steps_warmup: Steps to execute before starting replanning (lets the model "stabilize" first)n_steps_between_replan: Steps to execute between each replan. Smaller = more reactive but more computeblend_alpha: Blending weight between new and old chunks. 1.0 = fully use new chunk, 0.5 = average both
When to Use RTC?
- Use it: Dynamic tasks, tasks requiring high precision, uncertain environments
- Skip it: Static pick-and-place, simple tasks where standard chunking is sufficient
PEFT/LoRA for Pi0-FAST
If you want to fine-tune Pi0-FAST but lack VRAM for full fine-tuning, use LoRA:
lerobot-train \
--policy.type=pi0_fast \
--policy.pretrained_path=lerobot/pi0_fast_base \
--policy.peft_config.use_peft=true \
--policy.peft_config.lora_r=16 \
--policy.peft_config.lora_alpha=32 \
--policy.peft_config.target_modules=["q_proj","v_proj"] \
--dataset.repo_id=YOUR_USERNAME/my_dataset \
--training.steps=15000 \
--training.dtype=bfloat16
With LoRA, VRAM drops from 40GB to 16-20GB, comfortably fitting on an RTX 3090.
If you're unfamiliar with PEFT/LoRA, read more in our v0.5 overview post for the theoretical background.
Troubleshooting
High Tokenizer Reconstruction Error
If lerobot-check-tokenizer reports MSE > 0.05:
- Increase
encoded_dims(try 30 or 40) - Increase
vocab_size(try 2048) - Check data: are there action outliers? Use
normalization_mode=mean_std
Low Token Accuracy During Training
If token accuracy < 70% after 10k steps:
- Check tokenizer quality (MSE < 0.01)
- Reduce learning rate (try
5e-6) - Increase training steps (try 50k)
- Check dataset size: Pi0-FAST needs at least 50 episodes
Inference Too Slow
If inference < 15 Hz on RTX 4090:
- Ensure KV-cache is enabled (default)
- Use
dtype=bfloat16for inference - Reduce
num_visual_tokensif applicable - Check GPU utilization:
nvidia-smi— if GPU < 80%, the bottleneck is CPU/IO
Conclusion
Pi0-FAST represents a new direction in robot AI: autoregressive models for robot control. By combining the FAST tokenizer (10x action sequence compression) with KV-caching, Pi0-FAST achieves 5x inference speed over diffusion-based approaches while losing only ~2.5% accuracy.
When combined with Real-Time Chunking, Pi0-FAST enables robots to continuously update their plans at 25 Hz — near real-time responsiveness. This is a capability that previously only existed in classical control systems, now available for learned policies.
The choice between SmolVLA and Pi0-FAST is straightforward: if you're GPU-constrained, use SmolVLA. If you have an RTX 4090+ and need fast inference, use Pi0-FAST. For a deeper understanding of VLA theory, start with our VLA models overview.
Related Posts
- SmolVLA: Train a 450M VLA on Consumer GPU — Compact VLA for commodity hardware
- Diffusion Policy: Theory and Practice — Understand diffusion before comparing with autoregressive
- LeRobot Framework: Introduction and Architecture — LeRobot foundations for beginners