Pi0-FAST: 5x Faster Autoregressive VLA

Pi0-FAST: When Autoregressive Beats Diffusion

In the previous post, we explored SmolVLA — a compact VLA that runs on consumer GPUs. Now, let's switch to an entirely different approach: Pi0-FAST, a model that combines the PaliGemma backbone with the FAST action tokenizer to achieve inference speeds 5 times faster than the diffusion-based original Pi0.

Why does this matter? Because in robotics, inference speed determines reactivity. A model running at 5 Hz (5 predictions/second) responds significantly slower than one running at 25 Hz. In tasks requiring high precision — assembling, pouring, inserting — this difference separates success from failure.

This article will guide you step by step: understanding why the FAST tokenizer solves problems that standard binning cannot, how to train a custom tokenizer, fine-tune Pi0-FAST, and deploy with KV-caching.

Speed and performance in robot AI

What is Pi0-FAST?

The Problem with Original Pi0

Pi0 (Physical Intelligence's model) uses flow matching — a type of diffusion process — to generate robot actions. Flow matching works by starting from random noise and iteratively denoising it into an action trajectory. This process requires multiple denoising steps (typically 10-50 steps), each requiring a forward pass through the neural network.

Result: the original Pi0 runs at about 5 Hz — sufficient for many tasks but too slow for:

Dexterous manipulation (needs >15 Hz)
Dynamic tasks (catching, pouring)
Tasks requiring fast replanning (Real-Time Chunking)

The Solution: Autoregressive + FAST Tokenizer

Pi0-FAST replaces flow matching with autoregressive decoding — similar to how language models (GPT, LLaMA) generate text token-by-token. But there's a problem: robot actions are continuous values (joint angles, positions), not discrete tokens.

This is where the FAST tokenizer comes in.

FAST Tokenizer: Turning Continuous Actions into Tokens

FAST (Fourier Action Sequence Tokenization) solves the tokenization problem for robot actions with a 5-step pipeline:

Step 1 — Normalize: Standardize action values to range [-1, 1]

raw_actions = [0.15, -0.32, 1.47, 0.003, ...]
normalized = [-0.7, 0.2, 0.95, -0.99, ...]

Step 2 — DCT (Discrete Cosine Transform): Convert the action sequence from time domain to frequency domain. Similar to how JPEG compresses images using DCT, FAST compresses action trajectories.

# Action chunk [10 timesteps x 7 joints] = 70 values
# After DCT, keep only top-K frequency components
# Example: K=20 → compression ratio 3.5x

DCT works because robot actions are typically smooth — no sudden jumps between consecutive timesteps. High-frequency components (containing noise, jitter) can be discarded without losing important information.

Step 3 — Quantize: Convert continuous DCT coefficients to discrete integers

dct_coefficients = [0.742, -0.321, 0.055, ...]
quantized = [189, 87, 114, ...]  # Integers in range [0, 255]

Step 4 — Flatten: Concatenate all quantized values into one sequence

flattened = [189, 87, 114, 201, 55, ...]  # Flat sequence of integers

Step 5 — BPE (Byte Pair Encoding): Apply BPE tokenization (like text tokenization) to compress the sequence further

# BPE finds repeating patterns and merges them into single tokens
# Example: [189, 87] appears frequently → token_543
# Result: sequence of 70 → compressed to ~7-10 tokens

Result: An action chunk of 10 timesteps x 7 joints = 70 values is compressed to just ~7-10 tokens. This is a ~10x compression ratio, allowing the autoregressive model to generate actions much faster.

Why Standard Binning Fails

The naive approach: divide range [-1, 1] into N bins (e.g., 256 bins) and map each action value to the nearest bin. Problems:

Curse of dimensionality: With 7 joints x 10 timesteps = 70 dimensions, autoregressive decoding needs to generate 70 tokens — too slow
Lost correlation: Standard binning tokenizes each value independently, ignoring temporal correlation between timesteps
No compression: 70 values become 70 tokens, zero compression

FAST solves all three: DCT exploits temporal correlation, BPE finds patterns, and the result is 70 values compressed to ~10 tokens.

Comparison: Pi0 vs Pi0-FAST

Feature	Pi0 (Flow Matching)	Pi0-FAST (Autoregressive)
Action generation	Iterative denoising (10-50 steps)	Token-by-token (~10 tokens)
Inference speed	~5 Hz	~25 Hz
KV-caching	Not applicable	Yes, significantly boosts speed
Training complexity	Moderate	Higher (need to train tokenizer)
LIBERO benchmark	~85%	~82.5%
Real-world dexterous	Excellent	Very good
Backbone	PaliGemma 3B	PaliGemma 3B

Pi0-FAST sacrifices ~2.5% benchmark accuracy to achieve 5x inference speed — an excellent tradeoff for real-world deployment.

Step 1: Installation

System Requirements

Python 3.12+
GPU: RTX 4090 (24GB) or better for training, RTX 3090 for inference
CUDA 12.1+
RAM: 32GB+ recommended

Install LeRobot with Pi Dependencies

# Create environment
python3.12 -m venv pi0fast-env
source pi0fast-env/bin/activate

# Clone and install
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[pi]"

# Verify
python -c "from lerobot.policies import Pi0FASTPolicy; print('Pi0-FAST OK')"

The [pi] package installs additional dependencies for the PaliGemma backbone and FAST tokenizer.

Step 2: Train Custom FAST Tokenizer

Why Train a Custom Tokenizer?

The pretrained FAST tokenizer (lerobot/fast-action-tokenizer) was trained on Open-X Embodiment data — a collection of many different robot types. It works well in many cases, but if your robot has a significantly different action space (different number of joints, different range), training a custom tokenizer will yield better results.

When to Use Pretrained vs Custom

Use pretrained: Standard robots (6-7 DOF arm, SO-100, ALOHA), common tasks
Train custom: Specialized robots (humanoid, mobile manipulator), large action spaces (>10 DOF), large datasets (>500 episodes)

Tokenizer Training Command

lerobot-train-tokenizer \
  --repo_id=YOUR_USERNAME/my_dataset \
  --action_horizon=10 \
  --encoded_dims=20 \
  --vocab_size=1024 \
  --scale=1.0 \
  --normalization_mode=bounds \
  --output_dir=outputs/tokenizer/my_tokenizer

Tokenizer Parameter Explanation

--action_horizon (default: 10): Number of timesteps per action chunk. Must match chunk_size when training Pi0-FAST.

action_horizon=5  → Fewer steps, faster reactions, less smooth
action_horizon=10 → Balanced (recommended)
action_horizon=20 → Smoother, but slower reactions

--encoded_dims (default: 20): Number of DCT coefficients to keep. Higher = more detail preserved, but longer token sequences.

encoded_dims=10 → High compression, loses fine details
encoded_dims=20 → Balanced (recommended)
encoded_dims=40 → Preserves most details, longer tokens

--vocab_size (default: 1024): BPE vocabulary size. Larger = better compression but needs more data.

vocab_size=256  → Small vocabulary, easy to learn, low compression
vocab_size=1024 → Balanced (recommended)
vocab_size=4096 → Large vocabulary, needs lots of data

--scale: Multiplier for DCT coefficients before quantization. Usually kept at 1.0.

--normalization_mode: How to normalize actions:

bounds: Use min/max from dataset (recommended for most cases)
mean_std: Use mean/standard deviation normalization
none: No normalization (only if actions are pre-normalized)

Check Tokenizer Quality

After training, check reconstruction error:

lerobot-check-tokenizer \
  --tokenizer_path=outputs/tokenizer/my_tokenizer \
  --repo_id=YOUR_USERNAME/my_dataset \
  --num_samples=100

Output shows reconstruction MSE — target: < 0.01. If higher, try increasing encoded_dims or vocab_size.

Step 3: Fine-tune Pi0-FAST

Using the Pretrained Model

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=20000 \
  --training.batch_size=32 \
  --training.lr=1e-5 \
  --training.dtype=bfloat16 \
  --training.gradient_checkpointing=true \
  --policy.chunk_size=10 \
  --policy.n_action_steps=10

Training Parameter Explanation

--training.dtype=bfloat16: Use bfloat16 mixed precision. Reduces VRAM ~40% and speeds up training ~30% compared to float32. Mandatory on RTX 4090 — float32 will OOM.

--training.gradient_checkpointing=true: Trades speed for memory. Reduces VRAM ~30% by recomputing activations instead of storing them. Training is ~20% slower, but enables training on smaller GPUs.

--policy.chunk_size=10: Actions per chunk. Must match the tokenizer's action_horizon.

--policy.n_action_steps=10: Number of action steps to execute before querying the model again. Set equal to chunk_size for standard chunking, or smaller for Real-Time Chunking.

Using Custom Tokenizer

If you trained a custom tokenizer in Step 2:

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --policy.tokenizer_path=outputs/tokenizer/my_tokenizer \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=20000 \
  --training.dtype=bfloat16 \
  --training.gradient_checkpointing=true

Estimated Training Times

GPU	Batch Size	20k Steps	Notes
A100 80GB	64	~3 hours	No gradient_checkpointing needed
RTX 4090 24GB	32	~6 hours	Needs bfloat16 + gradient_checkpointing
RTX 3090 24GB	16	~10 hours	Needs bfloat16 + gradient_checkpointing

Monitoring and Early Stopping

# With Weights & Biases
lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=50000 \
  --training.dtype=bfloat16 \
  --training.gradient_checkpointing=true \
  --training.wandb.enable=true \
  --training.wandb.project=pi0fast-finetune \
  --training.save_freq=5000 \
  --training.eval_freq=5000

Monitor token_accuracy on the W&B dashboard — this is the most important metric. Target: >85% token accuracy on the validation set.

Training VLA models efficiently

Step 4: Evaluate on LIBERO and Real Robot

LIBERO Benchmark

LIBERO is the standard benchmark for robot manipulation. It consists of 4 suites with increasing difficulty:

# Evaluate on LIBERO-Object (easiest)
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_object \
  --eval.num_episodes=50

# Evaluate on LIBERO-Goal
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_goal \
  --eval.num_episodes=50

# Evaluate on LIBERO-Spatial (harder)
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_spatial \
  --eval.num_episodes=50

# Evaluate on LIBERO-Long (hardest, multi-step)
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_long \
  --eval.num_episodes=50

Expected Results (fine-tuned Pi0-FAST):

Suite	Success Rate
LIBERO-Object	~92%
LIBERO-Goal	~85%
LIBERO-Spatial	~80%
LIBERO-Long	~73%
Average	~82.5%

Deploy on Real Robot

# Run policy on real robot
lerobot-record \
  --robot.type=so100 \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --repo_id=YOUR_USERNAME/pi0fast_eval \
  --num_episodes=20 \
  --fps=25

Note: Pi0-FAST can run at 25 Hz on RTX 4090 — set --fps=25 to fully utilize its speed.

KV-Caching: Why Pi0-FAST is Fast

What is KV-Caching?

In autoregressive decoding, each new token needs to attend to all previous tokens. Without caching, the model must recompute key-value pairs for the entire sequence at every step — extremely wasteful.

KV-caching stores previously computed key-value pairs in GPU memory, and each step only needs to compute KV for the newest token. For Pi0-FAST, this means:

Without KV-cache:
  Token 1: compute attention for [image_tokens, instruction_tokens, token_1] → 1000 tokens
  Token 2: compute attention for [image_tokens, instruction_tokens, token_1, token_2] → 1001 tokens
  ...
  Token 10: compute attention for 1009 tokens
  Total: ~10,000 token computations

With KV-cache:
  Token 1: compute full attention (1000 tokens), cache KV
  Token 2: compute attention only for token_2, reuse cached KV
  ...
  Token 10: compute attention only for token_10
  Total: ~1,000 + 9 = ~1,009 token computations
  → ~10x less computation!

KV-Caching in Pi0-FAST

KV-caching is enabled by default in Pi0-FAST. You don't need to configure anything. However, it uses additional GPU memory — about 1-2GB per batch element. If you encounter OOM during inference, reduce batch size or disable it:

# Disable KV-cache (slower, but saves memory)
lerobot-eval \
  --policy.path=YOUR/model \
  --policy.use_kv_cache=false

Real-Time Chunking with Pi0-FAST

The Perfect Combination

Pi0-FAST + Real-Time Chunking is the most powerful combination in LeRobot v0.5. Thanks to 25 Hz inference speed, Pi0-FAST can continuously replan — every 2-3 actions, the model predicts a new chunk, blending with the old one:

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --policy.rtc_config.enabled=true \
  --policy.rtc_config.n_steps_warmup=5 \
  --policy.rtc_config.n_steps_between_replan=3 \
  --policy.rtc_config.blend_alpha=0.7 \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=20000

RTC Parameters

n_steps_warmup: Steps to execute before starting replanning (lets the model "stabilize" first)
n_steps_between_replan: Steps to execute between each replan. Smaller = more reactive but more compute
blend_alpha: Blending weight between new and old chunks. 1.0 = fully use new chunk, 0.5 = average both

When to Use RTC?

Use it: Dynamic tasks, tasks requiring high precision, uncertain environments
Skip it: Static pick-and-place, simple tasks where standard chunking is sufficient

PEFT/LoRA for Pi0-FAST

If you want to fine-tune Pi0-FAST but lack VRAM for full fine-tuning, use LoRA:

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --policy.peft_config.use_peft=true \
  --policy.peft_config.lora_r=16 \
  --policy.peft_config.lora_alpha=32 \
  --policy.peft_config.target_modules=["q_proj","v_proj"] \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=15000 \
  --training.dtype=bfloat16

With LoRA, VRAM drops from 40GB to 16-20GB, comfortably fitting on an RTX 3090.

If you're unfamiliar with PEFT/LoRA, read more in our v0.5 overview post for the theoretical background.

Troubleshooting

High Tokenizer Reconstruction Error

If lerobot-check-tokenizer reports MSE > 0.05:

Increase encoded_dims (try 30 or 40)
Increase vocab_size (try 2048)
Check data: are there action outliers? Use normalization_mode=mean_std

Low Token Accuracy During Training

If token accuracy < 70% after 10k steps:

Check tokenizer quality (MSE < 0.01)
Reduce learning rate (try 5e-6)
Increase training steps (try 50k)
Check dataset size: Pi0-FAST needs at least 50 episodes

Inference Too Slow

If inference < 15 Hz on RTX 4090:

Ensure KV-cache is enabled (default)
Use dtype=bfloat16 for inference
Reduce num_visual_tokens if applicable
Check GPU utilization: nvidia-smi — if GPU < 80%, the bottleneck is CPU/IO

Conclusion

Pi0-FAST represents a new direction in robot AI: autoregressive models for robot control. By combining the FAST tokenizer (10x action sequence compression) with KV-caching, Pi0-FAST achieves 5x inference speed over diffusion-based approaches while losing only ~2.5% accuracy.

When combined with Real-Time Chunking, Pi0-FAST enables robots to continuously update their plans at 25 Hz — near real-time responsiveness. This is a capability that previously only existed in classical control systems, now available for learned policies.

The choice between SmolVLA and Pi0-FAST is straightforward: if you're GPU-constrained, use SmolVLA. If you have an RTX 4090+ and need fast inference, use Pi0-FAST. For a deeper understanding of VLA theory, start with our VLA models overview.

SmolVLA: Train a 450M VLA on Consumer GPU — Compact VLA for commodity hardware
Diffusion Policy: Theory and Practice — Understand diffusion before comparing with autoregressive
LeRobot Framework: Introduction and Architecture — LeRobot foundations for beginners

Pi0-FAST: When Autoregressive Beats Diffusion

Speed and performance in robot AI

What is Pi0-FAST?

The Problem with Original Pi0

Result: the original Pi0 runs at about 5 Hz — sufficient for many tasks but too slow for:

Dexterous manipulation (needs >15 Hz)
Dynamic tasks (catching, pouring)
Tasks requiring fast replanning (Real-Time Chunking)

The Solution: Autoregressive + FAST Tokenizer

This is where the FAST tokenizer comes in.

FAST Tokenizer: Turning Continuous Actions into Tokens

FAST (Fourier Action Sequence Tokenization) solves the tokenization problem for robot actions with a 5-step pipeline:

Step 1 — Normalize: Standardize action values to range [-1, 1]

raw_actions = [0.15, -0.32, 1.47, 0.003, ...]
normalized = [-0.7, 0.2, 0.95, -0.99, ...]

Step 2 — DCT (Discrete Cosine Transform): Convert the action sequence from time domain to frequency domain. Similar to how JPEG compresses images using DCT, FAST compresses action trajectories.

# Action chunk [10 timesteps x 7 joints] = 70 values
# After DCT, keep only top-K frequency components
# Example: K=20 → compression ratio 3.5x

Step 3 — Quantize: Convert continuous DCT coefficients to discrete integers

dct_coefficients = [0.742, -0.321, 0.055, ...]
quantized = [189, 87, 114, ...]  # Integers in range [0, 255]

Step 4 — Flatten: Concatenate all quantized values into one sequence

flattened = [189, 87, 114, 201, 55, ...]  # Flat sequence of integers

Step 5 — BPE (Byte Pair Encoding): Apply BPE tokenization (like text tokenization) to compress the sequence further

# BPE finds repeating patterns and merges them into single tokens
# Example: [189, 87] appears frequently → token_543
# Result: sequence of 70 → compressed to ~7-10 tokens

Why Standard Binning Fails

The naive approach: divide range [-1, 1] into N bins (e.g., 256 bins) and map each action value to the nearest bin. Problems:

Curse of dimensionality: With 7 joints x 10 timesteps = 70 dimensions, autoregressive decoding needs to generate 70 tokens — too slow
Lost correlation: Standard binning tokenizes each value independently, ignoring temporal correlation between timesteps
No compression: 70 values become 70 tokens, zero compression

FAST solves all three: DCT exploits temporal correlation, BPE finds patterns, and the result is 70 values compressed to ~10 tokens.

Comparison: Pi0 vs Pi0-FAST

Feature	Pi0 (Flow Matching)	Pi0-FAST (Autoregressive)
Action generation	Iterative denoising (10-50 steps)	Token-by-token (~10 tokens)
Inference speed	~5 Hz	~25 Hz
KV-caching	Not applicable	Yes, significantly boosts speed
Training complexity	Moderate	Higher (need to train tokenizer)
LIBERO benchmark	~85%	~82.5%
Real-world dexterous	Excellent	Very good
Backbone	PaliGemma 3B	PaliGemma 3B

Pi0-FAST sacrifices ~2.5% benchmark accuracy to achieve 5x inference speed — an excellent tradeoff for real-world deployment.

Step 1: Installation

System Requirements

Python 3.12+
GPU: RTX 4090 (24GB) or better for training, RTX 3090 for inference
CUDA 12.1+
RAM: 32GB+ recommended

Install LeRobot with Pi Dependencies

# Create environment
python3.12 -m venv pi0fast-env
source pi0fast-env/bin/activate

# Clone and install
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[pi]"

# Verify
python -c "from lerobot.policies import Pi0FASTPolicy; print('Pi0-FAST OK')"

The [pi] package installs additional dependencies for the PaliGemma backbone and FAST tokenizer.

Step 2: Train Custom FAST Tokenizer

Why Train a Custom Tokenizer?

When to Use Pretrained vs Custom

Use pretrained: Standard robots (6-7 DOF arm, SO-100, ALOHA), common tasks
Train custom: Specialized robots (humanoid, mobile manipulator), large action spaces (>10 DOF), large datasets (>500 episodes)

Tokenizer Training Command

lerobot-train-tokenizer \
  --repo_id=YOUR_USERNAME/my_dataset \
  --action_horizon=10 \
  --encoded_dims=20 \
  --vocab_size=1024 \
  --scale=1.0 \
  --normalization_mode=bounds \
  --output_dir=outputs/tokenizer/my_tokenizer

Tokenizer Parameter Explanation

--action_horizon (default: 10): Number of timesteps per action chunk. Must match chunk_size when training Pi0-FAST.

action_horizon=5  → Fewer steps, faster reactions, less smooth
action_horizon=10 → Balanced (recommended)
action_horizon=20 → Smoother, but slower reactions

--encoded_dims (default: 20): Number of DCT coefficients to keep. Higher = more detail preserved, but longer token sequences.

encoded_dims=10 → High compression, loses fine details
encoded_dims=20 → Balanced (recommended)
encoded_dims=40 → Preserves most details, longer tokens

--vocab_size (default: 1024): BPE vocabulary size. Larger = better compression but needs more data.

vocab_size=256  → Small vocabulary, easy to learn, low compression
vocab_size=1024 → Balanced (recommended)
vocab_size=4096 → Large vocabulary, needs lots of data

--scale: Multiplier for DCT coefficients before quantization. Usually kept at 1.0.

--normalization_mode: How to normalize actions:

bounds: Use min/max from dataset (recommended for most cases)
mean_std: Use mean/standard deviation normalization
none: No normalization (only if actions are pre-normalized)

Check Tokenizer Quality

After training, check reconstruction error:

lerobot-check-tokenizer \
  --tokenizer_path=outputs/tokenizer/my_tokenizer \
  --repo_id=YOUR_USERNAME/my_dataset \
  --num_samples=100

Output shows reconstruction MSE — target: < 0.01. If higher, try increasing encoded_dims or vocab_size.

Step 3: Fine-tune Pi0-FAST

Using the Pretrained Model

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=20000 \
  --training.batch_size=32 \
  --training.lr=1e-5 \
  --training.dtype=bfloat16 \
  --training.gradient_checkpointing=true \
  --policy.chunk_size=10 \
  --policy.n_action_steps=10

Training Parameter Explanation

--training.dtype=bfloat16: Use bfloat16 mixed precision. Reduces VRAM ~40% and speeds up training ~30% compared to float32. Mandatory on RTX 4090 — float32 will OOM.

--policy.chunk_size=10: Actions per chunk. Must match the tokenizer's action_horizon.

--policy.n_action_steps=10: Number of action steps to execute before querying the model again. Set equal to chunk_size for standard chunking, or smaller for Real-Time Chunking.

Using Custom Tokenizer

If you trained a custom tokenizer in Step 2:

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --policy.tokenizer_path=outputs/tokenizer/my_tokenizer \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=20000 \
  --training.dtype=bfloat16 \
  --training.gradient_checkpointing=true

Estimated Training Times

GPU	Batch Size	20k Steps	Notes
A100 80GB	64	~3 hours	No gradient_checkpointing needed
RTX 4090 24GB	32	~6 hours	Needs bfloat16 + gradient_checkpointing
RTX 3090 24GB	16	~10 hours	Needs bfloat16 + gradient_checkpointing

Monitoring and Early Stopping

# With Weights & Biases
lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=50000 \
  --training.dtype=bfloat16 \
  --training.gradient_checkpointing=true \
  --training.wandb.enable=true \
  --training.wandb.project=pi0fast-finetune \
  --training.save_freq=5000 \
  --training.eval_freq=5000

Monitor token_accuracy on the W&B dashboard — this is the most important metric. Target: >85% token accuracy on the validation set.

Training VLA models efficiently

Step 4: Evaluate on LIBERO and Real Robot

LIBERO Benchmark

LIBERO is the standard benchmark for robot manipulation. It consists of 4 suites with increasing difficulty:

# Evaluate on LIBERO-Object (easiest)
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_object \
  --eval.num_episodes=50

# Evaluate on LIBERO-Goal
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_goal \
  --eval.num_episodes=50

# Evaluate on LIBERO-Spatial (harder)
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_spatial \
  --eval.num_episodes=50

# Evaluate on LIBERO-Long (hardest, multi-step)
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_long \
  --eval.num_episodes=50

Expected Results (fine-tuned Pi0-FAST):

Suite	Success Rate
LIBERO-Object	~92%
LIBERO-Goal	~85%
LIBERO-Spatial	~80%
LIBERO-Long	~73%
Average	~82.5%

Deploy on Real Robot

# Run policy on real robot
lerobot-record \
  --robot.type=so100 \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --repo_id=YOUR_USERNAME/pi0fast_eval \
  --num_episodes=20 \
  --fps=25

Note: Pi0-FAST can run at 25 Hz on RTX 4090 — set --fps=25 to fully utilize its speed.

KV-Caching: Why Pi0-FAST is Fast

What is KV-Caching?

KV-caching stores previously computed key-value pairs in GPU memory, and each step only needs to compute KV for the newest token. For Pi0-FAST, this means:

Without KV-cache:
  Token 1: compute attention for [image_tokens, instruction_tokens, token_1] → 1000 tokens
  Token 2: compute attention for [image_tokens, instruction_tokens, token_1, token_2] → 1001 tokens
  ...
  Token 10: compute attention for 1009 tokens
  Total: ~10,000 token computations

With KV-cache:
  Token 1: compute full attention (1000 tokens), cache KV
  Token 2: compute attention only for token_2, reuse cached KV
  ...
  Token 10: compute attention only for token_10
  Total: ~1,000 + 9 = ~1,009 token computations
  → ~10x less computation!

KV-Caching in Pi0-FAST

# Disable KV-cache (slower, but saves memory)
lerobot-eval \
  --policy.path=YOUR/model \
  --policy.use_kv_cache=false

Real-Time Chunking with Pi0-FAST

The Perfect Combination

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --policy.rtc_config.enabled=true \
  --policy.rtc_config.n_steps_warmup=5 \
  --policy.rtc_config.n_steps_between_replan=3 \
  --policy.rtc_config.blend_alpha=0.7 \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=20000

RTC Parameters

n_steps_warmup: Steps to execute before starting replanning (lets the model "stabilize" first)
n_steps_between_replan: Steps to execute between each replan. Smaller = more reactive but more compute
blend_alpha: Blending weight between new and old chunks. 1.0 = fully use new chunk, 0.5 = average both

When to Use RTC?

Use it: Dynamic tasks, tasks requiring high precision, uncertain environments
Skip it: Static pick-and-place, simple tasks where standard chunking is sufficient

PEFT/LoRA for Pi0-FAST

If you want to fine-tune Pi0-FAST but lack VRAM for full fine-tuning, use LoRA:

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --policy.peft_config.use_peft=true \
  --policy.peft_config.lora_r=16 \
  --policy.peft_config.lora_alpha=32 \
  --policy.peft_config.target_modules=["q_proj","v_proj"] \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=15000 \
  --training.dtype=bfloat16

With LoRA, VRAM drops from 40GB to 16-20GB, comfortably fitting on an RTX 3090.

If you're unfamiliar with PEFT/LoRA, read more in our v0.5 overview post for the theoretical background.

Troubleshooting

High Tokenizer Reconstruction Error

If lerobot-check-tokenizer reports MSE > 0.05:

Increase encoded_dims (try 30 or 40)
Increase vocab_size (try 2048)
Check data: are there action outliers? Use normalization_mode=mean_std

Low Token Accuracy During Training

If token accuracy < 70% after 10k steps:

Check tokenizer quality (MSE < 0.01)
Reduce learning rate (try 5e-6)
Increase training steps (try 50k)
Check dataset size: Pi0-FAST needs at least 50 episodes

Inference Too Slow

If inference < 15 Hz on RTX 4090:

Ensure KV-cache is enabled (default)
Use dtype=bfloat16 for inference
Reduce num_visual_tokens if applicable
Check GPU utilization: nvidia-smi — if GPU < 80%, the bottleneck is CPU/IO

Conclusion

SmolVLA: Train a 450M VLA on Consumer GPU — Compact VLA for commodity hardware
Diffusion Policy: Theory and Practice — Understand diffusion before comparing with autoregressive
LeRobot Framework: Introduction and Architecture — LeRobot foundations for beginners

Pi0-FAST: When Autoregressive Beats Diffusion

What is Pi0-FAST?

The Problem with Original Pi0

The Solution: Autoregressive + FAST Tokenizer

FAST Tokenizer: Turning Continuous Actions into Tokens

Why Standard Binning Fails

Comparison: Pi0 vs Pi0-FAST

Step 1: Installation

System Requirements

Install LeRobot with Pi Dependencies

Step 2: Train Custom FAST Tokenizer

Why Train a Custom Tokenizer?

When to Use Pretrained vs Custom

Tokenizer Training Command

Tokenizer Parameter Explanation

Check Tokenizer Quality

Step 3: Fine-tune Pi0-FAST

Using the Pretrained Model

Training Parameter Explanation

Using Custom Tokenizer

Estimated Training Times

Monitoring and Early Stopping

Step 4: Evaluate on LIBERO and Real Robot

LIBERO Benchmark

Deploy on Real Robot

KV-Caching: Why Pi0-FAST is Fast

What is KV-Caching?

KV-Caching in Pi0-FAST

Real-Time Chunking with Pi0-FAST

The Perfect Combination

RTC Parameters

When to Use RTC?

PEFT/LoRA for Pi0-FAST

Troubleshooting

High Tokenizer Reconstruction Error

Low Token Accuracy During Training

Inference Too Slow

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

LeRobot v0.5: Pi0-FAST + G1 Whole-Body Control

LeRobot v0.5: Tổng quan tính năng mới

SARM trong LeRobot: Reward Model cho VLA

Pi0-FAST: When Autoregressive Beats Diffusion

What is Pi0-FAST?

The Problem with Original Pi0

The Solution: Autoregressive + FAST Tokenizer

FAST Tokenizer: Turning Continuous Actions into Tokens

Why Standard Binning Fails

Comparison: Pi0 vs Pi0-FAST

Step 1: Installation

System Requirements

Install LeRobot with Pi Dependencies

Step 2: Train Custom FAST Tokenizer

Why Train a Custom Tokenizer?

When to Use Pretrained vs Custom

Tokenizer Training Command

Tokenizer Parameter Explanation

Check Tokenizer Quality

Step 3: Fine-tune Pi0-FAST

Using the Pretrained Model

Training Parameter Explanation

Using Custom Tokenizer

Estimated Training Times

Monitoring and Early Stopping

Step 4: Evaluate on LIBERO and Real Robot

LIBERO Benchmark

Deploy on Real Robot

KV-Caching: Why Pi0-FAST is Fast

What is KV-Caching?

KV-Caching in Pi0-FAST

Real-Time Chunking with Pi0-FAST

The Perfect Combination

RTC Parameters

When to Use RTC?

PEFT/LoRA for Pi0-FAST

Troubleshooting

High Tokenizer Reconstruction Error

Low Token Accuracy During Training