Pi0-FAST: VLA Autoregressive 5x nhanh hơn

Pi0-FAST: Khi autoregressive thắng diffusion

Trong bài trước, chúng ta đã khám phá SmolVLA — VLA nhỏ gọn chạy trên consumer GPU. Bây giờ, hãy chuyển sang một approach hoàn toàn khác: Pi0-FAST, model kết hợp PaliGemma backbone với FAST action tokenizer để đạt tốc độ inference nhanh gấp 5 lần so với diffusion-based Pi0 gốc.

Tại sao điều này quan trọng? Bởi vì trong robotics, tốc độ inference quyết định khả năng phản ứng. Một model chạy ở 5 Hz (5 predictions/giây) sẽ phản ứng chậm hơn đáng kể so với model chạy ở 25 Hz. Trong các task đòi hỏi precision cao — assembling, pouring, inserting — sự khác biệt này là giữa thành công và thất bại.

Bài viết này sẽ hướng dẫn bạn từng bước: hiểu tại sao FAST tokenizer giải quyết vấn đề mà standard binning không làm được, cách train custom tokenizer, fine-tune Pi0-FAST, và deploy với KV-caching.

Pi0-FAST là gì?

Vấn đề của Pi0 gốc

Pi0 (Physical Intelligence's model) dùng flow matching — một dạng diffusion process — để sinh robot actions. Flow matching hoạt động bằng cách bắt đầu từ random noise và iteratively denoise nó thành action trajectory. Quá trình này cần nhiều bước denoising (thường 10-50 bước), mỗi bước đều cần forward pass qua neural network.

Kết quả: Pi0 gốc chạy ở khoảng 5 Hz — đủ cho nhiều tasks nhưng quá chậm cho:

Dexterous manipulation (cần >15 Hz)
Dynamic tasks (catching, pouring)
Tasks cần replanning nhanh (Real-Time Chunking)

Giải pháp: Autoregressive + FAST Tokenizer

Pi0-FAST thay thế flow matching bằng autoregressive decoding — giống cách language models (GPT, LLaMA) sinh text token-by-token. Nhưng có một vấn đề: robot actions là continuous values (joint angles, positions), không phải discrete tokens.

Đây là lúc FAST tokenizer xuất hiện.

FAST Tokenizer: Biến continuous actions thành tokens

FAST (Fourier Action Sequence Tokenization) giải quyết vấn đề tokenization cho robot actions bằng pipeline 5 bước:

Bước 1 — Normalize: Chuẩn hóa action values về range [-1, 1]

raw_actions = [0.15, -0.32, 1.47, 0.003, ...]
normalized = [-0.7, 0.2, 0.95, -0.99, ...]

Bước 2 — DCT (Discrete Cosine Transform): Chuyển action sequence từ time domain sang frequency domain. Giống cách JPEG nén hình ảnh bằng DCT, FAST nén action trajectories.

# Action chunk [10 timesteps x 7 joints] = 70 values
# Sau DCT, chỉ giữ top-K frequency components
# Ví dụ: K=20 → compression ratio 3.5x

DCT hoạt động vì robot actions thường smooth — không có sudden jumps giữa các timesteps liên tiếp. Các high-frequency components (chứa noise, jitter) có thể bỏ đi mà không mất thông tin quan trọng.

Bước 3 — Quantize: Chuyển continuous DCT coefficients thành discrete integers

dct_coefficients = [0.742, -0.321, 0.055, ...]
quantized = [189, 87, 114, ...]  # Integers trong range [0, 255]

Bước 4 — Flatten: Nối tất cả quantized values thành 1 sequence

flattened = [189, 87, 114, 201, 55, ...]  # Flat sequence of integers

Bước 5 — BPE (Byte Pair Encoding): Áp dụng BPE tokenization (giống text tokenization) để nén sequence further

# BPE tìm patterns lặp lại và gom thành single tokens
# Ví dụ: [189, 87] xuất hiện thường xuyên → token_543
# Kết quả: sequence dài 70 → nén xuống ~7-10 tokens

Kết quả: Action chunk 10 timesteps x 7 joints = 70 values → chỉ ~7-10 tokens. Đây là compression ratio ~10x, cho phép autoregressive model sinh actions nhanh hơn rất nhiều.

Tại sao standard binning thất bại?

Cách tiếp cận naïve: chia range [-1, 1] thành N bins (ví dụ: 256 bins) và map mỗi action value vào bin gần nhất. Vấn đề:

Curse of dimensionality: Với 7 joints x 10 timesteps = 70 dimensions, autoregressive decoding cần sinh 70 tokens — quá chậm
Mất correlation: Standard binning tokenize từng value độc lập, bỏ qua temporal correlation giữa các timesteps
Không nén: 70 values → 70 tokens, không có compression

FAST giải quyết cả 3 vấn đề: DCT exploit temporal correlation, BPE tìm patterns, và kết quả là 70 values → ~10 tokens.

So sánh: Pi0 vs Pi0-FAST

Đặc điểm	Pi0 (Flow Matching)	Pi0-FAST (Autoregressive)
Action generation	Iterative denoising (10-50 steps)	Token-by-token (~10 tokens)
Inference speed	~5 Hz	~25 Hz
KV-caching	Không áp dụng	Có, tăng speed đáng kể
Training complexity	Moderate	Higher (cần train tokenizer)
LIBERO benchmark	~85%	~82.5%
Real-world dexterous	Excellent	Very good
Backbone	PaliGemma 3B	PaliGemma 3B

Pi0-FAST hy sinh ~2.5% accuracy trên benchmark để đạt 5x inference speed — tradeoff rất tốt cho deployment trong thực tế.

Bước 1: Cài đặt

Yêu cầu hệ thống

Python 3.12+
GPU: RTX 4090 (24GB) trở lên cho training, RTX 3090 cho inference
CUDA 12.1+
RAM: 32GB+ khuyến nghị

Cài đặt LeRobot với Pi dependencies

# Tạo environment
python3.12 -m venv pi0fast-env
source pi0fast-env/bin/activate

# Clone và cài đặt
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[pi]"

# Verify
python -c "from lerobot.policies import Pi0FASTPolicy; print('Pi0-FAST OK')"

Package [pi] cài thêm dependencies cho PaliGemma backbone và FAST tokenizer.

Bước 2: Train custom FAST tokenizer

Tại sao cần custom tokenizer?

FAST tokenizer pretrained (lerobot/fast-action-tokenizer) được train trên Open-X Embodiment data — tập hợp nhiều loại robot khác nhau. Nó hoạt động tốt trong nhiều trường hợp, nhưng nếu robot của bạn có action space khác biệt đáng kể (số joints khác, range khác), train custom tokenizer sẽ cho kết quả tốt hơn.

Khi nào dùng pretrained vs custom

Dùng pretrained: Robot tiêu chuẩn (6-7 DOF arm, SO-100, ALOHA), task thông thường
Train custom: Robot đặc biệt (humanoid, mobile manipulator), action space lớn (>10 DOF), dataset lớn (>500 episodes)

Lệnh train tokenizer

lerobot-train-tokenizer \
  --repo_id=YOUR_USERNAME/my_dataset \
  --action_horizon=10 \
  --encoded_dims=20 \
  --vocab_size=1024 \
  --scale=1.0 \
  --normalization_mode=bounds \
  --output_dir=outputs/tokenizer/my_tokenizer

Giải thích tham số tokenizer

--action_horizon (default: 10): Số timesteps trong mỗi action chunk. Phải khớp với chunk_size khi train Pi0-FAST.

action_horizon=5  → Ít hơn, phản ứng nhanh, nhưng ít smooth
action_horizon=10 → Cân bằng (khuyến nghị)
action_horizon=20 → Smooth hơn, nhưng phản ứng chậm

--encoded_dims (default: 20): Số DCT coefficients giữ lại. Cao hơn = giữ nhiều chi tiết, nhưng tokens dài hơn.

encoded_dims=10 → Compression cao, mất chi tiết nhỏ
encoded_dims=20 → Cân bằng (khuyến nghị)
encoded_dims=40 → Giữ hầu hết chi tiết, tokens dài

--vocab_size (default: 1024): Kích thước vocabulary cho BPE. Lớn hơn = compression tốt hơn nhưng cần nhiều data hơn.

vocab_size=256  → Vocabulary nhỏ, dễ học, compression thấp
vocab_size=1024 → Cân bằng (khuyến nghị)
vocab_size=4096 → Vocabulary lớn, cần nhiều data

--scale: Hệ số nhân cho DCT coefficients trước quantization. Thường giữ ở 1.0.

--normalization_mode: Cách normalize actions:

bounds: Dùng min/max từ dataset (khuyến nghị cho hầu hết cases)
mean_std: Dùng mean/std normalization
none: Không normalize (chỉ khi actions đã normalized)

Kiểm tra tokenizer quality

Sau khi train xong, kiểm tra reconstruction error:

lerobot-check-tokenizer \
  --tokenizer_path=outputs/tokenizer/my_tokenizer \
  --repo_id=YOUR_USERNAME/my_dataset \
  --num_samples=100

Output sẽ hiện reconstruction MSE — mục tiêu: < 0.01. Nếu cao hơn, thử tăng encoded_dims hoặc vocab_size.

Bước 3: Fine-tune Pi0-FAST

Dùng pretrained model

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=20000 \
  --training.batch_size=32 \
  --training.lr=1e-5 \
  --training.dtype=bfloat16 \
  --training.gradient_checkpointing=true \
  --policy.chunk_size=10 \
  --policy.n_action_steps=10

Giải thích tham số training

--training.dtype=bfloat16: Dùng bfloat16 mixed precision. Giảm VRAM ~40% và tăng tốc training ~30% so với float32. Bắt buộc dùng trên RTX 4090 — float32 sẽ OOM.

--training.gradient_checkpointing=true: Đánh đổi speed lấy memory. Giảm VRAM ~30% bằng cách recompute activations thay vì lưu trong memory. Training chậm hơn ~20%, nhưng cho phép train trên GPU nhỏ hơn.

--policy.chunk_size=10: Số actions per chunk. Phải khớp với action_horizon của tokenizer.

--policy.n_action_steps=10: Số action steps execute trước khi query model lại. Đặt bằng chunk_size cho standard chunking, hoặc nhỏ hơn nếu dùng Real-Time Chunking.

Dùng custom tokenizer

Nếu bạn đã train custom tokenizer ở Bước 2:

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --policy.tokenizer_path=outputs/tokenizer/my_tokenizer \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=20000 \
  --training.dtype=bfloat16 \
  --training.gradient_checkpointing=true

Thời gian training ước tính

GPU	Batch Size	20k Steps	Lưu ý
A100 80GB	64	~3 giờ	Không cần gradient_checkpointing
RTX 4090 24GB	32	~6 giờ	Cần bfloat16 + gradient_checkpointing
RTX 3090 24GB	16	~10 giờ	Cần bfloat16 + gradient_checkpointing

Monitoring và Early Stopping

# Với Weights & Biases
lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=50000 \
  --training.dtype=bfloat16 \
  --training.gradient_checkpointing=true \
  --training.wandb.enable=true \
  --training.wandb.project=pi0fast-finetune \
  --training.save_freq=5000 \
  --training.eval_freq=5000

Theo dõi token_accuracy trên W&B dashboard — đây là metric quan trọng nhất. Target: >85% token accuracy trên validation set.

Bước 4: Evaluate trên LIBERO và robot thật

LIBERO Benchmark

LIBERO là benchmark tiêu chuẩn cho robot manipulation. Nó gồm 4 suites với độ khó tăng dần:

# Evaluate trên LIBERO-Object (dễ nhất)
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_object \
  --eval.num_episodes=50

# Evaluate trên LIBERO-Goal
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_goal \
  --eval.num_episodes=50

# Evaluate trên LIBERO-Spatial (khó hơn)
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_spatial \
  --eval.num_episodes=50

# Evaluate trên LIBERO-Long (khó nhất, multi-step)
lerobot-eval \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_long \
  --eval.num_episodes=50

Kết quả kỳ vọng (fine-tuned Pi0-FAST):

Suite	Success Rate
LIBERO-Object	~92%
LIBERO-Goal	~85%
LIBERO-Spatial	~80%
LIBERO-Long	~73%
Average	~82.5%

Deploy trên robot thật

# Run policy trên real robot
lerobot-record \
  --robot.type=so100 \
  --policy.path=outputs/train/pi0_fast/checkpoints/last/pretrained_model \
  --repo_id=YOUR_USERNAME/pi0fast_eval \
  --num_episodes=20 \
  --fps=25

Lưu ý: Pi0-FAST có thể chạy ở 25 Hz trên RTX 4090 — đặt --fps=25 để tận dụng tối đa tốc độ.

KV-Caching: Tại sao Pi0-FAST nhanh

KV-caching là gì?

Trong autoregressive decoding, mỗi token mới cần attend đến tất cả tokens trước đó. Nếu không cache, model phải recompute key-value pairs cho toàn bộ sequence ở mỗi step — cực kỳ lãng phí.

KV-caching lưu key-value pairs đã compute vào GPU memory, và mỗi step chỉ cần compute KV cho token mới nhất. Với Pi0-FAST, điều này có nghĩa:

Không KV-cache:
  Token 1: compute attention cho [image_tokens, instruction_tokens, token_1] → 1000 tokens
  Token 2: compute attention cho [image_tokens, instruction_tokens, token_1, token_2] → 1001 tokens
  ...
  Token 10: compute attention cho 1009 tokens
  Total: ~10,000 token computations

Với KV-cache:
  Token 1: compute full attention (1000 tokens), cache KV
  Token 2: compute attention chỉ cho token_2, reuse cached KV
  ...
  Token 10: compute attention chỉ cho token_10
  Total: ~1,000 + 9 = ~1,009 token computations
  → ~10x ít computation!

KV-caching trong Pi0-FAST

KV-caching được bật mặc định trong Pi0-FAST. Bạn không cần configure gì thêm. Tuy nhiên, nó dùng thêm GPU memory — khoảng 1-2GB cho mỗi batch element. Nếu gặp OOM khi inference, giảm batch size hoặc tắt:

# Tắt KV-cache (chậm hơn, nhưng tiết kiệm memory)
lerobot-eval \
  --policy.path=YOUR/model \
  --policy.use_kv_cache=false

Real-Time Chunking với Pi0-FAST

Kết hợp hoàn hảo

Pi0-FAST + Real-Time Chunking là combination mạnh nhất trong LeRobot v0.5. Nhờ tốc độ inference 25 Hz, Pi0-FAST có thể replan liên tục — mỗi 2-3 actions, model lại predict chunk mới, blend với chunk cũ:

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --policy.rtc_config.enabled=true \
  --policy.rtc_config.n_steps_warmup=5 \
  --policy.rtc_config.n_steps_between_replan=3 \
  --policy.rtc_config.blend_alpha=0.7 \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=20000

Tham số RTC

n_steps_warmup: Số steps execute trước khi bắt đầu replanning (cho model "ổn định" trước)
n_steps_between_replan: Execute bao nhiêu steps giữa mỗi lần replan. Nhỏ hơn = reactive hơn nhưng tốn compute
blend_alpha: Trọng số blending giữa chunk mới và chunk cũ. 1.0 = hoàn toàn dùng chunk mới, 0.5 = trung bình cả hai

Khi nào dùng RTC?

Nên dùng: Dynamic tasks, tasks cần precision cao, environments có uncertainty
Không cần: Static pick-and-place, tasks đơn giản mà standard chunking đã đủ tốt

PEFT/LoRA cho Pi0-FAST

Nếu bạn muốn fine-tune Pi0-FAST mà không đủ VRAM cho full fine-tuning, dùng LoRA:

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --policy.peft_config.use_peft=true \
  --policy.peft_config.lora_r=16 \
  --policy.peft_config.lora_alpha=32 \
  --policy.peft_config.target_modules=["q_proj","v_proj"] \
  --dataset.repo_id=YOUR_USERNAME/my_dataset \
  --training.steps=15000 \
  --training.dtype=bfloat16

Với LoRA, VRAM giảm từ 40GB xuống 16-20GB, cho phép train trên RTX 3090 thoải mái.

Nếu bạn chưa quen với PEFT/LoRA, hãy đọc thêm trong bài tổng quan v0.5 để hiểu lý thuyết.

Troubleshooting

Tokenizer reconstruction error cao

Nếu lerobot-check-tokenizer báo MSE > 0.05:

Tăng encoded_dims (thử 30 hoặc 40)
Tăng vocab_size (thử 2048)
Kiểm tra data: actions có outliers không? Dùng normalization_mode=mean_std

Token accuracy thấp khi training

Nếu token accuracy < 70% sau 10k steps:

Kiểm tra tokenizer quality (MSE < 0.01)
Giảm learning rate (thử 5e-6)
Tăng training steps (thử 50k)
Kiểm tra dataset size: Pi0-FAST cần ít nhất 50 episodes

Inference quá chậm

Nếu inference < 15 Hz trên RTX 4090:

Đảm bảo KV-cache enabled (mặc định)
Dùng dtype=bfloat16 cho inference
Giảm num_visual_tokens nếu có
Check GPU utilization: nvidia-smi — nếu GPU < 80%, bottleneck ở CPU/IO

Kết luận

Pi0-FAST đại diện cho hướng đi mới trong robot AI: autoregressive models cho robot control. Bằng cách kết hợp FAST tokenizer (nén 10x action sequences) với KV-caching, Pi0-FAST đạt 5x tốc độ inference so với diffusion-based approaches mà chỉ mất ~2.5% accuracy.

Khi kết hợp với Real-Time Chunking, Pi0-FAST cho phép robot liên tục cập nhật kế hoạch ở 25 Hz — mức phản hồi gần như real-time. Đây là capability mà trước đây chỉ có ở các hệ thống classical control, giờ có thể áp dụng cho learned policies.

Câu hỏi lựa chọn giữa SmolVLA và Pi0-FAST rất đơn giản: nếu bạn bị giới hạn bởi GPU → dùng SmolVLA. Nếu bạn có RTX 4090+ và cần inference nhanh → dùng Pi0-FAST. Nếu bạn cần hiểu sâu hơn về lý thuyết VLA, bắt đầu với bài tổng quan VLA models.