LeRobot v0.5: Pi0-FAST + G1 Whole-Body Control

When Humanoid Meets VLA: Pi0-FAST on Unitree G1

Picture this: a humanoid robot — the Unitree G1 with 29 degrees of freedom, two articulated arms, and legs strong enough to walk — tasked not just to stand and pick up objects, but to navigate a room, approach a target, and execute high-precision manipulation. This is loco-manipulation, and LeRobot v0.5.0 is the first open-source framework to make this accessible.

Released in March 2026, LeRobot v0.5.0 marks a historic milestone: for the first time, a humanoid robot is fully integrated into an open-source training pipeline — from data collection through VLA policy training to real-time inference with Real-Time Chunking. The policy chosen to inaugurate the Unitree G1? Pi0-FAST — an autoregressive VLA 5× faster than the original Pi0.

This guide walks you through the complete pipeline: understanding G1's architecture in LeRobot, setting up the environment, teleoperating with the Homunculus Exoskeleton, collecting whole-body datasets, training Pi0-FAST, and deploying with RTC. If you've already read the Pi0-FAST training guide and the v0.5 overview, this is your next step — bringing everything onto real hardware.

Unitree G1 in LeRobot v0.5.0: The First Humanoid

Why G1 is LeRobot's "first humanoid"

Before v0.5.0, LeRobot supported robot arms (SO-100, Koch, Panda, WidowX) and mobile manipulators, but no humanoids. G1 broke that barrier for three reasons:

More open than Boston Dynamics — G1 has a full SDK with no commercial license required for custom software integration
Accessible price point — G1 sits around $16,000, far below other research-grade humanoids
Large community — Unitree has an established ecosystem from Go2, Go1, and H1; a wealth of community code already exists

Two variants: 23-DoF vs 29-DoF

LeRobot v0.5.0 supports both variants:

Spec	G1 23-DoF	G1 29-DoF
Legs (per side)	6 joints	6 joints
Arms (per side)	4 joints	7 joints
Hands	No	3 DoF/hand
Walking speed	2 m/s	2 m/s
Arm payload	3 kg	1.5 kg
Use case	Heavy locomotion	Dexterous manipulation

For whole-body loco-manipulation (walking + using hands), the 29-DoF variant is preferred for its arm flexibility and the ability to grasp objects with the dexterous hands.

Two locomotion controllers: Holoso vs GR00T

This is the most distinctive aspect of the G1 integration. LeRobot doesn't write its own locomotion controller — instead, it integrates with two external controllers:

HolosomaLocomotionController (open-source):

Fully open-source, code is inspectable and modifiable
Controls locomotion (walk, stop, turn) via velocity commands
Better for research and experimentation
More complex installation due to dependency chain

GR00T-WholeBodyControl (NVIDIA):

Uses NVIDIA GR00T foundation model as the locomotion backbone
Higher performance, better balance recovery
Requires NVIDIA GPU on the robot (Jetson AGX Orin or equivalent)
Better suited for production deployment

Which to choose? If you're doing research and want deep understanding → Holoso. If you want the best results out of the box → GR00T.

Pi0-FAST: Why It's the Ideal Policy for G1

The problem with diffusion on humanoids

The original Pi0 uses flow matching (a form of diffusion). Each inference step requires 10-50 denoising iterations, each a full forward pass through a 2B parameter network. The result: inference latency of 300–600ms on typical GPUs.

For a static robot arm, 300ms latency is acceptable — you wait an extra 0.3 seconds before the robot acts. For a humanoid that's actively walking, 300ms latency is catastrophic — the robot can lose balance in that window.

Pi0-FAST solves this by switching from diffusion to autoregressive generation. Instead of iterative denoising, Pi0-FAST predicts action tokens in a single pass — like an LLM generating text, but generating robot joint angles instead.

Pi0-FAST architecture in detail

Input:
  - Camera images (RGB, 224×224 or 448×448)
  - Text instruction ("Pick up the cup and walk to the table")
  - Robot state (joint positions, velocities)

↓

SigLIP Vision Tower (ViT-So400M)
  - Encodes images into visual tokens
  - Patch size: 16×16, 576 tokens/image

↓

PaliGemma Backbone (Gemma 2B)
  - Multimodal attention between visual + text + state tokens
  - KV-cache for optimized inference

↓

FAST Action Tokenizer
  - Normalize action chunk (quantile normalization)
  - DCT per dimension (same algorithm as JPEG, but for time series)
  - Quantize + prune insignificant DCT coefficients
  - Output: ~20 discrete tokens instead of 50×29 = 1,450 floats

↓

Autoregressive decoding (Gemma action head)
  - Generate FAST tokens one by one
  - Decode back to continuous joint angles

Output: Action chunk (50 steps × 29 DoF = 1,450 values)

FAST tokenization is the key innovation. By applying a Discrete Cosine Transform (DCT) — the same algorithm JPEG uses to compress images — to joint angle time series, FAST compresses 1,450 floats into ~20 discrete tokens. A 10× compression ratio. This doesn't just speed up inference; it also improves learning efficiency by drastically shortening the sequence length.

Inference speed comparison

Policy	Latency on RTX 4090	Control frequency	Suitable for WBC?
Pi0 (diffusion)	300–600ms	1.5–3 Hz	No
Pi0-FAST (autoregressive)	60–120ms	8–16 Hz	Yes
SmolVLA	40–80ms	12–25 Hz	Yes
ACT	20–40ms	25–50 Hz	Yes

Pi0-FAST at 8–16 Hz is responsive enough for loco-manipulation when combined with Real-Time Chunking.

Real-Time Chunking: Solving High-Latency Inference

Even at 8 Hz, you have 125ms latency — during which the robot has already taken 6–7 locomotion steps. This is where Real-Time Chunking (RTC) comes in.

How RTC works

Instead of waiting for inference to finish before executing actions, RTC lets the robot continue executing the current chunk while the next chunk is being inferred in the background. When the new chunk is ready, RTC doesn't hard-cut to it but smoothly blends between them:

Timeline (without RTC):
t=0   [Inference A: 120ms] → [Execute chunk A: 500ms] → [Inference B: 120ms] → ...

Timeline (with RTC):
t=0   [Inference A: 120ms]
t=120   [Execute A] + [Inference B: running in background]
t=240   [Blend A→B smoothly]
t=360   [Execute B] + [Inference C: running in background]
...

RTC adds a guidance term to the denoising process: it "pulls" the new chunk toward the steps already executed in the old chunk to ensure continuity. Without RTC, you see jerky motion — every time the policy replans, the robot stutters. With RTC, motion is smooth as if running at 25+ Hz.

For G1 whole-body control, RTC is especially critical for the locomotion component — losing balance control for even a few frames can cause a fall.

Environment Setup

System requirements

GPU: RTX 4080/4090 or A100 (for real-time Pi0-FAST inference)
RAM: 32 GB+ (G1 state observations are large)
OS: Ubuntu 22.04 (tested) or Ubuntu 20.04
Python: 3.12 (required by LeRobot v0.5.0)
CUDA: 12.1+

Installing LeRobot v0.5.0 with G1 support

# Clone LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Create Python 3.12 virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install LeRobot with G1 dependencies
pip install -e ".[unitree_g1]"

# If using HolosomaLocomotionController
pip install -e ".[unitree_g1,holoso]"

# If using GR00T
pip install -e ".[unitree_g1,groot]"

Install Unitree SDK

# Unitree SDK2 Python bindings
pip install unitree-sdk2py

# Verify robot connection (robot must be on same LAN)
python -c "from unitree_sdk2py.core.channel import ChannelFactory; print('SDK OK')"

Download pretrained Pi0-FAST checkpoint

python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='lerobot/pi0fast-base',
    local_dir='./checkpoints/pi0fast-base'
)
"

Teleoperation with Homunculus Exoskeleton

Homunculus is an open-source 7-DoF exoskeleton developed by the LeRobot community for G1 whole-body teleoperation. It tracks the operator's arm and hand movements and maps them to G1's arms and hands.

Setting up teleoperation

python lerobot/scripts/control_robot.py \
  --robot.type=unitree_g1 \
  --robot.variant=29dof \
  --robot.controller=HolosomaLocomotionController \
  --control.type=teleoperate \
  --control.fps=30 \
  --control.display_cameras=true

On first run, the script prompts you to calibrate the Homunculus — hold arms in neutral position to establish the reference frame. This takes about 2 minutes.

Collecting a whole-body dataset

Target task: teach G1 to "walk to the table, pick up a cup, carry it to the target location."

python lerobot/scripts/control_robot.py \
  --robot.type=unitree_g1 \
  --robot.variant=29dof \
  --robot.controller=HolosomaLocomotionController \
  --control.type=record \
  --control.fps=30 \
  --control.single_task="Walk to table, pick up cup, carry to target" \
  --control.repo_id="your_username/g1-cup-transport" \
  --control.num_episodes=100 \
  --control.push_to_hub=true

Data collection tips for WBC:

Episode length: Whole-body tasks run 30–60 seconds vs. 5–15 seconds for arm-only. Ensure your buffer is large enough.
Camera placement: Mount at least 2 cameras: 1 chest camera (egocentric view) and 1 external overhead camera.
Locomotion diversity: Vary locomotion style — straight walking, left/right turns, stop-and-wait — so the policy learns to generalize.
Minimum 50 episodes for single tasks, 100+ recommended for WBC due to the much larger action space.

Training Pi0-FAST for Unitree G1

Training configuration

Pi0-FAST requires training the FAST tokenizer first, then fine-tuning the main model. This is different from ACT or Diffusion Policy.

Step 1: Train FAST tokenizer on your dataset

python lerobot/scripts/train_fast_tokenizer.py \
  --dataset.repo_id="your_username/g1-cup-transport" \
  --tokenizer.vocab_size=512 \
  --tokenizer.chunk_size=50 \
  --tokenizer.dct_components=0.95 \
  --output_dir="./checkpoints/fast-tokenizer-g1"

Key parameters:

vocab_size=512: Number of discrete tokens. Increase to 1024 for complex action spaces (29-DoF WBC needs more capacity than arm-only)
chunk_size=50: Steps per action chunk. 50 ≈ 1.67 seconds at 30 Hz
dct_components=0.95: Retain DCT coefficients until 95% of variance is explained

Step 2: Fine-tune Pi0-FAST

python lerobot/scripts/train.py \
  --policy.type=pi0fast \
  --policy.pretrained="./checkpoints/pi0fast-base" \
  --dataset.repo_id="your_username/g1-cup-transport" \
  --policy.fast_tokenizer_path="./checkpoints/fast-tokenizer-g1" \
  --training.batch_size=32 \
  --training.num_epochs=100 \
  --training.learning_rate=1e-4 \
  --training.use_amp=true \
  --output_dir="./checkpoints/pi0fast-g1-cup" \
  --wandb.enable=true

Training hardware requirements

Config	GPU	VRAM	Time/epoch (100 eps)
Pi0-FAST full	A100 80GB	60 GB	~15 min
Pi0-FAST + PEFT/LoRA	RTX 4090	20 GB	~25 min
Pi0-FAST + PEFT/LoRA	RTX 3090	18 GB	~40 min

No A100? Use PEFT/LoRA fine-tuning — train only adapter layers while keeping the backbone frozen:

python lerobot/scripts/train.py \
  --policy.type=pi0fast \
  --policy.pretrained="./checkpoints/pi0fast-base" \
  --training.peft.enabled=true \
  --training.peft.method=lora \
  --training.peft.rank=16 \
  --training.peft.alpha=32 \
  # ... other parameters as above

Inference with Real-Time Chunking

Deploying the policy on G1

python lerobot/scripts/control_robot.py \
  --robot.type=unitree_g1 \
  --robot.variant=29dof \
  --robot.controller=HolosomaLocomotionController \
  --control.type=inference \
  --policy.path="./checkpoints/pi0fast-g1-cup" \
  --control.fps=30 \
  --control.rtc.enabled=true \
  --control.rtc.execution_horizon=10 \
  --control.rtc.max_guidance_weight=2.0 \
  --control.task="Walk to table, pick up cup, carry to target"

Key RTC parameters

execution_horizon=10: Blend 10 steps from the new chunk with the old chunk. Increase if motion is jerky.
max_guidance_weight=2.0: Strength of the "pull" toward the old chunk. Higher = smoother but less reactive.
prefix_attention_schedule=exponential: How guidance weight decays over time.

Results and Benchmarks

Based on results from the LeRobot team and community experiments on G1:

Task	Policy	Success Rate	Avg Completion Time
Pick & place (standing)	Pi0-FAST + RTC	87%	8.2s
Pick & place (standing)	ACT	82%	6.1s
Loco-manip: cup transport	Pi0-FAST + RTC	71%	24.3s
Loco-manip: cup transport	Pi0-FAST (no RTC)	43%	31.1s
Door opening (walking)	Pi0-FAST + RTC	58%	18.7s

Key takeaways:

RTC boosts loco-manipulation success from 43% → 71% — the largest impact is on tasks combining locomotion with manipulation
ACT is faster for simple pick-and-place, but Pi0-FAST generalizes better to unseen tasks
100 episodes is the sweet spot for WBC — more data still helps but with diminishing returns

Common Pitfalls

1. Robot loses balance during policy replanning

Cause: Latency too high, or RTC not enabled
Fix: Enable --control.rtc.enabled=true and increase execution_horizon

2. Training loss doesn't decrease

Cause: FAST tokenizer not trained on your dataset — using default arm-only tokenizer
Fix: Train a custom tokenizer first with train_fast_tokenizer.py

3. Locomotion controller won't connect

Cause: Unitree SDK can't discover robot IP
Fix: Set export ROBOT_IP=192.168.123.1 and verify LAN connectivity

4. OOM when training full Pi0-FAST

Fix: Add --training.gradient_checkpointing=true and --training.peft.enabled=true

Conclusion

LeRobot v0.5.0 + Pi0-FAST + Unitree G1 is the most capable open-source stack available today for whole-body loco-manipulation. The pipeline isn't without challenges — complex setup, high GPU requirements, and the need for at least 100 demonstration episodes — but this is the first time in history you can do all of this without paying for a proprietary framework.

The natural next step is tackling more complex tasks: carrying objects up stairs, opening a refrigerator, or interacting with humans in dynamic environments. The framework is ready — what remains is data collection and experimentation.

LeRobot v0.5: Pi0-FAST + G1 Whole-Body Control

When Humanoid Meets VLA: Pi0-FAST on Unitree G1

Unitree G1 in LeRobot v0.5.0: The First Humanoid

Why G1 is LeRobot's "first humanoid"

Two variants: 23-DoF vs 29-DoF

Two locomotion controllers: Holoso vs GR00T

Pi0-FAST: Why It's the Ideal Policy for G1

The problem with diffusion on humanoids

Pi0-FAST architecture in detail

Inference speed comparison

Real-Time Chunking: Solving High-Latency Inference

How RTC works

Environment Setup

System requirements

Installing LeRobot v0.5.0 with G1 support

Install Unitree SDK

Download pretrained Pi0-FAST checkpoint

Teleoperation with Homunculus Exoskeleton

Setting up teleoperation

Collecting a whole-body dataset

Training Pi0-FAST for Unitree G1

Training configuration

Training hardware requirements

Inference with Real-Time Chunking

Deploying the policy on G1

Key RTC parameters

Results and Benchmarks

Common Pitfalls

Conclusion

Nguyễn Anh Tuấn

Bài viết liên quan

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym

When Humanoid Meets VLA: Pi0-FAST on Unitree G1

Unitree G1 in LeRobot v0.5.0: The First Humanoid

Why G1 is LeRobot's "first humanoid"

Two variants: 23-DoF vs 29-DoF

Two locomotion controllers: Holoso vs GR00T

Pi0-FAST: Why It's the Ideal Policy for G1

The problem with diffusion on humanoids

Pi0-FAST architecture in detail

Inference speed comparison

Real-Time Chunking: Solving High-Latency Inference

How RTC works

Environment Setup

System requirements

Installing LeRobot v0.5.0 with G1 support

Install Unitree SDK

Download pretrained Pi0-FAST checkpoint

Teleoperation with Homunculus Exoskeleton

Setting up teleoperation

Collecting a whole-body dataset

Training Pi0-FAST for Unitree G1

Training configuration

Training hardware requirements

Inference with Real-Time Chunking

Deploying the policy on G1

Key RTC parameters

Results and Benchmarks

Common Pitfalls

Conclusion

Related Posts

Nguyễn Anh Tuấn

Bài viết liên quan

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym