humanoidlerobotpi0-fastunitree-g1whole-body-controlhumanoidloco-manipulationvla

LeRobot v0.5: Pi0-FAST + G1 Whole-Body Control

Deploy Pi0-FAST on Unitree G1 with whole-body loco-manipulation in LeRobot v0.5.0 — from setup, teleoperation, to real-time inference with RTC.

Nguyễn Anh Tuấn21 tháng 4, 202611 phút đọc
LeRobot v0.5: Pi0-FAST + G1 Whole-Body Control

When Humanoid Meets VLA: Pi0-FAST on Unitree G1

Picture this: a humanoid robot — the Unitree G1 with 29 degrees of freedom, two articulated arms, and legs strong enough to walk — tasked not just to stand and pick up objects, but to navigate a room, approach a target, and execute high-precision manipulation. This is loco-manipulation, and LeRobot v0.5.0 is the first open-source framework to make this accessible.

Released in March 2026, LeRobot v0.5.0 marks a historic milestone: for the first time, a humanoid robot is fully integrated into an open-source training pipeline — from data collection through VLA policy training to real-time inference with Real-Time Chunking. The policy chosen to inaugurate the Unitree G1? Pi0-FAST — an autoregressive VLA 5× faster than the original Pi0.

This guide walks you through the complete pipeline: understanding G1's architecture in LeRobot, setting up the environment, teleoperating with the Homunculus Exoskeleton, collecting whole-body datasets, training Pi0-FAST, and deploying with RTC. If you've already read the Pi0-FAST training guide and the v0.5 overview, this is your next step — bringing everything onto real hardware.


Unitree G1 in LeRobot v0.5.0: The First Humanoid

Why G1 is LeRobot's "first humanoid"

Before v0.5.0, LeRobot supported robot arms (SO-100, Koch, Panda, WidowX) and mobile manipulators, but no humanoids. G1 broke that barrier for three reasons:

  1. More open than Boston Dynamics — G1 has a full SDK with no commercial license required for custom software integration
  2. Accessible price point — G1 sits around $16,000, far below other research-grade humanoids
  3. Large community — Unitree has an established ecosystem from Go2, Go1, and H1; a wealth of community code already exists

Two variants: 23-DoF vs 29-DoF

LeRobot v0.5.0 supports both variants:

Spec G1 23-DoF G1 29-DoF
Legs (per side) 6 joints 6 joints
Arms (per side) 4 joints 7 joints
Hands No 3 DoF/hand
Walking speed 2 m/s 2 m/s
Arm payload 3 kg 1.5 kg
Use case Heavy locomotion Dexterous manipulation

For whole-body loco-manipulation (walking + using hands), the 29-DoF variant is preferred for its arm flexibility and the ability to grasp objects with the dexterous hands.

Unitree G1 humanoid robot with 29 degrees of freedom

Two locomotion controllers: Holoso vs GR00T

This is the most distinctive aspect of the G1 integration. LeRobot doesn't write its own locomotion controller — instead, it integrates with two external controllers:

HolosomaLocomotionController (open-source):

  • Fully open-source, code is inspectable and modifiable
  • Controls locomotion (walk, stop, turn) via velocity commands
  • Better for research and experimentation
  • More complex installation due to dependency chain

GR00T-WholeBodyControl (NVIDIA):

  • Uses NVIDIA GR00T foundation model as the locomotion backbone
  • Higher performance, better balance recovery
  • Requires NVIDIA GPU on the robot (Jetson AGX Orin or equivalent)
  • Better suited for production deployment

Which to choose? If you're doing research and want deep understanding → Holoso. If you want the best results out of the box → GR00T.


Pi0-FAST: Why It's the Ideal Policy for G1

The problem with diffusion on humanoids

The original Pi0 uses flow matching (a form of diffusion). Each inference step requires 10-50 denoising iterations, each a full forward pass through a 2B parameter network. The result: inference latency of 300–600ms on typical GPUs.

For a static robot arm, 300ms latency is acceptable — you wait an extra 0.3 seconds before the robot acts. For a humanoid that's actively walking, 300ms latency is catastrophic — the robot can lose balance in that window.

Pi0-FAST solves this by switching from diffusion to autoregressive generation. Instead of iterative denoising, Pi0-FAST predicts action tokens in a single pass — like an LLM generating text, but generating robot joint angles instead.

Pi0-FAST architecture in detail

Input:
  - Camera images (RGB, 224×224 or 448×448)
  - Text instruction ("Pick up the cup and walk to the table")
  - Robot state (joint positions, velocities)

↓

SigLIP Vision Tower (ViT-So400M)
  - Encodes images into visual tokens
  - Patch size: 16×16, 576 tokens/image

↓

PaliGemma Backbone (Gemma 2B)
  - Multimodal attention between visual + text + state tokens
  - KV-cache for optimized inference

↓

FAST Action Tokenizer
  - Normalize action chunk (quantile normalization)
  - DCT per dimension (same algorithm as JPEG, but for time series)
  - Quantize + prune insignificant DCT coefficients
  - Output: ~20 discrete tokens instead of 50×29 = 1,450 floats

↓

Autoregressive decoding (Gemma action head)
  - Generate FAST tokens one by one
  - Decode back to continuous joint angles

Output: Action chunk (50 steps × 29 DoF = 1,450 values)

FAST tokenization is the key innovation. By applying a Discrete Cosine Transform (DCT) — the same algorithm JPEG uses to compress images — to joint angle time series, FAST compresses 1,450 floats into ~20 discrete tokens. A 10× compression ratio. This doesn't just speed up inference; it also improves learning efficiency by drastically shortening the sequence length.

Inference speed comparison

Policy Latency on RTX 4090 Control frequency Suitable for WBC?
Pi0 (diffusion) 300–600ms 1.5–3 Hz No
Pi0-FAST (autoregressive) 60–120ms 8–16 Hz Yes
SmolVLA 40–80ms 12–25 Hz Yes
ACT 20–40ms 25–50 Hz Yes

Pi0-FAST at 8–16 Hz is responsive enough for loco-manipulation when combined with Real-Time Chunking.


Real-Time Chunking: Solving High-Latency Inference

Even at 8 Hz, you have 125ms latency — during which the robot has already taken 6–7 locomotion steps. This is where Real-Time Chunking (RTC) comes in.

How RTC works

Instead of waiting for inference to finish before executing actions, RTC lets the robot continue executing the current chunk while the next chunk is being inferred in the background. When the new chunk is ready, RTC doesn't hard-cut to it but smoothly blends between them:

Timeline (without RTC):
t=0   [Inference A: 120ms] → [Execute chunk A: 500ms] → [Inference B: 120ms] → ...

Timeline (with RTC):
t=0   [Inference A: 120ms]
t=120   [Execute A] + [Inference B: running in background]
t=240   [Blend A→B smoothly]
t=360   [Execute B] + [Inference C: running in background]
...

RTC adds a guidance term to the denoising process: it "pulls" the new chunk toward the steps already executed in the old chunk to ensure continuity. Without RTC, you see jerky motion — every time the policy replans, the robot stutters. With RTC, motion is smooth as if running at 25+ Hz.

For G1 whole-body control, RTC is especially critical for the locomotion component — losing balance control for even a few frames can cause a fall.


Environment Setup

System requirements

  • GPU: RTX 4080/4090 or A100 (for real-time Pi0-FAST inference)
  • RAM: 32 GB+ (G1 state observations are large)
  • OS: Ubuntu 22.04 (tested) or Ubuntu 20.04
  • Python: 3.12 (required by LeRobot v0.5.0)
  • CUDA: 12.1+

Installing LeRobot v0.5.0 with G1 support

# Clone LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Create Python 3.12 virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install LeRobot with G1 dependencies
pip install -e ".[unitree_g1]"

# If using HolosomaLocomotionController
pip install -e ".[unitree_g1,holoso]"

# If using GR00T
pip install -e ".[unitree_g1,groot]"

Install Unitree SDK

# Unitree SDK2 Python bindings
pip install unitree-sdk2py

# Verify robot connection (robot must be on same LAN)
python -c "from unitree_sdk2py.core.channel import ChannelFactory; print('SDK OK')"

Download pretrained Pi0-FAST checkpoint

python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='lerobot/pi0fast-base',
    local_dir='./checkpoints/pi0fast-base'
)
"

Teleoperation with Homunculus Exoskeleton

Homunculus is an open-source 7-DoF exoskeleton developed by the LeRobot community for G1 whole-body teleoperation. It tracks the operator's arm and hand movements and maps them to G1's arms and hands.

Homunculus exoskeleton for whole-body teleoperation

Setting up teleoperation

python lerobot/scripts/control_robot.py \
  --robot.type=unitree_g1 \
  --robot.variant=29dof \
  --robot.controller=HolosomaLocomotionController \
  --control.type=teleoperate \
  --control.fps=30 \
  --control.display_cameras=true

On first run, the script prompts you to calibrate the Homunculus — hold arms in neutral position to establish the reference frame. This takes about 2 minutes.

Collecting a whole-body dataset

Target task: teach G1 to "walk to the table, pick up a cup, carry it to the target location."

python lerobot/scripts/control_robot.py \
  --robot.type=unitree_g1 \
  --robot.variant=29dof \
  --robot.controller=HolosomaLocomotionController \
  --control.type=record \
  --control.fps=30 \
  --control.single_task="Walk to table, pick up cup, carry to target" \
  --control.repo_id="your_username/g1-cup-transport" \
  --control.num_episodes=100 \
  --control.push_to_hub=true

Data collection tips for WBC:

  • Episode length: Whole-body tasks run 30–60 seconds vs. 5–15 seconds for arm-only. Ensure your buffer is large enough.
  • Camera placement: Mount at least 2 cameras: 1 chest camera (egocentric view) and 1 external overhead camera.
  • Locomotion diversity: Vary locomotion style — straight walking, left/right turns, stop-and-wait — so the policy learns to generalize.
  • Minimum 50 episodes for single tasks, 100+ recommended for WBC due to the much larger action space.

Training Pi0-FAST for Unitree G1

Training configuration

Pi0-FAST requires training the FAST tokenizer first, then fine-tuning the main model. This is different from ACT or Diffusion Policy.

Step 1: Train FAST tokenizer on your dataset

python lerobot/scripts/train_fast_tokenizer.py \
  --dataset.repo_id="your_username/g1-cup-transport" \
  --tokenizer.vocab_size=512 \
  --tokenizer.chunk_size=50 \
  --tokenizer.dct_components=0.95 \
  --output_dir="./checkpoints/fast-tokenizer-g1"

Key parameters:

  • vocab_size=512: Number of discrete tokens. Increase to 1024 for complex action spaces (29-DoF WBC needs more capacity than arm-only)
  • chunk_size=50: Steps per action chunk. 50 ≈ 1.67 seconds at 30 Hz
  • dct_components=0.95: Retain DCT coefficients until 95% of variance is explained

Step 2: Fine-tune Pi0-FAST

python lerobot/scripts/train.py \
  --policy.type=pi0fast \
  --policy.pretrained="./checkpoints/pi0fast-base" \
  --dataset.repo_id="your_username/g1-cup-transport" \
  --policy.fast_tokenizer_path="./checkpoints/fast-tokenizer-g1" \
  --training.batch_size=32 \
  --training.num_epochs=100 \
  --training.learning_rate=1e-4 \
  --training.use_amp=true \
  --output_dir="./checkpoints/pi0fast-g1-cup" \
  --wandb.enable=true

Training hardware requirements

Config GPU VRAM Time/epoch (100 eps)
Pi0-FAST full A100 80GB 60 GB ~15 min
Pi0-FAST + PEFT/LoRA RTX 4090 20 GB ~25 min
Pi0-FAST + PEFT/LoRA RTX 3090 18 GB ~40 min

No A100? Use PEFT/LoRA fine-tuning — train only adapter layers while keeping the backbone frozen:

python lerobot/scripts/train.py \
  --policy.type=pi0fast \
  --policy.pretrained="./checkpoints/pi0fast-base" \
  --training.peft.enabled=true \
  --training.peft.method=lora \
  --training.peft.rank=16 \
  --training.peft.alpha=32 \
  # ... other parameters as above

Inference with Real-Time Chunking

Deploying the policy on G1

python lerobot/scripts/control_robot.py \
  --robot.type=unitree_g1 \
  --robot.variant=29dof \
  --robot.controller=HolosomaLocomotionController \
  --control.type=inference \
  --policy.path="./checkpoints/pi0fast-g1-cup" \
  --control.fps=30 \
  --control.rtc.enabled=true \
  --control.rtc.execution_horizon=10 \
  --control.rtc.max_guidance_weight=2.0 \
  --control.task="Walk to table, pick up cup, carry to target"

Key RTC parameters

  • execution_horizon=10: Blend 10 steps from the new chunk with the old chunk. Increase if motion is jerky.
  • max_guidance_weight=2.0: Strength of the "pull" toward the old chunk. Higher = smoother but less reactive.
  • prefix_attention_schedule=exponential: How guidance weight decays over time.

Results and Benchmarks

Based on results from the LeRobot team and community experiments on G1:

Task Policy Success Rate Avg Completion Time
Pick & place (standing) Pi0-FAST + RTC 87% 8.2s
Pick & place (standing) ACT 82% 6.1s
Loco-manip: cup transport Pi0-FAST + RTC 71% 24.3s
Loco-manip: cup transport Pi0-FAST (no RTC) 43% 31.1s
Door opening (walking) Pi0-FAST + RTC 58% 18.7s

Key takeaways:

  • RTC boosts loco-manipulation success from 43% → 71% — the largest impact is on tasks combining locomotion with manipulation
  • ACT is faster for simple pick-and-place, but Pi0-FAST generalizes better to unseen tasks
  • 100 episodes is the sweet spot for WBC — more data still helps but with diminishing returns

Common Pitfalls

1. Robot loses balance during policy replanning

  • Cause: Latency too high, or RTC not enabled
  • Fix: Enable --control.rtc.enabled=true and increase execution_horizon

2. Training loss doesn't decrease

  • Cause: FAST tokenizer not trained on your dataset — using default arm-only tokenizer
  • Fix: Train a custom tokenizer first with train_fast_tokenizer.py

3. Locomotion controller won't connect

  • Cause: Unitree SDK can't discover robot IP
  • Fix: Set export ROBOT_IP=192.168.123.1 and verify LAN connectivity

4. OOM when training full Pi0-FAST

  • Fix: Add --training.gradient_checkpointing=true and --training.peft.enabled=true

Conclusion

LeRobot v0.5.0 + Pi0-FAST + Unitree G1 is the most capable open-source stack available today for whole-body loco-manipulation. The pipeline isn't without challenges — complex setup, high GPU requirements, and the need for at least 100 demonstration episodes — but this is the first time in history you can do all of this without paying for a proprietary framework.

The natural next step is tackling more complex tasks: carrying objects up stairs, opening a refrigerator, or interacting with humans in dynamic environments. The framework is ready — what remains is data collection and experimentation.


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Bài viết liên quan

NEWTutorial
VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO
vlavla-adapteropenhelixliberoqwen2.5lorafrankaur5manipulation

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

Hướng dẫn VLA-Adapter từ OpenHelix — train VLA 0.5B trên GPU consumer 8 giờ, đạt SOTA LIBERO, deploy thật trên Franka/UR-5.

13/5/202610 phút đọc
NEWTutorial
WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid
wholebodyvlavlahumanoidloco-manipulationiclr-2026agibot-x2teleoprl

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

ICLR 2026 — pipeline thực chiến từ thu thập teleop, train unified latent VLA đến deploy whole-body loco-manipulation trên AgiBot X2.

11/5/202611 phút đọc
NEWTutorial
Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym
humanoidisaac-gymreinforcement-learningsim2realbooster-t1icra-2026

Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym

Hướng dẫn chi tiết Booster Gym — RL framework end-to-end open-source train humanoid Booster T1 walking từ teleop đến deploy thực tế.

6/5/202611 phút đọc