humanoidhumanoidvlatraininggrootfine-tuningmachine-learningunitree-g1

GR00T N1 + G1 (Post 3): fine-tuning GR00T N1 — GPU, config, training script

Step-by-step guide to fine-tune GR00T N1 (2B params) with a LeRobot dataset from G1 — GPU requirements, training config, how to avoid overfitting, and evaluating checkpoints before deployment.

Nguyễn Anh TuấnJune 4, 20265 min read
GR00T N1 + G1 (Post 3): fine-tuning GR00T N1 — GPU, config, training script

GR00T N1 + G1 (Post 3): fine-tuning GR00T N1 — GPU, config, training script

This is post 3 of the GR00T N1 + Unitree G1 series. The previous post produced a LeRobot dataset. This post: running fine-tune on GR00T N1 2B params with that dataset.

GR00T N1 is a pretrained foundation model — it already knows many general tasks. Fine-tuning only teaches it your specific task on your specific robot. With 50-100 demos, training typically takes 2-4 hours on an RTX 4090.

GPU requirements in practice

Setup VRAM Batch size Training time (50 demos)
RTX 4090 24GB 8 ~3 hours
A100 40GB 40GB 16 ~1.5 hours
A100 80GB 80GB 32 ~45 minutes
2× RTX 4090 48GB 16 ~1.5 hours (DDP)

Minimum: RTX 4090 (24GB). Below 24GB VRAM will OOM with default config. You can use gradient checkpointing to reduce VRAM but training will be ~40% slower.

Training environment setup

git clone https://github.com/NVIDIA/Isaac-GR00T.git
cd Isaac-GR00T

# Create conda environment
conda create -n groot python=3.10
conda activate groot

# Install dependencies
pip install -e ".[train]"

# Verify CUDA + torch
python -c "import torch; print(torch.cuda.get_device_name(0)); print(torch.version.cuda)"

# Download GR00T N1 pretrained checkpoint
python scripts/download_model.py --model GR00T-N1-2B
# checkpoint will be at: ./checkpoints/GR00T-N1-2B/

Config file

GR00T uses YAML config. Create a config file for your task:

# configs/finetune_g1_pickplace.yaml

# Model
model:
  name: "GR00T-N1-2B"
  checkpoint_path: "./checkpoints/GR00T-N1-2B"
  freeze_vision_encoder: false    # true for faster training, false for better accuracy
  freeze_language_model: true     # ALWAYS freeze LM to save VRAM

# Dataset
dataset:
  path: "./data/g1_pickplace_lerobot"
  robot: "g1"
  cameras:
    - "observation.images.left_wrist"
    - "observation.images.right_wrist"
    - "observation.images.head"      # omit if no head camera
  action_keys:
    - "action/left_ee_pose"     # 7-dim: xyz + quat
    - "action/right_ee_pose"    # 7-dim
    - "action/gripper"          # 2-dim: left, right
  
  # Train/val split
  train_split: 0.9
  val_split: 0.1

# Training hyperparams
training:
  batch_size: 8                 # reduce to 4 if OOM
  learning_rate: 1.0e-4
  lr_scheduler: "cosine"
  warmup_steps: 100
  num_epochs: 100
  gradient_clip: 1.0
  
  # Gradient checkpointing — enable if VRAM < 24GB
  gradient_checkpointing: false
  
  # Mixed precision
  bf16: true                    # A100 uses bf16; RTX 4090 uses fp16
  fp16: false

# Action head — flow matching
action_head:
  num_timesteps: 100           # denoising steps (10 for fast inference)
  action_chunk_size: 16        # predict 16 steps ahead
  
# Logging
output_dir: "./runs/g1_pickplace"
log_every: 10                  # log loss every 10 steps
eval_every: 500                # evaluate on val set every 500 steps
save_every: 500                # save checkpoint every 500 steps

Running training

# Single GPU
python scripts/finetune.py \
  --config configs/finetune_g1_pickplace.yaml

# Multi-GPU (2× RTX 4090)
torchrun --nproc_per_node=2 scripts/finetune.py \
  --config configs/finetune_g1_pickplace.yaml

# View training logs
tensorboard --logdir ./runs/g1_pickplace/tensorboard

Terminal output:

Epoch 1/100 | Step 50/500 | Loss: 0.342 | Action loss: 0.298 | LR: 8.3e-5
Epoch 1/100 | Step 100/500 | Loss: 0.241 | Action loss: 0.198 | LR: 1.0e-4
...
[Eval] Step 500 | Val loss: 0.187 | Saved checkpoint: ./runs/g1_pickplace/checkpoint_500/

Monitoring training — when to stop?

Normal loss curve:

Train loss:   0.35 → 0.20 → 0.12 → 0.09 → 0.07 (steady decrease)
Val loss:     0.33 → 0.19 → 0.13 → 0.10 → 0.08 (tracks train closely)

Overfitting signs — stop early:

Train loss:   0.35 → 0.15 → 0.08 → 0.04 → 0.02 (drops very fast)
Val loss:     0.33 → 0.18 → 0.16 → 0.18 → 0.22 (starts going back up)

When val loss increases for 3 consecutive eval steps → use the previous checkpoint (early stopping).

Adapting config for other robots

Change robot: "g1" and action_keys to match your robot:

# Example for GR1 (Fourier Intelligence)
dataset:
  robot: "gr1"
  cameras:
    - "observation.images.left_wrist"
    - "observation.images.right_wrist"
  action_keys:
    - "action/left_ee_pose"
    - "action/right_ee_pose"
    - "action/gripper"

# Example for single robot arm (not a humanoid)
dataset:
  robot: "franka"
  cameras:
    - "observation.images.wrist"
    - "observation.images.overhead"
  action_keys:
    - "action/ee_pose"     # 7-dim
    - "action/gripper"     # 1-dim

GR00T N1 supports variable action dimensions — just declare the correct action_keys.

Evaluating checkpoint in sim

Before deploying to a real robot, always evaluate in sim:

# Rollout checkpoint in Isaac Lab
python scripts/evaluate_checkpoint.py \
  --checkpoint ./runs/g1_pickplace/checkpoint_best/ \
  --robot g1 \
  --task PickPlace \
  --num_episodes 20 \
  --render         # show Isaac Sim GUI

# Output:
# Episode 0: SUCCESS (12.3s)
# Episode 1: SUCCESS (11.8s)
# Episode 2: FAIL — gripper missed object at step 45
# ...
# Success rate: 17/20 = 85%

Threshold before real robot deploy: ≥ 80% success rate in sim for simple tasks (pick-place). Complex tasks can be lower but must show an improving trend.

Common troubleshooting

OOM (Out of Memory):

# Reduce batch size
training:
  batch_size: 4   # from 8 down to 4

# Or enable gradient checkpointing
  gradient_checkpointing: true

# Or freeze vision encoder
model:
  freeze_vision_encoder: true

Loss not decreasing after 20 epochs:

# Try increasing learning rate
training:
  learning_rate: 3.0e-4   # from 1e-4 to 3e-4

# Or unfreeze vision encoder
model:
  freeze_vision_encoder: false

Good val loss but poor sim performance: → Dataset quality issue. Review demos: are there failed demos in the dataset? Is object position varied enough? Is camera calibration correct?

checkpoint_best vs checkpoint_last

The script automatically saves checkpoint_best (lowest val loss) and checkpoint_last (final epoch).

ls ./runs/g1_pickplace/
# checkpoint_500/   checkpoint_1000/  checkpoint_best/  checkpoint_last/

# Always use checkpoint_best for deploy
python scripts/evaluate_checkpoint.py \
  --checkpoint ./runs/g1_pickplace/checkpoint_best/

Always use checkpoint_best — not checkpoint_last. More training epochs don't always mean better.


Next: Deploy GR00T-WBC on real G1 — GEAR + SONIC.


References


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

GR00T N1 + G1 (Bài 5): sim2real transfer, domain randomization, và eval với humanoid-bench
humanoid

GR00T N1 + G1 (Bài 5): sim2real transfer, domain randomization, và eval với humanoid-bench

6/6/20266 min read
NT
GR00T N1 + G1 (Bài 2): thu data trong Isaac Lab và xr_teleoperate → LeRobot
humanoid

GR00T N1 + G1 (Bài 2): thu data trong Isaac Lab và xr_teleoperate → LeRobot

6/3/20266 min read
NT
GR00T N1 + Unitree G1: kiến trúc WBC+VLA decoupled từ 6Hz đến 500Hz
humanoid

GR00T N1 + Unitree G1: kiến trúc WBC+VLA decoupled từ 6Hz đến 500Hz

6/2/20266 min read
NT