aivlahumanoidnvidiaisaac-labfine-tuningdeep-learninggrootsim2real

Fine-tuning NVIDIA GR00T N1 Guide

Step-by-step guide to fine-tune the GR00T N1 VLA model for humanoid robots using Isaac Lab and AGIBOT World data — from setup to inference.

Nguyễn Anh Tuấn12 tháng 4, 202610 phút đọc
Fine-tuning NVIDIA GR00T N1 Guide

If you've been following the humanoid robotics space, you've likely heard of GR00T N1 — NVIDIA's first open foundation model purpose-built for humanoid robots. This isn't just another research paper — NVIDIA has fully open-sourced everything from model weights to the fine-tuning pipeline, making it possible for anyone to adapt and deploy on real robots.

In this guide, I'll walk you through every step of fine-tuning GR00T N1 using Isaac Lab and AGIBOT World data — from understanding the architecture, setting up the environment, preparing data, to training and inference.

What Is GR00T N1?

GR00T N1 (Generalist Robot 00 Technology) is a Vision-Language-Action (VLA) model announced by NVIDIA at GTC 2025. Unlike previous VLA models designed for specific robots, GR00T N1 is cross-embodiment — a single model that works across different robot types, from single-arm manipulators to full humanoid robots.

Original paper: GR00T N1: An Open Foundation Model for Generalist Humanoid Robots — NVIDIA Research, 2025.

Why Is GR00T N1 Special?

  • Cross-embodiment: One model runs on WidowX, Google Robot, Fourier GR-1, Unitree G1, and more
  • Fully open: Model weights, training code, fine-tuning pipeline — all open-source
  • Data-efficient: With only 10% of training data, GR00T N1 nearly matches Diffusion Policy using 100% data
  • Real-time inference: 22-27 Hz on RTX 4090, fast enough for real robot control

If you're familiar with other VLA models like RT-2 or Octo, GR00T N1 represents a major leap in generalization capability.

Dual-System Architecture

The most innovative aspect of GR00T N1 is its dual-system architecture, inspired by Daniel Kahneman's cognitive theory (Thinking, Fast and Slow):

GR00T N1 dual-system architecture — System 2 processes language and vision, System 1 generates fast actions

System 2 — Vision-Language Module (Slow, Deliberative)

  • Backbone: Eagle-2 VLM = SigLIP-2 (vision encoder) + SmolLM2 (language model)
  • Parameters: 1.34B
  • Speed: ~10 Hz
  • Role: Takes camera images + language instructions → understands context and task goals
  • Key insight: Features are extracted from the 12th (middle) layer rather than the final layer — this is both faster and produces better downstream performance (verified via ablation study)

System 1 — Diffusion Transformer Action Head (Fast, Reactive)

  • Architecture: DiT (Diffusion Transformer) with action flow-matching
  • Speed: ~120 Hz (internal), outputs chunks of 16 timesteps
  • Denoising steps: 4 steps using Forward Euler integration
  • Cross-attention: Connects to System 2's output for context understanding
  • Embodiment MLP: Each robot type has its own MLP for encoding/decoding state and actions — this is the key to cross-embodiment support

Total model: GR00T-N1-2B has 2.2B parameters (1.34B VLM + rest for DiT and embodiment MLPs).

If you've read about Diffusion Policy, System 1 in GR00T N1 is an upgraded version — instead of U-Net, it uses a Transformer with more flexible attention mechanisms.

Training Data — The 4-Layer Pyramid

GR00T N1 is trained on four types of data, arranged in a pyramid structure:

Layer 1: Real Robot Data (Highest Quality)

  • GR00T N1 Humanoid dataset: 88 hours of teleoperation on Fourier GR-1 using VIVE trackers + Xsens gloves
  • Open X-Embodiment: RT-1, Bridge-v2, DROID, RoboSet, and more
  • AgiBot-Alpha: 140,000 trajectories from 100 robots

Layer 2: Human Videos

  • Ego4D, EPIC-KITCHENS, Assembly-101 — videos of humans manipulating objects
  • No motor commands available → uses VQ-VAE to learn a latent action space

Layer 3: Simulation Data

  • DexMimicGen: 780,000 trajectories (equivalent to 6,500 hours) generated in just 11 hours on Isaac Sim

Layer 4: Neural Trajectories (AI-Generated)

  • From 88 hours of real data → generated 827 hours of video using image-to-video models
  • Result: 40% performance boost compared to using real data alone

AGIBOT World Dataset

AGIBOT World is one of the largest robot learning datasets available today, created by AgiBot (China):

  • AGIBOT World Beta: 1M+ trajectories, 2,976 hours, 217 tasks, 87 skills
  • 3,000+ objects, 100+ real-world scenarios
  • 5 domains: manipulation, tool use, multi-robot collaboration, and more
  • Finalist for IROS 2025 Best Paper Award

GR00T N1 uses AgiBot-Alpha (the earlier version, 140K trajectories) as one of its primary training data sources.

GitHub: OpenDriveLab/AgiBot-World HuggingFace: agibot-world/AgiBotWorld-Alpha

Environment Setup

Hardware Requirements

Purpose Minimum GPU
Fine-tuning 1x RTX A6000 or RTX 4090 (24GB VRAM)
Inference RTX 4090 (44ms, ~23 Hz) or Jetson AGX Orin
Pre-training 1024x H100 (not feasible for individuals)

Step 1: Clone the Repository

git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T

Step 2: Install Dependencies

GR00T uses uv — an extremely fast Python package manager:

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies for desktop GPU
bash scripts/deployment/dgpu/install_deps.sh

# Sync Python environment
uv sync && uv pip install -e .

Step 3: Download Model Weights

Model weights are available on HuggingFace:

# Install huggingface-cli if needed
pip install huggingface_hub

# Download model (~8GB)
huggingface-cli download nvidia/GR00T-N1.6-3B

Available models:

Model Parameters Description
GR00T-N1.6-3B 3B Latest base model
GR00T-N1.6-bridge 3B Pre-finetuned for WidowX
GR00T-N1.6-G1 3B Pre-finetuned for Unitree G1
GR00T-N1.6-BEHAVIOR1k 3B Pre-finetuned for Galaxea R1 Pro
GR00T-N1.6-DROID 3B Pre-finetuned on DROID dataset

Data Preparation

GR00T N1 uses the LeRobot v2 data format. If you're familiar with the LeRobot framework, this process should feel familiar.

Data Format

Each trajectory requires:

{
    "observation": {
        "image": np.array,          # (H, W, 3) RGB image
        "state": np.array,          # Robot state (joints, gripper, etc.)
    },
    "action": np.array,             # Target action
    "language_instruction": str,    # Task description in natural language
}

Using AGIBOT World

# Download AGIBOT World Alpha dataset
huggingface-cli download agibot-world/AgiBotWorld-Alpha \
    --local-dir ./data/agibot-alpha \
    --repo-type dataset

# Convert to LeRobot v2 format (script included in repo)
python scripts/data/convert_agibot_to_lerobot.py \
    --input-dir ./data/agibot-alpha \
    --output-dir ./data/agibot-lerobot

Creating a Modality Config

Each robot type needs a modality config describing its state/action structure:

# config/my_robot_modality.yaml
state:
  joint_positions:
    dim: 7          # Degrees of freedom
    normalize: true
  gripper:
    dim: 1
    normalize: true

action:
  joint_positions:
    dim: 7
    normalize: true
  gripper:
    dim: 1
    normalize: true

video:
  cameras:
    - name: front_camera
      resolution: [224, 224]

Fine-Tuning

This is the most important part. Fine-tuning GR00T N1 can be done on a single GPU.

Fine-tuning a VLA model on GPU — adapting the model for a specific robot

Fine-Tuning Strategy

During fine-tuning:

  • Frozen: Language component of VLM backbone
  • Fine-tuned: Vision encoder, state/action encoders, DiT action head
  • Batch size: Up to 200 (adapter-only tuning), up to 16 (vision encoder tuning)

Fine-Tuning Command

CUDA_VISIBLE_DEVICES=0 uv run python gr00t/experiment/launch_finetune.py \
    --base-model-path nvidia/GR00T-N1.6-3B \
    --dataset-path ./data/agibot-lerobot \
    --embodiment-tag new_humanoid \
    --modality-config-path config/my_robot_modality.yaml \
    --num-gpus 1 \
    --max-steps 2000 \
    --global-batch-size 32 \
    --learning-rate 1e-4 \
    --output-dir ./checkpoints/my_finetune

Key Hyperparameters

Parameter Recommended Value Notes
max-steps 2000-5000 Start with 2000, increase if needed
global-batch-size 16-32 Depends on VRAM
learning-rate 1e-4 For adapter-only; 1e-5 if tuning vision
warmup-steps 100-200 Learning rate warm up

Fine-Tuning Tips

  1. Start small: Fine-tune on 100-200 demos first, check results, then scale up
  2. Monitor loss: Loss should decrease steadily in the first 500 steps. If not → reduce learning rate
  3. Overfitting: GR00T N1.6 has 32 DiT layers (double N1.5), making it more prone to overfitting → ensure good regularization
  4. Mixed precision: Defaults to bf16, keep it if your GPU supports it

Inference — Deploying the Model

GR00T N1 uses a server-client architecture for inference:

Starting the Inference Server

# Launch server with fine-tuned model
uv run python gr00t/policy/serve_policy.py \
    --model-path ./checkpoints/my_finetune \
    --port 5555

Client Code (On the Robot)

from gr00t.policy.server_client import PolicyClient
import numpy as np

# Connect to server
policy = PolicyClient(host="localhost", port=5555)

# Control loop
while True:
    # Get observation from robot
    obs = {
        "image": camera.get_frame(),         # (224, 224, 3)
        "state": robot.get_joint_positions(), # (7,)
        "language_instruction": "pick up the red cup"
    }

    # Predict action
    action, info = policy.get_action(obs)

    # Send action to robot
    robot.execute(action)

Inference Performance

GPU Latency Frequency
RTX 5090 37ms 27.3 Hz
H100 38ms 26.3 Hz
RTX 4090 44ms 22.8 Hz
DGX Spark 89ms 11.2 Hz
Jetson Thor 105ms 9.5 Hz

Benchmark Results

GR00T N1 significantly outperforms baselines across most benchmarks:

Simulation (100 demos per task)

Benchmark BC Transformer Diffusion Policy GR00T N1
RoboCasa 26.3% 25.6% 32.1%
DexMimicGen 53.9% 56.1% 66.5%
GR-1 Humanoid 16.1% 32.7% 50.0%
Average 26.4% 33.4% 45.0%

Real-World on Fourier GR-1

Task Diffusion Policy (100% data) GR00T N1 (100% data)
Pick-and-Place 36.0% 82.0%
Articulated Objects 38.6% 70.9%
Industrial Tasks 61.0% 70.0%
Bimanual 62.5% 82.5%
Average 46.4% 76.8%

Key finding: GR00T N1 with only 10% of data (42.6%) nearly matches Diffusion Policy using 100% of data (46.4%). This demonstrates the power of cross-embodiment pre-training.

Isaac Lab — Its Role in the Pipeline

Isaac Lab is NVIDIA's robot learning framework built on Isaac Sim (Omniverse). In the GR00T N1 pipeline, Isaac Lab serves as:

  1. Simulation environment: Creates simulation environments for policy evaluation before deploying on real hardware
  2. Data generation: DexMimicGen (built on Isaac Sim) generated 780K trajectories for training
  3. Benchmarking: RoboCasa, DexMimicGen tasks, and GR-1 benchmarks all run on Isaac Lab
  4. Sim-to-real: Complete pipeline from training → sim evaluation → hardware deployment

If you're interested in simulation for robotics, Isaac Lab is an essential tool for the GR00T workflow.

Evolution: N1 → N1.5 → N1.6

Since announcing N1 (March 2025), NVIDIA has rapidly iterated:

N1.5:

  • VLM upgraded to Eagle 2.5 with better grounding capabilities
  • Added FLARE — aligns the model with target future embeddings
  • Language following: 46.6% → 93.3% (doubled improvement)

N1.6 (Latest version):

  • VLM switched to NVIDIA Cosmos-2B with flexible resolution
  • DiT doubled: 32 layers (vs. 16 in N1.5)
  • Faster convergence, smoother actions
  • Requires more careful fine-tuning (easier to overfit due to larger model)

Complete Workflow: Zero to Inference

Here's the complete 6-step pipeline:

  1. Setup: Clone Isaac-GR00T, install dependencies, download model weights
  2. Data collection: Teleoperation or download AGIBOT World
  3. Data preparation: Convert to LeRobot v2 format, create modality config
  4. Fine-tune: Run launch_finetune.py on 1x RTX 4090
  5. Evaluate: Test in Isaac Lab simulation
  6. Deploy: Run server-client inference on real robot

The entire fine-tuning process (2000 steps, 200 demos) takes approximately 2-4 hours on an RTX 4090 — entirely feasible for individual researchers.

Conclusion

GR00T N1 marks a turning point in humanoid robotics: the first time a powerful, cross-embodiment foundation model has been fully opened to the community. Combined with AGIBOT World data and the Isaac Lab environment, anyone with an RTX 4090 can start fine-tuning VLA models for their robots.

If you're building robot systems and want to leverage the power of foundation models, GR00T N1 is the best starting point available today.

Resources:


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Bài viết liên quan

NEWTutorial
StarVLA: Xây dựng VLA Model mô-đun
vlastarvlarobot-manipulationaideep-learningqwen-vlflow-matchingiclr-2026

StarVLA: Xây dựng VLA Model mô-đun

Hướng dẫn chi tiết xây dựng Vision-Language-Action model với StarVLA — framework mô-đun kiểu Lego từ ICLR 2026, hỗ trợ 4 kiến trúc action head.

12/4/202611 phút đọc
NEWDeep Dive
WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code
vlahumanoidloco-manipulationiclrrlopen-sourceisaac-lab

WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code

Deep-dive vào codebase WholebodyVLA — kiến trúc latent action, LMO RL policy, và cách xây dựng pipeline whole-body loco-manipulation cho humanoid.

12/4/202619 phút đọc
NEWTutorial
SimpleVLA-RL (9): OpenArm Simulation & Data
openarmisaac-labsimulationdata-collectionsimplevla-rlPhần 9

SimpleVLA-RL (9): OpenArm Simulation & Data

Setup OpenArm trong Isaac Lab, collect demonstration data trong simulation, và convert sang format cho SimpleVLA-RL training.

11/4/202618 phút đọc