If you've been following the humanoid robotics space, you've likely heard of GR00T N1 — NVIDIA's first open foundation model purpose-built for humanoid robots. This isn't just another research paper — NVIDIA has fully open-sourced everything from model weights to the fine-tuning pipeline, making it possible for anyone to adapt and deploy on real robots.
In this guide, I'll walk you through every step of fine-tuning GR00T N1 using Isaac Lab and AGIBOT World data — from understanding the architecture, setting up the environment, preparing data, to training and inference.
What Is GR00T N1?
GR00T N1 (Generalist Robot 00 Technology) is a Vision-Language-Action (VLA) model announced by NVIDIA at GTC 2025. Unlike previous VLA models designed for specific robots, GR00T N1 is cross-embodiment — a single model that works across different robot types, from single-arm manipulators to full humanoid robots.
Original paper: GR00T N1: An Open Foundation Model for Generalist Humanoid Robots — NVIDIA Research, 2025.
Why Is GR00T N1 Special?
- Cross-embodiment: One model runs on WidowX, Google Robot, Fourier GR-1, Unitree G1, and more
- Fully open: Model weights, training code, fine-tuning pipeline — all open-source
- Data-efficient: With only 10% of training data, GR00T N1 nearly matches Diffusion Policy using 100% data
- Real-time inference: 22-27 Hz on RTX 4090, fast enough for real robot control
If you're familiar with other VLA models like RT-2 or Octo, GR00T N1 represents a major leap in generalization capability.
Dual-System Architecture
The most innovative aspect of GR00T N1 is its dual-system architecture, inspired by Daniel Kahneman's cognitive theory (Thinking, Fast and Slow):
System 2 — Vision-Language Module (Slow, Deliberative)
- Backbone: Eagle-2 VLM = SigLIP-2 (vision encoder) + SmolLM2 (language model)
- Parameters: 1.34B
- Speed: ~10 Hz
- Role: Takes camera images + language instructions → understands context and task goals
- Key insight: Features are extracted from the 12th (middle) layer rather than the final layer — this is both faster and produces better downstream performance (verified via ablation study)
System 1 — Diffusion Transformer Action Head (Fast, Reactive)
- Architecture: DiT (Diffusion Transformer) with action flow-matching
- Speed: ~120 Hz (internal), outputs chunks of 16 timesteps
- Denoising steps: 4 steps using Forward Euler integration
- Cross-attention: Connects to System 2's output for context understanding
- Embodiment MLP: Each robot type has its own MLP for encoding/decoding state and actions — this is the key to cross-embodiment support
Total model: GR00T-N1-2B has 2.2B parameters (1.34B VLM + rest for DiT and embodiment MLPs).
If you've read about Diffusion Policy, System 1 in GR00T N1 is an upgraded version — instead of U-Net, it uses a Transformer with more flexible attention mechanisms.
Training Data — The 4-Layer Pyramid
GR00T N1 is trained on four types of data, arranged in a pyramid structure:
Layer 1: Real Robot Data (Highest Quality)
- GR00T N1 Humanoid dataset: 88 hours of teleoperation on Fourier GR-1 using VIVE trackers + Xsens gloves
- Open X-Embodiment: RT-1, Bridge-v2, DROID, RoboSet, and more
- AgiBot-Alpha: 140,000 trajectories from 100 robots
Layer 2: Human Videos
- Ego4D, EPIC-KITCHENS, Assembly-101 — videos of humans manipulating objects
- No motor commands available → uses VQ-VAE to learn a latent action space
Layer 3: Simulation Data
- DexMimicGen: 780,000 trajectories (equivalent to 6,500 hours) generated in just 11 hours on Isaac Sim
Layer 4: Neural Trajectories (AI-Generated)
- From 88 hours of real data → generated 827 hours of video using image-to-video models
- Result: 40% performance boost compared to using real data alone
AGIBOT World Dataset
AGIBOT World is one of the largest robot learning datasets available today, created by AgiBot (China):
- AGIBOT World Beta: 1M+ trajectories, 2,976 hours, 217 tasks, 87 skills
- 3,000+ objects, 100+ real-world scenarios
- 5 domains: manipulation, tool use, multi-robot collaboration, and more
- Finalist for IROS 2025 Best Paper Award
GR00T N1 uses AgiBot-Alpha (the earlier version, 140K trajectories) as one of its primary training data sources.
GitHub: OpenDriveLab/AgiBot-World HuggingFace: agibot-world/AgiBotWorld-Alpha
Environment Setup
Hardware Requirements
| Purpose | Minimum GPU |
|---|---|
| Fine-tuning | 1x RTX A6000 or RTX 4090 (24GB VRAM) |
| Inference | RTX 4090 (44ms, ~23 Hz) or Jetson AGX Orin |
| Pre-training | 1024x H100 (not feasible for individuals) |
Step 1: Clone the Repository
git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
Step 2: Install Dependencies
GR00T uses uv — an extremely fast Python package manager:
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies for desktop GPU
bash scripts/deployment/dgpu/install_deps.sh
# Sync Python environment
uv sync && uv pip install -e .
Step 3: Download Model Weights
Model weights are available on HuggingFace:
# Install huggingface-cli if needed
pip install huggingface_hub
# Download model (~8GB)
huggingface-cli download nvidia/GR00T-N1.6-3B
Available models:
| Model | Parameters | Description |
|---|---|---|
| GR00T-N1.6-3B | 3B | Latest base model |
| GR00T-N1.6-bridge | 3B | Pre-finetuned for WidowX |
| GR00T-N1.6-G1 | 3B | Pre-finetuned for Unitree G1 |
| GR00T-N1.6-BEHAVIOR1k | 3B | Pre-finetuned for Galaxea R1 Pro |
| GR00T-N1.6-DROID | 3B | Pre-finetuned on DROID dataset |
Data Preparation
GR00T N1 uses the LeRobot v2 data format. If you're familiar with the LeRobot framework, this process should feel familiar.
Data Format
Each trajectory requires:
{
"observation": {
"image": np.array, # (H, W, 3) RGB image
"state": np.array, # Robot state (joints, gripper, etc.)
},
"action": np.array, # Target action
"language_instruction": str, # Task description in natural language
}
Using AGIBOT World
# Download AGIBOT World Alpha dataset
huggingface-cli download agibot-world/AgiBotWorld-Alpha \
--local-dir ./data/agibot-alpha \
--repo-type dataset
# Convert to LeRobot v2 format (script included in repo)
python scripts/data/convert_agibot_to_lerobot.py \
--input-dir ./data/agibot-alpha \
--output-dir ./data/agibot-lerobot
Creating a Modality Config
Each robot type needs a modality config describing its state/action structure:
# config/my_robot_modality.yaml
state:
joint_positions:
dim: 7 # Degrees of freedom
normalize: true
gripper:
dim: 1
normalize: true
action:
joint_positions:
dim: 7
normalize: true
gripper:
dim: 1
normalize: true
video:
cameras:
- name: front_camera
resolution: [224, 224]
Fine-Tuning
This is the most important part. Fine-tuning GR00T N1 can be done on a single GPU.
Fine-Tuning Strategy
During fine-tuning:
- Frozen: Language component of VLM backbone
- Fine-tuned: Vision encoder, state/action encoders, DiT action head
- Batch size: Up to 200 (adapter-only tuning), up to 16 (vision encoder tuning)
Fine-Tuning Command
CUDA_VISIBLE_DEVICES=0 uv run python gr00t/experiment/launch_finetune.py \
--base-model-path nvidia/GR00T-N1.6-3B \
--dataset-path ./data/agibot-lerobot \
--embodiment-tag new_humanoid \
--modality-config-path config/my_robot_modality.yaml \
--num-gpus 1 \
--max-steps 2000 \
--global-batch-size 32 \
--learning-rate 1e-4 \
--output-dir ./checkpoints/my_finetune
Key Hyperparameters
| Parameter | Recommended Value | Notes |
|---|---|---|
max-steps |
2000-5000 | Start with 2000, increase if needed |
global-batch-size |
16-32 | Depends on VRAM |
learning-rate |
1e-4 | For adapter-only; 1e-5 if tuning vision |
warmup-steps |
100-200 | Learning rate warm up |
Fine-Tuning Tips
- Start small: Fine-tune on 100-200 demos first, check results, then scale up
- Monitor loss: Loss should decrease steadily in the first 500 steps. If not → reduce learning rate
- Overfitting: GR00T N1.6 has 32 DiT layers (double N1.5), making it more prone to overfitting → ensure good regularization
- Mixed precision: Defaults to bf16, keep it if your GPU supports it
Inference — Deploying the Model
GR00T N1 uses a server-client architecture for inference:
Starting the Inference Server
# Launch server with fine-tuned model
uv run python gr00t/policy/serve_policy.py \
--model-path ./checkpoints/my_finetune \
--port 5555
Client Code (On the Robot)
from gr00t.policy.server_client import PolicyClient
import numpy as np
# Connect to server
policy = PolicyClient(host="localhost", port=5555)
# Control loop
while True:
# Get observation from robot
obs = {
"image": camera.get_frame(), # (224, 224, 3)
"state": robot.get_joint_positions(), # (7,)
"language_instruction": "pick up the red cup"
}
# Predict action
action, info = policy.get_action(obs)
# Send action to robot
robot.execute(action)
Inference Performance
| GPU | Latency | Frequency |
|---|---|---|
| RTX 5090 | 37ms | 27.3 Hz |
| H100 | 38ms | 26.3 Hz |
| RTX 4090 | 44ms | 22.8 Hz |
| DGX Spark | 89ms | 11.2 Hz |
| Jetson Thor | 105ms | 9.5 Hz |
Benchmark Results
GR00T N1 significantly outperforms baselines across most benchmarks:
Simulation (100 demos per task)
| Benchmark | BC Transformer | Diffusion Policy | GR00T N1 |
|---|---|---|---|
| RoboCasa | 26.3% | 25.6% | 32.1% |
| DexMimicGen | 53.9% | 56.1% | 66.5% |
| GR-1 Humanoid | 16.1% | 32.7% | 50.0% |
| Average | 26.4% | 33.4% | 45.0% |
Real-World on Fourier GR-1
| Task | Diffusion Policy (100% data) | GR00T N1 (100% data) |
|---|---|---|
| Pick-and-Place | 36.0% | 82.0% |
| Articulated Objects | 38.6% | 70.9% |
| Industrial Tasks | 61.0% | 70.0% |
| Bimanual | 62.5% | 82.5% |
| Average | 46.4% | 76.8% |
Key finding: GR00T N1 with only 10% of data (42.6%) nearly matches Diffusion Policy using 100% of data (46.4%). This demonstrates the power of cross-embodiment pre-training.
Isaac Lab — Its Role in the Pipeline
Isaac Lab is NVIDIA's robot learning framework built on Isaac Sim (Omniverse). In the GR00T N1 pipeline, Isaac Lab serves as:
- Simulation environment: Creates simulation environments for policy evaluation before deploying on real hardware
- Data generation: DexMimicGen (built on Isaac Sim) generated 780K trajectories for training
- Benchmarking: RoboCasa, DexMimicGen tasks, and GR-1 benchmarks all run on Isaac Lab
- Sim-to-real: Complete pipeline from training → sim evaluation → hardware deployment
If you're interested in simulation for robotics, Isaac Lab is an essential tool for the GR00T workflow.
Evolution: N1 → N1.5 → N1.6
Since announcing N1 (March 2025), NVIDIA has rapidly iterated:
N1.5:
- VLM upgraded to Eagle 2.5 with better grounding capabilities
- Added FLARE — aligns the model with target future embeddings
- Language following: 46.6% → 93.3% (doubled improvement)
N1.6 (Latest version):
- VLM switched to NVIDIA Cosmos-2B with flexible resolution
- DiT doubled: 32 layers (vs. 16 in N1.5)
- Faster convergence, smoother actions
- Requires more careful fine-tuning (easier to overfit due to larger model)
Complete Workflow: Zero to Inference
Here's the complete 6-step pipeline:
- Setup: Clone Isaac-GR00T, install dependencies, download model weights
- Data collection: Teleoperation or download AGIBOT World
- Data preparation: Convert to LeRobot v2 format, create modality config
- Fine-tune: Run
launch_finetune.pyon 1x RTX 4090 - Evaluate: Test in Isaac Lab simulation
- Deploy: Run server-client inference on real robot
The entire fine-tuning process (2000 steps, 200 demos) takes approximately 2-4 hours on an RTX 4090 — entirely feasible for individual researchers.
Conclusion
GR00T N1 marks a turning point in humanoid robotics: the first time a powerful, cross-embodiment foundation model has been fully opened to the community. Combined with AGIBOT World data and the Isaac Lab environment, anyone with an RTX 4090 can start fine-tuning VLA models for their robots.
If you're building robot systems and want to leverage the power of foundation models, GR00T N1 is the best starting point available today.
Resources:
- Paper: GR00T N1 (arXiv 2503.14734)
- GitHub: NVIDIA/Isaac-GR00T
- HuggingFace: nvidia/GR00T-N1.6-3B
- AGIBOT World Dataset