Fine-tuning NVIDIA GR00T N1 Guide

If you've been following the humanoid robotics space, you've likely heard of GR00T N1 — NVIDIA's first open foundation model purpose-built for humanoid robots. This isn't just another research paper — NVIDIA has fully open-sourced everything from model weights to the fine-tuning pipeline, making it possible for anyone to adapt and deploy on real robots.

In this guide, I'll walk you through every step of fine-tuning GR00T N1 using Isaac Lab and AGIBOT World data — from understanding the architecture, setting up the environment, preparing data, to training and inference.

What Is GR00T N1?

GR00T N1 (Generalist Robot 00 Technology) is a Vision-Language-Action (VLA) model announced by NVIDIA at GTC 2025. Unlike previous VLA models designed for specific robots, GR00T N1 is cross-embodiment — a single model that works across different robot types, from single-arm manipulators to full humanoid robots.

Original paper: GR00T N1: An Open Foundation Model for Generalist Humanoid Robots — NVIDIA Research, 2025.

Why Is GR00T N1 Special?

Cross-embodiment: One model runs on WidowX, Google Robot, Fourier GR-1, Unitree G1, and more
Fully open: Model weights, training code, fine-tuning pipeline — all open-source
Data-efficient: With only 10% of training data, GR00T N1 nearly matches Diffusion Policy using 100% data
Real-time inference: 22-27 Hz on RTX 4090, fast enough for real robot control

If you're familiar with other VLA models like RT-2 or Octo, GR00T N1 represents a major leap in generalization capability.

Dual-System Architecture

The most innovative aspect of GR00T N1 is its dual-system architecture, inspired by Daniel Kahneman's cognitive theory (Thinking, Fast and Slow):

GR00T N1 dual-system architecture — System 2 processes language and vision, System 1 generates fast actions

System 2 — Vision-Language Module (Slow, Deliberative)

Backbone: Eagle-2 VLM = SigLIP-2 (vision encoder) + SmolLM2 (language model)
Parameters: 1.34B
Speed: ~10 Hz
Role: Takes camera images + language instructions → understands context and task goals
Key insight: Features are extracted from the 12th (middle) layer rather than the final layer — this is both faster and produces better downstream performance (verified via ablation study)

System 1 — Diffusion Transformer Action Head (Fast, Reactive)

Architecture: DiT (Diffusion Transformer) with action flow-matching
Speed: ~120 Hz (internal), outputs chunks of 16 timesteps
Denoising steps: 4 steps using Forward Euler integration
Cross-attention: Connects to System 2's output for context understanding
Embodiment MLP: Each robot type has its own MLP for encoding/decoding state and actions — this is the key to cross-embodiment support

Total model: GR00T-N1-2B has 2.2B parameters (1.34B VLM + rest for DiT and embodiment MLPs).

If you've read about Diffusion Policy, System 1 in GR00T N1 is an upgraded version — instead of U-Net, it uses a Transformer with more flexible attention mechanisms.

Training Data — The 4-Layer Pyramid

GR00T N1 is trained on four types of data, arranged in a pyramid structure:

Layer 1: Real Robot Data (Highest Quality)

GR00T N1 Humanoid dataset: 88 hours of teleoperation on Fourier GR-1 using VIVE trackers + Xsens gloves
Open X-Embodiment: RT-1, Bridge-v2, DROID, RoboSet, and more
AgiBot-Alpha: 140,000 trajectories from 100 robots

Layer 2: Human Videos

Ego4D, EPIC-KITCHENS, Assembly-101 — videos of humans manipulating objects
No motor commands available → uses VQ-VAE to learn a latent action space

Layer 3: Simulation Data

DexMimicGen: 780,000 trajectories (equivalent to 6,500 hours) generated in just 11 hours on Isaac Sim

Layer 4: Neural Trajectories (AI-Generated)

From 88 hours of real data → generated 827 hours of video using image-to-video models
Result: 40% performance boost compared to using real data alone

AGIBOT World Dataset

AGIBOT World is one of the largest robot learning datasets available today, created by AgiBot (China):

AGIBOT World Beta: 1M+ trajectories, 2,976 hours, 217 tasks, 87 skills
3,000+ objects, 100+ real-world scenarios
5 domains: manipulation, tool use, multi-robot collaboration, and more
Finalist for IROS 2025 Best Paper Award

GR00T N1 uses AgiBot-Alpha (the earlier version, 140K trajectories) as one of its primary training data sources.

GitHub: OpenDriveLab/AgiBot-World HuggingFace: agibot-world/AgiBotWorld-Alpha

Environment Setup

Hardware Requirements

Purpose	Minimum GPU
Fine-tuning	1x RTX A6000 or RTX 4090 (24GB VRAM)
Inference	RTX 4090 (44ms, ~23 Hz) or Jetson AGX Orin
Pre-training	1024x H100 (not feasible for individuals)

Step 1: Clone the Repository

git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T

Step 2: Install Dependencies

GR00T uses uv — an extremely fast Python package manager:

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies for desktop GPU
bash scripts/deployment/dgpu/install_deps.sh

# Sync Python environment
uv sync && uv pip install -e .

Step 3: Download Model Weights

Model weights are available on HuggingFace:

# Install huggingface-cli if needed
pip install huggingface_hub

# Download model (~8GB)
huggingface-cli download nvidia/GR00T-N1.6-3B

Available models:

Model	Parameters	Description
GR00T-N1.6-3B	3B	Latest base model
GR00T-N1.6-bridge	3B	Pre-finetuned for WidowX
GR00T-N1.6-G1	3B	Pre-finetuned for Unitree G1
GR00T-N1.6-BEHAVIOR1k	3B	Pre-finetuned for Galaxea R1 Pro
GR00T-N1.6-DROID	3B	Pre-finetuned on DROID dataset

Data Preparation

GR00T N1 uses the LeRobot v2 data format. If you're familiar with the LeRobot framework, this process should feel familiar.

Data Format

Each trajectory requires:

{
    "observation": {
        "image": np.array,          # (H, W, 3) RGB image
        "state": np.array,          # Robot state (joints, gripper, etc.)
    },
    "action": np.array,             # Target action
    "language_instruction": str,    # Task description in natural language
}

Using AGIBOT World

# Download AGIBOT World Alpha dataset
huggingface-cli download agibot-world/AgiBotWorld-Alpha \
    --local-dir ./data/agibot-alpha \
    --repo-type dataset

# Convert to LeRobot v2 format (script included in repo)
python scripts/data/convert_agibot_to_lerobot.py \
    --input-dir ./data/agibot-alpha \
    --output-dir ./data/agibot-lerobot

Creating a Modality Config

Each robot type needs a modality config describing its state/action structure:

# config/my_robot_modality.yaml
state:
  joint_positions:
    dim: 7          # Degrees of freedom
    normalize: true
  gripper:
    dim: 1
    normalize: true

action:
  joint_positions:
    dim: 7
    normalize: true
  gripper:
    dim: 1
    normalize: true

video:
  cameras:
    - name: front_camera
      resolution: [224, 224]

Fine-Tuning

This is the most important part. Fine-tuning GR00T N1 can be done on a single GPU.

Fine-tuning a VLA model on GPU — adapting the model for a specific robot

Fine-Tuning Strategy

During fine-tuning:

Frozen: Language component of VLM backbone
Fine-tuned: Vision encoder, state/action encoders, DiT action head
Batch size: Up to 200 (adapter-only tuning), up to 16 (vision encoder tuning)

Fine-Tuning Command

CUDA_VISIBLE_DEVICES=0 uv run python gr00t/experiment/launch_finetune.py \
    --base-model-path nvidia/GR00T-N1.6-3B \
    --dataset-path ./data/agibot-lerobot \
    --embodiment-tag new_humanoid \
    --modality-config-path config/my_robot_modality.yaml \
    --num-gpus 1 \
    --max-steps 2000 \
    --global-batch-size 32 \
    --learning-rate 1e-4 \
    --output-dir ./checkpoints/my_finetune

Key Hyperparameters

Parameter	Recommended Value	Notes
`max-steps`	2000-5000	Start with 2000, increase if needed
`global-batch-size`	16-32	Depends on VRAM
`learning-rate`	1e-4	For adapter-only; 1e-5 if tuning vision
`warmup-steps`	100-200	Learning rate warm up

Fine-Tuning Tips

Start small: Fine-tune on 100-200 demos first, check results, then scale up
Monitor loss: Loss should decrease steadily in the first 500 steps. If not → reduce learning rate
Overfitting: GR00T N1.6 has 32 DiT layers (double N1.5), making it more prone to overfitting → ensure good regularization
Mixed precision: Defaults to bf16, keep it if your GPU supports it

Inference — Deploying the Model

GR00T N1 uses a server-client architecture for inference:

Starting the Inference Server

# Launch server with fine-tuned model
uv run python gr00t/policy/serve_policy.py \
    --model-path ./checkpoints/my_finetune \
    --port 5555

Client Code (On the Robot)

from gr00t.policy.server_client import PolicyClient
import numpy as np

# Connect to server
policy = PolicyClient(host="localhost", port=5555)

# Control loop
while True:
    # Get observation from robot
    obs = {
        "image": camera.get_frame(),         # (224, 224, 3)
        "state": robot.get_joint_positions(), # (7,)
        "language_instruction": "pick up the red cup"
    }

    # Predict action
    action, info = policy.get_action(obs)

    # Send action to robot
    robot.execute(action)

Inference Performance

GPU	Latency	Frequency
RTX 5090	37ms	27.3 Hz
H100	38ms	26.3 Hz
RTX 4090	44ms	22.8 Hz
DGX Spark	89ms	11.2 Hz
Jetson Thor	105ms	9.5 Hz

Benchmark Results

GR00T N1 significantly outperforms baselines across most benchmarks:

Simulation (100 demos per task)

Benchmark	BC Transformer	Diffusion Policy	GR00T N1
RoboCasa	26.3%	25.6%	32.1%
DexMimicGen	53.9%	56.1%	66.5%
GR-1 Humanoid	16.1%	32.7%	50.0%
Average	26.4%	33.4%	45.0%

Real-World on Fourier GR-1

Task	Diffusion Policy (100% data)	GR00T N1 (100% data)
Pick-and-Place	36.0%	82.0%
Articulated Objects	38.6%	70.9%
Industrial Tasks	61.0%	70.0%
Bimanual	62.5%	82.5%
Average	46.4%	76.8%

Key finding: GR00T N1 with only 10% of data (42.6%) nearly matches Diffusion Policy using 100% of data (46.4%). This demonstrates the power of cross-embodiment pre-training.

Isaac Lab — Its Role in the Pipeline

Isaac Lab is NVIDIA's robot learning framework built on Isaac Sim (Omniverse). In the GR00T N1 pipeline, Isaac Lab serves as:

Simulation environment: Creates simulation environments for policy evaluation before deploying on real hardware
Data generation: DexMimicGen (built on Isaac Sim) generated 780K trajectories for training
Benchmarking: RoboCasa, DexMimicGen tasks, and GR-1 benchmarks all run on Isaac Lab
Sim-to-real: Complete pipeline from training → sim evaluation → hardware deployment

If you're interested in simulation for robotics, Isaac Lab is an essential tool for the GR00T workflow.

Evolution: N1 → N1.5 → N1.6

Since announcing N1 (March 2025), NVIDIA has rapidly iterated:

N1.5:

VLM upgraded to Eagle 2.5 with better grounding capabilities
Added FLARE — aligns the model with target future embeddings
Language following: 46.6% → 93.3% (doubled improvement)

N1.6 (Latest version):

VLM switched to NVIDIA Cosmos-2B with flexible resolution
DiT doubled: 32 layers (vs. 16 in N1.5)
Faster convergence, smoother actions
Requires more careful fine-tuning (easier to overfit due to larger model)

Complete Workflow: Zero to Inference

Here's the complete 6-step pipeline:

Setup: Clone Isaac-GR00T, install dependencies, download model weights
Data collection: Teleoperation or download AGIBOT World
Data preparation: Convert to LeRobot v2 format, create modality config
Fine-tune: Run launch_finetune.py on 1x RTX 4090
Evaluate: Test in Isaac Lab simulation
Deploy: Run server-client inference on real robot

The entire fine-tuning process (2000 steps, 200 demos) takes approximately 2-4 hours on an RTX 4090 — entirely feasible for individual researchers.

Conclusion

GR00T N1 marks a turning point in humanoid robotics: the first time a powerful, cross-embodiment foundation model has been fully opened to the community. Combined with AGIBOT World data and the Isaac Lab environment, anyone with an RTX 4090 can start fine-tuning VLA models for their robots.

If you're building robot systems and want to leverage the power of foundation models, GR00T N1 is the best starting point available today.

Resources:

What Is GR00T N1?

Original paper: GR00T N1: An Open Foundation Model for Generalist Humanoid Robots — NVIDIA Research, 2025.

Why Is GR00T N1 Special?

Cross-embodiment: One model runs on WidowX, Google Robot, Fourier GR-1, Unitree G1, and more
Fully open: Model weights, training code, fine-tuning pipeline — all open-source
Data-efficient: With only 10% of training data, GR00T N1 nearly matches Diffusion Policy using 100% data
Real-time inference: 22-27 Hz on RTX 4090, fast enough for real robot control

If you're familiar with other VLA models like RT-2 or Octo, GR00T N1 represents a major leap in generalization capability.

Dual-System Architecture

The most innovative aspect of GR00T N1 is its dual-system architecture, inspired by Daniel Kahneman's cognitive theory (Thinking, Fast and Slow):

GR00T N1 dual-system architecture — System 2 processes language and vision, System 1 generates fast actions

System 2 — Vision-Language Module (Slow, Deliberative)

Backbone: Eagle-2 VLM = SigLIP-2 (vision encoder) + SmolLM2 (language model)
Parameters: 1.34B
Speed: ~10 Hz
Role: Takes camera images + language instructions → understands context and task goals
Key insight: Features are extracted from the 12th (middle) layer rather than the final layer — this is both faster and produces better downstream performance (verified via ablation study)

System 1 — Diffusion Transformer Action Head (Fast, Reactive)

Architecture: DiT (Diffusion Transformer) with action flow-matching
Speed: ~120 Hz (internal), outputs chunks of 16 timesteps
Denoising steps: 4 steps using Forward Euler integration
Cross-attention: Connects to System 2's output for context understanding
Embodiment MLP: Each robot type has its own MLP for encoding/decoding state and actions — this is the key to cross-embodiment support

Total model: GR00T-N1-2B has 2.2B parameters (1.34B VLM + rest for DiT and embodiment MLPs).

If you've read about Diffusion Policy, System 1 in GR00T N1 is an upgraded version — instead of U-Net, it uses a Transformer with more flexible attention mechanisms.

Training Data — The 4-Layer Pyramid

GR00T N1 is trained on four types of data, arranged in a pyramid structure:

Layer 1: Real Robot Data (Highest Quality)

GR00T N1 Humanoid dataset: 88 hours of teleoperation on Fourier GR-1 using VIVE trackers + Xsens gloves
Open X-Embodiment: RT-1, Bridge-v2, DROID, RoboSet, and more
AgiBot-Alpha: 140,000 trajectories from 100 robots

Layer 2: Human Videos

Ego4D, EPIC-KITCHENS, Assembly-101 — videos of humans manipulating objects
No motor commands available → uses VQ-VAE to learn a latent action space

Layer 3: Simulation Data

DexMimicGen: 780,000 trajectories (equivalent to 6,500 hours) generated in just 11 hours on Isaac Sim

Layer 4: Neural Trajectories (AI-Generated)

From 88 hours of real data → generated 827 hours of video using image-to-video models
Result: 40% performance boost compared to using real data alone

AGIBOT World Dataset

AGIBOT World is one of the largest robot learning datasets available today, created by AgiBot (China):

AGIBOT World Beta: 1M+ trajectories, 2,976 hours, 217 tasks, 87 skills
3,000+ objects, 100+ real-world scenarios
5 domains: manipulation, tool use, multi-robot collaboration, and more
Finalist for IROS 2025 Best Paper Award

GR00T N1 uses AgiBot-Alpha (the earlier version, 140K trajectories) as one of its primary training data sources.

GitHub: OpenDriveLab/AgiBot-World HuggingFace: agibot-world/AgiBotWorld-Alpha

Environment Setup

Hardware Requirements

Purpose	Minimum GPU
Fine-tuning	1x RTX A6000 or RTX 4090 (24GB VRAM)
Inference	RTX 4090 (44ms, ~23 Hz) or Jetson AGX Orin
Pre-training	1024x H100 (not feasible for individuals)

Step 1: Clone the Repository

git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T

Step 2: Install Dependencies

GR00T uses uv — an extremely fast Python package manager:

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies for desktop GPU
bash scripts/deployment/dgpu/install_deps.sh

# Sync Python environment
uv sync && uv pip install -e .

Step 3: Download Model Weights

Model weights are available on HuggingFace:

# Install huggingface-cli if needed
pip install huggingface_hub

# Download model (~8GB)
huggingface-cli download nvidia/GR00T-N1.6-3B

Available models:

Model	Parameters	Description
GR00T-N1.6-3B	3B	Latest base model
GR00T-N1.6-bridge	3B	Pre-finetuned for WidowX
GR00T-N1.6-G1	3B	Pre-finetuned for Unitree G1
GR00T-N1.6-BEHAVIOR1k	3B	Pre-finetuned for Galaxea R1 Pro
GR00T-N1.6-DROID	3B	Pre-finetuned on DROID dataset

Data Preparation

GR00T N1 uses the LeRobot v2 data format. If you're familiar with the LeRobot framework, this process should feel familiar.

Data Format

Each trajectory requires:

{
    "observation": {
        "image": np.array,          # (H, W, 3) RGB image
        "state": np.array,          # Robot state (joints, gripper, etc.)
    },
    "action": np.array,             # Target action
    "language_instruction": str,    # Task description in natural language
}

Using AGIBOT World

# Download AGIBOT World Alpha dataset
huggingface-cli download agibot-world/AgiBotWorld-Alpha \
    --local-dir ./data/agibot-alpha \
    --repo-type dataset

# Convert to LeRobot v2 format (script included in repo)
python scripts/data/convert_agibot_to_lerobot.py \
    --input-dir ./data/agibot-alpha \
    --output-dir ./data/agibot-lerobot

Creating a Modality Config

Each robot type needs a modality config describing its state/action structure:

# config/my_robot_modality.yaml
state:
  joint_positions:
    dim: 7          # Degrees of freedom
    normalize: true
  gripper:
    dim: 1
    normalize: true

action:
  joint_positions:
    dim: 7
    normalize: true
  gripper:
    dim: 1
    normalize: true

video:
  cameras:
    - name: front_camera
      resolution: [224, 224]

Fine-Tuning

This is the most important part. Fine-tuning GR00T N1 can be done on a single GPU.

Fine-tuning a VLA model on GPU — adapting the model for a specific robot

Fine-Tuning Strategy

During fine-tuning:

Frozen: Language component of VLM backbone
Fine-tuned: Vision encoder, state/action encoders, DiT action head
Batch size: Up to 200 (adapter-only tuning), up to 16 (vision encoder tuning)

Fine-Tuning Command

CUDA_VISIBLE_DEVICES=0 uv run python gr00t/experiment/launch_finetune.py \
    --base-model-path nvidia/GR00T-N1.6-3B \
    --dataset-path ./data/agibot-lerobot \
    --embodiment-tag new_humanoid \
    --modality-config-path config/my_robot_modality.yaml \
    --num-gpus 1 \
    --max-steps 2000 \
    --global-batch-size 32 \
    --learning-rate 1e-4 \
    --output-dir ./checkpoints/my_finetune

Key Hyperparameters

Parameter	Recommended Value	Notes
`max-steps`	2000-5000	Start with 2000, increase if needed
`global-batch-size`	16-32	Depends on VRAM
`learning-rate`	1e-4	For adapter-only; 1e-5 if tuning vision
`warmup-steps`	100-200	Learning rate warm up

Fine-Tuning Tips

Start small: Fine-tune on 100-200 demos first, check results, then scale up
Monitor loss: Loss should decrease steadily in the first 500 steps. If not → reduce learning rate
Overfitting: GR00T N1.6 has 32 DiT layers (double N1.5), making it more prone to overfitting → ensure good regularization
Mixed precision: Defaults to bf16, keep it if your GPU supports it

Inference — Deploying the Model

GR00T N1 uses a server-client architecture for inference:

Starting the Inference Server

# Launch server with fine-tuned model
uv run python gr00t/policy/serve_policy.py \
    --model-path ./checkpoints/my_finetune \
    --port 5555

Client Code (On the Robot)

from gr00t.policy.server_client import PolicyClient
import numpy as np

# Connect to server
policy = PolicyClient(host="localhost", port=5555)

# Control loop
while True:
    # Get observation from robot
    obs = {
        "image": camera.get_frame(),         # (224, 224, 3)
        "state": robot.get_joint_positions(), # (7,)
        "language_instruction": "pick up the red cup"
    }

    # Predict action
    action, info = policy.get_action(obs)

    # Send action to robot
    robot.execute(action)

Inference Performance

GPU	Latency	Frequency
RTX 5090	37ms	27.3 Hz
H100	38ms	26.3 Hz
RTX 4090	44ms	22.8 Hz
DGX Spark	89ms	11.2 Hz
Jetson Thor	105ms	9.5 Hz

Benchmark Results

GR00T N1 significantly outperforms baselines across most benchmarks:

Simulation (100 demos per task)

Benchmark	BC Transformer	Diffusion Policy	GR00T N1
RoboCasa	26.3%	25.6%	32.1%
DexMimicGen	53.9%	56.1%	66.5%
GR-1 Humanoid	16.1%	32.7%	50.0%
Average	26.4%	33.4%	45.0%

Real-World on Fourier GR-1

Task	Diffusion Policy (100% data)	GR00T N1 (100% data)
Pick-and-Place	36.0%	82.0%
Articulated Objects	38.6%	70.9%
Industrial Tasks	61.0%	70.0%
Bimanual	62.5%	82.5%
Average	46.4%	76.8%

Key finding: GR00T N1 with only 10% of data (42.6%) nearly matches Diffusion Policy using 100% of data (46.4%). This demonstrates the power of cross-embodiment pre-training.

Isaac Lab — Its Role in the Pipeline

Isaac Lab is NVIDIA's robot learning framework built on Isaac Sim (Omniverse). In the GR00T N1 pipeline, Isaac Lab serves as:

Simulation environment: Creates simulation environments for policy evaluation before deploying on real hardware
Data generation: DexMimicGen (built on Isaac Sim) generated 780K trajectories for training
Benchmarking: RoboCasa, DexMimicGen tasks, and GR-1 benchmarks all run on Isaac Lab
Sim-to-real: Complete pipeline from training → sim evaluation → hardware deployment

If you're interested in simulation for robotics, Isaac Lab is an essential tool for the GR00T workflow.

Evolution: N1 → N1.5 → N1.6

Since announcing N1 (March 2025), NVIDIA has rapidly iterated:

N1.5:

VLM upgraded to Eagle 2.5 with better grounding capabilities
Added FLARE — aligns the model with target future embeddings
Language following: 46.6% → 93.3% (doubled improvement)

N1.6 (Latest version):

VLM switched to NVIDIA Cosmos-2B with flexible resolution
DiT doubled: 32 layers (vs. 16 in N1.5)
Faster convergence, smoother actions
Requires more careful fine-tuning (easier to overfit due to larger model)

Complete Workflow: Zero to Inference

Here's the complete 6-step pipeline:

Setup: Clone Isaac-GR00T, install dependencies, download model weights
Data collection: Teleoperation or download AGIBOT World
Data preparation: Convert to LeRobot v2 format, create modality config
Fine-tune: Run launch_finetune.py on 1x RTX 4090
Evaluate: Test in Isaac Lab simulation
Deploy: Run server-client inference on real robot

The entire fine-tuning process (2000 steps, 200 demos) takes approximately 2-4 hours on an RTX 4090 — entirely feasible for individual researchers.

Conclusion

If you're building robot systems and want to leverage the power of foundation models, GR00T N1 is the best starting point available today.

Resources:

What Is GR00T N1?

Why Is GR00T N1 Special?

Dual-System Architecture

System 2 — Vision-Language Module (Slow, Deliberative)

System 1 — Diffusion Transformer Action Head (Fast, Reactive)

Training Data — The 4-Layer Pyramid

Layer 1: Real Robot Data (Highest Quality)

Layer 2: Human Videos

Layer 3: Simulation Data

Layer 4: Neural Trajectories (AI-Generated)

AGIBOT World Dataset

Environment Setup

Hardware Requirements

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Download Model Weights

Data Preparation

Data Format

Using AGIBOT World

Creating a Modality Config

Fine-Tuning

Fine-Tuning Strategy

Fine-Tuning Command

Key Hyperparameters

Fine-Tuning Tips

Inference — Deploying the Model

Starting the Inference Server

Client Code (On the Robot)

Inference Performance

Benchmark Results

Simulation (100 demos per task)

Real-World on Fourier GR-1

Isaac Lab — Its Role in the Pipeline

Evolution: N1 → N1.5 → N1.6

Complete Workflow: Zero to Inference

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

Fine-Tune GR00T N1.6 với Cosmos Reason 2

Chạy GR00T-VisualSim2Real cho G1

Fine-Tune GR00T N1.7 với EgoScale: Từ Zero đến Deploy

What Is GR00T N1?

Why Is GR00T N1 Special?

Dual-System Architecture

System 2 — Vision-Language Module (Slow, Deliberative)

System 1 — Diffusion Transformer Action Head (Fast, Reactive)

Training Data — The 4-Layer Pyramid

Layer 1: Real Robot Data (Highest Quality)

Layer 2: Human Videos

Layer 3: Simulation Data

Layer 4: Neural Trajectories (AI-Generated)

AGIBOT World Dataset

Environment Setup

Hardware Requirements

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Download Model Weights

Data Preparation

Data Format

Using AGIBOT World

Creating a Modality Config

Fine-Tuning

Fine-Tuning Strategy

Fine-Tuning Command

Key Hyperparameters

Fine-Tuning Tips

Inference — Deploying the Model

Starting the Inference Server

Client Code (On the Robot)

Inference Performance

Benchmark Results

Simulation (100 demos per task)

Real-World on Fourier GR-1

Isaac Lab — Its Role in the Pipeline

Evolution: N1 → N1.5 → N1.6

Complete Workflow: Zero to Inference

Conclusion

Related Posts

Nguyễn Anh Tuấn