aivlaworld-modelreinforcement-learninggigabrainroboticsmanipulation

GigaBrain-0 Guide: VLA + World Model + RL

Hands-on guide to training VLA with World Models and Reinforcement Learning using the RAMP framework from GigaBrain — open-source, 3.5B params.

Nguyễn Anh Tuấn12 tháng 4, 202610 phút đọc
GigaBrain-0 Guide: VLA + World Model + RL

You're familiar with Imitation Learning — collect human demos, then teach the robot to copy? It works well, but has a fundamental ceiling: the robot can only be as good as the demo data it's seen. If the demos miss a scenario, the robot freezes.

GigaBrain-0 — an open-source VLA model family from GigaAI — solves this with a breakthrough idea: teach the robot to "imagine" the future before acting, then use Reinforcement Learning to optimize based on those imagined futures. In this article, I'll walk you through the core ideas, architecture, installation, training, and inference with fully open-source code.

GigaBrain-0 combines World Models and RL to train VLA more effectively than pure Imitation Learning

Overview: What is GigaBrain-0?

GigaBrain-0 is actually a model family with multiple versions:

Version Description Date
GigaBrain-0 Foundation VLA model, Mixture-of-Transformers 10/2025
GigaBrain-0.1 Upgraded version, more data, #1 on RoboChallenge 02/2026
GigaBrain-0.5 VLA backbone 3.5B params with Embodied CoT 02/2026
GigaBrain-0.5M* Full version: VLA + World Model + RL (RAMP) 02/2026

This article focuses on GigaBrain-0.5M* — the most complete version, combining all three components: VLA backbone, World Model (GigaWorld), and the RAMP framework for Reinforcement Learning.

Paper: GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning — GigaBrain Team, 02/2026

GitHub: open-gigaai/giga-brain-0 (Apache 2.0)

Core Idea: The RAMP Framework

The Problem with Pure Imitation Learning

Most current VLA models (RT-2, Octo, π₀) rely on Imitation Learning — the robot learns to copy actions from demo data. This creates two fundamental limitations:

  1. Performance ceiling is bounded by demo quality — the robot can't outperform its teachers
  2. Poor generalization — out-of-distribution scenarios cause complete failure

The Solution: Teach Robots to Dream, Then Learn from Dreams

RAMP (Reinforcement leArning via world Model-conditioned Policy) adds two key components:

  1. World Model (GigaWorld): A generative video model that predicts "what happens next" — like the robot "imagining" the future based on its current actions
  2. RL fine-tuning: Uses advantage functions from the World Model to optimize the policy, rather than just imitating

The core mathematical formulation:

π*(a|S) ∝ π_ref(a|S) · exp(A(S,a)/β)

Where the state is augmented: S = (o, z, l) with:

  • o = current observation (RGB-D camera images)
  • z = latent predictions from the World Model (imagined futures)
  • l = language instruction

RAMP vs RECAP: Why RAMP Wins

RECAP (from Physical Intelligence, used in π₀.5) also uses RL, but only with binary signals (success/failure). The paper proves that RECAP is actually a special case of RAMP when you marginalize out the World Model information:

H(a|o,z,l) ≤ H(a|o,l)

In plain terms: when the robot can "see the future" (via z), it has less uncertainty than when it only sees the present. The result is RAMP improving by ~30% absolute over RECAP on hard tasks.

Detailed Architecture

GigaBrain-0.5M* consists of 3 main modules:

1. VLA Backbone (GigaBrain-0.5) — 3.5B params

  • Vision-Language Encoder: PaliGemma-2 (Google) — processes RGB-D images + text instructions
  • Action Head: Diffusion Transformer (DiT) with flow matching — generates action chunks (50 consecutive steps)
  • Embodied Chain-of-Thought: Generates subgoal language + discrete action tokens + 2D manipulation trajectories via GRU decoder

Key feature: Knowledge Insulation prevents action prediction optimization from interfering with CoT generation — the two branches are gradient-isolated.

2. World Model (GigaWorld)

  • Architecture: Wan 2.2 (spatiotemporal DiT with self-attention)
  • Training: Flow matching with optimal transport path interpolation
  • Dual outputs: Jointly predicts future visual states AND value estimates
  • Prediction horizons: 12, 24, 36, 48 frames ahead

GigaWorld acts as the "dreaming brain" — it takes the current observation and action, then "imagines" what the next sequence of frames will look like. This information is encoded into a latent vector z and fed into the VLA backbone.

3. RAMP — The Glue

RAMP connects the World Model into the policy training loop:

  1. World Model generates z (latent predictions) for each observation
  2. Policy receives (o, z, l) instead of just (o, l)
  3. Advantage function A(S,a) computed from World Model's value predictions
  4. KL-regularized RL update keeps policy close to pretrained reference

Stochastic attention masking (p=0.2): During training, 20% of the time the World Model is "turned off" (masked). This prevents the policy from over-relying on the World Model — at inference, if the World Model is slow, the policy still works.

Environment Setup

Hardware Requirements

  • GPU: NVIDIA A100/A800 (80GB VRAM) for training. RTX 4090 (24GB) for inference
  • RAM: 64GB+ for training, 32GB for inference
  • Storage: ~200GB for datasets + checkpoints
  • CUDA: 12.1+

Step 1: Create Environment

# Create conda environment
conda create -n giga_brain_0 python=3.11.10 -y
conda activate giga_brain_0

# Install main dependencies
pip3 install giga-train giga-datasets lerobot==0.3.2 matplotlib numpydantic

Step 2: Clone and Install giga-models

# Clone giga-models (model definitions)
git clone https://github.com/open-gigaai/giga-models.git
cd giga-models && pip3 install -e .
cd ..

# Clone giga-brain-0 (training + inference code)
git clone https://github.com/open-gigaai/giga-brain-0.git
cd giga-brain-0

Step 3: Download Pretrained Weights

Weights are hosted on HuggingFace (org: open-gigaai):

# Download VLA backbone (3.5B params, ~7GB)
huggingface-cli download open-gigaai/GigaBrain-0.1-3.5B-Base --local-dir checkpoints/gigabrain-0.1

# Download World Model
huggingface-cli download open-gigaai/GigaWorld-0-Video-GR1-2b --local-dir checkpoints/gigaworld

# Version without depth camera (easier to deploy)
huggingface-cli download open-gigaai/GigaBrain-0-3.5B-Base --local-dir checkpoints/gigabrain-0-nodepth

Data Preparation

GigaBrain-0 uses LeRobot format. If you have HDF5 data, convert as follows:

Convert HDF5 to LeRobot Format

python scripts/convert_from_hdf5.py \
  --data-path /path/to/raw_hdf5_data_path \
  --out-dir /path/to/lerobot_dataset \
  --task "Pick up the red block and place it in the bin"

Compute Normalization Statistics

python scripts/compute_norm_stats.py \
  --data-paths /path/to/dataset1 /path/to/dataset2 \
  --output-path /path/to/norm_stats.json \
  --embodiment-id 0 \
  --delta-mask True,True,True,True,True,True,False,True,True,True,True,True,True,False \
  --sample-rate 1.0 \
  --action-chunk 50 \
  --action-dim 32

Understanding delta-mask: Each True/False corresponds to an action dimension. True = use delta (change from previous step), False = use absolute value. Typically gripper uses absolute (open/close), joints use delta.

Training Pipeline — 4 Stages

GigaBrain-0 training pipeline with 4 iterative stages: pretrain World Model → fine-tune Policy → collect HILR data → joint training

Stage 1: World Model Pre-training

GigaWorld is trained on 10,931 hours of visual experience:

  • 61% (6,653 hours) — data synthesized by the World Model itself (self-play)
  • 39% (4,278 hours) — real robot data from multiple platforms (UR5, Franka, ARX5, ALOHA, Agibot G1)

Reward function uses sparse signals:

  • 0 on task success
  • -C_fail on failure
  • -1 per timestep (encourages faster completion)

Stage 2: Policy Fine-tuning with World Model

This is the most critical step — fine-tuning the VLA backbone with World Model information:

# Fine-tune for AgileX Cobot Magic
python scripts/train.py --config configs.giga_brain_0_agilex_finetune.config

# Fine-tune for Agibot G1 humanoid
python scripts/train.py --config configs.giga_brain_0_agibot_finetune.config

# Train from scratch (if desired)
python scripts/train.py --config configs.giga_brain_0_from_scratch.config

Key hyperparameters:

  • Batch size: 256
  • Training steps: 20,000
  • Stochastic masking: p=0.2
  • Single denoising step (for efficiency)
  • n-step temporal difference for advantage computation

Stage 3: Human-in-the-Loop Rollout (HILR)

After the policy is reasonably good, collect additional data by:

  1. Robot runs the policy autonomously
  2. Human operator intervenes when the robot is about to fail
  3. Automatic detection and removal of temporal discontinuities at intervention points

HILR data has a distribution closer to the actual policy than pure teleoperation, reducing distribution shift.

Stage 4: Continual Joint Training

World Model AND Policy are trained simultaneously on new HILR data. This creates a self-improvement loop:

Better Policy → Better HILR Data → More Accurate World Model → Better Policy → ...

Inference and Deployment

Offline Inference (Testing on Dataset)

python scripts/inference.py \
  --model-path checkpoints/gigabrain-0.1 \
  --data-path /path/to/lerobot_dataset \
  --norm-stats-path /path/to/norm_stats.json \
  --output-path /tmp/vis_path \
  --delta-mask True,True,True,True,True,True,False,True,True,True,True,True,True,False \
  --embodiment-id 0 \
  --action-chunk 50 \
  --original-action-dim 14 \
  --tokenizer-model-path google/paligemma2-3b-pt-224 \
  --fast-tokenizer-path physical-intelligence/fast \
  --device cuda

Server-Client Deployment (For Real Robots)

On the GPU machine (server):

python scripts/inference_server.py \
  --model-path checkpoints/gigabrain-0.1 \
  --tokenizer-model-path google/paligemma2-3b-pt-224 \
  --fast-tokenizer-path physical-intelligence/fast \
  --delta-mask True,True,True,True,True,True,False,True,True,True,True,True,True,False \
  --embodiment-id 0 \
  --norm-stats-path /path/to/norm_stats.json \
  --original-action-dim 14

On the robot machine (client):

# Client sends sensor data, receives action predictions
python scripts/inference_client.py

# Or dedicated client for AgileX robots
python scripts/inference_agilex_client.py

Two inference modes:

  • Efficient Mode: Bypasses World Model (fastest, uses attention masking), suitable for simple tasks
  • Standard Mode: World Model active, for complex tasks requiring long-horizon planning

Edge Deployment

GigaBrain-0-Small — a smaller variant optimized for NVIDIA Jetson AGX Orin, suitable for running directly on the robot without a separate GPU server.

Benchmark Results

RoboChallenge Leaderboard (02/2026)

Model Average Success Rate Rank
GigaBrain-0.1 51.67% #1
π₀.5 (Physical Intelligence) 42.67% #2

Evaluated across 30 manipulation tasks on 20 different robots (UR5, Franka, ARX5, ALOHA).

RAMP vs RECAP — Direct Comparison

Task RAMP RECAP Improvement
Box Packing ~95% ~65% +30%
Espresso Preparation ~95% ~65% +30%
Laundry Folding ~90% ~60% +30%

The largest improvements are on long-horizon, complex tasks — where "imagining the future" provides the greatest advantage. For simple tasks (pick-and-place), the gap is smaller.

Value Prediction Quality

Method Inference Time MAE Kendall τ
VLM-based 0.32s 0.0683 0.7972
WM value-only 0.11s 0.0838 0.7288
WM state+value 0.25s 0.0621 0.8018

World Model predicting state+value jointly gives the best results, with acceptable latency (0.25s on A800).

Practical Tips

1. Start from Pretrained Weights

Don't train from scratch unless you have a large GPU cluster. Use GigaBrain-0.1-3.5B-Base or GigaBrain-0-3.5B-Base (no depth camera needed) as your starting checkpoint.

2. Get the Delta-Mask Right

A wrong delta-mask will cause the robot to "run away" — joints increasing to infinity. Rules:

  • Joints (angles/positions): True (delta)
  • Gripper: False (absolute — open/close)

3. Action Chunk Size

Default is 50 steps. If your task has a higher control frequency (>30Hz), reduce the chunk size. If the task is slow (making espresso), 50 is reasonable.

4. Stochastic Masking at Deploy Time

Keep p=0.2 masking even during inference. This creates dropout-like regularization, making the policy more robust to World Model noise.

Conclusion

GigaBrain-0.5M* marks a major advancement: VLA that doesn't just imitate, but can "dream" and learn from its dreams. The RAMP framework enables systematic integration of world models into RL training, and experimental results show significant improvements over prior methods.

With open-source code (Apache 2.0), pretrained weights on HuggingFace, and support for multiple robot platforms, this is one of the most accessible VLA frameworks available today for the robotics community.

Important links:


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Bài viết liên quan

NEWTutorial
StarVLA: Xây dựng VLA Model mô-đun
vlastarvlarobot-manipulationaideep-learningqwen-vlflow-matchingiclr-2026

StarVLA: Xây dựng VLA Model mô-đun

Hướng dẫn chi tiết xây dựng Vision-Language-Action model với StarVLA — framework mô-đun kiểu Lego từ ICLR 2026, hỗ trợ 4 kiến trúc action head.

12/4/202611 phút đọc
NEWTutorial
Hướng dẫn fine-tune NVIDIA GR00T N1
vlahumanoidnvidiaisaac-labfine-tuningdeep-learninggrootsim2real

Hướng dẫn fine-tune NVIDIA GR00T N1

Hướng dẫn chi tiết fine-tune VLA model GR00T N1 cho humanoid robot với Isaac Lab và dữ liệu AGIBOT World — từ cài đặt đến inference.

12/4/202612 phút đọc
NEWDeep Dive
WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code
vlahumanoidloco-manipulationiclrrlopen-sourceisaac-lab

WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code

Deep-dive vào codebase WholebodyVLA — kiến trúc latent action, LMO RL policy, và cách xây dựng pipeline whole-body loco-manipulation cho humanoid.

12/4/202619 phút đọc