navigationnavigationdeep-learningfoundation-modelsvint

Learning-based Navigation: GNM, ViNT and NoMaD

Explore foundation models for robot navigation -- GNM, ViNT, NoMaD from Berkeley and how they change how robots move.

Nguyen Anh Tuan12 tháng 2, 20268 phút đọc
Learning-based Navigation: GNM, ViNT and NoMaD

From Classic Navigation to Learning-Based

In Part 1 and Part 2 of this series, we explored SLAM and Nav2 -- classical navigation methods based on geometric reasoning and hand-crafted planners. They work well in structured environments (factories, warehouses) but struggle when:

  • Environment has no map and cannot be pre-mapped
  • Complex, changing terrain (outdoors, off-road)
  • Need to generalize to new robots and environments without retraining

Learning-based navigation approaches differently: instead of hand-coding rules, learn from data. Just like NLP has foundation models (GPT, BERT), robotics navigation is getting its own foundation models.

In this post, I'll analyze 3 landmark papers from BAIR (Berkeley AI Research) research group under Sergey Levine: GNM, ViNT, and NoMaD -- sequence of works shaping learning-based navigation.

Deep learning models for robot navigation from real-world data

GNM -- General Navigation Model (2022)

Paper: GNM: A General Navigation Model to Drive Any Robot (Shah et al., ICRA 2023)

Problem GNM Solves

Before GNM, each robot needed its own navigation policy. A policy trained on TurtleBot didn't work on Jackal, and vice versa. GNM asks: can we train 1 model that works on multiple different robots?

Approach

GNM is a goal-conditioned navigation policy trained on data from multiple robot types. Core ideas:

  1. Data aggregation: collect navigation data from 6 different robot types (Jackal, TurtleBot, Spot, drone, etc.), ~60 hours total
  2. Goal representation: use goal image -- picture of destination robot should reach
  3. Temporal context: instead of just current frame, GNM uses sequence of images (observation history) to understand motion
  4. Normalized action space: normalize action space (linear vel, angular vel) across robots with different sizes and kinematics

Architecture

Observation images (t-k, ..., t)  →  CNN Encoder  →  ┐
                                                       ├→  MLP  →  (v, ω) actions
Goal image                        →  CNN Encoder  →  ┘
                                                       └→  MLP  →  temporal distance

GNM has 2 outputs:

  • Action: linear and angular velocity (v, omega) -- what robot should do next
  • Temporal distance: estimate how many steps until reaching goal -- used for planning

Key Results

  • Cross-robot transfer: model trained on 6 robots, deployed on unseen robot (even quadrotor!) without fine-tuning
  • Positive transfer: model trained on diverse data performs better than single-robot model
  • Robustness: GNM robust to sensor degradation (blurry camera, vibration) thanks to diverse training data

Limitations

  • Only outputs 1 action (deterministic) -- cannot model multiple feasible paths
  • No exploration ability -- only reaches known goals
  • Not yet optimized for long-range navigation

ViNT -- Visual Navigation Transformer (2023)

Paper: ViNT: A Foundation Model for Visual Navigation (Shah et al., CoRL 2023)

From GNM to ViNT

ViNT is evolution from GNM with 3 major improvements:

  1. Transformer architecture: replace CNN with EfficientNet + Transformer, allowing model to learn long-range dependencies in observation history
  2. Diffusion-based subgoal proposals: add exploration ability by generating subgoal images
  3. Massive dataset: train on much larger dataset -- hundreds of hours from many robots

ViNT Architecture

Observations (t-k, ..., t)
    │
    ▼
EfficientNet Encoder (per frame)
    │
    ▼
Transformer (cross-attention between frames)
    │
    ├──→ Action Head  →  (v, ω) normalized actions
    └──→ Distance Head →  temporal distance to goal

Goal image  →  EfficientNet  →  Goal Token (inject into Transformer)

Difference from GNM:

  • Transformer allows model to attend to important frames in history (e.g., frames with obstacles)
  • Goal token injected into Transformer like prompt -- similar to prompt-tuning in NLP

Exploration with Diffusion Subgoals

Breakthrough feature of ViNT. When no goal image (robot needs to explore), ViNT uses diffusion model to generate subgoal images:

  1. Sample subgoal images from diffusion model (conditioned on current observation)
  2. Score each subgoal using ViNT distance head (choose most "feasible" subgoal)
  3. Navigate to selected subgoal
  4. Repeat -- creates frontier exploration behavior

This allows ViNT to explore novel environments without pre-built map -- something classical Nav2 cannot do.

Adaptation with Prompt-Tuning

ViNT can adapt to new tasks without full retraining:

  • GPS waypoints: replace goal image with GPS encoding
  • Routing commands: "turn left", "go straight" -- encode as goal token
  • Only need to train new goal encoder (small), keep backbone frozen

Results

  • Outperforms GNM on all benchmarks
  • Navigate kilometer-scale with subgoal chaining
  • Zero-shot transfer to 4 new robots (no fine-tune)
  • Exploration behavior emerges from diffusion subgoals

NoMaD -- Goal Masked Diffusion Policies (2023)

Paper: NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration (Sridhar et al., ICRA 2024)

Problem NoMaD Solves

ViNT still has limitation: action output is deterministic (single action only). In reality, at a fork in the road, robot could turn left or right -- both valid. Deterministic policy outputs average of 2 directions, leading to walking straight into wall!

NoMaD solves this using diffusion model to generate actions -- can model multi-modal action distributions.

NoMaD Architecture

Observations (t-k, ..., t)
    │
    ▼
ViT Encoder (Vision Transformer)
    │
    ▼
Observation Token
    │
    ├──→ Goal Masking Layer  ←  Goal image (or masked)
    │
    ▼
Diffusion Decoder
    │
    ▼
Action trajectory (sequence of future actions)

Goal Masking -- Unifying Navigation and Exploration

Core insight of NoMaD: goal masking. During training:

  • 50% samples: provide goal image (goal-conditioned navigation)
  • 50% samples: mask goal image (goal-agnostic exploration)

Single model learns both behaviors:

  • With goal: navigate to goal
  • Without goal: explore environment (visit new places)
# NoMaD pseudocode
def nomad_forward(observations, goal_image=None):
    obs_token = vit_encoder(observations)
    
    if goal_image is not None:
        goal_token = vit_encoder(goal_image)
        context = concat(obs_token, goal_token)
    else:
        # Mask goal -- exploration mode
        context = concat(obs_token, mask_token)
    
    # Diffusion generates multi-modal actions
    action_trajectory = diffusion_decoder.sample(context)
    return action_trajectory

Diffusion for Action Generation

Instead of single action, NoMaD generates trajectory (sequence of future actions) via diffusion:

  1. Start from noise (Gaussian random)
  2. Iteratively denoise conditioned on observation + goal context
  3. Output: trajectory with multiple steps (e.g., 8 future waypoints)

Advantages of diffusion:

  • Multi-modal: can generate multiple feasible trajectories
  • Smooth: trajectories naturally smooth, not jerky
  • Flexible: easy to add constraints

Results

  • Navigation: outperforms ViNT and GNM on real-world tests
  • Exploration: explores efficiently (fewer collisions than ViNT)
  • Smaller model: 70M parameters (smaller than ViNT) but more effective
  • Real-time: runs on NVIDIA Jetson Orin

Comparing 3 Models

Criterion GNM ViNT NoMaD
Year 2022 2023 2023
Architecture CNN + MLP EfficientNet + Transformer ViT + Diffusion
Action output Single (v, omega) Single (v, omega) Trajectory (multi-modal)
Exploration No Yes (diffusion subgoals) Yes (goal masking)
Cross-robot 6 robots More More
Long-range Limited Km-scale Km-scale
Real-time Yes Yes Yes (Jetson Orin)
Training data ~60h Hundreds of hours Hundreds of hours

Evolution of Ideas

GNM (2022)           ViNT (2023)              NoMaD (2023)
─────────           ──────────              ──────────
CNN backbone   →   Transformer backbone   →  ViT backbone
Single action  →   Single action          →  Diffusion trajectory
No exploration →   Diffusion subgoals     →  Goal masking
Basic dataset  →   Massive dataset        →  Same massive dataset

Comparing navigation methods from classic to learning-based

Real-World Applications and Limitations

When to Use Learning-Based Navigation?

Should use when:

  • Environment is unstructured, cannot be pre-mapped (forest, outdoors)
  • Need to generalize quickly to new robots
  • Environment has many dynamic obstacles (people walking)
  • Need exploration in unknown environments

Not yet ready when:

  • Environment is structured, static map available (factory) -- Nav2 still better
  • Need absolute safety guarantees (certified safety) -- learning-based doesn't provide
  • Hardware constrained -- needs GPU (at least Jetson Orin)

Deploy on Real Robot

# Clone official codebase
git clone https://github.com/robodhruv/visualnav-transformer.git
cd visualnav-transformer

# Install
pip install -r requirements.txt

# Download pretrained checkpoint
# GNM, ViNT, NoMaD checkpoints available

# Run on robot
python deployment/deploy_nomad.py \
  --model nomad \
  --checkpoint checkpoints/nomad.pth \
  --robot locobot  # or jackal, turtlebot, custom

Hardware Requirements

  • Minimum: NVIDIA Jetson Orin Nano (NoMaD runs ~10 Hz)
  • Recommended: Jetson AGX Orin (NoMaD runs ~20 Hz)
  • Camera: any RGB camera (RealSense, USB webcam)

Larger Foundation Models

Latest research scaling up navigation models:

  • Train on more data: YouTube videos, driving datasets, indoor datasets
  • Larger models: from 70M (NoMaD) to 300M+ parameters
  • Multi-task: not just navigation but also manipulation, exploration

Combining with VLMs (Vision-Language Models)

Use natural language instead of goal image to direct robot: "go to kitchen" -- this is Vision-Language Navigation (VLN), topic of Part 4 in this series.

Sim-to-Real for Navigation

Train navigation policy in simulation then transfer to real robot -- combine GNM/ViNT backbone with simulated diverse environments.

Up Next in Series

This is Part 3 of Modern Navigation series:


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Bài viết liên quan

NEWTutorial
StarVLA: Xây dựng VLA Model mô-đun
vlastarvlarobot-manipulationaideep-learningqwen-vlflow-matchingiclr-2026

StarVLA: Xây dựng VLA Model mô-đun

Hướng dẫn chi tiết xây dựng Vision-Language-Action model với StarVLA — framework mô-đun kiểu Lego từ ICLR 2026, hỗ trợ 4 kiến trúc action head.

12/4/202611 phút đọc
NEWTutorial
Hướng dẫn fine-tune NVIDIA GR00T N1
vlahumanoidnvidiaisaac-labfine-tuningdeep-learninggrootsim2real

Hướng dẫn fine-tune NVIDIA GR00T N1

Hướng dẫn chi tiết fine-tune VLA model GR00T N1 cho humanoid robot với Isaac Lab và dữ liệu AGIBOT World — từ cài đặt đến inference.

12/4/202612 phút đọc
NEWNghiên cứu
Gemma 4 và Ứng Dụng Trong Robotics
ai-perceptiongemmaedge-aifoundation-modelsrobotics

Gemma 4 và Ứng Dụng Trong Robotics

Phân tích kiến trúc Gemma 4 của Google — từ on-device AI đến ứng dụng thực tế trong điều khiển robot, perception và agentic workflows.

12/4/202612 phút đọc