← Back to Blog
navigationnavigationdeep-learningfoundation-modelsvint

Learning-based Navigation: GNM, ViNT and NoMaD

Explore foundation models for robot navigation -- GNM, ViNT, NoMaD from Berkeley and how they change how robots move.

Nguyen Anh Tuan12 tháng 2, 20268 min read
Learning-based Navigation: GNM, ViNT and NoMaD

From Classic Navigation to Learning-Based

In Part 1 and Part 2 of this series, we explored SLAM and Nav2 -- classical navigation methods based on geometric reasoning and hand-crafted planners. They work well in structured environments (factories, warehouses) but struggle when:

Learning-based navigation approaches differently: instead of hand-coding rules, learn from data. Just like NLP has foundation models (GPT, BERT), robotics navigation is getting its own foundation models.

In this post, I'll analyze 3 landmark papers from BAIR (Berkeley AI Research) research group under Sergey Levine: GNM, ViNT, and NoMaD -- sequence of works shaping learning-based navigation.

Deep learning models for robot navigation from real-world data

GNM -- General Navigation Model (2022)

Paper: GNM: A General Navigation Model to Drive Any Robot (Shah et al., ICRA 2023)

Problem GNM Solves

Before GNM, each robot needed its own navigation policy. A policy trained on TurtleBot didn't work on Jackal, and vice versa. GNM asks: can we train 1 model that works on multiple different robots?

Approach

GNM is a goal-conditioned navigation policy trained on data from multiple robot types. Core ideas:

  1. Data aggregation: collect navigation data from 6 different robot types (Jackal, TurtleBot, Spot, drone, etc.), ~60 hours total
  2. Goal representation: use goal image -- picture of destination robot should reach
  3. Temporal context: instead of just current frame, GNM uses sequence of images (observation history) to understand motion
  4. Normalized action space: normalize action space (linear vel, angular vel) across robots with different sizes and kinematics

Architecture

Observation images (t-k, ..., t)  →  CNN Encoder  →  ┐
                                                       ├→  MLP  →  (v, ω) actions
Goal image                        →  CNN Encoder  →  ┘
                                                       └→  MLP  →  temporal distance

GNM has 2 outputs:

Key Results

Limitations

ViNT -- Visual Navigation Transformer (2023)

Paper: ViNT: A Foundation Model for Visual Navigation (Shah et al., CoRL 2023)

From GNM to ViNT

ViNT is evolution from GNM with 3 major improvements:

  1. Transformer architecture: replace CNN with EfficientNet + Transformer, allowing model to learn long-range dependencies in observation history
  2. Diffusion-based subgoal proposals: add exploration ability by generating subgoal images
  3. Massive dataset: train on much larger dataset -- hundreds of hours from many robots

ViNT Architecture

Observations (t-k, ..., t)
    │
    ▼
EfficientNet Encoder (per frame)
    │
    ▼
Transformer (cross-attention between frames)
    │
    ├──→ Action Head  →  (v, ω) normalized actions
    └──→ Distance Head →  temporal distance to goal

Goal image  →  EfficientNet  →  Goal Token (inject into Transformer)

Difference from GNM:

Exploration with Diffusion Subgoals

Breakthrough feature of ViNT. When no goal image (robot needs to explore), ViNT uses diffusion model to generate subgoal images:

  1. Sample subgoal images from diffusion model (conditioned on current observation)
  2. Score each subgoal using ViNT distance head (choose most "feasible" subgoal)
  3. Navigate to selected subgoal
  4. Repeat -- creates frontier exploration behavior

This allows ViNT to explore novel environments without pre-built map -- something classical Nav2 cannot do.

Adaptation with Prompt-Tuning

ViNT can adapt to new tasks without full retraining:

Results

NoMaD -- Goal Masked Diffusion Policies (2023)

Paper: NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration (Sridhar et al., ICRA 2024)

Problem NoMaD Solves

ViNT still has limitation: action output is deterministic (single action only). In reality, at a fork in the road, robot could turn left or right -- both valid. Deterministic policy outputs average of 2 directions, leading to walking straight into wall!

NoMaD solves this using diffusion model to generate actions -- can model multi-modal action distributions.

NoMaD Architecture

Observations (t-k, ..., t)
    │
    ▼
ViT Encoder (Vision Transformer)
    │
    ▼
Observation Token
    │
    ├──→ Goal Masking Layer  ←  Goal image (or masked)
    │
    ▼
Diffusion Decoder
    │
    ▼
Action trajectory (sequence of future actions)

Goal Masking -- Unifying Navigation and Exploration

Core insight of NoMaD: goal masking. During training:

Single model learns both behaviors:

# NoMaD pseudocode
def nomad_forward(observations, goal_image=None):
    obs_token = vit_encoder(observations)
    
    if goal_image is not None:
        goal_token = vit_encoder(goal_image)
        context = concat(obs_token, goal_token)
    else:
        # Mask goal -- exploration mode
        context = concat(obs_token, mask_token)
    
    # Diffusion generates multi-modal actions
    action_trajectory = diffusion_decoder.sample(context)
    return action_trajectory

Diffusion for Action Generation

Instead of single action, NoMaD generates trajectory (sequence of future actions) via diffusion:

  1. Start from noise (Gaussian random)
  2. Iteratively denoise conditioned on observation + goal context
  3. Output: trajectory with multiple steps (e.g., 8 future waypoints)

Advantages of diffusion:

Results

Comparing 3 Models

Criterion GNM ViNT NoMaD
Year 2022 2023 2023
Architecture CNN + MLP EfficientNet + Transformer ViT + Diffusion
Action output Single (v, omega) Single (v, omega) Trajectory (multi-modal)
Exploration No Yes (diffusion subgoals) Yes (goal masking)
Cross-robot 6 robots More More
Long-range Limited Km-scale Km-scale
Real-time Yes Yes Yes (Jetson Orin)
Training data ~60h Hundreds of hours Hundreds of hours

Evolution of Ideas

GNM (2022)           ViNT (2023)              NoMaD (2023)
─────────           ──────────              ──────────
CNN backbone   →   Transformer backbone   →  ViT backbone
Single action  →   Single action          →  Diffusion trajectory
No exploration →   Diffusion subgoals     →  Goal masking
Basic dataset  →   Massive dataset        →  Same massive dataset

Comparing navigation methods from classic to learning-based

Real-World Applications and Limitations

When to Use Learning-Based Navigation?

Should use when:

Not yet ready when:

Deploy on Real Robot

# Clone official codebase
git clone https://github.com/robodhruv/visualnav-transformer.git
cd visualnav-transformer

# Install
pip install -r requirements.txt

# Download pretrained checkpoint
# GNM, ViNT, NoMaD checkpoints available

# Run on robot
python deployment/deploy_nomad.py \
  --model nomad \
  --checkpoint checkpoints/nomad.pth \
  --robot locobot  # or jackal, turtlebot, custom

Hardware Requirements

Future Trends

Larger Foundation Models

Latest research scaling up navigation models:

Combining with VLMs (Vision-Language Models)

Use natural language instead of goal image to direct robot: "go to kitchen" -- this is Vision-Language Navigation (VLN), topic of Part 4 in this series.

Sim-to-Real for Navigation

Train navigation policy in simulation then transfer to real robot -- combine GNM/ViNT backbone with simulated diverse environments.

Up Next in Series

This is Part 3 of Modern Navigation series:


Related Posts

Related Posts

Deep DiveOutdoor Navigation và Multi-Robot Coordination
navigationmulti-robotoutdoormapfvda5050Part 5

Outdoor Navigation và Multi-Robot Coordination

GPS-denied navigation, terrain classification, multi-robot traffic management với VDA5050, và MAPF algorithms cho robot fleet.

20/2/202611 min read
ResearchVision-Language Navigation: Robot đi theo chỉ dẫn
vlnnavigationllmvision-languagePart 4

Vision-Language Navigation: Robot đi theo chỉ dẫn

Khám phá VLN -- cách robot hiểu và thực hiện chỉ dẫn ngôn ngữ tự nhiên, từ R2R benchmark đến NaVILA và LLM-based planning.

16/2/20269 min read
ROS 2 từ A đến Z (Phần 3): Nav2 — Robot tự hành đầu tiên
ros2tutorialamrnavigationPart 3

ROS 2 từ A đến Z (Phần 3): Nav2 — Robot tự hành đầu tiên

Cấu hình Nav2 stack để robot tự lập bản đồ SLAM và di chuyển tự động — từ simulation đến thực tế.

11/2/202611 min read