Learning-based Navigation: GNM, ViNT and NoMaD

In Part 1 and Part 2 of this series, we explored SLAM and Nav2 -- classical navigation methods based on geometric reasoning and hand-crafted planners. They work well in structured environments (factories, warehouses) but struggle when:

Environment has no map and cannot be pre-mapped
Complex, changing terrain (outdoors, off-road)
Need to generalize to new robots and environments without retraining

Learning-based navigation approaches differently: instead of hand-coding rules, learn from data. Just like NLP has foundation models (GPT, BERT), robotics navigation is getting its own foundation models.

In this post, I'll analyze 3 landmark papers from BAIR (Berkeley AI Research) research group under Sergey Levine: GNM, ViNT, and NoMaD -- sequence of works shaping learning-based navigation.

Deep learning models for robot navigation from real-world data

Paper: GNM: A General Navigation Model to Drive Any Robot (Shah et al., ICRA 2023)

Problem GNM Solves

Before GNM, each robot needed its own navigation policy. A policy trained on TurtleBot didn't work on Jackal, and vice versa. GNM asks: can we train 1 model that works on multiple different robots?

Approach

GNM is a goal-conditioned navigation policy trained on data from multiple robot types. Core ideas:

Data aggregation: collect navigation data from 6 different robot types (Jackal, TurtleBot, Spot, drone, etc.), ~60 hours total
Goal representation: use goal image -- picture of destination robot should reach
Temporal context: instead of just current frame, GNM uses sequence of images (observation history) to understand motion
Normalized action space: normalize action space (linear vel, angular vel) across robots with different sizes and kinematics

Architecture

Observation images (t-k, ..., t)  →  CNN Encoder  →  ┐
                                                       ├→  MLP  →  (v, ω) actions
Goal image                        →  CNN Encoder  →  ┘
                                                       └→  MLP  →  temporal distance

GNM has 2 outputs:

Action: linear and angular velocity (v, omega) -- what robot should do next
Temporal distance: estimate how many steps until reaching goal -- used for planning

Key Results

Cross-robot transfer: model trained on 6 robots, deployed on unseen robot (even quadrotor!) without fine-tuning
Positive transfer: model trained on diverse data performs better than single-robot model
Robustness: GNM robust to sensor degradation (blurry camera, vibration) thanks to diverse training data

Limitations

Only outputs 1 action (deterministic) -- cannot model multiple feasible paths
No exploration ability -- only reaches known goals
Not yet optimized for long-range navigation

Paper: ViNT: A Foundation Model for Visual Navigation (Shah et al., CoRL 2023)

From GNM to ViNT

ViNT is evolution from GNM with 3 major improvements:

Transformer architecture: replace CNN with EfficientNet + Transformer, allowing model to learn long-range dependencies in observation history
Diffusion-based subgoal proposals: add exploration ability by generating subgoal images
Massive dataset: train on much larger dataset -- hundreds of hours from many robots

ViNT Architecture

Observations (t-k, ..., t)
    │
    ▼
EfficientNet Encoder (per frame)
    │
    ▼
Transformer (cross-attention between frames)
    │
    ├──→ Action Head  →  (v, ω) normalized actions
    └──→ Distance Head →  temporal distance to goal

Goal image  →  EfficientNet  →  Goal Token (inject into Transformer)

Difference from GNM:

Transformer allows model to attend to important frames in history (e.g., frames with obstacles)
Goal token injected into Transformer like prompt -- similar to prompt-tuning in NLP

Exploration with Diffusion Subgoals

Breakthrough feature of ViNT. When no goal image (robot needs to explore), ViNT uses diffusion model to generate subgoal images:

Sample subgoal images from diffusion model (conditioned on current observation)
Score each subgoal using ViNT distance head (choose most "feasible" subgoal)
Navigate to selected subgoal
Repeat -- creates frontier exploration behavior

This allows ViNT to explore novel environments without pre-built map -- something classical Nav2 cannot do.

Adaptation with Prompt-Tuning

ViNT can adapt to new tasks without full retraining:

GPS waypoints: replace goal image with GPS encoding
Routing commands: "turn left", "go straight" -- encode as goal token
Only need to train new goal encoder (small), keep backbone frozen

Results

Outperforms GNM on all benchmarks
Navigate kilometer-scale with subgoal chaining
Zero-shot transfer to 4 new robots (no fine-tune)
Exploration behavior emerges from diffusion subgoals

NoMaD -- Goal Masked Diffusion Policies (2023)

Paper: NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration (Sridhar et al., ICRA 2024)

Problem NoMaD Solves

ViNT still has limitation: action output is deterministic (single action only). In reality, at a fork in the road, robot could turn left or right -- both valid. Deterministic policy outputs average of 2 directions, leading to walking straight into wall!

NoMaD solves this using diffusion model to generate actions -- can model multi-modal action distributions.

NoMaD Architecture

Observations (t-k, ..., t)
    │
    ▼
ViT Encoder (Vision Transformer)
    │
    ▼
Observation Token
    │
    ├──→ Goal Masking Layer  ←  Goal image (or masked)
    │
    ▼
Diffusion Decoder
    │
    ▼
Action trajectory (sequence of future actions)

Core insight of NoMaD: goal masking. During training:

50% samples: provide goal image (goal-conditioned navigation)
50% samples: mask goal image (goal-agnostic exploration)

Single model learns both behaviors:

With goal: navigate to goal
Without goal: explore environment (visit new places)

# NoMaD pseudocode
def nomad_forward(observations, goal_image=None):
    obs_token = vit_encoder(observations)
    
    if goal_image is not None:
        goal_token = vit_encoder(goal_image)
        context = concat(obs_token, goal_token)
    else:
        # Mask goal -- exploration mode
        context = concat(obs_token, mask_token)
    
    # Diffusion generates multi-modal actions
    action_trajectory = diffusion_decoder.sample(context)
    return action_trajectory

Diffusion for Action Generation

Instead of single action, NoMaD generates trajectory (sequence of future actions) via diffusion:

Start from noise (Gaussian random)
Iteratively denoise conditioned on observation + goal context
Output: trajectory with multiple steps (e.g., 8 future waypoints)

Advantages of diffusion:

Multi-modal: can generate multiple feasible trajectories
Smooth: trajectories naturally smooth, not jerky
Flexible: easy to add constraints

Results

Navigation: outperforms ViNT and GNM on real-world tests
Exploration: explores efficiently (fewer collisions than ViNT)
Smaller model: 70M parameters (smaller than ViNT) but more effective
Real-time: runs on NVIDIA Jetson Orin

Comparing 3 Models

Criterion	GNM	ViNT	NoMaD
Year	2022	2023	2023
Architecture	CNN + MLP	EfficientNet + Transformer	ViT + Diffusion
Action output	Single (v, omega)	Single (v, omega)	Trajectory (multi-modal)
Exploration	No	Yes (diffusion subgoals)	Yes (goal masking)
Cross-robot	6 robots	More	More
Long-range	Limited	Km-scale	Km-scale
Real-time	Yes	Yes	Yes (Jetson Orin)
Training data	~60h	Hundreds of hours	Hundreds of hours

Evolution of Ideas

GNM (2022)           ViNT (2023)              NoMaD (2023)
─────────           ──────────              ──────────
CNN backbone   →   Transformer backbone   →  ViT backbone
Single action  →   Single action          →  Diffusion trajectory
No exploration →   Diffusion subgoals     →  Goal masking
Basic dataset  →   Massive dataset        →  Same massive dataset

Comparing navigation methods from classic to learning-based

Real-World Applications and Limitations

Should use when:

Environment is unstructured, cannot be pre-mapped (forest, outdoors)
Need to generalize quickly to new robots
Environment has many dynamic obstacles (people walking)
Need exploration in unknown environments

Not yet ready when:

Environment is structured, static map available (factory) -- Nav2 still better
Need absolute safety guarantees (certified safety) -- learning-based doesn't provide
Hardware constrained -- needs GPU (at least Jetson Orin)

Deploy on Real Robot

# Clone official codebase
git clone https://github.com/robodhruv/visualnav-transformer.git
cd visualnav-transformer

# Install
pip install -r requirements.txt

# Download pretrained checkpoint
# GNM, ViNT, NoMaD checkpoints available

# Run on robot
python deployment/deploy_nomad.py \
  --model nomad \
  --checkpoint checkpoints/nomad.pth \
  --robot locobot  # or jackal, turtlebot, custom

Hardware Requirements

Minimum: NVIDIA Jetson Orin Nano (NoMaD runs ~10 Hz)
Recommended: Jetson AGX Orin (NoMaD runs ~20 Hz)
Camera: any RGB camera (RealSense, USB webcam)

Future Trends

Larger Foundation Models

Latest research scaling up navigation models:

Train on more data: YouTube videos, driving datasets, indoor datasets
Larger models: from 70M (NoMaD) to 300M+ parameters
Multi-task: not just navigation but also manipulation, exploration

Combining with VLMs (Vision-Language Models)

Use natural language instead of goal image to direct robot: "go to kitchen" -- this is Vision-Language Navigation (VLN), topic of Part 4 in this series.

Train navigation policy in simulation then transfer to real robot -- combine GNM/ViNT backbone with simulated diverse environments.

Up Next in Series

This is Part 3 of Modern Navigation series:

Part 1: SLAM A to Z -- SLAM Foundation
Part 2: ROS 2 Nav2 -- Classical Navigation Stack
Part 4: Vision-Language Navigation -- Language-Guided Navigation
Part 5: Outdoor Navigation and Multi-Robot -- GPS-denied Nav, MAPF

Foundation Models for Robots: RT-2, Octo, OpenVLA -- Foundation models for manipulation
Sim-to-Real Transfer -- Train in simulation, run in reality
AI Series Part 4: Diffusion Policy -- Diffusion models in robotics
Edge AI with NVIDIA Jetson -- Deploy models to Jetson for real-time inference

Environment has no map and cannot be pre-mapped
Complex, changing terrain (outdoors, off-road)
Need to generalize to new robots and environments without retraining

Deep learning models for robot navigation from real-world data

Paper: GNM: A General Navigation Model to Drive Any Robot (Shah et al., ICRA 2023)

Problem GNM Solves

Approach

GNM is a goal-conditioned navigation policy trained on data from multiple robot types. Core ideas:

Data aggregation: collect navigation data from 6 different robot types (Jackal, TurtleBot, Spot, drone, etc.), ~60 hours total
Goal representation: use goal image -- picture of destination robot should reach
Temporal context: instead of just current frame, GNM uses sequence of images (observation history) to understand motion
Normalized action space: normalize action space (linear vel, angular vel) across robots with different sizes and kinematics

Architecture

Observation images (t-k, ..., t)  →  CNN Encoder  →  ┐
                                                       ├→  MLP  →  (v, ω) actions
Goal image                        →  CNN Encoder  →  ┘
                                                       └→  MLP  →  temporal distance

GNM has 2 outputs:

Action: linear and angular velocity (v, omega) -- what robot should do next
Temporal distance: estimate how many steps until reaching goal -- used for planning

Key Results

Cross-robot transfer: model trained on 6 robots, deployed on unseen robot (even quadrotor!) without fine-tuning
Positive transfer: model trained on diverse data performs better than single-robot model
Robustness: GNM robust to sensor degradation (blurry camera, vibration) thanks to diverse training data

Limitations

Only outputs 1 action (deterministic) -- cannot model multiple feasible paths
No exploration ability -- only reaches known goals
Not yet optimized for long-range navigation

Paper: ViNT: A Foundation Model for Visual Navigation (Shah et al., CoRL 2023)

From GNM to ViNT

ViNT is evolution from GNM with 3 major improvements:

Transformer architecture: replace CNN with EfficientNet + Transformer, allowing model to learn long-range dependencies in observation history
Diffusion-based subgoal proposals: add exploration ability by generating subgoal images
Massive dataset: train on much larger dataset -- hundreds of hours from many robots

ViNT Architecture

Observations (t-k, ..., t)
    │
    ▼
EfficientNet Encoder (per frame)
    │
    ▼
Transformer (cross-attention between frames)
    │
    ├──→ Action Head  →  (v, ω) normalized actions
    └──→ Distance Head →  temporal distance to goal

Goal image  →  EfficientNet  →  Goal Token (inject into Transformer)

Difference from GNM:

Transformer allows model to attend to important frames in history (e.g., frames with obstacles)
Goal token injected into Transformer like prompt -- similar to prompt-tuning in NLP

Exploration with Diffusion Subgoals

Breakthrough feature of ViNT. When no goal image (robot needs to explore), ViNT uses diffusion model to generate subgoal images:

Sample subgoal images from diffusion model (conditioned on current observation)
Score each subgoal using ViNT distance head (choose most "feasible" subgoal)
Navigate to selected subgoal
Repeat -- creates frontier exploration behavior

This allows ViNT to explore novel environments without pre-built map -- something classical Nav2 cannot do.

Adaptation with Prompt-Tuning

ViNT can adapt to new tasks without full retraining:

GPS waypoints: replace goal image with GPS encoding
Routing commands: "turn left", "go straight" -- encode as goal token
Only need to train new goal encoder (small), keep backbone frozen

Results

Outperforms GNM on all benchmarks
Navigate kilometer-scale with subgoal chaining
Zero-shot transfer to 4 new robots (no fine-tune)
Exploration behavior emerges from diffusion subgoals

NoMaD -- Goal Masked Diffusion Policies (2023)

Paper: NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration (Sridhar et al., ICRA 2024)

Problem NoMaD Solves

NoMaD solves this using diffusion model to generate actions -- can model multi-modal action distributions.

NoMaD Architecture

Observations (t-k, ..., t)
    │
    ▼
ViT Encoder (Vision Transformer)
    │
    ▼
Observation Token
    │
    ├──→ Goal Masking Layer  ←  Goal image (or masked)
    │
    ▼
Diffusion Decoder
    │
    ▼
Action trajectory (sequence of future actions)

Core insight of NoMaD: goal masking. During training:

50% samples: provide goal image (goal-conditioned navigation)
50% samples: mask goal image (goal-agnostic exploration)

Single model learns both behaviors:

With goal: navigate to goal
Without goal: explore environment (visit new places)

# NoMaD pseudocode
def nomad_forward(observations, goal_image=None):
    obs_token = vit_encoder(observations)
    
    if goal_image is not None:
        goal_token = vit_encoder(goal_image)
        context = concat(obs_token, goal_token)
    else:
        # Mask goal -- exploration mode
        context = concat(obs_token, mask_token)
    
    # Diffusion generates multi-modal actions
    action_trajectory = diffusion_decoder.sample(context)
    return action_trajectory

Diffusion for Action Generation

Instead of single action, NoMaD generates trajectory (sequence of future actions) via diffusion:

Start from noise (Gaussian random)
Iteratively denoise conditioned on observation + goal context
Output: trajectory with multiple steps (e.g., 8 future waypoints)

Advantages of diffusion:

Multi-modal: can generate multiple feasible trajectories
Smooth: trajectories naturally smooth, not jerky
Flexible: easy to add constraints

Results

Navigation: outperforms ViNT and GNM on real-world tests
Exploration: explores efficiently (fewer collisions than ViNT)
Smaller model: 70M parameters (smaller than ViNT) but more effective
Real-time: runs on NVIDIA Jetson Orin

Comparing 3 Models

Criterion	GNM	ViNT	NoMaD
Year	2022	2023	2023
Architecture	CNN + MLP	EfficientNet + Transformer	ViT + Diffusion
Action output	Single (v, omega)	Single (v, omega)	Trajectory (multi-modal)
Exploration	No	Yes (diffusion subgoals)	Yes (goal masking)
Cross-robot	6 robots	More	More
Long-range	Limited	Km-scale	Km-scale
Real-time	Yes	Yes	Yes (Jetson Orin)
Training data	~60h	Hundreds of hours	Hundreds of hours

Evolution of Ideas

GNM (2022)           ViNT (2023)              NoMaD (2023)
─────────           ──────────              ──────────
CNN backbone   →   Transformer backbone   →  ViT backbone
Single action  →   Single action          →  Diffusion trajectory
No exploration →   Diffusion subgoals     →  Goal masking
Basic dataset  →   Massive dataset        →  Same massive dataset

Comparing navigation methods from classic to learning-based

Real-World Applications and Limitations

Should use when:

Environment is unstructured, cannot be pre-mapped (forest, outdoors)
Need to generalize quickly to new robots
Environment has many dynamic obstacles (people walking)
Need exploration in unknown environments

Not yet ready when:

Environment is structured, static map available (factory) -- Nav2 still better
Need absolute safety guarantees (certified safety) -- learning-based doesn't provide
Hardware constrained -- needs GPU (at least Jetson Orin)

Deploy on Real Robot

# Clone official codebase
git clone https://github.com/robodhruv/visualnav-transformer.git
cd visualnav-transformer

# Install
pip install -r requirements.txt

# Download pretrained checkpoint
# GNM, ViNT, NoMaD checkpoints available

# Run on robot
python deployment/deploy_nomad.py \
  --model nomad \
  --checkpoint checkpoints/nomad.pth \
  --robot locobot  # or jackal, turtlebot, custom

Hardware Requirements

Minimum: NVIDIA Jetson Orin Nano (NoMaD runs ~10 Hz)
Recommended: Jetson AGX Orin (NoMaD runs ~20 Hz)
Camera: any RGB camera (RealSense, USB webcam)

Future Trends

Larger Foundation Models

Latest research scaling up navigation models:

Train on more data: YouTube videos, driving datasets, indoor datasets
Larger models: from 70M (NoMaD) to 300M+ parameters
Multi-task: not just navigation but also manipulation, exploration

Combining with VLMs (Vision-Language Models)

Use natural language instead of goal image to direct robot: "go to kitchen" -- this is Vision-Language Navigation (VLN), topic of Part 4 in this series.

Train navigation policy in simulation then transfer to real robot -- combine GNM/ViNT backbone with simulated diverse environments.

Up Next in Series

This is Part 3 of Modern Navigation series:

Part 1: SLAM A to Z -- SLAM Foundation
Part 2: ROS 2 Nav2 -- Classical Navigation Stack
Part 4: Vision-Language Navigation -- Language-Guided Navigation
Part 5: Outdoor Navigation and Multi-Robot -- GPS-denied Nav, MAPF

Foundation Models for Robots: RT-2, Octo, OpenVLA -- Foundation models for manipulation
Sim-to-Real Transfer -- Train in simulation, run in reality
AI Series Part 4: Diffusion Policy -- Diffusion models in robotics
Edge AI with NVIDIA Jetson -- Deploy models to Jetson for real-time inference

From Classic Navigation to Learning-Based

GNM -- General Navigation Model (2022)

Problem GNM Solves

Approach

Architecture

Key Results

Limitations

ViNT -- Visual Navigation Transformer (2023)

From GNM to ViNT

ViNT Architecture

Exploration with Diffusion Subgoals

Adaptation with Prompt-Tuning

Results

NoMaD -- Goal Masked Diffusion Policies (2023)

Problem NoMaD Solves

NoMaD Architecture

Goal Masking -- Unifying Navigation and Exploration

Diffusion for Action Generation

Results

Comparing 3 Models

Evolution of Ideas

Real-World Applications and Limitations

When to Use Learning-Based Navigation?

Deploy on Real Robot

Hardware Requirements

Future Trends

Larger Foundation Models

Combining with VLMs (Vision-Language Models)

Sim-to-Real for Navigation

Up Next in Series

Related Posts

Nguyễn Anh Tuấn

Related Posts

Outdoor Navigation và Multi-Robot Coordination

Vision-Language Navigation: Robot đi theo chỉ dẫn

ROS 2 Nav2: Navigation hoàn chỉnh cho AMR

From Classic Navigation to Learning-Based

GNM -- General Navigation Model (2022)

Problem GNM Solves

Approach

Architecture

Key Results

Limitations

ViNT -- Visual Navigation Transformer (2023)

From GNM to ViNT

ViNT Architecture

Exploration with Diffusion Subgoals

Adaptation with Prompt-Tuning

Results

NoMaD -- Goal Masked Diffusion Policies (2023)

Problem NoMaD Solves

NoMaD Architecture

Goal Masking -- Unifying Navigation and Exploration

Diffusion for Action Generation

Results

Comparing 3 Models

Evolution of Ideas

Real-World Applications and Limitations

When to Use Learning-Based Navigation?

Deploy on Real Robot

Hardware Requirements

Future Trends

Larger Foundation Models

Combining with VLMs (Vision-Language Models)

Sim-to-Real for Navigation

Up Next in Series

Related Posts

Nguyễn Anh Tuấn

Related Posts

Outdoor Navigation và Multi-Robot Coordination

Vision-Language Navigation: Robot đi theo chỉ dẫn

ROS 2 Nav2: Navigation hoàn chỉnh cho AMR