Vision-Language Navigation: Robot Following Instructions

From Goal Image to Natural Language

In Part 3, we saw GNM, ViNT, and NoMaD use goal image to direct robot: "go to place that looks like this". But in reality, humans don't communicate with images -- we say: "go to the kitchen", "turn left at the intersection, then go straight to the end of the hallway".

Vision-Language Navigation (VLN) is the problem where robots understand and execute natural language instructions in 3D environments. It's one of the hardest problems at intersection of NLP, Computer Vision, and Robotics.

Why hard? Because robot needs to:

Understand language: parse complex instructions, understand references ("the table next to the window")
Look and recognize: match language with what's seen (grounding)
Make decisions: choose direction based on language and visual understanding
Handle uncertainty: language is ambiguous, environment unfamiliar

Robot receiving language instruction and moving in real environment

Room-to-Room (R2R) -- Foundational Benchmark

R2R Dataset

R2R (Anderson et al., CVPR 2018) is first and most important VLN benchmark. Paper: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments.

Setup:

Environments: 90 buildings from Matterport3D -- 3D scans of real houses, photorealistic
Instructions: 21,567 instructions, average 29 words, human-written
Task: agent starts at location, reads instruction, goes to destination

Example instruction:

"Walk out of the bathroom. Turn left and walk down the hall. Turn left and wait in the doorway of the bedroom."

Metrics:

Success Rate (SR): proportion reaching correct destination (< 3m)
SPL (Success weighted by Path Length): SR * (shortest_path / actual_path) -- evaluates path efficiency
nDTW: normalized Dynamic Time Warping -- measures "following correct path" vs reference

R2R uses navigation graph -- set of viewpoints (nodes) connected by edges. At each step, agent:

Looks at 360-degree panorama at current node
Selects next node to move to (from neighbors)
Repeats until deciding to stop

This is discrete navigation -- agent teleports between nodes, no low-level control. Newer benchmarks (VLN-CE) switch to continuous environments with low-level actions.

Stages of VLN Development

Stage 1: Sequence-to-Sequence (2018-2020)

First models treated VLN as seq2seq: encode instruction to vector, decode to action sequence.

Speaker-Follower (Fried et al., NeurIPS 2018):

Follower: reads instruction, views panorama, chooses action
Speaker: watches trajectory, generates instruction -- used for data augmentation
Speaker generates new instructions for paths without annotation -- increases data 10x

Stage 2: Transformer-Based (2020-2023)

PREVALENT and HAMT (History Aware Multimodal Transformer) brought Transformers to VLN:

Cross-attention between instruction tokens and visual features
History encoding -- remembers what was seen before
Pre-training on image-text-action triplets

HAMT achieved SOTA on R2R with 65% SR (2022), using hierarchical history encoding for long trajectories.

Stage 3: LLM-Based Planning (2023-now)

Explosion of LLMs (GPT-4, LLaMA) opens new direction: use LLM as navigation planner.

Core Idea

Instead of training end-to-end model, use LLM as reasoning "brain":

Instruction: "Go to kitchen, get cup from table"
    │
    ▼
LLM (GPT-4V / LLaMA)
    │
    ├── Understand: [go to kitchen] → [get cup] → [on table]
    ├── See current: "I'm in hallway, door ahead"
    ├── Reason: "Kitchen usually on 1st floor, has fridge, stove -- not visible yet → continue"
    │
    ▼
Action: "Go forward, through door ahead"

NavGPT and LLM-Based Works

NavGPT (Zhou et al., 2023) is one of first using GPT-4 for VLN:

Perception module: describe current scene in text ("I see a hallway with door on left")
LLM reasoning: GPT-4 reads instruction + scene, reasons next action
Action execution: convert LLM output to navigation action

Strengths:

Zero-shot: no training needed, just prompt engineering
Transparent reasoning: can read LLM's thoughts (unlike black-box neural net)
Common sense: LLM knows "kitchen usually on 1st floor", "toilet near bedroom"

Weaknesses:

Slow: each step needs 1 LLM call (~1-2 seconds with GPT-4)
Hallucination: LLM can "imagine" things that aren't there
Cost: API calls expensive for real-time navigation

SayNav and Hierarchical Planning

SayNav uses hierarchical approach: LLM creates high-level plan, classical planner executes:

LLM: "To reach kitchen, I need to: 
      1. Exit current room
      2. Walk down hallway
      3. Turn right at intersection
      4. Kitchen at end of left hallway"
         │
         ▼
Classical Planner (Nav2): execute each step with obstacle avoidance

NaVILA -- Vision-Language-Action for Legged Robots

Paper: NaVILA: Legged Robot Vision-Language-Action Model for Navigation (Cheng et al., RSS 2025)

NaVILA is latest work combining VLA (Vision-Language-Action) model with locomotion skills for legged robots.

Two-Level Architecture

Level 1: VLA Model (low frequency ~2 Hz)
    Input: camera image + language instruction
    Output: mid-level command ("move forward 75cm")
         │
         ▼
Level 2: Locomotion Policy (high frequency ~50 Hz)
    Input: mid-level command + proprioception
    Output: joint torques

Why two levels?

VLA runs slow (inference ~0.5s) but understands language well
Locomotion policy runs fast, handles real-time obstacle avoidance
Separation allows each level to run at appropriate frequency

Results

88% success rate on 25 real instructions
75% success on complex instructions (multiple steps)
Robot Unitree Go2 navigates cluttered environments
Understands instructions like "go to kitchen and find red cup on counter"

Compared to NoMaD

Criterion	NoMaD	NaVILA
Input	Goal image	Language instruction
Robot	Wheeled	Legged (quadruped)
Architecture	ViT + Diffusion	VLA + RL locomotion
Speed	Fast (~20 Hz)	2 Hz (VLA) + 50 Hz (locomotion)
Terrain	Flat	Rough terrain
Interaction	Image goal	Natural language

Legged robot moving with language instruction in complex environment

VLN in Continuous Environments

VLN-CE (Continuous Environments)

R2R uses navigation graph (discrete). VLN-CE (Krantz et al., 2020) moves to continuous environments -- robot controls itself (linear + angular velocity), must avoid obstacles, can get lost.

Much harder because:

No teleportation -- must self-navigate
Cumulative error -- errors accumulate over time
Larger action space -- from discrete choices to continuous control

Embodied VLN

EmbodiedGPT and LEO (Large Embodied Model) are recent works combining:

3D scene understanding: understand 3D space from depth/point cloud
LLM reasoning: reason from language instructions
Continuous control: output velocity commands directly

Remaining Challenges

1. Grounding -- Connecting Language to Visual

"The table next to the window" -- robot must understand "next to" is spatial relation, "window" is object, and match with what's seen. This is visual grounding, still not fully solved.

2. Ambiguity in Language

"Go to the room" -- which room? "Turn at the intersection" -- which intersection? Natural language is inherently ambiguous. Robot needs to learn to ask for clarification or use common sense to reason.

3. Dynamic Environments

Instruction says "walk down hallway" but hallway has people, carts. Robot must adapt real-time -- combine VLN with reactive obstacle avoidance.

4. Long-Horizon Tasks

Long instruction ("go to kitchen, get cup, pour water, bring to table") requires robot to remember what it did and plan what remains. This needs memory and planning -- LLMs can help.

Practice: Getting Started with VLN

Option 1: R2R with Habitat Simulator

# Setup Habitat
pip install habitat-sim habitat-lab

# Clone VLN dataset
git clone https://github.com/peteanderson80/Matterport3DSimulator.git

# Download R2R data
python -c "from habitat.datasets.vln import download_r2r; download_r2r()"

Option 2: VLN-CE with Habitat 3.0

# Habitat 3.0 (continuous environments)
pip install habitat-sim==0.3.0 --extra-index-url https://aihabitat.org/pip

# VLN-CE dataset
python -m habitat.datasets.vln_ce.download

Option 3: Real Robot with NaVILA

NaVILA codebase allows deployment on Unitree Go2 or similar. Requires:

NVIDIA Jetson AGX Orin (or GPU server)
RGB camera (front-facing)
Robot with locomotion controller

Future of VLN

Multimodal Foundation Models

GPT-4o, Gemini 2.0, and multimodal models blur line between VLN and general AI. Future robots can:

Ask for clarification when instruction unclear
Explain why choosing this path
Learn from feedback: "not that room, the one next door"

Current VLN only covers movement. Next step is movement + action: "go to kitchen and make coffee" -- needs combining VLN with manipulation skills.

Up Next in Series

This is Part 4 of Modern Navigation series:

Part 1: SLAM A to Z -- SLAM Foundation
Part 2: ROS 2 Nav2 -- Classical Navigation Stack
Part 3: Learning-based Navigation: GNM, ViNT, NoMaD -- Foundation Models
Part 5: Outdoor Navigation and Multi-Robot -- GPS-denied Nav, MAPF

Foundation Models for Robots: RT-2, Octo, OpenVLA -- VLA models for manipulation
AI Series Part 5: VLA Models -- Vision-Language-Action models overview
Learning-based Navigation: GNM, ViNT, NoMaD -- Foundation models for navigation
Humanoid Robotics Guide -- Humanoid robots using VLN

From Goal Image to Natural Language

Why hard? Because robot needs to:

Understand language: parse complex instructions, understand references ("the table next to the window")
Look and recognize: match language with what's seen (grounding)
Make decisions: choose direction based on language and visual understanding
Handle uncertainty: language is ambiguous, environment unfamiliar

Robot receiving language instruction and moving in real environment

Room-to-Room (R2R) -- Foundational Benchmark

R2R Dataset

R2R (Anderson et al., CVPR 2018) is first and most important VLN benchmark. Paper: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments.

Setup:

Environments: 90 buildings from Matterport3D -- 3D scans of real houses, photorealistic
Instructions: 21,567 instructions, average 29 words, human-written
Task: agent starts at location, reads instruction, goes to destination

Example instruction:

"Walk out of the bathroom. Turn left and walk down the hall. Turn left and wait in the doorway of the bedroom."

Metrics:

Success Rate (SR): proportion reaching correct destination (< 3m)
SPL (Success weighted by Path Length): SR * (shortest_path / actual_path) -- evaluates path efficiency
nDTW: normalized Dynamic Time Warping -- measures "following correct path" vs reference

R2R uses navigation graph -- set of viewpoints (nodes) connected by edges. At each step, agent:

Looks at 360-degree panorama at current node
Selects next node to move to (from neighbors)
Repeats until deciding to stop

This is discrete navigation -- agent teleports between nodes, no low-level control. Newer benchmarks (VLN-CE) switch to continuous environments with low-level actions.

Stages of VLN Development

Stage 1: Sequence-to-Sequence (2018-2020)

First models treated VLN as seq2seq: encode instruction to vector, decode to action sequence.

Speaker-Follower (Fried et al., NeurIPS 2018):

Follower: reads instruction, views panorama, chooses action
Speaker: watches trajectory, generates instruction -- used for data augmentation
Speaker generates new instructions for paths without annotation -- increases data 10x

Stage 2: Transformer-Based (2020-2023)

PREVALENT and HAMT (History Aware Multimodal Transformer) brought Transformers to VLN:

Cross-attention between instruction tokens and visual features
History encoding -- remembers what was seen before
Pre-training on image-text-action triplets

HAMT achieved SOTA on R2R with 65% SR (2022), using hierarchical history encoding for long trajectories.

Stage 3: LLM-Based Planning (2023-now)

Explosion of LLMs (GPT-4, LLaMA) opens new direction: use LLM as navigation planner.

Core Idea

Instead of training end-to-end model, use LLM as reasoning "brain":

Instruction: "Go to kitchen, get cup from table"
    │
    ▼
LLM (GPT-4V / LLaMA)
    │
    ├── Understand: [go to kitchen] → [get cup] → [on table]
    ├── See current: "I'm in hallway, door ahead"
    ├── Reason: "Kitchen usually on 1st floor, has fridge, stove -- not visible yet → continue"
    │
    ▼
Action: "Go forward, through door ahead"

NavGPT and LLM-Based Works

NavGPT (Zhou et al., 2023) is one of first using GPT-4 for VLN:

Perception module: describe current scene in text ("I see a hallway with door on left")
LLM reasoning: GPT-4 reads instruction + scene, reasons next action
Action execution: convert LLM output to navigation action

Strengths:

Zero-shot: no training needed, just prompt engineering
Transparent reasoning: can read LLM's thoughts (unlike black-box neural net)
Common sense: LLM knows "kitchen usually on 1st floor", "toilet near bedroom"

Weaknesses:

Slow: each step needs 1 LLM call (~1-2 seconds with GPT-4)
Hallucination: LLM can "imagine" things that aren't there
Cost: API calls expensive for real-time navigation

SayNav and Hierarchical Planning

SayNav uses hierarchical approach: LLM creates high-level plan, classical planner executes:

LLM: "To reach kitchen, I need to: 
      1. Exit current room
      2. Walk down hallway
      3. Turn right at intersection
      4. Kitchen at end of left hallway"
         │
         ▼
Classical Planner (Nav2): execute each step with obstacle avoidance

NaVILA -- Vision-Language-Action for Legged Robots

Paper: NaVILA: Legged Robot Vision-Language-Action Model for Navigation (Cheng et al., RSS 2025)

NaVILA is latest work combining VLA (Vision-Language-Action) model with locomotion skills for legged robots.

Two-Level Architecture

Level 1: VLA Model (low frequency ~2 Hz)
    Input: camera image + language instruction
    Output: mid-level command ("move forward 75cm")
         │
         ▼
Level 2: Locomotion Policy (high frequency ~50 Hz)
    Input: mid-level command + proprioception
    Output: joint torques

Why two levels?

VLA runs slow (inference ~0.5s) but understands language well
Locomotion policy runs fast, handles real-time obstacle avoidance
Separation allows each level to run at appropriate frequency

Results

88% success rate on 25 real instructions
75% success on complex instructions (multiple steps)
Robot Unitree Go2 navigates cluttered environments
Understands instructions like "go to kitchen and find red cup on counter"

Compared to NoMaD

Criterion	NoMaD	NaVILA
Input	Goal image	Language instruction
Robot	Wheeled	Legged (quadruped)
Architecture	ViT + Diffusion	VLA + RL locomotion
Speed	Fast (~20 Hz)	2 Hz (VLA) + 50 Hz (locomotion)
Terrain	Flat	Rough terrain
Interaction	Image goal	Natural language

Legged robot moving with language instruction in complex environment

VLN in Continuous Environments

VLN-CE (Continuous Environments)

R2R uses navigation graph (discrete). VLN-CE (Krantz et al., 2020) moves to continuous environments -- robot controls itself (linear + angular velocity), must avoid obstacles, can get lost.

Much harder because:

No teleportation -- must self-navigate
Cumulative error -- errors accumulate over time
Larger action space -- from discrete choices to continuous control

Embodied VLN

EmbodiedGPT and LEO (Large Embodied Model) are recent works combining:

3D scene understanding: understand 3D space from depth/point cloud
LLM reasoning: reason from language instructions
Continuous control: output velocity commands directly

Remaining Challenges

1. Grounding -- Connecting Language to Visual

"The table next to the window" -- robot must understand "next to" is spatial relation, "window" is object, and match with what's seen. This is visual grounding, still not fully solved.

2. Ambiguity in Language

3. Dynamic Environments

Instruction says "walk down hallway" but hallway has people, carts. Robot must adapt real-time -- combine VLN with reactive obstacle avoidance.

4. Long-Horizon Tasks

Long instruction ("go to kitchen, get cup, pour water, bring to table") requires robot to remember what it did and plan what remains. This needs memory and planning -- LLMs can help.

Practice: Getting Started with VLN

Option 1: R2R with Habitat Simulator

# Setup Habitat
pip install habitat-sim habitat-lab

# Clone VLN dataset
git clone https://github.com/peteanderson80/Matterport3DSimulator.git

# Download R2R data
python -c "from habitat.datasets.vln import download_r2r; download_r2r()"

Option 2: VLN-CE with Habitat 3.0

# Habitat 3.0 (continuous environments)
pip install habitat-sim==0.3.0 --extra-index-url https://aihabitat.org/pip

# VLN-CE dataset
python -m habitat.datasets.vln_ce.download

Option 3: Real Robot with NaVILA

NaVILA codebase allows deployment on Unitree Go2 or similar. Requires:

NVIDIA Jetson AGX Orin (or GPU server)
RGB camera (front-facing)
Robot with locomotion controller

Future of VLN

Multimodal Foundation Models

GPT-4o, Gemini 2.0, and multimodal models blur line between VLN and general AI. Future robots can:

Ask for clarification when instruction unclear
Explain why choosing this path
Learn from feedback: "not that room, the one next door"

Current VLN only covers movement. Next step is movement + action: "go to kitchen and make coffee" -- needs combining VLN with manipulation skills.

Up Next in Series

This is Part 4 of Modern Navigation series:

Part 1: SLAM A to Z -- SLAM Foundation
Part 2: ROS 2 Nav2 -- Classical Navigation Stack
Part 3: Learning-based Navigation: GNM, ViNT, NoMaD -- Foundation Models
Part 5: Outdoor Navigation and Multi-Robot -- GPS-denied Nav, MAPF

Foundation Models for Robots: RT-2, Octo, OpenVLA -- VLA models for manipulation
AI Series Part 5: VLA Models -- Vision-Language-Action models overview
Learning-based Navigation: GNM, ViNT, NoMaD -- Foundation models for navigation
Humanoid Robotics Guide -- Humanoid robots using VLN

From Goal Image to Natural Language

Room-to-Room (R2R) -- Foundational Benchmark

R2R Dataset

Navigation Graph

Stages of VLN Development

Stage 1: Sequence-to-Sequence (2018-2020)

Stage 2: Transformer-Based (2020-2023)

Stage 3: LLM-Based Planning (2023-now)

LLM-Based Navigation Planning

Core Idea

NavGPT and LLM-Based Works

SayNav and Hierarchical Planning

NaVILA -- Vision-Language-Action for Legged Robots

Two-Level Architecture

Results

Compared to NoMaD

VLN in Continuous Environments

VLN-CE (Continuous Environments)

Embodied VLN

Remaining Challenges

1. Grounding -- Connecting Language to Visual

2. Ambiguity in Language

3. Dynamic Environments

4. Long-Horizon Tasks

Practice: Getting Started with VLN

Option 1: R2R with Habitat Simulator

Option 2: VLN-CE with Habitat 3.0

Option 3: Real Robot with NaVILA

Future of VLN

Multimodal Foundation Models

From Navigation to Manipulation

Up Next in Series

Related Posts

Nguyễn Anh Tuấn

Related Posts

Outdoor Navigation và Multi-Robot Coordination

Learning-based Navigation: GNM, ViNT và NoMaD

ROS 2 Nav2: Navigation hoàn chỉnh cho AMR

From Goal Image to Natural Language

Room-to-Room (R2R) -- Foundational Benchmark

R2R Dataset

Navigation Graph

Stages of VLN Development

Stage 1: Sequence-to-Sequence (2018-2020)

Stage 2: Transformer-Based (2020-2023)

Stage 3: LLM-Based Planning (2023-now)

LLM-Based Navigation Planning

Core Idea

NavGPT and LLM-Based Works

SayNav and Hierarchical Planning

NaVILA -- Vision-Language-Action for Legged Robots

Two-Level Architecture

Results

Compared to NoMaD

VLN in Continuous Environments

VLN-CE (Continuous Environments)

Embodied VLN

Remaining Challenges

1. Grounding -- Connecting Language to Visual

2. Ambiguity in Language

3. Dynamic Environments

4. Long-Horizon Tasks

Practice: Getting Started with VLN

Option 1: R2R with Habitat Simulator

Option 2: VLN-CE with Habitat 3.0

Option 3: Real Robot with NaVILA

Future of VLN

Multimodal Foundation Models

From Navigation to Manipulation

Up Next in Series

Related Posts

Nguyễn Anh Tuấn

Related Posts

Outdoor Navigation và Multi-Robot Coordination

Learning-based Navigation: GNM, ViNT và NoMaD

ROS 2 Nav2: Navigation hoàn chỉnh cho AMR