From Goal Image to Natural Language
In Part 3, we saw GNM, ViNT, and NoMaD use goal image to direct robot: "go to place that looks like this". But in reality, humans don't communicate with images -- we say: "go to the kitchen", "turn left at the intersection, then go straight to the end of the hallway".
Vision-Language Navigation (VLN) is the problem where robots understand and execute natural language instructions in 3D environments. It's one of the hardest problems at intersection of NLP, Computer Vision, and Robotics.
Why hard? Because robot needs to:
- Understand language: parse complex instructions, understand references ("the table next to the window")
- Look and recognize: match language with what's seen (grounding)
- Make decisions: choose direction based on language and visual understanding
- Handle uncertainty: language is ambiguous, environment unfamiliar
Room-to-Room (R2R) -- Foundational Benchmark
R2R Dataset
R2R (Anderson et al., CVPR 2018) is first and most important VLN benchmark. Paper: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments.
Setup:
- Environments: 90 buildings from Matterport3D -- 3D scans of real houses, photorealistic
- Instructions: 21,567 instructions, average 29 words, human-written
- Task: agent starts at location, reads instruction, goes to destination
Example instruction:
"Walk out of the bathroom. Turn left and walk down the hall. Turn left and wait in the doorway of the bedroom."
Metrics:
- Success Rate (SR): proportion reaching correct destination (< 3m)
- SPL (Success weighted by Path Length): SR * (shortest_path / actual_path) -- evaluates path efficiency
- nDTW: normalized Dynamic Time Warping -- measures "following correct path" vs reference
Navigation Graph
R2R uses navigation graph -- set of viewpoints (nodes) connected by edges. At each step, agent:
- Looks at 360-degree panorama at current node
- Selects next node to move to (from neighbors)
- Repeats until deciding to stop
This is discrete navigation -- agent teleports between nodes, no low-level control. Newer benchmarks (VLN-CE) switch to continuous environments with low-level actions.
Stages of VLN Development
Stage 1: Sequence-to-Sequence (2018-2020)
First models treated VLN as seq2seq: encode instruction to vector, decode to action sequence.
Speaker-Follower (Fried et al., NeurIPS 2018):
- Follower: reads instruction, views panorama, chooses action
- Speaker: watches trajectory, generates instruction -- used for data augmentation
- Speaker generates new instructions for paths without annotation -- increases data 10x
Stage 2: Transformer-Based (2020-2023)
PREVALENT and HAMT (History Aware Multimodal Transformer) brought Transformers to VLN:
- Cross-attention between instruction tokens and visual features
- History encoding -- remembers what was seen before
- Pre-training on image-text-action triplets
HAMT achieved SOTA on R2R with 65% SR (2022), using hierarchical history encoding for long trajectories.
Stage 3: LLM-Based Planning (2023-now)
Explosion of LLMs (GPT-4, LLaMA) opens new direction: use LLM as navigation planner.
LLM-Based Navigation Planning
Core Idea
Instead of training end-to-end model, use LLM as reasoning "brain":
Instruction: "Go to kitchen, get cup from table"
│
▼
LLM (GPT-4V / LLaMA)
│
├── Understand: [go to kitchen] → [get cup] → [on table]
├── See current: "I'm in hallway, door ahead"
├── Reason: "Kitchen usually on 1st floor, has fridge, stove -- not visible yet → continue"
│
▼
Action: "Go forward, through door ahead"
NavGPT and LLM-Based Works
NavGPT (Zhou et al., 2023) is one of first using GPT-4 for VLN:
- Perception module: describe current scene in text ("I see a hallway with door on left")
- LLM reasoning: GPT-4 reads instruction + scene, reasons next action
- Action execution: convert LLM output to navigation action
Strengths:
- Zero-shot: no training needed, just prompt engineering
- Transparent reasoning: can read LLM's thoughts (unlike black-box neural net)
- Common sense: LLM knows "kitchen usually on 1st floor", "toilet near bedroom"
Weaknesses:
- Slow: each step needs 1 LLM call (~1-2 seconds with GPT-4)
- Hallucination: LLM can "imagine" things that aren't there
- Cost: API calls expensive for real-time navigation
SayNav and Hierarchical Planning
SayNav uses hierarchical approach: LLM creates high-level plan, classical planner executes:
LLM: "To reach kitchen, I need to:
1. Exit current room
2. Walk down hallway
3. Turn right at intersection
4. Kitchen at end of left hallway"
│
▼
Classical Planner (Nav2): execute each step with obstacle avoidance
NaVILA -- Vision-Language-Action for Legged Robots
Paper: NaVILA: Legged Robot Vision-Language-Action Model for Navigation (Cheng et al., RSS 2025)
NaVILA is latest work combining VLA (Vision-Language-Action) model with locomotion skills for legged robots.
Two-Level Architecture
Level 1: VLA Model (low frequency ~2 Hz)
Input: camera image + language instruction
Output: mid-level command ("move forward 75cm")
│
▼
Level 2: Locomotion Policy (high frequency ~50 Hz)
Input: mid-level command + proprioception
Output: joint torques
Why two levels?
- VLA runs slow (inference ~0.5s) but understands language well
- Locomotion policy runs fast, handles real-time obstacle avoidance
- Separation allows each level to run at appropriate frequency
Results
- 88% success rate on 25 real instructions
- 75% success on complex instructions (multiple steps)
- Robot Unitree Go2 navigates cluttered environments
- Understands instructions like "go to kitchen and find red cup on counter"
Compared to NoMaD
| Criterion | NoMaD | NaVILA |
|---|---|---|
| Input | Goal image | Language instruction |
| Robot | Wheeled | Legged (quadruped) |
| Architecture | ViT + Diffusion | VLA + RL locomotion |
| Speed | Fast (~20 Hz) | 2 Hz (VLA) + 50 Hz (locomotion) |
| Terrain | Flat | Rough terrain |
| Interaction | Image goal | Natural language |
VLN in Continuous Environments
VLN-CE (Continuous Environments)
R2R uses navigation graph (discrete). VLN-CE (Krantz et al., 2020) moves to continuous environments -- robot controls itself (linear + angular velocity), must avoid obstacles, can get lost.
Much harder because:
- No teleportation -- must self-navigate
- Cumulative error -- errors accumulate over time
- Larger action space -- from discrete choices to continuous control
Embodied VLN
EmbodiedGPT and LEO (Large Embodied Model) are recent works combining:
- 3D scene understanding: understand 3D space from depth/point cloud
- LLM reasoning: reason from language instructions
- Continuous control: output velocity commands directly
Remaining Challenges
1. Grounding -- Connecting Language to Visual
"The table next to the window" -- robot must understand "next to" is spatial relation, "window" is object, and match with what's seen. This is visual grounding, still not fully solved.
2. Ambiguity in Language
"Go to the room" -- which room? "Turn at the intersection" -- which intersection? Natural language is inherently ambiguous. Robot needs to learn to ask for clarification or use common sense to reason.
3. Dynamic Environments
Instruction says "walk down hallway" but hallway has people, carts. Robot must adapt real-time -- combine VLN with reactive obstacle avoidance.
4. Long-Horizon Tasks
Long instruction ("go to kitchen, get cup, pour water, bring to table") requires robot to remember what it did and plan what remains. This needs memory and planning -- LLMs can help.
Practice: Getting Started with VLN
Option 1: R2R with Habitat Simulator
# Setup Habitat
pip install habitat-sim habitat-lab
# Clone VLN dataset
git clone https://github.com/peteanderson80/Matterport3DSimulator.git
# Download R2R data
python -c "from habitat.datasets.vln import download_r2r; download_r2r()"
Option 2: VLN-CE with Habitat 3.0
# Habitat 3.0 (continuous environments)
pip install habitat-sim==0.3.0 --extra-index-url https://aihabitat.org/pip
# VLN-CE dataset
python -m habitat.datasets.vln_ce.download
Option 3: Real Robot with NaVILA
NaVILA codebase allows deployment on Unitree Go2 or similar. Requires:
- NVIDIA Jetson AGX Orin (or GPU server)
- RGB camera (front-facing)
- Robot with locomotion controller
Future of VLN
Multimodal Foundation Models
GPT-4o, Gemini 2.0, and multimodal models blur line between VLN and general AI. Future robots can:
- Ask for clarification when instruction unclear
- Explain why choosing this path
- Learn from feedback: "not that room, the one next door"
From Navigation to Manipulation
Current VLN only covers movement. Next step is movement + action: "go to kitchen and make coffee" -- needs combining VLN with manipulation skills.
Up Next in Series
This is Part 4 of Modern Navigation series:
- Part 1: SLAM A to Z -- SLAM Foundation
- Part 2: ROS 2 Nav2 -- Classical Navigation Stack
- Part 3: Learning-based Navigation: GNM, ViNT, NoMaD -- Foundation Models
- Part 5: Outdoor Navigation and Multi-Robot -- GPS-denied Nav, MAPF
Related Posts
- Foundation Models for Robots: RT-2, Octo, OpenVLA -- VLA models for manipulation
- AI Series Part 5: VLA Models -- Vision-Language-Action models overview
- Learning-based Navigation: GNM, ViNT, NoMaD -- Foundation models for navigation
- Humanoid Robotics Guide -- Humanoid robots using VLN