← Back to Blog
navigationvlnnavigationllmvision-language

Vision-Language Navigation: Robot Following Instructions

Explore VLN -- how robots understand and execute natural language instructions, from R2R benchmark to NaVILA and LLM-based planning.

Nguyen Anh Tuan16 tháng 2, 20268 min read
Vision-Language Navigation: Robot Following Instructions

From Goal Image to Natural Language

In Part 3, we saw GNM, ViNT, and NoMaD use goal image to direct robot: "go to place that looks like this". But in reality, humans don't communicate with images -- we say: "go to the kitchen", "turn left at the intersection, then go straight to the end of the hallway".

Vision-Language Navigation (VLN) is the problem where robots understand and execute natural language instructions in 3D environments. It's one of the hardest problems at intersection of NLP, Computer Vision, and Robotics.

Why hard? Because robot needs to:

  1. Understand language: parse complex instructions, understand references ("the table next to the window")
  2. Look and recognize: match language with what's seen (grounding)
  3. Make decisions: choose direction based on language and visual understanding
  4. Handle uncertainty: language is ambiguous, environment unfamiliar

Robot receiving language instruction and moving in real environment

Room-to-Room (R2R) -- Foundational Benchmark

R2R Dataset

R2R (Anderson et al., CVPR 2018) is first and most important VLN benchmark. Paper: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments.

Setup:

Example instruction:

"Walk out of the bathroom. Turn left and walk down the hall. Turn left and wait in the doorway of the bedroom."

Metrics:

Navigation Graph

R2R uses navigation graph -- set of viewpoints (nodes) connected by edges. At each step, agent:

  1. Looks at 360-degree panorama at current node
  2. Selects next node to move to (from neighbors)
  3. Repeats until deciding to stop

This is discrete navigation -- agent teleports between nodes, no low-level control. Newer benchmarks (VLN-CE) switch to continuous environments with low-level actions.

Stages of VLN Development

Stage 1: Sequence-to-Sequence (2018-2020)

First models treated VLN as seq2seq: encode instruction to vector, decode to action sequence.

Speaker-Follower (Fried et al., NeurIPS 2018):

Stage 2: Transformer-Based (2020-2023)

PREVALENT and HAMT (History Aware Multimodal Transformer) brought Transformers to VLN:

HAMT achieved SOTA on R2R with 65% SR (2022), using hierarchical history encoding for long trajectories.

Stage 3: LLM-Based Planning (2023-now)

Explosion of LLMs (GPT-4, LLaMA) opens new direction: use LLM as navigation planner.

LLM-Based Navigation Planning

Core Idea

Instead of training end-to-end model, use LLM as reasoning "brain":

Instruction: "Go to kitchen, get cup from table"
    │
    ▼
LLM (GPT-4V / LLaMA)
    │
    ├── Understand: [go to kitchen] → [get cup] → [on table]
    ├── See current: "I'm in hallway, door ahead"
    ├── Reason: "Kitchen usually on 1st floor, has fridge, stove -- not visible yet → continue"
    │
    ▼
Action: "Go forward, through door ahead"

NavGPT and LLM-Based Works

NavGPT (Zhou et al., 2023) is one of first using GPT-4 for VLN:

  1. Perception module: describe current scene in text ("I see a hallway with door on left")
  2. LLM reasoning: GPT-4 reads instruction + scene, reasons next action
  3. Action execution: convert LLM output to navigation action

Strengths:

Weaknesses:

SayNav and Hierarchical Planning

SayNav uses hierarchical approach: LLM creates high-level plan, classical planner executes:

LLM: "To reach kitchen, I need to: 
      1. Exit current room
      2. Walk down hallway
      3. Turn right at intersection
      4. Kitchen at end of left hallway"
         │
         ▼
Classical Planner (Nav2): execute each step with obstacle avoidance

NaVILA -- Vision-Language-Action for Legged Robots

Paper: NaVILA: Legged Robot Vision-Language-Action Model for Navigation (Cheng et al., RSS 2025)

NaVILA is latest work combining VLA (Vision-Language-Action) model with locomotion skills for legged robots.

Two-Level Architecture

Level 1: VLA Model (low frequency ~2 Hz)
    Input: camera image + language instruction
    Output: mid-level command ("move forward 75cm")
         │
         ▼
Level 2: Locomotion Policy (high frequency ~50 Hz)
    Input: mid-level command + proprioception
    Output: joint torques

Why two levels?

Results

Compared to NoMaD

Criterion NoMaD NaVILA
Input Goal image Language instruction
Robot Wheeled Legged (quadruped)
Architecture ViT + Diffusion VLA + RL locomotion
Speed Fast (~20 Hz) 2 Hz (VLA) + 50 Hz (locomotion)
Terrain Flat Rough terrain
Interaction Image goal Natural language

Legged robot moving with language instruction in complex environment

VLN in Continuous Environments

VLN-CE (Continuous Environments)

R2R uses navigation graph (discrete). VLN-CE (Krantz et al., 2020) moves to continuous environments -- robot controls itself (linear + angular velocity), must avoid obstacles, can get lost.

Much harder because:

Embodied VLN

EmbodiedGPT and LEO (Large Embodied Model) are recent works combining:

Remaining Challenges

1. Grounding -- Connecting Language to Visual

"The table next to the window" -- robot must understand "next to" is spatial relation, "window" is object, and match with what's seen. This is visual grounding, still not fully solved.

2. Ambiguity in Language

"Go to the room" -- which room? "Turn at the intersection" -- which intersection? Natural language is inherently ambiguous. Robot needs to learn to ask for clarification or use common sense to reason.

3. Dynamic Environments

Instruction says "walk down hallway" but hallway has people, carts. Robot must adapt real-time -- combine VLN with reactive obstacle avoidance.

4. Long-Horizon Tasks

Long instruction ("go to kitchen, get cup, pour water, bring to table") requires robot to remember what it did and plan what remains. This needs memory and planning -- LLMs can help.

Practice: Getting Started with VLN

Option 1: R2R with Habitat Simulator

# Setup Habitat
pip install habitat-sim habitat-lab

# Clone VLN dataset
git clone https://github.com/peteanderson80/Matterport3DSimulator.git

# Download R2R data
python -c "from habitat.datasets.vln import download_r2r; download_r2r()"

Option 2: VLN-CE with Habitat 3.0

# Habitat 3.0 (continuous environments)
pip install habitat-sim==0.3.0 --extra-index-url https://aihabitat.org/pip

# VLN-CE dataset
python -m habitat.datasets.vln_ce.download

Option 3: Real Robot with NaVILA

NaVILA codebase allows deployment on Unitree Go2 or similar. Requires:

Future of VLN

Multimodal Foundation Models

GPT-4o, Gemini 2.0, and multimodal models blur line between VLN and general AI. Future robots can:

From Navigation to Manipulation

Current VLN only covers movement. Next step is movement + action: "go to kitchen and make coffee" -- needs combining VLN with manipulation skills.

Up Next in Series

This is Part 4 of Modern Navigation series:


Related Posts

Related Posts

Deep DiveOutdoor Navigation và Multi-Robot Coordination
navigationmulti-robotoutdoormapfvda5050Part 5

Outdoor Navigation và Multi-Robot Coordination

GPS-denied navigation, terrain classification, multi-robot traffic management với VDA5050, và MAPF algorithms cho robot fleet.

20/2/202611 min read
ResearchLearning-based Navigation: GNM, ViNT và NoMaD
navigationdeep-learningfoundation-modelsvintPart 3

Learning-based Navigation: GNM, ViNT và NoMaD

Khám phá foundation models cho robot navigation -- GNM, ViNT, NoMaD từ Berkeley và cách chúng thay đổi cách robot di chuyển.

12/2/202610 min read
ROS 2 từ A đến Z (Phần 3): Nav2 — Robot tự hành đầu tiên
ros2tutorialamrnavigationPart 3

ROS 2 từ A đến Z (Phần 3): Nav2 — Robot tự hành đầu tiên

Cấu hình Nav2 stack để robot tự lập bản đồ SLAM và di chuyển tự động — từ simulation đến thực tế.

11/2/202611 min read