SpatialVLA: 3D Understanding for Robot Manipulation

Why VLA Models Need to Understand 3D

In previous post on Foundation Models for Robot, we analyzed RT-2, Octo, and OpenVLA — powerful VLA models that can take camera image + language instruction and output robot actions. However, all share one limitation: they process the world as flat 2D images.

Imagine telling robot "place cup on top shelf". Robot sees cup and shelf through camera — but from 2D image, doesn't know how far away shelf is, how heavy cup is relative to distance it needs to travel, or how much gripper needs to tilt to avoid hitting shelf below.

Core problem: Depth ambiguity. Object close to camera but small looks identical to object far away but large in 2D image. Without depth information, robot must "guess" — and in manipulation, guessing wrong by 2cm completely fails task.

Robot arm performing manipulation task in 3D environment

Experimental Evidence

Benchmarks show when changing spatial layout (moving object to new position, changing shelf height), success rate of OpenVLA and policies without depth information drops below 50% — while humans barely affected. Shows 2D VLA models missing crucial component: spatial reasoning.

Situations where 2D VLA struggles:

Situation	2D Problem	Needs 3D
Stack objects	Don't know exact heights	Depth for each object
Place on shelf	Don't know distance to shelf	3D target position
Avoid obstacles	Don't know if obstacle ahead or behind	Full spatial context
Precise insertion	mm-level error from 2D projection	3D alignment

SpatialVLA: Architecture and Innovation

SpatialVLA (Qu et al., 2025, accepted at RSS 2025) solves this with two key innovations: Ego3D Position Encoding and Adaptive Action Grids. Trained on 1.1 million real-world robot episodes — significantly larger than Open X-Embodiment dataset (970K) that OpenVLA uses.

Ego3D Position Encoding — Viewing World Through Depth

Core idea: instead of just 2D positional encoding for image patches (traditional ViT), SpatialVLA adds 3D position information estimated from monocular depth.

Pipeline:
  RGB Image (from robot camera)
       ↓
  Monocular Depth Estimator (e.g., DPT/Depth Anything)
       ↓
  Depth Map → 3D Point Cloud (in egocentric frame)
       ↓
  Ego3D Position Encoding
       ↓
  Fused with Visual Tokens in VLM backbone

Why "Egocentric"? Instead of using world coordinate frame (needs precise camera-robot calibration), SpatialVLA represents everything in camera frame — egocentric. Two major benefits:

No camera-robot calibration needed — deploying on new robot requires no accurate extrinsic parameters
Universally applicable — works with any robot embodiment as long as it has camera

Specifically, image patch at pixel (u, v) with depth d maps to 3D coordinate:

X = (u - cx) * d / fx
Y = (v - cy) * d / fy  
Z = d

Ego3D_pos = MLP([X, Y, Z]) → positional embedding

Where (fx, fy, cx, cy) are camera intrinsics. This embedding added to visual tokens before transformer backbone, lets model "see" depth without changing architecture.

Adaptive Action Grids — Unifying Action Space

Second problem: each robot has different action space — Franka uses 7D joint positions, WidowX uses 6D end-effector velocity, ALOHA uses 14D bimanual actions. OpenVLA handles by tokenizing each dimension into 256 bins — but wastes resolution in rarely-used ranges.

Adaptive Action Grids solves by:

Analyze statistical distribution of actions across entire dataset per robot
Discretize actions into adaptive grids — more bins in frequently-used range, fewer bins in rare range
Align action tokens with 3D spatial structure — action grids reflect spatial meaning

# Simplified example: Adaptive binning
import numpy as np

def adaptive_grid(actions, n_bins=256):
    """
    Create adaptive bins based on actual distribution
    instead of uniform bins
    """
    # actions: shape (N, 7) — N episodes, 7 action dimensions
    bins_per_dim = []
    for dim in range(actions.shape[1]):
        # Use quantiles instead of linspace
        percentiles = np.linspace(0, 100, n_bins + 1)
        bins = np.percentile(actions[:, dim], percentiles)
        bins_per_dim.append(bins)
    return bins_per_dim

# Uniform: bins evenly spaced from -1 to 1
# Adaptive: bins dense in action-heavy region (near workspace center)
#           sparse in rare region (workspace boundary)

Result: higher action resolution where it matters without increasing total tokens.

3D spatial understanding in robot perception with point clouds

Results: SpatialVLA vs OpenVLA

SpatialVLA evaluated on 7 robot learning scenarios, 16 real-robot tasks, 48 simulation setups. Results show clear advantage, especially for spatial reasoning tasks.

Real-world Franka Manipulation

Task	OpenVLA	Octo	SpatialVLA
Pick spatial prompt (left/right/front/back)	< 50%	< 50%	73%
Stack with height awareness	~40%	~35%	68%
Place at specific location	~45%	~42%	71%

Zero-shot WidowX (out-of-distribution)

Harshest test — SpatialVLA never saw this specific WidowX robot layout:

OpenVLA and Octo: consistently below 50% when spatial layout changes
SpatialVLA: maintains 60-70% success rate thanks to 3D understanding

Inference Speed

SpatialVLA also improves inference speed via compact action representation:

Model	Tokens per action	Inference (Hz)
OpenVLA	7 tokens (7D)	~6 Hz
Octo-Base	Diffusion steps	~10 Hz
SpatialVLA	Fewer spatial tokens	~8 Hz

3D Representations in Robotics: The Big Picture

SpatialVLA uses monocular depth — simplest way to get 3D information. But just one part of broader landscape of 3D representations for robot perception.

Point Clouds

How it works: LiDAR or stereo camera creates 3D point set (x, y, z) for each surface in scene.

Advantages: Precise, explicit geometry, fast processing with PointNet/PointNet++.

Disadvantages: Sparse (no continuous surface info), no appearance/texture.

Applications: Grasping (Contact-GraspNet), navigation, bin picking in industry.

Neural Radiance Fields (NeRF)

How it works: Train neural network representing scene as continuous volumetric function — for any point (x, y, z) and viewing direction, outputs color and density.

Advantages: Photo-realistic novel view synthesis, smooth surfaces, compact representation.

Disadvantages: Slow training (minutes-hours), slow inference (seconds per render), hard to update real-time.

Applications in robotics: Scene understanding, sim-to-real transfer — use NeRF to create photorealistic training environments.

3D Gaussian Splatting (3DGS)

How it works: Represent scene as collection of 3D Gaussians, each with position, covariance, opacity and spherical harmonics coefficients for color. Render via differentiable rasterization instead of ray marching.

Advantages: Real-time rendering (100+ FPS), faster training than NeRF (minutes vs hours), explicit representation easy to manipulate.

Disadvantages: Memory-intensive (millions of Gaussians), not as smooth as NeRF.

Latest applications in robotics:

RoboSplat (RSS 2025): Use 3DGS to generate novel demos for policy learning — create diverse training data from few demonstrations
POGS (ICRA 2025): Track and manipulate irregular objects using Persistent Object Gaussian Splat
SplatSim: Zero-shot sim-to-real transfer for RGB manipulation policies

Comprehensive Comparison

Representation	Training	Rendering	Explicit?	Real-time?	Robot-friendly?
Point Cloud	N/A	N/A	Yes	Yes	High
NeRF	Slow (hours)	Slow (sec)	No	No	Medium
3DGS	Fast (min)	Real-time	Yes	Yes	High
Monocular Depth	Pre-trained	Real-time	Yes	Yes	Highest

SpatialVLA chose monocular depth because needs no extra hardware (LiDAR, stereo camera) and runs real-time — best fit for general-purpose VLA model.

Future: VLA + 3D Scene Understanding

SpatialVLA opens research direction combining VLA with deeper 3D understanding:

Short-term (2026-2027)

Multi-view VLA: Instead of 1 camera, use multiple cameras or wrist + head camera to build richer 3D representation. Some labs already experimenting with stereo VLA.
Better depth foundation models: Depth Anything V2 and metric depth models improving fast — SpatialVLA benefits directly.
Tactile + 3D: Combine touch sensing with 3D vision for fine manipulation (assembly, soft object handling).

Medium-term (2027-2028)

VLA + 3DGS real-time: Use Gaussian Splatting as world model for VLA — robot "imagines" action outcome before executing.
4D understanding: Not just 3D static but understand dynamics — how object moves when pushed, how liquid flows.

Long-term Vision

Embodied 3D foundation model: Single model deeply understanding 3D world — can plan, manipulate, navigate, reason about physics — not just map pixel to action.

This among most exciting research directions in robot learning right now. SpatialVLA important first step — proves 3D spatial information measurably improves VLA models on real robots.

Takeaway for Practitioners

If building robot manipulation system:

Using OpenVLA/Octo: Consider adding depth estimation to input pipeline — even simple monocular depth helps.
Task needs spatial precision: SpatialVLA or similar should be first choice. Fine-tune on your robot data gives best results.
Hardware: No LiDAR or stereo camera needed — monocular depth from pre-trained model sufficient for most manipulation.
Next step: Watch SpatialVLA GitHub for weights and fine-tuning code when released.

Next article: hands-on tutorial fine-tune OpenVLA with LeRobot — from recording data to deploying on real robot.

Foundation Models for Robot: RT-2, Octo, OpenVLA in Practice — VLA models overview
Hands-on: Fine-tune OpenVLA with LeRobot — Tutorial: fine-tune VLA model
Sim-to-Real Transfer: Train in Simulation, Run on Real Robot — Transfer models from sim to real
Robotics Research Trends 2025 — Research landscape overview

Why VLA Models Need to Understand 3D

Robot arm performing manipulation task in 3D environment

Experimental Evidence

Situations where 2D VLA struggles:

Situation	2D Problem	Needs 3D
Stack objects	Don't know exact heights	Depth for each object
Place on shelf	Don't know distance to shelf	3D target position
Avoid obstacles	Don't know if obstacle ahead or behind	Full spatial context
Precise insertion	mm-level error from 2D projection	3D alignment

SpatialVLA: Architecture and Innovation

Ego3D Position Encoding — Viewing World Through Depth

Core idea: instead of just 2D positional encoding for image patches (traditional ViT), SpatialVLA adds 3D position information estimated from monocular depth.

Pipeline:
  RGB Image (from robot camera)
       ↓
  Monocular Depth Estimator (e.g., DPT/Depth Anything)
       ↓
  Depth Map → 3D Point Cloud (in egocentric frame)
       ↓
  Ego3D Position Encoding
       ↓
  Fused with Visual Tokens in VLM backbone

Why "Egocentric"? Instead of using world coordinate frame (needs precise camera-robot calibration), SpatialVLA represents everything in camera frame — egocentric. Two major benefits:

No camera-robot calibration needed — deploying on new robot requires no accurate extrinsic parameters
Universally applicable — works with any robot embodiment as long as it has camera

Specifically, image patch at pixel (u, v) with depth d maps to 3D coordinate:

X = (u - cx) * d / fx
Y = (v - cy) * d / fy  
Z = d

Ego3D_pos = MLP([X, Y, Z]) → positional embedding

Where (fx, fy, cx, cy) are camera intrinsics. This embedding added to visual tokens before transformer backbone, lets model "see" depth without changing architecture.

Adaptive Action Grids — Unifying Action Space

Adaptive Action Grids solves by:

Analyze statistical distribution of actions across entire dataset per robot
Discretize actions into adaptive grids — more bins in frequently-used range, fewer bins in rare range
Align action tokens with 3D spatial structure — action grids reflect spatial meaning

# Simplified example: Adaptive binning
import numpy as np

def adaptive_grid(actions, n_bins=256):
    """
    Create adaptive bins based on actual distribution
    instead of uniform bins
    """
    # actions: shape (N, 7) — N episodes, 7 action dimensions
    bins_per_dim = []
    for dim in range(actions.shape[1]):
        # Use quantiles instead of linspace
        percentiles = np.linspace(0, 100, n_bins + 1)
        bins = np.percentile(actions[:, dim], percentiles)
        bins_per_dim.append(bins)
    return bins_per_dim

# Uniform: bins evenly spaced from -1 to 1
# Adaptive: bins dense in action-heavy region (near workspace center)
#           sparse in rare region (workspace boundary)

Result: higher action resolution where it matters without increasing total tokens.

3D spatial understanding in robot perception with point clouds

Results: SpatialVLA vs OpenVLA

SpatialVLA evaluated on 7 robot learning scenarios, 16 real-robot tasks, 48 simulation setups. Results show clear advantage, especially for spatial reasoning tasks.

Real-world Franka Manipulation

Task	OpenVLA	Octo	SpatialVLA
Pick spatial prompt (left/right/front/back)	< 50%	< 50%	73%
Stack with height awareness	~40%	~35%	68%
Place at specific location	~45%	~42%	71%

Zero-shot WidowX (out-of-distribution)

Harshest test — SpatialVLA never saw this specific WidowX robot layout:

OpenVLA and Octo: consistently below 50% when spatial layout changes
SpatialVLA: maintains 60-70% success rate thanks to 3D understanding

Inference Speed

SpatialVLA also improves inference speed via compact action representation:

Model	Tokens per action	Inference (Hz)
OpenVLA	7 tokens (7D)	~6 Hz
Octo-Base	Diffusion steps	~10 Hz
SpatialVLA	Fewer spatial tokens	~8 Hz

3D Representations in Robotics: The Big Picture

SpatialVLA uses monocular depth — simplest way to get 3D information. But just one part of broader landscape of 3D representations for robot perception.

Point Clouds

How it works: LiDAR or stereo camera creates 3D point set (x, y, z) for each surface in scene.

Advantages: Precise, explicit geometry, fast processing with PointNet/PointNet++.

Disadvantages: Sparse (no continuous surface info), no appearance/texture.

Applications: Grasping (Contact-GraspNet), navigation, bin picking in industry.

Neural Radiance Fields (NeRF)

How it works: Train neural network representing scene as continuous volumetric function — for any point (x, y, z) and viewing direction, outputs color and density.

Advantages: Photo-realistic novel view synthesis, smooth surfaces, compact representation.

Disadvantages: Slow training (minutes-hours), slow inference (seconds per render), hard to update real-time.

Applications in robotics: Scene understanding, sim-to-real transfer — use NeRF to create photorealistic training environments.

3D Gaussian Splatting (3DGS)

Advantages: Real-time rendering (100+ FPS), faster training than NeRF (minutes vs hours), explicit representation easy to manipulate.

Disadvantages: Memory-intensive (millions of Gaussians), not as smooth as NeRF.

Latest applications in robotics:

RoboSplat (RSS 2025): Use 3DGS to generate novel demos for policy learning — create diverse training data from few demonstrations
POGS (ICRA 2025): Track and manipulate irregular objects using Persistent Object Gaussian Splat
SplatSim: Zero-shot sim-to-real transfer for RGB manipulation policies

Comprehensive Comparison

Representation	Training	Rendering	Explicit?	Real-time?	Robot-friendly?
Point Cloud	N/A	N/A	Yes	Yes	High
NeRF	Slow (hours)	Slow (sec)	No	No	Medium
3DGS	Fast (min)	Real-time	Yes	Yes	High
Monocular Depth	Pre-trained	Real-time	Yes	Yes	Highest

SpatialVLA chose monocular depth because needs no extra hardware (LiDAR, stereo camera) and runs real-time — best fit for general-purpose VLA model.

Future: VLA + 3D Scene Understanding

SpatialVLA opens research direction combining VLA with deeper 3D understanding:

Short-term (2026-2027)

Multi-view VLA: Instead of 1 camera, use multiple cameras or wrist + head camera to build richer 3D representation. Some labs already experimenting with stereo VLA.
Better depth foundation models: Depth Anything V2 and metric depth models improving fast — SpatialVLA benefits directly.
Tactile + 3D: Combine touch sensing with 3D vision for fine manipulation (assembly, soft object handling).

Medium-term (2027-2028)

VLA + 3DGS real-time: Use Gaussian Splatting as world model for VLA — robot "imagines" action outcome before executing.
4D understanding: Not just 3D static but understand dynamics — how object moves when pushed, how liquid flows.

Long-term Vision

Embodied 3D foundation model: Single model deeply understanding 3D world — can plan, manipulate, navigate, reason about physics — not just map pixel to action.

This among most exciting research directions in robot learning right now. SpatialVLA important first step — proves 3D spatial information measurably improves VLA models on real robots.

Takeaway for Practitioners

If building robot manipulation system:

Using OpenVLA/Octo: Consider adding depth estimation to input pipeline — even simple monocular depth helps.
Task needs spatial precision: SpatialVLA or similar should be first choice. Fine-tune on your robot data gives best results.
Hardware: No LiDAR or stereo camera needed — monocular depth from pre-trained model sufficient for most manipulation.
Next step: Watch SpatialVLA GitHub for weights and fine-tuning code when released.

Next article: hands-on tutorial fine-tune OpenVLA with LeRobot — from recording data to deploying on real robot.

Foundation Models for Robot: RT-2, Octo, OpenVLA in Practice — VLA models overview
Hands-on: Fine-tune OpenVLA with LeRobot — Tutorial: fine-tune VLA model
Sim-to-Real Transfer: Train in Simulation, Run on Real Robot — Transfer models from sim to real
Robotics Research Trends 2025 — Research landscape overview

Why VLA Models Need to Understand 3D

Experimental Evidence

SpatialVLA: Architecture and Innovation

Ego3D Position Encoding — Viewing World Through Depth

Adaptive Action Grids — Unifying Action Space

Results: SpatialVLA vs OpenVLA

Real-world Franka Manipulation

Zero-shot WidowX (out-of-distribution)

Inference Speed

3D Representations in Robotics: The Big Picture

Point Clouds

Neural Radiance Fields (NeRF)

3D Gaussian Splatting (3DGS)

Comprehensive Comparison

Future: VLA + 3D Scene Understanding

Short-term (2026-2027)

Medium-term (2027-2028)

Long-term Vision

Takeaway for Practitioners

Related Posts

Nguyễn Anh Tuấn

Related Posts

VLA Models: RT-2 → Octo → OpenVLA → π0

Hands-on: Fine-tune OpenVLA với LeRobot

Action Chunking Transformers (ACT): Kiến trúc chi tiết

Why VLA Models Need to Understand 3D

Experimental Evidence

SpatialVLA: Architecture and Innovation

Ego3D Position Encoding — Viewing World Through Depth

Adaptive Action Grids — Unifying Action Space

Results: SpatialVLA vs OpenVLA

Real-world Franka Manipulation

Zero-shot WidowX (out-of-distribution)

Inference Speed

3D Representations in Robotics: The Big Picture

Point Clouds

Neural Radiance Fields (NeRF)

3D Gaussian Splatting (3DGS)

Comprehensive Comparison

Future: VLA + 3D Scene Understanding

Short-term (2026-2027)

Medium-term (2027-2028)

Long-term Vision

Takeaway for Practitioners

Related Posts

Nguyễn Anh Tuấn

Related Posts

VLA Models: RT-2 → Octo → OpenVLA → π0

Hands-on: Fine-tune OpenVLA với LeRobot

Action Chunking Transformers (ACT): Kiến trúc chi tiết