What is VLA and Why Does It Matter?
Vision-Language-Action (VLA) models are latest generation in robot learning — combining ability to see (vision), understand language, and make action decisions in single unified model. If LLM is "brain" for text, VLA is "brain" for robot.
LLM: Text → Text (ChatGPT, Claude)
VLM: Image + Text → Text (GPT-4V, LLaVA)
VLA: Image + Text → Robot Actions (RT-2, OpenVLA, π0)
From 2023 to 2025, VLA underwent remarkable evolution — from closed-source 55B parameter model running only on single robot, to open-source model running multi-robot with open-world generalization. This article traces entire evolution.
Timeline: VLA Evolution 2023-2026
2023 Jul ──── RT-2 (Google DeepMind)
55B params, closed-source, emergent reasoning
↓
2024 May ──── Octo (UC Berkeley)
93M params, open-source, cross-embodiment, diffusion head
↓
2024 Jun ──── OpenVLA (Stanford)
7B params, open-source, beats RT-2-X
↓
2024 Oct ──── π0 (Physical Intelligence)
Flow matching, 50Hz, dexterous manipulation
↓
2025 Apr ──── π0.5 (Physical Intelligence)
Open-world generalization, clean kitchens & bedrooms
↓
2025-26 ──── DexVLA, ChatVLA-2, π0-FAST...
Community explosion, specialization
Each model in timeline solves specific problem previous model left behind. Let's deep-dive into each.
RT-2: First Model to Prove VLA is Feasible
Context
Before RT-2, Google had RT-1 (2022) — powerful robot transformer policy but could only understand tasks it saw in training data. Ask RT-1 "pick up something to clean a spill" — doesn't know paper towel is for wiping.
RT-2 (Brohan et al., 2023) solved this by: leveraging web-scale knowledge from pre-trained VLM.
Architecture: VLM + Tokenized Actions
Extraordinarily elegant core idea: convert robot actions into text tokens, then fine-tune VLM to output actions exactly like outputting text.
Input:
[Image] Camera observation
[Text] "Pick up the red cup and place it on the plate"
PaLI-X 55B (Pre-trained Vision-Language Model)
- Seen billions of image-text pairs on Internet
- Understands objects, spatial relationships, common sense
↓
Output tokens: "1 128 91 241 1 128 91"
↓
De-tokenize: [x=0.12, y=0.50, z=0.36, rx=0.94, ry=0.01, rz=0.50, gripper=0.36]
Each action dimension discretized into 256 bins (0-255) represented as integer text. Model processes actions exactly like text tokens — no architecture or training pipeline changes needed. Genius: reuse entire VLM infrastructure.
Emergent Reasoning — The Breakthrough Point
Most surprising result wasn't performance on seen tasks (same as RT-1), but emergent capabilities:
| Capability | RT-1 | RT-2 (PaLI-X 55B) |
|---|---|---|
| Seen tasks | 95% | 95% |
| Unseen objects | 32% | 62% |
| Unseen backgrounds | 36% | 52% |
| Semantic reasoning | 0% | 48% |
"Semantic reasoning" means: model understands "pick up something you can use to clean a spill" → selects paper towel, even though never seen this instruction in robot training data. Knowledge from web pre-training (knowing paper towel is for wiping) transfers to robot control.
Serious Limitations
But RT-2 has 3 major problems:
- 55B parameters — can't deploy on edge, needs GPU cluster
- Closed-source — nobody outside Google can reproduce
- Single robot — only tested on Google's RT robot, no cross-embodiment
- Slow inference (~3 Hz) — too slow for reactive manipulation
- Deterministic actions — tokenized output is single mode, doesn't capture multimodal distributions
These limitations created opportunity for next models.
Octo: First Open-Source Generalist Policy
Problems Octo Solves
Octo (Ghosh, Dibya et al., 2024) from UC Berkeley solves 3 biggest RT-2 problems:
- Closed-source → Fully open-source (weights + code + data)
- Single robot → Cross-embodiment (22 robot platforms)
- Deterministic → Multi-modal action output (diffusion head)
Architecture: Custom Transformer + Diffusion Head
Instead of building on massive VLM, Octo designs robotics-specific transformer architecture:
Input tokens:
[Language] "Pick up the blue block" → Language encoder
[Image] Observation history (t-2, t-1, t) → ViT patches
[Proprio] Joint positions / EE pose → Linear projection
↓
Transformer Backbone (with readout tokens)
- Readout tokens: learnable tokens attend to all inputs
- Capture cross-modal information
↓
Diffusion Action Head
- DDPM-based head instead of simple regression
- Output: multi-modal action distribution
↓
Action: [a_t, a_{t+1}, ..., a_{t+H}] (action chunk)
Diffusion action head is key innovation — inspired by Diffusion Policy (Chi et al., RSS 2023). Instead of predicting mean action (single mode), Octo samples from learned distribution, naturally captures multimodal actions.
Training Data: Open X-Embodiment
Octo trained on Open X-Embodiment dataset — 800K+ robot episodes from 22 robot platforms:
| Dataset | Robot | Tasks | Episodes |
|---|---|---|---|
| Bridge V2 | WidowX | Manipulation | 60K |
| RT-1 | Google RT | Pick/Place | 130K |
| Taco Play | Franka | Language-conditioned | 6K |
| Kuka | Kuka iiwa | Stacking, insertion | 516K |
| Total | 22 robots | Diverse | 800K+ |
This diversity allows Octo to learn embodiment-agnostic features — understands "pick up" means same thing regardless of whether robot has 6 or 7 joints.
Two Versions
| Octo-Small | Octo-Base | |
|---|---|---|
| Parameters | 27M | 93M |
| Fine-tune time | ~2 hours (1x RTX 3090) | ~4 hours (1x A100) |
| Zero-shot cross-embodiment | Weak | Moderate |
| After fine-tuning (50 demos) | Good | Strong |
Trade-offs
Advantages: Open-source, lightweight (93M params), cross-embodiment, diffusion head for multimodal actions, fine-tune easy with consumer GPU.
Disadvantages: No web-scale VLM knowledge (lacks common sense reasoning), semantic understanding weaker than RT-2, needs fine-tuning for new task.
OpenVLA: Best of Both Worlds
Problems OpenVLA Solves
OpenVLA (Kim et al., 2024) from Stanford combines strengths of both RT-2 and Octo:
- From RT-2: Build on pre-trained VLM → has web-scale knowledge
- From Octo: Open-source, cross-embodiment training
- Improvement: 7B params (vs RT-2's 55B) → runs on consumer GPU
Architecture: Prismatic VLM + Robot Fine-tuning
Visual Encoder (dual):
SigLIP → vision-language alignment features
DINOv2 → spatial/geometric features
↓
Projector (MLP) — fuse visual features
↓
Llama 2 7B backbone
- Pre-trained language model
- Fine-tuned on 970K robot demos
↓
Output: Tokenized actions (256 bins per dimension)
Dual visual encoder is key insight: SigLIP brings semantic understanding (knows what "cup" is), DINOv2 brings spatial understanding (knows where cup is in 3D space). Combining both gives robot "understanding and seeing".
Results: Beats RT-2-X with 7x Fewer Parameters
| Benchmark (29 tasks) | RT-2-X (55B) | Octo-Base (93M) | OpenVLA (7B) |
|---|---|---|---|
| Average success rate | Baseline | -2.1% | +16.5% |
| BridgeData V2 tasks | 43.2% | 41.8% | 55.7% |
| RT-1 tasks | 72.1% | 70.5% | 78.3% |
OpenVLA achieves +16.5% improvement over RT-2-X baseline while using only 7B parameters (7x smaller than RT-2's 55B). Shows: VLM backbone quality + good robot fine-tuning > raw parameter count.
Fine-tune with LoRA on Consumer GPU
"""
Fine-tune OpenVLA 7B with LoRA
Requires: 1x RTX 3090/4090 (24GB VRAM)
"""
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model
model_id = "openvla/openvla-7b"
model = AutoModelForVision2Seq.from_pretrained(
model_id, torch_dtype=torch.bfloat16, trust_remote_code=True
)
# LoRA: fine-tune only 0.17% parameters
lora_config = LoraConfig(
r=32, lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
# trainable: 13.1M / 7.6B = 0.17%
Trade-offs
Advantages: Open-source, strong VLM backbone, cross-embodiment, LoRA fine-tune on consumer GPU, beats RT-2-X.
Disadvantages: Still deterministic output (tokenized, single mode), ~6 Hz inference (faster than RT-2 but not real-time), 7B still large for edge devices.
pi0: Flow Matching for Dexterous Manipulation
Problems pi0 Solves
pi0 (Black, Janner et al., 2024) from Physical Intelligence solves problem all previous VLAs face: dexterous manipulation needs smooth, high-frequency actions.
RT-2 and OpenVLA output tokenized (discrete) actions at 3-6 Hz — fine for pick-and-place but completely inadequate for:
- Folding clothes (needs smooth bimanual coordination)
- Pouring water (needs precise force control)
- Wiping table (needs adaptive trajectories)
Architecture: VLM + Flow Matching Action Expert
pi0 uses flow matching instead of DDPM diffusion — key innovation:
Block 1 (VLM Expert):
[Image tokens] + [Language tokens]
→ Pre-trained 3B VLM (PaliGemma-based)
→ Semantic understanding + task specification
↓
Block 2 (Action Expert):
[Proprioception] + [Noisy action chunk]
→ Flow matching network
→ Smooth denoising in continuous space
↓
Cross-attention: Action expert attends to VLM features
↓
Output: 50-action chunk (1 second at 50Hz)
Flow Matching vs DDPM:
| DDPM (Diffusion Policy) | Flow Matching (pi0) | |
|---|---|---|
| Transport | Stochastic SDE | Deterministic ODE |
| Inference steps | 50-100 | 10-20 |
| Action quality | High | Comparable/higher |
| Speed | Slower | 2-5x faster |
| Training | Noise scheduling complex | Simpler objective |
Flow matching learns straight-line paths from noise to data in latent space, instead of curved stochastic paths of DDPM. Result: fewer inference steps, faster, smoother.
Training Data: 10,000 Hours of Robot Experience
pi0 trained on massive dataset:
- 903 million timesteps from Physical Intelligence's robot fleet
- 90 million timesteps from open-source (OXE, BridgeData V2, DROID)
- Multiple robots: Single-arm, bimanual, dexterous hands
- Diverse tasks: Folding laundry, packing boxes, clearing tables
Inference: 73ms for 50-action Chunk
On GeForce RTX GPU, pi0 inference takes 73ms to generate 50 actions = 1 second at 50Hz. Breakthrough — first time VLA model runs fast enough for smooth dexterous manipulation.
Inference: 73ms → 50 actions at 50Hz
= 1 second robot trajectory per inference
= Smooth enough for folding, pouring, wiping
Trade-offs
Advantages: Smooth continuous actions (flow matching), 50Hz control, dexterous manipulation, massive training data.
Disadvantages: Closed-source (full model weights not public), needs large compute for training, dataset not open.
pi0.5: Open-World Generalization
Next Step Forward
pi0.5 (Physical Intelligence, 2025) extends pi0 with open-world generalization — robot works in unseen environments.
Knowledge Insulation
Key innovation: co-training on heterogeneous data sources without interference:
Data sources:
1. Robot manipulation data (Physical Intelligence fleet)
2. Navigation data (mobile bases)
3. Web data (image-text pairs)
4. High-level semantic prediction data
↓
Knowledge insulation training
- Each data source trains separate "expert" modules
- Shared backbone learns general representations
- No catastrophic forgetting
↓
Result: Model understands both low-level manipulation + high-level planning
Results: Clean Kitchens in Unseen Homes
Physical Intelligence tested pi0.5 at 3 rental homes in San Francisco — completely new environments, different layouts, different furniture. Robot (mobile manipulator) performed:
- Clearing kitchen: Gather dishes, put in dishwasher, wipe countertop
- Clearing bedroom: Fold blanket, arrange pillows, pick up items off floor
First time end-to-end learned robotic system completed long-horizon manipulation tasks in unseen homes. No hard-coded behaviors, no per-home fine-tuning.
Open-source: openpi
Physical Intelligence released openpi — framework for fine-tuning pi0 models. Although full pi0.5 weights not public yet, community can fine-tune smaller versions for custom tasks.
Comprehensive Comparison Table
| RT-2 | Octo | OpenVLA | pi0 | pi0.5 | |
|---|---|---|---|---|---|
| Year | 2023 | 2024 | 2024 | 2024 | 2025 |
| Organization | UC Berkeley | Stanford | Physical Intelligence | Physical Intelligence | |
| Parameters | 55B | 27M / 93M | 7B | ~3B+ | ~3B+ |
| Open-source | Closed | Full open | Full open | Partial (openpi) | Partial (openpi) |
| Action output | Discrete tokens | Diffusion (continuous) | Discrete tokens | Flow matching (continuous) | Flow matching (continuous) |
| Inference speed | ~3 Hz | ~10 Hz | ~6 Hz | ~14 Hz (50-action chunks) | ~14 Hz |
| Cross-embodiment | 1 robot | 22 robots | Multi-robot | Multi-robot | Multi-robot + mobile |
| Fine-tune cost | N/A | 1x RTX 3090 | 1x RTX 3090 (LoRA) | Large GPU cluster | Large GPU cluster |
| Semantic reasoning | Strong (55B VLM) | Weak | Good (7B VLM) | Good (3B VLM) | Strong (+ web data) |
| Dexterous manip. | Limited | Limited | Limited | Strong | Strong |
| Open-world | Partial | No | Partial | No | Yes |
| arXiv | 2307.15818 | 2405.12213 | 2406.09246 | 2410.24164 | 2504.16054 |
Which Model for Your Project?
Decision Tree
What do you need?
│
├── Research / Learning → OpenVLA or Octo
│ ├── Have RTX 3090+ → OpenVLA (LoRA fine-tune)
│ └── Weaker GPU → Octo-Small (27M, lightweight)
│
├── Production manipulation → Task-dependent
│ ├── Simple pick-place → Octo fine-tuned
│ ├── Dexterous tasks → pi0 (if budget allows)
│ └── Multi-modal actions → Diffusion Policy (see Part 4)
│
├── Cross-embodiment (many robot types) → Octo
│
└── Open-world / unseen environments → pi0.5 (SOTA)
Upcoming Trends
- Smaller, faster VLAs: Distillation and quantization for edge (Jetson, Raspberry Pi)
- Diffusion + VLA fusion: DexVLA, Diffusion Transformer Policy — combine continuous action benefits with VLM reasoning
- World models: VLA not just predict actions but predict future states — better planning
- Sim-to-real pre-training: Massive simulation data + real data fine-tuning
- Multi-modal inputs: Add tactile, force/torque, audio beyond vision
VLA models at stage equivalent to GPT-3 for NLP — powerful enough to see potential, but not reliable enough for every use case. Evolution from RT-2 to pi0.5 shows development speed is incredible — major leap every 6 months.
Starting? OpenVLA + LoRA fine-tuning is best entry point — open-source, good documentation, runs on consumer GPU, strong community support.
Related Posts
- Diffusion Policy: A Revolution in Robot Manipulation — Deep-dive into diffusion models for robot actions
- Foundation Models for Robot: RT-2, Octo, OpenVLA in Practice — Detailed fine-tuning guide
- Robotics Research Trends 2025 — Research landscape overview
- Sim-to-Real Transfer: Train in Simulation, Run on Real Robot — Sim-to-real techniques for VLA models
- AI and Robotics 2025: Trends and Real-World Applications — Real-world industry applications