← Back to Blog
aiai-perceptionvlaresearch

VLA Models: RT-2 → Octo → OpenVLA → π0

History and evolution of Vision-Language-Action models — what each model solves and the trade-offs.

Nguyen Anh Tuan17 tháng 3, 202612 min read
VLA Models: RT-2 → Octo → OpenVLA → π0

What is VLA and Why Does It Matter?

Vision-Language-Action (VLA) models are latest generation in robot learning — combining ability to see (vision), understand language, and make action decisions in single unified model. If LLM is "brain" for text, VLA is "brain" for robot.

LLM:  Text → Text         (ChatGPT, Claude)
VLM:  Image + Text → Text (GPT-4V, LLaVA)
VLA:  Image + Text → Robot Actions  (RT-2, OpenVLA, π0)

From 2023 to 2025, VLA underwent remarkable evolution — from closed-source 55B parameter model running only on single robot, to open-source model running multi-robot with open-world generalization. This article traces entire evolution.

AI research lab with robots and computing infrastructure

Timeline: VLA Evolution 2023-2026

2023 Jul ──── RT-2 (Google DeepMind)
              55B params, closed-source, emergent reasoning
              ↓
2024 May ──── Octo (UC Berkeley)
              93M params, open-source, cross-embodiment, diffusion head
              ↓
2024 Jun ──── OpenVLA (Stanford)
              7B params, open-source, beats RT-2-X
              ↓
2024 Oct ──── π0 (Physical Intelligence)
              Flow matching, 50Hz, dexterous manipulation
              ↓
2025 Apr ──── π0.5 (Physical Intelligence)
              Open-world generalization, clean kitchens & bedrooms
              ↓
2025-26  ──── DexVLA, ChatVLA-2, π0-FAST...
              Community explosion, specialization

Each model in timeline solves specific problem previous model left behind. Let's deep-dive into each.

RT-2: First Model to Prove VLA is Feasible

Context

Before RT-2, Google had RT-1 (2022) — powerful robot transformer policy but could only understand tasks it saw in training data. Ask RT-1 "pick up something to clean a spill" — doesn't know paper towel is for wiping.

RT-2 (Brohan et al., 2023) solved this by: leveraging web-scale knowledge from pre-trained VLM.

Architecture: VLM + Tokenized Actions

Extraordinarily elegant core idea: convert robot actions into text tokens, then fine-tune VLM to output actions exactly like outputting text.

Input:
  [Image] Camera observation
  [Text]  "Pick up the red cup and place it on the plate"

PaLI-X 55B (Pre-trained Vision-Language Model)
  - Seen billions of image-text pairs on Internet
  - Understands objects, spatial relationships, common sense
        ↓
Output tokens: "1 128 91 241 1 128 91"
        ↓
De-tokenize: [x=0.12, y=0.50, z=0.36, rx=0.94, ry=0.01, rz=0.50, gripper=0.36]

Each action dimension discretized into 256 bins (0-255) represented as integer text. Model processes actions exactly like text tokens — no architecture or training pipeline changes needed. Genius: reuse entire VLM infrastructure.

Emergent Reasoning — The Breakthrough Point

Most surprising result wasn't performance on seen tasks (same as RT-1), but emergent capabilities:

Capability RT-1 RT-2 (PaLI-X 55B)
Seen tasks 95% 95%
Unseen objects 32% 62%
Unseen backgrounds 36% 52%
Semantic reasoning 0% 48%

"Semantic reasoning" means: model understands "pick up something you can use to clean a spill" → selects paper towel, even though never seen this instruction in robot training data. Knowledge from web pre-training (knowing paper towel is for wiping) transfers to robot control.

Serious Limitations

But RT-2 has 3 major problems:

  1. 55B parameters — can't deploy on edge, needs GPU cluster
  2. Closed-source — nobody outside Google can reproduce
  3. Single robot — only tested on Google's RT robot, no cross-embodiment
  4. Slow inference (~3 Hz) — too slow for reactive manipulation
  5. Deterministic actions — tokenized output is single mode, doesn't capture multimodal distributions

These limitations created opportunity for next models.

Octo: First Open-Source Generalist Policy

Problems Octo Solves

Octo (Ghosh, Dibya et al., 2024) from UC Berkeley solves 3 biggest RT-2 problems:

  1. Closed-source → Fully open-source (weights + code + data)
  2. Single robot → Cross-embodiment (22 robot platforms)
  3. Deterministic → Multi-modal action output (diffusion head)

Architecture: Custom Transformer + Diffusion Head

Instead of building on massive VLM, Octo designs robotics-specific transformer architecture:

Input tokens:
  [Language]  "Pick up the blue block"        → Language encoder
  [Image]     Observation history (t-2, t-1, t) → ViT patches
  [Proprio]   Joint positions / EE pose       → Linear projection
        ↓
  Transformer Backbone (with readout tokens)
  - Readout tokens: learnable tokens attend to all inputs
  - Capture cross-modal information
        ↓
  Diffusion Action Head
  - DDPM-based head instead of simple regression
  - Output: multi-modal action distribution
        ↓
  Action: [a_t, a_{t+1}, ..., a_{t+H}]  (action chunk)

Diffusion action head is key innovation — inspired by Diffusion Policy (Chi et al., RSS 2023). Instead of predicting mean action (single mode), Octo samples from learned distribution, naturally captures multimodal actions.

Training Data: Open X-Embodiment

Octo trained on Open X-Embodiment dataset — 800K+ robot episodes from 22 robot platforms:

Dataset Robot Tasks Episodes
Bridge V2 WidowX Manipulation 60K
RT-1 Google RT Pick/Place 130K
Taco Play Franka Language-conditioned 6K
Kuka Kuka iiwa Stacking, insertion 516K
Total 22 robots Diverse 800K+

This diversity allows Octo to learn embodiment-agnostic features — understands "pick up" means same thing regardless of whether robot has 6 or 7 joints.

Two Versions

Octo-Small Octo-Base
Parameters 27M 93M
Fine-tune time ~2 hours (1x RTX 3090) ~4 hours (1x A100)
Zero-shot cross-embodiment Weak Moderate
After fine-tuning (50 demos) Good Strong

Trade-offs

Advantages: Open-source, lightweight (93M params), cross-embodiment, diffusion head for multimodal actions, fine-tune easy with consumer GPU.

Disadvantages: No web-scale VLM knowledge (lacks common sense reasoning), semantic understanding weaker than RT-2, needs fine-tuning for new task.

OpenVLA: Best of Both Worlds

Problems OpenVLA Solves

OpenVLA (Kim et al., 2024) from Stanford combines strengths of both RT-2 and Octo:

Architecture: Prismatic VLM + Robot Fine-tuning

Visual Encoder (dual):
  SigLIP → vision-language alignment features
  DINOv2 → spatial/geometric features
        ↓
  Projector (MLP) — fuse visual features
        ↓
  Llama 2 7B backbone
  - Pre-trained language model
  - Fine-tuned on 970K robot demos
        ↓
  Output: Tokenized actions (256 bins per dimension)

Dual visual encoder is key insight: SigLIP brings semantic understanding (knows what "cup" is), DINOv2 brings spatial understanding (knows where cup is in 3D space). Combining both gives robot "understanding and seeing".

Results: Beats RT-2-X with 7x Fewer Parameters

Benchmark (29 tasks) RT-2-X (55B) Octo-Base (93M) OpenVLA (7B)
Average success rate Baseline -2.1% +16.5%
BridgeData V2 tasks 43.2% 41.8% 55.7%
RT-1 tasks 72.1% 70.5% 78.3%

OpenVLA achieves +16.5% improvement over RT-2-X baseline while using only 7B parameters (7x smaller than RT-2's 55B). Shows: VLM backbone quality + good robot fine-tuning > raw parameter count.

Fine-tune with LoRA on Consumer GPU

"""
Fine-tune OpenVLA 7B with LoRA
Requires: 1x RTX 3090/4090 (24GB VRAM)
"""
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model

model_id = "openvla/openvla-7b"
model = AutoModelForVision2Seq.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, trust_remote_code=True
)

# LoRA: fine-tune only 0.17% parameters
lora_config = LoraConfig(
    r=32, lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
# trainable: 13.1M / 7.6B = 0.17%

Trade-offs

Advantages: Open-source, strong VLM backbone, cross-embodiment, LoRA fine-tune on consumer GPU, beats RT-2-X.

Disadvantages: Still deterministic output (tokenized, single mode), ~6 Hz inference (faster than RT-2 but not real-time), 7B still large for edge devices.

Robot manipulation pipeline with vision-language-action model

pi0: Flow Matching for Dexterous Manipulation

Problems pi0 Solves

pi0 (Black, Janner et al., 2024) from Physical Intelligence solves problem all previous VLAs face: dexterous manipulation needs smooth, high-frequency actions.

RT-2 and OpenVLA output tokenized (discrete) actions at 3-6 Hz — fine for pick-and-place but completely inadequate for:

Architecture: VLM + Flow Matching Action Expert

pi0 uses flow matching instead of DDPM diffusion — key innovation:

Block 1 (VLM Expert):
  [Image tokens] + [Language tokens]
  → Pre-trained 3B VLM (PaliGemma-based)
  → Semantic understanding + task specification
        ↓
Block 2 (Action Expert):
  [Proprioception] + [Noisy action chunk]
  → Flow matching network
  → Smooth denoising in continuous space
        ↓
Cross-attention: Action expert attends to VLM features
        ↓
Output: 50-action chunk (1 second at 50Hz)

Flow Matching vs DDPM:

DDPM (Diffusion Policy) Flow Matching (pi0)
Transport Stochastic SDE Deterministic ODE
Inference steps 50-100 10-20
Action quality High Comparable/higher
Speed Slower 2-5x faster
Training Noise scheduling complex Simpler objective

Flow matching learns straight-line paths from noise to data in latent space, instead of curved stochastic paths of DDPM. Result: fewer inference steps, faster, smoother.

Training Data: 10,000 Hours of Robot Experience

pi0 trained on massive dataset:

Inference: 73ms for 50-action Chunk

On GeForce RTX GPU, pi0 inference takes 73ms to generate 50 actions = 1 second at 50Hz. Breakthrough — first time VLA model runs fast enough for smooth dexterous manipulation.

Inference: 73ms → 50 actions at 50Hz
  = 1 second robot trajectory per inference
  = Smooth enough for folding, pouring, wiping

Trade-offs

Advantages: Smooth continuous actions (flow matching), 50Hz control, dexterous manipulation, massive training data.

Disadvantages: Closed-source (full model weights not public), needs large compute for training, dataset not open.

pi0.5: Open-World Generalization

Next Step Forward

pi0.5 (Physical Intelligence, 2025) extends pi0 with open-world generalization — robot works in unseen environments.

Knowledge Insulation

Key innovation: co-training on heterogeneous data sources without interference:

Data sources:
  1. Robot manipulation data (Physical Intelligence fleet)
  2. Navigation data (mobile bases)
  3. Web data (image-text pairs)
  4. High-level semantic prediction data
        ↓
  Knowledge insulation training
  - Each data source trains separate "expert" modules
  - Shared backbone learns general representations
  - No catastrophic forgetting
        ↓
  Result: Model understands both low-level manipulation + high-level planning

Results: Clean Kitchens in Unseen Homes

Physical Intelligence tested pi0.5 at 3 rental homes in San Francisco — completely new environments, different layouts, different furniture. Robot (mobile manipulator) performed:

First time end-to-end learned robotic system completed long-horizon manipulation tasks in unseen homes. No hard-coded behaviors, no per-home fine-tuning.

Open-source: openpi

Physical Intelligence released openpi — framework for fine-tuning pi0 models. Although full pi0.5 weights not public yet, community can fine-tune smaller versions for custom tasks.

Comprehensive Comparison Table

RT-2 Octo OpenVLA pi0 pi0.5
Year 2023 2024 2024 2024 2025
Organization Google UC Berkeley Stanford Physical Intelligence Physical Intelligence
Parameters 55B 27M / 93M 7B ~3B+ ~3B+
Open-source Closed Full open Full open Partial (openpi) Partial (openpi)
Action output Discrete tokens Diffusion (continuous) Discrete tokens Flow matching (continuous) Flow matching (continuous)
Inference speed ~3 Hz ~10 Hz ~6 Hz ~14 Hz (50-action chunks) ~14 Hz
Cross-embodiment 1 robot 22 robots Multi-robot Multi-robot Multi-robot + mobile
Fine-tune cost N/A 1x RTX 3090 1x RTX 3090 (LoRA) Large GPU cluster Large GPU cluster
Semantic reasoning Strong (55B VLM) Weak Good (7B VLM) Good (3B VLM) Strong (+ web data)
Dexterous manip. Limited Limited Limited Strong Strong
Open-world Partial No Partial No Yes
arXiv 2307.15818 2405.12213 2406.09246 2410.24164 2504.16054

Which Model for Your Project?

Decision Tree

What do you need?
│
├── Research / Learning → OpenVLA or Octo
│   ├── Have RTX 3090+ → OpenVLA (LoRA fine-tune)
│   └── Weaker GPU → Octo-Small (27M, lightweight)
│
├── Production manipulation → Task-dependent
│   ├── Simple pick-place → Octo fine-tuned
│   ├── Dexterous tasks → pi0 (if budget allows)
│   └── Multi-modal actions → Diffusion Policy (see Part 4)
│
├── Cross-embodiment (many robot types) → Octo
│
└── Open-world / unseen environments → pi0.5 (SOTA)

Upcoming Trends

  1. Smaller, faster VLAs: Distillation and quantization for edge (Jetson, Raspberry Pi)
  2. Diffusion + VLA fusion: DexVLA, Diffusion Transformer Policy — combine continuous action benefits with VLM reasoning
  3. World models: VLA not just predict actions but predict future states — better planning
  4. Sim-to-real pre-training: Massive simulation data + real data fine-tuning
  5. Multi-modal inputs: Add tactile, force/torque, audio beyond vision

VLA models at stage equivalent to GPT-3 for NLP — powerful enough to see potential, but not reliable enough for every use case. Evolution from RT-2 to pi0.5 shows development speed is incredible — major leap every 6 months.

Starting? OpenVLA + LoRA fine-tuning is best entry point — open-source, good documentation, runs on consumer GPU, strong community support.


Related Posts

Related Posts

IROS 2026: Papers navigation và manipulation đáng theo dõi
researchconferencerobotics

IROS 2026: Papers navigation và manipulation đáng theo dõi

Phân tích papers nổi bật về autonomous navigation và manipulation — chuẩn bị cho IROS 2026 Pittsburgh.

2/4/20267 min read
Sim-to-Real Transfer: Train simulation, chạy thực tế
ai-perceptionresearchrobotics

Sim-to-Real Transfer: Train simulation, chạy thực tế

Kỹ thuật chuyển đổi mô hình từ simulation sang robot thật — domain randomization, system identification và best practices.

1/4/202612 min read
IROS 2026 Preview: Những gì đáng chờ đợi
researchconferencerobotics

IROS 2026 Preview: Những gì đáng chờ đợi

IROS 2026 Pittsburgh — preview workshops, competitions và nghiên cứu navigation, manipulation hàng đầu.

30/3/20267 min read