VLA Models: RT-2 → Octo → OpenVLA → π0

What is VLA and Why Does It Matter?

Vision-Language-Action (VLA) models are latest generation in robot learning — combining ability to see (vision), understand language, and make action decisions in single unified model. If LLM is "brain" for text, VLA is "brain" for robot.

LLM:  Text → Text         (ChatGPT, Claude)
VLM:  Image + Text → Text (GPT-4V, LLaVA)
VLA:  Image + Text → Robot Actions  (RT-2, OpenVLA, π0)

From 2023 to 2025, VLA underwent remarkable evolution — from closed-source 55B parameter model running only on single robot, to open-source model running multi-robot with open-world generalization. This article traces entire evolution.

AI research lab with robots and computing infrastructure

Timeline: VLA Evolution 2023-2026

2023 Jul ──── RT-2 (Google DeepMind)
              55B params, closed-source, emergent reasoning
              ↓
2024 May ──── Octo (UC Berkeley)
              93M params, open-source, cross-embodiment, diffusion head
              ↓
2024 Jun ──── OpenVLA (Stanford)
              7B params, open-source, beats RT-2-X
              ↓
2024 Oct ──── π0 (Physical Intelligence)
              Flow matching, 50Hz, dexterous manipulation
              ↓
2025 Apr ──── π0.5 (Physical Intelligence)
              Open-world generalization, clean kitchens & bedrooms
              ↓
2025-26  ──── DexVLA, ChatVLA-2, π0-FAST...
              Community explosion, specialization

Each model in timeline solves specific problem previous model left behind. Let's deep-dive into each.

RT-2: First Model to Prove VLA is Feasible

Context

Before RT-2, Google had RT-1 (2022) — powerful robot transformer policy but could only understand tasks it saw in training data. Ask RT-1 "pick up something to clean a spill" — doesn't know paper towel is for wiping.

RT-2 (Brohan et al., 2023) solved this by: leveraging web-scale knowledge from pre-trained VLM.

Architecture: VLM + Tokenized Actions

Extraordinarily elegant core idea: convert robot actions into text tokens, then fine-tune VLM to output actions exactly like outputting text.

Input:
  [Image] Camera observation
  [Text]  "Pick up the red cup and place it on the plate"

PaLI-X 55B (Pre-trained Vision-Language Model)
  - Seen billions of image-text pairs on Internet
  - Understands objects, spatial relationships, common sense
        ↓
Output tokens: "1 128 91 241 1 128 91"
        ↓
De-tokenize: [x=0.12, y=0.50, z=0.36, rx=0.94, ry=0.01, rz=0.50, gripper=0.36]

Each action dimension discretized into 256 bins (0-255) represented as integer text. Model processes actions exactly like text tokens — no architecture or training pipeline changes needed. Genius: reuse entire VLM infrastructure.

Emergent Reasoning — The Breakthrough Point

Most surprising result wasn't performance on seen tasks (same as RT-1), but emergent capabilities:

Capability	RT-1	RT-2 (PaLI-X 55B)
Seen tasks	95%	95%
Unseen objects	32%	62%
Unseen backgrounds	36%	52%
Semantic reasoning	0%	48%

"Semantic reasoning" means: model understands "pick up something you can use to clean a spill" → selects paper towel, even though never seen this instruction in robot training data. Knowledge from web pre-training (knowing paper towel is for wiping) transfers to robot control.

Serious Limitations

But RT-2 has 3 major problems:

55B parameters — can't deploy on edge, needs GPU cluster
Closed-source — nobody outside Google can reproduce
Single robot — only tested on Google's RT robot, no cross-embodiment
Slow inference (~3 Hz) — too slow for reactive manipulation
Deterministic actions — tokenized output is single mode, doesn't capture multimodal distributions

These limitations created opportunity for next models.

Octo: First Open-Source Generalist Policy

Problems Octo Solves

Octo (Ghosh, Dibya et al., 2024) from UC Berkeley solves 3 biggest RT-2 problems:

Closed-source → Fully open-source (weights + code + data)
Single robot → Cross-embodiment (22 robot platforms)
Deterministic → Multi-modal action output (diffusion head)

Architecture: Custom Transformer + Diffusion Head

Instead of building on massive VLM, Octo designs robotics-specific transformer architecture:

Input tokens:
  [Language]  "Pick up the blue block"        → Language encoder
  [Image]     Observation history (t-2, t-1, t) → ViT patches
  [Proprio]   Joint positions / EE pose       → Linear projection
        ↓
  Transformer Backbone (with readout tokens)
  - Readout tokens: learnable tokens attend to all inputs
  - Capture cross-modal information
        ↓
  Diffusion Action Head
  - DDPM-based head instead of simple regression
  - Output: multi-modal action distribution
        ↓
  Action: [a_t, a_{t+1}, ..., a_{t+H}]  (action chunk)

Diffusion action head is key innovation — inspired by Diffusion Policy (Chi et al., RSS 2023). Instead of predicting mean action (single mode), Octo samples from learned distribution, naturally captures multimodal actions.

Training Data: Open X-Embodiment

Octo trained on Open X-Embodiment dataset — 800K+ robot episodes from 22 robot platforms:

Dataset	Robot	Tasks	Episodes
Bridge V2	WidowX	Manipulation	60K
RT-1	Google RT	Pick/Place	130K
Taco Play	Franka	Language-conditioned	6K
Kuka	Kuka iiwa	Stacking, insertion	516K
Total	22 robots	Diverse	800K+

This diversity allows Octo to learn embodiment-agnostic features — understands "pick up" means same thing regardless of whether robot has 6 or 7 joints.

Two Versions

	Octo-Small	Octo-Base
Parameters	27M	93M
Fine-tune time	~2 hours (1x RTX 3090)	~4 hours (1x A100)
Zero-shot cross-embodiment	Weak	Moderate
After fine-tuning (50 demos)	Good	Strong

Trade-offs

Advantages: Open-source, lightweight (93M params), cross-embodiment, diffusion head for multimodal actions, fine-tune easy with consumer GPU.

Disadvantages: No web-scale VLM knowledge (lacks common sense reasoning), semantic understanding weaker than RT-2, needs fine-tuning for new task.

OpenVLA: Best of Both Worlds

Problems OpenVLA Solves

OpenVLA (Kim et al., 2024) from Stanford combines strengths of both RT-2 and Octo:

From RT-2: Build on pre-trained VLM → has web-scale knowledge
From Octo: Open-source, cross-embodiment training
Improvement: 7B params (vs RT-2's 55B) → runs on consumer GPU

Architecture: Prismatic VLM + Robot Fine-tuning

Visual Encoder (dual):
  SigLIP → vision-language alignment features
  DINOv2 → spatial/geometric features
        ↓
  Projector (MLP) — fuse visual features
        ↓
  Llama 2 7B backbone
  - Pre-trained language model
  - Fine-tuned on 970K robot demos
        ↓
  Output: Tokenized actions (256 bins per dimension)

Dual visual encoder is key insight: SigLIP brings semantic understanding (knows what "cup" is), DINOv2 brings spatial understanding (knows where cup is in 3D space). Combining both gives robot "understanding and seeing".

Results: Beats RT-2-X with 7x Fewer Parameters

Benchmark (29 tasks)	RT-2-X (55B)	Octo-Base (93M)	OpenVLA (7B)
Average success rate	Baseline	-2.1%	+16.5%
BridgeData V2 tasks	43.2%	41.8%	55.7%
RT-1 tasks	72.1%	70.5%	78.3%

OpenVLA achieves +16.5% improvement over RT-2-X baseline while using only 7B parameters (7x smaller than RT-2's 55B). Shows: VLM backbone quality + good robot fine-tuning > raw parameter count.

Fine-tune with LoRA on Consumer GPU

"""
Fine-tune OpenVLA 7B with LoRA
Requires: 1x RTX 3090/4090 (24GB VRAM)
"""
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model

model_id = "openvla/openvla-7b"
model = AutoModelForVision2Seq.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, trust_remote_code=True
)

# LoRA: fine-tune only 0.17% parameters
lora_config = LoraConfig(
    r=32, lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
# trainable: 13.1M / 7.6B = 0.17%

Trade-offs

Advantages: Open-source, strong VLM backbone, cross-embodiment, LoRA fine-tune on consumer GPU, beats RT-2-X.

Disadvantages: Still deterministic output (tokenized, single mode), ~6 Hz inference (faster than RT-2 but not real-time), 7B still large for edge devices.

Robot manipulation pipeline with vision-language-action model

pi0: Flow Matching for Dexterous Manipulation

Problems pi0 Solves

pi0 (Black, Janner et al., 2024) from Physical Intelligence solves problem all previous VLAs face: dexterous manipulation needs smooth, high-frequency actions.

RT-2 and OpenVLA output tokenized (discrete) actions at 3-6 Hz — fine for pick-and-place but completely inadequate for:

Folding clothes (needs smooth bimanual coordination)
Pouring water (needs precise force control)
Wiping table (needs adaptive trajectories)

Architecture: VLM + Flow Matching Action Expert

pi0 uses flow matching instead of DDPM diffusion — key innovation:

Block 1 (VLM Expert):
  [Image tokens] + [Language tokens]
  → Pre-trained 3B VLM (PaliGemma-based)
  → Semantic understanding + task specification
        ↓
Block 2 (Action Expert):
  [Proprioception] + [Noisy action chunk]
  → Flow matching network
  → Smooth denoising in continuous space
        ↓
Cross-attention: Action expert attends to VLM features
        ↓
Output: 50-action chunk (1 second at 50Hz)

Flow Matching vs DDPM:

	DDPM (Diffusion Policy)	Flow Matching (pi0)
Transport	Stochastic SDE	Deterministic ODE
Inference steps	50-100	10-20
Action quality	High	Comparable/higher
Speed	Slower	2-5x faster
Training	Noise scheduling complex	Simpler objective

Flow matching learns straight-line paths from noise to data in latent space, instead of curved stochastic paths of DDPM. Result: fewer inference steps, faster, smoother.

Training Data: 10,000 Hours of Robot Experience

pi0 trained on massive dataset:

903 million timesteps from Physical Intelligence's robot fleet
90 million timesteps from open-source (OXE, BridgeData V2, DROID)
Multiple robots: Single-arm, bimanual, dexterous hands
Diverse tasks: Folding laundry, packing boxes, clearing tables

Inference: 73ms for 50-action Chunk

On GeForce RTX GPU, pi0 inference takes 73ms to generate 50 actions = 1 second at 50Hz. Breakthrough — first time VLA model runs fast enough for smooth dexterous manipulation.

Inference: 73ms → 50 actions at 50Hz
  = 1 second robot trajectory per inference
  = Smooth enough for folding, pouring, wiping

Trade-offs

Advantages: Smooth continuous actions (flow matching), 50Hz control, dexterous manipulation, massive training data.

Disadvantages: Closed-source (full model weights not public), needs large compute for training, dataset not open.

pi0.5: Open-World Generalization

Next Step Forward

pi0.5 (Physical Intelligence, 2025) extends pi0 with open-world generalization — robot works in unseen environments.

Knowledge Insulation

Key innovation: co-training on heterogeneous data sources without interference:

Data sources:
  1. Robot manipulation data (Physical Intelligence fleet)
  2. Navigation data (mobile bases)
  3. Web data (image-text pairs)
  4. High-level semantic prediction data
        ↓
  Knowledge insulation training
  - Each data source trains separate "expert" modules
  - Shared backbone learns general representations
  - No catastrophic forgetting
        ↓
  Result: Model understands both low-level manipulation + high-level planning

Results: Clean Kitchens in Unseen Homes

Physical Intelligence tested pi0.5 at 3 rental homes in San Francisco — completely new environments, different layouts, different furniture. Robot (mobile manipulator) performed:

Clearing kitchen: Gather dishes, put in dishwasher, wipe countertop
Clearing bedroom: Fold blanket, arrange pillows, pick up items off floor

First time end-to-end learned robotic system completed long-horizon manipulation tasks in unseen homes. No hard-coded behaviors, no per-home fine-tuning.

Open-source: openpi

Physical Intelligence released openpi — framework for fine-tuning pi0 models. Although full pi0.5 weights not public yet, community can fine-tune smaller versions for custom tasks.

Comprehensive Comparison Table

	RT-2	Octo	OpenVLA	pi0	pi0.5
Year	2023	2024	2024	2024	2025
Organization	Google	UC Berkeley	Stanford	Physical Intelligence	Physical Intelligence
Parameters	55B	27M / 93M	7B	~3B+	~3B+
Open-source	Closed	Full open	Full open	Partial (openpi)	Partial (openpi)
Action output	Discrete tokens	Diffusion (continuous)	Discrete tokens	Flow matching (continuous)	Flow matching (continuous)
Inference speed	~3 Hz	~10 Hz	~6 Hz	~14 Hz (50-action chunks)	~14 Hz
Cross-embodiment	1 robot	22 robots	Multi-robot	Multi-robot	Multi-robot + mobile
Fine-tune cost	N/A	1x RTX 3090	1x RTX 3090 (LoRA)	Large GPU cluster	Large GPU cluster
Semantic reasoning	Strong (55B VLM)	Weak	Good (7B VLM)	Good (3B VLM)	Strong (+ web data)
Dexterous manip.	Limited	Limited	Limited	Strong	Strong
Open-world	Partial	No	Partial	No	Yes
arXiv	2307.15818	2405.12213	2406.09246	2410.24164	2504.16054

Which Model for Your Project?

Decision Tree

What do you need?
│
├── Research / Learning → OpenVLA or Octo
│   ├── Have RTX 3090+ → OpenVLA (LoRA fine-tune)
│   └── Weaker GPU → Octo-Small (27M, lightweight)
│
├── Production manipulation → Task-dependent
│   ├── Simple pick-place → Octo fine-tuned
│   ├── Dexterous tasks → pi0 (if budget allows)
│   └── Multi-modal actions → Diffusion Policy (see Part 4)
│
├── Cross-embodiment (many robot types) → Octo
│
└── Open-world / unseen environments → pi0.5 (SOTA)

Upcoming Trends

Smaller, faster VLAs: Distillation and quantization for edge (Jetson, Raspberry Pi)
Diffusion + VLA fusion: DexVLA, Diffusion Transformer Policy — combine continuous action benefits with VLM reasoning
World models: VLA not just predict actions but predict future states — better planning
Sim-to-real pre-training: Massive simulation data + real data fine-tuning
Multi-modal inputs: Add tactile, force/torque, audio beyond vision

VLA models at stage equivalent to GPT-3 for NLP — powerful enough to see potential, but not reliable enough for every use case. Evolution from RT-2 to pi0.5 shows development speed is incredible — major leap every 6 months.

Starting? OpenVLA + LoRA fine-tuning is best entry point — open-source, good documentation, runs on consumer GPU, strong community support.

Diffusion Policy: A Revolution in Robot Manipulation — Deep-dive into diffusion models for robot actions
Foundation Models for Robot: RT-2, Octo, OpenVLA in Practice — Detailed fine-tuning guide
Robotics Research Trends 2025 — Research landscape overview
Sim-to-Real Transfer: Train in Simulation, Run on Real Robot — Sim-to-real techniques for VLA models
AI and Robotics 2025: Trends and Real-World Applications — Real-world industry applications

What is VLA and Why Does It Matter?

LLM:  Text → Text         (ChatGPT, Claude)
VLM:  Image + Text → Text (GPT-4V, LLaVA)
VLA:  Image + Text → Robot Actions  (RT-2, OpenVLA, π0)

AI research lab with robots and computing infrastructure

Timeline: VLA Evolution 2023-2026

2023 Jul ──── RT-2 (Google DeepMind)
              55B params, closed-source, emergent reasoning
              ↓
2024 May ──── Octo (UC Berkeley)
              93M params, open-source, cross-embodiment, diffusion head
              ↓
2024 Jun ──── OpenVLA (Stanford)
              7B params, open-source, beats RT-2-X
              ↓
2024 Oct ──── π0 (Physical Intelligence)
              Flow matching, 50Hz, dexterous manipulation
              ↓
2025 Apr ──── π0.5 (Physical Intelligence)
              Open-world generalization, clean kitchens & bedrooms
              ↓
2025-26  ──── DexVLA, ChatVLA-2, π0-FAST...
              Community explosion, specialization

Each model in timeline solves specific problem previous model left behind. Let's deep-dive into each.

RT-2: First Model to Prove VLA is Feasible

Context

RT-2 (Brohan et al., 2023) solved this by: leveraging web-scale knowledge from pre-trained VLM.

Architecture: VLM + Tokenized Actions

Extraordinarily elegant core idea: convert robot actions into text tokens, then fine-tune VLM to output actions exactly like outputting text.

Input:
  [Image] Camera observation
  [Text]  "Pick up the red cup and place it on the plate"

PaLI-X 55B (Pre-trained Vision-Language Model)
  - Seen billions of image-text pairs on Internet
  - Understands objects, spatial relationships, common sense
        ↓
Output tokens: "1 128 91 241 1 128 91"
        ↓
De-tokenize: [x=0.12, y=0.50, z=0.36, rx=0.94, ry=0.01, rz=0.50, gripper=0.36]

Emergent Reasoning — The Breakthrough Point

Most surprising result wasn't performance on seen tasks (same as RT-1), but emergent capabilities:

Capability	RT-1	RT-2 (PaLI-X 55B)
Seen tasks	95%	95%
Unseen objects	32%	62%
Unseen backgrounds	36%	52%
Semantic reasoning	0%	48%

Serious Limitations

But RT-2 has 3 major problems:

55B parameters — can't deploy on edge, needs GPU cluster
Closed-source — nobody outside Google can reproduce
Single robot — only tested on Google's RT robot, no cross-embodiment
Slow inference (~3 Hz) — too slow for reactive manipulation
Deterministic actions — tokenized output is single mode, doesn't capture multimodal distributions

These limitations created opportunity for next models.

Octo: First Open-Source Generalist Policy

Problems Octo Solves

Octo (Ghosh, Dibya et al., 2024) from UC Berkeley solves 3 biggest RT-2 problems:

Closed-source → Fully open-source (weights + code + data)
Single robot → Cross-embodiment (22 robot platforms)
Deterministic → Multi-modal action output (diffusion head)

Architecture: Custom Transformer + Diffusion Head

Instead of building on massive VLM, Octo designs robotics-specific transformer architecture:

Input tokens:
  [Language]  "Pick up the blue block"        → Language encoder
  [Image]     Observation history (t-2, t-1, t) → ViT patches
  [Proprio]   Joint positions / EE pose       → Linear projection
        ↓
  Transformer Backbone (with readout tokens)
  - Readout tokens: learnable tokens attend to all inputs
  - Capture cross-modal information
        ↓
  Diffusion Action Head
  - DDPM-based head instead of simple regression
  - Output: multi-modal action distribution
        ↓
  Action: [a_t, a_{t+1}, ..., a_{t+H}]  (action chunk)

Training Data: Open X-Embodiment

Octo trained on Open X-Embodiment dataset — 800K+ robot episodes from 22 robot platforms:

Dataset	Robot	Tasks	Episodes
Bridge V2	WidowX	Manipulation	60K
RT-1	Google RT	Pick/Place	130K
Taco Play	Franka	Language-conditioned	6K
Kuka	Kuka iiwa	Stacking, insertion	516K
Total	22 robots	Diverse	800K+

This diversity allows Octo to learn embodiment-agnostic features — understands "pick up" means same thing regardless of whether robot has 6 or 7 joints.

Two Versions

	Octo-Small	Octo-Base
Parameters	27M	93M
Fine-tune time	~2 hours (1x RTX 3090)	~4 hours (1x A100)
Zero-shot cross-embodiment	Weak	Moderate
After fine-tuning (50 demos)	Good	Strong

Trade-offs

Advantages: Open-source, lightweight (93M params), cross-embodiment, diffusion head for multimodal actions, fine-tune easy with consumer GPU.

Disadvantages: No web-scale VLM knowledge (lacks common sense reasoning), semantic understanding weaker than RT-2, needs fine-tuning for new task.

OpenVLA: Best of Both Worlds

Problems OpenVLA Solves

OpenVLA (Kim et al., 2024) from Stanford combines strengths of both RT-2 and Octo:

From RT-2: Build on pre-trained VLM → has web-scale knowledge
From Octo: Open-source, cross-embodiment training
Improvement: 7B params (vs RT-2's 55B) → runs on consumer GPU

Architecture: Prismatic VLM + Robot Fine-tuning

Visual Encoder (dual):
  SigLIP → vision-language alignment features
  DINOv2 → spatial/geometric features
        ↓
  Projector (MLP) — fuse visual features
        ↓
  Llama 2 7B backbone
  - Pre-trained language model
  - Fine-tuned on 970K robot demos
        ↓
  Output: Tokenized actions (256 bins per dimension)

Results: Beats RT-2-X with 7x Fewer Parameters

Benchmark (29 tasks)	RT-2-X (55B)	Octo-Base (93M)	OpenVLA (7B)
Average success rate	Baseline	-2.1%	+16.5%
BridgeData V2 tasks	43.2%	41.8%	55.7%
RT-1 tasks	72.1%	70.5%	78.3%

OpenVLA achieves +16.5% improvement over RT-2-X baseline while using only 7B parameters (7x smaller than RT-2's 55B). Shows: VLM backbone quality + good robot fine-tuning > raw parameter count.

Fine-tune with LoRA on Consumer GPU

"""
Fine-tune OpenVLA 7B with LoRA
Requires: 1x RTX 3090/4090 (24GB VRAM)
"""
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model

model_id = "openvla/openvla-7b"
model = AutoModelForVision2Seq.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, trust_remote_code=True
)

# LoRA: fine-tune only 0.17% parameters
lora_config = LoraConfig(
    r=32, lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
# trainable: 13.1M / 7.6B = 0.17%

Trade-offs

Advantages: Open-source, strong VLM backbone, cross-embodiment, LoRA fine-tune on consumer GPU, beats RT-2-X.

Disadvantages: Still deterministic output (tokenized, single mode), ~6 Hz inference (faster than RT-2 but not real-time), 7B still large for edge devices.

Robot manipulation pipeline with vision-language-action model

pi0: Flow Matching for Dexterous Manipulation

Problems pi0 Solves

pi0 (Black, Janner et al., 2024) from Physical Intelligence solves problem all previous VLAs face: dexterous manipulation needs smooth, high-frequency actions.

RT-2 and OpenVLA output tokenized (discrete) actions at 3-6 Hz — fine for pick-and-place but completely inadequate for:

Folding clothes (needs smooth bimanual coordination)
Pouring water (needs precise force control)
Wiping table (needs adaptive trajectories)

Architecture: VLM + Flow Matching Action Expert

pi0 uses flow matching instead of DDPM diffusion — key innovation:

Block 1 (VLM Expert):
  [Image tokens] + [Language tokens]
  → Pre-trained 3B VLM (PaliGemma-based)
  → Semantic understanding + task specification
        ↓
Block 2 (Action Expert):
  [Proprioception] + [Noisy action chunk]
  → Flow matching network
  → Smooth denoising in continuous space
        ↓
Cross-attention: Action expert attends to VLM features
        ↓
Output: 50-action chunk (1 second at 50Hz)

Flow Matching vs DDPM:

	DDPM (Diffusion Policy)	Flow Matching (pi0)
Transport	Stochastic SDE	Deterministic ODE
Inference steps	50-100	10-20
Action quality	High	Comparable/higher
Speed	Slower	2-5x faster
Training	Noise scheduling complex	Simpler objective

Flow matching learns straight-line paths from noise to data in latent space, instead of curved stochastic paths of DDPM. Result: fewer inference steps, faster, smoother.

Training Data: 10,000 Hours of Robot Experience

pi0 trained on massive dataset:

903 million timesteps from Physical Intelligence's robot fleet
90 million timesteps from open-source (OXE, BridgeData V2, DROID)
Multiple robots: Single-arm, bimanual, dexterous hands
Diverse tasks: Folding laundry, packing boxes, clearing tables

Inference: 73ms for 50-action Chunk

On GeForce RTX GPU, pi0 inference takes 73ms to generate 50 actions = 1 second at 50Hz. Breakthrough — first time VLA model runs fast enough for smooth dexterous manipulation.

Inference: 73ms → 50 actions at 50Hz
  = 1 second robot trajectory per inference
  = Smooth enough for folding, pouring, wiping

Trade-offs

Advantages: Smooth continuous actions (flow matching), 50Hz control, dexterous manipulation, massive training data.

Disadvantages: Closed-source (full model weights not public), needs large compute for training, dataset not open.

pi0.5: Open-World Generalization

Next Step Forward

pi0.5 (Physical Intelligence, 2025) extends pi0 with open-world generalization — robot works in unseen environments.

Knowledge Insulation

Key innovation: co-training on heterogeneous data sources without interference:

Data sources:
  1. Robot manipulation data (Physical Intelligence fleet)
  2. Navigation data (mobile bases)
  3. Web data (image-text pairs)
  4. High-level semantic prediction data
        ↓
  Knowledge insulation training
  - Each data source trains separate "expert" modules
  - Shared backbone learns general representations
  - No catastrophic forgetting
        ↓
  Result: Model understands both low-level manipulation + high-level planning

Results: Clean Kitchens in Unseen Homes

Physical Intelligence tested pi0.5 at 3 rental homes in San Francisco — completely new environments, different layouts, different furniture. Robot (mobile manipulator) performed:

Clearing kitchen: Gather dishes, put in dishwasher, wipe countertop
Clearing bedroom: Fold blanket, arrange pillows, pick up items off floor

First time end-to-end learned robotic system completed long-horizon manipulation tasks in unseen homes. No hard-coded behaviors, no per-home fine-tuning.

Open-source: openpi

Physical Intelligence released openpi — framework for fine-tuning pi0 models. Although full pi0.5 weights not public yet, community can fine-tune smaller versions for custom tasks.

Comprehensive Comparison Table

	RT-2	Octo	OpenVLA	pi0	pi0.5
Year	2023	2024	2024	2024	2025
Organization	Google	UC Berkeley	Stanford	Physical Intelligence	Physical Intelligence
Parameters	55B	27M / 93M	7B	~3B+	~3B+
Open-source	Closed	Full open	Full open	Partial (openpi)	Partial (openpi)
Action output	Discrete tokens	Diffusion (continuous)	Discrete tokens	Flow matching (continuous)	Flow matching (continuous)
Inference speed	~3 Hz	~10 Hz	~6 Hz	~14 Hz (50-action chunks)	~14 Hz
Cross-embodiment	1 robot	22 robots	Multi-robot	Multi-robot	Multi-robot + mobile
Fine-tune cost	N/A	1x RTX 3090	1x RTX 3090 (LoRA)	Large GPU cluster	Large GPU cluster
Semantic reasoning	Strong (55B VLM)	Weak	Good (7B VLM)	Good (3B VLM)	Strong (+ web data)
Dexterous manip.	Limited	Limited	Limited	Strong	Strong
Open-world	Partial	No	Partial	No	Yes
arXiv	2307.15818	2405.12213	2406.09246	2410.24164	2504.16054

Which Model for Your Project?

Decision Tree

What do you need?
│
├── Research / Learning → OpenVLA or Octo
│   ├── Have RTX 3090+ → OpenVLA (LoRA fine-tune)
│   └── Weaker GPU → Octo-Small (27M, lightweight)
│
├── Production manipulation → Task-dependent
│   ├── Simple pick-place → Octo fine-tuned
│   ├── Dexterous tasks → pi0 (if budget allows)
│   └── Multi-modal actions → Diffusion Policy (see Part 4)
│
├── Cross-embodiment (many robot types) → Octo
│
└── Open-world / unseen environments → pi0.5 (SOTA)

Upcoming Trends

Smaller, faster VLAs: Distillation and quantization for edge (Jetson, Raspberry Pi)
Diffusion + VLA fusion: DexVLA, Diffusion Transformer Policy — combine continuous action benefits with VLM reasoning
World models: VLA not just predict actions but predict future states — better planning
Sim-to-real pre-training: Massive simulation data + real data fine-tuning
Multi-modal inputs: Add tactile, force/torque, audio beyond vision

Starting? OpenVLA + LoRA fine-tuning is best entry point — open-source, good documentation, runs on consumer GPU, strong community support.

Diffusion Policy: A Revolution in Robot Manipulation — Deep-dive into diffusion models for robot actions
Foundation Models for Robot: RT-2, Octo, OpenVLA in Practice — Detailed fine-tuning guide
Robotics Research Trends 2025 — Research landscape overview
Sim-to-Real Transfer: Train in Simulation, Run on Real Robot — Sim-to-real techniques for VLA models
AI and Robotics 2025: Trends and Real-World Applications — Real-world industry applications

What is VLA and Why Does It Matter?

Timeline: VLA Evolution 2023-2026

RT-2: First Model to Prove VLA is Feasible

Context

Architecture: VLM + Tokenized Actions

Emergent Reasoning — The Breakthrough Point

Serious Limitations

Octo: First Open-Source Generalist Policy

Problems Octo Solves

Architecture: Custom Transformer + Diffusion Head

Training Data: Open X-Embodiment

Two Versions

Trade-offs

OpenVLA: Best of Both Worlds

Problems OpenVLA Solves

Architecture: Prismatic VLM + Robot Fine-tuning

Results: Beats RT-2-X with 7x Fewer Parameters

Fine-tune with LoRA on Consumer GPU

Trade-offs

pi0: Flow Matching for Dexterous Manipulation

Problems pi0 Solves

Architecture: VLM + Flow Matching Action Expert

Training Data: 10,000 Hours of Robot Experience

Inference: 73ms for 50-action Chunk

Trade-offs

pi0.5: Open-World Generalization

Next Step Forward

Knowledge Insulation

Results: Clean Kitchens in Unseen Homes

Open-source: openpi

Comprehensive Comparison Table

Which Model for Your Project?

Decision Tree

Upcoming Trends

Related Posts

Nguyễn Anh Tuấn

Related Posts

SpatialVLA: 3D understanding cho robot manipulation

Hands-on: Fine-tune OpenVLA với LeRobot

Action Chunking Transformers (ACT): Kiến trúc chi tiết

What is VLA and Why Does It Matter?

Timeline: VLA Evolution 2023-2026

RT-2: First Model to Prove VLA is Feasible

Context

Architecture: VLM + Tokenized Actions

Emergent Reasoning — The Breakthrough Point

Serious Limitations

Octo: First Open-Source Generalist Policy

Problems Octo Solves

Architecture: Custom Transformer + Diffusion Head

Training Data: Open X-Embodiment

Two Versions

Trade-offs

OpenVLA: Best of Both Worlds

Problems OpenVLA Solves

Architecture: Prismatic VLM + Robot Fine-tuning

Results: Beats RT-2-X with 7x Fewer Parameters

Fine-tune with LoRA on Consumer GPU

Trade-offs

pi0: Flow Matching for Dexterous Manipulation

Problems pi0 Solves

Architecture: VLM + Flow Matching Action Expert

Training Data: 10,000 Hours of Robot Experience

Inference: 73ms for 50-action Chunk

Trade-offs

pi0.5: Open-World Generalization

Next Step Forward

Knowledge Insulation

Results: Clean Kitchens in Unseen Homes

Open-source: openpi

Comprehensive Comparison Table

Which Model for Your Project?

Decision Tree

Upcoming Trends

Related Posts

Nguyễn Anh Tuấn

Related Posts

SpatialVLA: 3D understanding cho robot manipulation

Hands-on: Fine-tune OpenVLA với LeRobot

Action Chunking Transformers (ACT): Kiến trúc chi tiết