manipulationVLAmanipulationRT-2Octopi0

VLA for Manipulation: RT-2, Octo, pi0 in Practice

Vision-Language-Action models changing manipulation — from Google's RT-2 to pi0 by Physical Intelligence, with fine-tuning and real-world limitations.

Nguyen Anh Tuan6 tháng 3, 20268 phút đọc
VLA for Manipulation: RT-2, Octo, pi0 in Practice

VLA: Jump from Task-Specific to General-Purpose

In Part 2 and Part 3, I covered ACT and Diffusion Policy — methods that train one policy per task. Want a new task? Collect new data, retrain from scratch.

Vision-Language-Action (VLA) models change this paradigm: train one model on data from many robots and tasks, then fine-tune or zero-shot for specific tasks. Like GPT-4 understanding many languages, VLA models understand many robot tasks.

This post analyzes 3 most important VLA models: RT-2 (Google DeepMind), Octo (UC Berkeley), and pi0 (Physical Intelligence) — with practical perspective on when to use, when not to.

See VLA Models deep dive for detailed theory.

Foundation models for robotics — from language understanding to physical manipulation

RT-2: Vision-Language-Action from Google DeepMind

Idea

RT-2 (Brohan et al., 2023) is the first VLA model proving web-scale knowledge transfers to robot control. Simple but powerful idea: take a Vision-Language Model (VLM) pre-trained on billions of internet images + text, then co-fine-tune with robot trajectory data.

Architecture

Input:
  - Image: camera observation -> Vision encoder (ViT)
  - Language: task instruction "Pick up the red cup"
  - History: previous observations + actions

VLM backbone: PaLM-E (12B) or PaLI-X (55B)

Output:
  - Action tokens: generated like text tokens
  - Decode: token -> [dx, dy, dz, droll, dpitch, dyaw, gripper]

Special point: RT-2 doesn't change VLM architecture. It just adds robot action tokens to vocabulary and co-fine-tunes. VLM still understands language and images normally, but now outputs robot actions.

Emergent Capabilities

Most impressive aspect of RT-2: emergent capabilities — abilities never explicitly trained:

  • Reasoning about objects: "Pick up the object that is NOT a fruit" -> robot picks water bottle, not apple
  • Symbol grounding: "Move to number 3" -> robot understands text on table
  • Zero-shot generalization: sees unseen object in robot data, but seen in web data
Results (6,000 evaluation trials):
- Seen objects: 73% (baseline RT-1: 75% — equivalent)
- Unseen objects: 62% (RT-1: 32% — 2x better!)
- Semantic reasoning: 36% (RT-1: 0% — impossible before)

Limitations

  • Model huge: 12B-55B parameters, needs TPU cluster for training and inference
  • Slow: ~1-3Hz action frequency (robot arms need 10-50Hz)
  • Closed-source: Google doesn't release weights

Octo: Open-Source Generalist Policy

Why Octo Matters

Octo (Ghosh et al., 2024) from UC Berkeley solves RT-2's biggest problem: accessibility. Octo is open-source, trained on Open X-Embodiment dataset (800K trajectories from 22 robot platforms), and can be fine-tuned on consumer GPU in hours.

Architecture

Input tokens:
  - Task: language instruction OR goal image -> tokenizer
  - Observations: images + proprio -> patchify + linear projection
  - Readout tokens: learnable tokens to decode actions

Transformer backbone:
  - 27M (Octo-Small) or 93M (Octo-Base) parameters
  - Block-wise attention: obs tokens attend to task tokens,
    readout tokens attend to all

Action head:
  - Diffusion head (default) or MSE head
  - Output: action chunk [a_t, ..., a_{t+H}]

Fine-Tuning Workflow

This is Octo's real power — fine-tune for your robot with minimal data:

# Fine-tune Octo for custom robot (simplified)
from octo.model.octo_model import OctoModel
from octo.data.dataset import make_single_dataset

# 1. Load pre-trained Octo
model = OctoModel.load_pretrained("hf://rail-berkeley/octo-base-1.5")

# 2. Load custom dataset (need only 50-100 demos)
dataset = make_single_dataset(
    dataset_kwargs={
        "name": "my_robot_data",
        "data_dir": "/path/to/my/data",
    },
    train=True,
)

# 3. Fine-tune (2-4 hours on RTX 4090)
model.finetune(
    dataset,
    steps=50000,
    batch_size=128,
    learning_rate=3e-5,
    # Freeze vision encoder, only train action head + readout
    frozen_keys=["octo_transformer/BlockTransformer_0/*"],
)

# 4. Save and deploy
model.save_pretrained("my_finetuned_octo")

Fine-Tuning Results

On 9 robot platforms, Octo fine-tuned with 50-100 demos:

  • Beats BC from scratch on 7/9 platforms
  • Matches task-specific training on most tasks
  • Fine-tuning takes only 2-4 hours on 1 GPU (vs days for RT-2)

pi0: Flow Matching for General Robot Control

Breakthrough from Physical Intelligence

pi0 (Black et al., 2024) from Physical Intelligence is the newest and arguably most powerful VLA model. Instead of autoregressive token generation (like RT-2), pi0 uses flow matching — a generalization of diffusion models — to generate actions.

Architecture

Pre-trained VLM backbone: PaliGemma (3B vision-language model)

Flow matching action expert:
  - Separate action generation module
  - Trained with flow matching objective
  - Output: continuous action trajectories (no discretization)

Training:
  - Pre-train on diverse multi-robot dataset (7 robot platforms)
  - Fine-tune for specific tasks with 50-100 demos

Why Flow Matching?

RT-2 discretizes actions into tokens -> loses precision. pi0 uses flow matching to generate continuous actions directly, preserving accuracy for fine-grained manipulation.

Comparison:

  • RT-2: output "token 47" -> decode to dx=0.03 (quantization error)
  • pi0: output dx=0.0312 directly (continuous, more precise)

Results

pi0 achieves impressive results on dexterous manipulation tasks:

  • Laundry folding: 80% success (human-level difficulty)
  • Table bussing: 85% success
  • Box assembly: 70% success
  • Zero-shot transfer across different robot platforms

VLA models enable manipulation across many platforms with single policy

Comparison of 3 VLA Models

Criterion RT-2 Octo pi0
Team Google DeepMind UC Berkeley Physical Intelligence
Size 12B-55B 27M-93M ~3B
Open-source No Yes No (weights closed)
Training data Google internal Open X-Embodiment (800K) Multi-robot (proprietary)
Action generation Autoregressive tokens Diffusion head Flow matching
Action precision Low (discrete) Medium High (continuous)
Inference speed 1-3 Hz 5-10 Hz 5-15 Hz
Fine-tune cost TPU days GPU hours GPU hours
Zero-shot Good (web knowledge) Limited Good
Dexterous tasks Medium Medium Best
Best for Semantic reasoning Open-source research Production deployment

Fine-Tuning VLA for Custom Tasks

When to Fine-Tune VLA?

Do fine-tune when:

  • You have new robot setup (different cameras, action space)
  • Task needs language conditioning (multiple tasks, instructions)
  • Want to leverage pre-trained representations vs training from scratch
  • Have 50-100 demos and want fast results

DON'T use VLA when:

  • Just one simple task (use ACT or Diffusion Policy — simpler, faster)
  • Need real-time (<5ms inference) — VLA too slow
  • No GPU (Octo-Base needs at least RTX 3080)
  • Task doesn't need language understanding

Fine-Tuning Best Practices

  1. Freeze vision encoder: only train action head and readout tokens. Vision encoder learned well from pre-training; fine-tuning causes overfitting.

  2. Low learning rate: 3e-5 for Octo, lower for pi0. VLA pre-trained weights are valuable, don't want to erase them.

  3. Data diversity > quantity: 50 diverse demos (different initial conditions) better than 200 identical demos.

  4. Evaluate frequently: every 5,000 steps, run 20 eval episodes. VLA overfits fast on small datasets.

  5. Gradient checkpointing: save VRAM, allows fine-tuning 3B model on 24GB GPU.

VLA Limitations (2026)

1. Speed Still a Bottleneck

5-15 Hz insufficient for many manipulation tasks (contact-rich, force-sensitive). Groups like Stanford researching asynchronous VLA — high-level VLA outputs subgoals, low-level policy executes fast.

2. Sim-to-Real Gap

VLA models train mostly on real data, but real data is expensive and slow to collect. Integrating sim data into VLA pre-training remains open challenge.

3. Safety

VLA is black box — no guarantees about behavior. In industry, this is deal-breaker for safety-critical. Need separate safety mechanisms (force limits, workspace bounds, human detection).

4. Data Ownership

RT-2 trains on Google proprietary data. pi0 trains on Physical Intelligence data. Only Octo uses public dataset. When fine-tuning, your data might leak through model weights — IP concerns.

Future: VLA + Manipulation

pi0.5 and Beyond

pi0.5 (Physical Intelligence, 2025) extends pi0 with open-world generalization — robot does tasks never seen in training, just from language instruction. Closest yet to "general-purpose robot."

Open-Source Catching Up

Octo team working on newer versions with larger datasets and better fine-tuning. Hugging Face LeRobot community integrating VLA models. Gap between open-source and proprietary shrinking.

VLA + Diffusion Policy

Strongest combination: VLA for high-level understanding (understand task from language), Diffusion Policy for low-level execution (smooth, precise trajectories). pi0 does this with flow matching; other labs following.

Next in Series


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Bài viết liên quan

NEWDeep Dive
Task Planning cho Manipulation trên Jetson Edge
jetsonmanipulationtask-planningedge-ainvidiacuTAMPcuMotionTensorRT

Task Planning cho Manipulation trên Jetson Edge

Hướng dẫn triển khai task planning cho robot manipulation trên NVIDIA Jetson AGX Orin 64GB — từ cuTAMP, cuMotion đến VLM inference.

14/4/202615 phút đọc
NEWTutorial
Hướng dẫn GigaBrain-0: VLA + World Model + RL
vlaworld-modelreinforcement-learninggigabrainroboticsmanipulation

Hướng dẫn GigaBrain-0: VLA + World Model + RL

Hướng dẫn chi tiết huấn luyện VLA bằng World Model và Reinforcement Learning với framework RAMP từ GigaBrain — open-source, 3.5B params.

12/4/202611 phút đọc
NEWTutorial
SimpleVLA-RL (6): OpenArm — Phân tích Lộ trình
openarmvlareinforcement-learninglerobotpi0Phần 6

SimpleVLA-RL (6): OpenArm — Phân tích Lộ trình

Phân tích chi tiết cách tiếp cận training robot OpenArm 7-DoF gắp hộp carton — so sánh 2 lộ trình: LeRobot native vs SimpleVLA-RL.

11/4/202613 phút đọc