← Back to Blog
aiai-perceptionresearchrobotics

Foundation Models for Robots: RT-2, Octo, OpenVLA in Practice

Comprehensive overview of foundation models for robotics — how RT-2, Octo, and OpenVLA are transforming how robots learn manipulation and navigation.

Nguyen Anh Tuan14 tháng 3, 202610 min read
Foundation Models for Robots: RT-2, Octo, OpenVLA in Practice

Foundation Models for Robots — A Revolution Underway

Foundation models for robotics are completely changing how we program robots. Instead of writing thousands of lines of code for each specific task, robots can now learn from demonstrations and understand language instructions — similar to how ChatGPT understands text, but with physical actions as output.

In this post, I'll deep-dive into the 3 most important models: RT-2 (Google DeepMind), Octo (UC Berkeley), and OpenVLA (Stanford) — analyzing architecture, comparing performance, and providing a guide to fine-tune on your own robot.

Robot learning manipulation with AI foundation models

What Are Robot Foundation Models?

If you're familiar with Large Language Models (LLMs) like GPT or Claude, robot foundation models work on similar principles but for the physical world:

Robot foundation models are VLA models — they take camera images + language instructions and output robot actions directly (gripper position, joint angles, velocity, etc.).

Why Do We Need Foundation Models?

Previously, each robot task required:

  1. Collecting task-specific data (thousands of demonstrations)
  2. Training a dedicated model
  3. Deploying it for that specific robot
  4. Repeating for the next task

Foundation models solve this through pre-training on massive datasets from many different robots, then quick fine-tuning for specific tasks. Like how you fine-tune GPT for your domain instead of training from scratch.

RT-2: Vision-Language-Action Model (Google DeepMind)

Architecture

RT-2 (Brohan et al., 2023) was the first VLA model achieving impressive results. The core idea is elegant: convert robot actions into text tokens and co-train with vision-language data.

Input:  Camera image + "Pick up the red cup"
   ↓
PaLI-X (55B) or PaLM-E (12B)  [Pre-trained VLM]
   ↓
Output: "1 128 91 241 1 128 91"  [Tokenized actions]
   ↓
De-tokenize → [x, y, z, rx, ry, rz, gripper]

RT-2 tokenizes each action dimension into one of 256 bins (0-255) and represents it as text. This way, the model handles actions just like text tokens — no architecture changes needed.

Highlight Results

Capability RT-1 RT-2 (PaLI-X 55B)
Seen tasks 95% 95%
Unseen objects 32% 62%
Unseen backgrounds 36% 52%
Reasoning ("pick smallest") 0% 48%

Key breakthrough: RT-2 shows emergent reasoning — understanding "pick up something you can use to clean a spill" (selecting a paper towel) without ever seeing this instruction in robot training data. Knowledge from web pre-training transfers to robot control.

Limitations

Octo: Open-Source Generalist Robot Policy (UC Berkeley)

Architecture

Octo (Ghosh, Dibya et al., 2024) is the open-source answer to RT-2. Instead of using a giant VLM, Octo designs a transformer architecture tailored for robotics:

Input tokens:
  [Language] "Pick up the blue block"
  [Image]    Observation history (t-2, t-1, t)
  [Action]   Previous action history
       ↓
  Transformer Backbone (readout tokens)
       ↓
  Diffusion Action Head
       ↓
  Output: Action distribution (multi-modal)

Key innovation: Octo uses a diffusion head instead of a regression head — allowing the model to predict multi-modal action distributions. For example: when there are multiple valid paths (left or right), the diffusion head captures both modes, while a regression head would average them out.

Training Data

Octo trains on the Open X-Embodiment dataset — 800K+ robot episodes from 22 different robot platforms:

Dataset Robot Tasks Episodes
Bridge V2 WidowX Manipulation 60K
RT-1 Robot Google RT Pick/Place 130K
Taco Play Franka Language-conditioned 6K
Kuka Kuka iiwa Stacking, insertion 516K
... ... ... ...
Total 22 robots Diverse 800K+

Two Versions

Octo-Small Octo-Base
Parameters 27M 93M
Performance Baseline +15% success rate
Fine-tune time ~2 hours ~4 hours
GPU requirement 1x RTX 3090 1x A100 40GB

Fine-tune Octo on Your Robot

Here's the practical part — fine-tune Octo-Small on custom robot data with just one consumer GPU:

"""
Fine-tune Octo-Small on custom robot dataset
Requirements: 1x RTX 3090/4090, 50-200 demonstrations
"""
import jax
import jax.numpy as jnp
from octo.model.octo_model import OctoModel
from octo.data.dataset import make_single_dataset
from octo.utils.train_utils import (
    create_optimizer,
    create_train_state,
)
import optax

# 1. Load pretrained Octo-Small
model = OctoModel.load_pretrained("hf://octo-models/octo-small-1.5")
print(f"Loaded Octo-Small: {sum(x.size for x in jax.tree.leaves(model.params))/1e6:.1f}M params")

# 2. Prepare dataset (RLDS format)
# Dataset must be in RLDS format (https://github.com/google-research/rlds)
dataset_config = {
    "name": "my_robot_dataset",
    "data_dir": "/data/my_robot/",
    "image_obs_keys": {"primary": "image"},
    "language_key": "language_instruction",
    "action_proprio_normalization_type": "normal",
}

train_dataset = make_single_dataset(
    dataset_kwargs=dataset_config,
    traj_transform_kwargs={
        "window_size": 2,       # Observation history length
        "action_horizon": 4,    # Predict 4 future actions (action chunking)
    },
    frame_transform_kwargs={
        "resize_size": {"primary": (256, 256)},
    },
    train=True,
)

# 3. Setup optimizer
# Low learning rate for fine-tuning — don't want to destroy pre-trained weights
optimizer = optax.adamw(
    learning_rate=3e-5,    # Much lower than training from scratch
    weight_decay=0.01,
    b1=0.9,
    b2=0.95,
)

train_state = create_train_state(
    rng=jax.random.PRNGKey(42),
    model=model,
    optimizer=optimizer,
)

# 4. Training loop
NUM_STEPS = 5000  # 50 demos × ~100 steps = ~5K frames
BATCH_SIZE = 64

for step in range(NUM_STEPS):
    batch = next(train_dataset.iterator(batch_size=BATCH_SIZE))

    # Forward + backward pass
    train_state, metrics = train_state.apply_gradients(batch)

    if step % 500 == 0:
        print(f"Step {step}: loss={metrics['loss']:.4f}")

# 5. Save fine-tuned model
train_state.model.save_pretrained("/models/octo-my-robot/")
print("Fine-tuning complete!")

Data Collection Tips

# Collect data via teleoperation
# Need at least 50 demonstrations for single task
# 200+ demos for multi-task

# Each demonstration needs:
demo = {
    "observation": {
        "image": np.array(...),      # Camera image (256x256 RGB)
        "proprio": np.array(...),     # Joint positions / EE pose
    },
    "action": np.array(...),          # 7D: [x, y, z, rx, ry, rz, gripper]
    "language_instruction": "pick up the red block",
}

AI foundation model training for robot manipulation

OpenVLA: Open-Source VLA Model (Stanford)

Architecture

OpenVLA (Kim et al., 2024) takes a different approach: instead of designing new architecture, build VLA on top of a powerful pre-trained VLM:

Visual Encoder:
  SigLIP (vision-language) + DINOv2 (spatial features)
       ↓
  Projector (MLP)
       ↓
  Llama 2 7B backbone
       ↓
  Output: Tokenized actions (256 bins per dimension)

OpenVLA = Prismatic VLM + robot action fine-tuning. This approach maximizes knowledge transfer from web-scale VLMs.

Comparing All 3 Models

RT-2 Octo OpenVLA
Parameters 55B (PaLI-X) 27M / 93M 7B
Open-source No Yes Yes
Architecture VLM + tokenized actions Custom transformer + diffusion VLM + tokenized actions
Training data RT-1 + web Open X-Embodiment (800K) Open X-Embodiment (970K)
Action output Deterministic Multi-modal (diffusion) Deterministic
Fine-tune cost N/A (closed) 1x RTX 3090 (~2h) 1x RTX 3090 (~4h with LoRA)
Inference speed ~3 Hz ~10 Hz ~6 Hz
Success rate (29 tasks) RT-2-X baseline Competitive +16.5% vs RT-2-X
Cross-embodiment Limited (1 robot) Strong (22 robots) Strong (970K demos, multi-robot)

Fine-tune OpenVLA with LoRA

"""
Fine-tune OpenVLA 7B with LoRA on consumer GPU
Requirements: 1x RTX 3090/4090 (24GB VRAM)
"""
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model

# 1. Load OpenVLA from HuggingFace
model_id = "openvla/openvla-7b"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

# 2. Apply LoRA — fine-tune only ~2% of parameters
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13.1M || all params: 7.6B || trainable%: 0.17%

# 3. Prepare data
# OpenVLA expects: image + language instruction → action tokens
def prepare_sample(image, instruction, action):
    """
    image: PIL Image (256x256)
    instruction: str, e.g., "pick up the red block"
    action: np.array shape (7,) — [x, y, z, rx, ry, rz, gripper]
    """
    inputs = processor(
        images=image,
        text=f"In: What action should the robot take to {instruction}?\n",
        return_tensors="pt",
    )
    # Tokenize action (256 bins per dimension)
    action_tokens = tokenize_action(action, n_bins=256)
    inputs["labels"] = action_tokens
    return inputs

# 4. Training loop (simplified)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

for epoch in range(10):
    for batch in dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print(f"Epoch {epoch}: loss={loss.item():.4f}")

# 5. Save LoRA weights
model.save_pretrained("/models/openvla-my-robot-lora/")

Cross-Embodiment Transfer — Train One Robot, Run Another

This is a game-changing capability: a model trained on robot A can transfer to robot B completely different in kinematics.

Why It Works

Foundation models learn semantic understanding (what "pick up" means) separately from low-level control (how to move specific joints). The semantic part transfers across all robots.

Pre-trained knowledge (shared):
  "pick up" → move gripper above object → lower → close gripper → lift

Fine-tuned knowledge (robot-specific):
  Franka: 7 joints, position control, 1m reach
  WidowX: 6 joints, velocity control, 0.5m reach

Real Results from Octo Paper

Source robot Target robot Zero-shot Fine-tuned (50 demos)
WidowX → Franka Pick/Place 15% 72%
Multi-robot → ALOHA Bimanual 8% 65%
Multi-robot → UR5 Assembly 5% 58%

Zero-shot (no fine-tuning) is still weak, but fine-tuning with just 50 demonstrations achieves good results — saving 90% effort compared to training from scratch (which needs 500+ demos).

When NOT to Use Foundation Models

Foundation models are powerful, but not a silver bullet:

  1. Real-time requirement (<10ms): VLA models run at 3-10 Hz inference, too slow for reactive control (obstacle avoidance, force control). Use classical controllers or small RL policies instead.

  2. High-precision tasks (<0.5mm): Assembly, soldering — need system identification + model-based control, not learned policies.

  3. Safety-critical applications: Foundation models are black boxes with no formal guarantees. Surgical robots, autonomous driving — need verifiable controllers.

  4. Limited compute: Octo-Small (27M) runs on Jetson, but OpenVLA (7B) needs a beefy GPU. If your robot only has a Raspberry Pi, use classical methods or small specialized models.

  5. Abundant training data for specific task: If you already have 10K demos for one specific task, simple behavior cloning might outperform fine-tuned foundation models.

The Future

Foundation models for robotics are at a stage equivalent to GPT-2 for NLP — promising but not yet production-ready for all use cases. Expected trends:

If you're starting out, my recommendation: try Octo-Small first — open-source, lightweight, quick fine-tuning, good community support.


Related Articles

Related Posts

IROS 2026: Papers navigation và manipulation đáng theo dõi
researchconferencerobotics

IROS 2026: Papers navigation và manipulation đáng theo dõi

Phân tích papers nổi bật về autonomous navigation và manipulation — chuẩn bị cho IROS 2026 Pittsburgh.

2/4/20267 min read
Sim-to-Real Transfer: Train simulation, chạy thực tế
ai-perceptionresearchrobotics

Sim-to-Real Transfer: Train simulation, chạy thực tế

Kỹ thuật chuyển đổi mô hình từ simulation sang robot thật — domain randomization, system identification và best practices.

1/4/202612 min read
IROS 2026 Preview: Những gì đáng chờ đợi
researchconferencerobotics

IROS 2026 Preview: Những gì đáng chờ đợi

IROS 2026 Pittsburgh — preview workshops, competitions và nghiên cứu navigation, manipulation hàng đầu.

30/3/20267 min read