Foundation Models for Robots — A Revolution Underway
Foundation models for robotics are completely changing how we program robots. Instead of writing thousands of lines of code for each specific task, robots can now learn from demonstrations and understand language instructions — similar to how ChatGPT understands text, but with physical actions as output.
In this post, I'll deep-dive into the 3 most important models: RT-2 (Google DeepMind), Octo (UC Berkeley), and OpenVLA (Stanford) — analyzing architecture, comparing performance, and providing a guide to fine-tune on your own robot.
What Are Robot Foundation Models?
If you're familiar with Large Language Models (LLMs) like GPT or Claude, robot foundation models work on similar principles but for the physical world:
- LLM: Text in → Text out
- VLM (Vision-Language Model): Image + Text in → Text out
- VLA (Vision-Language-Action): Image + Text in → Robot action out
Robot foundation models are VLA models — they take camera images + language instructions and output robot actions directly (gripper position, joint angles, velocity, etc.).
Why Do We Need Foundation Models?
Previously, each robot task required:
- Collecting task-specific data (thousands of demonstrations)
- Training a dedicated model
- Deploying it for that specific robot
- Repeating for the next task
Foundation models solve this through pre-training on massive datasets from many different robots, then quick fine-tuning for specific tasks. Like how you fine-tune GPT for your domain instead of training from scratch.
RT-2: Vision-Language-Action Model (Google DeepMind)
Architecture
RT-2 (Brohan et al., 2023) was the first VLA model achieving impressive results. The core idea is elegant: convert robot actions into text tokens and co-train with vision-language data.
Input: Camera image + "Pick up the red cup"
↓
PaLI-X (55B) or PaLM-E (12B) [Pre-trained VLM]
↓
Output: "1 128 91 241 1 128 91" [Tokenized actions]
↓
De-tokenize → [x, y, z, rx, ry, rz, gripper]
RT-2 tokenizes each action dimension into one of 256 bins (0-255) and represents it as text. This way, the model handles actions just like text tokens — no architecture changes needed.
Highlight Results
| Capability | RT-1 | RT-2 (PaLI-X 55B) |
|---|---|---|
| Seen tasks | 95% | 95% |
| Unseen objects | 32% | 62% |
| Unseen backgrounds | 36% | 52% |
| Reasoning ("pick smallest") | 0% | 48% |
Key breakthrough: RT-2 shows emergent reasoning — understanding "pick up something you can use to clean a spill" (selecting a paper towel) without ever seeing this instruction in robot training data. Knowledge from web pre-training transfers to robot control.
Limitations
- 55B parameters — cannot run on edge devices
- Closed-source, weights not public
- Only tested on one robot type (Google's RT robot)
- Slow inference (~3 Hz)
Octo: Open-Source Generalist Robot Policy (UC Berkeley)
Architecture
Octo (Ghosh, Dibya et al., 2024) is the open-source answer to RT-2. Instead of using a giant VLM, Octo designs a transformer architecture tailored for robotics:
Input tokens:
[Language] "Pick up the blue block"
[Image] Observation history (t-2, t-1, t)
[Action] Previous action history
↓
Transformer Backbone (readout tokens)
↓
Diffusion Action Head
↓
Output: Action distribution (multi-modal)
Key innovation: Octo uses a diffusion head instead of a regression head — allowing the model to predict multi-modal action distributions. For example: when there are multiple valid paths (left or right), the diffusion head captures both modes, while a regression head would average them out.
Training Data
Octo trains on the Open X-Embodiment dataset — 800K+ robot episodes from 22 different robot platforms:
| Dataset | Robot | Tasks | Episodes |
|---|---|---|---|
| Bridge V2 | WidowX | Manipulation | 60K |
| RT-1 Robot | Google RT | Pick/Place | 130K |
| Taco Play | Franka | Language-conditioned | 6K |
| Kuka | Kuka iiwa | Stacking, insertion | 516K |
| ... | ... | ... | ... |
| Total | 22 robots | Diverse | 800K+ |
Two Versions
| Octo-Small | Octo-Base | |
|---|---|---|
| Parameters | 27M | 93M |
| Performance | Baseline | +15% success rate |
| Fine-tune time | ~2 hours | ~4 hours |
| GPU requirement | 1x RTX 3090 | 1x A100 40GB |
Fine-tune Octo on Your Robot
Here's the practical part — fine-tune Octo-Small on custom robot data with just one consumer GPU:
"""
Fine-tune Octo-Small on custom robot dataset
Requirements: 1x RTX 3090/4090, 50-200 demonstrations
"""
import jax
import jax.numpy as jnp
from octo.model.octo_model import OctoModel
from octo.data.dataset import make_single_dataset
from octo.utils.train_utils import (
create_optimizer,
create_train_state,
)
import optax
# 1. Load pretrained Octo-Small
model = OctoModel.load_pretrained("hf://octo-models/octo-small-1.5")
print(f"Loaded Octo-Small: {sum(x.size for x in jax.tree.leaves(model.params))/1e6:.1f}M params")
# 2. Prepare dataset (RLDS format)
# Dataset must be in RLDS format (https://github.com/google-research/rlds)
dataset_config = {
"name": "my_robot_dataset",
"data_dir": "/data/my_robot/",
"image_obs_keys": {"primary": "image"},
"language_key": "language_instruction",
"action_proprio_normalization_type": "normal",
}
train_dataset = make_single_dataset(
dataset_kwargs=dataset_config,
traj_transform_kwargs={
"window_size": 2, # Observation history length
"action_horizon": 4, # Predict 4 future actions (action chunking)
},
frame_transform_kwargs={
"resize_size": {"primary": (256, 256)},
},
train=True,
)
# 3. Setup optimizer
# Low learning rate for fine-tuning — don't want to destroy pre-trained weights
optimizer = optax.adamw(
learning_rate=3e-5, # Much lower than training from scratch
weight_decay=0.01,
b1=0.9,
b2=0.95,
)
train_state = create_train_state(
rng=jax.random.PRNGKey(42),
model=model,
optimizer=optimizer,
)
# 4. Training loop
NUM_STEPS = 5000 # 50 demos × ~100 steps = ~5K frames
BATCH_SIZE = 64
for step in range(NUM_STEPS):
batch = next(train_dataset.iterator(batch_size=BATCH_SIZE))
# Forward + backward pass
train_state, metrics = train_state.apply_gradients(batch)
if step % 500 == 0:
print(f"Step {step}: loss={metrics['loss']:.4f}")
# 5. Save fine-tuned model
train_state.model.save_pretrained("/models/octo-my-robot/")
print("Fine-tuning complete!")
Data Collection Tips
# Collect data via teleoperation
# Need at least 50 demonstrations for single task
# 200+ demos for multi-task
# Each demonstration needs:
demo = {
"observation": {
"image": np.array(...), # Camera image (256x256 RGB)
"proprio": np.array(...), # Joint positions / EE pose
},
"action": np.array(...), # 7D: [x, y, z, rx, ry, rz, gripper]
"language_instruction": "pick up the red block",
}
OpenVLA: Open-Source VLA Model (Stanford)
Architecture
OpenVLA (Kim et al., 2024) takes a different approach: instead of designing new architecture, build VLA on top of a powerful pre-trained VLM:
Visual Encoder:
SigLIP (vision-language) + DINOv2 (spatial features)
↓
Projector (MLP)
↓
Llama 2 7B backbone
↓
Output: Tokenized actions (256 bins per dimension)
OpenVLA = Prismatic VLM + robot action fine-tuning. This approach maximizes knowledge transfer from web-scale VLMs.
Comparing All 3 Models
| RT-2 | Octo | OpenVLA | |
|---|---|---|---|
| Parameters | 55B (PaLI-X) | 27M / 93M | 7B |
| Open-source | No | Yes | Yes |
| Architecture | VLM + tokenized actions | Custom transformer + diffusion | VLM + tokenized actions |
| Training data | RT-1 + web | Open X-Embodiment (800K) | Open X-Embodiment (970K) |
| Action output | Deterministic | Multi-modal (diffusion) | Deterministic |
| Fine-tune cost | N/A (closed) | 1x RTX 3090 (~2h) | 1x RTX 3090 (~4h with LoRA) |
| Inference speed | ~3 Hz | ~10 Hz | ~6 Hz |
| Success rate (29 tasks) | RT-2-X baseline | Competitive | +16.5% vs RT-2-X |
| Cross-embodiment | Limited (1 robot) | Strong (22 robots) | Strong (970K demos, multi-robot) |
Fine-tune OpenVLA with LoRA
"""
Fine-tune OpenVLA 7B with LoRA on consumer GPU
Requirements: 1x RTX 3090/4090 (24GB VRAM)
"""
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model
# 1. Load OpenVLA from HuggingFace
model_id = "openvla/openvla-7b"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
# 2. Apply LoRA — fine-tune only ~2% of parameters
lora_config = LoraConfig(
r=32,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13.1M || all params: 7.6B || trainable%: 0.17%
# 3. Prepare data
# OpenVLA expects: image + language instruction → action tokens
def prepare_sample(image, instruction, action):
"""
image: PIL Image (256x256)
instruction: str, e.g., "pick up the red block"
action: np.array shape (7,) — [x, y, z, rx, ry, rz, gripper]
"""
inputs = processor(
images=image,
text=f"In: What action should the robot take to {instruction}?\n",
return_tensors="pt",
)
# Tokenize action (256 bins per dimension)
action_tokens = tokenize_action(action, n_bins=256)
inputs["labels"] = action_tokens
return inputs
# 4. Training loop (simplified)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
for epoch in range(10):
for batch in dataloader:
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"Epoch {epoch}: loss={loss.item():.4f}")
# 5. Save LoRA weights
model.save_pretrained("/models/openvla-my-robot-lora/")
Cross-Embodiment Transfer — Train One Robot, Run Another
This is a game-changing capability: a model trained on robot A can transfer to robot B completely different in kinematics.
Why It Works
Foundation models learn semantic understanding (what "pick up" means) separately from low-level control (how to move specific joints). The semantic part transfers across all robots.
Pre-trained knowledge (shared):
"pick up" → move gripper above object → lower → close gripper → lift
Fine-tuned knowledge (robot-specific):
Franka: 7 joints, position control, 1m reach
WidowX: 6 joints, velocity control, 0.5m reach
Real Results from Octo Paper
| Source robot | Target robot | Zero-shot | Fine-tuned (50 demos) |
|---|---|---|---|
| WidowX → Franka | Pick/Place | 15% | 72% |
| Multi-robot → ALOHA | Bimanual | 8% | 65% |
| Multi-robot → UR5 | Assembly | 5% | 58% |
Zero-shot (no fine-tuning) is still weak, but fine-tuning with just 50 demonstrations achieves good results — saving 90% effort compared to training from scratch (which needs 500+ demos).
When NOT to Use Foundation Models
Foundation models are powerful, but not a silver bullet:
-
Real-time requirement (<10ms): VLA models run at 3-10 Hz inference, too slow for reactive control (obstacle avoidance, force control). Use classical controllers or small RL policies instead.
-
High-precision tasks (<0.5mm): Assembly, soldering — need system identification + model-based control, not learned policies.
-
Safety-critical applications: Foundation models are black boxes with no formal guarantees. Surgical robots, autonomous driving — need verifiable controllers.
-
Limited compute: Octo-Small (27M) runs on Jetson, but OpenVLA (7B) needs a beefy GPU. If your robot only has a Raspberry Pi, use classical methods or small specialized models.
-
Abundant training data for specific task: If you already have 10K demos for one specific task, simple behavior cloning might outperform fine-tuned foundation models.
The Future
Foundation models for robotics are at a stage equivalent to GPT-2 for NLP — promising but not yet production-ready for all use cases. Expected trends:
- Larger datasets: Open X-Embodiment v2 collecting more data
- Faster inference: Distillation and quantization for edge deployment
- Multi-modal: Adding tactile, force/torque sensing beyond vision
- Sim-to-real pre-training: Combining simulation with real data
If you're starting out, my recommendation: try Octo-Small first — open-source, lightweight, quick fine-tuning, good community support.
Related Articles
- Robotics Research Trends 2025 — Overview of the research landscape
- AI and Robotics 2025: Trends and Real-World Applications — AI applications in industrial robots
- Sim-to-Real Transfer: Train in Simulation, Run on Real Robots — How to transfer models from simulation to actual hardware