Foundation Models for Robots: RT-2, Octo, OpenVLA in Practice

Foundation Models for Robots — A Revolution Underway

Foundation models for robotics are completely changing how we program robots. Instead of writing thousands of lines of code for each specific task, robots can now learn from demonstrations and understand language instructions — similar to how ChatGPT understands text, but with physical actions as output.

In this post, I'll deep-dive into the 3 most important models: RT-2 (Google DeepMind), Octo (UC Berkeley), and OpenVLA (Stanford) — analyzing architecture, comparing performance, and providing a guide to fine-tune on your own robot.

Robot learning manipulation with AI foundation models

What Are Robot Foundation Models?

If you're familiar with Large Language Models (LLMs) like GPT or Claude, robot foundation models work on similar principles but for the physical world:

LLM: Text in → Text out
VLM (Vision-Language Model): Image + Text in → Text out
VLA (Vision-Language-Action): Image + Text in → Robot action out

Robot foundation models are VLA models — they take camera images + language instructions and output robot actions directly (gripper position, joint angles, velocity, etc.).

Why Do We Need Foundation Models?

Previously, each robot task required:

Collecting task-specific data (thousands of demonstrations)
Training a dedicated model
Deploying it for that specific robot
Repeating for the next task

Foundation models solve this through pre-training on massive datasets from many different robots, then quick fine-tuning for specific tasks. Like how you fine-tune GPT for your domain instead of training from scratch.

RT-2: Vision-Language-Action Model (Google DeepMind)

Architecture

RT-2 (Brohan et al., 2023) was the first VLA model achieving impressive results. The core idea is elegant: convert robot actions into text tokens and co-train with vision-language data.

Input:  Camera image + "Pick up the red cup"
   ↓
PaLI-X (55B) or PaLM-E (12B)  [Pre-trained VLM]
   ↓
Output: "1 128 91 241 1 128 91"  [Tokenized actions]
   ↓
De-tokenize → [x, y, z, rx, ry, rz, gripper]

RT-2 tokenizes each action dimension into one of 256 bins (0-255) and represents it as text. This way, the model handles actions just like text tokens — no architecture changes needed.

Highlight Results

Capability	RT-1	RT-2 (PaLI-X 55B)
Seen tasks	95%	95%
Unseen objects	32%	62%
Unseen backgrounds	36%	52%
Reasoning ("pick smallest")	0%	48%

Key breakthrough: RT-2 shows emergent reasoning — understanding "pick up something you can use to clean a spill" (selecting a paper towel) without ever seeing this instruction in robot training data. Knowledge from web pre-training transfers to robot control.

Limitations

55B parameters — cannot run on edge devices
Closed-source, weights not public
Only tested on one robot type (Google's RT robot)
Slow inference (~3 Hz)

Octo: Open-Source Generalist Robot Policy (UC Berkeley)

Architecture

Octo (Ghosh, Dibya et al., 2024) is the open-source answer to RT-2. Instead of using a giant VLM, Octo designs a transformer architecture tailored for robotics:

Input tokens:
  [Language] "Pick up the blue block"
  [Image]    Observation history (t-2, t-1, t)
  [Action]   Previous action history
       ↓
  Transformer Backbone (readout tokens)
       ↓
  Diffusion Action Head
       ↓
  Output: Action distribution (multi-modal)

Key innovation: Octo uses a diffusion head instead of a regression head — allowing the model to predict multi-modal action distributions. For example: when there are multiple valid paths (left or right), the diffusion head captures both modes, while a regression head would average them out.

Training Data

Octo trains on the Open X-Embodiment dataset — 800K+ robot episodes from 22 different robot platforms:

Dataset	Robot	Tasks	Episodes
Bridge V2	WidowX	Manipulation	60K
RT-1 Robot	Google RT	Pick/Place	130K
Taco Play	Franka	Language-conditioned	6K
Kuka	Kuka iiwa	Stacking, insertion	516K
...	...	...	...
Total	22 robots	Diverse	800K+

Two Versions

	Octo-Small	Octo-Base
Parameters	27M	93M
Performance	Baseline	+15% success rate
Fine-tune time	~2 hours	~4 hours
GPU requirement	1x RTX 3090	1x A100 40GB

Fine-tune Octo on Your Robot

Here's the practical part — fine-tune Octo-Small on custom robot data with just one consumer GPU:

"""
Fine-tune Octo-Small on custom robot dataset
Requirements: 1x RTX 3090/4090, 50-200 demonstrations
"""
import jax
import jax.numpy as jnp
from octo.model.octo_model import OctoModel
from octo.data.dataset import make_single_dataset
from octo.utils.train_utils import (
    create_optimizer,
    create_train_state,
)
import optax

# 1. Load pretrained Octo-Small
model = OctoModel.load_pretrained("hf://octo-models/octo-small-1.5")
print(f"Loaded Octo-Small: {sum(x.size for x in jax.tree.leaves(model.params))/1e6:.1f}M params")

# 2. Prepare dataset (RLDS format)
# Dataset must be in RLDS format (https://github.com/google-research/rlds)
dataset_config = {
    "name": "my_robot_dataset",
    "data_dir": "/data/my_robot/",
    "image_obs_keys": {"primary": "image"},
    "language_key": "language_instruction",
    "action_proprio_normalization_type": "normal",
}

train_dataset = make_single_dataset(
    dataset_kwargs=dataset_config,
    traj_transform_kwargs={
        "window_size": 2,       # Observation history length
        "action_horizon": 4,    # Predict 4 future actions (action chunking)
    },
    frame_transform_kwargs={
        "resize_size": {"primary": (256, 256)},
    },
    train=True,
)

# 3. Setup optimizer
# Low learning rate for fine-tuning — don't want to destroy pre-trained weights
optimizer = optax.adamw(
    learning_rate=3e-5,    # Much lower than training from scratch
    weight_decay=0.01,
    b1=0.9,
    b2=0.95,
)

train_state = create_train_state(
    rng=jax.random.PRNGKey(42),
    model=model,
    optimizer=optimizer,
)

# 4. Training loop
NUM_STEPS = 5000  # 50 demos × ~100 steps = ~5K frames
BATCH_SIZE = 64

for step in range(NUM_STEPS):
    batch = next(train_dataset.iterator(batch_size=BATCH_SIZE))

    # Forward + backward pass
    train_state, metrics = train_state.apply_gradients(batch)

    if step % 500 == 0:
        print(f"Step {step}: loss={metrics['loss']:.4f}")

# 5. Save fine-tuned model
train_state.model.save_pretrained("/models/octo-my-robot/")
print("Fine-tuning complete!")

Data Collection Tips

# Collect data via teleoperation
# Need at least 50 demonstrations for single task
# 200+ demos for multi-task

# Each demonstration needs:
demo = {
    "observation": {
        "image": np.array(...),      # Camera image (256x256 RGB)
        "proprio": np.array(...),     # Joint positions / EE pose
    },
    "action": np.array(...),          # 7D: [x, y, z, rx, ry, rz, gripper]
    "language_instruction": "pick up the red block",
}

AI foundation model training for robot manipulation

OpenVLA: Open-Source VLA Model (Stanford)

Architecture

OpenVLA (Kim et al., 2024) takes a different approach: instead of designing new architecture, build VLA on top of a powerful pre-trained VLM:

Visual Encoder:
  SigLIP (vision-language) + DINOv2 (spatial features)
       ↓
  Projector (MLP)
       ↓
  Llama 2 7B backbone
       ↓
  Output: Tokenized actions (256 bins per dimension)

OpenVLA = Prismatic VLM + robot action fine-tuning. This approach maximizes knowledge transfer from web-scale VLMs.

Comparing All 3 Models

	RT-2	Octo	OpenVLA
Parameters	55B (PaLI-X)	27M / 93M	7B
Open-source	No	Yes	Yes
Architecture	VLM + tokenized actions	Custom transformer + diffusion	VLM + tokenized actions
Training data	RT-1 + web	Open X-Embodiment (800K)	Open X-Embodiment (970K)
Action output	Deterministic	Multi-modal (diffusion)	Deterministic
Fine-tune cost	N/A (closed)	1x RTX 3090 (~2h)	1x RTX 3090 (~4h with LoRA)
Inference speed	~3 Hz	~10 Hz	~6 Hz
Success rate (29 tasks)	RT-2-X baseline	Competitive	+16.5% vs RT-2-X
Cross-embodiment	Limited (1 robot)	Strong (22 robots)	Strong (970K demos, multi-robot)

Fine-tune OpenVLA with LoRA

"""
Fine-tune OpenVLA 7B with LoRA on consumer GPU
Requirements: 1x RTX 3090/4090 (24GB VRAM)
"""
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model

# 1. Load OpenVLA from HuggingFace
model_id = "openvla/openvla-7b"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

# 2. Apply LoRA — fine-tune only ~2% of parameters
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13.1M || all params: 7.6B || trainable%: 0.17%

# 3. Prepare data
# OpenVLA expects: image + language instruction → action tokens
def prepare_sample(image, instruction, action):
    """
    image: PIL Image (256x256)
    instruction: str, e.g., "pick up the red block"
    action: np.array shape (7,) — [x, y, z, rx, ry, rz, gripper]
    """
    inputs = processor(
        images=image,
        text=f"In: What action should the robot take to {instruction}?\n",
        return_tensors="pt",
    )
    # Tokenize action (256 bins per dimension)
    action_tokens = tokenize_action(action, n_bins=256)
    inputs["labels"] = action_tokens
    return inputs

# 4. Training loop (simplified)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

for epoch in range(10):
    for batch in dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print(f"Epoch {epoch}: loss={loss.item():.4f}")

# 5. Save LoRA weights
model.save_pretrained("/models/openvla-my-robot-lora/")

Cross-Embodiment Transfer — Train One Robot, Run Another

This is a game-changing capability: a model trained on robot A can transfer to robot B completely different in kinematics.

Why It Works

Foundation models learn semantic understanding (what "pick up" means) separately from low-level control (how to move specific joints). The semantic part transfers across all robots.

Pre-trained knowledge (shared):
  "pick up" → move gripper above object → lower → close gripper → lift

Fine-tuned knowledge (robot-specific):
  Franka: 7 joints, position control, 1m reach
  WidowX: 6 joints, velocity control, 0.5m reach

Real Results from Octo Paper

Source robot	Target robot	Zero-shot	Fine-tuned (50 demos)
WidowX → Franka	Pick/Place	15%	72%
Multi-robot → ALOHA	Bimanual	8%	65%
Multi-robot → UR5	Assembly	5%	58%

Zero-shot (no fine-tuning) is still weak, but fine-tuning with just 50 demonstrations achieves good results — saving 90% effort compared to training from scratch (which needs 500+ demos).

When NOT to Use Foundation Models

Foundation models are powerful, but not a silver bullet:

Real-time requirement (<10ms): VLA models run at 3-10 Hz inference, too slow for reactive control (obstacle avoidance, force control). Use classical controllers or small RL policies instead.
High-precision tasks (<0.5mm): Assembly, soldering — need system identification + model-based control, not learned policies.
Safety-critical applications: Foundation models are black boxes with no formal guarantees. Surgical robots, autonomous driving — need verifiable controllers.
Limited compute: Octo-Small (27M) runs on Jetson, but OpenVLA (7B) needs a beefy GPU. If your robot only has a Raspberry Pi, use classical methods or small specialized models.
Abundant training data for specific task: If you already have 10K demos for one specific task, simple behavior cloning might outperform fine-tuned foundation models.

The Future

Foundation models for robotics are at a stage equivalent to GPT-2 for NLP — promising but not yet production-ready for all use cases. Expected trends:

Larger datasets: Open X-Embodiment v2 collecting more data
Faster inference: Distillation and quantization for edge deployment
Multi-modal: Adding tactile, force/torque sensing beyond vision
Sim-to-real pre-training: Combining simulation with real data

If you're starting out, my recommendation: try Octo-Small first — open-source, lightweight, quick fine-tuning, good community support.

Robotics Research Trends 2025 — Overview of the research landscape
AI and Robotics 2025: Trends and Real-World Applications — AI applications in industrial robots
Sim-to-Real Transfer: Train in Simulation, Run on Real Robots — How to transfer models from simulation to actual hardware

Foundation Models for Robots — A Revolution Underway

Robot learning manipulation with AI foundation models

What Are Robot Foundation Models?

If you're familiar with Large Language Models (LLMs) like GPT or Claude, robot foundation models work on similar principles but for the physical world:

LLM: Text in → Text out
VLM (Vision-Language Model): Image + Text in → Text out
VLA (Vision-Language-Action): Image + Text in → Robot action out

Robot foundation models are VLA models — they take camera images + language instructions and output robot actions directly (gripper position, joint angles, velocity, etc.).

Why Do We Need Foundation Models?

Previously, each robot task required:

Collecting task-specific data (thousands of demonstrations)
Training a dedicated model
Deploying it for that specific robot
Repeating for the next task

RT-2: Vision-Language-Action Model (Google DeepMind)

Architecture

RT-2 (Brohan et al., 2023) was the first VLA model achieving impressive results. The core idea is elegant: convert robot actions into text tokens and co-train with vision-language data.

Input:  Camera image + "Pick up the red cup"
   ↓
PaLI-X (55B) or PaLM-E (12B)  [Pre-trained VLM]
   ↓
Output: "1 128 91 241 1 128 91"  [Tokenized actions]
   ↓
De-tokenize → [x, y, z, rx, ry, rz, gripper]

RT-2 tokenizes each action dimension into one of 256 bins (0-255) and represents it as text. This way, the model handles actions just like text tokens — no architecture changes needed.

Highlight Results

Capability	RT-1	RT-2 (PaLI-X 55B)
Seen tasks	95%	95%
Unseen objects	32%	62%
Unseen backgrounds	36%	52%
Reasoning ("pick smallest")	0%	48%

Limitations

55B parameters — cannot run on edge devices
Closed-source, weights not public
Only tested on one robot type (Google's RT robot)
Slow inference (~3 Hz)

Octo: Open-Source Generalist Robot Policy (UC Berkeley)

Architecture

Octo (Ghosh, Dibya et al., 2024) is the open-source answer to RT-2. Instead of using a giant VLM, Octo designs a transformer architecture tailored for robotics:

Input tokens:
  [Language] "Pick up the blue block"
  [Image]    Observation history (t-2, t-1, t)
  [Action]   Previous action history
       ↓
  Transformer Backbone (readout tokens)
       ↓
  Diffusion Action Head
       ↓
  Output: Action distribution (multi-modal)

Training Data

Octo trains on the Open X-Embodiment dataset — 800K+ robot episodes from 22 different robot platforms:

Dataset	Robot	Tasks	Episodes
Bridge V2	WidowX	Manipulation	60K
RT-1 Robot	Google RT	Pick/Place	130K
Taco Play	Franka	Language-conditioned	6K
Kuka	Kuka iiwa	Stacking, insertion	516K
...	...	...	...
Total	22 robots	Diverse	800K+

Two Versions

	Octo-Small	Octo-Base
Parameters	27M	93M
Performance	Baseline	+15% success rate
Fine-tune time	~2 hours	~4 hours
GPU requirement	1x RTX 3090	1x A100 40GB

Fine-tune Octo on Your Robot

Here's the practical part — fine-tune Octo-Small on custom robot data with just one consumer GPU:

"""
Fine-tune Octo-Small on custom robot dataset
Requirements: 1x RTX 3090/4090, 50-200 demonstrations
"""
import jax
import jax.numpy as jnp
from octo.model.octo_model import OctoModel
from octo.data.dataset import make_single_dataset
from octo.utils.train_utils import (
    create_optimizer,
    create_train_state,
)
import optax

# 1. Load pretrained Octo-Small
model = OctoModel.load_pretrained("hf://octo-models/octo-small-1.5")
print(f"Loaded Octo-Small: {sum(x.size for x in jax.tree.leaves(model.params))/1e6:.1f}M params")

# 2. Prepare dataset (RLDS format)
# Dataset must be in RLDS format (https://github.com/google-research/rlds)
dataset_config = {
    "name": "my_robot_dataset",
    "data_dir": "/data/my_robot/",
    "image_obs_keys": {"primary": "image"},
    "language_key": "language_instruction",
    "action_proprio_normalization_type": "normal",
}

train_dataset = make_single_dataset(
    dataset_kwargs=dataset_config,
    traj_transform_kwargs={
        "window_size": 2,       # Observation history length
        "action_horizon": 4,    # Predict 4 future actions (action chunking)
    },
    frame_transform_kwargs={
        "resize_size": {"primary": (256, 256)},
    },
    train=True,
)

# 3. Setup optimizer
# Low learning rate for fine-tuning — don't want to destroy pre-trained weights
optimizer = optax.adamw(
    learning_rate=3e-5,    # Much lower than training from scratch
    weight_decay=0.01,
    b1=0.9,
    b2=0.95,
)

train_state = create_train_state(
    rng=jax.random.PRNGKey(42),
    model=model,
    optimizer=optimizer,
)

# 4. Training loop
NUM_STEPS = 5000  # 50 demos × ~100 steps = ~5K frames
BATCH_SIZE = 64

for step in range(NUM_STEPS):
    batch = next(train_dataset.iterator(batch_size=BATCH_SIZE))

    # Forward + backward pass
    train_state, metrics = train_state.apply_gradients(batch)

    if step % 500 == 0:
        print(f"Step {step}: loss={metrics['loss']:.4f}")

# 5. Save fine-tuned model
train_state.model.save_pretrained("/models/octo-my-robot/")
print("Fine-tuning complete!")

Data Collection Tips

# Collect data via teleoperation
# Need at least 50 demonstrations for single task
# 200+ demos for multi-task

# Each demonstration needs:
demo = {
    "observation": {
        "image": np.array(...),      # Camera image (256x256 RGB)
        "proprio": np.array(...),     # Joint positions / EE pose
    },
    "action": np.array(...),          # 7D: [x, y, z, rx, ry, rz, gripper]
    "language_instruction": "pick up the red block",
}

AI foundation model training for robot manipulation

OpenVLA: Open-Source VLA Model (Stanford)

Architecture

OpenVLA (Kim et al., 2024) takes a different approach: instead of designing new architecture, build VLA on top of a powerful pre-trained VLM:

Visual Encoder:
  SigLIP (vision-language) + DINOv2 (spatial features)
       ↓
  Projector (MLP)
       ↓
  Llama 2 7B backbone
       ↓
  Output: Tokenized actions (256 bins per dimension)

OpenVLA = Prismatic VLM + robot action fine-tuning. This approach maximizes knowledge transfer from web-scale VLMs.

Comparing All 3 Models

	RT-2	Octo	OpenVLA
Parameters	55B (PaLI-X)	27M / 93M	7B
Open-source	No	Yes	Yes
Architecture	VLM + tokenized actions	Custom transformer + diffusion	VLM + tokenized actions
Training data	RT-1 + web	Open X-Embodiment (800K)	Open X-Embodiment (970K)
Action output	Deterministic	Multi-modal (diffusion)	Deterministic
Fine-tune cost	N/A (closed)	1x RTX 3090 (~2h)	1x RTX 3090 (~4h with LoRA)
Inference speed	~3 Hz	~10 Hz	~6 Hz
Success rate (29 tasks)	RT-2-X baseline	Competitive	+16.5% vs RT-2-X
Cross-embodiment	Limited (1 robot)	Strong (22 robots)	Strong (970K demos, multi-robot)

Fine-tune OpenVLA with LoRA

"""
Fine-tune OpenVLA 7B with LoRA on consumer GPU
Requirements: 1x RTX 3090/4090 (24GB VRAM)
"""
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model

# 1. Load OpenVLA from HuggingFace
model_id = "openvla/openvla-7b"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

# 2. Apply LoRA — fine-tune only ~2% of parameters
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13.1M || all params: 7.6B || trainable%: 0.17%

# 3. Prepare data
# OpenVLA expects: image + language instruction → action tokens
def prepare_sample(image, instruction, action):
    """
    image: PIL Image (256x256)
    instruction: str, e.g., "pick up the red block"
    action: np.array shape (7,) — [x, y, z, rx, ry, rz, gripper]
    """
    inputs = processor(
        images=image,
        text=f"In: What action should the robot take to {instruction}?\n",
        return_tensors="pt",
    )
    # Tokenize action (256 bins per dimension)
    action_tokens = tokenize_action(action, n_bins=256)
    inputs["labels"] = action_tokens
    return inputs

# 4. Training loop (simplified)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

for epoch in range(10):
    for batch in dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print(f"Epoch {epoch}: loss={loss.item():.4f}")

# 5. Save LoRA weights
model.save_pretrained("/models/openvla-my-robot-lora/")

Cross-Embodiment Transfer — Train One Robot, Run Another

This is a game-changing capability: a model trained on robot A can transfer to robot B completely different in kinematics.

Why It Works

Foundation models learn semantic understanding (what "pick up" means) separately from low-level control (how to move specific joints). The semantic part transfers across all robots.

Pre-trained knowledge (shared):
  "pick up" → move gripper above object → lower → close gripper → lift

Fine-tuned knowledge (robot-specific):
  Franka: 7 joints, position control, 1m reach
  WidowX: 6 joints, velocity control, 0.5m reach

Real Results from Octo Paper

Source robot	Target robot	Zero-shot	Fine-tuned (50 demos)
WidowX → Franka	Pick/Place	15%	72%
Multi-robot → ALOHA	Bimanual	8%	65%
Multi-robot → UR5	Assembly	5%	58%

Zero-shot (no fine-tuning) is still weak, but fine-tuning with just 50 demonstrations achieves good results — saving 90% effort compared to training from scratch (which needs 500+ demos).

When NOT to Use Foundation Models

Foundation models are powerful, but not a silver bullet:

Real-time requirement (<10ms): VLA models run at 3-10 Hz inference, too slow for reactive control (obstacle avoidance, force control). Use classical controllers or small RL policies instead.
High-precision tasks (<0.5mm): Assembly, soldering — need system identification + model-based control, not learned policies.
Safety-critical applications: Foundation models are black boxes with no formal guarantees. Surgical robots, autonomous driving — need verifiable controllers.
Limited compute: Octo-Small (27M) runs on Jetson, but OpenVLA (7B) needs a beefy GPU. If your robot only has a Raspberry Pi, use classical methods or small specialized models.
Abundant training data for specific task: If you already have 10K demos for one specific task, simple behavior cloning might outperform fine-tuned foundation models.

The Future

Foundation models for robotics are at a stage equivalent to GPT-2 for NLP — promising but not yet production-ready for all use cases. Expected trends:

Larger datasets: Open X-Embodiment v2 collecting more data
Faster inference: Distillation and quantization for edge deployment
Multi-modal: Adding tactile, force/torque sensing beyond vision
Sim-to-real pre-training: Combining simulation with real data

If you're starting out, my recommendation: try Octo-Small first — open-source, lightweight, quick fine-tuning, good community support.

Robotics Research Trends 2025 — Overview of the research landscape
AI and Robotics 2025: Trends and Real-World Applications — AI applications in industrial robots
Sim-to-Real Transfer: Train in Simulation, Run on Real Robots — How to transfer models from simulation to actual hardware

Foundation Models for Robots — A Revolution Underway

What Are Robot Foundation Models?

Why Do We Need Foundation Models?

RT-2: Vision-Language-Action Model (Google DeepMind)

Architecture

Highlight Results

Limitations

Octo: Open-Source Generalist Robot Policy (UC Berkeley)

Architecture

Training Data

Two Versions

Fine-tune Octo on Your Robot

Data Collection Tips

OpenVLA: Open-Source VLA Model (Stanford)

Architecture

Comparing All 3 Models

Fine-tune OpenVLA with LoRA

Cross-Embodiment Transfer — Train One Robot, Run Another

Why It Works

Real Results from Octo Paper

When NOT to Use Foundation Models

The Future

Related Articles

Nguyễn Anh Tuấn

Related Posts

Sim-to-Real Transfer: Train simulation, chạy thực tế

Top nghiên cứu Robotics 2024-2025: Paper đáng đọc từ ICRA, CoRL và RSS

Xu hướng AI trong Robotics năm 2025: Từ LLM đến Embodied AI

Foundation Models for Robots — A Revolution Underway

What Are Robot Foundation Models?

Why Do We Need Foundation Models?

RT-2: Vision-Language-Action Model (Google DeepMind)

Architecture

Highlight Results

Limitations

Octo: Open-Source Generalist Robot Policy (UC Berkeley)

Architecture

Training Data

Two Versions

Fine-tune Octo on Your Robot

Data Collection Tips

OpenVLA: Open-Source VLA Model (Stanford)

Architecture

Comparing All 3 Models

Fine-tune OpenVLA with LoRA

Cross-Embodiment Transfer — Train One Robot, Run Another

Why It Works

Real Results from Octo Paper

When NOT to Use Foundation Models

The Future

Related Articles

Nguyễn Anh Tuấn

Related Posts

Sim-to-Real Transfer: Train simulation, chạy thực tế

Top nghiên cứu Robotics 2024-2025: Paper đáng đọc từ ICRA, CoRL và RSS

Xu hướng AI trong Robotics năm 2025: Từ LLM đến Embodied AI