Foundation Models cho Robot: RT-2, Octo, OpenVLA thực tế

Foundation Models cho Robot — Cuộc cách mạng đang diễn ra

Foundation models cho robot đang thay đổi hoàn toàn cách chúng ta lập trình robot. Thay vì viết hàng nghìn dòng code cho từng task cụ thể, giờ đây robot có thể học từ demonstrations và hiểu language instructions — giống cách ChatGPT hiểu text, nhưng output là hành động vật lý.

Trong bài này, mình sẽ deep-dive vào 3 model quan trọng nhất: RT-2 (Google DeepMind), Octo (UC Berkeley), và OpenVLA (Stanford) — phân tích kiến trúc, so sánh performance, và hướng dẫn fine-tune trên robot của bạn.

Robot learning manipulation với foundation models AI

Robot Foundation Models là gì?

Nếu bạn quen với LLM (Large Language Models) như GPT hay Claude, robot foundation models hoạt động theo nguyên lý tương tự nhưng cho physical world:

LLM: Text in → Text out
VLM (Vision-Language Model): Image + Text in → Text out
VLA (Vision-Language-Action): Image + Text in → Robot action out

Robot foundation models là VLA models — nhận camera image + language instruction, và output trực tiếp robot actions (vị trí gripper, góc joints, velocity...).

Tại sao cần foundation models?

Trước đây, mỗi robot task cần:

Thu thập data riêng (hàng nghìn demonstrations)
Train model riêng
Deploy cho đúng robot đó
Lặp lại cho task tiếp theo

Foundation models giải quyết bằng pre-training trên massive dataset từ nhiều robot khác nhau, sau đó fine-tune nhanh cho task cụ thể. Giống cách bạn fine-tune GPT cho domain riêng thay vì train from scratch.

RT-2: Vision-Language-Action Model (Google DeepMind)

Kiến trúc

RT-2 (Brohan et al., 2023) là model VLA đầu tiên đạt kết quả ấn tượng. Ý tưởng cốt lõi đơn giản: biến robot actions thành text tokens và co-train cùng vision-language data.

Input:  Camera image + "Pick up the red cup"
   ↓
PaLI-X (55B) hoặc PaLM-E (12B)  [Pre-trained VLM]
   ↓
Output: "1 128 91 241 1 128 91"  [Tokenized actions]
   ↓
De-tokenize → [x, y, z, rx, ry, rz, gripper]

RT-2 tokenize mỗi action dimension thành 1 trong 256 bins (0-255), rồi biểu diễn dưới dạng text. Nhờ đó, model xử lý actions giống hệt text tokens — không cần thay đổi architecture.

Kết quả nổi bật

Capability	RT-1	RT-2 (PaLI-X 55B)
Seen tasks	95%	95%
Unseen objects	32%	62%
Unseen backgrounds	36%	52%
Reasoning ("pick smallest")	0%	48%

Điểm breakthrough: RT-2 có khả năng emergent reasoning — hiểu "pick up something you can use to clean a spill" (chọn khăn giấy) mà chưa bao giờ thấy instruction này trong robot training data. Knowledge từ web pre-training transfer sang robot control.

Hạn chế

Model 55B parameters — không chạy được trên edge device
Closed-source, không có weights public
Chỉ test trên 1 loại robot (Google's RT robot)
Inference chậm (~3 Hz)

Octo: Open-Source Generalist Robot Policy (UC Berkeley)

Kiến trúc

Octo (Ghosh, Dibya et al., 2024) là câu trả lời open-source cho RT-2. Thay vì dùng giant VLM, Octo thiết kế transformer architecture riêng cho robotics:

Input tokens:
  [Language] "Pick up the blue block"
  [Image]    Observation history (t-2, t-1, t)
  [Action]   Previous action history
       ↓
  Transformer Backbone (readout tokens)
       ↓
  Diffusion Action Head
       ↓
  Output: Action distribution (multi-modal)

Điểm đặc biệt: Octo dùng diffusion head thay vì regression head — cho phép model dự đoán multi-modal action distributions. Ví dụ: khi có thể đi vòng trái hoặc phải, diffusion head capture được cả 2 modes, còn regression head chỉ ra trung bình (đâm thẳng).

Training Data

Octo train trên Open X-Embodiment dataset — 800K+ robot episodes từ 22 robot platforms khác nhau:

Dataset	Robot	Tasks	Episodes
Bridge V2	WidowX	Manipulation	60K
RT-1 Robot	Google RT	Pick/Place	130K
Taco Play	Franka	Language-conditioned	6K
Kuka	Kuka iiwa	Stacking, insertion	516K
...	...	...	...
Total	22 robots	Diverse	800K+

Hai phiên bản

	Octo-Small	Octo-Base
Parameters	27M	93M
Performance	Baseline	+15% success rate
Fine-tune time	~2 hours	~4 hours
GPU requirement	1x RTX 3090	1x A100 40GB

Fine-tune Octo trên robot của bạn

Đây là phần thực hành — fine-tune Octo-Small trên custom robot data chỉ với 1 GPU consumer:

"""
Fine-tune Octo-Small trên custom robot dataset
Yêu cầu: 1x RTX 3090/4090, 50-200 demonstrations
"""
import jax
import jax.numpy as jnp
from octo.model.octo_model import OctoModel
from octo.data.dataset import make_single_dataset
from octo.utils.train_utils import (
    create_optimizer,
    create_train_state,
)
import optax

# 1. Load pretrained Octo-Small
model = OctoModel.load_pretrained("hf://octo-models/octo-small-1.5")
print(f"Loaded Octo-Small: {sum(x.size for x in jax.tree.leaves(model.params))/1e6:.1f}M params")

# 2. Chuẩn bị dataset (RLDS format)
# Dataset cần ở dạng RLDS (https://github.com/google-research/rlds)
dataset_config = {
    "name": "my_robot_dataset",
    "data_dir": "/data/my_robot/",
    "image_obs_keys": {"primary": "image"},
    "language_key": "language_instruction",
    "action_proprio_normalization_type": "normal",
}

train_dataset = make_single_dataset(
    dataset_kwargs=dataset_config,
    traj_transform_kwargs={
        "window_size": 2,       # Observation history length
        "action_horizon": 4,    # Predict 4 future actions (action chunking)
    },
    frame_transform_kwargs={
        "resize_size": {"primary": (256, 256)},
    },
    train=True,
)

# 3. Setup optimizer
# Learning rate thấp cho fine-tuning — không muốn phá pre-trained weights
optimizer = optax.adamw(
    learning_rate=3e-5,    # Thấp hơn nhiều so với training from scratch
    weight_decay=0.01,
    b1=0.9,
    b2=0.95,
)

train_state = create_train_state(
    rng=jax.random.PRNGKey(42),
    model=model,
    optimizer=optimizer,
)

# 4. Training loop
NUM_STEPS = 5000  # 50 demos × ~100 steps = ~5K frames
BATCH_SIZE = 64

for step in range(NUM_STEPS):
    batch = next(train_dataset.iterator(batch_size=BATCH_SIZE))

    # Forward + backward pass
    train_state, metrics = train_state.apply_gradients(batch)

    if step % 500 == 0:
        print(f"Step {step}: loss={metrics['loss']:.4f}")

# 5. Save fine-tuned model
train_state.model.save_pretrained("/models/octo-my-robot/")
print("Fine-tuning hoàn tất!")

Data collection tips

# Thu thập data với teleoperation
# Cần ít nhất 50 demonstrations cho single task
# 200+ demos cho multi-task

# Mỗi demonstration cần:
demo = {
    "observation": {
        "image": np.array(...),      # Camera image (256x256 RGB)
        "proprio": np.array(...),     # Joint positions / EE pose
    },
    "action": np.array(...),          # 7D: [x, y, z, rx, ry, rz, gripper]
    "language_instruction": "pick up the red block",
}

AI foundation model training cho robot manipulation

OpenVLA: Open-Source VLA Model (Stanford)

Kiến trúc

OpenVLA (Kim et al., 2024) đi theo hướng khác: thay vì thiết kế architecture mới, build VLA trên nền pre-trained VLM mạnh:

Visual Encoder:
  SigLIP (vision-language) + DINOv2 (spatial features)
       ↓
  Projector (MLP)
       ↓
  Llama 2 7B backbone
       ↓
  Output: Tokenized actions (256 bins per dimension)

OpenVLA = Prismatic VLM + robot action fine-tuning. Cách tiếp cận này tận dụng tối đa pre-trained knowledge từ web-scale VLM.

So sánh 3 models

	RT-2	Octo	OpenVLA
Parameters	55B (PaLI-X)	27M / 93M	7B
Open-source	Không	Có	Có
Architecture	VLM + tokenized actions	Custom transformer + diffusion	VLM + tokenized actions
Training data	RT-1 + web	Open X-Embodiment (800K)	Open X-Embodiment (970K)
Action output	Deterministic	Multi-modal (diffusion)	Deterministic
Fine-tune cost	N/A (closed)	1x RTX 3090 (~2h)	1x RTX 3090 (~4h with LoRA)
Inference speed	~3 Hz	~10 Hz	~6 Hz
Success rate (29 tasks)	RT-2-X baseline	Competitive	+16.5% vs RT-2-X
Cross-embodiment	Limited (1 robot)	Strong (22 robots)	Strong (970K demos, multi-robot)

Fine-tune OpenVLA với LoRA

"""
Fine-tune OpenVLA 7B với LoRA trên consumer GPU
Yêu cầu: 1x RTX 3090/4090 (24GB VRAM)
"""
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model

# 1. Load OpenVLA từ HuggingFace
model_id = "openvla/openvla-7b"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

# 2. Apply LoRA — chỉ fine-tune ~2% parameters
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13.1M || all params: 7.6B || trainable%: 0.17%

# 3. Prepare data
# OpenVLA expects: image + language instruction → action tokens
def prepare_sample(image, instruction, action):
    """
    image: PIL Image (256x256)
    instruction: str, e.g., "pick up the red block"
    action: np.array shape (7,) — [x, y, z, rx, ry, rz, gripper]
    """
    inputs = processor(
        images=image,
        text=f"In: What action should the robot take to {instruction}?\n",
        return_tensors="pt",
    )
    # Tokenize action (256 bins per dimension)
    action_tokens = tokenize_action(action, n_bins=256)
    inputs["labels"] = action_tokens
    return inputs

# 4. Training loop (simplified)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

for epoch in range(10):
    for batch in dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print(f"Epoch {epoch}: loss={loss.item():.4f}")

# 5. Save LoRA weights
model.save_pretrained("/models/openvla-my-robot-lora/")

Cross-Embodiment Transfer — Model một robot, chạy robot khác

Đây là tính năng game-changing: model train trên robot A có thể transfer sang robot B khác hoàn toàn về kinematics.

Tại sao hoạt động?

Foundation models học semantic understanding (hiểu "pick up" nghĩa là gì) tách biệt khỏi low-level control (cách di chuyển joints cụ thể). Phần semantic có thể share giữa mọi robot.

Pre-trained knowledge (shared):
  "pick up" → move gripper above object → lower → close gripper → lift

Fine-tuned knowledge (robot-specific):
  Franka: 7 joints, position control, 1m reach
  WidowX: 6 joints, velocity control, 0.5m reach

Kết quả thực tế từ Octo paper

Source robot	Target robot	Zero-shot	Fine-tuned (50 demos)
WidowX → Franka	Pick/Place	15%	72%
Multi-robot → ALOHA	Bimanual	8%	65%
Multi-robot → UR5	Assembly	5%	58%

Zero-shot (không fine-tune) vẫn kém, nhưng fine-tune với chỉ 50 demonstrations đạt kết quả tốt — tiết kiệm 90% effort so với training from scratch (cần 500+ demos).

Khi nào KHÔNG nên dùng Foundation Models?

Foundation models mạnh, nhưng không phải silver bullet:

Real-time requirement (<10ms): VLA models inference ở 3-10 Hz, quá chậm cho reactive control (obstacle avoidance, force control). Dùng classical controllers hoặc small RL policies.
High-precision tasks (<0.5mm): Assembly, soldering — cần system identification + model-based control, không phải learned policy.
Safety-critical: Foundation models là black box, không có formal guarantees. Surgical robots, autonomous driving — cần verifiable controllers.
Limited compute: Octo-Small (27M) chạy được trên Jetson, nhưng OpenVLA (7B) cần beefy GPU. Nếu robot chỉ có Raspberry Pi, dùng classical methods hoặc small specialized models.
Abundant training data cho specific task: Nếu bạn đã có 10K demos cho 1 task cụ thể, behavior cloning đơn giản có thể tốt hơn fine-tuned foundation model.

Tương lai

Foundation models cho robot đang ở giai đoạn tương đương GPT-2 cho NLP — promising nhưng chưa production-ready cho mọi use case. Xu hướng sắp tới:

Larger datasets: Open X-Embodiment v2 đang thu thập thêm data
Faster inference: Distillation và quantization cho edge deployment
Multi-modal: Thêm tactile, force/torque sensing ngoài vision
Sim-to-real pre-training: Kết hợp simulation data với real data

Nếu bạn đang bắt đầu, recommendation: thử Octo-Small trước — open-source, nhẹ, fine-tune nhanh, community support tốt.

Foundation Models cho Robot — Cuộc cách mạng đang diễn ra

Robot learning manipulation với foundation models AI

Robot Foundation Models là gì?

Nếu bạn quen với LLM (Large Language Models) như GPT hay Claude, robot foundation models hoạt động theo nguyên lý tương tự nhưng cho physical world:

LLM: Text in → Text out
VLM (Vision-Language Model): Image + Text in → Text out
VLA (Vision-Language-Action): Image + Text in → Robot action out

Robot foundation models là VLA models — nhận camera image + language instruction, và output trực tiếp robot actions (vị trí gripper, góc joints, velocity...).

Tại sao cần foundation models?

Trước đây, mỗi robot task cần:

Thu thập data riêng (hàng nghìn demonstrations)
Train model riêng
Deploy cho đúng robot đó
Lặp lại cho task tiếp theo

RT-2: Vision-Language-Action Model (Google DeepMind)

Kiến trúc

Input:  Camera image + "Pick up the red cup"
   ↓
PaLI-X (55B) hoặc PaLM-E (12B)  [Pre-trained VLM]
   ↓
Output: "1 128 91 241 1 128 91"  [Tokenized actions]
   ↓
De-tokenize → [x, y, z, rx, ry, rz, gripper]

Kết quả nổi bật

Capability	RT-1	RT-2 (PaLI-X 55B)
Seen tasks	95%	95%
Unseen objects	32%	62%
Unseen backgrounds	36%	52%
Reasoning ("pick smallest")	0%	48%

Hạn chế

Model 55B parameters — không chạy được trên edge device
Closed-source, không có weights public
Chỉ test trên 1 loại robot (Google's RT robot)
Inference chậm (~3 Hz)

Octo: Open-Source Generalist Robot Policy (UC Berkeley)

Kiến trúc

Octo (Ghosh, Dibya et al., 2024) là câu trả lời open-source cho RT-2. Thay vì dùng giant VLM, Octo thiết kế transformer architecture riêng cho robotics:

Input tokens:
  [Language] "Pick up the blue block"
  [Image]    Observation history (t-2, t-1, t)
  [Action]   Previous action history
       ↓
  Transformer Backbone (readout tokens)
       ↓
  Diffusion Action Head
       ↓
  Output: Action distribution (multi-modal)

Training Data

Octo train trên Open X-Embodiment dataset — 800K+ robot episodes từ 22 robot platforms khác nhau:

Dataset	Robot	Tasks	Episodes
Bridge V2	WidowX	Manipulation	60K
RT-1 Robot	Google RT	Pick/Place	130K
Taco Play	Franka	Language-conditioned	6K
Kuka	Kuka iiwa	Stacking, insertion	516K
...	...	...	...
Total	22 robots	Diverse	800K+

Hai phiên bản

	Octo-Small	Octo-Base
Parameters	27M	93M
Performance	Baseline	+15% success rate
Fine-tune time	~2 hours	~4 hours
GPU requirement	1x RTX 3090	1x A100 40GB

Fine-tune Octo trên robot của bạn

Đây là phần thực hành — fine-tune Octo-Small trên custom robot data chỉ với 1 GPU consumer:

"""
Fine-tune Octo-Small trên custom robot dataset
Yêu cầu: 1x RTX 3090/4090, 50-200 demonstrations
"""
import jax
import jax.numpy as jnp
from octo.model.octo_model import OctoModel
from octo.data.dataset import make_single_dataset
from octo.utils.train_utils import (
    create_optimizer,
    create_train_state,
)
import optax

# 1. Load pretrained Octo-Small
model = OctoModel.load_pretrained("hf://octo-models/octo-small-1.5")
print(f"Loaded Octo-Small: {sum(x.size for x in jax.tree.leaves(model.params))/1e6:.1f}M params")

# 2. Chuẩn bị dataset (RLDS format)
# Dataset cần ở dạng RLDS (https://github.com/google-research/rlds)
dataset_config = {
    "name": "my_robot_dataset",
    "data_dir": "/data/my_robot/",
    "image_obs_keys": {"primary": "image"},
    "language_key": "language_instruction",
    "action_proprio_normalization_type": "normal",
}

train_dataset = make_single_dataset(
    dataset_kwargs=dataset_config,
    traj_transform_kwargs={
        "window_size": 2,       # Observation history length
        "action_horizon": 4,    # Predict 4 future actions (action chunking)
    },
    frame_transform_kwargs={
        "resize_size": {"primary": (256, 256)},
    },
    train=True,
)

# 3. Setup optimizer
# Learning rate thấp cho fine-tuning — không muốn phá pre-trained weights
optimizer = optax.adamw(
    learning_rate=3e-5,    # Thấp hơn nhiều so với training from scratch
    weight_decay=0.01,
    b1=0.9,
    b2=0.95,
)

train_state = create_train_state(
    rng=jax.random.PRNGKey(42),
    model=model,
    optimizer=optimizer,
)

# 4. Training loop
NUM_STEPS = 5000  # 50 demos × ~100 steps = ~5K frames
BATCH_SIZE = 64

for step in range(NUM_STEPS):
    batch = next(train_dataset.iterator(batch_size=BATCH_SIZE))

    # Forward + backward pass
    train_state, metrics = train_state.apply_gradients(batch)

    if step % 500 == 0:
        print(f"Step {step}: loss={metrics['loss']:.4f}")

# 5. Save fine-tuned model
train_state.model.save_pretrained("/models/octo-my-robot/")
print("Fine-tuning hoàn tất!")

Data collection tips

# Thu thập data với teleoperation
# Cần ít nhất 50 demonstrations cho single task
# 200+ demos cho multi-task

# Mỗi demonstration cần:
demo = {
    "observation": {
        "image": np.array(...),      # Camera image (256x256 RGB)
        "proprio": np.array(...),     # Joint positions / EE pose
    },
    "action": np.array(...),          # 7D: [x, y, z, rx, ry, rz, gripper]
    "language_instruction": "pick up the red block",
}

AI foundation model training cho robot manipulation

OpenVLA: Open-Source VLA Model (Stanford)

Kiến trúc

OpenVLA (Kim et al., 2024) đi theo hướng khác: thay vì thiết kế architecture mới, build VLA trên nền pre-trained VLM mạnh:

Visual Encoder:
  SigLIP (vision-language) + DINOv2 (spatial features)
       ↓
  Projector (MLP)
       ↓
  Llama 2 7B backbone
       ↓
  Output: Tokenized actions (256 bins per dimension)

OpenVLA = Prismatic VLM + robot action fine-tuning. Cách tiếp cận này tận dụng tối đa pre-trained knowledge từ web-scale VLM.

So sánh 3 models

	RT-2	Octo	OpenVLA
Parameters	55B (PaLI-X)	27M / 93M	7B
Open-source	Không	Có	Có
Architecture	VLM + tokenized actions	Custom transformer + diffusion	VLM + tokenized actions
Training data	RT-1 + web	Open X-Embodiment (800K)	Open X-Embodiment (970K)
Action output	Deterministic	Multi-modal (diffusion)	Deterministic
Fine-tune cost	N/A (closed)	1x RTX 3090 (~2h)	1x RTX 3090 (~4h with LoRA)
Inference speed	~3 Hz	~10 Hz	~6 Hz
Success rate (29 tasks)	RT-2-X baseline	Competitive	+16.5% vs RT-2-X
Cross-embodiment	Limited (1 robot)	Strong (22 robots)	Strong (970K demos, multi-robot)

Fine-tune OpenVLA với LoRA

"""
Fine-tune OpenVLA 7B với LoRA trên consumer GPU
Yêu cầu: 1x RTX 3090/4090 (24GB VRAM)
"""
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model

# 1. Load OpenVLA từ HuggingFace
model_id = "openvla/openvla-7b"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

# 2. Apply LoRA — chỉ fine-tune ~2% parameters
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13.1M || all params: 7.6B || trainable%: 0.17%

# 3. Prepare data
# OpenVLA expects: image + language instruction → action tokens
def prepare_sample(image, instruction, action):
    """
    image: PIL Image (256x256)
    instruction: str, e.g., "pick up the red block"
    action: np.array shape (7,) — [x, y, z, rx, ry, rz, gripper]
    """
    inputs = processor(
        images=image,
        text=f"In: What action should the robot take to {instruction}?\n",
        return_tensors="pt",
    )
    # Tokenize action (256 bins per dimension)
    action_tokens = tokenize_action(action, n_bins=256)
    inputs["labels"] = action_tokens
    return inputs

# 4. Training loop (simplified)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

for epoch in range(10):
    for batch in dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print(f"Epoch {epoch}: loss={loss.item():.4f}")

# 5. Save LoRA weights
model.save_pretrained("/models/openvla-my-robot-lora/")

Cross-Embodiment Transfer — Model một robot, chạy robot khác

Đây là tính năng game-changing: model train trên robot A có thể transfer sang robot B khác hoàn toàn về kinematics.

Tại sao hoạt động?

Pre-trained knowledge (shared):
  "pick up" → move gripper above object → lower → close gripper → lift

Fine-tuned knowledge (robot-specific):
  Franka: 7 joints, position control, 1m reach
  WidowX: 6 joints, velocity control, 0.5m reach

Kết quả thực tế từ Octo paper

Source robot	Target robot	Zero-shot	Fine-tuned (50 demos)
WidowX → Franka	Pick/Place	15%	72%
Multi-robot → ALOHA	Bimanual	8%	65%
Multi-robot → UR5	Assembly	5%	58%

Khi nào KHÔNG nên dùng Foundation Models?

Foundation models mạnh, nhưng không phải silver bullet:

Real-time requirement (<10ms): VLA models inference ở 3-10 Hz, quá chậm cho reactive control (obstacle avoidance, force control). Dùng classical controllers hoặc small RL policies.
High-precision tasks (<0.5mm): Assembly, soldering — cần system identification + model-based control, không phải learned policy.
Safety-critical: Foundation models là black box, không có formal guarantees. Surgical robots, autonomous driving — cần verifiable controllers.
Limited compute: Octo-Small (27M) chạy được trên Jetson, nhưng OpenVLA (7B) cần beefy GPU. Nếu robot chỉ có Raspberry Pi, dùng classical methods hoặc small specialized models.
Abundant training data cho specific task: Nếu bạn đã có 10K demos cho 1 task cụ thể, behavior cloning đơn giản có thể tốt hơn fine-tuned foundation model.

Tương lai

Foundation models cho robot đang ở giai đoạn tương đương GPT-2 cho NLP — promising nhưng chưa production-ready cho mọi use case. Xu hướng sắp tới:

Larger datasets: Open X-Embodiment v2 đang thu thập thêm data
Faster inference: Distillation và quantization cho edge deployment
Multi-modal: Thêm tactile, force/torque sensing ngoài vision
Sim-to-real pre-training: Kết hợp simulation data với real data

Nếu bạn đang bắt đầu, recommendation: thử Octo-Small trước — open-source, nhẹ, fine-tune nhanh, community support tốt.

Foundation Models cho Robot — Cuộc cách mạng đang diễn ra

Robot Foundation Models là gì?

Tại sao cần foundation models?

RT-2: Vision-Language-Action Model (Google DeepMind)

Kiến trúc

Kết quả nổi bật

Hạn chế

Octo: Open-Source Generalist Robot Policy (UC Berkeley)

Kiến trúc

Training Data

Hai phiên bản

Fine-tune Octo trên robot của bạn

Data collection tips

OpenVLA: Open-Source VLA Model (Stanford)

Kiến trúc

So sánh 3 models

Fine-tune OpenVLA với LoRA

Cross-Embodiment Transfer — Model một robot, chạy robot khác

Tại sao hoạt động?

Kết quả thực tế từ Octo paper

Khi nào KHÔNG nên dùng Foundation Models?

Tương lai

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

Sim-to-Real Transfer: Train simulation, chạy thực tế

Top nghiên cứu Robotics 2024-2025: Paper đáng đọc từ ICRA, CoRL và RSS

Xu hướng AI trong Robotics năm 2025: Từ LLM đến Embodied AI

Foundation Models cho Robot — Cuộc cách mạng đang diễn ra

Robot Foundation Models là gì?

Tại sao cần foundation models?

RT-2: Vision-Language-Action Model (Google DeepMind)

Kiến trúc

Kết quả nổi bật

Hạn chế

Octo: Open-Source Generalist Robot Policy (UC Berkeley)

Kiến trúc

Training Data

Hai phiên bản

Fine-tune Octo trên robot của bạn

Data collection tips

OpenVLA: Open-Source VLA Model (Stanford)

Kiến trúc

So sánh 3 models

Fine-tune OpenVLA với LoRA

Cross-Embodiment Transfer — Model một robot, chạy robot khác

Tại sao hoạt động?

Kết quả thực tế từ Octo paper

Khi nào KHÔNG nên dùng Foundation Models?

Tương lai

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

Sim-to-Real Transfer: Train simulation, chạy thực tế

Top nghiên cứu Robotics 2024-2025: Paper đáng đọc từ ICRA, CoRL và RSS

Xu hướng AI trong Robotics năm 2025: Từ LLM đến Embodied AI