VLAs Are Powerful, But Not Powerful Enough
If you have been following our VLA Models series, you already know: Vision-Language-Action models are the hottest direction in robot learning — from RT-2 to OpenVLA, from pi0 to pi0-FAST. However, there is a problem that the entire VLA community knows but has not fully solved:
VLAs are trained with imitation learning (IL) — meaning they are only as good as the data they were trained on. If the demonstration data has bias, the VLA inherits that bias. If demonstrations do not include recovery behaviors (when objects drop, when the gripper misaligns), the VLA cannot handle these situations.
This is exactly the problem LLMs faced before RLHF — models generated decent text but were not aligned with objectives. RLHF turned GPT into ChatGPT. Can RL turn OpenVLA into "ChatVLA"?
VLA-RL answers: yes, and it also discovers the first inference scaling law in robotics.
What Does the Paper Propose?
VLA-RL (2025) introduces a complete framework for applying online RL to pre-trained VLA models. Three main contributions:
- Trajectory-level RL formulation: models manipulation trajectories as multi-modal multi-turn conversations — each timestep is a "turn" in the dialogue between robot and environment
- Process Reward Model (PRM): a VLM fine-tuned to evaluate each segment of a trajectory — solving the sparse reward problem
- Inference scaling law: the first proof that giving a VLA "more time to think" at inference time improves performance following a power law — similar to LLM scaling laws
Result: OpenVLA-7B after RL training surpasses the strongest baseline by 4.5% on the LIBERO benchmark (40 manipulation tasks), matching pi0-FAST performance — a much larger model.
Core Insight: Manipulation Is a Conversation
Auto-regressive VLA = Multi-turn Conversation
This is the paper's most important insight. Auto-regressive VLAs (like OpenVLA) generate actions token-by-token, similar to how LLMs generate text. But unlike LLMs, each "turn" in a VLA is:
Turn t:
User message: [Image observation o_t]
Assistant reply: [Action tokens a_t = (a_t^1, a_t^2, ..., a_t^K)]
Turn t+1:
User message: [Image observation o_{t+1}] (result of a_t)
Assistant reply: [Action tokens a_{t+1}]
...
Turn T (final):
User message: [Image observation o_T]
Assistant reply: [Action tokens a_T]
-> Task success/failure
A manipulation trajectory of 50 timesteps = a 50-turn "conversation". The final reward (success/failure) is like a verdict for the entire conversation.
Why does this perspective matter? Because it allows reusing the entire RLHF infrastructure developed for LLMs. Same PPO algorithm, same reward model paradigm, same KL regularization — the only difference is "text tokens" become "action tokens" and "human feedback" becomes "task success".
Mathematical Formulation
import torch
import torch.nn as nn
class VLARLFormulation:
"""
Trajectory-level RL for auto-regressive VLA.
Each trajectory = multi-turn conversation.
"""
def __init__(self, vla_model, reward_model, kl_coeff=0.01):
self.vla = vla_model # OpenVLA-7B
self.vla_ref = vla_model.copy() # Frozen reference for KL
self.reward_model = reward_model # Process Reward Model
self.kl_coeff = kl_coeff
def compute_trajectory_reward(self, trajectory):
"""
Trajectory = [(o_0, a_0), (o_1, a_1), ..., (o_T, a_T)]
Reward at each timestep t:
r_t = PRM(o_0:t, a_0:t) (process reward)
R_total = sum(r_t) + R_final (sparse task reward)
"""
observations = [step[0] for step in trajectory]
actions = [step[1] for step in trajectory]
# Process rewards for each segment
process_rewards = []
for t in range(len(trajectory)):
r_t = self.reward_model.evaluate(
observations[:t+1], actions[:t+1]
)
process_rewards.append(r_t)
return torch.stack(process_rewards)
def compute_ppo_loss(self, trajectory, old_log_probs):
"""
PPO loss with trajectory-level rewards.
KL regularization prevents catastrophic forgetting.
"""
observations = [step[0] for step in trajectory]
actions = [step[1] for step in trajectory]
# Current policy log probs
log_probs = []
for t in range(len(trajectory)):
action_log_prob = self.vla.compute_log_prob(
observations[t], actions[t]
)
log_probs.append(action_log_prob)
log_probs = torch.stack(log_probs)
# Advantages from process rewards
rewards = self.compute_trajectory_reward(trajectory)
advantages = compute_gae(rewards, gamma=0.99, lam=0.95)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
# PPO clipped objective
ratio = torch.exp(log_probs - old_log_probs)
clip_ratio = torch.clamp(ratio, 1 - 0.2, 1 + 0.2)
policy_loss = -torch.min(
ratio * advantages,
clip_ratio * advantages
).mean()
# KL penalty vs reference model
ref_log_probs = []
for t in range(len(trajectory)):
ref_lp = self.vla_ref.compute_log_prob(
observations[t], actions[t]
)
ref_log_probs.append(ref_lp)
ref_log_probs = torch.stack(ref_log_probs)
kl_loss = (log_probs - ref_log_probs).mean()
total_loss = policy_loss + self.kl_coeff * kl_loss
return total_loss
Process Reward Model: Solving Sparse Rewards
The Problem: Manipulation Rewards Are Too Sparse
In manipulation, rewards typically only appear at the end of an episode: task success = 1, failure = 0. With a 50-step trajectory, RL must figure out which step mattered — like finding a needle in a haystack.
LLMs faced a similar problem, and Process Reward Models (PRM) were developed for math reasoning, evaluating each solution step. VLA-RL applies this idea to robot manipulation.
PRM Architecture
import torch
import torch.nn as nn
class ManipulationPRM(nn.Module):
"""
Process Reward Model for robot manipulation.
Fine-tuned from VLM, evaluates progress at each timestep.
Training: create pseudo reward labels from task segments.
"""
def __init__(self, vlm_backbone):
super().__init__()
self.vlm = vlm_backbone # e.g., Qwen-VL-7B
self.reward_head = nn.Sequential(
nn.Linear(vlm_backbone.config.hidden_size, 256),
nn.ReLU(),
nn.Linear(256, 1),
nn.Sigmoid(), # output [0, 1] reward
)
def forward(self, image_sequence, action_sequence, task_description):
"""
Evaluate progress based on visual history.
Args:
image_sequence: [o_0, ..., o_t] — observations so far
action_sequence: [a_0, ..., a_{t-1}] — actions taken
task_description: "pick up the red cup"
Returns:
reward: float [0, 1] — progress score
"""
prompt = self._build_prompt(
image_sequence, action_sequence, task_description
)
vlm_output = self.vlm(prompt)
hidden = vlm_output.last_hidden_state[:, -1, :]
reward = self.reward_head(hidden)
return reward
def _build_prompt(self, images, actions, task):
"""Build multi-image prompt for VLM."""
prompt_parts = [f"Task: {task}\n\nProgress so far:\n"]
for t, (img, act) in enumerate(zip(images, actions)):
prompt_parts.append(f"Step {t}: [IMAGE] -> Action: {act}\n")
prompt_parts.append(
f"Current observation: [IMAGE]\n"
f"Rate the progress toward completing the task (0-1):"
)
return "".join(prompt_parts)
Creating Pseudo Reward Labels
A clever trick: instead of manually annotating rewards (infeasible), the paper creates pseudo labels automatically:
def create_pseudo_reward_labels(
successful_trajectories,
failed_trajectories,
num_segments=5,
):
"""
Create pseudo reward labels for PRM training.
Idea:
- Successful trajectory: progress increases 0 -> 1
- Failed trajectory: progress increases then drops
- Final segment determines success/failure
"""
labeled_data = []
for traj in successful_trajectories:
T = len(traj)
segment_size = T // num_segments
for seg_idx in range(num_segments):
start = seg_idx * segment_size
end = min((seg_idx + 1) * segment_size, T)
# Linear progress for successful trajectory
progress = (seg_idx + 1) / num_segments
labeled_data.append({
"images": traj[start:end],
"reward": progress,
"label": "positive",
})
for traj in failed_trajectories:
T = len(traj)
segment_size = T // num_segments
for seg_idx in range(num_segments):
start = seg_idx * segment_size
end = min((seg_idx + 1) * segment_size, T)
# Failed: progress increases then plateaus/drops
if seg_idx < num_segments // 2:
progress = (seg_idx + 1) / num_segments * 0.5
else:
progress = 0.25
labeled_data.append({
"images": traj[start:end],
"reward": progress,
"label": "negative",
})
return labeled_data
Training Loop: PPO for VLA
GPU-Balanced Vectorized Environments
Training RL for a 7B parameter VLA requires careful GPU memory management. VLA-RL uses a separated actor-learner architecture:
import torch
class VLARLTrainer:
"""
Training loop for VLA-RL.
Actor (data collection) and Learner (gradient updates)
run on separate GPU pools.
"""
def __init__(
self,
vla_model, # OpenVLA-7B
prm_model, # Process Reward Model
env_suite, # LIBERO environments
num_envs=64, # vectorized environments
actor_gpus=[0, 1], # GPUs for inference
learner_gpus=[2, 3], # GPUs for training
lr=1e-5, # small LR for fine-tuning
):
self.vla = vla_model
self.prm = prm_model
self.envs = env_suite
self.num_envs = num_envs
self.optimizer = torch.optim.AdamW(
self.vla.parameters(),
lr=lr,
weight_decay=0.01,
)
# Curriculum: start from easy tasks
self.curriculum = TaskCurriculum(env_suite)
def collect_trajectories(self, batch_size=16):
"""
Collect trajectories from vectorized environments.
Each env runs 1 episode in parallel.
"""
trajectories = []
tasks = self.curriculum.sample_tasks(batch_size)
observations = self.envs.reset(tasks)
episode_data = [[] for _ in range(batch_size)]
dones = [False] * batch_size
for step in range(200): # max 200 steps
if all(dones):
break
with torch.no_grad():
actions = self.vla.predict(observations, tasks)
next_obs, rewards, new_dones, infos = self.envs.step(actions)
for i in range(batch_size):
if not dones[i]:
episode_data[i].append({
"obs": observations[i],
"action": actions[i],
"reward": rewards[i],
"done": new_dones[i],
})
if new_dones[i]:
dones[i] = True
observations = next_obs
# Compute process rewards for each trajectory
for traj in episode_data:
process_rewards = self.prm.evaluate_trajectory(traj)
for t, step_data in enumerate(traj):
step_data["process_reward"] = process_rewards[t]
trajectories.append(traj)
return trajectories
def train_step(self, trajectories):
"""PPO update step on learner GPUs."""
all_obs = []
all_actions = []
all_advantages = []
all_old_log_probs = []
for traj in trajectories:
rewards = [s["process_reward"] for s in traj]
advantages = compute_gae(rewards, gamma=0.99, lam=0.95)
for t, step_data in enumerate(traj):
all_obs.append(step_data["obs"])
all_actions.append(step_data["action"])
all_advantages.append(advantages[t])
with torch.no_grad():
old_lp = self.vla.compute_log_prob(
step_data["obs"], step_data["action"]
)
all_old_log_probs.append(old_lp)
# Mini-batch PPO updates
dataset_size = len(all_obs)
indices = torch.randperm(dataset_size)
mini_batch_size = 32
total_loss = 0
for start in range(0, dataset_size, mini_batch_size):
end = min(start + mini_batch_size, dataset_size)
mb_indices = indices[start:end]
mb_obs = [all_obs[i] for i in mb_indices]
mb_actions = [all_actions[i] for i in mb_indices]
mb_advantages = torch.stack(
[all_advantages[i] for i in mb_indices]
)
mb_old_lp = torch.stack(
[all_old_log_probs[i] for i in mb_indices]
)
new_log_probs = self.vla.compute_log_probs_batch(
mb_obs, mb_actions
)
ratio = torch.exp(new_log_probs - mb_old_lp)
clip_ratio = torch.clamp(ratio, 0.8, 1.2)
loss = -torch.min(
ratio * mb_advantages,
clip_ratio * mb_advantages,
).mean()
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(
self.vla.parameters(), max_norm=1.0
)
self.optimizer.step()
total_loss += loss.item()
return total_loss / (dataset_size // mini_batch_size)
def train(self, num_iterations=1000):
"""Main training loop."""
for iteration in range(num_iterations):
trajectories = self.collect_trajectories(batch_size=64)
successes = sum(
1 for t in trajectories if t[-1]["reward"] > 0.5
)
success_rate = successes / len(trajectories)
self.curriculum.update(success_rate)
loss = self.train_step(trajectories)
print(
f"Iter {iteration}: loss={loss:.4f}, "
f"success={success_rate:.2%}, "
f"curriculum_level={self.curriculum.level}"
)
Curriculum Selection
The paper uses automatic curriculum learning — starting from easy tasks and gradually increasing difficulty:
class TaskCurriculum:
"""
Automatic curriculum selection.
Tasks grouped into levels by difficulty.
Level advances when success rate exceeds threshold.
"""
def __init__(self, env_suite, success_threshold=0.7):
self.env_suite = env_suite
self.success_threshold = success_threshold
self.level = 0
# LIBERO tasks grouped by difficulty
self.task_levels = {
0: ["pick_up_object", "push_to_target"], # Easy
1: ["stack_two_blocks", "open_drawer"], # Medium
2: ["sort_objects", "pour_liquid"], # Hard
3: ["assemble_parts", "tool_use"], # Expert
}
self.success_history = []
def sample_tasks(self, batch_size):
"""Sample tasks from current and lower levels."""
available_tasks = []
for l in range(self.level + 1):
available_tasks.extend(self.task_levels.get(l, []))
current_tasks = self.task_levels.get(self.level, [])
tasks = []
for _ in range(batch_size):
if torch.rand(1).item() < 0.7 and current_tasks:
tasks.append(
current_tasks[
torch.randint(len(current_tasks), (1,)).item()
]
)
else:
tasks.append(
available_tasks[
torch.randint(len(available_tasks), (1,)).item()
]
)
return tasks
def update(self, success_rate):
"""Advance level when success rate is high enough."""
self.success_history.append(success_rate)
if len(self.success_history) >= 10:
recent_avg = sum(self.success_history[-10:]) / 10
if recent_avg > self.success_threshold:
max_level = max(self.task_levels.keys())
if self.level < max_level:
self.level += 1
self.success_history = []
print(f"Curriculum advanced to level {self.level}")
Inference Scaling Law: The Breakthrough Discovery
This is the most surprising result: VLA-RL discovers the first inference scaling law in robotics.
What Is an Inference Scaling Law?
In LLMs, the inference scaling law states that giving a model "more time to think" at inference time (generating more tokens, using chain-of-thought, best-of-N sampling) improves performance following a power law. OpenAI's o1/o3 are practical applications of this insight.
VLA-RL proves the same holds for robots:
Best-of-N sampling:
N=1: success 72%
N=4: success 78%
N=8: success 82%
N=16: success 84%
N=32: success 85.5%
Log-linear relationship: performance proportional to log(N)
This means: having the VLA generate multiple action trajectories and selecting the one with the highest PRM score significantly improves performance. The compute-performance tradeoff follows log — each doubling of compute yields approximately 2% more success.
Test-Time Optimization
class VLATestTimeOptimization:
"""
Inference scaling: generate N trajectories,
select the best according to PRM score.
"""
def __init__(self, vla_model, prm_model, n_samples=8):
self.vla = vla_model
self.prm = prm_model
self.n_samples = n_samples
def predict_with_scaling(self, observation, task):
"""
Best-of-N sampling for robot actions.
Trade compute for performance:
- N=1: fastest, baseline performance
- N=8: 8x compute, +10% success
- N=32: 32x compute, +13.5% success
"""
candidates = []
for _ in range(self.n_samples):
trajectory = self.vla.generate(
observation,
task,
temperature=0.8, # diversity
max_steps=50,
)
prm_score = self.prm.evaluate_trajectory(trajectory)
candidates.append({
"trajectory": trajectory,
"score": prm_score.mean().item(),
})
best = max(candidates, key=lambda x: x["score"])
return best["trajectory"][0] # return first action
def predict_with_refinement(self, observation, task):
"""
Alternative: iterative refinement instead of parallel sampling.
Similar to "self-reflection" in LLMs.
"""
trajectory = self.vla.generate(
observation, task, temperature=0.3
)
for refine_step in range(3):
scores = self.prm.evaluate_trajectory(trajectory)
# Find timestep with lowest score
worst_t = scores.argmin().item()
# Re-generate from worst_t onwards
refined = self.vla.generate(
observation,
task,
prefix=trajectory[:worst_t],
temperature=0.5,
)
trajectory = trajectory[:worst_t] + refined[worst_t:]
return trajectory[0]
Results: OpenVLA-7B vs pi0-FAST
LIBERO Benchmark (40 manipulation tasks)
| Model | Params | Success Rate | Training Data |
|---|---|---|---|
| OpenVLA (baseline) | 7B | 68.2% | 970K demos |
| OpenVLA + SFT (more data) | 7B | 71.5% | 970K + 50K |
| Octo | 93M | 52.3% | 800K |
| pi0-FAST | 3B | 77.8% | 10K (in-domain) |
| VLA-RL (OpenVLA + RL) | 7B | 82.3% | 970K + RL |
Note: VLA-RL surpasses pi0-FAST by 4.5% despite pi0-FAST being trained with high-quality in-domain demonstrations. RL training is more efficient than collecting additional demonstrations.
Ablation: Each Component's Contribution
| Configuration | Success Rate | Delta |
|---|---|---|
| OpenVLA baseline | 68.2% | -- |
| + Process Reward Model only | 72.1% | +3.9% |
| + Trajectory-level PPO only | 75.8% | +7.6% |
| + Curriculum selection | 79.4% | +11.2% |
| + Test-time optimization (N=8) | 82.3% | +14.1% |
Each component contributes clearly. Notably, test-time optimization adds +2.9% — "free" improvement using only compute.
Comparison with WholeBodyVLA
These two papers complement each other:
| Aspect | VLA-RL | WholeBodyVLA |
|---|---|---|
| Focus | Improving manipulation via RL | Whole-body loco-manipulation |
| Robot type | Fixed-base arms | Humanoid (AgiBot X2) |
| VLA backbone | OpenVLA-7B | Custom VLA |
| RL role | Fine-tune VLA directly | Train locomotion policy |
| Key insight | Inference scaling law | Manipulation-aware locomotion |
| Benchmark | LIBERO (sim) | Real robot tasks |
VLA-RL shows RL improves manipulation quality. WholeBodyVLA shows RL helps locomotion support manipulation. Combining both would yield humanoids that manipulate better AND move more stably.
Implications for the Robotics Community
1. RL > More Data
Adding 50K demonstrations improved only 3.3% (68.2 to 71.5). RL training improved 14.1% (68.2 to 82.3). When VLAs are large enough, RL is 4x more efficient than collecting more data. Great news for small labs that do not have teleoperator teams.
2. PRM Is the Missing Piece
Sparse rewards (0/1 success) are insufficient for efficient RL training. Process Reward Models solve this by evaluating each step — like a teacher grading each step of a math solution instead of only the final answer.
3. Inference Scaling = Compute for Performance
For the first time, robotics has an inference scaling law. This opens a new paradigm: instead of training larger models, let existing models "think longer". For robots, 8x compute = +10% success — a reasonable trade-off for high-stakes tasks.
4. RLHF Paradigm Transfer Succeeds
The entire RLHF stack (PPO, reward model, KL regularization) transfers nearly intact from LLMs to VLAs. This strongly validates that the robotics community should invest in RL infrastructure, not just data collection.
Limitations
- Sim-only results: benchmarked on LIBERO (simulation), no real robot validation yet
- Compute cost: training a 7B VLA with RL requires 8x A100 for ~48 hours
- PRM quality: pseudo labels are noisy — the learned PRM is not perfect
- Single-task RL: each task group needs separate RL fine-tuning, no universal RL yet
References
- VLA-RL: Towards Masterful Robot Manipulation via Scalable Reinforcement Learning — 2025
- OpenVLA: An Open-Source Vision-Language-Action Model — Kim et al., 2024
- pi0-FAST: Fast Action Sequence Tokenization for VLA — Physical Intelligence, 2025
- LIBERO: Lifelong Robot Learning Benchmark — Liu et al., 2023
Related Posts
- VLA Models: RT-2, Octo, OpenVLA, pi0 — Foundation: VLA history
- Reinforcement Learning Basics for Robotics — RL fundamentals
- WholeBodyVLA: Unified VLA for Humanoid — Related paper on VLA for humanoids
- Imitation Learning: Teaching Robots by Demonstration — IL that VLA-RL improves