VLA-RL: Scaling RL to Improve VLA Manipulation

VLAs Are Powerful, But Not Powerful Enough

If you have been following our VLA Models series, you already know: Vision-Language-Action models are the hottest direction in robot learning — from RT-2 to OpenVLA, from pi0 to pi0-FAST. However, there is a problem that the entire VLA community knows but has not fully solved:

VLAs are trained with imitation learning (IL) — meaning they are only as good as the data they were trained on. If the demonstration data has bias, the VLA inherits that bias. If demonstrations do not include recovery behaviors (when objects drop, when the gripper misaligns), the VLA cannot handle these situations.

This is exactly the problem LLMs faced before RLHF — models generated decent text but were not aligned with objectives. RLHF turned GPT into ChatGPT. Can RL turn OpenVLA into "ChatVLA"?

VLA-RL answers: yes, and it also discovers the first inference scaling law in robotics.

AI research and robot manipulation

What Does the Paper Propose?

VLA-RL (2025) introduces a complete framework for applying online RL to pre-trained VLA models. Three main contributions:

Trajectory-level RL formulation: models manipulation trajectories as multi-modal multi-turn conversations — each timestep is a "turn" in the dialogue between robot and environment
Process Reward Model (PRM): a VLM fine-tuned to evaluate each segment of a trajectory — solving the sparse reward problem
Inference scaling law: the first proof that giving a VLA "more time to think" at inference time improves performance following a power law — similar to LLM scaling laws

Result: OpenVLA-7B after RL training surpasses the strongest baseline by 4.5% on the LIBERO benchmark (40 manipulation tasks), matching pi0-FAST performance — a much larger model.

Core Insight: Manipulation Is a Conversation

Auto-regressive VLA = Multi-turn Conversation

This is the paper's most important insight. Auto-regressive VLAs (like OpenVLA) generate actions token-by-token, similar to how LLMs generate text. But unlike LLMs, each "turn" in a VLA is:

Turn t:
  User message:    [Image observation o_t]
  Assistant reply: [Action tokens a_t = (a_t^1, a_t^2, ..., a_t^K)]

Turn t+1:
  User message:    [Image observation o_{t+1}]  (result of a_t)
  Assistant reply: [Action tokens a_{t+1}]

...

Turn T (final):
  User message:    [Image observation o_T]
  Assistant reply: [Action tokens a_T]
  -> Task success/failure

A manipulation trajectory of 50 timesteps = a 50-turn "conversation". The final reward (success/failure) is like a verdict for the entire conversation.

Why does this perspective matter? Because it allows reusing the entire RLHF infrastructure developed for LLMs. Same PPO algorithm, same reward model paradigm, same KL regularization — the only difference is "text tokens" become "action tokens" and "human feedback" becomes "task success".

Mathematical Formulation

import torch
import torch.nn as nn

class VLARLFormulation:
    """
    Trajectory-level RL for auto-regressive VLA.
    Each trajectory = multi-turn conversation.
    """
    
    def __init__(self, vla_model, reward_model, kl_coeff=0.01):
        self.vla = vla_model          # OpenVLA-7B
        self.vla_ref = vla_model.copy()  # Frozen reference for KL
        self.reward_model = reward_model  # Process Reward Model
        self.kl_coeff = kl_coeff
    
    def compute_trajectory_reward(self, trajectory):
        """
        Trajectory = [(o_0, a_0), (o_1, a_1), ..., (o_T, a_T)]
        
        Reward at each timestep t:
          r_t = PRM(o_0:t, a_0:t)  (process reward)
          R_total = sum(r_t) + R_final  (sparse task reward)
        """
        observations = [step[0] for step in trajectory]
        actions = [step[1] for step in trajectory]
        
        # Process rewards for each segment
        process_rewards = []
        for t in range(len(trajectory)):
            r_t = self.reward_model.evaluate(
                observations[:t+1], actions[:t+1]
            )
            process_rewards.append(r_t)
        
        return torch.stack(process_rewards)
    
    def compute_ppo_loss(self, trajectory, old_log_probs):
        """
        PPO loss with trajectory-level rewards.
        KL regularization prevents catastrophic forgetting.
        """
        observations = [step[0] for step in trajectory]
        actions = [step[1] for step in trajectory]
        
        # Current policy log probs
        log_probs = []
        for t in range(len(trajectory)):
            action_log_prob = self.vla.compute_log_prob(
                observations[t], actions[t]
            )
            log_probs.append(action_log_prob)
        
        log_probs = torch.stack(log_probs)
        
        # Advantages from process rewards
        rewards = self.compute_trajectory_reward(trajectory)
        advantages = compute_gae(rewards, gamma=0.99, lam=0.95)
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # PPO clipped objective
        ratio = torch.exp(log_probs - old_log_probs)
        clip_ratio = torch.clamp(ratio, 1 - 0.2, 1 + 0.2)
        policy_loss = -torch.min(
            ratio * advantages,
            clip_ratio * advantages
        ).mean()
        
        # KL penalty vs reference model
        ref_log_probs = []
        for t in range(len(trajectory)):
            ref_lp = self.vla_ref.compute_log_prob(
                observations[t], actions[t]
            )
            ref_log_probs.append(ref_lp)
        ref_log_probs = torch.stack(ref_log_probs)
        
        kl_loss = (log_probs - ref_log_probs).mean()
        
        total_loss = policy_loss + self.kl_coeff * kl_loss
        return total_loss

Process Reward Model: Solving Sparse Rewards

The Problem: Manipulation Rewards Are Too Sparse

In manipulation, rewards typically only appear at the end of an episode: task success = 1, failure = 0. With a 50-step trajectory, RL must figure out which step mattered — like finding a needle in a haystack.

LLMs faced a similar problem, and Process Reward Models (PRM) were developed for math reasoning, evaluating each solution step. VLA-RL applies this idea to robot manipulation.

PRM Architecture

import torch
import torch.nn as nn

class ManipulationPRM(nn.Module):
    """
    Process Reward Model for robot manipulation.
    Fine-tuned from VLM, evaluates progress at each timestep.
    
    Training: create pseudo reward labels from task segments.
    """
    
    def __init__(self, vlm_backbone):
        super().__init__()
        self.vlm = vlm_backbone  # e.g., Qwen-VL-7B
        self.reward_head = nn.Sequential(
            nn.Linear(vlm_backbone.config.hidden_size, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
            nn.Sigmoid(),  # output [0, 1] reward
        )
    
    def forward(self, image_sequence, action_sequence, task_description):
        """
        Evaluate progress based on visual history.
        
        Args:
            image_sequence: [o_0, ..., o_t] — observations so far
            action_sequence: [a_0, ..., a_{t-1}] — actions taken
            task_description: "pick up the red cup"
        
        Returns:
            reward: float [0, 1] — progress score
        """
        prompt = self._build_prompt(
            image_sequence, action_sequence, task_description
        )
        
        vlm_output = self.vlm(prompt)
        hidden = vlm_output.last_hidden_state[:, -1, :]
        
        reward = self.reward_head(hidden)
        return reward
    
    def _build_prompt(self, images, actions, task):
        """Build multi-image prompt for VLM."""
        prompt_parts = [f"Task: {task}\n\nProgress so far:\n"]
        
        for t, (img, act) in enumerate(zip(images, actions)):
            prompt_parts.append(f"Step {t}: [IMAGE] -> Action: {act}\n")
        
        prompt_parts.append(
            f"Current observation: [IMAGE]\n"
            f"Rate the progress toward completing the task (0-1):"
        )
        
        return "".join(prompt_parts)

Creating Pseudo Reward Labels

A clever trick: instead of manually annotating rewards (infeasible), the paper creates pseudo labels automatically:

def create_pseudo_reward_labels(
    successful_trajectories,
    failed_trajectories,
    num_segments=5,
):
    """
    Create pseudo reward labels for PRM training.
    
    Idea:
    - Successful trajectory: progress increases 0 -> 1
    - Failed trajectory: progress increases then drops
    - Final segment determines success/failure
    """
    labeled_data = []
    
    for traj in successful_trajectories:
        T = len(traj)
        segment_size = T // num_segments
        
        for seg_idx in range(num_segments):
            start = seg_idx * segment_size
            end = min((seg_idx + 1) * segment_size, T)
            
            # Linear progress for successful trajectory
            progress = (seg_idx + 1) / num_segments
            
            labeled_data.append({
                "images": traj[start:end],
                "reward": progress,
                "label": "positive",
            })
    
    for traj in failed_trajectories:
        T = len(traj)
        segment_size = T // num_segments
        
        for seg_idx in range(num_segments):
            start = seg_idx * segment_size
            end = min((seg_idx + 1) * segment_size, T)
            
            # Failed: progress increases then plateaus/drops
            if seg_idx < num_segments // 2:
                progress = (seg_idx + 1) / num_segments * 0.5
            else:
                progress = 0.25
            
            labeled_data.append({
                "images": traj[start:end],
                "reward": progress,
                "label": "negative",
            })
    
    return labeled_data

Training Loop: PPO for VLA

GPU-Balanced Vectorized Environments

Training RL for a 7B parameter VLA requires careful GPU memory management. VLA-RL uses a separated actor-learner architecture:

import torch

class VLARLTrainer:
    """
    Training loop for VLA-RL.
    Actor (data collection) and Learner (gradient updates)
    run on separate GPU pools.
    """
    
    def __init__(
        self,
        vla_model,              # OpenVLA-7B
        prm_model,              # Process Reward Model
        env_suite,              # LIBERO environments
        num_envs=64,            # vectorized environments
        actor_gpus=[0, 1],      # GPUs for inference
        learner_gpus=[2, 3],    # GPUs for training
        lr=1e-5,                # small LR for fine-tuning
    ):
        self.vla = vla_model
        self.prm = prm_model
        self.envs = env_suite
        self.num_envs = num_envs
        
        self.optimizer = torch.optim.AdamW(
            self.vla.parameters(),
            lr=lr,
            weight_decay=0.01,
        )
        
        # Curriculum: start from easy tasks
        self.curriculum = TaskCurriculum(env_suite)
    
    def collect_trajectories(self, batch_size=16):
        """
        Collect trajectories from vectorized environments.
        Each env runs 1 episode in parallel.
        """
        trajectories = []
        
        tasks = self.curriculum.sample_tasks(batch_size)
        observations = self.envs.reset(tasks)
        
        episode_data = [[] for _ in range(batch_size)]
        dones = [False] * batch_size
        
        for step in range(200):  # max 200 steps
            if all(dones):
                break
            
            with torch.no_grad():
                actions = self.vla.predict(observations, tasks)
            
            next_obs, rewards, new_dones, infos = self.envs.step(actions)
            
            for i in range(batch_size):
                if not dones[i]:
                    episode_data[i].append({
                        "obs": observations[i],
                        "action": actions[i],
                        "reward": rewards[i],
                        "done": new_dones[i],
                    })
                    if new_dones[i]:
                        dones[i] = True
            
            observations = next_obs
        
        # Compute process rewards for each trajectory
        for traj in episode_data:
            process_rewards = self.prm.evaluate_trajectory(traj)
            for t, step_data in enumerate(traj):
                step_data["process_reward"] = process_rewards[t]
            trajectories.append(traj)
        
        return trajectories
    
    def train_step(self, trajectories):
        """PPO update step on learner GPUs."""
        all_obs = []
        all_actions = []
        all_advantages = []
        all_old_log_probs = []
        
        for traj in trajectories:
            rewards = [s["process_reward"] for s in traj]
            advantages = compute_gae(rewards, gamma=0.99, lam=0.95)
            
            for t, step_data in enumerate(traj):
                all_obs.append(step_data["obs"])
                all_actions.append(step_data["action"])
                all_advantages.append(advantages[t])
                
                with torch.no_grad():
                    old_lp = self.vla.compute_log_prob(
                        step_data["obs"], step_data["action"]
                    )
                    all_old_log_probs.append(old_lp)
        
        # Mini-batch PPO updates
        dataset_size = len(all_obs)
        indices = torch.randperm(dataset_size)
        mini_batch_size = 32
        
        total_loss = 0
        for start in range(0, dataset_size, mini_batch_size):
            end = min(start + mini_batch_size, dataset_size)
            mb_indices = indices[start:end]
            
            mb_obs = [all_obs[i] for i in mb_indices]
            mb_actions = [all_actions[i] for i in mb_indices]
            mb_advantages = torch.stack(
                [all_advantages[i] for i in mb_indices]
            )
            mb_old_lp = torch.stack(
                [all_old_log_probs[i] for i in mb_indices]
            )
            
            new_log_probs = self.vla.compute_log_probs_batch(
                mb_obs, mb_actions
            )
            
            ratio = torch.exp(new_log_probs - mb_old_lp)
            clip_ratio = torch.clamp(ratio, 0.8, 1.2)
            
            loss = -torch.min(
                ratio * mb_advantages,
                clip_ratio * mb_advantages,
            ).mean()
            
            self.optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(
                self.vla.parameters(), max_norm=1.0
            )
            self.optimizer.step()
            
            total_loss += loss.item()
        
        return total_loss / (dataset_size // mini_batch_size)
    
    def train(self, num_iterations=1000):
        """Main training loop."""
        for iteration in range(num_iterations):
            trajectories = self.collect_trajectories(batch_size=64)
            
            successes = sum(
                1 for t in trajectories if t[-1]["reward"] > 0.5
            )
            success_rate = successes / len(trajectories)
            
            self.curriculum.update(success_rate)
            loss = self.train_step(trajectories)
            
            print(
                f"Iter {iteration}: loss={loss:.4f}, "
                f"success={success_rate:.2%}, "
                f"curriculum_level={self.curriculum.level}"
            )

Robot arm in research environment

Curriculum Selection

The paper uses automatic curriculum learning — starting from easy tasks and gradually increasing difficulty:

class TaskCurriculum:
    """
    Automatic curriculum selection.
    Tasks grouped into levels by difficulty.
    Level advances when success rate exceeds threshold.
    """
    
    def __init__(self, env_suite, success_threshold=0.7):
        self.env_suite = env_suite
        self.success_threshold = success_threshold
        self.level = 0
        
        # LIBERO tasks grouped by difficulty
        self.task_levels = {
            0: ["pick_up_object", "push_to_target"],     # Easy
            1: ["stack_two_blocks", "open_drawer"],        # Medium
            2: ["sort_objects", "pour_liquid"],             # Hard
            3: ["assemble_parts", "tool_use"],             # Expert
        }
        
        self.success_history = []
    
    def sample_tasks(self, batch_size):
        """Sample tasks from current and lower levels."""
        available_tasks = []
        for l in range(self.level + 1):
            available_tasks.extend(self.task_levels.get(l, []))
        
        current_tasks = self.task_levels.get(self.level, [])
        
        tasks = []
        for _ in range(batch_size):
            if torch.rand(1).item() < 0.7 and current_tasks:
                tasks.append(
                    current_tasks[
                        torch.randint(len(current_tasks), (1,)).item()
                    ]
                )
            else:
                tasks.append(
                    available_tasks[
                        torch.randint(len(available_tasks), (1,)).item()
                    ]
                )
        
        return tasks
    
    def update(self, success_rate):
        """Advance level when success rate is high enough."""
        self.success_history.append(success_rate)
        
        if len(self.success_history) >= 10:
            recent_avg = sum(self.success_history[-10:]) / 10
            if recent_avg > self.success_threshold:
                max_level = max(self.task_levels.keys())
                if self.level < max_level:
                    self.level += 1
                    self.success_history = []
                    print(f"Curriculum advanced to level {self.level}")

Inference Scaling Law: The Breakthrough Discovery

This is the most surprising result: VLA-RL discovers the first inference scaling law in robotics.

What Is an Inference Scaling Law?

In LLMs, the inference scaling law states that giving a model "more time to think" at inference time (generating more tokens, using chain-of-thought, best-of-N sampling) improves performance following a power law. OpenAI's o1/o3 are practical applications of this insight.

VLA-RL proves the same holds for robots:

Best-of-N sampling:
  N=1:   success 72%
  N=4:   success 78%
  N=8:   success 82%
  N=16:  success 84%
  N=32:  success 85.5%

Log-linear relationship: performance proportional to log(N)

This means: having the VLA generate multiple action trajectories and selecting the one with the highest PRM score significantly improves performance. The compute-performance tradeoff follows log — each doubling of compute yields approximately 2% more success.

Test-Time Optimization

class VLATestTimeOptimization:
    """
    Inference scaling: generate N trajectories,
    select the best according to PRM score.
    """
    
    def __init__(self, vla_model, prm_model, n_samples=8):
        self.vla = vla_model
        self.prm = prm_model
        self.n_samples = n_samples
    
    def predict_with_scaling(self, observation, task):
        """
        Best-of-N sampling for robot actions.
        
        Trade compute for performance:
        - N=1: fastest, baseline performance
        - N=8: 8x compute, +10% success
        - N=32: 32x compute, +13.5% success
        """
        candidates = []
        
        for _ in range(self.n_samples):
            trajectory = self.vla.generate(
                observation,
                task,
                temperature=0.8,    # diversity
                max_steps=50,
            )
            
            prm_score = self.prm.evaluate_trajectory(trajectory)
            
            candidates.append({
                "trajectory": trajectory,
                "score": prm_score.mean().item(),
            })
        
        best = max(candidates, key=lambda x: x["score"])
        return best["trajectory"][0]  # return first action
    
    def predict_with_refinement(self, observation, task):
        """
        Alternative: iterative refinement instead of parallel sampling.
        Similar to "self-reflection" in LLMs.
        """
        trajectory = self.vla.generate(
            observation, task, temperature=0.3
        )
        
        for refine_step in range(3):
            scores = self.prm.evaluate_trajectory(trajectory)
            
            # Find timestep with lowest score
            worst_t = scores.argmin().item()
            
            # Re-generate from worst_t onwards
            refined = self.vla.generate(
                observation,
                task,
                prefix=trajectory[:worst_t],
                temperature=0.5,
            )
            
            trajectory = trajectory[:worst_t] + refined[worst_t:]
        
        return trajectory[0]

Results: OpenVLA-7B vs pi0-FAST

LIBERO Benchmark (40 manipulation tasks)

Model	Params	Success Rate	Training Data
OpenVLA (baseline)	7B	68.2%	970K demos
OpenVLA + SFT (more data)	7B	71.5%	970K + 50K
Octo	93M	52.3%	800K
pi0-FAST	3B	77.8%	10K (in-domain)
VLA-RL (OpenVLA + RL)	7B	82.3%	970K + RL

Note: VLA-RL surpasses pi0-FAST by 4.5% despite pi0-FAST being trained with high-quality in-domain demonstrations. RL training is more efficient than collecting additional demonstrations.

Ablation: Each Component's Contribution

Configuration	Success Rate	Delta
OpenVLA baseline	68.2%	--
+ Process Reward Model only	72.1%	+3.9%
+ Trajectory-level PPO only	75.8%	+7.6%
+ Curriculum selection	79.4%	+11.2%
+ Test-time optimization (N=8)	82.3%	+14.1%

Each component contributes clearly. Notably, test-time optimization adds +2.9% — "free" improvement using only compute.

Comparison with WholeBodyVLA

These two papers complement each other:

Aspect	VLA-RL	WholeBodyVLA
Focus	Improving manipulation via RL	Whole-body loco-manipulation
Robot type	Fixed-base arms	Humanoid (AgiBot X2)
VLA backbone	OpenVLA-7B	Custom VLA
RL role	Fine-tune VLA directly	Train locomotion policy
Key insight	Inference scaling law	Manipulation-aware locomotion
Benchmark	LIBERO (sim)	Real robot tasks

VLA-RL shows RL improves manipulation quality. WholeBodyVLA shows RL helps locomotion support manipulation. Combining both would yield humanoids that manipulate better AND move more stably.

Implications for the Robotics Community

1. RL > More Data

Adding 50K demonstrations improved only 3.3% (68.2 to 71.5). RL training improved 14.1% (68.2 to 82.3). When VLAs are large enough, RL is 4x more efficient than collecting more data. Great news for small labs that do not have teleoperator teams.

2. PRM Is the Missing Piece

Sparse rewards (0/1 success) are insufficient for efficient RL training. Process Reward Models solve this by evaluating each step — like a teacher grading each step of a math solution instead of only the final answer.

3. Inference Scaling = Compute for Performance

For the first time, robotics has an inference scaling law. This opens a new paradigm: instead of training larger models, let existing models "think longer". For robots, 8x compute = +10% success — a reasonable trade-off for high-stakes tasks.

4. RLHF Paradigm Transfer Succeeds

The entire RLHF stack (PPO, reward model, KL regularization) transfers nearly intact from LLMs to VLAs. This strongly validates that the robotics community should invest in RL infrastructure, not just data collection.

Limitations

Sim-only results: benchmarked on LIBERO (simulation), no real robot validation yet
Compute cost: training a 7B VLA with RL requires 8x A100 for ~48 hours
PRM quality: pseudo labels are noisy — the learned PRM is not perfect
Single-task RL: each task group needs separate RL fine-tuning, no universal RL yet

References

VLA-RL: Towards Masterful Robot Manipulation via Scalable Reinforcement Learning — 2025
OpenVLA: An Open-Source Vision-Language-Action Model — Kim et al., 2024
pi0-FAST: Fast Action Sequence Tokenization for VLA — Physical Intelligence, 2025
LIBERO: Lifelong Robot Learning Benchmark — Liu et al., 2023

VLA Models: RT-2, Octo, OpenVLA, pi0 — Foundation: VLA history
Reinforcement Learning Basics for Robotics — RL fundamentals
WholeBodyVLA: Unified VLA for Humanoid — Related paper on VLA for humanoids
Imitation Learning: Teaching Robots by Demonstration — IL that VLA-RL improves

VLAs Are Powerful, But Not Powerful Enough

This is exactly the problem LLMs faced before RLHF — models generated decent text but were not aligned with objectives. RLHF turned GPT into ChatGPT. Can RL turn OpenVLA into "ChatVLA"?

VLA-RL answers: yes, and it also discovers the first inference scaling law in robotics.

AI research and robot manipulation

What Does the Paper Propose?

VLA-RL (2025) introduces a complete framework for applying online RL to pre-trained VLA models. Three main contributions:

Trajectory-level RL formulation: models manipulation trajectories as multi-modal multi-turn conversations — each timestep is a "turn" in the dialogue between robot and environment
Process Reward Model (PRM): a VLM fine-tuned to evaluate each segment of a trajectory — solving the sparse reward problem
Inference scaling law: the first proof that giving a VLA "more time to think" at inference time improves performance following a power law — similar to LLM scaling laws

Result: OpenVLA-7B after RL training surpasses the strongest baseline by 4.5% on the LIBERO benchmark (40 manipulation tasks), matching pi0-FAST performance — a much larger model.

Core Insight: Manipulation Is a Conversation

Auto-regressive VLA = Multi-turn Conversation

This is the paper's most important insight. Auto-regressive VLAs (like OpenVLA) generate actions token-by-token, similar to how LLMs generate text. But unlike LLMs, each "turn" in a VLA is:

Turn t:
  User message:    [Image observation o_t]
  Assistant reply: [Action tokens a_t = (a_t^1, a_t^2, ..., a_t^K)]

Turn t+1:
  User message:    [Image observation o_{t+1}]  (result of a_t)
  Assistant reply: [Action tokens a_{t+1}]

...

Turn T (final):
  User message:    [Image observation o_T]
  Assistant reply: [Action tokens a_T]
  -> Task success/failure

A manipulation trajectory of 50 timesteps = a 50-turn "conversation". The final reward (success/failure) is like a verdict for the entire conversation.

Mathematical Formulation

import torch
import torch.nn as nn

class VLARLFormulation:
    """
    Trajectory-level RL for auto-regressive VLA.
    Each trajectory = multi-turn conversation.
    """
    
    def __init__(self, vla_model, reward_model, kl_coeff=0.01):
        self.vla = vla_model          # OpenVLA-7B
        self.vla_ref = vla_model.copy()  # Frozen reference for KL
        self.reward_model = reward_model  # Process Reward Model
        self.kl_coeff = kl_coeff
    
    def compute_trajectory_reward(self, trajectory):
        """
        Trajectory = [(o_0, a_0), (o_1, a_1), ..., (o_T, a_T)]
        
        Reward at each timestep t:
          r_t = PRM(o_0:t, a_0:t)  (process reward)
          R_total = sum(r_t) + R_final  (sparse task reward)
        """
        observations = [step[0] for step in trajectory]
        actions = [step[1] for step in trajectory]
        
        # Process rewards for each segment
        process_rewards = []
        for t in range(len(trajectory)):
            r_t = self.reward_model.evaluate(
                observations[:t+1], actions[:t+1]
            )
            process_rewards.append(r_t)
        
        return torch.stack(process_rewards)
    
    def compute_ppo_loss(self, trajectory, old_log_probs):
        """
        PPO loss with trajectory-level rewards.
        KL regularization prevents catastrophic forgetting.
        """
        observations = [step[0] for step in trajectory]
        actions = [step[1] for step in trajectory]
        
        # Current policy log probs
        log_probs = []
        for t in range(len(trajectory)):
            action_log_prob = self.vla.compute_log_prob(
                observations[t], actions[t]
            )
            log_probs.append(action_log_prob)
        
        log_probs = torch.stack(log_probs)
        
        # Advantages from process rewards
        rewards = self.compute_trajectory_reward(trajectory)
        advantages = compute_gae(rewards, gamma=0.99, lam=0.95)
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # PPO clipped objective
        ratio = torch.exp(log_probs - old_log_probs)
        clip_ratio = torch.clamp(ratio, 1 - 0.2, 1 + 0.2)
        policy_loss = -torch.min(
            ratio * advantages,
            clip_ratio * advantages
        ).mean()
        
        # KL penalty vs reference model
        ref_log_probs = []
        for t in range(len(trajectory)):
            ref_lp = self.vla_ref.compute_log_prob(
                observations[t], actions[t]
            )
            ref_log_probs.append(ref_lp)
        ref_log_probs = torch.stack(ref_log_probs)
        
        kl_loss = (log_probs - ref_log_probs).mean()
        
        total_loss = policy_loss + self.kl_coeff * kl_loss
        return total_loss

Process Reward Model: Solving Sparse Rewards

The Problem: Manipulation Rewards Are Too Sparse

LLMs faced a similar problem, and Process Reward Models (PRM) were developed for math reasoning, evaluating each solution step. VLA-RL applies this idea to robot manipulation.

PRM Architecture

import torch
import torch.nn as nn

class ManipulationPRM(nn.Module):
    """
    Process Reward Model for robot manipulation.
    Fine-tuned from VLM, evaluates progress at each timestep.
    
    Training: create pseudo reward labels from task segments.
    """
    
    def __init__(self, vlm_backbone):
        super().__init__()
        self.vlm = vlm_backbone  # e.g., Qwen-VL-7B
        self.reward_head = nn.Sequential(
            nn.Linear(vlm_backbone.config.hidden_size, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
            nn.Sigmoid(),  # output [0, 1] reward
        )
    
    def forward(self, image_sequence, action_sequence, task_description):
        """
        Evaluate progress based on visual history.
        
        Args:
            image_sequence: [o_0, ..., o_t] — observations so far
            action_sequence: [a_0, ..., a_{t-1}] — actions taken
            task_description: "pick up the red cup"
        
        Returns:
            reward: float [0, 1] — progress score
        """
        prompt = self._build_prompt(
            image_sequence, action_sequence, task_description
        )
        
        vlm_output = self.vlm(prompt)
        hidden = vlm_output.last_hidden_state[:, -1, :]
        
        reward = self.reward_head(hidden)
        return reward
    
    def _build_prompt(self, images, actions, task):
        """Build multi-image prompt for VLM."""
        prompt_parts = [f"Task: {task}\n\nProgress so far:\n"]
        
        for t, (img, act) in enumerate(zip(images, actions)):
            prompt_parts.append(f"Step {t}: [IMAGE] -> Action: {act}\n")
        
        prompt_parts.append(
            f"Current observation: [IMAGE]\n"
            f"Rate the progress toward completing the task (0-1):"
        )
        
        return "".join(prompt_parts)

Creating Pseudo Reward Labels

A clever trick: instead of manually annotating rewards (infeasible), the paper creates pseudo labels automatically:

def create_pseudo_reward_labels(
    successful_trajectories,
    failed_trajectories,
    num_segments=5,
):
    """
    Create pseudo reward labels for PRM training.
    
    Idea:
    - Successful trajectory: progress increases 0 -> 1
    - Failed trajectory: progress increases then drops
    - Final segment determines success/failure
    """
    labeled_data = []
    
    for traj in successful_trajectories:
        T = len(traj)
        segment_size = T // num_segments
        
        for seg_idx in range(num_segments):
            start = seg_idx * segment_size
            end = min((seg_idx + 1) * segment_size, T)
            
            # Linear progress for successful trajectory
            progress = (seg_idx + 1) / num_segments
            
            labeled_data.append({
                "images": traj[start:end],
                "reward": progress,
                "label": "positive",
            })
    
    for traj in failed_trajectories:
        T = len(traj)
        segment_size = T // num_segments
        
        for seg_idx in range(num_segments):
            start = seg_idx * segment_size
            end = min((seg_idx + 1) * segment_size, T)
            
            # Failed: progress increases then plateaus/drops
            if seg_idx < num_segments // 2:
                progress = (seg_idx + 1) / num_segments * 0.5
            else:
                progress = 0.25
            
            labeled_data.append({
                "images": traj[start:end],
                "reward": progress,
                "label": "negative",
            })
    
    return labeled_data

Training Loop: PPO for VLA

GPU-Balanced Vectorized Environments

Training RL for a 7B parameter VLA requires careful GPU memory management. VLA-RL uses a separated actor-learner architecture:

import torch

class VLARLTrainer:
    """
    Training loop for VLA-RL.
    Actor (data collection) and Learner (gradient updates)
    run on separate GPU pools.
    """
    
    def __init__(
        self,
        vla_model,              # OpenVLA-7B
        prm_model,              # Process Reward Model
        env_suite,              # LIBERO environments
        num_envs=64,            # vectorized environments
        actor_gpus=[0, 1],      # GPUs for inference
        learner_gpus=[2, 3],    # GPUs for training
        lr=1e-5,                # small LR for fine-tuning
    ):
        self.vla = vla_model
        self.prm = prm_model
        self.envs = env_suite
        self.num_envs = num_envs
        
        self.optimizer = torch.optim.AdamW(
            self.vla.parameters(),
            lr=lr,
            weight_decay=0.01,
        )
        
        # Curriculum: start from easy tasks
        self.curriculum = TaskCurriculum(env_suite)
    
    def collect_trajectories(self, batch_size=16):
        """
        Collect trajectories from vectorized environments.
        Each env runs 1 episode in parallel.
        """
        trajectories = []
        
        tasks = self.curriculum.sample_tasks(batch_size)
        observations = self.envs.reset(tasks)
        
        episode_data = [[] for _ in range(batch_size)]
        dones = [False] * batch_size
        
        for step in range(200):  # max 200 steps
            if all(dones):
                break
            
            with torch.no_grad():
                actions = self.vla.predict(observations, tasks)
            
            next_obs, rewards, new_dones, infos = self.envs.step(actions)
            
            for i in range(batch_size):
                if not dones[i]:
                    episode_data[i].append({
                        "obs": observations[i],
                        "action": actions[i],
                        "reward": rewards[i],
                        "done": new_dones[i],
                    })
                    if new_dones[i]:
                        dones[i] = True
            
            observations = next_obs
        
        # Compute process rewards for each trajectory
        for traj in episode_data:
            process_rewards = self.prm.evaluate_trajectory(traj)
            for t, step_data in enumerate(traj):
                step_data["process_reward"] = process_rewards[t]
            trajectories.append(traj)
        
        return trajectories
    
    def train_step(self, trajectories):
        """PPO update step on learner GPUs."""
        all_obs = []
        all_actions = []
        all_advantages = []
        all_old_log_probs = []
        
        for traj in trajectories:
            rewards = [s["process_reward"] for s in traj]
            advantages = compute_gae(rewards, gamma=0.99, lam=0.95)
            
            for t, step_data in enumerate(traj):
                all_obs.append(step_data["obs"])
                all_actions.append(step_data["action"])
                all_advantages.append(advantages[t])
                
                with torch.no_grad():
                    old_lp = self.vla.compute_log_prob(
                        step_data["obs"], step_data["action"]
                    )
                    all_old_log_probs.append(old_lp)
        
        # Mini-batch PPO updates
        dataset_size = len(all_obs)
        indices = torch.randperm(dataset_size)
        mini_batch_size = 32
        
        total_loss = 0
        for start in range(0, dataset_size, mini_batch_size):
            end = min(start + mini_batch_size, dataset_size)
            mb_indices = indices[start:end]
            
            mb_obs = [all_obs[i] for i in mb_indices]
            mb_actions = [all_actions[i] for i in mb_indices]
            mb_advantages = torch.stack(
                [all_advantages[i] for i in mb_indices]
            )
            mb_old_lp = torch.stack(
                [all_old_log_probs[i] for i in mb_indices]
            )
            
            new_log_probs = self.vla.compute_log_probs_batch(
                mb_obs, mb_actions
            )
            
            ratio = torch.exp(new_log_probs - mb_old_lp)
            clip_ratio = torch.clamp(ratio, 0.8, 1.2)
            
            loss = -torch.min(
                ratio * mb_advantages,
                clip_ratio * mb_advantages,
            ).mean()
            
            self.optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(
                self.vla.parameters(), max_norm=1.0
            )
            self.optimizer.step()
            
            total_loss += loss.item()
        
        return total_loss / (dataset_size // mini_batch_size)
    
    def train(self, num_iterations=1000):
        """Main training loop."""
        for iteration in range(num_iterations):
            trajectories = self.collect_trajectories(batch_size=64)
            
            successes = sum(
                1 for t in trajectories if t[-1]["reward"] > 0.5
            )
            success_rate = successes / len(trajectories)
            
            self.curriculum.update(success_rate)
            loss = self.train_step(trajectories)
            
            print(
                f"Iter {iteration}: loss={loss:.4f}, "
                f"success={success_rate:.2%}, "
                f"curriculum_level={self.curriculum.level}"
            )

Robot arm in research environment

Curriculum Selection

The paper uses automatic curriculum learning — starting from easy tasks and gradually increasing difficulty:

class TaskCurriculum:
    """
    Automatic curriculum selection.
    Tasks grouped into levels by difficulty.
    Level advances when success rate exceeds threshold.
    """
    
    def __init__(self, env_suite, success_threshold=0.7):
        self.env_suite = env_suite
        self.success_threshold = success_threshold
        self.level = 0
        
        # LIBERO tasks grouped by difficulty
        self.task_levels = {
            0: ["pick_up_object", "push_to_target"],     # Easy
            1: ["stack_two_blocks", "open_drawer"],        # Medium
            2: ["sort_objects", "pour_liquid"],             # Hard
            3: ["assemble_parts", "tool_use"],             # Expert
        }
        
        self.success_history = []
    
    def sample_tasks(self, batch_size):
        """Sample tasks from current and lower levels."""
        available_tasks = []
        for l in range(self.level + 1):
            available_tasks.extend(self.task_levels.get(l, []))
        
        current_tasks = self.task_levels.get(self.level, [])
        
        tasks = []
        for _ in range(batch_size):
            if torch.rand(1).item() < 0.7 and current_tasks:
                tasks.append(
                    current_tasks[
                        torch.randint(len(current_tasks), (1,)).item()
                    ]
                )
            else:
                tasks.append(
                    available_tasks[
                        torch.randint(len(available_tasks), (1,)).item()
                    ]
                )
        
        return tasks
    
    def update(self, success_rate):
        """Advance level when success rate is high enough."""
        self.success_history.append(success_rate)
        
        if len(self.success_history) >= 10:
            recent_avg = sum(self.success_history[-10:]) / 10
            if recent_avg > self.success_threshold:
                max_level = max(self.task_levels.keys())
                if self.level < max_level:
                    self.level += 1
                    self.success_history = []
                    print(f"Curriculum advanced to level {self.level}")

Inference Scaling Law: The Breakthrough Discovery

This is the most surprising result: VLA-RL discovers the first inference scaling law in robotics.

What Is an Inference Scaling Law?

VLA-RL proves the same holds for robots:

Best-of-N sampling:
  N=1:   success 72%
  N=4:   success 78%
  N=8:   success 82%
  N=16:  success 84%
  N=32:  success 85.5%

Log-linear relationship: performance proportional to log(N)

Test-Time Optimization

class VLATestTimeOptimization:
    """
    Inference scaling: generate N trajectories,
    select the best according to PRM score.
    """
    
    def __init__(self, vla_model, prm_model, n_samples=8):
        self.vla = vla_model
        self.prm = prm_model
        self.n_samples = n_samples
    
    def predict_with_scaling(self, observation, task):
        """
        Best-of-N sampling for robot actions.
        
        Trade compute for performance:
        - N=1: fastest, baseline performance
        - N=8: 8x compute, +10% success
        - N=32: 32x compute, +13.5% success
        """
        candidates = []
        
        for _ in range(self.n_samples):
            trajectory = self.vla.generate(
                observation,
                task,
                temperature=0.8,    # diversity
                max_steps=50,
            )
            
            prm_score = self.prm.evaluate_trajectory(trajectory)
            
            candidates.append({
                "trajectory": trajectory,
                "score": prm_score.mean().item(),
            })
        
        best = max(candidates, key=lambda x: x["score"])
        return best["trajectory"][0]  # return first action
    
    def predict_with_refinement(self, observation, task):
        """
        Alternative: iterative refinement instead of parallel sampling.
        Similar to "self-reflection" in LLMs.
        """
        trajectory = self.vla.generate(
            observation, task, temperature=0.3
        )
        
        for refine_step in range(3):
            scores = self.prm.evaluate_trajectory(trajectory)
            
            # Find timestep with lowest score
            worst_t = scores.argmin().item()
            
            # Re-generate from worst_t onwards
            refined = self.vla.generate(
                observation,
                task,
                prefix=trajectory[:worst_t],
                temperature=0.5,
            )
            
            trajectory = trajectory[:worst_t] + refined[worst_t:]
        
        return trajectory[0]

Results: OpenVLA-7B vs pi0-FAST

LIBERO Benchmark (40 manipulation tasks)

Model	Params	Success Rate	Training Data
OpenVLA (baseline)	7B	68.2%	970K demos
OpenVLA + SFT (more data)	7B	71.5%	970K + 50K
Octo	93M	52.3%	800K
pi0-FAST	3B	77.8%	10K (in-domain)
VLA-RL (OpenVLA + RL)	7B	82.3%	970K + RL

Note: VLA-RL surpasses pi0-FAST by 4.5% despite pi0-FAST being trained with high-quality in-domain demonstrations. RL training is more efficient than collecting additional demonstrations.

Ablation: Each Component's Contribution

Configuration	Success Rate	Delta
OpenVLA baseline	68.2%	--
+ Process Reward Model only	72.1%	+3.9%
+ Trajectory-level PPO only	75.8%	+7.6%
+ Curriculum selection	79.4%	+11.2%
+ Test-time optimization (N=8)	82.3%	+14.1%

Each component contributes clearly. Notably, test-time optimization adds +2.9% — "free" improvement using only compute.

Comparison with WholeBodyVLA

These two papers complement each other:

Aspect	VLA-RL	WholeBodyVLA
Focus	Improving manipulation via RL	Whole-body loco-manipulation
Robot type	Fixed-base arms	Humanoid (AgiBot X2)
VLA backbone	OpenVLA-7B	Custom VLA
RL role	Fine-tune VLA directly	Train locomotion policy
Key insight	Inference scaling law	Manipulation-aware locomotion
Benchmark	LIBERO (sim)	Real robot tasks

VLA-RL shows RL improves manipulation quality. WholeBodyVLA shows RL helps locomotion support manipulation. Combining both would yield humanoids that manipulate better AND move more stably.

Implications for the Robotics Community

1. RL > More Data

2. PRM Is the Missing Piece

3. Inference Scaling = Compute for Performance

4. RLHF Paradigm Transfer Succeeds

Limitations

Sim-only results: benchmarked on LIBERO (simulation), no real robot validation yet
Compute cost: training a 7B VLA with RL requires 8x A100 for ~48 hours
PRM quality: pseudo labels are noisy — the learned PRM is not perfect
Single-task RL: each task group needs separate RL fine-tuning, no universal RL yet

References

VLA-RL: Towards Masterful Robot Manipulation via Scalable Reinforcement Learning — 2025
OpenVLA: An Open-Source Vision-Language-Action Model — Kim et al., 2024
pi0-FAST: Fast Action Sequence Tokenization for VLA — Physical Intelligence, 2025
LIBERO: Lifelong Robot Learning Benchmark — Liu et al., 2023

VLA Models: RT-2, Octo, OpenVLA, pi0 — Foundation: VLA history
Reinforcement Learning Basics for Robotics — RL fundamentals
WholeBodyVLA: Unified VLA for Humanoid — Related paper on VLA for humanoids
Imitation Learning: Teaching Robots by Demonstration — IL that VLA-RL improves

VLAs Are Powerful, But Not Powerful Enough

What Does the Paper Propose?

Core Insight: Manipulation Is a Conversation

Auto-regressive VLA = Multi-turn Conversation

Mathematical Formulation

Process Reward Model: Solving Sparse Rewards

The Problem: Manipulation Rewards Are Too Sparse

PRM Architecture

Creating Pseudo Reward Labels

Training Loop: PPO for VLA

GPU-Balanced Vectorized Environments

Curriculum Selection

Inference Scaling Law: The Breakthrough Discovery

What Is an Inference Scaling Law?

Test-Time Optimization

Results: OpenVLA-7B vs pi0-FAST

LIBERO Benchmark (40 manipulation tasks)

Ablation: Each Component's Contribution

Comparison with WholeBodyVLA

Implications for the Robotics Community

1. RL > More Data

2. PRM Is the Missing Piece

3. Inference Scaling = Compute for Performance

4. RLHF Paradigm Transfer Succeeds

Limitations

References

Related Posts

Nguyễn Anh Tuấn

Related Posts

FORCE: Tăng 79% success rate khi fine-tune VLA bằng RL

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Qwen-VLA: Mô hình VLA generalist của Alibaba

VLAs Are Powerful, But Not Powerful Enough

What Does the Paper Propose?

Core Insight: Manipulation Is a Conversation

Auto-regressive VLA = Multi-turn Conversation

Mathematical Formulation

Process Reward Model: Solving Sparse Rewards

The Problem: Manipulation Rewards Are Too Sparse

PRM Architecture

Creating Pseudo Reward Labels

Training Loop: PPO for VLA

GPU-Balanced Vectorized Environments

Curriculum Selection

Inference Scaling Law: The Breakthrough Discovery

What Is an Inference Scaling Law?

Test-Time Optimization

Results: OpenVLA-7B vs pi0-FAST

LIBERO Benchmark (40 manipulation tasks)

Ablation: Each Component's Contribution

Comparison with WholeBodyVLA

Implications for the Robotics Community

1. RL > More Data

2. PRM Is the Missing Piece

3. Inference Scaling = Compute for Performance

4. RLHF Paradigm Transfer Succeeds

Limitations

References

Related Posts

Nguyễn Anh Tuấn

Related Posts

FORCE: Tăng 79% success rate khi fine-tune VLA bằng RL

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Qwen-VLA: Mô hình VLA generalist của Alibaba