VLA-RL: Online RL để Nâng Cấp VLA Manipulation

VLA mạnh, nhưng chưa đủ mạnh

Nếu bạn theo dõi series VLA Models của chúng tôi, bạn đã biết: Vision-Language-Action models là hướng đi nóng nhất trong robot learning — từ RT-2 đến OpenVLA, từ pi0 đến pi0-FAST. Tuy nhiên, có một vấn đề mà toàn bộ VLA community đều biết nhưng chưa ai giải quyết triệt để:

VLA được train bằng imitation learning (IL) — nghĩa là chúng chỉ tốt bằng data mà chúng được train trên. Nếu demonstration data có bias, VLA sẽ kế thừa bias đó. Nếu demonstration không bao gồm recovery behaviors (khi vật rơi, khi kẹp sai vị trí), VLA không biết xử lý.

Đây giống hệt vấn đề mà LLM gặp trước khi có RLHF — model sinh text ổn nhưng không align với mục tiêu. RLHF đã biến GPT thành ChatGPT. Liệu RL có thể biến OpenVLA thành "ChatVLA"?

VLA-RL trả lời: có, và còn phát hiện inference scaling law đầu tiên trong robotics.

Nghiên cứu AI và robot manipulation

Paper nói gì?

VLA-RL (2025) đề xuất framework hoàn chỉnh để apply online RL lên pre-trained VLA models. 3 đóng góp chính:

Trajectory-level RL formulation: mô hình hóa manipulation trajectory như multi-modal multi-turn conversation — mỗi timestep là một "turn" trong cuộc hội thoại giữa robot và environment
Process Reward Model (PRM): VLM fine-tuned để đánh giá từng segment của trajectory — giải quyết sparse reward problem
Inference scaling law: lần đầu tiên chứng minh rằng cho VLA "suy nghĩ lâu hơn" tại inference time → performance tăng theo power law — giống scaling law của LLM

Kết quả: OpenVLA-7B sau RL training vượt baseline mạnh nhất 4.5% trên LIBERO benchmark (40 manipulation tasks), và match performance của pi0-FAST — model lớn hơn nhiều.

Insight cốt lõi: Manipulation là cuộc hội thoại

Auto-regressive VLA = Multi-turn Conversation

Đây là insight quan trọng nhất của paper. VLA auto-regressive (như OpenVLA) generate actions token-by-token, giống LLM generate text. Nhưng khác LLM, mỗi "turn" trong VLA là:

Turn t:
  User message:    [Image observation o_t]
  Assistant reply: [Action tokens a_t = (a_t^1, a_t^2, ..., a_t^K)]

Turn t+1:
  User message:    [Image observation o_{t+1}]  (result of a_t)
  Assistant reply: [Action tokens a_{t+1}]

...

Turn T (final):
  User message:    [Image observation o_T]
  Assistant reply: [Action tokens a_T]
  → Task success/failure

Một trajectory manipulation dài 50 timesteps = một "conversation" 50 turns. Reward cuối cùng (success/failure) giống verdict cho toàn bộ cuộc hội thoại.

Tại sao cách nhìn này quan trọng? Vì nó cho phép tái sử dụng toàn bộ RLHF infrastructure đã phát triển cho LLM. Cùng PPO algorithm, cùng reward model paradigm, cùng KL regularization — chỉ khác "text tokens" thành "action tokens" và "human feedback" thành "task success".

Formulation toán học

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM

class VLARLFormulation:
    """
    Trajectory-level RL cho auto-regressive VLA.
    Mỗi trajectory = multi-turn conversation.
    """
    
    def __init__(self, vla_model, reward_model, kl_coeff=0.01):
        self.vla = vla_model          # OpenVLA-7B
        self.vla_ref = vla_model.copy()  # Frozen reference for KL
        self.reward_model = reward_model  # Process Reward Model
        self.kl_coeff = kl_coeff
    
    def compute_trajectory_reward(self, trajectory):
        """
        Trajectory = [(o_0, a_0), (o_1, a_1), ..., (o_T, a_T)]
        
        Reward tại mỗi timestep t:
          r_t = PRM(o_0:t, a_0:t)  (process reward)
          R_total = sum(r_t) + R_final  (sparse task reward)
        """
        observations = [step[0] for step in trajectory]
        actions = [step[1] for step in trajectory]
        
        # Process rewards cho từng segment
        process_rewards = []
        for t in range(len(trajectory)):
            # PRM đánh giá progress tại timestep t
            r_t = self.reward_model.evaluate(
                observations[:t+1], actions[:t+1]
            )
            process_rewards.append(r_t)
        
        return torch.stack(process_rewards)
    
    def compute_ppo_loss(self, trajectory, old_log_probs):
        """
        PPO loss với trajectory-level rewards.
        KL regularization chống catastrophic forgetting.
        """
        observations = [step[0] for step in trajectory]
        actions = [step[1] for step in trajectory]
        
        # Current policy log probs
        log_probs = []
        for t in range(len(trajectory)):
            # Mỗi action a_t gồm K tokens
            # Log prob = sum(log prob of each token)
            action_log_prob = self.vla.compute_log_prob(
                observations[t], actions[t]
            )
            log_probs.append(action_log_prob)
        
        log_probs = torch.stack(log_probs)
        
        # Advantages từ process rewards
        rewards = self.compute_trajectory_reward(trajectory)
        advantages = compute_gae(rewards, gamma=0.99, lam=0.95)
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # PPO clipped objective
        ratio = torch.exp(log_probs - old_log_probs)
        clip_ratio = torch.clamp(ratio, 1 - 0.2, 1 + 0.2)
        policy_loss = -torch.min(
            ratio * advantages, 
            clip_ratio * advantages
        ).mean()
        
        # KL penalty vs reference model
        ref_log_probs = []
        for t in range(len(trajectory)):
            ref_lp = self.vla_ref.compute_log_prob(
                observations[t], actions[t]
            )
            ref_log_probs.append(ref_lp)
        ref_log_probs = torch.stack(ref_log_probs)
        
        kl_loss = (log_probs - ref_log_probs).mean()
        
        total_loss = policy_loss + self.kl_coeff * kl_loss
        return total_loss

Process Reward Model: Giải quyết Sparse Reward

Vấn đề: Manipulation reward quá sparse

Trong manipulation, reward thường chỉ có ở cuối episode: task thành công → reward = 1, thất bại → reward = 0. Với trajectory dài 50 steps, RL phải tìm ra step nào quan trọng — giống tìm kim trong đống rơm.

LLM cũng từng gặp vấn đề tương tự → Process Reward Models (PRM) được phát triển cho math reasoning, đánh giá từng bước giải. VLA-RL áp dụng ý tưởng này cho robot manipulation.

PRM Architecture

import torch
import torch.nn as nn

class ManipulationPRM(nn.Module):
    """
    Process Reward Model cho robot manipulation.
    Fine-tuned từ VLM, đánh giá progress tại mỗi timestep.
    
    Training: tạo pseudo reward labels từ task segments.
    """
    
    def __init__(self, vlm_backbone):
        super().__init__()
        self.vlm = vlm_backbone  # e.g., Qwen-VL-7B
        self.reward_head = nn.Sequential(
            nn.Linear(vlm_backbone.config.hidden_size, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
            nn.Sigmoid(),  # output [0, 1] reward
        )
    
    def forward(self, image_sequence, action_sequence, task_description):
        """
        Đánh giá progress dựa trên visual history.
        
        Args:
            image_sequence: [o_0, ..., o_t] — observations đến hiện tại
            action_sequence: [a_0, ..., a_{t-1}] — actions đã thực hiện  
            task_description: "pick up the red cup"
        
        Returns:
            reward: float [0, 1] — progress score
        """
        # Construct prompt cho VLM
        # "Given these observations and actions, how much progress 
        #  has been made toward: {task}?"
        prompt = self._build_prompt(
            image_sequence, action_sequence, task_description
        )
        
        # VLM encode
        vlm_output = self.vlm(prompt)
        hidden = vlm_output.last_hidden_state[:, -1, :]
        
        # Reward prediction
        reward = self.reward_head(hidden)
        return reward
    
    def _build_prompt(self, images, actions, task):
        """Tạo multi-image prompt cho VLM."""
        prompt_parts = [f"Task: {task}\n\nProgress so far:\n"]
        
        for t, (img, act) in enumerate(zip(images, actions)):
            prompt_parts.append(f"Step {t}: [IMAGE] → Action: {act}\n")
        
        prompt_parts.append(
            f"Current observation: [IMAGE]\n"
            f"Rate the progress toward completing the task (0-1):"
        )
        
        return "".join(prompt_parts)

Tạo Pseudo Reward Labels

Trick thông minh: thay vì annotate reward bằng tay (không khả thi), paper tạo pseudo labels tự động:

def create_pseudo_reward_labels(
    successful_trajectories,
    failed_trajectories,
    num_segments=5,
):
    """
    Tạo pseudo reward labels cho PRM training.
    
    Ý tưởng:
    - Trajectory thành công: progress tăng dần 0 → 1
    - Trajectory thất bại: progress tăng rồi giảm
    - Segment cuối quyết định success/failure
    """
    labeled_data = []
    
    for traj in successful_trajectories:
        T = len(traj)
        segment_size = T // num_segments
        
        for seg_idx in range(num_segments):
            start = seg_idx * segment_size
            end = min((seg_idx + 1) * segment_size, T)
            
            # Progress tăng tuyến tính cho successful trajectory
            progress = (seg_idx + 1) / num_segments
            
            labeled_data.append({
                "images": traj[start:end],
                "reward": progress,
                "label": "positive",
            })
    
    for traj in failed_trajectories:
        T = len(traj)
        segment_size = T // num_segments
        
        for seg_idx in range(num_segments):
            start = seg_idx * segment_size
            end = min((seg_idx + 1) * segment_size, T)
            
            # Thất bại: progress tăng rồi plateau/giảm
            if seg_idx < num_segments // 2:
                progress = (seg_idx + 1) / num_segments * 0.5
            else:
                # Flatten hoặc giảm ở nửa sau
                progress = 0.25
            
            labeled_data.append({
                "images": traj[start:end],
                "reward": progress,
                "label": "negative",
            })
    
    return labeled_data

Training Loop: PPO cho VLA

GPU-Balanced Vectorized Environments

Training RL cho VLA 7B parameters cần giải quyết GPU memory. VLA-RL dùng kiến trúc actor-learner tách biệt:

import torch
from torch.distributed import rpc

class VLARLTrainer:
    """
    Training loop cho VLA-RL.
    Actor (data collection) và Learner (gradient update) 
    chạy trên GPU pools riêng biệt.
    """
    
    def __init__(
        self,
        vla_model,              # OpenVLA-7B
        prm_model,              # Process Reward Model
        env_suite,              # LIBERO environments
        num_envs=64,            # vectorized environments
        actor_gpus=[0, 1],      # GPUs cho inference
        learner_gpus=[2, 3],    # GPUs cho training
        lr=1e-5,                # learning rate nhỏ cho fine-tuning
    ):
        self.vla = vla_model
        self.prm = prm_model
        self.envs = env_suite
        self.num_envs = num_envs
        
        # Optimizer với gradient accumulation
        self.optimizer = torch.optim.AdamW(
            self.vla.parameters(),
            lr=lr,
            weight_decay=0.01,
        )
        
        # Curriculum: bắt đầu từ easy tasks
        self.curriculum = TaskCurriculum(env_suite)
    
    def collect_trajectories(self, batch_size=16):
        """
        Thu thập trajectories từ vectorized environments.
        Mỗi env chạy 1 episode song song.
        """
        trajectories = []
        
        # Reset environments với curriculum-selected tasks
        tasks = self.curriculum.sample_tasks(batch_size)
        observations = self.envs.reset(tasks)
        
        episode_data = [[] for _ in range(batch_size)]
        dones = [False] * batch_size
        
        for step in range(200):  # max 200 steps
            if all(dones):
                break
            
            # VLA inference (batched trên actor GPUs)
            with torch.no_grad():
                actions = self.vla.predict(observations, tasks)
            
            # Environment step
            next_obs, rewards, new_dones, infos = self.envs.step(actions)
            
            # Store transitions
            for i in range(batch_size):
                if not dones[i]:
                    episode_data[i].append({
                        "obs": observations[i],
                        "action": actions[i],
                        "reward": rewards[i],
                        "done": new_dones[i],
                    })
                    if new_dones[i]:
                        dones[i] = True
            
            observations = next_obs
        
        # Compute process rewards cho mỗi trajectory
        for traj in episode_data:
            process_rewards = self.prm.evaluate_trajectory(traj)
            for t, step in enumerate(traj):
                step["process_reward"] = process_rewards[t]
            trajectories.append(traj)
        
        return trajectories
    
    def train_step(self, trajectories):
        """
        PPO update step trên learner GPUs.
        """
        # Flatten trajectories thành transitions
        all_obs = []
        all_actions = []
        all_advantages = []
        all_old_log_probs = []
        
        for traj in trajectories:
            rewards = [s["process_reward"] for s in traj]
            
            # GAE advantages
            advantages = compute_gae(
                rewards, gamma=0.99, lam=0.95
            )
            
            for t, step in enumerate(traj):
                all_obs.append(step["obs"])
                all_actions.append(step["action"])
                all_advantages.append(advantages[t])
                
                with torch.no_grad():
                    old_lp = self.vla.compute_log_prob(
                        step["obs"], step["action"]
                    )
                    all_old_log_probs.append(old_lp)
        
        # Mini-batch PPO updates
        dataset_size = len(all_obs)
        indices = torch.randperm(dataset_size)
        mini_batch_size = 32
        
        total_loss = 0
        for start in range(0, dataset_size, mini_batch_size):
            end = min(start + mini_batch_size, dataset_size)
            mb_indices = indices[start:end]
            
            mb_obs = [all_obs[i] for i in mb_indices]
            mb_actions = [all_actions[i] for i in mb_indices]
            mb_advantages = torch.stack(
                [all_advantages[i] for i in mb_indices]
            )
            mb_old_lp = torch.stack(
                [all_old_log_probs[i] for i in mb_indices]
            )
            
            # Forward pass
            new_log_probs = self.vla.compute_log_probs_batch(
                mb_obs, mb_actions
            )
            
            # PPO loss
            ratio = torch.exp(new_log_probs - mb_old_lp)
            clip_ratio = torch.clamp(ratio, 0.8, 1.2)
            
            loss = -torch.min(
                ratio * mb_advantages,
                clip_ratio * mb_advantages,
            ).mean()
            
            # Gradient step
            self.optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(
                self.vla.parameters(), max_norm=1.0
            )
            self.optimizer.step()
            
            total_loss += loss.item()
        
        return total_loss / (dataset_size // mini_batch_size)
    
    def train(self, num_iterations=1000):
        """Main training loop."""
        for iteration in range(num_iterations):
            # Collect data
            trajectories = self.collect_trajectories(batch_size=64)
            
            # Compute success rate
            successes = sum(
                1 for t in trajectories if t[-1]["reward"] > 0.5
            )
            success_rate = successes / len(trajectories)
            
            # Update curriculum dựa trên success rate
            self.curriculum.update(success_rate)
            
            # PPO update
            loss = self.train_step(trajectories)
            
            print(
                f"Iter {iteration}: loss={loss:.4f}, "
                f"success={success_rate:.2%}, "
                f"curriculum_level={self.curriculum.level}"
            )

Robot arm trong môi trường nghiên cứu

Curriculum Selection

Paper dùng curriculum learning tự động — bắt đầu từ tasks dễ, tăng dần độ khó:

class TaskCurriculum:
    """
    Automatic curriculum selection.
    Tasks được chia thành levels dựa trên difficulty.
    Level tăng khi success rate > threshold.
    """
    
    def __init__(self, env_suite, success_threshold=0.7):
        self.env_suite = env_suite
        self.success_threshold = success_threshold
        self.level = 0
        
        # LIBERO tasks grouped by difficulty
        self.task_levels = {
            0: ["pick_up_object", "push_to_target"],     # Easy
            1: ["stack_two_blocks", "open_drawer"],        # Medium
            2: ["sort_objects", "pour_liquid"],             # Hard
            3: ["assemble_parts", "tool_use"],             # Expert
        }
        
        self.success_history = []
    
    def sample_tasks(self, batch_size):
        """Sample tasks từ current và lower levels."""
        available_tasks = []
        for l in range(self.level + 1):
            available_tasks.extend(self.task_levels.get(l, []))
        
        # 70% current level, 30% replay từ lower levels
        current_tasks = self.task_levels.get(self.level, [])
        
        tasks = []
        for _ in range(batch_size):
            if torch.rand(1).item() < 0.7 and current_tasks:
                tasks.append(
                    current_tasks[
                        torch.randint(len(current_tasks), (1,)).item()
                    ]
                )
            else:
                tasks.append(
                    available_tasks[
                        torch.randint(len(available_tasks), (1,)).item()
                    ]
                )
        
        return tasks
    
    def update(self, success_rate):
        """Nâng level khi success rate đủ cao."""
        self.success_history.append(success_rate)
        
        # Check 10 iterations gần nhất
        if len(self.success_history) >= 10:
            recent_avg = sum(self.success_history[-10:]) / 10
            if recent_avg > self.success_threshold:
                max_level = max(self.task_levels.keys())
                if self.level < max_level:
                    self.level += 1
                    self.success_history = []
                    print(f"Curriculum advanced to level {self.level}")

Inference Scaling Law: Phát hiện đột phá

Đây là kết quả bất ngờ nhất: VLA-RL phát hiện inference scaling law đầu tiên trong robotics.

Inference Scaling Law là gì?

Trong LLM, inference scaling law nói rằng: cho model "suy nghĩ" nhiều hơn tại inference time (generate nhiều tokens hơn, dùng chain-of-thought, best-of-N sampling) → performance tăng theo power law. OpenAI's o1/o3 là ứng dụng thực tế của insight này.

VLA-RL chứng minh điều tương tự cho robot:

Best-of-N sampling:
  N=1:   success 72%
  N=4:   success 78%
  N=8:   success 82%
  N=16:  success 84%
  N=32:  success 85.5%

Log-linear relationship: performance ∝ log(N)

Nghĩa là: cho VLA generate nhiều action trajectories, chọn trajectory có PRM score cao nhất → performance tăng đáng kể. Compute-performance tradeoff theo log — mỗi lần tăng gấp đôi compute, performance tăng khoảng 2%.

Test-Time Optimization

class VLATestTimeOptimization:
    """
    Inference scaling: generate N trajectories, 
    chọn best theo PRM score.
    """
    
    def __init__(self, vla_model, prm_model, n_samples=8):
        self.vla = vla_model
        self.prm = prm_model
        self.n_samples = n_samples
    
    def predict_with_scaling(self, observation, task):
        """
        Best-of-N sampling cho robot actions.
        
        Trade compute for performance:
        - N=1: fastest, baseline performance
        - N=8: 8x compute, +10% success
        - N=32: 32x compute, +13.5% success
        """
        candidates = []
        
        for _ in range(self.n_samples):
            # Sample action trajectory với temperature > 0
            trajectory = self.vla.generate(
                observation, 
                task,
                temperature=0.8,    # diversity
                max_steps=50,
            )
            
            # Evaluate với PRM
            prm_score = self.prm.evaluate_trajectory(trajectory)
            
            candidates.append({
                "trajectory": trajectory,
                "score": prm_score.mean().item(),
            })
        
        # Chọn trajectory có PRM score cao nhất
        best = max(candidates, key=lambda x: x["score"])
        
        return best["trajectory"][0]  # return first action
    
    def predict_with_refinement(self, observation, task):
        """
        Alternative: iterative refinement thay vì parallel sampling.
        Giống "self-reflection" trong LLM.
        """
        # Initial prediction
        trajectory = self.vla.generate(
            observation, task, temperature=0.3
        )
        
        for refine_step in range(3):
            # PRM evaluate
            scores = self.prm.evaluate_trajectory(trajectory)
            
            # Tìm timestep có score thấp nhất
            worst_t = scores.argmin().item()
            
            # Re-generate từ worst_t trở đi
            # với context = trajectory[:worst_t]
            refined = self.vla.generate(
                observation,
                task,
                prefix=trajectory[:worst_t],
                temperature=0.5,
            )
            
            trajectory = trajectory[:worst_t] + refined[worst_t:]
        
        return trajectory[0]

Kết quả: OpenVLA-7B vs pi0-FAST

LIBERO Benchmark (40 manipulation tasks)

Model	Params	Success Rate	Training Data
OpenVLA (baseline)	7B	68.2%	970K demos
OpenVLA + SFT (thêm data)	7B	71.5%	970K + 50K
Octo	93M	52.3%	800K
pi0-FAST	3B	77.8%	10K (in-domain)
VLA-RL (OpenVLA + RL)	7B	82.3%	970K + RL

Chú ý: VLA-RL vượt pi0-FAST 4.5% mặc dù pi0-FAST được train với in-domain demonstrations chất lượng cao. RL training hiệu quả hơn thu thập thêm demonstrations.

Ablation: Contribution của mỗi component

Configuration	Success Rate	Delta
OpenVLA baseline	68.2%	—
+ Process Reward Model only	72.1%	+3.9%
+ Trajectory-level PPO only	75.8%	+7.6%
+ Curriculum selection	79.4%	+11.2%
+ Test-time optimization (N=8)	82.3%	+14.1%

Mỗi component đóng góp rõ ràng. Đặc biệt, test-time optimization thêm +2.9% — "free" improvement chỉ bằng compute.

So sánh với WholeBodyVLA

Hai paper này complement nhau:

Aspect	VLA-RL	WholeBodyVLA
Focus	Nâng cấp manipulation qua RL	Whole-body loco-manipulation
Robot type	Fixed-base arms	Humanoid (AgiBot X2)
VLA backbone	OpenVLA-7B	Custom VLA
RL role	Fine-tune VLA trực tiếp	Train locomotion policy
Key insight	Inference scaling law	Manipulation-aware locomotion
Benchmark	LIBERO (sim)	Real robot tasks

VLA-RL cho thấy RL cải thiện manipulation quality. WholeBodyVLA cho thấy RL giúp locomotion hỗ trợ manipulation. Kết hợp cả hai → humanoid vừa manipulate giỏi hơn, vừa di chuyển ổn định hơn.

Implications cho cộng đồng Robotics

1. RL > More Data

Thêm 50K demonstrations chỉ cải thiện 3.3% (68.2 → 71.5). RL training cải thiện 14.1% (68.2 → 82.3). Khi VLA đã đủ lớn, RL hiệu quả gấp 4 lần so với thu thập thêm data. Đây là tin tốt cho lab nhỏ — không cần đội ngũ teleoperators.

2. PRM là missing piece

Sparse reward (0/1 success) không đủ cho RL training hiệu quả. Process Reward Model giải quyết bằng cách đánh giá từng bước — giống việc thầy giáo chấm từng bước giải toán thay vì chỉ chấm đáp số.

3. Inference scaling = compute for performance

Lần đầu tiên robotics có inference scaling law. Điều này mở ra paradigm mới: thay vì train model lớn hơn, cho model hiện tại "suy nghĩ" nhiều hơn. Với robot, 8x compute = +10% success — trade-off hợp lý cho high-stakes tasks.

4. RLHF paradigm transfer thành công

Toàn bộ RLHF stack (PPO, reward model, KL regularization) transfer gần như nguyên vẹn từ LLM sang VLA. Đây là validation mạnh cho việc robotics community nên invest vào RL infrastructure, không chỉ data collection.

Hạn chế

Sim-only results: benchmark trên LIBERO (simulation), chưa có real robot validation
Compute cost: training 7B VLA với RL cần 8x A100 ~48 giờ
PRM quality: pseudo labels có noise — learned PRM không perfect
Single-task RL: mỗi task group cần RL fine-tuning riêng, chưa có universal RL

Tài liệu tham khảo

VLA-RL: Towards Masterful Robot Manipulation via Scalable Reinforcement Learning — 2025
OpenVLA: An Open-Source Vision-Language-Action Model — Kim et al., 2024
pi0-FAST: Fast Action Sequence Tokenization for VLA — Physical Intelligence, 2025
LIBERO: Lifelong Robot Learning Benchmark — Liu et al., 2023

VLA mạnh, nhưng chưa đủ mạnh

VLA-RL trả lời: có, và còn phát hiện inference scaling law đầu tiên trong robotics.

Nghiên cứu AI và robot manipulation

Paper nói gì?

VLA-RL (2025) đề xuất framework hoàn chỉnh để apply online RL lên pre-trained VLA models. 3 đóng góp chính:

Trajectory-level RL formulation: mô hình hóa manipulation trajectory như multi-modal multi-turn conversation — mỗi timestep là một "turn" trong cuộc hội thoại giữa robot và environment
Process Reward Model (PRM): VLM fine-tuned để đánh giá từng segment của trajectory — giải quyết sparse reward problem
Inference scaling law: lần đầu tiên chứng minh rằng cho VLA "suy nghĩ lâu hơn" tại inference time → performance tăng theo power law — giống scaling law của LLM

Kết quả: OpenVLA-7B sau RL training vượt baseline mạnh nhất 4.5% trên LIBERO benchmark (40 manipulation tasks), và match performance của pi0-FAST — model lớn hơn nhiều.

Insight cốt lõi: Manipulation là cuộc hội thoại

Auto-regressive VLA = Multi-turn Conversation

Đây là insight quan trọng nhất của paper. VLA auto-regressive (như OpenVLA) generate actions token-by-token, giống LLM generate text. Nhưng khác LLM, mỗi "turn" trong VLA là:

Turn t:
  User message:    [Image observation o_t]
  Assistant reply: [Action tokens a_t = (a_t^1, a_t^2, ..., a_t^K)]

Turn t+1:
  User message:    [Image observation o_{t+1}]  (result of a_t)
  Assistant reply: [Action tokens a_{t+1}]

...

Turn T (final):
  User message:    [Image observation o_T]
  Assistant reply: [Action tokens a_T]
  → Task success/failure

Một trajectory manipulation dài 50 timesteps = một "conversation" 50 turns. Reward cuối cùng (success/failure) giống verdict cho toàn bộ cuộc hội thoại.

Formulation toán học

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM

class VLARLFormulation:
    """
    Trajectory-level RL cho auto-regressive VLA.
    Mỗi trajectory = multi-turn conversation.
    """
    
    def __init__(self, vla_model, reward_model, kl_coeff=0.01):
        self.vla = vla_model          # OpenVLA-7B
        self.vla_ref = vla_model.copy()  # Frozen reference for KL
        self.reward_model = reward_model  # Process Reward Model
        self.kl_coeff = kl_coeff
    
    def compute_trajectory_reward(self, trajectory):
        """
        Trajectory = [(o_0, a_0), (o_1, a_1), ..., (o_T, a_T)]
        
        Reward tại mỗi timestep t:
          r_t = PRM(o_0:t, a_0:t)  (process reward)
          R_total = sum(r_t) + R_final  (sparse task reward)
        """
        observations = [step[0] for step in trajectory]
        actions = [step[1] for step in trajectory]
        
        # Process rewards cho từng segment
        process_rewards = []
        for t in range(len(trajectory)):
            # PRM đánh giá progress tại timestep t
            r_t = self.reward_model.evaluate(
                observations[:t+1], actions[:t+1]
            )
            process_rewards.append(r_t)
        
        return torch.stack(process_rewards)
    
    def compute_ppo_loss(self, trajectory, old_log_probs):
        """
        PPO loss với trajectory-level rewards.
        KL regularization chống catastrophic forgetting.
        """
        observations = [step[0] for step in trajectory]
        actions = [step[1] for step in trajectory]
        
        # Current policy log probs
        log_probs = []
        for t in range(len(trajectory)):
            # Mỗi action a_t gồm K tokens
            # Log prob = sum(log prob of each token)
            action_log_prob = self.vla.compute_log_prob(
                observations[t], actions[t]
            )
            log_probs.append(action_log_prob)
        
        log_probs = torch.stack(log_probs)
        
        # Advantages từ process rewards
        rewards = self.compute_trajectory_reward(trajectory)
        advantages = compute_gae(rewards, gamma=0.99, lam=0.95)
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # PPO clipped objective
        ratio = torch.exp(log_probs - old_log_probs)
        clip_ratio = torch.clamp(ratio, 1 - 0.2, 1 + 0.2)
        policy_loss = -torch.min(
            ratio * advantages, 
            clip_ratio * advantages
        ).mean()
        
        # KL penalty vs reference model
        ref_log_probs = []
        for t in range(len(trajectory)):
            ref_lp = self.vla_ref.compute_log_prob(
                observations[t], actions[t]
            )
            ref_log_probs.append(ref_lp)
        ref_log_probs = torch.stack(ref_log_probs)
        
        kl_loss = (log_probs - ref_log_probs).mean()
        
        total_loss = policy_loss + self.kl_coeff * kl_loss
        return total_loss

Process Reward Model: Giải quyết Sparse Reward

Vấn đề: Manipulation reward quá sparse

PRM Architecture

import torch
import torch.nn as nn

class ManipulationPRM(nn.Module):
    """
    Process Reward Model cho robot manipulation.
    Fine-tuned từ VLM, đánh giá progress tại mỗi timestep.
    
    Training: tạo pseudo reward labels từ task segments.
    """
    
    def __init__(self, vlm_backbone):
        super().__init__()
        self.vlm = vlm_backbone  # e.g., Qwen-VL-7B
        self.reward_head = nn.Sequential(
            nn.Linear(vlm_backbone.config.hidden_size, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
            nn.Sigmoid(),  # output [0, 1] reward
        )
    
    def forward(self, image_sequence, action_sequence, task_description):
        """
        Đánh giá progress dựa trên visual history.
        
        Args:
            image_sequence: [o_0, ..., o_t] — observations đến hiện tại
            action_sequence: [a_0, ..., a_{t-1}] — actions đã thực hiện  
            task_description: "pick up the red cup"
        
        Returns:
            reward: float [0, 1] — progress score
        """
        # Construct prompt cho VLM
        # "Given these observations and actions, how much progress 
        #  has been made toward: {task}?"
        prompt = self._build_prompt(
            image_sequence, action_sequence, task_description
        )
        
        # VLM encode
        vlm_output = self.vlm(prompt)
        hidden = vlm_output.last_hidden_state[:, -1, :]
        
        # Reward prediction
        reward = self.reward_head(hidden)
        return reward
    
    def _build_prompt(self, images, actions, task):
        """Tạo multi-image prompt cho VLM."""
        prompt_parts = [f"Task: {task}\n\nProgress so far:\n"]
        
        for t, (img, act) in enumerate(zip(images, actions)):
            prompt_parts.append(f"Step {t}: [IMAGE] → Action: {act}\n")
        
        prompt_parts.append(
            f"Current observation: [IMAGE]\n"
            f"Rate the progress toward completing the task (0-1):"
        )
        
        return "".join(prompt_parts)

Tạo Pseudo Reward Labels

Trick thông minh: thay vì annotate reward bằng tay (không khả thi), paper tạo pseudo labels tự động:

def create_pseudo_reward_labels(
    successful_trajectories,
    failed_trajectories,
    num_segments=5,
):
    """
    Tạo pseudo reward labels cho PRM training.
    
    Ý tưởng:
    - Trajectory thành công: progress tăng dần 0 → 1
    - Trajectory thất bại: progress tăng rồi giảm
    - Segment cuối quyết định success/failure
    """
    labeled_data = []
    
    for traj in successful_trajectories:
        T = len(traj)
        segment_size = T // num_segments
        
        for seg_idx in range(num_segments):
            start = seg_idx * segment_size
            end = min((seg_idx + 1) * segment_size, T)
            
            # Progress tăng tuyến tính cho successful trajectory
            progress = (seg_idx + 1) / num_segments
            
            labeled_data.append({
                "images": traj[start:end],
                "reward": progress,
                "label": "positive",
            })
    
    for traj in failed_trajectories:
        T = len(traj)
        segment_size = T // num_segments
        
        for seg_idx in range(num_segments):
            start = seg_idx * segment_size
            end = min((seg_idx + 1) * segment_size, T)
            
            # Thất bại: progress tăng rồi plateau/giảm
            if seg_idx < num_segments // 2:
                progress = (seg_idx + 1) / num_segments * 0.5
            else:
                # Flatten hoặc giảm ở nửa sau
                progress = 0.25
            
            labeled_data.append({
                "images": traj[start:end],
                "reward": progress,
                "label": "negative",
            })
    
    return labeled_data

Training Loop: PPO cho VLA

GPU-Balanced Vectorized Environments

Training RL cho VLA 7B parameters cần giải quyết GPU memory. VLA-RL dùng kiến trúc actor-learner tách biệt:

import torch
from torch.distributed import rpc

class VLARLTrainer:
    """
    Training loop cho VLA-RL.
    Actor (data collection) và Learner (gradient update) 
    chạy trên GPU pools riêng biệt.
    """
    
    def __init__(
        self,
        vla_model,              # OpenVLA-7B
        prm_model,              # Process Reward Model
        env_suite,              # LIBERO environments
        num_envs=64,            # vectorized environments
        actor_gpus=[0, 1],      # GPUs cho inference
        learner_gpus=[2, 3],    # GPUs cho training
        lr=1e-5,                # learning rate nhỏ cho fine-tuning
    ):
        self.vla = vla_model
        self.prm = prm_model
        self.envs = env_suite
        self.num_envs = num_envs
        
        # Optimizer với gradient accumulation
        self.optimizer = torch.optim.AdamW(
            self.vla.parameters(),
            lr=lr,
            weight_decay=0.01,
        )
        
        # Curriculum: bắt đầu từ easy tasks
        self.curriculum = TaskCurriculum(env_suite)
    
    def collect_trajectories(self, batch_size=16):
        """
        Thu thập trajectories từ vectorized environments.
        Mỗi env chạy 1 episode song song.
        """
        trajectories = []
        
        # Reset environments với curriculum-selected tasks
        tasks = self.curriculum.sample_tasks(batch_size)
        observations = self.envs.reset(tasks)
        
        episode_data = [[] for _ in range(batch_size)]
        dones = [False] * batch_size
        
        for step in range(200):  # max 200 steps
            if all(dones):
                break
            
            # VLA inference (batched trên actor GPUs)
            with torch.no_grad():
                actions = self.vla.predict(observations, tasks)
            
            # Environment step
            next_obs, rewards, new_dones, infos = self.envs.step(actions)
            
            # Store transitions
            for i in range(batch_size):
                if not dones[i]:
                    episode_data[i].append({
                        "obs": observations[i],
                        "action": actions[i],
                        "reward": rewards[i],
                        "done": new_dones[i],
                    })
                    if new_dones[i]:
                        dones[i] = True
            
            observations = next_obs
        
        # Compute process rewards cho mỗi trajectory
        for traj in episode_data:
            process_rewards = self.prm.evaluate_trajectory(traj)
            for t, step in enumerate(traj):
                step["process_reward"] = process_rewards[t]
            trajectories.append(traj)
        
        return trajectories
    
    def train_step(self, trajectories):
        """
        PPO update step trên learner GPUs.
        """
        # Flatten trajectories thành transitions
        all_obs = []
        all_actions = []
        all_advantages = []
        all_old_log_probs = []
        
        for traj in trajectories:
            rewards = [s["process_reward"] for s in traj]
            
            # GAE advantages
            advantages = compute_gae(
                rewards, gamma=0.99, lam=0.95
            )
            
            for t, step in enumerate(traj):
                all_obs.append(step["obs"])
                all_actions.append(step["action"])
                all_advantages.append(advantages[t])
                
                with torch.no_grad():
                    old_lp = self.vla.compute_log_prob(
                        step["obs"], step["action"]
                    )
                    all_old_log_probs.append(old_lp)
        
        # Mini-batch PPO updates
        dataset_size = len(all_obs)
        indices = torch.randperm(dataset_size)
        mini_batch_size = 32
        
        total_loss = 0
        for start in range(0, dataset_size, mini_batch_size):
            end = min(start + mini_batch_size, dataset_size)
            mb_indices = indices[start:end]
            
            mb_obs = [all_obs[i] for i in mb_indices]
            mb_actions = [all_actions[i] for i in mb_indices]
            mb_advantages = torch.stack(
                [all_advantages[i] for i in mb_indices]
            )
            mb_old_lp = torch.stack(
                [all_old_log_probs[i] for i in mb_indices]
            )
            
            # Forward pass
            new_log_probs = self.vla.compute_log_probs_batch(
                mb_obs, mb_actions
            )
            
            # PPO loss
            ratio = torch.exp(new_log_probs - mb_old_lp)
            clip_ratio = torch.clamp(ratio, 0.8, 1.2)
            
            loss = -torch.min(
                ratio * mb_advantages,
                clip_ratio * mb_advantages,
            ).mean()
            
            # Gradient step
            self.optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(
                self.vla.parameters(), max_norm=1.0
            )
            self.optimizer.step()
            
            total_loss += loss.item()
        
        return total_loss / (dataset_size // mini_batch_size)
    
    def train(self, num_iterations=1000):
        """Main training loop."""
        for iteration in range(num_iterations):
            # Collect data
            trajectories = self.collect_trajectories(batch_size=64)
            
            # Compute success rate
            successes = sum(
                1 for t in trajectories if t[-1]["reward"] > 0.5
            )
            success_rate = successes / len(trajectories)
            
            # Update curriculum dựa trên success rate
            self.curriculum.update(success_rate)
            
            # PPO update
            loss = self.train_step(trajectories)
            
            print(
                f"Iter {iteration}: loss={loss:.4f}, "
                f"success={success_rate:.2%}, "
                f"curriculum_level={self.curriculum.level}"
            )

Robot arm trong môi trường nghiên cứu

Curriculum Selection

Paper dùng curriculum learning tự động — bắt đầu từ tasks dễ, tăng dần độ khó:

class TaskCurriculum:
    """
    Automatic curriculum selection.
    Tasks được chia thành levels dựa trên difficulty.
    Level tăng khi success rate > threshold.
    """
    
    def __init__(self, env_suite, success_threshold=0.7):
        self.env_suite = env_suite
        self.success_threshold = success_threshold
        self.level = 0
        
        # LIBERO tasks grouped by difficulty
        self.task_levels = {
            0: ["pick_up_object", "push_to_target"],     # Easy
            1: ["stack_two_blocks", "open_drawer"],        # Medium
            2: ["sort_objects", "pour_liquid"],             # Hard
            3: ["assemble_parts", "tool_use"],             # Expert
        }
        
        self.success_history = []
    
    def sample_tasks(self, batch_size):
        """Sample tasks từ current và lower levels."""
        available_tasks = []
        for l in range(self.level + 1):
            available_tasks.extend(self.task_levels.get(l, []))
        
        # 70% current level, 30% replay từ lower levels
        current_tasks = self.task_levels.get(self.level, [])
        
        tasks = []
        for _ in range(batch_size):
            if torch.rand(1).item() < 0.7 and current_tasks:
                tasks.append(
                    current_tasks[
                        torch.randint(len(current_tasks), (1,)).item()
                    ]
                )
            else:
                tasks.append(
                    available_tasks[
                        torch.randint(len(available_tasks), (1,)).item()
                    ]
                )
        
        return tasks
    
    def update(self, success_rate):
        """Nâng level khi success rate đủ cao."""
        self.success_history.append(success_rate)
        
        # Check 10 iterations gần nhất
        if len(self.success_history) >= 10:
            recent_avg = sum(self.success_history[-10:]) / 10
            if recent_avg > self.success_threshold:
                max_level = max(self.task_levels.keys())
                if self.level < max_level:
                    self.level += 1
                    self.success_history = []
                    print(f"Curriculum advanced to level {self.level}")

Inference Scaling Law: Phát hiện đột phá

Đây là kết quả bất ngờ nhất: VLA-RL phát hiện inference scaling law đầu tiên trong robotics.

Inference Scaling Law là gì?

VLA-RL chứng minh điều tương tự cho robot:

Best-of-N sampling:
  N=1:   success 72%
  N=4:   success 78%
  N=8:   success 82%
  N=16:  success 84%
  N=32:  success 85.5%

Log-linear relationship: performance ∝ log(N)

Test-Time Optimization

class VLATestTimeOptimization:
    """
    Inference scaling: generate N trajectories, 
    chọn best theo PRM score.
    """
    
    def __init__(self, vla_model, prm_model, n_samples=8):
        self.vla = vla_model
        self.prm = prm_model
        self.n_samples = n_samples
    
    def predict_with_scaling(self, observation, task):
        """
        Best-of-N sampling cho robot actions.
        
        Trade compute for performance:
        - N=1: fastest, baseline performance
        - N=8: 8x compute, +10% success
        - N=32: 32x compute, +13.5% success
        """
        candidates = []
        
        for _ in range(self.n_samples):
            # Sample action trajectory với temperature > 0
            trajectory = self.vla.generate(
                observation, 
                task,
                temperature=0.8,    # diversity
                max_steps=50,
            )
            
            # Evaluate với PRM
            prm_score = self.prm.evaluate_trajectory(trajectory)
            
            candidates.append({
                "trajectory": trajectory,
                "score": prm_score.mean().item(),
            })
        
        # Chọn trajectory có PRM score cao nhất
        best = max(candidates, key=lambda x: x["score"])
        
        return best["trajectory"][0]  # return first action
    
    def predict_with_refinement(self, observation, task):
        """
        Alternative: iterative refinement thay vì parallel sampling.
        Giống "self-reflection" trong LLM.
        """
        # Initial prediction
        trajectory = self.vla.generate(
            observation, task, temperature=0.3
        )
        
        for refine_step in range(3):
            # PRM evaluate
            scores = self.prm.evaluate_trajectory(trajectory)
            
            # Tìm timestep có score thấp nhất
            worst_t = scores.argmin().item()
            
            # Re-generate từ worst_t trở đi
            # với context = trajectory[:worst_t]
            refined = self.vla.generate(
                observation,
                task,
                prefix=trajectory[:worst_t],
                temperature=0.5,
            )
            
            trajectory = trajectory[:worst_t] + refined[worst_t:]
        
        return trajectory[0]

Kết quả: OpenVLA-7B vs pi0-FAST

LIBERO Benchmark (40 manipulation tasks)

Model	Params	Success Rate	Training Data
OpenVLA (baseline)	7B	68.2%	970K demos
OpenVLA + SFT (thêm data)	7B	71.5%	970K + 50K
Octo	93M	52.3%	800K
pi0-FAST	3B	77.8%	10K (in-domain)
VLA-RL (OpenVLA + RL)	7B	82.3%	970K + RL

Chú ý: VLA-RL vượt pi0-FAST 4.5% mặc dù pi0-FAST được train với in-domain demonstrations chất lượng cao. RL training hiệu quả hơn thu thập thêm demonstrations.

Ablation: Contribution của mỗi component

Configuration	Success Rate	Delta
OpenVLA baseline	68.2%	—
+ Process Reward Model only	72.1%	+3.9%
+ Trajectory-level PPO only	75.8%	+7.6%
+ Curriculum selection	79.4%	+11.2%
+ Test-time optimization (N=8)	82.3%	+14.1%

Mỗi component đóng góp rõ ràng. Đặc biệt, test-time optimization thêm +2.9% — "free" improvement chỉ bằng compute.

So sánh với WholeBodyVLA

Hai paper này complement nhau:

Aspect	VLA-RL	WholeBodyVLA
Focus	Nâng cấp manipulation qua RL	Whole-body loco-manipulation
Robot type	Fixed-base arms	Humanoid (AgiBot X2)
VLA backbone	OpenVLA-7B	Custom VLA
RL role	Fine-tune VLA trực tiếp	Train locomotion policy
Key insight	Inference scaling law	Manipulation-aware locomotion
Benchmark	LIBERO (sim)	Real robot tasks

Implications cho cộng đồng Robotics

1. RL > More Data

2. PRM là missing piece

3. Inference scaling = compute for performance

4. RLHF paradigm transfer thành công

Hạn chế

Sim-only results: benchmark trên LIBERO (simulation), chưa có real robot validation
Compute cost: training 7B VLA với RL cần 8x A100 ~48 giờ
PRM quality: pseudo labels có noise — learned PRM không perfect
Single-task RL: mỗi task group cần RL fine-tuning riêng, chưa có universal RL

Tài liệu tham khảo

VLA-RL: Towards Masterful Robot Manipulation via Scalable Reinforcement Learning — 2025
OpenVLA: An Open-Source Vision-Language-Action Model — Kim et al., 2024
pi0-FAST: Fast Action Sequence Tokenization for VLA — Physical Intelligence, 2025
LIBERO: Lifelong Robot Learning Benchmark — Liu et al., 2023

VLA mạnh, nhưng chưa đủ mạnh

Paper nói gì?

Insight cốt lõi: Manipulation là cuộc hội thoại

Auto-regressive VLA = Multi-turn Conversation

Formulation toán học

Process Reward Model: Giải quyết Sparse Reward

Vấn đề: Manipulation reward quá sparse

PRM Architecture

Tạo Pseudo Reward Labels

Training Loop: PPO cho VLA

GPU-Balanced Vectorized Environments

Curriculum Selection

Inference Scaling Law: Phát hiện đột phá

Inference Scaling Law là gì?

Test-Time Optimization

Kết quả: OpenVLA-7B vs pi0-FAST

LIBERO Benchmark (40 manipulation tasks)

Ablation: Contribution của mỗi component

So sánh với WholeBodyVLA

Implications cho cộng đồng Robotics

1. RL > More Data

2. PRM là missing piece

3. Inference scaling = compute for performance

4. RLHF paradigm transfer thành công

Hạn chế

Tài liệu tham khảo

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

FORCE: Tăng 79% success rate khi fine-tune VLA bằng RL

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Qwen-VLA: Mô hình VLA generalist của Alibaba

VLA mạnh, nhưng chưa đủ mạnh

Paper nói gì?

Insight cốt lõi: Manipulation là cuộc hội thoại

Auto-regressive VLA = Multi-turn Conversation

Formulation toán học

Process Reward Model: Giải quyết Sparse Reward

Vấn đề: Manipulation reward quá sparse

PRM Architecture

Tạo Pseudo Reward Labels

Training Loop: PPO cho VLA

GPU-Balanced Vectorized Environments

Curriculum Selection

Inference Scaling Law: Phát hiện đột phá

Inference Scaling Law là gì?

Test-Time Optimization

Kết quả: OpenVLA-7B vs pi0-FAST

LIBERO Benchmark (40 manipulation tasks)

Ablation: Contribution của mỗi component

So sánh với WholeBodyVLA

Implications cho cộng đồng Robotics

1. RL > More Data

2. PRM là missing piece

3. Inference scaling = compute for performance

4. RLHF paradigm transfer thành công

Hạn chế

Tài liệu tham khảo

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

FORCE: Tăng 79% success rate khi fine-tune VLA bằng RL

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Qwen-VLA: Mô hình VLA generalist của Alibaba