← Back to Blog
aiai-perceptionvlareinforcement-learningresearch

SimpleVLA-RL: Improving VLA with RL

SimpleVLA-RL uses reinforcement learning with simple 0/1 rewards to boost VLA from 17 to 92 points — no complex reward engineering needed.

Nguyễn Anh Tuấn10 tháng 4, 20268 min read
SimpleVLA-RL: Improving VLA with RL

The Problem: VLA Models Hit a Ceiling After SFT

Vision-Language-Action (VLA) models are the dominant approach for robot manipulation today. Models like OpenVLA, RT-2, and Pi0 have demonstrated that combining vision, language, and action in a single foundation model can produce powerful robot policies.

However, nearly all current VLAs are trained using Supervised Fine-Tuning (SFT) -- learning to imitate actions from human demonstration data. This approach has a fundamental limitation: the model can only be as good as its training data, never better.

Think of it like learning to drive by watching videos of other people driving. You can pick up the basic maneuvers, but when you encounter a novel situation -- an unfamiliar road, an unexpected obstacle -- you won't know what to do. You need to actually drive and receive feedback to truly improve.

That is exactly what SimpleVLA-RL addresses.

Reinforcement learning enables robots to discover novel behaviors beyond training data

What is SimpleVLA-RL?

SimpleVLA-RL is a framework presented at ICLR 2026 that improves VLA models through online reinforcement learning using an extremely simple reward signal: 0 or 1 (failure or success). No complex reward function design, no reward shaping, no dense rewards.

The core idea:

  1. Start from a VLA that has already been SFT-trained (e.g., OpenVLA-OFT)
  2. Let the robot explore through trial and error in simulation with binary reward
  3. Update the policy using RL (specifically a PPO variant)
  4. Result: The VLA improves far beyond SFT limits

Why is Binary Reward Sufficient?

In traditional RL for robotics, designing reward functions is a dark art (and a nightmare). You typically need to:

SimpleVLA-RL proves that none of this is necessary if you start from a VLA that already has foundational knowledge from SFT. The model already knows how to grasp, how to move -- RL just needs to tell it whether it succeeded or not so it can self-optimize.

It's like coaching a skilled chef: you don't need to instruct every knife cut and every stir. You just taste the dish and say "good" or "not yet" -- the chef knows what to adjust.

The "Pushcut" Phenomenon: RL Discovers Novel Actions

One of the most fascinating findings in SimpleVLA-RL is the "pushcut" phenomenon -- RL autonomously discovers entirely new actions that do not exist in any demonstration.

Specifically, in vegetable cutting tasks, humans demonstrated conventional knife-cutting motions. But after RL training, the robot discovered it could push the knife through the object (push + cut = pushcut) -- a technique no human demonstrator used, yet proved more effective for the robot given its specific gripper configuration.

This is powerful evidence that RL can liberate VLAs from the constraints of human data. The robot doesn't just imitate better -- it invents new approaches suited to its own physical capabilities.

Architecture: veRL + OpenVLA-OFT

SimpleVLA-RL is built on two core components:

OpenVLA-OFT (Policy Base)

OpenVLA-OFT is a fine-tuned version of OpenVLA using Orthogonal Fine-Tuning to improve performance on specific tasks. This serves as the starting point for RL training.

veRL (RL Framework)

veRL is a high-performance RL framework originally designed for Large Language Model training (RLHF). SimpleVLA-RL extends veRL to support:

The training pipeline:

OpenVLA-OFT (SFT policy)
    │
    ▼
veRL RL Training Loop
    ├── Rollout Workers (simulation environments)
    ├── Reward: binary 0/1 (task success/fail)
    ├── Policy Gradient (PPO-based)
    └── KL Divergence constraint (prevent catastrophic forgetting)
    │
    ▼
SimpleVLA-RL (improved policy)

A critical element is the KL divergence constraint -- it keeps the new policy from drifting too far from the original SFT policy. This prevents RL from causing the model to "forget" what it learned from SFT, a common problem known as catastrophic forgetting.

AI system architecture combining multiple components for robot learning

Results: The Numbers Speak for Themselves

LIBERO-Long Benchmark

Method Success Rate
OpenVLA-OFT (SFT only) 85.4
SimpleVLA-RL 97.6
Improvement +12.2 points

A 97.6% success rate on LIBERO-Long is state-of-the-art, significantly surpassing all SFT-only methods.

Cold-Start: The Miracle from 1 Trajectory

The most impressive result is the cold-start experiment: with only 1 trajectory per task (instead of hundreds of demonstrations), SimpleVLA-RL achieves:

Setup Success Rate
1 demo + SFT only 17.3
1 demo + SFT + RL 91.7
Improvement +430%

From 17.3 to 91.7 -- a 430% improvement -- with just 1 demonstration. The practical implications are enormous: you don't need to collect thousands of expensive demonstrations. Just 1 demo to "warm-start" the policy, then RL self-improves from there.

Real-World Results

On real-world dexterous manipulation tasks (not simulation), SimpleVLA-RL achieves approximately 300% improvement over the SFT baseline. Tasks include:

The sim-to-real gap is significantly narrowed thanks to the more robust policy produced by RL training.

Comparison with Other Approaches

Method Strengths Weaknesses
Pure SFT Simple, stable Limited by data quality
DAgger Iterative, expert feedback Requires continuous expert access
Offline RL No environment needed Hard to exceed data distribution
Online RL from scratch No demos needed Sample inefficient, needs reward engineering
SimpleVLA-RL Binary reward, exceeds demos Requires simulation environment

SimpleVLA-RL occupies a "sweet spot": it leverages SFT knowledge without being constrained by it, while avoiding the complex reward engineering of traditional RL.

Installation and Training Guide

Hardware Requirements

Installation

# Clone repository
git clone https://github.com/PRIME-RL/SimpleVLA-RL.git
cd SimpleVLA-RL

# Create conda environment
conda create -n simplevla-rl python=3.10
conda activate simplevla-rl

# Install dependencies
pip install -e .

# Install veRL (RL framework)
pip install verl

# Install simulation environment (LIBERO)
pip install libero

Training

# Train on LIBERO benchmark
bash examples/run_openvla_oft_rl_libero.sh

This script will:

  1. Load the OpenVLA-OFT pretrained checkpoint
  2. Initialize LIBERO simulation environments
  3. Run the RL training loop with binary reward
  4. Save checkpoints periodically

Key Configuration

In the config file, the main hyperparameters:

# Number of parallel environments
num_envs: 64

# Binary reward
reward_type: "binary"  # 0 or 1

# KL constraint (keep policy close to SFT)
kl_coeff: 0.01

# Training steps
total_steps: 50000

Supported Benchmarks

Significance and Future Directions

Why SimpleVLA-RL Matters

  1. Breaks the SFT ceiling: First clear demonstration that online RL can push VLA far beyond demonstration data limits.

  2. Democratizes robot learning: Binary reward = no reward engineering expertise needed. Anyone with a simulation environment can use it.

  3. Data efficiency: Cold-starting from 1 demo fundamentally changes the data collection equation. Collecting 1 demo takes minutes instead of months for thousands.

  4. Emergent behaviors: The pushcut phenomenon shows RL can generate novel behaviors -- robots don't just imitate, they create.

Future Development

The authors are working on extending SimpleVLA-RL to:

Personal Assessment

SimpleVLA-RL is one of the most important papers at ICLR 2026 for robot learning. It addresses exactly the problem the community has been struggling with: how to improve VLA after SFT has been exhausted.

What I appreciate most is the simplicity -- binary reward, no complex tricks, no magic hyperparameters. This is the hallmark of genuinely good research: a simple solution to a hard problem.

One caveat: this approach still requires high-quality simulation environments. If the sim-to-real gap is large, RL results in simulation may not transfer well to reality. But with the rapid advancement of simulation platforms like Isaac Sim and MuJoCo, this is becoming less of a concern.

Paper: SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning -- ICLR 2026

GitHub: PRIME-RL/SimpleVLA-RL

The future of robot learning: combining SFT and RL to break data limitations


Related Posts

Related Posts

ResearchΨ₀ Hands-On (6): Ablation & Bài học rút ra
ai-perceptionvlaresearchhumanoidpsi0Part 6

Ψ₀ Hands-On (6): Ablation & Bài học rút ra

Phân tích ablation studies, so sánh baselines, và 5 bài học quan trọng nhất từ Ψ₀ cho người mới bắt đầu.

11/4/202616 min read
ResearchFlashSAC: RL nhanh hơn PPO cho Robot
ai-perceptionreinforcement-learninghumanoidresearch

FlashSAC: RL nhanh hơn PPO cho Robot

FlashSAC — off-policy RL mới vượt PPO về tốc độ lẫn hiệu quả trên 100+ tasks robotics, từ humanoid locomotion đến dexterous manipulation.

11/4/202610 min read
TutorialSimpleVLA-RL (10): SFT & RL Training cho OpenArm
openarmsimplevla-rltraininggrporeinforcement-learningPart 10

SimpleVLA-RL (10): SFT & RL Training cho OpenArm

Hướng dẫn chi tiết SFT fine-tuning và RL training với SimpleVLA-RL cho OpenArm — từ config environment đến chạy GRPO.

11/4/202616 min read