SimpleVLA-RL (2): Architecture & Algorithm

Introduction: From Idea to Implementation

In Part 1, we understood why SimpleVLA-RL matters: it breaks through the SFT performance ceiling by letting VLA models practice with binary rewards. Now it's time to answer how — what architecture, what algorithm, and what engineering decisions make this system work.

This post dives deep into three technical pillars: (1) the OpenVLA-OFT backbone and its token-based action representation, (2) the GRPO algorithm with asymmetric clipping, and (3) dynamic sampling — a technique that seems simple but determines the success or failure of the entire system.

Neural network architecture and data flow

Backbone: OpenVLA-OFT

Architecture Overview

SimpleVLA-RL is built on OpenVLA-OFT (Open Vision-Language-Action with Optimal Fine-Tuning), an improved variant of the original OpenVLA. The architecture consists of:

Language Model: LLaMA2-7B as the central "brain," processing both language and actions
Vision Encoders: Two parallel encoders:
- SigLIP for semantic features (understanding "this is a cup")
- DINOv2 for spatial features (understanding "the cup is at coordinates (x, y, z)")
Action Head: Converts LLaMA's output into 7-DoF robot actions

The workflow: camera images pass through both vision encoders, features are concatenated and projected into LLaMA2's embedding space. Combined with a natural language task instruction (e.g., "pick up the red cup"), LLaMA2 generates a sequence of action tokens.

Token-based Actions: Why Not Regression?

This is the critical design decision, and it determines the feasibility of applying RL.

Traditional approach (regression): The action head directly outputs 7 continuous values (6-DoF pose + gripper). Uses loss functions like L1 or MSE. The original OpenVLA-OFT uses an MLP head with L1 loss.

SimpleVLA-RL's approach (token-based): Each action dimension is discretized into bins, then represented as tokens — exactly like how LLMs generate text. Each action step produces 256 tokens. Action chunking groups multiple steps into one sequence:

LIBERO: chunk size = 8 steps, yielding 8 x 256 = 2048 tokens
RoboTwin: chunk size = 25 steps, yielding 25 x 256 = 6400 tokens

Why is token-based critical for RL? Three reasons:

Natural probability distributions: Each token has a probability distribution over the vocabulary, enabling log-probability computation — the core component of policy gradients. With regression output, you must assume a distribution (Gaussian), and that assumption is often incorrect.
GRPO compatibility: GRPO (and PPO, REINFORCE, etc.) needs to compute the ratio pi(a|s)/pi_old(a|s). With token-based actions, this ratio computes naturally through autoregressive generation. With regression, you need to parameterize a separate distribution.
Natural exploration: Increasing the temperature when sampling tokens automatically increases exploration — no need for separate noise injection. At temperature 1.6, the model is "creative" enough to try novel actions.

Comparison with Original OpenVLA-OFT

Feature	Original OpenVLA-OFT	SimpleVLA-RL Version
Action representation	MLP head, continuous	Token-based, discrete
Loss function	L1 regression	Cross-entropy (SFT), GRPO (RL)
Camera views	Multi-view	Single-view (reduced complexity)
RL compatible?	Difficult (needs distribution head)	Yes, naturally

The switch to single-view is also noteworthy. Multi-view provides better 3D information, but single-view dramatically reduces computational cost for RL rollouts — and with sufficient training data, single-view achieves comparable performance.

GRPO: The Optimization Algorithm

What is Group Relative Policy Optimization?

GRPO (Group Relative Policy Optimization) was originally developed for RLHF in LLMs (by the DeepSeek team), and SimpleVLA-RL is one of the first works to apply it to robot manipulation.

The core idea: instead of using a separate critic network to estimate the value function (like PPO), GRPO computes advantages relative to the group — comparing trajectories against each other.

How GRPO Works, Step by Step

Step 1: Sampling. For each query (initial state + task instruction), the model generates G = 8 different trajectories (enabled by high temperature). Each trajectory is a sequence of action tokens.

Step 2: Evaluate. Execute all 8 trajectories in simulation, assigning binary rewards:

# Pseudocode
rewards = []
for trajectory in trajectories:
    success = env.evaluate(trajectory)
    rewards.append(1.0 if success else 0.0)
# Example: rewards = [1, 0, 1, 1, 0, 0, 1, 0]

Step 3: Compute group-relative advantage. Normalize rewards within the group:

mean_r = mean(rewards)   # = 0.5
std_r = std(rewards)     # approx 0.535
advantages = [(r - mean_r) / std_r for r in rewards]
# Successful trajectories get positive advantage, failures get negative

Step 4: Policy update. Maximize the objective:

L = E[ min(ratio * A, clip(ratio, 1-eps_low, 1+eps_high) * A) ]

Where:

ratio = pi_new(a|s) / pi_old(a|s) — probability ratio between new and old policy
A — advantage from step 3
eps_low = 0.2, eps_high = 0.28 — asymmetric clipping bounds

Asymmetric Clipping: A Subtle but Powerful Technique

This is one of the most important technical contributions. In standard PPO, clipping is symmetric: [1-eps, 1+eps] with eps = 0.2, giving [0.8, 1.2]. SimpleVLA-RL uses asymmetric clipping: [0.8, 1.28].

Why? Consider a concrete scenario:

Rare but successful actions: An action that the old policy assigned very low probability (e.g., pushing instead of grasping) but led to success. The ratio is large (pi_new >> pi_old). With symmetric clipping, the ratio is capped at 1.2 — the policy only "gently" increases the probability. With asymmetric clipping (capped at 1.28), the policy can increase the probability more aggressively for good actions it hadn't tried before.
Common but failed actions: The ratio is small (pi_new < pi_old). Clipping at 0.8 — decrease probability, but not too fast to avoid losing diversity.

In simple terms: eps_high > eps_low means "increase faster than decrease" — encouraging stronger exploration, which is especially important when binary reward creates weak gradient signals.

No KL Divergence: Radical Simplification

Traditional PPO often adds a KL penalty to prevent the new policy from diverging too far from the old one:

L = L_clip - beta * KL(pi_new || pi_old)

SimpleVLA-RL completely removes the KL term (beta = 0). The reasoning:

GPU memory savings: No need to store the reference model (7B parameters), saving approximately 14GB VRAM per GPU.
SFT provides a good base: After SFT, the VLA model already has a "reasonable" policy — it knows how to approach objects and move the gripper. RL only needs fine-tuning, not drastic changes. The clipping bound is sufficient to prevent policy divergence.
Empirical evidence: Experiments show that adding KL doesn't improve results and can actually decrease performance by limiting exploration.

Robotic systems in a production line illustrating complex workflows

Dynamic Sampling: Simple Yet Decisive

The Problem: Vanishing Gradients

Recall Step 3 — computing advantages. If all 8 trajectories succeed (rewards = [1,1,1,1,1,1,1,1]), then:

mean_r = 1.0
std_r = 0.0  # All identical!
advantages = [0, 0, 0, 0, 0, 0, 0, 0]  # No signal!

The same happens if all trajectories fail. When advantage = 0 for every trajectory, gradient = 0, and the model learns nothing.

This is especially severe with binary reward. With dense reward (0.1, 0.3, 0.7, 0.9), even if all trajectories succeed, advantages are still non-zero because rewards differ. But binary reward has only 2 values, so the probability of a uniform batch is very high.

The Solution: Discard Uniform Batches

SimpleVLA-RL applies dynamic sampling: for each batch, if all trajectories have the same reward (all-success or all-fail), discard the batch and resample.

# Pseudocode
while True:
    trajectories = model.generate(query, num_samples=8, temperature=1.6)
    rewards = [env.evaluate(t) for t in trajectories]
    
    if len(set(rewards)) > 1:  # Mix of successes and failures
        break  # This batch is useful
    # Otherwise, resample

This technique sounds simple, but without it, RL training completely fails. The paper's ablation shows that removing dynamic sampling drops performance by 15-20% compared to having it.

The Intuition

Think of dynamic sampling as selecting exercises at the right difficulty level. If you always solve problems that are too easy (all-success), you don't learn anything new. If you always face problems that are too hard (all-fail), you don't learn either. You learn the most when problems are at the boundary between possible and impossible — and that's exactly what dynamic sampling ensures.

Temperature: Balancing Exploration and Exploitation

SimpleVLA-RL uses temperature = 1.6 when generating trajectories (rollout), and temperature = 0 (greedy) during inference.

Why 1.6?

Temperature 1.0 is the "default" — the probability distribution stays as-is from the model. Temperature > 1.0 flattens the distribution, increasing the probability of less common tokens. At temperature 1.6:

The most common token (e.g., "move downward") still has the highest probability
But less common tokens (e.g., "push left") have significantly higher probability compared to temperature 1.0
Result: the model tries many different strategies instead of repeating the same approach

Temperature 1.6 is the sweet spot — high enough to explore, but not so high that actions become completely random (temperature > 2.0 is usually too noisy).

Two Operating Modes

Phase	Temperature	Purpose
Training (rollout)	1.6	Maximum exploration, collecting diverse trajectories
Inference (deploy)	0 (greedy)	Maximum exploitation, selecting the best action

This separation matters: training and inference have different objectives. Training needs diversity to discover new strategies. Inference needs consistency to achieve the highest success rate.

Binary Reward: Simple but Sufficient

Reward Applied to the Entire Trajectory

An important detail: the reward R = 1 or R = 0 is assigned to all tokens in the trajectory, not just the last token. This means:

# If trajectory succeeds
token_rewards = [1, 1, 1, 1, ..., 1]  # All tokens receive R=1

# If trajectory fails
token_rewards = [0, 0, 0, 0, ..., 0]  # All tokens receive R=0

Why not assign reward to individual tokens? Because we don't know which token matters. In a trajectory of 2048 tokens, token #500 (the decision to push instead of grasp) might be the determining factor, but we have no way of knowing. By assigning the same reward to all tokens, GRPO will automatically figure out which tokens' probabilities to increase or decrease through group-relative advantages.

Comparison with Dense Reward

Feature	Binary Reward	Dense Reward
Design effort	Automatic (task success detector)	Requires expert design per task
Reward hacking	Very difficult to exploit	Easily exploitable
Gradient signal	Weak (needs dynamic sampling)	Strong
Generalization	High (task-agnostic)	Low (task-specific)
Scalability	Excellent	Poor (must redesign per task)

Binary reward trades off weaker gradient signal for scalability and robustness — and combined with dynamic sampling, this drawback is effectively mitigated.

Data Flow: End to End

Let's trace a complete training loop:

1. Query Generation: Select a task (e.g., "pick up the red cup") and a random initial state in simulation.

2. Rollout: VLA model generates 8 trajectories in parallel, each consisting of:

Input: camera image + task instruction
Output: action token sequence (2048 for LIBERO)
Temperature: 1.6

3. Execution: Each trajectory is executed in the simulation environment (LIBERO/RoboTwin).

4. Reward Assignment: Check task completion, assign R = 1 or 0.

5. Dynamic Sampling Check: If all rewards are identical, discard the batch and return to step 1. If there's a mix of 0s and 1s, continue.

6. Advantage Computation: Normalize rewards within the group to get advantages.

7. Policy Update: GRPO update with asymmetric clipping.

8. Repeat: Return to step 1 with the updated policy.

The entire process runs on 8 NVIDIA A800 80GB GPUs, using the veRL framework (v0.2) to parallelize rollout and training.

GRPO vs PPO Comparison

Feature	PPO	GRPO
Critic network	Required (extra parameters, training)	Not needed
KL regularization	Commonly used	Not needed
Advantage estimation	GAE (needs value function)	Group-relative (compare within batch)
Memory footprint	High (policy + critic + ref model)	Low (policy only)
Hyperparameters	Many (GAE lambda, critic LR, KL beta, ...)	Few (eps_low, eps_high, temperature)
Implementation complexity	High	Medium

GRPO trades accuracy of advantage estimation (PPO's learned critic is more precise) for simplicity and efficiency. With a 7B parameter VLA model, removing the critic network saves approximately 14GB VRAM — enough to increase batch size or reduce the number of GPUs needed.

Key Hyperparameters

Here's a summary of SimpleVLA-RL's key hyperparameters:

Parameter	Value	Meaning
Learning rate	5e-6	Low to prevent catastrophic forgetting
Batch size	64	Queries per iteration
Samples per query	8	Parallel trajectories
eps_low (clip)	0.2	Lower bound for probability decrease
eps_high (clip)	0.28	Upper bound for probability increase
Temperature (rollout)	1.6	Exploration level
Temperature (inference)	0.0	Greedy, no exploration
KL coefficient	0.0	KL not used
Action tokens per step	256	Discretized 7-DoF action
Chunk size (LIBERO)	8	Action steps per prediction
Chunk size (RoboTwin)	25	More steps for dual-arm
GPUs	8x A800 80GB	Training hardware

The learning rate of 5e-6 is noteworthy — much lower than typical SFT (1e-4). The reason: RL training can easily cause catastrophic forgetting — the model forgets SFT knowledge if updates are too aggressive. A low learning rate ensures the model improves gradually without destroying its foundation.

Data charts — monitoring metrics is key to debugging RL

Setup: How to Reproduce

If you want to reproduce SimpleVLA-RL results, here's the basic setup:

# Environment
conda create -n simplevla python=3.10
conda activate simplevla

# Core dependencies
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install flash-attn==2.5.8
pip install verl==0.2

# Clone repos
git clone https://github.com/PRIME-RL/SimpleVLA-RL
git clone https://github.com/moojink/openvla-oft

# Install
cd SimpleVLA-RL && pip install -e .

Minimum hardware requirements: 8x A800/H100 80GB GPUs. This is the biggest barrier — not every lab has access to an 8-GPU high-end cluster. However, the community is working on LoRA + gradient checkpointing versions to run on 2-4 GPUs.

To learn more about setting up LeRobot for similar experiments, check out our AI for Robotics series.

Architectural Lessons

Looking at SimpleVLA-RL's design holistically, several important lessons emerge:

1. Token-based > Regression for RL

Switching from regression to token-based actions isn't just a technical detail — it unlocks the ability to apply RL. This is a textbook example of representation determines algorithm: choosing the right data representation matters more than choosing the right algorithm.

2. Simple > Complex

No KL, no critic, no dense reward — each omission has a clear rationale, and ablation studies confirm that removing them doesn't hurt performance. The lesson: always start with the simplest solution, only adding complexity when there's evidence it's needed.

3. RL Tailored for VLA

Unlike RL for games (dense rewards, short episodes), RL for robot manipulation has unique characteristics: sparse rewards, long trajectories, continuous state spaces. SimpleVLA-RL shows that rather than forcing traditional RL frameworks onto robotics, adapting the framework to fit (binary reward + dynamic sampling + asymmetric clipping) produces better results.

Limitations and Future Directions

Despite its impressive results, SimpleVLA-RL has limitations worth acknowledging:

Large hardware requirements: 8x A800 GPUs aren't available in every lab. More research is needed on parameter-efficient RL (LoRA, QLoRA).
Simulation dependency: RL requires thousands of rollouts, only feasible in simulation. Real-world RL remains challenging due to the high cost per rollout.
Sim-to-real gap: Real-world improvement (120%) is lower than sim improvement (430%), indicating that sim-to-real transfer remains a bottleneck.
Task-specific training: Currently, each task needs separate RL training. Multi-task RL for VLA remains an open question — can a single RL training run improve multiple tasks simultaneously?

Conclusion

SimpleVLA-RL is a testament to the "less is more" principle in machine learning. By removing unnecessary components (KL, critic, dense reward) and adding simple but effective techniques (dynamic sampling, asymmetric clipping), the authors created a system that outperforms far more complex methods.

If you're researching VLA models or want to apply RL to robot manipulation, SimpleVLA-RL is an ideal starting point: open-source code, clear methodology, and reproducible results. Combined with models like pi-zero (pi0) and the broader Embodied AI 2026 landscape, we're witnessing the transition from "robots learning from humans" to "robots learning from experience" — and SimpleVLA-RL is a crucial step on that journey.

Introduction: From Idea to Implementation

Neural network architecture and data flow

Backbone: OpenVLA-OFT

Architecture Overview

SimpleVLA-RL is built on OpenVLA-OFT (Open Vision-Language-Action with Optimal Fine-Tuning), an improved variant of the original OpenVLA. The architecture consists of:

Language Model: LLaMA2-7B as the central "brain," processing both language and actions
Vision Encoders: Two parallel encoders:
- SigLIP for semantic features (understanding "this is a cup")
- DINOv2 for spatial features (understanding "the cup is at coordinates (x, y, z)")
Action Head: Converts LLaMA's output into 7-DoF robot actions

Token-based Actions: Why Not Regression?

This is the critical design decision, and it determines the feasibility of applying RL.

LIBERO: chunk size = 8 steps, yielding 8 x 256 = 2048 tokens
RoboTwin: chunk size = 25 steps, yielding 25 x 256 = 6400 tokens

Why is token-based critical for RL? Three reasons:

Natural probability distributions: Each token has a probability distribution over the vocabulary, enabling log-probability computation — the core component of policy gradients. With regression output, you must assume a distribution (Gaussian), and that assumption is often incorrect.
GRPO compatibility: GRPO (and PPO, REINFORCE, etc.) needs to compute the ratio pi(a|s)/pi_old(a|s). With token-based actions, this ratio computes naturally through autoregressive generation. With regression, you need to parameterize a separate distribution.
Natural exploration: Increasing the temperature when sampling tokens automatically increases exploration — no need for separate noise injection. At temperature 1.6, the model is "creative" enough to try novel actions.

Comparison with Original OpenVLA-OFT

Feature	Original OpenVLA-OFT	SimpleVLA-RL Version
Action representation	MLP head, continuous	Token-based, discrete
Loss function	L1 regression	Cross-entropy (SFT), GRPO (RL)
Camera views	Multi-view	Single-view (reduced complexity)
RL compatible?	Difficult (needs distribution head)	Yes, naturally

GRPO: The Optimization Algorithm

What is Group Relative Policy Optimization?

GRPO (Group Relative Policy Optimization) was originally developed for RLHF in LLMs (by the DeepSeek team), and SimpleVLA-RL is one of the first works to apply it to robot manipulation.

The core idea: instead of using a separate critic network to estimate the value function (like PPO), GRPO computes advantages relative to the group — comparing trajectories against each other.

How GRPO Works, Step by Step

Step 2: Evaluate. Execute all 8 trajectories in simulation, assigning binary rewards:

# Pseudocode
rewards = []
for trajectory in trajectories:
    success = env.evaluate(trajectory)
    rewards.append(1.0 if success else 0.0)
# Example: rewards = [1, 0, 1, 1, 0, 0, 1, 0]

Step 3: Compute group-relative advantage. Normalize rewards within the group:

mean_r = mean(rewards)   # = 0.5
std_r = std(rewards)     # approx 0.535
advantages = [(r - mean_r) / std_r for r in rewards]
# Successful trajectories get positive advantage, failures get negative

Step 4: Policy update. Maximize the objective:

L = E[ min(ratio * A, clip(ratio, 1-eps_low, 1+eps_high) * A) ]

Where:

ratio = pi_new(a|s) / pi_old(a|s) — probability ratio between new and old policy
A — advantage from step 3
eps_low = 0.2, eps_high = 0.28 — asymmetric clipping bounds

Asymmetric Clipping: A Subtle but Powerful Technique

Why? Consider a concrete scenario:

Rare but successful actions: An action that the old policy assigned very low probability (e.g., pushing instead of grasping) but led to success. The ratio is large (pi_new >> pi_old). With symmetric clipping, the ratio is capped at 1.2 — the policy only "gently" increases the probability. With asymmetric clipping (capped at 1.28), the policy can increase the probability more aggressively for good actions it hadn't tried before.
Common but failed actions: The ratio is small (pi_new < pi_old). Clipping at 0.8 — decrease probability, but not too fast to avoid losing diversity.

In simple terms: eps_high > eps_low means "increase faster than decrease" — encouraging stronger exploration, which is especially important when binary reward creates weak gradient signals.

No KL Divergence: Radical Simplification

Traditional PPO often adds a KL penalty to prevent the new policy from diverging too far from the old one:

L = L_clip - beta * KL(pi_new || pi_old)

SimpleVLA-RL completely removes the KL term (beta = 0). The reasoning:

GPU memory savings: No need to store the reference model (7B parameters), saving approximately 14GB VRAM per GPU.
SFT provides a good base: After SFT, the VLA model already has a "reasonable" policy — it knows how to approach objects and move the gripper. RL only needs fine-tuning, not drastic changes. The clipping bound is sufficient to prevent policy divergence.
Empirical evidence: Experiments show that adding KL doesn't improve results and can actually decrease performance by limiting exploration.

Robotic systems in a production line illustrating complex workflows

Dynamic Sampling: Simple Yet Decisive

The Problem: Vanishing Gradients

Recall Step 3 — computing advantages. If all 8 trajectories succeed (rewards = [1,1,1,1,1,1,1,1]), then:

mean_r = 1.0
std_r = 0.0  # All identical!
advantages = [0, 0, 0, 0, 0, 0, 0, 0]  # No signal!

The same happens if all trajectories fail. When advantage = 0 for every trajectory, gradient = 0, and the model learns nothing.

The Solution: Discard Uniform Batches

SimpleVLA-RL applies dynamic sampling: for each batch, if all trajectories have the same reward (all-success or all-fail), discard the batch and resample.

# Pseudocode
while True:
    trajectories = model.generate(query, num_samples=8, temperature=1.6)
    rewards = [env.evaluate(t) for t in trajectories]
    
    if len(set(rewards)) > 1:  # Mix of successes and failures
        break  # This batch is useful
    # Otherwise, resample

This technique sounds simple, but without it, RL training completely fails. The paper's ablation shows that removing dynamic sampling drops performance by 15-20% compared to having it.

The Intuition

Temperature: Balancing Exploration and Exploitation

SimpleVLA-RL uses temperature = 1.6 when generating trajectories (rollout), and temperature = 0 (greedy) during inference.

Why 1.6?

The most common token (e.g., "move downward") still has the highest probability
But less common tokens (e.g., "push left") have significantly higher probability compared to temperature 1.0
Result: the model tries many different strategies instead of repeating the same approach

Temperature 1.6 is the sweet spot — high enough to explore, but not so high that actions become completely random (temperature > 2.0 is usually too noisy).

Two Operating Modes

Phase	Temperature	Purpose
Training (rollout)	1.6	Maximum exploration, collecting diverse trajectories
Inference (deploy)	0 (greedy)	Maximum exploitation, selecting the best action

Binary Reward: Simple but Sufficient

Reward Applied to the Entire Trajectory

An important detail: the reward R = 1 or R = 0 is assigned to all tokens in the trajectory, not just the last token. This means:

# If trajectory succeeds
token_rewards = [1, 1, 1, 1, ..., 1]  # All tokens receive R=1

# If trajectory fails
token_rewards = [0, 0, 0, 0, ..., 0]  # All tokens receive R=0

Comparison with Dense Reward

Feature	Binary Reward	Dense Reward
Design effort	Automatic (task success detector)	Requires expert design per task
Reward hacking	Very difficult to exploit	Easily exploitable
Gradient signal	Weak (needs dynamic sampling)	Strong
Generalization	High (task-agnostic)	Low (task-specific)
Scalability	Excellent	Poor (must redesign per task)

Binary reward trades off weaker gradient signal for scalability and robustness — and combined with dynamic sampling, this drawback is effectively mitigated.

Data Flow: End to End

Let's trace a complete training loop:

1. Query Generation: Select a task (e.g., "pick up the red cup") and a random initial state in simulation.

2. Rollout: VLA model generates 8 trajectories in parallel, each consisting of:

Input: camera image + task instruction
Output: action token sequence (2048 for LIBERO)
Temperature: 1.6

3. Execution: Each trajectory is executed in the simulation environment (LIBERO/RoboTwin).

4. Reward Assignment: Check task completion, assign R = 1 or 0.

5. Dynamic Sampling Check: If all rewards are identical, discard the batch and return to step 1. If there's a mix of 0s and 1s, continue.

6. Advantage Computation: Normalize rewards within the group to get advantages.

7. Policy Update: GRPO update with asymmetric clipping.

8. Repeat: Return to step 1 with the updated policy.

The entire process runs on 8 NVIDIA A800 80GB GPUs, using the veRL framework (v0.2) to parallelize rollout and training.

GRPO vs PPO Comparison

Feature	PPO	GRPO
Critic network	Required (extra parameters, training)	Not needed
KL regularization	Commonly used	Not needed
Advantage estimation	GAE (needs value function)	Group-relative (compare within batch)
Memory footprint	High (policy + critic + ref model)	Low (policy only)
Hyperparameters	Many (GAE lambda, critic LR, KL beta, ...)	Few (eps_low, eps_high, temperature)
Implementation complexity	High	Medium

Key Hyperparameters

Here's a summary of SimpleVLA-RL's key hyperparameters:

Parameter	Value	Meaning
Learning rate	5e-6	Low to prevent catastrophic forgetting
Batch size	64	Queries per iteration
Samples per query	8	Parallel trajectories
eps_low (clip)	0.2	Lower bound for probability decrease
eps_high (clip)	0.28	Upper bound for probability increase
Temperature (rollout)	1.6	Exploration level
Temperature (inference)	0.0	Greedy, no exploration
KL coefficient	0.0	KL not used
Action tokens per step	256	Discretized 7-DoF action
Chunk size (LIBERO)	8	Action steps per prediction
Chunk size (RoboTwin)	25	More steps for dual-arm
GPUs	8x A800 80GB	Training hardware

Data charts — monitoring metrics is key to debugging RL

Setup: How to Reproduce

If you want to reproduce SimpleVLA-RL results, here's the basic setup:

# Environment
conda create -n simplevla python=3.10
conda activate simplevla

# Core dependencies
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install flash-attn==2.5.8
pip install verl==0.2

# Clone repos
git clone https://github.com/PRIME-RL/SimpleVLA-RL
git clone https://github.com/moojink/openvla-oft

# Install
cd SimpleVLA-RL && pip install -e .

To learn more about setting up LeRobot for similar experiments, check out our AI for Robotics series.

Architectural Lessons

Looking at SimpleVLA-RL's design holistically, several important lessons emerge:

1. Token-based > Regression for RL

2. Simple > Complex

3. RL Tailored for VLA

Limitations and Future Directions

Despite its impressive results, SimpleVLA-RL has limitations worth acknowledging:

Large hardware requirements: 8x A800 GPUs aren't available in every lab. More research is needed on parameter-efficient RL (LoRA, QLoRA).
Simulation dependency: RL requires thousands of rollouts, only feasible in simulation. Real-world RL remains challenging due to the high cost per rollout.
Sim-to-real gap: Real-world improvement (120%) is lower than sim improvement (430%), indicating that sim-to-real transfer remains a bottleneck.
Task-specific training: Currently, each task needs separate RL training. Multi-task RL for VLA remains an open question — can a single RL training run improve multiple tasks simultaneously?

Introduction: From Idea to Implementation

Backbone: OpenVLA-OFT

Architecture Overview

Token-based Actions: Why Not Regression?

Comparison with Original OpenVLA-OFT

GRPO: The Optimization Algorithm

What is Group Relative Policy Optimization?

How GRPO Works, Step by Step

Asymmetric Clipping: A Subtle but Powerful Technique

No KL Divergence: Radical Simplification

Dynamic Sampling: Simple Yet Decisive

The Problem: Vanishing Gradients

The Solution: Discard Uniform Batches

The Intuition

Temperature: Balancing Exploration and Exploitation

Why 1.6?

Two Operating Modes

Binary Reward: Simple but Sufficient

Reward Applied to the Entire Trajectory

Comparison with Dense Reward

Data Flow: End to End

GRPO vs PPO Comparison

Key Hyperparameters

Setup: How to Reproduce

Architectural Lessons

1. Token-based > Regression for RL

2. Simple > Complex

3. RL Tailored for VLA

Limitations and Future Directions

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

SimpleVLA-RL (5): So sánh với LeRobot

SimpleVLA-RL (4): Kết quả & Bài học

SimpleVLA-RL (1): Tổng quan & Ý tưởng

Introduction: From Idea to Implementation

Backbone: OpenVLA-OFT

Architecture Overview

Token-based Actions: Why Not Regression?

Comparison with Original OpenVLA-OFT

GRPO: The Optimization Algorithm

What is Group Relative Policy Optimization?

How GRPO Works, Step by Step

Asymmetric Clipping: A Subtle but Powerful Technique

No KL Divergence: Radical Simplification

Dynamic Sampling: Simple Yet Decisive

The Problem: Vanishing Gradients

The Solution: Discard Uniform Batches

The Intuition

Temperature: Balancing Exploration and Exploitation

Why 1.6?

Two Operating Modes

Binary Reward: Simple but Sufficient

Reward Applied to the Entire Trajectory

Comparison with Dense Reward

Data Flow: End to End

GRPO vs PPO Comparison

Key Hyperparameters

Setup: How to Reproduce

Architectural Lessons

1. Token-based > Regression for RL

2. Simple > Complex

3. RL Tailored for VLA

Limitations and Future Directions

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

SimpleVLA-RL (5): So sánh với LeRobot

SimpleVLA-RL (4): Kết quả & Bài học

SimpleVLA-RL (1): Tổng quan & Ý tưởng