Have you ever fine-tuned a VLA model with RL only to watch its success rate crater in the first few epochs, never fully recovering? Or found yourself manually intervening hundreds of times to steer the policy away from dead ends?
That's precisely the problem that Shuyi Zhang and co-authors target in FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation (arXiv, June 2026). Their three-stage framework delivers a 79% absolute success rate improvement, beats prior RL methods by 10%, and cuts the required training samples by 32.5% — all without a single human intervention.
The Problem: The Imitation Ceiling and Why RL Fine-Tuning Is Hard
VLA models trained via imitation learning (behavior cloning from demonstrations) are inherently bounded by an imitation ceiling: when your training data is suboptimal, the model can never outperform what that data represents. Reinforcement learning is the natural escape hatch — let the policy explore and learn from environment feedback.
But RL fine-tuning VLA in practice hits two painful failure modes:
Failure 1 — Catastrophic Initial Unlearning: When RL begins, the Q-function (which evaluates action quality) is poorly initialized. It systematically overestimates Q-values for out-of-distribution actions that the policy has never actually taken. The policy updates toward these phantom high-value regions, forgetting what it learned during supervised fine-tuning. Result: a sharp early success rate drop that often never recovers.
Failure 2 — Low-Quality Exploration Data: Weak policy → poor rollout data → updates from poor data → policy stays weak. This negative feedback loop makes RL fine-tuning extremely sample inefficient. Many systems rely on human-in-the-loop (HIL) intervention — humans physically correcting the robot mid-rollout — which is catastrophically expensive to scale.
What Is FORCE?
FORCE stands for Fine-tuning with vAlue-calibRated warm-up and sElf-distillation. It is a three-stage framework that addresses both failures sequentially:
- Stabilize the Q-function offline using Cal-QL
- Warm up the Q-function with on-policy data to eliminate distributional shift
- Fine-tune the policy online using Q-function-filtered self-distillation
Source: Zhang et al., arXiv 2606.26006
Architecture Deep Dive: The 3 Stages of FORCE
Stage 1: Offline Cal-QL Pretraining
Before any rollout data is collected, FORCE trains the Q-function exclusively on expert demonstration data using Calibrated Q-Learning (Cal-QL).
Cal-QL extends Conservative Q-Learning (CQL) with a calibration term that prevents Q-value overestimation for actions outside the demonstration distribution. The training objective:
L_CalQL = L_TD + α × L_CQL + β × L_calibration
Where:
L_TD= Bellman temporal difference error (standard Q-learning)L_CQL= conservative penalty (suppresses Q-values for out-of-distribution actions)L_calibration= additional term to align estimated and true Q-values
The result is a Q-function that is appropriately skeptical about unexplored state-action regions and won't be fooled by the policy venturing into them.
Stage 1 hyperparameters:
offline_lr: 3.0e-4 # Adam optimizer
batch_size: 256
gamma: 0.99 # discount factor
tau: 0.005 # EMA coefficient for target network
cql_alpha: 0.1 # CQL regularization strength
critic_policy_ratio: 2 # 2 critic updates per policy update
Stage 2: Value-Calibrated Warm-Up
This is FORCE's core insight. Even a well-trained offline Q-function suffers from distributional shift: it was trained on the behavior policy's (expert's) data distribution, but the RL policy will visit a different distribution.
FORCE collects on-policy rollouts — runs the current policy in the environment and logs the trajectories — then merges this data with the offline demonstrations to create a mixed dataset.
# Warm-up: collect on-policy experience
on_policy_data = []
for _ in range(warm_up_episodes):
trajectory = run_episode(policy=vla_policy, env=robot_env)
on_policy_data.extend(trajectory)
# Merge with offline demonstrations
mixed_dataset = concat([demo_dataset, on_policy_data])
# Continue Q-function training on the expanded dataset
for batch in mixed_dataset:
q_loss = cal_ql_loss(q_function, batch)
q_function.update(q_loss)
By exposing the Q-function to what the policy actually does in the environment, it learns to correctly evaluate the state-action regions the policy will visit during RL. No more distributional blind spots, no more catastrophic early unlearning.
Stage 3: Online Fine-Tuning with VGPD
Now the policy itself is fine-tuned via RL. This is where Value-Guided Policy Self-Distillation (VGPD) takes over.
How VGPD works: Instead of learning from all rollout data (including low-quality trajectories), VGPD uses the calibrated Q-function as an intelligent filter:
def vgpd_update(policy, q_function, state, K=20, tau=0.1):
# Step 1: Sample K action candidates from policy
candidates = [policy.sample(state) for _ in range(K)]
# Step 2: Compute Q-values for all candidates
q_values = [q_function(state, a) for a in candidates]
# Step 3: Dynamic baseline = mean Q-value across candidates
v_ref = mean(q_values)
# Step 4: Filter — keep only actions above baseline
good_mask = [q > v_ref for q in q_values]
filtered = [(a, q) for a, q, m in zip(candidates, q_values, good_mask) if m]
if not filtered:
return None # no update if no good actions found
# Step 5: Exponential weighting by Q-value
actions, q_good = zip(*filtered)
weights = softmax([q / tau for q in q_good])
# Step 6: Weighted imitation learning on the policy's own best actions
loss = sum(
-w * policy.log_prob(state, a)
for a, w in zip(actions, weights)
)
return loss
This creates an automatic curriculum: when the policy is weak, the baseline V_ref is low, many candidates pass the filter, and the policy learns from many examples (close to behavioral cloning). As the policy improves, the baseline rises, the filter becomes stricter, and only the best actions guide further updates.
Expert buffer + Policy buffer: FORCE maintains two separate replay buffers during Stage 3:
- Expert Buffer: the original expert demonstrations
- Policy Buffer: rollouts collected from the current policy
These are sampled 50/50, and VGPD filtering is applied to both. This prevents catastrophic forgetting of demonstration knowledge while still incorporating newly explored behaviors.

Practical Setup (Implementing from the Paper)
As of July 2026, the paper does not include a public GitHub repository. Here is how to set up the environment and implement FORCE following the paper's algorithm specification.
System Requirements
# Create conda environment
conda create -n force_vla python=3.10
conda activate force_vla
# Core dependencies
pip install torch==2.3.0 torchvision
pip install transformers>=4.40.0
pip install gymnasium mujoco
# ManiSkill (used in paper benchmarks)
pip install mani-skill2
# Or LIBERO for tabletop manipulation
pip install libero
GPU requirements:
- Minimum: A40 (48GB VRAM) for VGPD with K=20 candidates
- Recommended: A100 80GB or H100 80GB
- Training time: ~18k environment steps for StackCube (a few hours on A100)
Base VLA Models
FORCE was tested with three models:
- Octo — open-source generalist robot policy, easiest to download and fine-tune
- π₀ (Pi Zero) — flow-based VLA from Physical Intelligence
- π₀.₅ — improved variant of π₀
For Octo (most accessible):
# Clone and install Octo
git clone https://github.com/octo-models/octo.git
cd octo && pip install -e .
# Download pretrained checkpoint
python -c "
from octo.model.octo_model import OctoModel
model = OctoModel.load_pretrained('hf://rail-berkeley/octo-base')
print('Model loaded successfully')
"
Stage 1: Implement Cal-QL
import torch
import torch.nn as nn
class CalQLCritic(nn.Module):
def __init__(self, obs_dim, action_dim, hidden_dim=256):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1),
)
def forward(self, obs, action):
return self.net(torch.cat([obs, action], dim=-1))
class CalQL:
def __init__(self, critic, target_critic, gamma=0.99, alpha=0.1, tau=0.005):
self.critic = critic
self.target = target_critic
self.gamma = gamma
self.alpha = alpha
self.tau = tau
def compute_loss(self, obs, actions, rewards, next_obs, dones):
# TD target using target network
with torch.no_grad():
target_q = rewards + (1 - dones) * self.gamma * self.target(next_obs, actions)
# In-distribution Q-values
current_q = self.critic(obs, actions)
td_loss = nn.MSELoss()(current_q, target_q)
# CQL conservative regularization
random_actions = torch.randn_like(actions)
random_q = self.critic(obs, random_actions)
cql_loss = (torch.logsumexp(random_q, dim=0) - current_q).mean()
return td_loss + self.alpha * cql_loss
def update_target(self):
for p, tp in zip(self.critic.parameters(), self.target.parameters()):
tp.data.copy_(self.tau * p.data + (1 - self.tau) * tp.data)
Monitoring Training Progress
Track these metrics to know if training is healthy:
training_metrics = {
'q_value_mean': [], # Should increase steadily
'q_value_std': [], # Should decrease after warm-up
'success_rate': [], # Target: > 80%
'episode_length': [], # Should decrease (more efficient behavior)
'vgpd_pass_rate': [], # % actions that pass VGPD filter (should be 30-70%)
'policy_buffer_ratio': [], # Ratio of policy vs expert data sampled
}
A healthy FORCE training run shows:
- Q-values stabilizing during Stage 1 (no divergence)
- Q-value std dropping during Stage 2 (distributional shift resolved)
- Success rate climbing steadily in Stage 3 (no early catastrophic drop)
Experimental Results
ManiSkill Simulation — Comparison with Baselines
| Method | Backbone | Success Rate | Needs Human? |
|---|---|---|---|
| BC (Behavior Cloning) | Octo | 45% | No |
| SFT | π₀ | ~60% | No |
| ConRFT (no HIL) | π₀ | ~73% | No |
| ConRFT (with HIL) | π₀ | ~76% | Yes |
| FORCE | Octo | 82.3% | No |
| FORCE | π₀ | 86.9% | No |
FORCE with π₀ backbone achieves a 79% absolute improvement over the BC baseline and outperforms the best HIL-free RL competitor by over 10 percentage points.
Real-World Franka Robot Tasks

| Method | Success Rate | Avg. Execution Steps | Relative Efficiency |
|---|---|---|---|
| BC baseline | 45% | 112.8 | 1× |
| FORCE | 98.3% | 38.9 | ~3× |
FORCE doesn't just succeed more often — it also completes tasks in roughly a third of the steps. The policy has learned to act decisively rather than exploratorily.
Ablation: How Much Do Warm-Up and VGPD Each Contribute?
Metric: Steps@80% (environment steps needed to reach 80% success rate):
| Task | Full FORCE | No Warm-Up | No VGPD |
|---|---|---|---|
| StackCube | 18k | 28k | ~24k |
| PickCube | 12k | ~20k | 20k |
| PushCube | 4k | 10k | ~8k |
Both components contribute meaningfully. Removing warm-up roughly doubles the sample cost; removing VGPD increases it by 33–67%. Neither alone is sufficient — the two are complementary.
Comparison with Other VLA RL Fine-Tuning Methods
| Method | Core Idea | Strengths | Weaknesses |
|---|---|---|---|
| SimpleVLA-RL | GRPO online RL | Simple to implement | High sample count |
| EXPO-FT π₀.₅ | Online RL in 19 min | Extremely fast | Narrow task scope |
| ProcVLM | Dense reward shaping | Rich reward signal | Requires reward engineering |
| FORCE | Cal-QL + VGPD | Sample efficient, no HIL | 3-stage complexity |
FORCE's main distinguishing factor is the combination of Q-function calibration before online training begins with a self-distillation mechanism that avoids learning from the policy's own mistakes. Most competing methods pick one or the other.
Strengths and Limitations
Strengths
- No human intervention: Essential for scaling to fleets of robots in production
- Architecture-agnostic: Tested with Octo and π₀ — drop-in compatible with other VLA backbones
- Automatic curriculum via VGPD: The filtering threshold rises as the policy improves, avoiding both under- and over-challenge
- 32.5% sample efficiency gain over the best competing method (ConRFT)
Limitations
- Three-stage complexity: More hyperparameters to tune than simple SFT or single-stage RL
- Q-function overhead: Must train and maintain a separate Q-function network in parallel with the policy
- No public code: Paper does not include a GitHub repo as of July 2026 — implementation requires effort
- K=20 forward passes per VGPD step: Memory-intensive; may require gradient checkpointing on smaller GPUs
- Real-world evaluation covers a limited set of Franka manipulation tasks — generalization to more complex, contact-rich scenarios is not yet demonstrated
Conclusion
FORCE represents a principled answer to one of the field's most persistent problems: how do you apply RL to VLA models without either destroying what the model already knows or getting trapped in a loop of bad data generating bad updates?
The answer — calibrate first, then distill from your own best actions — is elegant and demonstrably effective. The 79% absolute gain is real, the 3× execution efficiency improvement is real, and crucially, the elimination of human intervention makes it plausible to apply at scale.
The three-stage pipeline adds implementation complexity, and the absence of a public codebase raises the bar for adoption. But for teams working on production VLA deployment where sample cost and human labor are primary constraints, FORCE sets a new benchmark for what RL fine-tuning can achieve.
For more on VLA RL fine-tuning approaches:
Related Posts
- SimpleVLA-RL: RL Fine-Tuning for VLA with GRPO
- EXPO-FT: Fine-Tuning π₀.₅ with Online RL in 19 Minutes
- ProcVLM: Dense Reward Shaping for VLA RL
Paper: FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation — Zhang et al., arXiv 2606.26006, June 2026



