FORCE: 79% Success Rate Boost for VLA RL Fine-Tuning

Have you ever fine-tuned a VLA model with RL only to watch its success rate crater in the first few epochs, never fully recovering? Or found yourself manually intervening hundreds of times to steer the policy away from dead ends?

That's precisely the problem that Shuyi Zhang and co-authors target in FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation (arXiv, June 2026). Their three-stage framework delivers a 79% absolute success rate improvement, beats prior RL methods by 10%, and cuts the required training samples by 32.5% — all without a single human intervention.

The Problem: The Imitation Ceiling and Why RL Fine-Tuning Is Hard

VLA models trained via imitation learning (behavior cloning from demonstrations) are inherently bounded by an imitation ceiling: when your training data is suboptimal, the model can never outperform what that data represents. Reinforcement learning is the natural escape hatch — let the policy explore and learn from environment feedback.

But RL fine-tuning VLA in practice hits two painful failure modes:

Failure 1 — Catastrophic Initial Unlearning: When RL begins, the Q-function (which evaluates action quality) is poorly initialized. It systematically overestimates Q-values for out-of-distribution actions that the policy has never actually taken. The policy updates toward these phantom high-value regions, forgetting what it learned during supervised fine-tuning. Result: a sharp early success rate drop that often never recovers.

Failure 2 — Low-Quality Exploration Data: Weak policy → poor rollout data → updates from poor data → policy stays weak. This negative feedback loop makes RL fine-tuning extremely sample inefficient. Many systems rely on human-in-the-loop (HIL) intervention — humans physically correcting the robot mid-rollout — which is catastrophically expensive to scale.

What Is FORCE?

FORCE stands for Fine-tuning with vAlue-calibRated warm-up and sElf-distillation. It is a three-stage framework that addresses both failures sequentially:

Stabilize the Q-function offline using Cal-QL
Warm up the Q-function with on-policy data to eliminate distributional shift
Fine-tune the policy online using Q-function-filtered self-distillation

Source: Zhang et al., arXiv 2606.26006

Architecture Deep Dive: The 3 Stages of FORCE

Stage 1: Offline Cal-QL Pretraining

Before any rollout data is collected, FORCE trains the Q-function exclusively on expert demonstration data using Calibrated Q-Learning (Cal-QL).

Cal-QL extends Conservative Q-Learning (CQL) with a calibration term that prevents Q-value overestimation for actions outside the demonstration distribution. The training objective:

L_CalQL = L_TD + α × L_CQL + β × L_calibration

Where:

L_TD = Bellman temporal difference error (standard Q-learning)
L_CQL = conservative penalty (suppresses Q-values for out-of-distribution actions)
L_calibration = additional term to align estimated and true Q-values

The result is a Q-function that is appropriately skeptical about unexplored state-action regions and won't be fooled by the policy venturing into them.

Stage 1 hyperparameters:

offline_lr: 3.0e-4        # Adam optimizer
batch_size: 256
gamma: 0.99               # discount factor
tau: 0.005                # EMA coefficient for target network
cql_alpha: 0.1            # CQL regularization strength
critic_policy_ratio: 2    # 2 critic updates per policy update

Stage 2: Value-Calibrated Warm-Up

This is FORCE's core insight. Even a well-trained offline Q-function suffers from distributional shift: it was trained on the behavior policy's (expert's) data distribution, but the RL policy will visit a different distribution.

FORCE collects on-policy rollouts — runs the current policy in the environment and logs the trajectories — then merges this data with the offline demonstrations to create a mixed dataset.

# Warm-up: collect on-policy experience
on_policy_data = []
for _ in range(warm_up_episodes):
    trajectory = run_episode(policy=vla_policy, env=robot_env)
    on_policy_data.extend(trajectory)

# Merge with offline demonstrations
mixed_dataset = concat([demo_dataset, on_policy_data])

# Continue Q-function training on the expanded dataset
for batch in mixed_dataset:
    q_loss = cal_ql_loss(q_function, batch)
    q_function.update(q_loss)

By exposing the Q-function to what the policy actually does in the environment, it learns to correctly evaluate the state-action regions the policy will visit during RL. No more distributional blind spots, no more catastrophic early unlearning.

Stage 3: Online Fine-Tuning with VGPD

Now the policy itself is fine-tuned via RL. This is where Value-Guided Policy Self-Distillation (VGPD) takes over.

How VGPD works: Instead of learning from all rollout data (including low-quality trajectories), VGPD uses the calibrated Q-function as an intelligent filter:

def vgpd_update(policy, q_function, state, K=20, tau=0.1):
    # Step 1: Sample K action candidates from policy
    candidates = [policy.sample(state) for _ in range(K)]

    # Step 2: Compute Q-values for all candidates
    q_values = [q_function(state, a) for a in candidates]

    # Step 3: Dynamic baseline = mean Q-value across candidates
    v_ref = mean(q_values)

    # Step 4: Filter — keep only actions above baseline
    good_mask = [q > v_ref for q in q_values]
    filtered = [(a, q) for a, q, m in zip(candidates, q_values, good_mask) if m]

    if not filtered:
        return None  # no update if no good actions found

    # Step 5: Exponential weighting by Q-value
    actions, q_good = zip(*filtered)
    weights = softmax([q / tau for q in q_good])

    # Step 6: Weighted imitation learning on the policy's own best actions
    loss = sum(
        -w * policy.log_prob(state, a)
        for a, w in zip(actions, weights)
    )
    return loss

This creates an automatic curriculum: when the policy is weak, the baseline V_ref is low, many candidates pass the filter, and the policy learns from many examples (close to behavioral cloning). As the policy improves, the baseline rises, the filter becomes stricter, and only the best actions guide further updates.

Expert buffer + Policy buffer: FORCE maintains two separate replay buffers during Stage 3:

Expert Buffer: the original expert demonstrations
Policy Buffer: rollouts collected from the current policy

These are sampled 50/50, and VGPD filtering is applied to both. This prevents catastrophic forgetting of demonstration knowledge while still incorporating newly explored behaviors.

VGPD module — filtering K action candidates via Q-function and exponential weighting

Source: Zhang et al., arXiv 2606.26006

Practical Setup (Implementing from the Paper)

As of July 2026, the paper does not include a public GitHub repository. Here is how to set up the environment and implement FORCE following the paper's algorithm specification.

System Requirements

# Create conda environment
conda create -n force_vla python=3.10
conda activate force_vla

# Core dependencies
pip install torch==2.3.0 torchvision
pip install transformers>=4.40.0
pip install gymnasium mujoco

# ManiSkill (used in paper benchmarks)
pip install mani-skill2

# Or LIBERO for tabletop manipulation
pip install libero

GPU requirements:

Minimum: A40 (48GB VRAM) for VGPD with K=20 candidates
Recommended: A100 80GB or H100 80GB
Training time: ~18k environment steps for StackCube (a few hours on A100)

Base VLA Models

FORCE was tested with three models:

Octo — open-source generalist robot policy, easiest to download and fine-tune
π₀ (Pi Zero) — flow-based VLA from Physical Intelligence
π₀.₅ — improved variant of π₀

For Octo (most accessible):

# Clone and install Octo
git clone https://github.com/octo-models/octo.git
cd octo && pip install -e .

# Download pretrained checkpoint
python -c "
from octo.model.octo_model import OctoModel
model = OctoModel.load_pretrained('hf://rail-berkeley/octo-base')
print('Model loaded successfully')
"

Stage 1: Implement Cal-QL

import torch
import torch.nn as nn

class CalQLCritic(nn.Module):
    def __init__(self, obs_dim, action_dim, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
        )

    def forward(self, obs, action):
        return self.net(torch.cat([obs, action], dim=-1))

class CalQL:
    def __init__(self, critic, target_critic, gamma=0.99, alpha=0.1, tau=0.005):
        self.critic = critic
        self.target = target_critic
        self.gamma = gamma
        self.alpha = alpha
        self.tau = tau

    def compute_loss(self, obs, actions, rewards, next_obs, dones):
        # TD target using target network
        with torch.no_grad():
            target_q = rewards + (1 - dones) * self.gamma * self.target(next_obs, actions)

        # In-distribution Q-values
        current_q = self.critic(obs, actions)
        td_loss = nn.MSELoss()(current_q, target_q)

        # CQL conservative regularization
        random_actions = torch.randn_like(actions)
        random_q = self.critic(obs, random_actions)
        cql_loss = (torch.logsumexp(random_q, dim=0) - current_q).mean()

        return td_loss + self.alpha * cql_loss

    def update_target(self):
        for p, tp in zip(self.critic.parameters(), self.target.parameters()):
            tp.data.copy_(self.tau * p.data + (1 - self.tau) * tp.data)

Monitoring Training Progress

Track these metrics to know if training is healthy:

training_metrics = {
    'q_value_mean': [],        # Should increase steadily
    'q_value_std': [],         # Should decrease after warm-up
    'success_rate': [],        # Target: > 80%
    'episode_length': [],      # Should decrease (more efficient behavior)
    'vgpd_pass_rate': [],      # % actions that pass VGPD filter (should be 30-70%)
    'policy_buffer_ratio': [], # Ratio of policy vs expert data sampled
}

A healthy FORCE training run shows:

Q-values stabilizing during Stage 1 (no divergence)
Q-value std dropping during Stage 2 (distributional shift resolved)
Success rate climbing steadily in Stage 3 (no early catastrophic drop)

Experimental Results

ManiSkill Simulation — Comparison with Baselines

Method	Backbone	Success Rate	Needs Human?
BC (Behavior Cloning)	Octo	45%	No
SFT	π₀	~60%	No
ConRFT (no HIL)	π₀	~73%	No
ConRFT (with HIL)	π₀	~76%	Yes
FORCE	Octo	82.3%	No
FORCE	π₀	86.9%	No

FORCE with π₀ backbone achieves a 79% absolute improvement over the BC baseline and outperforms the best HIL-free RL competitor by over 10 percentage points.

Real-World Franka Robot Tasks

FORCE on real-world Franka robot tasks — 98.3% success rate vs 45% for BC baseline

Source: Zhang et al., arXiv 2606.26006

Method	Success Rate	Avg. Execution Steps	Relative Efficiency
BC baseline	45%	112.8	1×
FORCE	98.3%	38.9	~3×

FORCE doesn't just succeed more often — it also completes tasks in roughly a third of the steps. The policy has learned to act decisively rather than exploratorily.

Ablation: How Much Do Warm-Up and VGPD Each Contribute?

Metric: Steps@80% (environment steps needed to reach 80% success rate):

Task	Full FORCE	No Warm-Up	No VGPD
StackCube	18k	28k	~24k
PickCube	12k	~20k	20k
PushCube	4k	10k	~8k

Both components contribute meaningfully. Removing warm-up roughly doubles the sample cost; removing VGPD increases it by 33–67%. Neither alone is sufficient — the two are complementary.

Comparison with Other VLA RL Fine-Tuning Methods

Method	Core Idea	Strengths	Weaknesses
SimpleVLA-RL	GRPO online RL	Simple to implement	High sample count
EXPO-FT π₀.₅	Online RL in 19 min	Extremely fast	Narrow task scope
ProcVLM	Dense reward shaping	Rich reward signal	Requires reward engineering
FORCE	Cal-QL + VGPD	Sample efficient, no HIL	3-stage complexity

FORCE's main distinguishing factor is the combination of Q-function calibration before online training begins with a self-distillation mechanism that avoids learning from the policy's own mistakes. Most competing methods pick one or the other.

Strengths and Limitations

Strengths

No human intervention: Essential for scaling to fleets of robots in production
Architecture-agnostic: Tested with Octo and π₀ — drop-in compatible with other VLA backbones
Automatic curriculum via VGPD: The filtering threshold rises as the policy improves, avoiding both under- and over-challenge
32.5% sample efficiency gain over the best competing method (ConRFT)

Limitations

Three-stage complexity: More hyperparameters to tune than simple SFT or single-stage RL
Q-function overhead: Must train and maintain a separate Q-function network in parallel with the policy
No public code: Paper does not include a GitHub repo as of July 2026 — implementation requires effort
K=20 forward passes per VGPD step: Memory-intensive; may require gradient checkpointing on smaller GPUs
Real-world evaluation covers a limited set of Franka manipulation tasks — generalization to more complex, contact-rich scenarios is not yet demonstrated

Conclusion

FORCE represents a principled answer to one of the field's most persistent problems: how do you apply RL to VLA models without either destroying what the model already knows or getting trapped in a loop of bad data generating bad updates?

The answer — calibrate first, then distill from your own best actions — is elegant and demonstrably effective. The 79% absolute gain is real, the 3× execution efficiency improvement is real, and crucially, the elimination of human intervention makes it plausible to apply at scale.

The three-stage pipeline adds implementation complexity, and the absence of a public codebase raises the bar for adoption. But for teams working on production VLA deployment where sample cost and human labor are primary constraints, FORCE sets a new benchmark for what RL fine-tuning can achieve.

For more on VLA RL fine-tuning approaches:

Paper: FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation — Zhang et al., arXiv 2606.26006, June 2026

The Problem: The Imitation Ceiling and Why RL Fine-Tuning Is Hard

But RL fine-tuning VLA in practice hits two painful failure modes:

What Is FORCE?

FORCE stands for Fine-tuning with vAlue-calibRated warm-up and sElf-distillation. It is a three-stage framework that addresses both failures sequentially:

Stabilize the Q-function offline using Cal-QL
Warm up the Q-function with on-policy data to eliminate distributional shift
Fine-tune the policy online using Q-function-filtered self-distillation

Source: Zhang et al., arXiv 2606.26006

Architecture Deep Dive: The 3 Stages of FORCE

Stage 1: Offline Cal-QL Pretraining

Before any rollout data is collected, FORCE trains the Q-function exclusively on expert demonstration data using Calibrated Q-Learning (Cal-QL).

Cal-QL extends Conservative Q-Learning (CQL) with a calibration term that prevents Q-value overestimation for actions outside the demonstration distribution. The training objective:

L_CalQL = L_TD + α × L_CQL + β × L_calibration

Where:

L_TD = Bellman temporal difference error (standard Q-learning)
L_CQL = conservative penalty (suppresses Q-values for out-of-distribution actions)
L_calibration = additional term to align estimated and true Q-values

The result is a Q-function that is appropriately skeptical about unexplored state-action regions and won't be fooled by the policy venturing into them.

Stage 1 hyperparameters:

offline_lr: 3.0e-4        # Adam optimizer
batch_size: 256
gamma: 0.99               # discount factor
tau: 0.005                # EMA coefficient for target network
cql_alpha: 0.1            # CQL regularization strength
critic_policy_ratio: 2    # 2 critic updates per policy update

Stage 2: Value-Calibrated Warm-Up

FORCE collects on-policy rollouts — runs the current policy in the environment and logs the trajectories — then merges this data with the offline demonstrations to create a mixed dataset.

# Warm-up: collect on-policy experience
on_policy_data = []
for _ in range(warm_up_episodes):
    trajectory = run_episode(policy=vla_policy, env=robot_env)
    on_policy_data.extend(trajectory)

# Merge with offline demonstrations
mixed_dataset = concat([demo_dataset, on_policy_data])

# Continue Q-function training on the expanded dataset
for batch in mixed_dataset:
    q_loss = cal_ql_loss(q_function, batch)
    q_function.update(q_loss)

Stage 3: Online Fine-Tuning with VGPD

Now the policy itself is fine-tuned via RL. This is where Value-Guided Policy Self-Distillation (VGPD) takes over.

How VGPD works: Instead of learning from all rollout data (including low-quality trajectories), VGPD uses the calibrated Q-function as an intelligent filter:

def vgpd_update(policy, q_function, state, K=20, tau=0.1):
    # Step 1: Sample K action candidates from policy
    candidates = [policy.sample(state) for _ in range(K)]

    # Step 2: Compute Q-values for all candidates
    q_values = [q_function(state, a) for a in candidates]

    # Step 3: Dynamic baseline = mean Q-value across candidates
    v_ref = mean(q_values)

    # Step 4: Filter — keep only actions above baseline
    good_mask = [q > v_ref for q in q_values]
    filtered = [(a, q) for a, q, m in zip(candidates, q_values, good_mask) if m]

    if not filtered:
        return None  # no update if no good actions found

    # Step 5: Exponential weighting by Q-value
    actions, q_good = zip(*filtered)
    weights = softmax([q / tau for q in q_good])

    # Step 6: Weighted imitation learning on the policy's own best actions
    loss = sum(
        -w * policy.log_prob(state, a)
        for a, w in zip(actions, weights)
    )
    return loss

Expert buffer + Policy buffer: FORCE maintains two separate replay buffers during Stage 3:

Expert Buffer: the original expert demonstrations
Policy Buffer: rollouts collected from the current policy

These are sampled 50/50, and VGPD filtering is applied to both. This prevents catastrophic forgetting of demonstration knowledge while still incorporating newly explored behaviors.

VGPD module — filtering K action candidates via Q-function and exponential weighting

Source: Zhang et al., arXiv 2606.26006

Practical Setup (Implementing from the Paper)

As of July 2026, the paper does not include a public GitHub repository. Here is how to set up the environment and implement FORCE following the paper's algorithm specification.

System Requirements

# Create conda environment
conda create -n force_vla python=3.10
conda activate force_vla

# Core dependencies
pip install torch==2.3.0 torchvision
pip install transformers>=4.40.0
pip install gymnasium mujoco

# ManiSkill (used in paper benchmarks)
pip install mani-skill2

# Or LIBERO for tabletop manipulation
pip install libero

GPU requirements:

Minimum: A40 (48GB VRAM) for VGPD with K=20 candidates
Recommended: A100 80GB or H100 80GB
Training time: ~18k environment steps for StackCube (a few hours on A100)

Base VLA Models

FORCE was tested with three models:

Octo — open-source generalist robot policy, easiest to download and fine-tune
π₀ (Pi Zero) — flow-based VLA from Physical Intelligence
π₀.₅ — improved variant of π₀

For Octo (most accessible):

# Clone and install Octo
git clone https://github.com/octo-models/octo.git
cd octo && pip install -e .

# Download pretrained checkpoint
python -c "
from octo.model.octo_model import OctoModel
model = OctoModel.load_pretrained('hf://rail-berkeley/octo-base')
print('Model loaded successfully')
"

Stage 1: Implement Cal-QL

import torch
import torch.nn as nn

class CalQLCritic(nn.Module):
    def __init__(self, obs_dim, action_dim, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
        )

    def forward(self, obs, action):
        return self.net(torch.cat([obs, action], dim=-1))

class CalQL:
    def __init__(self, critic, target_critic, gamma=0.99, alpha=0.1, tau=0.005):
        self.critic = critic
        self.target = target_critic
        self.gamma = gamma
        self.alpha = alpha
        self.tau = tau

    def compute_loss(self, obs, actions, rewards, next_obs, dones):
        # TD target using target network
        with torch.no_grad():
            target_q = rewards + (1 - dones) * self.gamma * self.target(next_obs, actions)

        # In-distribution Q-values
        current_q = self.critic(obs, actions)
        td_loss = nn.MSELoss()(current_q, target_q)

        # CQL conservative regularization
        random_actions = torch.randn_like(actions)
        random_q = self.critic(obs, random_actions)
        cql_loss = (torch.logsumexp(random_q, dim=0) - current_q).mean()

        return td_loss + self.alpha * cql_loss

    def update_target(self):
        for p, tp in zip(self.critic.parameters(), self.target.parameters()):
            tp.data.copy_(self.tau * p.data + (1 - self.tau) * tp.data)

Monitoring Training Progress

Track these metrics to know if training is healthy:

training_metrics = {
    'q_value_mean': [],        # Should increase steadily
    'q_value_std': [],         # Should decrease after warm-up
    'success_rate': [],        # Target: > 80%
    'episode_length': [],      # Should decrease (more efficient behavior)
    'vgpd_pass_rate': [],      # % actions that pass VGPD filter (should be 30-70%)
    'policy_buffer_ratio': [], # Ratio of policy vs expert data sampled
}

A healthy FORCE training run shows:

Q-values stabilizing during Stage 1 (no divergence)
Q-value std dropping during Stage 2 (distributional shift resolved)
Success rate climbing steadily in Stage 3 (no early catastrophic drop)

Experimental Results

ManiSkill Simulation — Comparison with Baselines

Method	Backbone	Success Rate	Needs Human?
BC (Behavior Cloning)	Octo	45%	No
SFT	π₀	~60%	No
ConRFT (no HIL)	π₀	~73%	No
ConRFT (with HIL)	π₀	~76%	Yes
FORCE	Octo	82.3%	No
FORCE	π₀	86.9%	No

FORCE with π₀ backbone achieves a 79% absolute improvement over the BC baseline and outperforms the best HIL-free RL competitor by over 10 percentage points.

Real-World Franka Robot Tasks

FORCE on real-world Franka robot tasks — 98.3% success rate vs 45% for BC baseline

Source: Zhang et al., arXiv 2606.26006

Method	Success Rate	Avg. Execution Steps	Relative Efficiency
BC baseline	45%	112.8	1×
FORCE	98.3%	38.9	~3×

FORCE doesn't just succeed more often — it also completes tasks in roughly a third of the steps. The policy has learned to act decisively rather than exploratorily.

Ablation: How Much Do Warm-Up and VGPD Each Contribute?

Metric: Steps@80% (environment steps needed to reach 80% success rate):

Task	Full FORCE	No Warm-Up	No VGPD
StackCube	18k	28k	~24k
PickCube	12k	~20k	20k
PushCube	4k	10k	~8k

Both components contribute meaningfully. Removing warm-up roughly doubles the sample cost; removing VGPD increases it by 33–67%. Neither alone is sufficient — the two are complementary.

Comparison with Other VLA RL Fine-Tuning Methods

Method	Core Idea	Strengths	Weaknesses
SimpleVLA-RL	GRPO online RL	Simple to implement	High sample count
EXPO-FT π₀.₅	Online RL in 19 min	Extremely fast	Narrow task scope
ProcVLM	Dense reward shaping	Rich reward signal	Requires reward engineering
FORCE	Cal-QL + VGPD	Sample efficient, no HIL	3-stage complexity

Strengths and Limitations

Strengths

No human intervention: Essential for scaling to fleets of robots in production
Architecture-agnostic: Tested with Octo and π₀ — drop-in compatible with other VLA backbones
Automatic curriculum via VGPD: The filtering threshold rises as the policy improves, avoiding both under- and over-challenge
32.5% sample efficiency gain over the best competing method (ConRFT)

Limitations

Three-stage complexity: More hyperparameters to tune than simple SFT or single-stage RL
Q-function overhead: Must train and maintain a separate Q-function network in parallel with the policy
No public code: Paper does not include a GitHub repo as of July 2026 — implementation requires effort
K=20 forward passes per VGPD step: Memory-intensive; may require gradient checkpointing on smaller GPUs
Real-world evaluation covers a limited set of Franka manipulation tasks — generalization to more complex, contact-rich scenarios is not yet demonstrated

Conclusion

For more on VLA RL fine-tuning approaches:

Paper: FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation — Zhang et al., arXiv 2606.26006, June 2026

The Problem: The Imitation Ceiling and Why RL Fine-Tuning Is Hard

What Is FORCE?

Architecture Deep Dive: The 3 Stages of FORCE

Stage 1: Offline Cal-QL Pretraining

Stage 2: Value-Calibrated Warm-Up

Stage 3: Online Fine-Tuning with VGPD

Practical Setup (Implementing from the Paper)

System Requirements

Base VLA Models

Stage 1: Implement Cal-QL

Monitoring Training Progress

Experimental Results

ManiSkill Simulation — Comparison with Baselines

Real-World Franka Robot Tasks

Ablation: How Much Do Warm-Up and VGPD Each Contribute?

Comparison with Other VLA RL Fine-Tuning Methods

Strengths and Limitations

Strengths

Limitations

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2

RISE: Hands-on training pipeline tự cải thiện

The Problem: The Imitation Ceiling and Why RL Fine-Tuning Is Hard

What Is FORCE?

Architecture Deep Dive: The 3 Stages of FORCE

Stage 1: Offline Cal-QL Pretraining

Stage 2: Value-Calibrated Warm-Up

Stage 3: Online Fine-Tuning with VGPD

Practical Setup (Implementing from the Paper)

System Requirements

Base VLA Models

Stage 1: Implement Cal-QL

Monitoring Training Progress

Experimental Results

ManiSkill Simulation — Comparison with Baselines

Real-World Franka Robot Tasks

Ablation: How Much Do Warm-Up and VGPD Each Contribute?

Comparison with Other VLA RL Fine-Tuning Methods

Strengths and Limitations

Strengths

Limitations

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2

RISE: Hands-on training pipeline tự cải thiện