Why ProcVLM Matters
If you have ever fine-tuned a manipulation policy with reinforcement learning, you know the hardest part is often not the optimizer. It is the reward. For a task like "pick up the red cup and put it into the drawer", a binary 0/1 reward is too sparse: the robot can complete 80% of the task and still receive 0 because the drawer is not closed, or fail a small intermediate step without knowing where the failure happened. Hand-designed rewards are not much better at scale. Distance from gripper to object, object-to-goal distance, collision penalties, and shaped bonuses quickly become task-specific engineering.
ProcVLM, from the paper "ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation", proposes a more scalable idea: train a vision-language model to read robot videos, understand the task instruction, reason about the remaining atomic actions, and predict a continuous task progress score. That score can be used as dense reward for reward-guided fine-tuning of VLA policies. The project page is procvlm.github.io, the code is on GitHub, and the released resources include ProcVLM-2B, ProcCorpus-60M, and ProcVQA-20M.
This guide explains ProcVLM from an engineer's perspective: what problem it solves, how the architecture works, how to install it, how to run progress inference on a video, and how to use the resulting rewards for VLA manipulation fine-tuning. If you are new to RL, start with RL basics for robotics. If you need the VLA background first, read Vision-Language-Action models.
Core Idea: Reward Should Follow Procedure, Not Time
Many robot reward models turn a successful trajectory into progress labels by interpolating over time: the first frame is 0%, the final frame is 100%, and the middle frames are evenly spaced. This is simple, but it is wrong for long-horizon manipulation. A robot can pause to adjust the gripper, retry after a failed grasp, complete one subtask and then move through visually similar states, or recover from a mistake. A later frame is not always closer to success.
ProcVLM defines progress through procedure:
- The task is decomposed into atomic subtasks such as "grasp bowl", "move bowl above target", and "place bowl into target".
- Each subtask has start/end boundaries and complete/incomplete state.
- Overall progress is budgeted across the subtask structure.
- Within each subtask, progress is assigned using visual change, not just frame index.
- The model first reasons about the remaining atomic actions, then predicts a completion percentage.
A simplified pipeline:
Robot video + task instruction
|
v
Large VLM annotator
|
+-- subtask plan
+-- temporal boundaries
+-- frame-level reasoning
+-- remaining actions
|
v
ProcCorpus-60M
|
v
ProcVQA tasks
- action segmentation
- future planning
- progress prediction
|
v
ProcVLM-2B
|
+-- text reasoning
+-- scalar progress value
|
v
dense reward for policy fine-tuning
The important distinction is that ProcVLM does not only ask "does this frame look like the goal?". It asks: given this instruction, which execution stage is the robot in, which actions are already done, which actions remain, and what completion value should this state receive? This makes the reward task-conditioned. The project page shows a useful reward-editing example: the same apple-to-basket video receives a different progress curve when the instruction is edited to add a second step, "move the basket to the upper corner".
ProcVLM Architecture
ProcVLM is initialized from Qwen3-VL-2B-Instruct, a compact vision-language backbone. The model accepts task instructions plus visual observations and produces both textual reasoning and a continuous progress value.
| Component | Role | Output |
|---|---|---|
| VLM backbone | Processes observation window and task instruction | hidden states and text context |
| Language modeling head | Generates procedure-aware response | remaining actions, stage explanation |
| Progress value head | Regresses continuous completion | scalar progress |
| Semantic gating | Enables value head only for progress samples | clean multi-task training |
For progress prediction, the supervised response is structured like this:
Remaining actions:
1. Move the bowl above the target bowl.
2. Place the bowl into the target bowl.
<progress>62.5%</progress>
The paper trains ProcVLM with a joint objective:
L = L_LM + lambda * I_progress * L_value
L_LM is the standard autoregressive language modeling loss for all VQA tasks. L_value is the regression loss for the scalar progress head and is applied only when the sample contains progress supervision. This matters because token-only number prediction tends to collapse to coarse anchors such as 10%, 50%, or 90%. The value head gives ProcVLM a continuous path for progress estimation while keeping the reasoning grounded in the shared VLM representation.
How ProcCorpus-60M and ProcVQA-20M Are Built
The authors do not manually annotate every robot video. They build a synthetic supervision pipeline using large VLMs as annotators. According to the paper, ProcCorpus-60M is constructed from about 400K trajectories across 30 embodied datasets, mixing real-robot and simulation sources such as DROID, BridgeData V2, Fractal, RH20T, Table30, selected OXE subsets, LIBERO, RoboTwin 2.0, and GR00T-Teleop-Sim. The result is more than 60M annotated frames.
The annotation pipeline has four asynchronous modules:
| Module | Purpose |
|---|---|
| Data reader | Loads episodes, task instructions, camera keys, and frame indices |
| CPU preprocessing | Resizes images, builds prompts, prepares frame/video windows |
| GPU VLM inference | Runs plan generation, subtask localization, and frame reasoning |
| Post-processing | Parses JSONL, validates formats, expands segments into frame labels |
The paper uses Qwen3-VL-235B-A22B-Instruct for video-level planning and temporal localization because those steps require long-context video reasoning. It uses InternVL3.5-38B for frame-level reasoning and grounding, which is more efficient for single-frame analysis.
ProcCorpus is then converted into ProcVQA with three task families:
| ProcVQA Task | What the model learns | Why it helps reward modeling |
|---|---|---|
| Action segmentation | Split video into atomic actions | Understand procedure boundaries |
| Future planning | Predict remaining actions | Detect unfinished states |
| Progress prediction | Estimate completion percentage | Produce dense reward |
Training has two stages. Stage 1 uses the full ProcVQA corpus at about 20B tokens, giving the model broad procedural coverage across robots, viewpoints, and tasks. Stage 2 refines the model on a curated subset of about 15K trajectories and 2.8B tokens, selected for cleaner subtask alignment.
Installation
The official repository uses uv, Python 3.10, vLLM by default, and LMDeploy for parts of the annotation pipeline. For practice, separate two use cases:
- Reward inference: you only need the model checkpoint and videos.
- Full annotation/training pipeline: you need local inference engines and more GPU capacity.
Minimal setup:
git clone https://github.com/ProcVLM/ProcVLM.git
cd ProcVLM
# Install uv if needed
wget -qO- https://astral.sh/uv/install.sh | sh
# Create a Python 3.10 environment
uv sync --python 3.10
source .venv/bin/activate
# Flash Attention is usually installed separately
uv pip install flash-attn --no-build-isolation
The README notes that the project uses vLLM v0.18 and Transformers v4.57 by default. If CUDA errors appear, verify PyTorch/CUDA compatibility before changing project code. On smaller GPUs, start with small batches or use Transformers inference before attempting high-throughput vLLM serving.
For local annotation and reasoning pipelines, the repo recommends setting up LMDeploy in a separate Conda environment:
conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
pip install lmdeploy
pip install -r envs/others_pip.txt
Inference: From Video to Progress JSONL
The simplest use case is: you have a rollout video and a task instruction, and you want a completion score along the trajectory.
python evqa/inference.py \
--model_path /path/to/procvlm-checkpoint \
--video_path tmp/fold_cloth/R1_Lite_fold_clothes.mp4 \
--output_path tmp/fold_cloth/progress.jsonl \
--task "fold the red T-shirt" \
--window_size 8
window_size=8 means the model sees the current frame plus the recent visual context. The output JSONL contains sampled frame indices and predicted progress:
{"frame_index": 0, "progress": 0.02}
{"frame_index": 12, "progress": 0.11}
{"frame_index": 24, "progress": 0.28}
{"frame_index": 36, "progress": 0.51}
{"frame_index": 48, "progress": 0.73}
You can also visualize the progress curve on the video:
python evqa/eval/visualize_progress_video \
--model_path /path/to/procvlm-checkpoint \
--video_path tmp/fold_cloth/R1_Lite_fold_clothes.mp4 \
--output_path tmp/fold_cloth/progress_vis.mp4 \
--task "fold the red T-shirt" \
--window_size 8
In Python, the integration pattern is straightforward:
from evqa.inference import infer_progress_from_video
progress = infer_progress_from_video(
model_path="/models/procvlm",
video_path="rollouts/episode_003.mp4",
task="stack the blue bowl into the target bowl",
window_size=8,
)
# Simple dense reward: progress delta
rewards = []
prev = progress[0]["progress"]
for item in progress[1:]:
cur = item["progress"]
rewards.append(cur - prev)
prev = cur
For real training, smooth and clip the reward. Raw VLM progress is useful, but it is still a learned signal.
def procvlm_reward(progress_t, progress_prev, success_bonus=1.0):
delta = progress_t - progress_prev
dense = max(min(delta, 0.10), -0.05)
terminal = success_bonus if progress_t > 0.95 else 0.0
return dense + terminal
Using ProcVLM Rewards to Fine-Tune a VLA Policy
The paper evaluates ProcVLM inside a reward fine-tuning setup based on SJTU Evo-RL. The base policy is pi0.5. SFT and RFT start from the same policy initialization and use the same training data. The difference is that RFT uses ProcVLM to assign dense progress scores to trajectories, then Evo-RL estimates advantages over a 50-step horizon.
The workflow:
1. Collect trajectories through teleoperation or policy rollout
2. Save video + action sequence + task instruction
3. Run ProcVLM to obtain progress p_t for sampled frames
4. Convert p_t into dense reward or advantage
5. Mark the top 30% advantage samples as positive
6. Mark the remaining samples as negative
7. Train the policy with advantage-conditioned signals
8. Evaluate in simulation or on the real robot
System-level pseudo-code:
for episode in dataset:
progress = procvlm.infer(
video_path=episode.video,
task=episode.instruction,
window_size=8,
)
episode.rewards = compute_delta_rewards(progress)
episode.advantages = estimate_advantage(
rewards=episode.rewards,
horizon=50,
)
positive = select_top_percent(dataset, key="advantages", percent=30)
negative = select_rest(dataset, positive)
train_vla_policy(
base_policy="pi0.5",
positive_samples=positive,
negative_samples=negative,
objective="advantage_conditioned_rft",
)
If you are using LeRobot or a custom VLA stack, you do not need to implement the entire Evo-RL pipeline on day one. Start by using ProcVLM to rerank trajectories, remove segments with low or stagnant progress, and SFT the policy again on cleaner data. Once that works, move to advantage-conditioned fine-tuning. The HIL-SERL real-robot RL guide is also useful if you want to combine human intervention with a learned reward model.
Results: Strong Procedural Reasoning, Useful Real-Robot Gains
On ProcVQA, ProcVLM-2B outperforms larger VLMs on several primary metrics. Key reported numbers:
| Benchmark | Metric | ProcVLM-2B | Notable baseline |
|---|---|---|---|
| ProcVQA ID | BF1@5 | 0.6924 | GPT-5.4: 0.5221 |
| ProcVQA ID | Future planning success | 0.8103 | Qwen3.5: 0.7931 |
| ProcVQA ID | VOC | 0.8058 | Qwen3.5: 0.5475 |
| ProcVQA OOD | Future planning success | 0.8448 | Qwen3.5: 0.7758 |
| ProcVQA OOD | VOC | 0.7282 | GPT-5.4: 0.6553 |
In the zero-shot reward model comparison on ProcVQA-OOD, ProcVLM-2B reaches VOC 0.7282, above Robometer-4B at 0.5296 and RoboDopamine-4B at 0.7156. Robometer reports higher EPR@50, but its lower VOC indicates weaker trajectory-internal ordering under the shuffled local-window setup.
On RoboFAC real-robot one-shot adaptation, ProcVLM improves rapidly:
| Setting | VOCsucc | MAEfail | MCC | Latency |
|---|---|---|---|---|
| ProcVLM zero-shot | 0.4920 | 0.2001 | 0.6665 | 50s |
| ProcVLM 1-shot success | 0.9137 | 0.1241 | 0.7918 | 81s |
| ProcVLM 1-shot success + fail | 0.9301 | 0.1187 | 0.8053 | 80s |
For policy learning, the most relevant experiment is reward fine-tuning. On LIBERO-10 simulation, ProcVLM-guided RFT gives modest early gains over SFT: 73.2 to 73.6 at 1000 steps and 72.8 to 74.0 at 2000 steps. On the real JAKA stack-bowls task, the gain is larger: at 5k steps, SFT reaches 37.5 while RFT reaches 62.5; at 10k steps, SFT reaches 70.8 while RFT reaches 83.3. This matches the intuition that real teleoperation data contains noisy local behavior such as grasp retries, and a progress-aware reward can downweight low-value segments.
Beginner Deployment Checklist
If you want to try ProcVLM without building a full robot learning stack, use this sequence:
- Choose a clear manipulation task: pick-place, stack bowl, close drawer.
- Collect 10-30 successful videos and 10-30 failed videos if possible.
- Use consistent instructions, for example "place the blue bowl into the target bowl".
- Run ProcVLM inference with
window_size=8. - Plot progress against frame index.
- Manually inspect at least five videos with their reward curves.
- If curves are noisy, add smoothing and delta clipping.
- Use the reward to filter or rerank SFT data first.
- Move to advantage-conditioned RFT or RL only after the reward curve looks sane.
A practical rule: do not trust the reward model blindly. Always visualize video plus reward curve. If the score rises during useless retry behavior, the instruction may be ambiguous or the camera may miss the target object. If the score falls after success, the model may need more context frames or a clearer final-state instruction.
Limitations
ProcVLM is not a replacement for real evaluation. It learns progress from subtask decomposition and temporal boundary localization, so annotation errors can become reward errors. It also does not directly verify contact forces, hardware safety, or kinematic constraints. For force-control tasks, tight insertions, or tactile manipulation, combine ProcVLM with a task-specific success detector, safety constraints, and human review.
Inference cost is another practical constraint. ProcVLM-2B is much lighter than 8B or 27B reward models, but video-window inference is still not free. If you need rewards at control frequency, sample frames sparsely or label trajectories offline. The strongest current use cases are offline reward labeling, trajectory filtering, one-shot reward adaptation, and reward-guided fine-tuning.
Conclusion
ProcVLM is an important step toward robot reward models that understand task procedure. Instead of writing hand-shaped rewards or relying only on terminal success, it reads robot videos and instructions to produce dense progress rewards. For VLA manipulation, especially when fine-tuning from noisy demonstrations, this gives the policy a much richer signal about which parts of a trajectory actually move the task forward.
Treat ProcVLM as a progress critic, not as the robot controller itself. Combined with SFT, LeRobot/HIL-SERL, Evo-RL, or your own VLA training loop, it can turn robot videos into a more useful training signal than end-of-episode success or failure.