ProcVLM: Dense Video Rewards for VLA

Why ProcVLM Matters

If you have ever fine-tuned a manipulation policy with reinforcement learning, you know the hardest part is often not the optimizer. It is the reward. For a task like "pick up the red cup and put it into the drawer", a binary 0/1 reward is too sparse: the robot can complete 80% of the task and still receive 0 because the drawer is not closed, or fail a small intermediate step without knowing where the failure happened. Hand-designed rewards are not much better at scale. Distance from gripper to object, object-to-goal distance, collision penalties, and shaped bonuses quickly become task-specific engineering.

ProcVLM, from the paper "ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation", proposes a more scalable idea: train a vision-language model to read robot videos, understand the task instruction, reason about the remaining atomic actions, and predict a continuous task progress score. That score can be used as dense reward for reward-guided fine-tuning of VLA policies. The project page is procvlm.github.io, the code is on GitHub, and the released resources include ProcVLM-2B, ProcCorpus-60M, and ProcVQA-20M.

This guide explains ProcVLM from an engineer's perspective: what problem it solves, how the architecture works, how to install it, how to run progress inference on a video, and how to use the resulting rewards for VLA manipulation fine-tuning. If you are new to RL, start with RL basics for robotics. If you need the VLA background first, read Vision-Language-Action models.

Core Idea: Reward Should Follow Procedure, Not Time

Many robot reward models turn a successful trajectory into progress labels by interpolating over time: the first frame is 0%, the final frame is 100%, and the middle frames are evenly spaced. This is simple, but it is wrong for long-horizon manipulation. A robot can pause to adjust the gripper, retry after a failed grasp, complete one subtask and then move through visually similar states, or recover from a mistake. A later frame is not always closer to success.

ProcVLM defines progress through procedure:

The task is decomposed into atomic subtasks such as "grasp bowl", "move bowl above target", and "place bowl into target".
Each subtask has start/end boundaries and complete/incomplete state.
Overall progress is budgeted across the subtask structure.
Within each subtask, progress is assigned using visual change, not just frame index.
The model first reasons about the remaining atomic actions, then predicts a completion percentage.

A simplified pipeline:

Robot video + task instruction
        |
        v
Large VLM annotator
        |
        +-- subtask plan
        +-- temporal boundaries
        +-- frame-level reasoning
        +-- remaining actions
        |
        v
ProcCorpus-60M
        |
        v
ProcVQA tasks
  - action segmentation
  - future planning
  - progress prediction
        |
        v
ProcVLM-2B
        |
        +-- text reasoning
        +-- scalar progress value
        |
        v
dense reward for policy fine-tuning

The important distinction is that ProcVLM does not only ask "does this frame look like the goal?". It asks: given this instruction, which execution stage is the robot in, which actions are already done, which actions remain, and what completion value should this state receive? This makes the reward task-conditioned. The project page shows a useful reward-editing example: the same apple-to-basket video receives a different progress curve when the instruction is edited to add a second step, "move the basket to the upper corner".

ProcVLM Architecture

ProcVLM is initialized from Qwen3-VL-2B-Instruct, a compact vision-language backbone. The model accepts task instructions plus visual observations and produces both textual reasoning and a continuous progress value.

Component	Role	Output
VLM backbone	Processes observation window and task instruction	hidden states and text context
Language modeling head	Generates procedure-aware response	remaining actions, stage explanation
Progress value head	Regresses continuous completion	scalar progress
Semantic gating	Enables value head only for progress samples	clean multi-task training

For progress prediction, the supervised response is structured like this:

Remaining actions:
1. Move the bowl above the target bowl.
2. Place the bowl into the target bowl.

<progress>62.5%</progress>

The paper trains ProcVLM with a joint objective:

L = L_LM + lambda * I_progress * L_value

L_LM is the standard autoregressive language modeling loss for all VQA tasks. L_value is the regression loss for the scalar progress head and is applied only when the sample contains progress supervision. This matters because token-only number prediction tends to collapse to coarse anchors such as 10%, 50%, or 90%. The value head gives ProcVLM a continuous path for progress estimation while keeping the reasoning grounded in the shared VLM representation.

How ProcCorpus-60M and ProcVQA-20M Are Built

The authors do not manually annotate every robot video. They build a synthetic supervision pipeline using large VLMs as annotators. According to the paper, ProcCorpus-60M is constructed from about 400K trajectories across 30 embodied datasets, mixing real-robot and simulation sources such as DROID, BridgeData V2, Fractal, RH20T, Table30, selected OXE subsets, LIBERO, RoboTwin 2.0, and GR00T-Teleop-Sim. The result is more than 60M annotated frames.

The annotation pipeline has four asynchronous modules:

Module	Purpose
Data reader	Loads episodes, task instructions, camera keys, and frame indices
CPU preprocessing	Resizes images, builds prompts, prepares frame/video windows
GPU VLM inference	Runs plan generation, subtask localization, and frame reasoning
Post-processing	Parses JSONL, validates formats, expands segments into frame labels

The paper uses Qwen3-VL-235B-A22B-Instruct for video-level planning and temporal localization because those steps require long-context video reasoning. It uses InternVL3.5-38B for frame-level reasoning and grounding, which is more efficient for single-frame analysis.

ProcCorpus is then converted into ProcVQA with three task families:

ProcVQA Task	What the model learns	Why it helps reward modeling
Action segmentation	Split video into atomic actions	Understand procedure boundaries
Future planning	Predict remaining actions	Detect unfinished states
Progress prediction	Estimate completion percentage	Produce dense reward

Training has two stages. Stage 1 uses the full ProcVQA corpus at about 20B tokens, giving the model broad procedural coverage across robots, viewpoints, and tasks. Stage 2 refines the model on a curated subset of about 15K trajectories and 2.8B tokens, selected for cleaner subtask alignment.

Installation

The official repository uses uv, Python 3.10, vLLM by default, and LMDeploy for parts of the annotation pipeline. For practice, separate two use cases:

Reward inference: you only need the model checkpoint and videos.
Full annotation/training pipeline: you need local inference engines and more GPU capacity.

Minimal setup:

git clone https://github.com/ProcVLM/ProcVLM.git
cd ProcVLM

# Install uv if needed
wget -qO- https://astral.sh/uv/install.sh | sh

# Create a Python 3.10 environment
uv sync --python 3.10
source .venv/bin/activate

# Flash Attention is usually installed separately
uv pip install flash-attn --no-build-isolation

The README notes that the project uses vLLM v0.18 and Transformers v4.57 by default. If CUDA errors appear, verify PyTorch/CUDA compatibility before changing project code. On smaller GPUs, start with small batches or use Transformers inference before attempting high-throughput vLLM serving.

For local annotation and reasoning pipelines, the repo recommends setting up LMDeploy in a separate Conda environment:

conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
pip install lmdeploy
pip install -r envs/others_pip.txt

Inference: From Video to Progress JSONL

The simplest use case is: you have a rollout video and a task instruction, and you want a completion score along the trajectory.

python evqa/inference.py \
  --model_path /path/to/procvlm-checkpoint \
  --video_path tmp/fold_cloth/R1_Lite_fold_clothes.mp4 \
  --output_path tmp/fold_cloth/progress.jsonl \
  --task "fold the red T-shirt" \
  --window_size 8

window_size=8 means the model sees the current frame plus the recent visual context. The output JSONL contains sampled frame indices and predicted progress:

{"frame_index": 0, "progress": 0.02}
{"frame_index": 12, "progress": 0.11}
{"frame_index": 24, "progress": 0.28}
{"frame_index": 36, "progress": 0.51}
{"frame_index": 48, "progress": 0.73}

You can also visualize the progress curve on the video:

python evqa/eval/visualize_progress_video \
  --model_path /path/to/procvlm-checkpoint \
  --video_path tmp/fold_cloth/R1_Lite_fold_clothes.mp4 \
  --output_path tmp/fold_cloth/progress_vis.mp4 \
  --task "fold the red T-shirt" \
  --window_size 8

In Python, the integration pattern is straightforward:

from evqa.inference import infer_progress_from_video

progress = infer_progress_from_video(
    model_path="/models/procvlm",
    video_path="rollouts/episode_003.mp4",
    task="stack the blue bowl into the target bowl",
    window_size=8,
)

# Simple dense reward: progress delta
rewards = []
prev = progress[0]["progress"]
for item in progress[1:]:
    cur = item["progress"]
    rewards.append(cur - prev)
    prev = cur

For real training, smooth and clip the reward. Raw VLM progress is useful, but it is still a learned signal.

def procvlm_reward(progress_t, progress_prev, success_bonus=1.0):
    delta = progress_t - progress_prev
    dense = max(min(delta, 0.10), -0.05)
    terminal = success_bonus if progress_t > 0.95 else 0.0
    return dense + terminal

Using ProcVLM Rewards to Fine-Tune a VLA Policy

The paper evaluates ProcVLM inside a reward fine-tuning setup based on SJTU Evo-RL. The base policy is pi0.5. SFT and RFT start from the same policy initialization and use the same training data. The difference is that RFT uses ProcVLM to assign dense progress scores to trajectories, then Evo-RL estimates advantages over a 50-step horizon.

The workflow:

1. Collect trajectories through teleoperation or policy rollout
2. Save video + action sequence + task instruction
3. Run ProcVLM to obtain progress p_t for sampled frames
4. Convert p_t into dense reward or advantage
5. Mark the top 30% advantage samples as positive
6. Mark the remaining samples as negative
7. Train the policy with advantage-conditioned signals
8. Evaluate in simulation or on the real robot

System-level pseudo-code:

for episode in dataset:
    progress = procvlm.infer(
        video_path=episode.video,
        task=episode.instruction,
        window_size=8,
    )
    episode.rewards = compute_delta_rewards(progress)
    episode.advantages = estimate_advantage(
        rewards=episode.rewards,
        horizon=50,
    )

positive = select_top_percent(dataset, key="advantages", percent=30)
negative = select_rest(dataset, positive)

train_vla_policy(
    base_policy="pi0.5",
    positive_samples=positive,
    negative_samples=negative,
    objective="advantage_conditioned_rft",
)

If you are using LeRobot or a custom VLA stack, you do not need to implement the entire Evo-RL pipeline on day one. Start by using ProcVLM to rerank trajectories, remove segments with low or stagnant progress, and SFT the policy again on cleaner data. Once that works, move to advantage-conditioned fine-tuning. The HIL-SERL real-robot RL guide is also useful if you want to combine human intervention with a learned reward model.

Results: Strong Procedural Reasoning, Useful Real-Robot Gains

On ProcVQA, ProcVLM-2B outperforms larger VLMs on several primary metrics. Key reported numbers:

Benchmark	Metric	ProcVLM-2B	Notable baseline
ProcVQA ID	BF1@5	0.6924	GPT-5.4: 0.5221
ProcVQA ID	Future planning success	0.8103	Qwen3.5: 0.7931
ProcVQA ID	VOC	0.8058	Qwen3.5: 0.5475
ProcVQA OOD	Future planning success	0.8448	Qwen3.5: 0.7758
ProcVQA OOD	VOC	0.7282	GPT-5.4: 0.6553

In the zero-shot reward model comparison on ProcVQA-OOD, ProcVLM-2B reaches VOC 0.7282, above Robometer-4B at 0.5296 and RoboDopamine-4B at 0.7156. Robometer reports higher EPR@50, but its lower VOC indicates weaker trajectory-internal ordering under the shuffled local-window setup.

On RoboFAC real-robot one-shot adaptation, ProcVLM improves rapidly:

Setting	VOCsucc	MAEfail	MCC	Latency
ProcVLM zero-shot	0.4920	0.2001	0.6665	50s
ProcVLM 1-shot success	0.9137	0.1241	0.7918	81s
ProcVLM 1-shot success + fail	0.9301	0.1187	0.8053	80s

For policy learning, the most relevant experiment is reward fine-tuning. On LIBERO-10 simulation, ProcVLM-guided RFT gives modest early gains over SFT: 73.2 to 73.6 at 1000 steps and 72.8 to 74.0 at 2000 steps. On the real JAKA stack-bowls task, the gain is larger: at 5k steps, SFT reaches 37.5 while RFT reaches 62.5; at 10k steps, SFT reaches 70.8 while RFT reaches 83.3. This matches the intuition that real teleoperation data contains noisy local behavior such as grasp retries, and a progress-aware reward can downweight low-value segments.

Beginner Deployment Checklist

If you want to try ProcVLM without building a full robot learning stack, use this sequence:

Choose a clear manipulation task: pick-place, stack bowl, close drawer.
Collect 10-30 successful videos and 10-30 failed videos if possible.
Use consistent instructions, for example "place the blue bowl into the target bowl".
Run ProcVLM inference with window_size=8.
Plot progress against frame index.
Manually inspect at least five videos with their reward curves.
If curves are noisy, add smoothing and delta clipping.
Use the reward to filter or rerank SFT data first.
Move to advantage-conditioned RFT or RL only after the reward curve looks sane.

A practical rule: do not trust the reward model blindly. Always visualize video plus reward curve. If the score rises during useless retry behavior, the instruction may be ambiguous or the camera may miss the target object. If the score falls after success, the model may need more context frames or a clearer final-state instruction.

Limitations

ProcVLM is not a replacement for real evaluation. It learns progress from subtask decomposition and temporal boundary localization, so annotation errors can become reward errors. It also does not directly verify contact forces, hardware safety, or kinematic constraints. For force-control tasks, tight insertions, or tactile manipulation, combine ProcVLM with a task-specific success detector, safety constraints, and human review.

Inference cost is another practical constraint. ProcVLM-2B is much lighter than 8B or 27B reward models, but video-window inference is still not free. If you need rewards at control frequency, sample frames sparsely or label trajectories offline. The strongest current use cases are offline reward labeling, trajectory filtering, one-shot reward adaptation, and reward-guided fine-tuning.

Conclusion

ProcVLM is an important step toward robot reward models that understand task procedure. Instead of writing hand-shaped rewards or relying only on terminal success, it reads robot videos and instructions to produce dense progress rewards. For VLA manipulation, especially when fine-tuning from noisy demonstrations, this gives the policy a much richer signal about which parts of a trajectory actually move the task forward.

Treat ProcVLM as a progress critic, not as the robot controller itself. Combined with SFT, LeRobot/HIL-SERL, Evo-RL, or your own VLA training loop, it can turn robot videos into a more useful training signal than end-of-episode success or failure.

Why ProcVLM Matters

Core Idea: Reward Should Follow Procedure, Not Time

ProcVLM defines progress through procedure:

The task is decomposed into atomic subtasks such as "grasp bowl", "move bowl above target", and "place bowl into target".
Each subtask has start/end boundaries and complete/incomplete state.
Overall progress is budgeted across the subtask structure.
Within each subtask, progress is assigned using visual change, not just frame index.
The model first reasons about the remaining atomic actions, then predicts a completion percentage.

A simplified pipeline:

Robot video + task instruction
        |
        v
Large VLM annotator
        |
        +-- subtask plan
        +-- temporal boundaries
        +-- frame-level reasoning
        +-- remaining actions
        |
        v
ProcCorpus-60M
        |
        v
ProcVQA tasks
  - action segmentation
  - future planning
  - progress prediction
        |
        v
ProcVLM-2B
        |
        +-- text reasoning
        +-- scalar progress value
        |
        v
dense reward for policy fine-tuning

ProcVLM Architecture

Component	Role	Output
VLM backbone	Processes observation window and task instruction	hidden states and text context
Language modeling head	Generates procedure-aware response	remaining actions, stage explanation
Progress value head	Regresses continuous completion	scalar progress
Semantic gating	Enables value head only for progress samples	clean multi-task training

For progress prediction, the supervised response is structured like this:

Remaining actions:
1. Move the bowl above the target bowl.
2. Place the bowl into the target bowl.

<progress>62.5%</progress>

The paper trains ProcVLM with a joint objective:

L = L_LM + lambda * I_progress * L_value

How ProcCorpus-60M and ProcVQA-20M Are Built

The annotation pipeline has four asynchronous modules:

Module	Purpose
Data reader	Loads episodes, task instructions, camera keys, and frame indices
CPU preprocessing	Resizes images, builds prompts, prepares frame/video windows
GPU VLM inference	Runs plan generation, subtask localization, and frame reasoning
Post-processing	Parses JSONL, validates formats, expands segments into frame labels

ProcCorpus is then converted into ProcVQA with three task families:

ProcVQA Task	What the model learns	Why it helps reward modeling
Action segmentation	Split video into atomic actions	Understand procedure boundaries
Future planning	Predict remaining actions	Detect unfinished states
Progress prediction	Estimate completion percentage	Produce dense reward

Installation

The official repository uses uv, Python 3.10, vLLM by default, and LMDeploy for parts of the annotation pipeline. For practice, separate two use cases:

Reward inference: you only need the model checkpoint and videos.
Full annotation/training pipeline: you need local inference engines and more GPU capacity.

Minimal setup:

git clone https://github.com/ProcVLM/ProcVLM.git
cd ProcVLM

# Install uv if needed
wget -qO- https://astral.sh/uv/install.sh | sh

# Create a Python 3.10 environment
uv sync --python 3.10
source .venv/bin/activate

# Flash Attention is usually installed separately
uv pip install flash-attn --no-build-isolation

For local annotation and reasoning pipelines, the repo recommends setting up LMDeploy in a separate Conda environment:

conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
pip install lmdeploy
pip install -r envs/others_pip.txt

Inference: From Video to Progress JSONL

The simplest use case is: you have a rollout video and a task instruction, and you want a completion score along the trajectory.

python evqa/inference.py \
  --model_path /path/to/procvlm-checkpoint \
  --video_path tmp/fold_cloth/R1_Lite_fold_clothes.mp4 \
  --output_path tmp/fold_cloth/progress.jsonl \
  --task "fold the red T-shirt" \
  --window_size 8

window_size=8 means the model sees the current frame plus the recent visual context. The output JSONL contains sampled frame indices and predicted progress:

{"frame_index": 0, "progress": 0.02}
{"frame_index": 12, "progress": 0.11}
{"frame_index": 24, "progress": 0.28}
{"frame_index": 36, "progress": 0.51}
{"frame_index": 48, "progress": 0.73}

You can also visualize the progress curve on the video:

python evqa/eval/visualize_progress_video \
  --model_path /path/to/procvlm-checkpoint \
  --video_path tmp/fold_cloth/R1_Lite_fold_clothes.mp4 \
  --output_path tmp/fold_cloth/progress_vis.mp4 \
  --task "fold the red T-shirt" \
  --window_size 8

In Python, the integration pattern is straightforward:

from evqa.inference import infer_progress_from_video

progress = infer_progress_from_video(
    model_path="/models/procvlm",
    video_path="rollouts/episode_003.mp4",
    task="stack the blue bowl into the target bowl",
    window_size=8,
)

# Simple dense reward: progress delta
rewards = []
prev = progress[0]["progress"]
for item in progress[1:]:
    cur = item["progress"]
    rewards.append(cur - prev)
    prev = cur

For real training, smooth and clip the reward. Raw VLM progress is useful, but it is still a learned signal.

def procvlm_reward(progress_t, progress_prev, success_bonus=1.0):
    delta = progress_t - progress_prev
    dense = max(min(delta, 0.10), -0.05)
    terminal = success_bonus if progress_t > 0.95 else 0.0
    return dense + terminal

Using ProcVLM Rewards to Fine-Tune a VLA Policy

The workflow:

1. Collect trajectories through teleoperation or policy rollout
2. Save video + action sequence + task instruction
3. Run ProcVLM to obtain progress p_t for sampled frames
4. Convert p_t into dense reward or advantage
5. Mark the top 30% advantage samples as positive
6. Mark the remaining samples as negative
7. Train the policy with advantage-conditioned signals
8. Evaluate in simulation or on the real robot

System-level pseudo-code:

for episode in dataset:
    progress = procvlm.infer(
        video_path=episode.video,
        task=episode.instruction,
        window_size=8,
    )
    episode.rewards = compute_delta_rewards(progress)
    episode.advantages = estimate_advantage(
        rewards=episode.rewards,
        horizon=50,
    )

positive = select_top_percent(dataset, key="advantages", percent=30)
negative = select_rest(dataset, positive)

train_vla_policy(
    base_policy="pi0.5",
    positive_samples=positive,
    negative_samples=negative,
    objective="advantage_conditioned_rft",
)

Results: Strong Procedural Reasoning, Useful Real-Robot Gains

On ProcVQA, ProcVLM-2B outperforms larger VLMs on several primary metrics. Key reported numbers:

Benchmark	Metric	ProcVLM-2B	Notable baseline
ProcVQA ID	BF1@5	0.6924	GPT-5.4: 0.5221
ProcVQA ID	Future planning success	0.8103	Qwen3.5: 0.7931
ProcVQA ID	VOC	0.8058	Qwen3.5: 0.5475
ProcVQA OOD	Future planning success	0.8448	Qwen3.5: 0.7758
ProcVQA OOD	VOC	0.7282	GPT-5.4: 0.6553

On RoboFAC real-robot one-shot adaptation, ProcVLM improves rapidly:

Setting	VOCsucc	MAEfail	MCC	Latency
ProcVLM zero-shot	0.4920	0.2001	0.6665	50s
ProcVLM 1-shot success	0.9137	0.1241	0.7918	81s
ProcVLM 1-shot success + fail	0.9301	0.1187	0.8053	80s

Beginner Deployment Checklist

If you want to try ProcVLM without building a full robot learning stack, use this sequence:

Choose a clear manipulation task: pick-place, stack bowl, close drawer.
Collect 10-30 successful videos and 10-30 failed videos if possible.
Use consistent instructions, for example "place the blue bowl into the target bowl".
Run ProcVLM inference with window_size=8.
Plot progress against frame index.
Manually inspect at least five videos with their reward curves.
If curves are noisy, add smoothing and delta clipping.
Use the reward to filter or rerank SFT data first.
Move to advantage-conditioned RFT or RL only after the reward curve looks sane.

ProcVLM: Dense Video Rewards for VLA

Why ProcVLM Matters

Core Idea: Reward Should Follow Procedure, Not Time

ProcVLM Architecture

How ProcCorpus-60M and ProcVQA-20M Are Built

Installation

Inference: From Video to Progress JSONL

Using ProcVLM Rewards to Fine-Tune a VLA Policy

Results: Strong Procedural Reasoning, Useful Real-Robot Gains

Beginner Deployment Checklist

Limitations

Conclusion

Nguyễn Anh Tuấn

Related Posts

FORCE: Tăng 79% success rate khi fine-tune VLA bằng RL

LeVERB: Điều khiển toàn thân humanoid bằng ngôn ngữ-thị giác tiềm ẩn

MemoryVLA++: memory và world model cho VLA

ProcVLM: Dense Video Rewards for VLA

Why ProcVLM Matters

Core Idea: Reward Should Follow Procedure, Not Time

ProcVLM Architecture

How ProcCorpus-60M and ProcVQA-20M Are Built

Installation

Inference: From Video to Progress JSONL

Using ProcVLM Rewards to Fine-Tune a VLA Policy

Results: Strong Procedural Reasoning, Useful Real-Robot Gains

Beginner Deployment Checklist

Limitations

Conclusion

Nguyễn Anh Tuấn

Related Posts

FORCE: Tăng 79% success rate khi fine-tune VLA bằng RL

LeVERB: Điều khiển toàn thân humanoid bằng ngôn ngữ-thị giác tiềm ẩn

MemoryVLA++: memory và world model cho VLA