wholebody-vlasarmlerobotvlareward-modelra-bc

SARM in LeRobot: Reward Models for VLA

A practical guide to using SARM in LeRobot to learn task progress from video, score rollouts, and improve VLA policies with RA-BC.

Nguyễn Anh TuấnJune 11, 202615 min read
SARM in LeRobot: Reward Models for VLA

SARM targets a real VLA training problem

The standard LeRobot VLA workflow is straightforward: collect demonstrations, save them as a LeRobotDataset, then fine-tune a policy such as ACT, SmolVLA, Pi0, or Pi0.5. This works surprisingly well, but it carries a hidden assumption: every frame in the dataset is equally useful. Real robot data violates that assumption all the time. A single episode may include hesitation, a bad grasp, a teleoperator correction, back-and-forth motion, or a rollout that almost succeeds and then ruins the object state. If behavior cloning gives every frame the same weight, the policy learns both progress-making behavior and noisy behavior.

SARM, short for Stage-Aware Reward Modeling, is a video-based reward model for long-horizon robot manipulation. The original paper is SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation, with a project page at qianzhong-chen.github.io/sarm.github.io, open-source code at xdofai/opensarm, and official integration notes in the LeRobot SARM documentation.

The short version: instead of asking “what frame index is this?”, SARM asks “which task stage is the robot in, and how far has it progressed inside that stage?”. Given a short video window, task text, and optional robot state, SARM predicts task progress from 0 to 1. That progress signal can score rollouts, filter poor data, or train a VLA policy through Reward-Aligned Behavior Cloning (RA-BC).

If you are new to this stack, start with VLA & LeRobot framework and training SmolVLA on a consumer GPU. This tutorial focuses on the reward-model layer: how to turn robot videos into a progress signal strong enough to improve a VLA from rollout data.

Why frame-index rewards are brittle

The easiest progress label is frame_index / episode_length. The first frame is 0, the final frame is 1. For a very simple task, such as pushing a cube rightward with clean demonstrations, that can be a usable baseline. For long-horizon manipulation, especially deformable-object tasks such as folding cloth, it breaks in three ways.

First, the same physical state can occur at very different timestamps. Operator A may fold a shirt in 45 seconds. Operator B may need 90 seconds. If progress is normalized time, the same “shirt is flattened” state might be labeled 0.35 in one episode and 0.70 in another. A reward model trained on those labels learns inconsistent supervision.

Second, long tasks have stages. A folding task may require grasping the shirt, moving it to the center, flattening it, folding it, and placing it in the corner. If the model only compares the current frame to the final goal, the early stages are ambiguous. A shirt that has just been flattened is not a failure. It is simply in an earlier stage.

Third, policy rollouts often contain out-of-distribution failure modes: misgrasps, recovery attempts, repeated motions, or a near-complete state that gets destroyed at the end. A reward model must distinguish real progress from motion that merely looks active. The SARM project page shows rollout examples where the model keeps progress low during early misgrasps and reflects progress loss when a failed rollout crumples the shirt again.

SARM replaces raw time labels with stage + within-stage progress:

episode video + task text
        |
        v
subtask boundaries: stage 0, stage 1, stage 2, ...
        |
        v
target at each frame = k + tau

k   = stage index
tau = progress inside stage k, in [0, 1]

At inference time, LeRobot converts k + tau into normalized progress:

progress_t = P_(k-1) + alpha_k * tau_t

P_(k-1) = cumulative temporal proportion of previous stages
alpha_k = average temporal proportion of stage k in the dataset
tau_t   = within-stage progress at time t

The key detail is that alpha_k is computed at the dataset level. This helps the model assign a consistent progress range to states such as “flattening is complete,” even when episodes have different speeds and lengths.

SARM architecture: estimate stage, then progress

SARM does not predict one scalar reward from a single frame. It uses a short observation window with RGB frames, task text, and robot state. In LeRobot, the SARM processor encodes images and text using CLIP ViT-B/32, pads or truncates robot state to a fixed size, and builds targets as sparse_targets and, when needed, dense_targets.

SARM method overview
SARM method overview

The data flow is easier to remember as a two-head estimator:

RGB frames          task text             robot state
    |                  |                       |
    v                  v                       v
CLIP vision       CLIP text              state projection
    |                  |                       |
    +------------------+-----------------------+
                       |
                       v
              temporal estimator
                       |
         +-------------+-------------+
         |                           |
         v                           v
 stage estimator              subtask estimator
 predicts stage k             predicts tau in stage
         |                           |
         +-------------+-------------+
                       |
                       v
              normalized progress

The paper describes two main parts: a stage estimator that predicts the current task stage and a subtask estimator that predicts progress within that stage. The stage prediction is passed into the subtask estimator, so the model knows which part of the task it is scoring. This is the major difference from reward models that only compute similarity between the current image and a goal text.

LeRobot exposes three annotation modes:

Mode Annotation required? Heads Best use
single_stage No Sparse only Short tasks, fast experiments, no VLM
dense_only Dense VLM annotations Auto sparse + dense Detailed tracking without manually defining coarse stages
dual Sparse + dense VLM annotations Dual head Long tasks with meaningful coarse and fine stages

For beginners, single_stage is the fastest way to prove the pipeline works. For bimanual folding, cable routing, assembly, or humanoid manipulation with many phases, dual is closer to the full paper setup.

Prepare a LeRobotDataset

SARM does not replace LeRobotDataset; it sits on top of it. At minimum, the dataset needs:

  • task: a text description such as "fold the towel" or "pick up the red cube"
  • an image key such as observation.images.base
  • a state key, usually observation.state
  • episode videos or synchronized image frames with state and action

A typical LeRobot-style dataset looks like this:

my-dataset/
  data/
    chunk-000/
      episode_000000.parquet
      episode_000001.parquet
  meta/
    episodes.jsonl
    info.json
    stats.json
    tasks.jsonl
  videos/
    chunk-000/
      observation.images.base/
        episode_000000.mp4
        episode_000001.mp4

In the original opensarm repository, the reward column stores scalar progress at each timestep rather than a traditional reinforcement-learning reward. In native LeRobot SARM, progress is commonly precomputed into sarm_progress.parquet, which RA-BC reads during policy training.

Before training the reward model, check the dataset with practical questions:

Check Good signal Failure mode
Camera placement Object and end-effector are visible SARM may learn background artifacts
Task text consistency Same task uses the same wording Text embeddings become noisy
Complete episodes Beginning and ending are not cut off Progress labels become distorted
Moderate diversity Speed varies, but episodes are understandable Model cannot separate progress from noise
State shape observation.state is stable Processor must pad or truncate too much

If you are running real-robot VLA training, you can collect teleop data in LeRobot, then use SARM to score both demonstrations and policy rollouts. For human-in-the-loop real-robot RL, see HIL-SERL in LeRobot. SARM does not replace safety procedures, but it can provide a dense progress signal instead of a sparse success/failure reward.

Install LeRobot with SARM

The official LeRobot docs recommend installing the sarm extra:

git clone https://github.com/huggingface/lerobot
cd lerobot
pip install -e ".[sarm]"

Check that PyTorch can see your GPU:

python - <<'PY'
import torch
print("cuda:", torch.cuda.is_available())
print("device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "cpu")
PY

The original opensarm repository uses uv:

git clone https://github.com/xdofai/opensarm
cd opensarm
pip install uv
uv sync
source .venv/bin/activate

This tutorial uses native LeRobot commands because they integrate directly with lerobot-train and RA-BC sample weighting. The opensarm repo is still useful for reading the paper-style configuration: n_obs_steps: 8, frame_gap: 30, CLIP checkpoint openai/clip-vit-base-patch32, batch size 32, learning rate 5e-5, and the stage annotations used for T-shirt folding.

Step 1: create or verify subtask annotations

For a quick baseline:

# No annotation is required.
# The whole episode becomes one stage named "task".
# Use this for simple tasks or a first pipeline test.

For long tasks, use a VLM to generate dense annotations. Example for towel folding:

python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
  --repo-id your-username/your-dataset \
  --dense-only \
  --dense-subtasks "Bring robot arms up from starting position,Grab near side and do 1st fold,Grab side and do 2nd fold,Grab side and do 3rd fold to finish folding" \
  --video-key observation.images.base \
  --num-workers 4 \
  --push-to-hub

For dual mode, provide both coarse and fine stages:

python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
  --repo-id your-username/your-dataset \
  --sparse-subtasks "Bring arms up from starting position,Fold the towel" \
  --dense-subtasks "Bring robot arms up from starting position,Grab near side and do 1st fold,Grab side and do 2nd fold,Grab side and do 3rd fold to finish folding" \
  --video-key observation.images.base \
  --num-workers 4 \
  --push-to-hub

After this step, the dataset contains temporal proportions and subtask boundary columns in its episode metadata or parquet files. Do not skip visualization:

python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
  --repo-id your-username/your-dataset \
  --visualize-only \
  --visualize-type both \
  --num-visualizations 5 \
  --video-key observation.images.base \
  --output-dir ./subtask_viz

If the boundaries are off, make the subtask names more specific. “Fold” is too vague. “Grab near side and fold toward center” is better. Annotation quality strongly controls reward-model quality.

Step 2: train the SARM reward model

Fast baseline with single_stage:

lerobot-train \
  --dataset.repo_id=your-username/your-dataset \
  --policy.type=sarm \
  --policy.annotation_mode=single_stage \
  --policy.image_key=observation.images.base \
  --output_dir=outputs/train/sarm_single \
  --batch_size=32 \
  --steps=5000 \
  --wandb.enable=true \
  --wandb.project=sarm \
  --policy.repo_id=your-username/your-sarm-model

Training with dense annotations:

lerobot-train \
  --dataset.repo_id=your-username/your-dataset \
  --policy.type=sarm \
  --policy.annotation_mode=dense_only \
  --policy.image_key=observation.images.base \
  --output_dir=outputs/train/sarm_dense \
  --batch_size=32 \
  --steps=5000 \
  --wandb.enable=true \
  --wandb.project=sarm \
  --policy.repo_id=your-username/your-sarm-model

Training with sparse and dense annotations:

lerobot-train \
  --dataset.repo_id=your-username/your-dataset \
  --policy.type=sarm \
  --policy.annotation_mode=dual \
  --policy.image_key=observation.images.base \
  --output_dir=outputs/train/sarm_dual \
  --batch_size=32 \
  --steps=5000 \
  --wandb.enable=true \
  --wandb.project=sarm \
  --policy.repo_id=your-username/your-sarm-model

Key arguments:

Argument Meaning Practical note
policy.n_obs_steps Observation history; total frames are n_obs_steps + 1 Default 8
policy.frame_gap Gap between sampled frames 30 frames is about 1 second at 30 fps
policy.image_key Camera used for reward training Choose the camera that sees the task clearly
policy.state_key Robot state vector Usually observation.state
batch_size Training batch size 32 if VRAM allows

For multiple GPUs:

accelerate launch \
  --multi_gpu \
  --num_processes=4 \
  src/lerobot/scripts/lerobot_train.py \
  --dataset.repo_id=your-username/your-dataset \
  --policy.type=sarm \
  --policy.annotation_mode=dual \
  --policy.image_key=observation.images.base \
  --output_dir=outputs/train/sarm_dual \
  --batch_size=32 \
  --steps=5000

Step 3: inference, visualization, and rollout scoring

Once you have a model, do not immediately train a policy. Visualize predictions first:

python -m lerobot.rewards.sarm.compute_rabc_weights \
  --dataset-repo-id your-username/your-dataset \
  --reward-model-path your-username/your-sarm-model \
  --visualize-only \
  --num-visualizations 5 \
  --head-mode sparse \
  --output-dir ./sarm_viz

The visualization should show three things: predicted progress, stage probabilities over time, and key frames from the episode. A good reward model does not need to increase perfectly linearly. It should increase when the robot actually makes progress, stay flat during hesitation, and stall or drop when a rollout damages the task state.

When the model looks credible, compute progress for the full dataset:

python -m lerobot.rewards.sarm.compute_rabc_weights \
  --dataset-repo-id your-username/your-dataset \
  --reward-model-path your-username/your-sarm-model \
  --head-mode sparse \
  --num-visualizations 5 \
  --push-to-hub

The default output is sarm_progress.parquet:

Column Meaning
index Global frame index
episode_index Episode number
frame_index Local frame in the episode
progress_sparse Progress from the sparse head
progress_dense Progress from the dense head, if computed

For new rollouts, save the rollout as a LeRobotDataset and run the same progress computation. This is how SARM helps a VLA improve from rollouts: the policy generates more trajectories, SARM estimates which segments made progress, and RA-BC gives higher weight to useful segments while reducing weight for stagnant or regressive segments. It is not fully autonomous online RL by itself. You still need safety controls, resets, and reward-model validation.

Step 4: train a VLA policy with RA-BC

RA-BC uses progress delta rather than absolute progress. For each training sample, LeRobot compares progress at the current observation and progress after one action chunk:

r_i = phi(o_(t + Delta)) - phi(o_t)

phi = SARM progress prediction
Delta = policy chunk_size

If r_i is positive, that segment moves the task forward. If it is negative, the robot made the state worse. If it is near zero, the robot may be stalled or moving without useful progress.

Train Pi0 with RA-BC:

lerobot-train \
  --dataset.repo_id=your-username/your-dataset \
  --policy.type=pi0 \
  --sample_weighting.type=rabc \
  --sample_weighting.head_mode=sparse \
  --sample_weighting.kappa=0.01 \
  --output_dir=outputs/train/policy_rabc \
  --batch_size=32 \
  --steps=40000

According to the current LeRobot docs, RA-BC supports Pi0, Pi0.5, and SmolVLA. For an edge-friendly VLA experiment, use SmolVLA with the same sample weighting:

lerobot-train \
  --dataset.repo_id=your-username/your-dataset \
  --policy.type=smolvla \
  --sample_weighting.type=rabc \
  --sample_weighting.head_mode=sparse \
  --sample_weighting.kappa=0.01 \
  --output_dir=outputs/train/smolvla_rabc \
  --batch_size=32 \
  --steps=40000

The weighting rules are simple:

Condition Weight
delta > kappa 1.0
0 <= delta <= kappa Soft weight from running mean/std
delta < 0 0.0

Monitor these metrics:

Metric Healthy range Problem signal
sample_weight_mean_weight 0.3-0.8 Near 1.0 means kappa is too low
sample_weighting/delta_mean > 0 Negative values suggest dataset or reward issues
sample_weighting/delta_std > 0 Too low means reward cannot separate good and bad data

The default kappa=0.01 was tuned for a roughly 90-second T-shirt folding task at 30 fps. For your own dataset, inspect delta_mean and delta_std. If almost every sample gets full weight, RA-BC becomes close to vanilla BC. If almost no sample gets weight, training lacks useful supervision.

Results from the paper

The SARM paper evaluates T-shirt folding, a hard long-horizon task because cloth is deformable. Vanilla BC achieves 8% success from the flattened state and 0% from the crumpled state. With SARM + RA-BC, the policy reaches 83% success on the medium task and 67% on the hard task. In the appendix rollout table, at 40K training steps, RA-BC with SARM reaches 10/12 on medium and 8/12 on hard, outperforming RA-BC with a ReWiND reward model.

The important lesson is not just the headline success rate. The paper also shows that reward-model quality is central. If the reward model misjudges progress, RA-BC assigns bad sample weights, the policy learns the wrong behavior, and the filtering benefit disappears. A reliable workflow is:

1. Train SARM
2. Visualize predictions on demonstrations
3. Score a few policy rollouts
4. Compare scores against human judgment
5. Train RA-BC only when the reward looks reasonable

For your own robot, build a small benchmark:

Run Policy Dataset Reward weighting Success
A SmolVLA clean demos none baseline
B SmolVLA demos + rollouts none tests whether rollout data hurts
C SmolVLA demos + rollouts SARM RA-BC expected improvement
D Pi0/Pi0.5 demos + rollouts SARM RA-BC larger model comparison

If C does not beat B, do not immediately conclude that SARM is useless. Check annotation boundaries, camera view, head_mode, kappa, and rollout distribution. Reward modeling only helps when it can distinguish real progress from noisy motion.

Beginner implementation checklist

Start small:

Day 1:
  - Collect 30-50 demonstrations for one clear task
  - Train SARM single_stage for 5k steps
  - Visualize 5 episodes

Day 2:
  - If the task has multiple stages, add dense_only annotation
  - Retrain SARM
  - Generate sarm_progress.parquet

Day 3:
  - Train SmolVLA/Pi0 with vanilla BC
  - Train the same policy with RA-BC
  - Run 20 real or simulated rollouts

Common failure modes:

Failure Symptom Fix
Progress rises like a clock Model learned time, not task state Add speed diversity and stage annotations
Progress jumps randomly Camera or annotation is poor Change camera or visualize boundaries
RA-BC matches BC Mean weight is near 1.0 Increase kappa
RA-BC becomes unstable Too few samples get weight Lower kappa or add good demos
Failed rollouts score high Reward model has not seen failures Add failure rollouts to validation and retrain

On a real robot, treat SARM as a progress meter, not a safety system. Workspace limits, emergency stop, human intervention, and staged testing remain mandatory.

Conclusion

SARM fills a missing layer in VLA training: estimating task progress from video instead of treating every frame equally. In LeRobot, the workflow is now practical: install the SARM extra, create subtask annotations when needed, train the reward model, visualize predictions, precompute progress, and train Pi0, Pi0.5, or SmolVLA with RA-BC.

The strongest idea is stage awareness. Long-horizon tasks are not one smooth line from start to finish; they are sequences of meaningful stages. SARM makes that structure explicit, which is why it fits folding, assembly, bimanual manipulation, and whole-body VLA tasks with multiple phases. Its weakness is that reward quality depends heavily on camera placement, annotations, and rollout distribution. Used carefully, it is one of the most practical open-source ways to convert messy rollouts into useful learning signal for VLA policies.

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

LeRobot v0.5: Pi0-FAST + G1 Whole-Body Control
wholebody-vla

LeRobot v0.5: Pi0-FAST + G1 Whole-Body Control

4/21/202613 min read
NT
PEFT/LoRA Fine-tune & Deploy VLA
wholebody-vla

PEFT/LoRA Fine-tune & Deploy VLA

4/11/202612 min read
NT
Pi0-FAST: VLA Autoregressive 5x nhanh hơn
wholebody-vla

Pi0-FAST: VLA Autoregressive 5x nhanh hơn

4/10/202613 min read
NT