WEAVER: World Model for π0.5 VLA

What problem does WEAVER solve?

WEAVER, short for World Estimation Across Views for Embodied Reasoning, is the arXiv 2606.13672 paper WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation. The project page is arnavkj1995.github.io/WEAVER and the official code is arnavkj1995/WEAVER. The key point is that WEAVER is not another replacement for π0.5. It is a learned simulator around a VLA policy: it predicts future robot states in latent space, scores action chunks with reward and value heads, then uses those scores for policy evaluation, policy improvement, or test-time steering.

If you have read our posts on VLA models in robotics, Pi0-FAST training, or RISE world models for VLA-RL, the motivation will feel familiar. Modern VLA policies can perform real manipulation tasks, but they remain brittle when objects shift out of distribution, the gripper occludes the scene, or the task requires many consistent steps. π0.5 is a strong robot foundation policy, but the policy itself does not know in advance which sampled action chunk will lead to failure. WEAVER adds an imagination layer before execution: take the current observation, task language, and multiple candidate action chunks, imagine the outcomes, then choose or distill the better behavior.

The paper defines three requirements for a useful robotics world model. First, fidelity: simulated rollouts should correlate with real outcomes, not merely look plausible. Second, consistency: predictions must stay coherent over long horizons, especially when the wrist camera moves, objects are occluded, towels and bags deform, or beans spill during pouring. Third, efficiency: inference must be fast enough for online planning. WEAVER reports ρ = 0.870 correlation with real-world success rate for policy evaluation after finetuning, a 38% real-world success-rate improvement when π0.5 is finetuned with real plus synthetic data, and a 14% test-time planning gain with a 5-10x speedup over prior world models.

WEAVER overview for policy evaluation, improvement, and planning — source: WEAVER project page

The paper idea in plain language

Imagine you have a π0.5 policy controlling a Franka Panda. When the policy receives camera images and an instruction such as "put the towel into the basket", it outputs an action chunk. The usual baseline is to execute that chunk immediately on the real robot. If the chunk is poor, the robot damages the current rollout and you only discover the failure after paying the physical cost.

WEAVER inserts the following loop:

multi-view observation + proprioception + instruction
        |
        v
π0.5 samples several candidate action chunks
        |
        v
WEAVER imagines future latents for each chunk
        |
        v
reward head + critic estimate advantage
        |
        v
choose best chunk, label policy data, or evaluate rollout

The world model does not need to decode every latent into an image at every step. That design choice matters. If every candidate required video generation followed by a slow VLM judge, latency would make planning impractical. WEAVER keeps scoring in latent space: the reward head estimates task progress, while the critic estimates return beyond the imagined horizon. The decoder is used when you need visualization, debugging, or side-by-side rollout videos.

For a beginner, the simplest mental model is "a simulator learned from robot data." It is more flexible than a classical simulator in three ways. It does not need exact meshes or physical parameters for every object. It learns directly from DROID and real rollouts. It can score actions relative to a language task, not only predict pixels. The tradeoff is also important: it is not a perfect physics engine. If contacts, forces, or occluded object geometry are not visible from the cameras, the model can still guess incorrectly.

WEAVER architecture

WEAVER is a 928M-parameter multi-view latent world model. Its inputs are camera views, proprioceptive state, a language instruction, and a candidate action plan. In the paper setup, the robot is a Franka Emika Panda with two external Zed 2i cameras and one wrist-mounted Zed Mini camera, although WEAVER and π0.5 primarily use the right external view and the wrist view. Data is downsampled to 5 Hz for imagination.

The main components are:

Component	Role	Implementation note
SD3 VAE encoder/decoder	Compress camera frames into latents and decode them when visualization is needed	Each view becomes patch tokens
Proprioception token	Adds joint and gripper state to the same token dimension	Important for contact-rich manipulation
Sparse memory	Stores every few frames from the distant past	Keeps scene context under occlusion
Short-term history	Stores the last two frames	Captures recent motion and action consequences
32-layer dynamics Transformer	Predicts future latents conditioned on action tokens	Uses spatial attention and causal temporal attention
Reward head	Scores progress directly in latent space	Distilled from RoboMeter progress rewards
Critic head	Estimates return beyond the imagined horizon	Enables advantage-based action selection

For the training objective, WEAVER uses flow matching to learn a vector field from noise to the true future latent. Instead of generating frames with a heavy pixel-space diffusion process, the model predicts velocity in latent space. The paper also uses Diffusion Forcing, where future timesteps receive independently sampled noise levels, improving long-horizon coherence. For speed, WEAVER uses SPRINT blocks to drop patch tokens, KV caching for memory and history tokens during denoising, cosine or power noise schedules, and ReFlow post-training to distill a multi-step teacher into a faster student.

FID/FVD versus H100 inference time for WEAVER and Ctrl-World — source: WEAVER project page

Installing the repository

WEAVER is a heavy research codebase. You can inspect the code and run inference or evaluation from released checkpoints, but full training is not a laptop workload. The repository states that the training launchers use four H100 GPUs with distributed data parallelism. If your goal is to learn the pipeline, start with checkpoints and a small dataset slice.

Basic environment setup:

git clone --recurse-submodules https://github.com/arnavkj1995/WEAVER.git
cd WEAVER
uv venv --python 3.11
source .venv/bin/activate
uv sync

For logging or development tools:

uv sync --extra logging --extra dev

Download the model weights:

hf download arnavkj1995/WEAVER --local-dir checkpoints

This downloads the WEAVER, WEAVER-FT, and WEAVER-ReFlow folders. Each folder contains checkpoint.pt, config.yaml, and norm_stats_relabel.json. To try the OOD evaluation data used by the paper:

git lfs install
git clone https://huggingface.co/datasets/yilin-wu/droid_ood_data

If you only want annotations and metadata to understand the layout:

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="yilin-wu/droid_ood_data",
    repo_type="dataset",
    allow_patterns=["annotations/**", "annotation_rewards/**", "norm_stats*.json"],
)

Preparing data

WEAVER expects preprocessed DROID-style trajectories containing actions, robot states, language features, rewards, normalization statistics, and either view videos or SD3 latents. For raw DROID:

python datasets/preprocess_droid.py \
  --data_root /path/to/raw_droid \
  --output_root /path/to/preprocessed_droid

The important output layout is:

weaver_droid/
├── annotations/<split>/<trajectory_id>.json
├── videos/<split>/<trajectory_id>.mp4
├── latents/<split>/<trajectory_id>.npz
└── done/<split>/<trajectory_id>

After preprocessing, compute normalization statistics:

python datasets/compute_norm_stats.py \
  --data_root /path/to/weaver_droid

For rollouts you collect yourself, the repo provides third_party/openpi/examples/droid/panda_log.py to run an OpenPI policy server, log video plus annotation JSON, and convert the data into WEAVER format:

python -m datasets.preprocess_droid_ood \
  --input_roots /path/to/your/task_folder \
  --output_root /path/to/weaver_format_data \
  --data_type train \
  --tasks none

One beginner detail is easy to miss: reward labels do not appear automatically. WEAVER uses RoboMeter to produce reward_progress and reward_success, then the dataloader reads annotation_rewards/<split>/. If you train on custom data without reward labels, the reward head and critic will not receive meaningful supervision.

Training WEAVER

There are three main modes. Pretraining learns the world model from a large dataset:

DATASET_PATH=/path/to/preprocessed_droid \
SCRATCH_DIR=/path/to/output/model_dir \
sbatch scripts/pretrain.sh

Finetuning starts from a checkpoint and adapts it to your task or robot setup:

PRETRAINED_DIR=/path/to/pretrained/logs/chkpts \
DATASET_PATH=/path/to/finetune_data \
EXP_NAME=weaver_finetune \
FINETUNE_SUFFIX=finetune \
sbatch scripts/finetune.sh

ReFlow post-training reduces the denoising budget so planning becomes faster:

PRETRAINED_DIR=/path/to/teacher/logs/chkpts \
PRETRAINED_CKPT_NAME=checkpoint.pt \
DATASET_PATH=/path/to/preprocessed_droid \
EXP_NAME=weaver_reflow \
FINETUNE_SUFFIX=reflow \
sbatch scripts/reflow.sh

In the paper, the model is pretrained on DROID for 1M steps with batch size 32, batch length 8, 6 memory frames, memory stride 5, AdamW, and EMA 0.9999. The Transformer has 32 layers, 16 heads, embedding dimension 1536, head dimension 96, and SPRINT probability 0.5. Finetuning on task data runs for 16k steps with a lower learning rate. That is why WEAVER should be viewed as lab or cluster infrastructure, not a lightweight notebook demo.

Reward prediction and advantage filtering in WEAVER — source: WEAVER project page

Inference and evaluation

To generate rollout views:

python -m weaver.generate_views \
  --checkpoint /path/to/logs/chkpts \
  --output-dir /path/to/eval_output \
  --split val \
  --use-real-history \
  --overrides \
    dataset.path=/path/to/eval_dataset \
    model.val_steps=27 \
    eval_horizon=5 \
    eval_bootstrap=5 \
    inference.pyramid_stagger_width=1 \
    inference.pyramid_schedule=cosine

The key knobs are:

Parameter	Meaning
`model.val_steps`	Number of flow denoising steps
`eval_horizon`	Number of future frames generated per chunk
`eval_bootstrap`	Number of generated frames fed back into the next chunk
`inference.pyramid_schedule`	`linear`, `cosine`, `power`, or `sigmoid`
`inference.pyramid_stagger_width`	Schedule offset across future frames

The effective NFE is:

NFE = val_steps + eval_horizon * pyramid_stagger_width

To compute FID, FVD, and LPIPS:

EVAL_DIR=/path/to/eval_output \
sbatch scripts/compute_eval_metrics.sh

In Table 1 of the paper, on DROID validation, WEAVER at NFE=16 reaches exterior FID 10.20 and FVD 27.83 in 4.78 seconds, while Ctrl-World at NFE=16 reports FID 26.09 and FVD 78.73 in 14.65 seconds. On OOD task data, WEAVER at NFE=16 is also better on exterior FID/FVD and faster. The practical warning is that wrist-camera prediction remains harder than external-view prediction because the viewpoint changes continuously and occlusion is severe.

Improving π0.5 with synthetic data

Policy improvement is the part that makes WEAVER especially relevant for VLA builders. The README pipeline is:

Run an OpenPI server with π0.5 on a GPU machine.
Let π0.5 sample multiple candidate action chunks.
Roll out each candidate inside WEAVER latent space.
Use the reward head and critic to compute advantage.
Keep the best segment if its advantage passes a threshold.
Convert synthetic data to DROID layout, then to LeRobot.
Finetune π0.5 with real plus synthetic data.

Command skeleton:

uv run scripts/serve_policy.py --env DROID --num-samples 5

synth_options=(
  CHECKPOINT=/path/to/chkpts
  DATASET_PATH=/path/to/dataset
  OUTPUT_DIR=/path/to/output
  NUM_TRAJECTORIES=1000
  NUM_SAMPLES=5
  NUM_CHUNKS=4
  OPEN_LOOP_HORIZON=9
  SELECTION_CRITERION=advantage
  PI_HOST=<server-ip>
  PI_PORT=8000
)

env "${synth_options[@]}" bash scripts/synth_data_gen.sh

Then convert:

python third_party/openpi/examples/droid/convert_synthetic_data_to_droid.py \
  --input-dirs /path/to/synthetic_data_folder /path/to/real_data_folder \
  --output-dir ../data/data_mixed_droid_all \
  --tasks cup_task marker_task

python third_party/openpi/examples/droid/convert_synthetic_droid_data_to_lerobot.py \
  --data_dir ../data/data_mixed_droid_all \
  --repo_name your-hf-username/your-dataset-name

Finally finetune from third_party/openpi:

uv run scripts/train.py pi05_droid_finetune_real_syn_adv \
  --exp-name=droid-20k-real-syn-finetune \
  --overwrite

The paper reports that finetuning π0.5 with real plus WEAVER-generated synthetic data gives the strongest success rate on tasks such as Stack Bowls, PnP Bag, PnP Marker, PnP Towel, and Pour Beans. On average, WEAVER reports a 38% real-world success-rate improvement on top of π0.5. The qualitative appendix also notes that the finetuned policy produces smoother motion, fewer overly large steps, and more precise grasping and placement.

Policy finetuning success rates with WEAVER synthetic data — source: WEAVER project page

Test-time steering

If you do not want to retrain π0.5, WEAVER also supports test-time planning. For every current observation, the policy samples N action chunks, WEAVER imagines each chunk, the reward and critic heads estimate advantage, and the robot executes the best chunk. The README uses:

planning_options=(
  CHECKPOINT=/path/to/chkpts
  OUTPUT_DIR=/path/to/output
  TASK="stack the red block"
  NUM_SAMPLES=8
  OPEN_LOOP_HORIZON=9
  SELECTION_CRITERION=advantage
  MAX_STEPS=700
  PI_HOST=<server-ip>
  PI_PORT=8000
)

env "${planning_options[@]}" bash scripts/steer_pi_policy.sh

This is single-chunk best-of-N search, not full long-horizon tree search. Because latency is still the bottleneck, WEAVER only looks a short distance ahead, but it looks ahead quickly enough to improve action selection. The paper reports a 14% average success-rate improvement over the base policy, with larger gains when the base policy is weaker. The limitation is clear: if a task requires multi-stage recovery or delayed consequences, one chunk of imagination may not be enough.

When should you use WEAVER?

Use WEAVER when you already have a base VLA policy, DROID-style rollouts or similar data, and a real need to reduce physical trials for evaluation, finetuning, or planning. It is especially relevant for teams using OpenPI/π0.5, Franka or DROID-like setups, and a serious GPU server. If you are still learning the basics, start with the LeRobot ecosystem guide and train a smaller policy first; WEAVER becomes much easier once you understand data layout, action chunks, and policy servers.

Do not expect WEAVER to replace a physics simulator in every case. It still suffers from partial observability, lacks direct force or tactile sensing, struggles with granular and deformable dynamics, and depends heavily on data coverage. But the main lesson is clear: VLA manipulation does not only need a stronger policy; it also needs a fast enough world model so the policy can think before it acts.

References

Paper: WEAVER, Better, Faster, Longer
Project page: arnavkj1995.github.io/WEAVER
Code: github.com/arnavkj1995/WEAVER
Model weights: huggingface.co/arnavkj1995/WEAVER

What problem does WEAVER solve?

WEAVER overview for policy evaluation, improvement, and planning — source: WEAVER project page

The paper idea in plain language

WEAVER inserts the following loop:

multi-view observation + proprioception + instruction
        |
        v
π0.5 samples several candidate action chunks
        |
        v
WEAVER imagines future latents for each chunk
        |
        v
reward head + critic estimate advantage
        |
        v
choose best chunk, label policy data, or evaluate rollout

WEAVER architecture

The main components are:

Component	Role	Implementation note
SD3 VAE encoder/decoder	Compress camera frames into latents and decode them when visualization is needed	Each view becomes patch tokens
Proprioception token	Adds joint and gripper state to the same token dimension	Important for contact-rich manipulation
Sparse memory	Stores every few frames from the distant past	Keeps scene context under occlusion
Short-term history	Stores the last two frames	Captures recent motion and action consequences
32-layer dynamics Transformer	Predicts future latents conditioned on action tokens	Uses spatial attention and causal temporal attention
Reward head	Scores progress directly in latent space	Distilled from RoboMeter progress rewards
Critic head	Estimates return beyond the imagined horizon	Enables advantage-based action selection

FID/FVD versus H100 inference time for WEAVER and Ctrl-World — source: WEAVER project page

Installing the repository

Basic environment setup:

git clone --recurse-submodules https://github.com/arnavkj1995/WEAVER.git
cd WEAVER
uv venv --python 3.11
source .venv/bin/activate
uv sync

For logging or development tools:

uv sync --extra logging --extra dev

Download the model weights:

hf download arnavkj1995/WEAVER --local-dir checkpoints

git lfs install
git clone https://huggingface.co/datasets/yilin-wu/droid_ood_data

If you only want annotations and metadata to understand the layout:

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="yilin-wu/droid_ood_data",
    repo_type="dataset",
    allow_patterns=["annotations/**", "annotation_rewards/**", "norm_stats*.json"],
)

Preparing data

WEAVER expects preprocessed DROID-style trajectories containing actions, robot states, language features, rewards, normalization statistics, and either view videos or SD3 latents. For raw DROID:

python datasets/preprocess_droid.py \
  --data_root /path/to/raw_droid \
  --output_root /path/to/preprocessed_droid

The important output layout is:

weaver_droid/
├── annotations/<split>/<trajectory_id>.json
├── videos/<split>/<trajectory_id>.mp4
├── latents/<split>/<trajectory_id>.npz
└── done/<split>/<trajectory_id>

After preprocessing, compute normalization statistics:

python datasets/compute_norm_stats.py \
  --data_root /path/to/weaver_droid

python -m datasets.preprocess_droid_ood \
  --input_roots /path/to/your/task_folder \
  --output_root /path/to/weaver_format_data \
  --data_type train \
  --tasks none

Training WEAVER

There are three main modes. Pretraining learns the world model from a large dataset:

DATASET_PATH=/path/to/preprocessed_droid \
SCRATCH_DIR=/path/to/output/model_dir \
sbatch scripts/pretrain.sh

Finetuning starts from a checkpoint and adapts it to your task or robot setup:

PRETRAINED_DIR=/path/to/pretrained/logs/chkpts \
DATASET_PATH=/path/to/finetune_data \
EXP_NAME=weaver_finetune \
FINETUNE_SUFFIX=finetune \
sbatch scripts/finetune.sh

ReFlow post-training reduces the denoising budget so planning becomes faster:

PRETRAINED_DIR=/path/to/teacher/logs/chkpts \
PRETRAINED_CKPT_NAME=checkpoint.pt \
DATASET_PATH=/path/to/preprocessed_droid \
EXP_NAME=weaver_reflow \
FINETUNE_SUFFIX=reflow \
sbatch scripts/reflow.sh

Reward prediction and advantage filtering in WEAVER — source: WEAVER project page

Inference and evaluation

To generate rollout views:

python -m weaver.generate_views \
  --checkpoint /path/to/logs/chkpts \
  --output-dir /path/to/eval_output \
  --split val \
  --use-real-history \
  --overrides \
    dataset.path=/path/to/eval_dataset \
    model.val_steps=27 \
    eval_horizon=5 \
    eval_bootstrap=5 \
    inference.pyramid_stagger_width=1 \
    inference.pyramid_schedule=cosine

The key knobs are:

Parameter	Meaning
`model.val_steps`	Number of flow denoising steps
`eval_horizon`	Number of future frames generated per chunk
`eval_bootstrap`	Number of generated frames fed back into the next chunk
`inference.pyramid_schedule`	`linear`, `cosine`, `power`, or `sigmoid`
`inference.pyramid_stagger_width`	Schedule offset across future frames

The effective NFE is:

NFE = val_steps + eval_horizon * pyramid_stagger_width

To compute FID, FVD, and LPIPS:

EVAL_DIR=/path/to/eval_output \
sbatch scripts/compute_eval_metrics.sh

Improving π0.5 with synthetic data

Policy improvement is the part that makes WEAVER especially relevant for VLA builders. The README pipeline is:

Run an OpenPI server with π0.5 on a GPU machine.
Let π0.5 sample multiple candidate action chunks.
Roll out each candidate inside WEAVER latent space.
Use the reward head and critic to compute advantage.
Keep the best segment if its advantage passes a threshold.
Convert synthetic data to DROID layout, then to LeRobot.
Finetune π0.5 with real plus synthetic data.

Command skeleton:

uv run scripts/serve_policy.py --env DROID --num-samples 5

synth_options=(
  CHECKPOINT=/path/to/chkpts
  DATASET_PATH=/path/to/dataset
  OUTPUT_DIR=/path/to/output
  NUM_TRAJECTORIES=1000
  NUM_SAMPLES=5
  NUM_CHUNKS=4
  OPEN_LOOP_HORIZON=9
  SELECTION_CRITERION=advantage
  PI_HOST=<server-ip>
  PI_PORT=8000
)

env "${synth_options[@]}" bash scripts/synth_data_gen.sh

Then convert:

python third_party/openpi/examples/droid/convert_synthetic_data_to_droid.py \
  --input-dirs /path/to/synthetic_data_folder /path/to/real_data_folder \
  --output-dir ../data/data_mixed_droid_all \
  --tasks cup_task marker_task

python third_party/openpi/examples/droid/convert_synthetic_droid_data_to_lerobot.py \
  --data_dir ../data/data_mixed_droid_all \
  --repo_name your-hf-username/your-dataset-name

Finally finetune from third_party/openpi:

uv run scripts/train.py pi05_droid_finetune_real_syn_adv \
  --exp-name=droid-20k-real-syn-finetune \
  --overwrite

Policy finetuning success rates with WEAVER synthetic data — source: WEAVER project page

Test-time steering

planning_options=(
  CHECKPOINT=/path/to/chkpts
  OUTPUT_DIR=/path/to/output
  TASK="stack the red block"
  NUM_SAMPLES=8
  OPEN_LOOP_HORIZON=9
  SELECTION_CRITERION=advantage
  MAX_STEPS=700
  PI_HOST=<server-ip>
  PI_PORT=8000
)

env "${planning_options[@]}" bash scripts/steer_pi_policy.sh

When should you use WEAVER?

References

Paper: WEAVER, Better, Faster, Longer
Project page: arnavkj1995.github.io/WEAVER
Code: github.com/arnavkj1995/WEAVER
Model weights: huggingface.co/arnavkj1995/WEAVER

WEAVER: World Model for π0.5 VLA

What problem does WEAVER solve?

The paper idea in plain language

WEAVER architecture

Installing the repository

Preparing data

Training WEAVER

Inference and evaluation

Improving π0.5 with synthetic data

Test-time steering

When should you use WEAVER?

References

Nguyễn Anh Tuấn

Related Posts

AXIS: teleop browser và fine-tune π0.5

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2

WEAVER: World Model for π0.5 VLA

What problem does WEAVER solve?

The paper idea in plain language

WEAVER architecture

Installing the repository

Preparing data

Training WEAVER

Inference and evaluation

Improving π0.5 with synthetic data

Test-time steering

When should you use WEAVER?

References

Nguyễn Anh Tuấn

Related Posts

AXIS: teleop browser và fine-tune π0.5

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2