StressDream for robot policy testing

What problem does StressDream solve?

Robot manipulation policies often look good until they meet the tail cases. A trajectory may succeed in the logged demonstration, but the same action can fail if the gripper releases a few centimeters too high, the object is closer to the table edge, the bag tilts slightly, or a stacked object starts sliding. Before deploying a policy on real hardware, teams need to ask a practical question: does this action have a plausible future where the robot spills, drops, collides, or otherwise fails?

StressDream turns that question into an inference-time test with a video world model. The original paper is StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement, by Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, and Andrea Bajcsy. It was posted to arXiv on May 29, 2026. The project page is junwon.me/StressDream, and the official code is available at CMU-IntentLab/StressDream.

The idea is simple but powerful: instead of sampling a few "normal" rollouts from a video world model, StressDream optimizes the initial noise of a diffusion video world model so the model imagines a high-impact outcome that is still plausible under the learned model distribution. For manipulation, the target event can be written as a text prompt:

Is the coffee bean spilled? Respond with a single word: Yes or No.

If the action sequence really has a plausible spilling failure, the steered world model should find a future video where that failure appears. If the scene does not support that failure, for example a closed bag or sticky candy that should not spill, the method should not invent an absurd failure. That distinction is the core of the paper: StressDream is not trying to generate bad videos at any cost. It is trying to find the worst plausible future for a candidate action.

If you have read our guides on Diffusion Policy, VLA models, or RISE with world-model imagination, StressDream fills a very practical gap. It does not replace the main policy. It does not require thousands of additional real-world trials. It adds a stress-test stage between policy training and physical deployment.

The paper in five minutes

A video world model takes an observation history and a future action sequence, then predicts future observations. In robot manipulation, observations usually include multiple RGB camera views, sometimes including a wrist camera, plus proprioceptive state. Actions may be joint positions, end-effector poses, or action chunks. In the manipulation experiments, StressDream uses Ctrl-World, a controllable generative world model for robot manipulation trained in a DROID-style setup. According to the paper, Ctrl-World predicts 5 future frames from 3 camera views at 192 x 320 resolution, conditioned on joint-position actions, with a noise dimension of 57,600.

The problem with nominal rollout is optimism. A diffusion world model may represent many possible futures, but one random noise sample only gives you one future. If the failure is rare but plausible, random sampling can easily miss it. Sampling many videos helps, but each video generation is expensive.

StressDream treats the initial Gaussian noise as a control variable:

observation history + action sequence + initial noise
        |
        v
diffusion video world model
        |
        v
future video rollout

With fixed observation and action conditioning, the generated video is determined by the initial noise. StressDream therefore optimizes that noise with gradients so the generated video moves toward the target event described by an inference-time text prompt.

The full loop looks like this:

Initial image(s) + action trajectory
        |
        v
Sample initial Gaussian noise epsilon
        |
        v
Video world model generates a multi-view rollout
        |
        v
VLM reads rollout + failure prompt
        |
        v
Semantic score: log P(Yes) - log P(No)
        |
        v
Plausibility regularizer keeps epsilon in the Gaussian typical set
        |
        v
Update epsilon and repeat
        |
        v
Steered plausible failure video, if such a failure is supported

Technical architecture

StressDream has three main pieces.

Component	Role	In the manipulation demo
Video world model	Generate future video conditioned on actions	Ctrl-World
Semantic objective	Score whether the target event appears	Qwen3-VL-4B-Instruct
Plausibility objective	Keep optimized noise close to a valid Gaussian sample	norm, isotropy, spectral whiteness

1. The video world model

The world model can be written as:

o_future ~ f_theta(. | o_history, a_future)

Here o_history is the observation history, a_future is the action sequence to test, and o_future is the generated rollout. In a diffusion world model, generation starts from noise epsilon ~ N(0, I) and gradually denoises into a video latent. Once o_history and a_future are fixed, epsilon controls which future is generated.

The difference between nominal generation and StressDream is:

Nominal:
epsilon = random Gaussian noise
video = world_model(epsilon, observation, action)

StressDream:
epsilon = random Gaussian noise
for i in optimization_steps:
    video = world_model(epsilon, observation, action)
    reward = VLM(video, failure_prompt) + plausibility(epsilon)
    epsilon = epsilon + lr * grad(reward)

The official repository contains three modules: dubins/ for a synthetic Dubins car, vista/ for autonomous driving, and ctrl_world/ for robot manipulation. This tutorial focuses on ctrl_world/.

2. Semantic objective with a VLM

The paper uses a Vision-Language Model to score complex failures in generated video. Instead of training a separate classifier for every task, you describe the failure in text. The VLM is prompted to answer with a single token, Yes or No. The semantic score is:

C_sem(video, prompt) = log P(Yes | video, prompt) - log P(No | video, prompt)

For manipulation, a prompt can be:

Is the coffee bean spilled? Respond with a single word: Yes or No.

If the video increasingly looks like spilled coffee beans, P(Yes) rises. That score provides a gradient signal for the noise optimizer. In the ctrl_world implementation, Qwen3-VL reads a multi-view clip and returns a reward based on the margin between the Yes and No token probabilities.

3. Plausibility objective

If we only optimize the VLM reward, the noise can leave the region where the diffusion model was trained. The generated videos may become artifact-heavy, physically strange, or simply exploit the VLM. The paper frames this as leaving the typical set of the high-dimensional Gaussian prior.

StressDream uses regularizers that keep the optimized noise statistically similar to real Gaussian noise:

Regularizer	Intuition
Norm concentration	The noise norm should stay close to `sqrt(D)`
Isotropy	Random blocks of noise should not show unusual correlations
Spectral whiteness	The power spectrum should not contain structured frequency artifacts

In ctrl_world/steering_config.yaml, the relevant coefficients include kl_coeff, kl_coeff_spherical, std_coeff, spectral_coeff, and std_permutation_coeff. This is what makes StressDream different from plain adversarial generation. It is not just trying to fool a VLM. It is trying to locate a bad outcome that remains inside the world model's supported distribution.

Installing StressDream for manipulation

The official repo gives each domain its own environment. For manipulation:

conda env create -f ctrl_world/environment.yml
conda activate stressdream-ctrlworld

The ctrl_world README lists Python 3.11, PyTorch 2.9.1 with CUDA 12.8, Diffusers 0.36, Transformers 4.57, and an NVIDIA driver version of at least 520. Full optimization with the Qwen3-VL reward and noise optimization needs roughly 20 GB of VRAM. If your GPU is smaller, reduce interact_num, reduce iters, or first run the nominal generator to debug the data path.

You need three main checkpoints:

Checkpoint	Role	Relative size
`ckpts/Ctrl-World/coffee_bag.pt`	Ctrl-World checkpoint for the StressDream demo	smaller than SVD
`ckpts/stable-video-diffusion-img2vid/`	pretrained Stable Video Diffusion backbone	several GB
`ckpts/clip-vit-base-patch32/`	CLIP encoder	around 600M

The README uses the current hf CLI:

pip install -U huggingface_hub

mkdir -p ckpts/Ctrl-World
hf download junwon-seo/StressDream \
    coffee_bag.pt --local-dir ckpts/Ctrl-World

hf download stabilityai/stable-video-diffusion-img2vid \
    --local-dir ckpts/stable-video-diffusion-img2vid

hf download openai/clip-vit-base-patch32 \
    --local-dir ckpts/clip-vit-base-patch32

Qwen3-VL-4B-Instruct is fetched automatically by transformers on the first run, so leave enough disk space and download time.

Running the first demo

The repo includes a sample trajectory:

ctrl_world/example_data/traj_0001.hdf5

It contains a candy-coffee style task with real trajectory data, end-effector states, and three camera views. To run steered imagination:

python ctrl_world/run_steering.py \
    --hdf5_path ctrl_world/example_data/traj_0001.hdf5

To run the nominal baseline:

python ctrl_world/generate_nominal.py \
    --hdf5_path ctrl_world/example_data/traj_0001.hdf5

The steered output is written under:

outputs/ctrl_world_steering/<timestamp>/
├── step_0001_best.mp4
├── step_0002_best.mp4
├── ...
├── full_steered.mp4
├── history.json
└── optimized_noise.pt

The nominal output is written under:

outputs/ctrl_world_nominal/<timestamp>/
├── step_0001_nominal.mp4
├── ...
├── full_nominal.mp4
└── history.json

Open full_nominal.mp4 and full_steered.mp4 side by side. Nominal answers: "what does a typical rollout look like?" Steered answers: "can the world model find a plausible failure for this same action sequence?"

Customizing prompts for your task

You can override the instruction and Qwen prompt from the CLI:

python ctrl_world/run_steering.py \
    --hdf5_path ctrl_world/example_data/traj_0001.hdf5 \
    --instruction "put the coffee bag into the container without spilling" \
    --qwen_prompt "Is the coffee bean spilled? Respond with a single word: Yes or No."

A good prompt has three properties:

Do	Avoid
Ask for one concrete failure	Ask a vague question like "did the task fail?"
Force a Yes/No answer	Ask for a long explanation
Name the object as it appears in the scene	Use ambiguous references that make the VLM guess

Example prompts:

Is the knife dropped off the table edge? Respond with a single word: Yes or No.

Are the coffee beans outside the bag? Respond with a single word: Yes or No.

Does the top utensil fall while the bottom utensil is pulled? Respond with a single word: Yes or No.

The paper studies six contact-rich manipulation tasks: block stack, knife put, stacked utensil pick, coffee bean pour, open coffee bag, and open candy bag. The failure descriptions include dropping, spilling, sliding, releasing too high, and acting too close to an edge.

Reading `steering_config.yaml`

The default ctrl_world config looks like this:

task:
  instruction: "put the coffee bag into the container without spilling"
  qwen_prompt: "Is the coffee bean spilled? Respond with a single word: Yes or No."
  target_success: true

model:
  task_type: replay
  num_frames: 5

optim:
  iters: 5
  interact_num: 20
  lr: 1.0
  grad_scale: 100.0
  noise_norm_threshold: 3.0
  detach_latents: true
  seed: 33

The beginner-friendly interpretation:

Parameter	What it controls	Practical note
`num_frames`	Frames generated per world-model window	Keep default at first
`interact_num`	Number of autoregressive rollout windows	Reduce for faster debugging
`iters`	Noise optimization steps per window	Increase if steering is weak, but it costs time
`lr`	Learning rate for noise	Too high can cause artifacts
`grad_scale`	Scale for VLM reward gradients	Too high can encourage reward hacking
`noise_norm_threshold`	Hard renormalization threshold	Helps prevent excessive noise drift

In run_steering.py, each interaction step loads the next action chunk from the HDF5 file, normalizes it with dataset statistics, builds a three-view conditioning latent, samples fresh noise, optimizes that noise for several iterations, saves the best video, then uses the last generated latent frame as conditioning for the next step. That is how the rollout can continue beyond one short prediction window.

Training vs. inference

StressDream is not primarily a recipe for training a world model from scratch. The paper uses existing or fine-tuned world models:

Domain	World model	Data/setup
Dubins car	small diffusion WM	4,000 random trajectories
Driving	Vista	PAI-AV and Nexar Collision Prediction Dataset
Robot manipulation	Ctrl-World	DROID-style setup, about 150 teleoperation trajectories per task, including successes and failures

For a deployment team, there are three levels of use:

Level 1: run the included demo
  Use the provided checkpoint and traj_0001.hdf5 to understand outputs.

Level 2: stress-test internal trajectories
  Convert your robot logs into an HDF5 format the loader can read:
  multi-view RGB, eef_states or joint actions, and gripper state.

Level 3: fine-tune the world model
  Collect success and failure data for your task, fine-tune Ctrl-World,
  then run StressDream for policy/action evaluation.

Inference in StressDream is the steering process. You provide observation, action, and a failure prompt, and you receive a steered video. Policy improvement is a later step: use the steering result to reweight demonstrations. In the paper, the authors fine-tune π0.5-DROID with 40 successful demonstrations per task. Trajectories that remain successful under failure steering receive weight 1.0; trajectories that fail in steered imagination receive weight 0.1. This encourages the policy to imitate robust actions rather than actions that happened to succeed once.

Main results from the paper

The headline numbers are:

Experiment	Result
Robust policy evaluation	failure detection recall improves from 54% to 94%
Policy improvement	π0.5 success rate improves from 39% to 71%
Plausibility	StressDream does not force a closed bag or sticky candy into implausible spilling
Baseline	Best-of-N random sampling still misses many rare failures

The practical interpretation is narrow but valuable: StressDream does not prove that a policy is safe in the real world. It shows that if the world model has learned a plausible failure mode, inference-time steering can find that failure more efficiently than nominal rollouts. That makes it a useful pre-deployment test layer.

A safer deployment checklist

Use this checklist before making StressDream part of your internal workflow:

[ ] The world model has seen scenes, objects, and camera views close to deployment
[ ] Failure data exists, or the target failure is at least supported by the model
[ ] Failure prompt is concrete and uses a Yes/No answer
[ ] Nominal and steered rollouts are generated side by side
[ ] Humans review a subset of videos for VLM reward hacking
[ ] history.json is logged: reward, p_yes, p_no, noise_norm, regularizer
[ ] StressDream is not treated as the only safety proof
[ ] Real-world supervised validation still happens before deployment

The limitation is important. In this paper, "plausible" means plausible under the world model, not guaranteed physically plausible in the real world. If the world model lacks data for liquids, deformable objects, subtle contact, occlusion, or gripper slip, StressDream cannot reliably discover risks that the model never learned.

Conclusion

StressDream is a practical way to turn a video world model into a policy stress-testing tool. Instead of only asking "what does the policy do in a typical rollout?", it asks "does this same action have any plausible future where something important fails?" That second question is much closer to real deployment thinking, because production failures often live in the tail of the outcome distribution.

For robot manipulation, a reasonable workflow is: train or select a policy, collect representative action trajectories, run nominal world-model rollouts, run StressDream with task-specific failure prompts, review videos and history.json, then downweight or remove risky actions before real-world deployment. It does not replace real robot validation, but it can reduce the number of obviously risky actions that reach the hardware.

What problem does StressDream solve?

Is the coffee bean spilled? Respond with a single word: Yes or No.

The paper in five minutes

StressDream treats the initial Gaussian noise as a control variable:

observation history + action sequence + initial noise
        |
        v
diffusion video world model
        |
        v
future video rollout

The full loop looks like this:

Initial image(s) + action trajectory
        |
        v
Sample initial Gaussian noise epsilon
        |
        v
Video world model generates a multi-view rollout
        |
        v
VLM reads rollout + failure prompt
        |
        v
Semantic score: log P(Yes) - log P(No)
        |
        v
Plausibility regularizer keeps epsilon in the Gaussian typical set
        |
        v
Update epsilon and repeat
        |
        v
Steered plausible failure video, if such a failure is supported

Technical architecture

StressDream has three main pieces.

Component	Role	In the manipulation demo
Video world model	Generate future video conditioned on actions	Ctrl-World
Semantic objective	Score whether the target event appears	Qwen3-VL-4B-Instruct
Plausibility objective	Keep optimized noise close to a valid Gaussian sample	norm, isotropy, spectral whiteness

1. The video world model

The world model can be written as:

o_future ~ f_theta(. | o_history, a_future)

The difference between nominal generation and StressDream is:

Nominal:
epsilon = random Gaussian noise
video = world_model(epsilon, observation, action)

StressDream:
epsilon = random Gaussian noise
for i in optimization_steps:
    video = world_model(epsilon, observation, action)
    reward = VLM(video, failure_prompt) + plausibility(epsilon)
    epsilon = epsilon + lr * grad(reward)

2. Semantic objective with a VLM

C_sem(video, prompt) = log P(Yes | video, prompt) - log P(No | video, prompt)

For manipulation, a prompt can be:

Is the coffee bean spilled? Respond with a single word: Yes or No.

3. Plausibility objective

StressDream uses regularizers that keep the optimized noise statistically similar to real Gaussian noise:

Regularizer	Intuition
Norm concentration	The noise norm should stay close to `sqrt(D)`
Isotropy	Random blocks of noise should not show unusual correlations
Spectral whiteness	The power spectrum should not contain structured frequency artifacts

Installing StressDream for manipulation

The official repo gives each domain its own environment. For manipulation:

conda env create -f ctrl_world/environment.yml
conda activate stressdream-ctrlworld

You need three main checkpoints:

Checkpoint	Role	Relative size
`ckpts/Ctrl-World/coffee_bag.pt`	Ctrl-World checkpoint for the StressDream demo	smaller than SVD
`ckpts/stable-video-diffusion-img2vid/`	pretrained Stable Video Diffusion backbone	several GB
`ckpts/clip-vit-base-patch32/`	CLIP encoder	around 600M

The README uses the current hf CLI:

pip install -U huggingface_hub

mkdir -p ckpts/Ctrl-World
hf download junwon-seo/StressDream \
    coffee_bag.pt --local-dir ckpts/Ctrl-World

hf download stabilityai/stable-video-diffusion-img2vid \
    --local-dir ckpts/stable-video-diffusion-img2vid

hf download openai/clip-vit-base-patch32 \
    --local-dir ckpts/clip-vit-base-patch32

Qwen3-VL-4B-Instruct is fetched automatically by transformers on the first run, so leave enough disk space and download time.

Running the first demo

The repo includes a sample trajectory:

ctrl_world/example_data/traj_0001.hdf5

It contains a candy-coffee style task with real trajectory data, end-effector states, and three camera views. To run steered imagination:

python ctrl_world/run_steering.py \
    --hdf5_path ctrl_world/example_data/traj_0001.hdf5

To run the nominal baseline:

python ctrl_world/generate_nominal.py \
    --hdf5_path ctrl_world/example_data/traj_0001.hdf5

The steered output is written under:

outputs/ctrl_world_steering/<timestamp>/
├── step_0001_best.mp4
├── step_0002_best.mp4
├── ...
├── full_steered.mp4
├── history.json
└── optimized_noise.pt

The nominal output is written under:

outputs/ctrl_world_nominal/<timestamp>/
├── step_0001_nominal.mp4
├── ...
├── full_nominal.mp4
└── history.json

Customizing prompts for your task

You can override the instruction and Qwen prompt from the CLI:

python ctrl_world/run_steering.py \
    --hdf5_path ctrl_world/example_data/traj_0001.hdf5 \
    --instruction "put the coffee bag into the container without spilling" \
    --qwen_prompt "Is the coffee bean spilled? Respond with a single word: Yes or No."

A good prompt has three properties:

Do	Avoid
Ask for one concrete failure	Ask a vague question like "did the task fail?"
Force a Yes/No answer	Ask for a long explanation
Name the object as it appears in the scene	Use ambiguous references that make the VLM guess

Example prompts:

Is the knife dropped off the table edge? Respond with a single word: Yes or No.

Are the coffee beans outside the bag? Respond with a single word: Yes or No.

Does the top utensil fall while the bottom utensil is pulled? Respond with a single word: Yes or No.

Reading `steering_config.yaml`

The default ctrl_world config looks like this:

task:
  instruction: "put the coffee bag into the container without spilling"
  qwen_prompt: "Is the coffee bean spilled? Respond with a single word: Yes or No."
  target_success: true

model:
  task_type: replay
  num_frames: 5

optim:
  iters: 5
  interact_num: 20
  lr: 1.0
  grad_scale: 100.0
  noise_norm_threshold: 3.0
  detach_latents: true
  seed: 33

The beginner-friendly interpretation:

Parameter	What it controls	Practical note
`num_frames`	Frames generated per world-model window	Keep default at first
`interact_num`	Number of autoregressive rollout windows	Reduce for faster debugging
`iters`	Noise optimization steps per window	Increase if steering is weak, but it costs time
`lr`	Learning rate for noise	Too high can cause artifacts
`grad_scale`	Scale for VLM reward gradients	Too high can encourage reward hacking
`noise_norm_threshold`	Hard renormalization threshold	Helps prevent excessive noise drift

Training vs. inference

StressDream is not primarily a recipe for training a world model from scratch. The paper uses existing or fine-tuned world models:

Domain	World model	Data/setup
Dubins car	small diffusion WM	4,000 random trajectories
Driving	Vista	PAI-AV and Nexar Collision Prediction Dataset
Robot manipulation	Ctrl-World	DROID-style setup, about 150 teleoperation trajectories per task, including successes and failures

For a deployment team, there are three levels of use:

Level 1: run the included demo
  Use the provided checkpoint and traj_0001.hdf5 to understand outputs.

Level 2: stress-test internal trajectories
  Convert your robot logs into an HDF5 format the loader can read:
  multi-view RGB, eef_states or joint actions, and gripper state.

Level 3: fine-tune the world model
  Collect success and failure data for your task, fine-tune Ctrl-World,
  then run StressDream for policy/action evaluation.

Main results from the paper

The headline numbers are:

Experiment	Result
Robust policy evaluation	failure detection recall improves from 54% to 94%
Policy improvement	π0.5 success rate improves from 39% to 71%
Plausibility	StressDream does not force a closed bag or sticky candy into implausible spilling
Baseline	Best-of-N random sampling still misses many rare failures

A safer deployment checklist

Use this checklist before making StressDream part of your internal workflow:

[ ] The world model has seen scenes, objects, and camera views close to deployment
[ ] Failure data exists, or the target failure is at least supported by the model
[ ] Failure prompt is concrete and uses a Yes/No answer
[ ] Nominal and steered rollouts are generated side by side
[ ] Humans review a subset of videos for VLM reward hacking
[ ] history.json is logged: reward, p_yes, p_no, noise_norm, regularizer
[ ] StressDream is not treated as the only safety proof
[ ] Real-world supervised validation still happens before deployment

StressDream for robot policy testing

What problem does StressDream solve?

The paper in five minutes

Technical architecture

1. The video world model

2. Semantic objective with a VLM

3. Plausibility objective

Installing StressDream for manipulation

Running the first demo

Customizing prompts for your task

Reading `steering_config.yaml`

Training vs. inference

Main results from the paper

A safer deployment checklist

Conclusion

Nguyễn Anh Tuấn

Related Posts

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2

WEAVER: world model cải thiện π0.5 VLA

StressDream for robot policy testing

What problem does StressDream solve?

The paper in five minutes

Technical architecture

1. The video world model

2. Semantic objective with a VLM

3. Plausibility objective

Installing StressDream for manipulation

Running the first demo

Customizing prompts for your task

Reading `steering_config.yaml`

Training vs. inference

Main results from the paper

A safer deployment checklist

Conclusion

Nguyễn Anh Tuấn

Related Posts

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2

WEAVER: world model cải thiện π0.5 VLA

What problem does StressDream solve?

The paper in five minutes

Technical architecture

1. The video world model

2. Semantic objective with a VLM

3. Plausibility objective

Installing StressDream for manipulation

Running the first demo

Customizing prompts for your task

Reading steering_config.yaml

Training vs. inference

Main results from the paper

A safer deployment checklist

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2

WEAVER: world model cải thiện π0.5 VLA

What problem does StressDream solve?

The paper in five minutes

Technical architecture

1. The video world model

2. Semantic objective with a VLM

3. Plausibility objective

Installing StressDream for manipulation

Running the first demo

Customizing prompts for your task

Reading steering_config.yaml

Training vs. inference

Main results from the paper

A safer deployment checklist

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

Hướng dẫn InternVLA-A1: VLA + World Model qua Mixture-of-Transformers

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2

WEAVER: world model cải thiện π0.5 VLA

Reading `steering_config.yaml`

Reading `steering_config.yaml`