wholebody-vlavlaworld-pilotworld-action-modellibero-pluscosmos-policy

Run World Pilot for VLA Robots

A practical World Pilot guide for using World-Action Priors to improve zero-shot OOD robustness in VLA robots.

Nguyễn Anh TuấnJune 11, 202615 min read
Run World Pilot for VLA Robots

World Pilot is a new framework for steering Vision-Language-Action (VLA) models with World-Action Priors. Instead of asking a VLA policy to infer all physical dynamics from the current image and a language instruction, World Pilot adds two signals from a World-Action Model (WAM): a latent representation of how the scene is expected to evolve, and a coarse trajectory prior suggesting the shape of useful motion. These signals do not replace the policy. They act as a pilot that gives the action generator a better short-horizon sense of the physical future.

This guide explains the paper and the public repository from an implementation perspective. The goal is not to repeat the abstract, but to make the system usable: what problem World Pilot solves, how Latent Steering and Action Steering work, how to install the repo, how training is organized, how inference/evaluation is served, and how to read the zero-shot OOD results. If you are new to VLA work, think of World Pilot as a clean example of the current direction in robot learning: a VLM provides semantic grounding, while a world model provides predicted dynamics.

Primary sources used for this guide:

The Core Idea

Most modern VLA policies follow a pipeline like this:

camera images + language instruction + proprioception
        |
        v
Vision-Language Model backbone
        |
        v
action head / diffusion head / flow-matching head
        |
        v
robot action chunk

This design is strong at semantic grounding. A robot can understand that "stack orange blocks" refers to orange blocks and a stacking task rather than a pushing task. However, semantic grounding from image-text pretraining is mostly learned from static image-caption pairs. Real manipulation is continuous, contact-rich, partially occluded, and sensitive to geometry, timing, and accumulated error.

The gap becomes obvious under zero-shot out-of-distribution shifts:

Shift Example Why a plain VLA may fail
Camera a viewpoint not seen during fine-tuning hidden states recognize objects but do not predict motion well
Geometry taller blocks or a shifted lid pose action tolerance changes, not just object identity
Deformable state a towel rotated into a new shape soft-object state evolves continuously
Appearance different background, lighting, or noise the VLM may see the object, while the action head drifts
Layout object and target move relative to each other the policy needs a trajectory-level hint

World Pilot augments the policy with a WAM. The WAM consumes the same observation, instruction, and optional proprioception, then produces:

  1. scene-evolution latent: a compact latent that represents near-future scene changes, including object motion, contact outcomes, and local state transitions.
  2. anticipated action trajectory: a rough trajectory hypothesis, used as a motion prior rather than as the final action.

A simplified architecture:

Inputs
  multiview images + instruction + proprioception
        |
        +----------------------------+
        |                            |
        v                            v
  VLM semantic path             World-Action Model
        |                       /              \
        |                      v                v
        |              scene-evolution      anticipated
        |                  latent           trajectory
        |                      |                |
        |                      v                v
        |              Latent Steering    Action Steering
        |                      |                |
        +----------------------+----------------+
                               |
                               v
                    flow-matching action head
                               |
                               v
                         executable action

The important point is that World Pilot does not turn the WAM into the controller. The VLA remains the main policy and is still trained against expert action chunks. The world priors are injected into the decision chain at two carefully chosen points.

Robot manipulation
Robot manipulation

Architecture: Two Complementary Steering Paths

The paper defines two steering pathways: Latent Steering at the perception layer and Action Steering at the action-generation layer. They can be ablated independently, but using both gives the strongest result.

Latent Steering

Latent Steering asks: how should the predicted future from a world model enter VLM hidden states without being polluted by pixel-level artifacts?

One naive approach would be to decode a future image from the WAM, then feed that future image into the VLM as another visual input. The paper tests this direction and finds it weaker than latent injection. The reason is practical: decoded future images contain texture, lighting, background details, and generation artifacts that are mostly irrelevant for control. The WAM latent keeps a more compact representation of physical dynamics.

The Latent Steering flow looks like this:

WAM future latent Z_w
        |
        v
dynamics encoder f_dyn
        |
        + temporal future embedding
        |
        v
future-scene tokens D_w
        |
        v
cross-attention: VLM hidden states attend to D_w
        |
        v
dynamics-aware VLM hidden states

Each VLM hidden token can selectively attend to future-scene tokens that are relevant to its spatial region or object. The update is residual, so the original token sequence and hidden-state interface are preserved. That makes the design easier to attach to an existing VLA action head.

Action Steering

Action Steering asks a different question: if the WAM predicts a rough action trajectory, how should that trajectory condition the action generator?

World Pilot does not execute the WAM trajectory directly. It also does not force each output action step to match a corresponding WAM step. Instead, the trajectory is resampled to the VLA horizon K and encoded into one single prior token. This token is inserted as a prefix for the flow-matching action generator.

WAM anticipated trajectory A_w
        |
        v
align to VLA horizon K
        |
        v
action encoder f_act
        |
        v
single trajectory prior token s_w
        |
        v
self-attention context for flow-matching action head

The single-token design is small but important. Per-step trajectory tokens can over-constrain the generator and propagate WAM noise step by step. Flow initialization from the WAM trajectory ties the final output too strongly to WAM action quality. A single encoded token keeps the WAM as soft guidance: enough to reveal the coarse motion shape, but not enough to prevent the VLA action head from correcting the final chunk.

Training Objective

World Pilot trains like a flow-matching VLA policy with additional conditioning. Each training sample contains observations, a language instruction, optional proprioception, and an expert action chunk A*. The WAM is frozen. Gradients update only the VLA-side modules:

  • VLM backbone and adapters, depending on the configuration
  • dynamics encoder for the latent prior
  • Latent Steering cross-attention
  • action encoder for the trajectory prior
  • flow-matching action generator

The paper uses a clean-action parameterization. In pseudo-code:

# observation: images, language, optional proprioception
# expert: clean action chunk A_star

with torch.no_grad():
    Z_w, A_w = wam(observation)          # frozen WAM

H = vlm_encode(observation)
H_bar = latent_steering(H, Z_w)

s_w = action_encoder(align_to_horizon(A_w, K))

eps = torch.randn_like(A_star)
tau = sample_flow_time()
X_tau = tau * A_star + (1 - tau) * eps

A_hat = action_head(
    noisy_action=X_tau,
    flow_time=tau,
    state_token=proprioception,
    prior_token=s_w,
    cross_attention_context=H_bar,
)

loss = weight(tau) * mse(A_hat, A_star)
loss.backward()

During training, the WAM forward pass can be precomputed and cached, keeping the heavy world model out of the inner loop. During inference and evaluation, the WAM runs online at every decision step so priors are generated from the live observation.

Installing the Public Repo

The public WorldPilot repository currently uses multiple local environments rather than a single Docker-first workflow. This is the part beginners are most likely to trip over. There are four environments:

Environment Role
WorldPilot policy training, model server, main code
cosmos-policy Cosmos Policy serving and cache precompute
libero baseline LIBERO tooling if needed
libero-plus public zero-shot OOD evaluation

The commands below are a practical skeleton. Read the repository docs before a full run because checkpoint paths, dataset roots, and GPU IDs must be adapted to your machine.

sudo apt-get update
sudo apt-get install -y \
  build-essential cmake git git-lfs curl wget ffmpeg \
  libgl1 libegl1-mesa-dev libgl1-mesa-dri libglib2.0-0 \
  libexpat1 libfontconfig1-dev libpython3-stdlib libmagickwand-dev

Create the main environment:

mamba create -n WorldPilot python=3.10 -y
mamba activate WorldPilot

git clone https://github.com/ZefuLin/WorldPilot.git
cd WorldPilot

pip install torch torchvision --index-url <your-pytorch-cuda-wheel-index>
pip install -r requirements.txt
pip install -e .

Install VGGT if your LIBERO configuration requires it:

git clone https://github.com/facebookresearch/vggt.git /path/to/vggt
pip install -e /path/to/vggt

Install Cosmos Policy:

mamba create -n cosmos-policy python=3.10 -y
mamba activate cosmos-policy

git clone https://github.com/NVlabs/cosmos-policy.git
cd /path/to/cosmos-policy
pip install -e ".[cu128]"
pip install -r cosmos_policy/experiments/robot/libero/libero_requirements.txt
pip install websockets msgpack

Install LIBERO-Plus for evaluation:

mamba create -n libero-plus python=3.8 -y
mamba activate libero-plus

git clone https://github.com/sylvestf/LIBERO-plus.git
cd /path/to/LIBERO-plus
pip install -r requirements.txt
pip install -r extra_requirements.txt
pip install -e .

cd /path/to/WorldPilot
pip install -r examples/LIBERO-plus/eval_files/libero_plus_requirements.txt

Create the local LIBERO-Plus config:

mkdir -p ~/.libero-plus
cat > ~/.libero-plus/config.yaml <<'EOF'
assets: /path/to/LIBERO-plus/libero/libero/assets
bddl_files: /path/to/LIBERO-plus/libero/libero/bddl_files
benchmark_root: /path/to/LIBERO-plus/libero/libero
datasets: /path/to/libero_plus_datasets
init_states: /path/to/LIBERO-plus/libero/libero/init_files
EOF

export LIBERO_CONFIG_PATH=~/.libero-plus
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl

Checkpoints and Cache Artifacts

The public release depends on several pretrained assets:

Asset Purpose
nvidia/Cosmos-Policy-LIBERO-Predict2-2B WAM/Cosmos Policy priors
facebook/VGGT-1B visual geometry component for the current config
StarVLA/Qwen3-VL-4B-Instruct-Action action-ready Qwen3-VL checkpoint
amap_cvlab/ABot-M0-Pretrain ABot-M0 pretraining checkpoint
Chedan86/WorldPilot-LIBERO released WorldPilot checkpoint
Chedan86/WorldPilot-LIBERO-precompute precomputed LIBERO Cosmos cache

If your goal is evaluation, the fastest path is to download the released checkpoint and precomputed cache. If your goal is training from scratch on LIBERO, you need to configure examples/LIBERO/train_files/WorldPilot.yaml and a cache directory.

The expected cache layout is roughly:

/path/to/cosmos_cache/
  libero_10_no_noops_1.0.0_lerobot/
  libero_goal_no_noops_1.0.0_lerobot/
  libero_object_no_noops_1.0.0_lerobot/
  libero_spatial_no_noops_1.0.0_lerobot/

If you do not use the published cache, run precompute:

cd /path/to/WorldPilot
# edit the variables in cosmos_bridge/run_precompute.sh first
bash cosmos_bridge/run_precompute.sh

The script starts Cosmos servers for dataset splits and writes cache files to your configured output directory. For a beginner, the published cache is usually the better first step because it lets you debug the policy path before debugging WAM precompute.

Training World Pilot

The public training path has five steps:

  1. Prepare the WorldPilot and cosmos-policy environments.
  2. Prepare pretrained weights and cache artifacts.
  3. Download or precompute the Cosmos cache.
  4. Edit the training YAML and launch script.
  5. Run training.

At minimum, edit these fields in examples/LIBERO/train_files/WorldPilot.yaml:

run_root_dir: /path/to/runs
run_id: worldpilot_libero_debug

framework:
  vggt_path: /path/to/VGGT-1B
  qwenvl:
    base_vlm: /path/to/Qwen3-VL-4B-Instruct-Action

datasets:
  vla_data:
    data_root_dir: /path/to/libero_lerobot_data
    cosmos_cache_dir: /path/to/cosmos_cache

trainer:
  pretrained_checkpoint: /path/to/ABot-M0-Pretrain

Then edit the top block of examples/LIBERO/train_files/run_libero_train.sh:

GPU_IDS=0,1,2,3
CONFIG_YAML=examples/LIBERO/train_files/WorldPilot.yaml

Launch:

cd /path/to/WorldPilot
mamba activate WorldPilot
bash examples/LIBERO/train_files/run_libero_train.sh

The paper fine-tunes World Pilot on 8 RTX PRO 6000 GPUs and uses WAM-condition dropout with rate 0.3 so the policy does not over-rely on priors. If you only have one GPU, do not expect to reproduce the full benchmark number. Start with a short run that verifies the data loader, cache path, checkpoint loading, and loss behavior.

A practical smoke-test checklist:

[ ] WorldPilot package imports correctly
[ ] Qwen3-VL action checkpoint loads
[ ] ABot-M0 pretrain loads
[ ] LIBERO LeRobot dataset is readable
[ ] cosmos_cache_dir is readable
[ ] first batch has images, language, proprioception, and actions
[ ] forward pass has no NaNs
[ ] loss decreases during the first few hundred steps

Inference and Evaluation

The public evaluation target is WorldPilot on Libero-Plus. The repository provides two entrypoints:

  • examples/LIBERO-plus/eval_files/eval_libero_single.sh
  • examples/LIBERO-plus/eval_files/eval_libero_batch.sh

Before evaluation, make sure you have:

File Role
WorldPilot checkpoint policy weights
Cosmos checkpoint online WAM server
Cosmos dataset statistics JSON normalization/statistics
Cosmos T5 embedding pickle language instruction embeddings
~/.libero-plus/config.yaml paths to assets, datasets, and init states

Single-suite evaluation:

cd /path/to/WorldPilot
mamba activate libero-plus

# edit required paths at the top of the script first
bash examples/LIBERO-plus/eval_files/eval_libero_single.sh

Batch evaluation:

cd /path/to/WorldPilot
mamba activate libero-plus

# edit checkpoint paths, Cosmos files, SUITES, GPUS, and ports
bash examples/LIBERO-plus/eval_files/eval_libero_batch.sh

The public evaluation scripts use a server-style setup:

Cosmos server
  cosmos_bridge/run_cosmos_server.sh
        |
        v
policy server
  deployment/model_server/server_policy.py
        |
        v
LIBERO-Plus simulator workers
  socket requests -> policy actions

When debugging inference, separate the problem into three layers: whether the Cosmos server returns priors, whether the policy server loads the WorldPilot checkpoint, and whether simulator workers send observations with the expected shape. Debugging all three at once is slow and misleading.

Paper Results

World Pilot is evaluated on LIBERO, LIBERO-Plus, RoboCasa, and real robots. The headline number is 84.7% Total success rate on the LIBERO-Plus zero-shot OOD benchmark, compared with 80.5% for ABot-M0 and 79.7% for Cosmos Policy in the paper table.

Selected simulation results:

Method LIBERO LIBERO-Plus Total RoboCasa
ABot-M0 98.6 80.5 54.0
Cosmos Policy 98.5 79.7 67.1
World Pilot 98.5 84.7 65.5

World Pilot does not win every single axis. Its strongest gains appear on camera, light, background, and noise perturbations. On the camera axis, the paper reports 82.8%, a 13.2-point gain over the next strongest baseline. This matches the design intuition: a video-pretrained WAM has broader camera-pose experience, and the scene-evolution latent carries that coverage into the policy.

The real-robot setup includes four tasks:

Task ID setting OOD variants
Stack Blocks training-like layout color, height
Fold Towel training-like towel direction, novel towel
Fruit-to-Plate training-like fruit/layout novel fruit, layout
Container-Lid Alignment training-like object/lid novel object, lid pose

For each task, the authors collect 100 in-distribution teleoperated demonstrations, fine-tune for 10,000 steps, and run 20 trials per setting and method. World Pilot achieves the highest success rate in every setting. Examples: Stack Blocks reaches 70% ID, 55% OOD color, and 50% OOD height; Fold Towel reaches 85% ID, 75% OOD direction, and 70% novel towel; Fruit-to-Plate reaches 90% ID, 75% novel fruit, and 70% layout; Container-Lid Alignment reaches 80% ID, 70% novel object, and 65% lid pose.

The more important signal is the ID-to-OOD drop. World Pilot typically drops by 10-20 absolute points, while baselines often drop by 25-50 points. For manipulation tasks with tight contact tolerances, that difference matters. Failures often come from a slightly wrong pose or contact trajectory, not from misunderstanding the language instruction.

Ablations: Why This Design Matters

The paper includes four useful ablations.

Each Pathway Helps

Variant LIBERO-Plus success
ABot-M0 baseline 80.5
Latent Steering only 83.7
Action Steering only 83.1
Full World Pilot 84.7

This shows that scene-evolution priors and trajectory priors are not redundant. Latent Steering improves how the perception stack represents the future scene, while Action Steering gives the action head a soft motion hint.

World Priors Help Even Before Action Fine-Tuning

When Cosmos Policy is replaced by Cosmos-Predict, which only provides future-scene latents, Latent Steering still improves ABot-M0 on LIBERO-Plus, RoboCasa, and RoboTwin2.0 clean. This suggests that large-scale video pretraining already contains useful dynamics priors, even before action-side post-training sharpens them.

Latents Beat Decoded Future Images

Future latents at 1, 3, and 5 denoising steps all land around 84.5-84.7%. A decoded future image reaches 83.5%. This is a good robot-learning lesson: a visually plausible future image is not necessarily the best control representation. A latent can be less interpretable to humans while being cleaner for policy conditioning.

A Single Trajectory Token Beats Stepwise Forcing

The best action-prior form is the single encoded token at 84.7%. Per-step encoded tokens reach 83.6%, flow initialization reaches 84.1%, and raw trajectory conditioning reaches 83.0%. This supports the idea of using the WAM as soft guidance rather than as a hard trajectory oracle.

When Should You Try World Pilot?

World Pilot is worth trying if you already have:

  • a working VLA backbone, especially with a diffusion or flow-matching action head;
  • manipulation data with clear language labels and action chunks;
  • zero-shot OOD failures under camera, appearance, geometry, or pose shifts;
  • enough GPU capacity to run an additional WAM online or precompute WAM cache;
  • a benchmark such as LIBERO/LIBERO-Plus before moving to a physical robot.

World Pilot is not a universal fix. The paper notes that the policy still inherits the WAM's coverage. If your test scenes fall outside the WAM's video pretraining distribution, the priors can degrade. Each decision step also adds a WAM forward pass, which can be a problem for high-frequency reactive control. The WAM and VLA are connected modularly through the action loss, so the current design does not yet explore deep joint prior-policy co-adaptation.

A Practical Lab Roadmap

If you are in a small robotics lab or startup, do not start by trying to reproduce 84.7%. Use a staged path:

  1. Run the released checkpoint on one small LIBERO-Plus suite.
  2. Check that Cosmos server, policy server, and simulator communicate correctly.
  3. Fine-tune a short run from ABot-M0 pretrain with the published cache.
  4. Compare ABot-M0 and World Pilot on one or two OOD axes.
  5. Only then consider a physical robot or a custom dataset.

For a custom robot, the hardest part is usually not the steering code. It is data hygiene: camera synchronization, proprioception, action chunk formatting, gripper state, language labels, reset conditions, and evaluation protocol. World Pilot can provide dynamics priors, but it cannot rescue a noisy dataset.

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

UniIntervene: Giảm 57% human intervention
wholebody-vla

UniIntervene: Giảm 57% human intervention

6/11/202615 min read
NT
Chạy TREAD: tái gán nhãn robot bằng VLM
wholebody-vla

Chạy TREAD: tái gán nhãn robot bằng VLM

6/11/202614 min read
NT
SARM trong LeRobot: Reward Model cho VLA
wholebody-vla

SARM trong LeRobot: Reward Model cho VLA

6/11/202616 min read
NT