Run LIBERO-Occ VIM for Occluded VLA

LIBERO-Occ is a new benchmark for a very practical question: can a VLA manipulation policy stay reliable when the object or target region is occluded? Most robot manipulation benchmarks quietly assume that the task-relevant object, receptacle, and contact region are visible. That assumption is convenient in simulation and clean lab demos. It often breaks on real workbenches, where a bowl can hide behind a box, a drawer can hide the placement region, and the robot arm can block its own camera.

This guide explains how to read and run the work from LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination, posted on arXiv on 2026-06-09. The official repository is litsh/Libero-Occ. The paper contributes two things: the LIBERO-Occ benchmark for measuring occlusion robustness, and Viewpoint Imagination (VIM), a method that generates a complementary imagined view from an occluded primary observation and conditions action prediction on both observed and imagined evidence.

Keep these sources open while you work:

Paper: LIBERO-Occ arXiv 2606.10862
HTML paper: arxiv.org/html/2606.10862v1
GitHub: litsh/Libero-Occ
Upstream benchmark: LIBERO

What problem does LIBERO-Occ expose?

A common VLA policy pipeline looks like this:

RGB image(s) + language instruction + optional proprioception
        |
        v
vision-language backbone
        |
        v
action tokenizer / diffusion head / flow head
        |
        v
robot action chunk

This pipeline works best when the image contains enough task-relevant evidence. If the instruction is "put the bowl into the drawer", the policy usually needs to see the bowl, the drawer, the handle, and enough geometry to plan the approach and placement. If the bowl is 60% hidden behind another object, or the drawer target region is blocked from the main camera, the input no longer contains complete geometric information. The task becomes a partially observable decision problem, not just a noisy image-recognition problem.

The paper separates scene-induced occlusion from common visual perturbations:

Change type	Example	Why it differs from occlusion
Lighting	darker image or strong shadow	the object remains visible
Background shift	different table texture	task evidence is still present
Image noise	blur, compression, mild crop	evidence is degraded but not physically missing
Scene-induced occlusion	object or receptacle blocked by another object	the action-relevant evidence is absent from the primary view

LIBERO-Occ is useful because the occluder is not a 2D mask painted on top of the image. The benchmark inserts physical occluding objects into the 3D LIBERO scene, checks collision validity, renders the result, and replays the original demonstration to ensure the task is still executable. If a policy fails, the failure is more likely to reflect partial observability rather than an invalid task.

How the benchmark is built

LIBERO-Occ extends the four standard LIBERO suites:

Suite	What it tests
`libero_spatial_occluded`	generalization over spatial layouts
`libero_goal_occluded`	generalization over goals and receptacles
`libero_object_occluded`	generalization over objects
`libero_10_occluded`	long-horizon LIBERO-10 tasks

The paper describes a three-step generation pipeline:

Occlusion target identification: parse the BDDL task specification to find task-relevant objects, receptacles, or goal regions.
View-aware occluder placement: sample a physical occluder in 3D along the camera-to-target ray, so the target is blocked by scene geometry rather than an image-space trick.
Occlusion validity verification: render the scene, measure visibility, reject collisions, and replay the original demonstration to keep only executable task variants.

The high-level flow:

Original LIBERO task
        |
        v
parse BDDL -> find task-relevant targets
        |
        v
sample occluder along camera-to-target ray
        |
        v
render + visibility check + collision check
        |
        v
replay original demonstration
        |
        v
keep only executable occluded task

The benchmark labels occlusion by target type:

Occlusion type	Hidden target	What the policy must infer
Manipulated object	object to grasp or move	object identity, pose, grasp point
Receptacle	drawer, bowl, bin, placement region	target pose, affordance, contact area
Dual	both object and receptacle	source and destination are both partially missing

The paper reports 2,000 occluded task instances across the four suites, with 500 instances per suite. Severity is grouped into light, medium, and heavy. The public repository releases the final benchmark assets as BDDL and init files. The benchmark-generation code is not included, so a practical user should treat the repo as a way to install released occluded assets and run VIM training/evaluation scripts, not as a complete recipe for recreating every occlusion-generation step.

VIM in one sentence

VIM does not require an extra camera at deployment time. Instead, it learns from paired views during training: a primary third-person view and a complementary view, usually the wrist or gripper camera. During inference, the model receives only the occluded primary view, imagines the complementary view, then predicts actions using both the observed and imagined evidence.

occluded primary image + instruction
        |
        v
world-model / UniVLA-derived generator
        |
        +------------------------+
        |                        |
        v                        v
imagined gripper view       observed primary view
        |                        |
        +-----------+------------+
                    |
                    v
          unified action prediction
                    |
                    v
             action tokens / robot actions

The imagined view does not need to be a photorealistic demo image. It needs to provide useful action evidence. If the primary view barely sees the bowl, an imagined wrist view can recover relative position cues. If a target receptacle is partially hidden, the imagined view can help the policy preserve a plausible estimate of the placement region.

The paper trains VIM in two stages:

Stage	Objective	Main loss	Why it matters
Stage 1	generate the complementary view	visual content/special loss	builds a stable imagination interface
Stage 2	jointly train view generation and action prediction	visual loss + action loss	aligns imagined tokens with control

The ablations are important. Removing the Stage-2 view loss collapses performance, and skipping Stage-1 view training also hurts. That means generated visual evidence is not just a side artifact. The view-generation and action-generation sequence must be trained together so the policy can actually exploit imagined visual tokens.

Robot manipulation

Environment setup

Treat this as a research setup, not a lightweight package install. The paper trains with 8 H100 GPUs and global batch size 192. If you only want to understand the benchmark, start by installing the assets and running a tiny evaluation/debug pass. On a local machine without serious GPU capacity, you can still inspect BDDL files, init states, suite names, and data schema.

You need these pieces:

Component	Role
Upstream LIBERO	simulator, BDDL loader, task environments
LIBERO-Occ repo	occluded BDDL/init assets, train/eval scripts
MuJoCo / OSMesa	headless rendering
CUDA PyTorch	VIM training and evaluation
Emu3 vision tokenizer and VQ checkpoints	image tokenization for VIM
FAST action tokenizer	action-token training pipeline
UniVLA-derived code	world-model/action generation backbone bundled in the repo

Install upstream LIBERO first:

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
conda create -n libero-occ python=3.10
conda activate libero-occ
pip install -e .

Then install LIBERO-Occ. The current README includes a placeholder clone URL in one snippet, so use the real repository URL:

cd /path/to/workspace
git clone https://github.com/litsh/Libero-Occ.git
cd Libero-Occ

conda activate libero-occ

# Install the CUDA-matched PyTorch build first, for example:
# pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

pip install -r requirements.txt

Install the benchmark assets into your LIBERO checkout:

export LIBERO_ROOT=/path/to/LIBERO
bash scripts/setup/install_libero_occ_assets.sh

The installer copies:

benchmark_assets/bddl_files/libero_*_occluded
  -> $LIBERO_ROOT/libero/libero/bddl_files/

benchmark_assets/init_files/libero_*_occluded
  -> $LIBERO_ROOT/libero/libero/init_files/

After installation, LIBERO should resolve these suite names:

libero_spatial_occluded
libero_goal_occluded
libero_object_occluded
libero_10_occluded

For headless servers, set:

export MUJOCO_GL=osmesa
export NUMBA_DISABLE_JIT=1

Verify assets before training

Before spending GPU time, check that the files landed in the expected locations:

ls $LIBERO_ROOT/libero/libero/bddl_files/libero_goal_occluded | head
ls $LIBERO_ROOT/libero/libero/init_files/libero_goal_occluded | head

Check that Python can import LIBERO:

python - <<'PY'
import libero
print("LIBERO import OK:", libero.__file__)
PY

A common mistake is pointing LIBERO_ROOT to the wrong directory level. It should point to the upstream LIBERO checkout that contains libero/libero/bddl_files. If assets are copied into the wrong place, evaluation will fail with missing suite or missing BDDL errors.

Prepare Stage 1 and Stage 2 data

The public scripts expect two metadata files:

STAGE1_DATA_PATH=/path/to/stage1_multiview_meta.pkl
STAGE2_DATA_PATH=/path/to/stage2_multiview_meta.pkl

Conceptually, Stage 1 samples need a primary image and a complementary image for view generation. Stage 2 samples also need expert action chunks or action tokens for joint control training. If you already use UniVLA or RoboVLMs, export metadata in the format expected by the bundled univla/ code. If you are new to the codebase, inspect the data loader before building a large dataset:

rg -n "STAGE1_DATA_PATH|stage1_multiview|PERSPECTIVE_IMAGE_KEY|data_path" univla scripts

The conceptual schema is:

sample:
  instruction: "put the bowl into the drawer"
  primary_image: third-person RGB frame
  gripper_image: wrist/gripper RGB frame
  proprio: robot state, if used by your pipeline
  actions: expert action chunk, used in Stage 2

The default complementary image key is gripper_image:

PERSPECTIVE_IMAGE_KEY=gripper_image
PERSPECTIVE_VIEW_NAME=gripper

If your dataset uses a different key such as robot0_eye_in_hand_image or wrist_rgb, pass an environment variable instead of editing the script:

export PERSPECTIVE_IMAGE_KEY=robot0_eye_in_hand_image
export PERSPECTIVE_VIEW_NAME=gripper

Train VIM Stage 1

Stage 1 starts from a UniVLA/VIM world-model checkpoint and trains viewpoint imagination only. The public script defaults to:

Parameter	Default
`MAX_STEPS`	`4000`
`LEARNING_RATE`	`8e-5`
`GLOBAL_BATCH_SIZE`	`192`
`PER_GPU_BATCH_SIZE`	`3`
`NGPUS`	`8`
visual loss	enabled
action loss	disabled

Run:

cd /path/to/Libero-Occ
conda activate libero-occ

export WORLD_MODEL_CKPT=/path/to/WORLD_MODEL_POSTTRAIN
export STAGE1_DATA_PATH=/path/to/stage1_multiview_meta.pkl
export ACTION_TOKENIZER_PATH=/path/to/fast

bash scripts/train/train_vim_stage1.sh

For a small debug pass:

export NGPUS=1
export GLOBAL_BATCH_SIZE=3
export PER_GPU_BATCH_SIZE=1
export MAX_STEPS=20
export EXP_NAME=debug_vim_stage1
bash scripts/train/train_vim_stage1.sh

The debug goal is not to match the paper. It is to verify the data loader, tokenizer path, checkpoint path, CUDA setup, and output directory. Once the pipeline runs, restore sensible GPU and batch settings.

Train VIM Stage 2

Stage 2 loads the Stage-1 checkpoint and jointly trains viewpoint imagination plus action prediction. The default settings include:

Parameter	Default
`MAX_STEPS`	`6000`
`LEARNING_RATE`	`4e-5`
`VISUAL_CONTENT_LOSS_WEIGHT`	`0.5`
`ACTION_CONTENT_LOSS_WEIGHT`	`1.0`
`ACTION_SPECIAL_LOSS_WEIGHT`	`0.2`

Run:

export STAGE1_CKPT=/path/to/UniVLA/logs/vim_stage1_gripper/checkpoint-4000
export STAGE2_DATA_PATH=/path/to/stage2_multiview_meta.pkl
export ACTION_TOKENIZER_PATH=/path/to/fast

bash scripts/train/train_vim_stage2.sh

Troubleshooting checklist:

Symptom	Common cause	What to check
visual loss does not decrease	wrong complementary image key	print one batch from the dataset loader
action loss becomes NaN	tokenizer or action normalization mismatch	inspect action ranges and token ids
checkpoint load fails	wrong Stage-1 path	verify the folder contains real model files
out of memory	batch too large	lower `PER_GPU_BATCH_SIZE`; keep gradient checkpointing
evaluation is very low	train/eval camera key mismatch	compare `PERSPECTIVE_IMAGE_KEY` and `PERSPECTIVE_OBS_KEY`

Inference and evaluation on LIBERO-Occ

The evaluation script launches RoboVLMs LIBERO evaluation with perspective generation enabled. Required variables:

export LIBERO_ROOT=/path/to/LIBERO
export VIM_CKPT=/path/to/vim/checkpoint
export VISION_HUB=/path/to/Emu3-VisionTokenizer
export VQ_HUB=/path/to/Emu3-Stage1
export ACTION_TOKENIZER_PATH=/path/to/fast

Run all four suites:

TASK_SUITE_NAME=libero_spatial_occluded bash scripts/eval/eval_vim_libero_occ.sh
TASK_SUITE_NAME=libero_goal_occluded bash scripts/eval/eval_vim_libero_occ.sh
TASK_SUITE_NAME=libero_object_occluded bash scripts/eval/eval_vim_libero_occ.sh
TASK_SUITE_NAME=libero_10_occluded bash scripts/eval/eval_vim_libero_occ.sh

Useful variables:

Variable	Default	Meaning
`GPUS_PER_NODE`	`8`	GPUs used by torchrun evaluation
`NUM_TRIALS_PER_TASK`	`1`	rollouts per task instance
`CAMERA_RESOLUTION`	`200`	LIBERO camera resolution
`PERSPECTIVE_OBS_KEY`	`robot0_eye_in_hand_image`	complementary view key for reference/debug
`CACHE_ROOT`	logs path	output and cache directory

Minimal debug run:

export GPUS_PER_NODE=1
export NUM_TRIALS_PER_TASK=1
export CAMERA_RESOLUTION=200
export TASK_SUITE_NAME=libero_goal_occluded

bash scripts/eval/eval_vim_libero_occ.sh

At deployment time, VIM is meant to use only the primary observation as input and generate the complementary view internally. The PERSPECTIVE_OBS_KEY setting in evaluation exists because the benchmark environment has reference views and the code path needs to manage perspective generation/debugging.

Main results from the paper

The paper compares UniVLA, OpenVLA, OpenVLA-OFT, π0, π0.5, and VIM on original LIBERO and LIBERO-Occ. The headline table:

Method	Original LIBERO avg	LIBERO-Occ avg	Drop
UniVLA	88.25	57.10	31.15
OpenVLA	92.65	40.65	52.00
OpenVLA-OFT	95.75	47.95	47.80
π0	89.25	49.30	39.95
π0.5	90.00	40.55	49.45
VIM	92.00	65.05	26.95

The important point is not only that VIM beats the strongest baseline on LIBERO-Occ by 7.95 points. The smaller drop from original LIBERO to LIBERO-Occ shows that occlusion is a serious hidden failure mode for modern VLA systems. A policy can look strong on standard LIBERO and still be brittle when the task object or target region is physically hidden.

Grouped by occlusion target:

Method	Manipulated object	Receptacle	Dual	Overall
UniVLA	47.78	81.87	28.00	57.10
OpenVLA	28.89	67.47	13.43	40.65
OpenVLA-OFT	35.22	77.87	16.57	47.95
π0	40.78	70.80	25.14	49.30
π0.5	29.78	67.87	9.71	40.55
VIM	54.67	91.33	35.43	65.05

Dual occlusion remains hard. VIM improves the result, but it does not solve occlusion completely. If both the manipulated object and receptacle are missing from the primary view, the imagined view depends heavily on the learned prior. The paper also notes several limitations: the benchmark is still simulated, VIM needs paired complementary-view data during training, and imagined views can be inaccurate when the primary view contains too little evidence.

When should you use LIBERO-Occ?

Use LIBERO-Occ if your work involves:

Goal	Why the benchmark helps
fine-tuning OpenVLA/UniVLA/π0-style policies	measure whether success collapses when objects are hidden
designing multi-view policies	compare true complementary views with imagined views
manipulation in cluttered scenes	benchmark object-blocking-object failures
world models for control	test whether generated visual evidence improves action prediction

Do not use LIBERO-Occ as your only production evidence. It does not replace real-robot tests with sensor noise, calibration drift, latency, object variation, and contact dynamics. It is best used as a controlled stress test that tells you whether a policy only works when everything important is visible.

A beginner-friendly path

If you are new to this stack, do not start with full training. Follow this order:

Install upstream LIBERO and run one original LIBERO task.
Clone LIBERO-Occ and install the occluded assets.
Verify that the four occluded suite names load.
Run a tiny evaluation/debug pass with a checkpoint you control.
Read the metadata loader and create stage1_multiview_meta.pkl from a few episodes.
Run Stage 1 for 20 debug steps to verify viewpoint generation.
Run Stage 2 for 20 debug steps to verify action-token training.
Scale only after the small path is clean.

Experiment log template:

experiment: vim_goal_occ_debug
suite: libero_goal_occluded
primary_view: agentview / third-person
imagined_view_target: gripper_image
checkpoint: vim_stage2_gripper_weighted
trials_per_task: 1
camera_resolution: 200
metrics:
  success_rate:
  failure_examples:
    - object hidden by occluder
    - receptacle boundary ambiguous
    - gripper self-occlusion
notes:
  compare with same policy on original LIBERO goal suite

Takeaways

LIBERO-Occ is a useful reminder that VLA manipulation is not only language grounding and object recognition. A robot must act when part of the scene is missing from its camera. The benchmark turns occlusion into a controlled variable: target type is known, severity is measured, and the task remains executable. VIM is a practical response because it does not require an extra deployment camera; it learns to convert generative visual priors into complementary evidence for action prediction.

For real robot teams, the lesson is straightforward. Before deploying a VLA policy to a cluttered workbench, logistics cell, or educational manipulation setup, test occlusion explicitly. If the success rate drops sharply, collect cluttered demonstrations, add wrist-view supervision, or try a VIM-style viewpoint imagination pipeline before concluding that the model cannot understand the task.

Keep these sources open while you work:

Paper: LIBERO-Occ arXiv 2606.10862
HTML paper: arxiv.org/html/2606.10862v1
GitHub: litsh/Libero-Occ
Upstream benchmark: LIBERO

What problem does LIBERO-Occ expose?

A common VLA policy pipeline looks like this:

RGB image(s) + language instruction + optional proprioception
        |
        v
vision-language backbone
        |
        v
action tokenizer / diffusion head / flow head
        |
        v
robot action chunk

The paper separates scene-induced occlusion from common visual perturbations:

Change type	Example	Why it differs from occlusion
Lighting	darker image or strong shadow	the object remains visible
Background shift	different table texture	task evidence is still present
Image noise	blur, compression, mild crop	evidence is degraded but not physically missing
Scene-induced occlusion	object or receptacle blocked by another object	the action-relevant evidence is absent from the primary view

How the benchmark is built

LIBERO-Occ extends the four standard LIBERO suites:

Suite	What it tests
`libero_spatial_occluded`	generalization over spatial layouts
`libero_goal_occluded`	generalization over goals and receptacles
`libero_object_occluded`	generalization over objects
`libero_10_occluded`	long-horizon LIBERO-10 tasks

The paper describes a three-step generation pipeline:

Occlusion target identification: parse the BDDL task specification to find task-relevant objects, receptacles, or goal regions.
View-aware occluder placement: sample a physical occluder in 3D along the camera-to-target ray, so the target is blocked by scene geometry rather than an image-space trick.
Occlusion validity verification: render the scene, measure visibility, reject collisions, and replay the original demonstration to keep only executable task variants.

The high-level flow:

Original LIBERO task
        |
        v
parse BDDL -> find task-relevant targets
        |
        v
sample occluder along camera-to-target ray
        |
        v
render + visibility check + collision check
        |
        v
replay original demonstration
        |
        v
keep only executable occluded task

The benchmark labels occlusion by target type:

Occlusion type	Hidden target	What the policy must infer
Manipulated object	object to grasp or move	object identity, pose, grasp point
Receptacle	drawer, bowl, bin, placement region	target pose, affordance, contact area
Dual	both object and receptacle	source and destination are both partially missing

VIM in one sentence

occluded primary image + instruction
        |
        v
world-model / UniVLA-derived generator
        |
        +------------------------+
        |                        |
        v                        v
imagined gripper view       observed primary view
        |                        |
        +-----------+------------+
                    |
                    v
          unified action prediction
                    |
                    v
             action tokens / robot actions

The paper trains VIM in two stages:

Stage	Objective	Main loss	Why it matters
Stage 1	generate the complementary view	visual content/special loss	builds a stable imagination interface
Stage 2	jointly train view generation and action prediction	visual loss + action loss	aligns imagined tokens with control

Robot manipulation

Environment setup

You need these pieces:

Component	Role
Upstream LIBERO	simulator, BDDL loader, task environments
LIBERO-Occ repo	occluded BDDL/init assets, train/eval scripts
MuJoCo / OSMesa	headless rendering
CUDA PyTorch	VIM training and evaluation
Emu3 vision tokenizer and VQ checkpoints	image tokenization for VIM
FAST action tokenizer	action-token training pipeline
UniVLA-derived code	world-model/action generation backbone bundled in the repo

Install upstream LIBERO first:

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
conda create -n libero-occ python=3.10
conda activate libero-occ
pip install -e .

Then install LIBERO-Occ. The current README includes a placeholder clone URL in one snippet, so use the real repository URL:

cd /path/to/workspace
git clone https://github.com/litsh/Libero-Occ.git
cd Libero-Occ

conda activate libero-occ

# Install the CUDA-matched PyTorch build first, for example:
# pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

pip install -r requirements.txt

Install the benchmark assets into your LIBERO checkout:

export LIBERO_ROOT=/path/to/LIBERO
bash scripts/setup/install_libero_occ_assets.sh

The installer copies:

benchmark_assets/bddl_files/libero_*_occluded
  -> $LIBERO_ROOT/libero/libero/bddl_files/

benchmark_assets/init_files/libero_*_occluded
  -> $LIBERO_ROOT/libero/libero/init_files/

After installation, LIBERO should resolve these suite names:

libero_spatial_occluded
libero_goal_occluded
libero_object_occluded
libero_10_occluded

For headless servers, set:

export MUJOCO_GL=osmesa
export NUMBA_DISABLE_JIT=1

Verify assets before training

Before spending GPU time, check that the files landed in the expected locations:

ls $LIBERO_ROOT/libero/libero/bddl_files/libero_goal_occluded | head
ls $LIBERO_ROOT/libero/libero/init_files/libero_goal_occluded | head

Check that Python can import LIBERO:

python - <<'PY'
import libero
print("LIBERO import OK:", libero.__file__)
PY

Prepare Stage 1 and Stage 2 data

The public scripts expect two metadata files:

STAGE1_DATA_PATH=/path/to/stage1_multiview_meta.pkl
STAGE2_DATA_PATH=/path/to/stage2_multiview_meta.pkl

rg -n "STAGE1_DATA_PATH|stage1_multiview|PERSPECTIVE_IMAGE_KEY|data_path" univla scripts

The conceptual schema is:

sample:
  instruction: "put the bowl into the drawer"
  primary_image: third-person RGB frame
  gripper_image: wrist/gripper RGB frame
  proprio: robot state, if used by your pipeline
  actions: expert action chunk, used in Stage 2

The default complementary image key is gripper_image:

PERSPECTIVE_IMAGE_KEY=gripper_image
PERSPECTIVE_VIEW_NAME=gripper

If your dataset uses a different key such as robot0_eye_in_hand_image or wrist_rgb, pass an environment variable instead of editing the script:

export PERSPECTIVE_IMAGE_KEY=robot0_eye_in_hand_image
export PERSPECTIVE_VIEW_NAME=gripper

Train VIM Stage 1

Stage 1 starts from a UniVLA/VIM world-model checkpoint and trains viewpoint imagination only. The public script defaults to:

Parameter	Default
`MAX_STEPS`	`4000`
`LEARNING_RATE`	`8e-5`
`GLOBAL_BATCH_SIZE`	`192`
`PER_GPU_BATCH_SIZE`	`3`
`NGPUS`	`8`
visual loss	enabled
action loss	disabled

Run:

cd /path/to/Libero-Occ
conda activate libero-occ

export WORLD_MODEL_CKPT=/path/to/WORLD_MODEL_POSTTRAIN
export STAGE1_DATA_PATH=/path/to/stage1_multiview_meta.pkl
export ACTION_TOKENIZER_PATH=/path/to/fast

bash scripts/train/train_vim_stage1.sh

For a small debug pass:

export NGPUS=1
export GLOBAL_BATCH_SIZE=3
export PER_GPU_BATCH_SIZE=1
export MAX_STEPS=20
export EXP_NAME=debug_vim_stage1
bash scripts/train/train_vim_stage1.sh

Train VIM Stage 2

Stage 2 loads the Stage-1 checkpoint and jointly trains viewpoint imagination plus action prediction. The default settings include:

Parameter	Default
`MAX_STEPS`	`6000`
`LEARNING_RATE`	`4e-5`
`VISUAL_CONTENT_LOSS_WEIGHT`	`0.5`
`ACTION_CONTENT_LOSS_WEIGHT`	`1.0`
`ACTION_SPECIAL_LOSS_WEIGHT`	`0.2`

Run:

export STAGE1_CKPT=/path/to/UniVLA/logs/vim_stage1_gripper/checkpoint-4000
export STAGE2_DATA_PATH=/path/to/stage2_multiview_meta.pkl
export ACTION_TOKENIZER_PATH=/path/to/fast

bash scripts/train/train_vim_stage2.sh

Troubleshooting checklist:

Symptom	Common cause	What to check
visual loss does not decrease	wrong complementary image key	print one batch from the dataset loader
action loss becomes NaN	tokenizer or action normalization mismatch	inspect action ranges and token ids
checkpoint load fails	wrong Stage-1 path	verify the folder contains real model files
out of memory	batch too large	lower `PER_GPU_BATCH_SIZE`; keep gradient checkpointing
evaluation is very low	train/eval camera key mismatch	compare `PERSPECTIVE_IMAGE_KEY` and `PERSPECTIVE_OBS_KEY`

Inference and evaluation on LIBERO-Occ

The evaluation script launches RoboVLMs LIBERO evaluation with perspective generation enabled. Required variables:

export LIBERO_ROOT=/path/to/LIBERO
export VIM_CKPT=/path/to/vim/checkpoint
export VISION_HUB=/path/to/Emu3-VisionTokenizer
export VQ_HUB=/path/to/Emu3-Stage1
export ACTION_TOKENIZER_PATH=/path/to/fast

Run all four suites:

TASK_SUITE_NAME=libero_spatial_occluded bash scripts/eval/eval_vim_libero_occ.sh
TASK_SUITE_NAME=libero_goal_occluded bash scripts/eval/eval_vim_libero_occ.sh
TASK_SUITE_NAME=libero_object_occluded bash scripts/eval/eval_vim_libero_occ.sh
TASK_SUITE_NAME=libero_10_occluded bash scripts/eval/eval_vim_libero_occ.sh

Useful variables:

Variable	Default	Meaning
`GPUS_PER_NODE`	`8`	GPUs used by torchrun evaluation
`NUM_TRIALS_PER_TASK`	`1`	rollouts per task instance
`CAMERA_RESOLUTION`	`200`	LIBERO camera resolution
`PERSPECTIVE_OBS_KEY`	`robot0_eye_in_hand_image`	complementary view key for reference/debug
`CACHE_ROOT`	logs path	output and cache directory

Minimal debug run:

export GPUS_PER_NODE=1
export NUM_TRIALS_PER_TASK=1
export CAMERA_RESOLUTION=200
export TASK_SUITE_NAME=libero_goal_occluded

bash scripts/eval/eval_vim_libero_occ.sh

Main results from the paper

The paper compares UniVLA, OpenVLA, OpenVLA-OFT, π0, π0.5, and VIM on original LIBERO and LIBERO-Occ. The headline table:

Method	Original LIBERO avg	LIBERO-Occ avg	Drop
UniVLA	88.25	57.10	31.15
OpenVLA	92.65	40.65	52.00
OpenVLA-OFT	95.75	47.95	47.80
π0	89.25	49.30	39.95
π0.5	90.00	40.55	49.45
VIM	92.00	65.05	26.95

Grouped by occlusion target:

Method	Manipulated object	Receptacle	Dual	Overall
UniVLA	47.78	81.87	28.00	57.10
OpenVLA	28.89	67.47	13.43	40.65
OpenVLA-OFT	35.22	77.87	16.57	47.95
π0	40.78	70.80	25.14	49.30
π0.5	29.78	67.87	9.71	40.55
VIM	54.67	91.33	35.43	65.05

When should you use LIBERO-Occ?

Use LIBERO-Occ if your work involves:

Goal	Why the benchmark helps
fine-tuning OpenVLA/UniVLA/π0-style policies	measure whether success collapses when objects are hidden
designing multi-view policies	compare true complementary views with imagined views
manipulation in cluttered scenes	benchmark object-blocking-object failures
world models for control	test whether generated visual evidence improves action prediction

A beginner-friendly path

If you are new to this stack, do not start with full training. Follow this order:

Install upstream LIBERO and run one original LIBERO task.
Clone LIBERO-Occ and install the occluded assets.
Verify that the four occluded suite names load.
Run a tiny evaluation/debug pass with a checkpoint you control.
Read the metadata loader and create stage1_multiview_meta.pkl from a few episodes.
Run Stage 1 for 20 debug steps to verify viewpoint generation.
Run Stage 2 for 20 debug steps to verify action-token training.
Scale only after the small path is clean.

Experiment log template:

experiment: vim_goal_occ_debug
suite: libero_goal_occluded
primary_view: agentview / third-person
imagined_view_target: gripper_image
checkpoint: vim_stage2_gripper_weighted
trials_per_task: 1
camera_resolution: 200
metrics:
  success_rate:
  failure_examples:
    - object hidden by occluder
    - receptacle boundary ambiguous
    - gripper self-occlusion
notes:
  compare with same policy on original LIBERO goal suite

Run LIBERO-Occ VIM for Occluded VLA

What problem does LIBERO-Occ expose?

How the benchmark is built

VIM in one sentence

Environment setup

Verify assets before training

Prepare Stage 1 and Stage 2 data

Train VIM Stage 1

Train VIM Stage 2

Inference and evaluation on LIBERO-Occ

Main results from the paper

When should you use LIBERO-Occ?

A beginner-friendly path

Takeaways

Nguyễn Anh Tuấn

Related Posts

MemoryVLA++: memory và world model cho VLA

UniIntervene: Giảm 57% human intervention

Chạy Embodied-R1.5-VLA trên LIBERO

Run LIBERO-Occ VIM for Occluded VLA

What problem does LIBERO-Occ expose?

How the benchmark is built

VIM in one sentence

Environment setup

Verify assets before training

Prepare Stage 1 and Stage 2 data

Train VIM Stage 1

Train VIM Stage 2

Inference and evaluation on LIBERO-Occ

Main results from the paper

When should you use LIBERO-Occ?

A beginner-friendly path

Takeaways

Nguyễn Anh Tuấn

Related Posts

MemoryVLA++: memory và world model cho VLA

UniIntervene: Giảm 57% human intervention

Chạy Embodied-R1.5-VLA trên LIBERO