wholebody-vlalibero-occvlaviewpoint-imaginationrobot-manipulationocclusion

Run LIBERO-Occ VIM for Occluded VLA

A practical guide to LIBERO-Occ and Viewpoint Imagination for evaluating, training, and running VLA policies under occlusion.

Nguyễn Anh TuấnJune 12, 202614 min read
Run LIBERO-Occ VIM for Occluded VLA

LIBERO-Occ is a new benchmark for a very practical question: can a VLA manipulation policy stay reliable when the object or target region is occluded? Most robot manipulation benchmarks quietly assume that the task-relevant object, receptacle, and contact region are visible. That assumption is convenient in simulation and clean lab demos. It often breaks on real workbenches, where a bowl can hide behind a box, a drawer can hide the placement region, and the robot arm can block its own camera.

This guide explains how to read and run the work from LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination, posted on arXiv on 2026-06-09. The official repository is litsh/Libero-Occ. The paper contributes two things: the LIBERO-Occ benchmark for measuring occlusion robustness, and Viewpoint Imagination (VIM), a method that generates a complementary imagined view from an occluded primary observation and conditions action prediction on both observed and imagined evidence.

Keep these sources open while you work:

What problem does LIBERO-Occ expose?

A common VLA policy pipeline looks like this:

RGB image(s) + language instruction + optional proprioception
        |
        v
vision-language backbone
        |
        v
action tokenizer / diffusion head / flow head
        |
        v
robot action chunk

This pipeline works best when the image contains enough task-relevant evidence. If the instruction is "put the bowl into the drawer", the policy usually needs to see the bowl, the drawer, the handle, and enough geometry to plan the approach and placement. If the bowl is 60% hidden behind another object, or the drawer target region is blocked from the main camera, the input no longer contains complete geometric information. The task becomes a partially observable decision problem, not just a noisy image-recognition problem.

The paper separates scene-induced occlusion from common visual perturbations:

Change type Example Why it differs from occlusion
Lighting darker image or strong shadow the object remains visible
Background shift different table texture task evidence is still present
Image noise blur, compression, mild crop evidence is degraded but not physically missing
Scene-induced occlusion object or receptacle blocked by another object the action-relevant evidence is absent from the primary view

LIBERO-Occ is useful because the occluder is not a 2D mask painted on top of the image. The benchmark inserts physical occluding objects into the 3D LIBERO scene, checks collision validity, renders the result, and replays the original demonstration to ensure the task is still executable. If a policy fails, the failure is more likely to reflect partial observability rather than an invalid task.

How the benchmark is built

LIBERO-Occ extends the four standard LIBERO suites:

Suite What it tests
libero_spatial_occluded generalization over spatial layouts
libero_goal_occluded generalization over goals and receptacles
libero_object_occluded generalization over objects
libero_10_occluded long-horizon LIBERO-10 tasks

The paper describes a three-step generation pipeline:

  1. Occlusion target identification: parse the BDDL task specification to find task-relevant objects, receptacles, or goal regions.
  2. View-aware occluder placement: sample a physical occluder in 3D along the camera-to-target ray, so the target is blocked by scene geometry rather than an image-space trick.
  3. Occlusion validity verification: render the scene, measure visibility, reject collisions, and replay the original demonstration to keep only executable task variants.

The high-level flow:

Original LIBERO task
        |
        v
parse BDDL -> find task-relevant targets
        |
        v
sample occluder along camera-to-target ray
        |
        v
render + visibility check + collision check
        |
        v
replay original demonstration
        |
        v
keep only executable occluded task

The benchmark labels occlusion by target type:

Occlusion type Hidden target What the policy must infer
Manipulated object object to grasp or move object identity, pose, grasp point
Receptacle drawer, bowl, bin, placement region target pose, affordance, contact area
Dual both object and receptacle source and destination are both partially missing

The paper reports 2,000 occluded task instances across the four suites, with 500 instances per suite. Severity is grouped into light, medium, and heavy. The public repository releases the final benchmark assets as BDDL and init files. The benchmark-generation code is not included, so a practical user should treat the repo as a way to install released occluded assets and run VIM training/evaluation scripts, not as a complete recipe for recreating every occlusion-generation step.

VIM in one sentence

VIM does not require an extra camera at deployment time. Instead, it learns from paired views during training: a primary third-person view and a complementary view, usually the wrist or gripper camera. During inference, the model receives only the occluded primary view, imagines the complementary view, then predicts actions using both the observed and imagined evidence.

occluded primary image + instruction
        |
        v
world-model / UniVLA-derived generator
        |
        +------------------------+
        |                        |
        v                        v
imagined gripper view       observed primary view
        |                        |
        +-----------+------------+
                    |
                    v
          unified action prediction
                    |
                    v
             action tokens / robot actions

The imagined view does not need to be a photorealistic demo image. It needs to provide useful action evidence. If the primary view barely sees the bowl, an imagined wrist view can recover relative position cues. If a target receptacle is partially hidden, the imagined view can help the policy preserve a plausible estimate of the placement region.

The paper trains VIM in two stages:

Stage Objective Main loss Why it matters
Stage 1 generate the complementary view visual content/special loss builds a stable imagination interface
Stage 2 jointly train view generation and action prediction visual loss + action loss aligns imagined tokens with control

The ablations are important. Removing the Stage-2 view loss collapses performance, and skipping Stage-1 view training also hurts. That means generated visual evidence is not just a side artifact. The view-generation and action-generation sequence must be trained together so the policy can actually exploit imagined visual tokens.

Robot manipulation
Robot manipulation

Environment setup

Treat this as a research setup, not a lightweight package install. The paper trains with 8 H100 GPUs and global batch size 192. If you only want to understand the benchmark, start by installing the assets and running a tiny evaluation/debug pass. On a local machine without serious GPU capacity, you can still inspect BDDL files, init states, suite names, and data schema.

You need these pieces:

Component Role
Upstream LIBERO simulator, BDDL loader, task environments
LIBERO-Occ repo occluded BDDL/init assets, train/eval scripts
MuJoCo / OSMesa headless rendering
CUDA PyTorch VIM training and evaluation
Emu3 vision tokenizer and VQ checkpoints image tokenization for VIM
FAST action tokenizer action-token training pipeline
UniVLA-derived code world-model/action generation backbone bundled in the repo

Install upstream LIBERO first:

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
conda create -n libero-occ python=3.10
conda activate libero-occ
pip install -e .

Then install LIBERO-Occ. The current README includes a placeholder clone URL in one snippet, so use the real repository URL:

cd /path/to/workspace
git clone https://github.com/litsh/Libero-Occ.git
cd Libero-Occ

conda activate libero-occ

# Install the CUDA-matched PyTorch build first, for example:
# pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

pip install -r requirements.txt

Install the benchmark assets into your LIBERO checkout:

export LIBERO_ROOT=/path/to/LIBERO
bash scripts/setup/install_libero_occ_assets.sh

The installer copies:

benchmark_assets/bddl_files/libero_*_occluded
  -> $LIBERO_ROOT/libero/libero/bddl_files/

benchmark_assets/init_files/libero_*_occluded
  -> $LIBERO_ROOT/libero/libero/init_files/

After installation, LIBERO should resolve these suite names:

libero_spatial_occluded
libero_goal_occluded
libero_object_occluded
libero_10_occluded

For headless servers, set:

export MUJOCO_GL=osmesa
export NUMBA_DISABLE_JIT=1

Verify assets before training

Before spending GPU time, check that the files landed in the expected locations:

ls $LIBERO_ROOT/libero/libero/bddl_files/libero_goal_occluded | head
ls $LIBERO_ROOT/libero/libero/init_files/libero_goal_occluded | head

Check that Python can import LIBERO:

python - <<'PY'
import libero
print("LIBERO import OK:", libero.__file__)
PY

A common mistake is pointing LIBERO_ROOT to the wrong directory level. It should point to the upstream LIBERO checkout that contains libero/libero/bddl_files. If assets are copied into the wrong place, evaluation will fail with missing suite or missing BDDL errors.

Prepare Stage 1 and Stage 2 data

The public scripts expect two metadata files:

STAGE1_DATA_PATH=/path/to/stage1_multiview_meta.pkl
STAGE2_DATA_PATH=/path/to/stage2_multiview_meta.pkl

Conceptually, Stage 1 samples need a primary image and a complementary image for view generation. Stage 2 samples also need expert action chunks or action tokens for joint control training. If you already use UniVLA or RoboVLMs, export metadata in the format expected by the bundled univla/ code. If you are new to the codebase, inspect the data loader before building a large dataset:

rg -n "STAGE1_DATA_PATH|stage1_multiview|PERSPECTIVE_IMAGE_KEY|data_path" univla scripts

The conceptual schema is:

sample:
  instruction: "put the bowl into the drawer"
  primary_image: third-person RGB frame
  gripper_image: wrist/gripper RGB frame
  proprio: robot state, if used by your pipeline
  actions: expert action chunk, used in Stage 2

The default complementary image key is gripper_image:

PERSPECTIVE_IMAGE_KEY=gripper_image
PERSPECTIVE_VIEW_NAME=gripper

If your dataset uses a different key such as robot0_eye_in_hand_image or wrist_rgb, pass an environment variable instead of editing the script:

export PERSPECTIVE_IMAGE_KEY=robot0_eye_in_hand_image
export PERSPECTIVE_VIEW_NAME=gripper

Train VIM Stage 1

Stage 1 starts from a UniVLA/VIM world-model checkpoint and trains viewpoint imagination only. The public script defaults to:

Parameter Default
MAX_STEPS 4000
LEARNING_RATE 8e-5
GLOBAL_BATCH_SIZE 192
PER_GPU_BATCH_SIZE 3
NGPUS 8
visual loss enabled
action loss disabled

Run:

cd /path/to/Libero-Occ
conda activate libero-occ

export WORLD_MODEL_CKPT=/path/to/WORLD_MODEL_POSTTRAIN
export STAGE1_DATA_PATH=/path/to/stage1_multiview_meta.pkl
export ACTION_TOKENIZER_PATH=/path/to/fast

bash scripts/train/train_vim_stage1.sh

For a small debug pass:

export NGPUS=1
export GLOBAL_BATCH_SIZE=3
export PER_GPU_BATCH_SIZE=1
export MAX_STEPS=20
export EXP_NAME=debug_vim_stage1
bash scripts/train/train_vim_stage1.sh

The debug goal is not to match the paper. It is to verify the data loader, tokenizer path, checkpoint path, CUDA setup, and output directory. Once the pipeline runs, restore sensible GPU and batch settings.

Train VIM Stage 2

Stage 2 loads the Stage-1 checkpoint and jointly trains viewpoint imagination plus action prediction. The default settings include:

Parameter Default
MAX_STEPS 6000
LEARNING_RATE 4e-5
VISUAL_CONTENT_LOSS_WEIGHT 0.5
ACTION_CONTENT_LOSS_WEIGHT 1.0
ACTION_SPECIAL_LOSS_WEIGHT 0.2

Run:

export STAGE1_CKPT=/path/to/UniVLA/logs/vim_stage1_gripper/checkpoint-4000
export STAGE2_DATA_PATH=/path/to/stage2_multiview_meta.pkl
export ACTION_TOKENIZER_PATH=/path/to/fast

bash scripts/train/train_vim_stage2.sh

Troubleshooting checklist:

Symptom Common cause What to check
visual loss does not decrease wrong complementary image key print one batch from the dataset loader
action loss becomes NaN tokenizer or action normalization mismatch inspect action ranges and token ids
checkpoint load fails wrong Stage-1 path verify the folder contains real model files
out of memory batch too large lower PER_GPU_BATCH_SIZE; keep gradient checkpointing
evaluation is very low train/eval camera key mismatch compare PERSPECTIVE_IMAGE_KEY and PERSPECTIVE_OBS_KEY

Inference and evaluation on LIBERO-Occ

The evaluation script launches RoboVLMs LIBERO evaluation with perspective generation enabled. Required variables:

export LIBERO_ROOT=/path/to/LIBERO
export VIM_CKPT=/path/to/vim/checkpoint
export VISION_HUB=/path/to/Emu3-VisionTokenizer
export VQ_HUB=/path/to/Emu3-Stage1
export ACTION_TOKENIZER_PATH=/path/to/fast

Run all four suites:

TASK_SUITE_NAME=libero_spatial_occluded bash scripts/eval/eval_vim_libero_occ.sh
TASK_SUITE_NAME=libero_goal_occluded bash scripts/eval/eval_vim_libero_occ.sh
TASK_SUITE_NAME=libero_object_occluded bash scripts/eval/eval_vim_libero_occ.sh
TASK_SUITE_NAME=libero_10_occluded bash scripts/eval/eval_vim_libero_occ.sh

Useful variables:

Variable Default Meaning
GPUS_PER_NODE 8 GPUs used by torchrun evaluation
NUM_TRIALS_PER_TASK 1 rollouts per task instance
CAMERA_RESOLUTION 200 LIBERO camera resolution
PERSPECTIVE_OBS_KEY robot0_eye_in_hand_image complementary view key for reference/debug
CACHE_ROOT logs path output and cache directory

Minimal debug run:

export GPUS_PER_NODE=1
export NUM_TRIALS_PER_TASK=1
export CAMERA_RESOLUTION=200
export TASK_SUITE_NAME=libero_goal_occluded

bash scripts/eval/eval_vim_libero_occ.sh

At deployment time, VIM is meant to use only the primary observation as input and generate the complementary view internally. The PERSPECTIVE_OBS_KEY setting in evaluation exists because the benchmark environment has reference views and the code path needs to manage perspective generation/debugging.

Main results from the paper

The paper compares UniVLA, OpenVLA, OpenVLA-OFT, π0, π0.5, and VIM on original LIBERO and LIBERO-Occ. The headline table:

Method Original LIBERO avg LIBERO-Occ avg Drop
UniVLA 88.25 57.10 31.15
OpenVLA 92.65 40.65 52.00
OpenVLA-OFT 95.75 47.95 47.80
π0 89.25 49.30 39.95
π0.5 90.00 40.55 49.45
VIM 92.00 65.05 26.95

The important point is not only that VIM beats the strongest baseline on LIBERO-Occ by 7.95 points. The smaller drop from original LIBERO to LIBERO-Occ shows that occlusion is a serious hidden failure mode for modern VLA systems. A policy can look strong on standard LIBERO and still be brittle when the task object or target region is physically hidden.

Grouped by occlusion target:

Method Manipulated object Receptacle Dual Overall
UniVLA 47.78 81.87 28.00 57.10
OpenVLA 28.89 67.47 13.43 40.65
OpenVLA-OFT 35.22 77.87 16.57 47.95
π0 40.78 70.80 25.14 49.30
π0.5 29.78 67.87 9.71 40.55
VIM 54.67 91.33 35.43 65.05

Dual occlusion remains hard. VIM improves the result, but it does not solve occlusion completely. If both the manipulated object and receptacle are missing from the primary view, the imagined view depends heavily on the learned prior. The paper also notes several limitations: the benchmark is still simulated, VIM needs paired complementary-view data during training, and imagined views can be inaccurate when the primary view contains too little evidence.

When should you use LIBERO-Occ?

Use LIBERO-Occ if your work involves:

Goal Why the benchmark helps
fine-tuning OpenVLA/UniVLA/π0-style policies measure whether success collapses when objects are hidden
designing multi-view policies compare true complementary views with imagined views
manipulation in cluttered scenes benchmark object-blocking-object failures
world models for control test whether generated visual evidence improves action prediction

Do not use LIBERO-Occ as your only production evidence. It does not replace real-robot tests with sensor noise, calibration drift, latency, object variation, and contact dynamics. It is best used as a controlled stress test that tells you whether a policy only works when everything important is visible.

A beginner-friendly path

If you are new to this stack, do not start with full training. Follow this order:

  1. Install upstream LIBERO and run one original LIBERO task.
  2. Clone LIBERO-Occ and install the occluded assets.
  3. Verify that the four occluded suite names load.
  4. Run a tiny evaluation/debug pass with a checkpoint you control.
  5. Read the metadata loader and create stage1_multiview_meta.pkl from a few episodes.
  6. Run Stage 1 for 20 debug steps to verify viewpoint generation.
  7. Run Stage 2 for 20 debug steps to verify action-token training.
  8. Scale only after the small path is clean.

Experiment log template:

experiment: vim_goal_occ_debug
suite: libero_goal_occluded
primary_view: agentview / third-person
imagined_view_target: gripper_image
checkpoint: vim_stage2_gripper_weighted
trials_per_task: 1
camera_resolution: 200
metrics:
  success_rate:
  failure_examples:
    - object hidden by occluder
    - receptacle boundary ambiguous
    - gripper self-occlusion
notes:
  compare with same policy on original LIBERO goal suite

Takeaways

LIBERO-Occ is a useful reminder that VLA manipulation is not only language grounding and object recognition. A robot must act when part of the scene is missing from its camera. The benchmark turns occlusion into a controlled variable: target type is known, severity is measured, and the task remains executable. VIM is a practical response because it does not require an extra deployment camera; it learns to convert generative visual priors into complementary evidence for action prediction.

For real robot teams, the lesson is straightforward. Before deploying a VLA policy to a cluttered workbench, logistics cell, or educational manipulation setup, test occlusion explicitly. If the success rate drops sharply, collect cluttered demonstrations, add wrist-view supervision, or try a VIM-style viewpoint imagination pipeline before concluding that the model cannot understand the task.

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

Chạy Embodied-R1.5-VLA trên LIBERO
wholebody-vla

Chạy Embodied-R1.5-VLA trên LIBERO

6/11/202614 min read
NT
UniIntervene: Giảm 57% human intervention
wholebody-vla

UniIntervene: Giảm 57% human intervention

6/11/202615 min read
NT
ProcVLM: Dense Reward từ Video cho VLA
wholebody-vla

ProcVLM: Dense Reward từ Video cho VLA

6/8/202613 min read
NT