VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam
VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
  1. Home
  2. Blog
  3. VimaBench: 17 Tasks and 4-Level Generalization Protocol
manipulationvimavimabenchbenchmarkevaluationgeneralizationrobot-manipulationtutorial

VimaBench: 17 Tasks and 4-Level Generalization Protocol

Install VimaBench, run the 200M checkpoint demo, and analyze all 4 eval partitions — from placement to novel task generalization. Which level matters most for humanoid deployment?

Nguyễn Anh TuấnJune 16, 202615 min read
VimaBench: 17 Tasks and 4-Level Generalization Protocol

Imagine you have just finished training a robot manipulation model. Now the real question begins: does this model genuinely understand the tasks, or is it just memorizing patterns from training data? You need an evaluation framework strict enough to tell these two cases apart — one that does not simply re-test on the training distribution but measures what kind of generalization the model has actually achieved.

VimaBench is that framework. Built alongside VIMA (ICML 2023), VimaBench provides 17 task templates — each instantiable into thousands of individual episodes by combining diverse object types and textures — and defines a 4-level evaluation protocol that measures precisely which kind of generalization a model has mastered.

This post walks you through the full pipeline: installing VimaBench, running the 200M checkpoint demo, understanding the 17 tasks across 6 functional groups, and analyzing each of the 4 evaluation levels — from placement_generalization (easiest) to novel_task_generalization (hardest). By the end, you will understand why Level 4 is the only number worth caring about when the goal is deploying VIMA on a real humanoid robot.

Series Roadmap

This is post 2 of 5 in the VIMA: Multimodal Prompts for Humanoid Robot Manipulation series:

Post Topic
Post 1: Cross-Attention Architecture XAttn GPT + T5 Encoder — why cross-attention wins
Post 2 (you are here) VimaBench: 17 Tasks and the 4-Level Protocol
Post 3: Object Tokenizer From raw pixels to object tokens via Mask R-CNN + ViT
Post 4: Dataset 650K Multi-task data collection at scale
Post 5: Humanoid Adaptation Scaling VIMA to high-DoF humanoid hands

Installing VimaBench

System Requirements

  • Python ≥ 3.9
  • Git
  • ≥ 4 GB RAM (for PyBullet simulation environment)
  • GPU optional for running single-episode demos

Installation

# Step 1: Clone the VimaBench repo
git clone https://github.com/vimalabs/VimaBench
cd VimaBench

# Step 2: Install in editable mode (required for `import vima_bench`)
pip install -e .

Verify the installation in Python:

from vima_bench import make, PARTITION_TO_SPECS

# Check: list all available evaluation partitions
partitions = list(PARTITION_TO_SPECS["test"].keys())
print(partitions)
# Output: ['placement_generalization', 'combinatorial_generalization',
#          'novel_object_generalization', 'novel_task_generalization']

A clean import with no errors means you are ready to go.

Downloading Checkpoints

VimaBench does not bundle model checkpoints — they live in the main VIMA algorithm repository:

# Clone the VIMA algorithm repo
git clone https://github.com/vimalabs/VIMA
cd VIMA

# Create a checkpoint directory
mkdir -p ckpts

# Download the 200M checkpoint (best performance)
# See the VIMA README for the current download link
# wget <checkpoint_url> -O ckpts/200M.ckpt

VIMA ships 7 pretrained checkpoints ranging from 2M to 200M parameters. This post uses 200M.ckpt for all examples. If RAM is limited, 9M.ckpt or 4M.ckpt work with identical command syntax — just change the path.


Running Your First Demo

VimaBench ships scripts/example.py to test any policy immediately:

python scripts/example.py \
  --ckpt=ckpts/200M.ckpt \
  --partition=placement_generalization \
  --task=follow_order

The three key arguments:

Argument Meaning Example value
--ckpt Path to checkpoint file ckpts/200M.ckpt
--partition Evaluation level (1 of 4) placement_generalization
--task Specific task (1 of 17) follow_order

When launched, PyBullet opens a simulator window showing a tabletop with randomly placed objects. The robot receives a multimodal prompt — a mix of descriptive text and reference images — executes an action sequence, and you watch in real time whether the model succeeds.

Task follow_order: robot watches a multi-frame video demo and replicates the action sequence in the correct order
Task follow_order: robot watches a multi-frame video demo and replicates the action sequence in the correct order
Task follow_order — source: vimalabs/VimaBench repo

Stress-testing with a harder partition

Swap placement_generalization for novel_task_generalization to observe the model on a task template it never saw during training:

# Level 2: unseen adjective-noun combinations
python scripts/example.py \
  --ckpt=ckpts/200M.ckpt \
  --partition=combinatorial_generalization \
  --task=same_texture

# Level 4: entirely new task template
python scripts/example.py \
  --ckpt=ckpts/200M.ckpt \
  --partition=novel_task_generalization \
  --task=novel_adj_noun

The performance gap between these two commands is the answer to "does this model generalize?" — and it typically spans 20–40 percentage points in success rate.


17 Tasks — 6 Functional Groups

VimaBench's 17 tasks were not chosen arbitrarily. They are designed to cover different cognitive capabilities that a manipulation system needs, from simple pick-and-place to multi-step reasoning. Every task belongs to one of six groups:

Group 1: Simple Object Manipulation

Task Description
visual_manipulation Pick object A → place at location B, specified by reference images
scene_understanding Analyze the scene and select the correct object from a visual description

The "hello world" of robot manipulation — with a critical twist: information about which object and where to place it is conveyed through images in the prompt, not plain text. The model must ground multimodal references into 3D physical space.

Task visual_manipulation: robot analyzes the prompt image to identify the target object and destination — live execution in the PyBullet tabletop environment
Task visual_manipulation: robot analyzes the prompt image to identify the target object and destination — live execution in the PyBullet tabletop environment
Task visual_manipulation — source: vimalabs/VimaBench repo

Group 2: Visual Goal Reaching

Task Description
rotate Rotate an object to the orientation shown in a goal image
rearrange Rearrange objects to match a target configuration shown in a goal image
rearrange_then_restore Rearrange to match the goal, then restore to the original state

Harder than Group 1: the model must understand relative spatial relationships between objects, not just absolute positions. rearrange_then_restore is especially interesting — it requires maintaining a memory of the initial state throughout execution.

Group 3: Novel Concept Grounding

Task Description
novel_adj Learn a new adjective (unseen color/texture) from 1–2 in-prompt examples
novel_noun Learn a new noun (unseen object type) from 1–2 in-prompt examples
novel_adj_noun Combine both a new adjective and a new noun never seen before
twist Apply the concept of "twisting" in a novel context

These are few-shot grounding tasks: the prompt provides 1–2 examples of a new concept (e.g. "glassy" or "mug"), and the model must generalize to unseen instances. This group directly tests the ability to learn from context rather than training data.

Group 4: One-Shot Video Imitation

Task Description
follow_motion Replicate a motion trajectory demonstrated in a video clip
follow_order Execute actions in the exact order they appear in a demo video

The prompt is a video clip — a sequence of frames showing a demonstration — not a static image. The model must extract temporal structure from the video: which objects appear in what order, and which actions are performed.

Group 5: Visual Constraint Satisfaction

Task Description
sweep_without_exceeding Sweep objects into an area without exceeding a quantity threshold
sweep_without_touching Sweep objects without touching a designated object

The model must complete a task while satisfying a negative constraint — do not do X while doing Y. This is critical for real-world safety: a welding robot must not touch adjacent wires; a pick-and-place robot must not overload a shelf.

Group 6: Visual Reasoning

Task Description
same_texture Pick the object sharing the same texture as a reference object in the prompt
same_shape Pick the object sharing the same shape as a reference object
manipulate_old_neighbor Interact with the object that was previously adjacent to another (spatial memory)
pick_in_order_then_restore Pick objects in a specified order, then place them back in reverse order

The hardest group: requires multi-step reasoning, memory of prior state, and visual attribute comparison. manipulate_old_neighbor is particularly challenging — the model must remember the spatial relationship from the start of the episode to identify who the "neighbor" was.

Task pick_in_order_then_restore: robot picks each object in the specified sequence, then places them back in reverse order — multi-step memory required
Task pick_in_order_then_restore: robot picks each object in the specified sequence, then places them back in reverse order — multi-step memory required
Task pick_in_order_then_restore — source: vimalabs/VimaBench repo


The 4-Level Protocol — VimaBench's Core Innovation

This is what separates VimaBench from a typical benchmark. Instead of testing only on distributions similar to training, VimaBench defines 4 partitions of increasing difficulty, each measuring a different type of generalization.

Think of the 4 levels as four progressively harder questions asked of the model:


Level 1: placement_generalization

Question: Did the model overfit to specific positions in the training data?

Setup: Training uses a restricted set of object positions; evaluation randomizes object placement across the entire tabletop.

This is the minimum baseline sanity check. If a model fails here, it has completely overfit to fixed spatial patterns — useless in practice because objects in the real world never sit in the exact positions seen during training.

from vima_bench import make, PARTITION_TO_SPECS

task = "follow_order"
partition = "placement_generalization"

env = make(
    task,
    task_kwargs=PARTITION_TO_SPECS["test"][partition][task],
    render_prompt=True,
    display_debug_window=True,
)

obs, prompt, prompt_assets = env.reset()
# prompt: dict with 'prompt_token_type' and 'prompt_tokens'
# prompt_assets: dict with reference images for image placeholders in the prompt
print(f"Prompt token types: {prompt['prompt_token_type']}")
# Output: [0, 0, 1, 0, 1, 0]  — 0=text token, 1=image token

All 17 tasks have a Level 1 partition — this is the mandatory starting point when evaluating any model.


Level 2: combinatorial_generalization

Question: Has the model learned attributes as independent concepts, or only specific fixed combinations?

Setup: Training sees "red cube" and "blue sphere"; evaluation presents "red sphere" and "blue cube" — novel combinations, but every individual word is familiar.

This is the classic systematic compositionality test in AI. Many deep learning models fail here because they learn "red → cube" as a unit rather than treating "red" and "cube" as two independently composable attributes.

partition = "combinatorial_generalization"
env = make(
    task,
    task_kwargs=PARTITION_TO_SPECS["test"][partition][task],
)

# Inspect the config to see which combinations are tested
config = PARTITION_TO_SPECS["test"][partition][task]
print(f"Config keys: {list(config.keys())}")
# Includes: possible_dragged_obj, possible_dragged_obj_texture,
#           possible_base_obj, possible_base_obj_texture, ...

Why this matters for real-world robotics: Factory workers instruct robots with new attribute combinations every day — "move this (new color) to that position (new shelf)". A model that fails Level 2 requires re-training every time a new combination appears on the floor.


Level 3: novel_object_generalization

Question: Can the model handle entirely new objects (never seen during training)?

Setup: Training uses a fixed set of objects and colors; evaluation introduces a NEW adjective (unseen color/texture) or a NEW noun (unseen object type).

This is true zero-shot generalization over objects. The model cannot rely on visual memory of specific objects — it must understand the task instruction well enough to execute with any new object pointed to in the prompt.

partition = "novel_object_generalization"
env = make(
    task,
    task_kwargs=PARTITION_TO_SPECS["test"][partition][task],
)

obs, prompt, prompt_assets = env.reset()
# prompt_assets will contain images of NEW objects never seen in training
# The model must use cross-attention to "read" the image in the prompt
# rather than recalling from training memory

Why this is the practical boundary: Real environments always have new objects — new product launches, seasonal colors, customer-specific items. A model that only passes L1/L2 will fail immediately on anything outside its training vocabulary.


Level 4: novel_task_generalization

Question: Can the model execute a completely new task template it has never seen during training?

Setup: Training uses 16 of 17 task templates; the held-out task is completely excluded from all training. The prompt uses familiar words and familiar objects, but the task structure itself is brand new.

partition = "novel_task_generalization"

# Example: test on novel_adj_noun — both the adjective and noun are new
task = "novel_adj_noun"
env = make(
    task,
    task_kwargs=PARTITION_TO_SPECS["test"][partition][task],
)

obs, prompt, prompt_assets = env.reset()
# The prompt header contains few-shot examples of the new concept
# The model must reason about task structure from those examples
# No pattern-matching with training memory is possible

This is the hardest and most scientifically interesting level. The model cannot pattern-match against anything it has seen — it must read the multimodal prompt and reason about intent, much like a human understanding a new instruction from context alone.


PARTITION_TO_SPECS: What Is Inside the Config

PARTITION_TO_SPECS is a dictionary holding the exact configuration for each eval level and each task — controlling precisely what is allowed to appear when instantiating an episode:

from vima_bench import PARTITION_TO_SPECS

# Structure: test → partition → task → config dict
config = PARTITION_TO_SPECS["test"]["novel_object_generalization"]["same_texture"]
print(config)
# Example output:
# {
#   'num_dragged_obj': 1,
#   'num_distractor_obj': 3,
#   'possible_dragged_obj_texture': ['new_texture_1', 'new_texture_2', ...],
#   'possible_base_obj_texture': ['familiar_texture_1', ...],
#   'possible_dragged_obj': ['cube', 'sphere', ...],
#   ...
# }

This config enforces:

  • L1: familiar object pool and textures; only placement positions are randomized
  • L2: familiar objects and textures, but adjective-noun combinations are held out
  • L3: at least one adjective or noun in possible_... is completely new to the model
  • L4: not just new objects, but the entire task template structure is new

This is what makes VimaBench a rigorous benchmark — each partition actually tests the type of generalization it claims, with no data leakage between levels.


Which Level Matters Most for Humanoid Deployment?

Short answer: Level 4 is the only number worth reporting when the goal is real-world humanoid deployment.

Here is why each level maps to a different deployment requirement:

L1 is the minimum bar, not an achievement. If a model does not pass placement generalization, it is unusable. Passing L1 merely says "this model is not fundamentally broken" — the same way a car must start before being called a car.

L2 is a prerequisite for production. A humanoid on a factory floor encounters hundreds of new attribute combinations daily — new products, new colors, new sizes on the same assembly line. A model that fails L2 needs retraining constantly, turning deployment into an operational nightmare.

L3 separates "lab demo" from "actual robot". For a humanoid working in an open-world environment with new objects appearing continuously, L3 marks the boundary between "usable" and "requires 24/7 human supervision to handle exceptions".

L4 is a test of architecture, not data. A model that passes L4 has demonstrated that its prompt conditioning mechanism actually works — the model reads and understands task intent from the prompt rather than recalling from training memory. This is the core capability required for a household or office humanoid assistant, where users issue novel instructions every day that were never seen in any training set.

The Pattern in VIMA's Results

The VIMA paper (ICML 2023) shows a consistent pattern: VIMA with dense cross-attention maintains significantly better performance than baselines as the level climbs from L1 to L4 — especially at L3 and L4, where GPT-style prefix conditioning degrades most sharply.

The mechanism was analyzed in Post 1: GPT-style prefix reads the prompt once and relies on self-attention to propagate that signal — the prompt context dilutes across layers. When the task is entirely new (L4), the model needs to "re-read the prompt" at every layer to avoid drifting from task intent. Dense cross-attention at every one of the 12 layers does exactly this.

Scaling also follows an interesting pattern: smaller models (9M) degrade more steeply climbing from L1 to L4 compared to larger models (200M) — but even VIMA-9M degrades less than a GPT-style prefix model at 200M parameters on the higher levels. Architecture matters more than size when measuring generalization.


Running a Full Systematic Evaluation

To evaluate a policy across all tasks in a given partition programmatically:

from vima_bench import make, PARTITION_TO_SPECS
import numpy as np

def evaluate_partition(policy, partition: str, n_episodes: int = 50):
    """Evaluate a policy across all tasks in one partition."""
    results = {}
    task_names = list(PARTITION_TO_SPECS["test"][partition].keys())

    for task in task_names:
        env = make(
            task,
            task_kwargs=PARTITION_TO_SPECS["test"][partition][task],
        )

        success_count = 0
        for _ in range(n_episodes):
            obs, prompt, prompt_assets = env.reset()
            done = False

            while not done:
                # Replace with your actual policy inference
                action = policy.predict(obs, prompt, prompt_assets)
                obs, reward, done, info = env.step(action)

            if info["success"]:
                success_count += 1

        results[task] = success_count / n_episodes
        print(f"  {task}: {results[task]:.1%}")

    mean_success = np.mean(list(results.values()))
    print(f"\n→ Mean {partition}: {mean_success:.1%}")
    return results

# Run through all 4 levels
for partition in PARTITION_TO_SPECS["test"]:
    print(f"\n=== {partition} ===")
    evaluate_partition(your_policy, partition)

Comparing Checkpoint Sizes

To compare two model sizes head-to-head on the hardest partition:

for size in 9M 200M; do
  python scripts/example.py \
    --ckpt=ckpts/${size}.ckpt \
    --partition=novel_task_generalization \
    --task=follow_order
done

The performance difference directly shows the cost of running a smaller model in deployment — useful for choosing the right checkpoint size given the edge hardware constraints of your humanoid platform.


Conclusion

VimaBench is not just a collection of 17 tasks — it is a language for speaking precisely about generalization. When someone says "my model hits 70% on VimaBench," the correct follow-up question is: 70% on which partition? L1 or L4? That difference can be the gap between a compelling lab demo and a production-ready system.

For humanoid robots, the standard must be Level 4. A robot in a home, factory, or hospital receives hundreds of novel instructions every day from users — and there is no way to pre-train on all of them. A model must demonstrate Level 4 performance to confirm that it genuinely reads and understands prompts rather than just recalling patterns.

In Post 3: Object Tokenizer, we will go deep into how VIMA converts raw pixels into object tokens — the most critical part of the pipeline that enables the model to handle new objects at Level 3 without having seen them in training.


Related Posts

  • Post 1: Cross-Attention Architecture — Why VIMA Uses XAttn GPT
  • Post 3: Object Tokenizer — From Raw Pixels to Object Tokens
  • Dexora VLA: Open-Source Bimanual Dexterous Manipulation
NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Fleet MonitoringROS 2 IntegrationAMR Solutions
vima-humanoid-manip — Phần 2/2
← VIMA Architecture: Cross-Attention Transformer Explained

Related Posts

NEWDeep Dive
VIMA: Kiến Trúc Cross-Attention cho Tay Robot
vimacross-attentiontransformerPart 1
manipulation

VIMA: Kiến Trúc Cross-Attention cho Tay Robot

Giải mã kiến trúc cross-attention của VIMA: cách T5 encoder xử lý prompt đa phương thức và XAttn GPT decoder sinh lệnh cho 17 task robot.

6/16/202611 min read
NT
NEWResearch
Tại sao Multi-Agent đánh bại VLA đơn thuần? | AI Manipulation Agents #1
manipulationvlamulti-agentPart 1
manipulation

Tại sao Multi-Agent đánh bại VLA đơn thuần? | AI Manipulation Agents #1

ManiAgent đạt 86.8% trên SimplerEnv — vượt xa pi0 55.7% và CogACT 51.3%. Phân tích kiến trúc 3-agent và lý do phân rã pipeline thắng end-to-end VLA.

6/15/202613 min read
NT
NEWResearch
ALRM: Code-as-Policy vs Tool-as-Policy trong ReAct | AI Manipulation Agents #3
manipulationalrmcode-as-policyPart 3
manipulation

ALRM: Code-as-Policy vs Tool-as-Policy trong ReAct | AI Manipulation Agents #3

So sánh hai execution mode của ALRM: CaP sinh Python gọi robot API trong một lần chạy, TaP dùng ReAct lặp từng tool call. Benchmark 56 tasks, 10 LLMs — giúp bạn chọn đúng mode cho dự án.

6/15/202612 min read
NT
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam