What This Article Is For
The first six articles moved from the pipeline map to egocentric video, retargeting, teleoperation, LMO RL, and sim-to-real checks. Article 7 closes the series with the most practical question: once the policy runs, how do we evaluate whether a WholeBodyVLA-style system is actually better than a baseline?
If you only report one task-level success rate, you will miss the most important part of humanoid loco-manipulation. A robot may grasp a bag but fail when squatting. It may grab a cart handle but drift after pushing for two meters. It may fail the same task for three different reasons: perception selected the wrong target, the VLA produced a poor latent action, or the LMO controller stumbled during the approach. Whole-body VLA evaluation must therefore happen at the subgoal level, not only at the task level.
This article uses the WholeBodyVLA paper and the OpenDriveLab project page as the template. The paper evaluates an Agibot X2 prototype on three main tasks: Bag Packing, Box Loading, and Cart Pushing. Each task is split into two subgoals: Grasp Bags, Move & Squat, Squat & Grasp, Rise & Turn, Grab Handle, and Push Ahead. The reported averages are 78.0% for WholeBodyVLA, compared with 64.0% for Modular Design, 42.0% for GR00T w/ LMO, 56.7% for OpenVLA-OFT w/ LMO, and 54.0% for a velocity-based RL variant.
If you have not read the earlier articles, revisit Mapping the WholeBodyVLA Pipeline, Egocentric Video and LAMs, RL and LMO, and Sim-to-Real Before Unitree G1. Outside the series, useful background includes our WholeBodyVLA ICLR 2026 analysis and GROOT N1 data collection on G1.
Why One Success Rate Is Not Enough
For a tabletop single-arm robot, a single "pick success" number can sometimes be enough to compare two models. For a full humanoid, one task contains several different physical phases. Bag Packing is not only grasping a bag. It also requires lateral stepping, squatting, placing the bag into a carton, and maintaining balance while the arms move an object downward. Box Loading is not only grasping a box. It requires squatting, standing, turning, holding the box stable, and placing it onto a cart. Cart Pushing is not only grasping a handle. It requires sustained forward walking under load while maintaining heading.
The useful feature of the WholeBodyVLA table is that it does not collapse everything into one vague number. Each task is split into two subgoal columns. If a method performs well on the first subgoal but poorly on the second, you immediately know that the failure is concentrated in the longer loco-manipulation phase. For example, the velocity-based RL variant scores 22/25 on Grasp Bags but only 1/25 on Move & Squat. If you only look at the 54.0% average, you know it is worse. If you inspect the subgoals, you know where it is worse: foot control, squatting, heading, and transition after manipulation.
This is the first lesson for teams starting humanoid evaluation: a benchmark should help you debug, not only help you advertise. A good scorecard should answer four questions:
| Question | Why it matters |
|---|---|
| Which subgoal failed? | Separates perception, manipulation, locomotion, and transition problems |
| Does the failure repeat as a pattern? | Distinguishes random noise from a systematic weakness |
| Which ablation drops the most? | Tells you whether to invest in LAMs, LMO, data, or the decoder |
| Is the failure dangerous for hardware? | Prioritizes stumble, collision, excessive force, and drift under load |
How To Read the WholeBodyVLA Table
The paper reports each subgoal over 25 trials. The table below rewrites the main results in a spreadsheet-friendly format. These are not example numbers; they are the results reported in the WholeBodyVLA paper.
| Method | Grasp Bags | Move & Squat | Squat & Grasp | Rise & Turn | Grab Handle | Push Ahead | Avg. Score |
|---|---|---|---|---|---|---|---|
| Modular Design | 22/25 | 12/25 | 9/25 | 9/25 | 22/25 | 22/25 | 64.0% |
| GR00T w/ LMO | 20/25 | 10/25 | 6/25 | 4/25 | 12/25 | 11/25 | 42.0% |
| OpenVLA-OFT w/ LMO | 19/25 | 6/25 | 12/25 | 12/25 | 22/25 | 14/25 | 56.7% |
| WholeBodyVLA | 23/25 | 13/25 | 19/25 | 17/25 | 23/25 | 22/25 | 78.0% |
| WholeBodyVLA w/ velocity-based RL | 22/25 | 1/25 | 16/25 | 3/25 | 24/25 | 15/25 | 54.0% |
| WholeBodyVLA w/o LAM | 15/25 | 4/25 | 8/25 | 6/25 | 16/25 | 10/25 | 39.3% |
| WholeBodyVLA w/ manipulation-only LAM | 24/25 | 7/25 | 17/25 | 11/25 | 20/25 | 14/25 | 63.3% |
| WholeBodyVLA w/ shared LAM | 18/25 | 11/25 | 16/25 | 16/25 | 20/25 | 18/25 | 66.0% |
Three observations matter.
First, the first subgoal of each task is usually easier than the second. Grasp Bags, Squat & Grasp, and Grab Handle are still difficult, but they are closer to local manipulation. Move & Squat, Rise & Turn, and Push Ahead contain more locomotion: lateral stepping, standing, turning, advancing, maintaining heading, and handling load. These columns stress the lower-body controller and locomotion latents much more strongly.
Second, LMO matters because it makes VLA decisions executable. The full WholeBodyVLA system reaches 78.0%, while the velocity-based RL variant reaches 54.0%. The paper notes that most of this gap comes from the second subgoal of each task, where locomotion dominates. This matches practical failure modes: a velocity controller may produce inconsistent gait, path deviation, stumble, or turning while advancing when the intended command is an in-place turn.
Third, LAMs are what make action-free video useful. Removing LAMs drops the score to 39.3%. Using only a manipulation LAM reaches 63.3%, which is much better but still weaker than the full system, especially on subgoals that require locomotion. A shared LAM reaches 66.0%, showing that one mixed latent model can learn useful structure, but separate manipulation and locomotion LAMs work better for this design.
Designing Your Own Subgoal Scorecard
If you are building a benchmark for G1, X2, or another humanoid, you should not copy the tasks blindly when your hardware and lab are different. Copy the evaluation structure instead: long tasks are split into subgoals with clear pass criteria. A minimal scorecard should include these columns:
| Column | Example | Note |
|---|---|---|
trial_id |
bagpack_013 |
Unique key for finding video and logs |
task |
bag_packing |
User-level task name |
subgoal |
move_and_squat |
Main scoring unit |
instruction |
put the paper bag into the carton |
Language sent to the VLA |
start_pose_bin |
near_left, far_right, rotated_30deg |
Helps analyze generalization |
object_variant |
brown_bag, white_bag, heavy_box |
Records distribution shift |
payload_kg |
0, 5, 50 |
Critical for cart pushing |
pass |
true or false |
Subgoal pass/fail |
failure_mode |
path_deviation |
Filled only when failed |
notes |
stopped 30 cm too early |
Short note, not an essay |
Start with CSV if that is what your team can maintain:
trial_id,task,subgoal,instruction,start_pose_bin,object_variant,payload_kg,pass,failure_mode,notes
bagpack_001,bag_packing,grasp_bags,"grasp the bags",center,brown_bag,0,true,,
bagpack_001,bag_packing,move_and_squat,"place the bags into the carton",center,brown_bag,0,false,early_stop,"stopped before carton; arms could not reach"
boxload_004,box_loading,rise_and_turn,"put the box onto the cart",rotated_30deg,plastic_box,6,false,wrong_orientation,"turned only halfway"
cart_011,cart_pushing,push_ahead,"push the cart forward",center,carton_load,50,false,path_deviation,"drifted right after 1.5 m"
Do not wait for a complex dashboard. A well-maintained CSV plus synchronized video is enough to find useful failures during the first week. Build dashboards only after the taxonomy stabilizes.
Defining Pass and Fail Criteria
Beginners often make one major evaluation mistake: they define pass criteria after watching the videos. That makes the score biased. Write the criteria before running trials.
| Subgoal | Pass when | Fail when |
|---|---|---|
Grasp Bags |
Both grippers hold the bag stably enough to start moving | The bag is missed, dropped before movement, or the wrong object is grasped |
Move & Squat |
The robot reaches the carton region, squats low enough, and places the bag inside | It stops too far away, drifts, stumbles, squats to the wrong height, or drops the bag outside |
Squat & Grasp |
The robot squats and holds the box securely with both hands | It cannot reach the box, the gripper is misaligned, the box slips, or balance is lost |
Rise & Turn |
The robot stands, turns toward the cart, and keeps holding the box | The box is dropped, the turn is too short/long, the base advances during turn, or it hits the cart |
Grab Handle |
The required hand or both hands grasp the cart handle | The robot cannot reach, grasps off-center, or loses the handle before pushing |
Push Ahead |
The cart moves along the target direction for the required distance and the robot remains stable | It deviates, stops too early/late, stumbles, loses grip, or pushes the load sideways |
Each subgoal should also have measurable tolerances. For example:
subgoals:
move_and_squat:
target_region_radius_m: 0.20
final_base_yaw_error_deg: 15
min_squat_depth_m: 0.18
object_must_be_inside_container: true
push_ahead:
min_distance_m: 2.0
max_lateral_drift_m: 0.25
max_heading_error_deg: 12
handle_contact_required: true
These thresholds are examples, not universal rules. In a small lab with a compact robot, min_distance_m may be 1 meter. For a heavy cart, you may need a rule such as "handle contact must not be lost for more than 0.5 seconds." The important part is that everyone on the team uses the same definition.
Ablations: Ask What Each Component Contributes
Ablation turns a benchmark into an engineering decision. WholeBodyVLA's ablations make each component's role clear:
| Ablation | Technical question | What the result suggests |
|---|---|---|
w/o LAM |
What happens if we skip latent pretraining from videos and only finetune on teleop? | Score drops to 39.3%, so action-free video and latent supervision are important |
manipulation-only LAM |
What if we learn latents only from manipulation, without locomotion-aware video? | 63.3%, much better than no LAM but weak on locomotion-heavy subgoals |
shared LAM |
What if one LAM is trained on mixed manipulation and locomotion data? | 66.0%, useful but weaker than separate LAMs |
velocity-based RL |
What if the low-level controller is a conventional velocity-tracking policy? | 54.0%, especially weak on Move & Squat and Rise & Turn |
When running internal ablations, keep three things fixed: the same tasks, the same number of trials, and the same pass/fail criteria. If the full model runs 25 trials and the baseline runs 10, the comparison is noisy. If today's object is light and tomorrow's object is heavy, the difference may come from the setup rather than the model.
A small config is enough:
benchmark:
date: 2026-06-10
trials_per_subgoal: 25
tasks:
- bag_packing
- box_loading
- cart_pushing
methods:
- name: wholebodyvla_full
manipulation_lam: separate
locomotion_lam: separate
low_level_controller: lmo
- name: no_lam
manipulation_lam: none
locomotion_lam: none
low_level_controller: lmo
- name: manip_only_lam
manipulation_lam: separate
locomotion_lam: none
low_level_controller: lmo
- name: velocity_rl
manipulation_lam: separate
locomotion_lam: separate
low_level_controller: velocity_tracking
Failure Taxonomy: Log What You Want To Fix
The WholeBodyVLA project page shows baseline failure cases such as stumble to stop, loss of balance, large deviation from the intended direction, and stopping too late. The paper's appendix also separates failures into locomotion failures and pick/place failures, then further labels causes such as object unreachable, basket unreachable, wrong orientation, early stop, overshoot, collision, stumble, poor grasp pose, and misaligned placement.
For a beginner team, start with a short taxonomy that is still useful:
| Failure mode | Group | Description |
|---|---|---|
early_stop |
locomotion | The robot stops before the manipulation region, so the arms cannot reach |
overshoot |
locomotion | The robot passes the intended stopping point |
path_deviation |
locomotion | The robot drifts away from the desired path or heading |
turn_with_advance |
locomotion | The robot advances while the intended command is an in-place turn |
stumble |
locomotion | The foot catches, the torso shakes strongly, or safety stop is needed |
wrong_orientation |
locomotion | The base faces the wrong direction before manipulation |
bad_grasp_pose |
manipulation | The gripper contacts the wrong point or cannot close properly |
object_slip |
manipulation | The object is grasped but slips during motion |
misaligned_place |
manipulation | The object is placed outside the container or cart target |
collision |
safety | The robot hits the table, carton, cart, or operator |
Do not assign too many labels. A failed trial can have a chain of causes, but you should choose one primary_failure_mode and add a secondary_failure_mode only when it is truly useful.
trial_id,primary_failure_mode,secondary_failure_mode,stop_reason,video_start_s,video_end_s
bagpack_001,early_stop,bad_grasp_pose,operator_estop,18.4,31.2
boxload_004,turn_with_advance,collision,timeout,22.0,36.5
cart_011,path_deviation,stumble,operator_estop,10.7,18.9
Annotate right after the benchmark session, while the operator and reviewer still remember the setup. Three days later, the video may still be available, but you may no longer remember whether the floor was slippery, whether the payload was exactly 50 kg, or whether the camera mount had shifted.
Computing and Interpreting Scores
For a 25-trial-per-subgoal table, the scoring code is simple:
from collections import defaultdict
rows = [
{"method": "wholebodyvla", "task": "bag_packing", "subgoal": "grasp_bags", "pass": True},
{"method": "wholebodyvla", "task": "bag_packing", "subgoal": "move_and_squat", "pass": False},
]
score = defaultdict(lambda: {"pass": 0, "total": 0})
for row in rows:
key = (row["method"], row["task"], row["subgoal"])
score[key]["pass"] += int(row["pass"])
score[key]["total"] += 1
for key, value in score.items():
rate = value["pass"] / value["total"]
print(key, f"{value['pass']}/{value['total']}", f"{rate:.1%}")
Interpreting the score is harder than computing it. If Grasp Bags is high and Move & Squat is low, do not immediately retrain manipulation. Inspect locomotion video, base trajectory, yaw, squat depth, and latency between the 10 Hz VLA and the 50 Hz controller. If Grab Handle is high but Push Ahead is low, the issue may be heading control under load, handle contact, or insufficient payload randomization. If Squat & Grasp is low, the issue may be box perception, reach trajectory, or insufficient squat height.
Read scores in pairs:
| Pattern | First diagnosis |
|---|---|
| First manipulation subgoal high, second locomotion-heavy subgoal low | Prioritize LMO, command interface, stopping accuracy, squat/turn precision |
| Both subgoals low | Check perception, instruction, camera calibration, dataset mismatch |
| Full model much better than no-LAM | Latent pretraining is valuable; expand action-free video |
| Manip-only LAM close to full model | Current benchmark may not stress locomotion enough |
| Velocity RL weak at transitions | Controller likely needs discrete intent, directional accuracy reward, or better curriculum |
Checklist for a Benchmark Session
Before running:
[ ] Pin the commit and model checkpoint for every method
[ ] Reset scene layout and measure table, carton, and cart positions
[ ] Record object variant, payload, camera mount, and lighting
[ ] Run standing, squatting, turning, and emergency-stop smoke tests
[ ] Randomize method order to reduce battery, heat, or operator bias
[ ] Record egocentric video, third-person video, proprioception, VLA output, and LMO command
During the run:
[ ] Assign trial_id before the robot starts
[ ] Do not change the prompt inside a trial
[ ] If emergency stop happens, write stop_reason immediately
[ ] Use predefined criteria instead of subjective judgment
[ ] Do not delete bad trials unless the setup was clearly invalid and documented
After the run:
[ ] Review video by subgoal
[ ] Annotate primary failure mode
[ ] Compute each subgoal score and the average
[ ] Compare ablations with the same trial count
[ ] Pick the top 3 repeated failures for the next engineering sprint
Conclusion
The key lesson from WholeBodyVLA is not only the 78.0% number. The deeper lesson is how the authors made the benchmark diagnostic: long tasks are split into subgoals, each subgoal is evaluated over 25 trials, baselines and ablations are shown side by side, and failures are analyzed as locomotion versus pick/place problems.
If you are building a whole-body VLA, start with a small but strict scorecard. Three tasks, two subgoals per task, and 25 trials per subgoal are enough to reveal many serious failures. Once the scorecard is stable, add generalization: new objects, new start poses, new payloads, and new terrain. Do not let the benchmark become only a good-looking video. Make it a tool that tells you what data to collect next, which LAM to improve, how to revise the LMO, and what to test before the next deployment.