Evaluating Whole-Body VLAs

What This Article Is For

The first six articles moved from the pipeline map to egocentric video, retargeting, teleoperation, LMO RL, and sim-to-real checks. Article 7 closes the series with the most practical question: once the policy runs, how do we evaluate whether a WholeBodyVLA-style system is actually better than a baseline?

If you only report one task-level success rate, you will miss the most important part of humanoid loco-manipulation. A robot may grasp a bag but fail when squatting. It may grab a cart handle but drift after pushing for two meters. It may fail the same task for three different reasons: perception selected the wrong target, the VLA produced a poor latent action, or the LMO controller stumbled during the approach. Whole-body VLA evaluation must therefore happen at the subgoal level, not only at the task level.

This article uses the WholeBodyVLA paper and the OpenDriveLab project page as the template. The paper evaluates an Agibot X2 prototype on three main tasks: Bag Packing, Box Loading, and Cart Pushing. Each task is split into two subgoals: Grasp Bags, Move & Squat, Squat & Grasp, Rise & Turn, Grab Handle, and Push Ahead. The reported averages are 78.0% for WholeBodyVLA, compared with 64.0% for Modular Design, 42.0% for GR00T w/ LMO, 56.7% for OpenVLA-OFT w/ LMO, and 54.0% for a velocity-based RL variant.

If you have not read the earlier articles, revisit Mapping the WholeBodyVLA Pipeline, Egocentric Video and LAMs, RL and LMO, and Sim-to-Real Before Unitree G1. Outside the series, useful background includes our WholeBodyVLA ICLR 2026 analysis and GROOT N1 data collection on G1.

Why One Success Rate Is Not Enough

For a tabletop single-arm robot, a single "pick success" number can sometimes be enough to compare two models. For a full humanoid, one task contains several different physical phases. Bag Packing is not only grasping a bag. It also requires lateral stepping, squatting, placing the bag into a carton, and maintaining balance while the arms move an object downward. Box Loading is not only grasping a box. It requires squatting, standing, turning, holding the box stable, and placing it onto a cart. Cart Pushing is not only grasping a handle. It requires sustained forward walking under load while maintaining heading.

The useful feature of the WholeBodyVLA table is that it does not collapse everything into one vague number. Each task is split into two subgoal columns. If a method performs well on the first subgoal but poorly on the second, you immediately know that the failure is concentrated in the longer loco-manipulation phase. For example, the velocity-based RL variant scores 22/25 on Grasp Bags but only 1/25 on Move & Squat. If you only look at the 54.0% average, you know it is worse. If you inspect the subgoals, you know where it is worse: foot control, squatting, heading, and transition after manipulation.

This is the first lesson for teams starting humanoid evaluation: a benchmark should help you debug, not only help you advertise. A good scorecard should answer four questions:

Question	Why it matters
Which subgoal failed?	Separates perception, manipulation, locomotion, and transition problems
Does the failure repeat as a pattern?	Distinguishes random noise from a systematic weakness
Which ablation drops the most?	Tells you whether to invest in LAMs, LMO, data, or the decoder
Is the failure dangerous for hardware?	Prioritizes stumble, collision, excessive force, and drift under load

How To Read the WholeBodyVLA Table

The paper reports each subgoal over 25 trials. The table below rewrites the main results in a spreadsheet-friendly format. These are not example numbers; they are the results reported in the WholeBodyVLA paper.

Method	Grasp Bags	Move & Squat	Squat & Grasp	Rise & Turn	Grab Handle	Push Ahead	Avg. Score
Modular Design	22/25	12/25	9/25	9/25	22/25	22/25	64.0%
GR00T w/ LMO	20/25	10/25	6/25	4/25	12/25	11/25	42.0%
OpenVLA-OFT w/ LMO	19/25	6/25	12/25	12/25	22/25	14/25	56.7%
WholeBodyVLA	23/25	13/25	19/25	17/25	23/25	22/25	78.0%
WholeBodyVLA w/ velocity-based RL	22/25	1/25	16/25	3/25	24/25	15/25	54.0%
WholeBodyVLA w/o LAM	15/25	4/25	8/25	6/25	16/25	10/25	39.3%
WholeBodyVLA w/ manipulation-only LAM	24/25	7/25	17/25	11/25	20/25	14/25	63.3%
WholeBodyVLA w/ shared LAM	18/25	11/25	16/25	16/25	20/25	18/25	66.0%

Three observations matter.

First, the first subgoal of each task is usually easier than the second. Grasp Bags, Squat & Grasp, and Grab Handle are still difficult, but they are closer to local manipulation. Move & Squat, Rise & Turn, and Push Ahead contain more locomotion: lateral stepping, standing, turning, advancing, maintaining heading, and handling load. These columns stress the lower-body controller and locomotion latents much more strongly.

Second, LMO matters because it makes VLA decisions executable. The full WholeBodyVLA system reaches 78.0%, while the velocity-based RL variant reaches 54.0%. The paper notes that most of this gap comes from the second subgoal of each task, where locomotion dominates. This matches practical failure modes: a velocity controller may produce inconsistent gait, path deviation, stumble, or turning while advancing when the intended command is an in-place turn.

Third, LAMs are what make action-free video useful. Removing LAMs drops the score to 39.3%. Using only a manipulation LAM reaches 63.3%, which is much better but still weaker than the full system, especially on subgoals that require locomotion. A shared LAM reaches 66.0%, showing that one mixed latent model can learn useful structure, but separate manipulation and locomotion LAMs work better for this design.

Designing Your Own Subgoal Scorecard

If you are building a benchmark for G1, X2, or another humanoid, you should not copy the tasks blindly when your hardware and lab are different. Copy the evaluation structure instead: long tasks are split into subgoals with clear pass criteria. A minimal scorecard should include these columns:

Column	Example	Note
`trial_id`	`bagpack_013`	Unique key for finding video and logs
`task`	`bag_packing`	User-level task name
`subgoal`	`move_and_squat`	Main scoring unit
`instruction`	`put the paper bag into the carton`	Language sent to the VLA
`start_pose_bin`	`near_left`, `far_right`, `rotated_30deg`	Helps analyze generalization
`object_variant`	`brown_bag`, `white_bag`, `heavy_box`	Records distribution shift
`payload_kg`	`0`, `5`, `50`	Critical for cart pushing
`pass`	`true` or `false`	Subgoal pass/fail
`failure_mode`	`path_deviation`	Filled only when failed
`notes`	`stopped 30 cm too early`	Short note, not an essay

Start with CSV if that is what your team can maintain:

trial_id,task,subgoal,instruction,start_pose_bin,object_variant,payload_kg,pass,failure_mode,notes
bagpack_001,bag_packing,grasp_bags,"grasp the bags",center,brown_bag,0,true,,
bagpack_001,bag_packing,move_and_squat,"place the bags into the carton",center,brown_bag,0,false,early_stop,"stopped before carton; arms could not reach"
boxload_004,box_loading,rise_and_turn,"put the box onto the cart",rotated_30deg,plastic_box,6,false,wrong_orientation,"turned only halfway"
cart_011,cart_pushing,push_ahead,"push the cart forward",center,carton_load,50,false,path_deviation,"drifted right after 1.5 m"

Do not wait for a complex dashboard. A well-maintained CSV plus synchronized video is enough to find useful failures during the first week. Build dashboards only after the taxonomy stabilizes.

Defining Pass and Fail Criteria

Beginners often make one major evaluation mistake: they define pass criteria after watching the videos. That makes the score biased. Write the criteria before running trials.

Subgoal	Pass when	Fail when
`Grasp Bags`	Both grippers hold the bag stably enough to start moving	The bag is missed, dropped before movement, or the wrong object is grasped
`Move & Squat`	The robot reaches the carton region, squats low enough, and places the bag inside	It stops too far away, drifts, stumbles, squats to the wrong height, or drops the bag outside
`Squat & Grasp`	The robot squats and holds the box securely with both hands	It cannot reach the box, the gripper is misaligned, the box slips, or balance is lost
`Rise & Turn`	The robot stands, turns toward the cart, and keeps holding the box	The box is dropped, the turn is too short/long, the base advances during turn, or it hits the cart
`Grab Handle`	The required hand or both hands grasp the cart handle	The robot cannot reach, grasps off-center, or loses the handle before pushing
`Push Ahead`	The cart moves along the target direction for the required distance and the robot remains stable	It deviates, stops too early/late, stumbles, loses grip, or pushes the load sideways

Each subgoal should also have measurable tolerances. For example:

subgoals:
  move_and_squat:
    target_region_radius_m: 0.20
    final_base_yaw_error_deg: 15
    min_squat_depth_m: 0.18
    object_must_be_inside_container: true
  push_ahead:
    min_distance_m: 2.0
    max_lateral_drift_m: 0.25
    max_heading_error_deg: 12
    handle_contact_required: true

These thresholds are examples, not universal rules. In a small lab with a compact robot, min_distance_m may be 1 meter. For a heavy cart, you may need a rule such as "handle contact must not be lost for more than 0.5 seconds." The important part is that everyone on the team uses the same definition.

Ablations: Ask What Each Component Contributes

Ablation turns a benchmark into an engineering decision. WholeBodyVLA's ablations make each component's role clear:

Ablation	Technical question	What the result suggests
`w/o LAM`	What happens if we skip latent pretraining from videos and only finetune on teleop?	Score drops to 39.3%, so action-free video and latent supervision are important
`manipulation-only LAM`	What if we learn latents only from manipulation, without locomotion-aware video?	63.3%, much better than no LAM but weak on locomotion-heavy subgoals
`shared LAM`	What if one LAM is trained on mixed manipulation and locomotion data?	66.0%, useful but weaker than separate LAMs
`velocity-based RL`	What if the low-level controller is a conventional velocity-tracking policy?	54.0%, especially weak on `Move & Squat` and `Rise & Turn`

When running internal ablations, keep three things fixed: the same tasks, the same number of trials, and the same pass/fail criteria. If the full model runs 25 trials and the baseline runs 10, the comparison is noisy. If today's object is light and tomorrow's object is heavy, the difference may come from the setup rather than the model.

A small config is enough:

benchmark:
  date: 2026-06-10
  trials_per_subgoal: 25
  tasks:
    - bag_packing
    - box_loading
    - cart_pushing
methods:
  - name: wholebodyvla_full
    manipulation_lam: separate
    locomotion_lam: separate
    low_level_controller: lmo
  - name: no_lam
    manipulation_lam: none
    locomotion_lam: none
    low_level_controller: lmo
  - name: manip_only_lam
    manipulation_lam: separate
    locomotion_lam: none
    low_level_controller: lmo
  - name: velocity_rl
    manipulation_lam: separate
    locomotion_lam: separate
    low_level_controller: velocity_tracking

Failure Taxonomy: Log What You Want To Fix

The WholeBodyVLA project page shows baseline failure cases such as stumble to stop, loss of balance, large deviation from the intended direction, and stopping too late. The paper's appendix also separates failures into locomotion failures and pick/place failures, then further labels causes such as object unreachable, basket unreachable, wrong orientation, early stop, overshoot, collision, stumble, poor grasp pose, and misaligned placement.

For a beginner team, start with a short taxonomy that is still useful:

Failure mode	Group	Description
`early_stop`	locomotion	The robot stops before the manipulation region, so the arms cannot reach
`overshoot`	locomotion	The robot passes the intended stopping point
`path_deviation`	locomotion	The robot drifts away from the desired path or heading
`turn_with_advance`	locomotion	The robot advances while the intended command is an in-place turn
`stumble`	locomotion	The foot catches, the torso shakes strongly, or safety stop is needed
`wrong_orientation`	locomotion	The base faces the wrong direction before manipulation
`bad_grasp_pose`	manipulation	The gripper contacts the wrong point or cannot close properly
`object_slip`	manipulation	The object is grasped but slips during motion
`misaligned_place`	manipulation	The object is placed outside the container or cart target
`collision`	safety	The robot hits the table, carton, cart, or operator

Do not assign too many labels. A failed trial can have a chain of causes, but you should choose one primary_failure_mode and add a secondary_failure_mode only when it is truly useful.

trial_id,primary_failure_mode,secondary_failure_mode,stop_reason,video_start_s,video_end_s
bagpack_001,early_stop,bad_grasp_pose,operator_estop,18.4,31.2
boxload_004,turn_with_advance,collision,timeout,22.0,36.5
cart_011,path_deviation,stumble,operator_estop,10.7,18.9

Annotate right after the benchmark session, while the operator and reviewer still remember the setup. Three days later, the video may still be available, but you may no longer remember whether the floor was slippery, whether the payload was exactly 50 kg, or whether the camera mount had shifted.

Computing and Interpreting Scores

For a 25-trial-per-subgoal table, the scoring code is simple:

from collections import defaultdict

rows = [
    {"method": "wholebodyvla", "task": "bag_packing", "subgoal": "grasp_bags", "pass": True},
    {"method": "wholebodyvla", "task": "bag_packing", "subgoal": "move_and_squat", "pass": False},
]

score = defaultdict(lambda: {"pass": 0, "total": 0})

for row in rows:
    key = (row["method"], row["task"], row["subgoal"])
    score[key]["pass"] += int(row["pass"])
    score[key]["total"] += 1

for key, value in score.items():
    rate = value["pass"] / value["total"]
    print(key, f"{value['pass']}/{value['total']}", f"{rate:.1%}")

Interpreting the score is harder than computing it. If Grasp Bags is high and Move & Squat is low, do not immediately retrain manipulation. Inspect locomotion video, base trajectory, yaw, squat depth, and latency between the 10 Hz VLA and the 50 Hz controller. If Grab Handle is high but Push Ahead is low, the issue may be heading control under load, handle contact, or insufficient payload randomization. If Squat & Grasp is low, the issue may be box perception, reach trajectory, or insufficient squat height.

Read scores in pairs:

Pattern	First diagnosis
First manipulation subgoal high, second locomotion-heavy subgoal low	Prioritize LMO, command interface, stopping accuracy, squat/turn precision
Both subgoals low	Check perception, instruction, camera calibration, dataset mismatch
Full model much better than no-LAM	Latent pretraining is valuable; expand action-free video
Manip-only LAM close to full model	Current benchmark may not stress locomotion enough
Velocity RL weak at transitions	Controller likely needs discrete intent, directional accuracy reward, or better curriculum

Checklist for a Benchmark Session

Before running:

[ ] Pin the commit and model checkpoint for every method
[ ] Reset scene layout and measure table, carton, and cart positions
[ ] Record object variant, payload, camera mount, and lighting
[ ] Run standing, squatting, turning, and emergency-stop smoke tests
[ ] Randomize method order to reduce battery, heat, or operator bias
[ ] Record egocentric video, third-person video, proprioception, VLA output, and LMO command

During the run:

[ ] Assign trial_id before the robot starts
[ ] Do not change the prompt inside a trial
[ ] If emergency stop happens, write stop_reason immediately
[ ] Use predefined criteria instead of subjective judgment
[ ] Do not delete bad trials unless the setup was clearly invalid and documented

After the run:

[ ] Review video by subgoal
[ ] Annotate primary failure mode
[ ] Compute each subgoal score and the average
[ ] Compare ablations with the same trial count
[ ] Pick the top 3 repeated failures for the next engineering sprint

Conclusion

The key lesson from WholeBodyVLA is not only the 78.0% number. The deeper lesson is how the authors made the benchmark diagnostic: long tasks are split into subgoals, each subgoal is evaluated over 25 trials, baselines and ablations are shown side by side, and failures are analyzed as locomotion versus pick/place problems.

If you are building a whole-body VLA, start with a small but strict scorecard. Three tasks, two subgoals per task, and 25 trials per subgoal are enough to reveal many serious failures. Once the scorecard is stable, add generalization: new objects, new start poses, new payloads, and new terrain. Do not let the benchmark become only a good-looking video. Make it a tool that tells you what data to collect next, which LAM to improve, how to revise the LMO, and what to test before the next deployment.

What This Article Is For

Why One Success Rate Is Not Enough

This is the first lesson for teams starting humanoid evaluation: a benchmark should help you debug, not only help you advertise. A good scorecard should answer four questions:

Question	Why it matters
Which subgoal failed?	Separates perception, manipulation, locomotion, and transition problems
Does the failure repeat as a pattern?	Distinguishes random noise from a systematic weakness
Which ablation drops the most?	Tells you whether to invest in LAMs, LMO, data, or the decoder
Is the failure dangerous for hardware?	Prioritizes stumble, collision, excessive force, and drift under load

How To Read the WholeBodyVLA Table

Method	Grasp Bags	Move & Squat	Squat & Grasp	Rise & Turn	Grab Handle	Push Ahead	Avg. Score
Modular Design	22/25	12/25	9/25	9/25	22/25	22/25	64.0%
GR00T w/ LMO	20/25	10/25	6/25	4/25	12/25	11/25	42.0%
OpenVLA-OFT w/ LMO	19/25	6/25	12/25	12/25	22/25	14/25	56.7%
WholeBodyVLA	23/25	13/25	19/25	17/25	23/25	22/25	78.0%
WholeBodyVLA w/ velocity-based RL	22/25	1/25	16/25	3/25	24/25	15/25	54.0%
WholeBodyVLA w/o LAM	15/25	4/25	8/25	6/25	16/25	10/25	39.3%
WholeBodyVLA w/ manipulation-only LAM	24/25	7/25	17/25	11/25	20/25	14/25	63.3%
WholeBodyVLA w/ shared LAM	18/25	11/25	16/25	16/25	20/25	18/25	66.0%

Three observations matter.

Designing Your Own Subgoal Scorecard

Column	Example	Note
`trial_id`	`bagpack_013`	Unique key for finding video and logs
`task`	`bag_packing`	User-level task name
`subgoal`	`move_and_squat`	Main scoring unit
`instruction`	`put the paper bag into the carton`	Language sent to the VLA
`start_pose_bin`	`near_left`, `far_right`, `rotated_30deg`	Helps analyze generalization
`object_variant`	`brown_bag`, `white_bag`, `heavy_box`	Records distribution shift
`payload_kg`	`0`, `5`, `50`	Critical for cart pushing
`pass`	`true` or `false`	Subgoal pass/fail
`failure_mode`	`path_deviation`	Filled only when failed
`notes`	`stopped 30 cm too early`	Short note, not an essay

Start with CSV if that is what your team can maintain:

trial_id,task,subgoal,instruction,start_pose_bin,object_variant,payload_kg,pass,failure_mode,notes
bagpack_001,bag_packing,grasp_bags,"grasp the bags",center,brown_bag,0,true,,
bagpack_001,bag_packing,move_and_squat,"place the bags into the carton",center,brown_bag,0,false,early_stop,"stopped before carton; arms could not reach"
boxload_004,box_loading,rise_and_turn,"put the box onto the cart",rotated_30deg,plastic_box,6,false,wrong_orientation,"turned only halfway"
cart_011,cart_pushing,push_ahead,"push the cart forward",center,carton_load,50,false,path_deviation,"drifted right after 1.5 m"

Do not wait for a complex dashboard. A well-maintained CSV plus synchronized video is enough to find useful failures during the first week. Build dashboards only after the taxonomy stabilizes.

Defining Pass and Fail Criteria

Beginners often make one major evaluation mistake: they define pass criteria after watching the videos. That makes the score biased. Write the criteria before running trials.

Subgoal	Pass when	Fail when
`Grasp Bags`	Both grippers hold the bag stably enough to start moving	The bag is missed, dropped before movement, or the wrong object is grasped
`Move & Squat`	The robot reaches the carton region, squats low enough, and places the bag inside	It stops too far away, drifts, stumbles, squats to the wrong height, or drops the bag outside
`Squat & Grasp`	The robot squats and holds the box securely with both hands	It cannot reach the box, the gripper is misaligned, the box slips, or balance is lost
`Rise & Turn`	The robot stands, turns toward the cart, and keeps holding the box	The box is dropped, the turn is too short/long, the base advances during turn, or it hits the cart
`Grab Handle`	The required hand or both hands grasp the cart handle	The robot cannot reach, grasps off-center, or loses the handle before pushing
`Push Ahead`	The cart moves along the target direction for the required distance and the robot remains stable	It deviates, stops too early/late, stumbles, loses grip, or pushes the load sideways

Each subgoal should also have measurable tolerances. For example:

subgoals:
  move_and_squat:
    target_region_radius_m: 0.20
    final_base_yaw_error_deg: 15
    min_squat_depth_m: 0.18
    object_must_be_inside_container: true
  push_ahead:
    min_distance_m: 2.0
    max_lateral_drift_m: 0.25
    max_heading_error_deg: 12
    handle_contact_required: true

Ablations: Ask What Each Component Contributes

Ablation turns a benchmark into an engineering decision. WholeBodyVLA's ablations make each component's role clear:

Ablation	Technical question	What the result suggests
`w/o LAM`	What happens if we skip latent pretraining from videos and only finetune on teleop?	Score drops to 39.3%, so action-free video and latent supervision are important
`manipulation-only LAM`	What if we learn latents only from manipulation, without locomotion-aware video?	63.3%, much better than no LAM but weak on locomotion-heavy subgoals
`shared LAM`	What if one LAM is trained on mixed manipulation and locomotion data?	66.0%, useful but weaker than separate LAMs
`velocity-based RL`	What if the low-level controller is a conventional velocity-tracking policy?	54.0%, especially weak on `Move & Squat` and `Rise & Turn`

A small config is enough:

benchmark:
  date: 2026-06-10
  trials_per_subgoal: 25
  tasks:
    - bag_packing
    - box_loading
    - cart_pushing
methods:
  - name: wholebodyvla_full
    manipulation_lam: separate
    locomotion_lam: separate
    low_level_controller: lmo
  - name: no_lam
    manipulation_lam: none
    locomotion_lam: none
    low_level_controller: lmo
  - name: manip_only_lam
    manipulation_lam: separate
    locomotion_lam: none
    low_level_controller: lmo
  - name: velocity_rl
    manipulation_lam: separate
    locomotion_lam: separate
    low_level_controller: velocity_tracking

Failure Taxonomy: Log What You Want To Fix

For a beginner team, start with a short taxonomy that is still useful:

Failure mode	Group	Description
`early_stop`	locomotion	The robot stops before the manipulation region, so the arms cannot reach
`overshoot`	locomotion	The robot passes the intended stopping point
`path_deviation`	locomotion	The robot drifts away from the desired path or heading
`turn_with_advance`	locomotion	The robot advances while the intended command is an in-place turn
`stumble`	locomotion	The foot catches, the torso shakes strongly, or safety stop is needed
`wrong_orientation`	locomotion	The base faces the wrong direction before manipulation
`bad_grasp_pose`	manipulation	The gripper contacts the wrong point or cannot close properly
`object_slip`	manipulation	The object is grasped but slips during motion
`misaligned_place`	manipulation	The object is placed outside the container or cart target
`collision`	safety	The robot hits the table, carton, cart, or operator

Do not assign too many labels. A failed trial can have a chain of causes, but you should choose one primary_failure_mode and add a secondary_failure_mode only when it is truly useful.

trial_id,primary_failure_mode,secondary_failure_mode,stop_reason,video_start_s,video_end_s
bagpack_001,early_stop,bad_grasp_pose,operator_estop,18.4,31.2
boxload_004,turn_with_advance,collision,timeout,22.0,36.5
cart_011,path_deviation,stumble,operator_estop,10.7,18.9

Computing and Interpreting Scores

For a 25-trial-per-subgoal table, the scoring code is simple:

from collections import defaultdict

rows = [
    {"method": "wholebodyvla", "task": "bag_packing", "subgoal": "grasp_bags", "pass": True},
    {"method": "wholebodyvla", "task": "bag_packing", "subgoal": "move_and_squat", "pass": False},
]

score = defaultdict(lambda: {"pass": 0, "total": 0})

for row in rows:
    key = (row["method"], row["task"], row["subgoal"])
    score[key]["pass"] += int(row["pass"])
    score[key]["total"] += 1

for key, value in score.items():
    rate = value["pass"] / value["total"]
    print(key, f"{value['pass']}/{value['total']}", f"{rate:.1%}")

Read scores in pairs:

Pattern	First diagnosis
First manipulation subgoal high, second locomotion-heavy subgoal low	Prioritize LMO, command interface, stopping accuracy, squat/turn precision
Both subgoals low	Check perception, instruction, camera calibration, dataset mismatch
Full model much better than no-LAM	Latent pretraining is valuable; expand action-free video
Manip-only LAM close to full model	Current benchmark may not stress locomotion enough
Velocity RL weak at transitions	Controller likely needs discrete intent, directional accuracy reward, or better curriculum

Checklist for a Benchmark Session

Before running:

[ ] Pin the commit and model checkpoint for every method
[ ] Reset scene layout and measure table, carton, and cart positions
[ ] Record object variant, payload, camera mount, and lighting
[ ] Run standing, squatting, turning, and emergency-stop smoke tests
[ ] Randomize method order to reduce battery, heat, or operator bias
[ ] Record egocentric video, third-person video, proprioception, VLA output, and LMO command

During the run:

[ ] Assign trial_id before the robot starts
[ ] Do not change the prompt inside a trial
[ ] If emergency stop happens, write stop_reason immediately
[ ] Use predefined criteria instead of subjective judgment
[ ] Do not delete bad trials unless the setup was clearly invalid and documented

After the run:

[ ] Review video by subgoal
[ ] Annotate primary failure mode
[ ] Compute each subgoal score and the average
[ ] Compare ablations with the same trial count
[ ] Pick the top 3 repeated failures for the next engineering sprint

Evaluating Whole-Body VLAs

What This Article Is For

Why One Success Rate Is Not Enough

How To Read the WholeBodyVLA Table

Designing Your Own Subgoal Scorecard

Defining Pass and Fail Criteria

Ablations: Ask What Each Component Contributes

Failure Taxonomy: Log What You Want To Fix

Computing and Interpreting Scores

Checklist for a Benchmark Session

Conclusion

Nguyễn Anh Tuấn

Related Posts

Bản đồ pipeline WholeBodyVLA

Scale 20 người và eval whole-body VLA

Cyclo BT cho GR00T N1.7 Humanoid

Evaluating Whole-Body VLAs

What This Article Is For

Why One Success Rate Is Not Enough

How To Read the WholeBodyVLA Table

Designing Your Own Subgoal Scorecard

Defining Pass and Fail Criteria

Ablations: Ask What Each Component Contributes

Failure Taxonomy: Log What You Want To Fix

Computing and Interpreting Scores

Checklist for a Benchmark Session

Conclusion

Nguyễn Anh Tuấn

Related Posts

Bản đồ pipeline WholeBodyVLA

Scale 20 người và eval whole-body VLA

Cyclo BT cho GR00T N1.7 Humanoid