wholebody-vlahumanoid-vlawholebody-vlaevaluationoperator-scaleagibot-x2lamlmogr00t-wbc

Scaling to 20 Operators and VLA Evaluation

How to evaluate WholeBodyVLA, 50 Hz LMO, and WBC ONNX controllers while scaling humanoid data collection to 20 operators.

Nguyễn Anh TuấnJune 10, 202615 min read
Scaling to 20 Operators and VLA Evaluation

What this article gives you

At the end of this series, the question is no longer "how do we record one episode?" The practical question is harder: if you have 20 operators, a humanoid whole-body VLA, tasks longer than 30 seconds, and a realtime leg controller, how do you know the system is improving instead of producing more noisy data?

In part 1, we started with a two-person pilot to measure throughput and obvious failures. Part 3 standardized raw logs with ROS 2 and MCAP so episodes can be replayed and audited. Part 5 added synthetic data and QA before training. Part 6 turns those pieces into an operating evaluation table for a 20-operator team.

We use WholeBodyVLA as the reference architecture because the paper and project page match the problem directly: an AgiBot X2 humanoid performs large-space loco-manipulation, a Latent Action Model (LAM) learns latent action tokens from action-free egocentric videos, the VLM decodes at roughly 10 Hz, and a Loco-Manipulation-Oriented RL policy (LMO) runs at 50 Hz on proprioception. Read these primary sources first:

Outside this series, Training Trackers and LMO RL and VLA/WBC Repository Landscape provide useful background. If you are building an internal dashboard, treat this article as a first metric spec.

humanoid evaluation lab
humanoid evaluation lab

The architecture you need to evaluate

A whole-body VLA is not the same as a robot-arm policy that maps images to wrist poses. In a humanoid, high-level decisions and body stability run at different frequencies. WholeBodyVLA describes the runtime as follows: egocentric images and language instructions are encoded into latent action tokens; the action decoder emits dual-arm joint actions and locomotion commands at roughly 10 Hz; the LMO RL policy consumes proprioceptive state and runs at 50 Hz to execute locomotion commands. The paper also describes a split deployment: the VLA runs on an RTX 4090 workstation, the RL policy runs on an onboard computer, and communication happens through ZeroMQ over Ethernet.

For beginners, model the system as three loops:

camera + language
      |
      v
VLM / LAM action decoder (~10 Hz)
      |        \
      |         -> dual-arm joint actions
      v
locomotion command: forward, sidestep, turn, squat
      |
      v
LMO RL policy on proprioception (50 Hz)
      |
      v
whole-body robot + WBC / PD / safety layer

The evaluation table must separate failures from these loops. If the task fails because the robot misunderstood the instruction, it is a VLA or perception problem. If it understood the task but sidestepped to the wrong placement position, the locomotion command or LMO layer is suspect. If the legs are stable until the arms push the box and disturb the torso, the failure is in whole-body coupling. If the episode is rejected because camera timestamps drifted, the problem is data infrastructure, not the model.

Many teams also want to compare or replace their low-level stack with GR00T WholeBodyControl. The GR00T WBC documentation describes a humanoid controller platform, including decoupled WBC used in Isaac GR00T / GR00T N1.x and workflows for teleoperation, VLA data collection, and inference. For Unitree G1, the repository exposes two notable ONNX policies:

GR00T-WholeBodyControl-Balance.onnx
GR00T-WholeBodyControl-Walk.onnx

These files do not make AgiBot X2 equivalent to G1, and they do not directly replace WholeBodyVLA's LMO if the morphology is different. They do show a practical deployment pattern: the low-level controller should be a stable artifact, loadable at runtime, measurable for latency, and split into clear balance/walk modes so safety testing is manageable.

Why evaluation must come before 20-operator scale

With two operators, you can still watch episodes manually. With 20 operators, bad data scales quickly: new operators move too fast, prompts drift, scenes are reset inconsistently, the robot heats up, camera mounts shift, and some operators accumulate high discard rates without anyone noticing during the day.

The goal of evaluation is not to rank people. The goal is to keep the system learning from better data each week. A useful dashboard answers six questions:

Question Main metric Who uses it
Did the robot complete the task? Task success Research lead, model owner
Did the robot fall or become unstable? Balance failures Safety owner, WBC owner
How often did humans intervene? Intervention rate Ops lead, trainer
Was the control loop late? Latency Infra, realtime owner
Is the episode clean enough for training? Data acceptance Data QA, training owner
Which benchmark is failing? Task breakdown Product, research, ops

If one metric is missing, scale becomes misleading. Task success may increase while intervention rate also increases, which means operators are rescuing the robot more often. Data acceptance may look high while latency jitter worsens, meaning the dataset is saved but temporally inconsistent for training.

KPI table for 20 operators

The table below is a good first baseline when moving from a pilot to a 20-operator team. Each operator runs three tasks: bag packing, box loading, and cart pushing. During evaluation, start with 10 trials per operator per task per day before increasing production data collection.

Metric Definition Suggested formula Week 1 threshold
Task success Final task criteria met without a safety violation success_trials / total_trials Track by task, not only globally
Subgoal success Steps such as grasp, sidestep, squat, place subgoal_pass / total_trials At least two subgoals per task
Balance failure Fall, emergency stop from tilt, center-of-mass or IMU violation balance_fail_trials / total_trials As close to zero as possible
Intervention rate Pause, reset, manual override, or human rescue interventions / minute or trials_with_intervention / trials Should fall each week
Command latency From camera/state frame to command received by robot p50, p95, p99 Separate 10 Hz and 50 Hz loops
Loop miss rate VLA or LMO missed its deadline missed_ticks / expected_ticks Stricter for LMO than VLA
Data acceptance Episode approved for training accepted_episodes / recorded_episodes Do not optimize blindly
Discard reason Why the episode was rejected Fixed taxonomy Required
Operator throughput Accepted episodes per hour accepted / active_hours Compare with team median
Recovery time Time from failure to next trial median, p90 Used for ops improvement

A minimal CSV schema:

date,operator_id,robot_id,task,trial_id,success,subgoal_1,subgoal_2,balance_failure,interventions,accepted,discard_reason,vla_p95_ms,lmo_p95_ms,total_seconds
2026-06-10,op_03,x2_01,bag_packing,trial_004,true,true,true,false,0,true,,118,8,42.6
2026-06-10,op_03,x2_01,box_loading,trial_005,false,true,false,false,1,false,place_miss_after_turn,126,9,51.2
2026-06-10,op_11,x2_01,cart_pushing,trial_002,false,true,false,true,2,false,balance_push_load,135,11,28.4

Do not start with an elaborate dashboard. Start with clean event rows. If every row has task, operator, trial, success, failure mode, and latency, you already have enough information to decide whether next week should focus on operator training, controller fixes, or more data collection.

Defining success for the three benchmarks

WholeBodyVLA uses three main task suites: Bag Packing, Box Loading, and Cart Pushing. The project page describes sequences involving bimanual grasping, sidestepping, squatting, grasping and lifting a box, turning toward a cart, grasping a cart handle, pushing forward, and pushing a load above 50 kg. The appendix also discusses object and load variations: replacing a bag with a different bag, replacing box contents with plastic containers, and replacing the carton on the cart with 60 kg barbell plates.

You do not need to copy the entire benchmark on day one, but you should preserve the key idea: every task must force the robot to coordinate legs, arms, torso, and balance.

Task Subgoal 1 Subgoal 2 Final success Failure modes to log
Bag Packing Grasp bag with two hands or the configured main hand Sidestep and squat toward carton Bag is inside the carton and robot remains stable miss grasp, bag drop, squat drift, place outside
Box Loading Squat and grasp box Rise and turn toward cart Box is placed on the cart target region box tilt, turn overshoot, hand slip, torso lean
Cart Pushing Grasp handle Push ahead along path Cart moves the target distance without severe robot drift handle miss, lateral drift, foot slip, balance fail

For beginners, always separate task_success from subgoal_success. A box-loading trial may fail at placement while still proving that squat and grasp improved. If you only store final success, the model owner cannot tell whether to fix perception, action decoding, or LMO.

A readable benchmark YAML:

benchmark:
  name: humanoid_wholebody_vla_eval_v1
  robot: agibot_x2
  trials_per_operator_per_task: 10
  operators: 20
  tasks:
    bag_packing:
      subgoals: [grasp_bag, move_and_squat, place_in_carton]
      success:
        object_in_target: true
        no_balance_failure: true
        max_interventions: 0
    box_loading:
      subgoals: [squat_and_grasp, rise_and_turn, place_on_cart]
      success:
        box_on_cart: true
        no_balance_failure: true
        max_interventions: 0
    cart_pushing:
      subgoals: [grab_handle, push_ahead]
      success:
        cart_distance_m: 2.0
        lateral_drift_m_max: 0.35
        no_balance_failure: true

Latency: measure 10 Hz and 50 Hz separately

WholeBodyVLA gives us an important systems lesson: the VLA and LMO do not run at the same frequency. The VLA at roughly 10 Hz can tolerate slower inference because it handles perception, reasoning, and high-level command generation. The 50 Hz LMO loop cannot tolerate much jitter because it maintains locomotion and balance on proprioception.

Your latency table should therefore have at least two layers:

Layer Frequency What to measure Why
VLA decode About 10 Hz camera timestamp -> latent/action decoded -> command sent High delay makes commands stale
LMO/WBC loop 50 Hz proprio read -> policy inference -> action applied Jitter damages gait and balance
Network bridge Stack-dependent command sent -> command received ZeroMQ, ROS 2, or Ethernet can spike
Data recorder Camera-dependent frame saved -> metadata indexed Recorder lag can corrupt data even if control works

Minimal instrumentation:

from time import monotonic_ns

def now_ms():
    return monotonic_ns() / 1_000_000

event = {
    "trial_id": trial_id,
    "camera_frame_ts_ms": frame.ts_ms,
    "vla_decode_start_ms": now_ms(),
}

latent_action = vla.decode(frame, instruction)
event["vla_decode_end_ms"] = now_ms()

command = action_decoder.to_robot_command(latent_action)
event["command_sent_ms"] = now_ms()
command_bus.send(command)

# On the low-level side:
low_event = {
    "command_received_ms": now_ms(),
    "proprio_ts_ms": proprio.ts_ms,
    "lmo_start_ms": now_ms(),
}
action = lmo_policy(proprio, command)
low_event["lmo_end_ms"] = now_ms()
robot.apply(action)
low_event["action_applied_ms"] = now_ms()

Compute p50, p95, and p99 by task and by operator. If VLA p95 rises on cart pushing, the visual scene may be more complex or the prompt may trigger longer decoding. If LMO p95 rises, debug onboard compute, thread scheduling, memory allocation, logging, or network receive. Do not compress both loops into one average number.

Balance failures and intervention rate

Balance failure must be defined strictly. If every operator has a different definition of "almost fell", the dashboard becomes useless. Start with four labels:

Label Condition
fall Robot touches the ground or needs physical reset
estop_balance Emergency stop caused by tilt, slip, or unstable motion
operator_catch Human intervenes to prevent a fall or collision
controller_abort Safety layer stops because state exceeds a threshold

Interventions also need a taxonomy:

intervention_types:
  pause: operator pauses policy but keeps trial alive
  manual_override: operator takes over motion
  scene_fix: human fixes object/scene during trial
  safety_stop: emergency stop or safety abort
  prompt_correction: instruction changed after trial start

For training reports, a trial with intervention should usually not count as clean success. Store two columns:

Column Meaning
task_success_raw Final task outcome succeeded, even with intervention
task_success_clean Final task outcome succeeded without forbidden intervention

When scaling to 20 operators, task_success_clean is the model metric. task_success_raw is useful for operations because it shows whether operators can rescue the task, but it can mislead checkpoint selection.

Data acceptance: which episodes train the model?

Data acceptance is not identical to task success. A failed episode can still be useful for failure analysis if it has clean labels. But the main behavior cloning or VLA dataset should prioritize successful episodes with correct timestamps, correct prompts, no intervention, and no sensor drops.

Acceptance checklist:

Check Accept if Reject if
Prompt Matches task and objects Changed mid-trial or too vague
Camera Enough frames, no severe exposure loss Long drop, heavy blur during critical action
State/action Correct shape, monotonic timestamps NaN, wrong joint order, unexplained action spikes
Latency Inside budget p99 exceeds budget during critical phase
Safety No fall or emergency stop Fall or human catch
Outcome Success or clearly labeled failure Failure reason unknown

A simple rule engine:

def accept_episode(ep):
    if ep.prompt_changed:
        return False, "prompt_changed"
    if ep.camera_drop_ms_max > 250:
        return False, "camera_drop"
    if ep.state_has_nan or ep.action_has_nan:
        return False, "bad_state_action"
    if ep.balance_failure:
        return False, "balance_failure"
    if ep.interventions > 0:
        return False, "intervention"
    if ep.vla_latency_p99_ms > 250:
        return False, "vla_latency"
    if ep.lmo_latency_p99_ms > 20:
        return False, "lmo_latency"
    if not ep.task_success:
        return False, "task_fail"
    return True, ""

These thresholds are starting points, not universal constants. In a real stack, derive budgets from your pilot distribution and hardware. The essential rule is that every rejection has a fixed reason, so the weekly review can tell whether 38% of rejects came from operator behavior, cameras, the model, or the controller.

Operating schedule for 20 operators

Do not let 20 people collect production data on day one. Split the rollout into three rounds:

Round People Goal Gate to pass
Calibration 4 Compare new operators with the two-person pilot Correct prompts and intervention labels
Eval batch 20 Everyone runs the same small benchmark Success, balance, latency, and data acceptance table exists
Production 20 Collect data for priority tasks Only approved operators and tasks are opened

Suggested gates:

operator_gate:
  min_trials: 30
  prompt_error_rate_max: 0.02
  unsafe_intervention_rate_max: 0.05
  accepted_episode_rate_min: 0.70
  median_reset_seconds_max: 90

robot_gate:
  balance_failure_rate_max: 0.02
  lmo_loop_miss_rate_max: 0.01
  no_unexplained_estop: true

model_gate:
  clean_success_rate_min_by_task:
    bag_packing: 0.60
    box_loading: 0.45
    cart_pushing: 0.50

These numbers are not academic benchmarks. They are operating guardrails. If week one does not pass, do not collect more of the same data. Fix the largest failure mode first. Scaling bad data only increases training, QA, and debugging cost.

What the first dashboard should show

A good dashboard does not have to be beautiful, but it must support decisions. I usually start with four tables:

Table Required columns Decision supported
Operator scorecard operator, trials, clean success, interventions, accepted rate, median reset Who needs retraining and who can enter production
Task scorecard task, subgoal success, balance fail, latency p95, top discard reason Which task is bottlenecked
Latency health VLA p50/p95/p99, LMO p50/p95/p99, loop miss Whether infra or realtime work is needed
Data QA accepted, rejected, reject reason, camera drop, prompt error Whether the dataset is trainable

Operator scorecard SQL:

select
  operator_id,
  count(*) as trials,
  avg(case when task_success_clean then 1 else 0 end) as clean_success_rate,
  avg(interventions) as interventions_per_trial,
  avg(case when accepted then 1 else 0 end) as accepted_rate,
  percentile_cont(0.5) within group (order by reset_seconds) as median_reset_seconds
from eval_trials
where date = '2026-06-10'
group by operator_id
order by accepted_rate desc;

Failure reason SQL:

select
  task,
  discard_reason,
  count(*) as n
from eval_trials
where accepted = false
group by task, discard_reason
order by n desc;

If cart_pushing has many balance_push_load rejects, prioritize WBC/LMO and force interaction. If bag_packing has many place_outside rejects, inspect perception, hand calibration, or the action decoder. If every task has camera_drop, do not fine-tune the model yet; fix the recorder.

Conclusion

This final article moves the series from "we can collect data" to "we can tell whether the robot and data are good." WholeBodyVLA gives a clear technical blueprint: LAM learns latent action tokens, the VLA decodes around 10 Hz, LMO runs at 50 Hz, and benchmarks must force the robot to walk, turn, squat, grasp, place, and push. GR00T WBC gives another deployment blueprint: low-level controllers need explicit artifacts such as ONNX balance/walk policies, measurable runtime behavior, and safe test modes.

When scaling to 20 operators, do not count episodes alone. Track clean task success, balance failures, intervention rate, latency, data acceptance, and breakdowns across bag packing, box loading, and cart pushing. If those six metrics are stable, you can increase data collection with less risk. If they disagree, trust the failure table more than the feeling that the robot "seems to work."

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

Pilot 2 người cho dữ liệu humanoid VLA
wholebody-vla

Pilot 2 người cho dữ liệu humanoid VLA

6/10/202615 min read
NT
Chọn teleoperation stack cho humanoid
wholebody-vla

Chọn teleoperation stack cho humanoid

6/10/202616 min read
NT
Đánh giá VLA toàn thân
wholebody-vla

Đánh giá VLA toàn thân

6/10/202615 min read
NT