Scaling to 20 Operators and VLA Evaluation

What this article gives you

At the end of this series, the question is no longer "how do we record one episode?" The practical question is harder: if you have 20 operators, a humanoid whole-body VLA, tasks longer than 30 seconds, and a realtime leg controller, how do you know the system is improving instead of producing more noisy data?

In part 1, we started with a two-person pilot to measure throughput and obvious failures. Part 3 standardized raw logs with ROS 2 and MCAP so episodes can be replayed and audited. Part 5 added synthetic data and QA before training. Part 6 turns those pieces into an operating evaluation table for a 20-operator team.

We use WholeBodyVLA as the reference architecture because the paper and project page match the problem directly: an AgiBot X2 humanoid performs large-space loco-manipulation, a Latent Action Model (LAM) learns latent action tokens from action-free egocentric videos, the VLM decodes at roughly 10 Hz, and a Loco-Manipulation-Oriented RL policy (LMO) runs at 50 Hz on proprioception. Read these primary sources first:

Outside this series, Training Trackers and LMO RL and VLA/WBC Repository Landscape provide useful background. If you are building an internal dashboard, treat this article as a first metric spec.

humanoid evaluation lab

The architecture you need to evaluate

A whole-body VLA is not the same as a robot-arm policy that maps images to wrist poses. In a humanoid, high-level decisions and body stability run at different frequencies. WholeBodyVLA describes the runtime as follows: egocentric images and language instructions are encoded into latent action tokens; the action decoder emits dual-arm joint actions and locomotion commands at roughly 10 Hz; the LMO RL policy consumes proprioceptive state and runs at 50 Hz to execute locomotion commands. The paper also describes a split deployment: the VLA runs on an RTX 4090 workstation, the RL policy runs on an onboard computer, and communication happens through ZeroMQ over Ethernet.

For beginners, model the system as three loops:

camera + language
      |
      v
VLM / LAM action decoder (~10 Hz)
      |        \
      |         -> dual-arm joint actions
      v
locomotion command: forward, sidestep, turn, squat
      |
      v
LMO RL policy on proprioception (50 Hz)
      |
      v
whole-body robot + WBC / PD / safety layer

The evaluation table must separate failures from these loops. If the task fails because the robot misunderstood the instruction, it is a VLA or perception problem. If it understood the task but sidestepped to the wrong placement position, the locomotion command or LMO layer is suspect. If the legs are stable until the arms push the box and disturb the torso, the failure is in whole-body coupling. If the episode is rejected because camera timestamps drifted, the problem is data infrastructure, not the model.

Many teams also want to compare or replace their low-level stack with GR00T WholeBodyControl. The GR00T WBC documentation describes a humanoid controller platform, including decoupled WBC used in Isaac GR00T / GR00T N1.x and workflows for teleoperation, VLA data collection, and inference. For Unitree G1, the repository exposes two notable ONNX policies:

GR00T-WholeBodyControl-Balance.onnx
GR00T-WholeBodyControl-Walk.onnx

These files do not make AgiBot X2 equivalent to G1, and they do not directly replace WholeBodyVLA's LMO if the morphology is different. They do show a practical deployment pattern: the low-level controller should be a stable artifact, loadable at runtime, measurable for latency, and split into clear balance/walk modes so safety testing is manageable.

Why evaluation must come before 20-operator scale

With two operators, you can still watch episodes manually. With 20 operators, bad data scales quickly: new operators move too fast, prompts drift, scenes are reset inconsistently, the robot heats up, camera mounts shift, and some operators accumulate high discard rates without anyone noticing during the day.

The goal of evaluation is not to rank people. The goal is to keep the system learning from better data each week. A useful dashboard answers six questions:

Question	Main metric	Who uses it
Did the robot complete the task?	Task success	Research lead, model owner
Did the robot fall or become unstable?	Balance failures	Safety owner, WBC owner
How often did humans intervene?	Intervention rate	Ops lead, trainer
Was the control loop late?	Latency	Infra, realtime owner
Is the episode clean enough for training?	Data acceptance	Data QA, training owner
Which benchmark is failing?	Task breakdown	Product, research, ops

If one metric is missing, scale becomes misleading. Task success may increase while intervention rate also increases, which means operators are rescuing the robot more often. Data acceptance may look high while latency jitter worsens, meaning the dataset is saved but temporally inconsistent for training.

KPI table for 20 operators

The table below is a good first baseline when moving from a pilot to a 20-operator team. Each operator runs three tasks: bag packing, box loading, and cart pushing. During evaluation, start with 10 trials per operator per task per day before increasing production data collection.

Metric	Definition	Suggested formula	Week 1 threshold
Task success	Final task criteria met without a safety violation	`success_trials / total_trials`	Track by task, not only globally
Subgoal success	Steps such as grasp, sidestep, squat, place	`subgoal_pass / total_trials`	At least two subgoals per task
Balance failure	Fall, emergency stop from tilt, center-of-mass or IMU violation	`balance_fail_trials / total_trials`	As close to zero as possible
Intervention rate	Pause, reset, manual override, or human rescue	`interventions / minute` or `trials_with_intervention / trials`	Should fall each week
Command latency	From camera/state frame to command received by robot	p50, p95, p99	Separate 10 Hz and 50 Hz loops
Loop miss rate	VLA or LMO missed its deadline	`missed_ticks / expected_ticks`	Stricter for LMO than VLA
Data acceptance	Episode approved for training	`accepted_episodes / recorded_episodes`	Do not optimize blindly
Discard reason	Why the episode was rejected	Fixed taxonomy	Required
Operator throughput	Accepted episodes per hour	`accepted / active_hours`	Compare with team median
Recovery time	Time from failure to next trial	median, p90	Used for ops improvement

A minimal CSV schema:

date,operator_id,robot_id,task,trial_id,success,subgoal_1,subgoal_2,balance_failure,interventions,accepted,discard_reason,vla_p95_ms,lmo_p95_ms,total_seconds
2026-06-10,op_03,x2_01,bag_packing,trial_004,true,true,true,false,0,true,,118,8,42.6
2026-06-10,op_03,x2_01,box_loading,trial_005,false,true,false,false,1,false,place_miss_after_turn,126,9,51.2
2026-06-10,op_11,x2_01,cart_pushing,trial_002,false,true,false,true,2,false,balance_push_load,135,11,28.4

Do not start with an elaborate dashboard. Start with clean event rows. If every row has task, operator, trial, success, failure mode, and latency, you already have enough information to decide whether next week should focus on operator training, controller fixes, or more data collection.

Defining success for the three benchmarks

WholeBodyVLA uses three main task suites: Bag Packing, Box Loading, and Cart Pushing. The project page describes sequences involving bimanual grasping, sidestepping, squatting, grasping and lifting a box, turning toward a cart, grasping a cart handle, pushing forward, and pushing a load above 50 kg. The appendix also discusses object and load variations: replacing a bag with a different bag, replacing box contents with plastic containers, and replacing the carton on the cart with 60 kg barbell plates.

You do not need to copy the entire benchmark on day one, but you should preserve the key idea: every task must force the robot to coordinate legs, arms, torso, and balance.

Task	Subgoal 1	Subgoal 2	Final success	Failure modes to log
Bag Packing	Grasp bag with two hands or the configured main hand	Sidestep and squat toward carton	Bag is inside the carton and robot remains stable	miss grasp, bag drop, squat drift, place outside
Box Loading	Squat and grasp box	Rise and turn toward cart	Box is placed on the cart target region	box tilt, turn overshoot, hand slip, torso lean
Cart Pushing	Grasp handle	Push ahead along path	Cart moves the target distance without severe robot drift	handle miss, lateral drift, foot slip, balance fail

For beginners, always separate task_success from subgoal_success. A box-loading trial may fail at placement while still proving that squat and grasp improved. If you only store final success, the model owner cannot tell whether to fix perception, action decoding, or LMO.

A readable benchmark YAML:

benchmark:
  name: humanoid_wholebody_vla_eval_v1
  robot: agibot_x2
  trials_per_operator_per_task: 10
  operators: 20
  tasks:
    bag_packing:
      subgoals: [grasp_bag, move_and_squat, place_in_carton]
      success:
        object_in_target: true
        no_balance_failure: true
        max_interventions: 0
    box_loading:
      subgoals: [squat_and_grasp, rise_and_turn, place_on_cart]
      success:
        box_on_cart: true
        no_balance_failure: true
        max_interventions: 0
    cart_pushing:
      subgoals: [grab_handle, push_ahead]
      success:
        cart_distance_m: 2.0
        lateral_drift_m_max: 0.35
        no_balance_failure: true

Latency: measure 10 Hz and 50 Hz separately

WholeBodyVLA gives us an important systems lesson: the VLA and LMO do not run at the same frequency. The VLA at roughly 10 Hz can tolerate slower inference because it handles perception, reasoning, and high-level command generation. The 50 Hz LMO loop cannot tolerate much jitter because it maintains locomotion and balance on proprioception.

Your latency table should therefore have at least two layers:

Layer	Frequency	What to measure	Why
VLA decode	About 10 Hz	camera timestamp -> latent/action decoded -> command sent	High delay makes commands stale
LMO/WBC loop	50 Hz	proprio read -> policy inference -> action applied	Jitter damages gait and balance
Network bridge	Stack-dependent	command sent -> command received	ZeroMQ, ROS 2, or Ethernet can spike
Data recorder	Camera-dependent	frame saved -> metadata indexed	Recorder lag can corrupt data even if control works

Minimal instrumentation:

from time import monotonic_ns

def now_ms():
    return monotonic_ns() / 1_000_000

event = {
    "trial_id": trial_id,
    "camera_frame_ts_ms": frame.ts_ms,
    "vla_decode_start_ms": now_ms(),
}

latent_action = vla.decode(frame, instruction)
event["vla_decode_end_ms"] = now_ms()

command = action_decoder.to_robot_command(latent_action)
event["command_sent_ms"] = now_ms()
command_bus.send(command)

# On the low-level side:
low_event = {
    "command_received_ms": now_ms(),
    "proprio_ts_ms": proprio.ts_ms,
    "lmo_start_ms": now_ms(),
}
action = lmo_policy(proprio, command)
low_event["lmo_end_ms"] = now_ms()
robot.apply(action)
low_event["action_applied_ms"] = now_ms()

Compute p50, p95, and p99 by task and by operator. If VLA p95 rises on cart pushing, the visual scene may be more complex or the prompt may trigger longer decoding. If LMO p95 rises, debug onboard compute, thread scheduling, memory allocation, logging, or network receive. Do not compress both loops into one average number.

Balance failures and intervention rate

Balance failure must be defined strictly. If every operator has a different definition of "almost fell", the dashboard becomes useless. Start with four labels:

Label	Condition
`fall`	Robot touches the ground or needs physical reset
`estop_balance`	Emergency stop caused by tilt, slip, or unstable motion
`operator_catch`	Human intervenes to prevent a fall or collision
`controller_abort`	Safety layer stops because state exceeds a threshold

Interventions also need a taxonomy:

intervention_types:
  pause: operator pauses policy but keeps trial alive
  manual_override: operator takes over motion
  scene_fix: human fixes object/scene during trial
  safety_stop: emergency stop or safety abort
  prompt_correction: instruction changed after trial start

For training reports, a trial with intervention should usually not count as clean success. Store two columns:

Column	Meaning
`task_success_raw`	Final task outcome succeeded, even with intervention
`task_success_clean`	Final task outcome succeeded without forbidden intervention

When scaling to 20 operators, task_success_clean is the model metric. task_success_raw is useful for operations because it shows whether operators can rescue the task, but it can mislead checkpoint selection.

Data acceptance: which episodes train the model?

Data acceptance is not identical to task success. A failed episode can still be useful for failure analysis if it has clean labels. But the main behavior cloning or VLA dataset should prioritize successful episodes with correct timestamps, correct prompts, no intervention, and no sensor drops.

Acceptance checklist:

Check	Accept if	Reject if
Prompt	Matches task and objects	Changed mid-trial or too vague
Camera	Enough frames, no severe exposure loss	Long drop, heavy blur during critical action
State/action	Correct shape, monotonic timestamps	NaN, wrong joint order, unexplained action spikes
Latency	Inside budget	p99 exceeds budget during critical phase
Safety	No fall or emergency stop	Fall or human catch
Outcome	Success or clearly labeled failure	Failure reason unknown

A simple rule engine:

def accept_episode(ep):
    if ep.prompt_changed:
        return False, "prompt_changed"
    if ep.camera_drop_ms_max > 250:
        return False, "camera_drop"
    if ep.state_has_nan or ep.action_has_nan:
        return False, "bad_state_action"
    if ep.balance_failure:
        return False, "balance_failure"
    if ep.interventions > 0:
        return False, "intervention"
    if ep.vla_latency_p99_ms > 250:
        return False, "vla_latency"
    if ep.lmo_latency_p99_ms > 20:
        return False, "lmo_latency"
    if not ep.task_success:
        return False, "task_fail"
    return True, ""

These thresholds are starting points, not universal constants. In a real stack, derive budgets from your pilot distribution and hardware. The essential rule is that every rejection has a fixed reason, so the weekly review can tell whether 38% of rejects came from operator behavior, cameras, the model, or the controller.

Operating schedule for 20 operators

Do not let 20 people collect production data on day one. Split the rollout into three rounds:

Round	People	Goal	Gate to pass
Calibration	4	Compare new operators with the two-person pilot	Correct prompts and intervention labels
Eval batch	20	Everyone runs the same small benchmark	Success, balance, latency, and data acceptance table exists
Production	20	Collect data for priority tasks	Only approved operators and tasks are opened

Suggested gates:

operator_gate:
  min_trials: 30
  prompt_error_rate_max: 0.02
  unsafe_intervention_rate_max: 0.05
  accepted_episode_rate_min: 0.70
  median_reset_seconds_max: 90

robot_gate:
  balance_failure_rate_max: 0.02
  lmo_loop_miss_rate_max: 0.01
  no_unexplained_estop: true

model_gate:
  clean_success_rate_min_by_task:
    bag_packing: 0.60
    box_loading: 0.45
    cart_pushing: 0.50

These numbers are not academic benchmarks. They are operating guardrails. If week one does not pass, do not collect more of the same data. Fix the largest failure mode first. Scaling bad data only increases training, QA, and debugging cost.

What the first dashboard should show

A good dashboard does not have to be beautiful, but it must support decisions. I usually start with four tables:

Table	Required columns	Decision supported
Operator scorecard	operator, trials, clean success, interventions, accepted rate, median reset	Who needs retraining and who can enter production
Task scorecard	task, subgoal success, balance fail, latency p95, top discard reason	Which task is bottlenecked
Latency health	VLA p50/p95/p99, LMO p50/p95/p99, loop miss	Whether infra or realtime work is needed
Data QA	accepted, rejected, reject reason, camera drop, prompt error	Whether the dataset is trainable

Operator scorecard SQL:

select
  operator_id,
  count(*) as trials,
  avg(case when task_success_clean then 1 else 0 end) as clean_success_rate,
  avg(interventions) as interventions_per_trial,
  avg(case when accepted then 1 else 0 end) as accepted_rate,
  percentile_cont(0.5) within group (order by reset_seconds) as median_reset_seconds
from eval_trials
where date = '2026-06-10'
group by operator_id
order by accepted_rate desc;

Failure reason SQL:

select
  task,
  discard_reason,
  count(*) as n
from eval_trials
where accepted = false
group by task, discard_reason
order by n desc;

If cart_pushing has many balance_push_load rejects, prioritize WBC/LMO and force interaction. If bag_packing has many place_outside rejects, inspect perception, hand calibration, or the action decoder. If every task has camera_drop, do not fine-tune the model yet; fix the recorder.

Conclusion

This final article moves the series from "we can collect data" to "we can tell whether the robot and data are good." WholeBodyVLA gives a clear technical blueprint: LAM learns latent action tokens, the VLA decodes around 10 Hz, LMO runs at 50 Hz, and benchmarks must force the robot to walk, turn, squat, grasp, place, and push. GR00T WBC gives another deployment blueprint: low-level controllers need explicit artifacts such as ONNX balance/walk policies, measurable runtime behavior, and safe test modes.

When scaling to 20 operators, do not count episodes alone. Track clean task success, balance failures, intervention rate, latency, data acceptance, and breakdowns across bag packing, box loading, and cart pushing. If those six metrics are stable, you can increase data collection with less risk. If they disagree, trust the failure table more than the feeling that the robot "seems to work."

What this article gives you

Outside this series, Training Trackers and LMO RL and VLA/WBC Repository Landscape provide useful background. If you are building an internal dashboard, treat this article as a first metric spec.

humanoid evaluation lab

The architecture you need to evaluate

For beginners, model the system as three loops:

camera + language
      |
      v
VLM / LAM action decoder (~10 Hz)
      |        \
      |         -> dual-arm joint actions
      v
locomotion command: forward, sidestep, turn, squat
      |
      v
LMO RL policy on proprioception (50 Hz)
      |
      v
whole-body robot + WBC / PD / safety layer

GR00T-WholeBodyControl-Balance.onnx
GR00T-WholeBodyControl-Walk.onnx

Why evaluation must come before 20-operator scale

The goal of evaluation is not to rank people. The goal is to keep the system learning from better data each week. A useful dashboard answers six questions:

Question	Main metric	Who uses it
Did the robot complete the task?	Task success	Research lead, model owner
Did the robot fall or become unstable?	Balance failures	Safety owner, WBC owner
How often did humans intervene?	Intervention rate	Ops lead, trainer
Was the control loop late?	Latency	Infra, realtime owner
Is the episode clean enough for training?	Data acceptance	Data QA, training owner
Which benchmark is failing?	Task breakdown	Product, research, ops

KPI table for 20 operators

Metric	Definition	Suggested formula	Week 1 threshold
Task success	Final task criteria met without a safety violation	`success_trials / total_trials`	Track by task, not only globally
Subgoal success	Steps such as grasp, sidestep, squat, place	`subgoal_pass / total_trials`	At least two subgoals per task
Balance failure	Fall, emergency stop from tilt, center-of-mass or IMU violation	`balance_fail_trials / total_trials`	As close to zero as possible
Intervention rate	Pause, reset, manual override, or human rescue	`interventions / minute` or `trials_with_intervention / trials`	Should fall each week
Command latency	From camera/state frame to command received by robot	p50, p95, p99	Separate 10 Hz and 50 Hz loops
Loop miss rate	VLA or LMO missed its deadline	`missed_ticks / expected_ticks`	Stricter for LMO than VLA
Data acceptance	Episode approved for training	`accepted_episodes / recorded_episodes`	Do not optimize blindly
Discard reason	Why the episode was rejected	Fixed taxonomy	Required
Operator throughput	Accepted episodes per hour	`accepted / active_hours`	Compare with team median
Recovery time	Time from failure to next trial	median, p90	Used for ops improvement

A minimal CSV schema:

date,operator_id,robot_id,task,trial_id,success,subgoal_1,subgoal_2,balance_failure,interventions,accepted,discard_reason,vla_p95_ms,lmo_p95_ms,total_seconds
2026-06-10,op_03,x2_01,bag_packing,trial_004,true,true,true,false,0,true,,118,8,42.6
2026-06-10,op_03,x2_01,box_loading,trial_005,false,true,false,false,1,false,place_miss_after_turn,126,9,51.2
2026-06-10,op_11,x2_01,cart_pushing,trial_002,false,true,false,true,2,false,balance_push_load,135,11,28.4

Defining success for the three benchmarks

You do not need to copy the entire benchmark on day one, but you should preserve the key idea: every task must force the robot to coordinate legs, arms, torso, and balance.

Task	Subgoal 1	Subgoal 2	Final success	Failure modes to log
Bag Packing	Grasp bag with two hands or the configured main hand	Sidestep and squat toward carton	Bag is inside the carton and robot remains stable	miss grasp, bag drop, squat drift, place outside
Box Loading	Squat and grasp box	Rise and turn toward cart	Box is placed on the cart target region	box tilt, turn overshoot, hand slip, torso lean
Cart Pushing	Grasp handle	Push ahead along path	Cart moves the target distance without severe robot drift	handle miss, lateral drift, foot slip, balance fail

A readable benchmark YAML:

benchmark:
  name: humanoid_wholebody_vla_eval_v1
  robot: agibot_x2
  trials_per_operator_per_task: 10
  operators: 20
  tasks:
    bag_packing:
      subgoals: [grasp_bag, move_and_squat, place_in_carton]
      success:
        object_in_target: true
        no_balance_failure: true
        max_interventions: 0
    box_loading:
      subgoals: [squat_and_grasp, rise_and_turn, place_on_cart]
      success:
        box_on_cart: true
        no_balance_failure: true
        max_interventions: 0
    cart_pushing:
      subgoals: [grab_handle, push_ahead]
      success:
        cart_distance_m: 2.0
        lateral_drift_m_max: 0.35
        no_balance_failure: true

Latency: measure 10 Hz and 50 Hz separately

Your latency table should therefore have at least two layers:

Layer	Frequency	What to measure	Why
VLA decode	About 10 Hz	camera timestamp -> latent/action decoded -> command sent	High delay makes commands stale
LMO/WBC loop	50 Hz	proprio read -> policy inference -> action applied	Jitter damages gait and balance
Network bridge	Stack-dependent	command sent -> command received	ZeroMQ, ROS 2, or Ethernet can spike
Data recorder	Camera-dependent	frame saved -> metadata indexed	Recorder lag can corrupt data even if control works

Minimal instrumentation:

from time import monotonic_ns

def now_ms():
    return monotonic_ns() / 1_000_000

event = {
    "trial_id": trial_id,
    "camera_frame_ts_ms": frame.ts_ms,
    "vla_decode_start_ms": now_ms(),
}

latent_action = vla.decode(frame, instruction)
event["vla_decode_end_ms"] = now_ms()

command = action_decoder.to_robot_command(latent_action)
event["command_sent_ms"] = now_ms()
command_bus.send(command)

# On the low-level side:
low_event = {
    "command_received_ms": now_ms(),
    "proprio_ts_ms": proprio.ts_ms,
    "lmo_start_ms": now_ms(),
}
action = lmo_policy(proprio, command)
low_event["lmo_end_ms"] = now_ms()
robot.apply(action)
low_event["action_applied_ms"] = now_ms()

Balance failures and intervention rate

Balance failure must be defined strictly. If every operator has a different definition of "almost fell", the dashboard becomes useless. Start with four labels:

Label	Condition
`fall`	Robot touches the ground or needs physical reset
`estop_balance`	Emergency stop caused by tilt, slip, or unstable motion
`operator_catch`	Human intervenes to prevent a fall or collision
`controller_abort`	Safety layer stops because state exceeds a threshold

Interventions also need a taxonomy:

intervention_types:
  pause: operator pauses policy but keeps trial alive
  manual_override: operator takes over motion
  scene_fix: human fixes object/scene during trial
  safety_stop: emergency stop or safety abort
  prompt_correction: instruction changed after trial start

For training reports, a trial with intervention should usually not count as clean success. Store two columns:

Column	Meaning
`task_success_raw`	Final task outcome succeeded, even with intervention
`task_success_clean`	Final task outcome succeeded without forbidden intervention

Data acceptance: which episodes train the model?

Acceptance checklist:

Check	Accept if	Reject if
Prompt	Matches task and objects	Changed mid-trial or too vague
Camera	Enough frames, no severe exposure loss	Long drop, heavy blur during critical action
State/action	Correct shape, monotonic timestamps	NaN, wrong joint order, unexplained action spikes
Latency	Inside budget	p99 exceeds budget during critical phase
Safety	No fall or emergency stop	Fall or human catch
Outcome	Success or clearly labeled failure	Failure reason unknown

A simple rule engine:

def accept_episode(ep):
    if ep.prompt_changed:
        return False, "prompt_changed"
    if ep.camera_drop_ms_max > 250:
        return False, "camera_drop"
    if ep.state_has_nan or ep.action_has_nan:
        return False, "bad_state_action"
    if ep.balance_failure:
        return False, "balance_failure"
    if ep.interventions > 0:
        return False, "intervention"
    if ep.vla_latency_p99_ms > 250:
        return False, "vla_latency"
    if ep.lmo_latency_p99_ms > 20:
        return False, "lmo_latency"
    if not ep.task_success:
        return False, "task_fail"
    return True, ""

Operating schedule for 20 operators

Do not let 20 people collect production data on day one. Split the rollout into three rounds:

Round	People	Goal	Gate to pass
Calibration	4	Compare new operators with the two-person pilot	Correct prompts and intervention labels
Eval batch	20	Everyone runs the same small benchmark	Success, balance, latency, and data acceptance table exists
Production	20	Collect data for priority tasks	Only approved operators and tasks are opened

Suggested gates:

operator_gate:
  min_trials: 30
  prompt_error_rate_max: 0.02
  unsafe_intervention_rate_max: 0.05
  accepted_episode_rate_min: 0.70
  median_reset_seconds_max: 90

robot_gate:
  balance_failure_rate_max: 0.02
  lmo_loop_miss_rate_max: 0.01
  no_unexplained_estop: true

model_gate:
  clean_success_rate_min_by_task:
    bag_packing: 0.60
    box_loading: 0.45
    cart_pushing: 0.50

What the first dashboard should show

A good dashboard does not have to be beautiful, but it must support decisions. I usually start with four tables:

Table	Required columns	Decision supported
Operator scorecard	operator, trials, clean success, interventions, accepted rate, median reset	Who needs retraining and who can enter production
Task scorecard	task, subgoal success, balance fail, latency p95, top discard reason	Which task is bottlenecked
Latency health	VLA p50/p95/p99, LMO p50/p95/p99, loop miss	Whether infra or realtime work is needed
Data QA	accepted, rejected, reject reason, camera drop, prompt error	Whether the dataset is trainable

Operator scorecard SQL:

select
  operator_id,
  count(*) as trials,
  avg(case when task_success_clean then 1 else 0 end) as clean_success_rate,
  avg(interventions) as interventions_per_trial,
  avg(case when accepted then 1 else 0 end) as accepted_rate,
  percentile_cont(0.5) within group (order by reset_seconds) as median_reset_seconds
from eval_trials
where date = '2026-06-10'
group by operator_id
order by accepted_rate desc;

Failure reason SQL:

select
  task,
  discard_reason,
  count(*) as n
from eval_trials
where accepted = false
group by task, discard_reason
order by n desc;

Scaling to 20 Operators and VLA Evaluation

What this article gives you

The architecture you need to evaluate

Why evaluation must come before 20-operator scale

KPI table for 20 operators

Defining success for the three benchmarks

Latency: measure 10 Hz and 50 Hz separately

Balance failures and intervention rate

Data acceptance: which episodes train the model?

Operating schedule for 20 operators

What the first dashboard should show

Conclusion

Nguyễn Anh Tuấn

Related Posts

Đánh giá VLA toàn thân

Chọn teleoperation stack cho humanoid

Pilot 2 người cho dữ liệu humanoid VLA

Scaling to 20 Operators and VLA Evaluation

What this article gives you

The architecture you need to evaluate

Why evaluation must come before 20-operator scale

KPI table for 20 operators

Defining success for the three benchmarks

Latency: measure 10 Hz and 50 Hz separately

Balance failures and intervention rate

Data acceptance: which episodes train the model?

Operating schedule for 20 operators

What the first dashboard should show

Conclusion

Nguyễn Anh Tuấn

Related Posts

Đánh giá VLA toàn thân

Chọn teleoperation stack cho humanoid

Pilot 2 người cho dữ liệu humanoid VLA