Why part 5 is about scaling laws
The first four posts in this series covered the ownership map, teleoperation data collection, human video mining, and synthetic pipelines from simulation to reality. If you are joining here, start with Part 1: the humanoid data war landscape, Part 2: teleoperation data, and Part 4: synthetic data pipelines. Part 5 answers a very practical question: when does collecting more demonstrations make the robot better, and when are you just burning money?
In language models, scaling laws are usually discussed in terms of tokens, parameters, and compute. Robotics is different. A text token is cheap, easy to copy, and never breaks a robot arm. A robot demonstration requires an operator, hardware time, cameras, safety checks, scene resets, logging, labels, and often re-collection because the gripper slipped or the timestamp was wrong. A humanoid team cannot casually say "collect another million demos" in the same way a web-scale NLP team says "crawl another trillion tokens."
The paper Data Scaling Laws in Imitation Learning for Robotic Manipulation by Lin et al. is one of the most useful papers for this exact question. The authors did not merely train in simulation and plot a clean curve. They collected more than 40,000 demonstrations, executed more than 15,000 real-world robot rollouts, and studied three data axes: number of environments, number of objects, and number of demonstrations. The core conclusion is simple and important:
Policy generalization improves approximately as a power law when environment and object diversity increase, but there is no clear power law when you only add more demonstrations in the same setup.
In plain terms: a robot learns more from seeing new tables, lighting conditions, backgrounds, objects, poses, clutter, and manipulation variations. It learns far less from seeing the same table, same cup, same camera angle, and same operator another 500 times.
What is a robot scaling law?
A scaling law is a relationship that roughly looks like this:
error ≈ A * N^(-alpha) + irreducible_error
Here, N may mean the number of objects, environments, environment-object pairs, tasks, trajectories, or action tokens. A larger alpha means additional data is more useful. irreducible_error is the part that will not disappear just because you collect more of the same data: bad perception, missing force sensing, weak control, sensor latency, a poor action space, or an evaluation task that requires feedback not present in the dataset.
For beginners, keep this mental model:
Add a new setup:
new table, new lighting, new object, slightly shifted camera
-> the model learns invariance and robust behavior
Add repeated demos in the old setup:
same table, same object, same camera, same operator
-> the model reduces noise at first, then quickly plateaus
In Lin et al., the key metric is not a pretty validation loss. The policies are evaluated on unseen real environments and unseen real objects. This matters because imitation learning can be misleading if you only look at action MSE. A predicted action can be close to a demonstration yet still fail after contact, slippage, delay, or accumulated error. The paper also notes that validation MSE can help debugging, but it is not always reliable as a proxy for real rollout success.
Lin et al. 2024: experiment design
The paper studies single-task manipulation, not a generalist VLA trained over thousands of tasks. The authors use two main tasks to fit the scaling curves:
| Task | What the robot must do | Why it is useful for scaling |
|---|---|---|
| Pour Water | Grasp a bottle, pour water, and place it back | Requires grasping, pose control, tilting, and sensitivity to object and scene variation |
| Mouse Arrangement | Arrange a computer mouse into a target position | Requires spatial alignment, light contact, and accurate placement |
They then validate the resulting data collection strategy on two additional tasks:
| Validation task | What it tests |
|---|---|
| Fold Towels | Softer manipulation, grasping an edge and folding |
| Unplug Charger | Correct grasping and fast pulling from a power strip |
The data is collected with UMI hand-held grippers, modeled with Diffusion Policy, and evaluated zero-shot in 8 unseen environments with unseen objects. The experimental axes are clean:
| Scaling axis | Collection setup | Question |
|---|---|---|
| Object generalization | 32 objects in one environment | Does adding objects help with unseen objects? |
| Environment generalization | 32 environments with one object | Does adding environments help with unseen places? |
| Environment + object | 32 environment-object pairs, one unique object per environment | What happens when both scene and object change? |
| Demo count | Diversity fixed, demonstrations increased | Does collecting more demos in the same setup still help? |
The key result: as the number of objects or environments increases, performance on unseen objects/environments rises approximately as a power law. When only the number of demonstrations increases under fixed diversity, performance improves early and then plateaus.
The number to remember: 32 pairs, 50 demos each
One of the most practical contributions of the paper is its recommended data collection strategy. For tasks of similar difficulty to Pour Water and Mouse Arrangement, the authors recommend:
Environment-object pairs: about 32
Demonstrations per pair: about 50
Total demonstrations: about 1,600
Evaluation: 8 unseen environments, 2 unseen objects per environment
Expected result: roughly 85-92.5% success depending on the task in the paper
Their reported average success rates:
| Task | Average success rate |
|---|---|
| Pour Water | 85.0% |
| Mouse Arrangement | 92.5% |
| Fold Towels | 87.5% |
| Unplug Charger | 90.0% |
This does not mean 1,600 demos is a magic number for every humanoid task. It means that for tasks of similar complexity, with similar sensors, actions, and policy class, 32 diverse setups with 50 demos per setup is far more valuable than 1 setup with 1,600 demos.
The diminishing-returns threshold is also clear. In their demonstration-count study, performance plateaued around 800 total demos in one maximum-data setting. When analyzed by the number of environment-object pairs, the saturation points were:
| Environment-object pairs | Total demos near plateau | Approx. demos per pair |
|---|---|---|
| 8 | 400 | 50 |
| 16 | 800 | 50 |
| 32 | 1,600 | 50 |
This is a budget-changing result. If you already have 50 good demonstrations for the same table, same object, and same camera, collecting 200 more demos in that exact setup may only make the spreadsheet look better. The same operator time is usually better spent opening new environments or new objects.
Why raw demo count is misleading
"We have 100,000 demonstrations" sounds impressive. For robot learning, the better questions are:
How many objects do those demos cover?
How many rooms, tables, factory cells, shelves, and lighting conditions?
How many camera poses?
How many operator styles?
How many initial states?
How many recovery cases?
How many task families?
How many robot embodiments?
If 100,000 demos are all one operator doing pick-and-place in one lab, they can be less useful than 10,000 demos collected across 200 real scenes, with many objects, clutter patterns, lighting conditions, and reset states. A policy is not just learning a hand trajectory. It is learning a mapping from image + language + robot state to action. If the visual distribution is too narrow, the model can learn shortcuts: the cup is always on the left, the table is always white, the camera is always fixed, and the gripper always approaches from the same direction.
A simple way to audit a dataset is to score effective diversity rather than episode count:
effective_dataset_score = (
0.30 * unique_environments +
0.25 * unique_objects +
0.15 * unique_initial_states +
0.10 * camera_pose_bins +
0.10 * operator_style_bins +
0.10 * recovery_or_failure_modes
)
This is not a formula from the paper. It is an operational heuristic. The goal is to force the team to look at meaningful variation. If the score barely increases while episode count grows rapidly, you are collecting repetition, not useful coverage.
OpenVLA: diversity beats raw scale
OpenVLA shows the same lesson at foundation-model scale. It is a 7B-parameter VLA fine-tuned from a VLM backbone and trained on about 970,000 real robot demonstrations from Open X-Embodiment. The important part is not only "970k." The important part is that the data spans many embodiments, tasks, and scenes.
Open X-Embodiment was built from more than 20 robot embodiments and many labs. It is not a perfectly uniform dataset: action spaces differ, camera views differ, skills are uneven, and the mixture must be curated. But that heterogeneity is exactly where much of the generalization signal comes from. The OpenVLA paper explains that the training mixture is filtered and balanced to keep the input/output space coherent while avoiding domination by large but less diverse datasets.
The connection to Lin et al. is direct:
| Lin et al. single-task scaling | OpenVLA generalist scaling |
|---|---|
| More objects/environments produce power-law gains | More embodiments/tasks/scenes improve generalist behavior |
| Repeated demos in the same setup quickly plateau | Large but narrow datasets may need down-weighting or filtering |
| 50 demos per pair is a practical threshold for similar tasks | VLA fine-tuning needs target diversity, not just many episodes |
For a small team using OpenVLA or a similar VLA, the lesson is not "train your own 7B model from scratch." The lesson is: when fine-tuning, you are adding target-domain evidence to a model that already has a large prior. Two hundred diverse demos can be worth more than two thousand nearly identical demos. If your task is "pick electronic parts from a tray," vary the parts, trays, lighting, clutter, camera jitter, starting poses, and operators. Do not simply pick the same component from the same tray slot 1,000 more times.
π0: broad pre-training, clean post-training
π0 from Physical Intelligence pushes scaling in another direction: VLM pre-training, flow matching for action chunks, and multi-embodiment robot data. The paper describes pre-training on more than 10,000 hours of robot data, including 7 robot configurations and 68 tasks, followed by post-training on smaller downstream datasets.
The key point is that π0 separates two roles for data:
| Stage | What the data should provide | Goal |
|---|---|---|
| Pre-training | Broad, multi-task, multi-embodiment, including imperfect behavior | Learn physical priors, affordances, recovery, and semantic grounding |
| Post-training | Clean, consistent, task-specific, fluent | Teach the desired downstream behavior |
This is another expression of diversity scaling. Pre-training needs breadth so the model learns that the physical world has many situations. Post-training needs quality so the model learns the execution style you want. If you only post-train on clean demos, the model can become fluent but brittle. If you only train on broad but messy data, the model may know many things but fail to execute with enough precision.
For humanoids, this split is especially important. A two-arm robot doing household work must recover: towels shift, bowls rotate, lids stick, the left hand blocks the camera, or a person walks through the scene. Recovery rarely appears in "perfect" demonstrations because the operator tries to avoid mistakes. Broad pre-training data, interventions, and failure cases therefore become valuable assets. But deployment still needs clean post-training so the final behavior is efficient and stable.
GR00T N1: scaling with real, human, and synthetic data
GR00T N1 from NVIDIA is the clearest humanoid example of the idea that no single data source is enough. N1 is a VLA foundation model for humanoid robots with a dual-system architecture: a vision-language module interprets the scene and instruction, while a diffusion transformer generates motor actions. Its training data is a heterogeneous mixture of real-robot trajectories, human videos, and synthetic datasets.
One of the most important details in the GR00T N1 paper is its synthetic scaling pipeline. Using DexMimicGen, the authors scale a limited set of human demonstrations into 780,000 simulation trajectories, equivalent to about 6,500 hours of human demonstration data, in 11 hours. This is not proof that simulation replaces real data. It is proof that once you have a task schema, simulator, object-centric transformation, and success filter, synthetic data can cover variations that real teleoperation cannot economically cover.
GR00T N1 still uses real data. The paper combines real robot data, Open X-Embodiment data, human videos, latent actions, synthetic trajectories, and embodiment-specific post-training. In practice, a humanoid data strategy looks like a portfolio:
| Data source | Cost | Real actions? | Scaling value |
|---|---|---|---|
| Real robot teleoperation | Expensive | Yes | Ground truth for embodiment, contact, latency, and sensing |
| Human video | Cheaper at large scale | Not directly | Affordances, task semantics, motion priors |
| Synthetic trajectories | Cheap after setup | Yes, in simulation | Coverage of environments, objects, and initial states |
| Target-robot post-training | Expensive but smaller | Yes | Aligns the final policy to the embodiment and task |
Through the Lin et al. lens, GR00T N1 is applying the same lesson at a larger scale: it is not merely increasing episode count; it is increasing diversity axes. Human video expands activity and object coverage. Synthetic data expands scene and initial-state coverage. Real robot data anchors the model to physical reality. Post-training keeps the final behavior precise.
When is more data a waste of money?
Be suspicious of any "collect more data" plan when you see these symptoms:
| Symptom | What it means | Better action |
|---|---|---|
| High training success, low success in another lab | Environment overfit | Collect new environments, not more of the same lab |
| Failures cluster on unseen objects | Object diversity is missing | Add shape, material, and size variation |
| Failures appear when light or camera changes | Visual distribution is narrow | Randomize lighting, camera pose, and background |
| Failures happen after small gripper slips | Recovery data is missing | Collect interventions, failed attempts, and corrections |
| Validation loss drops but rollout success does not improve | Metric mismatch | Evaluate real rollouts or task scores |
| You already have 50-100 demos per setup and the curve is flat | Diminishing returns | Open a new setup |
A practical rule of thumb:
If you do not yet have 20-30 environment/object pairs:
prioritize diversity.
If each pair already has 50 clean demos:
do not automatically collect more for the same pair.
If the policy fails because actions are not smooth:
inspect data quality, controller, latency, and action representation.
If the policy fails because the scene changed:
collect more scenes before over-tuning the model.
If the policy fails because the task is new:
you are missing task-level diversity, not just object-level diversity.
The 20-30 range is not a law of physics. It is a useful operating point when you do not yet have your own scaling curve. Lin et al. showed that 32 pairs worked well for the four tasks in their paper. More dexterous, deformable, force-heavy, or whole-body humanoid tasks will need more. The principle remains: increase diversity before repetition.
Data budget planner for a small team
Suppose your team has a robot arm or humanoid upper body and wants to train "pick up a cup and place it in a tray" in office-like environments. You have 80 operator hours. A weak plan would be:
1 environment
5 cups
1 tray
1 camera pose
8,000 demonstrations
A better plan:
32 environment-object pairs
50 demonstrations per pair
1,600 total demonstrations
Remaining time:
- evaluate on 8 unseen environments
- collect 200 recovery demos around failure modes
- audit labels and remove bad demos
- add objects if failures correlate with object shape
You can plan data collection with a table:
| Pair | Environment | Object | Lighting | Clutter | Demo target | Failure notes |
|---|---|---|---|---|---|---|
| 01 | White lab table | Small plastic cup | Bright | Low | 50 | Baseline |
| 02 | Office wood desk | Paper cup | Warm light | Medium | 50 | Glare risk |
| 03 | Metal shelf | Steel cup | Shadows | High | 50 | Reflections |
| 04 | Utility cart | Short cup | Side light | Medium | 50 | Moving surface |
Train and test after every 8-pair batch. Do not wait until all 1,600 demos are collected before discovering that the camera is miscalibrated or gripper commands were logged incorrectly.
for batch in [8, 16, 24, 32]:
train_policy(dataset_pairs=batch)
score = evaluate_unseen(envs=4, objects=2, trials=5)
log(score)
if score_plateaus and failures_are_same_setup:
stop_collecting_repetitions()
add_new_diversity_axis()
Choose the next diversity axis based on real failures, not intuition. If the policy fails due to object pose, randomize placement. If it fails on transparent objects, add transparent and glossy objects. If it fails when people pass behind the table, add dynamic backgrounds or masks. If it fails with heavier objects, add mass and friction variation plus matching real demonstrations.
How VLA scaling differs from classic imitation learning
Lin et al. use Diffusion Policy for single-task manipulation. VLAs such as OpenVLA, π0, and GR00T N1 add language and pre-trained vision-language priors. That changes how we interpret scaling laws, but it does not invalidate them.
With a policy trained from scratch, each demonstration must teach perception, semantics, and action. With a VLA, the model already knows many concepts from Internet-scale vision-language data: cup, drawer, towel, box, left/right, on/inside/near. Target robot data does not need to teach all of those concepts from zero. It must teach grounding for a specific embodiment: camera geometry, gripper behavior, action scale, latency, joint limits, and real contact.
That means VLA fine-tuning needs:
| Coverage type | Why it matters |
|---|---|
| Language coverage | The same task can be instructed in many ways |
| Visual coverage | VLM priors are strong, but robot cameras are unusual |
| Object coverage | Knowing "cup" is not enough to choose a grasp pose |
| Action coverage | Embodiment-specific behavior does not come from web images |
| Recovery coverage | Real robots drift away from perfect demonstrations |
The danger is that a VLA can look intelligent in a short demo video while failing under strict split evaluation. Lin et al.'s unseen environment/object protocol should be treated as a minimum bar: test in places where you did not collect data, with objects that were not in the training set, and preferably with blinded policy order during evaluation.
Checklist: where should you invest diversity?
Before collecting another 1,000 demonstrations, answer these questions:
| Question | If the answer is no |
|---|---|
| Do you have at least 20 environment/object pairs? | Open new pairs first |
| Does each pair have 30-50 clean demos? | Add just enough, but avoid over-collection |
| Do you have a fixed unseen evaluation set? | Stop and create a benchmark |
| Do you log failure modes by taxonomy? | You do not know what to scale |
| Does the object split vary shape, material, and size? | Your split is too easy |
| Does the environment split vary light, background, and clutter? | Your split is too easy |
| Do you have recovery or intervention data? | The policy may be brittle |
| Have you checked dirty demos, bad resets, and timestamp drift? | Quality may be the bottleneck |
| Have you compared real-only, synthetic-only, and mixed training? | You do not know which source helps |
| Do you stop when the curve plateaus? | Budget will leak into repetition |
The key data strategy sentence is:
Do not ask "how many demonstrations do we need?"
Ask "how many different situations do we need, and how many clean demos per situation are enough?"
Conclusion: the data war is a diversity war
If part 1 mapped who owns the datasets, part 5 clarifies what makes those datasets valuable. Value does not come from raw episode count. It comes from coverage of the physical world: environments, objects, embodiments, tasks, language, failures, recovery, and quality control.
Lin et al. give robotics a rare operational result: an empirical curve concrete enough to guide data budgets. For tasks similar to the paper, 32 environment-object pairs and about 50 demonstrations per pair are a strong starting point. OpenVLA shows that foundation models need diverse mixtures, not just large datasets. π0 shows that broad pre-training and clean post-training play different roles. GR00T N1 shows that humanoid data strategy is a portfolio of real robot data, human video, synthetic trajectories, and embodiment-specific post-training.
The next post in the series turns this into an execution plan: if you are a small team, what should you collect first, what should you buy, what should you synthesize, and how do you avoid getting trapped in the raw demo-count race?
Sources
- Data Scaling Laws in Imitation Learning for Robotic Manipulation
- Project page: Data Scaling Laws
- OpenVLA: An Open-Source Vision-Language-Action Model
- Open X-Embodiment dataset
- π0: A Vision-Language-Action Flow Model for General Robot Control
- GR00T N1: An Open Foundation Model for Generalist Humanoid Robots