VLA Data Scaling Laws for Robot Learning

Why part 5 is about scaling laws

The first four posts in this series covered the ownership map, teleoperation data collection, human video mining, and synthetic pipelines from simulation to reality. If you are joining here, start with Part 1: the humanoid data war landscape, Part 2: teleoperation data, and Part 4: synthetic data pipelines. Part 5 answers a very practical question: when does collecting more demonstrations make the robot better, and when are you just burning money?

In language models, scaling laws are usually discussed in terms of tokens, parameters, and compute. Robotics is different. A text token is cheap, easy to copy, and never breaks a robot arm. A robot demonstration requires an operator, hardware time, cameras, safety checks, scene resets, logging, labels, and often re-collection because the gripper slipped or the timestamp was wrong. A humanoid team cannot casually say "collect another million demos" in the same way a web-scale NLP team says "crawl another trillion tokens."

The paper Data Scaling Laws in Imitation Learning for Robotic Manipulation by Lin et al. is one of the most useful papers for this exact question. The authors did not merely train in simulation and plot a clean curve. They collected more than 40,000 demonstrations, executed more than 15,000 real-world robot rollouts, and studied three data axes: number of environments, number of objects, and number of demonstrations. The core conclusion is simple and important:

Policy generalization improves approximately as a power law when environment and object diversity increase, but there is no clear power law when you only add more demonstrations in the same setup.

In plain terms: a robot learns more from seeing new tables, lighting conditions, backgrounds, objects, poses, clutter, and manipulation variations. It learns far less from seeing the same table, same cup, same camera angle, and same operator another 500 times.

What is a robot scaling law?

A scaling law is a relationship that roughly looks like this:

error ≈ A * N^(-alpha) + irreducible_error

Here, N may mean the number of objects, environments, environment-object pairs, tasks, trajectories, or action tokens. A larger alpha means additional data is more useful. irreducible_error is the part that will not disappear just because you collect more of the same data: bad perception, missing force sensing, weak control, sensor latency, a poor action space, or an evaluation task that requires feedback not present in the dataset.

For beginners, keep this mental model:

Add a new setup:
  new table, new lighting, new object, slightly shifted camera
  -> the model learns invariance and robust behavior

Add repeated demos in the old setup:
  same table, same object, same camera, same operator
  -> the model reduces noise at first, then quickly plateaus

In Lin et al., the key metric is not a pretty validation loss. The policies are evaluated on unseen real environments and unseen real objects. This matters because imitation learning can be misleading if you only look at action MSE. A predicted action can be close to a demonstration yet still fail after contact, slippage, delay, or accumulated error. The paper also notes that validation MSE can help debugging, but it is not always reliable as a proxy for real rollout success.

Lin et al. 2024: experiment design

The paper studies single-task manipulation, not a generalist VLA trained over thousands of tasks. The authors use two main tasks to fit the scaling curves:

Task	What the robot must do	Why it is useful for scaling
Pour Water	Grasp a bottle, pour water, and place it back	Requires grasping, pose control, tilting, and sensitivity to object and scene variation
Mouse Arrangement	Arrange a computer mouse into a target position	Requires spatial alignment, light contact, and accurate placement

They then validate the resulting data collection strategy on two additional tasks:

Validation task	What it tests
Fold Towels	Softer manipulation, grasping an edge and folding
Unplug Charger	Correct grasping and fast pulling from a power strip

The data is collected with UMI hand-held grippers, modeled with Diffusion Policy, and evaluated zero-shot in 8 unseen environments with unseen objects. The experimental axes are clean:

Scaling axis	Collection setup	Question
Object generalization	32 objects in one environment	Does adding objects help with unseen objects?
Environment generalization	32 environments with one object	Does adding environments help with unseen places?
Environment + object	32 environment-object pairs, one unique object per environment	What happens when both scene and object change?
Demo count	Diversity fixed, demonstrations increased	Does collecting more demos in the same setup still help?

The key result: as the number of objects or environments increases, performance on unseen objects/environments rises approximately as a power law. When only the number of demonstrations increases under fixed diversity, performance improves early and then plateaus.

The number to remember: 32 pairs, 50 demos each

One of the most practical contributions of the paper is its recommended data collection strategy. For tasks of similar difficulty to Pour Water and Mouse Arrangement, the authors recommend:

Environment-object pairs: about 32
Demonstrations per pair: about 50
Total demonstrations: about 1,600
Evaluation: 8 unseen environments, 2 unseen objects per environment
Expected result: roughly 85-92.5% success depending on the task in the paper

Their reported average success rates:

Task	Average success rate
Pour Water	85.0%
Mouse Arrangement	92.5%
Fold Towels	87.5%
Unplug Charger	90.0%

This does not mean 1,600 demos is a magic number for every humanoid task. It means that for tasks of similar complexity, with similar sensors, actions, and policy class, 32 diverse setups with 50 demos per setup is far more valuable than 1 setup with 1,600 demos.

The diminishing-returns threshold is also clear. In their demonstration-count study, performance plateaued around 800 total demos in one maximum-data setting. When analyzed by the number of environment-object pairs, the saturation points were:

Environment-object pairs	Total demos near plateau	Approx. demos per pair
8	400	50
16	800	50
32	1,600	50

This is a budget-changing result. If you already have 50 good demonstrations for the same table, same object, and same camera, collecting 200 more demos in that exact setup may only make the spreadsheet look better. The same operator time is usually better spent opening new environments or new objects.

Why raw demo count is misleading

"We have 100,000 demonstrations" sounds impressive. For robot learning, the better questions are:

How many objects do those demos cover?
How many rooms, tables, factory cells, shelves, and lighting conditions?
How many camera poses?
How many operator styles?
How many initial states?
How many recovery cases?
How many task families?
How many robot embodiments?

If 100,000 demos are all one operator doing pick-and-place in one lab, they can be less useful than 10,000 demos collected across 200 real scenes, with many objects, clutter patterns, lighting conditions, and reset states. A policy is not just learning a hand trajectory. It is learning a mapping from image + language + robot state to action. If the visual distribution is too narrow, the model can learn shortcuts: the cup is always on the left, the table is always white, the camera is always fixed, and the gripper always approaches from the same direction.

A simple way to audit a dataset is to score effective diversity rather than episode count:

effective_dataset_score = (
    0.30 * unique_environments +
    0.25 * unique_objects +
    0.15 * unique_initial_states +
    0.10 * camera_pose_bins +
    0.10 * operator_style_bins +
    0.10 * recovery_or_failure_modes
)

This is not a formula from the paper. It is an operational heuristic. The goal is to force the team to look at meaningful variation. If the score barely increases while episode count grows rapidly, you are collecting repetition, not useful coverage.

OpenVLA: diversity beats raw scale

OpenVLA shows the same lesson at foundation-model scale. It is a 7B-parameter VLA fine-tuned from a VLM backbone and trained on about 970,000 real robot demonstrations from Open X-Embodiment. The important part is not only "970k." The important part is that the data spans many embodiments, tasks, and scenes.

Open X-Embodiment was built from more than 20 robot embodiments and many labs. It is not a perfectly uniform dataset: action spaces differ, camera views differ, skills are uneven, and the mixture must be curated. But that heterogeneity is exactly where much of the generalization signal comes from. The OpenVLA paper explains that the training mixture is filtered and balanced to keep the input/output space coherent while avoiding domination by large but less diverse datasets.

The connection to Lin et al. is direct:

Lin et al. single-task scaling	OpenVLA generalist scaling
More objects/environments produce power-law gains	More embodiments/tasks/scenes improve generalist behavior
Repeated demos in the same setup quickly plateau	Large but narrow datasets may need down-weighting or filtering
50 demos per pair is a practical threshold for similar tasks	VLA fine-tuning needs target diversity, not just many episodes

For a small team using OpenVLA or a similar VLA, the lesson is not "train your own 7B model from scratch." The lesson is: when fine-tuning, you are adding target-domain evidence to a model that already has a large prior. Two hundred diverse demos can be worth more than two thousand nearly identical demos. If your task is "pick electronic parts from a tray," vary the parts, trays, lighting, clutter, camera jitter, starting poses, and operators. Do not simply pick the same component from the same tray slot 1,000 more times.

π0: broad pre-training, clean post-training

π0 from Physical Intelligence pushes scaling in another direction: VLM pre-training, flow matching for action chunks, and multi-embodiment robot data. The paper describes pre-training on more than 10,000 hours of robot data, including 7 robot configurations and 68 tasks, followed by post-training on smaller downstream datasets.

The key point is that π0 separates two roles for data:

Stage	What the data should provide	Goal
Pre-training	Broad, multi-task, multi-embodiment, including imperfect behavior	Learn physical priors, affordances, recovery, and semantic grounding
Post-training	Clean, consistent, task-specific, fluent	Teach the desired downstream behavior

This is another expression of diversity scaling. Pre-training needs breadth so the model learns that the physical world has many situations. Post-training needs quality so the model learns the execution style you want. If you only post-train on clean demos, the model can become fluent but brittle. If you only train on broad but messy data, the model may know many things but fail to execute with enough precision.

For humanoids, this split is especially important. A two-arm robot doing household work must recover: towels shift, bowls rotate, lids stick, the left hand blocks the camera, or a person walks through the scene. Recovery rarely appears in "perfect" demonstrations because the operator tries to avoid mistakes. Broad pre-training data, interventions, and failure cases therefore become valuable assets. But deployment still needs clean post-training so the final behavior is efficient and stable.

GR00T N1: scaling with real, human, and synthetic data

GR00T N1 from NVIDIA is the clearest humanoid example of the idea that no single data source is enough. N1 is a VLA foundation model for humanoid robots with a dual-system architecture: a vision-language module interprets the scene and instruction, while a diffusion transformer generates motor actions. Its training data is a heterogeneous mixture of real-robot trajectories, human videos, and synthetic datasets.

One of the most important details in the GR00T N1 paper is its synthetic scaling pipeline. Using DexMimicGen, the authors scale a limited set of human demonstrations into 780,000 simulation trajectories, equivalent to about 6,500 hours of human demonstration data, in 11 hours. This is not proof that simulation replaces real data. It is proof that once you have a task schema, simulator, object-centric transformation, and success filter, synthetic data can cover variations that real teleoperation cannot economically cover.

GR00T N1 still uses real data. The paper combines real robot data, Open X-Embodiment data, human videos, latent actions, synthetic trajectories, and embodiment-specific post-training. In practice, a humanoid data strategy looks like a portfolio:

Data source	Cost	Real actions?	Scaling value
Real robot teleoperation	Expensive	Yes	Ground truth for embodiment, contact, latency, and sensing
Human video	Cheaper at large scale	Not directly	Affordances, task semantics, motion priors
Synthetic trajectories	Cheap after setup	Yes, in simulation	Coverage of environments, objects, and initial states
Target-robot post-training	Expensive but smaller	Yes	Aligns the final policy to the embodiment and task

Through the Lin et al. lens, GR00T N1 is applying the same lesson at a larger scale: it is not merely increasing episode count; it is increasing diversity axes. Human video expands activity and object coverage. Synthetic data expands scene and initial-state coverage. Real robot data anchors the model to physical reality. Post-training keeps the final behavior precise.

When is more data a waste of money?

Be suspicious of any "collect more data" plan when you see these symptoms:

Symptom	What it means	Better action
High training success, low success in another lab	Environment overfit	Collect new environments, not more of the same lab
Failures cluster on unseen objects	Object diversity is missing	Add shape, material, and size variation
Failures appear when light or camera changes	Visual distribution is narrow	Randomize lighting, camera pose, and background
Failures happen after small gripper slips	Recovery data is missing	Collect interventions, failed attempts, and corrections
Validation loss drops but rollout success does not improve	Metric mismatch	Evaluate real rollouts or task scores
You already have 50-100 demos per setup and the curve is flat	Diminishing returns	Open a new setup

A practical rule of thumb:

If you do not yet have 20-30 environment/object pairs:
  prioritize diversity.

If each pair already has 50 clean demos:
  do not automatically collect more for the same pair.

If the policy fails because actions are not smooth:
  inspect data quality, controller, latency, and action representation.

If the policy fails because the scene changed:
  collect more scenes before over-tuning the model.

If the policy fails because the task is new:
  you are missing task-level diversity, not just object-level diversity.

The 20-30 range is not a law of physics. It is a useful operating point when you do not yet have your own scaling curve. Lin et al. showed that 32 pairs worked well for the four tasks in their paper. More dexterous, deformable, force-heavy, or whole-body humanoid tasks will need more. The principle remains: increase diversity before repetition.

Data budget planner for a small team

Suppose your team has a robot arm or humanoid upper body and wants to train "pick up a cup and place it in a tray" in office-like environments. You have 80 operator hours. A weak plan would be:

1 environment
5 cups
1 tray
1 camera pose
8,000 demonstrations

A better plan:

32 environment-object pairs
50 demonstrations per pair
1,600 total demonstrations
Remaining time:
  - evaluate on 8 unseen environments
  - collect 200 recovery demos around failure modes
  - audit labels and remove bad demos
  - add objects if failures correlate with object shape

You can plan data collection with a table:

Pair	Environment	Object	Lighting	Clutter	Demo target	Failure notes
01	White lab table	Small plastic cup	Bright	Low	50	Baseline
02	Office wood desk	Paper cup	Warm light	Medium	50	Glare risk
03	Metal shelf	Steel cup	Shadows	High	50	Reflections
04	Utility cart	Short cup	Side light	Medium	50	Moving surface

Train and test after every 8-pair batch. Do not wait until all 1,600 demos are collected before discovering that the camera is miscalibrated or gripper commands were logged incorrectly.

for batch in [8, 16, 24, 32]:
    train_policy(dataset_pairs=batch)
    score = evaluate_unseen(envs=4, objects=2, trials=5)
    log(score)

    if score_plateaus and failures_are_same_setup:
        stop_collecting_repetitions()
        add_new_diversity_axis()

Choose the next diversity axis based on real failures, not intuition. If the policy fails due to object pose, randomize placement. If it fails on transparent objects, add transparent and glossy objects. If it fails when people pass behind the table, add dynamic backgrounds or masks. If it fails with heavier objects, add mass and friction variation plus matching real demonstrations.

How VLA scaling differs from classic imitation learning

Lin et al. use Diffusion Policy for single-task manipulation. VLAs such as OpenVLA, π0, and GR00T N1 add language and pre-trained vision-language priors. That changes how we interpret scaling laws, but it does not invalidate them.

With a policy trained from scratch, each demonstration must teach perception, semantics, and action. With a VLA, the model already knows many concepts from Internet-scale vision-language data: cup, drawer, towel, box, left/right, on/inside/near. Target robot data does not need to teach all of those concepts from zero. It must teach grounding for a specific embodiment: camera geometry, gripper behavior, action scale, latency, joint limits, and real contact.

That means VLA fine-tuning needs:

Coverage type	Why it matters
Language coverage	The same task can be instructed in many ways
Visual coverage	VLM priors are strong, but robot cameras are unusual
Object coverage	Knowing "cup" is not enough to choose a grasp pose
Action coverage	Embodiment-specific behavior does not come from web images
Recovery coverage	Real robots drift away from perfect demonstrations

The danger is that a VLA can look intelligent in a short demo video while failing under strict split evaluation. Lin et al.'s unseen environment/object protocol should be treated as a minimum bar: test in places where you did not collect data, with objects that were not in the training set, and preferably with blinded policy order during evaluation.

Checklist: where should you invest diversity?

Before collecting another 1,000 demonstrations, answer these questions:

Question	If the answer is no
Do you have at least 20 environment/object pairs?	Open new pairs first
Does each pair have 30-50 clean demos?	Add just enough, but avoid over-collection
Do you have a fixed unseen evaluation set?	Stop and create a benchmark
Do you log failure modes by taxonomy?	You do not know what to scale
Does the object split vary shape, material, and size?	Your split is too easy
Does the environment split vary light, background, and clutter?	Your split is too easy
Do you have recovery or intervention data?	The policy may be brittle
Have you checked dirty demos, bad resets, and timestamp drift?	Quality may be the bottleneck
Have you compared real-only, synthetic-only, and mixed training?	You do not know which source helps
Do you stop when the curve plateaus?	Budget will leak into repetition

The key data strategy sentence is:

Do not ask "how many demonstrations do we need?"
Ask "how many different situations do we need, and how many clean demos per situation are enough?"

Conclusion: the data war is a diversity war

If part 1 mapped who owns the datasets, part 5 clarifies what makes those datasets valuable. Value does not come from raw episode count. It comes from coverage of the physical world: environments, objects, embodiments, tasks, language, failures, recovery, and quality control.

Lin et al. give robotics a rare operational result: an empirical curve concrete enough to guide data budgets. For tasks similar to the paper, 32 environment-object pairs and about 50 demonstrations per pair are a strong starting point. OpenVLA shows that foundation models need diverse mixtures, not just large datasets. π0 shows that broad pre-training and clean post-training play different roles. GR00T N1 shows that humanoid data strategy is a portfolio of real robot data, human video, synthetic trajectories, and embodiment-specific post-training.

The next post in the series turns this into an execution plan: if you are a small team, what should you collect first, what should you buy, what should you synthesize, and how do you avoid getting trapped in the raw demo-count race?

Sources

Why part 5 is about scaling laws

What is a robot scaling law?

A scaling law is a relationship that roughly looks like this:

error ≈ A * N^(-alpha) + irreducible_error

For beginners, keep this mental model:

Add a new setup:
  new table, new lighting, new object, slightly shifted camera
  -> the model learns invariance and robust behavior

Add repeated demos in the old setup:
  same table, same object, same camera, same operator
  -> the model reduces noise at first, then quickly plateaus

Lin et al. 2024: experiment design

The paper studies single-task manipulation, not a generalist VLA trained over thousands of tasks. The authors use two main tasks to fit the scaling curves:

Task	What the robot must do	Why it is useful for scaling
Pour Water	Grasp a bottle, pour water, and place it back	Requires grasping, pose control, tilting, and sensitivity to object and scene variation
Mouse Arrangement	Arrange a computer mouse into a target position	Requires spatial alignment, light contact, and accurate placement

They then validate the resulting data collection strategy on two additional tasks:

Validation task	What it tests
Fold Towels	Softer manipulation, grasping an edge and folding
Unplug Charger	Correct grasping and fast pulling from a power strip

The data is collected with UMI hand-held grippers, modeled with Diffusion Policy, and evaluated zero-shot in 8 unseen environments with unseen objects. The experimental axes are clean:

Scaling axis	Collection setup	Question
Object generalization	32 objects in one environment	Does adding objects help with unseen objects?
Environment generalization	32 environments with one object	Does adding environments help with unseen places?
Environment + object	32 environment-object pairs, one unique object per environment	What happens when both scene and object change?
Demo count	Diversity fixed, demonstrations increased	Does collecting more demos in the same setup still help?

The number to remember: 32 pairs, 50 demos each

One of the most practical contributions of the paper is its recommended data collection strategy. For tasks of similar difficulty to Pour Water and Mouse Arrangement, the authors recommend:

Environment-object pairs: about 32
Demonstrations per pair: about 50
Total demonstrations: about 1,600
Evaluation: 8 unseen environments, 2 unseen objects per environment
Expected result: roughly 85-92.5% success depending on the task in the paper

Their reported average success rates:

Task	Average success rate
Pour Water	85.0%
Mouse Arrangement	92.5%
Fold Towels	87.5%
Unplug Charger	90.0%

Environment-object pairs	Total demos near plateau	Approx. demos per pair
8	400	50
16	800	50
32	1,600	50

Why raw demo count is misleading

"We have 100,000 demonstrations" sounds impressive. For robot learning, the better questions are:

How many objects do those demos cover?
How many rooms, tables, factory cells, shelves, and lighting conditions?
How many camera poses?
How many operator styles?
How many initial states?
How many recovery cases?
How many task families?
How many robot embodiments?

A simple way to audit a dataset is to score effective diversity rather than episode count:

effective_dataset_score = (
    0.30 * unique_environments +
    0.25 * unique_objects +
    0.15 * unique_initial_states +
    0.10 * camera_pose_bins +
    0.10 * operator_style_bins +
    0.10 * recovery_or_failure_modes
)

OpenVLA: diversity beats raw scale

The connection to Lin et al. is direct:

Lin et al. single-task scaling	OpenVLA generalist scaling
More objects/environments produce power-law gains	More embodiments/tasks/scenes improve generalist behavior
Repeated demos in the same setup quickly plateau	Large but narrow datasets may need down-weighting or filtering
50 demos per pair is a practical threshold for similar tasks	VLA fine-tuning needs target diversity, not just many episodes

π0: broad pre-training, clean post-training

The key point is that π0 separates two roles for data:

Stage	What the data should provide	Goal
Pre-training	Broad, multi-task, multi-embodiment, including imperfect behavior	Learn physical priors, affordances, recovery, and semantic grounding
Post-training	Clean, consistent, task-specific, fluent	Teach the desired downstream behavior

GR00T N1: scaling with real, human, and synthetic data

Data source	Cost	Real actions?	Scaling value
Real robot teleoperation	Expensive	Yes	Ground truth for embodiment, contact, latency, and sensing
Human video	Cheaper at large scale	Not directly	Affordances, task semantics, motion priors
Synthetic trajectories	Cheap after setup	Yes, in simulation	Coverage of environments, objects, and initial states
Target-robot post-training	Expensive but smaller	Yes	Aligns the final policy to the embodiment and task

When is more data a waste of money?

Be suspicious of any "collect more data" plan when you see these symptoms:

Symptom	What it means	Better action
High training success, low success in another lab	Environment overfit	Collect new environments, not more of the same lab
Failures cluster on unseen objects	Object diversity is missing	Add shape, material, and size variation
Failures appear when light or camera changes	Visual distribution is narrow	Randomize lighting, camera pose, and background
Failures happen after small gripper slips	Recovery data is missing	Collect interventions, failed attempts, and corrections
Validation loss drops but rollout success does not improve	Metric mismatch	Evaluate real rollouts or task scores
You already have 50-100 demos per setup and the curve is flat	Diminishing returns	Open a new setup

A practical rule of thumb:

If you do not yet have 20-30 environment/object pairs:
  prioritize diversity.

If each pair already has 50 clean demos:
  do not automatically collect more for the same pair.

If the policy fails because actions are not smooth:
  inspect data quality, controller, latency, and action representation.

If the policy fails because the scene changed:
  collect more scenes before over-tuning the model.

If the policy fails because the task is new:
  you are missing task-level diversity, not just object-level diversity.

Data budget planner for a small team

Suppose your team has a robot arm or humanoid upper body and wants to train "pick up a cup and place it in a tray" in office-like environments. You have 80 operator hours. A weak plan would be:

1 environment
5 cups
1 tray
1 camera pose
8,000 demonstrations

A better plan:

32 environment-object pairs
50 demonstrations per pair
1,600 total demonstrations
Remaining time:
  - evaluate on 8 unseen environments
  - collect 200 recovery demos around failure modes
  - audit labels and remove bad demos
  - add objects if failures correlate with object shape

You can plan data collection with a table:

Pair	Environment	Object	Lighting	Clutter	Demo target	Failure notes
01	White lab table	Small plastic cup	Bright	Low	50	Baseline
02	Office wood desk	Paper cup	Warm light	Medium	50	Glare risk
03	Metal shelf	Steel cup	Shadows	High	50	Reflections
04	Utility cart	Short cup	Side light	Medium	50	Moving surface

Train and test after every 8-pair batch. Do not wait until all 1,600 demos are collected before discovering that the camera is miscalibrated or gripper commands were logged incorrectly.

for batch in [8, 16, 24, 32]:
    train_policy(dataset_pairs=batch)
    score = evaluate_unseen(envs=4, objects=2, trials=5)
    log(score)

    if score_plateaus and failures_are_same_setup:
        stop_collecting_repetitions()
        add_new_diversity_axis()

How VLA scaling differs from classic imitation learning

That means VLA fine-tuning needs:

Coverage type	Why it matters
Language coverage	The same task can be instructed in many ways
Visual coverage	VLM priors are strong, but robot cameras are unusual
Object coverage	Knowing "cup" is not enough to choose a grasp pose
Action coverage	Embodiment-specific behavior does not come from web images
Recovery coverage	Real robots drift away from perfect demonstrations

Checklist: where should you invest diversity?

Before collecting another 1,000 demonstrations, answer these questions:

Question	If the answer is no
Do you have at least 20 environment/object pairs?	Open new pairs first
Does each pair have 30-50 clean demos?	Add just enough, but avoid over-collection
Do you have a fixed unseen evaluation set?	Stop and create a benchmark
Do you log failure modes by taxonomy?	You do not know what to scale
Does the object split vary shape, material, and size?	Your split is too easy
Does the environment split vary light, background, and clutter?	Your split is too easy
Do you have recovery or intervention data?	The policy may be brittle
Have you checked dirty demos, bad resets, and timestamp drift?	Quality may be the bottleneck
Have you compared real-only, synthetic-only, and mixed training?	You do not know which source helps
Do you stop when the curve plateaus?	Budget will leak into repetition

The key data strategy sentence is:

Do not ask "how many demonstrations do we need?"
Ask "how many different situations do we need, and how many clean demos per situation are enough?"

VLA Data Scaling Laws for Robot Learning

Why part 5 is about scaling laws

What is a robot scaling law?

Lin et al. 2024: experiment design

The number to remember: 32 pairs, 50 demos each

Why raw demo count is misleading

OpenVLA: diversity beats raw scale

π0: broad pre-training, clean post-training

GR00T N1: scaling with real, human, and synthetic data

When is more data a waste of money?

Data budget planner for a small team

How VLA scaling differs from classic imitation learning

Checklist: where should you invest diversity?

Conclusion: the data war is a diversity war

Sources

Nguyễn Anh Tuấn

Related Posts

Data Strategy: Team Nhỏ Nên Thu Thập Dữ Liệu Gì?

Human Video Mining: Khai Thác Video Người Cho Robot

Open vs Closed: License, Data Moat Và Tương Lai 2027

VLA Data Scaling Laws for Robot Learning

Why part 5 is about scaling laws

What is a robot scaling law?

Lin et al. 2024: experiment design

The number to remember: 32 pairs, 50 demos each

Why raw demo count is misleading

OpenVLA: diversity beats raw scale

π0: broad pre-training, clean post-training

GR00T N1: scaling with real, human, and synthetic data

When is more data a waste of money?

Data budget planner for a small team

How VLA scaling differs from classic imitation learning

Checklist: where should you invest diversity?

Conclusion: the data war is a diversity war

Sources

Nguyễn Anh Tuấn

Related Posts

Data Strategy: Team Nhỏ Nên Thu Thập Dữ Liệu Gì?

Human Video Mining: Khai Thác Video Người Cho Robot

Open vs Closed: License, Data Moat Và Tương Lai 2027