Data Strategy: What Should Small Teams Collect?

Why part 6 is for small teams

The first five posts in this series covered the ownership map, teleoperation, human video mining, synthetic pipelines, and scaling laws for VLA data. If you are joining here, start with Part 1: the humanoid data war landscape, Part 2: teleoperation, and Part 5: VLA data scaling. This post answers a more operational question: if you are a small robotics team, without a fleet of 100 robots, what should you collect first?

The short answer: do not try to win the raw episode-count race. A small team should build a mini data flywheel: choose one narrow but real task, collect a small amount of clean data with inexpensive hardware, standardize it as a LeRobotDataset, fine-tune from a pretrained checkpoint, evaluate real failures, and collect the next batch only where the robot actually fails. That loop matters more than claiming "10,000 demos" from a single lab setup.

The good news is that the open robotics ecosystem has lowered the entry barrier. LeRobot provides tooling for robot control, datasets, training, and inference. SO-101 is a low-cost 6-DOF arm that can be built or bought as a kit and used with leader-follower teleoperation. SmolVLA is designed for fine-tuning on LeRobot datasets. OpenPI/π0 and NVIDIA Isaac GR00T push the same practical idea: start from a foundation model, adapt it to your embodiment and task, and avoid training from scratch unless you have a very strong reason.

Robot arm data collection desk

The right mindset: raw footage is not a data asset

For a startup, robot data becomes an asset only when it helps answer one of three questions:

Question	Example
Can our robot perform the customer workflow?	Pick parts from a tray, place boxes on a conveyor, open a drawer
Does the policy generalize across real variation?	Lighting, camera pose, objects, position, clutter
Can the dataset be reused for future models?	Standard schema, clear metadata, clean license, synchronized video/state/action

A folder of videos showing an operator driving a robot is not yet a data asset. It becomes one when it has timestamps, observations, actions, states, task labels, episode boundaries, train/eval splits, robot metadata, and a way to load everything back in code. This is why LeRobotDataset matters. You are buying optionality: today you may train ACT or SmolVLA, six months from now you may try π0 or GR00T, and a year from now you may merge your dataset with community data.

A small-team data strategy should start with a concrete sentence:

In the next 30 days, what task should the robot learn, on what setup,
with what success criterion, and what decision will this dataset support?

If you cannot answer that, do not start collecting data yet. Define the task first.

Minimum setup: LeRobot + SO-101 in the $100-300 range

SO-101 is the next-generation version of SO-100 developed by The Robot Studio and Hugging Face. The official docs describe a follower arm with 6 STS3215 motors, a leader arm with different gear ratios so the operator can move it smoothly, and a LeRobot workflow for recording datasets. The actual cost depends on whether you print the parts yourself, buy a kit, already own power supplies, and what camera you use. The "$100" number usually refers to the basic arm hardware if you source parts well. For a data collection setup, budget more realistically:

Item	Purpose	Rough budget
SO-101 follower arm	Robot that executes actions	$100-180 if well sourced, higher as a kit
SO-101 leader arm	Consistent teleoperation	$100-180
USB camera or webcam	Main visual observation	$20-60
Table, clamps, power, cables, objects	Reliable resets and fewer mechanical failures	$30-80
CUDA PC or cloud GPU	Fine-tuning experiments	Depends on usage

If you are extremely budget constrained, you can start with a follower arm and simpler teleoperation scripts. But leader-follower collection usually produces cleaner data. The operator moves a physical arm that mirrors the follower, demonstrations are more natural, and the action logs are easier to interpret.

A useful mini setup should reserve space for three camera zones even if you initially use only one or two:

Camera	Required?	When it matters
Front/global camera	Yes	Almost all pick-place, sorting, and simple insertion tasks
Wrist/near-gripper camera	Recommended	Small grasps, occlusion, objects with similar colors
Side camera	Optional	Debugging contact, height, and collisions

Do not add cameras before your logging is stable. One synchronized camera is better than three misaligned streams.

Choose the first task: narrow, repeatable, commercially relevant

The first task should not be "a humanoid cleans a house." With SO-101, choose a one-arm manipulation task that still points toward your target domain. If you are building for electronics factories, collect on jumper wires, PCB spacers, small plastic bins, or tray sorting. If you are building logistics workflows, collect on parcel-like objects, labels, and simple sorting. If you are building education products, collect LEGO or block pick-place with bilingual instructions.

A good first-loop task has four properties:

Criterion	Why it matters
Reset takes under 15 seconds	You can collect 50-200 episodes without exhausting the operator
Success is easy to judge	You do not need a complex annotation pipeline
It has 3-5 natural variations	Position, color, object, lighting, instruction
It connects to the product	The data remains useful after the demo

Good starter tasks:

Task	Variations to collect	Should you collect your own?
Pick a cube into a bowl	5 cube positions, 2 colors, 2 bowls	Yes, because your embodiment and camera differ
Sort small parts into trays	Part type, tray layout, clutter, lighting	Yes, if this is your product domain
Open a mini drawer	Handle pose, pulling force, camera angle	Yes, if contact behavior matters
Stack blocks by color	Colors, language prompts, positions	Use public data first, then collect a small target set
Wipe a table	Cloth, surface, force, coverage	Not yet, unless you can measure coverage

A common mistake is choosing a task because it will make a nice video, not because it supports a product decision. A robot picking candy may look good, but if your customer needs inspection or part handling, that dataset is mostly a tooling exercise, not a moat.

LeRobotDataset v2: store data so other models can read it

LeRobot is moving toward dataset v3, which reduces file count, improves streaming, and modernizes metadata. But many important stacks still use or support v2/v2.1. NVIDIA GR00T currently uses a LeRobot v2 flavor and adds meta/modality.json; its data preparation guide explicitly says v3 datasets should be converted to v2 for the current workflow. For a small team, the safest strategy is:

Record with current LeRobot tooling.
Keep a canonical dataset that can be converted.
If the target is GR00T or another v2 stack, export a clean v2.1 dataset.
If the target is newer LeRobot/SmolVLA streaming, also keep a v3 copy when useful.

A typical v2.1 layout:

my_dataset/
  data/
    chunk-000/
      episode_000000.parquet
      episode_000001.parquet
  videos/
    chunk-000/
      observation.images.front/
        episode_000000.mp4
        episode_000001.mp4
      observation.images.wrist/
        episode_000000.mp4
        episode_000001.mp4
  meta/
    info.json
    stats.json
    episodes.jsonl
    episodes_stats.jsonl
    tasks.jsonl

Minimum fields to check in each Parquet episode:

Field	Meaning	Common failure
`observation.state`	Current robot state	Wrong joint order or inconsistent scale
`action`	Next command or target delta	Mixing absolute and delta actions
`timestamp`	Frame time	Camera and action streams drift
`frame_index`	Index inside episode	Reset does not start from 0
`episode_index`	Episode ID	Duplicate IDs after merging datasets
`task_index`	Mapping to task/prompt	Task labels are wrong or too generic

If you use GR00T, add modality metadata to describe state/action/video names. The idea is simple: the model needs to know what the action vector means, which camera is the ego view, which camera is the front view, and where language annotations live.

A good tasks.jsonl is concrete:

{"task_index": 0, "task": "Pick the red cube and place it in the white bowl"}
{"task_index": 1, "task": "Pick the blue cube and place it in the white bowl"}
{"task_index": 2, "task": "Move the small spacer from the left tray to the right tray"}

If you eventually want Vietnamese instructions, keep bilingual metadata, but do not break the main field expected by your training script. A safe pattern is to keep English prompts for training and add task_vi as separate metadata or in the dataset card.

Recording the first 50 episodes

The SmolVLA docs recommend recording about 50 episodes as a starting point. Their pick-place reference uses 50 episodes across 5 cube positions, with 10 episodes per position. This is a useful starting number for small teams. It is enough to test the loop, but small enough that a bad camera setup does not waste a week.

A beginner-friendly plan:

Batch	Content	Goal
0	5 test episodes	Check video, state, action, resets
1	25 clean episodes	Training smoke test
2	25 more episodes with clear variation	First fine-tuned policy
3	20 held-out eval episodes	Never train on these

For a "pick cube into bowl" task:

5 cube positions: left, right, near, far, center
2 cube colors: red, blue
1 fixed white bowl
10 episodes for each main position
20 eval episodes: slight position shifts and different lighting, excluded from training

The exact record command changes as LeRobot evolves, but the workflow looks like this:

lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \
  --robot.id=vnrobo_so101_follower_01 \
  --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
  --teleop.type=so101_leader \
  --teleop.port=/dev/ttyACM1 \
  --teleop.id=vnrobo_so101_leader_01 \
  --dataset.repo_id=vnrobo/so101_cube_bowl_v1 \
  --dataset.num_episodes=50 \
  --display_data=true

After recording, manually inspect at least 10% of episodes. Do not rely only on training loss. Open the videos and ask:

Check	Question
Frame	Can the camera see the gripper and object at the decisive moment?
Timing	Are actions aligned with frames, or delayed by half a second?
Reset	Does each episode start from a consistent state?
Success	Did the demonstration actually complete the task?
Variation	Are the 50 episodes meaningfully different?
Safety	Did the arm collide, stall, or move unpredictably?

A small clean dataset beats a larger dataset where every episode requires guessing what happened.

Fine-tune from pretrained checkpoints, do not train from scratch

For a small team, training from scratch is almost always the wrong first move. You do not have enough data to teach vision, language, action priors, and contact dynamics from zero. Start with a pretrained checkpoint, fine-tune for a modest number of steps, measure failures, and then decide whether to collect more data or change the model.

Three practical options:

Model	When to use it	What to remember
SmolVLA	SO-101, small tasks, moderate hardware	Lightweight, LeRobot-native, easy to try
π0/OpenPI	Stronger VLA experiments, good GPU, custom platform	Base model is intended for small-to-medium fine-tuning
GR00T N1/N1.x	Humanoid or dual-arm workflows, NVIDIA Isaac/Jetson alignment	Requires GR00T LeRobot v2 schema and modality config

Example SmolVLA fine-tuning:

python lerobot/scripts/train.py \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=vnrobo/so101_cube_bowl_v1 \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=outputs/train/so101_cube_bowl_smolvla \
  --job_name=so101_cube_bowl_smolvla \
  --policy.device=cuda \
  --wandb.enable=true

Example π0 through LeRobot:

python lerobot/scripts/train.py \
  --policy.path=lerobot/pi0 \
  --dataset.repo_id=vnrobo/so101_cube_bowl_v1 \
  --output_dir=outputs/train/so101_cube_bowl_pi0

Example GR00T workflow:

# 1. Convert or validate the dataset as LeRobot v2
# 2. Add meta/modality.json using the GR00T schema
# 3. Fine-tune with a matching embodiment/action config
python scripts/launch_finetune.py \
  --dataset-path /data/so101_cube_bowl_v1 \
  --modality-config configs/so101_modality.json \
  --output-dir outputs/groot_so101_cube_bowl

Treat these as workflow templates, not immutable commands. LeRobot and GR00T CLI details change quickly. The durable principles are: standard dataset, pretrained policy, explicit embodiment config, and evaluation data separated from training data.

A 7-day mini data flywheel

Do not wait a month before training. The flywheel should run weekly:

Day	Work
1	Assemble, calibrate, record 5 test episodes
2	Record 25 episodes, validate the dataset, run a smoke test
3	Record 25 more episodes with clear variation, fine-tune a checkpoint
4	Run 20 unseen eval trials, log failure taxonomy
5	Collect 30-50 episodes targeting the top failure mode
6	Fine-tune again and compare against the baseline
7	Decide whether to scale the task, change cameras, change objects, or stop

Keep the failure taxonomy simple:

PERCEPTION: cannot see object, wrong color, glare, occlusion
LOCALIZATION: sees object but reaches the wrong pose
GRASP: approaches correctly but slips
CONTACT: hits table, pushes object away, gripper jams
LANGUAGE: misunderstands instruction
RECOVERY: drifts away from demonstration and cannot recover
HARDWARE: weak servo, backlash, vibration, bad calibration

After each evaluation, do not ask "how many more episodes do we need?" Ask:

What is the biggest failure class?
Does it need new data, hardware repair, camera changes, or another model?

If the failure is glare, 500 more demonstrations under the same lighting will not solve it. You need different lighting, glossy objects in training, or a camera change. If the gripper slips because of mechanical backlash, data cannot fix the mechanism.

When should you use public data?

Hugging Face Hub already hosts many LeRobot community datasets. SmolVLA was released as a model that benefits from LeRobot community datasets, and the lerobot/svla_so101_pickplace reference has 50 episodes, 11,939 frames, 480x640 cameras, and a 6-DOF action space. Public data is extremely useful for three things:

Purpose	How to use public data
Learn the toolchain	Load, train, and deploy before recording your own data
Add pretraining/fine-tuning signal	Mix similar datasets if robot, camera, and action spaces are compatible
Benchmark your setup	Compare simple tasks to detect hardware or logging problems

But public data does not automatically solve your embodiment. Different camera placement, gripper geometry, table height, object set, or action normalization can break a policy. For small teams, the better rule is:

Use public data to learn priors and debug tooling.
Use private target data to teach your camera, robot, objects, and workflow.
Contribute public data when it is safe, so the community can inspect and build on it.

Read dataset cards as if they were contracts:

Check	Why it matters
License	Can you use it commercially?
Robot type	SO-100, SO-101, ALOHA, Franka, or custom?
Action representation	Joint target, EEF delta, binary gripper, continuous gripper?
Camera keys	Do `front`, `wrist`, and `side` match your model config?
FPS	5 Hz, 10 Hz, and 30 Hz change action chunking
Task label	Are prompts concrete or vague?
Quality	Were failed demonstrations filtered?

If you contribute a dataset to Hugging Face, write a useful dataset card. Include robot, cameras, FPS, action space, episode count, frame count, tasks, splits, known failures, license, and reproduction instructions. This is how a small 50-episode dataset becomes a community artifact instead of an anonymous zip file.

Decision checklist: collect your own or use public data?

Use this checklist per task, not per company.

Question	If yes	Decision
Is the task directly tied to customer workflow or product IP?	Yes	Collect target data
Is your camera/robot/gripper different from public setups?	Yes	Collect at least 50-200 target episodes
Are the objects specific to your factory/domain?	Yes	Collect your own data
Does the task involve force, insertion, slipping, or jamming?	Yes	Collect and log failures
Is there a public dataset with the same robot, camera, and task?	Yes	Use it as a baseline first
Is the task only for learning the toolchain?	Yes	Use public data, avoid over-collection
Does the task lack a clear success metric?	Yes	Do not collect yet; define the metric
Is the hardware still unstable?	Yes	Stop large collection; fix calibration
Are you planning to train from scratch "for cleanliness"?	Yes	Do not; fine-tune a checkpoint first
Can the dataset be published without exposing secrets?	Yes	Consider contributing it

The practical rule:

Use public data for common skills.
Collect your own data for embodiment, domain, failures, and workflows only you have.

Examples:

Task	Public first?	Collect later?
Cube pick-place	Yes	50-100 episodes on your setup
Sort blocks by color	Yes	When camera, lighting, or prompts differ
Pick electronics components	Only as warm start	Yes, because objects and domain are specific
Open real product packaging	Maybe as reference	Yes, because contact and packaging vary
Fold towels	Use VLA/humanoid data as reference	Yes, if this is a product task

How much data is enough for the first loop?

Do not treat 50 episodes as a universal law. Treat it as a startup threshold. From Part 5, the important lesson is that diversity usually beats repetition. For a small team:

Stage	Training episodes	Held-out eval	Goal
Smoke test	10-25	5	Pipeline runs and data loads
First policy	50	20	Robot can perform a simple task
Robust v1	100-200	50	Add 3-5 real variations
Product pilot	300-800	100+	Split by object/environment/operator

The split matters more than the episode count:

Train: known variations
Validation: same distribution, used to catch overfitting
Test unseen: object, pose, light, or operator not used in training
Holdout customer-like: closest to the real use case, never touched during tuning

If you repeatedly inspect the test set and collect exactly those failures into training, the test set has become part of training. Keep a small but serious holdout.

A dataset card template for small teams

A good dataset card can be short, but it must be informative:

# vnrobo/so101_cube_bowl_v1

## Summary
SO-101 leader-follower demonstrations for cube-to-bowl pick-place.

## Robot
- Follower: SO-101, 6-DOF STS3215
- Teleop: SO-101 leader
- Cameras: front USB camera, 640x480, 30 FPS
- Action: 6-DOF joint targets + gripper

## Tasks
- Pick the red cube and place it in the white bowl
- Pick the blue cube and place it in the white bowl

## Dataset
- Train episodes: 100
- Eval episodes: 30
- FPS: 30
- Format: LeRobotDataset v2.1
- Known failures: glare on glossy cube, occasional gripper slip

## License
Apache-2.0 or CC-BY-4.0, depending on your business policy.

Do not publish data containing faces, customer information, production-line secrets, internal logos, or objects under NDA. If you want to share, recreate an equivalent task with neutral objects.

Conclusion: the small-team moat is the loop

Small teams do not win through raw data volume. They win by choosing the right task, recording with the right schema, fine-tuning the right pretrained checkpoint, evaluating honestly, and collecting the next batch against the real failure mode. LeRobot + SO-101 makes the first loop inexpensive. LeRobotDataset v2/v3 keeps the data portable. SmolVLA, π0, and GR00T let you reuse foundation models instead of training from zero. Hugging Face community datasets help you learn quickly and contribute back.

A good data strategy for a robotics startup should be this explicit:

We are not collecting "robot data" in general.
We are collecting evidence that a pretrained policy can perform a real workflow,
on our real embodiment, under real variations the customer will see.

The final post in the series will zoom out to strategy: in the humanoid data war, should small teams bet on open datasets, closed proprietary data, or a hybrid model?

Sources

Why part 6 is for small teams

Robot arm data collection desk

The right mindset: raw footage is not a data asset

For a startup, robot data becomes an asset only when it helps answer one of three questions:

Question	Example
Can our robot perform the customer workflow?	Pick parts from a tray, place boxes on a conveyor, open a drawer
Does the policy generalize across real variation?	Lighting, camera pose, objects, position, clutter
Can the dataset be reused for future models?	Standard schema, clear metadata, clean license, synchronized video/state/action

A small-team data strategy should start with a concrete sentence:

In the next 30 days, what task should the robot learn, on what setup,
with what success criterion, and what decision will this dataset support?

If you cannot answer that, do not start collecting data yet. Define the task first.

Minimum setup: LeRobot + SO-101 in the $100-300 range

Item	Purpose	Rough budget
SO-101 follower arm	Robot that executes actions	$100-180 if well sourced, higher as a kit
SO-101 leader arm	Consistent teleoperation	$100-180
USB camera or webcam	Main visual observation	$20-60
Table, clamps, power, cables, objects	Reliable resets and fewer mechanical failures	$30-80
CUDA PC or cloud GPU	Fine-tuning experiments	Depends on usage

A useful mini setup should reserve space for three camera zones even if you initially use only one or two:

Camera	Required?	When it matters
Front/global camera	Yes	Almost all pick-place, sorting, and simple insertion tasks
Wrist/near-gripper camera	Recommended	Small grasps, occlusion, objects with similar colors
Side camera	Optional	Debugging contact, height, and collisions

Do not add cameras before your logging is stable. One synchronized camera is better than three misaligned streams.

Choose the first task: narrow, repeatable, commercially relevant

A good first-loop task has four properties:

Criterion	Why it matters
Reset takes under 15 seconds	You can collect 50-200 episodes without exhausting the operator
Success is easy to judge	You do not need a complex annotation pipeline
It has 3-5 natural variations	Position, color, object, lighting, instruction
It connects to the product	The data remains useful after the demo

Good starter tasks:

Task	Variations to collect	Should you collect your own?
Pick a cube into a bowl	5 cube positions, 2 colors, 2 bowls	Yes, because your embodiment and camera differ
Sort small parts into trays	Part type, tray layout, clutter, lighting	Yes, if this is your product domain
Open a mini drawer	Handle pose, pulling force, camera angle	Yes, if contact behavior matters
Stack blocks by color	Colors, language prompts, positions	Use public data first, then collect a small target set
Wipe a table	Cloth, surface, force, coverage	Not yet, unless you can measure coverage

LeRobotDataset v2: store data so other models can read it

Record with current LeRobot tooling.
Keep a canonical dataset that can be converted.
If the target is GR00T or another v2 stack, export a clean v2.1 dataset.
If the target is newer LeRobot/SmolVLA streaming, also keep a v3 copy when useful.

A typical v2.1 layout:

my_dataset/
  data/
    chunk-000/
      episode_000000.parquet
      episode_000001.parquet
  videos/
    chunk-000/
      observation.images.front/
        episode_000000.mp4
        episode_000001.mp4
      observation.images.wrist/
        episode_000000.mp4
        episode_000001.mp4
  meta/
    info.json
    stats.json
    episodes.jsonl
    episodes_stats.jsonl
    tasks.jsonl

Minimum fields to check in each Parquet episode:

Field	Meaning	Common failure
`observation.state`	Current robot state	Wrong joint order or inconsistent scale
`action`	Next command or target delta	Mixing absolute and delta actions
`timestamp`	Frame time	Camera and action streams drift
`frame_index`	Index inside episode	Reset does not start from 0
`episode_index`	Episode ID	Duplicate IDs after merging datasets
`task_index`	Mapping to task/prompt	Task labels are wrong or too generic

A good tasks.jsonl is concrete:

{"task_index": 0, "task": "Pick the red cube and place it in the white bowl"}
{"task_index": 1, "task": "Pick the blue cube and place it in the white bowl"}
{"task_index": 2, "task": "Move the small spacer from the left tray to the right tray"}

Recording the first 50 episodes

A beginner-friendly plan:

Batch	Content	Goal
0	5 test episodes	Check video, state, action, resets
1	25 clean episodes	Training smoke test
2	25 more episodes with clear variation	First fine-tuned policy
3	20 held-out eval episodes	Never train on these

For a "pick cube into bowl" task:

5 cube positions: left, right, near, far, center
2 cube colors: red, blue
1 fixed white bowl
10 episodes for each main position
20 eval episodes: slight position shifts and different lighting, excluded from training

The exact record command changes as LeRobot evolves, but the workflow looks like this:

lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \
  --robot.id=vnrobo_so101_follower_01 \
  --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
  --teleop.type=so101_leader \
  --teleop.port=/dev/ttyACM1 \
  --teleop.id=vnrobo_so101_leader_01 \
  --dataset.repo_id=vnrobo/so101_cube_bowl_v1 \
  --dataset.num_episodes=50 \
  --display_data=true

After recording, manually inspect at least 10% of episodes. Do not rely only on training loss. Open the videos and ask:

Check	Question
Frame	Can the camera see the gripper and object at the decisive moment?
Timing	Are actions aligned with frames, or delayed by half a second?
Reset	Does each episode start from a consistent state?
Success	Did the demonstration actually complete the task?
Variation	Are the 50 episodes meaningfully different?
Safety	Did the arm collide, stall, or move unpredictably?

A small clean dataset beats a larger dataset where every episode requires guessing what happened.

Fine-tune from pretrained checkpoints, do not train from scratch

Three practical options:

Model	When to use it	What to remember
SmolVLA	SO-101, small tasks, moderate hardware	Lightweight, LeRobot-native, easy to try
π0/OpenPI	Stronger VLA experiments, good GPU, custom platform	Base model is intended for small-to-medium fine-tuning
GR00T N1/N1.x	Humanoid or dual-arm workflows, NVIDIA Isaac/Jetson alignment	Requires GR00T LeRobot v2 schema and modality config

Example SmolVLA fine-tuning:

python lerobot/scripts/train.py \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=vnrobo/so101_cube_bowl_v1 \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=outputs/train/so101_cube_bowl_smolvla \
  --job_name=so101_cube_bowl_smolvla \
  --policy.device=cuda \
  --wandb.enable=true

Example π0 through LeRobot:

python lerobot/scripts/train.py \
  --policy.path=lerobot/pi0 \
  --dataset.repo_id=vnrobo/so101_cube_bowl_v1 \
  --output_dir=outputs/train/so101_cube_bowl_pi0

Example GR00T workflow:

# 1. Convert or validate the dataset as LeRobot v2
# 2. Add meta/modality.json using the GR00T schema
# 3. Fine-tune with a matching embodiment/action config
python scripts/launch_finetune.py \
  --dataset-path /data/so101_cube_bowl_v1 \
  --modality-config configs/so101_modality.json \
  --output-dir outputs/groot_so101_cube_bowl

A 7-day mini data flywheel

Do not wait a month before training. The flywheel should run weekly:

Day	Work
1	Assemble, calibrate, record 5 test episodes
2	Record 25 episodes, validate the dataset, run a smoke test
3	Record 25 more episodes with clear variation, fine-tune a checkpoint
4	Run 20 unseen eval trials, log failure taxonomy
5	Collect 30-50 episodes targeting the top failure mode
6	Fine-tune again and compare against the baseline
7	Decide whether to scale the task, change cameras, change objects, or stop

Keep the failure taxonomy simple:

PERCEPTION: cannot see object, wrong color, glare, occlusion
LOCALIZATION: sees object but reaches the wrong pose
GRASP: approaches correctly but slips
CONTACT: hits table, pushes object away, gripper jams
LANGUAGE: misunderstands instruction
RECOVERY: drifts away from demonstration and cannot recover
HARDWARE: weak servo, backlash, vibration, bad calibration

After each evaluation, do not ask "how many more episodes do we need?" Ask:

What is the biggest failure class?
Does it need new data, hardware repair, camera changes, or another model?

When should you use public data?

Purpose	How to use public data
Learn the toolchain	Load, train, and deploy before recording your own data
Add pretraining/fine-tuning signal	Mix similar datasets if robot, camera, and action spaces are compatible
Benchmark your setup	Compare simple tasks to detect hardware or logging problems

Use public data to learn priors and debug tooling.
Use private target data to teach your camera, robot, objects, and workflow.
Contribute public data when it is safe, so the community can inspect and build on it.

Read dataset cards as if they were contracts:

Check	Why it matters
License	Can you use it commercially?
Robot type	SO-100, SO-101, ALOHA, Franka, or custom?
Action representation	Joint target, EEF delta, binary gripper, continuous gripper?
Camera keys	Do `front`, `wrist`, and `side` match your model config?
FPS	5 Hz, 10 Hz, and 30 Hz change action chunking
Task label	Are prompts concrete or vague?
Quality	Were failed demonstrations filtered?

Decision checklist: collect your own or use public data?

Use this checklist per task, not per company.

Question	If yes	Decision
Is the task directly tied to customer workflow or product IP?	Yes	Collect target data
Is your camera/robot/gripper different from public setups?	Yes	Collect at least 50-200 target episodes
Are the objects specific to your factory/domain?	Yes	Collect your own data
Does the task involve force, insertion, slipping, or jamming?	Yes	Collect and log failures
Is there a public dataset with the same robot, camera, and task?	Yes	Use it as a baseline first
Is the task only for learning the toolchain?	Yes	Use public data, avoid over-collection
Does the task lack a clear success metric?	Yes	Do not collect yet; define the metric
Is the hardware still unstable?	Yes	Stop large collection; fix calibration
Are you planning to train from scratch "for cleanliness"?	Yes	Do not; fine-tune a checkpoint first
Can the dataset be published without exposing secrets?	Yes	Consider contributing it

The practical rule:

Use public data for common skills.
Collect your own data for embodiment, domain, failures, and workflows only you have.

Examples:

Task	Public first?	Collect later?
Cube pick-place	Yes	50-100 episodes on your setup
Sort blocks by color	Yes	When camera, lighting, or prompts differ
Pick electronics components	Only as warm start	Yes, because objects and domain are specific
Open real product packaging	Maybe as reference	Yes, because contact and packaging vary
Fold towels	Use VLA/humanoid data as reference	Yes, if this is a product task

How much data is enough for the first loop?

Do not treat 50 episodes as a universal law. Treat it as a startup threshold. From Part 5, the important lesson is that diversity usually beats repetition. For a small team:

Stage	Training episodes	Held-out eval	Goal
Smoke test	10-25	5	Pipeline runs and data loads
First policy	50	20	Robot can perform a simple task
Robust v1	100-200	50	Add 3-5 real variations
Product pilot	300-800	100+	Split by object/environment/operator

The split matters more than the episode count:

Train: known variations
Validation: same distribution, used to catch overfitting
Test unseen: object, pose, light, or operator not used in training
Holdout customer-like: closest to the real use case, never touched during tuning

If you repeatedly inspect the test set and collect exactly those failures into training, the test set has become part of training. Keep a small but serious holdout.

A dataset card template for small teams

A good dataset card can be short, but it must be informative:

# vnrobo/so101_cube_bowl_v1

## Summary
SO-101 leader-follower demonstrations for cube-to-bowl pick-place.

## Robot
- Follower: SO-101, 6-DOF STS3215
- Teleop: SO-101 leader
- Cameras: front USB camera, 640x480, 30 FPS
- Action: 6-DOF joint targets + gripper

## Tasks
- Pick the red cube and place it in the white bowl
- Pick the blue cube and place it in the white bowl

## Dataset
- Train episodes: 100
- Eval episodes: 30
- FPS: 30
- Format: LeRobotDataset v2.1
- Known failures: glare on glossy cube, occasional gripper slip

## License
Apache-2.0 or CC-BY-4.0, depending on your business policy.

Do not publish data containing faces, customer information, production-line secrets, internal logos, or objects under NDA. If you want to share, recreate an equivalent task with neutral objects.

Conclusion: the small-team moat is the loop

A good data strategy for a robotics startup should be this explicit:

We are not collecting "robot data" in general.
We are collecting evidence that a pretrained policy can perform a real workflow,
on our real embodiment, under real variations the customer will see.

The final post in the series will zoom out to strategy: in the humanoid data war, should small teams bet on open datasets, closed proprietary data, or a hybrid model?

Data Strategy: What Should Small Teams Collect?

Why part 6 is for small teams

The right mindset: raw footage is not a data asset

Minimum setup: LeRobot + SO-101 in the $100-300 range

Choose the first task: narrow, repeatable, commercially relevant

LeRobotDataset v2: store data so other models can read it

Recording the first 50 episodes

Fine-tune from pretrained checkpoints, do not train from scratch

A 7-day mini data flywheel

When should you use public data?

Decision checklist: collect your own or use public data?

How much data is enough for the first loop?

A dataset card template for small teams

Conclusion: the small-team moat is the loop

Sources

Nguyễn Anh Tuấn

Related Posts

VLA Data Scaling: Luật Scaling Cho Robot Learning

Human Video Mining: Khai Thác Video Người Cho Robot

Open vs Closed: License, Data Moat Và Tương Lai 2027

Data Strategy: What Should Small Teams Collect?

Why part 6 is for small teams

The right mindset: raw footage is not a data asset

Minimum setup: LeRobot + SO-101 in the $100-300 range

Choose the first task: narrow, repeatable, commercially relevant

LeRobotDataset v2: store data so other models can read it

Recording the first 50 episodes

Fine-tune from pretrained checkpoints, do not train from scratch

A 7-day mini data flywheel

When should you use public data?

Decision checklist: collect your own or use public data?

How much data is enough for the first loop?

A dataset card template for small teams

Conclusion: the small-team moat is the loop

Sources

Nguyễn Anh Tuấn

Related Posts

VLA Data Scaling: Luật Scaling Cho Robot Learning

Human Video Mining: Khai Thác Video Người Cho Robot

Open vs Closed: License, Data Moat Và Tương Lai 2027