humanoidhumanoidrobot-datalerobotso-101smolvlapi0groot-n1vlastartup

Data Strategy: What Should Small Teams Collect?

A practical playbook for small robotics teams: SO-101 data, LeRobotDataset v2, pretrained VLA fine-tuning, and HF datasets.

Nguyễn Anh TuấnJune 12, 202618 min read
Data Strategy: What Should Small Teams Collect?

Why part 6 is for small teams

The first five posts in this series covered the ownership map, teleoperation, human video mining, synthetic pipelines, and scaling laws for VLA data. If you are joining here, start with Part 1: the humanoid data war landscape, Part 2: teleoperation, and Part 5: VLA data scaling. This post answers a more operational question: if you are a small robotics team, without a fleet of 100 robots, what should you collect first?

The short answer: do not try to win the raw episode-count race. A small team should build a mini data flywheel: choose one narrow but real task, collect a small amount of clean data with inexpensive hardware, standardize it as a LeRobotDataset, fine-tune from a pretrained checkpoint, evaluate real failures, and collect the next batch only where the robot actually fails. That loop matters more than claiming "10,000 demos" from a single lab setup.

The good news is that the open robotics ecosystem has lowered the entry barrier. LeRobot provides tooling for robot control, datasets, training, and inference. SO-101 is a low-cost 6-DOF arm that can be built or bought as a kit and used with leader-follower teleoperation. SmolVLA is designed for fine-tuning on LeRobot datasets. OpenPI/π0 and NVIDIA Isaac GR00T push the same practical idea: start from a foundation model, adapt it to your embodiment and task, and avoid training from scratch unless you have a very strong reason.

Robot arm data collection desk
Robot arm data collection desk

The right mindset: raw footage is not a data asset

For a startup, robot data becomes an asset only when it helps answer one of three questions:

Question Example
Can our robot perform the customer workflow? Pick parts from a tray, place boxes on a conveyor, open a drawer
Does the policy generalize across real variation? Lighting, camera pose, objects, position, clutter
Can the dataset be reused for future models? Standard schema, clear metadata, clean license, synchronized video/state/action

A folder of videos showing an operator driving a robot is not yet a data asset. It becomes one when it has timestamps, observations, actions, states, task labels, episode boundaries, train/eval splits, robot metadata, and a way to load everything back in code. This is why LeRobotDataset matters. You are buying optionality: today you may train ACT or SmolVLA, six months from now you may try π0 or GR00T, and a year from now you may merge your dataset with community data.

A small-team data strategy should start with a concrete sentence:

In the next 30 days, what task should the robot learn, on what setup,
with what success criterion, and what decision will this dataset support?

If you cannot answer that, do not start collecting data yet. Define the task first.

Minimum setup: LeRobot + SO-101 in the $100-300 range

SO-101 is the next-generation version of SO-100 developed by The Robot Studio and Hugging Face. The official docs describe a follower arm with 6 STS3215 motors, a leader arm with different gear ratios so the operator can move it smoothly, and a LeRobot workflow for recording datasets. The actual cost depends on whether you print the parts yourself, buy a kit, already own power supplies, and what camera you use. The "$100" number usually refers to the basic arm hardware if you source parts well. For a data collection setup, budget more realistically:

Item Purpose Rough budget
SO-101 follower arm Robot that executes actions $100-180 if well sourced, higher as a kit
SO-101 leader arm Consistent teleoperation $100-180
USB camera or webcam Main visual observation $20-60
Table, clamps, power, cables, objects Reliable resets and fewer mechanical failures $30-80
CUDA PC or cloud GPU Fine-tuning experiments Depends on usage

If you are extremely budget constrained, you can start with a follower arm and simpler teleoperation scripts. But leader-follower collection usually produces cleaner data. The operator moves a physical arm that mirrors the follower, demonstrations are more natural, and the action logs are easier to interpret.

A useful mini setup should reserve space for three camera zones even if you initially use only one or two:

Camera Required? When it matters
Front/global camera Yes Almost all pick-place, sorting, and simple insertion tasks
Wrist/near-gripper camera Recommended Small grasps, occlusion, objects with similar colors
Side camera Optional Debugging contact, height, and collisions

Do not add cameras before your logging is stable. One synchronized camera is better than three misaligned streams.

Choose the first task: narrow, repeatable, commercially relevant

The first task should not be "a humanoid cleans a house." With SO-101, choose a one-arm manipulation task that still points toward your target domain. If you are building for electronics factories, collect on jumper wires, PCB spacers, small plastic bins, or tray sorting. If you are building logistics workflows, collect on parcel-like objects, labels, and simple sorting. If you are building education products, collect LEGO or block pick-place with bilingual instructions.

A good first-loop task has four properties:

Criterion Why it matters
Reset takes under 15 seconds You can collect 50-200 episodes without exhausting the operator
Success is easy to judge You do not need a complex annotation pipeline
It has 3-5 natural variations Position, color, object, lighting, instruction
It connects to the product The data remains useful after the demo

Good starter tasks:

Task Variations to collect Should you collect your own?
Pick a cube into a bowl 5 cube positions, 2 colors, 2 bowls Yes, because your embodiment and camera differ
Sort small parts into trays Part type, tray layout, clutter, lighting Yes, if this is your product domain
Open a mini drawer Handle pose, pulling force, camera angle Yes, if contact behavior matters
Stack blocks by color Colors, language prompts, positions Use public data first, then collect a small target set
Wipe a table Cloth, surface, force, coverage Not yet, unless you can measure coverage

A common mistake is choosing a task because it will make a nice video, not because it supports a product decision. A robot picking candy may look good, but if your customer needs inspection or part handling, that dataset is mostly a tooling exercise, not a moat.

LeRobotDataset v2: store data so other models can read it

LeRobot is moving toward dataset v3, which reduces file count, improves streaming, and modernizes metadata. But many important stacks still use or support v2/v2.1. NVIDIA GR00T currently uses a LeRobot v2 flavor and adds meta/modality.json; its data preparation guide explicitly says v3 datasets should be converted to v2 for the current workflow. For a small team, the safest strategy is:

Record with current LeRobot tooling.
Keep a canonical dataset that can be converted.
If the target is GR00T or another v2 stack, export a clean v2.1 dataset.
If the target is newer LeRobot/SmolVLA streaming, also keep a v3 copy when useful.

A typical v2.1 layout:

my_dataset/
  data/
    chunk-000/
      episode_000000.parquet
      episode_000001.parquet
  videos/
    chunk-000/
      observation.images.front/
        episode_000000.mp4
        episode_000001.mp4
      observation.images.wrist/
        episode_000000.mp4
        episode_000001.mp4
  meta/
    info.json
    stats.json
    episodes.jsonl
    episodes_stats.jsonl
    tasks.jsonl

Minimum fields to check in each Parquet episode:

Field Meaning Common failure
observation.state Current robot state Wrong joint order or inconsistent scale
action Next command or target delta Mixing absolute and delta actions
timestamp Frame time Camera and action streams drift
frame_index Index inside episode Reset does not start from 0
episode_index Episode ID Duplicate IDs after merging datasets
task_index Mapping to task/prompt Task labels are wrong or too generic

If you use GR00T, add modality metadata to describe state/action/video names. The idea is simple: the model needs to know what the action vector means, which camera is the ego view, which camera is the front view, and where language annotations live.

A good tasks.jsonl is concrete:

{"task_index": 0, "task": "Pick the red cube and place it in the white bowl"}
{"task_index": 1, "task": "Pick the blue cube and place it in the white bowl"}
{"task_index": 2, "task": "Move the small spacer from the left tray to the right tray"}

If you eventually want Vietnamese instructions, keep bilingual metadata, but do not break the main field expected by your training script. A safe pattern is to keep English prompts for training and add task_vi as separate metadata or in the dataset card.

Recording the first 50 episodes

The SmolVLA docs recommend recording about 50 episodes as a starting point. Their pick-place reference uses 50 episodes across 5 cube positions, with 10 episodes per position. This is a useful starting number for small teams. It is enough to test the loop, but small enough that a bad camera setup does not waste a week.

A beginner-friendly plan:

Batch Content Goal
0 5 test episodes Check video, state, action, resets
1 25 clean episodes Training smoke test
2 25 more episodes with clear variation First fine-tuned policy
3 20 held-out eval episodes Never train on these

For a "pick cube into bowl" task:

5 cube positions: left, right, near, far, center
2 cube colors: red, blue
1 fixed white bowl
10 episodes for each main position
20 eval episodes: slight position shifts and different lighting, excluded from training

The exact record command changes as LeRobot evolves, but the workflow looks like this:

lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \
  --robot.id=vnrobo_so101_follower_01 \
  --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
  --teleop.type=so101_leader \
  --teleop.port=/dev/ttyACM1 \
  --teleop.id=vnrobo_so101_leader_01 \
  --dataset.repo_id=vnrobo/so101_cube_bowl_v1 \
  --dataset.num_episodes=50 \
  --display_data=true

After recording, manually inspect at least 10% of episodes. Do not rely only on training loss. Open the videos and ask:

Check Question
Frame Can the camera see the gripper and object at the decisive moment?
Timing Are actions aligned with frames, or delayed by half a second?
Reset Does each episode start from a consistent state?
Success Did the demonstration actually complete the task?
Variation Are the 50 episodes meaningfully different?
Safety Did the arm collide, stall, or move unpredictably?

A small clean dataset beats a larger dataset where every episode requires guessing what happened.

Fine-tune from pretrained checkpoints, do not train from scratch

For a small team, training from scratch is almost always the wrong first move. You do not have enough data to teach vision, language, action priors, and contact dynamics from zero. Start with a pretrained checkpoint, fine-tune for a modest number of steps, measure failures, and then decide whether to collect more data or change the model.

Three practical options:

Model When to use it What to remember
SmolVLA SO-101, small tasks, moderate hardware Lightweight, LeRobot-native, easy to try
π0/OpenPI Stronger VLA experiments, good GPU, custom platform Base model is intended for small-to-medium fine-tuning
GR00T N1/N1.x Humanoid or dual-arm workflows, NVIDIA Isaac/Jetson alignment Requires GR00T LeRobot v2 schema and modality config

Example SmolVLA fine-tuning:

python lerobot/scripts/train.py \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=vnrobo/so101_cube_bowl_v1 \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=outputs/train/so101_cube_bowl_smolvla \
  --job_name=so101_cube_bowl_smolvla \
  --policy.device=cuda \
  --wandb.enable=true

Example π0 through LeRobot:

python lerobot/scripts/train.py \
  --policy.path=lerobot/pi0 \
  --dataset.repo_id=vnrobo/so101_cube_bowl_v1 \
  --output_dir=outputs/train/so101_cube_bowl_pi0

Example GR00T workflow:

# 1. Convert or validate the dataset as LeRobot v2
# 2. Add meta/modality.json using the GR00T schema
# 3. Fine-tune with a matching embodiment/action config
python scripts/launch_finetune.py \
  --dataset-path /data/so101_cube_bowl_v1 \
  --modality-config configs/so101_modality.json \
  --output-dir outputs/groot_so101_cube_bowl

Treat these as workflow templates, not immutable commands. LeRobot and GR00T CLI details change quickly. The durable principles are: standard dataset, pretrained policy, explicit embodiment config, and evaluation data separated from training data.

A 7-day mini data flywheel

Do not wait a month before training. The flywheel should run weekly:

Day Work
1 Assemble, calibrate, record 5 test episodes
2 Record 25 episodes, validate the dataset, run a smoke test
3 Record 25 more episodes with clear variation, fine-tune a checkpoint
4 Run 20 unseen eval trials, log failure taxonomy
5 Collect 30-50 episodes targeting the top failure mode
6 Fine-tune again and compare against the baseline
7 Decide whether to scale the task, change cameras, change objects, or stop

Keep the failure taxonomy simple:

PERCEPTION: cannot see object, wrong color, glare, occlusion
LOCALIZATION: sees object but reaches the wrong pose
GRASP: approaches correctly but slips
CONTACT: hits table, pushes object away, gripper jams
LANGUAGE: misunderstands instruction
RECOVERY: drifts away from demonstration and cannot recover
HARDWARE: weak servo, backlash, vibration, bad calibration

After each evaluation, do not ask "how many more episodes do we need?" Ask:

What is the biggest failure class?
Does it need new data, hardware repair, camera changes, or another model?

If the failure is glare, 500 more demonstrations under the same lighting will not solve it. You need different lighting, glossy objects in training, or a camera change. If the gripper slips because of mechanical backlash, data cannot fix the mechanism.

When should you use public data?

Hugging Face Hub already hosts many LeRobot community datasets. SmolVLA was released as a model that benefits from LeRobot community datasets, and the lerobot/svla_so101_pickplace reference has 50 episodes, 11,939 frames, 480x640 cameras, and a 6-DOF action space. Public data is extremely useful for three things:

Purpose How to use public data
Learn the toolchain Load, train, and deploy before recording your own data
Add pretraining/fine-tuning signal Mix similar datasets if robot, camera, and action spaces are compatible
Benchmark your setup Compare simple tasks to detect hardware or logging problems

But public data does not automatically solve your embodiment. Different camera placement, gripper geometry, table height, object set, or action normalization can break a policy. For small teams, the better rule is:

Use public data to learn priors and debug tooling.
Use private target data to teach your camera, robot, objects, and workflow.
Contribute public data when it is safe, so the community can inspect and build on it.

Read dataset cards as if they were contracts:

Check Why it matters
License Can you use it commercially?
Robot type SO-100, SO-101, ALOHA, Franka, or custom?
Action representation Joint target, EEF delta, binary gripper, continuous gripper?
Camera keys Do front, wrist, and side match your model config?
FPS 5 Hz, 10 Hz, and 30 Hz change action chunking
Task label Are prompts concrete or vague?
Quality Were failed demonstrations filtered?

If you contribute a dataset to Hugging Face, write a useful dataset card. Include robot, cameras, FPS, action space, episode count, frame count, tasks, splits, known failures, license, and reproduction instructions. This is how a small 50-episode dataset becomes a community artifact instead of an anonymous zip file.

Decision checklist: collect your own or use public data?

Use this checklist per task, not per company.

Question If yes Decision
Is the task directly tied to customer workflow or product IP? Yes Collect target data
Is your camera/robot/gripper different from public setups? Yes Collect at least 50-200 target episodes
Are the objects specific to your factory/domain? Yes Collect your own data
Does the task involve force, insertion, slipping, or jamming? Yes Collect and log failures
Is there a public dataset with the same robot, camera, and task? Yes Use it as a baseline first
Is the task only for learning the toolchain? Yes Use public data, avoid over-collection
Does the task lack a clear success metric? Yes Do not collect yet; define the metric
Is the hardware still unstable? Yes Stop large collection; fix calibration
Are you planning to train from scratch "for cleanliness"? Yes Do not; fine-tune a checkpoint first
Can the dataset be published without exposing secrets? Yes Consider contributing it

The practical rule:

Use public data for common skills.
Collect your own data for embodiment, domain, failures, and workflows only you have.

Examples:

Task Public first? Collect later?
Cube pick-place Yes 50-100 episodes on your setup
Sort blocks by color Yes When camera, lighting, or prompts differ
Pick electronics components Only as warm start Yes, because objects and domain are specific
Open real product packaging Maybe as reference Yes, because contact and packaging vary
Fold towels Use VLA/humanoid data as reference Yes, if this is a product task

How much data is enough for the first loop?

Do not treat 50 episodes as a universal law. Treat it as a startup threshold. From Part 5, the important lesson is that diversity usually beats repetition. For a small team:

Stage Training episodes Held-out eval Goal
Smoke test 10-25 5 Pipeline runs and data loads
First policy 50 20 Robot can perform a simple task
Robust v1 100-200 50 Add 3-5 real variations
Product pilot 300-800 100+ Split by object/environment/operator

The split matters more than the episode count:

Train: known variations
Validation: same distribution, used to catch overfitting
Test unseen: object, pose, light, or operator not used in training
Holdout customer-like: closest to the real use case, never touched during tuning

If you repeatedly inspect the test set and collect exactly those failures into training, the test set has become part of training. Keep a small but serious holdout.

A dataset card template for small teams

A good dataset card can be short, but it must be informative:

# vnrobo/so101_cube_bowl_v1

## Summary
SO-101 leader-follower demonstrations for cube-to-bowl pick-place.

## Robot
- Follower: SO-101, 6-DOF STS3215
- Teleop: SO-101 leader
- Cameras: front USB camera, 640x480, 30 FPS
- Action: 6-DOF joint targets + gripper

## Tasks
- Pick the red cube and place it in the white bowl
- Pick the blue cube and place it in the white bowl

## Dataset
- Train episodes: 100
- Eval episodes: 30
- FPS: 30
- Format: LeRobotDataset v2.1
- Known failures: glare on glossy cube, occasional gripper slip

## License
Apache-2.0 or CC-BY-4.0, depending on your business policy.

Do not publish data containing faces, customer information, production-line secrets, internal logos, or objects under NDA. If you want to share, recreate an equivalent task with neutral objects.

Conclusion: the small-team moat is the loop

Small teams do not win through raw data volume. They win by choosing the right task, recording with the right schema, fine-tuning the right pretrained checkpoint, evaluating honestly, and collecting the next batch against the real failure mode. LeRobot + SO-101 makes the first loop inexpensive. LeRobotDataset v2/v3 keeps the data portable. SmolVLA, π0, and GR00T let you reuse foundation models instead of training from zero. Hugging Face community datasets help you learn quickly and contribute back.

A good data strategy for a robotics startup should be this explicit:

We are not collecting "robot data" in general.
We are collecting evidence that a pretrained policy can perform a real workflow,
on our real embodiment, under real variations the customer will see.

The final post in the series will zoom out to strategy: in the humanoid data war, should small teams bet on open datasets, closed proprietary data, or a hybrid model?

Sources

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

VLA Data Scaling: Luật Scaling Cho Robot Learning
humanoid

VLA Data Scaling: Luật Scaling Cho Robot Learning

6/12/202619 min read
NT
Human Video Mining: Khai Thác Video Người Cho Robot
humanoid

Human Video Mining: Khai Thác Video Người Cho Robot

6/12/202618 min read
NT
Open vs Closed: License, Data Moat Và Tương Lai 2027
humanoid

Open vs Closed: License, Data Moat Và Tương Lai 2027

6/12/202617 min read
NT