Why part 6 is for small teams
The first five posts in this series covered the ownership map, teleoperation, human video mining, synthetic pipelines, and scaling laws for VLA data. If you are joining here, start with Part 1: the humanoid data war landscape, Part 2: teleoperation, and Part 5: VLA data scaling. This post answers a more operational question: if you are a small robotics team, without a fleet of 100 robots, what should you collect first?
The short answer: do not try to win the raw episode-count race. A small team should build a mini data flywheel: choose one narrow but real task, collect a small amount of clean data with inexpensive hardware, standardize it as a LeRobotDataset, fine-tune from a pretrained checkpoint, evaluate real failures, and collect the next batch only where the robot actually fails. That loop matters more than claiming "10,000 demos" from a single lab setup.
The good news is that the open robotics ecosystem has lowered the entry barrier. LeRobot provides tooling for robot control, datasets, training, and inference. SO-101 is a low-cost 6-DOF arm that can be built or bought as a kit and used with leader-follower teleoperation. SmolVLA is designed for fine-tuning on LeRobot datasets. OpenPI/π0 and NVIDIA Isaac GR00T push the same practical idea: start from a foundation model, adapt it to your embodiment and task, and avoid training from scratch unless you have a very strong reason.
The right mindset: raw footage is not a data asset
For a startup, robot data becomes an asset only when it helps answer one of three questions:
| Question | Example |
|---|---|
| Can our robot perform the customer workflow? | Pick parts from a tray, place boxes on a conveyor, open a drawer |
| Does the policy generalize across real variation? | Lighting, camera pose, objects, position, clutter |
| Can the dataset be reused for future models? | Standard schema, clear metadata, clean license, synchronized video/state/action |
A folder of videos showing an operator driving a robot is not yet a data asset. It becomes one when it has timestamps, observations, actions, states, task labels, episode boundaries, train/eval splits, robot metadata, and a way to load everything back in code. This is why LeRobotDataset matters. You are buying optionality: today you may train ACT or SmolVLA, six months from now you may try π0 or GR00T, and a year from now you may merge your dataset with community data.
A small-team data strategy should start with a concrete sentence:
In the next 30 days, what task should the robot learn, on what setup,
with what success criterion, and what decision will this dataset support?
If you cannot answer that, do not start collecting data yet. Define the task first.
Minimum setup: LeRobot + SO-101 in the $100-300 range
SO-101 is the next-generation version of SO-100 developed by The Robot Studio and Hugging Face. The official docs describe a follower arm with 6 STS3215 motors, a leader arm with different gear ratios so the operator can move it smoothly, and a LeRobot workflow for recording datasets. The actual cost depends on whether you print the parts yourself, buy a kit, already own power supplies, and what camera you use. The "$100" number usually refers to the basic arm hardware if you source parts well. For a data collection setup, budget more realistically:
| Item | Purpose | Rough budget |
|---|---|---|
| SO-101 follower arm | Robot that executes actions | $100-180 if well sourced, higher as a kit |
| SO-101 leader arm | Consistent teleoperation | $100-180 |
| USB camera or webcam | Main visual observation | $20-60 |
| Table, clamps, power, cables, objects | Reliable resets and fewer mechanical failures | $30-80 |
| CUDA PC or cloud GPU | Fine-tuning experiments | Depends on usage |
If you are extremely budget constrained, you can start with a follower arm and simpler teleoperation scripts. But leader-follower collection usually produces cleaner data. The operator moves a physical arm that mirrors the follower, demonstrations are more natural, and the action logs are easier to interpret.
A useful mini setup should reserve space for three camera zones even if you initially use only one or two:
| Camera | Required? | When it matters |
|---|---|---|
| Front/global camera | Yes | Almost all pick-place, sorting, and simple insertion tasks |
| Wrist/near-gripper camera | Recommended | Small grasps, occlusion, objects with similar colors |
| Side camera | Optional | Debugging contact, height, and collisions |
Do not add cameras before your logging is stable. One synchronized camera is better than three misaligned streams.
Choose the first task: narrow, repeatable, commercially relevant
The first task should not be "a humanoid cleans a house." With SO-101, choose a one-arm manipulation task that still points toward your target domain. If you are building for electronics factories, collect on jumper wires, PCB spacers, small plastic bins, or tray sorting. If you are building logistics workflows, collect on parcel-like objects, labels, and simple sorting. If you are building education products, collect LEGO or block pick-place with bilingual instructions.
A good first-loop task has four properties:
| Criterion | Why it matters |
|---|---|
| Reset takes under 15 seconds | You can collect 50-200 episodes without exhausting the operator |
| Success is easy to judge | You do not need a complex annotation pipeline |
| It has 3-5 natural variations | Position, color, object, lighting, instruction |
| It connects to the product | The data remains useful after the demo |
Good starter tasks:
| Task | Variations to collect | Should you collect your own? |
|---|---|---|
| Pick a cube into a bowl | 5 cube positions, 2 colors, 2 bowls | Yes, because your embodiment and camera differ |
| Sort small parts into trays | Part type, tray layout, clutter, lighting | Yes, if this is your product domain |
| Open a mini drawer | Handle pose, pulling force, camera angle | Yes, if contact behavior matters |
| Stack blocks by color | Colors, language prompts, positions | Use public data first, then collect a small target set |
| Wipe a table | Cloth, surface, force, coverage | Not yet, unless you can measure coverage |
A common mistake is choosing a task because it will make a nice video, not because it supports a product decision. A robot picking candy may look good, but if your customer needs inspection or part handling, that dataset is mostly a tooling exercise, not a moat.
LeRobotDataset v2: store data so other models can read it
LeRobot is moving toward dataset v3, which reduces file count, improves streaming, and modernizes metadata. But many important stacks still use or support v2/v2.1. NVIDIA GR00T currently uses a LeRobot v2 flavor and adds meta/modality.json; its data preparation guide explicitly says v3 datasets should be converted to v2 for the current workflow. For a small team, the safest strategy is:
Record with current LeRobot tooling.
Keep a canonical dataset that can be converted.
If the target is GR00T or another v2 stack, export a clean v2.1 dataset.
If the target is newer LeRobot/SmolVLA streaming, also keep a v3 copy when useful.
A typical v2.1 layout:
my_dataset/
data/
chunk-000/
episode_000000.parquet
episode_000001.parquet
videos/
chunk-000/
observation.images.front/
episode_000000.mp4
episode_000001.mp4
observation.images.wrist/
episode_000000.mp4
episode_000001.mp4
meta/
info.json
stats.json
episodes.jsonl
episodes_stats.jsonl
tasks.jsonl
Minimum fields to check in each Parquet episode:
| Field | Meaning | Common failure |
|---|---|---|
observation.state |
Current robot state | Wrong joint order or inconsistent scale |
action |
Next command or target delta | Mixing absolute and delta actions |
timestamp |
Frame time | Camera and action streams drift |
frame_index |
Index inside episode | Reset does not start from 0 |
episode_index |
Episode ID | Duplicate IDs after merging datasets |
task_index |
Mapping to task/prompt | Task labels are wrong or too generic |
If you use GR00T, add modality metadata to describe state/action/video names. The idea is simple: the model needs to know what the action vector means, which camera is the ego view, which camera is the front view, and where language annotations live.
A good tasks.jsonl is concrete:
{"task_index": 0, "task": "Pick the red cube and place it in the white bowl"}
{"task_index": 1, "task": "Pick the blue cube and place it in the white bowl"}
{"task_index": 2, "task": "Move the small spacer from the left tray to the right tray"}
If you eventually want Vietnamese instructions, keep bilingual metadata, but do not break the main field expected by your training script. A safe pattern is to keep English prompts for training and add task_vi as separate metadata or in the dataset card.
Recording the first 50 episodes
The SmolVLA docs recommend recording about 50 episodes as a starting point. Their pick-place reference uses 50 episodes across 5 cube positions, with 10 episodes per position. This is a useful starting number for small teams. It is enough to test the loop, but small enough that a bad camera setup does not waste a week.
A beginner-friendly plan:
| Batch | Content | Goal |
|---|---|---|
| 0 | 5 test episodes | Check video, state, action, resets |
| 1 | 25 clean episodes | Training smoke test |
| 2 | 25 more episodes with clear variation | First fine-tuned policy |
| 3 | 20 held-out eval episodes | Never train on these |
For a "pick cube into bowl" task:
5 cube positions: left, right, near, far, center
2 cube colors: red, blue
1 fixed white bowl
10 episodes for each main position
20 eval episodes: slight position shifts and different lighting, excluded from training
The exact record command changes as LeRobot evolves, but the workflow looks like this:
lerobot-record \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.id=vnrobo_so101_follower_01 \
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--teleop.type=so101_leader \
--teleop.port=/dev/ttyACM1 \
--teleop.id=vnrobo_so101_leader_01 \
--dataset.repo_id=vnrobo/so101_cube_bowl_v1 \
--dataset.num_episodes=50 \
--display_data=true
After recording, manually inspect at least 10% of episodes. Do not rely only on training loss. Open the videos and ask:
| Check | Question |
|---|---|
| Frame | Can the camera see the gripper and object at the decisive moment? |
| Timing | Are actions aligned with frames, or delayed by half a second? |
| Reset | Does each episode start from a consistent state? |
| Success | Did the demonstration actually complete the task? |
| Variation | Are the 50 episodes meaningfully different? |
| Safety | Did the arm collide, stall, or move unpredictably? |
A small clean dataset beats a larger dataset where every episode requires guessing what happened.
Fine-tune from pretrained checkpoints, do not train from scratch
For a small team, training from scratch is almost always the wrong first move. You do not have enough data to teach vision, language, action priors, and contact dynamics from zero. Start with a pretrained checkpoint, fine-tune for a modest number of steps, measure failures, and then decide whether to collect more data or change the model.
Three practical options:
| Model | When to use it | What to remember |
|---|---|---|
| SmolVLA | SO-101, small tasks, moderate hardware | Lightweight, LeRobot-native, easy to try |
| π0/OpenPI | Stronger VLA experiments, good GPU, custom platform | Base model is intended for small-to-medium fine-tuning |
| GR00T N1/N1.x | Humanoid or dual-arm workflows, NVIDIA Isaac/Jetson alignment | Requires GR00T LeRobot v2 schema and modality config |
Example SmolVLA fine-tuning:
python lerobot/scripts/train.py \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=vnrobo/so101_cube_bowl_v1 \
--batch_size=64 \
--steps=20000 \
--output_dir=outputs/train/so101_cube_bowl_smolvla \
--job_name=so101_cube_bowl_smolvla \
--policy.device=cuda \
--wandb.enable=true
Example π0 through LeRobot:
python lerobot/scripts/train.py \
--policy.path=lerobot/pi0 \
--dataset.repo_id=vnrobo/so101_cube_bowl_v1 \
--output_dir=outputs/train/so101_cube_bowl_pi0
Example GR00T workflow:
# 1. Convert or validate the dataset as LeRobot v2
# 2. Add meta/modality.json using the GR00T schema
# 3. Fine-tune with a matching embodiment/action config
python scripts/launch_finetune.py \
--dataset-path /data/so101_cube_bowl_v1 \
--modality-config configs/so101_modality.json \
--output-dir outputs/groot_so101_cube_bowl
Treat these as workflow templates, not immutable commands. LeRobot and GR00T CLI details change quickly. The durable principles are: standard dataset, pretrained policy, explicit embodiment config, and evaluation data separated from training data.
A 7-day mini data flywheel
Do not wait a month before training. The flywheel should run weekly:
| Day | Work |
|---|---|
| 1 | Assemble, calibrate, record 5 test episodes |
| 2 | Record 25 episodes, validate the dataset, run a smoke test |
| 3 | Record 25 more episodes with clear variation, fine-tune a checkpoint |
| 4 | Run 20 unseen eval trials, log failure taxonomy |
| 5 | Collect 30-50 episodes targeting the top failure mode |
| 6 | Fine-tune again and compare against the baseline |
| 7 | Decide whether to scale the task, change cameras, change objects, or stop |
Keep the failure taxonomy simple:
PERCEPTION: cannot see object, wrong color, glare, occlusion
LOCALIZATION: sees object but reaches the wrong pose
GRASP: approaches correctly but slips
CONTACT: hits table, pushes object away, gripper jams
LANGUAGE: misunderstands instruction
RECOVERY: drifts away from demonstration and cannot recover
HARDWARE: weak servo, backlash, vibration, bad calibration
After each evaluation, do not ask "how many more episodes do we need?" Ask:
What is the biggest failure class?
Does it need new data, hardware repair, camera changes, or another model?
If the failure is glare, 500 more demonstrations under the same lighting will not solve it. You need different lighting, glossy objects in training, or a camera change. If the gripper slips because of mechanical backlash, data cannot fix the mechanism.
When should you use public data?
Hugging Face Hub already hosts many LeRobot community datasets. SmolVLA was released as a model that benefits from LeRobot community datasets, and the lerobot/svla_so101_pickplace reference has 50 episodes, 11,939 frames, 480x640 cameras, and a 6-DOF action space. Public data is extremely useful for three things:
| Purpose | How to use public data |
|---|---|
| Learn the toolchain | Load, train, and deploy before recording your own data |
| Add pretraining/fine-tuning signal | Mix similar datasets if robot, camera, and action spaces are compatible |
| Benchmark your setup | Compare simple tasks to detect hardware or logging problems |
But public data does not automatically solve your embodiment. Different camera placement, gripper geometry, table height, object set, or action normalization can break a policy. For small teams, the better rule is:
Use public data to learn priors and debug tooling.
Use private target data to teach your camera, robot, objects, and workflow.
Contribute public data when it is safe, so the community can inspect and build on it.
Read dataset cards as if they were contracts:
| Check | Why it matters |
|---|---|
| License | Can you use it commercially? |
| Robot type | SO-100, SO-101, ALOHA, Franka, or custom? |
| Action representation | Joint target, EEF delta, binary gripper, continuous gripper? |
| Camera keys | Do front, wrist, and side match your model config? |
| FPS | 5 Hz, 10 Hz, and 30 Hz change action chunking |
| Task label | Are prompts concrete or vague? |
| Quality | Were failed demonstrations filtered? |
If you contribute a dataset to Hugging Face, write a useful dataset card. Include robot, cameras, FPS, action space, episode count, frame count, tasks, splits, known failures, license, and reproduction instructions. This is how a small 50-episode dataset becomes a community artifact instead of an anonymous zip file.
Decision checklist: collect your own or use public data?
Use this checklist per task, not per company.
| Question | If yes | Decision |
|---|---|---|
| Is the task directly tied to customer workflow or product IP? | Yes | Collect target data |
| Is your camera/robot/gripper different from public setups? | Yes | Collect at least 50-200 target episodes |
| Are the objects specific to your factory/domain? | Yes | Collect your own data |
| Does the task involve force, insertion, slipping, or jamming? | Yes | Collect and log failures |
| Is there a public dataset with the same robot, camera, and task? | Yes | Use it as a baseline first |
| Is the task only for learning the toolchain? | Yes | Use public data, avoid over-collection |
| Does the task lack a clear success metric? | Yes | Do not collect yet; define the metric |
| Is the hardware still unstable? | Yes | Stop large collection; fix calibration |
| Are you planning to train from scratch "for cleanliness"? | Yes | Do not; fine-tune a checkpoint first |
| Can the dataset be published without exposing secrets? | Yes | Consider contributing it |
The practical rule:
Use public data for common skills.
Collect your own data for embodiment, domain, failures, and workflows only you have.
Examples:
| Task | Public first? | Collect later? |
|---|---|---|
| Cube pick-place | Yes | 50-100 episodes on your setup |
| Sort blocks by color | Yes | When camera, lighting, or prompts differ |
| Pick electronics components | Only as warm start | Yes, because objects and domain are specific |
| Open real product packaging | Maybe as reference | Yes, because contact and packaging vary |
| Fold towels | Use VLA/humanoid data as reference | Yes, if this is a product task |
How much data is enough for the first loop?
Do not treat 50 episodes as a universal law. Treat it as a startup threshold. From Part 5, the important lesson is that diversity usually beats repetition. For a small team:
| Stage | Training episodes | Held-out eval | Goal |
|---|---|---|---|
| Smoke test | 10-25 | 5 | Pipeline runs and data loads |
| First policy | 50 | 20 | Robot can perform a simple task |
| Robust v1 | 100-200 | 50 | Add 3-5 real variations |
| Product pilot | 300-800 | 100+ | Split by object/environment/operator |
The split matters more than the episode count:
Train: known variations
Validation: same distribution, used to catch overfitting
Test unseen: object, pose, light, or operator not used in training
Holdout customer-like: closest to the real use case, never touched during tuning
If you repeatedly inspect the test set and collect exactly those failures into training, the test set has become part of training. Keep a small but serious holdout.
A dataset card template for small teams
A good dataset card can be short, but it must be informative:
# vnrobo/so101_cube_bowl_v1
## Summary
SO-101 leader-follower demonstrations for cube-to-bowl pick-place.
## Robot
- Follower: SO-101, 6-DOF STS3215
- Teleop: SO-101 leader
- Cameras: front USB camera, 640x480, 30 FPS
- Action: 6-DOF joint targets + gripper
## Tasks
- Pick the red cube and place it in the white bowl
- Pick the blue cube and place it in the white bowl
## Dataset
- Train episodes: 100
- Eval episodes: 30
- FPS: 30
- Format: LeRobotDataset v2.1
- Known failures: glare on glossy cube, occasional gripper slip
## License
Apache-2.0 or CC-BY-4.0, depending on your business policy.
Do not publish data containing faces, customer information, production-line secrets, internal logos, or objects under NDA. If you want to share, recreate an equivalent task with neutral objects.
Conclusion: the small-team moat is the loop
Small teams do not win through raw data volume. They win by choosing the right task, recording with the right schema, fine-tuning the right pretrained checkpoint, evaluating honestly, and collecting the next batch against the real failure mode. LeRobot + SO-101 makes the first loop inexpensive. LeRobotDataset v2/v3 keeps the data portable. SmolVLA, π0, and GR00T let you reuse foundation models instead of training from zero. Hugging Face community datasets help you learn quickly and contribute back.
A good data strategy for a robotics startup should be this explicit:
We are not collecting "robot data" in general.
We are collecting evidence that a pretrained policy can perform a real workflow,
on our real embodiment, under real variations the customer will see.
The final post in the series will zoom out to strategy: in the humanoid data war, should small teams bet on open datasets, closed proprietary data, or a hybrid model?
Sources
- Hugging Face LeRobot GitHub
- SO-101 official LeRobot documentation
- LeRobotDataset v3 documentation and migration notes
- GR00T LeRobot v2 data format
- SmolVLA official LeRobot documentation
- SmolVLA project page
- π0 and π0-FAST in LeRobot
- OpenPI release blog