Part 5 ended on a blunt note: data is the bottleneck. You can have the prettiest VLA architecture and the best 3D perception, but without enough quality data to learn from, the policy stays dumb. And for whole-body manipulation, collecting data is especially hard — because you need both fine hand motion and whole-body coordination.
This article compares the three main data-collection strategies (as of June 2026) and helps you pick the right one for your budget.

The three trade-off axes of any data strategy
Before each approach, grasp the three axes every strategy must balance:
- Action label accuracy. A behavior-cloning policy needs to know "in this state, the robot does what." The more accurate the action label — true to robot kinematics — the easier the policy learns.
- Cost & throughput. How much money and time to collect 1,000 demos? Do you need a real robot? Do you need skilled operators?
- Domain gap. Does the collected data match the target robot (e.g. Unitree G1), or does it require a conversion step (retargeting) that loses information?
No strategy wins all three axes. That is why you must choose, not look for "the best one."
Strategy 1: Robot teleoperation (accurate, expensive)
An operator controls a real robot through a device — a VR controller (Meta Quest), an exoskeleton, or a master-slave rig. Each time the human acts, the robot mirrors it; we record observations + joint commands.
Pros:
- Absolutely accurate action labels. Since the robot itself executes, every joint command is a real number, true to robot kinematics — no embodiment domain gap.
- High-quality demos, usable directly for behavior cloning.
Cons:
- Needs a real robot (a G1 is ~tens of thousands of USD) for each collection station.
- Low throughput. One person, one robot, sequentially. Collecting 1,000 demos takes many days.
- Operator fatigue; quality degrades over time.
This is what earlier blog series used (data collection for GR00T N1). Best on quality, worst on scale.
Strategy 2: Robot-free demonstrations — HuMI
This is a clever way to break the cost bottleneck. HuMI (Humanoid Manipulation Interface / Humanoid Whole-Body Manipulation from Robot-Free Demonstrations, arXiv 2602.06643, 2026) lets you collect whole-body manipulation demos without a robot.
The idea: a human wears a data-collection interface (handheld grippers + camera + whole-body pose tracking) and performs the task themselves. The system records hand trajectory, gripper state, and whole-body pose — enough information to map onto a robot later, but without a robot present during collection.
Pros:
- Much cheaper than teleop. No robot at each station; just the interface. Many people can collect in parallel.
- High throughput. Humans manipulate naturally, quickly, anywhere (a real kitchen, a real warehouse).
- Captures whole-body pose — crucial for whole-body, not just hands (unlike "manipulation interfaces" that capture only the wrist, like UMI).
Cons:
- Morphology domain gap. A human hand ≠ a robot hand; careful retargeting to G1 kinematics is needed.
- Action labels are inferred from tracking, not direct robot commands → slightly less accurate than teleop.
HuMI is the sweet spot between "accurate but expensive" (teleop) and "cheap but vague" (raw video). It is a very strong option for anyone who wants to scale whole-body data on a limited budget.
Strategy 3: Action-free egocentric video (cheapest, vaguest)
Introduced in Part 5: use first-person video (Ego4D, YouTube, self-recorded) — with no action labels at all.
Pros:
- Cheapest, massive scale. Thousands of hours of video are freely available. No robot, no special interface.
- Diverse tasks, environments, objects — great for learning general representations.
Cons:
- No action labels. You must use latent-action learning (like WholeBodyVLA) or inverse dynamics to "guess" actions — never as accurate as real labels.
- The largest domain gap. Human ≠ robot in both morphology and camera.
Action-free video is great for pretraining (learning representations, intent), but almost always needs fine-tuning with teleop or robot-free demos to anchor into real actions.
Consolidated comparison
| Criterion | Teleop | Robot-free (HuMI) | Action-free video |
|---|---|---|---|
| Needs a real robot | Yes | No | No |
| Action label accuracy | Highest | Medium (inferred) | None (must guess) |
| Cost per demo | Very high | Low–medium | Very low |
| Throughput | Low | High | Very high |
| Domain gap | None (same robot) | Medium (human→robot) | Large |
| Feasible scale | Small (hundreds–thousands) | Medium–large | Massive |
| Whole-body pose | Yes | Yes | Partial (inferred) |
| Best used for | Quality fine-tuning | Scaling whole-body data | Representation pretraining |
A combined strategy: which one, when?
The best practice isn't picking one, but layering by budget:
Action-free video (cheap, massive)
↓ representation + latent-action pretraining
Robot-free HuMI demos (moderately cheap, whole-body pose)
↓ main whole-body manipulation learning
Teleop (expensive, few, accurate)
↓ final fine-tune anchored to real robot kinematics
Policy deployed on Unitree G1
Budget-based recommendations:
- Low budget, no robot yet: start with action-free video + a HuMI interface. Pretraining + robot-free demos give you a policy that "knows the task" without a robot.
- One robot, want high quality: use teleop for a small batch of quality demos, combine HuMI to scale, video to pretrain.
- Many robots, building a product: teleop is the backbone, but still pretrain with video to reduce the number of teleop demos needed.
Notes for collecting data for the Unitree G1
- Frame synchronization. Every data source must ultimately be brought into G1's robot frame. Calibration and retargeting are the hardest engineering parts — don't underestimate them.
- Quality > quantity. 200 clean, on-task, position-diverse demos usually beat 2,000 noisy ones. Especially with DP3 (Part 3), which is sample-efficient.
- Record enough modalities. For this series' 3D pipeline, capture RGB (for Robo3R), depth/point cloud if available, robot/human state, and gripper state.
Conclusion: data is strategy, not luck
The big lesson: choosing a data-collection strategy is an architectural decision, on par with choosing a model. Teleop, robot-free HuMI, and action-free video aren't rivals — they are three tiers of one data pyramid. Understanding the trade-off axes (label accuracy / cost / domain gap) lets you invest in the right place instead of burning money on teleop when video pretraining would have sufficed.
Now you have all the pieces: 3D perception (Parts 2–4), policy (Part 3, 5), and data (this part). Part 7 ties them all into a concrete deployment roadmap for the Unitree G1.
Related Posts
- Part 5: WholeBodyVLA — Egocentric Video + RL — Action-free video data in practice
- Part 7: A 3D Manipulation Roadmap for the Unitree G1 — Combining data + perception + policy
- GR00T N1 + Unitree G1: Data Collection — A detailed teleop pipeline


