Data Collection: Teleop vs Robot-Free vs Egocentric Video

Part 5 ended on a blunt note: data is the bottleneck. You can have the prettiest VLA architecture and the best 3D perception, but without enough quality data to learn from, the policy stays dumb. And for whole-body manipulation, collecting data is especially hard — because you need both fine hand motion and whole-body coordination.

This article compares the three main data-collection strategies (as of June 2026) and helps you pick the right one for your budget.

HuMI: learning whole-body manipulation from robot-free demos — source: arXiv 2602.06643

Source: HuMI — Humanoid Manipulation Interface (arXiv 2602.06643), collecting whole-body demos without a robot.

The three trade-off axes of any data strategy

Before each approach, grasp the three axes every strategy must balance:

Action label accuracy. A behavior-cloning policy needs to know "in this state, the robot does what." The more accurate the action label — true to robot kinematics — the easier the policy learns.
Cost & throughput. How much money and time to collect 1,000 demos? Do you need a real robot? Do you need skilled operators?
Domain gap. Does the collected data match the target robot (e.g. Unitree G1), or does it require a conversion step (retargeting) that loses information?

No strategy wins all three axes. That is why you must choose, not look for "the best one."

Strategy 1: Robot teleoperation (accurate, expensive)

An operator controls a real robot through a device — a VR controller (Meta Quest), an exoskeleton, or a master-slave rig. Each time the human acts, the robot mirrors it; we record observations + joint commands.

Pros:

Absolutely accurate action labels. Since the robot itself executes, every joint command is a real number, true to robot kinematics — no embodiment domain gap.
High-quality demos, usable directly for behavior cloning.

Cons:

Needs a real robot (a G1 is ~tens of thousands of USD) for each collection station.
Low throughput. One person, one robot, sequentially. Collecting 1,000 demos takes many days.
Operator fatigue; quality degrades over time.

This is what earlier blog series used (data collection for GR00T N1). Best on quality, worst on scale.

Strategy 2: Robot-free demonstrations — HuMI

This is a clever way to break the cost bottleneck. HuMI (Humanoid Manipulation Interface / Humanoid Whole-Body Manipulation from Robot-Free Demonstrations, arXiv 2602.06643, 2026) lets you collect whole-body manipulation demos without a robot.

The idea: a human wears a data-collection interface (handheld grippers + camera + whole-body pose tracking) and performs the task themselves. The system records hand trajectory, gripper state, and whole-body pose — enough information to map onto a robot later, but without a robot present during collection.

Pros:

Much cheaper than teleop. No robot at each station; just the interface. Many people can collect in parallel.
High throughput. Humans manipulate naturally, quickly, anywhere (a real kitchen, a real warehouse).
Captures whole-body pose — crucial for whole-body, not just hands (unlike "manipulation interfaces" that capture only the wrist, like UMI).

Cons:

Morphology domain gap. A human hand ≠ a robot hand; careful retargeting to G1 kinematics is needed.
Action labels are inferred from tracking, not direct robot commands → slightly less accurate than teleop.

HuMI is the sweet spot between "accurate but expensive" (teleop) and "cheap but vague" (raw video). It is a very strong option for anyone who wants to scale whole-body data on a limited budget.

Strategy 3: Action-free egocentric video (cheapest, vaguest)

Introduced in Part 5: use first-person video (Ego4D, YouTube, self-recorded) — with no action labels at all.

Pros:

Cheapest, massive scale. Thousands of hours of video are freely available. No robot, no special interface.
Diverse tasks, environments, objects — great for learning general representations.

Cons:

No action labels. You must use latent-action learning (like WholeBodyVLA) or inverse dynamics to "guess" actions — never as accurate as real labels.
The largest domain gap. Human ≠ robot in both morphology and camera.

Action-free video is great for pretraining (learning representations, intent), but almost always needs fine-tuning with teleop or robot-free demos to anchor into real actions.

Consolidated comparison

Criterion	Teleop	Robot-free (HuMI)	Action-free video
Needs a real robot	Yes	No	No
Action label accuracy	Highest	Medium (inferred)	None (must guess)
Cost per demo	Very high	Low–medium	Very low
Throughput	Low	High	Very high
Domain gap	None (same robot)	Medium (human→robot)	Large
Feasible scale	Small (hundreds–thousands)	Medium–large	Massive
Whole-body pose	Yes	Yes	Partial (inferred)
Best used for	Quality fine-tuning	Scaling whole-body data	Representation pretraining

A combined strategy: which one, when?

The best practice isn't picking one, but layering by budget:

Action-free video (cheap, massive)
        ↓ representation + latent-action pretraining
Robot-free HuMI demos (moderately cheap, whole-body pose)
        ↓ main whole-body manipulation learning
Teleop (expensive, few, accurate)
        ↓ final fine-tune anchored to real robot kinematics
Policy deployed on Unitree G1

Budget-based recommendations:

Low budget, no robot yet: start with action-free video + a HuMI interface. Pretraining + robot-free demos give you a policy that "knows the task" without a robot.
One robot, want high quality: use teleop for a small batch of quality demos, combine HuMI to scale, video to pretrain.
Many robots, building a product: teleop is the backbone, but still pretrain with video to reduce the number of teleop demos needed.

Notes for collecting data for the Unitree G1

Frame synchronization. Every data source must ultimately be brought into G1's robot frame. Calibration and retargeting are the hardest engineering parts — don't underestimate them.
Quality > quantity. 200 clean, on-task, position-diverse demos usually beat 2,000 noisy ones. Especially with DP3 (Part 3), which is sample-efficient.
Record enough modalities. For this series' 3D pipeline, capture RGB (for Robo3R), depth/point cloud if available, robot/human state, and gripper state.

Conclusion: data is strategy, not luck

The big lesson: choosing a data-collection strategy is an architectural decision, on par with choosing a model. Teleop, robot-free HuMI, and action-free video aren't rivals — they are three tiers of one data pyramid. Understanding the trade-off axes (label accuracy / cost / domain gap) lets you invest in the right place instead of burning money on teleop when video pretraining would have sufficed.

Now you have all the pieces: 3D perception (Parts 2–4), policy (Part 3, 5), and data (this part). Part 7 ties them all into a concrete deployment roadmap for the Unitree G1.

Part 5: WholeBodyVLA — Egocentric Video + RL — Action-free video data in practice
Part 7: A 3D Manipulation Roadmap for the Unitree G1 — Combining data + perception + policy
GR00T N1 + Unitree G1: Data Collection — A detailed teleop pipeline

This article compares the three main data-collection strategies (as of June 2026) and helps you pick the right one for your budget.

HuMI: learning whole-body manipulation from robot-free demos — source: arXiv 2602.06643

Source: HuMI — Humanoid Manipulation Interface (arXiv 2602.06643), collecting whole-body demos without a robot.

The three trade-off axes of any data strategy

Before each approach, grasp the three axes every strategy must balance:

Action label accuracy. A behavior-cloning policy needs to know "in this state, the robot does what." The more accurate the action label — true to robot kinematics — the easier the policy learns.
Cost & throughput. How much money and time to collect 1,000 demos? Do you need a real robot? Do you need skilled operators?
Domain gap. Does the collected data match the target robot (e.g. Unitree G1), or does it require a conversion step (retargeting) that loses information?

No strategy wins all three axes. That is why you must choose, not look for "the best one."

Strategy 1: Robot teleoperation (accurate, expensive)

Pros:

Absolutely accurate action labels. Since the robot itself executes, every joint command is a real number, true to robot kinematics — no embodiment domain gap.
High-quality demos, usable directly for behavior cloning.

Cons:

Needs a real robot (a G1 is ~tens of thousands of USD) for each collection station.
Low throughput. One person, one robot, sequentially. Collecting 1,000 demos takes many days.
Operator fatigue; quality degrades over time.

This is what earlier blog series used (data collection for GR00T N1). Best on quality, worst on scale.

Strategy 2: Robot-free demonstrations — HuMI

Pros:

Much cheaper than teleop. No robot at each station; just the interface. Many people can collect in parallel.
High throughput. Humans manipulate naturally, quickly, anywhere (a real kitchen, a real warehouse).
Captures whole-body pose — crucial for whole-body, not just hands (unlike "manipulation interfaces" that capture only the wrist, like UMI).

Cons:

Morphology domain gap. A human hand ≠ a robot hand; careful retargeting to G1 kinematics is needed.
Action labels are inferred from tracking, not direct robot commands → slightly less accurate than teleop.

HuMI is the sweet spot between "accurate but expensive" (teleop) and "cheap but vague" (raw video). It is a very strong option for anyone who wants to scale whole-body data on a limited budget.

Strategy 3: Action-free egocentric video (cheapest, vaguest)

Introduced in Part 5: use first-person video (Ego4D, YouTube, self-recorded) — with no action labels at all.

Pros:

Cheapest, massive scale. Thousands of hours of video are freely available. No robot, no special interface.
Diverse tasks, environments, objects — great for learning general representations.

Cons:

No action labels. You must use latent-action learning (like WholeBodyVLA) or inverse dynamics to "guess" actions — never as accurate as real labels.
The largest domain gap. Human ≠ robot in both morphology and camera.

Action-free video is great for pretraining (learning representations, intent), but almost always needs fine-tuning with teleop or robot-free demos to anchor into real actions.

Consolidated comparison

Criterion	Teleop	Robot-free (HuMI)	Action-free video
Needs a real robot	Yes	No	No
Action label accuracy	Highest	Medium (inferred)	None (must guess)
Cost per demo	Very high	Low–medium	Very low
Throughput	Low	High	Very high
Domain gap	None (same robot)	Medium (human→robot)	Large
Feasible scale	Small (hundreds–thousands)	Medium–large	Massive
Whole-body pose	Yes	Yes	Partial (inferred)
Best used for	Quality fine-tuning	Scaling whole-body data	Representation pretraining

A combined strategy: which one, when?

The best practice isn't picking one, but layering by budget:

Action-free video (cheap, massive)
        ↓ representation + latent-action pretraining
Robot-free HuMI demos (moderately cheap, whole-body pose)
        ↓ main whole-body manipulation learning
Teleop (expensive, few, accurate)
        ↓ final fine-tune anchored to real robot kinematics
Policy deployed on Unitree G1

Budget-based recommendations:

Low budget, no robot yet: start with action-free video + a HuMI interface. Pretraining + robot-free demos give you a policy that "knows the task" without a robot.
One robot, want high quality: use teleop for a small batch of quality demos, combine HuMI to scale, video to pretrain.
Many robots, building a product: teleop is the backbone, but still pretrain with video to reduce the number of teleop demos needed.

Notes for collecting data for the Unitree G1

Frame synchronization. Every data source must ultimately be brought into G1's robot frame. Calibration and retargeting are the hardest engineering parts — don't underestimate them.
Quality > quantity. 200 clean, on-task, position-diverse demos usually beat 2,000 noisy ones. Especially with DP3 (Part 3), which is sample-efficient.
Record enough modalities. For this series' 3D pipeline, capture RGB (for Robo3R), depth/point cloud if available, robot/human state, and gripper state.

Conclusion: data is strategy, not luck

Now you have all the pieces: 3D perception (Parts 2–4), policy (Part 3, 5), and data (this part). Part 7 ties them all into a concrete deployment roadmap for the Unitree G1.

Part 5: WholeBodyVLA — Egocentric Video + RL — Action-free video data in practice
Part 7: A 3D Manipulation Roadmap for the Unitree G1 — Combining data + perception + policy
GR00T N1 + Unitree G1: Data Collection — A detailed teleop pipeline

Data Collection: Teleop vs Robot-Free vs Egocentric Video

The three trade-off axes of any data strategy

Strategy 1: Robot teleoperation (accurate, expensive)

Strategy 2: Robot-free demonstrations — HuMI

Strategy 3: Action-free egocentric video (cheapest, vaguest)

Consolidated comparison

A combined strategy: which one, when?

Notes for collecting data for the Unitree G1

Conclusion: data is strategy, not luck

Nguyễn Anh Tuấn

Related Posts

WholeBodyVLA: video egocentric + RL loco-manipulation

Perception 3D cho humanoid: Omni-Manip & spatial reasoning

Vì sao VLA 2D chưa đủ cho manipulation

Data Collection: Teleop vs Robot-Free vs Egocentric Video

The three trade-off axes of any data strategy

Strategy 1: Robot teleoperation (accurate, expensive)

Strategy 2: Robot-free demonstrations — HuMI

Strategy 3: Action-free egocentric video (cheapest, vaguest)

Consolidated comparison

A combined strategy: which one, when?

Notes for collecting data for the Unitree G1

Conclusion: data is strategy, not luck

Nguyễn Anh Tuấn

Related Posts

WholeBodyVLA: video egocentric + RL loco-manipulation

Perception 3D cho humanoid: Omni-Manip & spatial reasoning

Vì sao VLA 2D chưa đủ cho manipulation