researchmanipulationdata-collectionteleoperationrobot-freehumiegocentric-videoimitation-learninghumanoidunitree-g1research

Data Collection: Teleop vs Robot-Free vs Egocentric Video

Three strategies for whole-body manipulation data — teleop, robot-free demos (HuMI), and action-free egocentric video. A cost/quality/throughput comparison and how to choose by budget.

Nguyễn Anh TuấnJune 14, 20266 min read
Data Collection: Teleop vs Robot-Free vs Egocentric Video

Part 5 ended on a blunt note: data is the bottleneck. You can have the prettiest VLA architecture and the best 3D perception, but without enough quality data to learn from, the policy stays dumb. And for whole-body manipulation, collecting data is especially hard — because you need both fine hand motion and whole-body coordination.

This article compares the three main data-collection strategies (as of June 2026) and helps you pick the right one for your budget.

HuMI: learning whole-body manipulation from robot-free demos — source: arXiv 2602.06643
HuMI: learning whole-body manipulation from robot-free demos — source: arXiv 2602.06643
Source: HuMI — Humanoid Manipulation Interface (arXiv 2602.06643), collecting whole-body demos without a robot.

The three trade-off axes of any data strategy

Before each approach, grasp the three axes every strategy must balance:

  1. Action label accuracy. A behavior-cloning policy needs to know "in this state, the robot does what." The more accurate the action label — true to robot kinematics — the easier the policy learns.
  2. Cost & throughput. How much money and time to collect 1,000 demos? Do you need a real robot? Do you need skilled operators?
  3. Domain gap. Does the collected data match the target robot (e.g. Unitree G1), or does it require a conversion step (retargeting) that loses information?

No strategy wins all three axes. That is why you must choose, not look for "the best one."

Strategy 1: Robot teleoperation (accurate, expensive)

An operator controls a real robot through a device — a VR controller (Meta Quest), an exoskeleton, or a master-slave rig. Each time the human acts, the robot mirrors it; we record observations + joint commands.

Pros:

  • Absolutely accurate action labels. Since the robot itself executes, every joint command is a real number, true to robot kinematics — no embodiment domain gap.
  • High-quality demos, usable directly for behavior cloning.

Cons:

  • Needs a real robot (a G1 is ~tens of thousands of USD) for each collection station.
  • Low throughput. One person, one robot, sequentially. Collecting 1,000 demos takes many days.
  • Operator fatigue; quality degrades over time.

This is what earlier blog series used (data collection for GR00T N1). Best on quality, worst on scale.

Strategy 2: Robot-free demonstrations — HuMI

This is a clever way to break the cost bottleneck. HuMI (Humanoid Manipulation Interface / Humanoid Whole-Body Manipulation from Robot-Free Demonstrations, arXiv 2602.06643, 2026) lets you collect whole-body manipulation demos without a robot.

The idea: a human wears a data-collection interface (handheld grippers + camera + whole-body pose tracking) and performs the task themselves. The system records hand trajectory, gripper state, and whole-body pose — enough information to map onto a robot later, but without a robot present during collection.

Pros:

  • Much cheaper than teleop. No robot at each station; just the interface. Many people can collect in parallel.
  • High throughput. Humans manipulate naturally, quickly, anywhere (a real kitchen, a real warehouse).
  • Captures whole-body pose — crucial for whole-body, not just hands (unlike "manipulation interfaces" that capture only the wrist, like UMI).

Cons:

  • Morphology domain gap. A human hand ≠ a robot hand; careful retargeting to G1 kinematics is needed.
  • Action labels are inferred from tracking, not direct robot commands → slightly less accurate than teleop.

HuMI is the sweet spot between "accurate but expensive" (teleop) and "cheap but vague" (raw video). It is a very strong option for anyone who wants to scale whole-body data on a limited budget.

Strategy 3: Action-free egocentric video (cheapest, vaguest)

Introduced in Part 5: use first-person video (Ego4D, YouTube, self-recorded) — with no action labels at all.

Pros:

  • Cheapest, massive scale. Thousands of hours of video are freely available. No robot, no special interface.
  • Diverse tasks, environments, objects — great for learning general representations.

Cons:

  • No action labels. You must use latent-action learning (like WholeBodyVLA) or inverse dynamics to "guess" actions — never as accurate as real labels.
  • The largest domain gap. Human ≠ robot in both morphology and camera.

Action-free video is great for pretraining (learning representations, intent), but almost always needs fine-tuning with teleop or robot-free demos to anchor into real actions.

Consolidated comparison

Criterion Teleop Robot-free (HuMI) Action-free video
Needs a real robot Yes No No
Action label accuracy Highest Medium (inferred) None (must guess)
Cost per demo Very high Low–medium Very low
Throughput Low High Very high
Domain gap None (same robot) Medium (human→robot) Large
Feasible scale Small (hundreds–thousands) Medium–large Massive
Whole-body pose Yes Yes Partial (inferred)
Best used for Quality fine-tuning Scaling whole-body data Representation pretraining

A combined strategy: which one, when?

The best practice isn't picking one, but layering by budget:

Action-free video (cheap, massive)
        ↓ representation + latent-action pretraining
Robot-free HuMI demos (moderately cheap, whole-body pose)
        ↓ main whole-body manipulation learning
Teleop (expensive, few, accurate)
        ↓ final fine-tune anchored to real robot kinematics
Policy deployed on Unitree G1

Budget-based recommendations:

  • Low budget, no robot yet: start with action-free video + a HuMI interface. Pretraining + robot-free demos give you a policy that "knows the task" without a robot.
  • One robot, want high quality: use teleop for a small batch of quality demos, combine HuMI to scale, video to pretrain.
  • Many robots, building a product: teleop is the backbone, but still pretrain with video to reduce the number of teleop demos needed.

Notes for collecting data for the Unitree G1

  • Frame synchronization. Every data source must ultimately be brought into G1's robot frame. Calibration and retargeting are the hardest engineering parts — don't underestimate them.
  • Quality > quantity. 200 clean, on-task, position-diverse demos usually beat 2,000 noisy ones. Especially with DP3 (Part 3), which is sample-efficient.
  • Record enough modalities. For this series' 3D pipeline, capture RGB (for Robo3R), depth/point cloud if available, robot/human state, and gripper state.

Conclusion: data is strategy, not luck

The big lesson: choosing a data-collection strategy is an architectural decision, on par with choosing a model. Teleop, robot-free HuMI, and action-free video aren't rivals — they are three tiers of one data pyramid. Understanding the trade-off axes (label accuracy / cost / domain gap) lets you invest in the right place instead of burning money on teleop when video pretraining would have sufficed.

Now you have all the pieces: 3D perception (Parts 2–4), policy (Part 3, 5), and data (this part). Part 7 ties them all into a concrete deployment roadmap for the Unitree G1.


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

WholeBodyVLA: video egocentric + RL loco-manipulation
research

WholeBodyVLA: video egocentric + RL loco-manipulation

6/14/20267 min read
NT
Perception 3D cho humanoid: Omni-Manip & spatial reasoning
research

Perception 3D cho humanoid: Omni-Manip & spatial reasoning

6/14/20268 min read
NT
Robo3R: tái dựng 3D feed-forward cho robot arm
research

Robo3R: tái dựng 3D feed-forward cho robot arm

6/13/202613 min read
NT