humanoidhumanoidteleoperationdatasetagibotunitreealohagellolerobotmanipulation

Teleoperation: Real-World Robot Data Collection

Compare AgiBot, Unitree XR, ALOHA 2, Mobile ALOHA, and GELLO to choose a real-world teleoperation data stack by budget.

Nguyễn Anh TuấnJune 12, 202615 min read
Teleoperation: Real-World Robot Data Collection

In part 1 of this series, we mapped the humanoid robot data war: who has data, who opens data, and who keeps data as a moat. Part 2 moves down to the practical layer: how is that data actually collected?

The short answer is teleoperation. A human controls a real robot. The robot records camera streams, joint states, action commands, force or tactile signals if available, then turns that session into trajectories for imitation learning, ACT, Diffusion Policy, or VLA training. This is how AgiBot built AgiBot World, how the ALOHA community created bimanual manipulation datasets, and why repositories like Unitree xr_teleoperate matter: they turn expensive robots into data collection machines.

This article assumes no prior teleoperation experience. The goal is simple: by the end, you should know which stack fits your budget.

  • If you want a real humanoid: look at Unitree G1/H1 plus XR teleoperation.
  • If you want high-quality bimanual manipulation in a lab: use ALOHA 2 or Mobile ALOHA.
  • If you need the cheapest credible starting point: use GELLO or a LeRobot-compatible arm.
  • If you want to scale like a large company: study AgiBot, but understand that the real difficulty is operator operations, QA, and data standardization.

Series roadmap

Part Title Status
1 The Data War: Who Owns Humanoid Robot Data? Published
2 Teleoperation: Real-World Robot Data Collection This article
3 Human Video Mining: Learning from Humans Next
4 Synthetic Data Pipelines: From Sim to Real Coming
5 VLA Data Scaling Laws Coming
6 Data Strategy: What Should You Collect? Coming
7 Open vs Closed: Licenses, Data Moats & What's Next Coming

What teleoperation means in robot learning

Teleoperation means controlling a robot with a human input device: leader arms, XR headsets, control cabins, joysticks, haptic devices, or a mechanical copy of the target arm. In robot learning, teleoperation has a specific purpose: record demonstrations clean enough for a policy to learn from.

A minimal trajectory usually contains:

episode_000123
├── observations
│   ├── images/cam_high        # RGB or RGB-D
│   ├── images/cam_left_wrist  # wrist camera
│   ├── state                  # joint position, velocity, gripper state
│   └── optional: force, tactile, IMU, base pose
├── actions
│   ├── joint targets          # next-step action
│   └── optional: base action, hand target, force target
├── metadata
│   ├── task: "fold towel"
│   ├── success: true
│   ├── operator_id
│   └── reset notes
└── language
    └── "fold the towel and place it on the tray"

For beginners, the key rule is: do not record only video. Video shows what happened. A robot policy also needs to know what command the robot received at each time step. Real robot datasets must synchronize three streams: vision, robot state, and action.

AgiBot World: scaling through robots, operators, and QA

AgiBot World Colosseo is the clearest public example of industrial-scale teleoperation. The technical report describes more than 1 million trajectories across 217 tasks, collected by over 100 homogeneous robots in real-world scenarios. The OpenDriveLab/AgiBot-World repository states that the Beta release contains 1,003,672 trajectories and is built on LeRobot v2.1. The paper also emphasizes human-in-the-loop verification: humans do not only operate robots; they also check data quality before the data enters training.

The impressive part is not only volume. AgiBot turns teleoperation into a production line:

task design
  -> operator teleoperates AgiBot G1
  -> robot records multimodal streams
  -> success/failure is checked
  -> bad trajectories are filtered
  -> task, skill, and keyframe annotations are added
  -> data is converted into training format
  -> GO-1 or internal policies are trained

One public-source detail needs careful reading. The original AgiBot World Colosseo material identifies AgiBot G1 as the collection platform: a dual-arm humanoid with dexterous hands and visuo-tactile sensing. Some newer AGIBOT WORLD 2026 pages mention G2 or expansion to G2. For this article, treat G1 as the Colosseo collection stack and G2 as a later industrial platform within the same ecosystem.

Who should use this pattern? Companies with robot fleets, a data collection facility, trained operators, and a need to train generalist policies. For a small lab, the lesson is not "buy 100 robots." The lesson is to design the pipeline for scale from day one: one schema, one success checklist, one reset protocol, one format, and one QA process.

Unitree xr_teleoperate: humanoid teleop with Quest 3/PICO/AVP

If AgiBot is the fleet-scale reference, Unitree is the more accessible path for teams that want to work with a real humanoid. The unitreerobotics/xr_teleoperate repository supports Unitree G1, H1, H1_2, H2, Dex1-1 grippers, Dex3-1 hands, Inspire hands, and BrainCo hands. The supported XR devices include Apple Vision Pro, PICO 4 Ultra Enterprise, and Meta Quest 3.

A typical launch command looks like this:

python teleop_hand_and_arm.py \
  --xr-mode=hand \
  --arm=G1_29 \
  --ee=dex3 \
  --record

The operator wears a headset, opens the Vuer/WebRTC interface, sees the robot's first-person camera view, presses r to begin teleoperation, and presses s to start or stop recording an episode. The repository documents a default --frequency of 30 Hz and stores recorded data under xr_teleoperate/teleop/utils/data, with follow-up instructions for Unitree imitation learning and LeRobot usage.

Strengths:

  • It feels natural for humanoid work: the operator uses head and hands to control view, arms, and hands.
  • There is a path from simulation to physical deployment.
  • It uses commercial hardware. Unitree lists G1 from around 13.5K USD publicly, while H1 is much more expensive.
  • It fits tasks that need body context: opening doors, picking objects from shelves, and manipulating in rooms.

Weaknesses:

  • Latency depends on Wi-Fi, WebRTC, headset browser behavior, image service, IK, and DDS. Unstable networking produces bad trajectories.
  • XR hand tracking is not always precise enough for small contact-rich manipulation.
  • Safety is real. Humanoids have mass, inertia, and fall risk.

Unitree XR is practical if you want humanoid data at a much lower budget than building a custom humanoid. But it does not automatically make you AgiBot. You still need task protocols, success/failure labels, reset discipline, camera standardization, and conversion into a training-ready format.

ALOHA 2: the lab standard for bimanual manipulation

ALOHA and ALOHA 2 are the best-known stacks for bimanual manipulation data. Instead of XR, ALOHA uses leader-follower arms. The operator moves two leader arms, and the follower arms reproduce the motion in the workspace. The mapping is mechanical and intuitive: left hand controls the left arm, right hand controls the right arm, and the leader grippers control the follower grippers.

ALOHA 2 improves the original ALOHA design: lower-friction grippers, better ergonomics, a simpler frame, smaller Intel RealSense D405 cameras, and a more accurate MuJoCo model. The ALOHA 2 page explicitly frames these changes around scaling data collection: more robots, more hours per robot, more task diversity, and less downtime.

The traditional ALOHA data format is HDF5. According to the Trossen ALOHA documentation, a stationary episode looks like this:

episode_000001.hdf5
├── observations/images/cam_high         uint8  (480, 640, 3)
├── observations/images/cam_low          uint8  (480, 640, 3)
├── observations/images/cam_left_wrist   uint8  (480, 640, 3)
├── observations/images/cam_right_wrist  uint8  (480, 640, 3)
├── observations/qpos                    float64 (14,)
├── observations/qvel                    float64 (14,)
└── action                               float64 (14,)

For Mobile ALOHA, the schema adds base_action:

action       float64 (14,)  # two arms plus grippers
base_action  float64 (2,)   # linear/angular or equivalent base command

ALOHA 2 is a strong fit when you need precise two-handed manipulation: opening boxes, folding towels, plugging cables, picking small objects, or assembly. It is not a full-body humanoid, but its bimanual data is extremely valuable because contact-rich manipulation is one of the hardest parts of robotics.

Mobile ALOHA: adding a base for whole-body manipulation

Mobile ALOHA extends ALOHA with a mobile base and whole-body teleoperation. The paper reports that with about 50 demonstrations per task, co-training with existing static ALOHA datasets can substantially improve success rates on mobile manipulation tasks such as opening cabinets, cooking, using an elevator, or rinsing a pan.

From a data collection perspective, Mobile ALOHA addresses a major gap: many real tasks do not fit on a tabletop. The robot must move to the right location, rotate its base, put both arms into a useful workspace, and only then manipulate. Stationary ALOHA teaches "what the hands do." Mobile ALOHA also teaches "where the body brings the hands."

Current commercial prices for ALOHA-style systems are not identical to the original research bill of materials. Trossen's 2026 pages list Stationary AI around 23,995.95 USD, Mobile AI around 33,695.95 USD without a laptop, and 37,845.95 USD with a laptop. These are not mandatory prices if you self-build open-source hardware, but they are realistic planning numbers for teams that want a supported system.

GELLO: the cheapest serious way to start

GELLO is a practical idea: if the target robot arm is Franka, UR5, or xArm, build a controller with the same kinematic structure using 3D-printed parts and inexpensive motors. The OpenReview summary describes GELLO as a teleoperation device under 300 USD that is easy to build and intuitive to use.

GELLO is not a humanoid. It is not a complete bimanual system if you only build one controller. But for beginners, it is a very good way to learn the entire data loop:

build controller
  -> map joint positions to the robot arm
  -> record image + qpos + action
  -> train ACT or Diffusion Policy
  -> deploy policy
  -> collect failure recovery data

GELLO's strength is latency and mapping. Because the controller shares the target arm's kinematic structure, the operator does not need to think about which joystick axis maps to which robot axis. The human moves the controller, and the robot follows. Compared with VR controllers or 3D mice, demonstrations are often smoother and more consistent.

The limitation is scope. If you need locomotion, dual-arm full-body control, or complex dexterous hands, GELLO is only one component. But for low budgets, it is easy to underestimate. A small clean GELLO dataset can be more useful than a thousand shaky XR episodes with delay, dropped frames, and weak annotation.

LeRobotDataset or HDF5?

The two formats you will see most often are ALOHA/ACT-style HDF5 and Hugging Face LeRobotDataset.

HDF5 is strong for local research. Each episode is a file, easy to open with h5py, and easy to train with the original ACT codebase. The weakness appears when you scale to many cameras and thousands of episodes: file management, metadata, and sharing become painful.

LeRobotDataset solves the standardization problem. The LeRobotDataset v3.0 documentation describes a format for multimodal time-series data, sensorimotor signals, multi-camera video, and metadata. The huggingface/lerobot repository stores vision as MP4 or images and state/action as Parquet. The advantages are PyTorch loading, Hugging Face Hub integration, streaming large datasets, and mixing data from different robots.

A practical rule:

If you are... Choose
Reproducing the original ACT/ALOHA papers HDF5
Prototyping in a lab with fewer than a few hundred episodes HDF5 is fine
Sharing data, training with LeRobot, or mixing robots LeRobotDataset
Scaling to thousands or millions of trajectories LeRobotDataset or an internal equivalent with serious metadata

A minimal LeRobot-style layout looks like:

my_robot_dataset/
├── meta/
│   ├── info.json
│   ├── episodes.jsonl
│   └── tasks.jsonl
├── data/
│   └── chunk-000/*.parquet
└── videos/
    └── chunk-000/observation.images.cam_high/*.mp4

Cost, throughput, and latency comparison

No public source reports throughput and latency under the same standard for every stack. The table below separates public facts from operational estimates. For short trajectories of 20-60 seconds and resets of 30-90 seconds, a skilled operator rarely records a full 60 minutes of usable data per hour. Labeling, quality checks, reset time, and fatigue all matter.

Stack Practical hardware cost Estimated throughput/operator Latency target Natural output format Best fit
AgiBot-style fleet Not public; requires robot fleet, facility, and QA 20-60 trajectories/hour/operator, scaled by robots/operators Low enough for contact; stability matters more than demo appeal LeRobot v2.1/Parquet + video, rich metadata Funded company training generalist humanoid policies
Unitree G1/H1 + XR G1 from about 13.5K USD, H1 much higher; plus Quest/PICO/AVP, PC, cameras 10-35 trajectories/hour/operator depending on task and reset Repository default is 30 Hz; keep network delay and jitter low xr_teleoperate recording, then Unitree IL/LeRobot conversion Teams wanting real humanoid data at moderate budget
ALOHA 2 / Stationary AI Self-build can be cheaper; supported kit around 24K USD 20-50 trajectories/hour/operator for tabletop tasks Low due to leader-follower control; ALOHA 2 improves gripper feel HDF5, convertible to LeRobot Lab bimanual manipulation
Mobile ALOHA / Mobile AI Supported kit around 34K-38K USD; self-build varies 8-25 trajectories/hour/operator because environment resets take longer Base and arm must be smooth; high latency makes path control hard HDF5 with base_action, or LeRobot after conversion Mobile manipulation tasks
GELLO + arm Controller under 300 USD, excluding robot arm and cameras 20-60 trajectories/hour/operator for single-arm tasks Very good if joint mapping is stable Whatever you record: HDF5 or LeRobot Low-budget pipeline learning and single-arm manipulation

Choosing a stack by budget

If your budget is under 5,000 USD, do not start with a humanoid. Use a small arm, a fixed camera, and a GELLO-style controller or low-cost leader arm. Your goal is to learn the pipeline: timestamp synchronization, correct action logging, first policy training, and failure recovery.

If your budget is 10,000-25,000 USD, there are two paths. Choose Unitree G1 if your primary goal is humanoid embodiment, locomotion context, and XR teleoperation. Choose stationary ALOHA/Trossen AI if your goal is high-quality two-handed manipulation. For the same budget, ALOHA usually gives cleaner manipulation data; Unitree gives a closer humanoid embodiment.

If your budget is 30,000-50,000 USD, Mobile ALOHA or Mobile AI becomes compelling. You pay more for a base, frame, cameras, and a more stable workflow for real-world tasks. This is a reasonable range for labs that want mobile manipulation without operating a humanoid fleet.

If your budget is above 100,000 USD, the question is no longer which robot to buy. The question is how to run a data factory: how many operators, how many tasks per day, QA process, format, versioning, privacy, license, and how data enters the training loop. That is the topic of part 6 on data strategy.

Data collection shift checklist

Before recording the first episode, write a checklist. A good teleoperation stack does not start with a model; it starts with operational discipline.

1. Define the task
   - Short, clear task name
   - Initial condition
   - Success condition
   - Failure condition

2. Lock the setup
   - Camera poses
   - Lighting
   - Object positions
   - Firmware/control-code version

3. Record data
   - Synchronized timestamps
   - No severe image frame drops
   - Correct qpos/qvel/action dimensions
   - Metadata includes task, operator, success

4. QA after each batch
   - Open 10 random episodes
   - Visualize action/state
   - Remove unsafe collisions, bad resets, and shifted cameras
   - Log failures for recovery-data collection

The beginner mistake is ignoring failure data. If you only save successful demonstrations, the policy learns the clean path but not how to recover when it drifts. Large pipelines treat verification, cleaning, and recovery as part of the data engine, not a side task.

Technical sources checked

Conclusion: teleoperation is the first data moat

In humanoid robotics, model architectures change quickly, but real-world data remains expensive and slow. Teleoperation turns money, operator time, and robot uptime into a training asset. AgiBot wins on scale. Unitree opens a path for smaller teams to collect real humanoid data. ALOHA 2 and Mobile ALOHA win on bimanual data quality. GELLO wins on starting cost.

The practical advice: choose the stack by task, not by how impressive the robot looks. If your task is plugging cables, opening boxes, or folding towels on a table, ALOHA or GELLO may beat a humanoid. If the task requires moving through space, reaching shelves, and coordinating body and arms, Unitree or Mobile ALOHA makes more sense. Once you have real data, part 3 asks the next question: can we reduce teleoperation cost by mining human video from the internet?

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

Open vs Closed: License, Data Moat Và Tương Lai 2027
humanoid

Open vs Closed: License, Data Moat Và Tương Lai 2027

6/12/202617 min read
NT
Data Strategy: Team Nhỏ Nên Thu Thập Dữ Liệu Gì?
humanoid

Data Strategy: Team Nhỏ Nên Thu Thập Dữ Liệu Gì?

6/12/202619 min read
NT
Synthetic Data Pipeline: Từ Sim Đến Thực
humanoid

Synthetic Data Pipeline: Từ Sim Đến Thực

6/12/202617 min read
NT