Why start from concrete dataset surfaces?
When people ask "who owns humanoid robot data?", the discussion often jumps straight to legal ownership: the robot owner, the teleoperator, the company hosting the files, or the team training the model. That framing is too abstract for beginners because humanoid data is not one object. It is a chain of data surfaces: raw recordings from robots or humans, synchronized camera files, training formats such as LeRobot, simulated demonstrations from Isaac Lab, model checkpoints, evaluation logs, and sometimes videos of real robots running in a cloud benchmark.
This first article in the series takes a more practical route: look at the files first, then discuss control over value. If you open a dataset folder and see episode_0.hdf5, episode_0.svo2, data/chunk-000/file-000.parquet, videos/observation.images.front/file-000.mp4, or generated_dataset_gr1_nut_pouring.hdf5, you should be able to identify where that file sits in the value chain.
This is not legal advice. The goal is technical classification. By the end, you should be able to reason about which layer is controlled by the hardware owner, which layer depends on the teleoperator, which layer is governed by the dataset host, which layer creates leverage for the model trainer, and which layer belongs to the cloud evaluation platform.
If you have already read LeRobot v0.5 with G1 whole-body control or GR00T-VisualSim2Real for G1 in Isaac Lab, treat this article as the data map underneath those tutorials.
Roadmap series
- Humanoid Data Map 2026 — this article, starting from files and mapping value control by layer.
- VR teleoperation and operator data — why PICO, Apple Vision Pro, and hand tracking turn human work into robot data assets.
- View alignment and action alignment — the conversion layer that makes human video usable for humanoids.
- Simulation and synthetic demonstrations — Isaac Lab, Mimic, domain randomization, and ownership of generated demos.
- Human video and robot-free data — when videos of people become pretraining data for VLA policies.
- The VLA stack and downstream control — from datasets to checkpoints, inference APIs, benchmarks, and product moats.
Four concrete examples that define the map
Instead of speaking in generalities, start with four dataset surfaces that matter in humanoid research in 2026.
| Data surface | Example file or scale | What it contains | Who usually controls it directly? |
|---|---|---|---|
| Raw egocentric humanoid data | episode_*.hdf5 + episode_*.svo2 in EgoHumanoid |
Skeletons, hand pose, timestamps, raw ZED video, later merged into image-bearing HDF5 | The collecting lab, hardware owner, and data pipeline team |
| LeRobotDataset | Parquet + MP4, sometimes image files for debugging/export |
State, action, and timestamp in Parquet; camera data in MP4; metadata, tasks, stats | Dataset host, schema owner, Hub uploader |
| Isaac Lab synthetic dataset | generated_dataset_gr1_nut_pouring.hdf5 |
1000 demonstrations of a GR1 humanoid for nut pouring/placing, generated with Isaac Lab Mimic | Scene owner, asset owner, generation script owner, compute owner |
| Humanoid Everyday | 10.3k trajectories, over 3M frames, 260 tasks | RGB, depth, LiDAR, tactile, state/action, language annotations, Apple Vision Pro teleop streams | Dataset authors, G1/H1 robot owners, cloud evaluation operator |
The key point is that all four rows are "humanoid data", but their economics are different. episode_0.svo2 is a raw sensor recording. A LeRobot Parquet file is a normalized training table. An Isaac Lab HDF5 file is a generated demonstration dataset in simulation. Humanoid Everyday is not just a folder; it is an ecosystem of dataset, loader, benchmark, and cloud evaluation. Each layer has a different gatekeeper.
Layer 1: Raw files from humans or robots
EgoHumanoid is a useful starting point because it makes the problem explicit: use egocentric human demonstrations to improve humanoid loco-manipulation, then co-train a VLA policy with a smaller amount of robot data. The official repository describes a pipeline that collects data from Unitree G1, PICO VR, and ZED Mini, processes view/action alignment, fine-tunes a π₀.₅-based VLA model, and deploys policies back to a real robot. Sources: OpenDriveLab/EgoHumanoid and arXiv 2602.10106.
At the file level, EgoHumanoid's human data pipeline expects raw data to be organized in dated batch folders:
raw_data/
2025-01-15_batch1/
episode_0.hdf5
episode_0.svo2
episode_1.hdf5
episode_1.svo2
2025-01-15_batch2/
...
The episode_*.hdf5 files contain structured streams such as body_pose, left_hand_pose, right_hand_pose, and local_timestamps_ns. The episode_*.svo2 files are ZED camera recordings. The pipeline then reorders episodes, computes navigation commands from body pose, downsamples streams, reads frames from SVO2, aligns timestamps, compresses left/right camera images as JPEG inside the final HDF5, and computes binary hand_status for left and right hands. Technical source: EgoHumanoid Human Data Processing Pipeline.
For beginners, the rule is simple: raw files are the layer closest to the real world. If a company owns the robot, camera, VR headset, workstation, and workspace, it usually controls whether those raw files can be produced. If the teleoperator is an employee or contractor, labor contracts and data policies determine their rights. If the teleoperator is an end user, privacy and consent become much harder.
A minimal chain looks like this:
Human/robot activity
-> sensor recording
-> episode_0.hdf5 # pose, hand, timestamp, command
-> episode_0.svo2 # camera stream from ZED
-> processed_episode.hdf5
-> LeRobot format
-> model training
The value at this layer comes from rarity: egocentric viewpoint, real objects, real lighting, real failures, human hand behavior, and annoying edge cases that simulation has not captured. But raw data is also the riskiest layer. It may contain faces, private rooms, factory layouts, computer screens, license plates, voices, or personally identifying work habits.
Layer 2: Training formats such as LeRobotDataset
Raw files are not necessarily good training files. Model trainers need fast sampling, clear metadata, stable episode segmentation, consistent task labels, and portable loaders. That is why LeRobotDataset has become an important value layer.
According to Hugging Face's LeRobotDataset v3.0 documentation, the format decouples efficient storage from the user-facing API. Low-dimensional, high-frequency streams such as states, actions, and timestamps are stored in Apache Parquet. Camera frames are concatenated and encoded into MP4 files, sharded by camera to keep file counts manageable. Metadata under meta/ describes schema, FPS, statistics, tasks, and episode offsets. Sources: LeRobotDataset v3.0 documentation and the LeRobot datasets v3 blog.
A typical LeRobotDataset v3 layout looks like this:
my-humanoid-dataset/
meta/
info.json
stats.json
tasks.jsonl
episodes/
chunk-000/file-000.parquet
data/
chunk-000/file-000.parquet
videos/
observation.images.front/
chunk-000/file-000.mp4
observation.images.wrist/
chunk-000/file-000.mp4
This creates a new control layer. The party that owns the raw files may not be the party that controls the standardized dataset. Once data is uploaded to the Hugging Face Hub or an internal object store, the dataset host controls access tokens, licenses, versions, visibility, takedowns, mirrors, and metadata. A model trainer may never see episode_0.svo2, yet still train a policy if the Parquet, MP4, and metadata are good enough.
If you are building a humanoid startup, do not treat "convert to LeRobot" as a minor cleanup step. This is where you decide:
| Schema decision | Technical effect | Control effect |
|---|---|---|
Camera name: observation.images.front or head_rgb |
Whether model configs can read it | Whoever defines the convention lowers their own integration cost |
| State/action dimension | Whether G1, H1, GR1, or another robot can reuse it | The action-space owner controls downstream portability |
| FPS and downsampling | I/O, latency, action smoothness | Fine teleoperator behavior may be erased |
| Task text | Whether VLA models understand instructions | Annotators add semantic value |
| Normalization stats | More stable training | Stats can reveal internal data distributions |
| License and access | Sharing or closed training | Dataset hosts control downstream use |
A quick inspection script:
from lerobot.datasets.lerobot_dataset import LeRobotDataset
dataset = LeRobotDataset("org/humanoid-demo")
sample = dataset[100]
print(sample.keys())
print(sample["observation.state"].shape)
print(sample["action"].shape)
print(sample["observation.images.front"].shape)
Once this works, the value is no longer just in the raw recording. The value is in turning messy episodes into a stable training API. That is why organizations without a large robot fleet can still create leverage by standardizing, validating, indexing, and distributing robotics datasets.
Layer 3: Synthetic demonstrations in Isaac Lab
Humanoid data does not only come from real robots. Isaac Lab Mimic can generate demonstrations in simulation, then train policies with Robomimic or post-train VLA models. The Isaac Lab documentation gives a concrete example: download generated_dataset_gr1_nut_pouring.hdf5, place it under IsaacLab/datasets/, note that it is about 12GB, and use it as a 1000-demonstration dataset of a GR1 humanoid performing a pouring/placing task generated with Isaac Lab Mimic for Isaac-NutPour-GR1T2-Pink-IK-Abs-Mimic-v0. Source: Isaac Lab teleoperation and imitation learning documentation.
The task is not just "pick up an object." The robot first picks up the red beaker, pours its contents into the yellow bowl, drops the beaker into the blue bin, and places the yellow bowl on the white scale. The success criteria are also multi-condition: the beaker must be in the bin, the nut must be in the bowl, and the bowl must be on the scale. That makes the dataset valuable for visuomotor learning because it combines perception, manipulation, and long-horizon sequencing.
A shortened Isaac Lab command:
./isaaclab.sh -p scripts/imitation_learning/isaaclab_mimic/generate_dataset.py \
--device cpu \
--headless \
--enable_pinocchio \
--enable_cameras \
--rendering_mode balanced \
--task Isaac-NutPour-GR1T2-Pink-IK-Abs-Mimic-v0 \
--generation_num_trials 1000 \
--num_envs 5 \
--input_file ./datasets/dataset_annotated_gr1_nut_pouring.hdf5 \
--output_file ./datasets/generated_dataset_gr1_nut_pouring.hdf5
Synthetic data complicates ownership. There is no direct teleoperator in every final frame, but many contributors still shape the dataset: the person who built the USD scene, the team that modeled the GR1 robot, the person who wrote the task, the annotator who marked subtasks, the engineer who ran generation, the owner of the object assets, the group paying for compute, and the simulator/toolchain license.
A practical classification:
| Synthetic data component | Value created | Common controller |
|---|---|---|
| Robot asset and controller | Valid action limits, dynamics, joint behavior | Robot vendor, lab, simulator vendor |
| Scene/object asset | Environment and object diversity | Asset owner, simulation team |
| Task definition | Rewards, success criteria, subtask graph | Research team or benchmark owner |
| Demonstration generator | Rollout volume, quality filtering | Isaac Lab Mimic operator |
| Output HDF5 | Direct training dataset | Storage and access owner |
| Conversion to LeRobot | Compatibility with VLA stacks | Dataset engineer or model team |
In short: synthetic datasets are not "ownership-free" just because they do not record real people. They trade privacy risk for license risk, asset provenance risk, and benchmark leakage risk.
Layer 4: Ecosystem datasets such as Humanoid Everyday
Humanoid Everyday shows a different direction: the dataset is not merely a folder, but an ecosystem of data, code, policy analysis, and cloud evaluation. The arXiv abstract describes 10.3k trajectories, over 3 million frames, 260 tasks across 7 categories, including RGB, depth, LiDAR, tactile inputs, language annotations, and a human-supervised teleoperation pipeline. The paper also introduces a cloud-based evaluation platform that lets researchers deploy policies in a controlled setting and receive performance feedback. Source: Humanoid Everyday arXiv 2510.08807.
The Humanoid Everyday repository adds practical details: demonstrations are recorded on Unitree G1 and H1 at 30Hz, across loco-manipulation, basic manipulation, tool use, deformable objects, articulated objects, and human-robot interaction. Low-dimensional modalities include joint states, IMU, odometry/kinematics, G1 hand pressure sensors, teleoperator hands/head actions from Apple Vision Pro, and inverse-kinematics data. High-dimensional modalities include 480x640 RGB, 480x640 depth, and LiDAR point clouds of roughly 6k points per step. The repository also provides a dataloader and a conversion script to LeRobot. Source: physical-superintelligence-lab/Humanoid-Everyday.
The important shift is cloud evaluation. When a group owns real robots and lets the community send policies to a benchmark, it is no longer just a dataset host. It becomes the evaluation platform owner. It can control:
| Evaluation layer | Generated data | Value |
|---|---|---|
| Input stream from the real robot | Current RGB/depth/state | Lets policies run without owning the robot |
| Policy action stream | Commands, latency, failure modes | Reveals behavior style and control quality |
| Rollout video | Evidence of success or failure | Powers leaderboards, papers, demos |
| Metrics and protocol | Success rate, time, safety reset | Determines who is considered best |
| Operations logs | Crashes, timeouts, interventions | Valuable debugging and productization data |
In humanoids, evaluation data can be nearly as valuable as training data. Offline training loss is not enough. A real humanoid has balance, contact, latency, motor heat, safety intervention, and objects that are never perfectly placed. Whoever owns the real benchmark can see the failure distribution of many models without necessarily publishing all logs.
A five-layer map of value control
From the four examples above, we can define a general map:
Layer 0: Physical world
people, robot, room, object, lighting, safety rig
Layer 1: Raw capture
HDF5, SVO2, ROS bag, camera stream, VR hand/head tracking
Layer 2: Processed dataset
synchronized HDF5, LeRobot Parquet + MP4/images, metadata, stats
Layer 3: Model training
sampling code, normalization, checkpoints, VLA adapters, finetune recipes
Layer 4: Evaluation and deployment
cloud benchmark, rollout videos, success metrics, inference API, logs
And here is the "who controls value?" table:
| Actor | What they control | Value they hold | Risk if ignored |
|---|---|---|---|
| Hardware owner | Robot, camera, VR headset, workspace, safety rig | Ability to create exclusive raw data | Dataset cannot be reproduced if robot access disappears |
| Teleoperator | Skill, correction strategy, speed, operating style | Demonstration quality and long-tail behavior | Ambiguous contracts/consent, untracked operator bias |
| Dataset host | Storage, schema, license, version, access | Distribution and standardization power | Downstream users may not know provenance or license constraints |
| Model trainer | Recipe, compute, checkpoint, normalization, model card | Turning data into capability | Model may absorb sensitive data or violate data terms |
| Cloud evaluation platform | Real robot benchmark, protocol, logs, leaderboard | Real-world measurement and failure visibility | Benchmark can become an opaque gatekeeper |
This map helps you read dataset announcements more carefully. When a paper says "we release data", ask: raw or processed? Does it include source videos or only state/action? Is commercial use allowed? Can you download everything or must you use a cloud API? Are evaluation logs included? Is there a task spreadsheet but no asset provenance? Is there a conversion script? Are normalization statistics published?
Beginner checklist for humanoid datasets
Use this checklist before training any policy:
1. Is the dataset from real robots, humans, simulation, or a mixture?
2. What are the source files: HDF5, SVO2, ROS bag, Parquet, MP4, PNG, PCD?
3. Are camera streams and state/action streams timestamp-aligned?
4. Is task text consistent with the episode?
5. Is the action space joint targets, end-effector poses, base velocity, or mixed action?
6. Are FPS, robot type, camera intrinsics, and normalization stats documented?
7. Does the license permit research, commercial use, redistribution, or closed-model fine-tuning?
8. Does the data include people, faces, voices, private spaces, or factory information?
9. Is there a separate benchmark or cloud evaluation platform?
10. Can the model trainer keep checkpoints and rollout logs, or must they share them back?
If questions 2, 3, 5, and 7 are unclear, you do not really know what you are using. With a fixed robot arm, a mismatch may only reduce performance. With a humanoid, a wrong action space or timestamp alignment can cause falls, collisions, or invalid evaluation.
Conclusion: humanoid data is a chain of control
In 2026, the advantage in humanoid robotics is not only the number of robots you own. It is the ability to move from trustworthy raw files, through standard training formats, into strong checkpoints, and finally prove capability through real-robot evaluation. Each step adds value and shifts control.
EgoHumanoid reminds us that human egocentric data can scale environmental diversity beyond lab teleoperation, but only with alignment. LeRobotDataset reminds us that schema and metadata can turn scattered recordings into trainable assets. Isaac Lab reminds us that synthetic demonstrations have their own provenance. Humanoid Everyday reminds us that a large dataset plus cloud evaluation can become benchmark infrastructure, not just a download link.
The next article will focus on VR teleoperation and operator data: when a person wears a headset to control G1/H1, what counts as robot data, what counts as labor data, and why that distinction directly affects the value of a VLA stack.
Technical sources
| Topic | Source |
|---|---|
| EgoHumanoid framework | OpenDriveLab/EgoHumanoid, arXiv 2602.10106 |
EgoHumanoid raw episode_*.hdf5 + episode_*.svo2 |
Human Data Processing Pipeline |
| LeRobotDataset v3 | Hugging Face docs, LeRobot v3 blog |
| Isaac Lab GR1 nut-pouring dataset | Isaac Lab imitation learning docs |
| Humanoid Everyday | arXiv 2510.08807, GitHub repo |