humanoidhumanoidrobot-dataegohumanoidvr-teleopzedhdf5data-ownershipvla

VR Teleop: PICO/ZED to HDF5

A practical EgoHumanoid walkthrough: PICO/ZED capture, raw HDF5/SVO2, five processing stages, and consent/licensing audit points.

Nguyễn Anh TuấnJune 10, 202618 min read
VR Teleop: PICO/ZED to HDF5

Why this article starts with VR teleop

In part 1 of this series, we mapped humanoid data as a chain of surfaces: raw files, standardized datasets, simulation data, model checkpoints, and evaluation logs. Part 2 zooms into one concrete path: how PICO VR and ZED Mini recordings in EgoHumanoid become HDF5 files that can be used for humanoid VLA training.

The important detail is that EgoHumanoid is not only classic robot teleoperation. The OpenDriveLab paper describes a robot-free setup where a human demonstrator performs tasks in the real world while wearing a VR headset, trackers, and an egocentric camera. The collected data is then aligned so it can be co-trained with a smaller amount of robot data. The project page emphasizes that egocentric human data improves generalization, especially in environments where a robot did not directly collect demonstrations. The main references are the EgoHumanoid paper, the project page, and the OpenDriveLab/EgoHumanoid repository.

This article does not begin with abstract legal ownership. It opens the dataset folder and asks:

Audit question Why it matters
What exactly is captured in a VR demo? You can separate personal/behavioral data from technical sensor streams
Where are velocity commands created? You avoid mistaking navigation_command for raw joystick input
Where does binary hand open/close status come from? You know hand_status is derived from hand pose, not raw hand pose itself
How are ZED camera frames merged into HDF5? You can audit image consent and timestamp synchronization
Which files need consent and licensing metadata? You avoid building a dataset that can train models but cannot be shared or commercialized

If you are building humanoid VLA data infrastructure, read this alongside GR00T whole-body real data and LeRobot v0.5 with G1 whole-body control, because those topics sit downstream of the HDF5 packaging step.

Series roadmap

  1. Humanoid Data Map 2026
  2. VR Teleop: PICO/ZED to HDF5
  3. View alignment and action alignment
  4. Simulation and synthetic demonstrations
  5. Human video and robot-free data
  6. The VLA stack and downstream control

Mental model: from headset wearer to trainable episode

EgoHumanoid has two data branches: robot teleoperation and human robot-free demonstration. This article focuses on the human branch. According to the data_collection/human_data README, the system uses PICO VR, ZED Mini, and MeshCat to record synchronized full-body tracking, hand tracking, controller poses, and video. The README also lists a default collection interval of 0.01 seconds, roughly 100Hz for the tracking stream.

A minimal chain looks like this:

Human demonstrator
  -> PICO full-body + hand tracking
  -> ZED Mini SVO2 video with depth
  -> episode_N.hdf5      # pose, controller, hand, timestamps
  -> episode_N.svo2      # binocular camera recording
  -> processed/hdf5 + processed/svo2
  -> final HDF5 with navigation, images, hand_status
  -> optional LeRobot conversion
  -> VLA training

For beginners, the easy trap is the word "teleop." In robot teleoperation, an operator controls a real robot, so the raw data often already contains robot state and robot action. In EgoHumanoid's robot-free VR demos, the human is not necessarily controlling a physical robot during recording. They are creating an egocentric human demonstration. The pipeline must therefore derive robot-compatible actions: lower-body commands, end-effector movement, and hand open/close states.

The EgoHumanoid project page describes action alignment in three parts: upper body becomes 6-DoF delta end-effector commands, lower body becomes discrete velocity commands, and dexterous hands become binary open/close labels. This article audits the file-level path for two of the easiest pieces to inspect: lower-body commands and hand status.

Engineer wearing a virtual reality headset in a lab
Engineer wearing a virtual reality headset in a lab

Stage 0: collection in data_collection/human_data

The relevant collection directory is:

data_collection/human_data/
  README.md
  requirements.txt
  scripts/
    human_data_collection.py
    svo2_to_mp4.py

The README lists the minimum hardware as a PICO VR headset with full-body tracking support, a ZED Mini depth camera, and a Linux PC running Ubuntu 22.04 or 24.04. Collection starts with:

cd data_collection/human_data

python scripts/human_data_collection.py --name <dataset_name>

python scripts/human_data_collection.py \
  --data-dir <save_dir> \
  --name <dataset_name> \
  --visualize-zed

The operating workflow is direct: the program initializes the PICO SDK, ZED Mini, and MeshCat; the operator opens http://localhost:7000/static/ to view the 3D skeleton; the operator enters an episode index; the human performs a demonstration; Space ends the episode; HDF5 and SVO2 files are saved automatically.

A session output usually looks like this:

data_collection/
  body_data/
    episode_0.hdf5
    episode_0.svo2
    episode_1.hdf5
    episode_1.svo2

The raw HDF5 schema is described in the README:

Raw HDF5 dataset Typical shape Audit meaning
body_pose (frames, 24, 7) 24 body joint poses, each with position and quaternion
left_controller_pose (frames, 7) Left controller pose
right_controller_pose (frames, 7) Right controller pose
left_hand_pose (frames, 26, 7) 26 left-hand joint poses
right_hand_pose (frames, 26, 7) 26 right-hand joint poses
left_hand_active (frames,) Whether left-hand tracking is active
right_hand_active (frames,) Whether right-hand tracking is active
local_timestamps_ns (frames,) Local PC timestamps in nanoseconds
episode_N.svo2 separate file ZED Mini video, including binocular stream and depth depending on configuration

From a data ownership perspective, this is the most sensitive layer. body_pose and hand_pose are not face images, but they are still behavioral traces: gait, arm range, manipulation style, reaction speed, and operator habits. episode_N.svo2 is more sensitive because it can contain the real environment, surrounding people, screens, signs, documents, customer items, or objects with their own IP restrictions. If the dataset will be shared outside the collecting team, consent should not merely say "participated in an experiment." It should explicitly cover model training, format conversion, image extraction, video preview, publication of samples, partner sharing, and commercial use.

A minimum manifest should live next to the raw data:

dataset_id: pillow_placement_home_2026_06_10
collector: team_a
operator_id: op_014
consent_form_version: v3_robot_learning_2026
consent_scope:
  train_internal_models: true
  publish_examples: false
  share_with_partners: false
  commercial_use: true
environment:
  location_type: home_mockup
  bystanders_present: false
  sensitive_displays_visible: false
hardware:
  headset: PICO
  camera: ZED Mini
  recording_format: hdf5_svo2
license:
  raw_hdf5: internal_restricted
  raw_svo2: internal_restricted
  processed_hdf5: internal_restricted

Do not wait until LeRobot conversion to add metadata. Once episodes are downsampled, merged, and sharded, reconstructing operator consent or usage scope becomes much harder.

Stage 1: Reorder Episodes

Processing happens under:

data_alignment/human_data_process/
  run_human_data_pipeline.sh
  scripts/reorder_episodes_for_raw.py
  process_navigation_pipeline.py
  downsample_episode.py
  merge_camera_only.py
  add_hand_status.py

The human_data_process README expects raw data to be organized in date/batch folders:

input_dir/
  2025-01-15_batch1/
    episode_0.hdf5
    episode_0.svo2
    episode_1.hdf5
    episode_1.svo2
  2025-01-15_batch2/
    ...

The full pipeline command is:

cd data_alignment/human_data_process

./run_human_data_pipeline.sh \
  --input_dir /path/to/raw_data \
  --output_dir /path/to/intermediate \
  --final-output-dir /path/to/final \
  --file all

Reorder Episodes scans {date}_{batch} subfolders, sorts episodes chronologically, and copies them into:

processed/
  hdf5/
    episode_0.hdf5
    episode_1.hdf5
  svo2/
    episode_0.svo2
    episode_1.svo2

Technically, this stage does not create a new learning signal. It standardizes order and naming so the later stages do not need to understand the original collection history. From a governance perspective, it is still critical because it can destroy context. If the raw folder was named 2026-06-10_factory_a_operator_014_batch2, and the processed file is only episode_17.hdf5, you have lost operator, location, and batch context unless a manifest or mapping file preserves it.

Stage 1 audit checklist:

Check Practical reason
Is there a raw path to processed episode mapping? Needed for takedown if an operator withdraws consent
Are file hashes recorded before and after copy? Needed to verify files were not modified outside the pipeline
Do HDF5 and SVO2 indices stay paired? A mismatch can merge the wrong camera stream later
Are date, batch, operator, and task stored in metadata? Sequential filenames are not enough for governance

A simple mapping file:

processed_episode,raw_hdf5,raw_svo2,operator_id,task,consent_id
episode_000017,2026-06-10_batch2/episode_3.hdf5,2026-06-10_batch2/episode_3.svo2,op_014,pillow_placement,consent_2026_v3

Stage 2: Navigation Pipeline

This is the most important stage if you want to know where velocity commands come from. The README says Navigation Pipeline reads body_pose, uses skeleton keypoints such as pelvis and hip landmarks, applies coordinate transforms, smooths the trajectory with a Savitzky-Golay filter, estimates tangent direction, and generates velocity commands [vx, vy, yaw_rate] in the local body frame. The pipeline can also produce PNG comparison plots for validation.

Run this stage alone:

python process_navigation_pipeline.py \
  --dataset-dir /data/processed \
  --baseline-sec 15 \
  --tangent-lag 5 \
  --overwrite \
  --no-png

Beginner mental model:

body_pose over time
  -> pelvis/hip trajectory
  -> smoothed path
  -> local tangent direction
  -> frame-to-frame velocity
  -> navigation_command = [vx, vy, yaw_rate]

navigation_command is not raw joystick input. It is derived from human body motion. This is an easy place to mislabel a dataset. If a dataset card says "humanoid actions," a reader may assume the field came from a robot controller. In the human robot-free branch, this command is the result of action alignment from human pose into a representation that a humanoid policy can consume.

Wrapper parameters worth recording:

Parameter README default Effect
--baseline-sec 15 Trajectory smoothing window
--tangent-lag 5 Frames used to estimate tangent direction
--with-png off Produces validation plots when enabled
--skip-navigation off Skips this stage if commands are already injected

This is also a subtle ownership boundary. Raw pose comes from the demonstration process, but velocity commands are generated by the processing pipeline. If team A collects raw data and team B writes the navigation pipeline, who owns the derived commands? The answer depends on contract and license, but metadata should at least record the derivation:

derived_fields:
  navigation_command:
    source_fields: [body_pose, local_timestamps_ns]
    method: EgoHumanoid process_navigation_pipeline.py
    baseline_sec: 15
    tangent_lag: 5
    generated_by: data_team_b
    generated_at: 2026-06-10T10:20:00Z

Stage 3: Downsample

Raw tracking can run at a higher frequency than the intended training dataset. Downsample reduces the stream frequency with a sliding window, with default factor 5. It averages navigation commands, creates discrete teleop_navigate_command values by thresholding, and computes delta_height between frames.

Run it alone:

python downsample_episode.py \
  --dataset-dir /data/processed \
  --downsample-rate 5 \
  --overwrite

Important outputs:

Field Raw or derived? Meaning
navigation_command derived continuous field [vx, vy, yaw_rate] after processing
teleop_navigate_command derived discrete field Thresholded command suitable for a discrete action space
delta_height derived field Height change between frames, useful when a person crouches, reaches, or changes posture

Downsampling is not just an I/O optimization. It changes what the model can learn. A quick hand correction or a short body turn can be smoothed away. A continuous velocity may become a discrete class. If a trained policy later fails at tasks requiring fast correction, the audit question is not only "do we have enough data?" It is also "did the downsample stage erase the useful detail?"

Minimum metadata:

processing:
  downsample:
    rate: 5
    command_aggregation: sliding_window_average
    discrete_command_method: thresholding
    preserves_raw: true

For beginners, keep raw HDF5 and final HDF5 separate. Do not overwrite raw episodes with downsampled files. Raw data is evidence; downsampled data is a training artifact.

Stage 4: Merge Camera

Merge Camera joins ZED video with the downsampled HDF5 data. According to the README, the script reads binocular frames from SVO2 files, matches them to downsampled HDF5 timestamps using binary search, compresses images as JPEG at quality 95, and writes left/right images into the final HDF5.

Run it alone:

python merge_camera_only.py \
  --dataset-dir /data/processed \
  --output-dir /data/final \
  --num-workers 32

The final HDF5 gains these fields:

Field Audit meaning
observation_image_left JPEG-compressed left camera frames
observation_image_right JPEG-compressed right camera frames
camera_timestamp Matched camera timestamp
timestamp_diff_ms Synchronization error between camera and data timestamps

This is where privacy risk increases sharply. Before Stage 4, a processed HDF5 may mostly contain pose and command fields. After Stage 4, the final HDF5 includes real images. If you publish a sample HDF5, upload it to a dataset hub, or show it to a customer, you are sharing a visual environment, not just motion vectors.

Checklist before merging or publishing:

Question Answer before training or sharing
Do non-operator people appear in the frame? If yes, you need consent or a redaction policy
Are screens, papers, product labels, or license plates visible? If yes, mark the episode as sensitive
Is timestamp_diff_ms too large? Poor sync means the model can learn mismatched image/action pairs
Are JPEG quality and resize settings recorded? Needed for reproducible training
Does raw SVO2 have a different license than processed HDF5? You may allow internal training but forbid image redistribution

A practical policy is to license by artifact layer:

license_by_artifact:
  raw_svo2:
    access: restricted
    reason: contains unredacted environment video
  final_hdf5_with_images:
    access: internal_training_only
    redistribution: prohibited
  derived_command_only_hdf5:
    access: partner_shareable
    redistribution: case_by_case

Stage 5: Hand Status

The final stage creates hand_status. The README describes it as binary hand open/close status computed with a square wave approximation and written into the final HDF5 as [left, right], where 1 means closed and 0 means open.

Run it alone:

python add_hand_status.py \
  --raw /data/processed/hdf5 \
  --mid /data/final \
  --target /data/final \
  --downsample 5 \
  --num_workers 32

Data flow:

left_hand_pose + right_hand_pose
  -> downsample alignment
  -> open/close approximation
  -> hand_status[:, 0] = left hand
  -> hand_status[:, 1] = right hand

hand_status is tiny but important. A humanoid dexterous hand policy does not always need all 26 joints from each human hand. Early-stage policies often only need to know whether the hand should be open or closed, especially when the downstream robot hand or gripper is controlled with a binary abstraction. This stage reduces dimensionality and improves portability between human embodiment and robot embodiment.

But dimensionality reduction also destroys information. If a demonstration involves light contact, two-finger pinching, or fingertip rotation, binary open/close may be too crude. When manipulation fails, audit whether the task is truly compatible with a binary hand label.

Metadata to record:

derived_fields:
  hand_status:
    source_fields: [left_hand_pose, right_hand_pose]
    representation: binary_open_close
    closed_value: 1
    open_value: 0
    downsample_rate: 5
    known_limitations:
      - loses finger-level contact detail
      - unsuitable for fine dexterous manipulation without raw hand poses

Final HDF5: audit every field

The processing README lists the main final HDF5 fields. A useful audit classification is:

Final HDF5 field Source Raw/derived Consent or licensing metadata needed
body_pose raw HDF5 raw but downsampled operator consent, behavioral data scope
navigation_command body pose trajectory derived pipeline version, smoothing parameters, derived-rights
teleop_navigate_command navigation command derived threshold method, action-space definition
delta_height body pose derived pipeline version
observation_image_left SVO2 compressed raw visual data visual consent, location release, redistribution scope
observation_image_right SVO2 compressed raw visual data same as left camera
camera_timestamp SVO2/HDF5 sync derived metadata synchronization method
timestamp_diff_ms sync calculation derived metadata quality threshold
hand_status hand pose derived source pose consent, labeling method

A beginner-friendly inspection script:

import h5py

path = "final/episode_0.hdf5"

with h5py.File(path, "r") as f:
    for key in f.keys():
        obj = f[key]
        shape = getattr(obj, "shape", None)
        dtype = getattr(obj, "dtype", None)
        print(key, shape, dtype)

    if "timestamp_diff_ms" in f:
        diff = f["timestamp_diff_ms"][:]
        print("max sync error ms:", diff.max())
        print("mean sync error ms:", diff.mean())

    if "hand_status" in f:
        status = f["hand_status"][:]
        print("left closed ratio:", status[:, 0].mean())
        print("right closed ratio:", status[:, 1].mean())

If timestamp_diff_ms has large outliers, that episode may not be safe to train on. If hand_status is always 0 or always 1, hand tracking may have failed, the task may not involve grasping, or the open/close approximation may not fit the episode.

Running the pipeline with control

The run_human_data_pipeline.sh wrapper supports dry runs, skipped stages, and validation plots. For a new team, I recommend three passes:

# Pass 1: preview without writing
./run_human_data_pipeline.sh \
  --input_dir /data/raw \
  --output_dir /data/processed \
  --final-output-dir /data/final \
  --dry-run

# Pass 2: full pipeline with trajectory plots
./run_human_data_pipeline.sh \
  --input_dir /data/raw \
  --output_dir /data/processed \
  --final-output-dir /data/final \
  --with-png

# Pass 3: rerun only hand status if the labeling logic changes
./run_human_data_pipeline.sh \
  --input_dir /data/raw \
  --output_dir /data/processed \
  --final-output-dir /data/final \
  --skip-reorder \
  --skip-navigation \
  --skip-downsample \
  --skip-merge

Useful options:

Option When to use it
--file hdf5/svo2/all Reorder only one file type
--workers Speed up copying and camera merge
--baseline-sec Tune navigation smoothing
--tangent-lag Tune velocity direction estimation
--downsample-rate Match the target training FPS
--with-png Review trajectories before training
--dry-run Audit commands before processing a large dataset
--skip-* Rerun one stage without disturbing the others

This is the part robotics teams often skip because they are busy training. In a series about who owns humanoid robot data in 2026, file-level governance is the core issue.

Artifact What it can contain Risk Metadata you should require
Raw episode_*.hdf5 body pose, hand pose, controller pose, timestamps behavioral biometric traces, operator style operator consent, task, location type, retention policy
Raw episode_*.svo2 binocular/depth video bystanders, environment, documents, screens visual consent, redaction status, redistribution limit
Reordered HDF5/SVO2 renamed copies provenance loss raw path mapping, file hash, consent id
Navigation-injected HDF5 derived velocity commands mistaken as real robot action pipeline version, parameters, generated_by
Downsampled HDF5 lower-frequency/discrete action fields lost motion detail downsample rate, threshold method
Final HDF5 with images JPEG images + commands + hand status privacy plus action labels field-level license, publish policy, takedown path
Converted LeRobot dataset Parquet/MP4/schema easy to distribute, hard to control downstream dataset card, license, allowed use, source provenance

The EgoHumanoid repository currently lists Apache 2.0 for the project code and says the OpenPI models/code are also provided under Apache 2.0, but that does not automatically solve the rights for your own collected data. Code license, model license, raw video consent, and dataset distribution license are four different layers.

A minimum DATASET_CARD.md:

# Dataset Card

## Collection
- Hardware: PICO VR, ZED Mini
- Collection path: data_collection/human_data
- Tasks: pillow placement, trash disposal
- Operators: anonymized IDs only

## Consent
- Consent form: v3_robot_learning_2026
- Allowed use: internal model training, commercial deployment
- Not allowed: public raw video release
- Withdrawal path: contact [email protected]

## Processing
- Pipeline: EgoHumanoid data_alignment/human_data_process
- Stages: reorder, navigation, downsample, merge camera, hand status
- Downsample rate: 5
- Navigation baseline-sec: 15
- Tangent lag: 5

## Artifacts
- raw HDF5: restricted
- raw SVO2: restricted
- final HDF5: internal training only
- derived command-only export: partner review required

Conclusion

EgoHumanoid's PICO/ZED pipeline is useful because it forces us to look at humanoid data at the right grain size: episode_0.hdf5, episode_0.svo2, navigation_command, teleop_navigate_command, observation_image_left, timestamp_diff_ms, and hand_status. Once you know which fields are raw and which are derived, the question "who owns the data?" becomes less vague.

Short summary:

Layer What to remember
Collection PICO/ZED records pose, hand, controller, timestamp, and video per episode
Reorder Renames and sorts files, so provenance mapping is required
Navigation Creates [vx, vy, yaw_rate] from body pose, not from raw joystick input
Downsample Reduces frequency and discretizes commands, which can remove detail
Merge Camera Adds ZED images into final HDF5 and increases privacy risk
Hand Status Creates binary open/close labels from hand pose, useful but coarse
Governance Consent and license must follow each artifact, not just the code repository

The next article goes deeper into view alignment and action alignment: the point where human recordings start becoming truly humanoid-compatible data.

Technical sources

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

Bản đồ dữ liệu humanoid 2026
humanoid

Bản đồ dữ liệu humanoid 2026

6/10/202616 min read
NT
Căn góc nhìn người sang robot
humanoid

Căn góc nhìn người sang robot

6/10/202616 min read
NT
Stack VLA: dữ liệu đến triển khai
humanoid

Stack VLA: dữ liệu đến triển khai

6/10/202613 min read
NT