EgoHumanoid: Human Demos to G1 VLA

Why EgoHumanoid is the third stack

In Part 1 on OpenWBT, we started with debuggable whole-body teleoperation: lower-body policy, upper-body IK, simulator first, real robot later. In Part 2 on TWIST2, the focus moved to direct robot data collection with PICO teleoperation, a Redis bus, a low-level controller, and a sim-to-real path. Both stacks place the robot at the center: good demonstrations require the robot to run, the operator to drive the robot, and every latency or tracking issue to directly affect the collected data.

EgoHumanoid asks a different question: if a lab cannot collect thousands of clean robot episodes yet, can it use egocentric human demonstrations? The OpenDriveLab/EgoHumanoid repository describes a framework that co-trains a VLA policy from human first-person demonstrations plus a limited amount of robot data. The EgoHumanoid project page highlights two alignment modules: view alignment reduces the camera-view gap between humans and robots, while action alignment maps human motion into a robot-compatible action space. This is the key difference from TWIST2. TWIST2 tries to make robot teleoperation better; EgoHumanoid tries to make human data more robot-ready.

This post follows the four-step path exposed by the repository: collect robot/human data, run data_alignment/human_data_process/run_human_data_pipeline.sh, convert HDF5 to LeRobot with data_alignment/convert_to_lerobot.py, then train and deploy with scripts/train.py, scripts/serve_policy.py, and scripts/deploy.py. The most important work is not the final deployment command. It is the alignment layer in the middle: view_alignment/viewport_transform_batch_h5.py with cache_3d.py for images, and process_navigation_pipeline.py with add_hand_status.py for actions.

Series roadmap

OpenWBT: G1 Teleop in MuJoCo/Isaac: build the environment, verify ONNX policies, and understand lower-body joystick plus upper-body IK.
TWIST2: PICO Teleop and G1 Sim2Real: use PICO teleop, Redis, and a low-level controller to collect direct robot data.
EgoHumanoid: Human Demos to G1 VLA: turn egocentric human demonstrations into HDF5/LeRobot data with images, navigation commands, and hand status.
VIRAL: retargeting and skill validation: use external motion sources, retarget them to the target humanoid, and evaluate failures.
FromW1: moving skills onto real hardware: move from simulation policy to real hardware while handling latency, contact, and actuator limits.
CLONE: closed-loop whole-body teleop: treat closed-loop teleoperation as a long-horizon data stack for loco-manipulation.

For broader context, also read GR00T N1 + G1 data collection and the WholeBodyVLA open-source guide. They place EgoHumanoid inside the larger data, VLA, and whole-body control landscape.

Technical references to keep open

Source	Why it matters	Detail to remember
EgoHumanoid README	Understand the full collect, process, train, deploy pipeline	The repository is organized around data collection, data processing, model training, and deployment
Human data pipeline README	Understand `run_human_data_pipeline.sh`	The script internally runs reorder, navigation, downsample, camera merge, and hand-status stages
View alignment README	Run `viewport_transform_batch_h5.py`	The pipeline uses MoGe depth, Cache3D point-cloud warping, and Stable Diffusion inpainting
EgoHumanoid paper	Understand why alignment is needed	Human demos use PICO + ZED; robot demos use Unitree G1 + Dex3 + ZED; actions are represented as EEF deltas, velocity commands, and hand open/close

Humanoid robot in a lab environment

Mental model: EgoHumanoid is not just dataset conversion

Beginners often read EgoHumanoid as a script that converts HDF5 files to LeRobot. The real pipeline has three gaps to close:

Human demonstrator
  PICO VR + 5 trackers + ZED
  body pose, hand pose, egocentric RGB
        |
        | human_data_process + view/action alignment
        v
Robot-ready HDF5
  teleop_navigate_command
  observation_image_left/right
  action_eef, action_delta_eef
  delta_height, hand_status
        |
        | convert_to_lerobot.py
        v
LeRobot dataset
  observation.images.left
  action_eef, action_delta_eef
  teleop_navigate, delta_height, hand_status
        |
        | scripts/train.py
        v
π0.5-style G1 VLA policy
        |
        | serve_policy.py + deploy.py
        v
Unitree G1 runtime

The first gap is the visual domain gap. A person wears a camera at a different height, pose, and heading from the camera mounted on a G1 head. During the same toy-pickup task, a human image may look down from a higher point or be offset relative to the robot view. View alignment uses depth, point-cloud warping, and inpainting to produce images that better approximate the robot viewpoint.

The second gap is the action domain gap. A human body has different bones, hand proportions, gait, and reachability from a humanoid robot. EgoHumanoid does not try to export human joints and retarget every joint to G1. The paper describes a unified action space: the upper body uses 6-DoF delta end-effector poses, the lower body uses discrete velocity commands, and the hand uses binary open/close status. This is practical because the policy learns task-level movement and manipulation signals, while the lower controller and IK layer handle embodiment.

The third gap is the dataset format gap. The internal HDF5 files are useful for alignment and episode inspection. LeRobot is better for policy training, image writing, metadata, FPS, and task instructions. For that reason, convert_to_lerobot.py is not an administrative step. It is where the training schema becomes fixed.

Step 1: collect robot data and human data

EgoHumanoid supports two data branches. Robot data is collected on a Unitree G1 with Dex3 hands, an Ubuntu workstation, a ZED Mini mounted on the robot head, and PICO VR for teleoperation. The robot-data README shows a practical loop with a control process, a teleoperation process, and a ZED exporter:

# Terminal 1: G1 control loop, real interface, with hands
python decoupled_wbc/control/main/teleop/run_g1_control_loop.py \
  --interface real \
  --control-frequency 50 \
  --with_hands

# Terminal 2: PICO teleoperation
python decoupled_wbc/control/main/teleop/run_teleop_policy_loop.py \
  --body-control-device pico \
  --hand_control_device pico \
  --enable_real_device

# Terminal 3: ZED + robot data exporter
python decoupled_wbc/control/main/teleop/zed_mini_run_g1_data_exporter.py \
  --dataset-name pick_toy_v1 \
  --visualize

Human data uses a PICO VR headset, five PICO Motion Trackers for full-body tracking, a ZED Mini mounted on the headset, and a workstation with USB 3.0. The paper mentions ZED X Mini, while the repository README notes that the released setup uses ZED Mini for easier access. The raw data should contain at least these fields:

Raw field	Reference shape	Used for
`body_pose`	`(N, num_joints, 7)`	Pelvis trajectory, navigation commands, and end-effector pose
`left_hand_pose`	`(N, 26, >=3)`	Infer left-hand open/close state
`right_hand_pose`	`(N, 26, >=3)`	Infer right-hand open/close state
`local_timestamps_ns`	`(N,)`	Synchronize motion with ZED frames
`episode_*.svo2`	ZED video	Extract left/right camera frames

The expected raw folder structure is dated batches. Do not flatten every episode into one directory if you want to use the default script:

raw_human/
  2026-06-11_batch1/
    episode_0.hdf5
    episode_0.svo2
    episode_1.hdf5
    episode_1.svo2
  2026-06-11_batch2/
    episode_0.hdf5
    episode_0.svo2

For a first lab run, collect less data but keep it clean. A task such as "put the pillow on the bed" or "grasp the toy and put it on the table" is better than ten mixed tasks. Each episode should begin and end in a recognizable state, the camera should not be covered, the hand should stay visible for most of the interaction, and the operator should repeat the same instruction.

Step 2: run the human data pipeline

The shortest command is:

cd data_alignment/human_data_process

./run_human_data_pipeline.sh \
  --input_dir /data/ego/raw_human \
  --output_dir /data/ego/intermediate \
  --final-output-dir /data/ego/final \
  --file all

Although we call this Step 2 of the four-step pipeline, the shell script contains five smaller stages:

Stage	Related script	Result
Reorder	`scripts/reorder_episodes_for_raw.py`	Sort episodes by time and copy them as `episode_N.hdf5` and `episode_N.svo2`
Navigation	`process_navigation_pipeline.py`	Convert `body_pose` into a trajectory and velocity `[vx, vy, yaw_rate]`
Downsample	`downsample_episode.py`	Reduce frequency, default factor 5, and create `teleop_navigate_command`
Merge camera	`merge_camera_only.py`	Match timestamps and write ZED left/right images into HDF5
Hand status	`add_hand_status.py`	Write `hand_status` with shape `(M, 2)` for left/right hands

To debug one stage at a time, run the scripts independently:

# 1. Reorder
python scripts/reorder_episodes_for_raw.py \
  --input_dir /data/ego/raw_human \
  --output_dir /data/ego/intermediate \
  --file all \
  --workers 32

# 2. Navigation: pelvis trajectory -> velocity command
python process_navigation_pipeline.py \
  --dataset-dir /data/ego/intermediate \
  --baseline-sec 15 \
  --tangent-lag 5 \
  --overwrite \
  --no-png

# 3. Downsample: create teleop_navigate_command
python downsample_episode.py \
  --dataset-dir /data/ego/intermediate \
  --downsample-rate 5 \
  --overwrite

# 4. Merge camera: SVO2 -> HDF5 images
python merge_camera_only.py \
  --dataset-dir /data/ego/intermediate \
  --output-dir /data/ego/final \
  --num-workers 32

# 5. Hand status: hand pose -> open/close
python add_hand_status.py \
  --raw /data/ego/intermediate/hdf5 \
  --mid /data/ego/final \
  --target /data/ego/final \
  --downsample 5 \
  --num_workers 32

After this step, the final HDF5 files should contain the important fields:

body_pose
navigation_command
teleop_navigate_command
delta_height
observation_image_left
observation_image_right
camera_timestamp
timestamp_diff_ms
hand_status

teleop_navigate_command is where lower-body alignment becomes concrete. The navigation script reads pelvis and hip landmarks, smooths the trajectory with a Savitzky-Golay filter, estimates tangent directions, and converts human movement into velocity commands in the local body frame. After downsampling, the command is discretized through thresholds so it resembles robot teleoperation primitives. It is not perfect ground truth, but it gives the policy a consistent signal for moving forward, backward, sideways, turning, or standing.

hand_status is where dexterous hands become a simpler learning target. add_hand_status.py reads hand pose, computes a metric based on fingertip distance and finger curvature, then fits a square-wave approximation to infer 0/1 states. The README defines 1 = closed, 0 = open, with shape (M, 2) in [left, right] order. For G1 + Dex3, this is coarser than joint-level hand pose, but it is easier to learn and less sensitive to tracker noise.

Step 3: view alignment with `viewport_transform_batch_h5.py`

View alignment is easy to skip because it costs GPU time and looks like image augmentation. In EgoHumanoid, it is central. The paper explains that the height and camera-pose differences between humans and humanoids create a clear visual gap. The view-alignment pipeline uses MoGe to infer per-pixel 3D points and depth, uses Cache3D to warp the point cloud to a new viewpoint, then uses Stable Diffusion inpainting to fill disoccluded regions.

Run one HDF5 file:

cd data_alignment/view_alignment

python viewport_transform_batch_h5.py \
  --h5_file /data/ego/final/episode_0.hdf5 \
  --image_key "observation_image_left" \
  --trajectory "down" \
  --movement_distance 0.07 \
  --output_dir /data/ego/view_aligned/episode_0

Run a directory with multiple GPUs:

python viewport_transform_batch_h5.py \
  --h5_dir /data/ego/final \
  --batch_size 32 \
  --trajectory "down" \
  --movement_distance 0.07 \
  --num_gpus 4 \
  --output_dir /data/ego/view_aligned

Understand these arguments before changing them:

Argument	Meaning	Beginner guidance
`--image_key`	HDF5 image field	Start with `observation_image_left`; handle right images later if needed
`--trajectory`	Camera shift direction: `left`, `right`, `up`, `down`, `forward`, `backward`	If the human viewpoint is higher than the robot, try `down` first
`--movement_distance`	Camera movement distance	The README example uses `0.07`; inspect images before increasing it
`--movement_distance_noise`	Per-sample pose noise	Useful for making the policy robust to mounting errors
`--sd_model`	Inpainting model	Default is `stabilityai/stable-diffusion-2-inpainting`
`--save_h5`	Save output back as H5	Use this when the transformed images should become part of the training dataset

After running the transform, do not only look at training loss. Open frames before and after alignment. If the warp distorts the hand or the manipulated object, the policy may learn the wrong visual signal. If inpainting removes a small object, reduce the movement distance or inspect the depth validity mask. Real robot data usually does not need this human-to-robot view transform; human data often does.

Step 4: action alignment and robot-ready HDF5

Action alignment in EgoHumanoid has three parts. The upper body uses action_eef and action_delta_eef; the lower body uses teleop_navigate_command; the hand uses hand_status. The repository's convert_to_lerobot.py expects HDF5 fields like these:

action_eef               # (N, 14), end-effector pose
action_delta_eef         # (N, 12), delta end-effector action
teleop_navigate_command  # (N, 3), discrete navigation command
delta_height             # (N,), base/pelvis height change
hand_status              # (N, 2), left/right open-close
observation_image_left   # RGB/JPEG image from ZED

If your file already has navigation and hand status but not EEF actions, inspect process_human_eef_pipeline.py in the human-data process folder. The README describes it as the utility that computes 7D hand end-effector poses and 6D deltas from skeleton data. For a beginner tutorial, the best first check is to print HDF5 keys and shapes with h5py:

python - <<'PY'
import h5py

path = "/data/ego/final/episode_0.hdf5"
with h5py.File(path, "r") as f:
    for key in [
        "action_eef",
        "action_delta_eef",
        "teleop_navigate_command",
        "delta_height",
        "hand_status",
        "observation_image_left",
        "observation_image_right",
    ]:
        if key in f:
            print(key, f[key].shape, f[key].dtype)
        else:
            print("MISSING", key)
PY

A good episode for conversion satisfies three basic conditions. First, the frame counts of actions and images must match, or the loader must handle the mismatch explicitly. Second, teleop_navigate_command should not be all zeros if the task includes walking. Third, hand_status should change around grasp and release, not flip every few frames. If hand status jitters, rerun add_hand_status.py with per-hand wave settings or manual transition shifts instead of asking the policy to learn noise.

Step 5: convert to LeRobot

Once the HDF5 files are clean, convert them to LeRobot:

cd data_alignment

python convert_to_lerobot.py \
  --src-path /data/ego/final \
  --output-path /data/ego/lerobot \
  --repo-id egohumanoid_pick_toy_v1 \
  --fps 20 \
  --task "grasp the toy and put it on the table"

For a larger dataset, use parallel workers:

python convert_to_lerobot.py \
  --src-path /data/ego/final \
  --output-path /data/ego/lerobot \
  --repo-id egohumanoid_pick_toy_v1 \
  --num-workers 16 \
  --fps 20 \
  --task "grasp the toy and put it on the table"

The converter's main LeRobot features include observation.images.left, action_eef, action_delta_eef, teleop_navigate, delta_height, and hand_status. Notice that the current converter centers on the left image feature. If you want stereo training with both left and right images, extend the feature config and loader explicitly instead of assuming the right stream is used automatically. For a first run, train a left-only dataset to reduce moving parts.

The task instruction matters. Do not use a vague instruction such as "do task". If the episode is toy pickup, say so. If the episode is trash disposal, use a different repo-id or at least a different task label. A VLA learns from images, actions, and language; vague task labels make the dataset harder to audit.

Step 6: train, serve, and deploy

Before training, the EgoHumanoid README asks you to compute normalization statistics:

uv run python scripts/compute_norm_states_ultra_fast.py \
  --config-name=norm_compute

Then run training:

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 \
uv run scripts/train.py pi05_g1_custom \
  --exp_name=egohumanoid_pick_toy_v1

For multi-GPU training:

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 \
uv run scripts/train.py pi05_g1_custom \
  --exp_name=egohumanoid_pick_toy_v1 \
  --fsdp-devices 4

Checkpoints are saved under:

checkpoints/<config_name>/<exp_name>/<iteration>

After you have a checkpoint, start the policy server:

uv run scripts/serve_policy.py policy:checkpoint \
  --policy.config=pi05_g1_custom \
  --policy.dir=checkpoints/pi05_g1_custom/egohumanoid_pick_toy_v1/10000

The server listens on port 8000 by default. On the robot/client side, run:

cd /root/Projects/openpi

python scripts/deploy.py \
  --host <server_ip> \
  --port 8000

This is where the safety lessons from Parts 1 and 2 matter again. Do not deploy to a real G1 just because the server returns actions. Check the camera mount, network, emergency stop, low-level locomotion policy, clear workspace, and task speed. EgoHumanoid helps a policy learn from human data, but it does not remove the risks of a real whole-body robot.

EgoHumanoid or TWIST2: direct robot data or human demos?

Question	TWIST2	EgoHumanoid
Primary data	Direct robot teleoperation	Human egocentric demos plus limited robot data
Strength	Actions and states are close to the real robot	Faster scaling, more scene diversity, no robot needed for every demo
Weakness	Consumes robot time, tires operators, and is harder to scale across scenes	Requires view/action alignment and fails easily if HDF5 data is dirty
Camera	Robot-centric active vision during teleop	Human headset/ZED views aligned toward the robot viewpoint
Lower-body signal	Commands from teleop/retargeting directly	Pelvis trajectory -> `teleop_navigate_command`
Hand signal	Robot hand command/tracking	Human hand pose -> `hand_status`
Best fit	The lab has a stable G1 and needs precise imitation data	The lab has limited robot time and wants diverse human demonstrations

A practical decision rule is simple. If the task depends on precise contact, force, or robot morphology, collect direct robot data first. Examples include pulling a heavy cart, opening a difficult handle, or operating near the edge of the G1 workspace. If the task mostly requires scene understanding, moving toward an object, and approximate grasp/release across many environments, EgoHumanoid is attractive because human demonstrations are cheaper than robot demonstrations.

A strong lab does not need to choose one forever. Use TWIST2 to build a clean robot dataset for a few core tasks. Use EgoHumanoid to expand diversity across rooms, object placements, and approach styles. Then co-train so the policy has both real robot grounding and broad visual diversity.

Debug checklist before long training

Check	Command or method	Expected result
HDF5 has required keys	Print keys and shapes with `h5py`	No missing `teleop_navigate_command`, image, `hand_status`, or EEF action fields
FPS is reasonable	Inspect metadata and frame counts	Human data is usually downsampled to 20 Hz
Camera sync is acceptable	Inspect `timestamp_diff_ms`	No long runs of large spikes
Navigation is not all zero	Plot or print unique commands	Walking tasks should show command changes
Hand status is meaningful	Plot both left/right columns	State changes around grasp and release
View alignment preserves objects	Open frames before and after alignment	The object and hand remain recognizable
LeRobot loads	Run a small smoke training job	Dataloader has no image/action shape errors

If one of these checks fails, do not train overnight. A VLA can reduce loss on dirty data, but a real robot will expose the mistake quickly. In EgoHumanoid, alignment quality determines whether human demonstrations are valuable.

Conclusion

EgoHumanoid is worth studying because it moves humanoid VLA work beyond the question of "how many robot teleoperation hours do we have?" Instead of treating human videos as loose visual references, it turns them into data with a robot-like schema: egocentric images aligned toward the robot viewpoint, teleop_navigate_command for the lower body, action_delta_eef for the arms, and hand_status for grasping. After conversion to LeRobot and training with scripts/train.py, you have a clear path from human demonstration to a policy that can be served and deployed on G1.

That advantage only matters if the data is audited carefully. run_human_data_pipeline.sh, viewport_transform_batch_h5.py, process_navigation_pipeline.py, add_hand_status.py, and convert_to_lerobot.py are not paperwork. They are where the embodiment gap is reduced. If you treat them as black boxes, EgoHumanoid may give you a nicely named dataset that is hard to use. If you inspect each HDF5 field, it becomes a practical way to scale whole-body VLA data in 2026.

Why EgoHumanoid is the third stack

Series roadmap

OpenWBT: G1 Teleop in MuJoCo/Isaac: build the environment, verify ONNX policies, and understand lower-body joystick plus upper-body IK.
TWIST2: PICO Teleop and G1 Sim2Real: use PICO teleop, Redis, and a low-level controller to collect direct robot data.
EgoHumanoid: Human Demos to G1 VLA: turn egocentric human demonstrations into HDF5/LeRobot data with images, navigation commands, and hand status.
VIRAL: retargeting and skill validation: use external motion sources, retarget them to the target humanoid, and evaluate failures.
FromW1: moving skills onto real hardware: move from simulation policy to real hardware while handling latency, contact, and actuator limits.
CLONE: closed-loop whole-body teleop: treat closed-loop teleoperation as a long-horizon data stack for loco-manipulation.

For broader context, also read GR00T N1 + G1 data collection and the WholeBodyVLA open-source guide. They place EgoHumanoid inside the larger data, VLA, and whole-body control landscape.

Technical references to keep open

Source	Why it matters	Detail to remember
EgoHumanoid README	Understand the full collect, process, train, deploy pipeline	The repository is organized around data collection, data processing, model training, and deployment
Human data pipeline README	Understand `run_human_data_pipeline.sh`	The script internally runs reorder, navigation, downsample, camera merge, and hand-status stages
View alignment README	Run `viewport_transform_batch_h5.py`	The pipeline uses MoGe depth, Cache3D point-cloud warping, and Stable Diffusion inpainting
EgoHumanoid paper	Understand why alignment is needed	Human demos use PICO + ZED; robot demos use Unitree G1 + Dex3 + ZED; actions are represented as EEF deltas, velocity commands, and hand open/close

Humanoid robot in a lab environment

Mental model: EgoHumanoid is not just dataset conversion

Beginners often read EgoHumanoid as a script that converts HDF5 files to LeRobot. The real pipeline has three gaps to close:

Human demonstrator
  PICO VR + 5 trackers + ZED
  body pose, hand pose, egocentric RGB
        |
        | human_data_process + view/action alignment
        v
Robot-ready HDF5
  teleop_navigate_command
  observation_image_left/right
  action_eef, action_delta_eef
  delta_height, hand_status
        |
        | convert_to_lerobot.py
        v
LeRobot dataset
  observation.images.left
  action_eef, action_delta_eef
  teleop_navigate, delta_height, hand_status
        |
        | scripts/train.py
        v
π0.5-style G1 VLA policy
        |
        | serve_policy.py + deploy.py
        v
Unitree G1 runtime

Step 1: collect robot data and human data

# Terminal 1: G1 control loop, real interface, with hands
python decoupled_wbc/control/main/teleop/run_g1_control_loop.py \
  --interface real \
  --control-frequency 50 \
  --with_hands

# Terminal 2: PICO teleoperation
python decoupled_wbc/control/main/teleop/run_teleop_policy_loop.py \
  --body-control-device pico \
  --hand_control_device pico \
  --enable_real_device

# Terminal 3: ZED + robot data exporter
python decoupled_wbc/control/main/teleop/zed_mini_run_g1_data_exporter.py \
  --dataset-name pick_toy_v1 \
  --visualize

Raw field	Reference shape	Used for
`body_pose`	`(N, num_joints, 7)`	Pelvis trajectory, navigation commands, and end-effector pose
`left_hand_pose`	`(N, 26, >=3)`	Infer left-hand open/close state
`right_hand_pose`	`(N, 26, >=3)`	Infer right-hand open/close state
`local_timestamps_ns`	`(N,)`	Synchronize motion with ZED frames
`episode_*.svo2`	ZED video	Extract left/right camera frames

The expected raw folder structure is dated batches. Do not flatten every episode into one directory if you want to use the default script:

raw_human/
  2026-06-11_batch1/
    episode_0.hdf5
    episode_0.svo2
    episode_1.hdf5
    episode_1.svo2
  2026-06-11_batch2/
    episode_0.hdf5
    episode_0.svo2

Step 2: run the human data pipeline

The shortest command is:

cd data_alignment/human_data_process

./run_human_data_pipeline.sh \
  --input_dir /data/ego/raw_human \
  --output_dir /data/ego/intermediate \
  --final-output-dir /data/ego/final \
  --file all

Although we call this Step 2 of the four-step pipeline, the shell script contains five smaller stages:

Stage	Related script	Result
Reorder	`scripts/reorder_episodes_for_raw.py`	Sort episodes by time and copy them as `episode_N.hdf5` and `episode_N.svo2`
Navigation	`process_navigation_pipeline.py`	Convert `body_pose` into a trajectory and velocity `[vx, vy, yaw_rate]`
Downsample	`downsample_episode.py`	Reduce frequency, default factor 5, and create `teleop_navigate_command`
Merge camera	`merge_camera_only.py`	Match timestamps and write ZED left/right images into HDF5
Hand status	`add_hand_status.py`	Write `hand_status` with shape `(M, 2)` for left/right hands

To debug one stage at a time, run the scripts independently:

# 1. Reorder
python scripts/reorder_episodes_for_raw.py \
  --input_dir /data/ego/raw_human \
  --output_dir /data/ego/intermediate \
  --file all \
  --workers 32

# 2. Navigation: pelvis trajectory -> velocity command
python process_navigation_pipeline.py \
  --dataset-dir /data/ego/intermediate \
  --baseline-sec 15 \
  --tangent-lag 5 \
  --overwrite \
  --no-png

# 3. Downsample: create teleop_navigate_command
python downsample_episode.py \
  --dataset-dir /data/ego/intermediate \
  --downsample-rate 5 \
  --overwrite

# 4. Merge camera: SVO2 -> HDF5 images
python merge_camera_only.py \
  --dataset-dir /data/ego/intermediate \
  --output-dir /data/ego/final \
  --num-workers 32

# 5. Hand status: hand pose -> open/close
python add_hand_status.py \
  --raw /data/ego/intermediate/hdf5 \
  --mid /data/ego/final \
  --target /data/ego/final \
  --downsample 5 \
  --num_workers 32

After this step, the final HDF5 files should contain the important fields:

body_pose
navigation_command
teleop_navigate_command
delta_height
observation_image_left
observation_image_right
camera_timestamp
timestamp_diff_ms
hand_status

Step 3: view alignment with `viewport_transform_batch_h5.py`

Run one HDF5 file:

cd data_alignment/view_alignment

python viewport_transform_batch_h5.py \
  --h5_file /data/ego/final/episode_0.hdf5 \
  --image_key "observation_image_left" \
  --trajectory "down" \
  --movement_distance 0.07 \
  --output_dir /data/ego/view_aligned/episode_0

Run a directory with multiple GPUs:

python viewport_transform_batch_h5.py \
  --h5_dir /data/ego/final \
  --batch_size 32 \
  --trajectory "down" \
  --movement_distance 0.07 \
  --num_gpus 4 \
  --output_dir /data/ego/view_aligned

Understand these arguments before changing them:

Argument	Meaning	Beginner guidance
`--image_key`	HDF5 image field	Start with `observation_image_left`; handle right images later if needed
`--trajectory`	Camera shift direction: `left`, `right`, `up`, `down`, `forward`, `backward`	If the human viewpoint is higher than the robot, try `down` first
`--movement_distance`	Camera movement distance	The README example uses `0.07`; inspect images before increasing it
`--movement_distance_noise`	Per-sample pose noise	Useful for making the policy robust to mounting errors
`--sd_model`	Inpainting model	Default is `stabilityai/stable-diffusion-2-inpainting`
`--save_h5`	Save output back as H5	Use this when the transformed images should become part of the training dataset

Step 4: action alignment and robot-ready HDF5

action_eef               # (N, 14), end-effector pose
action_delta_eef         # (N, 12), delta end-effector action
teleop_navigate_command  # (N, 3), discrete navigation command
delta_height             # (N,), base/pelvis height change
hand_status              # (N, 2), left/right open-close
observation_image_left   # RGB/JPEG image from ZED

python - <<'PY'
import h5py

path = "/data/ego/final/episode_0.hdf5"
with h5py.File(path, "r") as f:
    for key in [
        "action_eef",
        "action_delta_eef",
        "teleop_navigate_command",
        "delta_height",
        "hand_status",
        "observation_image_left",
        "observation_image_right",
    ]:
        if key in f:
            print(key, f[key].shape, f[key].dtype)
        else:
            print("MISSING", key)
PY

Step 5: convert to LeRobot

Once the HDF5 files are clean, convert them to LeRobot:

cd data_alignment

python convert_to_lerobot.py \
  --src-path /data/ego/final \
  --output-path /data/ego/lerobot \
  --repo-id egohumanoid_pick_toy_v1 \
  --fps 20 \
  --task "grasp the toy and put it on the table"

For a larger dataset, use parallel workers:

python convert_to_lerobot.py \
  --src-path /data/ego/final \
  --output-path /data/ego/lerobot \
  --repo-id egohumanoid_pick_toy_v1 \
  --num-workers 16 \
  --fps 20 \
  --task "grasp the toy and put it on the table"

Step 6: train, serve, and deploy

Before training, the EgoHumanoid README asks you to compute normalization statistics:

uv run python scripts/compute_norm_states_ultra_fast.py \
  --config-name=norm_compute

Then run training:

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 \
uv run scripts/train.py pi05_g1_custom \
  --exp_name=egohumanoid_pick_toy_v1

For multi-GPU training:

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 \
uv run scripts/train.py pi05_g1_custom \
  --exp_name=egohumanoid_pick_toy_v1 \
  --fsdp-devices 4

Checkpoints are saved under:

checkpoints/<config_name>/<exp_name>/<iteration>

After you have a checkpoint, start the policy server:

uv run scripts/serve_policy.py policy:checkpoint \
  --policy.config=pi05_g1_custom \
  --policy.dir=checkpoints/pi05_g1_custom/egohumanoid_pick_toy_v1/10000

The server listens on port 8000 by default. On the robot/client side, run:

cd /root/Projects/openpi

python scripts/deploy.py \
  --host <server_ip> \
  --port 8000

EgoHumanoid or TWIST2: direct robot data or human demos?

Question	TWIST2	EgoHumanoid
Primary data	Direct robot teleoperation	Human egocentric demos plus limited robot data
Strength	Actions and states are close to the real robot	Faster scaling, more scene diversity, no robot needed for every demo
Weakness	Consumes robot time, tires operators, and is harder to scale across scenes	Requires view/action alignment and fails easily if HDF5 data is dirty
Camera	Robot-centric active vision during teleop	Human headset/ZED views aligned toward the robot viewpoint
Lower-body signal	Commands from teleop/retargeting directly	Pelvis trajectory -> `teleop_navigate_command`
Hand signal	Robot hand command/tracking	Human hand pose -> `hand_status`
Best fit	The lab has a stable G1 and needs precise imitation data	The lab has limited robot time and wants diverse human demonstrations

Debug checklist before long training

Check	Command or method	Expected result
HDF5 has required keys	Print keys and shapes with `h5py`	No missing `teleop_navigate_command`, image, `hand_status`, or EEF action fields
FPS is reasonable	Inspect metadata and frame counts	Human data is usually downsampled to 20 Hz
Camera sync is acceptable	Inspect `timestamp_diff_ms`	No long runs of large spikes
Navigation is not all zero	Plot or print unique commands	Walking tasks should show command changes
Hand status is meaningful	Plot both left/right columns	State changes around grasp and release
View alignment preserves objects	Open frames before and after alignment	The object and hand remain recognizable
LeRobot loads	Run a small smoke training job	Dataloader has no image/action shape errors

EgoHumanoid: Human Demos to G1 VLA

Why EgoHumanoid is the third stack

Series roadmap

Technical references to keep open

Mental model: EgoHumanoid is not just dataset conversion

Step 1: collect robot data and human data

Step 2: run the human data pipeline

Step 3: view alignment with `viewport_transform_batch_h5.py`

Step 4: action alignment and robot-ready HDF5

Step 5: convert to LeRobot

Step 6: train, serve, and deploy

EgoHumanoid or TWIST2: direct robot data or human demos?

Debug checklist before long training

Conclusion

Nguyễn Anh Tuấn

Related Posts

TWIST2: PICO teleop và G1 sim2real

VIRAL: RGB sim2real cho G1 loco-manip

FRoM-W1: text → motion → G1 policy

EgoHumanoid: Human Demos to G1 VLA

Why EgoHumanoid is the third stack

Series roadmap

Technical references to keep open

Mental model: EgoHumanoid is not just dataset conversion

Step 1: collect robot data and human data

Step 2: run the human data pipeline

Step 3: view alignment with `viewport_transform_batch_h5.py`

Step 4: action alignment and robot-ready HDF5

Step 5: convert to LeRobot

Step 6: train, serve, and deploy

EgoHumanoid or TWIST2: direct robot data or human demos?

Debug checklist before long training

Conclusion

Nguyễn Anh Tuấn

Related Posts

TWIST2: PICO teleop và G1 sim2real

VIRAL: RGB sim2real cho G1 loco-manip

FRoM-W1: text → motion → G1 policy

Why EgoHumanoid is the third stack

Series roadmap

Technical references to keep open

Mental model: EgoHumanoid is not just dataset conversion

Step 1: collect robot data and human data

Step 2: run the human data pipeline

Step 3: view alignment with viewport_transform_batch_h5.py

Step 4: action alignment and robot-ready HDF5

Step 5: convert to LeRobot

Step 6: train, serve, and deploy

EgoHumanoid or TWIST2: direct robot data or human demos?

Debug checklist before long training

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

TWIST2: PICO teleop và G1 sim2real

VIRAL: RGB sim2real cho G1 loco-manip

FRoM-W1: text → motion → G1 policy

Why EgoHumanoid is the third stack

Series roadmap

Technical references to keep open

Mental model: EgoHumanoid is not just dataset conversion

Step 1: collect robot data and human data

Step 2: run the human data pipeline

Step 3: view alignment with viewport_transform_batch_h5.py

Step 4: action alignment and robot-ready HDF5

Step 5: convert to LeRobot

Step 6: train, serve, and deploy

EgoHumanoid or TWIST2: direct robot data or human demos?

Debug checklist before long training

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

TWIST2: PICO teleop và G1 sim2real

VIRAL: RGB sim2real cho G1 loco-manip

FRoM-W1: text → motion → G1 policy

Step 3: view alignment with `viewport_transform_batch_h5.py`

Step 3: view alignment with `viewport_transform_batch_h5.py`