humanoidhumanoidhuman-videopi05phantomvlaegocentric-videodata-ownershiprobot-learning

Human Video: Phantom and pi0.5

Compare Phantom and pi0.5 to decide when human videos become training data, derived robot data, or benchmark evidence.

Nguyễn Anh TuấnJune 10, 202614 min read
Human Video: Phantom and pi0.5

Why does part 5 focus on human video?

The first four parts of this series moved from a broad ownership map to teleoperation, alignment, and synthetic data. Part 1 framed the question as "who creates value at each data layer?" Part 3 covered view alignment and action alignment, the layer that turns human observations into something robots can learn from. Part 4 moved into synthetic data, where human demonstrations are amplified by simulation.

Part 5 returns to a source that is easy to underestimate: ordinary human video. If an engineer records a person sweeping a table, sorting eggs, opening a drawer, or arranging a spice rack with an RGBD camera, is that just reference media, or is it already robot training data? If a model extracts hand pose, removes the human arm, and renders a robot arm into the frame, who owns the new dataset? If a VLA such as pi0.5 co-finetunes on egocentric human videos and related robot data, do those human videos become robot data?

The two best case studies are Phantom and Physical Intelligence's pi0.5 human-to-robot transfer. Phantom, described in Training Robots Without Robots Using Only Human Videos, explicitly converts RGBD human demonstrations into robot observation-action pairs: estimate hand pose to recover actions, inpaint the human arm, then render the robot arm into the observation. Physical Intelligence, in Emergence of Human to Robot Transfer in Vision-Language-Action Models, takes a nearly opposite route: instead of designing a special transfer mechanism, it co-finetunes pi0.5 with human video treated like another embodiment, with actions represented by 3D hand positions, and lets alignment emerge from diverse robot pretraining. The original pi0.5 paper, π0.5: a Vision-Language-Action Model with Open-World Generalization, explains why heterogeneous co-training is central to the model.

For more practical context around pi0.5 workflows, read EXPO-FT: Online RL for VLA π0.5. For the open-source VLA tooling angle, read LeRobot and pi0-FAST training.

For a humanoid robotics startup, the practical question is not whether human video is useful. It is. The better question is: at which step does human video change state from raw media into training data, derived robot data, or benchmark evidence?

The two pipelines in one table

Comparison point Phantom pi0.5 human-to-robot
Main input Third-person RGBD videos of humans doing tasks Egocentric human videos plus relevant robot data
How actions are created Estimate hand pose with HaMeR, refine with depth/SAM2/ICP, then map to robot end-effector pose Treat human data as another embodiment, with actions given by 3D hand positions
How observations are handled Remove the human arm with segmentation and inpainting, then render the target robot arm into the image No rendered robot arm is required; alignment is learned by the pretrained VLA
Robot demos for the target task Not required Related robot data is used; human video supplies missing scenarios or task concepts
Policy type Imitation learning policy; the paper uses Diffusion Policy and zero-shot deployment pi0.5 VLA co-finetuned on a human + robot mixture
Ownership meaning Produces derived robot observation-action pairs from human video Human video remains a distinct source, but its value increases through robot pretraining
Main risk Hand-pose error, occlusion, embodiment mismatch, derived-data rights Benchmark leakage, egocentric-video consent, ownership of mixtures and learned representations

A beginner-friendly reading is this: Phantom tries to convert the data. It turns "a human in a video" into "a robot-looking observation with a robot action." pi0.5 tries to absorb the data. It keeps human video as human video, but trains a large enough model to find the bridge between human and robot behavior.

Phantom: turning human video into robot demonstrations

Phantom starts from a clean assumption: we have a dataset D_human of human video demonstrations. Each demonstration is a sequence of RGBD frames from a third-person camera, and the person performs a manipulation task with a pinch grasp using the thumb and index finger. There are no robot action labels. The goal is to create D_robot, where each human frame I_h,t becomes a pair (I_r,t, a_r,t): an observation that looks like the target robot, plus the corresponding robot action.

The technical pipeline is:

RGBD human video
  -> hand pose estimation with HaMeR
  -> SAM2 hand mask + depth point cloud
  -> ICP refinement for 3D hand pose
  -> convert hand pose to robot end-effector action
  -> segment and inpaint human arm
  -> render target robot arm from known camera extrinsics
  -> overlay rendered robot with depth-aware occlusion
  -> train imitation policy on edited robot observation-action pairs

The important detail is that Phantom does not merely use human video for visual pretraining. It creates action labels. The paper reports that HaMeR predicts 21 hand keypoints and a dense hand mesh with 778 vertices. Because monocular hand pose can be inaccurate in absolute 3D, the method uses SAM2 to segment the hand, depth data to extract a partial hand point cloud, and ICP to align the predicted mesh to the point cloud. The resulting pose is converted into the robot frame using known camera extrinsics. The action includes position, orientation, and gripper state.

The observation is also heavily edited. Phantom uses SAM2 to segment the human arm, inpaints the removed region with E2FGVI or a simpler inpainting method, then renders the target robot from the correct viewpoint and overlays it into the image. At test time, it also overlays a rendered robot arm on real robot observations to reduce train-test domain shift. The output dataset is no longer raw human video. It is derived robot data: it looks like robot data and contains robot actions, but its behavioral source is still a person.

The results matter for ownership because they show this is not just a visualization trick. The paper reports success rates up to 92% on several tasks and finds that Hand Inpaint and Hand Mask work far better than Red Line or Vanilla baselines. The paper also notes that this simple data-editing approach creates robot observation-action pairs that can be integrated into datasets for generalist policies. In other words, after the pipeline runs, human video has been packaged into a form that robot policies and VLAs can directly consume.

pi0.5: transfer emerging from co-training

Physical Intelligence asks a different question: if a VLA is large enough and pretrained on sufficiently diverse robot data, can it learn to use human video without a special transfer method? Their human-to-robot work focuses on egocentric human videos, which are easy to collect with wearable cameras. This data is cheap and natural, but the domain gap is severe: humans and robots have different bodies, camera viewpoints, motion constraints, and kinematics.

Their recipe is simple on the surface:

pretrained pi0.5
  + relevant robot data
  + egocentric human videos
  + actions represented as 3D hand positions
  -> human-robot co-finetuning
  -> evaluate on scenarios shown only in human demonstrations

In the egg sorting task, robot data covers the physical skill of placing eggs into cartons, while human data supplies the new rule: which colored eggs go into which carton. In the dresser task, robot data covers diverse bedroom scenes, while human data shows how to arrange the target dresser in a specific scene, such as putting jewelry into a jewelry box and hair ties into an organizer. In the spice rack task, the robot must understand the rack and layout in a previously unseen kitchen. The human video is not merely a trajectory source. It carries task semantics.

The critical study varies pretraining diversity across 0%, 25%, 50%, 75%, 100%, and 100% + Xemb. A beginner can read those checkpoints this way:

Pretraining mix Practical meaning Expected effect of adding human video
0% Mostly base VLM initialization, without strong robot pretraining Human video is hard to use because human and robot representations remain separate
25% Some robot diversity, still limited Transfer may appear, but weakly
50% More robot pretraining, but not yet consistently aligned Some tasks begin to show signal
75% More diverse robot representations Human video starts producing clear gains
100% Full robot pretraining diversity in the ablation setting The model absorbs human data more effectively
100% + Xemb Full pi0.5 mixture with cross-embodiment data The strongest transfer among the reported settings

The headline numbers are concrete. Physical Intelligence reports that co-training with human video improves Spice from 32% to 71%, Dresser from 25% to 50%, Bussing from 53% to 63%, and egg sorting from 57% to 78%. Egg sorting is especially useful for the ownership question. The robot already had the physical ability to pick and place eggs, but it did not have the sorting rule. Human video supplied the missing concept. When the final policy sorts better, the value comes from the human demonstrator, the robot pretraining mixture, and the co-training recipe together.

When is human video training data?

Human video becomes training data when it enters the training objective or fine-tuning loop. It does not need to be a conventional robot demonstration. In Phantom, the boundary is obvious: once hand pose is converted into robot action, each frame has (observation, action) and can train an imitation policy. In pi0.5, the boundary is softer: egocentric videos are included in co-finetuning as another embodiment, with 3D hand positions as actions. The video is not rendered into a robot view, but it changes the model weights. That makes it training data.

A compact checklist:

human_video_is_training_data_if:
  used_in_loss_function: true
  contributes_actions_or_pseudo_actions: true
  influences_policy_weights: true
  retained_for_retraining_or_ablation: true
  shown_only_as_paper_demo: false

If a video is only used as a blog illustration or qualitative comparison, it may be media evidence. But if it is sampled by a dataloader, used to create action labels, used to create embedding targets, included in supervised fine-tuning, used in reward learning, or placed in an evaluation split, treat it as training or evaluation data and manage consent accordingly.

When does it become derived robot data?

Human video becomes derived robot data when the pipeline creates a new artifact that describes a robot performing the task, even if the real robot never performed it. Phantom is the cleanest example:

Artifact Data state Why
Raw RGBD video Human media / human demonstration Contains human behavior and may show hands, objects, and the environment
Hand pose + action labels Pseudo-action data Actions are not recorded from a robot, but are mapped into the robot frame
Inpainted image Edited observation The human arm is removed and background is inferred
Rendered robot overlay Robot-like observation The image now depicts the target robot in the scene
Final (I_r,t, a_r,t) Derived robot training data It can train a policy like a robot demonstration
Policy checkpoint Model artifact Not a dataset, but it absorbs information from the derived data

The legal and technical point is that derived robot data does not erase the rights or provenance of the source video. If a company records workers, runs Phantom, and sells an edited robot dataset, the workers remain the behavioral source. The robot asset owner, camera setup designer, hand-pose pipeline author, and task annotator also contribute value. A dataset card should preserve this provenance.

An internal manifest can be simple:

dataset: phantom_spice_rack_robot_pairs_v1
source_media: rgbd_human_video
camera: third_person_rgbd
hand_pose: HaMeR + SAM2 + ICP
observation_editing: SAM2 mask + inpainting + rendered_robot_overlay
robot_asset: target_arm_model
human_consent: required
allowed_use:
  - internal_policy_training
  - aggregate_benchmark_reporting
not_allowed:
  - identity_recognition
  - resale_without_derived_data_review

When is it benchmark evidence?

Human video can also function as benchmark evidence: proof that a scenario exists, or a specification of what the robot should do. In pi0.5, human videos describe scenarios that robot data does not fully cover: arranging a particular dresser, sorting eggs by color, or placing items on a spice rack in a new kitchen. The co-finetuned policy is evaluated on those settings. When a paper reports "Spice 32% -> 71%" or "Eggs 57% -> 78%", the human video is both a training input and part of the scenario definition. This is the region where teams often confuse themselves.

Separate the three roles:

Role Question to ask Example
Training input Was the video used to update model weights? pi0.5 co-finetuning with human videos
Task specification Did the video define a rule or test scene? Egg color sorting rule shown in human demos
Benchmark evidence Was the video or rollout used to support a performance claim? Success-rate tables and rollout videos

If the same video both trains the model and defines the test scenario, watch for benchmark leakage. This is not automatically wrong. Physical Intelligence clearly states that evaluation happens on settings illustrated in the human demonstrations. But when using the result in a product comparison or a "generalization" claim, the exposure level must be explicit.

When should you choose Phantom or pi0.5-style co-training?

Choose Phantom when you have reliable RGBD capture, a task that can be mapped from human pinch grasp to a parallel gripper, and a need to create imitation-learning data without collecting robot demonstrations. It fits teams that do not yet have many robots but can record diverse human demonstrations across environments. It also gives you a clear audit trail: raw video, action extraction, edited observation, robot overlay, policy training.

Choose pi0.5-style co-training when you already have a strong VLA or foundation policy, a broad base of robot data, and human video mainly supplies semantics or rare scenarios. The robot may already know how to pick and place objects, but not the sorting rule, the layout of a new home, or the context-specific way to organize items. In that case, human video does not need to be rendered into robot form. It needs to be understood inside a shared representation.

Situation Prefer
No robot demonstrations for the target task Phantom
Good third-person RGBD data and camera extrinsics Phantom
Pinch-grasp or quasi-static manipulation Phantom
Large robot dataset across tasks and embodiments pi0.5-style co-training
Human video carries rules or semantics more than precise trajectories pi0.5-style co-training
Need to audit every artifact before commercializing a dataset Phantom, with a manifest
Need to exploit the scale of a VLA representation pi0.5-style co-training

The ownership lesson for humanoid data

Human video is no longer a side channel in 2026 robotics. With Phantom, it can be converted into robot observation-action pairs and used to train a zero-shot robot policy without target robot demonstrations. With pi0.5, it can become a new knowledge source for a VLA once robot pretraining is diverse enough. The two approaches differ, but they lead to the same ownership lesson: rights do not live only in the final file. They live across the transformation chain.

Practical rules for robotics teams:

  1. Label each video as raw media, training data, derived robot data, or benchmark evidence.
  2. Separate rights to human behavior, robot assets, editing pipelines, and model checkpoints.
  3. For human video, store consent, allowed use, retention policy, and removal procedures for future training.
  4. When reporting benchmarks, state whether human video was used for training, scenario definition, or demonstration only.
  5. If a dataset has gone through rendering, inpainting, or action extraction, do not call it "non-human data." Call it what it is: robot data derived from human video.

The final part of the series, Part 6, will connect these layers into the full VLA stack: raw video, teleoperation, simulation data, cross-embodiment data, model checkpoints, and product telemetry inside a commercial pipeline.

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

Stack VLA: dữ liệu đến triển khai
humanoid

Stack VLA: dữ liệu đến triển khai

6/10/202613 min read
NT
Căn góc nhìn người sang robot
humanoid

Căn góc nhìn người sang robot

6/10/202616 min read
NT
Bản đồ dữ liệu humanoid 2026
humanoid

Bản đồ dữ liệu humanoid 2026

6/10/202616 min read
NT