Psi0 Hands-On (3): Data Recipe & Pipeline
In the previous two posts of this series, we explored the overall architecture and the three System-0/1/2 subsystems of Psi0. This post dives deep into the most critical part that many people overlook: data. Why does Psi0 only need ~860 hours of data yet still outperform models that use over 10,000 hours? The answer lies in the data recipe — how data is selected, processed, and used in the right order.
Overview of the Three Data Sources
Psi0 uses three fundamentally different data sources, each serving a distinct purpose in the training pipeline:
| Data Source | Duration | Space | Stage | Purpose |
|---|---|---|---|---|
| EgoDex | 829 hours | Task-space (48-DoF) | Stage 1: Pre-train | Learn manipulation primitives from human video |
| Humanoid Everyday (HE) | 31 hours | Joint-space (36-DoF) | Stage 1 + 2 | Learn humanoid body control |
| Task-specific demos | ~80 demos/task | Joint-space (36-DoF) | Stage 3: Fine-tune | Adapt to specific tasks |
The total comes to only about 860 hours for Stage 1, plus 80 demos per task at Stage 3. Compare this with Pi0 from Physical Intelligence needing over 10,000 hours, and you can see this represents a major leap in data efficiency.
The Cooking Analogy
Imagine you want to learn how to cook authentic Vietnamese pho:
Step 1 — Watching YouTube (EgoDex): You watch 829 hours of cooking videos from a first-person perspective. You don't need to know the exact stove temperature or the precise angle of every cut — you just need to understand the general patterns: left hand holds the meat, right hand cuts, place meat in the pot, stir evenly... This is task-space knowledge — knowing what needs to be done, without knowing exactly how the body moves.
Step 2 — Practicing in the kitchen (Humanoid Everyday): After watching enough videos, you enter a real kitchen. 31 hours of hands-on practice helps you transfer knowledge from eyes to hands — learning how your body coordinates to execute those movements. This is joint-space knowledge — knowing exactly how to move each joint.
Step 3 — Mastering the pho recipe (Task-specific): Finally, you only need ~80 cooking sessions on the specific recipe to fine-tune your skills. No need for 10,000 attempts — because you already have a solid foundation.
EgoDex: 829 Hours of Egocentric Video
EgoDex is the largest and most important dataset in Psi0. It is created from egocentric (first-person) videos of humans performing everyday hand manipulations.
Why Egocentric?
Here is the key insight: the camera on top of a humanoid robot looking down at its hands produces images that look nearly identical to a camera mounted on a human's head looking down at human hands. When you use third-person video, the model has to learn an additional viewpoint transformation step — which is difficult and data-hungry. Egocentric video eliminates this problem entirely.
EgoDex Data Structure
EgoDex is stored in 48-DoF task-space format:
- Left hand 3D position (x, y, z): 3 values
- Right hand 3D position (x, y, z): 3 values
- Finger states: 42 values (21 DoF per hand — 3 DoF for each of 7 finger groups)
Note: this is task-space, not joint-space. EgoDex does not encode joint angles of the robot — it only captures the position and orientation of hands in 3D space. This means EgoDex can transfer to any robot with hands, not just the Unitree G1.
EgoDex Processing Pipeline
The workflow from raw video to training data:
Egocentric video -> H-RDT (Hand Reconstruction) -> 3D hand positions
-> Transform to camera frame -> Upsample 3x -> Normalize -> FAST tokenize
Step 1: Hand Reconstruction with H-RDT
H-RDT (Hand Reconstruction from Dense Tracking) analyzes each video frame to extract 3D positions of hand joints. It outputs 48 values per frame.
Step 2: Transform to Camera Frame
Since egocentric video involves a moving camera, all 3D coordinates are transformed into the camera frame — a coordinate system with the origin at the camera. This ensures data consistency regardless of whether the person wearing the camera is standing or sitting.
Step 3: Upsample 3x
The original video is typically at 10-30 FPS. The pipeline upsamples by 3x to achieve higher temporal resolution, helping the model learn smoother motions.
Step 4: Normalize
Each dimension is normalized based on statistics computed from the entire dataset. The stats.json file stores the mean and standard deviation for each dimension.
Step 5: FAST Tokenize
This is the most critical step — converting 48 continuous values into ~20 discrete tokens.
FAST Tokenizer: Smart Action Compression
FAST (Fast Action Sequence Tokenizer) is one of Psi0's key technical contributions. Instead of having the model directly predict 48 continuous values (very difficult for autoregressive models), FAST compresses them into a discrete token sequence.
How FAST Works
Input: 48 continuous values (e.g., [0.23, -0.15, 0.87, ...])
| Discrete Cosine Transform (DCT)
| Keep the 20 most important coefficients
| Quantize into a codebook of 2048 bins
Output: ~20 discrete tokens (e.g., [1204, 87, 1956, ...])
Codebook size: 2048 bins — each continuous value is discretized into 2048 levels. This provides sufficient precision for robot control (error <0.5mm in typical workspace).
Reconstruction loss (L1): 0.005 — when decoding back from tokens to continuous values, the average error is only 0.005. This means the compression is nearly lossless.
Why Not Use Continuous Actions Directly?
Three main reasons:
- Autoregressive generation: Qwen3-VL (System-2) is a language model that naturally generates discrete tokens more readily than continuous numbers
- Vocabulary sharing: Action tokens can share the embedding space with text tokens, enabling the model to learn cross-modal patterns
- Compression: 48 values -> ~20 tokens = 2.4x compression, significantly reducing sequence length
LeRobot Data Format
All data in Psi0 is stored in LeRobot format — HuggingFace's standard data format for robot learning. If you are already familiar with the LeRobot framework, you will find this format very recognizable.
Directory Structure
dataset/
├── meta/
│ ├── info.json # Metadata: fps, features, robot type
│ ├── stats.json # Mean, std, min, max for each modality
│ ├── episodes.jsonl # List of episodes with timestamps
│ └── tasks.jsonl # Task descriptions for each episode
├── data/
│ ├── chunk-000/
│ │ ├── episode_000000.parquet # Actions + states
│ │ ├── episode_000001.parquet
│ │ └── ...
│ └── chunk-001/
│ └── ...
└── videos/
├── chunk-000/
│ ├── observation.images.cam_high/
│ │ ├── episode_000000.mp4
│ │ └── ...
│ └── observation.images.cam_wrist/
│ └── ...
└── chunk-001/
└── ...
Observation Format
Each observation frame includes:
- Images: 320x240 pixels, typically from 2 cameras (head cam + wrist cam)
- State: 28-DoF proprioception — positions and velocities of arm + hand joints
- Timestamp: Precise timing of each frame
Action Format
- Joint-space (HE + Task data): 36-DoF — includes both upper body and hand joints
- Task-space (EgoDex): 48-DoF — 3D hand positions + finger states
stats.json
This file is critically important for normalization/denormalization:
{
"observation.state": {
"mean": [0.12, -0.05, ...],
"std": [0.34, 0.28, ...],
"min": [-1.57, -1.57, ...],
"max": [1.57, 1.57, ...]
},
"action": {
"mean": [...],
"std": [...],
"min": [...],
"max": [...]
}
}
During training, all values are normalized: x_norm = (x - mean) / std. During inference, they are denormalized back: x = x_norm * std + mean. If the stats are wrong, the robot's movements will be completely off.
Humanoid Everyday (HE): 31 Hours of Robot Data
While EgoDex provides manipulation knowledge from human videos, Humanoid Everyday supplies real data from a Unitree G1 robot performing daily tasks.
HE Characteristics
- 31 hours of teleoperation data
- Joint-space: 36-DoF (not task-space like EgoDex)
- Tasks: Everyday activities — folding clothes, clearing tables, pouring water, opening cabinets...
- Robot: Unitree G1 + Dex3-1 hands (43-DoF total, 36-DoF upper body)
Role in Training
HE appears in both Stage 1 and Stage 2:
- Stage 1 (Pre-train): HE data is mixed with EgoDex. The model learns both task-space (from EgoDex) and joint-space (from HE) simultaneously. The mixing ratio is tuned to balance the two data sources.
- Stage 2 (Post-train): Only HE data is used. The Flow Matching action expert learns to generate precise joint-space actions.
Task-specific Demos: 80 Is Enough
The most impressive aspect of Psi0 is that it only needs 80 demonstrations per new task. For comparison:
| Model | Demos Required | Collection Time |
|---|---|---|
| ACT | 50-200 | 1-2 hours |
| Diffusion Policy | 100-500 | 2-5 hours |
| Pi0 | 500-1000+ | 5-10 hours |
| Psi0 | ~80 | ~40 minutes |
40 minutes of teleoperation to teach a new task to a humanoid robot — this is a practical level for deployment in factories or homes.
Data Processing Commands
Download Simulation Data
# Download simulation data from HuggingFace
python scripts/data/download_datasets.py \
--repo_id physical-superintelligence/psi0-sim-data \
--local_dir data/sim
# Download real-world data
python scripts/data/download_datasets.py \
--repo_id physical-superintelligence/psi0-real-data \
--local_dir data/real
Convert Raw Data to LeRobot Format
# Convert raw teleoperation data to LeRobot format
python scripts/data/raw_to_lerobot.py \
--input_dir data/raw/my_task \
--output_dir data/lerobot/my_task \
--fps 30 \
--video_codec h264
# Compute statistics for the dataset
python scripts/data/calc_modality_stats.py \
--dataset_dir data/lerobot/my_task \
--output_path data/lerobot/my_task/meta/stats.json
# Patch metadata if needed
python scripts/data/patch_lerobot_meta.py \
--dataset_dir data/lerobot/my_task \
--num_episodes 80 \
--fps 30
Upload to HuggingFace
# Requires HF_TOKEN
export HF_TOKEN=hf_xxxxx
# Push dataset
python scripts/data/push_to_hub.py \
--dataset_dir data/lerobot/my_task \
--repo_id your-username/psi0-custom-task
Why This Data Recipe Works
1. Staged Usage — Right Data at the Right Time
Not all data is used at once. Each stage has its own data source with a specific purpose:
- Stage 1: Learning vocabulary — EgoDex teaches the model the "vocabulary" of manipulation (grasp, place, rotate, press...)
- Stage 2: Learning grammar — HE data teaches the model how to combine that "vocabulary" into complete "sentences" on a real robot
- Stage 3: Learning to write essays — 80 demos teach the model to write the correct "essay" for a specific task
2. Egocentric Match — Consistent Viewpoint
The camera on the Unitree G1's head produces images very similar to an egocentric camera on a human head. No domain adaptation needed, no viewpoint transformation required. Data from humans naturally transfers to the robot.
3. Quality Over Quantity
The 829 hours of EgoDex were carefully curated — only videos with good quality, clear viewpoints, and visible hands were kept. The 31 hours of HE data also consist of high-quality teleoperation from experienced operators. Compared to crawling thousands of hours of mixed-quality YouTube videos, this approach is far more effective.
Ablation: What Happens Without EgoDex?
The research team ran ablation studies to verify the value of each component:
Ablation Results
| Configuration | Success Rate | vs. Full Model |
|---|---|---|
| Full (EgoDex + HE + 80 demos) | 78.5% | baseline |
| Without EgoDex | 31.2% | -47.3% |
| 10% EgoDex (83 hours) | 52.8% | -25.7% |
| 50% EgoDex (415 hours) | 68.1% | -10.4% |
| Without HE in Stage 1 | 61.3% | -17.2% |
| 40 demos (instead of 80) | 69.7% | -8.8% |
Key findings:
-
Removing EgoDex is catastrophic: Success rate drops by nearly 50%. The model loses its ability to generalize — it can only replicate the exact 80 demos it has seen and cannot adapt when objects are in different positions or have different shapes.
-
10% EgoDex is still significantly worse: Even 83 hours of egocentric video is not enough. The model needs to see diverse manipulation patterns — and 83 hours does not provide sufficient coverage.
-
HE data in Stage 1 matters: Mixing HE into Stage 1 (instead of only using it in Stage 2) helps the model develop embodiment awareness early, improving performance by 17%.
-
80 demos is the sweet spot: 40 demos performs 9% worse, but increasing to 120 demos only improves performance by an additional 2-3%. 80 represents the optimal balance between effort and performance.
Summary and Takeaways
Psi0's data recipe teaches us an important lesson: more data is not always better. How you organize, process, and use data matters more than the total volume. With its tiered architecture (System-0/1/2) combined with a 3-stage data recipe, Psi0 has demonstrated that 860 hours of the right data, used the right way, can beat 10,000 hours of raw data.
In the next post, we will get hands-on with setting up the environment and running Psi0's training pipeline from start to finish.
Related Posts
- AI for Robotics (7): LeRobot Hands-On — Data Collection and Training — Practical guide to the LeRobot framework, the foundation of Psi0's data format
- VLA Models: When Language Controls Robots — Overview of Vision-Language-Action models, the architectural foundation for System-2
- Diffusion Policy: Generating Actions for Robots — Understanding Flow Matching and generative models for robot actions