Psi0 Hands-On (3): Data Recipe & Pipeline

In the previous two posts of this series, we explored the overall architecture and the three System-0/1/2 subsystems of Psi0. This post dives deep into the most critical part that many people overlook: data. Why does Psi0 only need ~860 hours of data yet still outperform models that use over 10,000 hours? The answer lies in the data recipe — how data is selected, processed, and used in the right order.

Overview of the Three Data Sources

Psi0 uses three fundamentally different data sources, each serving a distinct purpose in the training pipeline:

Data Source	Duration	Space	Stage	Purpose
EgoDex	829 hours	Task-space (48-DoF)	Stage 1: Pre-train	Learn manipulation primitives from human video
Humanoid Everyday (HE)	31 hours	Joint-space (36-DoF)	Stage 1 + 2	Learn humanoid body control
Task-specific demos	~80 demos/task	Joint-space (36-DoF)	Stage 3: Fine-tune	Adapt to specific tasks

The total comes to only about 860 hours for Stage 1, plus 80 demos per task at Stage 3. Compare this with Pi0 from Physical Intelligence needing over 10,000 hours, and you can see this represents a major leap in data efficiency.

The Cooking Analogy

Imagine you want to learn how to cook authentic Vietnamese pho:

Step 1 — Watching YouTube (EgoDex): You watch 829 hours of cooking videos from a first-person perspective. You don't need to know the exact stove temperature or the precise angle of every cut — you just need to understand the general patterns: left hand holds the meat, right hand cuts, place meat in the pot, stir evenly... This is task-space knowledge — knowing what needs to be done, without knowing exactly how the body moves.

Step 2 — Practicing in the kitchen (Humanoid Everyday): After watching enough videos, you enter a real kitchen. 31 hours of hands-on practice helps you transfer knowledge from eyes to hands — learning how your body coordinates to execute those movements. This is joint-space knowledge — knowing exactly how to move each joint.

Step 3 — Mastering the pho recipe (Task-specific): Finally, you only need ~80 cooking sessions on the specific recipe to fine-tune your skills. No need for 10,000 attempts — because you already have a solid foundation.

Data processing pipeline in AI workflow

EgoDex: 829 Hours of Egocentric Video

EgoDex is the largest and most important dataset in Psi0. It is created from egocentric (first-person) videos of humans performing everyday hand manipulations.

Why Egocentric?

Here is the key insight: the camera on top of a humanoid robot looking down at its hands produces images that look nearly identical to a camera mounted on a human's head looking down at human hands. When you use third-person video, the model has to learn an additional viewpoint transformation step — which is difficult and data-hungry. Egocentric video eliminates this problem entirely.

EgoDex Data Structure

EgoDex is stored in 48-DoF task-space format:

Left hand 3D position (x, y, z): 3 values
Right hand 3D position (x, y, z): 3 values
Finger states: 42 values (21 DoF per hand — 3 DoF for each of 7 finger groups)

Note: this is task-space, not joint-space. EgoDex does not encode joint angles of the robot — it only captures the position and orientation of hands in 3D space. This means EgoDex can transfer to any robot with hands, not just the Unitree G1.

EgoDex Processing Pipeline

The workflow from raw video to training data:

Egocentric video -> H-RDT (Hand Reconstruction) -> 3D hand positions
    -> Transform to camera frame -> Upsample 3x -> Normalize -> FAST tokenize

Step 1: Hand Reconstruction with H-RDT

H-RDT (Hand Reconstruction from Dense Tracking) analyzes each video frame to extract 3D positions of hand joints. It outputs 48 values per frame.

Step 2: Transform to Camera Frame

Since egocentric video involves a moving camera, all 3D coordinates are transformed into the camera frame — a coordinate system with the origin at the camera. This ensures data consistency regardless of whether the person wearing the camera is standing or sitting.

Step 3: Upsample 3x

The original video is typically at 10-30 FPS. The pipeline upsamples by 3x to achieve higher temporal resolution, helping the model learn smoother motions.

Step 4: Normalize

Each dimension is normalized based on statistics computed from the entire dataset. The stats.json file stores the mean and standard deviation for each dimension.

Step 5: FAST Tokenize

This is the most critical step — converting 48 continuous values into ~20 discrete tokens.

FAST Tokenizer: Smart Action Compression

FAST (Fast Action Sequence Tokenizer) is one of Psi0's key technical contributions. Instead of having the model directly predict 48 continuous values (very difficult for autoregressive models), FAST compresses them into a discrete token sequence.

How FAST Works

Input: 48 continuous values (e.g., [0.23, -0.15, 0.87, ...])
    | Discrete Cosine Transform (DCT)
    | Keep the 20 most important coefficients
    | Quantize into a codebook of 2048 bins
Output: ~20 discrete tokens (e.g., [1204, 87, 1956, ...])

Codebook size: 2048 bins — each continuous value is discretized into 2048 levels. This provides sufficient precision for robot control (error <0.5mm in typical workspace).

Reconstruction loss (L1): 0.005 — when decoding back from tokens to continuous values, the average error is only 0.005. This means the compression is nearly lossless.

Why Not Use Continuous Actions Directly?

Three main reasons:

Autoregressive generation: Qwen3-VL (System-2) is a language model that naturally generates discrete tokens more readily than continuous numbers
Vocabulary sharing: Action tokens can share the embedding space with text tokens, enabling the model to learn cross-modal patterns
Compression: 48 values -> ~20 tokens = 2.4x compression, significantly reducing sequence length

LeRobot Data Format

All data in Psi0 is stored in LeRobot format — HuggingFace's standard data format for robot learning. If you are already familiar with the LeRobot framework, you will find this format very recognizable.

Directory Structure

dataset/
├── meta/
│   ├── info.json          # Metadata: fps, features, robot type
│   ├── stats.json         # Mean, std, min, max for each modality
│   ├── episodes.jsonl     # List of episodes with timestamps
│   └── tasks.jsonl        # Task descriptions for each episode
├── data/
│   ├── chunk-000/
│   │   ├── episode_000000.parquet  # Actions + states
│   │   ├── episode_000001.parquet
│   │   └── ...
│   └── chunk-001/
│       └── ...
└── videos/
    ├── chunk-000/
    │   ├── observation.images.cam_high/
    │   │   ├── episode_000000.mp4
    │   │   └── ...
    │   └── observation.images.cam_wrist/
    │       └── ...
    └── chunk-001/
        └── ...

Observation Format

Each observation frame includes:

Images: 320x240 pixels, typically from 2 cameras (head cam + wrist cam)
State: 28-DoF proprioception — positions and velocities of arm + hand joints
Timestamp: Precise timing of each frame

Action Format

Joint-space (HE + Task data): 36-DoF — includes both upper body and hand joints
Task-space (EgoDex): 48-DoF — 3D hand positions + finger states

stats.json

This file is critically important for normalization/denormalization:

{
  "observation.state": {
    "mean": [0.12, -0.05, ...],
    "std": [0.34, 0.28, ...],
    "min": [-1.57, -1.57, ...],
    "max": [1.57, 1.57, ...]
  },
  "action": {
    "mean": [...],
    "std": [...],
    "min": [...],
    "max": [...]
  }
}

During training, all values are normalized: x_norm = (x - mean) / std. During inference, they are denormalized back: x = x_norm * std + mean. If the stats are wrong, the robot's movements will be completely off.

Robot data pipeline processing from multiple sources

Humanoid Everyday (HE): 31 Hours of Robot Data

While EgoDex provides manipulation knowledge from human videos, Humanoid Everyday supplies real data from a Unitree G1 robot performing daily tasks.

HE Characteristics

31 hours of teleoperation data
Joint-space: 36-DoF (not task-space like EgoDex)
Tasks: Everyday activities — folding clothes, clearing tables, pouring water, opening cabinets...
Robot: Unitree G1 + Dex3-1 hands (43-DoF total, 36-DoF upper body)

Role in Training

HE appears in both Stage 1 and Stage 2:

Stage 1 (Pre-train): HE data is mixed with EgoDex. The model learns both task-space (from EgoDex) and joint-space (from HE) simultaneously. The mixing ratio is tuned to balance the two data sources.
Stage 2 (Post-train): Only HE data is used. The Flow Matching action expert learns to generate precise joint-space actions.

Task-specific Demos: 80 Is Enough

The most impressive aspect of Psi0 is that it only needs 80 demonstrations per new task. For comparison:

Model	Demos Required	Collection Time
ACT	50-200	1-2 hours
Diffusion Policy	100-500	2-5 hours
Pi0	500-1000+	5-10 hours
Psi0	~80	~40 minutes

40 minutes of teleoperation to teach a new task to a humanoid robot — this is a practical level for deployment in factories or homes.

Data Processing Commands

Download Simulation Data

# Download simulation data from HuggingFace
python scripts/data/download_datasets.py \
  --repo_id physical-superintelligence/psi0-sim-data \
  --local_dir data/sim

# Download real-world data
python scripts/data/download_datasets.py \
  --repo_id physical-superintelligence/psi0-real-data \
  --local_dir data/real

Convert Raw Data to LeRobot Format

# Convert raw teleoperation data to LeRobot format
python scripts/data/raw_to_lerobot.py \
  --input_dir data/raw/my_task \
  --output_dir data/lerobot/my_task \
  --fps 30 \
  --video_codec h264

# Compute statistics for the dataset
python scripts/data/calc_modality_stats.py \
  --dataset_dir data/lerobot/my_task \
  --output_path data/lerobot/my_task/meta/stats.json

# Patch metadata if needed
python scripts/data/patch_lerobot_meta.py \
  --dataset_dir data/lerobot/my_task \
  --num_episodes 80 \
  --fps 30

Upload to HuggingFace

# Requires HF_TOKEN
export HF_TOKEN=hf_xxxxx

# Push dataset
python scripts/data/push_to_hub.py \
  --dataset_dir data/lerobot/my_task \
  --repo_id your-username/psi0-custom-task

Why This Data Recipe Works

1. Staged Usage — Right Data at the Right Time

Not all data is used at once. Each stage has its own data source with a specific purpose:

Stage 1: Learning vocabulary — EgoDex teaches the model the "vocabulary" of manipulation (grasp, place, rotate, press...)
Stage 2: Learning grammar — HE data teaches the model how to combine that "vocabulary" into complete "sentences" on a real robot
Stage 3: Learning to write essays — 80 demos teach the model to write the correct "essay" for a specific task

2. Egocentric Match — Consistent Viewpoint

The camera on the Unitree G1's head produces images very similar to an egocentric camera on a human head. No domain adaptation needed, no viewpoint transformation required. Data from humans naturally transfers to the robot.

3. Quality Over Quantity

The 829 hours of EgoDex were carefully curated — only videos with good quality, clear viewpoints, and visible hands were kept. The 31 hours of HE data also consist of high-quality teleoperation from experienced operators. Compared to crawling thousands of hours of mixed-quality YouTube videos, this approach is far more effective.

High-quality data is the key to effective AI

Ablation: What Happens Without EgoDex?

The research team ran ablation studies to verify the value of each component:

Ablation Results

Configuration	Success Rate	vs. Full Model
Full (EgoDex + HE + 80 demos)	78.5%	baseline
Without EgoDex	31.2%	-47.3%
10% EgoDex (83 hours)	52.8%	-25.7%
50% EgoDex (415 hours)	68.1%	-10.4%
Without HE in Stage 1	61.3%	-17.2%
40 demos (instead of 80)	69.7%	-8.8%

Key findings:

Removing EgoDex is catastrophic: Success rate drops by nearly 50%. The model loses its ability to generalize — it can only replicate the exact 80 demos it has seen and cannot adapt when objects are in different positions or have different shapes.
10% EgoDex is still significantly worse: Even 83 hours of egocentric video is not enough. The model needs to see diverse manipulation patterns — and 83 hours does not provide sufficient coverage.
HE data in Stage 1 matters: Mixing HE into Stage 1 (instead of only using it in Stage 2) helps the model develop embodiment awareness early, improving performance by 17%.
80 demos is the sweet spot: 40 demos performs 9% worse, but increasing to 120 demos only improves performance by an additional 2-3%. 80 represents the optimal balance between effort and performance.

Summary and Takeaways

Psi0's data recipe teaches us an important lesson: more data is not always better. How you organize, process, and use data matters more than the total volume. With its tiered architecture (System-0/1/2) combined with a 3-stage data recipe, Psi0 has demonstrated that 860 hours of the right data, used the right way, can beat 10,000 hours of raw data.

In the next post, we will get hands-on with setting up the environment and running Psi0's training pipeline from start to finish.

AI for Robotics (7): LeRobot Hands-On — Data Collection and Training — Practical guide to the LeRobot framework, the foundation of Psi0's data format
VLA Models: When Language Controls Robots — Overview of Vision-Language-Action models, the architectural foundation for System-2
Diffusion Policy: Generating Actions for Robots — Understanding Flow Matching and generative models for robot actions

Psi0 Hands-On (3): Data Recipe & Pipeline

Overview of the Three Data Sources

Psi0 uses three fundamentally different data sources, each serving a distinct purpose in the training pipeline:

Data Source	Duration	Space	Stage	Purpose
EgoDex	829 hours	Task-space (48-DoF)	Stage 1: Pre-train	Learn manipulation primitives from human video
Humanoid Everyday (HE)	31 hours	Joint-space (36-DoF)	Stage 1 + 2	Learn humanoid body control
Task-specific demos	~80 demos/task	Joint-space (36-DoF)	Stage 3: Fine-tune	Adapt to specific tasks

The Cooking Analogy

Imagine you want to learn how to cook authentic Vietnamese pho:

Data processing pipeline in AI workflow

EgoDex: 829 Hours of Egocentric Video

EgoDex is the largest and most important dataset in Psi0. It is created from egocentric (first-person) videos of humans performing everyday hand manipulations.

Why Egocentric?

EgoDex Data Structure

EgoDex is stored in 48-DoF task-space format:

Left hand 3D position (x, y, z): 3 values
Right hand 3D position (x, y, z): 3 values
Finger states: 42 values (21 DoF per hand — 3 DoF for each of 7 finger groups)

EgoDex Processing Pipeline

The workflow from raw video to training data:

Egocentric video -> H-RDT (Hand Reconstruction) -> 3D hand positions
    -> Transform to camera frame -> Upsample 3x -> Normalize -> FAST tokenize

Step 1: Hand Reconstruction with H-RDT

H-RDT (Hand Reconstruction from Dense Tracking) analyzes each video frame to extract 3D positions of hand joints. It outputs 48 values per frame.

Step 2: Transform to Camera Frame

Step 3: Upsample 3x

The original video is typically at 10-30 FPS. The pipeline upsamples by 3x to achieve higher temporal resolution, helping the model learn smoother motions.

Step 4: Normalize

Each dimension is normalized based on statistics computed from the entire dataset. The stats.json file stores the mean and standard deviation for each dimension.

Step 5: FAST Tokenize

This is the most critical step — converting 48 continuous values into ~20 discrete tokens.

FAST Tokenizer: Smart Action Compression

How FAST Works

Input: 48 continuous values (e.g., [0.23, -0.15, 0.87, ...])
    | Discrete Cosine Transform (DCT)
    | Keep the 20 most important coefficients
    | Quantize into a codebook of 2048 bins
Output: ~20 discrete tokens (e.g., [1204, 87, 1956, ...])

Codebook size: 2048 bins — each continuous value is discretized into 2048 levels. This provides sufficient precision for robot control (error <0.5mm in typical workspace).

Reconstruction loss (L1): 0.005 — when decoding back from tokens to continuous values, the average error is only 0.005. This means the compression is nearly lossless.

Why Not Use Continuous Actions Directly?

Three main reasons:

Autoregressive generation: Qwen3-VL (System-2) is a language model that naturally generates discrete tokens more readily than continuous numbers
Vocabulary sharing: Action tokens can share the embedding space with text tokens, enabling the model to learn cross-modal patterns
Compression: 48 values -> ~20 tokens = 2.4x compression, significantly reducing sequence length

LeRobot Data Format

Directory Structure

dataset/
├── meta/
│   ├── info.json          # Metadata: fps, features, robot type
│   ├── stats.json         # Mean, std, min, max for each modality
│   ├── episodes.jsonl     # List of episodes with timestamps
│   └── tasks.jsonl        # Task descriptions for each episode
├── data/
│   ├── chunk-000/
│   │   ├── episode_000000.parquet  # Actions + states
│   │   ├── episode_000001.parquet
│   │   └── ...
│   └── chunk-001/
│       └── ...
└── videos/
    ├── chunk-000/
    │   ├── observation.images.cam_high/
    │   │   ├── episode_000000.mp4
    │   │   └── ...
    │   └── observation.images.cam_wrist/
    │       └── ...
    └── chunk-001/
        └── ...

Observation Format

Each observation frame includes:

Images: 320x240 pixels, typically from 2 cameras (head cam + wrist cam)
State: 28-DoF proprioception — positions and velocities of arm + hand joints
Timestamp: Precise timing of each frame

Action Format

Joint-space (HE + Task data): 36-DoF — includes both upper body and hand joints
Task-space (EgoDex): 48-DoF — 3D hand positions + finger states

stats.json

This file is critically important for normalization/denormalization:

{
  "observation.state": {
    "mean": [0.12, -0.05, ...],
    "std": [0.34, 0.28, ...],
    "min": [-1.57, -1.57, ...],
    "max": [1.57, 1.57, ...]
  },
  "action": {
    "mean": [...],
    "std": [...],
    "min": [...],
    "max": [...]
  }
}

Robot data pipeline processing from multiple sources

Humanoid Everyday (HE): 31 Hours of Robot Data

While EgoDex provides manipulation knowledge from human videos, Humanoid Everyday supplies real data from a Unitree G1 robot performing daily tasks.

HE Characteristics

31 hours of teleoperation data
Joint-space: 36-DoF (not task-space like EgoDex)
Tasks: Everyday activities — folding clothes, clearing tables, pouring water, opening cabinets...
Robot: Unitree G1 + Dex3-1 hands (43-DoF total, 36-DoF upper body)

Role in Training

HE appears in both Stage 1 and Stage 2:

Stage 1 (Pre-train): HE data is mixed with EgoDex. The model learns both task-space (from EgoDex) and joint-space (from HE) simultaneously. The mixing ratio is tuned to balance the two data sources.
Stage 2 (Post-train): Only HE data is used. The Flow Matching action expert learns to generate precise joint-space actions.

Task-specific Demos: 80 Is Enough

The most impressive aspect of Psi0 is that it only needs 80 demonstrations per new task. For comparison:

Model	Demos Required	Collection Time
ACT	50-200	1-2 hours
Diffusion Policy	100-500	2-5 hours
Pi0	500-1000+	5-10 hours
Psi0	~80	~40 minutes

40 minutes of teleoperation to teach a new task to a humanoid robot — this is a practical level for deployment in factories or homes.

Data Processing Commands

Download Simulation Data

# Download simulation data from HuggingFace
python scripts/data/download_datasets.py \
  --repo_id physical-superintelligence/psi0-sim-data \
  --local_dir data/sim

# Download real-world data
python scripts/data/download_datasets.py \
  --repo_id physical-superintelligence/psi0-real-data \
  --local_dir data/real

Convert Raw Data to LeRobot Format

# Convert raw teleoperation data to LeRobot format
python scripts/data/raw_to_lerobot.py \
  --input_dir data/raw/my_task \
  --output_dir data/lerobot/my_task \
  --fps 30 \
  --video_codec h264

# Compute statistics for the dataset
python scripts/data/calc_modality_stats.py \
  --dataset_dir data/lerobot/my_task \
  --output_path data/lerobot/my_task/meta/stats.json

# Patch metadata if needed
python scripts/data/patch_lerobot_meta.py \
  --dataset_dir data/lerobot/my_task \
  --num_episodes 80 \
  --fps 30

Upload to HuggingFace

# Requires HF_TOKEN
export HF_TOKEN=hf_xxxxx

# Push dataset
python scripts/data/push_to_hub.py \
  --dataset_dir data/lerobot/my_task \
  --repo_id your-username/psi0-custom-task

Why This Data Recipe Works

1. Staged Usage — Right Data at the Right Time

Not all data is used at once. Each stage has its own data source with a specific purpose:

Stage 1: Learning vocabulary — EgoDex teaches the model the "vocabulary" of manipulation (grasp, place, rotate, press...)
Stage 2: Learning grammar — HE data teaches the model how to combine that "vocabulary" into complete "sentences" on a real robot
Stage 3: Learning to write essays — 80 demos teach the model to write the correct "essay" for a specific task

2. Egocentric Match — Consistent Viewpoint

3. Quality Over Quantity

High-quality data is the key to effective AI

Ablation: What Happens Without EgoDex?

The research team ran ablation studies to verify the value of each component:

Ablation Results

Configuration	Success Rate	vs. Full Model
Full (EgoDex + HE + 80 demos)	78.5%	baseline
Without EgoDex	31.2%	-47.3%
10% EgoDex (83 hours)	52.8%	-25.7%
50% EgoDex (415 hours)	68.1%	-10.4%
Without HE in Stage 1	61.3%	-17.2%
40 demos (instead of 80)	69.7%	-8.8%

Key findings:

Removing EgoDex is catastrophic: Success rate drops by nearly 50%. The model loses its ability to generalize — it can only replicate the exact 80 demos it has seen and cannot adapt when objects are in different positions or have different shapes.
10% EgoDex is still significantly worse: Even 83 hours of egocentric video is not enough. The model needs to see diverse manipulation patterns — and 83 hours does not provide sufficient coverage.
HE data in Stage 1 matters: Mixing HE into Stage 1 (instead of only using it in Stage 2) helps the model develop embodiment awareness early, improving performance by 17%.
80 demos is the sweet spot: 40 demos performs 9% worse, but increasing to 120 demos only improves performance by an additional 2-3%. 80 represents the optimal balance between effort and performance.

Summary and Takeaways

In the next post, we will get hands-on with setting up the environment and running Psi0's training pipeline from start to finish.

AI for Robotics (7): LeRobot Hands-On — Data Collection and Training — Practical guide to the LeRobot framework, the foundation of Psi0's data format
VLA Models: When Language Controls Robots — Overview of Vision-Language-Action models, the architectural foundation for System-2
Diffusion Policy: Generating Actions for Robots — Understanding Flow Matching and generative models for robot actions

Psi0 Hands-On (3): Data Recipe & Pipeline

Overview of the Three Data Sources

The Cooking Analogy

EgoDex: 829 Hours of Egocentric Video

Why Egocentric?

EgoDex Data Structure

EgoDex Processing Pipeline

FAST Tokenizer: Smart Action Compression

How FAST Works

Why Not Use Continuous Actions Directly?

LeRobot Data Format

Directory Structure

Observation Format

Action Format

stats.json

Humanoid Everyday (HE): 31 Hours of Robot Data

HE Characteristics

Role in Training

Task-specific Demos: 80 Is Enough

Data Processing Commands

Download Simulation Data

Convert Raw Data to LeRobot Format

Upload to HuggingFace

Why This Data Recipe Works

1. Staged Usage — Right Data at the Right Time

2. Egocentric Match — Consistent Viewpoint

3. Quality Over Quantity

Ablation: What Happens Without EgoDex?

Ablation Results

Summary and Takeaways

Related Posts

Nguyễn Anh Tuấn

Related Posts

Ψ₀ Hands-On (6): Ablation & Bài học rút ra

Ψ₀ Hands-On (5): Inference & Evaluation

Ψ₀ Hands-On (4): Setup & Training Pipeline

Psi0 Hands-On (3): Data Recipe & Pipeline

Overview of the Three Data Sources

The Cooking Analogy

EgoDex: 829 Hours of Egocentric Video

Why Egocentric?

EgoDex Data Structure

EgoDex Processing Pipeline

FAST Tokenizer: Smart Action Compression

How FAST Works

Why Not Use Continuous Actions Directly?

LeRobot Data Format

Directory Structure

Observation Format

Action Format

stats.json

Humanoid Everyday (HE): 31 Hours of Robot Data

HE Characteristics

Role in Training

Task-specific Demos: 80 Is Enough

Data Processing Commands

Download Simulation Data

Convert Raw Data to LeRobot Format

Upload to HuggingFace

Why This Data Recipe Works

1. Staged Usage — Right Data at the Right Time

2. Egocentric Match — Consistent Viewpoint

3. Quality Over Quantity

Ablation: What Happens Without EgoDex?

Ablation Results

Summary and Takeaways

Related Posts

Nguyễn Anh Tuấn

Related Posts

Ψ₀ Hands-On (6): Ablation & Bài học rút ra

Ψ₀ Hands-On (5): Inference & Evaluation

Ψ₀ Hands-On (4): Setup & Training Pipeline