← Back to Blog
aiai-perceptionvlahumanoidresearchpsi0

Ψ₀ Hands-On (1): Overview & Key Ideas

Introducing Ψ₀ — the first open-source foundation model that teaches humanoid robots to walk and manipulate objects simultaneously.

Nguyễn Anh Tuấn28 tháng 3, 202612 min read
Ψ₀ Hands-On (1): Overview & Key Ideas

Ψ₀ Hands-On (1): Overview & Key Ideas Behind the Foundation Model for Humanoids

Imagine you are standing in your kitchen. You walk back and forth between the fridge and the counter, grab a knife to chop vegetables, and keep your balance as you reach up to a high shelf for spices. For humans, this is so mundane that we do not even think about it. But for robots, this is one of the hardest unsolved problems in robotics: loco-manipulation — moving and manipulating objects at the same time.

And that is precisely the problem that Ψ₀ (pronounced "Psi-Zero") tackles. It is the first open-source foundation model that enables humanoid robots to perform loco-manipulation smoothly, developed by the USC Physical Superintelligence Lab (PSI Lab) in collaboration with NVIDIA.

In this Ψ₀ Hands-On series, we will go from understanding the core ideas to hands-on implementation, step by step. This first article gives you the big picture before we dive into the code.

Why Does Ψ₀ Matter?

Before we talk about architecture or algorithms, let us answer the most important question: why should you care about Ψ₀?

Humanoid robot manipulating objects in a real-world environment

1. Superior Performance with Less Data

Ψ₀ outperforms the strongest current baselines — including NVIDIA's GR00T N1, Physical Intelligence's Pi0, and ACT — by a margin of over 40%, while using 10 times less robot data. Read that again: 10x less data, yet 40% better results. This runs counter to the "more data = better" intuition that the AI field typically assumes.

2. Solving a Genuinely Hard Problem

Loco-manipulation is not just adding locomotion and manipulation together. When a robot walks while holding an object, its center of gravity shifts continuously, reaction forces from the hands affect the legs, and every action must be coordinated within milliseconds. If you have read about whole-body control for humanoids, you know this is an extremely complex problem.

3. Fully Open-Source

Unlike many foundation models that publish a paper but keep the code private, Ψ₀ opens everything: model weights, training code, inference code, datasets, and the entire data processing pipeline. This means you — yes, you reading this article — can download, run, and fine-tune it for your own use case.

The Problem: Why Co-Training Fails

To understand what makes Ψ₀ special, we first need to understand why previous approaches struggled.

The Old Idea: Mixing Human and Robot Data

Many prior works attempted to train a single model on both human video data (abundantly available on the internet) and robot data (scarce and expensive). The idea sounds perfectly reasonable: humans and robots both manipulate objects, so human data should help robots learn faster.

But reality is far harsher. There is a problem called kinematic disparity — the mechanical differences between the human body and a robot body:

When you force a model to learn simultaneously from two data sources with such different structures, it is like forcing someone to learn how to drive a car and fly a plane at the same time — both are "piloting," but the fundamental skills are so different that they interfere with each other.

Ψ₀'s New Idea: Divide and Conquer + Data Recipe

Ψ₀ does not try to throw everything into one pot. Instead, the research team realized that how you organize data and train in stages matters more than the amount of data you have. This is the core insight:

Staged training + data recipe > massive data

Specifically, Ψ₀ separates the problem into three distinct systems, each trained with the type of data best suited for it. And the magic lies in how they connect to each other.

Three Systems: Brain, Hands, and Legs

The easiest way to understand the Ψ₀ architecture is to think about how humans function. When you see a glass of water and decide to pick it up, three systems in your body coordinate:

  1. Brain (System-2: Deliberation) — Your eyes see the glass, your brain processes the image, identifies the object, and makes the decision "pick up the glass." This is a slow process that requires thinking.

  2. Hands (System-1: Action) — Once the brain decides, your hand automatically executes a sequence of actions: reach out, open the hand, grasp the glass, lift it up. This process is fast, nearly reflexive, requiring no step-by-step thinking.

  3. Legs (System-0: Balance) — Throughout the process, your legs automatically adjust to maintain balance. You never think about your legs when picking up a glass — they operate entirely on autopilot.

Artificial intelligence and neural networks — the foundation of foundation models

Ψ₀ mirrors this structure precisely:

System Name Model Parameters Role
System-2 VLM (Vision-Language Model) Qwen3-VL-2B 2 billion See + understand language
System-1 Action Expert (MM-DiT) Multi-Modal DiT ~500 million Generate actions for hands + upper body
System-0 Locomotion Controller RL Policy (AMO) Small Control legs + maintain balance

The crucial point is that these three systems are trained separately, each with the data type that is optimal for it. This is the "divide and conquer" approach — rather than forcing a single massive model to learn everything, Ψ₀ splits the problem into three specialized parts.

If you want to learn more about VLA models in general, you can read our introduction to VLA models in the AI for Robots series.

Three Training Stages: From Watching YouTube to Professional Chef

To explain the 3-stage training pipeline of Ψ₀, let us use an analogy everyone can relate to: learning to cook.

Stage 1: Watching YouTube (Pre-training on egocentric video)

Before entering the kitchen, you watch hundreds of cooking videos on YouTube. You cannot cook yet, but you learn:

In Ψ₀, this stage uses EgoDex — a dataset of 829 hours of egocentric (first-person) video from public datasets like Ego4D, EpicKitchens, and HOI4D. The model learns to understand images and predict hand actions, but in a general form — not tied to any specific robot.

The key point: egocentric data is chosen deliberately because the viewpoint resembles the camera on a robot's head. This is why Ψ₀ does not use exocentric video (filmed from the outside) — the viewpoint difference would introduce noise.

Result: The model develops "intuition" about object manipulation, much like how watching enough YouTube gives you a rough idea of how to cook, even if you have never set foot in a kitchen.

Stage 2: Practicing in the Kitchen (Post-training on robot data)

Now you step into a real kitchen. You have 31 hours of practice (equivalent to 31 hours of teleoperation data on the Unitree G1 robot). You apply the knowledge from YouTube to reality, but must adjust because:

In Ψ₀, this stage fine-tunes the action expert on 31 hours of robot data — a surprisingly small amount. But because the model already has foundational knowledge from Stage 1, it can learn much faster than starting from scratch.

Result: The model knows how to control a specific robot (Unitree G1), but is not yet proficient at any specific task.

Stage 3: Mastering a Signature Dish (Fine-tuning with specific demos)

Finally, you want to cook one dish perfectly — say, pho. You need someone to cook it a few times while you watch (equivalent to 80 demonstrations per task). Since you already know how to cook in general, watching just a few times is enough to get it down.

In Ψ₀, this stage fine-tunes the model on just 80 demos per specific task (e.g., picking up a can, opening a drawer, wiping a table). The number 80 is remarkably small compared to the thousands of demos that other methods require.

Result: The model performs specific tasks with a high success rate — 82% on average, compared to 50% for GR00T N1 and 30% for Pi0.

Programming and AI development — building blocks for robotics

Target Robot: Unitree G1 + Dex3-1

Ψ₀ was designed and validated on the Unitree G1 — a compact humanoid robot from China — paired with the dexterous Dex3-1 hand. The robot has a total of 43 degrees of freedom (DoF):

This division mirrors the three-system architecture exactly: System-1 generates commands for the 28 upper-body DoF, while System-0 receives 8 input commands and controls the 15 lower-body DoF.

What Will This Series Teach You?

Here is the roadmap for the entire Ψ₀ Hands-On series:

Part 1 (this article): Overview & Key Ideas

You are here. Understanding the problem, the ideas, and the architecture at the highest level.

Part 2: The Three-Tier Architecture in Detail

A deep-dive into each system: System-2 (VLM), System-1 (MM-DiT with Flow Matching), System-0 (RL Controller). You will understand exactly how data flows from camera to motor.

Part 3: EgoDex & the Data Pipeline

How the EgoDex dataset is built, how egocentric video is processed, and why the data recipe matters. You will process video data yourself.

Part 4: Pre-training the Action Expert

Training Stage 1 — from egocentric video to a model with "intuition." You will run actual pre-training code.

Part 5: Post-training & Fine-tuning

From a general model to a specialized one — training on robot data and fine-tuning for specific tasks.

Part 6: Deployment & Real-Time Chunking

Deploying the model onto a real robot — handling 160ms latency, Real-Time Chunking, and practical tricks for running on real hardware.

What Do You Need to Prepare?

To follow along with this series, you will need:

Foundational Knowledge:

Hardware:

Official Resources:

Quick Comparison: Ψ₀ vs. Other Methods

To give you an overview of where Ψ₀ stands in the current research landscape:

Ψ₀ GR00T N1 Pi0 ACT
Loco-manipulation Yes No No No
Robot data needed 31h + 80 demo/task ~300h+ ~10,000h ~50 demo/task
Pre-training data 829h human video In-house In-house None
Open-source Yes Partial No Yes
Avg. success rate 82% 50% 30% 45%
Latency 160ms ~200ms ~100ms ~50ms

Note: Numbers are from the Ψ₀ paper on their benchmark. Results may differ on other benchmarks.

What stands out is that Ψ₀ is the only model in this table that truly tackles loco-manipulation — the others only handle manipulation on fixed robot arms or locomotion separately. This is an entirely new playing field where Ψ₀ is leading.

If you are interested in the broader landscape of AI for robotics, you can read our Embodied AI 2026 overview to see where Ψ₀ fits in the ecosystem.

Summary

Ψ₀ represents a significant leap in robotics for three reasons:

  1. A novel approach: It decomposes the complex loco-manipulation problem into three specialized systems, each optimized independently.

  2. Data efficiency: It demonstrates that how you organize data (staged training + egocentric video) matters more than sheer data volume.

  3. Open-source: It releases all code, models, and data — enabling the community to build on this foundation.

In the next article, we will take a deep dive into the three-tier architecture — understanding exactly how each component works, from Qwen3-VL-2B (the brain) to MM-DiT Flow Matching (the hands) to the AMO RL Controller (the legs). You will see why each design decision was made and the tradeoffs behind them.


Related Articles

Related Posts

ResearchΨ₀ Hands-On (6): Ablation & Bài học rút ra
ai-perceptionvlaresearchhumanoidpsi0Part 6

Ψ₀ Hands-On (6): Ablation & Bài học rút ra

Phân tích ablation studies, so sánh baselines, và 5 bài học quan trọng nhất từ Ψ₀ cho người mới bắt đầu.

11/4/202616 min read
ResearchFlashSAC: RL nhanh hơn PPO cho Robot
ai-perceptionreinforcement-learninghumanoidresearch

FlashSAC: RL nhanh hơn PPO cho Robot

FlashSAC — off-policy RL mới vượt PPO về tốc độ lẫn hiệu quả trên 100+ tasks robotics, từ humanoid locomotion đến dexterous manipulation.

11/4/202610 min read
ResearchSimpleVLA-RL (4): Kết quả & Bài học
ai-perceptionvlareinforcement-learningresearchPart 4

SimpleVLA-RL (4): Kết quả & Bài học

Phân tích kết quả SimpleVLA-RL: ablation studies, hiện tượng pushcut, real-world transfer, và 5 bài học rút ra.

11/4/202614 min read