← Back to Blog
aiai-perceptionvlahumanoiddiffusion-modelpsi0

Ψ₀ Hands-On (2): The Three-Tier Architecture

Deep-dive into the Ψ₀ architecture: System-2 (VLM), System-1 (MM-DiT), System-0 (RL Controller) — the brain, hands, and legs of the robot.

Nguyễn Anh Tuấn1 tháng 4, 202615 min read
Ψ₀ Hands-On (2): The Three-Tier Architecture

Ψ₀ Hands-On (2): The Three-Tier Architecture — Brain, Hands, and Legs of the Robot

In the previous article, we understood the big picture of Ψ₀ and why it matters. Now it is time to pop the hood and see what is inside. This article takes a deep dive into the three-tier architecture — how each component works, how data flows through the system, and why each design decision was made.

Fair warning: this article is technically dense. But I will do my best to keep everything as intuitive as possible.

The Big Picture: Data Flow

Before diving into each system, let us look at the entire pipeline from end to end. When the robot receives a command ("pick up the glass of water"), here is what happens:

Camera (320x240 RGB)
    │
    ▼
System-2: Qwen3-VL-2B (VLM)
    │ → Visual features + Text features
    ▼
System-1: MM-DiT (~500M params)
    │ → Action chunks (28 DoF upper body + 8 locomotion commands)
    ├── 28 DoF → Directly to arm + hand motors
    ▼
System-0: AMO RL Controller
    │ ← 8 DoF commands (velocity, heading, height...)
    ▼
15 DoF legs → Leg motors

The entire pipeline runs with a latency of approximately 160ms — fast enough for real-time control. Now let us zoom into each system.

System-2: The Brain That Sees and Understands — Qwen3-VL-2B

Why Qwen3-VL-2B?

System-2 is a Vision-Language Model (VLM) whose job is to take images from the camera and text commands, then extract meaningful features for System-1 to use. The research team chose Qwen3-VL-2B — the 2-billion parameter version of Qwen3-VL from Alibaba — for three reasons:

  1. Right-sized: 2B parameters is powerful enough to understand complex images, yet small enough to run in real-time on an edge GPU. Larger models like LLaVA-13B or Qwen-VL-7B would be too slow for robot control at 160ms latency.

  2. Strong vision performance: Qwen3-VL is one of the best small VLMs for visual understanding, particularly in recognizing objects and understanding 3D spatial relationships from 2D images.

  3. Pre-trained out of the box: The model has already been trained on billions of image-text pairs, so it comes with built-in ability to recognize everyday objects (cups, bottles, boxes, door handles...) that robots need to interact with.

AI architecture and neural networks — the building blocks of modern deep learning

How It Works

System-2 takes as input:

And outputs visual-language features — a tensor containing semantic information about both the image and text, which is passed into System-1.

# Simplified pseudo-code
image = camera.capture()  # 320x240x3 RGB
text = "pick up the red cup"

# VLM processing
visual_tokens = qwen3_vision_encoder(image)   # [N, D]
text_tokens = qwen3_text_encoder(text)        # [M, D]
features = qwen3_cross_attention(visual_tokens, text_tokens)  # [N+M, D]

One important detail: System-2 is completely frozen during Ψ₀ training. This means the research team does not fine-tune the VLM — they use the pre-trained model as-is. This has two implications:

If you want a deeper understanding of how VLA models generally use VLMs, read our article on VLA models in the AI for Robots series.

System-1: The Acting Hands — MM-DiT Action Expert

This is the heart of Ψ₀. System-1 is an action expert with approximately 500 million parameters, responsible for generating sequences of actions (action chunks) from System-2's features. It is built on three key technologies: MM-DiT, Flow Matching, and Action Chunking.

MM-DiT: Why Not Use a Standard DiT?

DiT (Diffusion Transformer) is a transformer architecture designed for diffusion models. But Ψ₀ uses a variant called MM-DiT (Multi-Modal DiT), inspired by the architecture in Stable Diffusion 3. The difference lies in two critical aspects:

1. Separate Modulation

In a standard DiT (Naive DiT), all modalities (image, text, action) pass through the same set of modulation parameters. This means the model uses a shared way to "adjust" processing for every type of input.

MM-DiT changes this: each modality has its own modulation parameters. Specifically:

Why does this matter? Because visual features and action tokens are fundamentally different in nature. Visual features represent static spatial information (object positions, shapes). Action tokens represent dynamic temporal sequences (robot joint positions over time). Forcing them through the same "filter" loses information.

2. Joint Attention

Despite separate modulation, MM-DiT still allows joint attention across modalities. This means action tokens can "look at" visual features during attention computation, and vice versa.

# Simplified MM-DiT block
class MMDiTBlock:
    def forward(self, visual_tokens, action_tokens, timestep):
        # Separate modulation
        v_scale, v_shift = self.visual_modulation(timestep)
        a_scale, a_shift = self.action_modulation(timestep)
        
        visual_tokens = v_scale * self.norm(visual_tokens) + v_shift
        action_tokens = a_scale * self.norm(action_tokens) + a_shift
        
        # Joint attention - both modalities attend to each other
        all_tokens = concat(visual_tokens, action_tokens)
        all_tokens = self.joint_attention(all_tokens)
        
        # Split back
        visual_tokens, action_tokens = split(all_tokens)
        return visual_tokens, action_tokens

The combination of separate modulation + joint attention creates the best balance: each modality is processed in a way that suits its nature, while still being able to exchange information with the other modality.

Flow Matching: From Noise to Action

This is the most important mathematical component in Ψ₀. But do not worry — I will explain it intuitively first, then introduce the formulas.

The Intuitive Idea

Imagine you are standing in the middle of a foggy field (noise). You want to reach a specific house (the correct action). Flow Matching gives you a compass — at every point in the field, the compass tells you which direction to go. If you follow the compass long enough, you will reach the house.

Mathematically, this "compass" is a velocity field v(x, tau) — at every point x in action space and every time step tau, it tells you which direction to move.

The Core Formula

Flow Matching defines a straight-line path from noise to action:

x(tau) = (1 - tau) * epsilon + tau * a

Where:

The corresponding velocity field is:

v*(x, tau) = a - epsilon

Surprisingly simple — the velocity field is just the direction from noise to action. The model is trained to predict this velocity field:

Loss = || v_theta(x(tau), tau, c) - (a - epsilon) ||^2

Where c is the conditioning (visual features from System-2), and v_theta is the MM-DiT model with parameters theta.

Robots and automation — AI applied in the physical world

Flow Matching vs. DDPM Diffusion

If you have read about Diffusion Policy, you will notice that Flow Matching differs from DDPM in several key ways:

Flow Matching DDPM Diffusion
Path Straight (linear interpolation) Curved (Markov chain)
Inference steps Few (1-10 steps) Many (50-1000 steps)
Speed Much faster Slow
Training stability More stable Can be unstable
Quality Comparable or better Good

Ψ₀ uses 10 denoising steps during inference. At each step, the model takes the current state x(tau), predicts the velocity v, and updates: x(tau + delta_tau) = x(tau) + delta_tau * v. After 10 steps, pure noise becomes a complete action.

Action Chunking: Predicting the Future

Instead of predicting one action at a time, System-1 predicts a chunk of H consecutive actions simultaneously. In Ψ₀, H is typically 16-32 timesteps (equivalent to 0.8-1.6 seconds at 20Hz).

Why is chunking important?

  1. Temporal coherence: When predicting individual actions, consecutive actions can "jitter" — one timestep tells the arm to go right, the next tells it to go left. Predicting an entire chunk ensures a smooth action sequence.

  2. Handling latency: The robot needs approximately 160ms for one inference pass. If it only predicted 1 action, the robot would "freeze" for 160ms between each action. With chunking, the robot can execute previously predicted actions while the model computes the next batch.

  3. Long-horizon understanding: Predicting 16 steps into the future requires the model to grasp a longer-term "plan," not just immediate reflexes.

FAST Tokenizer: Compressing Actions into Tokens

An important technical detail: before feeding actions into the MM-DiT, Ψ₀ uses the FAST tokenizer to encode continuous actions into discrete tokens. FAST (Fast Action Sequence Tokenizer) compresses action sequences into a compact representation by:

This allows the MM-DiT to process actions like "language" — each action token is equivalent to a "word" describing a movement. After inference, the output tokens are decoded back into continuous joint angle values.

System-0: The Autonomous Legs — AMO RL Controller

System-0 is the "balance keeper" of Ψ₀. It is an RL policy (a policy trained via Reinforcement Learning) called AMO (Agile Mobility Oracle), responsible for controlling the robot's legs.

Input/Output

Why RL Instead of a Learned Policy?

Bipedal locomotion is a problem that RL has already solved very well. Training an RL policy for walking and balancing in simulation (Isaac Gym, MuJoCo) and then performing sim-to-real transfer has become an industry standard — Unitree is one of the companies doing this best.

Since RL policies for locomotion are already very mature, there is no reason to retrain them end-to-end alongside manipulation. Keeping System-0 separate has several advantages:

  1. Robustness: The RL policy has been tested over millions of steps in simulation and is very stable.
  2. Safety: If System-1 generates unusual actions for the upper body, System-0 still prevents the robot from falling.
  3. Modularity: You can swap System-0 with a different policy without affecting System-1.

If you are not yet clear on how RL is used in humanoid locomotion, check out our article on RL for humanoids in the Humanoid Robot Engineering series.

Action Space: 43 DoF and How It Changes Across Stages

A subtle but important point: Ψ₀'s action space changes between training stages. This is not an oversight — it is a deliberate design choice.

Pre-training: 48-DoF Task Space

During the pre-training stage on egocentric video (EgoDex), the model predicts actions in task space — that is, the position and orientation of the hands in 3D space. Specifically:

Why use task space for pre-training? Because human video data only tells us where the hand is (via hand tracking), not the specific robot joint angles. Task space is a shared representation between humans and robots.

Post-training & Fine-tuning: 36-DoF Joint Space

When switching to real robot data (Stages 2 and 3), the model transitions to joint space — directly controlling the angle of each joint:

The transition from task space to joint space is a crucial step — it allows the model to "translate" general knowledge from human video into specific control commands for the Unitree G1 robot.

The Full 43-DoF

The total 43 DoF across the robot = 28 upper-body DoF (directly from System-1) + 15 leg DoF (from System-0, receiving 8 commands from System-1):

System-1 Output:
├── 7-DoF left arm      ─────────── Direct → Motor
├── 7-DoF right arm     ─────────── Direct → Motor  
├── 7-DoF left hand     ─────────── Direct → Motor
├── 7-DoF right hand    ─────────── Direct → Motor
└── 8-DoF locomotion cmd   ──┐
                              │
System-0 (AMO RL):            │
├── Input: 8-DoF commands  ◄──┘
└── Output: 15-DoF legs    ─────────── Direct → Motor
                              
Total: 28 + 15 = 43 DoF

Real-Time Chunking: Solving the Latency Problem

A practical issue: model inference takes approximately 160ms, but the robot control loop needs to run at 20Hz (50ms per cycle). How do you handle this?

Ψ₀ uses a technique called Real-Time Chunking:

  1. At time t, the model predicts a chunk of H actions: [a_t, a_{t+1}, ..., a_{t+H-1}]
  2. The robot starts executing a_t immediately
  3. While the robot executes a_t, a_{t+1}, a_{t+2}... the model is computing the next chunk
  4. When the model finishes (after approximately 160ms = approximately 3 timesteps at 20Hz), the robot has a new chunk
  5. The new chunk is blended with the remaining portion of the old chunk to avoid jitter
Time:       t=0   t=1   t=2   t=3   t=4   t=5   t=6
Chunk 1:   [a0,   a1,   a2,   a3,   a4,   a5,   ...]
                              ^ model finishes chunk 2
Chunk 2:                     [a3',  a4',  a5',  a6',  ...]
Executed:  [a0,   a1,   a2,  blend(a3,a3'), ...]

The simplest blending technique is an exponential moving average. The executed action = alpha x new_chunk_action + (1-alpha) x old_chunk_action, with alpha increasing gradually.

Why This Architecture Works: The "Divide and Conquer" Philosophy

After walking through each component, let us step back and ask: why does this three-tier architecture outperform a monolithic model?

1. Each System Uses the Right Type of Data

2. Each System Has the Right Architecture

3. Fault Isolation

If System-1 generates unusual actions (out-of-distribution), System-0 still keeps the robot standing. If System-2 misidentifies an object, it only affects high-level decisions without causing the robot to fall. Each system acts as a safety boundary.

4. Easy Per-Component Upgrades

When a better VLM comes along (say Qwen4-VL), you can swap System-2 without retraining the entire model. When a more stable RL locomotion policy emerges, swap System-0. This is the classic advantage of modular architecture over end-to-end monolithic models.

Comparing this with other robot AI architectures — such as those discussed in our Embodied AI 2026 overview — you will see a clear trend: the field is shifting from monolithic to modular.

Architecture Summary

Component Model Params Input Output Training
System-2 Qwen3-VL-2B 2B Image 320x240 + Text Visual-language features Frozen (pre-trained)
System-1 MM-DiT ~500M Features + Noisy actions Action chunks (28+8 DoF) 3-stage training
System-0 AMO RL Small 8 commands 15 DoF leg joints RL in simulation

Key Technical Highlights:

In the next article, we will explore EgoDex and the data pipeline — how Ψ₀ built its 829-hour egocentric video dataset, how hand tracking is processed, and why the "data recipe" matters more than "data volume." You will download and process EgoDex data yourself.


Related Articles

Related Posts

ResearchΨ₀ Hands-On (6): Ablation & Bài học rút ra
ai-perceptionvlaresearchhumanoidpsi0Part 6

Ψ₀ Hands-On (6): Ablation & Bài học rút ra

Phân tích ablation studies, so sánh baselines, và 5 bài học quan trọng nhất từ Ψ₀ cho người mới bắt đầu.

11/4/202616 min read
ResearchFlashSAC: RL nhanh hơn PPO cho Robot
ai-perceptionreinforcement-learninghumanoidresearch

FlashSAC: RL nhanh hơn PPO cho Robot

FlashSAC — off-policy RL mới vượt PPO về tốc độ lẫn hiệu quả trên 100+ tasks robotics, từ humanoid locomotion đến dexterous manipulation.

11/4/202610 min read
ResearchSimpleVLA-RL (4): Kết quả & Bài học
ai-perceptionvlareinforcement-learningresearchPart 4

SimpleVLA-RL (4): Kết quả & Bài học

Phân tích kết quả SimpleVLA-RL: ablation studies, hiện tượng pushcut, real-world transfer, và 5 bài học rút ra.

11/4/202614 min read