Ψ₀ Hands-On (2): The Three-Tier Architecture — Brain, Hands, and Legs of the Robot

In the previous article, we understood the big picture of Ψ₀ and why it matters. Now it is time to pop the hood and see what is inside. This article takes a deep dive into the three-tier architecture — how each component works, how data flows through the system, and why each design decision was made.

Fair warning: this article is technically dense. But I will do my best to keep everything as intuitive as possible.

The Big Picture: Data Flow

Before diving into each system, let us look at the entire pipeline from end to end. When the robot receives a command ("pick up the glass of water"), here is what happens:

Camera (320x240 RGB)
    │
    ▼
System-2: Qwen3-VL-2B (VLM)
    │ → Visual features + Text features
    ▼
System-1: MM-DiT (~500M params)
    │ → Action chunks (28 DoF upper body + 8 locomotion commands)
    ├── 28 DoF → Directly to arm + hand motors
    ▼
System-0: AMO RL Controller
    │ ← 8 DoF commands (velocity, heading, height...)
    ▼
15 DoF legs → Leg motors

The entire pipeline runs with a latency of approximately 160ms — fast enough for real-time control. Now let us zoom into each system.

System-2: The Brain That Sees and Understands — Qwen3-VL-2B

Why Qwen3-VL-2B?

System-2 is a Vision-Language Model (VLM) whose job is to take images from the camera and text commands, then extract meaningful features for System-1 to use. The research team chose Qwen3-VL-2B — the 2-billion parameter version of Qwen3-VL from Alibaba — for three reasons:

Right-sized: 2B parameters is powerful enough to understand complex images, yet small enough to run in real-time on an edge GPU. Larger models like LLaVA-13B or Qwen-VL-7B would be too slow for robot control at 160ms latency.
Strong vision performance: Qwen3-VL is one of the best small VLMs for visual understanding, particularly in recognizing objects and understanding 3D spatial relationships from 2D images.
Pre-trained out of the box: The model has already been trained on billions of image-text pairs, so it comes with built-in ability to recognize everyday objects (cups, bottles, boxes, door handles...) that robots need to interact with.

AI architecture and neural networks — the building blocks of modern deep learning

How It Works

System-2 takes as input:

An RGB image at 320x240 pixels from the camera mounted on the robot's head (egocentric view)
A text instruction — a natural language command, e.g., "pick up the red cup"

And outputs visual-language features — a tensor containing semantic information about both the image and text, which is passed into System-1.

# Simplified pseudo-code
image = camera.capture()  # 320x240x3 RGB
text = "pick up the red cup"

# VLM processing
visual_tokens = qwen3_vision_encoder(image)   # [N, D]
text_tokens = qwen3_text_encoder(text)        # [M, D]
features = qwen3_cross_attention(visual_tokens, text_tokens)  # [N+M, D]

One important detail: System-2 is completely frozen during Ψ₀ training. This means the research team does not fine-tune the VLM — they use the pre-trained model as-is. This has two implications:

Advantage: Significant savings in training resources. No need to compute gradients through 2B parameters.
Potential drawback: The VLM may not be optimally adapted for the specific domain (robot egocentric view). However, experiments show that Qwen3-VL is good enough.

If you want a deeper understanding of how VLA models generally use VLMs, read our article on VLA models in the AI for Robots series.

System-1: The Acting Hands — MM-DiT Action Expert

This is the heart of Ψ₀. System-1 is an action expert with approximately 500 million parameters, responsible for generating sequences of actions (action chunks) from System-2's features. It is built on three key technologies: MM-DiT, Flow Matching, and Action Chunking.

MM-DiT: Why Not Use a Standard DiT?

DiT (Diffusion Transformer) is a transformer architecture designed for diffusion models. But Ψ₀ uses a variant called MM-DiT (Multi-Modal DiT), inspired by the architecture in Stable Diffusion 3. The difference lies in two critical aspects:

1. Separate Modulation

In a standard DiT (Naive DiT), all modalities (image, text, action) pass through the same set of modulation parameters. This means the model uses a shared way to "adjust" processing for every type of input.

MM-DiT changes this: each modality has its own modulation parameters. Specifically:

Visual features from the VLM get their own modulation parameters
Action tokens get their own modulation parameters
Timestep embeddings get their own modulation parameters

Why does this matter? Because visual features and action tokens are fundamentally different in nature. Visual features represent static spatial information (object positions, shapes). Action tokens represent dynamic temporal sequences (robot joint positions over time). Forcing them through the same "filter" loses information.

2. Joint Attention

Despite separate modulation, MM-DiT still allows joint attention across modalities. This means action tokens can "look at" visual features during attention computation, and vice versa.

# Simplified MM-DiT block
class MMDiTBlock:
    def forward(self, visual_tokens, action_tokens, timestep):
        # Separate modulation
        v_scale, v_shift = self.visual_modulation(timestep)
        a_scale, a_shift = self.action_modulation(timestep)
        
        visual_tokens = v_scale * self.norm(visual_tokens) + v_shift
        action_tokens = a_scale * self.norm(action_tokens) + a_shift
        
        # Joint attention - both modalities attend to each other
        all_tokens = concat(visual_tokens, action_tokens)
        all_tokens = self.joint_attention(all_tokens)
        
        # Split back
        visual_tokens, action_tokens = split(all_tokens)
        return visual_tokens, action_tokens

The combination of separate modulation + joint attention creates the best balance: each modality is processed in a way that suits its nature, while still being able to exchange information with the other modality.

Flow Matching: From Noise to Action

This is the most important mathematical component in Ψ₀. But do not worry — I will explain it intuitively first, then introduce the formulas.

The Intuitive Idea

Imagine you are standing in the middle of a foggy field (noise). You want to reach a specific house (the correct action). Flow Matching gives you a compass — at every point in the field, the compass tells you which direction to go. If you follow the compass long enough, you will reach the house.

Mathematically, this "compass" is a velocity field v(x, tau) — at every point x in action space and every time step tau, it tells you which direction to move.

The Core Formula

Flow Matching defines a straight-line path from noise to action:

x(tau) = (1 - tau) * epsilon + tau * a

Where:

epsilon is Gaussian noise (a random starting point)
a is the ground-truth action (the destination)
tau is in [0, 1] — tau=0 is pure noise, tau=1 is the complete action

The corresponding velocity field is:

v*(x, tau) = a - epsilon

Surprisingly simple — the velocity field is just the direction from noise to action. The model is trained to predict this velocity field:

Loss = || v_theta(x(tau), tau, c) - (a - epsilon) ||^2

Where c is the conditioning (visual features from System-2), and v_theta is the MM-DiT model with parameters theta.

Robots and automation — AI applied in the physical world

Flow Matching vs. DDPM Diffusion

If you have read about Diffusion Policy, you will notice that Flow Matching differs from DDPM in several key ways:

	Flow Matching	DDPM Diffusion
Path	Straight (linear interpolation)	Curved (Markov chain)
Inference steps	Few (1-10 steps)	Many (50-1000 steps)
Speed	Much faster	Slow
Training stability	More stable	Can be unstable
Quality	Comparable or better	Good

Ψ₀ uses 10 denoising steps during inference. At each step, the model takes the current state x(tau), predicts the velocity v, and updates: x(tau + delta_tau) = x(tau) + delta_tau * v. After 10 steps, pure noise becomes a complete action.

Action Chunking: Predicting the Future

Instead of predicting one action at a time, System-1 predicts a chunk of H consecutive actions simultaneously. In Ψ₀, H is typically 16-32 timesteps (equivalent to 0.8-1.6 seconds at 20Hz).

Why is chunking important?

Temporal coherence: When predicting individual actions, consecutive actions can "jitter" — one timestep tells the arm to go right, the next tells it to go left. Predicting an entire chunk ensures a smooth action sequence.
Handling latency: The robot needs approximately 160ms for one inference pass. If it only predicted 1 action, the robot would "freeze" for 160ms between each action. With chunking, the robot can execute previously predicted actions while the model computes the next batch.
Long-horizon understanding: Predicting 16 steps into the future requires the model to grasp a longer-term "plan," not just immediate reflexes.

FAST Tokenizer: Compressing Actions into Tokens

An important technical detail: before feeding actions into the MM-DiT, Ψ₀ uses the FAST tokenizer to encode continuous actions into discrete tokens. FAST (Fast Action Sequence Tokenizer) compresses action sequences into a compact representation by:

Quantizing joint angle values
Grouping highly correlated DoFs
Significantly reducing data dimensionality

This allows the MM-DiT to process actions like "language" — each action token is equivalent to a "word" describing a movement. After inference, the output tokens are decoded back into continuous joint angle values.

System-0: The Autonomous Legs — AMO RL Controller

System-0 is the "balance keeper" of Ψ₀. It is an RL policy (a policy trained via Reinforcement Learning) called AMO (Agile Mobility Oracle), responsible for controlling the robot's legs.

Input/Output

Input (8 DoF commands): From System-1, including:
- Translational velocity x, y (forward/backward, left/right)
- Rotational velocity (yaw rate)
- Body height
- Body tilt angles (pitch, roll)
- Stepping frequency
- Step height
- Stance/swing ratio
Output (15 DoF): Direct commands to 15 motors in both legs (each leg: 3-DoF hip, 1-DoF knee, 2-DoF ankle, plus auxiliary DoFs).

Why RL Instead of a Learned Policy?

Bipedal locomotion is a problem that RL has already solved very well. Training an RL policy for walking and balancing in simulation (Isaac Gym, MuJoCo) and then performing sim-to-real transfer has become an industry standard — Unitree is one of the companies doing this best.

Since RL policies for locomotion are already very mature, there is no reason to retrain them end-to-end alongside manipulation. Keeping System-0 separate has several advantages:

Robustness: The RL policy has been tested over millions of steps in simulation and is very stable.
Safety: If System-1 generates unusual actions for the upper body, System-0 still prevents the robot from falling.
Modularity: You can swap System-0 with a different policy without affecting System-1.

If you are not yet clear on how RL is used in humanoid locomotion, check out our article on RL for humanoids in the Humanoid Robot Engineering series.

Action Space: 43 DoF and How It Changes Across Stages

A subtle but important point: Ψ₀'s action space changes between training stages. This is not an oversight — it is a deliberate design choice.

Pre-training: 48-DoF Task Space

During the pre-training stage on egocentric video (EgoDex), the model predicts actions in task space — that is, the position and orientation of the hands in 3D space. Specifically:

Each hand: 6-DoF pose (3 position xyz + 3 orientation rpy) = 12
Each finger (simplified): Bend angles = ~12 per hand x 2 = 24
Total: ~48 DoF in task space

Why use task space for pre-training? Because human video data only tells us where the hand is (via hand tracking), not the specific robot joint angles. Task space is a shared representation between humans and robots.

Post-training & Fine-tuning: 36-DoF Joint Space

When switching to real robot data (Stages 2 and 3), the model transitions to joint space — directly controlling the angle of each joint:

2 arms x 7-DoF = 14 DoF
2 Dex3-1 hands x 7-DoF = 14 DoF
8 locomotion commands for System-0
Total: 36 DoF output from System-1

The transition from task space to joint space is a crucial step — it allows the model to "translate" general knowledge from human video into specific control commands for the Unitree G1 robot.

The Full 43-DoF

The total 43 DoF across the robot = 28 upper-body DoF (directly from System-1) + 15 leg DoF (from System-0, receiving 8 commands from System-1):

System-1 Output:
├── 7-DoF left arm      ─────────── Direct → Motor
├── 7-DoF right arm     ─────────── Direct → Motor  
├── 7-DoF left hand     ─────────── Direct → Motor
├── 7-DoF right hand    ─────────── Direct → Motor
└── 8-DoF locomotion cmd   ──┐
                              │
System-0 (AMO RL):            │
├── Input: 8-DoF commands  ◄──┘
└── Output: 15-DoF legs    ─────────── Direct → Motor
                              
Total: 28 + 15 = 43 DoF

Real-Time Chunking: Solving the Latency Problem

A practical issue: model inference takes approximately 160ms, but the robot control loop needs to run at 20Hz (50ms per cycle). How do you handle this?

Ψ₀ uses a technique called Real-Time Chunking:

At time t, the model predicts a chunk of H actions: [a_t, a_{t+1}, ..., a_{t+H-1}]
The robot starts executing a_t immediately
While the robot executes a_t, a_{t+1}, a_{t+2}... the model is computing the next chunk
When the model finishes (after approximately 160ms = approximately 3 timesteps at 20Hz), the robot has a new chunk
The new chunk is blended with the remaining portion of the old chunk to avoid jitter

Time:       t=0   t=1   t=2   t=3   t=4   t=5   t=6
Chunk 1:   [a0,   a1,   a2,   a3,   a4,   a5,   ...]
                              ^ model finishes chunk 2
Chunk 2:                     [a3',  a4',  a5',  a6',  ...]
Executed:  [a0,   a1,   a2,  blend(a3,a3'), ...]

The simplest blending technique is an exponential moving average. The executed action = alpha x new_chunk_action + (1-alpha) x old_chunk_action, with alpha increasing gradually.

Why This Architecture Works: The "Divide and Conquer" Philosophy

After walking through each component, let us step back and ask: why does this three-tier architecture outperform a monolithic model?

1. Each System Uses the Right Type of Data

System-2 (VLM) needs image+text data — billions of samples are available on the internet
System-1 (action expert) needs action data — scarcer, but egocentric video helps bridge the gap
System-0 (RL) needs simulation data — cheap, can be generated in unlimited quantities

2. Each System Has the Right Architecture

VLM needs attention mechanisms for spatial reasoning — Transformers are the natural choice
Action expert needs to generate continuous action sequences — Flow Matching/Diffusion is ideal
Locomotion needs fast, robust reflexes — a small RL policy is optimal

3. Fault Isolation

If System-1 generates unusual actions (out-of-distribution), System-0 still keeps the robot standing. If System-2 misidentifies an object, it only affects high-level decisions without causing the robot to fall. Each system acts as a safety boundary.

4. Easy Per-Component Upgrades

When a better VLM comes along (say Qwen4-VL), you can swap System-2 without retraining the entire model. When a more stable RL locomotion policy emerges, swap System-0. This is the classic advantage of modular architecture over end-to-end monolithic models.

Comparing this with other robot AI architectures — such as those discussed in our Embodied AI 2026 overview — you will see a clear trend: the field is shifting from monolithic to modular.

Architecture Summary

Component	Model	Params	Input	Output	Training
System-2	Qwen3-VL-2B	2B	Image 320x240 + Text	Visual-language features	Frozen (pre-trained)
System-1	MM-DiT	~500M	Features + Noisy actions	Action chunks (28+8 DoF)	3-stage training
System-0	AMO RL	Small	8 commands	15 DoF leg joints	RL in simulation

Key Technical Highlights:

MM-DiT = Separate modulation (each modality processed independently) + Joint attention (shared communication)
Flow Matching = Straight path from noise to action, 10 inference steps, faster than DDPM
Action Chunking = Predict 16-32 actions at once, smoother motion + latency handling
FAST Tokenizer = Compress continuous actions into discrete tokens
Real-Time Chunking = Blend old and new chunks so the robot runs continuously at 20Hz

In the next article, we will explore EgoDex and the data pipeline — how Ψ₀ built its 829-hour egocentric video dataset, how hand tracking is processed, and why the "data recipe" matters more than "data volume." You will download and process EgoDex data yourself.

Diffusion Policy for Robot Manipulation — The theoretical foundation of diffusion models for robotics, helping you understand Flow Matching in Ψ₀ more deeply.
RL for Humanoid Locomotion — Details on how RL policies control humanoid robot legs — the exact technology used by Ψ₀'s System-0.
VLA Models: Vision-Language-Action — An overview of the VLA architecture, the class of models that Ψ₀ belongs to and extends.

Ψ₀ Hands-On (2): The Three-Tier Architecture — Brain, Hands, and Legs of the Robot

Fair warning: this article is technically dense. But I will do my best to keep everything as intuitive as possible.

The Big Picture: Data Flow

Before diving into each system, let us look at the entire pipeline from end to end. When the robot receives a command ("pick up the glass of water"), here is what happens:

Camera (320x240 RGB)
    │
    ▼
System-2: Qwen3-VL-2B (VLM)
    │ → Visual features + Text features
    ▼
System-1: MM-DiT (~500M params)
    │ → Action chunks (28 DoF upper body + 8 locomotion commands)
    ├── 28 DoF → Directly to arm + hand motors
    ▼
System-0: AMO RL Controller
    │ ← 8 DoF commands (velocity, heading, height...)
    ▼
15 DoF legs → Leg motors

The entire pipeline runs with a latency of approximately 160ms — fast enough for real-time control. Now let us zoom into each system.

System-2: The Brain That Sees and Understands — Qwen3-VL-2B

Why Qwen3-VL-2B?

Right-sized: 2B parameters is powerful enough to understand complex images, yet small enough to run in real-time on an edge GPU. Larger models like LLaVA-13B or Qwen-VL-7B would be too slow for robot control at 160ms latency.
Strong vision performance: Qwen3-VL is one of the best small VLMs for visual understanding, particularly in recognizing objects and understanding 3D spatial relationships from 2D images.
Pre-trained out of the box: The model has already been trained on billions of image-text pairs, so it comes with built-in ability to recognize everyday objects (cups, bottles, boxes, door handles...) that robots need to interact with.

AI architecture and neural networks — the building blocks of modern deep learning

How It Works

System-2 takes as input:

An RGB image at 320x240 pixels from the camera mounted on the robot's head (egocentric view)
A text instruction — a natural language command, e.g., "pick up the red cup"

And outputs visual-language features — a tensor containing semantic information about both the image and text, which is passed into System-1.

# Simplified pseudo-code
image = camera.capture()  # 320x240x3 RGB
text = "pick up the red cup"

# VLM processing
visual_tokens = qwen3_vision_encoder(image)   # [N, D]
text_tokens = qwen3_text_encoder(text)        # [M, D]
features = qwen3_cross_attention(visual_tokens, text_tokens)  # [N+M, D]

Advantage: Significant savings in training resources. No need to compute gradients through 2B parameters.
Potential drawback: The VLM may not be optimally adapted for the specific domain (robot egocentric view). However, experiments show that Qwen3-VL is good enough.

If you want a deeper understanding of how VLA models generally use VLMs, read our article on VLA models in the AI for Robots series.

System-1: The Acting Hands — MM-DiT Action Expert

MM-DiT: Why Not Use a Standard DiT?

1. Separate Modulation

MM-DiT changes this: each modality has its own modulation parameters. Specifically:

Visual features from the VLM get their own modulation parameters
Action tokens get their own modulation parameters
Timestep embeddings get their own modulation parameters

2. Joint Attention

Despite separate modulation, MM-DiT still allows joint attention across modalities. This means action tokens can "look at" visual features during attention computation, and vice versa.

# Simplified MM-DiT block
class MMDiTBlock:
    def forward(self, visual_tokens, action_tokens, timestep):
        # Separate modulation
        v_scale, v_shift = self.visual_modulation(timestep)
        a_scale, a_shift = self.action_modulation(timestep)
        
        visual_tokens = v_scale * self.norm(visual_tokens) + v_shift
        action_tokens = a_scale * self.norm(action_tokens) + a_shift
        
        # Joint attention - both modalities attend to each other
        all_tokens = concat(visual_tokens, action_tokens)
        all_tokens = self.joint_attention(all_tokens)
        
        # Split back
        visual_tokens, action_tokens = split(all_tokens)
        return visual_tokens, action_tokens

Flow Matching: From Noise to Action

This is the most important mathematical component in Ψ₀. But do not worry — I will explain it intuitively first, then introduce the formulas.

The Intuitive Idea

Mathematically, this "compass" is a velocity field v(x, tau) — at every point x in action space and every time step tau, it tells you which direction to move.

The Core Formula

Flow Matching defines a straight-line path from noise to action:

x(tau) = (1 - tau) * epsilon + tau * a

Where:

epsilon is Gaussian noise (a random starting point)
a is the ground-truth action (the destination)
tau is in [0, 1] — tau=0 is pure noise, tau=1 is the complete action

The corresponding velocity field is:

v*(x, tau) = a - epsilon

Surprisingly simple — the velocity field is just the direction from noise to action. The model is trained to predict this velocity field:

Loss = || v_theta(x(tau), tau, c) - (a - epsilon) ||^2

Where c is the conditioning (visual features from System-2), and v_theta is the MM-DiT model with parameters theta.

Robots and automation — AI applied in the physical world

Flow Matching vs. DDPM Diffusion

If you have read about Diffusion Policy, you will notice that Flow Matching differs from DDPM in several key ways:

	Flow Matching	DDPM Diffusion
Path	Straight (linear interpolation)	Curved (Markov chain)
Inference steps	Few (1-10 steps)	Many (50-1000 steps)
Speed	Much faster	Slow
Training stability	More stable	Can be unstable
Quality	Comparable or better	Good

Action Chunking: Predicting the Future

Why is chunking important?

Temporal coherence: When predicting individual actions, consecutive actions can "jitter" — one timestep tells the arm to go right, the next tells it to go left. Predicting an entire chunk ensures a smooth action sequence.
Handling latency: The robot needs approximately 160ms for one inference pass. If it only predicted 1 action, the robot would "freeze" for 160ms between each action. With chunking, the robot can execute previously predicted actions while the model computes the next batch.
Long-horizon understanding: Predicting 16 steps into the future requires the model to grasp a longer-term "plan," not just immediate reflexes.

FAST Tokenizer: Compressing Actions into Tokens

Quantizing joint angle values
Grouping highly correlated DoFs
Significantly reducing data dimensionality

System-0: The Autonomous Legs — AMO RL Controller

System-0 is the "balance keeper" of Ψ₀. It is an RL policy (a policy trained via Reinforcement Learning) called AMO (Agile Mobility Oracle), responsible for controlling the robot's legs.

Input/Output

Input (8 DoF commands): From System-1, including:
- Translational velocity x, y (forward/backward, left/right)
- Rotational velocity (yaw rate)
- Body height
- Body tilt angles (pitch, roll)
- Stepping frequency
- Step height
- Stance/swing ratio
Output (15 DoF): Direct commands to 15 motors in both legs (each leg: 3-DoF hip, 1-DoF knee, 2-DoF ankle, plus auxiliary DoFs).

Why RL Instead of a Learned Policy?

Since RL policies for locomotion are already very mature, there is no reason to retrain them end-to-end alongside manipulation. Keeping System-0 separate has several advantages:

Robustness: The RL policy has been tested over millions of steps in simulation and is very stable.
Safety: If System-1 generates unusual actions for the upper body, System-0 still prevents the robot from falling.
Modularity: You can swap System-0 with a different policy without affecting System-1.

If you are not yet clear on how RL is used in humanoid locomotion, check out our article on RL for humanoids in the Humanoid Robot Engineering series.

Action Space: 43 DoF and How It Changes Across Stages

A subtle but important point: Ψ₀'s action space changes between training stages. This is not an oversight — it is a deliberate design choice.

Pre-training: 48-DoF Task Space

During the pre-training stage on egocentric video (EgoDex), the model predicts actions in task space — that is, the position and orientation of the hands in 3D space. Specifically:

Each hand: 6-DoF pose (3 position xyz + 3 orientation rpy) = 12
Each finger (simplified): Bend angles = ~12 per hand x 2 = 24
Total: ~48 DoF in task space

Post-training & Fine-tuning: 36-DoF Joint Space

When switching to real robot data (Stages 2 and 3), the model transitions to joint space — directly controlling the angle of each joint:

2 arms x 7-DoF = 14 DoF
2 Dex3-1 hands x 7-DoF = 14 DoF
8 locomotion commands for System-0
Total: 36 DoF output from System-1

The transition from task space to joint space is a crucial step — it allows the model to "translate" general knowledge from human video into specific control commands for the Unitree G1 robot.

The Full 43-DoF

The total 43 DoF across the robot = 28 upper-body DoF (directly from System-1) + 15 leg DoF (from System-0, receiving 8 commands from System-1):

System-1 Output:
├── 7-DoF left arm      ─────────── Direct → Motor
├── 7-DoF right arm     ─────────── Direct → Motor  
├── 7-DoF left hand     ─────────── Direct → Motor
├── 7-DoF right hand    ─────────── Direct → Motor
└── 8-DoF locomotion cmd   ──┐
                              │
System-0 (AMO RL):            │
├── Input: 8-DoF commands  ◄──┘
└── Output: 15-DoF legs    ─────────── Direct → Motor
                              
Total: 28 + 15 = 43 DoF

Real-Time Chunking: Solving the Latency Problem

A practical issue: model inference takes approximately 160ms, but the robot control loop needs to run at 20Hz (50ms per cycle). How do you handle this?

Ψ₀ uses a technique called Real-Time Chunking:

At time t, the model predicts a chunk of H actions: [a_t, a_{t+1}, ..., a_{t+H-1}]
The robot starts executing a_t immediately
While the robot executes a_t, a_{t+1}, a_{t+2}... the model is computing the next chunk
When the model finishes (after approximately 160ms = approximately 3 timesteps at 20Hz), the robot has a new chunk
The new chunk is blended with the remaining portion of the old chunk to avoid jitter

Time:       t=0   t=1   t=2   t=3   t=4   t=5   t=6
Chunk 1:   [a0,   a1,   a2,   a3,   a4,   a5,   ...]
                              ^ model finishes chunk 2
Chunk 2:                     [a3',  a4',  a5',  a6',  ...]
Executed:  [a0,   a1,   a2,  blend(a3,a3'), ...]

The simplest blending technique is an exponential moving average. The executed action = alpha x new_chunk_action + (1-alpha) x old_chunk_action, with alpha increasing gradually.

Why This Architecture Works: The "Divide and Conquer" Philosophy

After walking through each component, let us step back and ask: why does this three-tier architecture outperform a monolithic model?

1. Each System Uses the Right Type of Data

System-2 (VLM) needs image+text data — billions of samples are available on the internet
System-1 (action expert) needs action data — scarcer, but egocentric video helps bridge the gap
System-0 (RL) needs simulation data — cheap, can be generated in unlimited quantities

2. Each System Has the Right Architecture

VLM needs attention mechanisms for spatial reasoning — Transformers are the natural choice
Action expert needs to generate continuous action sequences — Flow Matching/Diffusion is ideal
Locomotion needs fast, robust reflexes — a small RL policy is optimal

3. Fault Isolation

4. Easy Per-Component Upgrades

Comparing this with other robot AI architectures — such as those discussed in our Embodied AI 2026 overview — you will see a clear trend: the field is shifting from monolithic to modular.

Architecture Summary

Component	Model	Params	Input	Output	Training
System-2	Qwen3-VL-2B	2B	Image 320x240 + Text	Visual-language features	Frozen (pre-trained)
System-1	MM-DiT	~500M	Features + Noisy actions	Action chunks (28+8 DoF)	3-stage training
System-0	AMO RL	Small	8 commands	15 DoF leg joints	RL in simulation

Key Technical Highlights:

MM-DiT = Separate modulation (each modality processed independently) + Joint attention (shared communication)
Flow Matching = Straight path from noise to action, 10 inference steps, faster than DDPM
Action Chunking = Predict 16-32 actions at once, smoother motion + latency handling
FAST Tokenizer = Compress continuous actions into discrete tokens
Real-Time Chunking = Blend old and new chunks so the robot runs continuously at 20Hz

Diffusion Policy for Robot Manipulation — The theoretical foundation of diffusion models for robotics, helping you understand Flow Matching in Ψ₀ more deeply.
RL for Humanoid Locomotion — Details on how RL policies control humanoid robot legs — the exact technology used by Ψ₀'s System-0.
VLA Models: Vision-Language-Action — An overview of the VLA architecture, the class of models that Ψ₀ belongs to and extends.

Ψ₀ Hands-On (2): The Three-Tier Architecture — Brain, Hands, and Legs of the Robot

The Big Picture: Data Flow

System-2: The Brain That Sees and Understands — Qwen3-VL-2B

Why Qwen3-VL-2B?

How It Works

System-1: The Acting Hands — MM-DiT Action Expert

MM-DiT: Why Not Use a Standard DiT?

1. Separate Modulation

2. Joint Attention

Flow Matching: From Noise to Action

The Intuitive Idea

The Core Formula

Flow Matching vs. DDPM Diffusion

Action Chunking: Predicting the Future

FAST Tokenizer: Compressing Actions into Tokens

System-0: The Autonomous Legs — AMO RL Controller

Input/Output

Why RL Instead of a Learned Policy?

Action Space: 43 DoF and How It Changes Across Stages

Pre-training: 48-DoF Task Space

Post-training & Fine-tuning: 36-DoF Joint Space

The Full 43-DoF

Real-Time Chunking: Solving the Latency Problem

Why This Architecture Works: The "Divide and Conquer" Philosophy

1. Each System Uses the Right Type of Data

2. Each System Has the Right Architecture

3. Fault Isolation

4. Easy Per-Component Upgrades

Architecture Summary

Related Articles

Nguyễn Anh Tuấn

Related Posts

Ψ₀ Hands-On (6): Ablation & Bài học rút ra

Ψ₀ Hands-On (1): Tổng quan & Ý tưởng

Ψ₀ Hands-On (5): Inference & Evaluation

Ψ₀ Hands-On (2): The Three-Tier Architecture — Brain, Hands, and Legs of the Robot

The Big Picture: Data Flow

System-2: The Brain That Sees and Understands — Qwen3-VL-2B

Why Qwen3-VL-2B?

How It Works

System-1: The Acting Hands — MM-DiT Action Expert

MM-DiT: Why Not Use a Standard DiT?

1. Separate Modulation

2. Joint Attention

Flow Matching: From Noise to Action

The Intuitive Idea

The Core Formula

Flow Matching vs. DDPM Diffusion

Action Chunking: Predicting the Future

FAST Tokenizer: Compressing Actions into Tokens

System-0: The Autonomous Legs — AMO RL Controller

Input/Output

Why RL Instead of a Learned Policy?

Action Space: 43 DoF and How It Changes Across Stages

Pre-training: 48-DoF Task Space

Post-training & Fine-tuning: 36-DoF Joint Space

The Full 43-DoF

Real-Time Chunking: Solving the Latency Problem

Why This Architecture Works: The "Divide and Conquer" Philosophy

1. Each System Uses the Right Type of Data

2. Each System Has the Right Architecture

3. Fault Isolation

4. Easy Per-Component Upgrades

Architecture Summary

Related Articles

Nguyễn Anh Tuấn

Related Posts

Ψ₀ Hands-On (6): Ablation & Bài học rút ra

Ψ₀ Hands-On (1): Tổng quan & Ý tưởng

Ψ₀ Hands-On (5): Inference & Evaluation