VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam
VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
  1. Home
  2. Blog
  3. Why Multi-Agent Pipelines Beat Single VLA Models | AI Manipulation Agents #1
manipulationmanipulationvlamulti-agentmaniagentsimplerenvroboticsaibenchmarkperceptionreasoning

Why Multi-Agent Pipelines Beat Single VLA Models | AI Manipulation Agents #1

ManiAgent hits 86.8% on SimplerEnv vs pi0 55.7% and CogACT 51.3% — with zero robot fine-tuning. A deep-dive into the 3-agent architecture and why decomposition beats end-to-end.

Nguyễn Anh TuấnJune 15, 202611 min read
Why Multi-Agent Pipelines Beat Single VLA Models | AI Manipulation Agents #1

In 2025, Physical Intelligence launched π₀ (pi0) — one of the most capable Vision-Language-Action models ever built, trained on millions of real robot steps across diverse tasks. Its SimplerEnv benchmark score: 55.7% average success rate. Impressive? Yes. Production-ready? Not quite.

Around the same time, a research group published ManiAgent (arXiv:2510.11660) — a fundamentally different approach. Not one massive model, but three small specialized agents working together. The result: 86.8% success rate. That's a 31-percentage-point gap over pi0, achieved with zero fine-tuning on any robot demonstration data.

This article breaks down exactly why multi-agent decomposition outperforms monolithic VLA models in robot manipulation — and by the end, you'll be able to sketch the ManiAgent architecture from memory.


Series Roadmap: AI Agent Pipeline for Robot Manipulation

This is a 5-part series, each article building on the previous:

# Article What You'll Learn
1 Why Multi-Agent Beats VLA? ← you are here Benchmark comparison, ManiAgent architecture, information flow
2 Perception Agent & Grasp Planning Building the vision layer: detection, depth, 3D coordinates
3 ALRM, CAP vs TAP Three action-planning paradigms compared head-to-head
4 SAP Verifier: Self-Verification Automated action checking and error recovery
5 Sim-to-Real Deploy Deploying the agent pipeline on a real robot

What VLA Models Are — And Where They Hit a Wall

A Vision-Language-Action model takes images and a natural language instruction as input and outputs robot commands directly — joint angles, end-effector waypoints, or gripper actions. The appeal is clear: one unified brain that sees, understands, and acts.

Major VLA models today include:

  • π₀ (pi0) — Physical Intelligence: flow-matching architecture, broad multi-task training
  • CogACT — cognitive VLA integrating reasoning into action generation
  • OpenVLA, RoboFlamingo, SpatialVLA — different approaches to the same problem

The fundamental problem: manipulation requires three completely distinct capabilities, and optimizing all three simultaneously inside one model is extraordinarily difficult:

  1. Perception — understanding 3D space, localizing objects to millimeter precision, handling occlusion
  2. Reasoning — multi-step planning, conditional logic, replanning when sub-tasks fail
  3. Action execution — converting a plan into a trajectory that respects robot kinematics and workspace limits

Think of it like hiring one person to simultaneously act as surgeon, anesthesiologist, and scrub nurse during an operation. Specialization wins every time.


The Numbers Don't Lie

SimplerEnv is the standard simulation benchmark for robot manipulation. It runs 4 tasks on a Google Robot arm:

  1. Stack blocks — precise stacking requiring accurate spatial reasoning
  2. Place carrot on plate — pick-and-place with a deformable target surface
  3. Put spoon on towel — object placement with texture-defined target
  4. Move eggplant to basket — cluttered scene with occlusion challenges

Full results (average success rate across all trials):

Model Type Stack Carrot Spoon Eggplant Avg
CogACT Single VLA 15.0% 50.8% 71.7% 67.5% 51.3%
π₀ (pi0) Single VLA 21.3% 58.8% 63.3% 79.2% 55.7%
ManiAgent-GPT-4o 3-agent 76.4% 95.8% 77.8% 47.2% 74.3%
ManiAgent-Claude Sonnet 4 3-agent 77.8% 98.6% 80.6% 62.5% 79.9%
ManiAgent-GPT-5 3-agent 87.5% 95.8% 91.7% 72.2% 86.8%

SimplerEnv benchmark results: ManiAgent 86.8% vs pi0 55.7% vs CogACT 51.3%
SimplerEnv benchmark results: ManiAgent 86.8% vs pi0 55.7% vs CogACT 51.3%
SimplerEnv benchmark: ManiAgent (GPT-5) achieves 86.8%, far ahead of pi0 at 55.7% and CogACT at 51.3% — source: arXiv 2510.11660

Key observations from the data:

  • Stack blocks is the hardest task: pi0 manages only 21.3%, ManiAgent-GPT-5 hits 87.5% — a 66-point gap. Stacking demands accurate 3D perception, sequenced reasoning, and collision-free trajectories. End-to-end VLA fails at the intersection of all three.
  • Move eggplant is where multi-agent shows weakness: ManiAgent-GPT-5 scores 72.2% while pi0 scores 79.2%. Heavy occlusion hurts detection-based pipelines more than trained VLA models.
  • Even ManiAgent-GPT-4o (74.3%) beats both VLA baselines — confirming the architecture advantage is structural, not just about raw model power.

The ManiAgent Architecture: Three Specialists, One Pipeline

ManiAgent (Yi Yang et al., 2025) decomposes manipulation into three agents, each using the best tool for its specific job:

[Scene Images + Depth Maps + Camera Calibration + Task Description]
                            │
                            ▼
               ┌────────────────────────┐
               │    PERCEPTION AGENT     │  ← VLM + Florence-v2 detector
               │  • Object detection    │
               │  • 2D → 3D projection  │
               │  • Grasp pose gen      │
               └───────────┬────────────┘
                           │  textual scene description
                           │  3D positions + grasp poses
                           ▼
               ┌────────────────────────┐
               │    REASONING AGENT      │  ← LLM (GPT-5 / Claude)
               │  • Sub-task decomp     │
               │  • State evaluation    │
               │  • History tracking    │
               └───────────┬────────────┘
                           │  next sub-task + target object list
                           ▼
               ┌────────────────────────┐
               │  ACTION-EXEC AGENT      │  ← LLM + action cache
               │  • Keypoint generation │
               │  • Trajectory planning │
               │  • Action caching      │
               └───────────┬────────────┘
                           │
                           ▼
                    [Robot Commands]

Perception Agent — The Measuring Eye

Inputs: RGB images, depth maps, camera calibration parameters, task description

The Perception Agent performs four steps:

  1. Scene understanding: Uses a VLM to write a natural-language description of the scene — "there is a white plate on the left, an orange carrot near center, gripper positioned upper-right"
  2. Object detection: Uses Florence-v2 (Microsoft) to detect individual objects with bounding boxes. Trick: prepend "every" to queries ("every carrot", "every cube") to reduce missed detections when multiple instances exist
  3. 2D → 3D projection: Combines depth map + camera calibration matrix to convert pixel coordinates (u, v) into real-world 3D coordinates (x, y, z) in meters
  4. Grasp pose generation: Computes the approach angle and position for the gripper to safely grasp each detected object

Output sent to the Reasoning Agent — plain text:

Red cube: position=[0.23, -0.15, 0.82], grasp_pose=[0.0, 0.0, 0.0, 1.0]
Blue plate: position=[0.18, 0.22, 0.78], flat_surface=True

Reasoning Agent — The Strategic Brain

Inputs: Scene description, task instruction, history of completed sub-tasks

The Reasoning Agent works entirely in natural language — no numbers, no coordinates. Its responsibilities:

  1. Evaluate current state: what has been accomplished, what remains, did the last sub-task succeed?
  2. Decompose the overall task into the next concrete sub-task
  3. Send detection requests back to Perception Agent: "next cycle, focus on detecting [target objects]"
  4. Use history to avoid repeating failed sub-tasks — preventing infinite loops when the robot gets stuck

Example reasoning chain:

Task: "Place carrot on plate"
State: "Carrot at (0.3, -0.1, 0.85). Plate at (0.1, 0.2, 0.80)."
History: []

Plan:
  Step 1: Pick up carrot → approach carrot at (0.3, -0.1, 0.85)
  Step 2: Move above plate → target plate center (0.1, 0.2, 0.80)
  Step 3: Release carrot → verify success

Current: Execute Step 1

Action-Execution Agent — The Hands

Inputs: Sub-task description, 3D coordinates, grasp poses

This agent converts a concrete sub-task into actual robot trajectories:

  1. Maps target 3D coordinates to Cartesian keypoints — a sequence of end-effector waypoints
  2. Checks the action cache — stores one trajectory template per skill type. If a similar action was cached, reuse it rather than regenerating
  3. On cache miss → uses the LLM to generate an action sequence from scratch

The action cache is an underappreciated detail. Reusing templates dramatically reduces latency and ensures consistent execution — the same "pick" skill always looks the same, reducing gripper variance.


Why Text Is the Right "Common Language"

ManiAgent's most distinctive design choice: all inter-agent communication happens through plain text — not shared embedding spaces, not latent vectors, not function calls with typed schemas.

Perception → Reasoning:
  "Blue cup at (0.25, 0.10, 0.92), upright, grasp_pose=[...]"
  "Red plate at (0.15, -0.05, 0.81), flat surface"

Reasoning → Perception:
  "Detect: blue cup, red plate"  ← target list for next detection pass

Perception → Action-Exec:
  "object_0 = blue cup: position=[0.25, 0.10, 0.92], grasp_pose=[...]"

Action-Exec → Robot:
  waypoints = [(0.25, 0.10, 0.95), (0.25, 0.10, 0.92), ...]

Why text? Three reasons:

  • Universal interface: Every LLM/VLM understands text — swap any component without rewriting connectors
  • Full debuggability: Read every message between agents; see exactly what each agent "thinks"
  • Flexible metadata: Add confidence scores, failure reasons, or uncertainty estimates just by appending to the string

Spatial consistency is maintained through fixed object indices: once Perception Agent assigns object_0 to a specific object, that index is stable throughout the pipeline. The Action Agent requesting "pick object_0" always refers to the same physical thing.


Four Reasons Decomposition Wins

1. Each component optimized for its own task

Training a VLA end-to-end forces the network to simultaneously master perception, planning, and motor control. Gradient signals from action losses must backpropagate through all three capabilities, causing gradient interference — improving one skill often degrades another.

With decomposition, the Perception Agent uses Florence-v2 — a dedicated detector trained specifically for object detection, far more accurate than the vision encoder inside a generic VLA.

2. Metric 3D spatial grounding

VLA models represent spatial information through vision tokens compressed into a latent space. Precise metric distances get lost in this compression. ManiAgent's Perception Agent computes real metric 3D coordinates via depth + calibration, then passes exact numbers: [0.25, 0.10, 0.92] rather than "object is slightly left of center." No information lost through compression.

3. Zero-shot generalization — no robot data needed

ManiAgent requires no fine-tuning on robot demonstrations. Pi0 and CogACT need tens of thousands of robot trajectories. ManiAgent only needs a better LLM — which is why ManiAgent-GPT-5 (86.8%) > ManiAgent-GPT-4o (74.3%): upgrading one component immediately upgrades the whole pipeline.

4. Explicit failure recovery

When the Reasoning Agent receives a post-execution scene description and sees the state hasn't changed (sub-task failed), it can immediately replan. Single VLA models have no explicit state tracking — they just "generate the next action" based on current observation, with no mechanism to recognize "I just failed and need a different approach."


Real-World Limitations

ManiAgent isn't a perfect solution. Documented failure cases from the paper:

Issue Example Root Cause
Occlusion Eggplant partially hidden by sink rim → detector grabs wrong coordinates Florence-v2 struggles with partial occlusion
Height ambiguity Stacked blocks → grasp point computed at wrong height Depth estimation error on vertical surfaces
Object confusion Green pepper vs red pepper → Reasoning Agent targets wrong one Visual similarity confuses LLM descriptions
Depth-RGB misalignment SimplerEnv intermittent sync bug → underestimates performance Simulator sensor synchronization issue
IK failures Arm places object outside reachable workspace No workspace constraints in trajectory planning

In the paper's data collection experiment (551 trajectories), there were 15 manual interventions — actual success rate 81.5%, not 100%.

Notably, Move eggplant remains the hardest task for ManiAgent (72.2%) — lower than pi0 (79.2%) precisely because occlusion degrades the detection-based pipeline more than it affects a VLA's holistic visual understanding.


ManiAgent as a Data Generator

A less-discussed but highly practical application: ManiAgent can automatically generate training data for other VLA models.

The authors used ManiAgent to collect 551 trajectories for "Place carrot on plate" — 450 valid (81.5% success). Total time: 19.5 hours (~2 minutes per trajectory). That dataset then trained a CogACT VLA to match performance of models trained on human-teleoperated data.

Implication: instead of hundreds of hours of human teleop, a small team can deploy ManiAgent to autonomously generate specialized VLA training data at scale.


What This Means for Robotics Practitioners

The core lesson from ManiAgent: you don't need a superhuman model to achieve strong manipulation performance. You need:

  1. Right tool for each job: specialized detector for perception, powerful LLM for reasoning, cached primitives for execution
  2. Clean interfaces: text-based inter-agent communication makes each component replaceable and debuggable
  3. Explicit state tracking: know what's done, what remains → natural failure recovery
  4. Zero-shot generalization: decouple from robot-specific training data dependency

This means a small team can build a strong manipulation system by combining off-the-shelf components — no million-step dataset, no GPU cluster for VLA training.

In the next article, we'll go deep on the Perception Agent: how Florence-v2 achieves zero-shot object detection, how to compute 3D coordinates from depth maps and camera calibration, and how to generate reliable grasp poses for real grippers.


Related Posts

  • Part 2: Perception Agent & Grasp Planning — Building the vision layer for manipulation
  • Part 3: ALRM, CAP vs TAP — Three Action Planning Paradigms Compared
  • Part 5: Sim-to-Real Deploy — Taking the agent pipeline to a real robot
  • VLA Models 2025: Overview of Vision-Language-Action Models
  • AI for Robotics 2025: Landscape and Trends
NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Fleet MonitoringROS 2 IntegrationAMR Solutions
ai-manipulation-agents — Phần 1/4
Perception Agent: Florence-v2 + AnyGrasp | AI Manipulation Agents #2 →

Related Posts

NEWDeep Dive
Perception Agent: Florence-v2 + AnyGrasp | AI Manipulation Agents #2
manipulationflorence2anygraspPart 2
manipulation

Perception Agent: Florence-v2 + AnyGrasp | AI Manipulation Agents #2

Deep dive vào Perception Agent của ManiAgent: Florence-v2 nhận diện vật thể zero-shot qua open-vocabulary detection, AnyGrasp sinh 6-DoF grasp pose từ point cloud. Hướng dẫn Python build perception module hoàn chỉnh.

6/15/202617 min read
NT
NEWResearch
ALRM: Code-as-Policy vs Tool-as-Policy trong ReAct | AI Manipulation Agents #3
manipulationalrmcode-as-policyPart 3
manipulation

ALRM: Code-as-Policy vs Tool-as-Policy trong ReAct | AI Manipulation Agents #3

So sánh hai execution mode của ALRM: CaP sinh Python gọi robot API trong một lần chạy, TaP dùng ReAct lặp từng tool call. Benchmark 56 tasks, 10 LLMs — giúp bạn chọn đúng mode cho dự án.

6/15/202612 min read
NT
NEWTutorial
Agentic Robot: SAP Protocol + Temporal Verifier
manipulationvlaliberoPart 4
manipulation

Agentic Robot: SAP Protocol + Temporal Verifier

Chạy ds.py (DeepSeek-V3 decompose subgoals) và main.py (OpenVLA trên LIBERO). Implement Temporal Verifier sliding window — SAP protocol đạt 79.6% LIBERO avg.

6/15/202614 min read
NT
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam