VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam
VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
  1. Home
  2. Blog
  3. ALRM: Code-as-Policy vs Tool-as-Policy in ReAct | AI Manipulation Agents #3
manipulationmanipulationalrmcode-as-policytool-as-policyreactllmbenchmarkroboticsairest-api

ALRM: Code-as-Policy vs Tool-as-Policy in ReAct | AI Manipulation Agents #3

Comparing ALRM's two execution modes: CaP generates Python to call robot APIs in one pass, TaP uses ReAct to loop per tool call. Benchmark across 56 tasks and 10 LLMs to help you pick the right mode.

Nguyễn Anh TuấnJune 15, 202611 min read
ALRM: Code-as-Policy vs Tool-as-Policy in ReAct | AI Manipulation Agents #3

Imagine you are a robot standing in front of a kitchen counter. A user says: "Grab the spoon and put it in the basket." The question is: should you write a Python script and run it all at once — or should you ask one small question, act, wait for feedback, then repeat?

This is not a philosophical question. It is the core design choice behind ALRM: Agentic LLM for Robotic Manipulation (arXiv 2601.19510), a paper from the Technology Innovation Institute (TII UAE) that formally defines and compares two execution modes across 56 tasks and 10 LLMs — from Claude-4.1-Opus down to Falcon-H1-7B — with results that may surprise you.

By the end of this post, you will understand why both modes exist, when to choose each one, and what the real benchmark numbers look like.


Series Roadmap

# Post Content
1 Why Multi-Agent Beats VLA? ManiAgent 86.8% vs pi0 55.7%
2 Perception Agent & Grasp Planning Florence-v2, AnyGrasp, 3D coordinates
3 ALRM: CaP vs TaP in ReAct ← you are here Two execution modes, benchmark 10 LLMs
4 SAP Verifier: Self-Check Before Execute Preventing execution errors with verifier agent
5 Sim-to-Real Deploy Pipeline From Gazebo to a real robot

What Is ALRM and What Problem Does It Solve?

Before ALRM, most LLM-for-robotics systems shared a fundamental flaw: no closed-loop mechanism. The LLM receives a command, generates an action, the robot executes it — done. If the robot picks up the wrong object, the system has no awareness and cannot self-correct. Human intervention is required.

ALRM addresses this with a three-layer architecture:

Layer 1 — Task Planner Agent: Receives natural language instructions (e.g. "clear the breakfast table"), uses the ReAct framework to decompose them into executable subtasks (e.g. pick spoon → place bowl → move_to home). The Planner continuously receives observations from the Executor to revise the plan when needed.

Layer 2 — Task Executor Agent: Receives each subtask from the Planner and executes it using one of two modes: Code-as-Policy (CaP) or Tool-as-Policy (TaP). This is ALRM's core differentiator.

Layer 3 — REST API Server: The bridge between the LLM and the physical robot. The server exposes 8 standardized actions through two internal modules:

  • wx250sRobot: robot arm control via MoveIt/ROS
  • SimPerception: object detection via Gazebo

REST API — The Common Language Between LLM and Robot

Before diving into CaP and TaP, you need to understand the vocabulary. ALRM defines 8 primitive actions split into 3 groups:

Group Action Description
Control pick(object_name) Grasp an object by name
Control place(location) Place at specified location
Control move_to(pose) Move to specific coordinates
Control move_to_home_pos() Return to safe home position
Perception get_objects() Return list of objects in the scene
Perception get_reference_names() Get names of reference points (basket, table, etc.)
Pose compute_grasp(object) Compute optimal grasp pose for an object
Pose get_pose(object) Get current pose of an object

These 8 actions form an immutable interface — regardless of whether you use CaP or TaP, regardless of whether the LLM is GPT-5 or Falcon-H1-7B, every operation ultimately calls through these 8 primitives. This design cleanly separates the reasoning layer (LLM) from the execution layer (robot).


Code-as-Policy (CaP) — Write Script, Run Once

How it works: The LLM receives a subtask description plus full Python function definitions for all 8 actions (signatures, docstrings, and a one-shot pick-and-place example), then generates a complete Python snippet to handle the entire subtask in a single pass.

For the subtask "pick the spoon and place it in the basket", CaP outputs code similar to:

# CaP output — all logic in one LLM call
objects = get_objects()
if "spoon" in objects:
    grasp_pose = compute_grasp("spoon")
    move_to(grasp_pose)
    pick("spoon")
    basket_pose = get_pose("basket")
    move_to(basket_pose)
    place("basket")
    move_to_home_pos()

This code is sent to the REST API server, executed from start to finish, and the result is returned.

CaP advantages:

  • Fast: Only 1 LLM call for the entire subtask → low latency
  • Deterministic: The generated code can be read and validated before execution
  • Easy to debug: If something goes wrong, the specific line of code is visible

CaP disadvantages:

  • Not adaptive: If get_objects() returns a scene different from what the LLM assumed, the already-generated code cannot self-correct
  • Requires capable LLMs: Weaker models generate code with syntax errors or logic bugs
  • Struggles with complex conditionals: Tasks requiring multi-round reasoning are difficult to encode in one snippet

Tool-as-Policy (TaP) — Ask, Act, Receive Feedback, Repeat

How it works: The LLM is given tool definitions (JSON schema of all 8 actions), but instead of generating code, it calls one tool per step. Each tool call result is appended to the conversation history and sent back to the LLM to decide the next action.

For the same subtask, TaP operates as a ReAct loop:

Step 1: LLM → tool_call: get_objects()
         Obs: ["spoon", "spatula", "basket", "coke_can"]

Step 2: LLM → tool_call: compute_grasp("spoon")
         Obs: {position: [0.3, 0.1, 0.05], orientation: [0, 0, 0.7, 0.7]}

Step 3: LLM → tool_call: move_to({position: [0.3, 0.1, 0.05], ...})
         Obs: "Move successful"

Step 4: LLM → tool_call: pick("spoon")
         Obs: "Pick successful"

Step 5: LLM → tool_call: get_pose("basket")
         Obs: {position: [0.5, -0.2, 0.1], ...}

Step 6: LLM → tool_call: place("basket")
         Obs: "Place successful"

Step 7: LLM → tool_call: move_to_home_pos()
         Obs: "Home position reached"
         → DONE

At each step, the LLM sees the complete history — it knows where the robot is, what has been done, and what the responses were. If step 4 (pick) fails because the object is occluded, the LLM can adjust its strategy at step 5 instead of blindly continuing.

TaP advantages:

  • Adaptive: Real-time feedback from the robot directly influences subsequent decisions
  • Handles complex tasks well: Multi-step conditional logic and error recovery happen naturally
  • Large models fully utilized: GPT-5, Claude-4.1-Opus are more effective with full context at each step

TaP disadvantages:

  • Significantly slower: Each tool call requires a full LLM round-trip → latency accumulates quickly
  • Small models often fail: Requires the LLM to precisely understand tool call schemas — 7B models are rarely reliable enough
  • Higher token cost: Conversation history grows with every step

Benchmark — 56 Tasks, 3 Environments, 10 LLMs

To make a fair comparison, ALRM constructed a carefully designed benchmark:

3 Gazebo simulation environments:

  • Kitchen Utensils: spoon, spatula, coke can, basket
  • Boxes: cardboard box, wooden box, metal box, container
  • Fruits: strawberry, plum, lemon, peach, bowl, trash bin

56 tasks — each with 6 linguistic variations (from direct "pick the red apple" to indirect "take the round sweet fruit and put it away") to test genuine language understanding rather than pattern matching.

Scoring — 3 judge models (majority voting):

  • Score 2: All subtasks completed correctly with accurate parameters
  • Score 1: At least one subtask completed, but with errors or omissions
  • Score 0: No subtask completed correctly

The three judges are GPT-4.1, Claude-Sonnet-4, and Gemini-2.5-Flash — majority voting eliminates single-model bias.


Results — Who Wins, Who Loses?

Success rate of 10 LLMs across 56 manipulation tasks — CaP vs TaP comparison — source: tiiuae.github.io/ALRM
Success rate of 10 LLMs across 56 manipulation tasks — CaP vs TaP comparison — source: tiiuae.github.io/ALRM

Large-scale models

Model CaP Success CaP Latency TaP Success TaP Latency
Claude-4.1-Opus 92.6% 33.4s 93.5% 82.6s
GPT-5 90.7% 145.6s 85.2% 113.9s
DeepSeek-V3.1 84.3% 69.8s 85.2% 161.7s
Gemini-2.5-Pro 73.1% 52.6s 87.0% 117.4s

Notable observation: Claude-4.1-Opus leads not only in accuracy but also in latency among large models in CaP mode (33.4s vs GPT-5's 145.6s — 4.3x faster). In TaP, Opus accuracy improves slightly (93.5% vs 92.6%), but latency increases 2.5x.

Gemini-2.5-Pro shows the opposite behavior: CaP 73.1% but TaP 87.0% — this model has significantly stronger iterative reasoning than it does one-shot code synthesis.

Small-scale (open-source) models

Model CaP Success CaP Latency TaP Success
Falcon-H1-7B 84.3% 24.9s Unreliable
Llama-3.1-8B 68.5% 20.1s Failed
Qwen3-8B 64.8% 343.3s 53.7%
Granite-3.3-8B 53.7% 24.5s Failed
DeepSeek-R1-7B 21.3% 53.6s Failed
Mistral-7B 8.3% 19.0s Failed

Latency benchmark: TaP is slower than CaP but more flexible for multi-step tasks — source: tiiuae.github.io/ALRM
Latency benchmark: TaP is slower than CaP but more flexible for multi-step tasks — source: tiiuae.github.io/ALRM

Falcon-H1-7B is the benchmark's biggest surprise. At 84.3% CaP success with only 24.9s latency, it matches DeepSeek-V3.1 (a model many times larger) and significantly outperforms every other 7-8B open-source model. This is the payoff from TII UAE's Hybrid SSM-Transformer architecture in the Falcon-H1 family — optimized specifically for tool use and code generation.

Qwen3-8B is the curious outlier: CaP latency of 343.3s (the slowest in the entire benchmark) yet it is the only open-source model with functional TaP (53.7%). This suggests Qwen3-8B has strong structured reasoning but is significantly slower at code generation throughput.


When to Use CaP vs TaP

This is the most practical question of all. Based on benchmark results, clear principles emerge:

Choose CaP when:

  • The task has clear structure with little conditional branching (pick → place without intermediate checks)
  • Low latency is a requirement (high-throughput warehouse robots, many operations per minute)
  • You are using an open-source 7-8B model (CaP is the ONLY viable choice — TaP is too unreliable for small models)
  • You want code that can be inspected and validated before running on a real robot

Choose TaP when:

  • The task is complex and requires real-time error recovery (e.g. if object is occluded, find another approach)
  • You are using a large model (GPT-5, Claude-4.1-Opus, Gemini-2.5-Pro) and accuracy matters more than speed
  • The robot environment is dynamic and the scene may change during execution
  • The task requires multiple perception-action cycles (observe → decide → act → observe → decide again)

Quick decision table:

Condition Recommendation
Open-source 7-8B model CaP (TaP not reliable)
Large model + simple task CaP (2-3x faster)
Large model + complex task TaP (higher accuracy)
Latency < 30s is a hard requirement CaP (or Claude-4.1-Opus + CaP)
Real robot, dynamic environment TaP (more adaptive)
Simulation testing/development Start with CaP, upgrade to TaP if needed

Why Does Claude-4.1-Opus Lead Both Modes?

This is the most interesting takeaway from the benchmark. Usually, a model strong at code generation (CaP) is not necessarily strong at iterative tool-calling (TaP), and vice versa. Yet Opus tops both.

The paper attributes this to Opus's extremely precise instruction following — when generating Python code, it adheres correctly to API contracts (correct signatures, correct types, correct edge case handling). When using TaP, it also parses tool call schemas accurately without hallucinating non-existent parameters. Both modes demand high precision — and that is Opus's core strength.

This also explains why Mistral-7B achieves only 8.3% in CaP: the model lacks the ability to generate syntactically valid Python that respects the strict API constraints.


Connecting to the Broader Pipeline

The previous post on Perception Agent showed how Florence-v2 + AnyGrasp generates 3D coordinates for objects in the scene. Those coordinates are precisely the input to compute_grasp() and get_pose() in the ALRM REST API.

The next post on SAP Verifier will explain how to add a verification layer before the Executor acts — particularly important in CaP mode where generated code must be validated before running on a real robot.


Summary

  • ALRM uses a 3-layer architecture: Task Planner (ReAct) → Task Executor (CaP/TaP) → REST API Server
  • CaP generates complete Python in 1 LLM call → fast, deterministic, works with small models
  • TaP calls one tool at a time, receives observations, decides next step → slower, adaptive, requires large models
  • Claude-4.1-Opus leads both modes: 92.6% CaP (33.4s) and 93.5% TaP
  • Falcon-H1-7B is the open-source champion: 84.3% CaP at 24.9s latency — matching DeepSeek-V3.1
  • Mode selection rule: Open-source/latency-critical → CaP. Large model/complex task → TaP.

Related Posts

  • Why Multi-Agent Beats VLA? — AI Manipulation Agents #1
  • Perception Agent & Grasp Planning: Florence-v2 + AnyGrasp — AI Manipulation Agents #2
  • SAP Verifier: Self-Check Before Execution — AI Manipulation Agents #4
NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Fleet MonitoringROS 2 IntegrationAMR Solutions
ai-manipulation-agents — Phần 3/4
← Perception Agent: Florence-v2 + AnyGrasp | AI Manipulation Agents #2Agentic Robot: SAP Protocol & Temporal Verifier →

Related Posts

NEWResearch
Tại sao Multi-Agent đánh bại VLA đơn thuần? | AI Manipulation Agents #1
manipulationvlamulti-agentPart 1
manipulation

Tại sao Multi-Agent đánh bại VLA đơn thuần? | AI Manipulation Agents #1

ManiAgent đạt 86.8% trên SimplerEnv — vượt xa pi0 55.7% và CogACT 51.3%. Phân tích kiến trúc 3-agent và lý do phân rã pipeline thắng end-to-end VLA.

6/15/202613 min read
NT
NEWTutorial
Agentic Robot: SAP Protocol + Temporal Verifier
manipulationvlaliberoPart 4
manipulation

Agentic Robot: SAP Protocol + Temporal Verifier

Chạy ds.py (DeepSeek-V3 decompose subgoals) và main.py (OpenVLA trên LIBERO). Implement Temporal Verifier sliding window — SAP protocol đạt 79.6% LIBERO avg.

6/15/202614 min read
NT
NEWDeep Dive
Perception Agent: Florence-v2 + AnyGrasp | AI Manipulation Agents #2
manipulationflorence2anygraspPart 2
manipulation

Perception Agent: Florence-v2 + AnyGrasp | AI Manipulation Agents #2

Deep dive vào Perception Agent của ManiAgent: Florence-v2 nhận diện vật thể zero-shot qua open-vocabulary detection, AnyGrasp sinh 6-DoF grasp pose từ point cloud. Hướng dẫn Python build perception module hoàn chỉnh.

6/15/202617 min read
NT
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam