ALRM: Code-as-Policy vs Tool-as-Policy in ReAct | AI Manipulation Agents #3

Imagine you are a robot standing in front of a kitchen counter. A user says: "Grab the spoon and put it in the basket." The question is: should you write a Python script and run it all at once — or should you ask one small question, act, wait for feedback, then repeat?

This is not a philosophical question. It is the core design choice behind ALRM: Agentic LLM for Robotic Manipulation (arXiv 2601.19510), a paper from the Technology Innovation Institute (TII UAE) that formally defines and compares two execution modes across 56 tasks and 10 LLMs — from Claude-4.1-Opus down to Falcon-H1-7B — with results that may surprise you.

By the end of this post, you will understand why both modes exist, when to choose each one, and what the real benchmark numbers look like.

Series Roadmap

#	Post	Content
1	Why Multi-Agent Beats VLA?	ManiAgent 86.8% vs pi0 55.7%
2	Perception Agent & Grasp Planning	Florence-v2, AnyGrasp, 3D coordinates
3	ALRM: CaP vs TaP in ReAct ← you are here	Two execution modes, benchmark 10 LLMs
4	SAP Verifier: Self-Check Before Execute	Preventing execution errors with verifier agent
5	Sim-to-Real Deploy Pipeline	From Gazebo to a real robot

What Is ALRM and What Problem Does It Solve?

Before ALRM, most LLM-for-robotics systems shared a fundamental flaw: no closed-loop mechanism. The LLM receives a command, generates an action, the robot executes it — done. If the robot picks up the wrong object, the system has no awareness and cannot self-correct. Human intervention is required.

ALRM addresses this with a three-layer architecture:

Layer 1 — Task Planner Agent: Receives natural language instructions (e.g. "clear the breakfast table"), uses the ReAct framework to decompose them into executable subtasks (e.g. pick spoon → place bowl → move_to home). The Planner continuously receives observations from the Executor to revise the plan when needed.

Layer 2 — Task Executor Agent: Receives each subtask from the Planner and executes it using one of two modes: Code-as-Policy (CaP) or Tool-as-Policy (TaP). This is ALRM's core differentiator.

Layer 3 — REST API Server: The bridge between the LLM and the physical robot. The server exposes 8 standardized actions through two internal modules:

wx250sRobot: robot arm control via MoveIt/ROS
SimPerception: object detection via Gazebo

REST API — The Common Language Between LLM and Robot

Before diving into CaP and TaP, you need to understand the vocabulary. ALRM defines 8 primitive actions split into 3 groups:

Group	Action	Description
Control	`pick(object_name)`	Grasp an object by name
Control	`place(location)`	Place at specified location
Control	`move_to(pose)`	Move to specific coordinates
Control	`move_to_home_pos()`	Return to safe home position
Perception	`get_objects()`	Return list of objects in the scene
Perception	`get_reference_names()`	Get names of reference points (basket, table, etc.)
Pose	`compute_grasp(object)`	Compute optimal grasp pose for an object
Pose	`get_pose(object)`	Get current pose of an object

These 8 actions form an immutable interface — regardless of whether you use CaP or TaP, regardless of whether the LLM is GPT-5 or Falcon-H1-7B, every operation ultimately calls through these 8 primitives. This design cleanly separates the reasoning layer (LLM) from the execution layer (robot).

Code-as-Policy (CaP) — Write Script, Run Once

How it works: The LLM receives a subtask description plus full Python function definitions for all 8 actions (signatures, docstrings, and a one-shot pick-and-place example), then generates a complete Python snippet to handle the entire subtask in a single pass.

For the subtask "pick the spoon and place it in the basket", CaP outputs code similar to:

# CaP output — all logic in one LLM call
objects = get_objects()
if "spoon" in objects:
    grasp_pose = compute_grasp("spoon")
    move_to(grasp_pose)
    pick("spoon")
    basket_pose = get_pose("basket")
    move_to(basket_pose)
    place("basket")
    move_to_home_pos()

This code is sent to the REST API server, executed from start to finish, and the result is returned.

CaP advantages:

Fast: Only 1 LLM call for the entire subtask → low latency
Deterministic: The generated code can be read and validated before execution
Easy to debug: If something goes wrong, the specific line of code is visible

CaP disadvantages:

Not adaptive: If get_objects() returns a scene different from what the LLM assumed, the already-generated code cannot self-correct
Requires capable LLMs: Weaker models generate code with syntax errors or logic bugs
Struggles with complex conditionals: Tasks requiring multi-round reasoning are difficult to encode in one snippet

Tool-as-Policy (TaP) — Ask, Act, Receive Feedback, Repeat

How it works: The LLM is given tool definitions (JSON schema of all 8 actions), but instead of generating code, it calls one tool per step. Each tool call result is appended to the conversation history and sent back to the LLM to decide the next action.

For the same subtask, TaP operates as a ReAct loop:

Step 1: LLM → tool_call: get_objects()
         Obs: ["spoon", "spatula", "basket", "coke_can"]

Step 2: LLM → tool_call: compute_grasp("spoon")
         Obs: {position: [0.3, 0.1, 0.05], orientation: [0, 0, 0.7, 0.7]}

Step 3: LLM → tool_call: move_to({position: [0.3, 0.1, 0.05], ...})
         Obs: "Move successful"

Step 4: LLM → tool_call: pick("spoon")
         Obs: "Pick successful"

Step 5: LLM → tool_call: get_pose("basket")
         Obs: {position: [0.5, -0.2, 0.1], ...}

Step 6: LLM → tool_call: place("basket")
         Obs: "Place successful"

Step 7: LLM → tool_call: move_to_home_pos()
         Obs: "Home position reached"
         → DONE

At each step, the LLM sees the complete history — it knows where the robot is, what has been done, and what the responses were. If step 4 (pick) fails because the object is occluded, the LLM can adjust its strategy at step 5 instead of blindly continuing.

TaP advantages:

Adaptive: Real-time feedback from the robot directly influences subsequent decisions
Handles complex tasks well: Multi-step conditional logic and error recovery happen naturally
Large models fully utilized: GPT-5, Claude-4.1-Opus are more effective with full context at each step

TaP disadvantages:

Significantly slower: Each tool call requires a full LLM round-trip → latency accumulates quickly
Small models often fail: Requires the LLM to precisely understand tool call schemas — 7B models are rarely reliable enough
Higher token cost: Conversation history grows with every step

Benchmark — 56 Tasks, 3 Environments, 10 LLMs

To make a fair comparison, ALRM constructed a carefully designed benchmark:

3 Gazebo simulation environments:

Kitchen Utensils: spoon, spatula, coke can, basket
Boxes: cardboard box, wooden box, metal box, container
Fruits: strawberry, plum, lemon, peach, bowl, trash bin

56 tasks — each with 6 linguistic variations (from direct "pick the red apple" to indirect "take the round sweet fruit and put it away") to test genuine language understanding rather than pattern matching.

Scoring — 3 judge models (majority voting):

Score 2: All subtasks completed correctly with accurate parameters
Score 1: At least one subtask completed, but with errors or omissions
Score 0: No subtask completed correctly

The three judges are GPT-4.1, Claude-Sonnet-4, and Gemini-2.5-Flash — majority voting eliminates single-model bias.

Results — Who Wins, Who Loses?

Success rate of 10 LLMs across 56 manipulation tasks — CaP vs TaP comparison — source: tiiuae.github.io/ALRM

Large-scale models

Model	CaP Success	CaP Latency	TaP Success	TaP Latency
Claude-4.1-Opus	92.6%	33.4s	93.5%	82.6s
GPT-5	90.7%	145.6s	85.2%	113.9s
DeepSeek-V3.1	84.3%	69.8s	85.2%	161.7s
Gemini-2.5-Pro	73.1%	52.6s	87.0%	117.4s

Notable observation: Claude-4.1-Opus leads not only in accuracy but also in latency among large models in CaP mode (33.4s vs GPT-5's 145.6s — 4.3x faster). In TaP, Opus accuracy improves slightly (93.5% vs 92.6%), but latency increases 2.5x.

Gemini-2.5-Pro shows the opposite behavior: CaP 73.1% but TaP 87.0% — this model has significantly stronger iterative reasoning than it does one-shot code synthesis.

Small-scale (open-source) models

Model	CaP Success	CaP Latency	TaP Success
Falcon-H1-7B	84.3%	24.9s	Unreliable
Llama-3.1-8B	68.5%	20.1s	Failed
Qwen3-8B	64.8%	343.3s	53.7%
Granite-3.3-8B	53.7%	24.5s	Failed
DeepSeek-R1-7B	21.3%	53.6s	Failed
Mistral-7B	8.3%	19.0s	Failed

Latency benchmark: TaP is slower than CaP but more flexible for multi-step tasks — source: tiiuae.github.io/ALRM

Falcon-H1-7B is the benchmark's biggest surprise. At 84.3% CaP success with only 24.9s latency, it matches DeepSeek-V3.1 (a model many times larger) and significantly outperforms every other 7-8B open-source model. This is the payoff from TII UAE's Hybrid SSM-Transformer architecture in the Falcon-H1 family — optimized specifically for tool use and code generation.

Qwen3-8B is the curious outlier: CaP latency of 343.3s (the slowest in the entire benchmark) yet it is the only open-source model with functional TaP (53.7%). This suggests Qwen3-8B has strong structured reasoning but is significantly slower at code generation throughput.

When to Use CaP vs TaP

This is the most practical question of all. Based on benchmark results, clear principles emerge:

Choose CaP when:

The task has clear structure with little conditional branching (pick → place without intermediate checks)
Low latency is a requirement (high-throughput warehouse robots, many operations per minute)
You are using an open-source 7-8B model (CaP is the ONLY viable choice — TaP is too unreliable for small models)
You want code that can be inspected and validated before running on a real robot

Choose TaP when:

The task is complex and requires real-time error recovery (e.g. if object is occluded, find another approach)
You are using a large model (GPT-5, Claude-4.1-Opus, Gemini-2.5-Pro) and accuracy matters more than speed
The robot environment is dynamic and the scene may change during execution
The task requires multiple perception-action cycles (observe → decide → act → observe → decide again)

Quick decision table:

Condition	Recommendation
Open-source 7-8B model	CaP (TaP not reliable)
Large model + simple task	CaP (2-3x faster)
Large model + complex task	TaP (higher accuracy)
Latency < 30s is a hard requirement	CaP (or Claude-4.1-Opus + CaP)
Real robot, dynamic environment	TaP (more adaptive)
Simulation testing/development	Start with CaP, upgrade to TaP if needed

Why Does Claude-4.1-Opus Lead Both Modes?

This is the most interesting takeaway from the benchmark. Usually, a model strong at code generation (CaP) is not necessarily strong at iterative tool-calling (TaP), and vice versa. Yet Opus tops both.

The paper attributes this to Opus's extremely precise instruction following — when generating Python code, it adheres correctly to API contracts (correct signatures, correct types, correct edge case handling). When using TaP, it also parses tool call schemas accurately without hallucinating non-existent parameters. Both modes demand high precision — and that is Opus's core strength.

This also explains why Mistral-7B achieves only 8.3% in CaP: the model lacks the ability to generate syntactically valid Python that respects the strict API constraints.

Connecting to the Broader Pipeline

The previous post on Perception Agent showed how Florence-v2 + AnyGrasp generates 3D coordinates for objects in the scene. Those coordinates are precisely the input to compute_grasp() and get_pose() in the ALRM REST API.

The next post on SAP Verifier will explain how to add a verification layer before the Executor acts — particularly important in CaP mode where generated code must be validated before running on a real robot.

Summary

ALRM uses a 3-layer architecture: Task Planner (ReAct) → Task Executor (CaP/TaP) → REST API Server
CaP generates complete Python in 1 LLM call → fast, deterministic, works with small models
TaP calls one tool at a time, receives observations, decides next step → slower, adaptive, requires large models
Claude-4.1-Opus leads both modes: 92.6% CaP (33.4s) and 93.5% TaP
Falcon-H1-7B is the open-source champion: 84.3% CaP at 24.9s latency — matching DeepSeek-V3.1
Mode selection rule: Open-source/latency-critical → CaP. Large model/complex task → TaP.

By the end of this post, you will understand why both modes exist, when to choose each one, and what the real benchmark numbers look like.

Series Roadmap

#	Post	Content
1	Why Multi-Agent Beats VLA?	ManiAgent 86.8% vs pi0 55.7%
2	Perception Agent & Grasp Planning	Florence-v2, AnyGrasp, 3D coordinates
3	ALRM: CaP vs TaP in ReAct ← you are here	Two execution modes, benchmark 10 LLMs
4	SAP Verifier: Self-Check Before Execute	Preventing execution errors with verifier agent
5	Sim-to-Real Deploy Pipeline	From Gazebo to a real robot

What Is ALRM and What Problem Does It Solve?

ALRM addresses this with a three-layer architecture:

Layer 3 — REST API Server: The bridge between the LLM and the physical robot. The server exposes 8 standardized actions through two internal modules:

wx250sRobot: robot arm control via MoveIt/ROS
SimPerception: object detection via Gazebo

REST API — The Common Language Between LLM and Robot

Before diving into CaP and TaP, you need to understand the vocabulary. ALRM defines 8 primitive actions split into 3 groups:

Group	Action	Description
Control	`pick(object_name)`	Grasp an object by name
Control	`place(location)`	Place at specified location
Control	`move_to(pose)`	Move to specific coordinates
Control	`move_to_home_pos()`	Return to safe home position
Perception	`get_objects()`	Return list of objects in the scene
Perception	`get_reference_names()`	Get names of reference points (basket, table, etc.)
Pose	`compute_grasp(object)`	Compute optimal grasp pose for an object
Pose	`get_pose(object)`	Get current pose of an object

Code-as-Policy (CaP) — Write Script, Run Once

For the subtask "pick the spoon and place it in the basket", CaP outputs code similar to:

# CaP output — all logic in one LLM call
objects = get_objects()
if "spoon" in objects:
    grasp_pose = compute_grasp("spoon")
    move_to(grasp_pose)
    pick("spoon")
    basket_pose = get_pose("basket")
    move_to(basket_pose)
    place("basket")
    move_to_home_pos()

This code is sent to the REST API server, executed from start to finish, and the result is returned.

CaP advantages:

Fast: Only 1 LLM call for the entire subtask → low latency
Deterministic: The generated code can be read and validated before execution
Easy to debug: If something goes wrong, the specific line of code is visible

CaP disadvantages:

Not adaptive: If get_objects() returns a scene different from what the LLM assumed, the already-generated code cannot self-correct
Requires capable LLMs: Weaker models generate code with syntax errors or logic bugs
Struggles with complex conditionals: Tasks requiring multi-round reasoning are difficult to encode in one snippet

Tool-as-Policy (TaP) — Ask, Act, Receive Feedback, Repeat

For the same subtask, TaP operates as a ReAct loop:

Step 1: LLM → tool_call: get_objects()
         Obs: ["spoon", "spatula", "basket", "coke_can"]

Step 2: LLM → tool_call: compute_grasp("spoon")
         Obs: {position: [0.3, 0.1, 0.05], orientation: [0, 0, 0.7, 0.7]}

Step 3: LLM → tool_call: move_to({position: [0.3, 0.1, 0.05], ...})
         Obs: "Move successful"

Step 4: LLM → tool_call: pick("spoon")
         Obs: "Pick successful"

Step 5: LLM → tool_call: get_pose("basket")
         Obs: {position: [0.5, -0.2, 0.1], ...}

Step 6: LLM → tool_call: place("basket")
         Obs: "Place successful"

Step 7: LLM → tool_call: move_to_home_pos()
         Obs: "Home position reached"
         → DONE

TaP advantages:

Adaptive: Real-time feedback from the robot directly influences subsequent decisions
Handles complex tasks well: Multi-step conditional logic and error recovery happen naturally
Large models fully utilized: GPT-5, Claude-4.1-Opus are more effective with full context at each step

TaP disadvantages:

Significantly slower: Each tool call requires a full LLM round-trip → latency accumulates quickly
Small models often fail: Requires the LLM to precisely understand tool call schemas — 7B models are rarely reliable enough
Higher token cost: Conversation history grows with every step

Benchmark — 56 Tasks, 3 Environments, 10 LLMs

To make a fair comparison, ALRM constructed a carefully designed benchmark:

3 Gazebo simulation environments:

Kitchen Utensils: spoon, spatula, coke can, basket
Boxes: cardboard box, wooden box, metal box, container
Fruits: strawberry, plum, lemon, peach, bowl, trash bin

Scoring — 3 judge models (majority voting):

Score 2: All subtasks completed correctly with accurate parameters
Score 1: At least one subtask completed, but with errors or omissions
Score 0: No subtask completed correctly

The three judges are GPT-4.1, Claude-Sonnet-4, and Gemini-2.5-Flash — majority voting eliminates single-model bias.

Results — Who Wins, Who Loses?

Success rate of 10 LLMs across 56 manipulation tasks — CaP vs TaP comparison — source: tiiuae.github.io/ALRM

Large-scale models

Model	CaP Success	CaP Latency	TaP Success	TaP Latency
Claude-4.1-Opus	92.6%	33.4s	93.5%	82.6s
GPT-5	90.7%	145.6s	85.2%	113.9s
DeepSeek-V3.1	84.3%	69.8s	85.2%	161.7s
Gemini-2.5-Pro	73.1%	52.6s	87.0%	117.4s

Gemini-2.5-Pro shows the opposite behavior: CaP 73.1% but TaP 87.0% — this model has significantly stronger iterative reasoning than it does one-shot code synthesis.

Small-scale (open-source) models

Model	CaP Success	CaP Latency	TaP Success
Falcon-H1-7B	84.3%	24.9s	Unreliable
Llama-3.1-8B	68.5%	20.1s	Failed
Qwen3-8B	64.8%	343.3s	53.7%
Granite-3.3-8B	53.7%	24.5s	Failed
DeepSeek-R1-7B	21.3%	53.6s	Failed
Mistral-7B	8.3%	19.0s	Failed

Latency benchmark: TaP is slower than CaP but more flexible for multi-step tasks — source: tiiuae.github.io/ALRM

When to Use CaP vs TaP

This is the most practical question of all. Based on benchmark results, clear principles emerge:

Choose CaP when:

The task has clear structure with little conditional branching (pick → place without intermediate checks)
Low latency is a requirement (high-throughput warehouse robots, many operations per minute)
You are using an open-source 7-8B model (CaP is the ONLY viable choice — TaP is too unreliable for small models)
You want code that can be inspected and validated before running on a real robot

Choose TaP when:

The task is complex and requires real-time error recovery (e.g. if object is occluded, find another approach)
You are using a large model (GPT-5, Claude-4.1-Opus, Gemini-2.5-Pro) and accuracy matters more than speed
The robot environment is dynamic and the scene may change during execution
The task requires multiple perception-action cycles (observe → decide → act → observe → decide again)

Quick decision table:

Condition	Recommendation
Open-source 7-8B model	CaP (TaP not reliable)
Large model + simple task	CaP (2-3x faster)
Large model + complex task	TaP (higher accuracy)
Latency < 30s is a hard requirement	CaP (or Claude-4.1-Opus + CaP)
Real robot, dynamic environment	TaP (more adaptive)
Simulation testing/development	Start with CaP, upgrade to TaP if needed

Why Does Claude-4.1-Opus Lead Both Modes?

This also explains why Mistral-7B achieves only 8.3% in CaP: the model lacks the ability to generate syntactically valid Python that respects the strict API constraints.

Connecting to the Broader Pipeline

Summary

ALRM uses a 3-layer architecture: Task Planner (ReAct) → Task Executor (CaP/TaP) → REST API Server
CaP generates complete Python in 1 LLM call → fast, deterministic, works with small models
TaP calls one tool at a time, receives observations, decides next step → slower, adaptive, requires large models
Claude-4.1-Opus leads both modes: 92.6% CaP (33.4s) and 93.5% TaP
Falcon-H1-7B is the open-source champion: 84.3% CaP at 24.9s latency — matching DeepSeek-V3.1
Mode selection rule: Open-source/latency-critical → CaP. Large model/complex task → TaP.

ALRM: Code-as-Policy vs Tool-as-Policy in ReAct | AI Manipulation Agents #3

Series Roadmap

What Is ALRM and What Problem Does It Solve?

REST API — The Common Language Between LLM and Robot

Code-as-Policy (CaP) — Write Script, Run Once

Tool-as-Policy (TaP) — Ask, Act, Receive Feedback, Repeat

Benchmark — 56 Tasks, 3 Environments, 10 LLMs

Results — Who Wins, Who Loses?

Large-scale models

Small-scale (open-source) models

When to Use CaP vs TaP

Why Does Claude-4.1-Opus Lead Both Modes?

Connecting to the Broader Pipeline

Summary

Nguyễn Anh Tuấn

Related Posts

Tại sao Multi-Agent đánh bại VLA đơn thuần? | AI Manipulation Agents #1

Sim-to-Real Deploy: Đưa SAP Pipeline từ LIBERO ra Robot Thật | AI Manipulation Agents #5

Perception Agent: Florence-v2 + AnyGrasp | AI Manipulation Agents #2

ALRM: Code-as-Policy vs Tool-as-Policy in ReAct | AI Manipulation Agents #3

Series Roadmap

What Is ALRM and What Problem Does It Solve?

REST API — The Common Language Between LLM and Robot

Code-as-Policy (CaP) — Write Script, Run Once

Tool-as-Policy (TaP) — Ask, Act, Receive Feedback, Repeat

Benchmark — 56 Tasks, 3 Environments, 10 LLMs

Results — Who Wins, Who Loses?

Large-scale models

Small-scale (open-source) models

When to Use CaP vs TaP

Why Does Claude-4.1-Opus Lead Both Modes?

Connecting to the Broader Pipeline

Summary

Nguyễn Anh Tuấn

Related Posts

Tại sao Multi-Agent đánh bại VLA đơn thuần? | AI Manipulation Agents #1

Sim-to-Real Deploy: Đưa SAP Pipeline từ LIBERO ra Robot Thật | AI Manipulation Agents #5

Perception Agent: Florence-v2 + AnyGrasp | AI Manipulation Agents #2