Perception Agent: Florence-v2 + AnyGrasp | AI Manipulation Agents #2

A robot looks at a table and sees a coffee cup. Simple question: "where is that cup in 3D space, and how do I grip it?" But answering this accurately enough for the robot to actually pick it up requires solving two completely different hard problems simultaneously: natural language understanding and 3D spatial geometry.

This is precisely why the previous article (Why Multi-Agent Beats Stand-Alone VLA?) showed that isolating perception into a dedicated agent is the right call. This article digs inside the Perception Agent of ManiAgent — the architecture that achieved 86.8% on SimplerEnv — to see what it does, how it works, and how you can build a similar module from scratch.

Series Roadmap: AI Agent Pipeline for Robot Manipulation

#	Article	Status
1	Why Multi-Agent Beats VLA?	✅ Published
2	Perception Agent: Florence-v2 + AnyGrasp ← you are here	📍
3	ALRM, CAP vs TAP: Three Planning Paradigms	🔜
4	SAP Verifier: Self-Verification in the Pipeline	🔜
5	Sim-to-Real Deployment	🔜

What Does the Perception Agent Do?

Think of the Perception Agent as a warehouse receiving clerk: it takes a request ("get me the red box"), locates the exact item, determines where it sits on the shelf, and reports back with the location and best approach angle. Only then can the delivery team (Reasoning Agent + Action Agent) know where to go.

Technically, ManiAgent's Perception Agent receives:

RGB image from camera (RealSense D435 in the original paper)
Depth image (per-pixel distance in meters)
List of target object names (from Reasoning Agent)

And returns a spatial_info dictionary containing:

object_label — the detected object name
position — 3D center coordinates in the robot base frame [x, y, z] (meters)
grasp_pose — 6-DoF grasp: 3×3 rotation matrix + 3D translation + quality score

To accomplish this, the pipeline uses two models in sequence: Florence-v2 for the language→image part, AnyGrasp for the geometry→pose part.

Perception Agent pipeline — source: ManiAgent paper (arXiv:2510.11660)

Florence-v2: Eyes That Understand Language

The Problem with Traditional Object Detection

YOLO, Faster R-CNN, Detectron2 — all require a fixed list of classes defined at training time. Want to detect "a white coffee cup with a handle"? You need a dataset with that label, a training loop, then a model. In real-world robot manipulation, you simply cannot train a model for every object a user might request.

Open-vocabulary detection solves this: the model takes an image + any object name → localizes it. No fine-tuning, no new dataset required.

Florence-2 Architecture

Florence-2 (arXiv:2311.06242, Microsoft Research, CVPR 2024) uses a sequence-to-sequence architecture:

Backbone: DaViT (Dual-attention Vision Transformer) — encodes images into visual token embeddings
Encoder-decoder: standard transformer processing both visual embeddings and text prompts
Training: FLD-5B dataset — 5.4 billion annotations across 126 million images, covering bounding boxes, segmentation masks, captions, and grounding

The key insight: Florence-2 doesn't learn to "recognize classes" — it learns to translate language prompts into spatial coordinates. When you send <OPEN_VOCABULARY_DETECTION>coffee cup, the model translates that into a bounding box. Similar to how a person understands "point me to the cup" without needing a pre-defined "cup" class.

Florence-2 Task Prompts

Florence-2 uses a task prompt system to select output type:

Prompt	Task
`<OD>`	Detect all objects in the image (auto-naming)
`<OPEN_VOCABULARY_DETECTION>`	Detect objects by any name you specify
`<CAPTION_TO_PHRASE_GROUNDING>`	Link phrases in a caption to image regions
`<DENSE_REGION_CAPTION>`	Describe each region in the image in detail

For the Perception Agent, we use <OPEN_VOCABULARY_DETECTION> since it takes an object name from the Reasoning Agent and finds exactly that object.

Code: Load and Run Florence-v2

import torch
from PIL import Image
from transformers import AutoProcessor, Florence2ForConditionalGeneration

# Load model (once at startup)
MODEL_ID = "microsoft/Florence-2-large"

model = Florence2ForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="cuda"
)
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model.eval()

def detect_object(rgb_image: Image.Image, object_name: str) -> dict:
    """
    Detect an object by name, return bounding box and label.
    
    Args:
        rgb_image: PIL Image (RGB)
        object_name: target object name, e.g. "coffee cup"
    
    Returns:
        {'bboxes': [[x1, y1, x2, y2]], 'bboxes_labels': ['coffee cup']}
    """
    task_prompt = "<OPEN_VOCABULARY_DETECTION>"

    inputs = processor(
        text=task_prompt,
        images=rgb_image,
        text_input=object_name,
        return_tensors="pt"
    ).to(model.device, torch.float16)

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=1024,
            num_beams=3,
            do_sample=False
        )

    generated_text = processor.batch_decode(
        generated_ids, skip_special_tokens=False
    )[0]

    parsed = processor.post_process_generation(
        generated_text,
        task=task_prompt,
        image_size=rgb_image.size  # (width, height)
    )
    return parsed.get(task_prompt, {})

Output format:

{
    'bboxes': [[156.2, 203.4, 289.1, 341.7]],  # [x1, y1, x2, y2] pixels
    'bboxes_labels': ['coffee cup']
}

From Pixel to 3D Space

Florence-v2 gives us a bounding box in pixel coordinates — but the robot needs the object's location in real 3D space. This requires back-projection using the camera model.

Pinhole Camera Model

A camera is like a pinhole — light from a 3D point (X, Y, Z) passes through the pinhole and hits the sensor at pixel (u, v):

u = fx * (X / Z) + cx
v = fy * (Y / Z) + cy

Where:

fx, fy: focal length in pixels (typically 600–900 for RealSense D435)
cx, cy: optical center (roughly at image center)
Z: depth (distance from camera to object)

Inverting to find 3D from pixel + depth:

X = (u - cx) * Z / fx
Y = (v - cy) * Z / fy
Z = depth[v, u]   # from depth image

Code: Pixel → 3D in Robot Base Frame

import numpy as np

def pixel_to_3d(
    u: float, v: float,
    depth_image: np.ndarray,
    camera_intrinsics: dict,
    T_base_cam: np.ndarray
) -> np.ndarray:
    """
    Convert pixel coordinates to 3D in robot base frame.

    Args:
        u, v: pixel coordinates (can be float, e.g. bounding box center)
        depth_image: depth image (H, W), in meters
        camera_intrinsics: {'fx': ..., 'fy': ..., 'cx': ..., 'cy': ...}
        T_base_cam: 4x4 transform matrix camera → robot base

    Returns:
        3D coordinates [x, y, z] in robot base frame (meters)
    """
    fx = camera_intrinsics['fx']
    fy = camera_intrinsics['fy']
    cx = camera_intrinsics['cx']
    cy = camera_intrinsics['cy']

    # Get depth at (u, v) — use 5x5 patch median to reduce noise
    v_int, u_int = int(round(v)), int(round(u))
    v0, v1 = max(0, v_int - 2), min(depth_image.shape[0], v_int + 3)
    u0, u1 = max(0, u_int - 2), min(depth_image.shape[1], u_int + 3)
    patch = depth_image[v0:v1, u0:u1]
    valid = patch[patch > 0.1]
    if len(valid) == 0:
        raise ValueError(f"Invalid depth at pixel ({u}, {v})")
    Z = float(np.median(valid))

    # Back-project to camera frame
    X_cam = (u - cx) * Z / fx
    Y_cam = (v - cy) * Z / fy
    point_cam = np.array([X_cam, Y_cam, Z, 1.0])

    # Transform to robot base frame
    point_base = T_base_cam @ point_cam
    return point_base[:3]  # [x, y, z]

AnyGrasp: Intelligent Gripping from Point Clouds

What Is a 6-DoF Grasp?

When you pick up a coffee cup, you don't just know "place hand here" — you also know the approach angle, finger orientation, and how wide to open your hand. That's 6 degrees of freedom:

3 DoF position: x, y, z — grasp center
3 DoF orientation: pitch, yaw, roll — approach direction

For a 2-finger gripper (like Robotiq 2F-85), a grasp pose is represented as:

Rotation matrix R (3×3): gripper axes in space
Translation vector t (3D): center point between fingers
Width: gripper opening distance (meters)
Score: quality metric (higher is better)

AnyGrasp: Dense Prediction Over Sampling

Traditional grasp detection (GPD, GQ-CNN) uses sampling — propose many random grasp candidates, score each one. This is slow and misses many good grasps.

AnyGrasp (Fang et al., IEEE T-RO 2023) uses dense prediction: for every point in the point cloud, the model directly predicts the probability it could be a good grasp point (the graspness score). For points with high graspness, the model simultaneously generates all 6-DoF grasp parameters.

Think of it this way: AnyGrasp scans the entire scene at once (dense), rather than checking one corner at a time (sampling).

Result: AnyGrasp can process an entire scene with multiple objects, propose hundreds of grasp candidates, and you pick the one that matches your target object.

Code: Calling AnyGrasp SDK

import numpy as np
from gsnet import AnyGrasp  # installed from anygrasp_sdk

# Initialize model (once at startup)
cfgs = {
    'checkpoint_path': 'log/checkpoint_detection.tar',
    'max_gripper_width': 0.1,    # meters — matches Robotiq 2F-85
    'gripper_height': 0.03,
    'top_down_grasp': False,     # allow grasps from any direction
    'debug': False
}
anygrasp = AnyGrasp(cfgs)
anygrasp.load_net()

def detect_grasps(
    rgb_image: np.ndarray,    # (H, W, 3) uint8
    depth_image: np.ndarray,  # (H, W) float32, meters
    camera_intrinsics: dict,
    workspace_lims: list      # [xmin, xmax, ymin, ymax, zmin, zmax] camera frame
) -> object:
    """
    Generate grasp candidates from RGB-D point cloud.

    Returns:
        GraspGroup after NMS and sorted by score (best first)
    """
    fx = camera_intrinsics['fx']
    fy = camera_intrinsics['fy']
    cx = camera_intrinsics['cx']
    cy = camera_intrinsics['cy']

    # Build point cloud from depth image
    H, W = depth_image.shape
    u_grid, v_grid = np.meshgrid(np.arange(W), np.arange(H))

    Z = depth_image.flatten()
    X = ((u_grid.flatten() - cx) * Z / fx)
    Y = ((v_grid.flatten() - cy) * Z / fy)

    # Filter invalid depth
    valid_mask = (Z > 0.1) & (Z < 2.0) & np.isfinite(Z)
    points = np.stack([X, Y, Z], axis=-1)[valid_mask].astype(np.float32)
    colors = (rgb_image.reshape(-1, 3)[valid_mask] / 255.0).astype(np.float32)

    # Run AnyGrasp
    gg, cloud = anygrasp.get_grasp(
        points, colors,
        lims=workspace_lims,
        apply_object_mask=True,    # filter background
        dense_grasp=False,         # True for more candidates (slower)
        collision_detection=True   # avoid grasps that collide with table/shelf
    )

    if gg is None or len(gg) == 0:
        return None

    # Non-Maximum Suppression + sort by score descending
    gg = gg.nms().sort_by_score()
    return gg

def select_best_grasp_near_object(
    grasp_group,
    object_position_3d: np.ndarray,  # [x, y, z] in camera frame
    max_distance: float = 0.05        # 5cm tolerance
):
    """
    Select the highest-scoring grasp within proximity of the target object.
    """
    if grasp_group is None:
        return None

    best_grasp = None
    best_score = -1.0

    for grasp in grasp_group:
        dist = np.linalg.norm(grasp.translation - object_position_3d)
        if dist <= max_distance and grasp.score > best_score:
            best_score = grasp.score
            best_grasp = grasp

    return best_grasp

Packaging the Output: spatial_info Dict

ManiAgent overall architecture: three agents working together — source: ManiAgent paper (arXiv:2510.11660)

Framework architecture: Perception Agent → Reasoning Agent → Action Agent

After collecting all information from Florence-v2 and AnyGrasp, the Perception Agent packages everything into spatial_info and passes it to the Reasoning Agent:

import numpy as np
from PIL import Image

def run_perception_agent(
    rgb_image: np.ndarray,
    depth_image: np.ndarray,
    camera_intrinsics: dict,
    T_base_cam: np.ndarray,
    object_name: str,
    workspace_lims: list
) -> dict:
    """
    Full Perception Agent: RGB-D + object name → spatial_info.

    Args:
        rgb_image: numpy array (H, W, 3) uint8
        depth_image: numpy array (H, W) float32, meters
        camera_intrinsics: dict with 'fx', 'fy', 'cx', 'cy'
        T_base_cam: 4x4 float64 matrix (camera → robot base)
        object_name: target object name, e.g. "coffee cup"
        workspace_lims: [xmin, xmax, ymin, ymax, zmin, zmax] camera frame (m)

    Returns:
        spatial_info dict or None if object/grasp not found
    """
    # --- Step 1: Florence-v2 detects the object ---
    pil_rgb = Image.fromarray(rgb_image)
    detection = detect_object(pil_rgb, object_name)

    if not detection.get('bboxes'):
        print(f"[Perception] '{object_name}' not found in image")
        return None

    # --- Step 2: Convert bounding box center to 3D ---
    bbox = detection['bboxes'][0]
    x1, y1, x2, y2 = bbox
    u_center = (x1 + x2) / 2.0
    v_center = (y1 + y2) / 2.0

    # 3D in camera frame (for AnyGrasp proximity lookup)
    obj_pos_cam = pixel_to_3d(
        u_center, v_center, depth_image, camera_intrinsics,
        T_base_cam=np.eye(4)  # identity → camera frame
    )
    # 3D in robot base frame (for Reasoning Agent)
    obj_pos_base = pixel_to_3d(
        u_center, v_center, depth_image, camera_intrinsics, T_base_cam
    )

    # --- Step 3: AnyGrasp generates grasp candidates ---
    grasp_group = detect_grasps(
        rgb_image, depth_image, camera_intrinsics, workspace_lims
    )

    # --- Step 4: Select best grasp near the detected object ---
    best_grasp = select_best_grasp_near_object(
        grasp_group, obj_pos_cam, max_distance=0.05
    )

    if best_grasp is None:
        print(f"[Perception] No suitable grasp found for '{object_name}'")
        return None

    # Transform grasp from camera frame to robot base frame
    R_cam = best_grasp.rotation_matrix   # 3x3
    t_cam = best_grasp.translation        # 3D
    t_base = (T_base_cam @ np.append(t_cam, 1.0))[:3]
    R_base = T_base_cam[:3, :3] @ R_cam

    # --- Step 5: Package spatial_info ---
    spatial_info = {
        "object_label": detection['bboxes_labels'][0],
        "position": obj_pos_base.tolist(),          # [x, y, z] meters, base frame
        "grasp_pose": {
            "rotation": R_base.tolist(),            # 3x3 list of lists
            "translation": t_base.tolist(),         # [x, y, z] meters, base frame
            "score": float(best_grasp.score),       # 0.0–1.0
            "width": float(best_grasp.width)        # meters
        }
    }

    print(f"[Perception] Found '{object_name}' at {obj_pos_base.round(3)}")
    print(f"[Perception] Grasp score: {best_grasp.score:.3f}, width: {best_grasp.width:.3f}m")

    return spatial_info

Example real-world output:

{
    "object_label": "coffee cup",
    "position": [0.312, -0.087, 0.142],    # 31.2cm forward, 8.7cm left, 14.2cm up
    "grasp_pose": {
        "rotation": [
            [0.998, -0.043, 0.042],
            [0.041,  0.999, 0.024],
            [-0.043, -0.022, 0.999]
        ],
        "translation": [0.308, -0.085, 0.178],   # midpoint between fingers
        "score": 0.847,
        "width": 0.072                            # 7.2cm — fits a standard mug
    }
}

System Requirements

Dependencies

# Florence-2
pip install transformers>=4.41.0 torch torchvision Pillow

# AnyGrasp SDK (requires license key from graspnet.net)
# See: https://github.com/graspnet/anygrasp_sdk
pip install MinkowskiEngine  # build from source for your CUDA version
# Then clone anygrasp_sdk and install gsnet wheel

# Point cloud + visualization
pip install open3d numpy

Hardware Requirements

Component	Minimum	Recommended
GPU VRAM	8 GB (Florence-2-base)	16 GB (Florence-2-large)
CUDA	11.x	12.x
RAM	16 GB	32 GB
Camera	Any RGB-D	Intel RealSense D435

Note: Florence-2 and AnyGrasp can share one GPU if loaded/unloaded sequentially. In production, two separate GPUs reduce latency significantly.

Latency Benchmarks (RTX 4090)

Step	Time
Florence-2-large inference	~180ms
Pixel → 3D conversion	<5ms
Build point cloud	~20ms
AnyGrasp inference	~150ms
Total	~360ms

360ms per perception cycle is well within the 3–5 Hz control frequency of most manipulation tasks.

Common Pitfalls

1. Depth Noise at Object Edges

The RealSense D435 (and most depth cameras) produces flying pixels at object edges — unstable depth values where the sensor straddles two surfaces at different depths. The bounding box center sometimes falls on these.

Fix: Use the median of a 5×5 patch around the bounding box center instead of a single pixel.

2. AnyGrasp Returns No Grasps

Happens when the object is too small, transparent (glass), or workspace_lims is too tight. Debug with:

print(f"Total grasps found: {len(gg) if gg else 0}")
# If 0: widen workspace_lims or disable collision_detection

3. Florence-v2 Picks the Wrong Object (Multiple Similar Items)

If the table has three cups, Florence-v2 may return multiple bounding boxes. Use the largest box (closest to camera), or add spatial context: "red coffee cup on the left".

4. Wrong Extrinsic Matrix

The most common deployment error: 3D coordinates are relatively correct but have a fixed absolute offset. Validate by placing an object at a known position:

# Place object exactly 30cm in front of robot base
print(f"Measured: {obj_pos_base}, Expected: [0.30, 0.00, ...]")

Summary

The Perception Agent in ManiAgent is an elegant combination of two powerful models:

Florence-v2 solves "name → location" with open-vocabulary detection — no fine-tuning needed for new object types
AnyGrasp solves "point cloud → how to grip" with dense prediction across the entire scene
Camera back-projection (pixel + depth → 3D) bridges the two worlds

The spatial_info dictionary is the common language that the Perception Agent uses to communicate with the Reasoning and Action Agents. This standardized format is what makes the pipeline modular: you can swap Florence-v2 for Grounding DINO, or swap AnyGrasp for Contact-GraspNet — as long as the output is still a spatial_info dict with the same schema.

In the next article, we'll see how the Reasoning Agent receives spatial_info and plans what to do — comparing three paradigms: ALRM, CAP, and TAP, and which one works best for complex manipulation tasks.

For deeper background on VLA foundation models that ManiAgent benchmarks against, see VLA Models: From RT-2 to OpenVLA and Diffusion Policy: Learning from Demonstrations.

Series Roadmap: AI Agent Pipeline for Robot Manipulation

#	Article	Status
1	Why Multi-Agent Beats VLA?	✅ Published
2	Perception Agent: Florence-v2 + AnyGrasp ← you are here	📍
3	ALRM, CAP vs TAP: Three Planning Paradigms	🔜
4	SAP Verifier: Self-Verification in the Pipeline	🔜
5	Sim-to-Real Deployment	🔜

What Does the Perception Agent Do?

Technically, ManiAgent's Perception Agent receives:

RGB image from camera (RealSense D435 in the original paper)
Depth image (per-pixel distance in meters)
List of target object names (from Reasoning Agent)

And returns a spatial_info dictionary containing:

object_label — the detected object name
position — 3D center coordinates in the robot base frame [x, y, z] (meters)
grasp_pose — 6-DoF grasp: 3×3 rotation matrix + 3D translation + quality score

To accomplish this, the pipeline uses two models in sequence: Florence-v2 for the language→image part, AnyGrasp for the geometry→pose part.

Perception Agent pipeline — source: ManiAgent paper (arXiv:2510.11660)

Florence-v2: Eyes That Understand Language

The Problem with Traditional Object Detection

Open-vocabulary detection solves this: the model takes an image + any object name → localizes it. No fine-tuning, no new dataset required.

Florence-2 Architecture

Florence-2 (arXiv:2311.06242, Microsoft Research, CVPR 2024) uses a sequence-to-sequence architecture:

Backbone: DaViT (Dual-attention Vision Transformer) — encodes images into visual token embeddings
Encoder-decoder: standard transformer processing both visual embeddings and text prompts
Training: FLD-5B dataset — 5.4 billion annotations across 126 million images, covering bounding boxes, segmentation masks, captions, and grounding

Florence-2 Task Prompts

Florence-2 uses a task prompt system to select output type:

Prompt	Task
`<OD>`	Detect all objects in the image (auto-naming)
`<OPEN_VOCABULARY_DETECTION>`	Detect objects by any name you specify
`<CAPTION_TO_PHRASE_GROUNDING>`	Link phrases in a caption to image regions
`<DENSE_REGION_CAPTION>`	Describe each region in the image in detail

For the Perception Agent, we use <OPEN_VOCABULARY_DETECTION> since it takes an object name from the Reasoning Agent and finds exactly that object.

Code: Load and Run Florence-v2

import torch
from PIL import Image
from transformers import AutoProcessor, Florence2ForConditionalGeneration

# Load model (once at startup)
MODEL_ID = "microsoft/Florence-2-large"

model = Florence2ForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="cuda"
)
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model.eval()

def detect_object(rgb_image: Image.Image, object_name: str) -> dict:
    """
    Detect an object by name, return bounding box and label.
    
    Args:
        rgb_image: PIL Image (RGB)
        object_name: target object name, e.g. "coffee cup"
    
    Returns:
        {'bboxes': [[x1, y1, x2, y2]], 'bboxes_labels': ['coffee cup']}
    """
    task_prompt = "<OPEN_VOCABULARY_DETECTION>"

    inputs = processor(
        text=task_prompt,
        images=rgb_image,
        text_input=object_name,
        return_tensors="pt"
    ).to(model.device, torch.float16)

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=1024,
            num_beams=3,
            do_sample=False
        )

    generated_text = processor.batch_decode(
        generated_ids, skip_special_tokens=False
    )[0]

    parsed = processor.post_process_generation(
        generated_text,
        task=task_prompt,
        image_size=rgb_image.size  # (width, height)
    )
    return parsed.get(task_prompt, {})

Output format:

{
    'bboxes': [[156.2, 203.4, 289.1, 341.7]],  # [x1, y1, x2, y2] pixels
    'bboxes_labels': ['coffee cup']
}

From Pixel to 3D Space

Florence-v2 gives us a bounding box in pixel coordinates — but the robot needs the object's location in real 3D space. This requires back-projection using the camera model.

Pinhole Camera Model

A camera is like a pinhole — light from a 3D point (X, Y, Z) passes through the pinhole and hits the sensor at pixel (u, v):

u = fx * (X / Z) + cx
v = fy * (Y / Z) + cy

Where:

fx, fy: focal length in pixels (typically 600–900 for RealSense D435)
cx, cy: optical center (roughly at image center)
Z: depth (distance from camera to object)

Inverting to find 3D from pixel + depth:

X = (u - cx) * Z / fx
Y = (v - cy) * Z / fy
Z = depth[v, u]   # from depth image

Code: Pixel → 3D in Robot Base Frame

import numpy as np

def pixel_to_3d(
    u: float, v: float,
    depth_image: np.ndarray,
    camera_intrinsics: dict,
    T_base_cam: np.ndarray
) -> np.ndarray:
    """
    Convert pixel coordinates to 3D in robot base frame.

    Args:
        u, v: pixel coordinates (can be float, e.g. bounding box center)
        depth_image: depth image (H, W), in meters
        camera_intrinsics: {'fx': ..., 'fy': ..., 'cx': ..., 'cy': ...}
        T_base_cam: 4x4 transform matrix camera → robot base

    Returns:
        3D coordinates [x, y, z] in robot base frame (meters)
    """
    fx = camera_intrinsics['fx']
    fy = camera_intrinsics['fy']
    cx = camera_intrinsics['cx']
    cy = camera_intrinsics['cy']

    # Get depth at (u, v) — use 5x5 patch median to reduce noise
    v_int, u_int = int(round(v)), int(round(u))
    v0, v1 = max(0, v_int - 2), min(depth_image.shape[0], v_int + 3)
    u0, u1 = max(0, u_int - 2), min(depth_image.shape[1], u_int + 3)
    patch = depth_image[v0:v1, u0:u1]
    valid = patch[patch > 0.1]
    if len(valid) == 0:
        raise ValueError(f"Invalid depth at pixel ({u}, {v})")
    Z = float(np.median(valid))

    # Back-project to camera frame
    X_cam = (u - cx) * Z / fx
    Y_cam = (v - cy) * Z / fy
    point_cam = np.array([X_cam, Y_cam, Z, 1.0])

    # Transform to robot base frame
    point_base = T_base_cam @ point_cam
    return point_base[:3]  # [x, y, z]

AnyGrasp: Intelligent Gripping from Point Clouds

What Is a 6-DoF Grasp?

When you pick up a coffee cup, you don't just know "place hand here" — you also know the approach angle, finger orientation, and how wide to open your hand. That's 6 degrees of freedom:

3 DoF position: x, y, z — grasp center
3 DoF orientation: pitch, yaw, roll — approach direction

For a 2-finger gripper (like Robotiq 2F-85), a grasp pose is represented as:

Rotation matrix R (3×3): gripper axes in space
Translation vector t (3D): center point between fingers
Width: gripper opening distance (meters)
Score: quality metric (higher is better)

AnyGrasp: Dense Prediction Over Sampling

Traditional grasp detection (GPD, GQ-CNN) uses sampling — propose many random grasp candidates, score each one. This is slow and misses many good grasps.

Think of it this way: AnyGrasp scans the entire scene at once (dense), rather than checking one corner at a time (sampling).

Result: AnyGrasp can process an entire scene with multiple objects, propose hundreds of grasp candidates, and you pick the one that matches your target object.

Code: Calling AnyGrasp SDK

import numpy as np
from gsnet import AnyGrasp  # installed from anygrasp_sdk

# Initialize model (once at startup)
cfgs = {
    'checkpoint_path': 'log/checkpoint_detection.tar',
    'max_gripper_width': 0.1,    # meters — matches Robotiq 2F-85
    'gripper_height': 0.03,
    'top_down_grasp': False,     # allow grasps from any direction
    'debug': False
}
anygrasp = AnyGrasp(cfgs)
anygrasp.load_net()

def detect_grasps(
    rgb_image: np.ndarray,    # (H, W, 3) uint8
    depth_image: np.ndarray,  # (H, W) float32, meters
    camera_intrinsics: dict,
    workspace_lims: list      # [xmin, xmax, ymin, ymax, zmin, zmax] camera frame
) -> object:
    """
    Generate grasp candidates from RGB-D point cloud.

    Returns:
        GraspGroup after NMS and sorted by score (best first)
    """
    fx = camera_intrinsics['fx']
    fy = camera_intrinsics['fy']
    cx = camera_intrinsics['cx']
    cy = camera_intrinsics['cy']

    # Build point cloud from depth image
    H, W = depth_image.shape
    u_grid, v_grid = np.meshgrid(np.arange(W), np.arange(H))

    Z = depth_image.flatten()
    X = ((u_grid.flatten() - cx) * Z / fx)
    Y = ((v_grid.flatten() - cy) * Z / fy)

    # Filter invalid depth
    valid_mask = (Z > 0.1) & (Z < 2.0) & np.isfinite(Z)
    points = np.stack([X, Y, Z], axis=-1)[valid_mask].astype(np.float32)
    colors = (rgb_image.reshape(-1, 3)[valid_mask] / 255.0).astype(np.float32)

    # Run AnyGrasp
    gg, cloud = anygrasp.get_grasp(
        points, colors,
        lims=workspace_lims,
        apply_object_mask=True,    # filter background
        dense_grasp=False,         # True for more candidates (slower)
        collision_detection=True   # avoid grasps that collide with table/shelf
    )

    if gg is None or len(gg) == 0:
        return None

    # Non-Maximum Suppression + sort by score descending
    gg = gg.nms().sort_by_score()
    return gg

def select_best_grasp_near_object(
    grasp_group,
    object_position_3d: np.ndarray,  # [x, y, z] in camera frame
    max_distance: float = 0.05        # 5cm tolerance
):
    """
    Select the highest-scoring grasp within proximity of the target object.
    """
    if grasp_group is None:
        return None

    best_grasp = None
    best_score = -1.0

    for grasp in grasp_group:
        dist = np.linalg.norm(grasp.translation - object_position_3d)
        if dist <= max_distance and grasp.score > best_score:
            best_score = grasp.score
            best_grasp = grasp

    return best_grasp

Packaging the Output: spatial_info Dict

ManiAgent overall architecture: three agents working together — source: ManiAgent paper (arXiv:2510.11660)

Framework architecture: Perception Agent → Reasoning Agent → Action Agent

After collecting all information from Florence-v2 and AnyGrasp, the Perception Agent packages everything into spatial_info and passes it to the Reasoning Agent:

import numpy as np
from PIL import Image

def run_perception_agent(
    rgb_image: np.ndarray,
    depth_image: np.ndarray,
    camera_intrinsics: dict,
    T_base_cam: np.ndarray,
    object_name: str,
    workspace_lims: list
) -> dict:
    """
    Full Perception Agent: RGB-D + object name → spatial_info.

    Args:
        rgb_image: numpy array (H, W, 3) uint8
        depth_image: numpy array (H, W) float32, meters
        camera_intrinsics: dict with 'fx', 'fy', 'cx', 'cy'
        T_base_cam: 4x4 float64 matrix (camera → robot base)
        object_name: target object name, e.g. "coffee cup"
        workspace_lims: [xmin, xmax, ymin, ymax, zmin, zmax] camera frame (m)

    Returns:
        spatial_info dict or None if object/grasp not found
    """
    # --- Step 1: Florence-v2 detects the object ---
    pil_rgb = Image.fromarray(rgb_image)
    detection = detect_object(pil_rgb, object_name)

    if not detection.get('bboxes'):
        print(f"[Perception] '{object_name}' not found in image")
        return None

    # --- Step 2: Convert bounding box center to 3D ---
    bbox = detection['bboxes'][0]
    x1, y1, x2, y2 = bbox
    u_center = (x1 + x2) / 2.0
    v_center = (y1 + y2) / 2.0

    # 3D in camera frame (for AnyGrasp proximity lookup)
    obj_pos_cam = pixel_to_3d(
        u_center, v_center, depth_image, camera_intrinsics,
        T_base_cam=np.eye(4)  # identity → camera frame
    )
    # 3D in robot base frame (for Reasoning Agent)
    obj_pos_base = pixel_to_3d(
        u_center, v_center, depth_image, camera_intrinsics, T_base_cam
    )

    # --- Step 3: AnyGrasp generates grasp candidates ---
    grasp_group = detect_grasps(
        rgb_image, depth_image, camera_intrinsics, workspace_lims
    )

    # --- Step 4: Select best grasp near the detected object ---
    best_grasp = select_best_grasp_near_object(
        grasp_group, obj_pos_cam, max_distance=0.05
    )

    if best_grasp is None:
        print(f"[Perception] No suitable grasp found for '{object_name}'")
        return None

    # Transform grasp from camera frame to robot base frame
    R_cam = best_grasp.rotation_matrix   # 3x3
    t_cam = best_grasp.translation        # 3D
    t_base = (T_base_cam @ np.append(t_cam, 1.0))[:3]
    R_base = T_base_cam[:3, :3] @ R_cam

    # --- Step 5: Package spatial_info ---
    spatial_info = {
        "object_label": detection['bboxes_labels'][0],
        "position": obj_pos_base.tolist(),          # [x, y, z] meters, base frame
        "grasp_pose": {
            "rotation": R_base.tolist(),            # 3x3 list of lists
            "translation": t_base.tolist(),         # [x, y, z] meters, base frame
            "score": float(best_grasp.score),       # 0.0–1.0
            "width": float(best_grasp.width)        # meters
        }
    }

    print(f"[Perception] Found '{object_name}' at {obj_pos_base.round(3)}")
    print(f"[Perception] Grasp score: {best_grasp.score:.3f}, width: {best_grasp.width:.3f}m")

    return spatial_info

Example real-world output:

{
    "object_label": "coffee cup",
    "position": [0.312, -0.087, 0.142],    # 31.2cm forward, 8.7cm left, 14.2cm up
    "grasp_pose": {
        "rotation": [
            [0.998, -0.043, 0.042],
            [0.041,  0.999, 0.024],
            [-0.043, -0.022, 0.999]
        ],
        "translation": [0.308, -0.085, 0.178],   # midpoint between fingers
        "score": 0.847,
        "width": 0.072                            # 7.2cm — fits a standard mug
    }
}

System Requirements

Dependencies

# Florence-2
pip install transformers>=4.41.0 torch torchvision Pillow

# AnyGrasp SDK (requires license key from graspnet.net)
# See: https://github.com/graspnet/anygrasp_sdk
pip install MinkowskiEngine  # build from source for your CUDA version
# Then clone anygrasp_sdk and install gsnet wheel

# Point cloud + visualization
pip install open3d numpy

Hardware Requirements

Component	Minimum	Recommended
GPU VRAM	8 GB (Florence-2-base)	16 GB (Florence-2-large)
CUDA	11.x	12.x
RAM	16 GB	32 GB
Camera	Any RGB-D	Intel RealSense D435

Note: Florence-2 and AnyGrasp can share one GPU if loaded/unloaded sequentially. In production, two separate GPUs reduce latency significantly.

Latency Benchmarks (RTX 4090)

Step	Time
Florence-2-large inference	~180ms
Pixel → 3D conversion	<5ms
Build point cloud	~20ms
AnyGrasp inference	~150ms
Total	~360ms

360ms per perception cycle is well within the 3–5 Hz control frequency of most manipulation tasks.

Common Pitfalls

1. Depth Noise at Object Edges

Fix: Use the median of a 5×5 patch around the bounding box center instead of a single pixel.

2. AnyGrasp Returns No Grasps

Happens when the object is too small, transparent (glass), or workspace_lims is too tight. Debug with:

print(f"Total grasps found: {len(gg) if gg else 0}")
# If 0: widen workspace_lims or disable collision_detection

3. Florence-v2 Picks the Wrong Object (Multiple Similar Items)

If the table has three cups, Florence-v2 may return multiple bounding boxes. Use the largest box (closest to camera), or add spatial context: "red coffee cup on the left".

4. Wrong Extrinsic Matrix

The most common deployment error: 3D coordinates are relatively correct but have a fixed absolute offset. Validate by placing an object at a known position:

# Place object exactly 30cm in front of robot base
print(f"Measured: {obj_pos_base}, Expected: [0.30, 0.00, ...]")

Summary

The Perception Agent in ManiAgent is an elegant combination of two powerful models:

Florence-v2 solves "name → location" with open-vocabulary detection — no fine-tuning needed for new object types
AnyGrasp solves "point cloud → how to grip" with dense prediction across the entire scene
Camera back-projection (pixel + depth → 3D) bridges the two worlds

For deeper background on VLA foundation models that ManiAgent benchmarks against, see VLA Models: From RT-2 to OpenVLA and Diffusion Policy: Learning from Demonstrations.

Series Roadmap: AI Agent Pipeline for Robot Manipulation

What Does the Perception Agent Do?

Florence-v2: Eyes That Understand Language

The Problem with Traditional Object Detection

Florence-2 Architecture

Florence-2 Task Prompts

Code: Load and Run Florence-v2

From Pixel to 3D Space

Pinhole Camera Model

Code: Pixel → 3D in Robot Base Frame

AnyGrasp: Intelligent Gripping from Point Clouds

What Is a 6-DoF Grasp?

AnyGrasp: Dense Prediction Over Sampling

Code: Calling AnyGrasp SDK

Packaging the Output: spatial_info Dict

System Requirements

Dependencies

Hardware Requirements

Latency Benchmarks (RTX 4090)

Common Pitfalls

1. Depth Noise at Object Edges

2. AnyGrasp Returns No Grasps

3. Florence-v2 Picks the Wrong Object (Multiple Similar Items)

4. Wrong Extrinsic Matrix

Summary

Related Posts

Nguyễn Anh Tuấn

Related Posts

Tại sao Multi-Agent đánh bại VLA đơn thuần? | AI Manipulation Agents #1

Agentic Robot: SAP Protocol + Temporal Verifier

Sim-to-Real Deploy: Đưa SAP Pipeline từ LIBERO ra Robot Thật | AI Manipulation Agents #5

Series Roadmap: AI Agent Pipeline for Robot Manipulation

What Does the Perception Agent Do?

Florence-v2: Eyes That Understand Language

The Problem with Traditional Object Detection

Florence-2 Architecture

Florence-2 Task Prompts

Code: Load and Run Florence-v2

From Pixel to 3D Space

Pinhole Camera Model

Code: Pixel → 3D in Robot Base Frame

AnyGrasp: Intelligent Gripping from Point Clouds

What Is a 6-DoF Grasp?

AnyGrasp: Dense Prediction Over Sampling

Code: Calling AnyGrasp SDK

Packaging the Output: spatial_info Dict

System Requirements

Dependencies

Hardware Requirements

Latency Benchmarks (RTX 4090)

Common Pitfalls

1. Depth Noise at Object Edges

2. AnyGrasp Returns No Grasps

3. Florence-v2 Picks the Wrong Object (Multiple Similar Items)

4. Wrong Extrinsic Matrix

Summary

Related Posts

Nguyễn Anh Tuấn

Related Posts

Tại sao Multi-Agent đánh bại VLA đơn thuần? | AI Manipulation Agents #1

Agentic Robot: SAP Protocol + Temporal Verifier

Sim-to-Real Deploy: Đưa SAP Pipeline từ LIBERO ra Robot Thật | AI Manipulation Agents #5