3D Perception for Whole-Body Humanoids: Beyond-FOV

In Part 3, DP3 gave us a strong point-cloud policy — but it is single-arm, with a small workspace and a camera pointed straight at a table. When you move to a humanoid like the Unitree G1, AgiBot X2, or Fourier GR, the perception problem changes qualitatively.

A humanoid doesn't stand still grabbing objects off one table. It walks to a shelf, bends down to pick something off the floor, reaches behind its back, turns to place an object on a conveyor. Its workspace is an entire room, and most of what it needs to manipulate is outside the field-of-view (FOV) of its head camera. This article analyzes two 2026 humanoid papers that solve exactly that problem: Omni-Manip and Active Spatial Reasoning.

Omni-Manip: large-workspace manipulation beyond a fixed camera FOV — source: arXiv 2603.05355

Source: Omni-Manip (arXiv 2603.05355) — omnidirectional perception for large-workspace humanoids.

Why humanoid perception is fundamentally different

For a fixed arm, everything is tidy: a single RGB-D camera looking at a ~60×60cm workspace is enough. Every object to manipulate is in the frame, depth is stable, calibration is one-and-done.

A humanoid breaks all three assumptions:

The workspace is many times larger than the FOV. A G1 arm reaches ~70cm, but it moves through a room several meters wide. The head camera only sees a narrow cone in front.
Target objects are often out of view. When reaching down to pick something off the floor, the object disappears from the head camera the moment the hand approaches — exactly when you most need to see it (the occlusion problem from Part 1).
The whole body must avoid collisions. Legs, hips, elbows can all hit a shelf, a table, or the robot itself. You need the 3D geometry of the entire space around the robot, not just the region in front.

In other words: a robot arm needs to "see one point clearly," a humanoid needs to "understand the whole space around it." That is a shift from frame perception to panoramic perception.

Omni-Manip: beyond-FOV perception for large workspaces

Omni-Manip (Beyond-FOV Large-Workspace Humanoid Manipulation, arXiv 2603.05355, 2026) attacks the "object out of FOV" problem head-on. The core idea: instead of relying on a single forward-looking head camera, build an omnidirectional 3D representation of the environment around the robot — essentially geometric "360° vision."

Main components:

Panoramic / multi-sensor 3D perception. Combine multiple sources — head camera, wrist camera, and a wide-scanning LiDAR/RGB-D — into a point cloud covering a region larger than any single camera's FOV.
Persistent scene memory. When the robot turns or bends down, previously-seen parts of the scene are kept in geometric memory rather than vanishing when they leave the current frame. This is the "3D model in your head" humans have — and that 2D policies lack.
Large-workspace manipulation policy. On top of that omnidirectional 3D, the policy can plan grasp/place for objects anywhere in the extended workspace, even when the object is not in the camera frame at command time.

Why this matters for whole-body: with the 3D geometry of the whole region around the robot, the planner can check collisions for the whole body (legs + torso + arms) while reaching far — exactly what Part 1 showed is impossible with a single 2D image.

Active Spatial Reasoning: look actively, reason spatially

The second paper, Humanoid Whole-Body Manipulation via Active Spatial Reasoning (arXiv 2605.21133, May 2026), adds another piece: active perception and spatial reasoning.

Active Spatial Reasoning for whole-body manipulation — source: arXiv 2605.21133

Source: Active Spatial Reasoning (arXiv 2605.21133) — reasoning about spatial relations with little real-robot data.

The core difference from passive perception:

Active perception. Rather than only processing images from a fixed camera, the robot actively adjusts its viewpoint — turns its head, tilts its torso, moves — to "see" the part of space the task requires. Like a human tilting their head to look at an occluded object.
Spatial reasoning about relations. The policy reasons over relations like "the cup is to the left of the box," "grab the object behind the bottle," "place it on the higher shelf." This is 3D reasoning, not just 2D object detection.
Sample efficiency. The most valuable point: the paper achieves 3D spatial-aware whole-body manipulation with less real-robot data — a good spatial representation reduces the burden of learning everything from demonstrations.

Why this is a step forward: a 2D VLA must learn its "spatial intuition" implicitly from millions of frames. Inject explicit spatial reasoning into the pipeline and the model needs fewer demos for the same capability — a huge advantage when real-robot data is expensive (the topic of Part 6).

Comparing the two approaches

Criterion	Omni-Manip	Active Spatial Reasoning
Main problem solved	Out-of-FOV objects, large workspace	Spatial relation reasoning, occlusion
Perception mechanism	Omnidirectional 3D (passive, multi-sensor)	Active — actively changes viewpoint
Scene memory	Persistent geometric memory	Task-driven reasoning
Strength	Wide spatial coverage	Sample-efficient, less robot data
Best fit	Warehouse/factory large workspace	Tasks complex in spatial relations

These two directions are complementary, not mutually exclusive: Omni-Manip gives you a panoramic 3D map, while Active Spatial Reasoning gives you a way to use that map actively and data-efficiently. An ideal humanoid system uses both.

Fitting into the series' 3D pipeline

Looking at the big picture, humanoid perception is a layer that sits on top of what we've learned:

Robo3R (Part 2)       → metric 3D from RGB, per camera
        +
Omni-Manip            → fuse multiple views into omnidirectional 3D
        +
Active Spatial Reason → actively look + reason about spatial relations
        ↓
Panoramic 3D scene in robot frame
        ↓
Policy (DP3 / WholeBodyVLA) → actions for both hands + locomotion

DP3 from Part 3 is still a good policy for each hand manipulation, but it needs a clean point cloud of the right region. Omni-Manip and Active Spatial Reasoning are the perception layer that supplies that point cloud for a humanoid operating in a large space.

Notes for applying to the Unitree G1

Sensor placement. One head camera (RGB-D) is not enough for whole-body. In practice you want a wrist camera (close-up during grasp) and a wide-scanning sensor (LiDAR or wide-angle RGB-D) to build the out-of-FOV region.
Multi-sensor calibration. Every point cloud must be brought into the same robot base frame to be fused — this is the hardest engineering part, requiring precise extrinsic calibration between cameras and joints.
Active perception needs a coordinated controller. "Turn the head to look" means perception and motion control must talk to each other — they can't be decoupled as on a fixed arm.

Conclusion: from "see one point" to "understand the whole space"

The big lesson of this part: humanoid manipulation is a panoramic-perception problem, not a frame-perception problem. Omni-Manip solves "objects out of FOV" with omnidirectional 3D; Active Spatial Reasoning solves "spatial reasoning with little data" with active perception. Both build on the 3D geometry this whole series has pursued since Part 1.

With panoramic perception in hand, the next question is: how do we turn it into whole-body action — walking and manipulating at once? That is the topic of Part 5: WholeBodyVLA, learning loco-manipulation from egocentric video combined with RL.

Part 3: DP3 — Diffusion Policy with Point Clouds — A point-cloud policy fed by this perception layer
Part 5: WholeBodyVLA — Egocentric Video + RL — Turning perception into whole-body action
Part 1: Why 2D VLA Is Not Enough for Manipulation — Why occlusion & FOV break 2D images

Omni-Manip: large-workspace manipulation beyond a fixed camera FOV — source: arXiv 2603.05355

Source: Omni-Manip (arXiv 2603.05355) — omnidirectional perception for large-workspace humanoids.

Why humanoid perception is fundamentally different

For a fixed arm, everything is tidy: a single RGB-D camera looking at a ~60×60cm workspace is enough. Every object to manipulate is in the frame, depth is stable, calibration is one-and-done.

A humanoid breaks all three assumptions:

The workspace is many times larger than the FOV. A G1 arm reaches ~70cm, but it moves through a room several meters wide. The head camera only sees a narrow cone in front.
Target objects are often out of view. When reaching down to pick something off the floor, the object disappears from the head camera the moment the hand approaches — exactly when you most need to see it (the occlusion problem from Part 1).
The whole body must avoid collisions. Legs, hips, elbows can all hit a shelf, a table, or the robot itself. You need the 3D geometry of the entire space around the robot, not just the region in front.

In other words: a robot arm needs to "see one point clearly," a humanoid needs to "understand the whole space around it." That is a shift from frame perception to panoramic perception.

Omni-Manip: beyond-FOV perception for large workspaces

Main components:

Panoramic / multi-sensor 3D perception. Combine multiple sources — head camera, wrist camera, and a wide-scanning LiDAR/RGB-D — into a point cloud covering a region larger than any single camera's FOV.
Persistent scene memory. When the robot turns or bends down, previously-seen parts of the scene are kept in geometric memory rather than vanishing when they leave the current frame. This is the "3D model in your head" humans have — and that 2D policies lack.
Large-workspace manipulation policy. On top of that omnidirectional 3D, the policy can plan grasp/place for objects anywhere in the extended workspace, even when the object is not in the camera frame at command time.

Active Spatial Reasoning: look actively, reason spatially

The second paper, Humanoid Whole-Body Manipulation via Active Spatial Reasoning (arXiv 2605.21133, May 2026), adds another piece: active perception and spatial reasoning.

Active Spatial Reasoning for whole-body manipulation — source: arXiv 2605.21133

Source: Active Spatial Reasoning (arXiv 2605.21133) — reasoning about spatial relations with little real-robot data.

The core difference from passive perception:

Active perception. Rather than only processing images from a fixed camera, the robot actively adjusts its viewpoint — turns its head, tilts its torso, moves — to "see" the part of space the task requires. Like a human tilting their head to look at an occluded object.
Spatial reasoning about relations. The policy reasons over relations like "the cup is to the left of the box," "grab the object behind the bottle," "place it on the higher shelf." This is 3D reasoning, not just 2D object detection.
Sample efficiency. The most valuable point: the paper achieves 3D spatial-aware whole-body manipulation with less real-robot data — a good spatial representation reduces the burden of learning everything from demonstrations.

Comparing the two approaches

Criterion	Omni-Manip	Active Spatial Reasoning
Main problem solved	Out-of-FOV objects, large workspace	Spatial relation reasoning, occlusion
Perception mechanism	Omnidirectional 3D (passive, multi-sensor)	Active — actively changes viewpoint
Scene memory	Persistent geometric memory	Task-driven reasoning
Strength	Wide spatial coverage	Sample-efficient, less robot data
Best fit	Warehouse/factory large workspace	Tasks complex in spatial relations

Fitting into the series' 3D pipeline

Looking at the big picture, humanoid perception is a layer that sits on top of what we've learned:

Robo3R (Part 2)       → metric 3D from RGB, per camera
        +
Omni-Manip            → fuse multiple views into omnidirectional 3D
        +
Active Spatial Reason → actively look + reason about spatial relations
        ↓
Panoramic 3D scene in robot frame
        ↓
Policy (DP3 / WholeBodyVLA) → actions for both hands + locomotion

Notes for applying to the Unitree G1

Sensor placement. One head camera (RGB-D) is not enough for whole-body. In practice you want a wrist camera (close-up during grasp) and a wide-scanning sensor (LiDAR or wide-angle RGB-D) to build the out-of-FOV region.
Multi-sensor calibration. Every point cloud must be brought into the same robot base frame to be fused — this is the hardest engineering part, requiring precise extrinsic calibration between cameras and joints.
Active perception needs a coordinated controller. "Turn the head to look" means perception and motion control must talk to each other — they can't be decoupled as on a fixed arm.

Conclusion: from "see one point" to "understand the whole space"

Part 3: DP3 — Diffusion Policy with Point Clouds — A point-cloud policy fed by this perception layer
Part 5: WholeBodyVLA — Egocentric Video + RL — Turning perception into whole-body action
Part 1: Why 2D VLA Is Not Enough for Manipulation — Why occlusion & FOV break 2D images

3D Perception for Whole-Body Humanoids: Beyond-FOV

Why humanoid perception is fundamentally different

Omni-Manip: beyond-FOV perception for large workspaces

Active Spatial Reasoning: look actively, reason spatially

Comparing the two approaches

Fitting into the series' 3D pipeline

Notes for applying to the Unitree G1

Conclusion: from "see one point" to "understand the whole space"

Nguyễn Anh Tuấn

Related Posts

WholeBodyVLA: video egocentric + RL loco-manipulation

Vì sao VLA 2D chưa đủ cho manipulation

Thu data manipulation: teleop vs robot-free vs video

3D Perception for Whole-Body Humanoids: Beyond-FOV

Why humanoid perception is fundamentally different

Omni-Manip: beyond-FOV perception for large workspaces

Active Spatial Reasoning: look actively, reason spatially

Comparing the two approaches

Fitting into the series' 3D pipeline

Notes for applying to the Unitree G1

Conclusion: from "see one point" to "understand the whole space"

Nguyễn Anh Tuấn

Related Posts

WholeBodyVLA: video egocentric + RL loco-manipulation

Vì sao VLA 2D chưa đủ cho manipulation

Thu data manipulation: teleop vs robot-free vs video