In Part 3, DP3 gave us a strong point-cloud policy — but it is single-arm, with a small workspace and a camera pointed straight at a table. When you move to a humanoid like the Unitree G1, AgiBot X2, or Fourier GR, the perception problem changes qualitatively.
A humanoid doesn't stand still grabbing objects off one table. It walks to a shelf, bends down to pick something off the floor, reaches behind its back, turns to place an object on a conveyor. Its workspace is an entire room, and most of what it needs to manipulate is outside the field-of-view (FOV) of its head camera. This article analyzes two 2026 humanoid papers that solve exactly that problem: Omni-Manip and Active Spatial Reasoning.

Why humanoid perception is fundamentally different
For a fixed arm, everything is tidy: a single RGB-D camera looking at a ~60×60cm workspace is enough. Every object to manipulate is in the frame, depth is stable, calibration is one-and-done.
A humanoid breaks all three assumptions:
- The workspace is many times larger than the FOV. A G1 arm reaches ~70cm, but it moves through a room several meters wide. The head camera only sees a narrow cone in front.
- Target objects are often out of view. When reaching down to pick something off the floor, the object disappears from the head camera the moment the hand approaches — exactly when you most need to see it (the occlusion problem from Part 1).
- The whole body must avoid collisions. Legs, hips, elbows can all hit a shelf, a table, or the robot itself. You need the 3D geometry of the entire space around the robot, not just the region in front.
In other words: a robot arm needs to "see one point clearly," a humanoid needs to "understand the whole space around it." That is a shift from frame perception to panoramic perception.
Omni-Manip: beyond-FOV perception for large workspaces
Omni-Manip (Beyond-FOV Large-Workspace Humanoid Manipulation, arXiv 2603.05355, 2026) attacks the "object out of FOV" problem head-on. The core idea: instead of relying on a single forward-looking head camera, build an omnidirectional 3D representation of the environment around the robot — essentially geometric "360° vision."
Main components:
- Panoramic / multi-sensor 3D perception. Combine multiple sources — head camera, wrist camera, and a wide-scanning LiDAR/RGB-D — into a point cloud covering a region larger than any single camera's FOV.
- Persistent scene memory. When the robot turns or bends down, previously-seen parts of the scene are kept in geometric memory rather than vanishing when they leave the current frame. This is the "3D model in your head" humans have — and that 2D policies lack.
- Large-workspace manipulation policy. On top of that omnidirectional 3D, the policy can plan grasp/place for objects anywhere in the extended workspace, even when the object is not in the camera frame at command time.
Why this matters for whole-body: with the 3D geometry of the whole region around the robot, the planner can check collisions for the whole body (legs + torso + arms) while reaching far — exactly what Part 1 showed is impossible with a single 2D image.
Active Spatial Reasoning: look actively, reason spatially
The second paper, Humanoid Whole-Body Manipulation via Active Spatial Reasoning (arXiv 2605.21133, May 2026), adds another piece: active perception and spatial reasoning.

The core difference from passive perception:
- Active perception. Rather than only processing images from a fixed camera, the robot actively adjusts its viewpoint — turns its head, tilts its torso, moves — to "see" the part of space the task requires. Like a human tilting their head to look at an occluded object.
- Spatial reasoning about relations. The policy reasons over relations like "the cup is to the left of the box," "grab the object behind the bottle," "place it on the higher shelf." This is 3D reasoning, not just 2D object detection.
- Sample efficiency. The most valuable point: the paper achieves 3D spatial-aware whole-body manipulation with less real-robot data — a good spatial representation reduces the burden of learning everything from demonstrations.
Why this is a step forward: a 2D VLA must learn its "spatial intuition" implicitly from millions of frames. Inject explicit spatial reasoning into the pipeline and the model needs fewer demos for the same capability — a huge advantage when real-robot data is expensive (the topic of Part 6).
Comparing the two approaches
| Criterion | Omni-Manip | Active Spatial Reasoning |
|---|---|---|
| Main problem solved | Out-of-FOV objects, large workspace | Spatial relation reasoning, occlusion |
| Perception mechanism | Omnidirectional 3D (passive, multi-sensor) | Active — actively changes viewpoint |
| Scene memory | Persistent geometric memory | Task-driven reasoning |
| Strength | Wide spatial coverage | Sample-efficient, less robot data |
| Best fit | Warehouse/factory large workspace | Tasks complex in spatial relations |
These two directions are complementary, not mutually exclusive: Omni-Manip gives you a panoramic 3D map, while Active Spatial Reasoning gives you a way to use that map actively and data-efficiently. An ideal humanoid system uses both.
Fitting into the series' 3D pipeline
Looking at the big picture, humanoid perception is a layer that sits on top of what we've learned:
Robo3R (Part 2) → metric 3D from RGB, per camera
+
Omni-Manip → fuse multiple views into omnidirectional 3D
+
Active Spatial Reason → actively look + reason about spatial relations
↓
Panoramic 3D scene in robot frame
↓
Policy (DP3 / WholeBodyVLA) → actions for both hands + locomotion
DP3 from Part 3 is still a good policy for each hand manipulation, but it needs a clean point cloud of the right region. Omni-Manip and Active Spatial Reasoning are the perception layer that supplies that point cloud for a humanoid operating in a large space.
Notes for applying to the Unitree G1
- Sensor placement. One head camera (RGB-D) is not enough for whole-body. In practice you want a wrist camera (close-up during grasp) and a wide-scanning sensor (LiDAR or wide-angle RGB-D) to build the out-of-FOV region.
- Multi-sensor calibration. Every point cloud must be brought into the same robot base frame to be fused — this is the hardest engineering part, requiring precise extrinsic calibration between cameras and joints.
- Active perception needs a coordinated controller. "Turn the head to look" means perception and motion control must talk to each other — they can't be decoupled as on a fixed arm.
Conclusion: from "see one point" to "understand the whole space"
The big lesson of this part: humanoid manipulation is a panoramic-perception problem, not a frame-perception problem. Omni-Manip solves "objects out of FOV" with omnidirectional 3D; Active Spatial Reasoning solves "spatial reasoning with little data" with active perception. Both build on the 3D geometry this whole series has pursued since Part 1.
With panoramic perception in hand, the next question is: how do we turn it into whole-body action — walking and manipulating at once? That is the topic of Part 5: WholeBodyVLA, learning loco-manipulation from egocentric video combined with RL.
Related Posts
- Part 3: DP3 — Diffusion Policy with Point Clouds — A point-cloud policy fed by this perception layer
- Part 5: WholeBodyVLA — Egocentric Video + RL — Turning perception into whole-body action
- Part 1: Why 2D VLA Is Not Enough for Manipulation — Why occlusion & FOV break 2D images


