A 3D Manipulation Roadmap for the Unitree G1 Humanoid

The previous six parts gave you each piece: why you need 3D (Part 1), building 3D from RGB (Part 2), a point-cloud policy (Part 3), panoramic perception (Part 4), a whole-body VLA (Part 5), and a data strategy (Part 6). This final part ties them all into a concrete deployment roadmap for a humanoid like the Unitree G1.

The goal: go from a blank G1 to a working 3D-aware whole-body manipulation system. (Note: examples use the Unitree G1 / AgiBot X2 / Fourier — it applies to any humanoid with an RL loco controller.)

DP3 running in simulation with a 3D point cloud — source: 3d-diffusion-policy.github.io

Overall architecture: from pixels to whole-body action

Before each step, here is the full picture of the pipeline:

flowchart TD
    A[Top RGB-D camera] --> P
    B[Wrist camera] --> P
    C[LiDAR / wide RGB-D] --> P
    P[Robo3R + Omni-Manip:<br/>canonical robot-frame 3D scene] --> Q
    Q[Active Spatial Reasoning:<br/>select region, reason relations] --> R
    R{Policy} --> S[DP3: hand manipulation]
    R --> T[WholeBodyVLA: loco-manipulation]
    S --> U[RL low-level controller 500 Hz]
    T --> U
    U --> V[Unitree G1]

Three clear tiers: 3D perception (build the scene) → policy (emit actions) → controller (balance + execute). Each tier is one part of the series.

Step 1: Sensor placement

This is the physical foundation. A whole-body humanoid needs more than one head camera (the reason was analyzed in Part 4):

Sensor	Location	Role
RGB-D camera	Head	Main forward scene, build 3D scene
RGB / RGB-D	Wrist	Close-up during grasp, reduce occlusion
LiDAR or wide RGB-D	Torso/shoulder	Out-of-FOV region, whole-body collision

Golden rule: every sensor must be extrinsically calibrated into the same robot base frame of the G1. Without this step, point clouds from different cameras can't be fused — and the entire 3D pipeline collapses. Invest time in accurate calibration from the start.

Step 2: Choose the perception stack (build the 3D scene)

The goal of this tier: from camera images → a metric point cloud in the canonical robot frame.

Simple start: use depth directly from RGB-D (RealSense D435i) → project into a point cloud. Enough to run DP3 immediately, but weak on transparent/reflective objects.
Upgrade: add Robo3R (Part 2) to build metric 3D from RGB at 43 Hz — drops the depth-sensor dependency and handles hard objects.
Whole-body: fuse multiple views using the Omni-Manip idea (Part 4) for an omnidirectional scene + scene memory for the out-of-FOV region.

Standard output of this step: a point cloud (N, 6) = xyz + rgb, cropped to the task-relevant region, in the robot frame.

Step 3: Choose the policy (emit actions)

Depending on task complexity:

Single/bimanual manipulation, standing still: use DP3 (Part 3). Sample-efficient, runs well from a point cloud, Simple DP3 ~25 Hz fits a Jetson Orin. A solid starting choice.
Whole-body loco-manipulation (walk and work): use WholeBodyVLA (Part 5) — a latent VLA that coordinates the whole body, learning from video + a little teleop.

In practice, combine them: WholeBodyVLA handles whole-body coordination and walking to position, DP3 handles fine hand manipulation once stabilized. Don't force one policy to do both from day one.

Step 4: Choose the data strategy

Per the framework in Part 6, layer by budget:

Pretrain with action-free egocentric video (cheap, massive).
Scale with robot-free demos (HuMI) — captures whole-body pose, no robot per station.
Final fine-tune with teleop on the G1 itself — anchor the policy to real robot kinematics.

Remember: with sample-efficient DP3, 200 clean demos are usually enough for one task; don't burn teleop money before you need to.

Step 5: RL low-level controller — the part you can't skip

This is the commonly underestimated point. The high-level policy emits intent, but if the low-level controller can't hold balance when the G1 reaches far, everything falls. You need an RL loco-manipulation controller running at ~500 Hz, trained in simulation.

Train in Isaac Lab or MuJoCo — the robot falls millions of times to learn to stay upright as the center of mass shifts during reaching.
Mandatory domain randomization + an actuator-delay model to cross sim2real (detailed techniques in ASAP/Unitree G1).

Step 6: Sim → real-robot checklist

Don't jump straight to the real robot. A safe path:

Sim phase:

Build the G1 in Isaac Lab/MuJoCo, verify URDF + kinematics match the real robot.
Train the RL loco controller to balance while reaching (no manipulation policy yet).
Integrate 3D perception (point cloud from sim) → DP3/WholeBodyVLA.
Measure success rate on a standard task set (pick-place, place on shelf, pick off floor).

Sim2real phase:

Domain randomization: lighting, texture, friction, mass, actuator delay.
Verify camera↔robot-frame extrinsic calibration on the real robot.
Test the low-level controller alone on the real G1 (safely, with a safety harness) before attaching the policy.
Enable the policy at slow speed, gradually expand the workspace.

Logging & evaluation:

Log all modalities (RGB, point cloud, state, action, task outcome) for debugging and collecting more data.
Measure success rate + time + collision count per task type.
Use failure cases as fine-tuning data for the next round (a data flywheel).

Summary table: which part solves which roadmap piece

Roadmap piece	Solution	Part
Why you need 3D	Motivation, 2D limits	Part 1
Build 3D from RGB	Robo3R feed-forward	Part 2
Manipulation policy	DP3 point-cloud	Part 3
Panoramic perception	Omni-Manip + spatial reasoning	Part 4
Whole-body coordination	WholeBodyVLA + RL	Part 5
Data collection	Teleop / HuMI / video	Part 6
Combine & deploy	This roadmap	Part 7

Conclusion: from paper to real robot

This series began with a simple question — why do image-only 2D VLAs fail at manipulation — and ends with a complete roadmap to build a 3D-aware whole-body manipulation system for a humanoid. The throughline: manipulation is a geometry problem, and geometry lives in the 3D metric space around the robot.

Don't try to do everything at once. Start small: one G1, a depth camera, DP3 for a single standing pick-place task. Once it works, add Robo3R to drop the depth sensor, add cameras to widen the workspace, then move up to WholeBodyVLA for loco-manipulation. Each step is one part of this series — now you have the full map.

Happy building. If you get stuck at any step, return to the corresponding part in the table above.

Part 1: Why 2D VLA Is Not Enough for Manipulation — The start of the whole journey
Part 5: WholeBodyVLA — Egocentric Video + RL — Whole-body coordination
Part 6: Data Collection — Teleop vs Robot-Free vs Video — Choosing a data strategy

The goal: go from a blank G1 to a working 3D-aware whole-body manipulation system. (Note: examples use the Unitree G1 / AgiBot X2 / Fourier — it applies to any humanoid with an RL loco controller.)

DP3 running in simulation with a 3D point cloud — source: 3d-diffusion-policy.github.io

Overall architecture: from pixels to whole-body action

Before each step, here is the full picture of the pipeline:

flowchart TD
    A[Top RGB-D camera] --> P
    B[Wrist camera] --> P
    C[LiDAR / wide RGB-D] --> P
    P[Robo3R + Omni-Manip:<br/>canonical robot-frame 3D scene] --> Q
    Q[Active Spatial Reasoning:<br/>select region, reason relations] --> R
    R{Policy} --> S[DP3: hand manipulation]
    R --> T[WholeBodyVLA: loco-manipulation]
    S --> U[RL low-level controller 500 Hz]
    T --> U
    U --> V[Unitree G1]

Three clear tiers: 3D perception (build the scene) → policy (emit actions) → controller (balance + execute). Each tier is one part of the series.

Step 1: Sensor placement

This is the physical foundation. A whole-body humanoid needs more than one head camera (the reason was analyzed in Part 4):

Sensor	Location	Role
RGB-D camera	Head	Main forward scene, build 3D scene
RGB / RGB-D	Wrist	Close-up during grasp, reduce occlusion
LiDAR or wide RGB-D	Torso/shoulder	Out-of-FOV region, whole-body collision

Step 2: Choose the perception stack (build the 3D scene)

The goal of this tier: from camera images → a metric point cloud in the canonical robot frame.

Simple start: use depth directly from RGB-D (RealSense D435i) → project into a point cloud. Enough to run DP3 immediately, but weak on transparent/reflective objects.
Upgrade: add Robo3R (Part 2) to build metric 3D from RGB at 43 Hz — drops the depth-sensor dependency and handles hard objects.
Whole-body: fuse multiple views using the Omni-Manip idea (Part 4) for an omnidirectional scene + scene memory for the out-of-FOV region.

Standard output of this step: a point cloud (N, 6) = xyz + rgb, cropped to the task-relevant region, in the robot frame.

Step 3: Choose the policy (emit actions)

Depending on task complexity:

Single/bimanual manipulation, standing still: use DP3 (Part 3). Sample-efficient, runs well from a point cloud, Simple DP3 ~25 Hz fits a Jetson Orin. A solid starting choice.
Whole-body loco-manipulation (walk and work): use WholeBodyVLA (Part 5) — a latent VLA that coordinates the whole body, learning from video + a little teleop.

Step 4: Choose the data strategy

Per the framework in Part 6, layer by budget:

Pretrain with action-free egocentric video (cheap, massive).
Scale with robot-free demos (HuMI) — captures whole-body pose, no robot per station.
Final fine-tune with teleop on the G1 itself — anchor the policy to real robot kinematics.

Remember: with sample-efficient DP3, 200 clean demos are usually enough for one task; don't burn teleop money before you need to.

Step 5: RL low-level controller — the part you can't skip

Train in Isaac Lab or MuJoCo — the robot falls millions of times to learn to stay upright as the center of mass shifts during reaching.
Mandatory domain randomization + an actuator-delay model to cross sim2real (detailed techniques in ASAP/Unitree G1).

Step 6: Sim → real-robot checklist

Don't jump straight to the real robot. A safe path:

Sim phase:

Build the G1 in Isaac Lab/MuJoCo, verify URDF + kinematics match the real robot.
Train the RL loco controller to balance while reaching (no manipulation policy yet).
Integrate 3D perception (point cloud from sim) → DP3/WholeBodyVLA.
Measure success rate on a standard task set (pick-place, place on shelf, pick off floor).

Sim2real phase:

Domain randomization: lighting, texture, friction, mass, actuator delay.
Verify camera↔robot-frame extrinsic calibration on the real robot.
Test the low-level controller alone on the real G1 (safely, with a safety harness) before attaching the policy.
Enable the policy at slow speed, gradually expand the workspace.

Logging & evaluation:

Log all modalities (RGB, point cloud, state, action, task outcome) for debugging and collecting more data.
Measure success rate + time + collision count per task type.
Use failure cases as fine-tuning data for the next round (a data flywheel).

Summary table: which part solves which roadmap piece

Roadmap piece	Solution	Part
Why you need 3D	Motivation, 2D limits	Part 1
Build 3D from RGB	Robo3R feed-forward	Part 2
Manipulation policy	DP3 point-cloud	Part 3
Panoramic perception	Omni-Manip + spatial reasoning	Part 4
Whole-body coordination	WholeBodyVLA + RL	Part 5
Data collection	Teleop / HuMI / video	Part 6
Combine & deploy	This roadmap	Part 7

Conclusion: from paper to real robot

Happy building. If you get stuck at any step, return to the corresponding part in the table above.

Part 1: Why 2D VLA Is Not Enough for Manipulation — The start of the whole journey
Part 5: WholeBodyVLA — Egocentric Video + RL — Whole-body coordination
Part 6: Data Collection — Teleop vs Robot-Free vs Video — Choosing a data strategy

A 3D Manipulation Roadmap for the Unitree G1 Humanoid

Overall architecture: from pixels to whole-body action

Step 1: Sensor placement

Step 2: Choose the perception stack (build the 3D scene)

Step 3: Choose the policy (emit actions)

Step 4: Choose the data strategy

Step 5: RL low-level controller — the part you can't skip

Step 6: Sim → real-robot checklist

Summary table: which part solves which roadmap piece

Conclusion: from paper to real robot

Nguyễn Anh Tuấn

Related Posts

DP3: 3D Diffusion Policy với point cloud (hands-on)

WholeBodyVLA: video egocentric + RL loco-manipulation

Vì sao VLA 2D chưa đủ cho manipulation

A 3D Manipulation Roadmap for the Unitree G1 Humanoid

Overall architecture: from pixels to whole-body action

Step 1: Sensor placement

Step 2: Choose the perception stack (build the 3D scene)

Step 3: Choose the policy (emit actions)

Step 4: Choose the data strategy

Step 5: RL low-level controller — the part you can't skip

Step 6: Sim → real-robot checklist

Summary table: which part solves which roadmap piece

Conclusion: from paper to real robot

Nguyễn Anh Tuấn

Related Posts

DP3: 3D Diffusion Policy với point cloud (hands-on)

WholeBodyVLA: video egocentric + RL loco-manipulation

Vì sao VLA 2D chưa đủ cho manipulation