manipulationumirealsense-d405depth-camerahumanoidmanipulationvlargb-d

Upgrading to D405: when to replace GoPro in UMI and how

Practical guide for deciding when to upgrade from GoPro UMI to RealSense D405: what D405 brings, what you must build yourself, and the right architecture to avoid losing data quality.

Nguyễn Anh TuấnJune 6, 20266 min read
Upgrading to D405: when to replace GoPro in UMI and how

Upgrading to D405: when to replace GoPro in UMI and how

This is Part 6 in the UMI + VLA series. This post is for people who have a working UMI pipeline (Parts 2–5) and are considering using Intel RealSense D405 instead of GoPro.

TL;DR: D405 adds RGB-D near the gripper, useful for contact estimation and object segmentation. But it is not a drop-in replacement — you have to build custom tracking, a custom recorder, and a custom converter. If your GoPro pipeline is working, don't upgrade unless you have a specific reason.

Why GoPro and D405 are fundamentally different

In the original UMI, GoPro does 3 things simultaneously:

GoPro in UMI:
  1. Camera observation (fisheye 155° wrist view)
  2. IMU source (accelerometer + gyroscope for SLAM)
  3. Visual odometry base (feature-rich fisheye for ORB-SLAM3)

RealSense D405 has no built-in IMU (unlike D435i). Its FOV is much narrower. The SLAM pipeline scripts in the repo (01_extract_gopro_imu.py, 03_batch_slam.py) are written specifically for GoPro format.

Item GoPro (original UMI) RealSense D405
FOV Fisheye ~155° 87° × 58° (RGB)
IMU Built-in None
Depth None Yes (short-range, ~0.1–0.5m)
UMI SLAM scripts Work directly ✓ Need custom adapter ✗
Upstream support Full ✓ None in upstream ✗

When D405 is actually worth it

Only upgrade to D405 if you specifically need at least one of these:

  1. Contact estimation — you need to know exactly when the fingertip touches an object. D405 depth gives a point cloud near the gripper (~0.1–0.5m range), far better than RGB-only for grasping.

  2. Object segmentation — your task has many similar objects; depth helps segment 3D object bounds more accurately.

  3. Partial occlusion — objects are partially hidden in RGB; depth helps estimate pose from the visible surface.

  4. Sensor redundancy — you want both RGB (observation) and depth (feature) as independent information sources.

Don't upgrade if:

  • Your GoPro pipeline works and the policy is good enough
  • You want to "improve" without knowing what specific problem D405 solves
  • You don't have a working 6DoF tracker yet (see below)

What you MUST build yourself when using D405

This is the part most people overlook: D405 needs custom infrastructure, not just a camera swap:

1. External 6DoF tracking

D405 cannot SLAM like GoPro. You need:

Option Accuracy Cost Setup complexity
Mocap (OptiTrack/Vicon) Very high (<0.5mm) High ($5k–$50k) High
SteamVR tracker High (~1–3mm) Medium (~$150/tracker) Medium
AprilTag/ArUco rig Medium (~5–10mm) Low Low
Custom RGB-D SLAM Low–medium (drift) Low Very high

Recommendation: SteamVR tracker or mocap for lab setups. AprilTag rig for budget setups but watch for drift in occluded areas.

If you choose custom RGB-D SLAM (Open3D or ORB-SLAM3 RGB-D mode): this is a research project, not an engineering task. You must validate drift against ground truth, handle texture-poor surfaces, and implement tracking loss recovery. Not suitable for beginners.

2. Custom recorder

UMI repo has no recorder for D405. You must write your own sync script:

# Skeleton — starting point only, NOT production code
# You must add: error handling, proper shutdown, latency measurement
import pyrealsense2 as rs
import numpy as np

pipe = rs.pipeline()
cfg = rs.config()
cfg.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)
cfg.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30)

align = rs.align(rs.stream.color)
profile = pipe.start(cfg)

# Save intrinsics once for calibration
color_intr = profile.get_stream(rs.stream.color).as_video_stream_profile().get_intrinsics()
print("D405 intrinsics:", color_intr)

frames_rgb, frames_depth, timestamps = [], [], []
for i in range(90):  # 3 seconds at 30fps
    frames = pipe.wait_for_frames()
    aligned = align.process(frames)
    color = aligned.get_color_frame()
    depth = aligned.get_depth_frame()
    ts = frames.get_timestamp() / 1000.0  # ms → s

    frames_rgb.append(np.asanyarray(color.get_data()))
    frames_depth.append(np.asanyarray(depth.get_data()))
    timestamps.append(ts)

pipe.stop()

The real recorder must sync with tracker poses and gripper width in the same loop.

3. Custom data converter

No script in the UMI repo converts D405 data to replay_buffer.zarr.zip or LeRobot format. You must write a converter mapping:

D405 color frames + D405 depth frames
+ External tracker poses
+ Gripper width measurements
    ↓ (custom converter)
UMI replay buffer keys:
  robot0_eef_pos, robot0_eef_rot_axis_angle
  robot0_gripper_width
  camera0_rgb, camera0_depth (if your model supports depth)

Correct architecture for UMI-D405

Hardware:
  D405 (wrist RGB-D observation)
  + External tracker rigidly mounted to gripper body
  + Gripper width sensor (ArUco tag or encoder)

Software:
  Custom sync recorder
      → saves: color.mp4, depth.zarr, poses.csv, width.csv, timestamps.csv
  Calibration:
      → D405 intrinsics (camera_matrix, dist_coeffs)
      → T_gripper_camera (rigid transform from gripper body → camera)
      → T_tracker_gripper (if using mocap/VR)
  Custom converter:
      → Combine RGB-D + pose + width → replay buffer or LeRobot format

Most critical calibration: T_gripper_camera — the rigid transform from the gripper body to the D405 optical frame. If wrong, policy actions will be misaligned with camera observations. Use a ChArUco board and hand-eye calibration.

Can depth be used with VLAs?

GR00T/GR00T-LeRobot (NVIDIA): Depends on version and config. Check whether your branch's modality.json supports depth keys before investing in a depth pipeline.

UMI Diffusion Policy baseline: Default config is RGB-only. To add depth, you need a custom encoder (e.g., concatenate depth as an extra channel, or separate depth encoder). Not in the official configs.

Practical recommendation:

  1. Train an RGB-only baseline with the D405 color stream first
  2. If baseline works, add depth as a supplementary feature
  3. Test: does the model improve with depth? If not → stick with RGB-only

Real D405 depth problems

Problem Cause Fix
Shiny/transparent objects have no depth Stereo depth failure Use matte props during data collection
Depth misaligned with RGB Stereo baseline not aligned Use rs.align(), save intrinsics/extrinsics
Noisy depth at edges Stereo disparity noise librealsense spatial/temporal filter
Depth can't see <10cm D405 minimum range limit Adjust camera position
Model doesn't improve with depth Depth features not used Verify dataloader reads depth correctly

Decision checklist before upgrading

Answer these questions before buying a D405:

[ ] Is your current GoPro pipeline working well?
    → If not, fix the GoPro pipeline first.

[ ] Do you have a specific problem that D405 solves?
    → If not sure, keep GoPro.

[ ] Have you chosen an external 6DoF tracking solution?
    → If not, D405 won't have pose data.

[ ] Are you prepared to write a custom recorder + converter?
    → Estimate 2–4 weeks for an experienced engineer.

[ ] Does your VLA support depth input?
    → Check before investing in a depth pipeline.

[ ] Have you tested an RGB-only D405 baseline?
    → Test that first, then decide on adding depth.

References


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

Lên hai tay: UMI bimanual pipeline với scripts chính thức
manipulation

Lên hai tay: UMI bimanual pipeline với scripts chính thức

6/5/20267 min read
NT
UMI là gì? Cách thu data VLA cho robot mà không cần teleop
manipulation

UMI là gì? Cách thu data VLA cho robot mà không cần teleop

5/25/20268 min read
NT
Train Diffusion Policy đầu tiên với UMI và test trên robot arm
manipulation

Train Diffusion Policy đầu tiên với UMI và test trên robot arm

6/3/20266 min read
NT