4D HOI Reconstruction: GEM, SAM2, MoGe

In part 1, we treated 3D assets and terrain as metric input data. In part 2, Blender rendered the conditioning frame, camera parameters, and depth, then Kling generated a 2D human-object interaction video. Part 3 covers the hardest stage of the manipulation branch: reconstructing 4D HOI from that video.

"4D" simply means 3D over time. The output is not one human mesh in one frame. It is a sequence of human pose, hand pose, object pose, and contact information per frame. The GRAIL reconstruction docs describe this stage as recovering full human-object interaction: SMPL-X body pose, MANO hand pose, and a 6-DoF object trajectory. In robotics terms, the pipeline has to answer the questions that downstream training needs: where is the person, where do the hands touch, how does the object translate and rotate, when does contact begin, and is the result clean enough to retarget to a Unitree G1?

GRAIL has an advantage over unconstrained in-the-wild reconstruction pipelines because it does not start from an unknown internet video. NVIDIA's project page emphasizes that GRAIL starts from a fully specified 3D configuration: object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation. recon_4dhoi uses that privileged context to reduce ambiguity in scale, camera geometry, and object shape.

Technical sources used for this walkthrough:

Series Roadmap

3D Assets and Terrain for GRAIL: asset generation, object prompts, sharding, and downstream file contracts.
2D HOI Videos with Blender and Kling: conditioning renders, camera/depth output, and video foundation model generation.
4D HOI Reconstruction: GEM, SAM2, MoGe: human pose, object tracking, optimization, filtering, and visualization.
Static Terrain Locomotion: curb, slope, and stairs for scene-centric motion.
Retargeting Trajectories to Unitree G1: converting human/object trajectories into robot targets.
Training and Data Export: packaging demonstrations, training trackers/policies, and preparing sim-to-real data.

For broader context on whole-body policies, see NVIDIA GR00T/SONIC for whole-body VLA and LeRobot v0.5 with humanoid G1. Those articles explain why clean motion data with objects and contact matters more than a visually impressive video.

What You Will Learn

By the end of this article, you should understand the six-stage grail.pipelines.recon_4dhoi stack, know the smoke-test command, know where the outputs land, know how to inspect hoi_data/hoi_data.pkl, recon_result.mp4, and recon_comparison.mp4, and know how to tune the important fields in configs/recon_4dhoi/manip_smplx.yaml: opt_stage_specs.init_opt, contact_opt, filter_object_motion: dynamic_only, and pipeline.is_static_obj.

The basic manipulation command is:

python -m grail.pipelines.recon_4dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --results_dir results

To run one specific video:

python -m grail.pipelines.recon_4dhoi \
  --video_id ComAsset/cordless_drill/<video_name> \
  --results_dir results

If human pose, mask/depth preprocessing, and object pose are already done, and you only want to rerun optimization after changing the config:

python -m grail.pipelines.recon_4dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --results_dir results \
  --skip_step1 \
  --skip_step2 \
  --skip_step3

Run these commands from the root of the GRAIL repository, not from the blog project. You also need the environment and checkpoints. The official README uses Docker, installs Blender and local extras, then runs bash scripts/setup/download_checkpoints.sh to download GEM-SMPL/GEM-SOMA/FoundationPose weights.

The Six-Stage Map

recon_4dhoi is a stack of multiple models and post-processing components. A common beginner mistake is to treat it as one end-to-end model. In practice, each stage produces cache files for the next stage:

#	Stage	Main tool	Input	Downstream output
1	Human pose	GEM-SMPL or GEM-SOMA, WiLoR	RGB video	Human pose `.npz`, body and hand keypoints
2	Preprocess	SAM2, MoGe, camera/depth cache	Video, first-frame masks, camera	Per-frame human/object masks, depth/point cloud
3	Object pose	FoundationPose	RGB, object mask, OBJ mesh	Per-frame 6-DoF object pose
4	HOI optimization	`HOIOptimizer`	Human pose, object pose, mesh, depth, masks	Raw `hoi_data.pkl`
5	Filtering and post-processing	Thresholds, contact/motion checks	Raw `hoi_data.pkl`	`_valid` folder, mesh data, comparison videos
6	Visualization	PyTorch3D, ScenePic	Filtered HOI	Overlay MP4, top view, interactive HTML

The GRAIL docs give useful timing references on an L40S: human pose takes about 45 seconds per video, SAM2 plus MoGe preprocessing about 36 seconds, FoundationPose about 40 seconds, and HOI optimization is the heavy part at roughly 9-10 minutes per video. These are not guarantees, but they tell you where caching and sharding matter most.

Stage 1: GEM-SMPL and WiLoR for Human Pose

Stage 1 calls run_human_pose_est. With the default manip_smplx.yaml, human_model.body_model is g1_smplx. This tells the pipeline that the output should use the G1-aligned SMPL-X body model rather than a generic body model.

GEM is NVIDIA's human motion model for monocular video. The GEM-X model card describes GEM as estimating 3D body pose from RGB video, including global motion trajectory and local body kinematics. In GRAIL, the reconstruction docs state that this stage uses GEM-SMPL for the body and WiLoR for the hands, fused per frame. WiLoR is an in-the-wild 3D hand localization and reconstruction model; its paper and repo emphasize real-time hand localization and MANO-based 3D reconstruction. For manipulation, hands are not a cosmetic detail. If the wrist, fingers, or hand keypoints drift, the optimizer may force contact in the wrong place.

The main stage-1 outputs usually live under:

results/generation/hmr_smplx/
results/generation/hmr_smplx_cache/

The code saves both motion_global and motion_incam. A simple mental model is:

motion_incam: pose in camera coordinates, useful for reprojection and keypoint tracking.
motion_global: pose with a global trajectory, required for 4D motion over time.
keypoint cache: useful for debugging whether hands and feet follow the video.

A common failure mode is starting from a Kling video where the person is cropped or the hands are hidden, then expecting reconstruction to recover everything. If stage 1 does not produce stable keypoints, later stages cannot magically repair it. If recon_result.mp4 shows the skeleton drifting away from the person from the beginning, go back to the 2D video or the conditioning render from part 2.

Stage 2: SAM2 Mask Tracking and MoGe Depth

Stage 2 does two things: mask tracking and depth estimation. In the code, preprocess_masks starts from the first-frame object and human masks, then tracks them across the video. SAM2 is a good fit because Meta introduced it as a unified segmentation model for images and videos that can segment objects and follow them through frames. In GRAIL, SAM2 turns RGB video into two practical signals: object pixels and human pixels.

After masks, the pipeline estimates depth. GRAIL uses MoGe for monocular geometry estimation when depth needs to be inferred from images or video. MoGe is Microsoft's open-domain monocular geometry project. In this pipeline, depth is not used alone; it is combined with camera intrinsics, masks, and object meshes to provide geometric constraints for the optimizer.

Important cache files:

results/generation/4dhoi_recon_cache/masks/<video_id>.npz
results/generation/4dhoi_recon_cache/depth/<video_id>.pt

With --verbose, the pipeline can export mask and depth debug artifacts. Turn it on when adding a new object, when the object is heavily occluded, or when the generated video lets the hand cover the object for many frames. Mask errors are usually visible immediately: the object mask jumps to the hand, disappears for a few frames, or includes background. Depth errors can be harder to see, but they usually pull the object or human along the z axis.

Use this checklist:

Symptom	Common cause	Action
Object mask disappears after contact	Occlusion, object too small	Render closer, increase object size, choose a clearer video
Human mask absorbs the object	Similar color or too much body-object overlap	Check first-frame masks, reduce occlusion in video prompt
Object floats or sinks in depth	Monocular depth lacks a strong anchor	Use GT depth if available, inspect camera/depth from part 2
Cache ignores your edited video	`--skip_done` reused old cache	Delete the relevant cache or disable `--skip_done`

Stage 3: FoundationPose for 6-DoF Object Tracking

Stage 3 runs FoundationPose to track the object pose. FoundationPose is NVIDIA's model for 6D pose estimation and tracking of novel objects, including a model-based setup when a CAD mesh is available. GRAIL has object meshes from part 1, so object tracking does not need to infer category shape from scratch.

6-DoF means three translation degrees and three rotation degrees. For robot manipulation, this signal is central. If the drill is lifted but the tracker keeps it on the table, downstream policies learn the wrong object motion. If rotation around the object's long axis flips, grasp and contact may stop matching the visual interaction.

Stage 3 reads the mesh from:

results/generation/mesh/<dataset>/<category>/*.obj

and writes poses to:

results/generation/foundation_pose_output/<video_id>/pose_estimation_output/poses_in_cam.pkl

In manip_smplx.yaml:

obj_pose_tracking:
  foundation_pose_debug: 2
  crop_image: false
  interpolation_factor: 1

foundation_pose_debug: 2 keeps a useful debug level. crop_image: false means the pipeline does not crop the image before tracking; keeping the full frame is reasonable when the GRAIL render is clean. interpolation_factor: 1 means no additional frame interpolation.

The field you must understand is static versus dynamic:

pipeline:
  is_static_obj: false

For manipulation and pickup, the object is expected to move, so is_static_obj: false lets FoundationPose track it. For terrain or sitting, the object or scene is often static; locomotion configs use static-object mode and may bypass FoundationPose. If you set this incorrectly, stage 5 will either reject the result or keep the wrong kind of result.

Stage 4: Multi-Stage `HOIOptimizer`

This is the reconstruction core. The first three stages provide separate estimates: human pose, masks/depth, and object pose. HOIOptimizer turns them into a more consistent HOI trajectory by optimizing several variables together.

Here is a shortened version of the config:

optimization:
  opt_stage_specs:
    init_opt:
      opt_vars:
        human_trans_global: { lr: 0.003 }
        human_trans_res: { lr: 0.003 }
        human_pose_res: { lr: 0.0001 }
        obj_t_res: { lr: 0.003 }
        obj_R_res: { lr: 0.0001 }
      niter: 400
      loss_cfg:
        keypoint_tracking: { weight: 0.3, beta: 0.1 }
        human_smoothness: { weight: 30.0, beta: 0.1 }
        human_pose_reg: { weight: 100.0 }
        human_foot_contact: { weight: 10.0 }
        obj_smoothness: { weight: 50.0 }
        depth_pointcloud:
          weight: 100.0
          num_gt_samples: 3000
          trim_pct: 0.2
          depth_tol: 0.02
    contact_opt:
      niter: 200
      loss_cfg:
        contact:
          weight: 30000.0
          depth_only: true
          max_contact_dist: 0.15

init_opt pulls the human, object, and depth signals into a plausible shared coordinate system. It should not be too aggressive with pose residuals, because if the human pose is bent too much just to fit the object, the motion becomes unnatural. The main losses are:

Loss	Meaning
`keypoint_tracking`	Keeps the rendered pose aligned with 2D video keypoints
`human_smoothness`	Reduces frame-to-frame human jitter
`human_pose_reg`	Keeps pose close to the prior
`human_foot_contact`	Encourages plausible foot-ground behavior
`verts_tracking`	Uses vertex/mask alignment to fit the image
`obj_smoothness`	Reduces object trajectory jumps
`depth_pointcloud`	Uses depth/point cloud constraints to reduce z-scale errors

contact_opt runs after contact labels are available. The GRAIL docs note that stage 4 uses OpenAI vision calls inside grail/core/contact_label.py to detect contact joints by interval, with gpt-4o as the default model. The point is not to let an LLM control a robot. It helps label when and where contact likely occurs, and the optimizer then uses a contact loss to bring the relevant body or hand points close to the object during plausible intervals.

When should you tune init_opt?

Symptom	Tuning hint
Human drifts from the video but object is stable	Slightly increase `keypoint_tracking`, but inspect stage 1 first
Human motion jitters	Increase `human_smoothness`, but not excessively
Object jumps frame to frame	Increase `obj_smoothness`, inspect FoundationPose
Object is wrong in depth	Inspect depth first; increase `depth_pointcloud` only if masks/depth are reliable

When should you tune contact_opt?

Symptom	Tuning hint
Hand approaches the object but never touches	Slightly increase contact weight or `max_contact_dist`
Hand is pulled through the object	Reduce contact weight, inspect penetration/filtering
Contact starts at the wrong time	Rerun contact labeling or inspect the video prompt
Object follows the hand before contact	Check `obj_precontact_reg` and contact intervals

Beginner rule: do not change every weight at once. Pick one video, rerun stage 4, compare recon_comparison.mp4, then adjust again.

Stage 5: Filtering and Post-Processing

Optimization can produce a trajectory that looks superficially plausible but is not useful for training. Stage 5 filters results. The GRAIL docs list the main quality checks: human-position error, mask alignment, keypoint tracking, contact penalty, penetration, and motion magnitude.

In manip_smplx.yaml:

filtering:
  camera_trans_thr: 0.1
  object_mask_tol: 0.5
  total_mask_tol: 0.3
  human_static_thr: 0.01
  min_frames: null
  filter_object_motion: "dynamic_only"
  object_static_thr: 0.02

filter_object_motion: dynamic_only is important. This article covers manipulation and pickup, so the pipeline should reject reconstructions where the object barely moves. If the video only shows a person standing near a drill, or FoundationPose fails and the object trajectory stays static, that result is not useful for a manipulation policy. In contrast, for terrain locomotion in part 4, a static object or static terrain is correct; locomotion configs use static_only.

When a result passes filtering, the pipeline packages it as:

results/generation/4dhoi_recon_smplx_valid/<dataset>/<category>/<video_id>/
  hoi_data/
    hoi_data.pkl
  mesh_data/
    model.obj
    model.mtl
    ...
  result_vis/
    input.mp4
    recon_result.mp4
    recon_result_top_view.mp4
    recon_comparison.mp4
    recon_result.html

If a result fails, you will usually see an invalid marker in the raw output folder. Do not simply delete the marker to force data into training. Filtering failures usually have a cause: poor masks, wrong human pose, static object motion, physically implausible contact, or a too-short interaction.

Stage 6: ScenePic and PyTorch3D Visualization

Visualization is not decorative. It is the fastest way to decide whether a trajectory is worth retargeting. The GRAIL docs state that visualization uses PyTorch3D top-down and side-by-side renders, plus a ScenePic HTML viewer. PyTorch3D provides 3D rendering and differentiable rendering components in the PyTorch/CUDA ecosystem. ScenePic is a lightweight 3D visualization library that exports HTML, so you can inspect the result in a browser without opening Blender.

Open these three files after each run:

File	What to inspect
`result_vis/recon_result.mp4`	Whether the human/object meshes follow the input video
`result_vis/recon_comparison.mp4`	Side-by-side input versus reconstruction
`result_vis/recon_result.html`	3D contact, penetration, and depth errors from free camera views

recon_result_top_view.mp4 is especially helpful for errors hidden by the camera view. A reconstruction can look correct in 2D while being wrong in depth: from the original camera it looks fine, from the top view the hand may be 30 cm away from the object. This is why GRAIL uses camera/depth/mesh context instead of fitting only a 2D silhouette.

Inspecting `hoi_data/hoi_data.pkl`

hoi_data.pkl is the main artifact. It is a Python pickle file, so do not open it in a text editor. Use a small script:

import pickle
from pathlib import Path

path = Path("results/generation/4dhoi_recon_smplx_valid/ComAsset/cordless_drill/<video_id>/hoi_data/hoi_data.pkl")

with path.open("rb") as f:
    hoi = pickle.load(f)

print(hoi.keys())
print(hoi["human_data"].keys())
print(hoi["object_data"].keys())
print(hoi["meta"].keys())

poses = hoi["human_data"]["poses"]
obj_pose = hoi["object_data"]["object_poses"]
print("num frames:", poses.shape[0])
print("human poses:", poses.shape)
print("object poses:", obj_pose.shape)

Exact key names may change with code versions, but the usual structure is:

human_data: pose, translation, body parameters, keypoints, or joints.
object_data: object pose or trajectory per frame.
meta: paths to mesh, raw object pose file, mask cache, and render config.
contact: contact labels or post-processed contact points, when available.

You are looking for three things:

The frame count is reasonable, non-empty, and not unexpectedly short.
Human and object arrays share the same time dimension.
meta.obj_path, masks_cache_file, and render_config_file point to the correct artifacts for this video.

If hoi_data.pkl passes these checks but visualization looks wrong, trust the visualization. A good dataset is inspectable and explainable, not merely a pickle with the expected keys.

Tuning `manip_smplx.yaml`

New users should remember four fields:

Field	Default role	When to change it
`optimization.opt_stage_specs.init_opt`	Initial alignment optimization	When caches are good but alignment is still off
`optimization.opt_stage_specs.contact_opt`	Contact-aware optimization	When hands and object are close but contact is wrong
`filtering.filter_object_motion`	`dynamic_only`	Keeps manipulation data focused on moving objects
`pipeline.is_static_obj`	`false`	Set `true` only for terrain or static-object scenes

For normal dynamic manipulation:

filtering:
  filter_object_motion: "dynamic_only"
  object_static_thr: 0.02

pipeline:
  is_static_obj: false

For a deliberate static-object debug run:

python -m grail.pipelines.recon_4dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --results_dir results \
  --is_static_obj

Do not use static mode for real pickup data. It can help debug masks and depth, but it will not produce the dynamic object trajectory needed for manipulation.

A Practical Debug Order

Do not start by raising loss weights. Debug artifacts in order:

Open the input video under results/generation/videos_kling.
Inspect stage-1 body/keypoint caches if the human drifts.
Inspect mask .npz or debug masks if the object disappears.
Inspect poses_in_cam.pkl through FoundationPose visualization if object pose is wrong.
Rerun stage 4 after changing init_opt or contact_opt.
Loosen filters only after you understand why filtering failed.
Open recon_comparison.mp4 and recon_result.html before sending data to retargeting.

For larger batches, shard the run:

python -m grail.pipelines.recon_4dhoi \
  --dataset ComAsset \
  --results_dir results \
  --job_chunk_idx 0 \
  --num_job_chunks 8

The GRAIL docs state that a typical 8-chunk run can cover about 24 videos in roughly 35 minutes of wall-clock time. The important idea is caching: once stages 1-3 are stable, you can iterate on stages 4-5 much faster than rerunning the full pipeline.

Conclusion

grail.pipelines.recon_4dhoi does more than reconstruct a video. It bridges video foundation model output and robot-usable motion data. GEM-SMPL/WiLoR provide the human and hand priors. SAM2/MoGe turn RGB into masks and depth. FoundationPose tracks the known object mesh in 6-DoF. HOIOptimizer combines these signals into a contact-aware trajectory. Filtering rejects weak results. ScenePic/PyTorch3D lets you inspect the reconstruction before retargeting.

The practical lesson is simple: trust artifacts more than logs. hoi_data.pkl contains the data, but recon_result.mp4, recon_comparison.mp4, and recon_result.html tell you whether that data is worth using. Once reconstruction is stable, part 5 will convert the human/object trajectory to Unitree G1, and part 6 will package the motion for training and export.

Technical sources used for this walkthrough:

Series Roadmap

3D Assets and Terrain for GRAIL: asset generation, object prompts, sharding, and downstream file contracts.
2D HOI Videos with Blender and Kling: conditioning renders, camera/depth output, and video foundation model generation.
4D HOI Reconstruction: GEM, SAM2, MoGe: human pose, object tracking, optimization, filtering, and visualization.
Static Terrain Locomotion: curb, slope, and stairs for scene-centric motion.
Retargeting Trajectories to Unitree G1: converting human/object trajectories into robot targets.
Training and Data Export: packaging demonstrations, training trackers/policies, and preparing sim-to-real data.

What You Will Learn

The basic manipulation command is:

python -m grail.pipelines.recon_4dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --results_dir results

To run one specific video:

python -m grail.pipelines.recon_4dhoi \
  --video_id ComAsset/cordless_drill/<video_name> \
  --results_dir results

If human pose, mask/depth preprocessing, and object pose are already done, and you only want to rerun optimization after changing the config:

python -m grail.pipelines.recon_4dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --results_dir results \
  --skip_step1 \
  --skip_step2 \
  --skip_step3

The Six-Stage Map

#	Stage	Main tool	Input	Downstream output
1	Human pose	GEM-SMPL or GEM-SOMA, WiLoR	RGB video	Human pose `.npz`, body and hand keypoints
2	Preprocess	SAM2, MoGe, camera/depth cache	Video, first-frame masks, camera	Per-frame human/object masks, depth/point cloud
3	Object pose	FoundationPose	RGB, object mask, OBJ mesh	Per-frame 6-DoF object pose
4	HOI optimization	`HOIOptimizer`	Human pose, object pose, mesh, depth, masks	Raw `hoi_data.pkl`
5	Filtering and post-processing	Thresholds, contact/motion checks	Raw `hoi_data.pkl`	`_valid` folder, mesh data, comparison videos
6	Visualization	PyTorch3D, ScenePic	Filtered HOI	Overlay MP4, top view, interactive HTML

Stage 1: GEM-SMPL and WiLoR for Human Pose

The main stage-1 outputs usually live under:

results/generation/hmr_smplx/
results/generation/hmr_smplx_cache/

The code saves both motion_global and motion_incam. A simple mental model is:

motion_incam: pose in camera coordinates, useful for reprojection and keypoint tracking.
motion_global: pose with a global trajectory, required for 4D motion over time.
keypoint cache: useful for debugging whether hands and feet follow the video.

Stage 2: SAM2 Mask Tracking and MoGe Depth

Important cache files:

results/generation/4dhoi_recon_cache/masks/<video_id>.npz
results/generation/4dhoi_recon_cache/depth/<video_id>.pt

Use this checklist:

Symptom	Common cause	Action
Object mask disappears after contact	Occlusion, object too small	Render closer, increase object size, choose a clearer video
Human mask absorbs the object	Similar color or too much body-object overlap	Check first-frame masks, reduce occlusion in video prompt
Object floats or sinks in depth	Monocular depth lacks a strong anchor	Use GT depth if available, inspect camera/depth from part 2
Cache ignores your edited video	`--skip_done` reused old cache	Delete the relevant cache or disable `--skip_done`

Stage 3: FoundationPose for 6-DoF Object Tracking

Stage 3 reads the mesh from:

results/generation/mesh/<dataset>/<category>/*.obj

and writes poses to:

results/generation/foundation_pose_output/<video_id>/pose_estimation_output/poses_in_cam.pkl

In manip_smplx.yaml:

obj_pose_tracking:
  foundation_pose_debug: 2
  crop_image: false
  interpolation_factor: 1

The field you must understand is static versus dynamic:

pipeline:
  is_static_obj: false

Stage 4: Multi-Stage `HOIOptimizer`

Here is a shortened version of the config:

optimization:
  opt_stage_specs:
    init_opt:
      opt_vars:
        human_trans_global: { lr: 0.003 }
        human_trans_res: { lr: 0.003 }
        human_pose_res: { lr: 0.0001 }
        obj_t_res: { lr: 0.003 }
        obj_R_res: { lr: 0.0001 }
      niter: 400
      loss_cfg:
        keypoint_tracking: { weight: 0.3, beta: 0.1 }
        human_smoothness: { weight: 30.0, beta: 0.1 }
        human_pose_reg: { weight: 100.0 }
        human_foot_contact: { weight: 10.0 }
        obj_smoothness: { weight: 50.0 }
        depth_pointcloud:
          weight: 100.0
          num_gt_samples: 3000
          trim_pct: 0.2
          depth_tol: 0.02
    contact_opt:
      niter: 200
      loss_cfg:
        contact:
          weight: 30000.0
          depth_only: true
          max_contact_dist: 0.15

Loss	Meaning
`keypoint_tracking`	Keeps the rendered pose aligned with 2D video keypoints
`human_smoothness`	Reduces frame-to-frame human jitter
`human_pose_reg`	Keeps pose close to the prior
`human_foot_contact`	Encourages plausible foot-ground behavior
`verts_tracking`	Uses vertex/mask alignment to fit the image
`obj_smoothness`	Reduces object trajectory jumps
`depth_pointcloud`	Uses depth/point cloud constraints to reduce z-scale errors

When should you tune init_opt?

Symptom	Tuning hint
Human drifts from the video but object is stable	Slightly increase `keypoint_tracking`, but inspect stage 1 first
Human motion jitters	Increase `human_smoothness`, but not excessively
Object jumps frame to frame	Increase `obj_smoothness`, inspect FoundationPose
Object is wrong in depth	Inspect depth first; increase `depth_pointcloud` only if masks/depth are reliable

When should you tune contact_opt?

Symptom	Tuning hint
Hand approaches the object but never touches	Slightly increase contact weight or `max_contact_dist`
Hand is pulled through the object	Reduce contact weight, inspect penetration/filtering
Contact starts at the wrong time	Rerun contact labeling or inspect the video prompt
Object follows the hand before contact	Check `obj_precontact_reg` and contact intervals

Beginner rule: do not change every weight at once. Pick one video, rerun stage 4, compare recon_comparison.mp4, then adjust again.

Stage 5: Filtering and Post-Processing

In manip_smplx.yaml:

filtering:
  camera_trans_thr: 0.1
  object_mask_tol: 0.5
  total_mask_tol: 0.3
  human_static_thr: 0.01
  min_frames: null
  filter_object_motion: "dynamic_only"
  object_static_thr: 0.02

When a result passes filtering, the pipeline packages it as:

results/generation/4dhoi_recon_smplx_valid/<dataset>/<category>/<video_id>/
  hoi_data/
    hoi_data.pkl
  mesh_data/
    model.obj
    model.mtl
    ...
  result_vis/
    input.mp4
    recon_result.mp4
    recon_result_top_view.mp4
    recon_comparison.mp4
    recon_result.html

Stage 6: ScenePic and PyTorch3D Visualization

Open these three files after each run:

File	What to inspect
`result_vis/recon_result.mp4`	Whether the human/object meshes follow the input video
`result_vis/recon_comparison.mp4`	Side-by-side input versus reconstruction
`result_vis/recon_result.html`	3D contact, penetration, and depth errors from free camera views

Inspecting `hoi_data/hoi_data.pkl`

hoi_data.pkl is the main artifact. It is a Python pickle file, so do not open it in a text editor. Use a small script:

import pickle
from pathlib import Path

path = Path("results/generation/4dhoi_recon_smplx_valid/ComAsset/cordless_drill/<video_id>/hoi_data/hoi_data.pkl")

with path.open("rb") as f:
    hoi = pickle.load(f)

print(hoi.keys())
print(hoi["human_data"].keys())
print(hoi["object_data"].keys())
print(hoi["meta"].keys())

poses = hoi["human_data"]["poses"]
obj_pose = hoi["object_data"]["object_poses"]
print("num frames:", poses.shape[0])
print("human poses:", poses.shape)
print("object poses:", obj_pose.shape)

Exact key names may change with code versions, but the usual structure is:

human_data: pose, translation, body parameters, keypoints, or joints.
object_data: object pose or trajectory per frame.
meta: paths to mesh, raw object pose file, mask cache, and render config.
contact: contact labels or post-processed contact points, when available.

You are looking for three things:

The frame count is reasonable, non-empty, and not unexpectedly short.
Human and object arrays share the same time dimension.
meta.obj_path, masks_cache_file, and render_config_file point to the correct artifacts for this video.

If hoi_data.pkl passes these checks but visualization looks wrong, trust the visualization. A good dataset is inspectable and explainable, not merely a pickle with the expected keys.

Tuning `manip_smplx.yaml`

New users should remember four fields:

Field	Default role	When to change it
`optimization.opt_stage_specs.init_opt`	Initial alignment optimization	When caches are good but alignment is still off
`optimization.opt_stage_specs.contact_opt`	Contact-aware optimization	When hands and object are close but contact is wrong
`filtering.filter_object_motion`	`dynamic_only`	Keeps manipulation data focused on moving objects
`pipeline.is_static_obj`	`false`	Set `true` only for terrain or static-object scenes

For normal dynamic manipulation:

filtering:
  filter_object_motion: "dynamic_only"
  object_static_thr: 0.02

pipeline:
  is_static_obj: false

For a deliberate static-object debug run:

python -m grail.pipelines.recon_4dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --results_dir results \
  --is_static_obj

Do not use static mode for real pickup data. It can help debug masks and depth, but it will not produce the dynamic object trajectory needed for manipulation.

A Practical Debug Order

Do not start by raising loss weights. Debug artifacts in order:

Open the input video under results/generation/videos_kling.
Inspect stage-1 body/keypoint caches if the human drifts.
Inspect mask .npz or debug masks if the object disappears.
Inspect poses_in_cam.pkl through FoundationPose visualization if object pose is wrong.
Rerun stage 4 after changing init_opt or contact_opt.
Loosen filters only after you understand why filtering failed.
Open recon_comparison.mp4 and recon_result.html before sending data to retargeting.

For larger batches, shard the run:

python -m grail.pipelines.recon_4dhoi \
  --dataset ComAsset \
  --results_dir results \
  --job_chunk_idx 0 \
  --num_job_chunks 8

4D HOI Reconstruction: GEM, SAM2, MoGe

Series Roadmap

What You Will Learn

The Six-Stage Map

Stage 1: GEM-SMPL and WiLoR for Human Pose

Stage 2: SAM2 Mask Tracking and MoGe Depth

Stage 3: FoundationPose for 6-DoF Object Tracking

Stage 4: Multi-Stage `HOIOptimizer`

Stage 5: Filtering and Post-Processing

Stage 6: ScenePic and PyTorch3D Visualization

Inspecting `hoi_data/hoi_data.pkl`

Tuning `manip_smplx.yaml`

A Practical Debug Order

Conclusion

Nguyễn Anh Tuấn

Related Posts

Tạo asset 3D và terrain cho GRAIL

Sinh video 2D HOI bằng Blender và Kling

Retarget SMPL-X sang Unitree G1

4D HOI Reconstruction: GEM, SAM2, MoGe

Series Roadmap

What You Will Learn

The Six-Stage Map

Stage 1: GEM-SMPL and WiLoR for Human Pose

Stage 2: SAM2 Mask Tracking and MoGe Depth

Stage 3: FoundationPose for 6-DoF Object Tracking

Stage 4: Multi-Stage `HOIOptimizer`

Stage 5: Filtering and Post-Processing

Stage 6: ScenePic and PyTorch3D Visualization

Inspecting `hoi_data/hoi_data.pkl`

Tuning `manip_smplx.yaml`

A Practical Debug Order

Conclusion

Nguyễn Anh Tuấn

Related Posts

Tạo asset 3D và terrain cho GRAIL

Sinh video 2D HOI bằng Blender và Kling

Retarget SMPL-X sang Unitree G1

Series Roadmap

What You Will Learn

The Six-Stage Map

Stage 1: GEM-SMPL and WiLoR for Human Pose

Stage 2: SAM2 Mask Tracking and MoGe Depth

Stage 3: FoundationPose for 6-DoF Object Tracking

Stage 4: Multi-Stage HOIOptimizer

Stage 5: Filtering and Post-Processing

Stage 6: ScenePic and PyTorch3D Visualization

Inspecting hoi_data/hoi_data.pkl

Tuning manip_smplx.yaml

A Practical Debug Order

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

Tạo asset 3D và terrain cho GRAIL

Sinh video 2D HOI bằng Blender và Kling

Retarget SMPL-X sang Unitree G1

Series Roadmap

What You Will Learn

The Six-Stage Map

Stage 1: GEM-SMPL and WiLoR for Human Pose

Stage 2: SAM2 Mask Tracking and MoGe Depth

Stage 3: FoundationPose for 6-DoF Object Tracking

Stage 4: Multi-Stage HOIOptimizer

Stage 5: Filtering and Post-Processing

Stage 6: ScenePic and PyTorch3D Visualization

Inspecting hoi_data/hoi_data.pkl

Tuning manip_smplx.yaml

A Practical Debug Order

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

Tạo asset 3D và terrain cho GRAIL

Sinh video 2D HOI bằng Blender và Kling

Retarget SMPL-X sang Unitree G1

Stage 4: Multi-Stage `HOIOptimizer`

Inspecting `hoi_data/hoi_data.pkl`

Tuning `manip_smplx.yaml`

Stage 4: Multi-Stage `HOIOptimizer`

Inspecting `hoi_data/hoi_data.pkl`

Tuning `manip_smplx.yaml`