humanoidgrailhumanoidloco-manipulation2d-hoiblenderklingsimulation

2D HOI Videos with Blender and Kling

Run gen_2dhoi for cordless_drill: Bullet settling, scale/prompt refinement, scene rendering, and Kling videos.

Nguyễn Anh TuấnJune 7, 202615 min read
2D HOI Videos with Blender and Kling

In part 1, we treated 3D objects and terrain as the input layer of GRAIL. Part 2 moves to the next stage: generating 2D human-object interaction (2D HOI) videos from a known 3D scene. This is the bridge between static assets and the metric 4D reconstruction covered in part 3: Blender renders the conditioning frame, camera parameters, and depth; Kling generates a short human-object interaction video; the render metadata is kept so downstream stages do not need to infer scale and camera geometry from an unconstrained internet video.

The NVIDIA GRAIL project page describes the core idea clearly: instead of starting from arbitrary in-the-wild videos, GRAIL starts from a fully specified 3D configuration where object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation. The official 2D HOI documentation describes this stage as a chain of physics simulation, multi-view Blender rendering, and a video foundation model, with Kling AI as the default video backend. In practical terms, the 2D video is not the final dataset. It is a controlled visual prior for 4D HOI recovery.

Technical sources used for this walkthrough:

What You Will Run

By the end of this article, you should be able to run a ComAsset/cordless_drill smoke test, understand the important fields in configs/gen_2dhoi/manipulation.yaml, know what skip_step1 and skip_step2 control, inspect the output folders initial_states, asset_renders, cameras, depth_maps, and videos_kling, and switch to terrain_stairs.yaml, sitting.yaml, pickup_table.yaml, or pickup_ground.yaml when the task changes.

The central command is:

python -m grail.pipelines.gen_2dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --character kid \
  --results_dir results \
  --video_model_api kling-ai

Run it from the root of the GRAIL repository. The official docs explicitly call this a package entrypoint; there is no project-root wrapper script for this stage. In the standard Docker workflow, you pull the GRAIL image, mount the repository, install Blender and checkpoints inside the container, activate the grail Conda environment, and source .env so the process can see OPENAI_API_KEY, KLING_ACCESS_KEY, KLING_SECRET_KEY, and any other required tokens.

Reading manipulation.yaml

configs/gen_2dhoi/manipulation.yaml is the default config for tabletop or handheld manipulation. The top of the file sets the character, asset paths, output root, and object config:

character: kid
character_dir: data/characters
texture_dir: data/characters
results_dir: results_collection/manipulation001
object_config: configs/objects/comasset.yaml
verbose: false
skip_step1: false
skip_step2: true
skip_step3: false
skip_step4: false
skip_done: false

For beginners, the most important default is skip_step2: true. That does not mean scale is unimportant. It means most objects in configs/objects/comasset.yaml already ship with hand-tuned object scales. For example, cordless_drill uses obj_scale: [2.5, 2.5, 2.5] and scene: indoor2-manipulation. The smoke test therefore does not need to spend time and OpenAI API calls on scale search.

In contrast, skip_step1: false means the pipeline does run Blender Bullet settling by default. A generated or catalog object may be exported in an arbitrary pose. If you render it directly, it can float, clip through the table, or rest on an unnatural edge. Step 1 drops the object from a small height, lets Bullet settle the rigid body, and saves the final stable orientation for later rendering.

The key parameters are:

Field Default in manipulation.yaml Practical meaning
skip_step1 false Whether to skip Bullet settling for stable object orientation
skip_step2 true Whether to skip render plus OpenAI vision scale optimization
rendering.num_rand_scenes 3 Number of random camera/lighting variants per object
rendering.samples 32 Blender render samples, trading quality against speed
rendering.width / height 1280 / 720 Conditioning frame resolution
video.kling_model_name kling-v2-5-turbo Kling model passed to the image-to-video adapter
video.duration "5" Video segment length in seconds
video.kling_mode pro Kling generation mode
video.video_max_retries 100 Maximum retries for a valid video response
video.video_retry_wait 30 Seconds to sleep between retries

If you only want to render images and inspect the scene without calling Kling, skip step 4:

python -m grail.pipelines.gen_2dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --character kid \
  --skip_step1 \
  --skip_step2 \
  --skip_step4 \
  --results_dir results

This is the right mode when you do not have Kling credentials yet or when you are debugging camera, scale, and placement. Step 3 can still produce rendered images, camera files, and depth maps.

Step 1: Blender Bullet Settling

Step 1 calls a Blender simulation script to generate the initial object state. The default simulation block is:

simulation:
  drop_height: 0.1
  settling_time: 5.0
  initial_rotation_perturbation: 2.0
  seed: 42
  save_usd: false
  use_initial_state: true

drop_height: 0.1 means the object is dropped from slightly above its placement target, not from far above the scene. settling_time: 5.0 gives Bullet five seconds to let the object fall, collide, rotate, and stop. initial_rotation_perturbation: 2.0 adds controlled variation to the initial orientation, avoiding a perfectly artificial pose while keeping the simulation manageable.

The output lives under:

results/generation/initial_states/

Think of this folder as the cache for "how this object rests stably." Once it exists and you trust it, you can skip step 1 on later runs:

python -m grail.pipelines.gen_2dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --character kid \
  --skip_step1 \
  --skip_step2 \
  --results_dir results

Only use --skip_step1 when the cached initial state exists and still matches your object mesh and scene. If you regenerate the mesh, change the object rotation, alter contact surfaces, or delete the cache, run step 1 again. A common failure mode is a video where the object starts in a physically strange orientation because the pipeline reused a stale or missing initial state.

Step 2: OpenAI Scale Evaluation

Step 2 determines object scale using iterative rendering plus chat-vision evaluation. The pipeline renders a temporary scene, asks the model whether the object looks small, big, or correct relative to the person and room, then updates the scale using exponential expansion and binary search. The default search budget is:

scale:
  max_iterations: 5

In manipulation.yaml, this step is skipped because ComAsset objects already have scales. The relevant object config contains:

objects:
  cordless_drill:
    obj_scale: [2.5, 2.5, 2.5]
    scene: indoor2-manipulation

When should you run step 2? Do it when you add a new object without a reliable scale, when a Hunyuan3D mesh looks clearly too small or too large next to the character, or when the object category differs from the existing ComAsset assumptions. Step 2 needs OPENAI_API_KEY; without it, the vision evaluation cannot happen.

A practical beginner rule is:

Situation Recommended action
Running the cordless_drill smoke test Keep skip_step2: true
Changing render/camera settings but keeping the same object config Keep skip_step2: true
Adding a new object with unknown scale Run step 2 once and inspect obj_scales
The object is visibly wrong but OpenAI returns correct Manually fix configs/objects/*.yaml and render again
Running without external API calls Use known scales and keep skip_step2 enabled

Step 2 writes to:

results/generation/obj_scales/

This article focuses on the requested output folders, but obj_scales explains why skip_step2 is safe for known objects and risky for new ones.

Step 3: Render Scene, Camera, and Depth

Step 3 is the real Blender render pass for the video model and downstream reconstruction:

rendering:
  samples: 32
  width: 1280
  height: 720
  num_rand_scenes: 3
  gpu: true
  no_rand_seed: false
  render_start_end: false

rendering.num_rand_scenes: 3 creates three random scene variants, usually identified by seeds such as rand00001, rand00002, and rand00003. Each variant changes camera and lighting while preserving the object, character, and scene configuration. This is an inexpensive way to increase data diversity: the same cordless_drill can produce multiple conditioning frames before the pipeline calls Kling.

The important outputs are:

results/generation/asset_renders/
results/generation/cameras/
results/generation/depth_maps/

asset_renders contains the rendered PNG frames. Open these first when debugging. Check that the object is visible, the character is framed, the camera is not too close, and the interaction target is obvious. If the rendered input is wrong, the video model will usually amplify the mistake.

cameras contains camera parameters, commonly serialized as pickle files. These files are geometry, not decoration. GRAIL benefits from knowing the camera before video generation; part 3 will use that known context to reduce ambiguity during 4D HOI recovery.

depth_maps contains Blender depth output. For arbitrary videos, depth often has to be estimated. In GRAIL, the initial scene depth comes from the known synthetic scene, which gives the downstream optimizer a better metric anchor.

Before calling Kling, check:

  • PNGs in asset_renders clearly show both person and object.
  • The object is not clipping through the table or floor.
  • The character's body is not cropped in a way that hides hands or feet.
  • The camera is static and far enough to preserve the interaction.
  • Camera and depth files exist for the same seeds as the renders.
  • If you increase num_rand_scenes, you understand that video cost increases too.

Step 4: Prompt Refinement and Kling Video

Step 4 takes a rendered image as the input frame, selects a base prompt, optionally refines it with OpenAI vision/chat, and calls Kling image-to-video. In manipulation.yaml:

video:
  num_videos: 1
  num_video_segments: 1
  model_api: kling-ai
  kling_model_name: kling-v2-5-turbo
  duration: "5"
  kling_mode: pro
  skip_prompt_refinement: false
  base_prompt:
    - The person interacts with the object. The camera should remain static.
    - The person moves the object (pull or push). The camera should remain static.
  video_max_retries: 100
  video_retry_wait: 30

video.kling_model_name is the specific model name passed to the Kling adapter. If the Kling API changes or your account has access to a different model, this is the field to change. video_retry_wait: 30 means the pipeline sleeps 30 seconds between failed or unfinished attempts. With video_max_retries: 100, a bad render or a congested API queue can keep a job alive for a long time. During debugging, keep num_rand_scenes small and num_videos: 1.

Prompt refinement serves two purposes. First, it makes the prompt object-specific, for example turning a generic interaction into "a person picks up a cordless drill and keeps holding it." Second, it reinforces the static camera constraint. GRAIL wants static-camera video because reconstruction depends on geometric consistency. A moving cinematic camera may look better as media, but it is usually worse as reconstruction input.

The final videos are written under:

results/generation/videos_kling/

For the smoke test, expect MP4 files under the dataset/category path:

results/
  generation/
    videos_kling/
      ComAsset/
        cordless_drill/
          kid_indoor2-manipulation_rand00001.mp4
          kid_indoor2-manipulation_rand00002.mp4
          kid_indoor2-manipulation_rand00003.mp4

The exact filenames may vary with scene key and video index, but the pattern should include character, scene, and random seed. If videos_kling is empty, check in this order: rendered PNGs exist, OPENAI_API_KEY is loaded, KLING_ACCESS_KEY and KLING_SECRET_KEY are loaded, the logs show successful prompt refinement or a fallback, and the Kling adapter is not retrying indefinitely.

Smoke Test: ComAsset/cordless_drill

A good smoke test is cheap, visually easy to evaluate, and close to the default config. cordless_drill fits because it is handheld, asymmetric, already scaled in comasset.yaml, assigned to indoor2-manipulation, and easy for a video model to interpret. The desired action can be simple: a person approaches, holds, moves, pulls, or pushes the drill while the camera stays static.

Run the full pipeline:

source .env
python -m grail.pipelines.gen_2dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --character kid \
  --results_dir results \
  --video_model_api kling-ai

If this is your first run in the container, verify Blender setup. The code looks for Blender at imports/blender/blender; if it is missing, the error will point you to bash scripts/setup/install_env_docker.sh. That is an environment setup issue, not an object or prompt issue.

After the run, inspect outputs in stage order:

find results/generation/initial_states -maxdepth 3 -type f | head
find results/generation/asset_renders -maxdepth 4 -name '*.png' | head
find results/generation/cameras -maxdepth 4 -type f | head
find results/generation/depth_maps -maxdepth 4 -type f | head
find results/generation/videos_kling -maxdepth 4 -name '*.mp4' | head

Do not start by watching every video. Open the render first. If the render is wrong, the video is not worth evaluating. If the render is good but the video is wrong, then adjust prompt, model name, duration, or random seed.

When to Increase num_rand_scenes

num_rand_scenes increases diversity, but it also increases cost. With 3, you get three camera/lighting variants for one object. With 10, you have more chances to get a good video, but you also create more video generation requests. Since each video can retry many times, changing 3 to 10 is not a small change.

A practical schedule:

Phase num_rand_scenes Reason
Environment smoke test 1 or 3 Find Blender/API issues quickly
New object inspection 3 Enough to judge scale and camera stability
Small batch generation 5 Better diversity without runaway cost
Dataset generation 10+ Only after prompts, scenes, and caches are stable

You may be able to override config values from the CLI depending on parser support, but for beginners the least confusing method is to copy the config, edit rendering.num_rand_scenes, and pass the file with --config.

Switching to Stairs, Sitting, and Pickup

Once the manipulation smoke test works, switch configs:

# Terrain stairs
python -m grail.pipelines.gen_2dhoi \
  --config configs/gen_2dhoi/terrain_stairs.yaml \
  --results_dir results

# Sitting
python -m grail.pipelines.gen_2dhoi \
  --config configs/gen_2dhoi/sitting.yaml \
  --results_dir results

# Pickup from table
python -m grail.pipelines.gen_2dhoi \
  --config configs/gen_2dhoi/pickup_table.yaml \
  --results_dir results

# Pickup from ground
python -m grail.pipelines.gen_2dhoi \
  --config configs/gen_2dhoi/pickup_ground.yaml \
  --results_dir results

These configs differ in more than prompt text. terrain_stairs.yaml uses synthetic stair object configs, sets skip_step1: true, skip_step2: true, uses num_rand_scenes: 10, uses duration "10", and switches kling_model_name to kling-v3. Terrain does not need the same Bullet settling as a handheld object because the staircase is scene geometry. The prompt is also more constrained: the person should climb carefully, stay visible, and avoid walking through the ceiling or leaving the frame. That sounds like prompt engineering, but it is really data constraint engineering for reconstruction and locomotion.

sitting.yaml uses configs/objects/chair_sitting.yaml, samples: 128, num_rand_scenes: 10, no_rand_seed: true, and prompts such as "walks over and sits down on the chair naturally." Sitting is different from manipulation because the final contact is body-to-chair, not hand-to-object. You need the seat, back, legs, and approach direction to be clear in the render.

The pickup configs split tabletop pickup and ground pickup. Tabletop pickup is close to manipulation but has a clearer action target: approach, grasp, and lift. Ground pickup is harder because the person must bend or squat; if the camera crops the legs or the object is too small, the generated video often fails.

Debug by Output Folder

When the pipeline fails, debug forward by stage instead of guessing from the final video:

Folder If missing or wrong Common cause
initial_states No state file or strange object pose Blender setup error, wrong skip_step1, bad mesh/collision
asset_renders No PNG or wrong scene Wrong object config, wrong scene key, bad scale, cropped camera
cameras Missing camera pickle Render step did not finish or was skipped incorrectly
depth_maps Missing depth render_scene failed or output paths are wrong
videos_kling No MP4 Kling credentials, prompt refinement, API retries, invalid render input

One common trap is using --skip_done after producing bad outputs. skip_done checks whether files exist; it does not judge quality. If you change scale or prompt after a bad run, use a fresh results directory or remove the specific bad outputs before rerunning. For production batches, keep the config and random seeds with the output so you can trace failures.

Why GRAIL Needs 2D Video

If the scene is already 3D, why not optimize the motion directly in simulation? Because the video foundation model contributes a human motion prior: how a person approaches an object, reaches, grasps, lifts, pushes, sits, or climbs. GRAIL does not blindly trust the video. It uses the video as a visual motion prior and then constrains reconstruction with the known geometry, depth, camera, scale, and robot-proportioned character.

That is why the 2D HOI stage has to be both expressive and disciplined. It must be expressive enough for the video model to generate natural human behavior, but disciplined enough for the reconstruction pipeline to recover metric motion. In part 4, the terrain branch applies the same principle to curb, slope, and stair traversal. In part 5, the recovered motion is retargeted to Unitree G1, where scale or camera mistakes from this stage become robot motion errors. Treat this stage as the place to slow down, inspect renders, test prompts, and only then fan out.

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

Train SONIC, export và đánh giá GRAIL
humanoid

Train SONIC, export và đánh giá GRAIL

6/7/202614 min read
NT
Retarget SMPL-X sang Unitree G1
humanoid

Retarget SMPL-X sang Unitree G1

6/7/202615 min read
NT
Tái dựng 4D HOI: GEM, SAM2, MoGe
humanoid

Tái dựng 4D HOI: GEM, SAM2, MoGe

6/7/202616 min read
NT