humanoidgrailunitree-g1vlasynthetic-datawhole-body-manipulation

GRAIL: Synthetic Data for G1 VLA

A practical GRAIL guide for creating synthetic data from 3D assets and video priors to fine-tune whole-body VLA policies on Unitree G1.

Nguyễn Anh Tuấn5 tháng 6, 202612 min read
GRAIL: Synthetic Data for G1 VLA

What problem does GRAIL solve?

Whole-body manipulation on a humanoid is not just arm control. A Unitree G1 has to walk into a useful pose, adjust its center of mass, coordinate torso and hands, maintain balance, make contact with an object, and recover to a stable posture. Collecting this data with teleoperation or motion capture is possible, but it is expensive to scale. Every new object, height, terrain shape, chair, stair, and manipulation style requires another round of physical setup and human operation.

GRAIL, short for Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors, proposes a more scalable route: keep data generation fully digital until deployment. The original paper is available as arXiv:2606.05160, the project page is hosted by NVIDIA Research, the code is in NVlabs/GRAIL, and the released dataset is on Hugging Face.

The key idea is subtle but important. GRAIL does not start from random internet videos and try to guess all geometry afterward. It starts from a fully specified 3D configuration: object mesh, camera intrinsics and extrinsics, metric scale, environment depth, and a SMPL-X character pre-fitted to Unitree G1-like proportions. Then it uses a video foundation model to synthesize a human-object interaction video, reconstructs metric 4D human-object trajectories, retargets them to the G1, trains tracking policies in simulation, and distills them into egocentric visual policies.

If you are new to VLA models, start with VLA models in robotics. This guide focuses on the data pipeline and how to use GRAIL-style data to fine-tune whole-body manipulation policies.

The paper idea in one diagram

Think of GRAIL as a data factory for humanoid loco-manipulation:

3D object / terrain asset
        |
        v
Known 3D scene + G1-proportioned human character
        |
        v
Blender render first frame + VLM prompt
        |
        v
Video foundation model creates human-object video
        |
        v
4D HOI reconstruction
  - SMPL-X body and hands
  - object 6-DoF pose
  - metric depth and contact optimization
        |
        v
Retarget SMPL-X -> Unitree G1
        |
        v
SONIC task-general tracking in Isaac Lab
        |
        v
Egocentric RGB policy / VLA fine-tuning
        |
        v
Real Unitree G1 deployment

The paper reports more than 20,000 generated sequences across tabletop pick-up, ground pick-up, whole-body manipulation, sitting, curbs, slopes, and stairs. Using only GRAIL-generated data, the authors deploy on a real Unitree G1 and report 84% success for diverse object pick-up and 90% success for stair climbing. For pick-up, they train 200 approach-and-pick-up sequences per object across cubes, apples, tea boxes, carrots, and wet wipes; unseen objects still average 80% success.

Architecture: three data layers, two policy layers

GRAIL is not a single neural network. It is a pipeline that combines graphics, video generation, 4D reconstruction, retargeting, physics-based tracking, and visual policy learning.

Layer Input Output Main tools
Asset-conditioned video generation 3D asset, scene, camera, prompt Synthetic HOI video Blender, VLM, Kling AI
4D HOI reconstruction Video plus known scene context SMPL-X motion plus object 6-DoF GEM-SMPL, WiLoR, FoundationPose, SAM2, MoGe
Robot retargeting SMPL-X and object motion G1 joint trajectory plus USD assets GMR, Isaac Lab
Task-general tracking Retargeted motion library Executable robot actions SONIC
Visual/VLA policy Egocentric RGB plus proprioception Latent tokens/actions Diffusion/VLA-style policy, optional GR00T fine-tune

1. Robot-centric human video generation

GRAIL generates human videos rather than robot videos because current video foundation models have stronger priors for humans manipulating objects, sitting, stepping, carrying, and pushing. Human reconstruction tools are also more mature than robot reconstruction tools. SMPL-X, MANO, ViTPose, WiLoR, and related estimators provide usable body and hand motion from video.

To reduce morphology mismatch, the human character is pre-fitted to the Unitree G1 body proportions. The video still looks like a human interaction, but its body scale and limb proportions are closer to the target humanoid, which makes retargeting easier.

A minimal generation run looks like this:

# Smoke-test example from the GRAIL docs
source .env   # OPENAI_API_KEY, KLING_ACCESS_KEY, KLING_SECRET_KEY, HF_TOKEN

python -m grail.pipelines.gen_terrain \
  --type stairs \
  --num 5 \
  --output_dir data/syn_stairs

python -m grail.pipelines.gen_2dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --character kid \
  --results_dir results \
  --video_model_api kling-ai

Blender renders the first frame with known camera parameters. A VLM turns the rendered frame into an interaction prompt, and the video foundation model synthesizes the interaction under a static-camera setup. Because the camera and scene are known, later reconstruction does not need to infer everything from scratch.

2. 4D human-object interaction reconstruction

This is the most technical part of the system. The output cannot just be a plausible video. It must be a trajectory that can supervise robot learning:

sample = {
    "human": "SMPL-X pose per frame",
    "object": "6-DoF pose per frame",
    "contacts": "hand/object or body/scene contact labels",
    "camera": "known intrinsics and extrinsics",
    "scale": "metric scale aligned to rendered scene",
}

Human motion is estimated with GEM-SMPL/GENMO as SMPL-X. Hands are refined with WiLoR/MANO, interpolated through missing detections, and smoothed to reduce jitter. Object motion is tracked with FoundationPose, initialized from the known first-frame object pose. FoundationPose receives the object mesh, texture, and camera parameters, which makes 6-DoF tracking much better conditioned than unconstrained in-the-wild tracking.

GRAIL then runs joint optimization because independent estimates often create floating contacts, penetration, scale drift, or foot skating. The optimization objective combines several losses:

Loss Purpose
L_kp Keep body and hand keypoints aligned with the video
L_proj Keep object projection aligned with FoundationPose
L_depth Align metric depth using MoGe plus known background depth
L_cont Encourage plausible hand-object or body-scene contact
L_reg Reduce foot skating, velocity drift, and temporal jitter

For beginners, the important lesson is that the known 3D scene turns an ambiguous video problem into a constrained reconstruction problem. GRAIL knows the camera, object mesh, scene scale, and background depth. That gives depth alignment and contact alignment a real geometric anchor.

3. Retargeting to Unitree G1

After reconstruction, the motion is still human SMPL-X motion. GRAIL uses GMR to retarget it to the Unitree G1. The expected output is a robot-ready motion library:

data/motion_lib/<name>/
  robot/       # G1 joint trajectories, one pkl per motion
  objects/     # object 6-DoF trajectories
  object_usd/  # Isaac Lab-ready USD assets
  meta/        # table pose, object name, sequence metadata
  bps/         # shape encoding for multi-object manipulation

End-to-end retargeting:

conda activate sonic
export DISPLAY=:1

bash grail/retargeting/scripts/retarget_pipeline.sh \
  data/genhoi/benchmark_v3/generation/4dhoi_recon_valid/Hunyuan \
  benchmark_v3_0203

Individual stages:

python -m grail.retargeting.retarget \
  --data_dir data/genhoi/benchmark_v3/generation/4dhoi_recon_valid/Hunyuan \
  --all \
  --robot unitree_g1 \
  --no_viewer \
  --output_dir data/motion_lib/benchmark_v3_0203

python -m grail.retargeting.process \
  --input data/motion_lib/benchmark_v3_0203 \
  --output data/motion_lib/benchmark_v3_0203_ha \
  --meta_pkl grail/retargeting/data/g1_skeleton_meta.pkl \
  --include_contact_points \
  --grasp_from_lift \
  --lift_threshold 0.02 \
  --grasp_anticipation_frames 10 \
  --skip_no_lift \
  --per_object

python -m grail.retargeting.compute_bps \
  --object_usd_dir data/motion_lib/benchmark_v3_0203/object_usd \
  --output_dir data/motion_lib/benchmark_v3_0203/bps

For terrain or sitting data, GRAIL recommends --zero_out_wrist because those tasks do not require hand IK.

Installing GRAIL

The repository is tested on Ubuntu 22.04+ with NVIDIA GPUs including A6000, RTX 4090, RTX 5090, and RTX 6000 Ada. It uses three environments:

Env Python Used for
grail 3.10 2D generation, 4D reconstruction, optimization
hunyuan 3.10 Hunyuan3D-2.1 asset generation
sonic 3.11 retargeting, Isaac Lab, task-general tracking

The easiest setup path is Docker:

docker pull docker.io/nvgrail/grail:latest

docker run --gpus all -it --shm-size=16g \
  -v /path/to/grail:/workspace/grail \
  docker.io/nvgrail/grail:latest

cd /workspace/grail
bash scripts/setup/install_env_docker.sh
bash scripts/setup/download_checkpoints.sh
conda activate grail
source .env

If you need full retargeting or SONIC training:

bash scripts/setup/install_env_sonic.sh

Minimum environment variables:

export CUDA_HOME=/usr/local/cuda-12.1
export PYOPENGL_PLATFORM=egl
export OMNI_KIT_ACCEPT_EULA=Yes
export OPENAI_API_KEY=<your-key>
export KLING_ACCESS_KEY=<your-key>
export KLING_SECRET_KEY=<your-key>
export HF_TOKEN=<your-token>

Plan for disk usage. GEM-SMPL is roughly 14 GB, GEM-SOMA is roughly 6.4 GB, FoundationPose is roughly 250 MB, and the setup also pulls SMPL-X body models and RealESRGAN. Isaac Sim/Lab for the sonic environment adds a large extra install.

Training: from motion library to policy

GRAIL trains task-general tracking policies on top of SONIC instead of fitting a separate controller for every object. For manipulation, it adds an object-aware latent adaptor. The adaptor reads object state and motion context, modulates controller latent tokens, and emits hand actions. For terrain and sitting, GRAIL uses a scene-aware tracker with a height-map encoder, which helps the controller adapt stepping and body alignment to scene geometry.

A pick-up smoke test:

conda activate sonic
export HYDRA_FULL_ERROR=1 PYTHONUNBUFFERED=1 WANDB_MODE=offline

cd imports/SONIC
python -u train_agent_trl.py \
  +exp=manager/universal_token/hoi/pnp_table \
  num_envs=4 \
  headless=True \
  ++algo.config.num_learning_iterations=3 \
  ++manager_env.config.gpu_collision_stack_size_exp=28 \
  ++manager_env.commands.motion.motion_lib_cfg.motion_file=${DATA_DIR}/robot \
  ++manager_env.commands.motion.motion_lib_cfg.object_motion_file=${DATA_DIR}/objects \
  ++manager_env.config.object_usd_path=${DATA_DIR}/object_usd \
  ++manager_env.commands.motion.motion_lib_cfg.bps_dir=${BPS_DIR}

A more realistic 8-GPU launch:

conda activate sonic
export HYDRA_FULL_ERROR=1 PYTHONUNBUFFERED=1

cd imports/SONIC
accelerate launch --num_processes=8 train_agent_trl.py \
  +exp=manager/universal_token/hoi/pnp_table \
  num_envs=2048 \
  headless=True \
  ++manager_env.commands.motion.motion_lib_cfg.motion_file=${DATA_DIR}/robot \
  ++manager_env.commands.motion.motion_lib_cfg.object_motion_file=${DATA_DIR}/objects \
  ++manager_env.config.object_usd_path=${DATA_DIR}/object_usd \
  ++manager_env.commands.motion.motion_lib_cfg.bps_dir=${BPS_DIR}

Fine-tuning from a public reference checkpoint:

python -u train_agent_trl.py \
  +exp=manager/universal_token/hoi/pnp_table \
  num_envs=2048 \
  headless=True \
  ++resume=True \
  ++checkpoint=models/pnp_table/last.pt \
  experiment_dir=${FINETUNE_DIR} \
  ++algo.config.num_learning_iterations=10000 \
  ++manager_env.commands.motion.motion_lib_cfg.motion_file=${DATA_DIR}/robot \
  ++manager_env.commands.motion.motion_lib_cfg.object_motion_file=${DATA_DIR}/objects \
  ++manager_env.config.object_usd_path=${DATA_DIR}/object_usd \
  ++manager_env.commands.motion.motion_lib_cfg.bps_dir=${BPS_DIR}

For whole-body VLA fine-tuning, there are two practical ways to use GRAIL:

  1. Train the GRAIL tracker first, then distill it into an egocentric RGB policy as in the paper.
  2. Export GRAIL episodes into your VLA dataset format, mix them with a small amount of real teleoperation, and fine-tune a model such as GR00T or a similar policy.

The project page states that the authors also evaluate GRAIL for GR00T fine-tuning. Early co-training uses a 95% GRAIL / 5% teleoperation mixture, improving grasping success over teleoperation-only training and reducing cases where the policy gets stuck before reaching the target. For production, keep some real teleoperation data. It calibrates camera pose, latency, hand compliance, surface friction, and object appearance gaps that synthetic data will not fully cover.

A simple VLA export schema:

def export_episode(grail_motion):
    return {
        "observation": {
            "rgb": render_egocentric(grail_motion),
            "proprio": grail_motion["robot_state"],
            "language": "pick up the object and place it on the table",
        },
        "action": {
            "base": grail_motion["base_velocity"],
            "joints": grail_motion["g1_joint_targets"],
            "hands": grail_motion["hand_dof_pos"],
        },
        "metadata": {
            "object_usd": grail_motion["object_asset"],
            "source": "GRAIL",
            "sim_fps": 25,
        },
    }

If you are building a LeRobot-style pipeline, keep the same discipline: standardize the episode format, log RGB/proprio/action cleanly, and validate sim-to-real behavior before running the real robot.

Inference on Unitree G1

In the paper deployment, the real robot streams head-camera RGB and proprioceptive input to a desktop with an NVIDIA RTX 5090. The desktop runs inference and streams actions back to the G1. The camera is a Luxonis OAK-D W, and inference runs at 10 Hz.

The runtime loop looks like this:

OAK-D W RGB frame
        |
        v
Visual policy / VLA
        |
        v
Latent token or action chunk
        |
        v
SONIC controller / low-level action decoder
        |
        v
Unitree G1 joints + hands
        |
        v
New proprioception and camera frame

Deployment checklist:

Item Why it matters
Camera extrinsics Wrong camera pose shifts reaching behavior
Joint limits Synthetic actions may exceed real hardware limits
Latency 10 Hz control needs buffering and watchdogs
Hand calibration Real G1 hand contact differs from simulation
Emergency stop Whole-body manipulation can cause falls and collisions
Domain randomization Randomize lighting, texture, object pose, and camera noise

A common beginner mistake is deploying too early. Validate each layer first: visualize the trajectory, replay it in Isaac Lab, test the tracker headless, test the visual policy in simulation, run shadow mode on the G1, and only then allow the robot to apply real force.

Results and what they mean

Experiment Main result
HOI generation on 20 objects GRAIL reaches 88.9% tracking SR, higher than HOIDiff, CHOIS, and DAViD
Task-general tracking Full model reaches 81.4% SR, above HDMI at 48.5% and ResMimic at 49.2%
Real-world pick-up 84% average on seen objects, 80% on unseen objects
Real-world stair climbing 90% success
Dataset scale More than 20,000 sequences, 1,000 object assets, and 1,000 terrain configurations

The important takeaway is not only the success rate. GRAIL shows that synthetic data can become useful for humanoid manipulation when it is geometrically controlled. Video priors provide motion realism, known 3D scenes provide metric grounding, and physics simulation filters the data into robot-executable trajectories.

GRAIL still has limitations. It needs 3D assets, simulator-ready scenes, and a video foundation model that follows the intended interaction. Reconstruction can degrade under heavy occlusion, fast motion, inconsistent object appearance, or FoundationPose tracking failure. The filtering stage discards some sequences. Task-general trackers also still need training or fine-tuning when the motion family changes substantially.

Where to start

If you want to try this on a workstation:

  1. Install the Docker image and run the cordless_drill quick start.
  2. Inspect recon_result.mp4 and recon_comparison.mp4.
  3. Retarget one small motion to G1.
  4. Run the SONIC smoke test with num_envs=4.
  5. Export the motion into your policy episode format.
  6. Deploy on real hardware only after stable simulation replay.

GRAIL is a strong fit when you need more humanoid manipulation data than you can reasonably collect with teleoperation. For Unitree G1, the recipe is clear: use 3D assets to define the scene, video priors to generate behavior, 4D reconstruction to recover trajectories, SONIC to turn trajectories into control, and VLA fine-tuning to map perception to action.

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

NEWDeep Dive
Software stack humanoid robot: từ ROS 2 đến VLA deployment
humanoidsoftwareros2isaac-simmujocolerobotvlawhole-body

Software stack humanoid robot: từ ROS 2 đến VLA deployment

Kiến trúc software stack cho humanoid robot: realtime control, ROS 2, simulator, teleop data, LeRobot, VLA policy, deployment và monitoring.

4/6/20265 min read
NEWComparison
Chọn Jetson cho humanoid robot: Orin Nano, Orin NX hay cloud GPU?
humanoidjetsonedge-computingnvidiaros2vla

Chọn Jetson cho humanoid robot: Orin Nano, Orin NX hay cloud GPU?

So sánh Jetson Orin Nano, Orin NX và cloud GPU cho humanoid robot theo ROS 2, camera, VLA inference, logging, training và ngân sách.

4/6/20265 min read
NEWDeep Dive
WBC + VLA mới nhất cho humanoid
wbcvlawhole-body-controlhumanoidloco-manipulation

WBC + VLA mới nhất cho humanoid

Deep dive WholeBodyVLA, LMO RL và xu hướng WBC + VLA giúp humanoid vừa đi vừa thao tác trong không gian lớn.

3/6/202615 min read