GRAIL: Dữ liệu synthetic cho G1 VLA

GRAIL giải quyết vấn đề gì?

Trong manipulation truyền thống, dữ liệu thường đến từ teleoperation, motion capture, hoặc robot thật được điều khiển lặp đi lặp lại. Cách này có giá trị vì dữ liệu nằm gần phân phối deployment, nhưng rất khó scale cho humanoid. Với Unitree G1, một episode không chỉ là quỹ đạo tay. Robot phải bước đến vị trí phù hợp, hạ trọng tâm, giữ cân bằng, dùng bàn tay, tránh va chạm với mặt bàn hoặc sàn, rồi quay về trạng thái ổn định. Nếu mỗi object, mỗi scene và mỗi kiểu thao tác đều cần thu bằng người thật, chi phí tăng rất nhanh.

GRAIL, viết tắt từ Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors, đề xuất một hướng khác: tạo demonstration hoàn toàn trong không gian số cho đến lúc deploy. Paper gốc được công bố trên arXiv:2606.05160, project page nằm tại NVIDIA Research, code ở NVlabs/GRAIL, và dataset phát hành trên Hugging Face.

Điểm quan trọng nhất: GRAIL không cố reconstruct video ngoài đời một cách mù mờ. Nó bắt đầu từ một cấu hình 3D đã biết đầy đủ: object mesh, camera intrinsics/extrinsics, metric scale, environment depth, nhân vật SMPL-X được fit gần morphology của Unitree G1. Sau đó GRAIL dùng video foundation model để tạo video tương tác human-object, reconstruct thành 4D human-object interaction, retarget sang G1, rồi train tracker và visual policy.

Nếu bạn mới với VLA, hãy đọc trước mô hình VLA trong robotics. Bài này đi thẳng vào pipeline dữ liệu và cách dùng nó như nền để fine-tune policy cho whole-body manipulation.

Ý tưởng paper trong một sơ đồ

GRAIL có thể hiểu như một data factory cho humanoid loco-manipulation:

3D object / terrain asset
        |
        v
Known 3D scene + G1-proportioned human character
        |
        v
Blender render first frame + VLM prompt
        |
        v
Video foundation model creates human-object video
        |
        v
4D HOI reconstruction
  - SMPL-X body and hands
  - object 6-DoF pose
  - metric depth and contact optimization
        |
        v
Retarget SMPL-X -> Unitree G1
        |
        v
SONIC task-general tracking in Isaac Lab
        |
        v
Egocentric RGB policy / VLA fine-tuning
        |
        v
Real Unitree G1 deployment

Paper báo cáo hơn 20.000 sequences bao gồm pick-up trên bàn, pick-up dưới sàn, whole-body manipulation, sitting, curb, slope và stair traversal. Khi train policy chỉ từ dữ liệu GRAIL-generated, nhóm tác giả deploy trên Unitree G1 đạt 84% success cho object pick-up và 90% success cho stair-climbing. Với pick-up, họ train 200 approach-and-pick-up sequences cho mỗi object trong nhóm cube, apple, tea box, carrot và wet wipes; unseen objects vẫn đạt trung bình 80%.

Kiến trúc: ba tầng dữ liệu, hai tầng policy

GRAIL không phải một model đơn lẻ. Nó là một pipeline gồm nhiều module đã có sẵn trong computer vision, graphics, simulation và humanoid control.

Tầng	Input	Output	Công cụ chính
Asset-conditioned video generation	3D asset, scene, camera, prompt	Synthetic HOI video	Blender, VLM, Kling AI
4D HOI reconstruction	Video + known scene context	SMPL-X motion + object 6-DoF	GEM-SMPL, WiLoR, FoundationPose, SAM2, MoGe
Robot retargeting	SMPL-X + object motion	G1 joint trajectory + USD assets	GMR, Isaac Lab
Task-general tracking	Retargeted motion library	Executable robot actions	SONIC
Visual/VLA policy	Egocentric RGB + proprioception	Latent tokens/actions	Diffusion/VLA-style policy, optional GR00T fine-tune

1. Robot-centric human video generation

GRAIL dùng human video thay vì robot video vì video foundation model hiện vẫn có prior mạnh hơn về người: con người cầm đồ vật, ngồi xuống, bước qua bậc, kéo xe, nhấc hộp. Đồng thời, hệ sinh thái reconstruct human pose như SMPL-X, MANO, ViTPose, WiLoR cũng trưởng thành hơn robot pose reconstruction.

Để giảm morphology mismatch, nhân vật human asset được prefitted theo tỉ lệ của Unitree G1. Khi VFM tạo video, chuyển động giống người nhưng tỷ lệ cơ thể đã gần robot hơn, giúp retarget sang G1 ít méo hơn.

Pipeline tạo video thường có các bước:

# Ví dụ smoke test từ docs GRAIL
source .env   # OPENAI_API_KEY, KLING_ACCESS_KEY, KLING_SECRET_KEY, HF_TOKEN

python -m grail.pipelines.gen_terrain \
  --type stairs \
  --num 5 \
  --output_dir data/syn_stairs

python -m grail.pipelines.gen_2dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --character kid \
  --results_dir results \
  --video_model_api kling-ai

Ở đây, Blender render frame đầu với camera đã biết. Một VLM tạo prompt tương tác từ frame đó, rồi video foundation model sinh video. Vì camera static và scene 3D ban đầu đã biết, reconstruction sau đó không phải đoán lại mọi thứ từ đầu.

2. 4D human-object interaction reconstruction

Đây là phần kỹ thuật nhất của GRAIL. Output mong muốn không chỉ là video đẹp, mà là trajectory có thể đưa vào robot training:

sample = {
    "human": "SMPL-X pose per frame",
    "object": "6-DoF pose per frame",
    "contacts": "hand/object or body/scene contact labels",
    "camera": "known intrinsics and extrinsics",
    "scale": "metric scale aligned to rendered scene",
}

Human motion được estimate bằng GEM-SMPL/GENMO theo SMPL-X. Hands được refine bằng WiLoR/MANO, sau đó interpolate và smooth để giảm jitter. Object motion được track bằng FoundationPose, khởi tạo từ pose frame đầu đã biết. Vì FoundationPose có object mesh, texture và camera parameter, bài toán 6-DoF tracking ổn định hơn nhiều so với video in-the-wild.

GRAIL vẫn không tin hoàn toàn vào các estimator độc lập. Nó chạy joint optimization với các loss:

Loss	Mục tiêu
`L_kp`	Giữ body/hand keypoints khớp với video
`L_proj`	Giữ object projection khớp FoundationPose
`L_depth`	Căn depth metric bằng MoGe + known background depth
`L_cont`	Ép contact hand-object/body-scene hợp lý
`L_reg`	Giảm foot skating, velocity drift và temporal jitter

Điểm rất hay cho beginner: known 3D context biến một bài toán ambiguous thành bài toán constrained. Thay vì hỏi "video này đang ở scale nào?", GRAIL biết camera, biết object, biết mesh, biết background depth. Vì vậy depth alignment và contact alignment có cơ sở hình học rõ ràng.

3. Retarget sang Unitree G1

Sau reconstruction, dữ liệu vẫn ở dạng human SMPL-X. GRAIL dùng GMR để retarget sang Unitree G1. Output là các thư mục robot-ready:

data/motion_lib/<name>/
  robot/       # G1 joint trajectories, one pkl per motion
  objects/     # object 6-DoF trajectories
  object_usd/  # Isaac Lab-ready USD assets
  meta/        # table pose, object name, sequence metadata
  bps/         # shape encoding for multi-object manipulation

Lệnh retarget end-to-end:

conda activate sonic
export DISPLAY=:1

bash grail/retargeting/scripts/retarget_pipeline.sh \
  data/genhoi/benchmark_v3/generation/4dhoi_recon_valid/Hunyuan \
  benchmark_v3_0203

Nếu muốn chạy từng stage:

python -m grail.retargeting.retarget \
  --data_dir data/genhoi/benchmark_v3/generation/4dhoi_recon_valid/Hunyuan \
  --all \
  --robot unitree_g1 \
  --no_viewer \
  --output_dir data/motion_lib/benchmark_v3_0203

python -m grail.retargeting.process \
  --input data/motion_lib/benchmark_v3_0203 \
  --output data/motion_lib/benchmark_v3_0203_ha \
  --meta_pkl grail/retargeting/data/g1_skeleton_meta.pkl \
  --include_contact_points \
  --grasp_from_lift \
  --lift_threshold 0.02 \
  --grasp_anticipation_frames 10 \
  --skip_no_lift \
  --per_object

python -m grail.retargeting.compute_bps \
  --object_usd_dir data/motion_lib/benchmark_v3_0203/object_usd \
  --output_dir data/motion_lib/benchmark_v3_0203/bps

Với terrain hoặc sitting, GRAIL khuyến nghị --zero_out_wrist vì task không cần hand IK.

Cài đặt GRAIL

Repo được test trên Ubuntu 22.04+ với NVIDIA GPU như A6000, RTX 4090, RTX 5090 và RTX 6000 Ada. Có ba môi trường:

Env	Python	Dùng cho
`grail`	3.10	2D generation, 4D reconstruction, optimization
`hunyuan`	3.10	Hunyuan3D-2.1 asset generation
`sonic`	3.11	retargeting, Isaac Lab, task-general tracking

Cách nhanh nhất là Docker:

docker pull docker.io/nvgrail/grail:latest

docker run --gpus all -it --shm-size=16g \
  -v /path/to/grail:/workspace/grail \
  docker.io/nvgrail/grail:latest

cd /workspace/grail
bash scripts/setup/install_env_docker.sh
bash scripts/setup/download_checkpoints.sh
conda activate grail
source .env

Nếu bạn cần train SONIC hoặc retargeting đầy đủ:

bash scripts/setup/install_env_sonic.sh

Các biến môi trường tối thiểu:

export CUDA_HOME=/usr/local/cuda-12.1
export PYOPENGL_PLATFORM=egl
export OMNI_KIT_ACCEPT_EULA=Yes
export OPENAI_API_KEY=<your-key>
export KLING_ACCESS_KEY=<your-key>
export KLING_SECRET_KEY=<your-key>
export HF_TOKEN=<your-token>

Checkpoint khá lớn: GEM-SMPL khoảng 14 GB, GEM-SOMA khoảng 6.4 GB, FoundationPose khoảng 250 MB, cộng thêm SMPL-X body models và RealESRGAN. Hãy chuẩn bị disk trước, nhất là khi cài thêm Isaac Sim/Lab cho sonic.

Training: từ motion library đến policy

GRAIL train task-general tracking policy trên SONIC, không train một controller riêng cho từng object. Với manipulation, nó thêm object-aware latent adaptor: adaptor đọc object state và motion context, modulate latent token của controller, đồng thời xuất hand actions. Với terrain/sitting, nó dùng scene-aware tracker có height-map encoder để điều kiện hóa stepping và body alignment.

Smoke test cho pick-up:

conda activate sonic
export HYDRA_FULL_ERROR=1 PYTHONUNBUFFERED=1 WANDB_MODE=offline

cd imports/SONIC
python -u train_agent_trl.py \
  +exp=manager/universal_token/hoi/pnp_table \
  num_envs=4 \
  headless=True \
  ++algo.config.num_learning_iterations=3 \
  ++manager_env.config.gpu_collision_stack_size_exp=28 \
  ++manager_env.commands.motion.motion_lib_cfg.motion_file=${DATA_DIR}/robot \
  ++manager_env.commands.motion.motion_lib_cfg.object_motion_file=${DATA_DIR}/objects \
  ++manager_env.config.object_usd_path=${DATA_DIR}/object_usd \
  ++manager_env.commands.motion.motion_lib_cfg.bps_dir=${BPS_DIR}

Training thực tế trên 8 GPU:

conda activate sonic
export HYDRA_FULL_ERROR=1 PYTHONUNBUFFERED=1

cd imports/SONIC
accelerate launch --num_processes=8 train_agent_trl.py \
  +exp=manager/universal_token/hoi/pnp_table \
  num_envs=2048 \
  headless=True \
  ++manager_env.commands.motion.motion_lib_cfg.motion_file=${DATA_DIR}/robot \
  ++manager_env.commands.motion.motion_lib_cfg.object_motion_file=${DATA_DIR}/objects \
  ++manager_env.config.object_usd_path=${DATA_DIR}/object_usd \
  ++manager_env.commands.motion.motion_lib_cfg.bps_dir=${BPS_DIR}

Để fine-tune từ checkpoint public:

python -u train_agent_trl.py \
  +exp=manager/universal_token/hoi/pnp_table \
  num_envs=2048 \
  headless=True \
  ++resume=True \
  ++checkpoint=models/pnp_table/last.pt \
  experiment_dir=${FINETUNE_DIR} \
  ++algo.config.num_learning_iterations=10000 \
  ++manager_env.commands.motion.motion_lib_cfg.motion_file=${DATA_DIR}/robot \
  ++manager_env.commands.motion.motion_lib_cfg.object_motion_file=${DATA_DIR}/objects \
  ++manager_env.config.object_usd_path=${DATA_DIR}/object_usd \
  ++manager_env.commands.motion.motion_lib_cfg.bps_dir=${BPS_DIR}

Với VLA whole-body manipulation, có hai cách dùng GRAIL:

Dùng GRAIL để train tracker trước, sau đó distill thành egocentric RGB policy giống paper.
Export GRAIL episodes thành format VLA, trộn với một ít teleoperation thật, rồi fine-tune model như GR00T hoặc policy tương tự.

Project page cho biết nhóm tác giả có thử co-training GR00T với mixture 95% GRAIL + 5% teleoperation, giúp tăng grasping success so với chỉ dùng teleoperation và giảm tình trạng policy bị đứng trước object mà không reach tới target. Với production, bạn vẫn nên giữ một phần real teleop nhỏ để calibrate camera, latency, hand compliance và surface friction.

Một schema đơn giản để chuyển sang VLA:

def export_episode(grail_motion):
    return {
        "observation": {
            "rgb": render_egocentric(grail_motion),
            "proprio": grail_motion["robot_state"],
            "language": "pick up the object and place it on the table",
        },
        "action": {
            "base": grail_motion["base_velocity"],
            "joints": grail_motion["g1_joint_targets"],
            "hands": grail_motion["hand_dof_pos"],
        },
        "metadata": {
            "object_usd": grail_motion["object_asset"],
            "source": "GRAIL",
            "sim_fps": 25,
        },
    }

Nếu bạn đang dùng LeRobot hoặc một stack tự xây, hãy giữ cùng nguyên tắc: chuẩn hóa episode format, log đủ RGB/proprio/action, và kiểm tra sim-to-real trước khi chạy robot thật.

Inference trên Unitree G1

Trong paper, deployment dùng head-camera RGB và proprioceptive input stream về desktop có NVIDIA RTX 5090, rồi stream action ngược lại G1. Camera là Luxonis OAK-D W, inference chạy 10 Hz.

Runtime có thể mô tả như sau:

OAK-D W RGB frame
        |
        v
Visual policy / VLA
        |
        v
Latent token or action chunk
        |
        v
SONIC controller / low-level action decoder
        |
        v
Unitree G1 joints + hands
        |
        v
New proprioception and camera frame

Checklist trước khi chạy robot thật:

Hạng mục	Vì sao quan trọng
Camera extrinsics	Sai camera pose làm policy reach lệch object
Joint limits	Synthetic action có thể vượt giới hạn hardware
Latency	10 Hz cần buffer và watchdog rõ ràng
Hand calibration	G1 hand DOF và contact thực khác simulation
Emergency stop	Whole-body manipulation có rủi ro ngã và va chạm
Domain randomization	Cần randomize ánh sáng, texture, object pose, camera noise

Beginner thường mắc lỗi deploy policy quá sớm. Hãy chạy từng lớp: visualize trajectory, replay trong Isaac Lab, test tracker headless, test visual policy trong sim, shadow mode trên G1, rồi mới cho robot exert force.

Kết quả và ý nghĩa

Thí nghiệm	Kết quả chính
HOI generation trên 20 objects	GRAIL đạt tracking SR 88.9%, cao hơn HOIDiff, CHOIS và DAViD
Task-general tracking	Full model đạt SR 81.4%, tốt hơn HDMI 48.5% và ResMimic 49.2%
Real-world pick-up	84% trung bình trên seen objects, 80% trên unseen objects
Real-world stair-climbing	90% success
Dataset scale	Hơn 20.000 sequences, 1.000 object assets và 1.000 terrain configurations

Điều đáng học từ GRAIL không chỉ là các con số. Ý tưởng cốt lõi là: nếu synthetic data được tạo từ geometry có kiểm soát, video prior có thể bổ sung motion realism, còn simulation đảm bảo physical feasibility. Đây là hướng thực dụng hơn so với việc tin tuyệt đối vào video generation hoặc tin tuyệt đối vào manual teleoperation.

GRAIL vẫn có giới hạn. Nó cần 3D asset, scene setup và VFM phải follow prompt đủ tốt. Reconstruction có thể hỏng khi occlusion nặng, chuyển động quá nhanh, object appearance thay đổi qua frame, hoặc FoundationPose mất tracking. Failure filter sẽ loại bỏ một phần sequence. Ngoài ra, task-general tracker vẫn cần fine-tune khi motion family đổi nhiều.

Nên bắt đầu từ đâu?

Nếu bạn muốn thử trên workstation:

Cài Docker image và chạy quick start với cordless_drill.
Visualize output recon_result.mp4 và recon_comparison.mp4.
Retarget một motion nhỏ sang G1.
Chạy SONIC smoke test với num_envs=4.
Export motion thành format episode cho policy của bạn.
Chỉ deploy thật sau khi replay trong sim ổn định.

GRAIL phù hợp nhất khi bạn cần mở rộng dữ liệu cho humanoid manipulation nhưng không đủ thời gian thu teleoperation hàng nghìn episode. Với Unitree G1, pipeline này cho một công thức khá rõ: 3D asset tạo scene, video prior tạo behavior, 4D reconstruction tạo trajectory, SONIC biến trajectory thành control, và VLA học map từ perception sang action.

GRAIL giải quyết vấn đề gì?

Ý tưởng paper trong một sơ đồ

GRAIL có thể hiểu như một data factory cho humanoid loco-manipulation:

3D object / terrain asset
        |
        v
Known 3D scene + G1-proportioned human character
        |
        v
Blender render first frame + VLM prompt
        |
        v
Video foundation model creates human-object video
        |
        v
4D HOI reconstruction
  - SMPL-X body and hands
  - object 6-DoF pose
  - metric depth and contact optimization
        |
        v
Retarget SMPL-X -> Unitree G1
        |
        v
SONIC task-general tracking in Isaac Lab
        |
        v
Egocentric RGB policy / VLA fine-tuning
        |
        v
Real Unitree G1 deployment

Kiến trúc: ba tầng dữ liệu, hai tầng policy

GRAIL không phải một model đơn lẻ. Nó là một pipeline gồm nhiều module đã có sẵn trong computer vision, graphics, simulation và humanoid control.

Tầng	Input	Output	Công cụ chính
Asset-conditioned video generation	3D asset, scene, camera, prompt	Synthetic HOI video	Blender, VLM, Kling AI
4D HOI reconstruction	Video + known scene context	SMPL-X motion + object 6-DoF	GEM-SMPL, WiLoR, FoundationPose, SAM2, MoGe
Robot retargeting	SMPL-X + object motion	G1 joint trajectory + USD assets	GMR, Isaac Lab
Task-general tracking	Retargeted motion library	Executable robot actions	SONIC
Visual/VLA policy	Egocentric RGB + proprioception	Latent tokens/actions	Diffusion/VLA-style policy, optional GR00T fine-tune

1. Robot-centric human video generation

Pipeline tạo video thường có các bước:

# Ví dụ smoke test từ docs GRAIL
source .env   # OPENAI_API_KEY, KLING_ACCESS_KEY, KLING_SECRET_KEY, HF_TOKEN

python -m grail.pipelines.gen_terrain \
  --type stairs \
  --num 5 \
  --output_dir data/syn_stairs

python -m grail.pipelines.gen_2dhoi \
  --dataset ComAsset \
  --category cordless_drill \
  --character kid \
  --results_dir results \
  --video_model_api kling-ai

2. 4D human-object interaction reconstruction

Đây là phần kỹ thuật nhất của GRAIL. Output mong muốn không chỉ là video đẹp, mà là trajectory có thể đưa vào robot training:

sample = {
    "human": "SMPL-X pose per frame",
    "object": "6-DoF pose per frame",
    "contacts": "hand/object or body/scene contact labels",
    "camera": "known intrinsics and extrinsics",
    "scale": "metric scale aligned to rendered scene",
}

GRAIL vẫn không tin hoàn toàn vào các estimator độc lập. Nó chạy joint optimization với các loss:

Loss	Mục tiêu
`L_kp`	Giữ body/hand keypoints khớp với video
`L_proj`	Giữ object projection khớp FoundationPose
`L_depth`	Căn depth metric bằng MoGe + known background depth
`L_cont`	Ép contact hand-object/body-scene hợp lý
`L_reg`	Giảm foot skating, velocity drift và temporal jitter

3. Retarget sang Unitree G1

Sau reconstruction, dữ liệu vẫn ở dạng human SMPL-X. GRAIL dùng GMR để retarget sang Unitree G1. Output là các thư mục robot-ready:

data/motion_lib/<name>/
  robot/       # G1 joint trajectories, one pkl per motion
  objects/     # object 6-DoF trajectories
  object_usd/  # Isaac Lab-ready USD assets
  meta/        # table pose, object name, sequence metadata
  bps/         # shape encoding for multi-object manipulation

Lệnh retarget end-to-end:

conda activate sonic
export DISPLAY=:1

bash grail/retargeting/scripts/retarget_pipeline.sh \
  data/genhoi/benchmark_v3/generation/4dhoi_recon_valid/Hunyuan \
  benchmark_v3_0203

Nếu muốn chạy từng stage:

python -m grail.retargeting.retarget \
  --data_dir data/genhoi/benchmark_v3/generation/4dhoi_recon_valid/Hunyuan \
  --all \
  --robot unitree_g1 \
  --no_viewer \
  --output_dir data/motion_lib/benchmark_v3_0203

python -m grail.retargeting.process \
  --input data/motion_lib/benchmark_v3_0203 \
  --output data/motion_lib/benchmark_v3_0203_ha \
  --meta_pkl grail/retargeting/data/g1_skeleton_meta.pkl \
  --include_contact_points \
  --grasp_from_lift \
  --lift_threshold 0.02 \
  --grasp_anticipation_frames 10 \
  --skip_no_lift \
  --per_object

python -m grail.retargeting.compute_bps \
  --object_usd_dir data/motion_lib/benchmark_v3_0203/object_usd \
  --output_dir data/motion_lib/benchmark_v3_0203/bps

Với terrain hoặc sitting, GRAIL khuyến nghị --zero_out_wrist vì task không cần hand IK.

Cài đặt GRAIL

Repo được test trên Ubuntu 22.04+ với NVIDIA GPU như A6000, RTX 4090, RTX 5090 và RTX 6000 Ada. Có ba môi trường:

Env	Python	Dùng cho
`grail`	3.10	2D generation, 4D reconstruction, optimization
`hunyuan`	3.10	Hunyuan3D-2.1 asset generation
`sonic`	3.11	retargeting, Isaac Lab, task-general tracking

Cách nhanh nhất là Docker:

docker pull docker.io/nvgrail/grail:latest

docker run --gpus all -it --shm-size=16g \
  -v /path/to/grail:/workspace/grail \
  docker.io/nvgrail/grail:latest

cd /workspace/grail
bash scripts/setup/install_env_docker.sh
bash scripts/setup/download_checkpoints.sh
conda activate grail
source .env

Nếu bạn cần train SONIC hoặc retargeting đầy đủ:

bash scripts/setup/install_env_sonic.sh

Các biến môi trường tối thiểu:

export CUDA_HOME=/usr/local/cuda-12.1
export PYOPENGL_PLATFORM=egl
export OMNI_KIT_ACCEPT_EULA=Yes
export OPENAI_API_KEY=<your-key>
export KLING_ACCESS_KEY=<your-key>
export KLING_SECRET_KEY=<your-key>
export HF_TOKEN=<your-token>

Training: từ motion library đến policy

Smoke test cho pick-up:

conda activate sonic
export HYDRA_FULL_ERROR=1 PYTHONUNBUFFERED=1 WANDB_MODE=offline

cd imports/SONIC
python -u train_agent_trl.py \
  +exp=manager/universal_token/hoi/pnp_table \
  num_envs=4 \
  headless=True \
  ++algo.config.num_learning_iterations=3 \
  ++manager_env.config.gpu_collision_stack_size_exp=28 \
  ++manager_env.commands.motion.motion_lib_cfg.motion_file=${DATA_DIR}/robot \
  ++manager_env.commands.motion.motion_lib_cfg.object_motion_file=${DATA_DIR}/objects \
  ++manager_env.config.object_usd_path=${DATA_DIR}/object_usd \
  ++manager_env.commands.motion.motion_lib_cfg.bps_dir=${BPS_DIR}

Training thực tế trên 8 GPU:

conda activate sonic
export HYDRA_FULL_ERROR=1 PYTHONUNBUFFERED=1

cd imports/SONIC
accelerate launch --num_processes=8 train_agent_trl.py \
  +exp=manager/universal_token/hoi/pnp_table \
  num_envs=2048 \
  headless=True \
  ++manager_env.commands.motion.motion_lib_cfg.motion_file=${DATA_DIR}/robot \
  ++manager_env.commands.motion.motion_lib_cfg.object_motion_file=${DATA_DIR}/objects \
  ++manager_env.config.object_usd_path=${DATA_DIR}/object_usd \
  ++manager_env.commands.motion.motion_lib_cfg.bps_dir=${BPS_DIR}

Để fine-tune từ checkpoint public:

python -u train_agent_trl.py \
  +exp=manager/universal_token/hoi/pnp_table \
  num_envs=2048 \
  headless=True \
  ++resume=True \
  ++checkpoint=models/pnp_table/last.pt \
  experiment_dir=${FINETUNE_DIR} \
  ++algo.config.num_learning_iterations=10000 \
  ++manager_env.commands.motion.motion_lib_cfg.motion_file=${DATA_DIR}/robot \
  ++manager_env.commands.motion.motion_lib_cfg.object_motion_file=${DATA_DIR}/objects \
  ++manager_env.config.object_usd_path=${DATA_DIR}/object_usd \
  ++manager_env.commands.motion.motion_lib_cfg.bps_dir=${BPS_DIR}

Với VLA whole-body manipulation, có hai cách dùng GRAIL:

Dùng GRAIL để train tracker trước, sau đó distill thành egocentric RGB policy giống paper.
Export GRAIL episodes thành format VLA, trộn với một ít teleoperation thật, rồi fine-tune model như GR00T hoặc policy tương tự.

Một schema đơn giản để chuyển sang VLA:

def export_episode(grail_motion):
    return {
        "observation": {
            "rgb": render_egocentric(grail_motion),
            "proprio": grail_motion["robot_state"],
            "language": "pick up the object and place it on the table",
        },
        "action": {
            "base": grail_motion["base_velocity"],
            "joints": grail_motion["g1_joint_targets"],
            "hands": grail_motion["hand_dof_pos"],
        },
        "metadata": {
            "object_usd": grail_motion["object_asset"],
            "source": "GRAIL",
            "sim_fps": 25,
        },
    }

Inference trên Unitree G1

Runtime có thể mô tả như sau:

OAK-D W RGB frame
        |
        v
Visual policy / VLA
        |
        v
Latent token or action chunk
        |
        v
SONIC controller / low-level action decoder
        |
        v
Unitree G1 joints + hands
        |
        v
New proprioception and camera frame

Checklist trước khi chạy robot thật:

Hạng mục	Vì sao quan trọng
Camera extrinsics	Sai camera pose làm policy reach lệch object
Joint limits	Synthetic action có thể vượt giới hạn hardware
Latency	10 Hz cần buffer và watchdog rõ ràng
Hand calibration	G1 hand DOF và contact thực khác simulation
Emergency stop	Whole-body manipulation có rủi ro ngã và va chạm
Domain randomization	Cần randomize ánh sáng, texture, object pose, camera noise

Kết quả và ý nghĩa

Thí nghiệm	Kết quả chính
HOI generation trên 20 objects	GRAIL đạt tracking SR 88.9%, cao hơn HOIDiff, CHOIS và DAViD
Task-general tracking	Full model đạt SR 81.4%, tốt hơn HDMI 48.5% và ResMimic 49.2%
Real-world pick-up	84% trung bình trên seen objects, 80% trên unseen objects
Real-world stair-climbing	90% success
Dataset scale	Hơn 20.000 sequences, 1.000 object assets và 1.000 terrain configurations

Nên bắt đầu từ đâu?

Nếu bạn muốn thử trên workstation:

Cài Docker image và chạy quick start với cordless_drill.
Visualize output recon_result.mp4 và recon_comparison.mp4.
Retarget một motion nhỏ sang G1.
Chạy SONIC smoke test với num_envs=4.
Export motion thành format episode cho policy của bạn.
Chỉ deploy thật sau khi replay trong sim ổn định.

GRAIL: Dữ liệu synthetic cho G1 VLA

GRAIL giải quyết vấn đề gì?

Ý tưởng paper trong một sơ đồ

Kiến trúc: ba tầng dữ liệu, hai tầng policy

1. Robot-centric human video generation

2. 4D human-object interaction reconstruction

3. Retarget sang Unitree G1

Cài đặt GRAIL

Training: từ motion library đến policy

Inference trên Unitree G1

Kết quả và ý nghĩa

Nên bắt đầu từ đâu?

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

Chạy LingBot-VLA 2.0 trên RoboTwin 2.0

LeVERB: Điều khiển toàn thân humanoid bằng ngôn ngữ-thị giác tiềm ẩn

Chạy GR00T-VisualSim2Real cho G1

GRAIL: Dữ liệu synthetic cho G1 VLA

GRAIL giải quyết vấn đề gì?

Ý tưởng paper trong một sơ đồ

Kiến trúc: ba tầng dữ liệu, hai tầng policy

1. Robot-centric human video generation

2. 4D human-object interaction reconstruction

3. Retarget sang Unitree G1

Cài đặt GRAIL

Training: từ motion library đến policy

Inference trên Unitree G1

Kết quả và ý nghĩa

Nên bắt đầu từ đâu?

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

Chạy LingBot-VLA 2.0 trên RoboTwin 2.0

LeVERB: Điều khiển toàn thân humanoid bằng ngôn ngữ-thị giác tiềm ẩn

Chạy GR00T-VisualSim2Real cho G1