GR00T Whole-Body VLA Data: Simulation Data

Disclosure: This article may contain affiliate or referral links. If you buy or sign up through those links, VnRobo may earn a commission or service credit. The technical recommendations prioritize engineering fit first.

Part 1 used open datasets to walk through download -> verify -> fine-tune -> inference. Part 2 switches to simulation data: generate trajectories in Isaac Lab / IsaacLab-Arena, convert them into GR00T-LeRobot, then train in two modes:

Sim-only: train only on simulation data.
Sim + public mix: mix simulation data with a public dataset to reduce overfitting to the simulator.

Important caveat: as of June 6, 2026, there is no single universal command that converts every IsaacLab-Arena task into GR00T-LeRobot for every embodiment. Some datasets are already converted, such as nvidia/Arena-G1-Loco-Manipulation-Task. If you build a new task/env, you usually need a custom exporter from HDF5/trajectory buffers to LeRobot v2 plus meta/modality.json.

2.1 Generate Simulation Data

Goal

Generate episodes with:

RGB camera frames: observation.images.ego_view, optionally wrist cameras.
State: joint positions, base/torso state, hand state, object/goal pose if needed.
Action: raw joint targets, base velocities, hand pose, or SONIC latent/action depending on your controller.
Language annotation: task prompt per episode/frame.
Episode metadata: success/failure, length, task id.

For a G1 / GEAR-SONIC style whole-body VLA workflow, the target loop is:

sim rollout
  -> camera + robot state + task prompt + action
  -> GR00T-LeRobot v2
  -> train GR00T
  -> PolicyServer
  -> SONIC decoder / simulation control loop

Environment Requirements

Simulation machine:

Ubuntu 22.04/24.04.
NVIDIA GPU such as RTX 4090, RTX 6000 Ada, A6000, A100, or H100.
VRAM:
- 16-24 GB: small debug env, few cameras, low parallelism.
- 24-48 GB: moderate data collection with cameras.
- 80 GB or multi-GPU: large parallel rollout/rendering.
Isaac Sim / Isaac Lab version compatible with the IsaacLab-Arena release you use.
ACCEPT_EULA=Y and PRIVACY_CONSENT=Y when required.

Training machine:

Debug fine-tune: 1 GPU with 48-80 GB VRAM is much better than 24 GB.
Serious whole-body fine-tune: prepare 4+ 80 GB GPUs.
Loader-only verification can run on a smaller machine, but does not tell you about training quality.

Fastest Route: Use Already Converted Sim Data

Before writing your own exporter, download a converted simulation dataset:

cd Isaac-GR00T
mkdir -p datasets

uv run hf download nvidia/Arena-G1-Loco-Manipulation-Task \
  --repo-type dataset \
  --include "lerobot/**" \
  --local-dir datasets/arena_g1_loco

export SIM_DATASET="$PWD/datasets/arena_g1_loco/lerobot"
uv run python tools/verify_groot_lerobot_dataset.py "$SIM_DATASET"

The dataset card describes a G1 loco-manipulation task in IsaacLab-Arena with HDF5 source data and a lerobot folder converted to GR00T-Lerobot. This is the best place to start because you can debug the GR00T side before debugging a custom exporter.

Load IsaacLab-Arena To Check Your Setup

The LeRobot/IsaacLab-Arena docs show a random rollout script pattern. Create:

cat > test_env_load_arena.py <<'PY'
import logging
from dataclasses import asdict
from pprint import pformat

import torch
import tqdm
from lerobot.configs import parser
from lerobot.configs.eval import EvalPipelineConfig

@parser.wrap()
def main(cfg: EvalPipelineConfig):
    logging.info(pformat(asdict(cfg)))
    from lerobot.envs import make_env

    env_dict = make_env(cfg.env, n_envs=cfg.env.num_envs, trust_remote_code=True)
    env = next(iter(env_dict.values()))[0]
    obs, info = env.reset()

    for _ in tqdm.tqdm(range(cfg.env.episode_length)):
        with torch.inference_mode():
            action = env.action_space.sample()
            obs, reward, terminated, truncated, info = env.step(action)
            if terminated.any() or truncated.any():
                obs, info = env.reset()
    env.close()

if __name__ == "__main__":
    main()
PY

Run:

export ACCEPT_EULA=Y
export PRIVACY_CONSENT=Y

python test_env_load_arena.py \
  --env.hub_path=nvidia/isaaclab-arena-envs \
  --env.type=isaaclab_arena \
  --env.environment=g1_locomanip_pnp \
  --env.embodiment=gr1_pink \
  --env.object=cracker_box \
  --env.num_envs=4 \
  --env.enable_cameras=true \
  --env.headless=false \
  --env.seed=1000 \
  --env.video=true \
  --env.video_length=10 \
  --env.video_interval=15

Headless:

python test_env_load_arena.py \
  --env.hub_path=nvidia/isaaclab-arena-envs \
  --env.type=isaaclab_arena \
  --env.environment=g1_locomanip_pnp \
  --env.embodiment=gr1_pink \
  --env.object=cracker_box \
  --env.num_envs=8 \
  --env.enable_cameras=true \
  --env.headless=true \
  --env.video=true \
  --env.video_length=10 \
  --env.video_interval=15

Done-correct criteria:

Env resets successfully.
Camera rendering works.
Eval videos appear under outputs/eval/.../videos/....
No EULA/privacy error.
If CUDA OOM occurs, reduce --env.num_envs, video frequency, or camera resolution.

Collect Episodes In Simulation

Common data generation modes:

Method	Use case	Notes
Teleoperation in sim	Human demonstrations	Better quality, slower collection.
Mimic / trajectory generation	Scaling from seed demos	Requires task annotations and source demos.
Scripted policy/controller	Simple baseline tasks	Easy to bias, weak generalization if randomization is poor.

Expected HDF5-style output:

sim_outputs/g1_pick_place_hdf5/
├── demo_000000.hdf5
├── demo_000001.hdf5
├── demo_000002.hdf5
└── manifest.json

Each HDF5 file should contain keys like:

/observations/state
/observations/images/ego_view
/observations/images/left_wrist
/observations/images/right_wrist
/actions
/timestamps
/language/task_description
/episode/success

If your generator uses different key names, that is fine. The converter must map them into LeRobot columns and modality.json.

Minimum Domain Randomization

Do not generate 1000 identical episodes. At minimum, randomize:

Object pose and goal pose.
Lighting, texture, table/floor material.
Camera extrinsic jitter: yaw, pitch, height.
Robot initial base pose.
Action/observation delay and noise.
Background distractors if policy uses RGB.

Example config concept:

randomization:
  object_xy: [-0.25, 0.25]
  object_yaw_deg: [-180, 180]
  goal_xy: [-0.20, 0.20]
  camera_yaw_deg: [-3, 3]
  camera_pitch_deg: [-2, 2]
  light_intensity: [300, 900]
  action_noise_std: 0.01
  joint_state_noise_std: 0.002

For locomotion/contact tasks, start with a narrow randomization range. Increase it only after the generator success rate is stable.

2.2 Convert Simulation Data To GR00T-LeRobot

Goal

Convert:

sim_outputs/g1_pick_place_hdf5/

into:

datasets/g1_pick_place_lerobot/
├── meta/
│   ├── info.json
│   ├── episodes.jsonl
│   ├── tasks.jsonl
│   └── modality.json
├── data/
│   └── chunk-000/
│       ├── episode_000000.parquet
│       └── episode_000001.parquet
└── videos/
    └── chunk-000/
        └── observation.images.ego_view/
            ├── episode_000000.mp4
            └── episode_000001.mp4

Example Converter

This is a template you must adapt to your actual HDF5 keys. It is not an official NVIDIA script; it is a practical, testable converter shape.

mkdir -p tools
cat > tools/convert_sim_hdf5_to_groot_lerobot.py <<'PY'
import argparse
import json
from pathlib import Path

import h5py
import imageio.v3 as iio
import numpy as np
import pandas as pd

def write_jsonl(path, rows):
    with path.open("w", encoding="utf-8") as f:
        for row in rows:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input-dir", required=True)
    parser.add_argument("--output-dir", required=True)
    parser.add_argument("--fps", type=int, default=20)
    parser.add_argument("--task", default="pick up the object and place it at the goal")
    args = parser.parse_args()

    in_dir = Path(args.input_dir)
    out = Path(args.output_dir)
    meta = out / "meta"
    data_dir = out / "data" / "chunk-000"
    video_dir = out / "videos" / "chunk-000" / "observation.images.ego_view"
    meta.mkdir(parents=True, exist_ok=True)
    data_dir.mkdir(parents=True, exist_ok=True)
    video_dir.mkdir(parents=True, exist_ok=True)

    h5_files = sorted(in_dir.glob("*.hdf5"))
    if not h5_files:
        raise SystemExit(f"no .hdf5 files in {in_dir}")

    episodes = []
    tasks = [{"task_index": 0, "task": args.task}]
    global_frame_index = 0

    for ep_idx, h5_path in enumerate(h5_files):
        with h5py.File(h5_path, "r") as h5:
            state = np.asarray(h5["observations/state"], dtype=np.float32)
            action = np.asarray(h5["actions"], dtype=np.float32)
            images = np.asarray(h5["observations/images/ego_view"], dtype=np.uint8)

        if len(state) != len(action) or len(state) != len(images):
            raise SystemExit(f"length mismatch in {h5_path}")

        rows = []
        for t in range(len(state)):
            rows.append({
                "index": global_frame_index,
                "episode_index": ep_idx,
                "frame_index": t,
                "timestamp": t / args.fps,
                "task_index": 0,
                "observation.state": state[t].astype(np.float32).tolist(),
                "action": action[t].astype(np.float32).tolist(),
                "annotation.human.action.task_description": 0,
            })
            global_frame_index += 1

        pd.DataFrame(rows).to_parquet(data_dir / f"episode_{ep_idx:06d}.parquet", index=False)
        iio.imwrite(video_dir / f"episode_{ep_idx:06d}.mp4", images, fps=args.fps)
        episodes.append({"episode_index": ep_idx, "tasks": [0], "length": len(state)})

    state_dim = int(np.asarray(rows[0]["observation.state"]).shape[0])
    action_dim = int(np.asarray(rows[0]["action"]).shape[0])

    (meta / "info.json").write_text(json.dumps({
        "codebase_version": "v2.0",
        "fps": args.fps,
        "total_episodes": len(episodes),
        "total_frames": sum(ep["length"] for ep in episodes),
        "features": {
            "index": {"dtype": "int64", "shape": [1]},
            "episode_index": {"dtype": "int64", "shape": [1]},
            "frame_index": {"dtype": "int64", "shape": [1]},
            "timestamp": {"dtype": "float32", "shape": [1]},
            "task_index": {"dtype": "int64", "shape": [1]},
            "observation.state": {"dtype": "float32", "shape": [state_dim]},
            "action": {"dtype": "float32", "shape": [action_dim]},
            "observation.images.ego_view": {"dtype": "video", "shape": [480, 640, 3]},
        },
    }, indent=2), encoding="utf-8")

    modality = {
        "state": {"whole_body": {"start": 0, "end": state_dim}},
        "action": {"whole_body": {"start": 0, "end": action_dim}},
        "video": {"ego_view": {"original_key": "observation.images.ego_view"}},
        "annotation": {
            "human.action.task_description": {
                "original_key": "annotation.human.action.task_description"
            }
        }
    }
    (meta / "modality.json").write_text(json.dumps(modality, indent=2), encoding="utf-8")
    write_jsonl(meta / "episodes.jsonl", episodes)
    write_jsonl(meta / "tasks.jsonl", tasks)

    print(f"wrote {out}")
    print(f"episodes={len(episodes)}")
    print(f"state_dim={state_dim} action_dim={action_dim}")

if __name__ == "__main__":
    main()
PY

Run:

uv pip install h5py pandas pyarrow imageio imageio-ffmpeg

uv run python tools/convert_sim_hdf5_to_groot_lerobot.py \
  --input-dir sim_outputs/g1_pick_place_hdf5 \
  --output-dir datasets/g1_pick_place_lerobot \
  --fps 20 \
  --task "pick up the object and place it at the goal"

Verify:

export SIM_DATASET="$PWD/datasets/g1_pick_place_lerobot"
uv run python tools/verify_groot_lerobot_dataset.py "$SIM_DATASET"
python -m json.tool "$SIM_DATASET/meta/modality.json"

This converter is a minimal template. Before a large training run, compare meta/info.json, parquet columns, and video layout against a public dataset that already loads in the same Isaac-GR00T/LeRobot branch. If your loader branch uses the LeRobot v2.1 data/train-00000.parquet style, keep that format instead of forcing chunk/episode files.

Improve `modality.json` For Whole-Body Data

The template uses a single "whole_body" slice. For better training and debugging, split state/action by body parts:

{
  "state": {
    "left_leg": { "start": 0, "end": 6 },
    "right_leg": { "start": 6, "end": 12 },
    "waist": { "start": 12, "end": 15 },
    "left_arm": { "start": 15, "end": 22 },
    "left_hand": { "start": 22, "end": 29 },
    "right_arm": { "start": 29, "end": 36 },
    "right_hand": { "start": 36, "end": 43 }
  },
  "action": {
    "base": { "start": 0, "end": 3 },
    "left_arm": { "start": 3, "end": 10 },
    "left_hand": { "start": 10, "end": 17 },
    "right_arm": { "start": 17, "end": 24 },
    "right_hand": { "start": 24, "end": 31 }
  }
}

If you use SONIC, action slicing must match the 64-dimensional latent token plus hands. If you only have raw joint/action targets, do not label it as SONIC.

2.3 Training With Simulation Data

Branch A: Sim-Only Training

Goal: verify that the model can learn inside the simulator distribution.

cd Isaac-GR00T

export NUM_GPUS=1
export SIM_DATASET=/abs/path/to/datasets/g1_pick_place_lerobot
export OUT=/mnt/checkpoints/groot_g1_sim_only_debug

CUDA_VISIBLE_DEVICES=0 uv run python \
  gr00t/experiment/launch_finetune.py \
  --base-model-path nvidia/GR00T-N1.7-3B \
  --dataset-path "$SIM_DATASET" \
  --embodiment-tag NEW_EMBODIMENT \
  --modality-config-path /abs/path/to/configs/g1_sim_config.py \
  --num-gpus $NUM_GPUS \
  --output-dir "$OUT" \
  --save-total-limit 3 \
  --save-steps 1000 \
  --max-steps 5000 \
  --global-batch-size 8 \
  --dataloader-num-workers 2

For UNITREE_G1_SONIC:

export NUM_GPUS=4
export SIM_DATASET=/abs/path/to/sonic_lerobot_dataset
export OUT=/mnt/checkpoints/groot_g1_sonic_sim_only

uv run torchrun --nproc_per_node=$NUM_GPUS --master_port=29500 \
  gr00t/experiment/launch_finetune.py \
  --base-model-path nvidia/GR00T-N1.7-3B \
  --dataset-path "$SIM_DATASET" \
  --embodiment-tag UNITREE_G1_SONIC \
  --modality-config-path gr00t/configs/data/embodiment_configs.py \
  --num-gpus $NUM_GPUS \
  --output-dir "$OUT" \
  --save-total-limit 5 \
  --save-steps 5000 \
  --max-steps 20000 \
  --use-wandb \
  --global-batch-size 32 \
  --dataloader-num-workers 4

Done-correct criteria:

Loader iterates through many episodes.
Loss is not NaN.
The checkpoint loads in open-loop eval or PolicyServer.
In sim, the policy does not collapse to a single average pose.

Branch B: Mix Simulation And Public Data

Goal: use public data to reduce overfitting to simulator textures, lighting, geometry, or task layout.

Two options:

If your launcher supports multiple dataset paths, use the documented flag. Verify with --help.
More reliable: physically merge datasets into a new LeRobot dataset and reindex episodes.

Check the launcher:

uv run python gr00t/experiment/launch_finetune.py --help | grep -i dataset

If multi-dataset support is unclear, merge physically.

Physical Dataset Merge

Create a manifest:

cat > datasets_mix.txt <<'TXT'
/abs/path/to/datasets/g1_pick_place_lerobot
/abs/path/to/datasets/arena_g1_loco/lerobot
TXT

Merge script:

cat > tools/merge_groot_lerobot_datasets.py <<'PY'
import argparse
import json
import shutil
from pathlib import Path

import pandas as pd

def read_jsonl(path):
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                rows.append(json.loads(line))
    return rows

def write_jsonl(path, rows):
    with path.open("w", encoding="utf-8") as f:
        for row in rows:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset-list", required=True)
    parser.add_argument("--output-dir", required=True)
    args = parser.parse_args()

    datasets = [
        Path(line.strip())
        for line in Path(args.dataset_list).read_text().splitlines()
        if line.strip() and not line.strip().startswith("#")
    ]
    out = Path(args.output_dir)
    (out / "meta").mkdir(parents=True, exist_ok=True)
    (out / "data" / "chunk-000").mkdir(parents=True, exist_ok=True)
    (out / "videos" / "chunk-000").mkdir(parents=True, exist_ok=True)

    first_modality = json.loads((datasets[0] / "meta" / "modality.json").read_text())
    episodes_out = []
    tasks_out = []
    task_offset = 0
    ep_out = 0
    global_frame_index = 0

    for ds in datasets:
        modality = json.loads((ds / "meta" / "modality.json").read_text())
        if modality != first_modality:
            raise SystemExit(f"modality mismatch: {ds}. Merge only datasets with same schema.")

        tasks = read_jsonl(ds / "meta" / "tasks.jsonl")
        task_map = {}
        for task in tasks:
            old = task.get("task_index", len(task_map))
            new = task_offset + len(task_map)
            task_map[old] = new
            task["task_index"] = new
            tasks_out.append(task)
        task_offset += len(task_map)

        episodes = read_jsonl(ds / "meta" / "episodes.jsonl")
        parquet_files = sorted((ds / "data").rglob("*.parquet"))
        video_files = sorted((ds / "videos").rglob("*.mp4"))

        if len(parquet_files) < len(episodes):
            raise SystemExit(f"not enough parquet files in {ds}")

        for i, ep in enumerate(episodes):
            ep_new = dict(ep)
            ep_new["episode_index"] = ep_out
            ep_new["tasks"] = [task_map.get(t, t) for t in ep.get("tasks", [])]
            episodes_out.append(ep_new)

            df = pd.read_parquet(parquet_files[i])
            df["episode_index"] = ep_out
            if "frame_index" in df.columns:
                df["frame_index"] = range(len(df))
            if "task_index" in df.columns:
                task_id = ep_new["tasks"][0] if ep_new["tasks"] else 0
                df["task_index"] = task_id
            if "index" in df.columns:
                df["index"] = range(global_frame_index, global_frame_index + len(df))
            global_frame_index += len(df)
            df.to_parquet(out / "data" / "chunk-000" / f"episode_{ep_out:06d}.parquet", index=False)

            if i < len(video_files):
                target_dir = out / "videos" / "chunk-000" / video_files[i].parent.name
                target_dir.mkdir(parents=True, exist_ok=True)
                shutil.copy2(video_files[i], target_dir / f"episode_{ep_out:06d}.mp4")
            ep_out += 1

    shutil.copy2(datasets[0] / "meta" / "modality.json", out / "meta" / "modality.json")
    info = json.loads((datasets[0] / "meta" / "info.json").read_text())
    info["total_episodes"] = len(episodes_out)
    if all("length" in ep for ep in episodes_out):
        info["total_frames"] = sum(ep["length"] for ep in episodes_out)
    info["total_tasks"] = len(tasks_out)
    (out / "meta" / "info.json").write_text(json.dumps(info, indent=2), encoding="utf-8")
    write_jsonl(out / "meta" / "episodes.jsonl", episodes_out)
    write_jsonl(out / "meta" / "tasks.jsonl", tasks_out)
    print(f"merged {len(datasets)} datasets -> {out}")
    print(f"episodes={len(episodes_out)} tasks={len(tasks_out)}")

if __name__ == "__main__":
    main()
PY

Run:

uv run python tools/merge_groot_lerobot_datasets.py \
  --dataset-list datasets_mix.txt \
  --output-dir datasets/g1_mix_sim_public

uv run python tools/verify_groot_lerobot_dataset.py datasets/g1_mix_sim_public

The script intentionally fails if modality.json differs. That is correct. Do not directly mix raw G1 actions with SONIC latent actions unless you have converted them to the same schema.

Check the merged output:

uv pip install pandas pyarrow

uv run python -c "import pandas as pd; df=pd.read_parquet('datasets/g1_mix_sim_public/data/chunk-000/episode_000000.parquet'); print(df[['episode_index','frame_index','task_index']].head()); print(df.shape)"
uv run python tools/verify_groot_lerobot_dataset.py datasets/g1_mix_sim_public

Starting Mix Ratios

There is no universal ratio. Start with:

Goal	Suggested ratio
New sim task, public only for regularization	80% sim / 20% public
Better lighting/camera/object generalization	60% sim / 40% public
Very synthetic sim, high-quality real/public data	30-50% sim / 50-70% public

If physical merge has no sampling weights, duplicate paths in the manifest:

cat > datasets_mix_weighted.txt <<'TXT'
/abs/path/to/sim_dataset
/abs/path/to/sim_dataset
/abs/path/to/public_dataset
TXT

This is crude but predictable. For large training, implement sampler weights if your dataloader supports them.

Common Errors And Fixes

Error	Cause	Fix
IsaacLab-Arena renders no camera	Camera disabled	Add `--env.enable_cameras=true --env.headless=true`.
`EULA not accepted`	Missing env vars	`export ACCEPT_EULA=Y PRIVACY_CONSENT=Y`.
CUDA OOM in sim	Too many envs/cameras/videos	Reduce `--env.num_envs`, resolution, or video frequency.
Converted dataset fails loader	Wrong column names	Compare with public `lerobot` data; fix `observation.state`, `action`, annotation keys.
Video length differs from parquet length	Dropped frames or fps mismatch	Assert lengths in converter; interpolate or discard bad episodes.
Merge fails modality mismatch	Different schemas	Do not merge directly; convert to one schema or train separate stages.
Sim-only policy fails real robot	Sim-to-real gap	Add domain randomization, public/real mix, and verify camera calibration/action scaling.

Done-Correct Criteria

You completed Part 2 correctly if:

Simulation env loads and renders cameras.
You have at least 50 task-specific episodes for a serious first fine-tune.
Converted dataset has meta/info.json, episodes.jsonl, tasks.jsonl, and modality.json.
verify_groot_lerobot_dataset.py passes.
A 100-500 step smoke fine-tune runs without NaN/OOM.
Open-loop eval or PolicyServer loads the checkpoint.
If datasets are mixed, they share a schema or you have an explicit schema mapper.

Sources

GR00T Whole-Body VLA Data: Simulation Data

Sim-only: train only on simulation data.
Sim + public mix: mix simulation data with a public dataset to reduce overfitting to the simulator.

2.1 Generate Simulation Data

Goal

Generate episodes with:

RGB camera frames: observation.images.ego_view, optionally wrist cameras.
State: joint positions, base/torso state, hand state, object/goal pose if needed.
Action: raw joint targets, base velocities, hand pose, or SONIC latent/action depending on your controller.
Language annotation: task prompt per episode/frame.
Episode metadata: success/failure, length, task id.

For a G1 / GEAR-SONIC style whole-body VLA workflow, the target loop is:

sim rollout
  -> camera + robot state + task prompt + action
  -> GR00T-LeRobot v2
  -> train GR00T
  -> PolicyServer
  -> SONIC decoder / simulation control loop

Environment Requirements

Simulation machine:

Ubuntu 22.04/24.04.
NVIDIA GPU such as RTX 4090, RTX 6000 Ada, A6000, A100, or H100.
VRAM:
- 16-24 GB: small debug env, few cameras, low parallelism.
- 24-48 GB: moderate data collection with cameras.
- 80 GB or multi-GPU: large parallel rollout/rendering.
Isaac Sim / Isaac Lab version compatible with the IsaacLab-Arena release you use.
ACCEPT_EULA=Y and PRIVACY_CONSENT=Y when required.

Training machine:

Debug fine-tune: 1 GPU with 48-80 GB VRAM is much better than 24 GB.
Serious whole-body fine-tune: prepare 4+ 80 GB GPUs.
Loader-only verification can run on a smaller machine, but does not tell you about training quality.

Fastest Route: Use Already Converted Sim Data

Before writing your own exporter, download a converted simulation dataset:

cd Isaac-GR00T
mkdir -p datasets

uv run hf download nvidia/Arena-G1-Loco-Manipulation-Task \
  --repo-type dataset \
  --include "lerobot/**" \
  --local-dir datasets/arena_g1_loco

export SIM_DATASET="$PWD/datasets/arena_g1_loco/lerobot"
uv run python tools/verify_groot_lerobot_dataset.py "$SIM_DATASET"

Load IsaacLab-Arena To Check Your Setup

The LeRobot/IsaacLab-Arena docs show a random rollout script pattern. Create:

cat > test_env_load_arena.py <<'PY'
import logging
from dataclasses import asdict
from pprint import pformat

import torch
import tqdm
from lerobot.configs import parser
from lerobot.configs.eval import EvalPipelineConfig

@parser.wrap()
def main(cfg: EvalPipelineConfig):
    logging.info(pformat(asdict(cfg)))
    from lerobot.envs import make_env

    env_dict = make_env(cfg.env, n_envs=cfg.env.num_envs, trust_remote_code=True)
    env = next(iter(env_dict.values()))[0]
    obs, info = env.reset()

    for _ in tqdm.tqdm(range(cfg.env.episode_length)):
        with torch.inference_mode():
            action = env.action_space.sample()
            obs, reward, terminated, truncated, info = env.step(action)
            if terminated.any() or truncated.any():
                obs, info = env.reset()
    env.close()

if __name__ == "__main__":
    main()
PY

Run:

export ACCEPT_EULA=Y
export PRIVACY_CONSENT=Y

python test_env_load_arena.py \
  --env.hub_path=nvidia/isaaclab-arena-envs \
  --env.type=isaaclab_arena \
  --env.environment=g1_locomanip_pnp \
  --env.embodiment=gr1_pink \
  --env.object=cracker_box \
  --env.num_envs=4 \
  --env.enable_cameras=true \
  --env.headless=false \
  --env.seed=1000 \
  --env.video=true \
  --env.video_length=10 \
  --env.video_interval=15

Headless:

python test_env_load_arena.py \
  --env.hub_path=nvidia/isaaclab-arena-envs \
  --env.type=isaaclab_arena \
  --env.environment=g1_locomanip_pnp \
  --env.embodiment=gr1_pink \
  --env.object=cracker_box \
  --env.num_envs=8 \
  --env.enable_cameras=true \
  --env.headless=true \
  --env.video=true \
  --env.video_length=10 \
  --env.video_interval=15

Done-correct criteria:

Env resets successfully.
Camera rendering works.
Eval videos appear under outputs/eval/.../videos/....
No EULA/privacy error.
If CUDA OOM occurs, reduce --env.num_envs, video frequency, or camera resolution.

Collect Episodes In Simulation

Common data generation modes:

Method	Use case	Notes
Teleoperation in sim	Human demonstrations	Better quality, slower collection.
Mimic / trajectory generation	Scaling from seed demos	Requires task annotations and source demos.
Scripted policy/controller	Simple baseline tasks	Easy to bias, weak generalization if randomization is poor.

Expected HDF5-style output:

sim_outputs/g1_pick_place_hdf5/
├── demo_000000.hdf5
├── demo_000001.hdf5
├── demo_000002.hdf5
└── manifest.json

Each HDF5 file should contain keys like:

/observations/state
/observations/images/ego_view
/observations/images/left_wrist
/observations/images/right_wrist
/actions
/timestamps
/language/task_description
/episode/success

If your generator uses different key names, that is fine. The converter must map them into LeRobot columns and modality.json.

Minimum Domain Randomization

Do not generate 1000 identical episodes. At minimum, randomize:

Object pose and goal pose.
Lighting, texture, table/floor material.
Camera extrinsic jitter: yaw, pitch, height.
Robot initial base pose.
Action/observation delay and noise.
Background distractors if policy uses RGB.

Example config concept:

randomization:
  object_xy: [-0.25, 0.25]
  object_yaw_deg: [-180, 180]
  goal_xy: [-0.20, 0.20]
  camera_yaw_deg: [-3, 3]
  camera_pitch_deg: [-2, 2]
  light_intensity: [300, 900]
  action_noise_std: 0.01
  joint_state_noise_std: 0.002

For locomotion/contact tasks, start with a narrow randomization range. Increase it only after the generator success rate is stable.

2.2 Convert Simulation Data To GR00T-LeRobot

Goal

Convert:

sim_outputs/g1_pick_place_hdf5/

into:

datasets/g1_pick_place_lerobot/
├── meta/
│   ├── info.json
│   ├── episodes.jsonl
│   ├── tasks.jsonl
│   └── modality.json
├── data/
│   └── chunk-000/
│       ├── episode_000000.parquet
│       └── episode_000001.parquet
└── videos/
    └── chunk-000/
        └── observation.images.ego_view/
            ├── episode_000000.mp4
            └── episode_000001.mp4

Example Converter

This is a template you must adapt to your actual HDF5 keys. It is not an official NVIDIA script; it is a practical, testable converter shape.

mkdir -p tools
cat > tools/convert_sim_hdf5_to_groot_lerobot.py <<'PY'
import argparse
import json
from pathlib import Path

import h5py
import imageio.v3 as iio
import numpy as np
import pandas as pd

def write_jsonl(path, rows):
    with path.open("w", encoding="utf-8") as f:
        for row in rows:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input-dir", required=True)
    parser.add_argument("--output-dir", required=True)
    parser.add_argument("--fps", type=int, default=20)
    parser.add_argument("--task", default="pick up the object and place it at the goal")
    args = parser.parse_args()

    in_dir = Path(args.input_dir)
    out = Path(args.output_dir)
    meta = out / "meta"
    data_dir = out / "data" / "chunk-000"
    video_dir = out / "videos" / "chunk-000" / "observation.images.ego_view"
    meta.mkdir(parents=True, exist_ok=True)
    data_dir.mkdir(parents=True, exist_ok=True)
    video_dir.mkdir(parents=True, exist_ok=True)

    h5_files = sorted(in_dir.glob("*.hdf5"))
    if not h5_files:
        raise SystemExit(f"no .hdf5 files in {in_dir}")

    episodes = []
    tasks = [{"task_index": 0, "task": args.task}]
    global_frame_index = 0

    for ep_idx, h5_path in enumerate(h5_files):
        with h5py.File(h5_path, "r") as h5:
            state = np.asarray(h5["observations/state"], dtype=np.float32)
            action = np.asarray(h5["actions"], dtype=np.float32)
            images = np.asarray(h5["observations/images/ego_view"], dtype=np.uint8)

        if len(state) != len(action) or len(state) != len(images):
            raise SystemExit(f"length mismatch in {h5_path}")

        rows = []
        for t in range(len(state)):
            rows.append({
                "index": global_frame_index,
                "episode_index": ep_idx,
                "frame_index": t,
                "timestamp": t / args.fps,
                "task_index": 0,
                "observation.state": state[t].astype(np.float32).tolist(),
                "action": action[t].astype(np.float32).tolist(),
                "annotation.human.action.task_description": 0,
            })
            global_frame_index += 1

        pd.DataFrame(rows).to_parquet(data_dir / f"episode_{ep_idx:06d}.parquet", index=False)
        iio.imwrite(video_dir / f"episode_{ep_idx:06d}.mp4", images, fps=args.fps)
        episodes.append({"episode_index": ep_idx, "tasks": [0], "length": len(state)})

    state_dim = int(np.asarray(rows[0]["observation.state"]).shape[0])
    action_dim = int(np.asarray(rows[0]["action"]).shape[0])

    (meta / "info.json").write_text(json.dumps({
        "codebase_version": "v2.0",
        "fps": args.fps,
        "total_episodes": len(episodes),
        "total_frames": sum(ep["length"] for ep in episodes),
        "features": {
            "index": {"dtype": "int64", "shape": [1]},
            "episode_index": {"dtype": "int64", "shape": [1]},
            "frame_index": {"dtype": "int64", "shape": [1]},
            "timestamp": {"dtype": "float32", "shape": [1]},
            "task_index": {"dtype": "int64", "shape": [1]},
            "observation.state": {"dtype": "float32", "shape": [state_dim]},
            "action": {"dtype": "float32", "shape": [action_dim]},
            "observation.images.ego_view": {"dtype": "video", "shape": [480, 640, 3]},
        },
    }, indent=2), encoding="utf-8")

    modality = {
        "state": {"whole_body": {"start": 0, "end": state_dim}},
        "action": {"whole_body": {"start": 0, "end": action_dim}},
        "video": {"ego_view": {"original_key": "observation.images.ego_view"}},
        "annotation": {
            "human.action.task_description": {
                "original_key": "annotation.human.action.task_description"
            }
        }
    }
    (meta / "modality.json").write_text(json.dumps(modality, indent=2), encoding="utf-8")
    write_jsonl(meta / "episodes.jsonl", episodes)
    write_jsonl(meta / "tasks.jsonl", tasks)

    print(f"wrote {out}")
    print(f"episodes={len(episodes)}")
    print(f"state_dim={state_dim} action_dim={action_dim}")

if __name__ == "__main__":
    main()
PY

Run:

uv pip install h5py pandas pyarrow imageio imageio-ffmpeg

uv run python tools/convert_sim_hdf5_to_groot_lerobot.py \
  --input-dir sim_outputs/g1_pick_place_hdf5 \
  --output-dir datasets/g1_pick_place_lerobot \
  --fps 20 \
  --task "pick up the object and place it at the goal"

Verify:

export SIM_DATASET="$PWD/datasets/g1_pick_place_lerobot"
uv run python tools/verify_groot_lerobot_dataset.py "$SIM_DATASET"
python -m json.tool "$SIM_DATASET/meta/modality.json"

Improve `modality.json` For Whole-Body Data

The template uses a single "whole_body" slice. For better training and debugging, split state/action by body parts:

{
  "state": {
    "left_leg": { "start": 0, "end": 6 },
    "right_leg": { "start": 6, "end": 12 },
    "waist": { "start": 12, "end": 15 },
    "left_arm": { "start": 15, "end": 22 },
    "left_hand": { "start": 22, "end": 29 },
    "right_arm": { "start": 29, "end": 36 },
    "right_hand": { "start": 36, "end": 43 }
  },
  "action": {
    "base": { "start": 0, "end": 3 },
    "left_arm": { "start": 3, "end": 10 },
    "left_hand": { "start": 10, "end": 17 },
    "right_arm": { "start": 17, "end": 24 },
    "right_hand": { "start": 24, "end": 31 }
  }
}

If you use SONIC, action slicing must match the 64-dimensional latent token plus hands. If you only have raw joint/action targets, do not label it as SONIC.

2.3 Training With Simulation Data

Branch A: Sim-Only Training

Goal: verify that the model can learn inside the simulator distribution.

cd Isaac-GR00T

export NUM_GPUS=1
export SIM_DATASET=/abs/path/to/datasets/g1_pick_place_lerobot
export OUT=/mnt/checkpoints/groot_g1_sim_only_debug

CUDA_VISIBLE_DEVICES=0 uv run python \
  gr00t/experiment/launch_finetune.py \
  --base-model-path nvidia/GR00T-N1.7-3B \
  --dataset-path "$SIM_DATASET" \
  --embodiment-tag NEW_EMBODIMENT \
  --modality-config-path /abs/path/to/configs/g1_sim_config.py \
  --num-gpus $NUM_GPUS \
  --output-dir "$OUT" \
  --save-total-limit 3 \
  --save-steps 1000 \
  --max-steps 5000 \
  --global-batch-size 8 \
  --dataloader-num-workers 2

For UNITREE_G1_SONIC:

export NUM_GPUS=4
export SIM_DATASET=/abs/path/to/sonic_lerobot_dataset
export OUT=/mnt/checkpoints/groot_g1_sonic_sim_only

uv run torchrun --nproc_per_node=$NUM_GPUS --master_port=29500 \
  gr00t/experiment/launch_finetune.py \
  --base-model-path nvidia/GR00T-N1.7-3B \
  --dataset-path "$SIM_DATASET" \
  --embodiment-tag UNITREE_G1_SONIC \
  --modality-config-path gr00t/configs/data/embodiment_configs.py \
  --num-gpus $NUM_GPUS \
  --output-dir "$OUT" \
  --save-total-limit 5 \
  --save-steps 5000 \
  --max-steps 20000 \
  --use-wandb \
  --global-batch-size 32 \
  --dataloader-num-workers 4

Done-correct criteria:

Loader iterates through many episodes.
Loss is not NaN.
The checkpoint loads in open-loop eval or PolicyServer.
In sim, the policy does not collapse to a single average pose.

Branch B: Mix Simulation And Public Data

Goal: use public data to reduce overfitting to simulator textures, lighting, geometry, or task layout.

Two options:

If your launcher supports multiple dataset paths, use the documented flag. Verify with --help.
More reliable: physically merge datasets into a new LeRobot dataset and reindex episodes.

Check the launcher:

uv run python gr00t/experiment/launch_finetune.py --help | grep -i dataset

If multi-dataset support is unclear, merge physically.

Physical Dataset Merge

Create a manifest:

cat > datasets_mix.txt <<'TXT'
/abs/path/to/datasets/g1_pick_place_lerobot
/abs/path/to/datasets/arena_g1_loco/lerobot
TXT

Merge script:

cat > tools/merge_groot_lerobot_datasets.py <<'PY'
import argparse
import json
import shutil
from pathlib import Path

import pandas as pd

def read_jsonl(path):
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                rows.append(json.loads(line))
    return rows

def write_jsonl(path, rows):
    with path.open("w", encoding="utf-8") as f:
        for row in rows:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset-list", required=True)
    parser.add_argument("--output-dir", required=True)
    args = parser.parse_args()

    datasets = [
        Path(line.strip())
        for line in Path(args.dataset_list).read_text().splitlines()
        if line.strip() and not line.strip().startswith("#")
    ]
    out = Path(args.output_dir)
    (out / "meta").mkdir(parents=True, exist_ok=True)
    (out / "data" / "chunk-000").mkdir(parents=True, exist_ok=True)
    (out / "videos" / "chunk-000").mkdir(parents=True, exist_ok=True)

    first_modality = json.loads((datasets[0] / "meta" / "modality.json").read_text())
    episodes_out = []
    tasks_out = []
    task_offset = 0
    ep_out = 0
    global_frame_index = 0

    for ds in datasets:
        modality = json.loads((ds / "meta" / "modality.json").read_text())
        if modality != first_modality:
            raise SystemExit(f"modality mismatch: {ds}. Merge only datasets with same schema.")

        tasks = read_jsonl(ds / "meta" / "tasks.jsonl")
        task_map = {}
        for task in tasks:
            old = task.get("task_index", len(task_map))
            new = task_offset + len(task_map)
            task_map[old] = new
            task["task_index"] = new
            tasks_out.append(task)
        task_offset += len(task_map)

        episodes = read_jsonl(ds / "meta" / "episodes.jsonl")
        parquet_files = sorted((ds / "data").rglob("*.parquet"))
        video_files = sorted((ds / "videos").rglob("*.mp4"))

        if len(parquet_files) < len(episodes):
            raise SystemExit(f"not enough parquet files in {ds}")

        for i, ep in enumerate(episodes):
            ep_new = dict(ep)
            ep_new["episode_index"] = ep_out
            ep_new["tasks"] = [task_map.get(t, t) for t in ep.get("tasks", [])]
            episodes_out.append(ep_new)

            df = pd.read_parquet(parquet_files[i])
            df["episode_index"] = ep_out
            if "frame_index" in df.columns:
                df["frame_index"] = range(len(df))
            if "task_index" in df.columns:
                task_id = ep_new["tasks"][0] if ep_new["tasks"] else 0
                df["task_index"] = task_id
            if "index" in df.columns:
                df["index"] = range(global_frame_index, global_frame_index + len(df))
            global_frame_index += len(df)
            df.to_parquet(out / "data" / "chunk-000" / f"episode_{ep_out:06d}.parquet", index=False)

            if i < len(video_files):
                target_dir = out / "videos" / "chunk-000" / video_files[i].parent.name
                target_dir.mkdir(parents=True, exist_ok=True)
                shutil.copy2(video_files[i], target_dir / f"episode_{ep_out:06d}.mp4")
            ep_out += 1

    shutil.copy2(datasets[0] / "meta" / "modality.json", out / "meta" / "modality.json")
    info = json.loads((datasets[0] / "meta" / "info.json").read_text())
    info["total_episodes"] = len(episodes_out)
    if all("length" in ep for ep in episodes_out):
        info["total_frames"] = sum(ep["length"] for ep in episodes_out)
    info["total_tasks"] = len(tasks_out)
    (out / "meta" / "info.json").write_text(json.dumps(info, indent=2), encoding="utf-8")
    write_jsonl(out / "meta" / "episodes.jsonl", episodes_out)
    write_jsonl(out / "meta" / "tasks.jsonl", tasks_out)
    print(f"merged {len(datasets)} datasets -> {out}")
    print(f"episodes={len(episodes_out)} tasks={len(tasks_out)}")

if __name__ == "__main__":
    main()
PY

Run:

uv run python tools/merge_groot_lerobot_datasets.py \
  --dataset-list datasets_mix.txt \
  --output-dir datasets/g1_mix_sim_public

uv run python tools/verify_groot_lerobot_dataset.py datasets/g1_mix_sim_public

The script intentionally fails if modality.json differs. That is correct. Do not directly mix raw G1 actions with SONIC latent actions unless you have converted them to the same schema.

Check the merged output:

uv pip install pandas pyarrow

uv run python -c "import pandas as pd; df=pd.read_parquet('datasets/g1_mix_sim_public/data/chunk-000/episode_000000.parquet'); print(df[['episode_index','frame_index','task_index']].head()); print(df.shape)"
uv run python tools/verify_groot_lerobot_dataset.py datasets/g1_mix_sim_public

Starting Mix Ratios

There is no universal ratio. Start with:

Goal	Suggested ratio
New sim task, public only for regularization	80% sim / 20% public
Better lighting/camera/object generalization	60% sim / 40% public
Very synthetic sim, high-quality real/public data	30-50% sim / 50-70% public

If physical merge has no sampling weights, duplicate paths in the manifest:

cat > datasets_mix_weighted.txt <<'TXT'
/abs/path/to/sim_dataset
/abs/path/to/sim_dataset
/abs/path/to/public_dataset
TXT

This is crude but predictable. For large training, implement sampler weights if your dataloader supports them.

Common Errors And Fixes

Error	Cause	Fix
IsaacLab-Arena renders no camera	Camera disabled	Add `--env.enable_cameras=true --env.headless=true`.
`EULA not accepted`	Missing env vars	`export ACCEPT_EULA=Y PRIVACY_CONSENT=Y`.
CUDA OOM in sim	Too many envs/cameras/videos	Reduce `--env.num_envs`, resolution, or video frequency.
Converted dataset fails loader	Wrong column names	Compare with public `lerobot` data; fix `observation.state`, `action`, annotation keys.
Video length differs from parquet length	Dropped frames or fps mismatch	Assert lengths in converter; interpolate or discard bad episodes.
Merge fails modality mismatch	Different schemas	Do not merge directly; convert to one schema or train separate stages.
Sim-only policy fails real robot	Sim-to-real gap	Add domain randomization, public/real mix, and verify camera calibration/action scaling.

Done-Correct Criteria

You completed Part 2 correctly if:

Simulation env loads and renders cameras.
You have at least 50 task-specific episodes for a serious first fine-tune.
Converted dataset has meta/info.json, episodes.jsonl, tasks.jsonl, and modality.json.
verify_groot_lerobot_dataset.py passes.
A 100-500 step smoke fine-tune runs without NaN/OOM.
Open-loop eval or PolicyServer loads the checkpoint.
If datasets are mixed, they share a schema or you have an explicit schema mapper.

GR00T Whole-Body VLA Data: Simulation Data

2.1 Generate Simulation Data

Goal

Environment Requirements

Fastest Route: Use Already Converted Sim Data

Load IsaacLab-Arena To Check Your Setup

Collect Episodes In Simulation

Minimum Domain Randomization

2.2 Convert Simulation Data To GR00T-LeRobot

Goal

Example Converter

Improve modality.json For Whole-Body Data

2.3 Training With Simulation Data

Branch A: Sim-Only Training

Branch B: Mix Simulation And Public Data

Physical Dataset Merge

Starting Mix Ratios

Common Errors And Fixes

Done-Correct Criteria

Related Posts

Sources

Nguyễn Anh Tuấn

Related Posts

GR00T whole-body VLA data: dùng open dataset

GR00T whole-body VLA data: có cần data real?

GR00T whole-body VLA: train SONIC controller

GR00T Whole-Body VLA Data: Simulation Data

2.1 Generate Simulation Data

Goal

Environment Requirements

Fastest Route: Use Already Converted Sim Data

Load IsaacLab-Arena To Check Your Setup

Collect Episodes In Simulation

Minimum Domain Randomization

2.2 Convert Simulation Data To GR00T-LeRobot

Goal

Example Converter

Improve modality.json For Whole-Body Data

2.3 Training With Simulation Data

Branch A: Sim-Only Training

Branch B: Mix Simulation And Public Data

Physical Dataset Merge

Starting Mix Ratios

Common Errors And Fixes

Done-Correct Criteria

Related Posts

Sources

Nguyễn Anh Tuấn

Related Posts

GR00T whole-body VLA data: dùng open dataset

GR00T whole-body VLA data: có cần data real?

GR00T whole-body VLA: train SONIC controller

Improve `modality.json` For Whole-Body Data

Improve `modality.json` For Whole-Body Data