GR00T Whole-Body VLA Data: Simulation Data
Disclosure: This article may contain affiliate or referral links. If you buy or sign up through those links, VnRobo may earn a commission or service credit. The technical recommendations prioritize engineering fit first.
Part 1 used open datasets to walk through download -> verify -> fine-tune -> inference. Part 2 switches to simulation data: generate trajectories in Isaac Lab / IsaacLab-Arena, convert them into GR00T-LeRobot, then train in two modes:
- Sim-only: train only on simulation data.
- Sim + public mix: mix simulation data with a public dataset to reduce overfitting to the simulator.
Important caveat: as of June 6, 2026, there is no single universal command that converts every IsaacLab-Arena task into GR00T-LeRobot for every embodiment. Some datasets are already converted, such as nvidia/Arena-G1-Loco-Manipulation-Task. If you build a new task/env, you usually need a custom exporter from HDF5/trajectory buffers to LeRobot v2 plus meta/modality.json.
2.1 Generate Simulation Data
Goal
Generate episodes with:
- RGB camera frames:
observation.images.ego_view, optionally wrist cameras. - State: joint positions, base/torso state, hand state, object/goal pose if needed.
- Action: raw joint targets, base velocities, hand pose, or SONIC latent/action depending on your controller.
- Language annotation: task prompt per episode/frame.
- Episode metadata: success/failure, length, task id.
For a G1 / GEAR-SONIC style whole-body VLA workflow, the target loop is:
sim rollout
-> camera + robot state + task prompt + action
-> GR00T-LeRobot v2
-> train GR00T
-> PolicyServer
-> SONIC decoder / simulation control loop
Environment Requirements
Simulation machine:
- Ubuntu 22.04/24.04.
- NVIDIA GPU such as RTX 4090, RTX 6000 Ada, A6000, A100, or H100.
- VRAM:
- 16-24 GB: small debug env, few cameras, low parallelism.
- 24-48 GB: moderate data collection with cameras.
- 80 GB or multi-GPU: large parallel rollout/rendering.
- Isaac Sim / Isaac Lab version compatible with the IsaacLab-Arena release you use.
ACCEPT_EULA=YandPRIVACY_CONSENT=Ywhen required.
Training machine:
- Debug fine-tune: 1 GPU with 48-80 GB VRAM is much better than 24 GB.
- Serious whole-body fine-tune: prepare 4+ 80 GB GPUs.
- Loader-only verification can run on a smaller machine, but does not tell you about training quality.
Fastest Route: Use Already Converted Sim Data
Before writing your own exporter, download a converted simulation dataset:
cd Isaac-GR00T
mkdir -p datasets
uv run hf download nvidia/Arena-G1-Loco-Manipulation-Task \
--repo-type dataset \
--include "lerobot/**" \
--local-dir datasets/arena_g1_loco
export SIM_DATASET="$PWD/datasets/arena_g1_loco/lerobot"
uv run python tools/verify_groot_lerobot_dataset.py "$SIM_DATASET"
The dataset card describes a G1 loco-manipulation task in IsaacLab-Arena with HDF5 source data and a lerobot folder converted to GR00T-Lerobot. This is the best place to start because you can debug the GR00T side before debugging a custom exporter.
Load IsaacLab-Arena To Check Your Setup
The LeRobot/IsaacLab-Arena docs show a random rollout script pattern. Create:
cat > test_env_load_arena.py <<'PY'
import logging
from dataclasses import asdict
from pprint import pformat
import torch
import tqdm
from lerobot.configs import parser
from lerobot.configs.eval import EvalPipelineConfig
@parser.wrap()
def main(cfg: EvalPipelineConfig):
logging.info(pformat(asdict(cfg)))
from lerobot.envs import make_env
env_dict = make_env(cfg.env, n_envs=cfg.env.num_envs, trust_remote_code=True)
env = next(iter(env_dict.values()))[0]
obs, info = env.reset()
for _ in tqdm.tqdm(range(cfg.env.episode_length)):
with torch.inference_mode():
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
if terminated.any() or truncated.any():
obs, info = env.reset()
env.close()
if __name__ == "__main__":
main()
PY
Run:
export ACCEPT_EULA=Y
export PRIVACY_CONSENT=Y
python test_env_load_arena.py \
--env.hub_path=nvidia/isaaclab-arena-envs \
--env.type=isaaclab_arena \
--env.environment=g1_locomanip_pnp \
--env.embodiment=gr1_pink \
--env.object=cracker_box \
--env.num_envs=4 \
--env.enable_cameras=true \
--env.headless=false \
--env.seed=1000 \
--env.video=true \
--env.video_length=10 \
--env.video_interval=15
Headless:
python test_env_load_arena.py \
--env.hub_path=nvidia/isaaclab-arena-envs \
--env.type=isaaclab_arena \
--env.environment=g1_locomanip_pnp \
--env.embodiment=gr1_pink \
--env.object=cracker_box \
--env.num_envs=8 \
--env.enable_cameras=true \
--env.headless=true \
--env.video=true \
--env.video_length=10 \
--env.video_interval=15
Done-correct criteria:
- Env resets successfully.
- Camera rendering works.
- Eval videos appear under
outputs/eval/.../videos/.... - No EULA/privacy error.
- If CUDA OOM occurs, reduce
--env.num_envs, video frequency, or camera resolution.
Collect Episodes In Simulation
Common data generation modes:
| Method | Use case | Notes |
|---|---|---|
| Teleoperation in sim | Human demonstrations | Better quality, slower collection. |
| Mimic / trajectory generation | Scaling from seed demos | Requires task annotations and source demos. |
| Scripted policy/controller | Simple baseline tasks | Easy to bias, weak generalization if randomization is poor. |
Expected HDF5-style output:
sim_outputs/g1_pick_place_hdf5/
├── demo_000000.hdf5
├── demo_000001.hdf5
├── demo_000002.hdf5
└── manifest.json
Each HDF5 file should contain keys like:
/observations/state
/observations/images/ego_view
/observations/images/left_wrist
/observations/images/right_wrist
/actions
/timestamps
/language/task_description
/episode/success
If your generator uses different key names, that is fine. The converter must map them into LeRobot columns and modality.json.
Minimum Domain Randomization
Do not generate 1000 identical episodes. At minimum, randomize:
- Object pose and goal pose.
- Lighting, texture, table/floor material.
- Camera extrinsic jitter: yaw, pitch, height.
- Robot initial base pose.
- Action/observation delay and noise.
- Background distractors if policy uses RGB.
Example config concept:
randomization:
object_xy: [-0.25, 0.25]
object_yaw_deg: [-180, 180]
goal_xy: [-0.20, 0.20]
camera_yaw_deg: [-3, 3]
camera_pitch_deg: [-2, 2]
light_intensity: [300, 900]
action_noise_std: 0.01
joint_state_noise_std: 0.002
For locomotion/contact tasks, start with a narrow randomization range. Increase it only after the generator success rate is stable.
2.2 Convert Simulation Data To GR00T-LeRobot
Goal
Convert:
sim_outputs/g1_pick_place_hdf5/
into:
datasets/g1_pick_place_lerobot/
├── meta/
│ ├── info.json
│ ├── episodes.jsonl
│ ├── tasks.jsonl
│ └── modality.json
├── data/
│ └── chunk-000/
│ ├── episode_000000.parquet
│ └── episode_000001.parquet
└── videos/
└── chunk-000/
└── observation.images.ego_view/
├── episode_000000.mp4
└── episode_000001.mp4
Example Converter
This is a template you must adapt to your actual HDF5 keys. It is not an official NVIDIA script; it is a practical, testable converter shape.
mkdir -p tools
cat > tools/convert_sim_hdf5_to_groot_lerobot.py <<'PY'
import argparse
import json
from pathlib import Path
import h5py
import imageio.v3 as iio
import numpy as np
import pandas as pd
def write_jsonl(path, rows):
with path.open("w", encoding="utf-8") as f:
for row in rows:
f.write(json.dumps(row, ensure_ascii=False) + "\n")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input-dir", required=True)
parser.add_argument("--output-dir", required=True)
parser.add_argument("--fps", type=int, default=20)
parser.add_argument("--task", default="pick up the object and place it at the goal")
args = parser.parse_args()
in_dir = Path(args.input_dir)
out = Path(args.output_dir)
meta = out / "meta"
data_dir = out / "data" / "chunk-000"
video_dir = out / "videos" / "chunk-000" / "observation.images.ego_view"
meta.mkdir(parents=True, exist_ok=True)
data_dir.mkdir(parents=True, exist_ok=True)
video_dir.mkdir(parents=True, exist_ok=True)
h5_files = sorted(in_dir.glob("*.hdf5"))
if not h5_files:
raise SystemExit(f"no .hdf5 files in {in_dir}")
episodes = []
tasks = [{"task_index": 0, "task": args.task}]
global_frame_index = 0
for ep_idx, h5_path in enumerate(h5_files):
with h5py.File(h5_path, "r") as h5:
state = np.asarray(h5["observations/state"], dtype=np.float32)
action = np.asarray(h5["actions"], dtype=np.float32)
images = np.asarray(h5["observations/images/ego_view"], dtype=np.uint8)
if len(state) != len(action) or len(state) != len(images):
raise SystemExit(f"length mismatch in {h5_path}")
rows = []
for t in range(len(state)):
rows.append({
"index": global_frame_index,
"episode_index": ep_idx,
"frame_index": t,
"timestamp": t / args.fps,
"task_index": 0,
"observation.state": state[t].astype(np.float32).tolist(),
"action": action[t].astype(np.float32).tolist(),
"annotation.human.action.task_description": 0,
})
global_frame_index += 1
pd.DataFrame(rows).to_parquet(data_dir / f"episode_{ep_idx:06d}.parquet", index=False)
iio.imwrite(video_dir / f"episode_{ep_idx:06d}.mp4", images, fps=args.fps)
episodes.append({"episode_index": ep_idx, "tasks": [0], "length": len(state)})
state_dim = int(np.asarray(rows[0]["observation.state"]).shape[0])
action_dim = int(np.asarray(rows[0]["action"]).shape[0])
(meta / "info.json").write_text(json.dumps({
"codebase_version": "v2.0",
"fps": args.fps,
"total_episodes": len(episodes),
"total_frames": sum(ep["length"] for ep in episodes),
"features": {
"index": {"dtype": "int64", "shape": [1]},
"episode_index": {"dtype": "int64", "shape": [1]},
"frame_index": {"dtype": "int64", "shape": [1]},
"timestamp": {"dtype": "float32", "shape": [1]},
"task_index": {"dtype": "int64", "shape": [1]},
"observation.state": {"dtype": "float32", "shape": [state_dim]},
"action": {"dtype": "float32", "shape": [action_dim]},
"observation.images.ego_view": {"dtype": "video", "shape": [480, 640, 3]},
},
}, indent=2), encoding="utf-8")
modality = {
"state": {"whole_body": {"start": 0, "end": state_dim}},
"action": {"whole_body": {"start": 0, "end": action_dim}},
"video": {"ego_view": {"original_key": "observation.images.ego_view"}},
"annotation": {
"human.action.task_description": {
"original_key": "annotation.human.action.task_description"
}
}
}
(meta / "modality.json").write_text(json.dumps(modality, indent=2), encoding="utf-8")
write_jsonl(meta / "episodes.jsonl", episodes)
write_jsonl(meta / "tasks.jsonl", tasks)
print(f"wrote {out}")
print(f"episodes={len(episodes)}")
print(f"state_dim={state_dim} action_dim={action_dim}")
if __name__ == "__main__":
main()
PY
Run:
uv pip install h5py pandas pyarrow imageio imageio-ffmpeg
uv run python tools/convert_sim_hdf5_to_groot_lerobot.py \
--input-dir sim_outputs/g1_pick_place_hdf5 \
--output-dir datasets/g1_pick_place_lerobot \
--fps 20 \
--task "pick up the object and place it at the goal"
Verify:
export SIM_DATASET="$PWD/datasets/g1_pick_place_lerobot"
uv run python tools/verify_groot_lerobot_dataset.py "$SIM_DATASET"
python -m json.tool "$SIM_DATASET/meta/modality.json"
This converter is a minimal template. Before a large training run, compare meta/info.json, parquet columns, and video layout against a public dataset that already loads in the same Isaac-GR00T/LeRobot branch. If your loader branch uses the LeRobot v2.1 data/train-00000.parquet style, keep that format instead of forcing chunk/episode files.
Improve modality.json For Whole-Body Data
The template uses a single "whole_body" slice. For better training and debugging, split state/action by body parts:
{
"state": {
"left_leg": { "start": 0, "end": 6 },
"right_leg": { "start": 6, "end": 12 },
"waist": { "start": 12, "end": 15 },
"left_arm": { "start": 15, "end": 22 },
"left_hand": { "start": 22, "end": 29 },
"right_arm": { "start": 29, "end": 36 },
"right_hand": { "start": 36, "end": 43 }
},
"action": {
"base": { "start": 0, "end": 3 },
"left_arm": { "start": 3, "end": 10 },
"left_hand": { "start": 10, "end": 17 },
"right_arm": { "start": 17, "end": 24 },
"right_hand": { "start": 24, "end": 31 }
}
}
If you use SONIC, action slicing must match the 64-dimensional latent token plus hands. If you only have raw joint/action targets, do not label it as SONIC.
2.3 Training With Simulation Data
Branch A: Sim-Only Training
Goal: verify that the model can learn inside the simulator distribution.
cd Isaac-GR00T
export NUM_GPUS=1
export SIM_DATASET=/abs/path/to/datasets/g1_pick_place_lerobot
export OUT=/mnt/checkpoints/groot_g1_sim_only_debug
CUDA_VISIBLE_DEVICES=0 uv run python \
gr00t/experiment/launch_finetune.py \
--base-model-path nvidia/GR00T-N1.7-3B \
--dataset-path "$SIM_DATASET" \
--embodiment-tag NEW_EMBODIMENT \
--modality-config-path /abs/path/to/configs/g1_sim_config.py \
--num-gpus $NUM_GPUS \
--output-dir "$OUT" \
--save-total-limit 3 \
--save-steps 1000 \
--max-steps 5000 \
--global-batch-size 8 \
--dataloader-num-workers 2
For UNITREE_G1_SONIC:
export NUM_GPUS=4
export SIM_DATASET=/abs/path/to/sonic_lerobot_dataset
export OUT=/mnt/checkpoints/groot_g1_sonic_sim_only
uv run torchrun --nproc_per_node=$NUM_GPUS --master_port=29500 \
gr00t/experiment/launch_finetune.py \
--base-model-path nvidia/GR00T-N1.7-3B \
--dataset-path "$SIM_DATASET" \
--embodiment-tag UNITREE_G1_SONIC \
--modality-config-path gr00t/configs/data/embodiment_configs.py \
--num-gpus $NUM_GPUS \
--output-dir "$OUT" \
--save-total-limit 5 \
--save-steps 5000 \
--max-steps 20000 \
--use-wandb \
--global-batch-size 32 \
--dataloader-num-workers 4
Done-correct criteria:
- Loader iterates through many episodes.
- Loss is not NaN.
- The checkpoint loads in open-loop eval or PolicyServer.
- In sim, the policy does not collapse to a single average pose.
Branch B: Mix Simulation And Public Data
Goal: use public data to reduce overfitting to simulator textures, lighting, geometry, or task layout.
Two options:
- If your launcher supports multiple dataset paths, use the documented flag. Verify with
--help. - More reliable: physically merge datasets into a new LeRobot dataset and reindex episodes.
Check the launcher:
uv run python gr00t/experiment/launch_finetune.py --help | grep -i dataset
If multi-dataset support is unclear, merge physically.
Physical Dataset Merge
Create a manifest:
cat > datasets_mix.txt <<'TXT'
/abs/path/to/datasets/g1_pick_place_lerobot
/abs/path/to/datasets/arena_g1_loco/lerobot
TXT
Merge script:
cat > tools/merge_groot_lerobot_datasets.py <<'PY'
import argparse
import json
import shutil
from pathlib import Path
import pandas as pd
def read_jsonl(path):
rows = []
with path.open("r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
rows.append(json.loads(line))
return rows
def write_jsonl(path, rows):
with path.open("w", encoding="utf-8") as f:
for row in rows:
f.write(json.dumps(row, ensure_ascii=False) + "\n")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dataset-list", required=True)
parser.add_argument("--output-dir", required=True)
args = parser.parse_args()
datasets = [
Path(line.strip())
for line in Path(args.dataset_list).read_text().splitlines()
if line.strip() and not line.strip().startswith("#")
]
out = Path(args.output_dir)
(out / "meta").mkdir(parents=True, exist_ok=True)
(out / "data" / "chunk-000").mkdir(parents=True, exist_ok=True)
(out / "videos" / "chunk-000").mkdir(parents=True, exist_ok=True)
first_modality = json.loads((datasets[0] / "meta" / "modality.json").read_text())
episodes_out = []
tasks_out = []
task_offset = 0
ep_out = 0
global_frame_index = 0
for ds in datasets:
modality = json.loads((ds / "meta" / "modality.json").read_text())
if modality != first_modality:
raise SystemExit(f"modality mismatch: {ds}. Merge only datasets with same schema.")
tasks = read_jsonl(ds / "meta" / "tasks.jsonl")
task_map = {}
for task in tasks:
old = task.get("task_index", len(task_map))
new = task_offset + len(task_map)
task_map[old] = new
task["task_index"] = new
tasks_out.append(task)
task_offset += len(task_map)
episodes = read_jsonl(ds / "meta" / "episodes.jsonl")
parquet_files = sorted((ds / "data").rglob("*.parquet"))
video_files = sorted((ds / "videos").rglob("*.mp4"))
if len(parquet_files) < len(episodes):
raise SystemExit(f"not enough parquet files in {ds}")
for i, ep in enumerate(episodes):
ep_new = dict(ep)
ep_new["episode_index"] = ep_out
ep_new["tasks"] = [task_map.get(t, t) for t in ep.get("tasks", [])]
episodes_out.append(ep_new)
df = pd.read_parquet(parquet_files[i])
df["episode_index"] = ep_out
if "frame_index" in df.columns:
df["frame_index"] = range(len(df))
if "task_index" in df.columns:
task_id = ep_new["tasks"][0] if ep_new["tasks"] else 0
df["task_index"] = task_id
if "index" in df.columns:
df["index"] = range(global_frame_index, global_frame_index + len(df))
global_frame_index += len(df)
df.to_parquet(out / "data" / "chunk-000" / f"episode_{ep_out:06d}.parquet", index=False)
if i < len(video_files):
target_dir = out / "videos" / "chunk-000" / video_files[i].parent.name
target_dir.mkdir(parents=True, exist_ok=True)
shutil.copy2(video_files[i], target_dir / f"episode_{ep_out:06d}.mp4")
ep_out += 1
shutil.copy2(datasets[0] / "meta" / "modality.json", out / "meta" / "modality.json")
info = json.loads((datasets[0] / "meta" / "info.json").read_text())
info["total_episodes"] = len(episodes_out)
if all("length" in ep for ep in episodes_out):
info["total_frames"] = sum(ep["length"] for ep in episodes_out)
info["total_tasks"] = len(tasks_out)
(out / "meta" / "info.json").write_text(json.dumps(info, indent=2), encoding="utf-8")
write_jsonl(out / "meta" / "episodes.jsonl", episodes_out)
write_jsonl(out / "meta" / "tasks.jsonl", tasks_out)
print(f"merged {len(datasets)} datasets -> {out}")
print(f"episodes={len(episodes_out)} tasks={len(tasks_out)}")
if __name__ == "__main__":
main()
PY
Run:
uv run python tools/merge_groot_lerobot_datasets.py \
--dataset-list datasets_mix.txt \
--output-dir datasets/g1_mix_sim_public
uv run python tools/verify_groot_lerobot_dataset.py datasets/g1_mix_sim_public
The script intentionally fails if modality.json differs. That is correct. Do not directly mix raw G1 actions with SONIC latent actions unless you have converted them to the same schema.
Check the merged output:
uv pip install pandas pyarrow
uv run python -c "import pandas as pd; df=pd.read_parquet('datasets/g1_mix_sim_public/data/chunk-000/episode_000000.parquet'); print(df[['episode_index','frame_index','task_index']].head()); print(df.shape)"
uv run python tools/verify_groot_lerobot_dataset.py datasets/g1_mix_sim_public
Starting Mix Ratios
There is no universal ratio. Start with:
| Goal | Suggested ratio |
|---|---|
| New sim task, public only for regularization | 80% sim / 20% public |
| Better lighting/camera/object generalization | 60% sim / 40% public |
| Very synthetic sim, high-quality real/public data | 30-50% sim / 50-70% public |
If physical merge has no sampling weights, duplicate paths in the manifest:
cat > datasets_mix_weighted.txt <<'TXT'
/abs/path/to/sim_dataset
/abs/path/to/sim_dataset
/abs/path/to/public_dataset
TXT
This is crude but predictable. For large training, implement sampler weights if your dataloader supports them.
Common Errors And Fixes
| Error | Cause | Fix |
|---|---|---|
| IsaacLab-Arena renders no camera | Camera disabled | Add --env.enable_cameras=true --env.headless=true. |
EULA not accepted |
Missing env vars | export ACCEPT_EULA=Y PRIVACY_CONSENT=Y. |
| CUDA OOM in sim | Too many envs/cameras/videos | Reduce --env.num_envs, resolution, or video frequency. |
| Converted dataset fails loader | Wrong column names | Compare with public lerobot data; fix observation.state, action, annotation keys. |
| Video length differs from parquet length | Dropped frames or fps mismatch | Assert lengths in converter; interpolate or discard bad episodes. |
| Merge fails modality mismatch | Different schemas | Do not merge directly; convert to one schema or train separate stages. |
| Sim-only policy fails real robot | Sim-to-real gap | Add domain randomization, public/real mix, and verify camera calibration/action scaling. |
Done-Correct Criteria
You completed Part 2 correctly if:
- Simulation env loads and renders cameras.
- You have at least 50 task-specific episodes for a serious first fine-tune.
- Converted dataset has
meta/info.json,episodes.jsonl,tasks.jsonl, andmodality.json. verify_groot_lerobot_dataset.pypasses.- A 100-500 step smoke fine-tune runs without NaN/OOM.
- Open-loop eval or PolicyServer loads the checkpoint.
- If datasets are mixed, they share a schema or you have an explicit schema mapper.
Related Posts
- GR00T Whole-Body VLA Data: Open Datasets
- GR00T Whole-Body VLA Data: Do You Need Real Data?
- GR00T Whole-Body VLA: Training SONIC Controller
- WholeBodyVLA open-source guide