What this article gives you
In part 1, we designed the pilot session so the operator, data supervisor, and robot have clear responsibilities. In part 2, we chose the teleoperation stack that produces learnable actions. In part 3, we used MCAP as the raw log layer for replay and audit.
Part 4 answers the next question: once the raw log is healthy, how should the training dataset live in the data lake?
If you only have a few dozen episodes, one LeRobot folder may be enough. Once you start collecting thousands of humanoid episodes, you need to separate two very different jobs:
| Need | Best fit | Why |
|---|---|---|
| Share a dataset, inspect metadata, load from Hugging Face Hub, debug episodes | LeRobotDataset Parquet + MP4 |
Standard schema, easy inspection, strong fit with the Hub and LeRobot ecosystem |
| Fine-tune repeatedly, read sequentially at high speed, reduce decode and IO overhead | Robo-DM .vla trajectory |
Self-contained container, Trajectory.add() for multimodal data, optimized loading for training |
The key idea is: LeRobot and Robo-DM are not mutually exclusive choices. In a humanoid VLA data center, LeRobot is the exchange and governance layer. Robo-DM is the high-throughput training cache. You can keep both, but you do not need to convert everything on day one.
If you are new to LeRobot, read our LeRobot ecosystem guide. If you are building the full humanoid path from ROS 2 to VLA deployment, the humanoid robot software stack article places this data lake layer in the broader system.
The three layers: raw, share, train
A durable data lake should have three layers, not one giant folder:
humanoid-data-lake/
raw-mcap/
2026-06-10/robot_g1_001/episode_000123/
episode_000123_0.mcap
metadata.yaml
operator_notes.md
lerobot/
vnrobo/g1-pick-bin-v1/
meta/info.json
meta/tasks.parquet
meta/episodes/chunk-000/file-000.parquet
data/chunk-000/file-000.parquet
videos/head_camera/chunk-000/file-000.mp4
videos/left_wrist/chunk-000/file-000.mp4
robodm-cache/
g1-pick-bin-v1/
episode_000123.vla
episode_000124.vla
manifest.parquet
Raw MCAP is the original evidence. LeRobot is the canonical dataset for sharing, review, and versioning. Robo-DM is a training cache that can be rebuilt from LeRobot. If the .vla cache is deleted, you rebuild it from Parquet+MP4. If a LeRobot export has a bug, you can return to MCAP and investigate the original robot streams.
A common beginner mistake is to convert MCAP directly into a custom training format and throw away the raw log. That works for a demo, but it destroys your audit path. When training fails two weeks later, you will not know whether the problem was the recorder, converter, timestamp alignment, camera codec, or batch loader. The three-layer design costs a little more discipline at the start, but it sharply lowers debugging cost later.
What LeRobotDataset v3 stores
Hugging Face's LeRobotDataset v3 documentation describes the format as a standard for multimodal robot data: sensorimotor time series, multi-camera video, and metadata for indexing, search, and visualization on the Hub. The important v3 design shift is that storage is no longer tied to one file per episode. Many episodes can be concatenated into shared Parquet or MP4 shards, and metadata reconstructs the episode-level view.
The files you need to understand are:
| File or folder | Role |
|---|---|
meta/info.json |
Canonical schema: features, shapes, dtypes, FPS, version, data_path, video_path, total frames, episodes, and tasks |
meta/tasks.parquet |
Natural-language tasks and task_index, such as "pick the red cup and place it in the bin" |
meta/episodes/*.parquet |
Per-episode metadata: length, task, indexes into data/video shards, per-episode stats |
data/chunk-*/*.parquet |
Frame-level data: observation.state, action, timestamp, episode_index, frame_index, low-dimensional signals |
videos/chunk-*/*.mp4 or videos/{camera}/chunk-*/*.mp4 |
Video shards per camera; current v3 templates usually include video_key for clear camera separation |
meta/stats.json |
Mean/std/min/max statistics used for normalization during training |
In current LeRobot source, the default task path is meta/tasks.parquet, episode metadata lives under meta/episodes/...parquet, frame data under data/...parquet, and videos under videos/{video_key}/...mp4. If you see older datasets with tasks.jsonl or episodes.jsonl, treat them as legacy layout and migrate before making them the standard for a new data center.
A minimal humanoid VLA schema can start like this:
features = {
"observation.state": {
"dtype": "float32",
"shape": (64,),
"names": ["base", "torso", "left_arm", "right_arm", "hands"],
},
"action": {
"dtype": "float32",
"shape": (32,),
"names": ["target_joints", "gripper", "base_velocity"],
},
"observation.images.head": {
"dtype": "image",
"shape": (480, 640, 3),
"names": ["height", "width", "channels"],
},
"observation.images.left_wrist": {
"dtype": "image",
"shape": (480, 640, 3),
"names": ["height", "width", "channels"],
},
"observation.images.right_wrist": {
"dtype": "image",
"shape": (480, 640, 3),
"names": ["height", "width", "channels"],
},
}
It does not need to be perfect on the first day. But three things must be stable: feature names, action/state shapes, and FPS. If action is a 32-dimensional joint target today and silently becomes 34-dimensional tomorrow because you added tactile gripper channels, training code will fail in a way that is hard to diagnose.
Creating a dataset with LeRobotDataset.create()
LeRobotDataset.create() creates a write-mode dataset. In the current LeRobot source, the important arguments include repo_id, fps, features, root, robot_type, use_videos, tolerance_s, batch_encoding_size, streaming_encoding, video_files_size_in_mb, and data_files_size_in_mb. After creation, you call add_frame(), save_episode(), then finalize().
Here is a skeleton converter from synchronized episode frames:
from pathlib import Path
from lerobot.datasets.lerobot_dataset import LeRobotDataset
dataset = LeRobotDataset.create(
repo_id="vnrobo/g1-pick-bin-v1",
fps=30,
features=features,
root=Path("/data/lerobot/vnrobo/g1-pick-bin-v1"),
robot_type="unitree_g1",
use_videos=True,
batch_encoding_size=4,
streaming_encoding=False,
video_files_size_in_mb=200,
data_files_size_in_mb=100,
)
for episode in synchronized_episodes:
for frame in episode.frames:
dataset.add_frame({
"observation.state": frame.state.astype("float32"),
"action": frame.action.astype("float32"),
"observation.images.head": frame.head_rgb,
"observation.images.left_wrist": frame.left_wrist_rgb,
"observation.images.right_wrist": frame.right_wrist_rgb,
"task": episode.task_text,
})
dataset.save_episode()
dataset.finalize()
For beginners, the task field deserves attention. It is added on each frame so the writer can map natural-language task strings to task_index and update meta/tasks.parquet. You should not hand-edit tasks.parquet. Letting the writer own the index reduces drift between frame rows and episode metadata.
After the dataset is written, run a quick smoke test:
find /data/lerobot/vnrobo/g1-pick-bin-v1 -maxdepth 3 -type f | sort | head -50
python - <<'PY'
from lerobot.datasets.lerobot_dataset import LeRobotDataset
ds = LeRobotDataset(
"vnrobo/g1-pick-bin-v1",
root="/data/lerobot/vnrobo/g1-pick-bin-v1",
)
print("episodes:", ds.num_episodes)
print("frames:", len(ds))
print("features:", ds.features.keys())
sample = ds[0]
print(sample["timestamp"], sample["observation.state"].shape, sample["action"].shape)
PY
Loading the first sample does not prove the dataset is good. It only proves the files are readable. You still need the QA layer in part 5: duration checks, black-frame detection, action spike detection, and task-label review.
When to keep Parquet+MP4
Keep LeRobot Parquet+MP4 as the main dataset when any of these conditions apply:
| Situation | Why LeRobot should stay |
|---|---|
| You want to upload to Hugging Face Hub | The Hub understands dataset repos, metadata, cards, streaming, and sharded files better than a private cache |
| Other people need to inspect the data | Parquet can be read with PyArrow/Pandas, and MP4 can be opened by common tools |
| You train with LeRobot policies | LeRobotDataset returns tensor dictionaries compatible with DataLoader and supports delta_timestamps |
| You version datasets as releases | repo_id, dataset cards, stats, and metadata make audits easier |
| The schema is still changing | Parquet+MP4 is easier to inspect and migrate than a custom binary trajectory cache |
LeRobot is also a strong fit for the data review phase. A data supervisor can open metadata, count episodes per task, inspect video shards, and check meta/episodes for episodes that are too short. When the dataset must be shared with another team, LeRobot is the common language.
The tradeoff is throughput. With multi-camera humanoid data, the loader may need to read Parquet, resolve episode/frame offsets, decode MP4, apply transforms, and collate the batch. This is fine for many projects. But once you fine-tune on hundreds of thousands of episodes for many experiments, IO and video decoding can become the bottleneck.
Where Robo-DM .vla fits
Robo-DM provides a Trajectory abstraction for writing multimodal robot data into .vla files. Its README shows the basic pattern:
import robodm
trajectory = robodm.Trajectory(path="/tmp/robot_demo.vla", mode="w")
trajectory.add("camera/rgb", image)
trajectory.add("robot/joint_positions", qpos)
trajectory.add("action/gripper_action", gripper_action)
trajectory.close()
The script examples/lerobot/lerobot_to_robodm_ingestion.py applies this idea to LeRobot. It loads LeRobotDataset, groups samples by episode_index, sorts them by frame_index, creates episode_XXX.vla, and calls Trajectory.add() for images, state, action, reward, and done signals when present. run_pipeline.py then demonstrates a more complete workflow: if robodm_data_dir already exists, ingestion is skipped; the pipeline loads .vla, builds a DataLoader, and trains a policy.
For humanoid VLA, the converter should use real timestamps from LeRobot when available. Do not always assume frame_idx * 100 ms. A safer skeleton looks like this:
from pathlib import Path
import numpy as np
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from robodm.trajectory import Trajectory
def tensor_to_hwc_uint8(image_tensor):
image = image_tensor.permute(1, 2, 0).cpu().numpy()
if image.max() <= 1.0:
image = image * 255
return image.astype(np.uint8)
def lerobot_to_vla(repo_id, root, output_dir, episodes=None):
ds = LeRobotDataset(repo_id, root=root, episodes=episodes, video_backend="pyav")
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
grouped = {}
for sample in ds:
ep = int(sample["episode_index"].item())
frame = int(sample["frame_index"].item())
grouped.setdefault(ep, []).append((frame, sample))
for ep, frames in grouped.items():
frames.sort(key=lambda item: item[0])
traj = Trajectory(path=str(output_dir / f"episode_{ep:06d}.vla"), mode="w")
try:
for _, sample in frames:
timestamp_s = float(sample["timestamp"].item())
timestamp_ms = int(round(timestamp_s * 1000))
traj.add(
"observation/state",
sample["observation.state"].cpu().numpy().astype(np.float32),
timestamp=timestamp_ms,
time_unit="ms",
)
traj.add(
"action",
sample["action"].cpu().numpy().astype(np.float32),
timestamp=timestamp_ms,
time_unit="ms",
)
for key in sample.keys():
if key.startswith("observation.images."):
camera = key.split(".")[-1]
traj.add(
f"observation/images/{camera}",
tensor_to_hwc_uint8(sample[key]),
timestamp=timestamp_ms,
time_unit="ms",
)
finally:
traj.close()
In production, do not create .vla files and lose their origin. Write a manifest:
manifest.parquet
vla_path
source_repo_id
source_dataset_version
source_episode_index
source_episode_length
task
robot_id
converter_version
created_at
qa_status
The manifest answers practical questions: which LeRobot release produced this .vla file, which converter version created it, whether it passed QA, and which training runs may use it.
When to convert to .vla
You do not need to convert every dataset immediately. Use this table:
| Question | If yes | Action |
|---|---|---|
| Has the dataset passed QA and stabilized its schema? | Yes | Create a .vla cache for training |
| Do other teams need to download or inspect the dataset? | Yes | Keep LeRobot as the released dataset |
| Is training bottlenecked by video decoding or random access? | Yes | Benchmark a Robo-DM cache |
| Is this only a 20-episode pilot? | Yes | Skip .vla for now; keep LeRobot and raw MCAP |
| Do you need to replay or audit a sensor failure? | Yes | Return to MCAP, not .vla |
| Are you running many experiments against one frozen dataset? | Yes | The .vla cache is worth it |
A useful rule:
MCAP = source of truth
LeRobot = source of shareable training data
Robo-DM .vla = reproducible training cache
If you change camera crops, action normalization, or feature names, create a new .vla cache instead of overwriting the old one. A training run must know exactly which cache it used.
Recommended two-tier pipeline
The data center pipeline for this article is:
1. Record MCAP
2. Export synchronized episode frames
3. Write LeRobotDataset with LeRobotDataset.create()
4. Run metadata and visual QA
5. Publish or version the LeRobot dataset
6. Convert approved episodes to Robo-DM .vla
7. Train from the .vla cache
8. Log the training run back to the dataset version
Example command layout:
# Sharing layer
python tools/mcap_to_lerobot.py \
--raw-root /data/raw-mcap/2026-06-10 \
--repo-id vnrobo/g1-pick-bin-v1 \
--output-root /data/lerobot/vnrobo/g1-pick-bin-v1 \
--fps 30
# QA before training cache
python tools/qa_lerobot_dataset.py \
--repo-id vnrobo/g1-pick-bin-v1 \
--root /data/lerobot/vnrobo/g1-pick-bin-v1 \
--report reports/g1-pick-bin-v1-qa.html
# Training layer
python examples/lerobot/lerobot_to_robodm_ingestion.py \
--dataset vnrobo/g1-pick-bin-v1 \
--output_dir /data/robodm-cache/g1-pick-bin-v1 \
--video_backend pyav
python examples/lerobot/run_pipeline.py \
--robodm_data_dir /data/robodm-cache/g1-pick-bin-v1 \
--training_steps 100000 \
--batch_size 128
For beginners, the hardest part is not the converter code. The hard part is deciding what is canonical. Keep the answer simple: the canonical public dataset is LeRobot; the canonical raw evidence is MCAP; .vla is an artifact you can rebuild.
Checklist before training
Before a large training run consumes .vla, check:
| Item | How to check |
|---|---|
Number of .vla files equals approved episodes |
Compare manifest against meta/episodes |
| Timestamps increase monotonically | Assert timestamp[i] < timestamp[i+1] per trajectory |
| Cameras have enough frames | Each camera frame count should be close to episode length |
| State/action dtypes are stable | Use float32 and fixed shapes |
| Task labels are non-empty | Join source_episode_index with tasks.parquet |
| Cache is rebuildable | Delete one .vla, rebuild from LeRobot, compare metadata checksum |
| Training loader does not drop keys | Print batch keys before training |
When a check fails, fix the lowest layer that owns the bug. Timestamp sync issues belong in the MCAP-to-LeRobot exporter. Video decode issues belong in the MP4 shard layer. Missing .vla keys belong in the LeRobot-to-Robo-DM converter.
Technical sources
- Hugging Face LeRobotDataset v3.0 docs
- LeRobot v3 porting guide on GitHub
- LeRobotDataset source code
- Robo-DM GitHub repository
- Robo-DM paper on arXiv