wholebody-vlahumanoid-vlalerobotrobodmdata-lakeparquetvlahuggingfacetraining

LeRobotDataset and Robo-DM Data Lake

Design a two-tier data lake: LeRobot Parquet+MP4 for sharing and Robo-DM .vla trajectories for high-throughput VLA training.

Nguyễn Anh TuấnJune 10, 202611 min read
LeRobotDataset and Robo-DM Data Lake

What this article gives you

In part 1, we designed the pilot session so the operator, data supervisor, and robot have clear responsibilities. In part 2, we chose the teleoperation stack that produces learnable actions. In part 3, we used MCAP as the raw log layer for replay and audit.

Part 4 answers the next question: once the raw log is healthy, how should the training dataset live in the data lake?

If you only have a few dozen episodes, one LeRobot folder may be enough. Once you start collecting thousands of humanoid episodes, you need to separate two very different jobs:

Need Best fit Why
Share a dataset, inspect metadata, load from Hugging Face Hub, debug episodes LeRobotDataset Parquet + MP4 Standard schema, easy inspection, strong fit with the Hub and LeRobot ecosystem
Fine-tune repeatedly, read sequentially at high speed, reduce decode and IO overhead Robo-DM .vla trajectory Self-contained container, Trajectory.add() for multimodal data, optimized loading for training

The key idea is: LeRobot and Robo-DM are not mutually exclusive choices. In a humanoid VLA data center, LeRobot is the exchange and governance layer. Robo-DM is the high-throughput training cache. You can keep both, but you do not need to convert everything on day one.

If you are new to LeRobot, read our LeRobot ecosystem guide. If you are building the full humanoid path from ROS 2 to VLA deployment, the humanoid robot software stack article places this data lake layer in the broader system.

The three layers: raw, share, train

A durable data lake should have three layers, not one giant folder:

humanoid-data-lake/
  raw-mcap/
    2026-06-10/robot_g1_001/episode_000123/
      episode_000123_0.mcap
      metadata.yaml
      operator_notes.md

  lerobot/
    vnrobo/g1-pick-bin-v1/
      meta/info.json
      meta/tasks.parquet
      meta/episodes/chunk-000/file-000.parquet
      data/chunk-000/file-000.parquet
      videos/head_camera/chunk-000/file-000.mp4
      videos/left_wrist/chunk-000/file-000.mp4

  robodm-cache/
    g1-pick-bin-v1/
      episode_000123.vla
      episode_000124.vla
      manifest.parquet

Raw MCAP is the original evidence. LeRobot is the canonical dataset for sharing, review, and versioning. Robo-DM is a training cache that can be rebuilt from LeRobot. If the .vla cache is deleted, you rebuild it from Parquet+MP4. If a LeRobot export has a bug, you can return to MCAP and investigate the original robot streams.

A common beginner mistake is to convert MCAP directly into a custom training format and throw away the raw log. That works for a demo, but it destroys your audit path. When training fails two weeks later, you will not know whether the problem was the recorder, converter, timestamp alignment, camera codec, or batch loader. The three-layer design costs a little more discipline at the start, but it sharply lowers debugging cost later.

What LeRobotDataset v3 stores

Hugging Face's LeRobotDataset v3 documentation describes the format as a standard for multimodal robot data: sensorimotor time series, multi-camera video, and metadata for indexing, search, and visualization on the Hub. The important v3 design shift is that storage is no longer tied to one file per episode. Many episodes can be concatenated into shared Parquet or MP4 shards, and metadata reconstructs the episode-level view.

The files you need to understand are:

File or folder Role
meta/info.json Canonical schema: features, shapes, dtypes, FPS, version, data_path, video_path, total frames, episodes, and tasks
meta/tasks.parquet Natural-language tasks and task_index, such as "pick the red cup and place it in the bin"
meta/episodes/*.parquet Per-episode metadata: length, task, indexes into data/video shards, per-episode stats
data/chunk-*/*.parquet Frame-level data: observation.state, action, timestamp, episode_index, frame_index, low-dimensional signals
videos/chunk-*/*.mp4 or videos/{camera}/chunk-*/*.mp4 Video shards per camera; current v3 templates usually include video_key for clear camera separation
meta/stats.json Mean/std/min/max statistics used for normalization during training

In current LeRobot source, the default task path is meta/tasks.parquet, episode metadata lives under meta/episodes/...parquet, frame data under data/...parquet, and videos under videos/{video_key}/...mp4. If you see older datasets with tasks.jsonl or episodes.jsonl, treat them as legacy layout and migrate before making them the standard for a new data center.

A minimal humanoid VLA schema can start like this:

features = {
    "observation.state": {
        "dtype": "float32",
        "shape": (64,),
        "names": ["base", "torso", "left_arm", "right_arm", "hands"],
    },
    "action": {
        "dtype": "float32",
        "shape": (32,),
        "names": ["target_joints", "gripper", "base_velocity"],
    },
    "observation.images.head": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
    "observation.images.left_wrist": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
    "observation.images.right_wrist": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
}

It does not need to be perfect on the first day. But three things must be stable: feature names, action/state shapes, and FPS. If action is a 32-dimensional joint target today and silently becomes 34-dimensional tomorrow because you added tactile gripper channels, training code will fail in a way that is hard to diagnose.

Creating a dataset with LeRobotDataset.create()

LeRobotDataset.create() creates a write-mode dataset. In the current LeRobot source, the important arguments include repo_id, fps, features, root, robot_type, use_videos, tolerance_s, batch_encoding_size, streaming_encoding, video_files_size_in_mb, and data_files_size_in_mb. After creation, you call add_frame(), save_episode(), then finalize().

Here is a skeleton converter from synchronized episode frames:

from pathlib import Path
from lerobot.datasets.lerobot_dataset import LeRobotDataset

dataset = LeRobotDataset.create(
    repo_id="vnrobo/g1-pick-bin-v1",
    fps=30,
    features=features,
    root=Path("/data/lerobot/vnrobo/g1-pick-bin-v1"),
    robot_type="unitree_g1",
    use_videos=True,
    batch_encoding_size=4,
    streaming_encoding=False,
    video_files_size_in_mb=200,
    data_files_size_in_mb=100,
)

for episode in synchronized_episodes:
    for frame in episode.frames:
        dataset.add_frame({
            "observation.state": frame.state.astype("float32"),
            "action": frame.action.astype("float32"),
            "observation.images.head": frame.head_rgb,
            "observation.images.left_wrist": frame.left_wrist_rgb,
            "observation.images.right_wrist": frame.right_wrist_rgb,
            "task": episode.task_text,
        })

    dataset.save_episode()

dataset.finalize()

For beginners, the task field deserves attention. It is added on each frame so the writer can map natural-language task strings to task_index and update meta/tasks.parquet. You should not hand-edit tasks.parquet. Letting the writer own the index reduces drift between frame rows and episode metadata.

After the dataset is written, run a quick smoke test:

find /data/lerobot/vnrobo/g1-pick-bin-v1 -maxdepth 3 -type f | sort | head -50

python - <<'PY'
from lerobot.datasets.lerobot_dataset import LeRobotDataset

ds = LeRobotDataset(
    "vnrobo/g1-pick-bin-v1",
    root="/data/lerobot/vnrobo/g1-pick-bin-v1",
)

print("episodes:", ds.num_episodes)
print("frames:", len(ds))
print("features:", ds.features.keys())
sample = ds[0]
print(sample["timestamp"], sample["observation.state"].shape, sample["action"].shape)
PY

Loading the first sample does not prove the dataset is good. It only proves the files are readable. You still need the QA layer in part 5: duration checks, black-frame detection, action spike detection, and task-label review.

When to keep Parquet+MP4

Keep LeRobot Parquet+MP4 as the main dataset when any of these conditions apply:

Situation Why LeRobot should stay
You want to upload to Hugging Face Hub The Hub understands dataset repos, metadata, cards, streaming, and sharded files better than a private cache
Other people need to inspect the data Parquet can be read with PyArrow/Pandas, and MP4 can be opened by common tools
You train with LeRobot policies LeRobotDataset returns tensor dictionaries compatible with DataLoader and supports delta_timestamps
You version datasets as releases repo_id, dataset cards, stats, and metadata make audits easier
The schema is still changing Parquet+MP4 is easier to inspect and migrate than a custom binary trajectory cache

LeRobot is also a strong fit for the data review phase. A data supervisor can open metadata, count episodes per task, inspect video shards, and check meta/episodes for episodes that are too short. When the dataset must be shared with another team, LeRobot is the common language.

The tradeoff is throughput. With multi-camera humanoid data, the loader may need to read Parquet, resolve episode/frame offsets, decode MP4, apply transforms, and collate the batch. This is fine for many projects. But once you fine-tune on hundreds of thousands of episodes for many experiments, IO and video decoding can become the bottleneck.

Where Robo-DM .vla fits

Robo-DM provides a Trajectory abstraction for writing multimodal robot data into .vla files. Its README shows the basic pattern:

import robodm

trajectory = robodm.Trajectory(path="/tmp/robot_demo.vla", mode="w")
trajectory.add("camera/rgb", image)
trajectory.add("robot/joint_positions", qpos)
trajectory.add("action/gripper_action", gripper_action)
trajectory.close()

The script examples/lerobot/lerobot_to_robodm_ingestion.py applies this idea to LeRobot. It loads LeRobotDataset, groups samples by episode_index, sorts them by frame_index, creates episode_XXX.vla, and calls Trajectory.add() for images, state, action, reward, and done signals when present. run_pipeline.py then demonstrates a more complete workflow: if robodm_data_dir already exists, ingestion is skipped; the pipeline loads .vla, builds a DataLoader, and trains a policy.

For humanoid VLA, the converter should use real timestamps from LeRobot when available. Do not always assume frame_idx * 100 ms. A safer skeleton looks like this:

from pathlib import Path
import numpy as np
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from robodm.trajectory import Trajectory

def tensor_to_hwc_uint8(image_tensor):
    image = image_tensor.permute(1, 2, 0).cpu().numpy()
    if image.max() <= 1.0:
        image = image * 255
    return image.astype(np.uint8)

def lerobot_to_vla(repo_id, root, output_dir, episodes=None):
    ds = LeRobotDataset(repo_id, root=root, episodes=episodes, video_backend="pyav")
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    grouped = {}
    for sample in ds:
        ep = int(sample["episode_index"].item())
        frame = int(sample["frame_index"].item())
        grouped.setdefault(ep, []).append((frame, sample))

    for ep, frames in grouped.items():
        frames.sort(key=lambda item: item[0])
        traj = Trajectory(path=str(output_dir / f"episode_{ep:06d}.vla"), mode="w")

        try:
            for _, sample in frames:
                timestamp_s = float(sample["timestamp"].item())
                timestamp_ms = int(round(timestamp_s * 1000))

                traj.add(
                    "observation/state",
                    sample["observation.state"].cpu().numpy().astype(np.float32),
                    timestamp=timestamp_ms,
                    time_unit="ms",
                )
                traj.add(
                    "action",
                    sample["action"].cpu().numpy().astype(np.float32),
                    timestamp=timestamp_ms,
                    time_unit="ms",
                )

                for key in sample.keys():
                    if key.startswith("observation.images."):
                        camera = key.split(".")[-1]
                        traj.add(
                            f"observation/images/{camera}",
                            tensor_to_hwc_uint8(sample[key]),
                            timestamp=timestamp_ms,
                            time_unit="ms",
                        )
        finally:
            traj.close()

In production, do not create .vla files and lose their origin. Write a manifest:

manifest.parquet
  vla_path
  source_repo_id
  source_dataset_version
  source_episode_index
  source_episode_length
  task
  robot_id
  converter_version
  created_at
  qa_status

The manifest answers practical questions: which LeRobot release produced this .vla file, which converter version created it, whether it passed QA, and which training runs may use it.

When to convert to .vla

You do not need to convert every dataset immediately. Use this table:

Question If yes Action
Has the dataset passed QA and stabilized its schema? Yes Create a .vla cache for training
Do other teams need to download or inspect the dataset? Yes Keep LeRobot as the released dataset
Is training bottlenecked by video decoding or random access? Yes Benchmark a Robo-DM cache
Is this only a 20-episode pilot? Yes Skip .vla for now; keep LeRobot and raw MCAP
Do you need to replay or audit a sensor failure? Yes Return to MCAP, not .vla
Are you running many experiments against one frozen dataset? Yes The .vla cache is worth it

A useful rule:

MCAP = source of truth
LeRobot = source of shareable training data
Robo-DM .vla = reproducible training cache

If you change camera crops, action normalization, or feature names, create a new .vla cache instead of overwriting the old one. A training run must know exactly which cache it used.

The data center pipeline for this article is:

1. Record MCAP
2. Export synchronized episode frames
3. Write LeRobotDataset with LeRobotDataset.create()
4. Run metadata and visual QA
5. Publish or version the LeRobot dataset
6. Convert approved episodes to Robo-DM .vla
7. Train from the .vla cache
8. Log the training run back to the dataset version

Example command layout:

# Sharing layer
python tools/mcap_to_lerobot.py \
  --raw-root /data/raw-mcap/2026-06-10 \
  --repo-id vnrobo/g1-pick-bin-v1 \
  --output-root /data/lerobot/vnrobo/g1-pick-bin-v1 \
  --fps 30

# QA before training cache
python tools/qa_lerobot_dataset.py \
  --repo-id vnrobo/g1-pick-bin-v1 \
  --root /data/lerobot/vnrobo/g1-pick-bin-v1 \
  --report reports/g1-pick-bin-v1-qa.html

# Training layer
python examples/lerobot/lerobot_to_robodm_ingestion.py \
  --dataset vnrobo/g1-pick-bin-v1 \
  --output_dir /data/robodm-cache/g1-pick-bin-v1 \
  --video_backend pyav

python examples/lerobot/run_pipeline.py \
  --robodm_data_dir /data/robodm-cache/g1-pick-bin-v1 \
  --training_steps 100000 \
  --batch_size 128

For beginners, the hardest part is not the converter code. The hard part is deciding what is canonical. Keep the answer simple: the canonical public dataset is LeRobot; the canonical raw evidence is MCAP; .vla is an artifact you can rebuild.

Checklist before training

Before a large training run consumes .vla, check:

Item How to check
Number of .vla files equals approved episodes Compare manifest against meta/episodes
Timestamps increase monotonically Assert timestamp[i] < timestamp[i+1] per trajectory
Cameras have enough frames Each camera frame count should be close to episode length
State/action dtypes are stable Use float32 and fixed shapes
Task labels are non-empty Join source_episode_index with tasks.parquet
Cache is rebuildable Delete one .vla, rebuild from LeRobot, compare metadata checksum
Training loader does not drop keys Print batch keys before training

When a check fails, fix the lowest layer that owns the bug. Timestamp sync issues belong in the MCAP-to-LeRobot exporter. Video decode issues belong in the MP4 shard layer. Missing .vla keys belong in the LeRobot-to-Robo-DM converter.

Technical sources

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

ROS 2 MCAP làm chuẩn raw log
wholebody-vla

ROS 2 MCAP làm chuẩn raw log

6/10/202616 min read
NT
Pilot 2 người cho dữ liệu humanoid VLA
wholebody-vla

Pilot 2 người cho dữ liệu humanoid VLA

6/10/202615 min read
NT
Chọn teleoperation stack cho humanoid
wholebody-vla

Chọn teleoperation stack cho humanoid

6/10/202616 min read
NT