LeRobotDataset and Robo-DM Data Lake

What this article gives you

In part 1, we designed the pilot session so the operator, data supervisor, and robot have clear responsibilities. In part 2, we chose the teleoperation stack that produces learnable actions. In part 3, we used MCAP as the raw log layer for replay and audit.

Part 4 answers the next question: once the raw log is healthy, how should the training dataset live in the data lake?

If you only have a few dozen episodes, one LeRobot folder may be enough. Once you start collecting thousands of humanoid episodes, you need to separate two very different jobs:

Need	Best fit	Why
Share a dataset, inspect metadata, load from Hugging Face Hub, debug episodes	LeRobotDataset `Parquet + MP4`	Standard schema, easy inspection, strong fit with the Hub and LeRobot ecosystem
Fine-tune repeatedly, read sequentially at high speed, reduce decode and IO overhead	Robo-DM `.vla` trajectory	Self-contained container, `Trajectory.add()` for multimodal data, optimized loading for training

The key idea is: LeRobot and Robo-DM are not mutually exclusive choices. In a humanoid VLA data center, LeRobot is the exchange and governance layer. Robo-DM is the high-throughput training cache. You can keep both, but you do not need to convert everything on day one.

If you are new to LeRobot, read our LeRobot ecosystem guide. If you are building the full humanoid path from ROS 2 to VLA deployment, the humanoid robot software stack article places this data lake layer in the broader system.

A durable data lake should have three layers, not one giant folder:

humanoid-data-lake/
  raw-mcap/
    2026-06-10/robot_g1_001/episode_000123/
      episode_000123_0.mcap
      metadata.yaml
      operator_notes.md

  lerobot/
    vnrobo/g1-pick-bin-v1/
      meta/info.json
      meta/tasks.parquet
      meta/episodes/chunk-000/file-000.parquet
      data/chunk-000/file-000.parquet
      videos/head_camera/chunk-000/file-000.mp4
      videos/left_wrist/chunk-000/file-000.mp4

  robodm-cache/
    g1-pick-bin-v1/
      episode_000123.vla
      episode_000124.vla
      manifest.parquet

Raw MCAP is the original evidence. LeRobot is the canonical dataset for sharing, review, and versioning. Robo-DM is a training cache that can be rebuilt from LeRobot. If the .vla cache is deleted, you rebuild it from Parquet+MP4. If a LeRobot export has a bug, you can return to MCAP and investigate the original robot streams.

A common beginner mistake is to convert MCAP directly into a custom training format and throw away the raw log. That works for a demo, but it destroys your audit path. When training fails two weeks later, you will not know whether the problem was the recorder, converter, timestamp alignment, camera codec, or batch loader. The three-layer design costs a little more discipline at the start, but it sharply lowers debugging cost later.

What LeRobotDataset v3 stores

Hugging Face's LeRobotDataset v3 documentation describes the format as a standard for multimodal robot data: sensorimotor time series, multi-camera video, and metadata for indexing, search, and visualization on the Hub. The important v3 design shift is that storage is no longer tied to one file per episode. Many episodes can be concatenated into shared Parquet or MP4 shards, and metadata reconstructs the episode-level view.

The files you need to understand are:

File or folder	Role
`meta/info.json`	Canonical schema: `features`, shapes, dtypes, FPS, version, `data_path`, `video_path`, total frames, episodes, and tasks
`meta/tasks.parquet`	Natural-language tasks and `task_index`, such as "pick the red cup and place it in the bin"
`meta/episodes/*.parquet`	Per-episode metadata: length, task, indexes into data/video shards, per-episode stats
`data/chunk-/.parquet`	Frame-level data: `observation.state`, `action`, `timestamp`, `episode_index`, `frame_index`, low-dimensional signals
`videos/chunk-/.mp4` or `videos/{camera}/chunk-/.mp4`	Video shards per camera; current v3 templates usually include `video_key` for clear camera separation
`meta/stats.json`	Mean/std/min/max statistics used for normalization during training

In current LeRobot source, the default task path is meta/tasks.parquet, episode metadata lives under meta/episodes/...parquet, frame data under data/...parquet, and videos under videos/{video_key}/...mp4. If you see older datasets with tasks.jsonl or episodes.jsonl, treat them as legacy layout and migrate before making them the standard for a new data center.

A minimal humanoid VLA schema can start like this:

features = {
    "observation.state": {
        "dtype": "float32",
        "shape": (64,),
        "names": ["base", "torso", "left_arm", "right_arm", "hands"],
    },
    "action": {
        "dtype": "float32",
        "shape": (32,),
        "names": ["target_joints", "gripper", "base_velocity"],
    },
    "observation.images.head": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
    "observation.images.left_wrist": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
    "observation.images.right_wrist": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
}

It does not need to be perfect on the first day. But three things must be stable: feature names, action/state shapes, and FPS. If action is a 32-dimensional joint target today and silently becomes 34-dimensional tomorrow because you added tactile gripper channels, training code will fail in a way that is hard to diagnose.

Creating a dataset with `LeRobotDataset.create()`

LeRobotDataset.create() creates a write-mode dataset. In the current LeRobot source, the important arguments include repo_id, fps, features, root, robot_type, use_videos, tolerance_s, batch_encoding_size, streaming_encoding, video_files_size_in_mb, and data_files_size_in_mb. After creation, you call add_frame(), save_episode(), then finalize().

Here is a skeleton converter from synchronized episode frames:

from pathlib import Path
from lerobot.datasets.lerobot_dataset import LeRobotDataset

dataset = LeRobotDataset.create(
    repo_id="vnrobo/g1-pick-bin-v1",
    fps=30,
    features=features,
    root=Path("/data/lerobot/vnrobo/g1-pick-bin-v1"),
    robot_type="unitree_g1",
    use_videos=True,
    batch_encoding_size=4,
    streaming_encoding=False,
    video_files_size_in_mb=200,
    data_files_size_in_mb=100,
)

for episode in synchronized_episodes:
    for frame in episode.frames:
        dataset.add_frame({
            "observation.state": frame.state.astype("float32"),
            "action": frame.action.astype("float32"),
            "observation.images.head": frame.head_rgb,
            "observation.images.left_wrist": frame.left_wrist_rgb,
            "observation.images.right_wrist": frame.right_wrist_rgb,
            "task": episode.task_text,
        })

    dataset.save_episode()

dataset.finalize()

For beginners, the task field deserves attention. It is added on each frame so the writer can map natural-language task strings to task_index and update meta/tasks.parquet. You should not hand-edit tasks.parquet. Letting the writer own the index reduces drift between frame rows and episode metadata.

After the dataset is written, run a quick smoke test:

find /data/lerobot/vnrobo/g1-pick-bin-v1 -maxdepth 3 -type f | sort | head -50

python - <<'PY'
from lerobot.datasets.lerobot_dataset import LeRobotDataset

ds = LeRobotDataset(
    "vnrobo/g1-pick-bin-v1",
    root="/data/lerobot/vnrobo/g1-pick-bin-v1",
)

print("episodes:", ds.num_episodes)
print("frames:", len(ds))
print("features:", ds.features.keys())
sample = ds[0]
print(sample["timestamp"], sample["observation.state"].shape, sample["action"].shape)
PY

Loading the first sample does not prove the dataset is good. It only proves the files are readable. You still need the QA layer in part 5: duration checks, black-frame detection, action spike detection, and task-label review.

When to keep Parquet+MP4

Keep LeRobot Parquet+MP4 as the main dataset when any of these conditions apply:

Situation	Why LeRobot should stay
You want to upload to Hugging Face Hub	The Hub understands dataset repos, metadata, cards, streaming, and sharded files better than a private cache
Other people need to inspect the data	Parquet can be read with PyArrow/Pandas, and MP4 can be opened by common tools
You train with LeRobot policies	`LeRobotDataset` returns tensor dictionaries compatible with DataLoader and supports `delta_timestamps`
You version datasets as releases	`repo_id`, dataset cards, stats, and metadata make audits easier
The schema is still changing	Parquet+MP4 is easier to inspect and migrate than a custom binary trajectory cache

LeRobot is also a strong fit for the data review phase. A data supervisor can open metadata, count episodes per task, inspect video shards, and check meta/episodes for episodes that are too short. When the dataset must be shared with another team, LeRobot is the common language.

The tradeoff is throughput. With multi-camera humanoid data, the loader may need to read Parquet, resolve episode/frame offsets, decode MP4, apply transforms, and collate the batch. This is fine for many projects. But once you fine-tune on hundreds of thousands of episodes for many experiments, IO and video decoding can become the bottleneck.

Where Robo-DM `.vla` fits

Robo-DM provides a Trajectory abstraction for writing multimodal robot data into .vla files. Its README shows the basic pattern:

import robodm

trajectory = robodm.Trajectory(path="/tmp/robot_demo.vla", mode="w")
trajectory.add("camera/rgb", image)
trajectory.add("robot/joint_positions", qpos)
trajectory.add("action/gripper_action", gripper_action)
trajectory.close()

The script examples/lerobot/lerobot_to_robodm_ingestion.py applies this idea to LeRobot. It loads LeRobotDataset, groups samples by episode_index, sorts them by frame_index, creates episode_XXX.vla, and calls Trajectory.add() for images, state, action, reward, and done signals when present. run_pipeline.py then demonstrates a more complete workflow: if robodm_data_dir already exists, ingestion is skipped; the pipeline loads .vla, builds a DataLoader, and trains a policy.

For humanoid VLA, the converter should use real timestamps from LeRobot when available. Do not always assume frame_idx * 100 ms. A safer skeleton looks like this:

from pathlib import Path
import numpy as np
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from robodm.trajectory import Trajectory

def tensor_to_hwc_uint8(image_tensor):
    image = image_tensor.permute(1, 2, 0).cpu().numpy()
    if image.max() <= 1.0:
        image = image * 255
    return image.astype(np.uint8)

def lerobot_to_vla(repo_id, root, output_dir, episodes=None):
    ds = LeRobotDataset(repo_id, root=root, episodes=episodes, video_backend="pyav")
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    grouped = {}
    for sample in ds:
        ep = int(sample["episode_index"].item())
        frame = int(sample["frame_index"].item())
        grouped.setdefault(ep, []).append((frame, sample))

    for ep, frames in grouped.items():
        frames.sort(key=lambda item: item[0])
        traj = Trajectory(path=str(output_dir / f"episode_{ep:06d}.vla"), mode="w")

        try:
            for _, sample in frames:
                timestamp_s = float(sample["timestamp"].item())
                timestamp_ms = int(round(timestamp_s * 1000))

                traj.add(
                    "observation/state",
                    sample["observation.state"].cpu().numpy().astype(np.float32),
                    timestamp=timestamp_ms,
                    time_unit="ms",
                )
                traj.add(
                    "action",
                    sample["action"].cpu().numpy().astype(np.float32),
                    timestamp=timestamp_ms,
                    time_unit="ms",
                )

                for key in sample.keys():
                    if key.startswith("observation.images."):
                        camera = key.split(".")[-1]
                        traj.add(
                            f"observation/images/{camera}",
                            tensor_to_hwc_uint8(sample[key]),
                            timestamp=timestamp_ms,
                            time_unit="ms",
                        )
        finally:
            traj.close()

In production, do not create .vla files and lose their origin. Write a manifest:

manifest.parquet
  vla_path
  source_repo_id
  source_dataset_version
  source_episode_index
  source_episode_length
  task
  robot_id
  converter_version
  created_at
  qa_status

The manifest answers practical questions: which LeRobot release produced this .vla file, which converter version created it, whether it passed QA, and which training runs may use it.

When to convert to `.vla`

You do not need to convert every dataset immediately. Use this table:

Question	If yes	Action
Has the dataset passed QA and stabilized its schema?	Yes	Create a `.vla` cache for training
Do other teams need to download or inspect the dataset?	Yes	Keep LeRobot as the released dataset
Is training bottlenecked by video decoding or random access?	Yes	Benchmark a Robo-DM cache
Is this only a 20-episode pilot?	Yes	Skip `.vla` for now; keep LeRobot and raw MCAP
Do you need to replay or audit a sensor failure?	Yes	Return to MCAP, not `.vla`
Are you running many experiments against one frozen dataset?	Yes	The `.vla` cache is worth it

A useful rule:

MCAP = source of truth
LeRobot = source of shareable training data
Robo-DM .vla = reproducible training cache

If you change camera crops, action normalization, or feature names, create a new .vla cache instead of overwriting the old one. A training run must know exactly which cache it used.

Recommended two-tier pipeline

The data center pipeline for this article is:

1. Record MCAP
2. Export synchronized episode frames
3. Write LeRobotDataset with LeRobotDataset.create()
4. Run metadata and visual QA
5. Publish or version the LeRobot dataset
6. Convert approved episodes to Robo-DM .vla
7. Train from the .vla cache
8. Log the training run back to the dataset version

Example command layout:

# Sharing layer
python tools/mcap_to_lerobot.py \
  --raw-root /data/raw-mcap/2026-06-10 \
  --repo-id vnrobo/g1-pick-bin-v1 \
  --output-root /data/lerobot/vnrobo/g1-pick-bin-v1 \
  --fps 30

# QA before training cache
python tools/qa_lerobot_dataset.py \
  --repo-id vnrobo/g1-pick-bin-v1 \
  --root /data/lerobot/vnrobo/g1-pick-bin-v1 \
  --report reports/g1-pick-bin-v1-qa.html

# Training layer
python examples/lerobot/lerobot_to_robodm_ingestion.py \
  --dataset vnrobo/g1-pick-bin-v1 \
  --output_dir /data/robodm-cache/g1-pick-bin-v1 \
  --video_backend pyav

python examples/lerobot/run_pipeline.py \
  --robodm_data_dir /data/robodm-cache/g1-pick-bin-v1 \
  --training_steps 100000 \
  --batch_size 128

For beginners, the hardest part is not the converter code. The hard part is deciding what is canonical. Keep the answer simple: the canonical public dataset is LeRobot; the canonical raw evidence is MCAP; .vla is an artifact you can rebuild.

Checklist before training

Before a large training run consumes .vla, check:

Item	How to check
Number of `.vla` files equals approved episodes	Compare manifest against `meta/episodes`
Timestamps increase monotonically	Assert `timestamp[i] < timestamp[i+1]` per trajectory
Cameras have enough frames	Each camera frame count should be close to episode length
State/action dtypes are stable	Use `float32` and fixed shapes
Task labels are non-empty	Join `source_episode_index` with `tasks.parquet`
Cache is rebuildable	Delete one `.vla`, rebuild from LeRobot, compare metadata checksum
Training loader does not drop keys	Print batch keys before training

When a check fails, fix the lowest layer that owns the bug. Timestamp sync issues belong in the MCAP-to-LeRobot exporter. Video decode issues belong in the MP4 shard layer. Missing .vla keys belong in the LeRobot-to-Robo-DM converter.

Technical sources

What this article gives you

Part 4 answers the next question: once the raw log is healthy, how should the training dataset live in the data lake?

If you only have a few dozen episodes, one LeRobot folder may be enough. Once you start collecting thousands of humanoid episodes, you need to separate two very different jobs:

Need	Best fit	Why
Share a dataset, inspect metadata, load from Hugging Face Hub, debug episodes	LeRobotDataset `Parquet + MP4`	Standard schema, easy inspection, strong fit with the Hub and LeRobot ecosystem
Fine-tune repeatedly, read sequentially at high speed, reduce decode and IO overhead	Robo-DM `.vla` trajectory	Self-contained container, `Trajectory.add()` for multimodal data, optimized loading for training

A durable data lake should have three layers, not one giant folder:

humanoid-data-lake/
  raw-mcap/
    2026-06-10/robot_g1_001/episode_000123/
      episode_000123_0.mcap
      metadata.yaml
      operator_notes.md

  lerobot/
    vnrobo/g1-pick-bin-v1/
      meta/info.json
      meta/tasks.parquet
      meta/episodes/chunk-000/file-000.parquet
      data/chunk-000/file-000.parquet
      videos/head_camera/chunk-000/file-000.mp4
      videos/left_wrist/chunk-000/file-000.mp4

  robodm-cache/
    g1-pick-bin-v1/
      episode_000123.vla
      episode_000124.vla
      manifest.parquet

What LeRobotDataset v3 stores

The files you need to understand are:

File or folder	Role
`meta/info.json`	Canonical schema: `features`, shapes, dtypes, FPS, version, `data_path`, `video_path`, total frames, episodes, and tasks
`meta/tasks.parquet`	Natural-language tasks and `task_index`, such as "pick the red cup and place it in the bin"
`meta/episodes/*.parquet`	Per-episode metadata: length, task, indexes into data/video shards, per-episode stats
`data/chunk-/.parquet`	Frame-level data: `observation.state`, `action`, `timestamp`, `episode_index`, `frame_index`, low-dimensional signals
`videos/chunk-/.mp4` or `videos/{camera}/chunk-/.mp4`	Video shards per camera; current v3 templates usually include `video_key` for clear camera separation
`meta/stats.json`	Mean/std/min/max statistics used for normalization during training

A minimal humanoid VLA schema can start like this:

features = {
    "observation.state": {
        "dtype": "float32",
        "shape": (64,),
        "names": ["base", "torso", "left_arm", "right_arm", "hands"],
    },
    "action": {
        "dtype": "float32",
        "shape": (32,),
        "names": ["target_joints", "gripper", "base_velocity"],
    },
    "observation.images.head": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
    "observation.images.left_wrist": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
    "observation.images.right_wrist": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
}

Creating a dataset with `LeRobotDataset.create()`

Here is a skeleton converter from synchronized episode frames:

from pathlib import Path
from lerobot.datasets.lerobot_dataset import LeRobotDataset

dataset = LeRobotDataset.create(
    repo_id="vnrobo/g1-pick-bin-v1",
    fps=30,
    features=features,
    root=Path("/data/lerobot/vnrobo/g1-pick-bin-v1"),
    robot_type="unitree_g1",
    use_videos=True,
    batch_encoding_size=4,
    streaming_encoding=False,
    video_files_size_in_mb=200,
    data_files_size_in_mb=100,
)

for episode in synchronized_episodes:
    for frame in episode.frames:
        dataset.add_frame({
            "observation.state": frame.state.astype("float32"),
            "action": frame.action.astype("float32"),
            "observation.images.head": frame.head_rgb,
            "observation.images.left_wrist": frame.left_wrist_rgb,
            "observation.images.right_wrist": frame.right_wrist_rgb,
            "task": episode.task_text,
        })

    dataset.save_episode()

dataset.finalize()

After the dataset is written, run a quick smoke test:

find /data/lerobot/vnrobo/g1-pick-bin-v1 -maxdepth 3 -type f | sort | head -50

python - <<'PY'
from lerobot.datasets.lerobot_dataset import LeRobotDataset

ds = LeRobotDataset(
    "vnrobo/g1-pick-bin-v1",
    root="/data/lerobot/vnrobo/g1-pick-bin-v1",
)

print("episodes:", ds.num_episodes)
print("frames:", len(ds))
print("features:", ds.features.keys())
sample = ds[0]
print(sample["timestamp"], sample["observation.state"].shape, sample["action"].shape)
PY

When to keep Parquet+MP4

Keep LeRobot Parquet+MP4 as the main dataset when any of these conditions apply:

Situation	Why LeRobot should stay
You want to upload to Hugging Face Hub	The Hub understands dataset repos, metadata, cards, streaming, and sharded files better than a private cache
Other people need to inspect the data	Parquet can be read with PyArrow/Pandas, and MP4 can be opened by common tools
You train with LeRobot policies	`LeRobotDataset` returns tensor dictionaries compatible with DataLoader and supports `delta_timestamps`
You version datasets as releases	`repo_id`, dataset cards, stats, and metadata make audits easier
The schema is still changing	Parquet+MP4 is easier to inspect and migrate than a custom binary trajectory cache

Where Robo-DM `.vla` fits

Robo-DM provides a Trajectory abstraction for writing multimodal robot data into .vla files. Its README shows the basic pattern:

import robodm

trajectory = robodm.Trajectory(path="/tmp/robot_demo.vla", mode="w")
trajectory.add("camera/rgb", image)
trajectory.add("robot/joint_positions", qpos)
trajectory.add("action/gripper_action", gripper_action)
trajectory.close()

For humanoid VLA, the converter should use real timestamps from LeRobot when available. Do not always assume frame_idx * 100 ms. A safer skeleton looks like this:

from pathlib import Path
import numpy as np
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from robodm.trajectory import Trajectory

def tensor_to_hwc_uint8(image_tensor):
    image = image_tensor.permute(1, 2, 0).cpu().numpy()
    if image.max() <= 1.0:
        image = image * 255
    return image.astype(np.uint8)

def lerobot_to_vla(repo_id, root, output_dir, episodes=None):
    ds = LeRobotDataset(repo_id, root=root, episodes=episodes, video_backend="pyav")
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    grouped = {}
    for sample in ds:
        ep = int(sample["episode_index"].item())
        frame = int(sample["frame_index"].item())
        grouped.setdefault(ep, []).append((frame, sample))

    for ep, frames in grouped.items():
        frames.sort(key=lambda item: item[0])
        traj = Trajectory(path=str(output_dir / f"episode_{ep:06d}.vla"), mode="w")

        try:
            for _, sample in frames:
                timestamp_s = float(sample["timestamp"].item())
                timestamp_ms = int(round(timestamp_s * 1000))

                traj.add(
                    "observation/state",
                    sample["observation.state"].cpu().numpy().astype(np.float32),
                    timestamp=timestamp_ms,
                    time_unit="ms",
                )
                traj.add(
                    "action",
                    sample["action"].cpu().numpy().astype(np.float32),
                    timestamp=timestamp_ms,
                    time_unit="ms",
                )

                for key in sample.keys():
                    if key.startswith("observation.images."):
                        camera = key.split(".")[-1]
                        traj.add(
                            f"observation/images/{camera}",
                            tensor_to_hwc_uint8(sample[key]),
                            timestamp=timestamp_ms,
                            time_unit="ms",
                        )
        finally:
            traj.close()

In production, do not create .vla files and lose their origin. Write a manifest:

manifest.parquet
  vla_path
  source_repo_id
  source_dataset_version
  source_episode_index
  source_episode_length
  task
  robot_id
  converter_version
  created_at
  qa_status

The manifest answers practical questions: which LeRobot release produced this .vla file, which converter version created it, whether it passed QA, and which training runs may use it.

When to convert to `.vla`

You do not need to convert every dataset immediately. Use this table:

Question	If yes	Action
Has the dataset passed QA and stabilized its schema?	Yes	Create a `.vla` cache for training
Do other teams need to download or inspect the dataset?	Yes	Keep LeRobot as the released dataset
Is training bottlenecked by video decoding or random access?	Yes	Benchmark a Robo-DM cache
Is this only a 20-episode pilot?	Yes	Skip `.vla` for now; keep LeRobot and raw MCAP
Do you need to replay or audit a sensor failure?	Yes	Return to MCAP, not `.vla`
Are you running many experiments against one frozen dataset?	Yes	The `.vla` cache is worth it

A useful rule:

MCAP = source of truth
LeRobot = source of shareable training data
Robo-DM .vla = reproducible training cache

If you change camera crops, action normalization, or feature names, create a new .vla cache instead of overwriting the old one. A training run must know exactly which cache it used.

Recommended two-tier pipeline

The data center pipeline for this article is:

1. Record MCAP
2. Export synchronized episode frames
3. Write LeRobotDataset with LeRobotDataset.create()
4. Run metadata and visual QA
5. Publish or version the LeRobot dataset
6. Convert approved episodes to Robo-DM .vla
7. Train from the .vla cache
8. Log the training run back to the dataset version

Example command layout:

# Sharing layer
python tools/mcap_to_lerobot.py \
  --raw-root /data/raw-mcap/2026-06-10 \
  --repo-id vnrobo/g1-pick-bin-v1 \
  --output-root /data/lerobot/vnrobo/g1-pick-bin-v1 \
  --fps 30

# QA before training cache
python tools/qa_lerobot_dataset.py \
  --repo-id vnrobo/g1-pick-bin-v1 \
  --root /data/lerobot/vnrobo/g1-pick-bin-v1 \
  --report reports/g1-pick-bin-v1-qa.html

# Training layer
python examples/lerobot/lerobot_to_robodm_ingestion.py \
  --dataset vnrobo/g1-pick-bin-v1 \
  --output_dir /data/robodm-cache/g1-pick-bin-v1 \
  --video_backend pyav

python examples/lerobot/run_pipeline.py \
  --robodm_data_dir /data/robodm-cache/g1-pick-bin-v1 \
  --training_steps 100000 \
  --batch_size 128

Checklist before training

Before a large training run consumes .vla, check:

Item	How to check
Number of `.vla` files equals approved episodes	Compare manifest against `meta/episodes`
Timestamps increase monotonically	Assert `timestamp[i] < timestamp[i+1]` per trajectory
Cameras have enough frames	Each camera frame count should be close to episode length
State/action dtypes are stable	Use `float32` and fixed shapes
Task labels are non-empty	Join `source_episode_index` with `tasks.parquet`
Cache is rebuildable	Delete one `.vla`, rebuild from LeRobot, compare metadata checksum
Training loader does not drop keys	Print batch keys before training

LeRobotDataset and Robo-DM Data Lake

What this article gives you

What LeRobotDataset v3 stores

Creating a dataset with `LeRobotDataset.create()`

When to keep Parquet+MP4

Where Robo-DM `.vla` fits

When to convert to `.vla`

Recommended two-tier pipeline

Checklist before training

Technical sources

Nguyễn Anh Tuấn

Related Posts

ROS 2 MCAP làm chuẩn raw log

Pilot 2 người cho dữ liệu humanoid VLA

RISE: Hands-on training pipeline tự cải thiện

LeRobotDataset and Robo-DM Data Lake

What this article gives you

What LeRobotDataset v3 stores

Creating a dataset with `LeRobotDataset.create()`

When to keep Parquet+MP4

Where Robo-DM `.vla` fits

When to convert to `.vla`

Recommended two-tier pipeline

Checklist before training

Technical sources

Nguyễn Anh Tuấn

Related Posts

ROS 2 MCAP làm chuẩn raw log

Pilot 2 người cho dữ liệu humanoid VLA

RISE: Hands-on training pipeline tự cải thiện

What this article gives you

The three layers: raw, share, train

What LeRobotDataset v3 stores

Creating a dataset with LeRobotDataset.create()

When to keep Parquet+MP4

Where Robo-DM .vla fits

When to convert to .vla

Recommended two-tier pipeline

Checklist before training

Technical sources

Related Posts

Nguyễn Anh Tuấn

Related Posts

ROS 2 MCAP làm chuẩn raw log

Pilot 2 người cho dữ liệu humanoid VLA

RISE: Hands-on training pipeline tự cải thiện

What this article gives you

The three layers: raw, share, train

What LeRobotDataset v3 stores

Creating a dataset with LeRobotDataset.create()

When to keep Parquet+MP4

Where Robo-DM .vla fits

When to convert to .vla

Recommended two-tier pipeline

Checklist before training

Technical sources

Related Posts

Nguyễn Anh Tuấn

Related Posts

ROS 2 MCAP làm chuẩn raw log

Pilot 2 người cho dữ liệu humanoid VLA

RISE: Hands-on training pipeline tự cải thiện

Creating a dataset with `LeRobotDataset.create()`

Where Robo-DM `.vla` fits

When to convert to `.vla`

Creating a dataset with `LeRobotDataset.create()`

Where Robo-DM `.vla` fits

When to convert to `.vla`