LeRobotDataset và Robo-DM cho data lake

Mục tiêu của bài này

Trong bài 1, chúng ta thiết kế ca pilot để operator, data supervisor và robot không giẫm chân nhau. Trong bài 2, ta chọn teleop stack tạo action đủ sạch cho policy học. Trong bài 3, ta dùng MCAP làm raw log gốc để replay và audit.

Bài 4 trả lời câu hỏi tiếp theo: sau khi đã có raw log tốt, data lake nên lưu dataset training như thế nào?

Nếu bạn chỉ có vài chục episode, một folder LeRobot là đủ. Nhưng khi bắt đầu thu hàng nghìn episode humanoid, bạn cần tách rõ hai nhu cầu khác nhau:

Nhu cầu	Định dạng phù hợp	Vì sao
Chia sẻ dataset, xem metadata, tải từ Hugging Face Hub, debug episode	LeRobotDataset `Parquet + MP4`	Chuẩn hóa schema, dễ inspect, hợp với Hub và ecosystem LeRobot
Fine-tune liên tục, đọc tuần tự tốc độ cao, giảm overhead decode/IO	Robo-DM `.vla` trajectory	Container tự chứa, `Trajectory.add()` ghi nhiều modality, tối ưu loading cho training

Điểm quan trọng: đừng coi LeRobot và Robo-DM là hai lựa chọn loại trừ nhau. Trong data center humanoid VLA, LeRobot là tầng trao đổi và quản trị dữ liệu. Robo-DM là tầng training cache tốc độ cao. Bạn có thể giữ cả hai, nhưng không cần convert mọi thứ ngay ngày đầu.

Nếu bạn mới làm quen với LeRobot, đọc thêm hướng dẫn hệ sinh thái LeRobot. Nếu bạn đang xây pipeline humanoid hoàn chỉnh từ ROS 2 tới VLA deployment, bài software stack humanoid robot sẽ giúp đặt phần data lake vào bức tranh lớn hơn.

Một data lake bền vững nên có ba lớp, không phải một folder khổng lồ duy nhất:

humanoid-data-lake/
  raw-mcap/
    2026-06-10/robot_g1_001/episode_000123/
      episode_000123_0.mcap
      metadata.yaml
      operator_notes.md

  lerobot/
    vnrobo/g1-pick-bin-v1/
      meta/info.json
      meta/tasks.parquet
      meta/episodes/chunk-000/file-000.parquet
      data/chunk-000/file-000.parquet
      videos/head_camera/chunk-000/file-000.mp4
      videos/left_wrist/chunk-000/file-000.mp4

  robodm-cache/
    g1-pick-bin-v1/
      episode_000123.vla
      episode_000124.vla
      manifest.parquet

Raw MCAP là bằng chứng gốc. LeRobot là dataset canonical để chia sẻ, review và version. Robo-DM là cache training có thể tái tạo từ LeRobot. Nếu .vla cache bị xóa, bạn vẫn rebuild được từ Parquet+MP4. Nếu LeRobot có bug export, bạn vẫn quay lại MCAP để điều tra.

Một lỗi phổ biến của beginner là convert thẳng MCAP sang format training riêng rồi bỏ raw log. Cách đó chạy nhanh trong demo, nhưng mất khả năng kiểm toán. Khi training fail sau hai tuần, bạn không biết lỗi nằm ở recorder, converter, timestamp, camera codec hay batch loader. Ba lớp ở trên làm pipeline chậm hơn một chút ban đầu, nhưng giảm rất nhiều chi phí debug về sau.

LeRobotDataset v3 lưu gì?

Tài liệu LeRobotDataset v3 của Hugging Face mô tả format này như một chuẩn cho dữ liệu robot đa phương thức: sensorimotor time series, multi-camera video và metadata phục vụ indexing/search/visualization trên Hub. Với v3, nguyên tắc cốt lõi là storage không còn bám vào ranh giới file mỗi episode. Nhiều episode có thể được nối vào cùng Parquet hoặc MP4 shard, còn metadata sẽ tái dựng lại view theo episode.

Các file chính bạn cần hiểu:

File hoặc folder	Vai trò
`meta/info.json`	Schema canonical: `features`, shape, dtype, FPS, version, `data_path`, `video_path`, tổng số frame/episode/task
`meta/tasks.parquet`	Bảng task tự nhiên và `task_index`; ví dụ "pick the red cup and place it in the bin"
`meta/episodes/*.parquet`	Metadata từng episode: độ dài, task, index vào data/video shard, thống kê từng episode
`data/chunk-/.parquet`	Dữ liệu theo frame: `observation.state`, `action`, `timestamp`, `episode_index`, `frame_index`, các trường low-dimensional
`videos/chunk-/.mp4` hoặc `videos/{camera}/chunk-/.mp4`	Video shard theo camera; v3 thường dùng path có `video_key` để tách camera rõ ràng
`meta/stats.json`	Thống kê mean/std/min/max phục vụ normalization khi train

Trong source hiện tại của LeRobot, default path cho task là meta/tasks.parquet, episode metadata nằm dưới meta/episodes/...parquet, data nằm dưới data/...parquet, video nằm dưới videos/{video_key}/...mp4. Nếu bạn gặp dataset cũ có tasks.jsonl hoặc episodes.jsonl, hãy coi đó là layout legacy và migrate trước khi dùng làm chuẩn cho data center mới.

Một schema tối thiểu cho humanoid VLA có thể bắt đầu như sau:

features = {
    "observation.state": {
        "dtype": "float32",
        "shape": (64,),
        "names": ["base", "torso", "left_arm", "right_arm", "hands"],
    },
    "action": {
        "dtype": "float32",
        "shape": (32,),
        "names": ["target_joints", "gripper", "base_velocity"],
    },
    "observation.images.head": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
    "observation.images.left_wrist": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
    "observation.images.right_wrist": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
}

Không cần hoàn hảo ngay. Nhưng bạn phải ổn định ba thứ: tên feature, shape action/state, và FPS. Nếu ngày hôm nay action là 32 chiều joint target, ngày mai lại thành 34 chiều vì thêm gripper tactile mà không version dataset, training code sẽ vỡ theo cách rất khó đọc.

Tạo dataset bằng `LeRobotDataset.create()`

LeRobotDataset.create() tạo một dataset ở write mode. Theo source của LeRobot, các tham số quan trọng gồm repo_id, fps, features, root, robot_type, use_videos, tolerance_s, batch_encoding_size, streaming_encoding, video_files_size_in_mb và data_files_size_in_mb. Sau khi tạo, bạn dùng add_frame(), save_episode(), rồi finalize().

Ví dụ skeleton cho converter từ episode đã đồng bộ:

from pathlib import Path
from lerobot.datasets.lerobot_dataset import LeRobotDataset

dataset = LeRobotDataset.create(
    repo_id="vnrobo/g1-pick-bin-v1",
    fps=30,
    features=features,
    root=Path("/data/lerobot/vnrobo/g1-pick-bin-v1"),
    robot_type="unitree_g1",
    use_videos=True,
    batch_encoding_size=4,
    streaming_encoding=False,
    video_files_size_in_mb=200,
    data_files_size_in_mb=100,
)

for episode in synchronized_episodes:
    for frame in episode.frames:
        dataset.add_frame({
            "observation.state": frame.state.astype("float32"),
            "action": frame.action.astype("float32"),
            "observation.images.head": frame.head_rgb,
            "observation.images.left_wrist": frame.left_wrist_rgb,
            "observation.images.right_wrist": frame.right_wrist_rgb,
            "task": episode.task_text,
        })

    dataset.save_episode()

dataset.finalize()

Beginner cần chú ý: task được đưa vào từng frame để writer có thể gom thành task_index và cập nhật meta/tasks.parquet. Bạn không nên tự edit tasks.parquet bằng tay. Để writer quản lý index sẽ giảm lỗi lệch task giữa data frame và episode metadata.

Khi dataset được ghi xong, kiểm tra nhanh:

find /data/lerobot/vnrobo/g1-pick-bin-v1 -maxdepth 3 -type f | sort | head -50

python - <<'PY'
from lerobot.datasets.lerobot_dataset import LeRobotDataset

ds = LeRobotDataset(
    "vnrobo/g1-pick-bin-v1",
    root="/data/lerobot/vnrobo/g1-pick-bin-v1",
)

print("episodes:", ds.num_episodes)
print("frames:", len(ds))
print("features:", ds.features.keys())
sample = ds[0]
print(sample["timestamp"], sample["observation.state"].shape, sample["action"].shape)
PY

Nếu sample đầu tiên load được, chưa có nghĩa dataset đã tốt. Bạn vẫn cần QA ở bài 5: duration có đúng không, camera có bị đen không, action có spike không, task label có bị copy nhầm không.

Khi nào giữ Parquet+MP4?

Giữ LeRobot Parquet+MP4 làm bản chính khi bạn cần một trong các điều kiện sau:

Tình huống	Lý do giữ LeRobot
Muốn upload Hugging Face Hub	Hub hiểu dataset repo, metadata, card, streaming và file sharding tốt hơn một cache riêng
Muốn người khác inspect nhanh	Parquet có thể đọc bằng PyArrow/Pandas, MP4 mở được bằng tool phổ biến
Muốn training bằng LeRobot policy	`LeRobotDataset` trả về dictionary tensor phù hợp DataLoader và hỗ trợ `delta_timestamps`
Muốn version dataset theo release	`repo_id`, dataset card, stats và metadata giúp audit dễ hơn
Dataset vẫn đang thay schema	Parquet+MP4 dễ inspect và migrate hơn binary trajectory riêng

LeRobot cũng hợp với giai đoạn "data review": data supervisor có thể mở metadata, đếm số episode theo task, xem video shard, kiểm tra meta/episodes xem episode nào quá ngắn. Khi dataset cần chia sẻ cho team khác, LeRobot là ngôn ngữ chung.

Nhược điểm của tầng này là training throughput không phải lúc nào cũng tối ưu. Với multi-camera humanoid, loader phải đọc Parquet, resolve episode/frame offset, decode MP4, apply transform và collate batch. Điều này vẫn tốt cho nhiều dự án, nhưng khi bạn fine-tune hàng trăm nghìn episode trong nhiều ngày, IO và video decode có thể trở thành nút thắt.

Robo-DM `.vla` dùng ở đâu?

Robo-DM định nghĩa Trajectory để ghi dữ liệu robot đa phương thức vào file .vla. README của Robo-DM cho thấy pattern cơ bản:

import robodm

trajectory = robodm.Trajectory(path="/tmp/robot_demo.vla", mode="w")
trajectory.add("camera/rgb", image)
trajectory.add("robot/joint_positions", qpos)
trajectory.add("action/gripper_action", gripper_action)
trajectory.close()

Script examples/lerobot/lerobot_to_robodm_ingestion.py đi xa hơn: nó load LeRobotDataset, group sample theo episode_index, sort theo frame_index, tạo episode_XXX.vla, rồi gọi Trajectory.add() cho image, state, action, reward và done nếu có. run_pipeline.py minh họa flow production hơn: nếu đã có robodm_data_dir thì bỏ qua ingestion, load .vla, tạo DataLoader, rồi train policy.

Với humanoid VLA, converter nên dùng timestamp thật từ LeRobot nếu có, không nên luôn giả định frame_idx * 100 ms. Skeleton an toàn hơn:

from pathlib import Path
import numpy as np
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from robodm.trajectory import Trajectory

def tensor_to_hwc_uint8(image_tensor):
    image = image_tensor.permute(1, 2, 0).cpu().numpy()
    if image.max() <= 1.0:
        image = image * 255
    return image.astype(np.uint8)

def lerobot_to_vla(repo_id, root, output_dir, episodes=None):
    ds = LeRobotDataset(repo_id, root=root, episodes=episodes, video_backend="pyav")
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    grouped = {}
    for sample in ds:
        ep = int(sample["episode_index"].item())
        frame = int(sample["frame_index"].item())
        grouped.setdefault(ep, []).append((frame, sample))

    for ep, frames in grouped.items():
        frames.sort(key=lambda item: item[0])
        traj = Trajectory(path=str(output_dir / f"episode_{ep:06d}.vla"), mode="w")

        try:
            for _, sample in frames:
                timestamp_s = float(sample["timestamp"].item())
                timestamp_ms = int(round(timestamp_s * 1000))

                traj.add(
                    "observation/state",
                    sample["observation.state"].cpu().numpy().astype(np.float32),
                    timestamp=timestamp_ms,
                    time_unit="ms",
                )
                traj.add(
                    "action",
                    sample["action"].cpu().numpy().astype(np.float32),
                    timestamp=timestamp_ms,
                    time_unit="ms",
                )

                for key in sample.keys():
                    if key.startswith("observation.images."):
                        camera = key.split(".")[-1]
                        traj.add(
                            f"observation/images/{camera}",
                            tensor_to_hwc_uint8(sample[key]),
                            timestamp=timestamp_ms,
                            time_unit="ms",
                        )
        finally:
            traj.close()

Trong production, đừng chỉ tạo .vla rồi quên nguồn gốc. Hãy viết thêm manifest:

manifest.parquet
  vla_path
  source_repo_id
  source_dataset_version
  source_episode_index
  source_episode_length
  task
  robot_id
  converter_version
  created_at
  qa_status

Manifest này giúp bạn trả lời: file .vla này sinh từ release LeRobot nào, bằng converter version nào, đã qua QA chưa, có thể dùng cho training run nào.

Khi nào convert sang `.vla`?

Không cần convert mọi dataset ngay lập tức. Dùng bảng sau:

Câu hỏi	Nếu câu trả lời là "có"	Hành động
Dataset đã qua QA và schema ổn định chưa?	Có	Tạo `.vla` cache cho training
Team khác cần tải/xem dataset không?	Có	Giữ LeRobot là bản phát hành chính
Training bị nghẽn do decode video hoặc random access chậm?	Có	Benchmark Robo-DM cache
Dataset chỉ là pilot 20 episode?	Có	Chưa cần `.vla`, giữ LeRobot và raw MCAP
Bạn cần replay/audit lỗi sensor?	Có	Quay lại MCAP, không debug từ `.vla`
Bạn đang chạy nhiều experiment cùng một dataset frozen?	Có	`.vla` cache đáng tiền

Một rule thực dụng:

MCAP = source of truth
LeRobot = source of shareable training data
Robo-DM .vla = reproducible training cache

Nếu bạn thay đổi crop camera, action normalization hoặc feature name, hãy tạo cache .vla mới thay vì ghi đè cache cũ. Training run cần biết nó đã dùng cache nào.

Pipeline hai tầng đề xuất

Pipeline data center cho bài này:

1. Record MCAP
2. Export synchronized episode frames
3. Write LeRobotDataset with LeRobotDataset.create()
4. Run metadata + visual QA
5. Publish or version LeRobot dataset
6. Convert approved episodes to Robo-DM .vla
7. Train from .vla cache
8. Log training run back to dataset version

Ví dụ command layout:

# Tầng chia sẻ
python tools/mcap_to_lerobot.py \
  --raw-root /data/raw-mcap/2026-06-10 \
  --repo-id vnrobo/g1-pick-bin-v1 \
  --output-root /data/lerobot/vnrobo/g1-pick-bin-v1 \
  --fps 30

# QA trước khi cache training
python tools/qa_lerobot_dataset.py \
  --repo-id vnrobo/g1-pick-bin-v1 \
  --root /data/lerobot/vnrobo/g1-pick-bin-v1 \
  --report reports/g1-pick-bin-v1-qa.html

# Tầng training
python examples/lerobot/lerobot_to_robodm_ingestion.py \
  --dataset vnrobo/g1-pick-bin-v1 \
  --output_dir /data/robodm-cache/g1-pick-bin-v1 \
  --video_backend pyav

python examples/lerobot/run_pipeline.py \
  --robodm_data_dir /data/robodm-cache/g1-pick-bin-v1 \
  --training_steps 100000 \
  --batch_size 128

Với beginner, phần khó nhất không phải code converter. Phần khó là quyết định cái gì là canonical. Câu trả lời nên đơn giản: canonical public dataset là LeRobot; canonical raw evidence là MCAP; .vla là artifact có thể rebuild.

Checklist trước khi training

Trước khi đưa .vla vào training run lớn, kiểm tra:

Mục	Cách kiểm
Số `.vla` bằng số episode approved	So sánh manifest với `meta/episodes`
Timestamp tăng đều	Assert `timestamp[i] < timestamp[i+1]` trong từng trajectory
Camera đủ frame	Mỗi camera có số frame gần bằng episode length
State/action đúng dtype	`float32`, shape cố định
Task label không rỗng	Join `source_episode_index` với `tasks.parquet`
Cache rebuild được	Xóa thử 1 `.vla`, rebuild từ LeRobot và so checksum metadata
Training loader không âm thầm drop key	In batch keys trước khi train

Nếu có lỗi ở checklist, sửa ở tầng thấp nhất có thể. Lỗi do sync timestamp thì sửa exporter từ MCAP sang LeRobot. Lỗi do video decode thì kiểm tra MP4 shard. Lỗi do .vla thiếu key thì sửa converter LeRobot-to-Robo-DM.

Nguồn kỹ thuật

Mục tiêu của bài này

Bài 4 trả lời câu hỏi tiếp theo: sau khi đã có raw log tốt, data lake nên lưu dataset training như thế nào?

Nếu bạn chỉ có vài chục episode, một folder LeRobot là đủ. Nhưng khi bắt đầu thu hàng nghìn episode humanoid, bạn cần tách rõ hai nhu cầu khác nhau:

Nhu cầu	Định dạng phù hợp	Vì sao
Chia sẻ dataset, xem metadata, tải từ Hugging Face Hub, debug episode	LeRobotDataset `Parquet + MP4`	Chuẩn hóa schema, dễ inspect, hợp với Hub và ecosystem LeRobot
Fine-tune liên tục, đọc tuần tự tốc độ cao, giảm overhead decode/IO	Robo-DM `.vla` trajectory	Container tự chứa, `Trajectory.add()` ghi nhiều modality, tối ưu loading cho training

Một data lake bền vững nên có ba lớp, không phải một folder khổng lồ duy nhất:

humanoid-data-lake/
  raw-mcap/
    2026-06-10/robot_g1_001/episode_000123/
      episode_000123_0.mcap
      metadata.yaml
      operator_notes.md

  lerobot/
    vnrobo/g1-pick-bin-v1/
      meta/info.json
      meta/tasks.parquet
      meta/episodes/chunk-000/file-000.parquet
      data/chunk-000/file-000.parquet
      videos/head_camera/chunk-000/file-000.mp4
      videos/left_wrist/chunk-000/file-000.mp4

  robodm-cache/
    g1-pick-bin-v1/
      episode_000123.vla
      episode_000124.vla
      manifest.parquet

LeRobotDataset v3 lưu gì?

Các file chính bạn cần hiểu:

File hoặc folder	Vai trò
`meta/info.json`	Schema canonical: `features`, shape, dtype, FPS, version, `data_path`, `video_path`, tổng số frame/episode/task
`meta/tasks.parquet`	Bảng task tự nhiên và `task_index`; ví dụ "pick the red cup and place it in the bin"
`meta/episodes/*.parquet`	Metadata từng episode: độ dài, task, index vào data/video shard, thống kê từng episode
`data/chunk-/.parquet`	Dữ liệu theo frame: `observation.state`, `action`, `timestamp`, `episode_index`, `frame_index`, các trường low-dimensional
`videos/chunk-/.mp4` hoặc `videos/{camera}/chunk-/.mp4`	Video shard theo camera; v3 thường dùng path có `video_key` để tách camera rõ ràng
`meta/stats.json`	Thống kê mean/std/min/max phục vụ normalization khi train

Một schema tối thiểu cho humanoid VLA có thể bắt đầu như sau:

features = {
    "observation.state": {
        "dtype": "float32",
        "shape": (64,),
        "names": ["base", "torso", "left_arm", "right_arm", "hands"],
    },
    "action": {
        "dtype": "float32",
        "shape": (32,),
        "names": ["target_joints", "gripper", "base_velocity"],
    },
    "observation.images.head": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
    "observation.images.left_wrist": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
    "observation.images.right_wrist": {
        "dtype": "image",
        "shape": (480, 640, 3),
        "names": ["height", "width", "channels"],
    },
}

Tạo dataset bằng `LeRobotDataset.create()`

Ví dụ skeleton cho converter từ episode đã đồng bộ:

from pathlib import Path
from lerobot.datasets.lerobot_dataset import LeRobotDataset

dataset = LeRobotDataset.create(
    repo_id="vnrobo/g1-pick-bin-v1",
    fps=30,
    features=features,
    root=Path("/data/lerobot/vnrobo/g1-pick-bin-v1"),
    robot_type="unitree_g1",
    use_videos=True,
    batch_encoding_size=4,
    streaming_encoding=False,
    video_files_size_in_mb=200,
    data_files_size_in_mb=100,
)

for episode in synchronized_episodes:
    for frame in episode.frames:
        dataset.add_frame({
            "observation.state": frame.state.astype("float32"),
            "action": frame.action.astype("float32"),
            "observation.images.head": frame.head_rgb,
            "observation.images.left_wrist": frame.left_wrist_rgb,
            "observation.images.right_wrist": frame.right_wrist_rgb,
            "task": episode.task_text,
        })

    dataset.save_episode()

dataset.finalize()

Khi dataset được ghi xong, kiểm tra nhanh:

find /data/lerobot/vnrobo/g1-pick-bin-v1 -maxdepth 3 -type f | sort | head -50

python - <<'PY'
from lerobot.datasets.lerobot_dataset import LeRobotDataset

ds = LeRobotDataset(
    "vnrobo/g1-pick-bin-v1",
    root="/data/lerobot/vnrobo/g1-pick-bin-v1",
)

print("episodes:", ds.num_episodes)
print("frames:", len(ds))
print("features:", ds.features.keys())
sample = ds[0]
print(sample["timestamp"], sample["observation.state"].shape, sample["action"].shape)
PY

Khi nào giữ Parquet+MP4?

Giữ LeRobot Parquet+MP4 làm bản chính khi bạn cần một trong các điều kiện sau:

Tình huống	Lý do giữ LeRobot
Muốn upload Hugging Face Hub	Hub hiểu dataset repo, metadata, card, streaming và file sharding tốt hơn một cache riêng
Muốn người khác inspect nhanh	Parquet có thể đọc bằng PyArrow/Pandas, MP4 mở được bằng tool phổ biến
Muốn training bằng LeRobot policy	`LeRobotDataset` trả về dictionary tensor phù hợp DataLoader và hỗ trợ `delta_timestamps`
Muốn version dataset theo release	`repo_id`, dataset card, stats và metadata giúp audit dễ hơn
Dataset vẫn đang thay schema	Parquet+MP4 dễ inspect và migrate hơn binary trajectory riêng

Robo-DM `.vla` dùng ở đâu?

Robo-DM định nghĩa Trajectory để ghi dữ liệu robot đa phương thức vào file .vla. README của Robo-DM cho thấy pattern cơ bản:

import robodm

trajectory = robodm.Trajectory(path="/tmp/robot_demo.vla", mode="w")
trajectory.add("camera/rgb", image)
trajectory.add("robot/joint_positions", qpos)
trajectory.add("action/gripper_action", gripper_action)
trajectory.close()

Với humanoid VLA, converter nên dùng timestamp thật từ LeRobot nếu có, không nên luôn giả định frame_idx * 100 ms. Skeleton an toàn hơn:

from pathlib import Path
import numpy as np
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from robodm.trajectory import Trajectory

def tensor_to_hwc_uint8(image_tensor):
    image = image_tensor.permute(1, 2, 0).cpu().numpy()
    if image.max() <= 1.0:
        image = image * 255
    return image.astype(np.uint8)

def lerobot_to_vla(repo_id, root, output_dir, episodes=None):
    ds = LeRobotDataset(repo_id, root=root, episodes=episodes, video_backend="pyav")
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    grouped = {}
    for sample in ds:
        ep = int(sample["episode_index"].item())
        frame = int(sample["frame_index"].item())
        grouped.setdefault(ep, []).append((frame, sample))

    for ep, frames in grouped.items():
        frames.sort(key=lambda item: item[0])
        traj = Trajectory(path=str(output_dir / f"episode_{ep:06d}.vla"), mode="w")

        try:
            for _, sample in frames:
                timestamp_s = float(sample["timestamp"].item())
                timestamp_ms = int(round(timestamp_s * 1000))

                traj.add(
                    "observation/state",
                    sample["observation.state"].cpu().numpy().astype(np.float32),
                    timestamp=timestamp_ms,
                    time_unit="ms",
                )
                traj.add(
                    "action",
                    sample["action"].cpu().numpy().astype(np.float32),
                    timestamp=timestamp_ms,
                    time_unit="ms",
                )

                for key in sample.keys():
                    if key.startswith("observation.images."):
                        camera = key.split(".")[-1]
                        traj.add(
                            f"observation/images/{camera}",
                            tensor_to_hwc_uint8(sample[key]),
                            timestamp=timestamp_ms,
                            time_unit="ms",
                        )
        finally:
            traj.close()

Trong production, đừng chỉ tạo .vla rồi quên nguồn gốc. Hãy viết thêm manifest:

manifest.parquet
  vla_path
  source_repo_id
  source_dataset_version
  source_episode_index
  source_episode_length
  task
  robot_id
  converter_version
  created_at
  qa_status

Manifest này giúp bạn trả lời: file .vla này sinh từ release LeRobot nào, bằng converter version nào, đã qua QA chưa, có thể dùng cho training run nào.

Khi nào convert sang `.vla`?

Không cần convert mọi dataset ngay lập tức. Dùng bảng sau:

Câu hỏi	Nếu câu trả lời là "có"	Hành động
Dataset đã qua QA và schema ổn định chưa?	Có	Tạo `.vla` cache cho training
Team khác cần tải/xem dataset không?	Có	Giữ LeRobot là bản phát hành chính
Training bị nghẽn do decode video hoặc random access chậm?	Có	Benchmark Robo-DM cache
Dataset chỉ là pilot 20 episode?	Có	Chưa cần `.vla`, giữ LeRobot và raw MCAP
Bạn cần replay/audit lỗi sensor?	Có	Quay lại MCAP, không debug từ `.vla`
Bạn đang chạy nhiều experiment cùng một dataset frozen?	Có	`.vla` cache đáng tiền

Một rule thực dụng:

MCAP = source of truth
LeRobot = source of shareable training data
Robo-DM .vla = reproducible training cache

Nếu bạn thay đổi crop camera, action normalization hoặc feature name, hãy tạo cache .vla mới thay vì ghi đè cache cũ. Training run cần biết nó đã dùng cache nào.

Pipeline hai tầng đề xuất

Pipeline data center cho bài này:

1. Record MCAP
2. Export synchronized episode frames
3. Write LeRobotDataset with LeRobotDataset.create()
4. Run metadata + visual QA
5. Publish or version LeRobot dataset
6. Convert approved episodes to Robo-DM .vla
7. Train from .vla cache
8. Log training run back to dataset version

Ví dụ command layout:

# Tầng chia sẻ
python tools/mcap_to_lerobot.py \
  --raw-root /data/raw-mcap/2026-06-10 \
  --repo-id vnrobo/g1-pick-bin-v1 \
  --output-root /data/lerobot/vnrobo/g1-pick-bin-v1 \
  --fps 30

# QA trước khi cache training
python tools/qa_lerobot_dataset.py \
  --repo-id vnrobo/g1-pick-bin-v1 \
  --root /data/lerobot/vnrobo/g1-pick-bin-v1 \
  --report reports/g1-pick-bin-v1-qa.html

# Tầng training
python examples/lerobot/lerobot_to_robodm_ingestion.py \
  --dataset vnrobo/g1-pick-bin-v1 \
  --output_dir /data/robodm-cache/g1-pick-bin-v1 \
  --video_backend pyav

python examples/lerobot/run_pipeline.py \
  --robodm_data_dir /data/robodm-cache/g1-pick-bin-v1 \
  --training_steps 100000 \
  --batch_size 128

Checklist trước khi training

Trước khi đưa .vla vào training run lớn, kiểm tra:

Mục	Cách kiểm
Số `.vla` bằng số episode approved	So sánh manifest với `meta/episodes`
Timestamp tăng đều	Assert `timestamp[i] < timestamp[i+1]` trong từng trajectory
Camera đủ frame	Mỗi camera có số frame gần bằng episode length
State/action đúng dtype	`float32`, shape cố định
Task label không rỗng	Join `source_episode_index` với `tasks.parquet`
Cache rebuild được	Xóa thử 1 `.vla`, rebuild từ LeRobot và so checksum metadata
Training loader không âm thầm drop key	In batch keys trước khi train

LeRobotDataset và Robo-DM cho data lake

Mục tiêu của bài này

LeRobotDataset v3 lưu gì?

Tạo dataset bằng `LeRobotDataset.create()`

Khi nào giữ Parquet+MP4?

Robo-DM `.vla` dùng ở đâu?

Khi nào convert sang `.vla`?

Pipeline hai tầng đề xuất

Checklist trước khi training

Nguồn kỹ thuật

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

ROS 2 MCAP làm chuẩn raw log

Pilot 2 người cho dữ liệu humanoid VLA

RISE: Hands-on training pipeline tự cải thiện

LeRobotDataset và Robo-DM cho data lake

Mục tiêu của bài này

LeRobotDataset v3 lưu gì?

Tạo dataset bằng `LeRobotDataset.create()`

Khi nào giữ Parquet+MP4?

Robo-DM `.vla` dùng ở đâu?

Khi nào convert sang `.vla`?

Pipeline hai tầng đề xuất

Checklist trước khi training

Nguồn kỹ thuật

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

ROS 2 MCAP làm chuẩn raw log

Pilot 2 người cho dữ liệu humanoid VLA

RISE: Hands-on training pipeline tự cải thiện

Mục tiêu của bài này

Bức tranh tổng thể: raw, share, train

LeRobotDataset v3 lưu gì?

Tạo dataset bằng LeRobotDataset.create()

Khi nào giữ Parquet+MP4?

Robo-DM .vla dùng ở đâu?

Khi nào convert sang .vla?

Pipeline hai tầng đề xuất

Checklist trước khi training

Nguồn kỹ thuật

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

ROS 2 MCAP làm chuẩn raw log

Pilot 2 người cho dữ liệu humanoid VLA

RISE: Hands-on training pipeline tự cải thiện

Mục tiêu của bài này

Bức tranh tổng thể: raw, share, train

LeRobotDataset v3 lưu gì?

Tạo dataset bằng LeRobotDataset.create()

Khi nào giữ Parquet+MP4?

Robo-DM .vla dùng ở đâu?

Khi nào convert sang .vla?

Pipeline hai tầng đề xuất

Checklist trước khi training

Nguồn kỹ thuật

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

ROS 2 MCAP làm chuẩn raw log

Pilot 2 người cho dữ liệu humanoid VLA

RISE: Hands-on training pipeline tự cải thiện

Tạo dataset bằng `LeRobotDataset.create()`

Robo-DM `.vla` dùng ở đâu?

Khi nào convert sang `.vla`?

Tạo dataset bằng `LeRobotDataset.create()`

Robo-DM `.vla` dùng ở đâu?

Khi nào convert sang `.vla`?