wholebody-vlatreadvlmvlaliberooctopi0-fastrobot-dataimitation-learning

Run TREAD: relabel robot data with VLMs

A practical guide to TREAD: segment robot trajectories, relabel subtasks with VLMs, and fine-tune stronger VLAs on LIBERO.

Nguyễn Anh TuấnJune 11, 202614 min read
Run TREAD: relabel robot data with VLMs

TREAD, short for Task Robustness via Re-Labelling Vision-Action Robot Data, is a practical answer to a common VLA bottleneck: instead of collecting more expensive robot demonstrations, use a Vision-Language Model (VLM) to cut long demonstrations into shorter subtasks, relabel each segment with grounded language, and fine-tune a policy on a mixture of original and relabeled data.

The original paper by Artur Kuramshin, Özgür Aslan, Cyrus Neary, and Glen Berseth appeared on arXiv on June 9, 2026, with an official project page and code at akuramshin/tread. The idea is easy to state, but the details matter. TREAD does not merely paraphrase instructions. The VLM sees the first frame, sees the trajectory video, generates semantic subtasks, identifies temporal boundaries, and then produces diverse instructions grounded in object properties and spatial relationships. On LIBERO, this extra data diversity improves Octo and π0-FAST robustness to unseen tasks, unseen scene-instruction pairings, and altered wording.

This guide walks through the paper idea, the architecture, installation, the LIBERO relabeling pipeline, HDF5 slicing, VLA fine-tuning, inference/evaluation, and how to debug the common failure modes. If you have already read the OpenVLA deep dive, think of TREAD as the data engine before the VLA: the model architecture still matters, but the language attached to robot trajectories often decides whether the policy can generalize.

Why relabel robot data at all?

Robot demonstrations are usually stored like this:

instruction: "Put the white mug on the plate and put the chocolate pudding to the left of the plate"
trajectory:  image_0, state_0, action_0
             image_1, state_1, action_1
             ...
             image_T, state_T, action_T

The problem is that one instruction covers the whole long-horizon sequence. During imitation learning, every action in the middle of the trajectory is conditioned on the same broad sentence. A task like the example above actually contains several goals: grasp the mug, lift it, move it to the plate, place it, grasp the pudding, move the pudding, and place it left of the plate. If the policy is already holding the mug but the text still describes the whole mug-and-pudding task, the supervision is noisy.

TREAD fixes this by converting one long trajectory into several shorter language-action pairs:

Full demo:
  "Put the white mug on the plate and put the pudding left of the plate"

TREAD segments:
  00:00-00:02  "Grasp the white mug"
  00:02-00:04  "Lift the white mug"
  00:04-00:07  "Move the white mug to the plate"
  00:07-00:09  "Place the white mug on the plate"
  00:09-00:11  "Grasp the chocolate pudding"
  00:11-00:14  "Move the chocolate pudding left of the plate"
  00:14-00:16  "Place the chocolate pudding on the table"

The important part: these segments still use the real actions from the original demonstration. TREAD does not hallucinate new control trajectories, require fresh teleoperation, or ask a human annotator to mark every frame. The VLM acts as a semantic parser, temporal annotator, and language augmenter.

TREAD architecture in one diagram

TREAD has three main stages, matching the project page and the paper:

                 original LIBERO / Bridge demonstrations
                                  |
                                  v
        +----------------- semantic decomposition -----------------+
        | input: first frame + original instruction                 |
        | output: ordered subtask list                              |
        +----------------------------------------------------------+
                                  |
                                  v
        +------------------- motion segmentation ------------------+
        | input: full trajectory video + subtask list              |
        | output: time_range for each subtask                      |
        +----------------------------------------------------------+
                                  |
                                  v
        +---------------- grounded language diversity -------------+
        | input: first frame of segment + subtask instruction      |
        | output: paraphrases with object attributes/spatial cues  |
        +----------------------------------------------------------+
                                  |
                                  v
           sliced HDF5 subtask datasets + original full demos
                                  |
                                  v
                     fine-tune Octo / pi0-FAST / VLA

In the paper, the authors use Gemini 2.5 Pro as the default VLM. The public repository is structured around provider-neutral prompts and dataset adapters, but vlm_client.py currently includes Gemini support out of the box. If you want to use Qwen2.5-VL, GPT-4o, Claude vision, or an internal multimodal model, implement the VLMClient interface so it converts VLMMessage objects into your provider's API call and returns plain text.

What exactly did the paper test?

TREAD is evaluated on LIBERO, a simulated manipulation benchmark. LIBERO contains suites designed around distribution shifts in object type, spatial arrangement, task goal, and long-horizon composition. The TREAD experiments focus on LIBERO-100, where each original task has 50 human-teleoperated demonstrations. Because of resource limits, the authors label five demonstrations per task and omit STUDY_SCENE tasks, for a total of 570 trajectories.

Key setup:

Component Paper choice Practical meaning
VLM Gemini 2.5 Pro Generates subtasks, timecodes, paraphrases
Dataset LIBERO-100 subset Source image-action trajectories
Policy Octo-Small 1.5, π0-FAST BC/VLA models for fine-tuning
Augmentation decomposition + diverse labels Splits trajectories and diversifies language
Evaluation Motion Generalization, Language Generalization, LIBERO-10 Measures new-task and new-wording robustness

The most useful part of the experiment is the ablation over three mixtures:

  1. Original Fine-tuned: fine-tune only on original trajectories.
  2. TREAD w/o diverse labels: mix original trajectories with decomposed sub-trajectories but no language paraphrase diversity.
  3. TREAD: mix original trajectories with decomposed and linguistically enriched sub-trajectories.

That separation tells us what each part contributes. Trajectory decomposition helps motion and planning generalization. Grounded paraphrasing helps language-conditioned policy generalization.

Main LIBERO results

The paper reports success rates across several cases. These are the numbers to remember:

Test case Metric π0-FAST original π0-FAST TREAD Octo original Octo TREAD
Language Generalization Single Goal SR 47% 67% 82% 91%
Language Generalization 2 of 2 SR 36% 39% 30% 31%
Motion Generalization Single Goal SR 28% 34% 7% 22%
Motion Generalization 1 of 2 SR 73% 82% 13% 43%
LIBERO-10 2 of 2 SR 57% 57% 40% 38%
Average SR 49% 53% 41% 47%

The right interpretation is not "TREAD solves LIBERO." Full two-goal success remains hard, especially for Motion Generalization 2 of 2, where both Octo and π0-FAST still struggle to complete the whole long-horizon task. The more convincing result is that TREAD improves single-goal and partial completion, especially for Octo in new environments. That is exactly what we would expect if shorter skill-level supervision helps the model recombine familiar motions in unfamiliar scenes.

Another important result is that in-distribution LIBERO-10 does not collapse. TREAD keeps long-horizon performance comparable to original fine-tuning, so the augmentation is not simply trading away original benchmark behavior for custom test cases.

Robot manipulation
Robot manipulation

Environment setup

The TREAD repo is intentionally small. The public requirements.txt includes opencv-python, google-genai, h5py, and numpy. For an end-to-end LIBERO experiment, you also need a compatible LIBERO/robomimic/Octo environment. I recommend separating the relabeling environment from the policy training environment:

# Env 1: VLM relabeling and dataset slicing
conda create -n tread python=3.10 -y
conda activate tread
git clone https://github.com/akuramshin/tread.git
cd tread
pip install -r requirements.txt
export GEMINI_API_KEY="your-gemini-api-key"
# Env 2: policy fine-tuning/evaluation, depending on your stack
conda create -n libero-octo python=3.10 -y
conda activate libero-octo
# install LIBERO, robomimic, Octo, or your VLA repo here

If you only want to inspect the pipeline, Env 1 is enough. If you want better VLA performance, Env 2 must be able to load LIBERO/robomimic-style HDF5 files and train your policy.

Step 1: create trajectory videos from LIBERO

TREAD consumes .mp4 trajectory videos. Each video needs an instruction, scene, and demo id. The easiest structure is:

data/libero_videos/
  trajectory_000.mp4
  trajectory_001.mp4
  metadata.json

Use a metadata.json file like this:

{
  "trajectory_000.mp4": {
    "task": "Open the top drawer of the cabinet.",
    "scene": "KITCHEN_SCENE4",
    "demo_id": 0
  },
  "trajectory_001.mp4": {
    "task": "Put the white mug on the plate.",
    "scene": "LIVING_ROOM_SCENE2",
    "demo_id": 1
  }
}

Without metadata, the repo parses LIBERO filenames using the convention SCENE_task_words_demo_<id>.mp4. For beginners, metadata is safer because instructions often contain long object names, numbers, or punctuation that make filename parsing brittle.

Before calling a VLM, check:

Check Why it matters
Video clearly shows gripper and objects The VLM must see motion boundaries
FPS matches the original HDF5 create_dataset.py converts timecodes to frames
Instruction matches the demonstration Wrong input text creates wrong semantic plans
demo_id matches demo_<id> in HDF5 Slicing skips demos when ids do not match

Step 2: run semantic decomposition

The first stage generates an ordered subtask list from the first frame and original instruction:

python semantic_segmentation.py \
  --input-dir data/libero_videos \
  --output-dir labels/libero \
  --provider gemini \
  --model gemini-2.5-pro \
  --temperature 0.0 \
  --api-key-env GEMINI_API_KEY \
  --max-retries 2

The main output is:

labels/libero/dataset_subtask_labels.json

Inspect a few entries before moving on:

{
  "trajectory_000.mp4_0": {
    "task": "Put the white mug on the plate and put the chocolate pudding to the left of the plate.",
    "semantic_subtasks": [
      "Grasp the white mug",
      "Lift the white mug",
      "Move the white mug to the plate",
      "Place the white mug on the plate",
      "Grasp the chocolate pudding",
      "Move the chocolate pudding to the left of the plate",
      "Place the chocolate pudding on the table"
    ]
  }
}

The scripts are resumable. If an output JSON already contains a processed trajectory, the next run skips it. That matters because VLM calls cost money and can hit rate limits.

Step 3: run motion segmentation

The second stage uses the full video and subtask list to find time ranges:

python motion_segmentation.py \
  --input-dir data/libero_videos \
  --subtask-labels labels/libero/dataset_subtask_labels.json \
  --output-dir labels/libero \
  --provider gemini \
  --model gemini-2.5-pro \
  --temperature 0.0 \
  --api-key-env GEMINI_API_KEY \
  --max-retries 2

The output is:

labels/libero/dataset_motion_labels.json

A typical entry looks like:

{
  "trajectory_000.mp4_0": {
    "demo_id": 0,
    "motion_labels": [
      {
        "sub_task": "Grasp the white mug",
        "time_range": "00:00 - 00:02"
      },
      {
        "sub_task": "Place the white mug on the plate",
        "time_range": "00:07 - 00:09"
      }
    ],
    "response": "full VLM response"
  }
}

This is the most fragile step. If a time range is too short, the HDF5 segment may not contain enough frames for policy learning. If it is too long, the segment again contains multiple skills. A practical validation script should overlay labels on the video or read dataset_motion_labels.json, convert start/end times to frames, and display the first and last image for each segment.

Step 4: generate grounded paraphrases

TREAD does not stop at trajectory decomposition. The third stage creates visually grounded paraphrases:

python language_paraphrasing.py \
  --dataset libero \
  --input-dir data/libero_videos \
  --subtask-labels labels/libero/dataset_motion_labels.json \
  --output-dir labels/libero

Then sample those paraphrases into the motion labels:

python augment_instructions.py \
  --input-dir labels/libero \
  --task-relabel-prob 0.25 \
  --subtask-relabel-prob 0.5 \
  --seed 0

task-relabel-prob controls how often task-level instructions are replaced. subtask-relabel-prob controls how often subtask labels are replaced. Do not set both to 1.0 at the start. The policy still needs exposure to canonical instructions so it remains aligned with the benchmark. The paper studies several mixture ratios and selects them with Re-Mix; for small projects, the README defaults are a reasonable first run.

Step 5: slice LIBERO HDF5 into subtask datasets

Once you have timecodes, use create_dataset.py to slice the original HDF5:

python create_dataset.py \
  --motion-segmentations labels/libero/dataset_motion_labels.json \
  --hdf5-dir data/libero_hdf5 \
  --output-dir data/libero_subtasks \
  --fps 10 \
  --image-size 256 256 \
  --name-by-label

The script assumes a LIBERO/robomimic-style schema. It reads actions, rewards, states, robot_states, dones, and observation keys such as agentview_rgb, eye_in_hand_rgb, gripper_states, joint_states, and ee_states. For each subtask, it creates a new HDF5 file, marks the final frame with reward/done, resizes images, and writes the new instruction.

One easy-to-miss detail: the script swaps left/right words by default to match LIBERO camera conventions. If your dataset does not need this correction, run:

python create_dataset.py \
  --motion-segmentations labels/libero/dataset_motion_labels.json \
  --hdf5-dir data/libero_hdf5 \
  --output-dir data/libero_subtasks \
  --fps 10 \
  --image-size 256 256 \
  --no-swap-left-right

Step 6: fine-tune a VLA on the mixture

TREAD is model-agnostic. The paper uses Octo-Small 1.5 and π0-FAST. For Octo, the authors fine-tune for 50,000 steps with batch size 256 and a warmup plus cosine decay schedule, then report the step 30,000 checkpoint. For π0-FAST, they full fine-tune for 30,000 steps with batch size 32 and report the step 15,000 checkpoint.

A minimal Octo-style dataset config looks like:

dataset_mix:
  original_libero:
    path: data/libero_hdf5
    weight: 1.0
  tread_subtasks:
    path: data/libero_subtasks
    weight: 1.1

training:
  batch_size: 256
  max_steps: 50000
  lr_schedule: warmup_cosine
  checkpoint_every: 5000
  eval_checkpoints: [30000, 50000]

If you are using OpenVLA, VLA-Adapter, or a LoRA fine-tuning stack instead of Octo, the data mapping is still the same:

observation images + proprioception + relabeled instruction
    -> VLA tokenizer / processor
    -> action head
    -> supervised action loss on segment actions

The key rule is: do not train only on subtasks. Mix original full trajectories with TREAD segments. Original data teaches long-horizon instruction following. TREAD data teaches shorter skills and language variations.

Step 7: inference and evaluation

The repository includes an evaluation_pipeline/ folder for LIBERO rollouts with Octo checkpoints. The README is explicit that this part is optional and requires Octo/LIBERO branches with compatible APIs. Configure conf/config.yaml or override values on the command line:

finetuned_path: /path/to/checkpoints
dataset_statistics_path: /path/to/dataset_statistics.json
test_case_files:
  - /path/to/language_generalization.json
  - /path/to/motion_generalization.json
checkpoints:
  - name: "original"
    path: "${finetuned_path}/original"
    model_step: 30000
  - name: "tread"
    path: "${finetuned_path}/tread"
    model_step: 30000

Run rollouts:

python evaluation_pipeline/src/eval_libero.py \
  finetuned_path=/path/to/checkpoints \
  dataset_statistics_path=/path/to/dataset_statistics.json

Visualize results:

python evaluation_pipeline/src/visualize_results.py \
  output_dir=evaluation_pipeline/results

For your own evaluation, separate at least three buckets: original tasks, paraphrased instructions, and new scene-instruction pairings. If you only test original tasks, you will miss TREAD's main purpose: robustness under language and motion composition shifts.

Beginner debug checklist

Symptom Common cause Fix
create_dataset.py reports missing demo_<id> metadata.json demo_id does not match HDF5 Open the HDF5 and inspect groups under /data
Empty segment Wrong timecode or wrong FPS Check --fps, then inspect start/end frames
Subtask labels are too vague First frame lacks context or video is unclear Inspect the first frame and add scene metadata
Good subtask success, poor long-horizon success Mixture is too biased toward subtasks Increase the weight of original trajectories
Evaluation script fails Octo/LIBERO APIs do not match Pin compatible branches or adjust the wrapper

When should you use TREAD?

TREAD is a good fit when:

  • You have long demonstrations but coarse instructions.
  • You want more language diversity without hiring annotators.
  • You are training a VLA or behavior cloning policy on LIBERO, BridgeData, or similar HDF5 datasets.
  • You want better robustness to wording changes such as "close the drawer" instead of a longer original command.
  • You need to diagnose whether a policy fails because of language understanding or motion skill.

TREAD is not a universal fix. If the camera does not see the object, the VLM cannot reliably segment it. If the original actions are poor, better labels will not remove control noise. If the task depends on force feedback or contact-rich dexterity while the observation is RGB-only, subtask labels will not teach missing dynamics. TREAD is strongest when the bottleneck is semantic supervision, not actuators, simulators, or low-level controllers.

How TREAD fits with other VLA pipelines

TREAD pairs naturally with pipelines like VLA-Adapter on LIBERO and VLA-0 action-as-text. VLA-Adapter focuses on a smaller architecture and efficient training. VLA-0 explores action as text/token prediction. TREAD sits before both and improves the dataset through relabeling. For a small team, the practical sequence is:

1. Train/fine-tune a baseline on original data
2. Run TREAD to create subtasks and paraphrases
3. Fine-tune again on the mixture
4. Compare language generalization and motion generalization
5. Scale model size or collect more data only if performance is still lacking

Real robot data is expensive. TREAD is a reminder to first ask whether the data you already have is described well enough.

References

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

OpenVLA: VLA mở cho robot
wholebody-vla

OpenVLA: VLA mở cho robot

6/7/202615 min read
NT
VLA-RFT: RL Fine-Tune VLA trong World Simulator
wholebody-vla

VLA-RFT: RL Fine-Tune VLA trong World Simulator

6/3/202614 min read
NT
X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot
wholebody-vla

X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot

5/20/202611 min read
NT