VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam
VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
  1. Home
  2. Blog
  3. Multitask DiT Policy on LeRobot v0.5: One Model, Many Tasks
manipulationlerobotmultitask-ditdiffusion-policycliptext-conditioningso-100so-101huggingfacemanipulationflow-matching

Multitask DiT Policy on LeRobot v0.5: One Model, Many Tasks

Hands-on guide to LeRobot v0.5 Multitask DiT Policy: one model for many tasks, CLIP text-conditioning, open-source on HuggingFace, deployed on SO-100/SO-101.

Nguyễn Anh TuấnMay 18, 20269 min readUpdated: Jun 14, 2026
Multitask DiT Policy on LeRobot v0.5: One Model, Many Tasks

Multitask DiT Policy on LeRobot v0.5: One Model, Many Tasks with CLIP Text-Conditioning

In March 2026, HuggingFace shipped LeRobot v0.5.0 — the largest release to date, with over 200 merged PRs. Among the additions that got less spotlight than Pi0-FAST or Wall-X, but matters a lot for hobbyist roboticists, is the Multitask Diffusion Transformer (DiT) Policy. It is a new-generation diffusion policy with only ~450M parameters that competes with multi-billion-parameter VLAs on dexterity, and — more importantly — it can train a single model that handles many tasks, switching between them via natural-language commands routed through a frozen CLIP text encoder.

This guide walks through the core idea (why DiT, why multitask, why CLIP), the architecture, real installation on SO-100/SO-101, a full training command, common failure modes, and the LIBERO benchmark numbers.

References: official Multitask DiT docs, LeRobot v0.5 release blog, original TRI LBM paper, and Bryson Jones's detailed dissection.

Robot arm performing manipulation task
Robot arm performing manipulation task

1. Why a Multitask DiT Policy?

The limits of the original Diffusion Policy

The 2023 Diffusion Policy (ai-series-4) showed that treating robot actions as a sequence to be denoised yields smoother, multi-modal, more robust behavior than vanilla behavior cloning. But the original implementation used a 1D U-Net — a backbone designed for local convolution, hard to scale past a few hundred million parameters, and with no natural place to inject language.

When you want one robot to perform 5–10 different tasks on the same hardware (pick the red cube, drop it in the bowl, wipe the table, close the drawer, pour water…), the old approach was to train 5–10 separate models. Wasteful in data, in VRAM, and you have to reload weights to switch tasks — not production-grade.

How Multitask DiT fixes this

Multitask DiT replaces the U-Net with a Diffusion Transformer (DiT — the same family that powers Stable Diffusion 3 and Sora) and wires in two conditioning streams:

  • CLIP Vision Encoder processes RGB from multiple cameras (wrist + overhead).
  • CLIP Text Encoder embeds the task command ("pick up the red cube"). The text embedding is injected into every transformer block via cross-attention.

The result: a single ~450M-parameter model can learn dozens of tasks, switching between them at inference time by just changing the text prompt. This is the same spirit as the TRI Large Behavior Model and the recent Boston Dynamics Atlas LBM blog, now officially ported into LeRobot.

2. The architecture, in detail

                    ┌──────────────────────────┐
   "Pick up red ─→  │  CLIP Text Encoder       │ ──┐
    cube"           │  (frozen + learnable proj)│   │
                    └──────────────────────────┘   │
                                                    ▼
   RGB images   ─→  CLIP Vision Encoder  ──→  ┌─────────────────┐
   (wrist+top)      (lr_mult = 0.1)           │  DiT Backbone    │ ──→ Predicted
                                              │  (6–8 layers,    │     action chunk
   Action noise ──→ Action token embedder ──→ │   hidden=512–768)│     (horizon=32)
   z_t                                        └─────────────────┘
                          ▲
                          │
                  Diffusion timestep t
                  or Flow-matching t ∈ [0,1]

A few notable design choices:

  • Two objectives, one model: --policy.objective=diffusion (DDPM/DDIM, default) or --policy.objective=flow_matching (à la Boston Dynamics). Flow matching sometimes yields smoother actions but it is not a silver bullet.
  • RoPE positional encoding by default for the action sequence — better than absolute PE for this use case.
  • Vision encoder learning rate is only 0.1× the backbone LR — because CLIP is already very well pretrained, fine-tuning hard wrecks the features.
  • Horizon = 32 ≈ 1.0 s at 30 Hz, predicting 32 steps ahead but only executing 24 before replanning.

3. Installation

Minimum training hardware is a 16 GB-VRAM GPU; RTX 4090 / A100 / H100 are recommended for batch sizes 256–320. Inference needs at least an RTX 5070 Ti to run real-time at 30 Hz.

# 1. Create a Python 3.12+ env (required by LeRobot v0.5)
conda create -n lerobot python=3.12 -y
conda activate lerobot

# 2. Install LeRobot with the multi_task_dit extras
pip install "lerobot[multi_task_dit]"

# 3. Verify
python -c "from lerobot.policies.multi_task_dit import MultiTaskDiTPolicy; print('OK')"

The [multi_task_dit] extras pull in transformers>=5.0 so CLIP (vision + text) loads correctly.

4. Preparing a dataset for SO-100/SO-101

LeRobot v0.5 consolidated the SO-100 and SO-101 codebase — you use one shared interface. To train multitask, every episode in the dataset must carry a task description.

When recording teleop data, give a clear and varied task description:

lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \
  --teleop.type=so101_leader \
  --teleop.port=/dev/ttyACM1 \
  --dataset.repo_id=YOUR_USER/so101_multitask \
  --dataset.single_task="pick up the red cube and place it in the bowl" \
  --dataset.num_episodes=50 \
  --dataset.fps=30

Repeat for each task (swap --dataset.single_task every 50 episodes). Aim for at least 30–50 episodes per task and 3–5 tasks to see real multitask payoff. With only 20 episodes/task you will mostly see idling at inference.

One important detail: task descriptions must be specific and distinct. "Pick the cube" and "Pick the block" will confuse the model. Use "pick up the red cube", "pick up the blue sponge", "open the drawer"… each with a concrete noun, easy to disambiguate.

If you are new to the SO-101 teleop pipeline, see the LeRobot ecosystem primer and the SO-101 sim2real with Isaac Lab tutorial.

Robot training workflow
Robot training workflow

5. The full training command

This is the "golden" starting command for SO-101 multitask training. I have annotated every flag so you understand the reason behind it.

lerobot-train \
  --dataset.repo_id=YOUR_USER/so101_multitask \
  --output_dir=./outputs/multitask_dit_so101 \
  --job_name=so101_multitask_dit_v1 \
  \
  `# === Batch & training schedule ===` \
  --batch_size=256 \
  --steps=50000 \
  --save_freq=2000 \
  --log_freq=100 \
  --num_workers=8 \
  \
  `# === Policy core ===` \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.use_amp=true \
  \
  `# === Action horizon (30Hz) ===` \
  --policy.horizon=32 \
  --policy.n_action_steps=24 \
  --policy.n_obs_steps=2 \
  \
  `# === Objective: start with diffusion ===` \
  --policy.objective=diffusion \
  --policy.noise_scheduler_type=DDPM \
  --policy.num_train_timesteps=100 \
  --policy.prediction_type=epsilon \
  --policy.clip_sample=true \
  \
  `# === DiT architecture ===` \
  --policy.num_layers=6 \
  --policy.hidden_dim=512 \
  --policy.num_heads=8 \
  --policy.dropout=0.1 \
  --policy.use_rope=true \
  \
  `# === Vision encoder ===` \
  --policy.vision_encoder_name=openai/clip-vit-base-patch16 \
  --policy.image_resize_shape=[256,256] \
  --policy.image_crop_shape=[224,224] \
  --policy.image_crop_is_random=true \
  --policy.vision_encoder_lr_multiplier=0.1 \
  \
  `# === Text encoder (CLIP) ===` \
  --policy.text_encoder_name=openai/clip-vit-base-patch16 \
  \
  `# === Optimizer ===` \
  --policy.optimizer_lr=2e-5 \
  --policy.optimizer_weight_decay=0 \
  \
  `# === Push checkpoint ===` \
  --policy.push_to_hub=true \
  --policy.repo_id=YOUR_USER/multitask-dit-so101 \
  --wandb.enable=true \
  --wandb.project=multitask_dit_so101

With 5 tasks × 50 episodes on an RTX 4090, expect 8–12 hours of training. Loss usually flatlines around 20k steps but keep going to 50k–100k — multitask models need extra time for language steerability to stabilize, even after loss has plateaued.

Speeding up inference with DDIM

After training, switch the sampler to DDIM with fewer steps:

--policy.noise_scheduler_type=DDIM \
--policy.num_inference_steps=10

DDIM with 10 steps ≈ DDPM with 100 steps in quality, and 10× faster.

6. Inference & deploy on a real SO-101

lerobot-eval \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \
  --policy.path=YOUR_USER/multitask-dit-so101 \
  --policy.device=cuda \
  --eval.task="pick up the red cube and place it in the bowl" \
  --eval.n_episodes=10

The --eval.task argument is the text prompt fed into CLIP. You can switch to a different in-distribution task without reloading the model:

--eval.task="open the drawer slowly"

That is the big leap over the original ACT or Diffusion Policy — one binary, switch tasks via natural language.

7. Common failure modes & how to debug them

This section is hands-on experience, the most useful part when you train your own model.

Failure 1: idling / no motion

Symptoms: action output is near zero, the robot does not move or just jitters.

Common causes:

  • Dataset too small (< 200 total examples).
  • Tasks too similar (pick red cube vs pick blue cube) → model leans on vision and ignores text.
  • Loss already flat but training stopped too early.

Fixes:

  • Double the dataset until you exceed 300 examples.
  • Keep training to 100k steps even if loss flatlines.
  • Diversify text instructions: "grasp the crimson block" instead of repeating "pick the cube".

Failure 2: executing the wrong task

Symptoms: you say "pick the red cube" and the robot wipes the table.

Cause: task descriptions overlap too much, or the requested task is out-of-distribution.

Fixes:

  • Verify each task instruction is specific.
  • Re-weight the ignored task during sampling.
  • Fine-tune a few hundred extra steps on the hard task.

Failure 3: training instability

Symptoms: loss oscillates wildly, NaN after a few k steps.

Fixes:

  • Drop the learning rate from 2e-5 to 1e-5.
  • Increase batch size — anything under 64 is unstable.
  • Verify image normalization matches CLIP's expected range (ImageNet mean/std).

8. LIBERO benchmark

The LeRobot team reports that Multitask DiT reaches 90.6% average on LIBERO with the config: 8 layers, hidden 768, horizon 48, 100k steps, batch 320:

Suite Success Rate
LIBERO Spatial 87.0%
LIBERO Object 98.2%
LIBERO Goal 93.8%
LIBERO 10 83.2%
Average 90.6%

For comparison: original Diffusion Policy ≈ 78%, ACT ≈ 73%, OpenVLA 7B (multi-billion params) ≈ 76%. Multitask DiT with only 450M params beats several much larger VLAs. That is fantastic news for teams who only have an RTX 4090 / 5090.

9. When to use Multitask DiT, when not

Situation Choice
Single task, tiny dataset (< 50 episodes) ACT — simple, good enough
Single task, medium dataset Original Diffusion Policy
2–10 tasks on the same hardware Multitask DiT ✓
Need generalization to brand-new scenes A real VLA (Pi0, SmolVLA, GR00T)
Long-horizon multi-subtask Multitask DiT + HIL-SERL

10. Closing thoughts

Multitask DiT in LeRobot v0.5 hits a rare sweet spot: powerful enough to learn dozens of tasks through language, small enough to run on consumer GPUs, and already packaged into lerobot-train. For the Vietnamese robotics community building real products with SO-100/SO-101 or similar low-cost arms, this is the strongest baseline you can grab without renting an A100 cluster.

A final tip: do not try to train multitask from day one. Train a single task with Multitask DiT first to verify your data pipeline, then add tasks one by one. Most "the policy will not learn" issues come from dirty data, not from the model itself.

Tool recommendations

VLA train/deploy stack

Train on cloud/workstation, then deploy optimized models to Jetson or the robot computer.

Cloud GPU for VLA / policy training Use for imitation learning, diffusion policies, RL, and robotics model fine-tuning. View cloud GPU → NVIDIA Jetson Orin NX / Orin Nano Edge deployment hardware for perception, logging, and optimized inference. View Jetson → Hugging Face / robotics dataset hosting Host datasets, checkpoints, and model cards for cleaner LeRobot/VLA workflows. View platform →

Related Posts

  • LeRobot v0.5: everything new in the release
  • Diffusion Policy: the foundation under Multitask DiT
  • SO-101 + Isaac Lab sim2real with LeRobot
NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Fleet MonitoringROS 2 IntegrationAMR Solutions

Related Posts

Tutorial
Lên hai tay: UMI bimanual pipeline với scripts chính thức
umibimanualtwo-armPart 5
manipulation

Lên hai tay: UMI bimanual pipeline với scripts chính thức

Scale UMI lên bimanual: in 2 unit, thu demo với cả hai tay, dùng demo_real_bimanual_robots.py và eval_real_bimanual_umi.py chính thức, train với config umi_bimanual. Cụ thể, step-by-step, không guesswork.

6/5/20267 min read
NT
Tutorial
Hướng dẫn SO-101 sim-to-real với Isaac Lab & LeRobot
so-101isaac-lablerobot
manipulation

Hướng dẫn SO-101 sim-to-real với Isaac Lab & LeRobot

Từng bước train cánh tay SO-101 trong NVIDIA Isaac Lab, thu thập dữ liệu teleop, fine-tune GR00T N1.5, và deploy policy lên robot thực tế.

4/29/202612 min read
NT
Tutorial
DP3: 3D Diffusion Policy với point cloud (hands-on)
manipulationdiffusion-policypoint-cloudPart 3
manipulation

DP3: 3D Diffusion Policy với point cloud (hands-on)

DP3 biến point cloud thành input trực tiếp cho diffusion policy — 24.2% cải thiện trên 72 tasks, 85% success rate trên robot thật. Hands-on: cài đặt, train, eval từ repo YanjieZe/3D-Diffusion-Policy.

6/13/202615 min read
NT
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam