Multitask DiT Policy on LeRobot v0.5: One Model, Many Tasks with CLIP Text-Conditioning

In March 2026, HuggingFace shipped LeRobot v0.5.0 — the largest release to date, with over 200 merged PRs. Among the additions that got less spotlight than Pi0-FAST or Wall-X, but matters a lot for hobbyist roboticists, is the Multitask Diffusion Transformer (DiT) Policy. It is a new-generation diffusion policy with only ~450M parameters that competes with multi-billion-parameter VLAs on dexterity, and — more importantly — it can train a single model that handles many tasks, switching between them via natural-language commands routed through a frozen CLIP text encoder.

This guide walks through the core idea (why DiT, why multitask, why CLIP), the architecture, real installation on SO-100/SO-101, a full training command, common failure modes, and the LIBERO benchmark numbers.

References: official Multitask DiT docs, LeRobot v0.5 release blog, original TRI LBM paper, and Bryson Jones's detailed dissection.

Robot arm performing manipulation task

1. Why a Multitask DiT Policy?

The limits of the original Diffusion Policy

The 2023 Diffusion Policy (ai-series-4) showed that treating robot actions as a sequence to be denoised yields smoother, multi-modal, more robust behavior than vanilla behavior cloning. But the original implementation used a 1D U-Net — a backbone designed for local convolution, hard to scale past a few hundred million parameters, and with no natural place to inject language.

When you want one robot to perform 5–10 different tasks on the same hardware (pick the red cube, drop it in the bowl, wipe the table, close the drawer, pour water…), the old approach was to train 5–10 separate models. Wasteful in data, in VRAM, and you have to reload weights to switch tasks — not production-grade.

How Multitask DiT fixes this

Multitask DiT replaces the U-Net with a Diffusion Transformer (DiT — the same family that powers Stable Diffusion 3 and Sora) and wires in two conditioning streams:

CLIP Vision Encoder processes RGB from multiple cameras (wrist + overhead).
CLIP Text Encoder embeds the task command ("pick up the red cube"). The text embedding is injected into every transformer block via cross-attention.

The result: a single ~450M-parameter model can learn dozens of tasks, switching between them at inference time by just changing the text prompt. This is the same spirit as the TRI Large Behavior Model and the recent Boston Dynamics Atlas LBM blog, now officially ported into LeRobot.

2. The architecture, in detail

                    ┌──────────────────────────┐
   "Pick up red ─→  │  CLIP Text Encoder       │ ──┐
    cube"           │  (frozen + learnable proj)│   │
                    └──────────────────────────┘   │
                                                    ▼
   RGB images   ─→  CLIP Vision Encoder  ──→  ┌─────────────────┐
   (wrist+top)      (lr_mult = 0.1)           │  DiT Backbone    │ ──→ Predicted
                                              │  (6–8 layers,    │     action chunk
   Action noise ──→ Action token embedder ──→ │   hidden=512–768)│     (horizon=32)
   z_t                                        └─────────────────┘
                          ▲
                          │
                  Diffusion timestep t
                  or Flow-matching t ∈ [0,1]

A few notable design choices:

Two objectives, one model: --policy.objective=diffusion (DDPM/DDIM, default) or --policy.objective=flow_matching (à la Boston Dynamics). Flow matching sometimes yields smoother actions but it is not a silver bullet.
RoPE positional encoding by default for the action sequence — better than absolute PE for this use case.
Vision encoder learning rate is only 0.1× the backbone LR — because CLIP is already very well pretrained, fine-tuning hard wrecks the features.
Horizon = 32 ≈ 1.0 s at 30 Hz, predicting 32 steps ahead but only executing 24 before replanning.

3. Installation

Minimum training hardware is a 16 GB-VRAM GPU; RTX 4090 / A100 / H100 are recommended for batch sizes 256–320. Inference needs at least an RTX 5070 Ti to run real-time at 30 Hz.

# 1. Create a Python 3.12+ env (required by LeRobot v0.5)
conda create -n lerobot python=3.12 -y
conda activate lerobot

# 2. Install LeRobot with the multi_task_dit extras
pip install "lerobot[multi_task_dit]"

# 3. Verify
python -c "from lerobot.policies.multi_task_dit import MultiTaskDiTPolicy; print('OK')"

The [multi_task_dit] extras pull in transformers>=5.0 so CLIP (vision + text) loads correctly.

4. Preparing a dataset for SO-100/SO-101

LeRobot v0.5 consolidated the SO-100 and SO-101 codebase — you use one shared interface. To train multitask, every episode in the dataset must carry a task description.

When recording teleop data, give a clear and varied task description:

lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \
  --teleop.type=so101_leader \
  --teleop.port=/dev/ttyACM1 \
  --dataset.repo_id=YOUR_USER/so101_multitask \
  --dataset.single_task="pick up the red cube and place it in the bowl" \
  --dataset.num_episodes=50 \
  --dataset.fps=30

Repeat for each task (swap --dataset.single_task every 50 episodes). Aim for at least 30–50 episodes per task and 3–5 tasks to see real multitask payoff. With only 20 episodes/task you will mostly see idling at inference.

One important detail: task descriptions must be specific and distinct. "Pick the cube" and "Pick the block" will confuse the model. Use "pick up the red cube", "pick up the blue sponge", "open the drawer"… each with a concrete noun, easy to disambiguate.

If you are new to the SO-101 teleop pipeline, see the LeRobot ecosystem primer and the SO-101 sim2real with Isaac Lab tutorial.

Robot training workflow

5. The full training command

This is the "golden" starting command for SO-101 multitask training. I have annotated every flag so you understand the reason behind it.

lerobot-train \
  --dataset.repo_id=YOUR_USER/so101_multitask \
  --output_dir=./outputs/multitask_dit_so101 \
  --job_name=so101_multitask_dit_v1 \
  \
  `# === Batch & training schedule ===` \
  --batch_size=256 \
  --steps=50000 \
  --save_freq=2000 \
  --log_freq=100 \
  --num_workers=8 \
  \
  `# === Policy core ===` \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.use_amp=true \
  \
  `# === Action horizon (30Hz) ===` \
  --policy.horizon=32 \
  --policy.n_action_steps=24 \
  --policy.n_obs_steps=2 \
  \
  `# === Objective: start with diffusion ===` \
  --policy.objective=diffusion \
  --policy.noise_scheduler_type=DDPM \
  --policy.num_train_timesteps=100 \
  --policy.prediction_type=epsilon \
  --policy.clip_sample=true \
  \
  `# === DiT architecture ===` \
  --policy.num_layers=6 \
  --policy.hidden_dim=512 \
  --policy.num_heads=8 \
  --policy.dropout=0.1 \
  --policy.use_rope=true \
  \
  `# === Vision encoder ===` \
  --policy.vision_encoder_name=openai/clip-vit-base-patch16 \
  --policy.image_resize_shape=[256,256] \
  --policy.image_crop_shape=[224,224] \
  --policy.image_crop_is_random=true \
  --policy.vision_encoder_lr_multiplier=0.1 \
  \
  `# === Text encoder (CLIP) ===` \
  --policy.text_encoder_name=openai/clip-vit-base-patch16 \
  \
  `# === Optimizer ===` \
  --policy.optimizer_lr=2e-5 \
  --policy.optimizer_weight_decay=0 \
  \
  `# === Push checkpoint ===` \
  --policy.push_to_hub=true \
  --policy.repo_id=YOUR_USER/multitask-dit-so101 \
  --wandb.enable=true \
  --wandb.project=multitask_dit_so101

With 5 tasks × 50 episodes on an RTX 4090, expect 8–12 hours of training. Loss usually flatlines around 20k steps but keep going to 50k–100k — multitask models need extra time for language steerability to stabilize, even after loss has plateaued.

Speeding up inference with DDIM

After training, switch the sampler to DDIM with fewer steps:

--policy.noise_scheduler_type=DDIM \
--policy.num_inference_steps=10

DDIM with 10 steps ≈ DDPM with 100 steps in quality, and 10× faster.

6. Inference & deploy on a real SO-101

lerobot-eval \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \
  --policy.path=YOUR_USER/multitask-dit-so101 \
  --policy.device=cuda \
  --eval.task="pick up the red cube and place it in the bowl" \
  --eval.n_episodes=10

The --eval.task argument is the text prompt fed into CLIP. You can switch to a different in-distribution task without reloading the model:

--eval.task="open the drawer slowly"

That is the big leap over the original ACT or Diffusion Policy — one binary, switch tasks via natural language.

7. Common failure modes & how to debug them

This section is hands-on experience, the most useful part when you train your own model.

Failure 1: idling / no motion

Symptoms: action output is near zero, the robot does not move or just jitters.

Common causes:

Dataset too small (< 200 total examples).
Tasks too similar (pick red cube vs pick blue cube) → model leans on vision and ignores text.
Loss already flat but training stopped too early.

Fixes:

Double the dataset until you exceed 300 examples.
Keep training to 100k steps even if loss flatlines.
Diversify text instructions: "grasp the crimson block" instead of repeating "pick the cube".

Failure 2: executing the wrong task

Symptoms: you say "pick the red cube" and the robot wipes the table.

Cause: task descriptions overlap too much, or the requested task is out-of-distribution.

Fixes:

Verify each task instruction is specific.
Re-weight the ignored task during sampling.
Fine-tune a few hundred extra steps on the hard task.

Failure 3: training instability

Symptoms: loss oscillates wildly, NaN after a few k steps.

Fixes:

Drop the learning rate from 2e-5 to 1e-5.
Increase batch size — anything under 64 is unstable.
Verify image normalization matches CLIP's expected range (ImageNet mean/std).

8. LIBERO benchmark

The LeRobot team reports that Multitask DiT reaches 90.6% average on LIBERO with the config: 8 layers, hidden 768, horizon 48, 100k steps, batch 320:

Suite	Success Rate
LIBERO Spatial	87.0%
LIBERO Object	98.2%
LIBERO Goal	93.8%
LIBERO 10	83.2%
Average	90.6%

For comparison: original Diffusion Policy ≈ 78%, ACT ≈ 73%, OpenVLA 7B (multi-billion params) ≈ 76%. Multitask DiT with only 450M params beats several much larger VLAs. That is fantastic news for teams who only have an RTX 4090 / 5090.

9. When to use Multitask DiT, when not

Situation	Choice
Single task, tiny dataset (< 50 episodes)	ACT — simple, good enough
Single task, medium dataset	Original Diffusion Policy
2–10 tasks on the same hardware	Multitask DiT ✓
Need generalization to brand-new scenes	A real VLA (Pi0, SmolVLA, GR00T)
Long-horizon multi-subtask	Multitask DiT + HIL-SERL

10. Closing thoughts

Multitask DiT in LeRobot v0.5 hits a rare sweet spot: powerful enough to learn dozens of tasks through language, small enough to run on consumer GPUs, and already packaged into lerobot-train. For the Vietnamese robotics community building real products with SO-100/SO-101 or similar low-cost arms, this is the strongest baseline you can grab without renting an A100 cluster.

A final tip: do not try to train multitask from day one. Train a single task with Multitask DiT first to verify your data pipeline, then add tasks one by one. Most "the policy will not learn" issues come from dirty data, not from the model itself.

Multitask DiT Policy on LeRobot v0.5: One Model, Many Tasks with CLIP Text-Conditioning

References: official Multitask DiT docs, LeRobot v0.5 release blog, original TRI LBM paper, and Bryson Jones's detailed dissection.

Robot arm performing manipulation task

1. Why a Multitask DiT Policy?

The limits of the original Diffusion Policy

How Multitask DiT fixes this

Multitask DiT replaces the U-Net with a Diffusion Transformer (DiT — the same family that powers Stable Diffusion 3 and Sora) and wires in two conditioning streams:

CLIP Vision Encoder processes RGB from multiple cameras (wrist + overhead).
CLIP Text Encoder embeds the task command ("pick up the red cube"). The text embedding is injected into every transformer block via cross-attention.

2. The architecture, in detail

                    ┌──────────────────────────┐
   "Pick up red ─→  │  CLIP Text Encoder       │ ──┐
    cube"           │  (frozen + learnable proj)│   │
                    └──────────────────────────┘   │
                                                    ▼
   RGB images   ─→  CLIP Vision Encoder  ──→  ┌─────────────────┐
   (wrist+top)      (lr_mult = 0.1)           │  DiT Backbone    │ ──→ Predicted
                                              │  (6–8 layers,    │     action chunk
   Action noise ──→ Action token embedder ──→ │   hidden=512–768)│     (horizon=32)
   z_t                                        └─────────────────┘
                          ▲
                          │
                  Diffusion timestep t
                  or Flow-matching t ∈ [0,1]

A few notable design choices:

Two objectives, one model: --policy.objective=diffusion (DDPM/DDIM, default) or --policy.objective=flow_matching (à la Boston Dynamics). Flow matching sometimes yields smoother actions but it is not a silver bullet.
RoPE positional encoding by default for the action sequence — better than absolute PE for this use case.
Vision encoder learning rate is only 0.1× the backbone LR — because CLIP is already very well pretrained, fine-tuning hard wrecks the features.
Horizon = 32 ≈ 1.0 s at 30 Hz, predicting 32 steps ahead but only executing 24 before replanning.

3. Installation

Minimum training hardware is a 16 GB-VRAM GPU; RTX 4090 / A100 / H100 are recommended for batch sizes 256–320. Inference needs at least an RTX 5070 Ti to run real-time at 30 Hz.

# 1. Create a Python 3.12+ env (required by LeRobot v0.5)
conda create -n lerobot python=3.12 -y
conda activate lerobot

# 2. Install LeRobot with the multi_task_dit extras
pip install "lerobot[multi_task_dit]"

# 3. Verify
python -c "from lerobot.policies.multi_task_dit import MultiTaskDiTPolicy; print('OK')"

The [multi_task_dit] extras pull in transformers>=5.0 so CLIP (vision + text) loads correctly.

4. Preparing a dataset for SO-100/SO-101

LeRobot v0.5 consolidated the SO-100 and SO-101 codebase — you use one shared interface. To train multitask, every episode in the dataset must carry a task description.

When recording teleop data, give a clear and varied task description:

lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \
  --teleop.type=so101_leader \
  --teleop.port=/dev/ttyACM1 \
  --dataset.repo_id=YOUR_USER/so101_multitask \
  --dataset.single_task="pick up the red cube and place it in the bowl" \
  --dataset.num_episodes=50 \
  --dataset.fps=30

If you are new to the SO-101 teleop pipeline, see the LeRobot ecosystem primer and the SO-101 sim2real with Isaac Lab tutorial.

Robot training workflow

5. The full training command

This is the "golden" starting command for SO-101 multitask training. I have annotated every flag so you understand the reason behind it.

lerobot-train \
  --dataset.repo_id=YOUR_USER/so101_multitask \
  --output_dir=./outputs/multitask_dit_so101 \
  --job_name=so101_multitask_dit_v1 \
  \
  `# === Batch & training schedule ===` \
  --batch_size=256 \
  --steps=50000 \
  --save_freq=2000 \
  --log_freq=100 \
  --num_workers=8 \
  \
  `# === Policy core ===` \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.use_amp=true \
  \
  `# === Action horizon (30Hz) ===` \
  --policy.horizon=32 \
  --policy.n_action_steps=24 \
  --policy.n_obs_steps=2 \
  \
  `# === Objective: start with diffusion ===` \
  --policy.objective=diffusion \
  --policy.noise_scheduler_type=DDPM \
  --policy.num_train_timesteps=100 \
  --policy.prediction_type=epsilon \
  --policy.clip_sample=true \
  \
  `# === DiT architecture ===` \
  --policy.num_layers=6 \
  --policy.hidden_dim=512 \
  --policy.num_heads=8 \
  --policy.dropout=0.1 \
  --policy.use_rope=true \
  \
  `# === Vision encoder ===` \
  --policy.vision_encoder_name=openai/clip-vit-base-patch16 \
  --policy.image_resize_shape=[256,256] \
  --policy.image_crop_shape=[224,224] \
  --policy.image_crop_is_random=true \
  --policy.vision_encoder_lr_multiplier=0.1 \
  \
  `# === Text encoder (CLIP) ===` \
  --policy.text_encoder_name=openai/clip-vit-base-patch16 \
  \
  `# === Optimizer ===` \
  --policy.optimizer_lr=2e-5 \
  --policy.optimizer_weight_decay=0 \
  \
  `# === Push checkpoint ===` \
  --policy.push_to_hub=true \
  --policy.repo_id=YOUR_USER/multitask-dit-so101 \
  --wandb.enable=true \
  --wandb.project=multitask_dit_so101

Speeding up inference with DDIM

After training, switch the sampler to DDIM with fewer steps:

--policy.noise_scheduler_type=DDIM \
--policy.num_inference_steps=10

DDIM with 10 steps ≈ DDPM with 100 steps in quality, and 10× faster.

6. Inference & deploy on a real SO-101

lerobot-eval \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \
  --policy.path=YOUR_USER/multitask-dit-so101 \
  --policy.device=cuda \
  --eval.task="pick up the red cube and place it in the bowl" \
  --eval.n_episodes=10

The --eval.task argument is the text prompt fed into CLIP. You can switch to a different in-distribution task without reloading the model:

--eval.task="open the drawer slowly"

That is the big leap over the original ACT or Diffusion Policy — one binary, switch tasks via natural language.

7. Common failure modes & how to debug them

This section is hands-on experience, the most useful part when you train your own model.

Failure 1: idling / no motion

Symptoms: action output is near zero, the robot does not move or just jitters.

Common causes:

Dataset too small (< 200 total examples).
Tasks too similar (pick red cube vs pick blue cube) → model leans on vision and ignores text.
Loss already flat but training stopped too early.

Fixes:

Double the dataset until you exceed 300 examples.
Keep training to 100k steps even if loss flatlines.
Diversify text instructions: "grasp the crimson block" instead of repeating "pick the cube".

Failure 2: executing the wrong task

Symptoms: you say "pick the red cube" and the robot wipes the table.

Cause: task descriptions overlap too much, or the requested task is out-of-distribution.

Fixes:

Verify each task instruction is specific.
Re-weight the ignored task during sampling.
Fine-tune a few hundred extra steps on the hard task.

Failure 3: training instability

Symptoms: loss oscillates wildly, NaN after a few k steps.

Fixes:

Drop the learning rate from 2e-5 to 1e-5.
Increase batch size — anything under 64 is unstable.
Verify image normalization matches CLIP's expected range (ImageNet mean/std).

8. LIBERO benchmark

The LeRobot team reports that Multitask DiT reaches 90.6% average on LIBERO with the config: 8 layers, hidden 768, horizon 48, 100k steps, batch 320:

Suite	Success Rate
LIBERO Spatial	87.0%
LIBERO Object	98.2%
LIBERO Goal	93.8%
LIBERO 10	83.2%
Average	90.6%

9. When to use Multitask DiT, when not

Situation	Choice
Single task, tiny dataset (< 50 episodes)	ACT — simple, good enough
Single task, medium dataset	Original Diffusion Policy
2–10 tasks on the same hardware	Multitask DiT ✓
Need generalization to brand-new scenes	A real VLA (Pi0, SmolVLA, GR00T)
Long-horizon multi-subtask	Multitask DiT + HIL-SERL

Multitask DiT Policy on LeRobot v0.5: One Model, Many Tasks with CLIP Text-Conditioning

1. Why a Multitask DiT Policy?

The limits of the original Diffusion Policy

How Multitask DiT fixes this

2. The architecture, in detail

3. Installation

4. Preparing a dataset for SO-100/SO-101

5. The full training command

Speeding up inference with DDIM

6. Inference & deploy on a real SO-101

7. Common failure modes & how to debug them

Failure 1: idling / no motion

Failure 2: executing the wrong task

Failure 3: training instability

8. LIBERO benchmark

9. When to use Multitask DiT, when not

10. Closing thoughts

Related Posts

Nguyễn Anh Tuấn

Related Posts

Lên hai tay: UMI bimanual pipeline với scripts chính thức

Hướng dẫn SO-101 sim-to-real với Isaac Lab & LeRobot

DP3: 3D Diffusion Policy với point cloud (hands-on)

Multitask DiT Policy on LeRobot v0.5: One Model, Many Tasks with CLIP Text-Conditioning

1. Why a Multitask DiT Policy?

The limits of the original Diffusion Policy

How Multitask DiT fixes this

2. The architecture, in detail

3. Installation

4. Preparing a dataset for SO-100/SO-101

5. The full training command

Speeding up inference with DDIM

6. Inference & deploy on a real SO-101

7. Common failure modes & how to debug them

Failure 1: idling / no motion

Failure 2: executing the wrong task

Failure 3: training instability

8. LIBERO benchmark

9. When to use Multitask DiT, when not

10. Closing thoughts

Related Posts

Nguyễn Anh Tuấn

Related Posts

Lên hai tay: UMI bimanual pipeline với scripts chính thức

Hướng dẫn SO-101 sim-to-real với Isaac Lab & LeRobot

DP3: 3D Diffusion Policy với point cloud (hands-on)