In September 2025, THU-ML (Tsinghua University Machine Learning Lab) released RDT2 — the successor to RDT-1B and one of the first foundation models that achieves zero-shot deployment on unseen embodiments. Practically: take the model off the shelf, plug it into your bimanual UR5e, type an English instruction ("pick up the apple"), and the robot does it — no data collection on your specific arm, no fine-tuning required.
This is a big jump from RDT-1B (October 2024), which only generalized well within the same embodiment. RDT2 unlocks real cross-embodiment generalization through clever data design: it uses UMI (Universal Manipulation Interface) — a hand-held gripper device — to collect 10,000+ hours of manipulation demos across 100+ scenes, embodiment-agnostic from day one.
This guide covers: the paper's core idea, the three-stage training architecture, how to set up the thu-ml/RDT2 repo, the data pipeline, training, inference, and real-world results.
Why cross-embodiment matters
In robotics, "embodiment" means the specific physical configuration: number of DOFs, gripper, cameras, base. Each lab usually has its own embodiment (single UR5e, ALOHA, dual Franka…). The pain: data collected on ALOHA is not directly usable on UR5e, because action space, reachable workspace, and dynamics all differ.
Consequence: the community keeps replicating data per embodiment — extremely expensive. Open X-Embodiment (2023) tried to pool data, but generalization stayed weak because formats weren't unified.
RDT2 solves this by not collecting data on a robot at all, but on an embodiment-agnostic device: the UMI gripper. UMI has a wrist-view fisheye camera + tracker. A human operator picks it up, performs a task → records action chunks + images. At deployment, any robot with a similar bimanual wrist-camera setup can "read" the model.
RDT2's core idea
The paper RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization (arXiv:2602.03310) makes three key claims:
- Scale UMI data to 10,000+ hours — the largest open-source robotic dataset to date.
- Three-stage training recipe — bridges discrete linguistic knowledge (from Qwen2.5-VL) with continuous control through RVQ → flow matching → distillation.
- Relative action chunks, 24 frames — output 24 frames (0.8 seconds at 30 FPS), 20 DOF for bimanual (per arm: 3D position + 6D rotation + 1D gripper). Relative actions transfer cleanly across robots with different workspaces.
Result: RDT2 zero-shot generalizes to unseen objects, scenes, instructions, and embodiments, and even beats state-of-the-art baselines on hard tasks like table tennis, archery with 100ms reaction, extinguishing burning incense.
Architecture: VQ + FM on Qwen2.5-VL-7B
RDT2 ships two paired models:
RDT2-VQ — discrete Vision-Language-Action
- Backbone: Qwen2.5-VL-7B-Instruct (a 7B-parameter open VLM on Hugging Face).
- Action tokenizer:
MultiVQVAE(Residual Vector Quantization) — discretizes a 24-frame action chunk into 27 tokens. The win: 3× shorter than FAST, 8× shorter than binning. Shorter tokens → much faster autoregressive decoding. - Input: 2 wrist-view fisheye images + an English instruction.
- Output: discrete action tokens, decoded back to a continuous action chunk via the VQ-VAE decoder.
RDT2-FM — flow-matching action expert
- Backbone: still Qwen2.5-VL-7B but frozen, with key-value cache attended by a separate action expert.
- Action expert: 400M params, an improved RDT architecture, trained with flow-matching loss (5 denoising steps).
- Stage 3: distill RDT2-FM into a one-step generator — a single forward pass from noise to action, ultra-low latency (good enough for table tennis at 1 m/s, archery at 100ms reaction).
Pick by use case: VQ for moderate-speed tasks needing explainability (action tokens are inspectable), FM for dynamic tasks demanding very low latency.
The three-stage training recipe
This is the most elegant design choice in the paper:
Stage 1 — RVQ pretraining: train Qwen2.5-VL-7B on UMI data with the task: 2 images + instruction → output discrete action tokens. Loss is the standard LLM cross-entropy. This stage teaches the model to map vision+language to actions — but in discrete space.
Stage 2 — Flow matching: swap the RVQ head for the 400M action expert. Freeze the Qwen backbone, train only the action expert with flow-matching loss. This brings outputs back to continuous space, removing quantization artifacts — critical for dexterous tasks (ping pong, archery).
Stage 3 — Distillation: distill the 5-step flow process into a 1-step direct mapping. Noise → action in a single forward pass. The idea is similar to consistency models / adversarial generation. This is the key to low enough latency for real-time control.
The philosophy: discrete language + continuous control, bridged by VQ then refined by flow matching. I expect this template to show up in many VLA models over the next 1-2 years.
Setting up thu-ml/RDT2
Hardware requirements:
- Inference: NVIDIA RTX 4090 (~16GB VRAM) is enough for RDT2-FM or RDT2-VQ with LoRA.
- Full fine-tuning: A100/H100 80GB.
- OS: Ubuntu 24.04, Python 3.10, PyTorch 2.7.1.
- Other: Flash Attention, DeepSpeed, plus packages for UR5e or Franka Research 3.
git clone https://github.com/thu-ml/RDT2.git
cd RDT2
# Conda env
conda create -n rdt2 python=3.10 -y
conda activate rdt2
# PyTorch 2.7.1 with CUDA 12.4
pip install torch==2.7.1 torchvision --index-url https://download.pytorch.org/whl/cu124
# Requirements
pip install -r requirements.txt
# Flash Attention (~10 min compile)
pip install flash-attn==2.7.4 --no-build-isolation
Repo structure per the docs:
RDT2/
├── configs/ # dataset, robot, training configs
├── deploy/ # calibration scripts for UR5e/FR3
├── examples/ # per-robot deployment guides
├── models/ # RDT inference, normalizer
├── rdt/ # core modules
├── scripts/ # finetune_full_param.sh, finetune_lora.sh
├── vqvae/ # action tokenizer
├── main.py # training entry
└── train.py
Hardware setup: bimanual UR5e and Franka Research 3
RDT2 officially supports two platforms:
Bimanual UR5e:
- Payload 0.82 kg per arm.
- Authors recommend running at 30% speed initially for safety.
- Needs HikRobot fisheye cameras at the wrist, plus a Vive Tracker to calibrate TCP-to-tracker space.
Bimanual Franka Research 3:
- Gripper mass 1.9 kg.
- Similar setup, tracker calibration via scripts in
deploy/.
Calibration is the most error-prone step. You'll need to:
- Mount the Vive Tracker on the gripper, run
python deploy/calibrate_tcp.py. - Measure offset from TCP to tracker frame (rotation + translation).
- Save to
configs/robots/ur5e.yamlorfranka.yaml.
A 5mm offset error can make the model "fly past" the target during pick. Test with a fixed calibration object before running the real task.
Inference: running RDT2-VQ
Minimal code to run inference with the pretrained RDT2-VQ:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from vqvae.multi_vqvae import MultiVQVAE
from rdt.inference import batch_predict_action
# Load model + processor + VAE from Hugging Face
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"robotics-diffusion-transformer/RDT2-VQ",
torch_dtype="bfloat16",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("robotics-diffusion-transformer/RDT2-VQ")
vae = MultiVQVAE.from_pretrained("robotics-diffusion-transformer/RVQActionTokenizer")
# Camera input (2 wrist-view fisheye, 224x224 RGB)
import cv2
left_rgb = cv2.imread("left_wrist.jpg")
right_rgb = cv2.imread("right_wrist.jpg")
# Predict 24-frame action chunk
result = batch_predict_action(
model, processor, vae,
examples=[{
"obs": {
"camera0_rgb": left_rgb,
"camera1_rgb": right_rgb,
}
}],
instruction="Pick up the red apple and place it in the bowl."
)
action_chunk = result[0]["action"] # shape (24, 20)
# 24 frames, 20 DOF: [left_xyz(3), left_rot6d(6), left_gripper(1),
# right_xyz(3), right_rot6d(6), right_gripper(1)]
The chunk is "executed" by streaming it to the robot controller at 30 Hz. Every 0.8 seconds the model is re-run to predict a new chunk — overlapping enough to avoid jitter.
Fine-tuning on your data
If your task is quirky (e.g. manipulating inside a tight cabinet), you'll want to fine-tune. The repo provides two scripts:
scripts/finetune_full_param.sh— full-parameter, needs A100 80GB.scripts/finetune_lora.sh— LoRA, runs on RTX 4090.
Standard 3-step data prep:
- Convert to WebDataset shards: each sample is
image.jpg + action.npy + action_token.npy. Conversion scripts indata/handle ROS bags or UMI recordings. - Define a dataset config: YAML pointing at the data path + a normalizer (action chunk mean/std).
- Run training: DeepSpeed ZeRO-2 or ZeRO-3 depending on GPU. The paper recommends fewer than 5 epochs to avoid overfitting — the pretrained model is strong, a few epochs is enough.
bash scripts/finetune_lora.sh \
--config configs/datasets/my_task.yaml \
--output_dir checkpoints/my_task \
--num_epochs 3
Benchmark results
Per the paper, RDT2 hits some impressive marks:
- Zero-shot generalization to unseen objects, scenes, instructions, and embodiments.
- Beats SOTA baselines (π0, OpenVLA) on long-horizon tasks like table setting.
- Dexterous tasks: table tennis with 1 m/s arm speed, archery at 100ms reaction time.
- Deformable objects: generalizes to fabrics with new textures/sizes (unseen garments).
Inference latency:
- RDT2-VQ: ~150-200 ms/chunk.
- RDT2-FM (5-step): ~80-100 ms/chunk.
- RDT2-FM (1-step distilled): ~20-30 ms/chunk — fast enough for real-time control.
Compared to training one policy per task, this is a step change in generalization.
Pitfalls and tips
- Camera setup must match: wrist-view fisheye, correct FOV and mount angle. A 15° angle error can throw actions off completely.
- Tracker calibration: test calibration before real tasks. I use a calibration cube to verify.
- Don't skip Stage 2: if you fine-tune the full pipeline, don't skip flow matching. Quantization-only output will be jittery.
- Speed limit: start at 30% speed and ramp up as you gain confidence. Bimanual collisions at full speed = broken grippers.
- Start with LoRA fine-tuning before going full-param — much cheaper and faster.
- Not every task is zero-shot: tasks far from the training distribution (e.g. surgery, micro-assembly) still need extra data.
Wrap-up
RDT2 is a major step forward for cross-embodiment manipulation. By scaling UMI data to 10,000+ hours and designing a three-stage training that bridges discrete language with continuous control, THU-ML built a foundation model that's truly "deployable" across different robots.
For Vietnamese engineers: if you have a UR5e or Franka, try the zero-shot demo before thinking about collecting your own data. You may be surprised — simple pick-and-place works out of the box. When you need task-specific behavior, LoRA fine-tuning on an RTX 4090 is enough.
Repo: thu-ml/RDT2 on GitHub. Models: robotics-diffusion-transformer/RDT2-VQ and RDT2-FM on Hugging Face. Project page: rdt-robotics.github.io/rdt2.