RDT2: Zero-Shot Cross-Embodiment Foundation Model for Bimanual UR5e/Franka

In September 2025, THU-ML (Tsinghua University Machine Learning Lab) released RDT2 — the successor to RDT-1B and one of the first foundation models that achieves zero-shot deployment on unseen embodiments. Practically: take the model off the shelf, plug it into your bimanual UR5e, type an English instruction ("pick up the apple"), and the robot does it — no data collection on your specific arm, no fine-tuning required.

This is a big jump from RDT-1B (October 2024), which only generalized well within the same embodiment. RDT2 unlocks real cross-embodiment generalization through clever data design: it uses UMI (Universal Manipulation Interface) — a hand-held gripper device — to collect 10,000+ hours of manipulation demos across 100+ scenes, embodiment-agnostic from day one.

This guide covers: the paper's core idea, the three-stage training architecture, how to set up the thu-ml/RDT2 repo, the data pipeline, training, inference, and real-world results.

Why cross-embodiment matters

In robotics, "embodiment" means the specific physical configuration: number of DOFs, gripper, cameras, base. Each lab usually has its own embodiment (single UR5e, ALOHA, dual Franka…). The pain: data collected on ALOHA is not directly usable on UR5e, because action space, reachable workspace, and dynamics all differ.

Consequence: the community keeps replicating data per embodiment — extremely expensive. Open X-Embodiment (2023) tried to pool data, but generalization stayed weak because formats weren't unified.

RDT2 solves this by not collecting data on a robot at all, but on an embodiment-agnostic device: the UMI gripper. UMI has a wrist-view fisheye camera + tracker. A human operator picks it up, performs a task → records action chunks + images. At deployment, any robot with a similar bimanual wrist-camera setup can "read" the model.

RDT2's core idea

The paper RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization (arXiv:2602.03310) makes three key claims:

Scale UMI data to 10,000+ hours — the largest open-source robotic dataset to date.
Three-stage training recipe — bridges discrete linguistic knowledge (from Qwen2.5-VL) with continuous control through RVQ → flow matching → distillation.
Relative action chunks, 24 frames — output 24 frames (0.8 seconds at 30 FPS), 20 DOF for bimanual (per arm: 3D position + 6D rotation + 1D gripper). Relative actions transfer cleanly across robots with different workspaces.

Result: RDT2 zero-shot generalizes to unseen objects, scenes, instructions, and embodiments, and even beats state-of-the-art baselines on hard tasks like table tennis, archery with 100ms reaction, extinguishing burning incense.

Architecture: VQ + FM on Qwen2.5-VL-7B

RDT2 ships two paired models:

RDT2-VQ — discrete Vision-Language-Action

Backbone: Qwen2.5-VL-7B-Instruct (a 7B-parameter open VLM on Hugging Face).
Action tokenizer: MultiVQVAE (Residual Vector Quantization) — discretizes a 24-frame action chunk into 27 tokens. The win: 3× shorter than FAST, 8× shorter than binning. Shorter tokens → much faster autoregressive decoding.
Input: 2 wrist-view fisheye images + an English instruction.
Output: discrete action tokens, decoded back to a continuous action chunk via the VQ-VAE decoder.

RDT2-FM — flow-matching action expert

Backbone: still Qwen2.5-VL-7B but frozen, with key-value cache attended by a separate action expert.
Action expert: 400M params, an improved RDT architecture, trained with flow-matching loss (5 denoising steps).
Stage 3: distill RDT2-FM into a one-step generator — a single forward pass from noise to action, ultra-low latency (good enough for table tennis at 1 m/s, archery at 100ms reaction).

Pick by use case: VQ for moderate-speed tasks needing explainability (action tokens are inspectable), FM for dynamic tasks demanding very low latency.

The three-stage training recipe

This is the most elegant design choice in the paper:

Stage 1 — RVQ pretraining: train Qwen2.5-VL-7B on UMI data with the task: 2 images + instruction → output discrete action tokens. Loss is the standard LLM cross-entropy. This stage teaches the model to map vision+language to actions — but in discrete space.

Stage 2 — Flow matching: swap the RVQ head for the 400M action expert. Freeze the Qwen backbone, train only the action expert with flow-matching loss. This brings outputs back to continuous space, removing quantization artifacts — critical for dexterous tasks (ping pong, archery).

Stage 3 — Distillation: distill the 5-step flow process into a 1-step direct mapping. Noise → action in a single forward pass. The idea is similar to consistency models / adversarial generation. This is the key to low enough latency for real-time control.

The philosophy: discrete language + continuous control, bridged by VQ then refined by flow matching. I expect this template to show up in many VLA models over the next 1-2 years.

Setting up `thu-ml/RDT2`

Hardware requirements:

Inference: NVIDIA RTX 4090 (~16GB VRAM) is enough for RDT2-FM or RDT2-VQ with LoRA.
Full fine-tuning: A100/H100 80GB.
OS: Ubuntu 24.04, Python 3.10, PyTorch 2.7.1.
Other: Flash Attention, DeepSpeed, plus packages for UR5e or Franka Research 3.

git clone https://github.com/thu-ml/RDT2.git
cd RDT2

# Conda env
conda create -n rdt2 python=3.10 -y
conda activate rdt2

# PyTorch 2.7.1 with CUDA 12.4
pip install torch==2.7.1 torchvision --index-url https://download.pytorch.org/whl/cu124

# Requirements
pip install -r requirements.txt

# Flash Attention (~10 min compile)
pip install flash-attn==2.7.4 --no-build-isolation

Repo structure per the docs:

RDT2/
├── configs/        # dataset, robot, training configs
├── deploy/         # calibration scripts for UR5e/FR3
├── examples/       # per-robot deployment guides
├── models/         # RDT inference, normalizer
├── rdt/            # core modules
├── scripts/        # finetune_full_param.sh, finetune_lora.sh
├── vqvae/          # action tokenizer
├── main.py         # training entry
└── train.py

Hardware setup: bimanual UR5e and Franka Research 3

RDT2 officially supports two platforms:

Bimanual UR5e:

Payload 0.82 kg per arm.
Authors recommend running at 30% speed initially for safety.
Needs HikRobot fisheye cameras at the wrist, plus a Vive Tracker to calibrate TCP-to-tracker space.

Bimanual Franka Research 3:

Gripper mass 1.9 kg.
Similar setup, tracker calibration via scripts in deploy/.

Calibration is the most error-prone step. You'll need to:

Mount the Vive Tracker on the gripper, run python deploy/calibrate_tcp.py.
Measure offset from TCP to tracker frame (rotation + translation).
Save to configs/robots/ur5e.yaml or franka.yaml.

A 5mm offset error can make the model "fly past" the target during pick. Test with a fixed calibration object before running the real task.

Inference: running RDT2-VQ

Minimal code to run inference with the pretrained RDT2-VQ:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from vqvae.multi_vqvae import MultiVQVAE
from rdt.inference import batch_predict_action

# Load model + processor + VAE from Hugging Face
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "robotics-diffusion-transformer/RDT2-VQ",
    torch_dtype="bfloat16",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("robotics-diffusion-transformer/RDT2-VQ")
vae = MultiVQVAE.from_pretrained("robotics-diffusion-transformer/RVQActionTokenizer")

# Camera input (2 wrist-view fisheye, 224x224 RGB)
import cv2
left_rgb = cv2.imread("left_wrist.jpg")
right_rgb = cv2.imread("right_wrist.jpg")

# Predict 24-frame action chunk
result = batch_predict_action(
    model, processor, vae,
    examples=[{
        "obs": {
            "camera0_rgb": left_rgb,
            "camera1_rgb": right_rgb,
        }
    }],
    instruction="Pick up the red apple and place it in the bowl."
)

action_chunk = result[0]["action"]   # shape (24, 20)
# 24 frames, 20 DOF: [left_xyz(3), left_rot6d(6), left_gripper(1),
#                     right_xyz(3), right_rot6d(6), right_gripper(1)]

The chunk is "executed" by streaming it to the robot controller at 30 Hz. Every 0.8 seconds the model is re-run to predict a new chunk — overlapping enough to avoid jitter.

Fine-tuning on your data

If your task is quirky (e.g. manipulating inside a tight cabinet), you'll want to fine-tune. The repo provides two scripts:

scripts/finetune_full_param.sh — full-parameter, needs A100 80GB.
scripts/finetune_lora.sh — LoRA, runs on RTX 4090.

Standard 3-step data prep:

Convert to WebDataset shards: each sample is image.jpg + action.npy + action_token.npy. Conversion scripts in data/ handle ROS bags or UMI recordings.
Define a dataset config: YAML pointing at the data path + a normalizer (action chunk mean/std).
Run training: DeepSpeed ZeRO-2 or ZeRO-3 depending on GPU. The paper recommends fewer than 5 epochs to avoid overfitting — the pretrained model is strong, a few epochs is enough.

bash scripts/finetune_lora.sh \
    --config configs/datasets/my_task.yaml \
    --output_dir checkpoints/my_task \
    --num_epochs 3

Benchmark results

Per the paper, RDT2 hits some impressive marks:

Zero-shot generalization to unseen objects, scenes, instructions, and embodiments.
Beats SOTA baselines (π0, OpenVLA) on long-horizon tasks like table setting.
Dexterous tasks: table tennis with 1 m/s arm speed, archery at 100ms reaction time.
Deformable objects: generalizes to fabrics with new textures/sizes (unseen garments).

Inference latency:

RDT2-VQ: ~150-200 ms/chunk.
RDT2-FM (5-step): ~80-100 ms/chunk.
RDT2-FM (1-step distilled): ~20-30 ms/chunk — fast enough for real-time control.

Compared to training one policy per task, this is a step change in generalization.

Pitfalls and tips

Camera setup must match: wrist-view fisheye, correct FOV and mount angle. A 15° angle error can throw actions off completely.
Tracker calibration: test calibration before real tasks. I use a calibration cube to verify.
Don't skip Stage 2: if you fine-tune the full pipeline, don't skip flow matching. Quantization-only output will be jittery.
Speed limit: start at 30% speed and ramp up as you gain confidence. Bimanual collisions at full speed = broken grippers.
Start with LoRA fine-tuning before going full-param — much cheaper and faster.
Not every task is zero-shot: tasks far from the training distribution (e.g. surgery, micro-assembly) still need extra data.

Wrap-up

RDT2 is a major step forward for cross-embodiment manipulation. By scaling UMI data to 10,000+ hours and designing a three-stage training that bridges discrete language with continuous control, THU-ML built a foundation model that's truly "deployable" across different robots.

For Vietnamese engineers: if you have a UR5e or Franka, try the zero-shot demo before thinking about collecting your own data. You may be surprised — simple pick-and-place works out of the box. When you need task-specific behavior, LoRA fine-tuning on an RTX 4090 is enough.

Repo: thu-ml/RDT2 on GitHub. Models: robotics-diffusion-transformer/RDT2-VQ and RDT2-FM on Hugging Face. Project page: rdt-robotics.github.io/rdt2.

RDT2: Zero-Shot Cross-Embodiment Foundation Model for Bimanual UR5e/Franka

Why cross-embodiment matters

RDT2's core idea

Architecture: VQ + FM on Qwen2.5-VL-7B

RDT2-VQ — discrete Vision-Language-Action

RDT2-FM — flow-matching action expert

The three-stage training recipe

Setting up `thu-ml/RDT2`

Hardware setup: bimanual UR5e and Franka Research 3

Inference: running RDT2-VQ

Fine-tuning on your data

Benchmark results

Pitfalls and tips

Wrap-up

Nguyễn Anh Tuấn

Bài viết liên quan

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc

Why cross-embodiment matters

RDT2's core idea

Architecture: VQ + FM on Qwen2.5-VL-7B

RDT2-VQ — discrete Vision-Language-Action

RDT2-FM — flow-matching action expert

The three-stage training recipe

Setting up thu-ml/RDT2

Hardware setup: bimanual UR5e and Franka Research 3

Inference: running RDT2-VQ

Fine-tuning on your data

Benchmark results

Pitfalls and tips

Wrap-up

Related Posts

Nguyễn Anh Tuấn

Bài viết liên quan

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc

Setting up `thu-ml/RDT2`