manipulationumidiffusion-policytrainingrobot-armimitation-learningmanipulation

Train your first UMI Diffusion Policy and test on a real robot arm

Train Diffusion Policy from UMI's replay_buffer.zarr.zip, understand training metrics, and deploy the policy to a real robot arm using eval_real_umi.py — with verified official configs.

Nguyễn Anh TuấnJune 3, 20265 min readUpdated: Jun 6, 2026
Train your first UMI Diffusion Policy and test on a real robot arm

Train your first UMI Diffusion Policy and test on a real robot arm

This is Part 4 in the UMI + VLA series. This post assumes you have a replay_buffer.zarr.zip from Part 3.

Goal: train a Diffusion Policy baseline, understand the training curve, and run eval_real_umi.py to test the policy on a real robot.

Why train Diffusion Policy first, not a big VLA?

This question comes up often. The practical answer:

1. Much faster debugging. Diffusion Policy trains in hours on one GPU. A 3B+ VLA needs multiple GPUs and takes 1–2 days. If your data has issues (misaligned timestamps, noisy poses, wrong coordinate frame), you want to find out fast with a small model.

2. Reference baseline. If your VLA later doesn't work, you need to know: is the problem in the data or the model? A small baseline gives you a comparison point.

3. VLAs are not magic. VLAs improve language conditioning and generalization, but can't rescue bad data. If the Diffusion Policy baseline can't learn anything, a VLA won't either.

Training environment setup

cd universal_manipulation_interface
conda activate umi

# Check CUDA
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"

VRAM requirements:

  • UMI Diffusion Policy (RGB-only, single camera): 1× 12–24 GB
  • UMI Diffusion Policy (2 cameras): 1× 24 GB recommended
  • If VRAM is limited: reduce batch_size in the config

View official configs

ls diffusion_policy/config/
ls diffusion_policy/config/task/

# Configs for UMI single-arm:
# - diffusion_policy/config/task/umi.yaml
# - diffusion_policy/config/train_diffusion_unet_timm_umi_workspace.yaml
# - diffusion_policy/config/train_diffusion_transformer_umi_workspace.yaml

Read diffusion_policy/config/task/umi.yaml to understand the dataset keys the config expects. They must match the keys in your replay_buffer.zarr.zip.

Training: official command

UMI uses the Hydra config system. Basic command:

python train.py --config-name=train_diffusion_unet_timm_umi_workspace \
  task.dataset.dataset_path=/absolute/path/to/replay_buffer.zarr.zip \
  training.seed=42

Transformer architecture variant:

python train.py --config-name=train_diffusion_transformer_umi_workspace \
  task.dataset.dataset_path=/absolute/path/to/replay_buffer.zarr.zip \
  training.seed=42

Useful Hydra overrides:

# Reduce batch size if OOM
python train.py --config-name=train_diffusion_unet_timm_umi_workspace \
  task.dataset.dataset_path=/path/to/replay_buffer.zarr.zip \
  training.batch_size=16 \
  training.num_epochs=500

# Change output directory
python train.py --config-name=train_diffusion_unet_timm_umi_workspace \
  task.dataset.dataset_path=/path/to/replay_buffer.zarr.zip \
  hydra.run.dir=outputs/umi_run_001

See all options:

python train.py --help
python train.py --config-name=train_diffusion_unet_timm_umi_workspace --cfg job

Reading training metrics

Watch these metrics while training:

Train loss: should decrease consistently. If it oscillates with no downward trend → learning rate needs adjustment or data has issues.

Val loss: if config has a validation split, val loss should decrease alongside train loss. A large gap = overfitting.

Epoch count: UMI typically needs 500–2000 epochs depending on dataset size. Don't stop too early.

Good signs:

Epoch 50:  train_loss=0.08, val_loss=0.09
Epoch 200: train_loss=0.04, val_loss=0.05
Epoch 500: train_loss=0.02, val_loss=0.03

Bad signs:

Epoch 50:  train_loss=0.08
Epoch 200: train_loss=0.08  ← not decreasing = wrong learning rate or data issue
Epoch 500: train_loss=0.09  ← increasing = unstable training

If loss doesn't decrease after 100 epochs:

  1. Verify the replay buffer path is correct
  2. Check dataset keys match the config
  3. Visualize 1–2 samples from the dataset to confirm data format

Checkpoint inspection

After training, checkpoints are in the output directory:

ls outputs/umi_run_001/checkpoints/

python -c "
import torch
ckpt = torch.load('outputs/umi_run_001/checkpoints/latest.ckpt', map_location='cpu')
print('Epoch:', ckpt.get('epoch', 'N/A'))
print('Train loss:', ckpt.get('train_loss', 'N/A'))
"

Deploy to robot: eval_real_umi.py

UMI includes an official script for running the policy on a real robot:

python scripts_real/eval_real_umi.py --help

python scripts_real/eval_real_umi.py \
  --input outputs/umi_run_001/checkpoints/latest.ckpt \
  --output data/eval/run_001 \
  --frequency 10

Before any real robot evaluation, MANDATORY:

[ ] Hardware E-stop connected and tested
[ ] Joint limits set in robot config
[ ] Workspace box constraint declared
[ ] Dry-run if robot SDK supports it
[ ] Someone standing by the E-stop at all times
[ ] Initial speed set low (50% or less)

Robot adapter

eval_real_umi.py requires a robot adapter — a Python class connecting policy outputs to your robot SDK. UMI repo includes adapters for Franka and some other robots in scripts_real/. Check what's available:

ls scripts_real/
# control_franka.py, control_robots.py, control_wsg_spacemouse.py, ...

If your robot doesn't have an existing adapter, you'll need to implement the interface.

Test scenarios: easy to hard

Start with the easiest scenario first:

Scenario Purpose
Object at exact demo position Basic sanity check
Object shifted 5 cm from demo Small spatial generalization
Object shifted 10–15 cm Larger spatial generalization
Slightly different lighting Visual robustness
Object different color, same shape Object generalization

If the first scenario fails (exact position still fails) → policy problem. Check:

  1. Is camera calibration correct?
  2. Does the action convention (relative vs absolute) match?
  3. Is the coordinate frame consistent?

Common errors

Error Cause Fix
Loss doesn't decrease Wrong path or data format mismatch Print 1 sample from dataset, compare with config expected keys
OOM during training Batch size too large Reduce training.batch_size to 8–16
Robot moves erratically Action scale wrong Check normalization in config; add workspace constraints
Policy fails to reproduce demo Wrong gripper width in data Re-run ArUco detection with better settings
Loss NaN Learning rate too high or gradient explode Lower lr, add gradient clipping

If baseline works: next steps

If the Diffusion Policy baseline performs well (>60% success on easy scenario), you can:

  1. Collect more data — 100–200 demos for the current task
  2. Add diverse scenarios — different lighting, positions, object variations
  3. Scale to bimanualPart 5
  4. Fine-tune a VLA — if you want language conditioning (GR00T or other)

Don't rush to VLA fine-tuning if the baseline isn't stable — fix the data first.

References


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

Lên hai tay: UMI bimanual pipeline với scripts chính thức
manipulation

Lên hai tay: UMI bimanual pipeline với scripts chính thức

6/5/20267 min read
NT
Thu demo đơn tay với UMI và chạy SLAM pipeline chính thức
manipulation

Thu demo đơn tay với UMI và chạy SLAM pipeline chính thức

5/31/20268 min read
NT
UMI là gì? Cách thu data VLA cho robot mà không cần teleop
manipulation

UMI là gì? Cách thu data VLA cho robot mà không cần teleop

5/25/20268 min read
NT