SteadyTray: Unitree G1 Tray Carrying with Residual RL

Think a humanoid robot walking down a hallway carrying a cup of coffee is trivial? In reality it is one of the hardest problems in humanoid control: every footstep is a shock wave propagating up the arm, and a few degrees of tray tilt is enough to spill the cup. Researchers at ArcLab (UC San Diego) just released SteadyTray — an RL framework that lets a Unitree G1 carry a tray loaded with glasses, cups, and surgical tools across uneven terrain under external disturbances, with a 96.9% success rate.

This post dissects the paper, explains why "residual policy" is the unlock, and walks you through reproducing the result on IsaacLab all the way to a MuJoCo sim-to-sim deployment.

Why is tray carrying so hard?

For humans, carrying a tray is an unconscious skill. A humanoid robot has to resolve three classic conflicts:

Locomotion generates perturbations. Every footstep produces z-axis acceleration (landing impact) and xy sway. These travel through hip, spine, shoulder, finally reaching the tray.
End-to-end RL collapses. If you train one policy to both walk and stabilize, the agent tends to "cheat" — standing still or walking very slowly to reduce perturbation, which kills command-velocity tracking.
Sim-to-real gap compounds. Small simulation errors (friction, inertia, latency) amplify down the long kinematic chain and explode at the end-effector. Policies trained in sim typically die on hardware.

SteadyTray addresses all three with an architecture called ReST-RL (REsidual STabilization RL).

The core idea: decouple locomotion from stabilization

Instead of learning from scratch, ReST-RL freezes a pre-trained locomotion policy (one that already tracks velocity well) and trains a residual module whose only job is to produce small corrections that keep the tray level.

The overall formula for the Residual Action variant:

a_final = α_base · a_base(s) + α_residual · a_residual(s, s_priv) + q_default

a_base(s): action from the locomotion policy (frozen)
a_residual(·): correction action from the new module
q_default: default joint positions
α_base, α_residual: blending coefficients (typically α_base = 1, α_residual small)

The second variant — Residual FiLM — is more elegant. Instead of adding to the action, it modulates the base policy's features via scale/shift (Feature-wise Linear Modulation):

y'_i = y_i · (1 + γ_{t,i}) + β_{t,i}

where γ, β come from the residual encoder. Result: FiLM learns faster and more stably than the action adapter, because it never has to "undo" an already-correct locomotion behavior.

Architecture in detail

Observation space

Component	Size	Description
Proprioception	ℝ²⁹	Joint pos, joint vel, base angular vel, gravity projection
Object pose (camera)	ℝ⁷	Position 3D + quaternion 4D (from AprilTag)
Privileged (training only)	ℝ²⁴	Base lin vel, tray/object vel, CoM offset
Goal	ℝ³	Velocity command `(v_x, v_y, ω_z)`
History window	H=32	Last 32 timesteps of context

Note that privileged observations are training-only. At deployment, the student policy sees only proprioception + camera-based object pose — thanks to a distillation phase described below.

Action space & control

29 DoF (full Unitree G1): 12 leg joints, 1 torso, 14 arm joints (including 2 wrists that tilt the tray). Policy outputs target joint positions; a PD controller at 1 kHz drives them; policy step runs at 50 Hz.

Reward function

Three reward groups:

Locomotion group (inherited from base policy):

Linear XY tracking: exp(-λ_vxy · ||v_xy - v̂_xy||²)
Torso stability: exp(-λ_ω · ||ω_xy||²)
Foot impact smoothness

Tray stabilization group (new):

Object upright: exp(-λ_up · ||P_xy(g_obj)||²) — keeps the object's gravity vector perpendicular to the tray surface
Tray orientation: exp(-λ · ||P_xy(g_tray)||²) — tray stays level
Object-tray contact: 𝟙{contact} — object must not leave the tray

Penalty group:

Action rate, torque, joint limits — the standard Isaac Lab humanoid regularizers.

3-stage training + distillation

Stage	Task	Init	Output
1	Locomotion only	Scratch	`loc_policy.pt`
2	+ Tray balance (no object)	Load Stage 1	`tray_policy.pt`
3	+ Object stabilization (residual teacher w/ privileged obs)	Freeze base, train residual	`teacher.pt`
4	Distillation to student	Student sees proprio + camera only	`student.pt` (deploy)

Distillation loss matches both latent and action:

L = ||z_teacher - z_student||² + ||a_teacher - a_student||²

With a small LR of 5×10⁻⁵ — enough for the student to imitate without forgetting locomotion.

Environment setup

You need an NVIDIA GPU (RTX 3090 or better recommended), CUDA 11.8+, and Docker. The paper trained on 2× RTX A6000 with 8192 parallel envs — you can drop to 2048–4096 envs on a single GPU.

Step 1 — Clone IsaacLab fork and SteadyTray

SteadyTray requires a custom IsaacLab fork (adds tray assets + object randomization):

# Create workspace
mkdir -p ~/steadytray_ws && cd ~/steadytray_ws

# Clone IsaacLab fork
git clone https://github.com/AllenHuangGit/IsaacLab_SteadyTray.git

# Clone SteadyTray repo
git clone https://github.com/AllenHuangGit/steadytray.git
cd steadytray
git lfs pull  # fetch checkpoints from model/

Step 2 — Build the Docker image

Docker is strongly recommended because IsaacLab has a painful dependency stack (OpenUSD, Kit SDK):

cd ~/steadytray_ws/IsaacLab_SteadyTray
./docker/container.py start
./docker/container.py enter

Inside the container, mount SteadyTray:

# In container
cd /workspace/steadytray
python -m pip install -e source/steadytray
apt-get update && apt-get install -y git-lfs tmux

Step 3 — Verify the env

Run visualization to confirm assets load correctly:

python scripts/rsl_rl/play.py \
  --task G1-Steady-Tray-Pre-Locomotion \
  --num_envs 4 --video

You should see 4 G1 robots standing, each with a tray attached to both hands, ready to train.

Training stage by stage

Stage 1 — Basic locomotion

python scripts/rsl_rl/train.py \
  --task G1-Steady-Tray-Pre-Locomotion \
  --num_envs 4096 --headless \
  --max_iterations 3000

Runtime: ~4h on 1× A6000. The reward curve plateaus once the robot walks stably across the [-1, 1] m/s velocity range.

Stage 2 — Hold the tray (no object yet)

python scripts/rsl_rl/train.py \
  --task G1-Steady-Tray-Balance \
  --num_envs 4096 --headless \
  --resume --load_run <stage1_run_name> \
  --max_iterations 2000

Now the tray is attached but empty. The policy learns to keep its hands level while walking.

Stage 3 — Train the residual teacher with privileged obs

python scripts/rsl_rl/train.py \
  --task G1-Steady-Object-Teacher \
  --num_envs 4096 --headless \
  --resume --load_run <stage2_run_name> \
  --residual.mode film \
  --max_iterations 4000

Note --residual.mode film for the FiLM variant (better in nearly every scenario than the action adapter). On 2 GPUs:

python scripts/rsl_rl/train.py \
  --task G1-Steady-Object-Teacher \
  --num_envs 8192 --headless \
  --distributed --nproc_per_node=2 \
  --resume --load_run <stage2_run_name>

Stage 4 — Distill into the student

python scripts/rsl_rl/train.py \
  --task G1-Steady-Object-Distillation \
  --num_envs 4096 --headless \
  --resume --load_run <stage3_run_name> \
  --distill.lr 5e-5 \
  --max_iterations 2000

The student only sees proprioception + camera-based object pose, but learns to mimic the teacher.

Domain randomization — the sim-to-real key

This is the most overlooked piece, and it decides whether your policy survives on hardware. Config in source/steadytray/tasks/.../randomization_cfg.py:

Parameter	Range	Why
Robot friction	`[0.3, 1.0]`	Real surfaces differ from sim carpet
Object mass	`[0.05, 1.0] kg`	Empty cup → water bottle
Tray mass	`[0.3, 0.7] kg`	Material tolerances
Object radius	`[0.7, 1.5] × 0.03m`	Small cup → tall glass
Object height	`[0.75, 2.0] × 0.1m`	Short bottle → tall bottle
Torso CoM offset	`±[2.5, 5, 5] cm`	Real calibration error
Push robot	`±0.5 m/s` every `[3, 5]s`	Hallway bumps
Push object	`±0.3 m/s` every `[2, 4]s`	Object sliding on tray
Observation delay	10–30 ms	Camera latency

Skip any row and sim-to-real almost certainly fails.

Inference: run the trained policy

Play inside Isaac Lab

python scripts/rsl_rl/play.py \
  --task G1-Steady-Object-Distillation \
  --num_envs 8 \
  --checkpoint model/model_9999.pt \
  --video --video_length 500

Sim2sim to MuJoCo (before touching real hardware)

The deploy/deploy_mujoco/ directory contains scripts that port the policy to a MuJoCo environment — this is a mandatory stop before you dare to run on hardware:

cd deploy/deploy_mujoco
python run_mujoco.py --policy ../../exported/student_film.onnx

If the policy stays stable for 30s in MuJoCo with the heaviest object and strongest push, it has a good chance of working on the real G1. If it fails, go back and crank up domain randomization.

Paper results

Method	Command Track Success	Tilt (rad)	Vel Error (m/s)
Base policy (frozen loco)	47.4%	0.179	0.110
End-to-End RL	89.1%	0.046	0.116
ReST-RL (FiLM)	96.9%	0.046	0.106

In the push-robot test (hard shoves to the hip), ReST-RL (FiLM) achieves 84.6% success vs the base policy's 9.1% — a 9× gap. Similarly with push-object (hitting the cup on the tray): ReST-RL 74.6% vs base 25.2%.

Real-robot experiments: G1 carried a full wine glass, surgical tools on a tray, walked across uneven floors while being actively shoved by a human — nothing fell, nothing spilled.

Real-world applications

This is far from a toy problem. The same framework applies to:

Restaurants / hospitality: humanoid servers carrying food and drinks.
Healthcare: robotic assistants moving specimens and sterile instruments.
Logistics: carrying unsecured packages over uneven warehouse floors.
Home: household robots carrying plates without breakage.

The ReST-RL framework can even generalize to any task with two conflicting objectives — e.g. a mobile robot carrying a camera that must walk fast yet keep imagery stable, or a drone holding payload through wind gusts.

Common pitfalls

Skipping observation delay. Sim policies react instantly; real robots have 10–30 ms latency. Without delay randomization → the policy oscillates wildly on hardware.
Residual coefficient too large. If α_residual = 1.0 from step 0, the residual overrides locomotion → robot falls within the first 10 steps. Warm up from 0.1 to 1.0.
Skipping Stage 2. Many people rush from locomotion to object. Result: the policy never learned the tray exists, and wrist actions explode.
Under-training distillation. Student needs ~2000 iterations to catch up to the teacher. Don't stop at 500.
Using action-adapter instead of FiLM. Action-adapter is easier to learn but caps 5–7 % below FiLM on success rate.

References

SteadyTray: Learning Object Balancing Tasks in Humanoid Tray Transport via Residual Reinforcement Learning — Huang et al., arXiv 2026
Project page
GitHub: AllenHuangGit/steadytray
Unitree RL Lab (IsaacLab base)

SteadyTray: Unitree G1 Tray Carrying with Residual RL

Why is tray carrying so hard?

The core idea: decouple locomotion from stabilization

Architecture in detail

Observation space

Action space & control

Reward function

3-stage training + distillation

Environment setup

Step 1 — Clone IsaacLab fork and SteadyTray

Step 2 — Build the Docker image

Step 3 — Verify the env

Training stage by stage

Stage 1 — Basic locomotion

Stage 2 — Hold the tray (no object yet)

Stage 3 — Train the residual teacher with privileged obs

Stage 4 — Distill into the student

Domain randomization — the sim-to-real key

Inference: run the trained policy

Play inside Isaac Lab

Sim2sim to MuJoCo (before touching real hardware)

Paper results

Real-world applications

Common pitfalls

References

Nguyễn Anh Tuấn

Bài viết liên quan

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym

Why is tray carrying so hard?

The core idea: decouple locomotion from stabilization

Architecture in detail

Observation space

Action space & control

Reward function

3-stage training + distillation

Environment setup

Step 1 — Clone IsaacLab fork and SteadyTray

Step 2 — Build the Docker image

Step 3 — Verify the env

Training stage by stage

Stage 1 — Basic locomotion

Stage 2 — Hold the tray (no object yet)

Stage 3 — Train the residual teacher with privileged obs

Stage 4 — Distill into the student

Domain randomization — the sim-to-real key

Inference: run the trained policy

Play inside Isaac Lab

Sim2sim to MuJoCo (before touching real hardware)

Paper results

Real-world applications

Common pitfalls

Related Posts

References

Nguyễn Anh Tuấn

Bài viết liên quan

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym