Think a humanoid robot walking down a hallway carrying a cup of coffee is trivial? In reality it is one of the hardest problems in humanoid control: every footstep is a shock wave propagating up the arm, and a few degrees of tray tilt is enough to spill the cup. Researchers at ArcLab (UC San Diego) just released SteadyTray — an RL framework that lets a Unitree G1 carry a tray loaded with glasses, cups, and surgical tools across uneven terrain under external disturbances, with a 96.9% success rate.
This post dissects the paper, explains why "residual policy" is the unlock, and walks you through reproducing the result on IsaacLab all the way to a MuJoCo sim-to-sim deployment.
Why is tray carrying so hard?
For humans, carrying a tray is an unconscious skill. A humanoid robot has to resolve three classic conflicts:
- Locomotion generates perturbations. Every footstep produces z-axis acceleration (landing impact) and xy sway. These travel through hip, spine, shoulder, finally reaching the tray.
- End-to-end RL collapses. If you train one policy to both walk and stabilize, the agent tends to "cheat" — standing still or walking very slowly to reduce perturbation, which kills command-velocity tracking.
- Sim-to-real gap compounds. Small simulation errors (friction, inertia, latency) amplify down the long kinematic chain and explode at the end-effector. Policies trained in sim typically die on hardware.
SteadyTray addresses all three with an architecture called ReST-RL (REsidual STabilization RL).
The core idea: decouple locomotion from stabilization
Instead of learning from scratch, ReST-RL freezes a pre-trained locomotion policy (one that already tracks velocity well) and trains a residual module whose only job is to produce small corrections that keep the tray level.
The overall formula for the Residual Action variant:
a_final = α_base · a_base(s) + α_residual · a_residual(s, s_priv) + q_default
a_base(s): action from the locomotion policy (frozen)a_residual(·): correction action from the new moduleq_default: default joint positionsα_base, α_residual: blending coefficients (typicallyα_base = 1,α_residualsmall)
The second variant — Residual FiLM — is more elegant. Instead of adding to the action, it modulates the base policy's features via scale/shift (Feature-wise Linear Modulation):
y'_i = y_i · (1 + γ_{t,i}) + β_{t,i}
where γ, β come from the residual encoder. Result: FiLM learns faster and more stably than the action adapter, because it never has to "undo" an already-correct locomotion behavior.
Architecture in detail
Observation space
| Component | Size | Description |
|---|---|---|
| Proprioception | ℝ²⁹ | Joint pos, joint vel, base angular vel, gravity projection |
| Object pose (camera) | ℝ⁷ | Position 3D + quaternion 4D (from AprilTag) |
| Privileged (training only) | ℝ²⁴ | Base lin vel, tray/object vel, CoM offset |
| Goal | ℝ³ | Velocity command (v_x, v_y, ω_z) |
| History window | H=32 | Last 32 timesteps of context |
Note that privileged observations are training-only. At deployment, the student policy sees only proprioception + camera-based object pose — thanks to a distillation phase described below.
Action space & control
29 DoF (full Unitree G1): 12 leg joints, 1 torso, 14 arm joints (including 2 wrists that tilt the tray). Policy outputs target joint positions; a PD controller at 1 kHz drives them; policy step runs at 50 Hz.
Reward function
Three reward groups:
Locomotion group (inherited from base policy):
- Linear XY tracking:
exp(-λ_vxy · ||v_xy - v̂_xy||²) - Torso stability:
exp(-λ_ω · ||ω_xy||²) - Foot impact smoothness
Tray stabilization group (new):
- Object upright:
exp(-λ_up · ||P_xy(g_obj)||²)— keeps the object's gravity vector perpendicular to the tray surface - Tray orientation:
exp(-λ · ||P_xy(g_tray)||²)— tray stays level - Object-tray contact:
𝟙{contact}— object must not leave the tray
Penalty group:
- Action rate, torque, joint limits — the standard Isaac Lab humanoid regularizers.
3-stage training + distillation
| Stage | Task | Init | Output |
|---|---|---|---|
| 1 | Locomotion only | Scratch | loc_policy.pt |
| 2 | + Tray balance (no object) | Load Stage 1 | tray_policy.pt |
| 3 | + Object stabilization (residual teacher w/ privileged obs) | Freeze base, train residual | teacher.pt |
| 4 | Distillation to student | Student sees proprio + camera only | student.pt (deploy) |
Distillation loss matches both latent and action:
L = ||z_teacher - z_student||² + ||a_teacher - a_student||²
With a small LR of 5×10⁻⁵ — enough for the student to imitate without forgetting locomotion.
Environment setup
You need an NVIDIA GPU (RTX 3090 or better recommended), CUDA 11.8+, and Docker. The paper trained on 2× RTX A6000 with 8192 parallel envs — you can drop to 2048–4096 envs on a single GPU.
Step 1 — Clone IsaacLab fork and SteadyTray
SteadyTray requires a custom IsaacLab fork (adds tray assets + object randomization):
# Create workspace
mkdir -p ~/steadytray_ws && cd ~/steadytray_ws
# Clone IsaacLab fork
git clone https://github.com/AllenHuangGit/IsaacLab_SteadyTray.git
# Clone SteadyTray repo
git clone https://github.com/AllenHuangGit/steadytray.git
cd steadytray
git lfs pull # fetch checkpoints from model/
Step 2 — Build the Docker image
Docker is strongly recommended because IsaacLab has a painful dependency stack (OpenUSD, Kit SDK):
cd ~/steadytray_ws/IsaacLab_SteadyTray
./docker/container.py start
./docker/container.py enter
Inside the container, mount SteadyTray:
# In container
cd /workspace/steadytray
python -m pip install -e source/steadytray
apt-get update && apt-get install -y git-lfs tmux
Step 3 — Verify the env
Run visualization to confirm assets load correctly:
python scripts/rsl_rl/play.py \
--task G1-Steady-Tray-Pre-Locomotion \
--num_envs 4 --video
You should see 4 G1 robots standing, each with a tray attached to both hands, ready to train.
Training stage by stage
Stage 1 — Basic locomotion
python scripts/rsl_rl/train.py \
--task G1-Steady-Tray-Pre-Locomotion \
--num_envs 4096 --headless \
--max_iterations 3000
Runtime: ~4h on 1× A6000. The reward curve plateaus once the robot walks stably across the [-1, 1] m/s velocity range.
Stage 2 — Hold the tray (no object yet)
python scripts/rsl_rl/train.py \
--task G1-Steady-Tray-Balance \
--num_envs 4096 --headless \
--resume --load_run <stage1_run_name> \
--max_iterations 2000
Now the tray is attached but empty. The policy learns to keep its hands level while walking.
Stage 3 — Train the residual teacher with privileged obs
python scripts/rsl_rl/train.py \
--task G1-Steady-Object-Teacher \
--num_envs 4096 --headless \
--resume --load_run <stage2_run_name> \
--residual.mode film \
--max_iterations 4000
Note --residual.mode film for the FiLM variant (better in nearly every scenario than the action adapter). On 2 GPUs:
python scripts/rsl_rl/train.py \
--task G1-Steady-Object-Teacher \
--num_envs 8192 --headless \
--distributed --nproc_per_node=2 \
--resume --load_run <stage2_run_name>
Stage 4 — Distill into the student
python scripts/rsl_rl/train.py \
--task G1-Steady-Object-Distillation \
--num_envs 4096 --headless \
--resume --load_run <stage3_run_name> \
--distill.lr 5e-5 \
--max_iterations 2000
The student only sees proprioception + camera-based object pose, but learns to mimic the teacher.
Domain randomization — the sim-to-real key
This is the most overlooked piece, and it decides whether your policy survives on hardware. Config in source/steadytray/tasks/.../randomization_cfg.py:
| Parameter | Range | Why |
|---|---|---|
| Robot friction | [0.3, 1.0] |
Real surfaces differ from sim carpet |
| Object mass | [0.05, 1.0] kg |
Empty cup → water bottle |
| Tray mass | [0.3, 0.7] kg |
Material tolerances |
| Object radius | [0.7, 1.5] × 0.03m |
Small cup → tall glass |
| Object height | [0.75, 2.0] × 0.1m |
Short bottle → tall bottle |
| Torso CoM offset | ±[2.5, 5, 5] cm |
Real calibration error |
| Push robot | ±0.5 m/s every [3, 5]s |
Hallway bumps |
| Push object | ±0.3 m/s every [2, 4]s |
Object sliding on tray |
| Observation delay | 10–30 ms | Camera latency |
Skip any row and sim-to-real almost certainly fails.
Inference: run the trained policy
Play inside Isaac Lab
python scripts/rsl_rl/play.py \
--task G1-Steady-Object-Distillation \
--num_envs 8 \
--checkpoint model/model_9999.pt \
--video --video_length 500
Sim2sim to MuJoCo (before touching real hardware)
The deploy/deploy_mujoco/ directory contains scripts that port the policy to a MuJoCo environment — this is a mandatory stop before you dare to run on hardware:
cd deploy/deploy_mujoco
python run_mujoco.py --policy ../../exported/student_film.onnx
If the policy stays stable for 30s in MuJoCo with the heaviest object and strongest push, it has a good chance of working on the real G1. If it fails, go back and crank up domain randomization.
Paper results
| Method | Command Track Success | Tilt (rad) | Vel Error (m/s) |
|---|---|---|---|
| Base policy (frozen loco) | 47.4% | 0.179 | 0.110 |
| End-to-End RL | 89.1% | 0.046 | 0.116 |
| ReST-RL (FiLM) | 96.9% | 0.046 | 0.106 |
In the push-robot test (hard shoves to the hip), ReST-RL (FiLM) achieves 84.6% success vs the base policy's 9.1% — a 9× gap. Similarly with push-object (hitting the cup on the tray): ReST-RL 74.6% vs base 25.2%.
Real-robot experiments: G1 carried a full wine glass, surgical tools on a tray, walked across uneven floors while being actively shoved by a human — nothing fell, nothing spilled.
Real-world applications
This is far from a toy problem. The same framework applies to:
- Restaurants / hospitality: humanoid servers carrying food and drinks.
- Healthcare: robotic assistants moving specimens and sterile instruments.
- Logistics: carrying unsecured packages over uneven warehouse floors.
- Home: household robots carrying plates without breakage.
The ReST-RL framework can even generalize to any task with two conflicting objectives — e.g. a mobile robot carrying a camera that must walk fast yet keep imagery stable, or a drone holding payload through wind gusts.
Common pitfalls
- Skipping observation delay. Sim policies react instantly; real robots have 10–30 ms latency. Without delay randomization → the policy oscillates wildly on hardware.
- Residual coefficient too large. If
α_residual = 1.0from step 0, the residual overrides locomotion → robot falls within the first 10 steps. Warm up from 0.1 to 1.0. - Skipping Stage 2. Many people rush from locomotion to object. Result: the policy never learned the tray exists, and wrist actions explode.
- Under-training distillation. Student needs ~2000 iterations to catch up to the teacher. Don't stop at 500.
- Using action-adapter instead of FiLM. Action-adapter is easier to learn but caps 5–7 % below FiLM on success rate.
Related Posts
- RL for Humanoid — Part 1: Introduction
- RL for Humanoid — Part 8: Loco-manipulation
- Fine-tune GR00T N1 on Isaac Lab