humanoidhumanoidreinforcement-learningunitree-g1isaac-labresidual-rlsim-to-realmanipulation

SteadyTray: Unitree G1 Tray Carrying with Residual RL

Reproduce SteadyTray on IsaacLab — train a Unitree G1 humanoid to carry a tray without dropping objects using a two-stage Residual RL architecture.

Nguyễn Anh Tuấn19 tháng 4, 202610 phút đọc
SteadyTray: Unitree G1 Tray Carrying with Residual RL

Think a humanoid robot walking down a hallway carrying a cup of coffee is trivial? In reality it is one of the hardest problems in humanoid control: every footstep is a shock wave propagating up the arm, and a few degrees of tray tilt is enough to spill the cup. Researchers at ArcLab (UC San Diego) just released SteadyTray — an RL framework that lets a Unitree G1 carry a tray loaded with glasses, cups, and surgical tools across uneven terrain under external disturbances, with a 96.9% success rate.

This post dissects the paper, explains why "residual policy" is the unlock, and walks you through reproducing the result on IsaacLab all the way to a MuJoCo sim-to-sim deployment.

Humanoid robot in a research environment — illustrating humanoid manipulation while locomoting

Why is tray carrying so hard?

For humans, carrying a tray is an unconscious skill. A humanoid robot has to resolve three classic conflicts:

  1. Locomotion generates perturbations. Every footstep produces z-axis acceleration (landing impact) and xy sway. These travel through hip, spine, shoulder, finally reaching the tray.
  2. End-to-end RL collapses. If you train one policy to both walk and stabilize, the agent tends to "cheat" — standing still or walking very slowly to reduce perturbation, which kills command-velocity tracking.
  3. Sim-to-real gap compounds. Small simulation errors (friction, inertia, latency) amplify down the long kinematic chain and explode at the end-effector. Policies trained in sim typically die on hardware.

SteadyTray addresses all three with an architecture called ReST-RL (REsidual STabilization RL).

The core idea: decouple locomotion from stabilization

Instead of learning from scratch, ReST-RL freezes a pre-trained locomotion policy (one that already tracks velocity well) and trains a residual module whose only job is to produce small corrections that keep the tray level.

The overall formula for the Residual Action variant:

a_final = α_base · a_base(s) + α_residual · a_residual(s, s_priv) + q_default
  • a_base(s): action from the locomotion policy (frozen)
  • a_residual(·): correction action from the new module
  • q_default: default joint positions
  • α_base, α_residual: blending coefficients (typically α_base = 1, α_residual small)

The second variant — Residual FiLM — is more elegant. Instead of adding to the action, it modulates the base policy's features via scale/shift (Feature-wise Linear Modulation):

y'_i = y_i · (1 + γ_{t,i}) + β_{t,i}

where γ, β come from the residual encoder. Result: FiLM learns faster and more stably than the action adapter, because it never has to "undo" an already-correct locomotion behavior.

Architecture in detail

Observation space

Component Size Description
Proprioception ℝ²⁹ Joint pos, joint vel, base angular vel, gravity projection
Object pose (camera) ℝ⁷ Position 3D + quaternion 4D (from AprilTag)
Privileged (training only) ℝ²⁴ Base lin vel, tray/object vel, CoM offset
Goal ℝ³ Velocity command (v_x, v_y, ω_z)
History window H=32 Last 32 timesteps of context

Note that privileged observations are training-only. At deployment, the student policy sees only proprioception + camera-based object pose — thanks to a distillation phase described below.

Action space & control

29 DoF (full Unitree G1): 12 leg joints, 1 torso, 14 arm joints (including 2 wrists that tilt the tray). Policy outputs target joint positions; a PD controller at 1 kHz drives them; policy step runs at 50 Hz.

Reward function

Three reward groups:

Locomotion group (inherited from base policy):

  • Linear XY tracking: exp(-λ_vxy · ||v_xy - v̂_xy||²)
  • Torso stability: exp(-λ_ω · ||ω_xy||²)
  • Foot impact smoothness

Tray stabilization group (new):

  • Object upright: exp(-λ_up · ||P_xy(g_obj)||²) — keeps the object's gravity vector perpendicular to the tray surface
  • Tray orientation: exp(-λ · ||P_xy(g_tray)||²) — tray stays level
  • Object-tray contact: 𝟙{contact} — object must not leave the tray

Penalty group:

  • Action rate, torque, joint limits — the standard Isaac Lab humanoid regularizers.

3-stage training + distillation

Stage Task Init Output
1 Locomotion only Scratch loc_policy.pt
2 + Tray balance (no object) Load Stage 1 tray_policy.pt
3 + Object stabilization (residual teacher w/ privileged obs) Freeze base, train residual teacher.pt
4 Distillation to student Student sees proprio + camera only student.pt (deploy)

Distillation loss matches both latent and action:

L = ||z_teacher - z_student||² + ||a_teacher - a_student||²

With a small LR of 5×10⁻⁵ — enough for the student to imitate without forgetting locomotion.

Environment setup

You need an NVIDIA GPU (RTX 3090 or better recommended), CUDA 11.8+, and Docker. The paper trained on 2× RTX A6000 with 8192 parallel envs — you can drop to 2048–4096 envs on a single GPU.

Step 1 — Clone IsaacLab fork and SteadyTray

SteadyTray requires a custom IsaacLab fork (adds tray assets + object randomization):

# Create workspace
mkdir -p ~/steadytray_ws && cd ~/steadytray_ws

# Clone IsaacLab fork
git clone https://github.com/AllenHuangGit/IsaacLab_SteadyTray.git

# Clone SteadyTray repo
git clone https://github.com/AllenHuangGit/steadytray.git
cd steadytray
git lfs pull  # fetch checkpoints from model/

Step 2 — Build the Docker image

Docker is strongly recommended because IsaacLab has a painful dependency stack (OpenUSD, Kit SDK):

cd ~/steadytray_ws/IsaacLab_SteadyTray
./docker/container.py start
./docker/container.py enter

Inside the container, mount SteadyTray:

# In container
cd /workspace/steadytray
python -m pip install -e source/steadytray
apt-get update && apt-get install -y git-lfs tmux

Step 3 — Verify the env

Run visualization to confirm assets load correctly:

python scripts/rsl_rl/play.py \
  --task G1-Steady-Tray-Pre-Locomotion \
  --num_envs 4 --video

You should see 4 G1 robots standing, each with a tray attached to both hands, ready to train.

Training stage by stage

Stage 1 — Basic locomotion

python scripts/rsl_rl/train.py \
  --task G1-Steady-Tray-Pre-Locomotion \
  --num_envs 4096 --headless \
  --max_iterations 3000

Runtime: ~4h on 1× A6000. The reward curve plateaus once the robot walks stably across the [-1, 1] m/s velocity range.

Stage 2 — Hold the tray (no object yet)

python scripts/rsl_rl/train.py \
  --task G1-Steady-Tray-Balance \
  --num_envs 4096 --headless \
  --resume --load_run <stage1_run_name> \
  --max_iterations 2000

Now the tray is attached but empty. The policy learns to keep its hands level while walking.

Stage 3 — Train the residual teacher with privileged obs

python scripts/rsl_rl/train.py \
  --task G1-Steady-Object-Teacher \
  --num_envs 4096 --headless \
  --resume --load_run <stage2_run_name> \
  --residual.mode film \
  --max_iterations 4000

Note --residual.mode film for the FiLM variant (better in nearly every scenario than the action adapter). On 2 GPUs:

python scripts/rsl_rl/train.py \
  --task G1-Steady-Object-Teacher \
  --num_envs 8192 --headless \
  --distributed --nproc_per_node=2 \
  --resume --load_run <stage2_run_name>

Stage 4 — Distill into the student

python scripts/rsl_rl/train.py \
  --task G1-Steady-Object-Distillation \
  --num_envs 4096 --headless \
  --resume --load_run <stage3_run_name> \
  --distill.lr 5e-5 \
  --max_iterations 2000

The student only sees proprioception + camera-based object pose, but learns to mimic the teacher.

Domain randomization — the sim-to-real key

This is the most overlooked piece, and it decides whether your policy survives on hardware. Config in source/steadytray/tasks/.../randomization_cfg.py:

Parameter Range Why
Robot friction [0.3, 1.0] Real surfaces differ from sim carpet
Object mass [0.05, 1.0] kg Empty cup → water bottle
Tray mass [0.3, 0.7] kg Material tolerances
Object radius [0.7, 1.5] × 0.03m Small cup → tall glass
Object height [0.75, 2.0] × 0.1m Short bottle → tall bottle
Torso CoM offset ±[2.5, 5, 5] cm Real calibration error
Push robot ±0.5 m/s every [3, 5]s Hallway bumps
Push object ±0.3 m/s every [2, 4]s Object sliding on tray
Observation delay 10–30 ms Camera latency

Skip any row and sim-to-real almost certainly fails.

RL training dashboard — visualizing reward curves, episode length, domain randomization stats

Inference: run the trained policy

Play inside Isaac Lab

python scripts/rsl_rl/play.py \
  --task G1-Steady-Object-Distillation \
  --num_envs 8 \
  --checkpoint model/model_9999.pt \
  --video --video_length 500

Sim2sim to MuJoCo (before touching real hardware)

The deploy/deploy_mujoco/ directory contains scripts that port the policy to a MuJoCo environment — this is a mandatory stop before you dare to run on hardware:

cd deploy/deploy_mujoco
python run_mujoco.py --policy ../../exported/student_film.onnx

If the policy stays stable for 30s in MuJoCo with the heaviest object and strongest push, it has a good chance of working on the real G1. If it fails, go back and crank up domain randomization.

Paper results

Method Command Track Success Tilt (rad) Vel Error (m/s)
Base policy (frozen loco) 47.4% 0.179 0.110
End-to-End RL 89.1% 0.046 0.116
ReST-RL (FiLM) 96.9% 0.046 0.106

In the push-robot test (hard shoves to the hip), ReST-RL (FiLM) achieves 84.6% success vs the base policy's 9.1% — a gap. Similarly with push-object (hitting the cup on the tray): ReST-RL 74.6% vs base 25.2%.

Real-robot experiments: G1 carried a full wine glass, surgical tools on a tray, walked across uneven floors while being actively shoved by a human — nothing fell, nothing spilled.

Real-world applications

This is far from a toy problem. The same framework applies to:

  • Restaurants / hospitality: humanoid servers carrying food and drinks.
  • Healthcare: robotic assistants moving specimens and sterile instruments.
  • Logistics: carrying unsecured packages over uneven warehouse floors.
  • Home: household robots carrying plates without breakage.

The ReST-RL framework can even generalize to any task with two conflicting objectives — e.g. a mobile robot carrying a camera that must walk fast yet keep imagery stable, or a drone holding payload through wind gusts.

Common pitfalls

  1. Skipping observation delay. Sim policies react instantly; real robots have 10–30 ms latency. Without delay randomization → the policy oscillates wildly on hardware.
  2. Residual coefficient too large. If α_residual = 1.0 from step 0, the residual overrides locomotion → robot falls within the first 10 steps. Warm up from 0.1 to 1.0.
  3. Skipping Stage 2. Many people rush from locomotion to object. Result: the policy never learned the tray exists, and wrist actions explode.
  4. Under-training distillation. Student needs ~2000 iterations to catch up to the teacher. Don't stop at 500.
  5. Using action-adapter instead of FiLM. Action-adapter is easier to learn but caps 5–7 % below FiLM on success rate.

References

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Bài viết liên quan

NEWTutorial
VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO
vlavla-adapteropenhelixliberoqwen2.5lorafrankaur5manipulation

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

Hướng dẫn VLA-Adapter từ OpenHelix — train VLA 0.5B trên GPU consumer 8 giờ, đạt SOTA LIBERO, deploy thật trên Franka/UR-5.

13/5/202610 phút đọc
NEWTutorial
WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid
wholebodyvlavlahumanoidloco-manipulationiclr-2026agibot-x2teleoprl

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

ICLR 2026 — pipeline thực chiến từ thu thập teleop, train unified latent VLA đến deploy whole-body loco-manipulation trên AgiBot X2.

11/5/202611 phút đọc
NEWTutorial
Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym
humanoidisaac-gymreinforcement-learningsim2realbooster-t1icra-2026

Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym

Hướng dẫn chi tiết Booster Gym — RL framework end-to-end open-source train humanoid Booster T1 walking từ teleop đến deploy thực tế.

6/5/202611 phút đọc