If you have followed humanoid RL over the last few years, you will recognize the pattern: pretrain in a massive simulator (usually Isaac Gym or IsaacLab with PPO), then zero-shot sim-to-real. The recipe works, but it is expensive — PPO is on-policy, needs hundreds of millions of timesteps, and when you port the policy to a new environment (Brax, Bullet, real robot), it typically "crashes" for hours before it self-adapts. LIFT — an ICLR 2026 paper from BIGAI — attacks exactly this pain point.
LIFT (Large-scale pretraining + efficient FineTuning) proposes a three-stage pipeline: (1) pretrain SAC in MuJoCo Playground with large-batch updates and a high UTD ratio, (2) learn a physics-informed world model from the replay buffer, (3) fine-tune in the target environment (Brax or real hardware) by running deterministic actions in the real env while stochastic exploration happens inside the world model. The result: convergence in ~1 hour on an NVIDIA 4090, transfer to Brax and Booster T1 with only 80–590 seconds of real data.
Why is pretrain + finetune hard for humanoids?
For manipulators, pretrain-then-finetune is relatively easy: the dynamics are mostly kinematic, inertias are low, contacts are short. Humanoids have three traits that make naïve transfer almost always fail:
- Foot under-actuation — there is no joint at the ground contact. Balance depends entirely on reaction forces. A tiny mismatch in friction between MuJoCo and Brax is enough to turn "walks forward" into "falls backwards".
- Long kinematic chains, discrete contacts — ground forces propagate ankle → knee → hip → waist → shoulder. Each physics engine handles contacts differently (soft vs hard constraints, spring-damper, Coulomb approximations). The sim-to-sim gap is sometimes larger than sim-to-real.
- Action rate matters — humanoid policies need smooth actions. PPO with high entropy bonuses tends to produce jittery policies that oscillate in a different environment.
LIFT picks SAC off-policy (the replay buffer is reusable for the world model) plus an action-rate L2 penalty in the reward, forcing smoothness from day one of pretrain. This is an important architectural choice — during pretrain you might not appreciate it, but once you enter fine-tuning you will be thankful that you have a replay buffer and a smooth policy to begin with.
LIFT's three-stage architecture
Stage 1 — SAC pretrain in MuJoCo Playground
Instead of PPO with tens of millions of env-steps, LIFT uses SAC in JAX with three key tweaks:
- Large-batch updates: batch sizes in the thousands to saturate the GPU.
- High UTD (Update-To-Data) ratio: for each env-step the networks are updated many times (typically 4–16 gradient steps). High UTD distills the replay buffer quickly; you need less exploration.
- 1024 parallel envs in MuJoCo Playground: Playground runs on MJX (MuJoCo-on-JAX), so parallelism is essentially free — no Isaac required.
With this setup, the T1LowDimJoystickFlatTerrain task (Booster T1 following a joystick on flat ground) converges in ~40M timesteps, ~1 hour on a 4090. Compare that to PPO on Isaac Lab, which typically needs 4–8 hours and more VRAM for the same task.
If you are new to RL and want to understand SAC vs PPO, read AI Series 1: RL Basics. If you want a MuJoCo Playground intro first, see Sim Series 2: MuJoCo deep dive.
Stage 2 — Physics-informed world model
This is the original contribution of LIFT. Instead of a black-box world model like Dreamer (GRU + RSSM), LIFT factors the architecture into two branches:
- Lagrangian dynamics branch — writes the mechanical equation
M(q)q̈ + C(q,q̇)q̇ + g(q) = τ + Jᵀfcexplicitly. The network only learns small residuals for M, C, g that correct the mismatch between MJX sim and the target env, instead of relearning the whole dynamics from scratch. - Contact residual predictor — a small MLP taking (q, q̇, foot height, torque) and predicting the contact force residual. The part that is "hard for theory but easy for data".
Upside: the world model generalises far better than a pure RSSM because the structural part is hard-coded into physics. The WM training data comes from the stage-1 replay buffer — one reason the authors chose SAC off-policy instead of PPO.
Stage 3 — Fine-tuning: deterministic env + stochastic world model
This is the most important idea and the easiest to miss. During fine-tuning, LIFT does:
- In the real env (Brax or Booster T1): run the policy deterministically — take
μ(s)of the Gaussian policy, do not sample. This is safe (no wild surprises) and the resulting data is clean. - Inside the world model: rollout stochastically — sample actions from
N(μ, σ), compute reward with the WM, backprop gradients through both WM and policy.
This split does two things: (a) preserves the exploration SAC needs without hurting the real robot, (b) when the WM drifts from the true env, the loss pulls both the WM and the policy back into agreement.
Cross-reference: if you want to know where sim2real typically fails, read Sim Series 5: Sim2Real Pipeline — most of those gaps are exactly what LIFT's fine-tune step fills in.
Installing LIFT-humanoid from scratch
System requirements
| Component | Minimum | Notes |
|---|---|---|
| OS | Ubuntu 22.04 | Author-tested |
| Python | 3.10 | Other versions can break JAX |
| GPU | NVIDIA 4090 or H800 | 24GB VRAM is enough for T1 low-dim |
| RAM | 32GB+ | Replay buffer is large |
| CUDA | 12.x | jax[cuda12] |
Setup steps
conda create -n lift python=3.10 -c conda-forge -y
conda activate lift
git clone https://github.com/bigai-ai/LIFT-humanoid.git
cd LIFT-humanoid
# Install the repo's patched MuJoCo Playground (custom envs)
cd mujoco_playground && pip install -e . && cd ..
# Install the Brax env wrapper
cd brax_env && pip install -e . && cd ..
# Install remaining deps (jax, flax, dill, wandb, ...)
pip install -r requirements.txt
Verify that JAX sees the GPU:
import jax
print(jax.devices()) # should list CudaDevice(0)
If it says CPU-only, re-check jax[cuda12] and that LD_LIBRARY_PATH points to CUDA 12.
Pretrain SAC (Stage 1)
Canonical command for Booster T1, low-dim state, flat terrain:
CUDA_VISIBLE_DEVICES=0 python train_in_mujoco_playground.py \
--env_name=T1LowDimSimFinetuneJoystickFlatTerrain \
--domain_randomization \
--num_timesteps 40000000 \
--save_buffer_data \
--wandb_entity your_wandb_entity
Flags worth noticing:
--domain_randomization: randomize friction, COM, motor gain. Mandatory if you expect sim2sim to work later.--num_timesteps 40000000: ~40M env-steps, which is ~40k outer steps with 1024 parallel envs.--save_buffer_data: REQUIRED. This is the replay buffer consumed by stage 2.--wandb_entity: drop it if you do not use W&B.
After pretrain, the log folder looks like:
logs/T1LowDimSimFinetuneJoystickFlatTerrain-20260424-101530-abc12/
├── checkpoints/ # config snapshots
├── policies/ # policy_*.pkl (dill-pickled Flax state)
└── buffer_data/ # transitions for WM training
Reward sanity check: in W&B, eval/episode_reward should climb to ~250+ on flat terrain. If after 20M steps it still sits below 100, your domain randomization is probably too aggressive.
Pretrain the world model (Stage 2)
python train_wm_from_file.py \
--env_name=T1LowDimSimFinetuneJoystickFlatTerrain \
--data_path=logs/T1LowDimSimFinetuneJoystickFlatTerrain-20260424-101530-abc12
The script:
- Reads all transitions under
buffer_data/. - Fits the Lagrangian residual (M, C, g corrections) with MSE on
q̈. - Fits the contact residual MLP.
- Saves
wm_states/wm_*.pkl.
Wall time: ~15–30 minutes on a 4090. This is the cheapest stage. For newcomers I recommend:
- Plot
predicted q̈vsactual q̈on a few trajectories — eyeball that the WM is not off. - Compare WM rollout vs env rollout (1–2s) — if divergence stays under ~5% for the first 200ms, the WM is good enough for fine-tuning.
Sim2Sim fine-tune in Brax (Stage 3)
This is the critical test — if the MuJoCo-pretrained policy survives Brax (a totally different physics engine), then real-world transfer is also likely to work.
python finetune.py \
--env_name=T1LowDimSimFinetuneJoystickFlatTerrain \
--ac_training_state_path=logs/.../policies/policy_final.pkl \
--wm_training_state_path=logs/.../wm_states/wm_final.pkl
What you will observe:
- Iter 0–5: zero-shot policy in Brax. If pretrain is healthy, the robot still walks but wobbles at target velocities > 1 m/s (outside the pretrain distribution).
- Iter 5–20: action rate ticks up slightly, policy fixes balance. WM loss drops as it sees more Brax-specific states.
- Iter 20+: stable. The authors report smooth walking up to 1.5 m/s.
Fine-tuning on the real Booster T1
Real workflow:
- Load the sim2sim-adapted policy (output of stage 3).
- Deploy on Booster T1 — run deterministically for 30–60s of gentle joystick commands.
- Log state-action-reward into a buffer.
- Offline update: fine-tune policy + WM with the new buffer (same script, but
envis real data). - Redeploy.
The authors report: with 80–590 seconds of real data, the robot goes from "wobbly, unstable" to "upright, smooth gait, tight velocity tracking" on grass, mud, concrete and slopes. Compared to Dreamer baselines that need hours, this is a ~50–100× improvement.
If the "residual RL" philosophy is more your thing, contrast it with SteadyTray: Residual RL on IsaacLab — two very different ways of attacking the same sim2real problem.
Inference: deploying the policy
The final policy is a small Flax MLP (typically 256-256-12 for T1 low-dim). Deployment:
import dill, jax.numpy as jnp
from flax import linen as nn
state = dill.load(open("policy_final.pkl", "rb"))
params = state.params
@jax.jit
def act(obs):
mu, _ = policy_net.apply(params, obs)
return mu # deterministic
# 100Hz loop
while True:
obs = robot.get_obs()
action = act(jnp.asarray(obs))
robot.send(np.asarray(action))
Latency < 1ms on a Jetson Orin. Caveat: observation scaling/normalization must match pretrain exactly — a 1% mismatch is enough to wreck the gait. Store the mean/std in the checkpoint and reuse them at deploy time.
Headline results
- Pretrain speed: ~1h on a 4090 for T1 flat joystick, 4–8× faster than PPO baselines at matching reward.
- Sim2sim transfer: MuJoCo-pretrained policy runs in Brax at 1.0 m/s out-of-the-box; 20 fine-tune iterations reach 1.5 m/s (beyond the pretrain distribution).
- Real-world T1: with 80–590s of data, zero-shot on grass / mud / concrete / slopes (no reward re-shaping).
- Cross-robot: the same pipeline works on Unitree G1 (full-body with arm + waist DOF) and T1 (12 DOF leg-only).
Common pitfalls when reproducing
- Forgetting
--save_buffer_data→ stage 2 has no data. Re-running pretrain costs 1 hour. - Too aggressive domain randomization → SAC fails to converge. Start with friction range
[0.5, 1.0], widen it gradually. - UTD ratio too high → the network overfits the replay buffer, eval reward drops. Start with UTD=4 and ramp up.
- Exploding WM residuals → L2 penalty on residuals too low; the robot flies into the sky during WM rollouts. Tighten regularization on M, C corrections.
- No safety during real fine-tune → literal robot falls. Always use a harness + e-stop and cap action magnitude at 70% of pretrain range for the first iteration.
Quick comparison with other approaches
| Method | Pretrain | Finetune | Real data | Notes |
|---|---|---|---|---|
| PPO Isaac Lab | 4–8h | Zero-shot, patched via DR | 0 | Often fails on new terrain |
| Dreamer-V3 | 10h+ | Online WM | Several hours | Data-inefficient for humanoids |
| LIFT | 1h | WM + deterministic env | 80–590s | Best for data-efficient transfer |
| DAgger + IL | Needs demos | Supervised | Hours | Requires expert demonstrations |
LIFT is not a silver bullet — if you already have a PPO + DR pipeline that works and your team does not know JAX, porting is expensive. But if you are starting fresh and care about data efficiency, this is a baseline worth copying.
When to (and when not to) use LIFT
Use it when:
- You have a 4090 or H800 and care about data efficiency (minimize real rollout).
- Your team is comfortable with JAX / Flax.
- The robot is fragile and you want to keep real-world rollouts in the minutes range.
Reconsider when:
- You already have a PPO + IsaacLab pipeline that works — switching cost is high.
- The task needs complex reward shaping (not pure locomotion) — the paper mainly validates locomotion.
- There is no MuJoCo Playground equivalent for your custom robot — you must port the hardware first.
Wrap-up
LIFT gives a clear recipe for humanoid control: pretrain SAC off-policy in MuJoCo Playground fast, learn a world model with physics structure baked in, then fine-tune by separating exploration (inside the WM) from data collection (deterministic, in the real env). The result is a pipeline that drives real-world data down to a few hundred seconds — a big step for practical humanoid RL. Full source code at github.com/bigai-ai/LIFT-humanoid, paper at arXiv:2601.21363.