In the 2026 humanoid robot boom, one engineering challenge keeps frustrating research teams: how do you make a policy trained in simulation run on a real robot without expensive fine-tuning? The answer from the Booster Robotics team at ICRA 2026 is Booster Gym — an open-source end-to-end reinforcement learning framework, validated directly on the real T1 humanoid, capable of zero-shot sim-to-real transfer for omnidirectional walking, 10-degree slope climbing, and recovery from a 10kg ball impact.
This article walks through the whole pipeline: paper motivation, the PPO asymmetric actor-critic architecture, installing Isaac Gym + Booster Gym, training with domain randomization, cross-simulation validation on MuJoCo, and finally deploying on a real T1 via the Python SDK. If you're comfortable with ROS2 or have read the RL Humanoid series, you'll be able to follow along.
What is Booster Gym? Why does it matter for ICRA 2026?
Before Booster Gym, frameworks like Humanoid-Gym (NVIDIA Isaac Gym + Unitree H1) and Berkeley Humanoid had already proven sim-to-real transfer for bipedal locomotion. The catch: every team had to re-implement reward functions, domain randomization, and deployment SDKs from scratch — no shared standard. Booster Gym addresses that by providing a single framework, from training to deployment, that runs on real hardware, with fully open-source code.
The paper Booster Gym: An End-to-End Reinforcement Learning Framework for Humanoid Robot Locomotion — Booster Robotics, ICRA 2026 — has four main contributions:
- Complete pipeline from Isaac Gym training → MuJoCo/Webots cross-validation → real T1 deployment via Python DDS middleware.
- Comprehensive domain randomization along 3 axes: robot dynamics (mass, CoM), actuator (Kp, Kd, friction, 0–20ms latency), environment (terrain, kicks, friction).
- Series-Parallel Conversion Module for parallel ankles — train on a virtual serial structure, convert via transposed Jacobian at deployment. This is a key innovation because many humanoids (Booster T1, Unitree G1) use parallel ankles, but simulators usually only support serial chains.
- Open-source repo at github.com/BoosterRobotics/booster_gym — including pre-trained policy and Python deployment SDK.
Booster T1 is a humanoid with 12 active joints, IMU + joint encoders, parallel ankle structure, and an onboard CPU strong enough to run a 50Hz policy via JIT compilation. The robot is more affordable than Unitree G1 or Tesla Optimus, making it well-suited for academic labs.
Architecture Overview
Booster Gym uses PPO with asymmetric actor-critic, a pattern that Walk These Ways proved effective for legged locomotion:
- Actor network receives only proprioceptive observations (what the real robot can measure): base angular velocity, gravity vector from IMU, 12 joint positions + velocities, previous action, velocity commands
(vx, vy, ωyaw), and gait cycle signal(cos, sin)to encode the periodic walking pattern. - Critic network receives full state in simulation: everything the actor sees plus privileged info like terrain height, friction coefficient, and true CoM offset — things only the simulator knows. The critic is used to estimate a better value function, but at deployment time only the actor is shipped, so no privileged data is needed on the real robot.
Action space: 12-dimensional joint position offsets. The policy output a_t is added to a nominal pose q_0 to get the desired joint position q_des = q_0 + a_t. A higher-frequency PD controller (typically 1kHz) then converts this to torque: τ = Kp·(q_des - q) - Kd·q̇.
Observations (39-D) Action (12-D)
┌────────────────┐ ┌─────────────┐
│ ω_base (3) │ │ joint pos │
│ gravity (3) │ │ offsets │
│ joint pos (12) │ → MLP → a_t (12) │ → PD → τ → Robot
│ joint vel (12) │ │ │
│ prev action (12)│ └─────────────┘
│ cmd (3) │
│ gait cos/sin (2)│
│ ... │
└────────────────┘
Policy frequency: 50Hz (20ms cycle). Measured real-world latency: 9–12ms round-trip, with policy inference under 1ms — plenty of headroom for DDS communication.
Reward function — 19 components
This is the most subtle part, since reward shaping decides whether the policy is robust. Booster Gym groups them into 4 categories:
Tracking rewards (positive):
track_lin_vel_xy_exp: rewardsexp(-‖v_xy_target − v_xy‖²/σ)— track the xy velocity command.track_ang_vel_yaw_exp: same for yaw rate.feet_swing: weight 3.0 — encourages foot lifting on the correct gait phase.
Posture & stability:
base_height: weight−20.0— heavily penalizes deviation from target height (~0.7m).orientation: weight−5.0— uses the gravity vector in base frame, penalizes tilt.
Energy & smoothness:
torques,dof_acc,action_rate: penalize high torques, joint accelerations, and abrupt action changes.power: penalizesΣ |τ·q̇|— reduces real-world power draw.
Safety:
collision: weight−1.0— penalizes collisions between non-foot links.dof_pos_limits,dof_vel_limits: penalize hitting hard joint limits.
Practical tip: the base_height = -20.0 weight is much larger than other papers (typically −10). The reason is that Booster T1 has a parallel ankle, and base-height oscillation easily excites ankle resonance during deployment — a trade-off between tracking accuracy and stability.
Domain randomization — the key to zero-shot transfer
This is the part you CANNOT skip if you want the policy to work on the real robot. Booster Gym randomizes 3 axes:
1. Robot dynamics (per episode):
- Trunk mass: ±15% from nominal.
- CoM offset of trunk and each link: ±5cm on 3 axes.
- Joint friction & damping: scale 0.5×–1.5×.
2. Actuator characteristics:
- Kp scale: 0.85×–1.15× — simulates calibration drift.
- Kd scale: 0.85×–1.15×.
- Latency 0–20ms between observation and action — the most important one, since real latency is 9–12ms.
- Motor torque limit: scale 0.9×–1.0×.
3. Environment:
- Terrain: flat, slopes (±10°), stairs, rough patches — sampled via curriculum.
- Friction coefficient: 0.3–1.5.
- Restitution: 0–0.5.
- External pushes: random impulses 200–500N applied on the trunk every 5–10s — this is how the team trains the robot to withstand a 10kg ball impact in the wild.
If you're comfortable with basic domain randomization, Booster Gym is a natural step up: it adds latency randomization and parallel-ankle simulation.
Installation
You'll need an NVIDIA GPU (RTX 3060+ is enough for 4096 envs). I tested on Ubuntu 22.04 + CUDA 11.8.
# Conda environment
conda create -n booster python=3.8 -y
conda activate booster
# PyTorch + CUDA
conda install numpy=1.21.6 pytorch=2.0 pytorch-cuda=11.8 \
-c pytorch -c nvidia -y
# Isaac Gym — download from NVIDIA developer portal first
# https://developer.nvidia.com/isaac-gym (NVIDIA dev account required)
cd ~/Downloads/isaacgym/python
pip install -e .
# Booster Gym
git clone https://github.com/BoosterRobotics/booster_gym.git
cd booster_gym
pip install -r requirements.txt
A common error is libpython3.8.so.1.0: cannot open shared object file. Fix it by adding to ~/anaconda3/envs/booster/etc/conda/activate.d/env_vars.sh:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib
Then conda deactivate && conda activate booster.
Training
The basic command:
python train.py --task=T1
On an RTX 4090 with 4096 envs, training to convergence takes 6–8 hours (~10000 iterations). Useful flags:
python train.py --task=T1 \
--num_envs=4096 \
--headless \
--seed=42 \
--max_iterations=15000
Track progress:
tensorboard --logdir logs
A "healthy" reward curve should look like: total reward goes up monotonically for the first 1000 iterations (tracking velocity command), then plateaus around 1000–3000 (balancing tracking and energy), then climbs again as the curriculum increases terrain difficulty (3000+).
If reward oscillates wildly or collapses to 0 after 500 iter, check:
- Termination too strict (height threshold). Default Booster Gym terminates when base height < 0.3m or velocity > 5 m/s.
- Initial joint positions are unreasonable —
q_0should be close to T1's neutral pose.
Evaluation and cross-simulation
After training, eval in Isaac Gym:
python play.py --task=T1 --checkpoint=-1
-1 means load the latest checkpoint. You'll see 16 robots in parallel following random commands.
Critical step: cross-validate on MuJoCo before deploying to real hardware. Isaac Gym uses PhysX, while MuJoCo has different dynamics (especially the contact model). If a policy only works in Isaac Gym but fails on MuJoCo, that's a sign of overfitting to one simulator's physics:
python play_mujoco.py --task=T1 --checkpoint=-1
Expectation: joint position trajectories on MuJoCo should be very close to Isaac Gym (delta < 5%). The Booster paper actually shows this delta is only 2–3% — evidence that domain randomization is doing its job.
Deployment on the real Booster T1
The final step is exporting the policy to TorchScript JIT format for the onboard CPU:
python export_model.py --task=T1 --checkpoint=-1
Output: policy.pt in logs/<date>/exported/. Copy it to the robot via scp:
scp logs/2026-05-06_10-30-00/exported/policy.pt \
[email protected]:~/booster_deploy/
On T1, there's a Python deployment SDK using DDS middleware. Run:
cd ~/booster_deploy
python run_policy.py --policy=policy.pt --freq=50
The SDK will:
- Subscribe to IMU + joint state via DDS topics.
- Build the observation vector in the same format as training.
- Run policy inference → action
a_t. - Publish
q_des = q_0 + a_tover the DDS topicjoint_command. - The onboard PD controller converts to torque at 1kHz.
The robot will stand up naturally (since a_t = 0 initially = nominal pose), then you use a joystick to send (vx, vy, ωyaw) over the topic velocity_command.
Experimental results
The paper reports:
| Test | Result |
|---|---|
| Forward walking | 0.5–1.2 m/s, tracking error < 0.1 m/s |
| Backward + sideways | OK on all commands |
| Yaw rotation | Up to 1.0 rad/s |
| 10° slopes | Up + down without terrain perception |
| 6 surface types | grass, stone, soil, asphalt, concrete, tile — all OK |
| 10kg ball push | Recovers stable gait within a few steps |
| Latency tolerance | Up to 20ms (measured: 9–12ms) |
Most importantly: zero-shot — policy trained in sim, deployed directly, no fine-tuning. This is the gold standard for sim-to-real research.
Pitfalls and best practices
A few errors I hit while reproducing results in the lab:
- Forgetting to set
LD_LIBRARY_PATH— Isaac Gym crashes silently. Always checkecho $LD_LIBRARY_PATH | grep condaafter activating the env. - GPU OOM with 4096 envs on VRAM < 12GB — drop to 2048 envs and bump
max_iterationsto 20000 to compensate. - Reward NaN after 100 iter — usually because
dof_accweight is too large combined with acceleration spikes in the first episode. Reducedof_accweight from-2.5e-7to-1e-7. - Real robot has "high-frequency jitter" after deployment — caused by Kp on the real robot being larger than Kp in sim. Verify
actuator_cfg.yamlmatches T1 firmware. - Sim-to-real gap large despite domain randomization being on — check if the ankle is being trained as a serial structure. Booster Gym has a flag
use_parallel_ankle: True/False— must beFalseduring training,Trueat deployment.
Extensions
Booster Gym is a strong baseline for several research directions:
- Loco-manipulation — extend the action space to 23-D (12 leg + 11 arm), add task-specific reward. See Loco-manip humanoid.
- Whole-body MPC + RL — use the policy as a warm-start for MPC, similar to the approach in Sonic humanoid.
- Vision integration — feed depth/RGB through a CNN encoder, concat into observations.
- Multi-skill policy — train one policy covering walking + standing + recovery, using a skill-conditioning bit as an extra command.
Compared to LIFT-humanoid (ICLR 2026), which focuses on world-model pretraining, Booster Gym is simpler but production-ready right now. If you need to deploy on T1 fast, use Booster Gym; if you need a foundation model for humanoids, LIFT is a better starting point.
Conclusion
Booster Gym embodies the 2026 trend: end-to-end framework, public code, validated on real hardware, with no hidden details. With an RTX 3060 and a Booster T1 (or an equivalent 12-DoF humanoid), a student in Vietnam can reproduce ICRA 2026 results in 2 days. That was unthinkable just 3 years ago.
When should you use Booster Gym? When you need a robust locomotion baseline for a 10–20 DoF humanoid, real deployment, no complex tasks. When NOT to use it? For complex manipulation, vision-language conditioning, or multi-robot — those cases call for VLA models or other specialized frameworks.