aihumanoidwhole-body-controlnvidiareinforcement-learningmotion-trackingvr-teleoperationisaac-lab

GEAR-SONIC: Whole-Body Control for Humanoid Robots

Deep dive into NVIDIA GEAR-SONIC — training a whole-body controller for humanoid robots with BONES-SEED dataset and VR teleoperation.

Nguyễn Anh Tuấn13 tháng 4, 202611 phút đọc
GEAR-SONIC: Whole-Body Control for Humanoid Robots

Imagine putting on a VR headset, moving your arms and legs — and a humanoid robot across the room mirrors your every motion in real time. This isn't science fiction. It's GEAR-SONIC — NVIDIA's whole-body control system that achieves 100% success rate on a Unitree G1 robot with zero real-world fine-tuning.

In this article, we'll explore the architecture, training data, deployment pipeline, and how you can get started with NVIDIA's open-source codebase.

Background: Why Whole-Body Control?

Traditionally, humanoid robot control splits into two separate branches: lower-body locomotion (walking, running, balance) and upper-body manipulation (grasping, interacting). Each uses its own controller, and coordinating them creates cascading problems — the robot can walk but drops objects mid-stride, or grasps well but falls when leaning.

Whole-body control solves this by training a single policy that commands all joints simultaneously. Instead of decomposing the problem into modules, we let a neural network learn to coordinate all 29 degrees of freedom at once.

GEAR-SONIC takes this further: rather than training separate behaviors, they use motion tracking — reproducing human movement — as a single scalable training task, then scale up with massive data. The result is a behavior foundation model that any downstream system (VR teleoperation, VLA models, gamepad control) can leverage.

Humanoid robot performing whole-body control — combining locomotion and manipulation in a single policy

What is SONIC? Architecture Overview

SONIC stands for Supersizing Motion Tracking for Natural Humanoid Whole-Body Control — a paper from NVIDIA Research (GEAR Lab), published as arXiv:2511.07820 in November 2025 by Zhengyi Luo, Ye Yuan, Tingwu Wang, and over 25 co-authors.

The core insight: Motion tracking is all you need. If a policy can accurately track any human motion, it has implicitly learned every skill — walking, running, jumping, picking up objects, dancing, even fighting.

Encoder-Decoder with Finite Scalar Quantization

SONIC's architecture has three main components:

Three specialized encoders:

  1. Robot Motion Encoder (Er) — processes 10 future frames (0.1s apart) of robot joint trajectories
  2. Human Motion Encoder (Eh) — processes 10 future frames (0.02s apart) of human SMPL poses
  3. Hybrid Motion Encoder (Em) — processes mixed robot/human commands (for 3-point teleoperation)

All encoders are MLPs with architecture [2048, 1024, 512, 512].

Finite Scalar Quantization (FSQ): This is the key step — outputs from all three encoders are quantized into the same discrete token space. Whether input comes from robot trajectories, SMPL poses, or hybrid commands, they're all represented in the same token "language." This enables seamless switching between control modes.

Two decoders:

  1. Control Decoder (Dc) — transforms tokens into 29 target joint positions (Gaussian distribution)
  2. Auxiliary Robot Motion Decoder (Dr) — reconstructs robot motion for an additional supervision signal

Decoder MLP: [2048, 2048, 1024, 1024, 512, 512].

Observation and Action Space

Component Details
Observation Joint poses, joint velocities, root angular velocity, gravity vector, previous action
Action 29-dimensional target joint positions
Control frequency 50 Hz (policy loop), 500 Hz (motor stream via Unitree low-level API)
Inference latency 1-2 ms on Jetson Orin (TensorRT + CUDA Graph)

BONES-SEED Dataset: 700 Hours of Human Motion

Data is the decisive factor behind SONIC's success. The original paper trained on 100M+ frames (700 hours of motion capture) from 170 subjects with heights ranging from 145-199 cm.

In March 2026, at GTC, Bones Studio publicly released BONES-SEED (Skeletal Everyday Embodiment Dataset) — an expanded and public version of this data:

Metric Value
Total motions 142,220 (71,132 original + 71,088 mirrored)
Duration ~288 hours @ 120 fps
Actors 522 (253 female, 269 male)
Age range 17-71 years
Height range 145-199 cm
File size 114 GB
Capture system Vicon optical motion capture (sub-millimeter accuracy)

Three Data Formats

  1. SOMA Uniform (BVH) — standardized skeleton shared across all motions
  2. SOMA Proportional (BVH) — per-actor skeleton preserving body proportions
  3. Unitree G1 MuJoCo-compatible (CSV) — joint-angle trajectories ready for simulation

The dataset includes 51 metadata columns: up to 6 natural language descriptions per motion, temporal segmentation with precise timestamps, biomechanical descriptions, and actor biometrics.

Motion category distribution:

Category Count
Locomotion 74,488
Communication 21,493
Interactions 14,643
Dances 11,006
Gaming 8,700
Everyday 5,816
Sport 3,993

Retargeting from Human to Robot

Human motions are retargeted to Unitree G1 via GMR (Geometric Motion Retargeting). This step is essential — human bodies and robots have different proportions (arm length, leg length, joint ranges), so an intelligent mapping algorithm is needed to preserve the original motion intent while remaining physically feasible on the robot.

Training Pipeline

Reinforcement Learning with PPO

SONIC trains using PPO (Proximal Policy Optimization) in NVIDIA Isaac Lab — a high-speed GPU-accelerated physics simulator.

Reward function balances accurate tracking with safety:

Component Weight
Root orientation tracking 0.5
Body link positions (relative to root) 1.0
Body link orientations 1.0
Linear/angular velocities 1.0 each
Action rate penalty -0.1
Joint limit violation -10.0
Undesired contacts -0.1

Note the -10.0 weight for joint limit violations — an extremely strong penalty that forces the policy to respect the hardware's physical constraints.

Domain Randomization

To ensure the policy generalizes well to real-world deployment, SONIC uses aggressive domain randomization:

  • Friction: 0.3 to 1.6
  • Restitution: 0 to 0.5
  • External pushes: up to 0.5 m/s (simulating unexpected collisions)
  • Motion jitter: position and orientation noise (simulating sensor noise)

Adaptive Sampling

Not all trajectories are equally difficult. SONIC uses bin-based adaptive sampling — difficult trajectories (high failure rate) are sampled more frequently, helping the policy focus on its weak spots instead of repeating what it already masters.

Scaling Laws

A key finding: SONIC performance improves consistently when scaling along any of three axes:

  1. Model size: 1.2M → 42M parameters
  2. Data volume: more data = better tracking
  3. Compute: 9,000 → 21,000 GPU-hours (128 GPUs over ~3 days)

Among these, increasing data volume yields the largest gains — this is why BONES-SEED matters: it enables anyone with sufficient GPUs to train powerful whole-body controllers.

Reinforcement learning training pipeline — from motion capture data through simulation to real-world deployment

VR Teleoperation: Two Control Modes

Mode 1: Whole-body teleoperation

  • Hardware: PICO VR headset + 2 ankle trackers + 2 handheld controllers
  • Output: Full-body SMPL pose streamed in real-time
  • Encoder: Human Motion Encoder (Eh)

This mode enables full-body robot control with maximum precision — every joint is tracked.

Mode 2: 3-point teleoperation (lightweight)

  • Hardware: PICO headset + 2 handheld controllers only (no ankle trackers)
  • Input: Head + wrist SE(3) poses, finger angles, waist height, locomotion mode
  • Encoder: Kinematic Planner → Hybrid Motion Encoder (Em)

This is the more practical mode for large-scale data collection — fewer devices, faster setup, and still accurate enough for most tasks.

3-point teleoperation performance:

Metric Value
Mean latency 121.9 ms
Wrist position error 6 cm (mean), 13.3 cm (95th percentile)
Orientation error 0.145 rad (mean), 0.267 rad (95th percentile)

Kinematic Planner

The kinematic planner bridges high-level commands to motion tracking:

  • Representation: Pelvis-relative joint positions + global joint rotations
  • Backbone: Masked token prediction with 4x downsampling
  • Speed: 10 Hz, generates locomotion at 0-6 m/s
  • Root trajectory: Critically damped spring model
  • Supported styles: normal, stealth, happy, injured, boxing, kneeling, crawling

Experimental Results

Simulation (AMASS benchmark)

On 1,602 trajectories from AMASS, SONIC significantly outperforms all baselines (Any2Track, BeyondMimic, GMT) in success rate, MPJPE (Mean Per Joint Position Error), and velocity/acceleration error.

Real-world (Unitree G1)

Metric Result
Success rate 100% (50 diverse trajectories)
Deployment Zero-shot (no real-hardware fine-tuning)
Behaviors Dance, jumps, loco-manipulation

100% success rate with zero-shot deployment — this is a remarkable result. The policy was trained entirely in simulation, with domain randomization strong enough to bridge the sim-to-real gap without any adaptation.

GR00T N1.5 Integration

When combined with the VLA model GR00T N1.5 (Vision-Language-Action model), SONIC achieves 95% success rate on mobile pick-and-place tasks (picking up an apple and placing it on a plate) — with only 300 fine-tuning trajectories.

Getting Started with GEAR-SONIC

Available Resources

Resource Link
GitHub repo NVlabs/GR00T-WholeBodyControl
Pretrained model HuggingFace: nvidia/GEAR-SONIC
BONES-SEED dataset HuggingFace: bones-studio/seed
Interactive demo MuJoCo browser demo

Basic Setup

# Clone the repo
git clone https://github.com/NVlabs/GR00T-WholeBodyControl.git
cd GR00T-WholeBodyControl

# Install dependencies (requires NVIDIA Isaac Lab)
pip install -e .

# Download pretrained checkpoints
# The model consists of 3 ONNX files:
# - model_encoder.onnx
# - model_decoder.onnx
# - planner_sonic.onnx

Interactive Demo

The fastest way to experience SONIC is through the browser demo — it runs MuJoCo WASM directly in your browser, letting you load the policy and watch the G1 robot perform real-time motion tracking without installing anything.

Inference Stack

GEAR-SONIC provides a production-ready C++ inference stack:

  • TensorRT for optimized inference on Jetson Orin
  • CUDA Graph for reduced latency
  • Forward pass takes only 1-2 ms on Jetson Orin

This isn't just a research prototype — it's a system ready for real-world deployment.

License

  • Source code: Apache 2.0
  • Model weights: NVIDIA Open Model License (commercial use with attribution)
  • BONES-SEED: Free for academic research and qualifying startups; separate commercial licensing available

Comparison with Other Approaches

Approach Strengths Limitations
Decoupled WBC (RL lower + IK upper) Simple, easy to debug Poor upper-lower coordination
Model Predictive Control Online optimization, flexible Slow, requires accurate model
GEAR-SONIC Unified policy, scales with data, zero-shot real Needs powerful GPUs to train, depends on data quality

SONIC belongs to a new paradigm: instead of designing controllers, collect data and scale models. Similar to how LLMs transformed NLP, motion foundation models are changing how we build robot controllers.

Key Takeaways

1. Data is king. Among the three scaling axes (model, data, compute), increasing data yields the largest gains. BONES-SEED with 142K motions and 288 hours of data is an invaluable resource for the community.

2. Motion tracking is a universal interface. Instead of training individual behaviors, SONIC proves that motion tracking — a single task — can serve as the foundation for all downstream applications.

3. Sim-to-real has matured. 100% zero-shot success rate on real hardware shows that domain randomization + enough data + the right architecture = completely bridging the sim-to-real gap, at least for basic whole-body locomotion and manipulation.

4. VR teleoperation enables a data flywheel. 3-point teleoperation requires only a VR headset + 2 controllers — cheap and simple enough for large-scale data collection, creating a virtuous cycle: data → better policy → easier teleoperation → more data.

Conclusion

GEAR-SONIC represents a significant leap in humanoid robotics: from handcrafting controllers to training behavior foundation models with data. With open-source code, pretrained models on HuggingFace, the public BONES-SEED dataset, and a production-ready C++ inference stack — there's never been a better time to start experimenting with whole-body control for humanoid robots.

If you're already working with simulation for robotics, GEAR-SONIC is the natural next project — it combines many techniques we've covered in previous series: reinforcement learning, domain randomization, and sim-to-real transfer.


References

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Bài viết liên quan

NEWTutorial
Genie Sim 3.0: Huấn luyện Humanoid với AGIBOT
simulationhumanoidisaac-simgenie-simagibotsim-to-realreinforcement-learning

Genie Sim 3.0: Huấn luyện Humanoid với AGIBOT

Hướng dẫn chi tiết dựng môi trường simulation với Genie Sim 3.0 — nền tảng open-source từ AGIBOT trên Isaac Sim để huấn luyện robot humanoid.

12/4/202611 phút đọc
NEWTutorial
Tự Build Robot Hình Người Dưới $5000 với Berkeley Humanoid Lite
humanoidreinforcement-learning3d-printingsim-to-realisaac-gymopen-sourcelocomotion

Tự Build Robot Hình Người Dưới $5000 với Berkeley Humanoid Lite

Hướng dẫn chi tiết xây dựng Berkeley Humanoid Lite — robot humanoid in 3D mã nguồn mở từ UC Berkeley, 24 DOF, locomotion bằng RL sim-to-real.

12/4/202612 phút đọc
NEWDeep Dive
WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code
vlahumanoidloco-manipulationiclrrlopen-sourceisaac-lab

WholebodyVLA Open-Source: Hướng Dẫn Kiến Trúc & Code

Deep-dive vào codebase WholebodyVLA — kiến trúc latent action, LMO RL policy, và cách xây dựng pipeline whole-body loco-manipulation cho humanoid.

12/4/202619 phút đọc