Imagine putting on a VR headset, moving your arms and legs — and a humanoid robot across the room mirrors your every motion in real time. This isn't science fiction. It's GEAR-SONIC — NVIDIA's whole-body control system that achieves 100% success rate on a Unitree G1 robot with zero real-world fine-tuning.
In this article, we'll explore the architecture, training data, deployment pipeline, and how you can get started with NVIDIA's open-source codebase.
Background: Why Whole-Body Control?
Traditionally, humanoid robot control splits into two separate branches: lower-body locomotion (walking, running, balance) and upper-body manipulation (grasping, interacting). Each uses its own controller, and coordinating them creates cascading problems — the robot can walk but drops objects mid-stride, or grasps well but falls when leaning.
Whole-body control solves this by training a single policy that commands all joints simultaneously. Instead of decomposing the problem into modules, we let a neural network learn to coordinate all 29 degrees of freedom at once.
GEAR-SONIC takes this further: rather than training separate behaviors, they use motion tracking — reproducing human movement — as a single scalable training task, then scale up with massive data. The result is a behavior foundation model that any downstream system (VR teleoperation, VLA models, gamepad control) can leverage.
What is SONIC? Architecture Overview
SONIC stands for Supersizing Motion Tracking for Natural Humanoid Whole-Body Control — a paper from NVIDIA Research (GEAR Lab), published as arXiv:2511.07820 in November 2025 by Zhengyi Luo, Ye Yuan, Tingwu Wang, and over 25 co-authors.
The core insight: Motion tracking is all you need. If a policy can accurately track any human motion, it has implicitly learned every skill — walking, running, jumping, picking up objects, dancing, even fighting.
Encoder-Decoder with Finite Scalar Quantization
SONIC's architecture has three main components:
Three specialized encoders:
- Robot Motion Encoder (Er) — processes 10 future frames (0.1s apart) of robot joint trajectories
- Human Motion Encoder (Eh) — processes 10 future frames (0.02s apart) of human SMPL poses
- Hybrid Motion Encoder (Em) — processes mixed robot/human commands (for 3-point teleoperation)
All encoders are MLPs with architecture [2048, 1024, 512, 512].
Finite Scalar Quantization (FSQ): This is the key step — outputs from all three encoders are quantized into the same discrete token space. Whether input comes from robot trajectories, SMPL poses, or hybrid commands, they're all represented in the same token "language." This enables seamless switching between control modes.
Two decoders:
- Control Decoder (Dc) — transforms tokens into 29 target joint positions (Gaussian distribution)
- Auxiliary Robot Motion Decoder (Dr) — reconstructs robot motion for an additional supervision signal
Decoder MLP: [2048, 2048, 1024, 1024, 512, 512].
Observation and Action Space
| Component | Details |
|---|---|
| Observation | Joint poses, joint velocities, root angular velocity, gravity vector, previous action |
| Action | 29-dimensional target joint positions |
| Control frequency | 50 Hz (policy loop), 500 Hz (motor stream via Unitree low-level API) |
| Inference latency | 1-2 ms on Jetson Orin (TensorRT + CUDA Graph) |
BONES-SEED Dataset: 700 Hours of Human Motion
Data is the decisive factor behind SONIC's success. The original paper trained on 100M+ frames (700 hours of motion capture) from 170 subjects with heights ranging from 145-199 cm.
In March 2026, at GTC, Bones Studio publicly released BONES-SEED (Skeletal Everyday Embodiment Dataset) — an expanded and public version of this data:
| Metric | Value |
|---|---|
| Total motions | 142,220 (71,132 original + 71,088 mirrored) |
| Duration | ~288 hours @ 120 fps |
| Actors | 522 (253 female, 269 male) |
| Age range | 17-71 years |
| Height range | 145-199 cm |
| File size | 114 GB |
| Capture system | Vicon optical motion capture (sub-millimeter accuracy) |
Three Data Formats
- SOMA Uniform (BVH) — standardized skeleton shared across all motions
- SOMA Proportional (BVH) — per-actor skeleton preserving body proportions
- Unitree G1 MuJoCo-compatible (CSV) — joint-angle trajectories ready for simulation
The dataset includes 51 metadata columns: up to 6 natural language descriptions per motion, temporal segmentation with precise timestamps, biomechanical descriptions, and actor biometrics.
Motion category distribution:
| Category | Count |
|---|---|
| Locomotion | 74,488 |
| Communication | 21,493 |
| Interactions | 14,643 |
| Dances | 11,006 |
| Gaming | 8,700 |
| Everyday | 5,816 |
| Sport | 3,993 |
Retargeting from Human to Robot
Human motions are retargeted to Unitree G1 via GMR (Geometric Motion Retargeting). This step is essential — human bodies and robots have different proportions (arm length, leg length, joint ranges), so an intelligent mapping algorithm is needed to preserve the original motion intent while remaining physically feasible on the robot.
Training Pipeline
Reinforcement Learning with PPO
SONIC trains using PPO (Proximal Policy Optimization) in NVIDIA Isaac Lab — a high-speed GPU-accelerated physics simulator.
Reward function balances accurate tracking with safety:
| Component | Weight |
|---|---|
| Root orientation tracking | 0.5 |
| Body link positions (relative to root) | 1.0 |
| Body link orientations | 1.0 |
| Linear/angular velocities | 1.0 each |
| Action rate penalty | -0.1 |
| Joint limit violation | -10.0 |
| Undesired contacts | -0.1 |
Note the -10.0 weight for joint limit violations — an extremely strong penalty that forces the policy to respect the hardware's physical constraints.
Domain Randomization
To ensure the policy generalizes well to real-world deployment, SONIC uses aggressive domain randomization:
- Friction: 0.3 to 1.6
- Restitution: 0 to 0.5
- External pushes: up to 0.5 m/s (simulating unexpected collisions)
- Motion jitter: position and orientation noise (simulating sensor noise)
Adaptive Sampling
Not all trajectories are equally difficult. SONIC uses bin-based adaptive sampling — difficult trajectories (high failure rate) are sampled more frequently, helping the policy focus on its weak spots instead of repeating what it already masters.
Scaling Laws
A key finding: SONIC performance improves consistently when scaling along any of three axes:
- Model size: 1.2M → 42M parameters
- Data volume: more data = better tracking
- Compute: 9,000 → 21,000 GPU-hours (128 GPUs over ~3 days)
Among these, increasing data volume yields the largest gains — this is why BONES-SEED matters: it enables anyone with sufficient GPUs to train powerful whole-body controllers.
VR Teleoperation: Two Control Modes
Mode 1: Whole-body teleoperation
- Hardware: PICO VR headset + 2 ankle trackers + 2 handheld controllers
- Output: Full-body SMPL pose streamed in real-time
- Encoder: Human Motion Encoder (Eh)
This mode enables full-body robot control with maximum precision — every joint is tracked.
Mode 2: 3-point teleoperation (lightweight)
- Hardware: PICO headset + 2 handheld controllers only (no ankle trackers)
- Input: Head + wrist SE(3) poses, finger angles, waist height, locomotion mode
- Encoder: Kinematic Planner → Hybrid Motion Encoder (Em)
This is the more practical mode for large-scale data collection — fewer devices, faster setup, and still accurate enough for most tasks.
3-point teleoperation performance:
| Metric | Value |
|---|---|
| Mean latency | 121.9 ms |
| Wrist position error | 6 cm (mean), 13.3 cm (95th percentile) |
| Orientation error | 0.145 rad (mean), 0.267 rad (95th percentile) |
Kinematic Planner
The kinematic planner bridges high-level commands to motion tracking:
- Representation: Pelvis-relative joint positions + global joint rotations
- Backbone: Masked token prediction with 4x downsampling
- Speed: 10 Hz, generates locomotion at 0-6 m/s
- Root trajectory: Critically damped spring model
- Supported styles: normal, stealth, happy, injured, boxing, kneeling, crawling
Experimental Results
Simulation (AMASS benchmark)
On 1,602 trajectories from AMASS, SONIC significantly outperforms all baselines (Any2Track, BeyondMimic, GMT) in success rate, MPJPE (Mean Per Joint Position Error), and velocity/acceleration error.
Real-world (Unitree G1)
| Metric | Result |
|---|---|
| Success rate | 100% (50 diverse trajectories) |
| Deployment | Zero-shot (no real-hardware fine-tuning) |
| Behaviors | Dance, jumps, loco-manipulation |
100% success rate with zero-shot deployment — this is a remarkable result. The policy was trained entirely in simulation, with domain randomization strong enough to bridge the sim-to-real gap without any adaptation.
GR00T N1.5 Integration
When combined with the VLA model GR00T N1.5 (Vision-Language-Action model), SONIC achieves 95% success rate on mobile pick-and-place tasks (picking up an apple and placing it on a plate) — with only 300 fine-tuning trajectories.
Getting Started with GEAR-SONIC
Available Resources
| Resource | Link |
|---|---|
| GitHub repo | NVlabs/GR00T-WholeBodyControl |
| Pretrained model | HuggingFace: nvidia/GEAR-SONIC |
| BONES-SEED dataset | HuggingFace: bones-studio/seed |
| Interactive demo | MuJoCo browser demo |
Basic Setup
# Clone the repo
git clone https://github.com/NVlabs/GR00T-WholeBodyControl.git
cd GR00T-WholeBodyControl
# Install dependencies (requires NVIDIA Isaac Lab)
pip install -e .
# Download pretrained checkpoints
# The model consists of 3 ONNX files:
# - model_encoder.onnx
# - model_decoder.onnx
# - planner_sonic.onnx
Interactive Demo
The fastest way to experience SONIC is through the browser demo — it runs MuJoCo WASM directly in your browser, letting you load the policy and watch the G1 robot perform real-time motion tracking without installing anything.
Inference Stack
GEAR-SONIC provides a production-ready C++ inference stack:
- TensorRT for optimized inference on Jetson Orin
- CUDA Graph for reduced latency
- Forward pass takes only 1-2 ms on Jetson Orin
This isn't just a research prototype — it's a system ready for real-world deployment.
License
- Source code: Apache 2.0
- Model weights: NVIDIA Open Model License (commercial use with attribution)
- BONES-SEED: Free for academic research and qualifying startups; separate commercial licensing available
Comparison with Other Approaches
| Approach | Strengths | Limitations |
|---|---|---|
| Decoupled WBC (RL lower + IK upper) | Simple, easy to debug | Poor upper-lower coordination |
| Model Predictive Control | Online optimization, flexible | Slow, requires accurate model |
| GEAR-SONIC | Unified policy, scales with data, zero-shot real | Needs powerful GPUs to train, depends on data quality |
SONIC belongs to a new paradigm: instead of designing controllers, collect data and scale models. Similar to how LLMs transformed NLP, motion foundation models are changing how we build robot controllers.
Key Takeaways
1. Data is king. Among the three scaling axes (model, data, compute), increasing data yields the largest gains. BONES-SEED with 142K motions and 288 hours of data is an invaluable resource for the community.
2. Motion tracking is a universal interface. Instead of training individual behaviors, SONIC proves that motion tracking — a single task — can serve as the foundation for all downstream applications.
3. Sim-to-real has matured. 100% zero-shot success rate on real hardware shows that domain randomization + enough data + the right architecture = completely bridging the sim-to-real gap, at least for basic whole-body locomotion and manipulation.
4. VR teleoperation enables a data flywheel. 3-point teleoperation requires only a VR headset + 2 controllers — cheap and simple enough for large-scale data collection, creating a virtuous cycle: data → better policy → easier teleoperation → more data.
Conclusion
GEAR-SONIC represents a significant leap in humanoid robotics: from handcrafting controllers to training behavior foundation models with data. With open-source code, pretrained models on HuggingFace, the public BONES-SEED dataset, and a production-ready C++ inference stack — there's never been a better time to start experimenting with whole-body control for humanoid robots.
If you're already working with simulation for robotics, GEAR-SONIC is the natural next project — it combines many techniques we've covered in previous series: reinforcement learning, domain randomization, and sim-to-real transfer.
References
- SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control — Zhengyi Luo et al., NVIDIA Research, arXiv 2025
- BONES-SEED Dataset — Bones Studio, GTC 2026
- GR00T-WholeBodyControl Repository — NVIDIA NVLabs