GEAR-SONIC: Whole-Body Control for Humanoid Robots

Imagine putting on a VR headset, moving your arms and legs — and a humanoid robot across the room mirrors your every motion in real time. This isn't science fiction. It's GEAR-SONIC — NVIDIA's whole-body control system that achieves 100% success rate on a Unitree G1 robot with zero real-world fine-tuning.

In this article, we'll explore the architecture, training data, deployment pipeline, and how you can get started with NVIDIA's open-source codebase.

Background: Why Whole-Body Control?

Traditionally, humanoid robot control splits into two separate branches: lower-body locomotion (walking, running, balance) and upper-body manipulation (grasping, interacting). Each uses its own controller, and coordinating them creates cascading problems — the robot can walk but drops objects mid-stride, or grasps well but falls when leaning.

Whole-body control solves this by training a single policy that commands all joints simultaneously. Instead of decomposing the problem into modules, we let a neural network learn to coordinate all 29 degrees of freedom at once.

GEAR-SONIC takes this further: rather than training separate behaviors, they use motion tracking — reproducing human movement — as a single scalable training task, then scale up with massive data. The result is a behavior foundation model that any downstream system (VR teleoperation, VLA models, gamepad control) can leverage.

Humanoid robot performing whole-body control — combining locomotion and manipulation in a single policy

What is SONIC? Architecture Overview

SONIC stands for Supersizing Motion Tracking for Natural Humanoid Whole-Body Control — a paper from NVIDIA Research (GEAR Lab), published as arXiv:2511.07820 in November 2025 by Zhengyi Luo, Ye Yuan, Tingwu Wang, and over 25 co-authors.

The core insight: Motion tracking is all you need. If a policy can accurately track any human motion, it has implicitly learned every skill — walking, running, jumping, picking up objects, dancing, even fighting.

Encoder-Decoder with Finite Scalar Quantization

SONIC's architecture has three main components:

Three specialized encoders:

Robot Motion Encoder (Er) — processes 10 future frames (0.1s apart) of robot joint trajectories
Human Motion Encoder (Eh) — processes 10 future frames (0.02s apart) of human SMPL poses
Hybrid Motion Encoder (Em) — processes mixed robot/human commands (for 3-point teleoperation)

All encoders are MLPs with architecture [2048, 1024, 512, 512].

Finite Scalar Quantization (FSQ): This is the key step — outputs from all three encoders are quantized into the same discrete token space. Whether input comes from robot trajectories, SMPL poses, or hybrid commands, they're all represented in the same token "language." This enables seamless switching between control modes.

Two decoders:

Control Decoder (Dc) — transforms tokens into 29 target joint positions (Gaussian distribution)
Auxiliary Robot Motion Decoder (Dr) — reconstructs robot motion for an additional supervision signal

Decoder MLP: [2048, 2048, 1024, 1024, 512, 512].

Observation and Action Space

Component	Details
Observation	Joint poses, joint velocities, root angular velocity, gravity vector, previous action
Action	29-dimensional target joint positions
Control frequency	50 Hz (policy loop), 500 Hz (motor stream via Unitree low-level API)
Inference latency	1-2 ms on Jetson Orin (TensorRT + CUDA Graph)

BONES-SEED Dataset: 700 Hours of Human Motion

Data is the decisive factor behind SONIC's success. The original paper trained on 100M+ frames (700 hours of motion capture) from 170 subjects with heights ranging from 145-199 cm.

In March 2026, at GTC, Bones Studio publicly released BONES-SEED (Skeletal Everyday Embodiment Dataset) — an expanded and public version of this data:

Metric	Value
Total motions	142,220 (71,132 original + 71,088 mirrored)
Duration	~288 hours @ 120 fps
Actors	522 (253 female, 269 male)
Age range	17-71 years
Height range	145-199 cm
File size	114 GB
Capture system	Vicon optical motion capture (sub-millimeter accuracy)

Three Data Formats

SOMA Uniform (BVH) — standardized skeleton shared across all motions
SOMA Proportional (BVH) — per-actor skeleton preserving body proportions
Unitree G1 MuJoCo-compatible (CSV) — joint-angle trajectories ready for simulation

The dataset includes 51 metadata columns: up to 6 natural language descriptions per motion, temporal segmentation with precise timestamps, biomechanical descriptions, and actor biometrics.

Motion category distribution:

Category	Count
Locomotion	74,488
Communication	21,493
Interactions	14,643
Dances	11,006
Gaming	8,700
Everyday	5,816
Sport	3,993

Retargeting from Human to Robot

Human motions are retargeted to Unitree G1 via GMR (Geometric Motion Retargeting). This step is essential — human bodies and robots have different proportions (arm length, leg length, joint ranges), so an intelligent mapping algorithm is needed to preserve the original motion intent while remaining physically feasible on the robot.

Training Pipeline

Reinforcement Learning with PPO

SONIC trains using PPO (Proximal Policy Optimization) in NVIDIA Isaac Lab — a high-speed GPU-accelerated physics simulator.

Reward function balances accurate tracking with safety:

Component	Weight
Root orientation tracking	0.5
Body link positions (relative to root)	1.0
Body link orientations	1.0
Linear/angular velocities	1.0 each
Action rate penalty	-0.1
Joint limit violation	-10.0
Undesired contacts	-0.1

Note the -10.0 weight for joint limit violations — an extremely strong penalty that forces the policy to respect the hardware's physical constraints.

Domain Randomization

To ensure the policy generalizes well to real-world deployment, SONIC uses aggressive domain randomization:

Friction: 0.3 to 1.6
Restitution: 0 to 0.5
External pushes: up to 0.5 m/s (simulating unexpected collisions)
Motion jitter: position and orientation noise (simulating sensor noise)

Adaptive Sampling

Not all trajectories are equally difficult. SONIC uses bin-based adaptive sampling — difficult trajectories (high failure rate) are sampled more frequently, helping the policy focus on its weak spots instead of repeating what it already masters.

Scaling Laws

A key finding: SONIC performance improves consistently when scaling along any of three axes:

Model size: 1.2M → 42M parameters
Data volume: more data = better tracking
Compute: 9,000 → 21,000 GPU-hours (128 GPUs over ~3 days)

Among these, increasing data volume yields the largest gains — this is why BONES-SEED matters: it enables anyone with sufficient GPUs to train powerful whole-body controllers.

Reinforcement learning training pipeline — from motion capture data through simulation to real-world deployment

VR Teleoperation: Two Control Modes

Mode 1: Whole-body teleoperation

Hardware: PICO VR headset + 2 ankle trackers + 2 handheld controllers
Output: Full-body SMPL pose streamed in real-time
Encoder: Human Motion Encoder (Eh)

This mode enables full-body robot control with maximum precision — every joint is tracked.

Mode 2: 3-point teleoperation (lightweight)

Hardware: PICO headset + 2 handheld controllers only (no ankle trackers)
Input: Head + wrist SE(3) poses, finger angles, waist height, locomotion mode
Encoder: Kinematic Planner → Hybrid Motion Encoder (Em)

This is the more practical mode for large-scale data collection — fewer devices, faster setup, and still accurate enough for most tasks.

3-point teleoperation performance:

Metric	Value
Mean latency	121.9 ms
Wrist position error	6 cm (mean), 13.3 cm (95th percentile)
Orientation error	0.145 rad (mean), 0.267 rad (95th percentile)

Kinematic Planner

The kinematic planner bridges high-level commands to motion tracking:

Representation: Pelvis-relative joint positions + global joint rotations
Backbone: Masked token prediction with 4x downsampling
Speed: 10 Hz, generates locomotion at 0-6 m/s
Root trajectory: Critically damped spring model
Supported styles: normal, stealth, happy, injured, boxing, kneeling, crawling

Experimental Results

Simulation (AMASS benchmark)

On 1,602 trajectories from AMASS, SONIC significantly outperforms all baselines (Any2Track, BeyondMimic, GMT) in success rate, MPJPE (Mean Per Joint Position Error), and velocity/acceleration error.

Real-world (Unitree G1)

Metric	Result
Success rate	100% (50 diverse trajectories)
Deployment	Zero-shot (no real-hardware fine-tuning)
Behaviors	Dance, jumps, loco-manipulation

100% success rate with zero-shot deployment — this is a remarkable result. The policy was trained entirely in simulation, with domain randomization strong enough to bridge the sim-to-real gap without any adaptation.

GR00T N1.5 Integration

When combined with the VLA model GR00T N1.5 (Vision-Language-Action model), SONIC achieves 95% success rate on mobile pick-and-place tasks (picking up an apple and placing it on a plate) — with only 300 fine-tuning trajectories.

Getting Started with GEAR-SONIC

Available Resources

Resource	Link
GitHub repo	NVlabs/GR00T-WholeBodyControl
Pretrained model	HuggingFace: nvidia/GEAR-SONIC
BONES-SEED dataset	HuggingFace: bones-studio/seed
Interactive demo	MuJoCo browser demo

Basic Setup

# Clone the repo
git clone https://github.com/NVlabs/GR00T-WholeBodyControl.git
cd GR00T-WholeBodyControl

# Install dependencies (requires NVIDIA Isaac Lab)
pip install -e .

# Download pretrained checkpoints
# The model consists of 3 ONNX files:
# - model_encoder.onnx
# - model_decoder.onnx
# - planner_sonic.onnx

Interactive Demo

The fastest way to experience SONIC is through the browser demo — it runs MuJoCo WASM directly in your browser, letting you load the policy and watch the G1 robot perform real-time motion tracking without installing anything.

Inference Stack

GEAR-SONIC provides a production-ready C++ inference stack:

TensorRT for optimized inference on Jetson Orin
CUDA Graph for reduced latency
Forward pass takes only 1-2 ms on Jetson Orin

This isn't just a research prototype — it's a system ready for real-world deployment.

License

Source code: Apache 2.0
Model weights: NVIDIA Open Model License (commercial use with attribution)
BONES-SEED: Free for academic research and qualifying startups; separate commercial licensing available

Comparison with Other Approaches

Approach	Strengths	Limitations
Decoupled WBC (RL lower + IK upper)	Simple, easy to debug	Poor upper-lower coordination
Model Predictive Control	Online optimization, flexible	Slow, requires accurate model
GEAR-SONIC	Unified policy, scales with data, zero-shot real	Needs powerful GPUs to train, depends on data quality

SONIC belongs to a new paradigm: instead of designing controllers, collect data and scale models. Similar to how LLMs transformed NLP, motion foundation models are changing how we build robot controllers.

Key Takeaways

1. Data is king. Among the three scaling axes (model, data, compute), increasing data yields the largest gains. BONES-SEED with 142K motions and 288 hours of data is an invaluable resource for the community.

2. Motion tracking is a universal interface. Instead of training individual behaviors, SONIC proves that motion tracking — a single task — can serve as the foundation for all downstream applications.

3. Sim-to-real has matured. 100% zero-shot success rate on real hardware shows that domain randomization + enough data + the right architecture = completely bridging the sim-to-real gap, at least for basic whole-body locomotion and manipulation.

4. VR teleoperation enables a data flywheel. 3-point teleoperation requires only a VR headset + 2 controllers — cheap and simple enough for large-scale data collection, creating a virtuous cycle: data → better policy → easier teleoperation → more data.

Conclusion

GEAR-SONIC represents a significant leap in humanoid robotics: from handcrafting controllers to training behavior foundation models with data. With open-source code, pretrained models on HuggingFace, the public BONES-SEED dataset, and a production-ready C++ inference stack — there's never been a better time to start experimenting with whole-body control for humanoid robots.

If you're already working with simulation for robotics, GEAR-SONIC is the natural next project — it combines many techniques we've covered in previous series: reinforcement learning, domain randomization, and sim-to-real transfer.

References

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control — Zhengyi Luo et al., NVIDIA Research, arXiv 2025
BONES-SEED Dataset — Bones Studio, GTC 2026
GR00T-WholeBodyControl Repository — NVIDIA NVLabs

In this article, we'll explore the architecture, training data, deployment pipeline, and how you can get started with NVIDIA's open-source codebase.

Background: Why Whole-Body Control?

Humanoid robot performing whole-body control — combining locomotion and manipulation in a single policy

What is SONIC? Architecture Overview