GR00T N1 + Unitree G1: decoupled WBC+VLA architecture from 6Hz to 500Hz
This is post 1 of the GR00T N1 + Unitree G1 series. The series walks through every step from data collection to deploying a working whole-body VLA policy on G1, using G1 as the concrete example but structured to adapt to any humanoid with a URDF.
This post explains why the architecture must be decoupled, and what the different frequencies mean in practice.
The core problem: 150ms inference vs 5ms control
Every architectural decision in this stack traces back to one unchangeable reality:
GR00T N1 inference: ~150ms → ~6 Hz
GEAR upper body ctrl: 20ms → 50 Hz
SONIC loco ctrl: 5ms → 200 Hz
Robot joint servo: 2ms → 500 Hz
If you try to run end-to-end — VLA output directly into joint servos — the robot will jerk heavily every 150ms and lose balance. The solution is decoupled layers: each layer runs at its own frequency and interpolates.
Three layers in the GR00T-WBC stack
┌─────────────────────────────────────────┐
│ GR00T N1 (VLA) 6Hz / 150ms │
│ Input: camera × 3 + language │
│ Output: target wrist pose L/R │
│ + gripper width L/R │
└──────────────┬──────────────────────────┘
│ high-level command
┌──────────────▼──────────────────────────┐
│ GEAR (upper body) 50Hz / 20ms │
│ RL-trained arm controller │
│ Input: wrist target + proprioception │
│ Output: arm joint torques │
└──────────────┬──────────────────────────┘
│
┌──────────────▼──────────────────────────┐
│ SONIC / HOVER (loco) 200Hz / 5ms │
│ MPC + RL whole-body balance │
│ Input: CoM target + terrain │
│ Output: ALL joint commands (30+ DoF) │
└──────────────┬──────────────────────────┘
│
┌──────────────▼──────────────────────────┐
│ Robot servo drivers 500Hz / 2ms │
│ PD controller per joint │
└─────────────────────────────────────────┘
Why split this way?
- VLA doesn't need to know how to balance — it only needs to know "hand goes here"
- Locomotion doesn't need to understand language — it only needs to know where the CoM should be placed
- Each layer has a clear domain, debugged independently, replaced independently
GR00T N1: model architecture
Repo: NVIDIA/Isaac-GR00T
Input:
- Left wrist camera RGB (224×224)
- Right wrist camera RGB (224×224)
- Head camera RGB (224×224) [optional]
- Language instruction (tokenized)
Backbone:
- Eagle2 vision encoder (NVIDIA, 300M params)
- Llama-3 language model (adapted)
- Fusion: cross-attention layers
Action head:
- Flow-matching diffusion
- Predicts action chunk (T=16 steps ahead)
- Output per step: Δ end-effector pose L/R + gripper binary
Parameters: ~2B total
Inference: ~150ms on RTX 4090
Action chunking: GR00T N1 doesn't predict a single action — it predicts a chunk of 16 steps ahead. The robot executes that chunk while N1 infers the next one. This is how 6Hz becomes smooth-feeling motion.
Unitree G1: joint map
G1 has 29 DoF in full config (with gripper):
Each leg (×2):
hip_yaw, hip_roll, hip_pitch → 3 DoF
knee_pitch → 1 DoF
ankle_pitch, ankle_roll → 2 DoF
= 6 DoF × 2 legs = 12 DoF
Each arm (×2):
shoulder_pitch, shoulder_roll, shoulder_yaw → 3 DoF
elbow_pitch → 1 DoF
wrist_roll, wrist_pitch → 2 DoF
= 6 DoF × 2 arms = 12 DoF
Torso (waist):
waist_yaw → 1 DoF
Gripper (×2):
gripper_left, gripper_right → 2 DoF
Total: 12 + 12 + 1 + 2 = 27–29 DoF
GR00T-WBC controls all 27-29 DoF simultaneously — which is why WBC is much more complex than an arm-only policy.
Adapting for other robots
The stack is modular — you can swap G1 for any humanoid with a full URDF and joint SDK.
Swapping the robot: 3 steps
Step 1: Provide URDF
# Directory structure in GR00T-WBC
groot_wbc/robots/
├── g1/
│ ├── g1.urdf
│ ├── joint_config.yaml ← this is what to modify
│ └── pd_gains.yaml
├── gr1/
└── YOUR_ROBOT/ ← create new directory
├── your_robot.urdf
├── joint_config.yaml
└── pd_gains.yaml
Step 2: Edit joint_config.yaml
# joint_config.yaml for your robot
robot_name: "your_robot"
urdf_path: "robots/YOUR_ROBOT/your_robot.urdf"
# Map joint names in the order your URDF specifies
left_arm_joints:
- "left_shoulder_pitch_joint"
- "left_shoulder_roll_joint"
- "left_shoulder_yaw_joint"
- "left_elbow_pitch_joint"
- "left_wrist_roll_joint"
- "left_wrist_pitch_joint"
right_arm_joints:
- "right_shoulder_pitch_joint"
# ... same pattern
leg_joints:
- "left_hip_yaw_joint"
# ... 12 joints for 2 legs
# End-effector frames (must exist in URDF)
left_ee_frame: "left_gripper_link"
right_ee_frame: "right_gripper_link"
Step 3: Tune PD gains
# pd_gains.yaml — adjust to match your motor specs
joint_gains:
left_shoulder_pitch_joint:
kp: 150.0 # position gain
kd: 10.0 # velocity gain (damping)
left_elbow_pitch_joint:
kp: 80.0
kd: 5.0
# ... each joint has its own gains
For G1: gains are already tuned by NVIDIA in
groot_wbc/robots/g1/pd_gains.yaml. For other robots, start with low gains (kp 50, kd 3) and increase after testing in sim.
Hardware requirements to run this series
| Component | Minimum | Recommended |
|---|---|---|
| GPU (training) | RTX 4090 (24GB) | A100 40GB |
| GPU (inference) | RTX 3090 (24GB) | RTX 4090 |
| RAM | 32GB | 64GB |
| Storage | 500GB SSD | 2TB NVMe |
| Robot | Optional for posts 2-4 with sim | Unitree G1 |
| Sim | Isaac Lab (Isaac Sim 4.x) | Isaac Lab |
No G1? Posts 2-4 can be done entirely in Isaac Sim with the G1 URDF. Post 5 (sim2real) requires real hardware.
Series roadmap
| Post | Topic |
|---|---|
| Post 1 (this) | Decoupled architecture, G1 joints, robot adaptation |
| Post 2 | Data collection: Isaac Lab teleop + xr_teleoperate, LeRobot format |
| Post 3 | Fine-tune GR00T N1: GPU config, training script |
| Post 4 | Deploy GR00T-WBC on G1: GEAR + SONIC |
| Post 5 | Sim2real + Evaluation: domain rand, humanoid-bench |
Key takeaway
The most important insight from this post: decoupled is not a compromise — it is correct engineering. Mixing VLA inference (ML) and joint servo (control) into a single loop gives you a system that can't be debugged and isn't safe. Each layer has a clear responsibility, tests independently, and fails clearly when something goes wrong.
Next: Data collection with Isaac Lab and xr_teleoperate → LeRobot format.
References
- GR00T N1 paper (arxiv:2503.14734)
- GR00T-WBC paper (arxiv:2506.08000)
- NVIDIA/Isaac-GR00T GitHub
- NVlabs/GR00T-WholeBodyControl GitHub
- Unitree G1 URDF