humanoidhumanoidvlawhole-bodygrootunitree-g1architecturedecoupled-control

GR00T N1 + Unitree G1: decoupled WBC+VLA architecture from 6Hz to 500Hz

How GR00T N1 (VLA at 6Hz) combines with GR00T-WBC (200Hz) to control Unitree G1 — decoupled architecture, why end-to-end doesn't work, and how to adapt for other robots.

Nguyễn Anh TuấnJune 2, 20266 min readUpdated: Jun 6, 2026
GR00T N1 + Unitree G1: decoupled WBC+VLA architecture from 6Hz to 500Hz

GR00T N1 + Unitree G1: decoupled WBC+VLA architecture from 6Hz to 500Hz

This is post 1 of the GR00T N1 + Unitree G1 series. The series walks through every step from data collection to deploying a working whole-body VLA policy on G1, using G1 as the concrete example but structured to adapt to any humanoid with a URDF.

This post explains why the architecture must be decoupled, and what the different frequencies mean in practice.

The core problem: 150ms inference vs 5ms control

Every architectural decision in this stack traces back to one unchangeable reality:

GR00T N1 inference:  ~150ms  →  ~6 Hz
GEAR upper body ctrl:  20ms  →  50 Hz
SONIC loco ctrl:        5ms  → 200 Hz
Robot joint servo:      2ms  → 500 Hz

If you try to run end-to-end — VLA output directly into joint servos — the robot will jerk heavily every 150ms and lose balance. The solution is decoupled layers: each layer runs at its own frequency and interpolates.

Three layers in the GR00T-WBC stack

┌─────────────────────────────────────────┐
│  GR00T N1 (VLA)          6Hz / 150ms   │
│  Input: camera × 3 + language           │
│  Output: target wrist pose L/R          │
│          + gripper width L/R            │
└──────────────┬──────────────────────────┘
               │ high-level command
┌──────────────▼──────────────────────────┐
│  GEAR (upper body)       50Hz / 20ms   │
│  RL-trained arm controller              │
│  Input: wrist target + proprioception   │
│  Output: arm joint torques              │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│  SONIC / HOVER (loco)   200Hz / 5ms    │
│  MPC + RL whole-body balance            │
│  Input: CoM target + terrain            │
│  Output: ALL joint commands (30+ DoF)   │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│  Robot servo drivers     500Hz / 2ms   │
│  PD controller per joint                │
└─────────────────────────────────────────┘

Why split this way?

  • VLA doesn't need to know how to balance — it only needs to know "hand goes here"
  • Locomotion doesn't need to understand language — it only needs to know where the CoM should be placed
  • Each layer has a clear domain, debugged independently, replaced independently

GR00T N1: model architecture

Repo: NVIDIA/Isaac-GR00T

Input:
  - Left wrist camera RGB (224×224)
  - Right wrist camera RGB (224×224)
  - Head camera RGB (224×224) [optional]
  - Language instruction (tokenized)

Backbone:
  - Eagle2 vision encoder (NVIDIA, 300M params)
  - Llama-3 language model (adapted)
  - Fusion: cross-attention layers

Action head:
  - Flow-matching diffusion
  - Predicts action chunk (T=16 steps ahead)
  - Output per step: Δ end-effector pose L/R + gripper binary

Parameters: ~2B total
Inference: ~150ms on RTX 4090

Action chunking: GR00T N1 doesn't predict a single action — it predicts a chunk of 16 steps ahead. The robot executes that chunk while N1 infers the next one. This is how 6Hz becomes smooth-feeling motion.

Unitree G1: joint map

G1 has 29 DoF in full config (with gripper):

Each leg (×2):
  hip_yaw, hip_roll, hip_pitch  → 3 DoF
  knee_pitch                    → 1 DoF
  ankle_pitch, ankle_roll       → 2 DoF
  = 6 DoF × 2 legs = 12 DoF

Each arm (×2):
  shoulder_pitch, shoulder_roll, shoulder_yaw → 3 DoF
  elbow_pitch                               → 1 DoF
  wrist_roll, wrist_pitch                   → 2 DoF
  = 6 DoF × 2 arms = 12 DoF

Torso (waist):
  waist_yaw                     → 1 DoF

Gripper (×2):
  gripper_left, gripper_right   → 2 DoF

Total: 12 + 12 + 1 + 2 = 27–29 DoF

GR00T-WBC controls all 27-29 DoF simultaneously — which is why WBC is much more complex than an arm-only policy.

Adapting for other robots

The stack is modular — you can swap G1 for any humanoid with a full URDF and joint SDK.

Swapping the robot: 3 steps

Step 1: Provide URDF

# Directory structure in GR00T-WBC
groot_wbc/robots/
  ├── g1/
  │   ├── g1.urdf
  │   ├── joint_config.yaml     ← this is what to modify
  │   └── pd_gains.yaml
  ├── gr1/
  └── YOUR_ROBOT/               ← create new directory
      ├── your_robot.urdf
      ├── joint_config.yaml
      └── pd_gains.yaml

Step 2: Edit joint_config.yaml

# joint_config.yaml for your robot
robot_name: "your_robot"
urdf_path: "robots/YOUR_ROBOT/your_robot.urdf"

# Map joint names in the order your URDF specifies
left_arm_joints:
  - "left_shoulder_pitch_joint"
  - "left_shoulder_roll_joint"
  - "left_shoulder_yaw_joint"
  - "left_elbow_pitch_joint"
  - "left_wrist_roll_joint"
  - "left_wrist_pitch_joint"

right_arm_joints:
  - "right_shoulder_pitch_joint"
  # ... same pattern

leg_joints:
  - "left_hip_yaw_joint"
  # ... 12 joints for 2 legs

# End-effector frames (must exist in URDF)
left_ee_frame: "left_gripper_link"
right_ee_frame: "right_gripper_link"

Step 3: Tune PD gains

# pd_gains.yaml — adjust to match your motor specs
joint_gains:
  left_shoulder_pitch_joint:
    kp: 150.0   # position gain
    kd: 10.0    # velocity gain (damping)
  left_elbow_pitch_joint:
    kp: 80.0
    kd: 5.0
  # ... each joint has its own gains

For G1: gains are already tuned by NVIDIA in groot_wbc/robots/g1/pd_gains.yaml. For other robots, start with low gains (kp 50, kd 3) and increase after testing in sim.

Hardware requirements to run this series

Component Minimum Recommended
GPU (training) RTX 4090 (24GB) A100 40GB
GPU (inference) RTX 3090 (24GB) RTX 4090
RAM 32GB 64GB
Storage 500GB SSD 2TB NVMe
Robot Optional for posts 2-4 with sim Unitree G1
Sim Isaac Lab (Isaac Sim 4.x) Isaac Lab

No G1? Posts 2-4 can be done entirely in Isaac Sim with the G1 URDF. Post 5 (sim2real) requires real hardware.

Series roadmap

Post Topic
Post 1 (this) Decoupled architecture, G1 joints, robot adaptation
Post 2 Data collection: Isaac Lab teleop + xr_teleoperate, LeRobot format
Post 3 Fine-tune GR00T N1: GPU config, training script
Post 4 Deploy GR00T-WBC on G1: GEAR + SONIC
Post 5 Sim2real + Evaluation: domain rand, humanoid-bench

Key takeaway

The most important insight from this post: decoupled is not a compromise — it is correct engineering. Mixing VLA inference (ML) and joint servo (control) into a single loop gives you a system that can't be debugged and isn't safe. Each layer has a clear responsibility, tests independently, and fails clearly when something goes wrong.

Next: Data collection with Isaac Lab and xr_teleoperate → LeRobot format.


References


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

GR00T N1 + G1 (Bài 5): sim2real transfer, domain randomization, và eval với humanoid-bench
humanoid

GR00T N1 + G1 (Bài 5): sim2real transfer, domain randomization, và eval với humanoid-bench

6/6/20266 min read
NT
GR00T N1 + G1 (Bài 4): deploy GR00T-WBC trên Unitree G1 — GEAR + SONIC
humanoid

GR00T N1 + G1 (Bài 4): deploy GR00T-WBC trên Unitree G1 — GEAR + SONIC

6/5/20266 min read
NT
GR00T N1 + G1 (Bài 3): fine-tune GR00T N1 — GPU, config, training script
humanoid

GR00T N1 + G1 (Bài 3): fine-tune GR00T N1 — GPU, config, training script

6/4/20265 min read
NT