humanoidhumanoidvlawhole-bodynvidiagrootopenpihumanplustelevisionresearch

VLA + WBC repos from the US: NVIDIA GR00T, openpi, HumanPlus, TeleVision

Deep dive into VLA and Whole-Body Control GitHub repos from US companies and labs — NVIDIA Isaac-GR00T, GR00T-WholeBodyControl, Physical Intelligence openpi, Berkeley HumanPlus, Stanford TeleVision, OpenHelix, and EgoHumanoid.

Nguyễn Anh TuấnJune 5, 20268 min read
VLA + WBC repos from the US: NVIDIA GR00T, openpi, HumanPlus, TeleVision

VLA + WBC repos from the US: NVIDIA GR00T, openpi, HumanPlus, TeleVision

This is post 2 of the VLA + WBC repos landscape series. This post deep dives into US repositories — NVIDIA, Physical Intelligence, Berkeley, Stanford, and other open-source labs.

What defines this group: a strong focus on generalization (one model works across multiple robots and tasks) and open research culture (most have paper + code + dataset released simultaneously).

NVIDIA: Isaac-GR00T and GR00T-WholeBodyControl

NVIDIA has two separate repos for two different problems — don't confuse them.

Isaac-GR00T (~7.3k stars)

Repo: NVIDIA/Isaac-GR00T

What it is: Foundation model for robots — takes image + language → action. Equivalent to LLaVA but output is robot action instead of text. Model is called GR00T N1 (March 2025).

Architecture:

Observation: [wrist camera RGB] + [head camera RGB] + [text instruction]
Backbone: Eagle2 vision encoder (NVIDIA)
Language: Llama-3 backbone
Action head: flow-matching diffusion
Output: joint positions (delta) or end-effector pose

Strengths: Pretrained on 1000+ tasks across many robots (Franka, UR5, G1, H1, GR1...). Fine-tuning with ~50-100 demos is usually enough for a new task.

Weaknesses: Inference is heavy — needs GPU (recommended A100/H100 for training, RTX 4090 for inference). Not production-ready for real-time deployment (<100ms per step) yet.

Getting started:

git clone https://github.com/NVIDIA/Isaac-GR00T.git
cd Isaac-GR00T
pip install -e ".[dev]"

# Fine-tune with your demo data (LeRobot format)
python scripts/finetune.py \
  --model_path nvidia/GR00T-N1-2B \
  --dataset_path path/to/your/lerobot_dataset \
  --output_dir ./finetuned_model

Data format: LeRobot (HuggingFace). If you already have UMI data, you'll need to convert to LeRobot format first.

Paper: GR00T N1: An Open Foundation Model for Generalist Humanoid Robots


GR00T-WholeBodyControl (~2.2k stars)

Repo: NVlabs/GR00T-WholeBodyControl

What it is: WBC platform for deploying policies on real humanoid robots — not a VLA model. This is the middle layer between VLA output and robot joints.

Decoupled architecture:

VLA policy → [target wrist pose L/R + gripper] (upper body)
                    ↓
         GR00T-WBC ← GEAR (upper body RL controller)
                    ↓
         SONIC (loco-manipulation controller)
                    ↓
         Robot joints (30+ DoF)

Three components:

  • N1.5 / N1.6: VLA policy (runs at 6Hz)
  • GEAR: upper body controller (50Hz, RL-trained, from joint targets → torques)
  • SONIC: whole-body loco controller (200Hz, MPC + RL)

Why decoupled? Because VLA inference (~150ms) and robot control loop (5ms) can't run at the same rate. Decoupled lets each layer run at its own frequency.

Supported robots: GR1, G1, Unitree H1 (and any robot with URDF + sufficient DOF).

Paper: GR00T-WBC: Decoupled Whole-Body Control for Humanoid Locomotion and Manipulation


HOVER (~742 stars)

Repo: NVlabs/HOVER

What it is: A precursor experiment to GR00T-WBC — unified controller for whole-body. Many people use HOVER as a baseline to compare against GR00T-WBC.


Physical Intelligence: openpi (~12.2k stars)

Repo: physical-intelligence/openpi

What it is: Open-source release of π0 and π0.5 — currently the strongest VLA model by many benchmarks (March 2025 release).

π0 architecture:

Base: PaliGemma (vision-language backbone from Google)
Action: flow-matching → continuous action
Frequency: 50Hz (fast enough for real robot)
Parameters: 3B

π0.5 improvement:

  • Added reasoning chain (chain-of-thought for robots)
  • Long-horizon task execution (multi-step)
  • Zero-shot cross-embodiment transfer

Why openpi is appealing:

  1. Inference speed: 50Hz — usable in real-time
  2. Cross-embodiment: pretrained on many robots, transfer works
  3. Largest community: 12k+ stars, many tutorials and examples

Getting started:

git clone https://github.com/physical-intelligence/openpi.git
cd openpi
pip install -e .

# Download pretrained π0 checkpoint
python -c "from openpi.models import pi0; pi0.download_checkpoint('pi0-base')"

# Inference example
python examples/inference_example.py \
  --checkpoint pi0-base \
  --image path/to/obs.jpg \
  --instruction "pick up the red cup"

Fine-tuning: openpi supports fine-tuning with LeRobot format or custom dataset loader.

Papers:


Berkeley: HumanPlus (~847 stars)

Repo: MarkFzp/HumanPlus

What it is: Whole-body humanoid imitation learning from human motion capture. A person wears a mocap suit (21 markers) → data → train policy for Unitree H1.

What makes it unique: HumanPlus is the "collect data from the human" approach (like UMI) but for the whole body. Instead of teleoperation, you simply... perform the action yourself.

Pipeline:

1. Person wears mocap suit and performs task
2. OptiTrack captures 21-marker body + 6DoF wrist poses
3. Retargeting: human → H1 URDF (via motion retargeting code in repo)
4. Train: ACT (Action Chunking Transformer) on whole-body trajectories
5. Deploy: H1 runs learned policy with head camera + proprioception

Demonstrated tasks: cabinet opening, object manipulation, carrying box, folding shirt (whole body).

Hardware requirements:

  • Unitree H1 humanoid robot
  • OptiTrack mocap system (or equivalent)
  • GPU workstation (training)

Why it matters: HumanPlus proved that complex teleoperation isn't necessary — a mocap suit + retargeting is enough for a working whole-body policy. This approach influenced many labs that followed.

Paper: HumanPlus: Humanoid Shadowing and Imitation from Observations


Stanford + MIT: TeleVision (~1.3k stars)

Repo: OpenTeleVision/TeleVision

What it is: Immersive teleoperation for humanoids using Apple Vision Pro or Meta Quest 3. Operator wears HMD → sees robot perspective → controls with hands.

Architecture:

Apple Vision Pro / Meta Quest 3
    ↓ (hand tracking + head pose)
Retargeting (human → robot arm)
    ↓
ROS2 / custom SDK
    ↓
Humanoid upper body (arms + hands)

Key feature: Binocular video stream from robot head camera → operator gets stereoscopic view — feels like "being inside the robot." Total latency ~80ms (VR rendering + network + robot).

Weakness: Only covers upper body (arms + hands). Legs must run a separate locomotion controller. Apple Vision Pro is expensive ($3,500).

When to use: When you need high-quality manipulation data collection with human-level dexterity and no mocap room. Combine with SONIC/GR00T-WBC to drive legs.

Paper: Open-TeleVision: Teleoperation with Immersive Active Visual Feedback


OpenHelix (~378 stars)

Repo: OpenHelix-Team/OpenHelix

What it is: Dual-system architecture for VLA manipulation — two-module system: "thinking system" (slow, for planning) + "acting system" (fast, for execution). Inspired by System 1/System 2 cognitive science.

What differentiates it from openpi/GR00T:

  • Designed explicitly for robot arms (not focused on full humanoid)
  • Easier to deploy because no full humanoid hardware needed
  • Architecture suited to bimanual manipulation tasks

Supported robots: WidowX, Franka, and any robot with URDF.

When to use: OpenHelix sits at the intersection between manipulation-focused VLA and bimanual systems. Good for studying VLA architecture before scaling to humanoids.


OpenDriveLab: EgoHumanoid (~161 stars)

Repo: OpenDriveLab/EgoHumanoid

What it is: Framework for collecting data and training policies for loco-manipulation (walking + manipulation) from egocentric human demos — person wears a head-mounted camera, performs task, no mocap suit needed.

The most important point about EgoHumanoid: reduces hardware barrier to just a head-mounted camera (ego camera). No mocap room, no VR headset.

Pipeline:

1. Person wears GoPro/RealSense on head and performs task
2. Video + pose estimation (ViTPose, etc.)
3. 3D pose lifting from 2D video
4. Retargeting to humanoid URDF
5. Train: loco-manipulation policy (locomotion + manipulation jointly)
6. Deploy: humanoid with head camera + proprioception

Results from paper (RSS 2026): First demo of whole-body loco-manipulation on a real robot from ego demos — humanoid walks to a table, picks up object, and moves to another location.

Why to watch: If this pipeline works well at scale, the barrier for whole-body data collection will drop significantly — just a head-mounted camera and a person demonstrating.

Paper: EgoHumanoid: Embodied Whole-Body Loco-Manipulation from Egocentric Demonstrations (RSS 2026)


Summary comparison — US group

Repo Main problem Hardware needed Entry barrier
openpi VLA inference + fine-tune GPU workstation Low
Isaac-GR00T VLA fine-tune + sim GPU workstation Medium
GR00T-WBC Deploy WBC to real robot Humanoid robot High
HumanPlus WBC data collection Humanoid + mocap High
TeleVision Teleoperation data Humanoid + HMD Medium
OpenHelix Bimanual VLA Robot arm Low
EgoHumanoid Loco-manip data Humanoid + head cam Medium

Observation

The US group stands out for foundation model quality (openpi, GR00T) and research novelty (EgoHumanoid, HumanPlus). NVIDIA is particularly notable because they're building the full stack — from VLA model to WBC controller — and releasing everything under permissive licenses.

Next: Chinese repos — Unitree, THU, and the open community.


References


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

Bản đồ repos VLA + WBC 2025-2026: tổng quan các GitHub repos humanoid
humanoid

Bản đồ repos VLA + WBC 2025-2026: tổng quan các GitHub repos humanoid

6/4/20266 min read
NT
VLA + WBC repos từ Trung Quốc: Unitree, THU RDT-1B, và cộng đồng mở
humanoid

VLA + WBC repos từ Trung Quốc: Unitree, THU RDT-1B, và cộng đồng mở

6/6/20268 min read
NT
GR00T N1 + Unitree G1: kiến trúc WBC+VLA decoupled từ 6Hz đến 500Hz
humanoid

GR00T N1 + Unitree G1: kiến trúc WBC+VLA decoupled từ 6Hz đến 500Hz

6/2/20266 min read
NT