VLA + WBC repos from the US: NVIDIA GR00T, openpi, HumanPlus, TeleVision

This is post 2 of the VLA + WBC repos landscape series. This post deep dives into US repositories — NVIDIA, Physical Intelligence, Berkeley, Stanford, and other open-source labs.

What defines this group: a strong focus on generalization (one model works across multiple robots and tasks) and open research culture (most have paper + code + dataset released simultaneously).

NVIDIA: Isaac-GR00T and GR00T-WholeBodyControl

NVIDIA has two separate repos for two different problems — don't confuse them.

Isaac-GR00T (~7.3k stars)

Repo: NVIDIA/Isaac-GR00T

What it is: Foundation model for robots — takes image + language → action. Equivalent to LLaVA but output is robot action instead of text. Model is called GR00T N1 (March 2025).

Architecture:

Observation: [wrist camera RGB] + [head camera RGB] + [text instruction]
Backbone: Eagle2 vision encoder (NVIDIA)
Language: Llama-3 backbone
Action head: flow-matching diffusion
Output: joint positions (delta) or end-effector pose

Strengths: Pretrained on 1000+ tasks across many robots (Franka, UR5, G1, H1, GR1...). Fine-tuning with ~50-100 demos is usually enough for a new task.

Weaknesses: Inference is heavy — needs GPU (recommended A100/H100 for training, RTX 4090 for inference). Not production-ready for real-time deployment (<100ms per step) yet.

Getting started:

git clone https://github.com/NVIDIA/Isaac-GR00T.git
cd Isaac-GR00T
pip install -e ".[dev]"

# Fine-tune with your demo data (LeRobot format)
python scripts/finetune.py \
  --model_path nvidia/GR00T-N1-2B \
  --dataset_path path/to/your/lerobot_dataset \
  --output_dir ./finetuned_model

Data format: LeRobot (HuggingFace). If you already have UMI data, you'll need to convert to LeRobot format first.

Paper: GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

GR00T-WholeBodyControl (~2.2k stars)

Repo: NVlabs/GR00T-WholeBodyControl

What it is: WBC platform for deploying policies on real humanoid robots — not a VLA model. This is the middle layer between VLA output and robot joints.

Decoupled architecture:

VLA policy → [target wrist pose L/R + gripper] (upper body)
                    ↓
         GR00T-WBC ← GEAR (upper body RL controller)
                    ↓
         SONIC (loco-manipulation controller)
                    ↓
         Robot joints (30+ DoF)

Three components:

N1.5 / N1.6: VLA policy (runs at 6Hz)
GEAR: upper body controller (50Hz, RL-trained, from joint targets → torques)
SONIC: whole-body loco controller (200Hz, MPC + RL)

Why decoupled? Because VLA inference (~150ms) and robot control loop (5ms) can't run at the same rate. Decoupled lets each layer run at its own frequency.

Supported robots: GR1, G1, Unitree H1 (and any robot with URDF + sufficient DOF).

Paper: GR00T-WBC: Decoupled Whole-Body Control for Humanoid Locomotion and Manipulation

HOVER (~742 stars)

Repo: NVlabs/HOVER

What it is: A precursor experiment to GR00T-WBC — unified controller for whole-body. Many people use HOVER as a baseline to compare against GR00T-WBC.

Physical Intelligence: openpi (~12.2k stars)

Repo: physical-intelligence/openpi

What it is: Open-source release of π0 and π0.5 — currently the strongest VLA model by many benchmarks (March 2025 release).

π0 architecture:

Base: PaliGemma (vision-language backbone from Google)
Action: flow-matching → continuous action
Frequency: 50Hz (fast enough for real robot)
Parameters: 3B

π0.5 improvement:

Added reasoning chain (chain-of-thought for robots)
Long-horizon task execution (multi-step)
Zero-shot cross-embodiment transfer

Why openpi is appealing:

Inference speed: 50Hz — usable in real-time
Cross-embodiment: pretrained on many robots, transfer works
Largest community: 12k+ stars, many tutorials and examples

Getting started:

git clone https://github.com/physical-intelligence/openpi.git
cd openpi
pip install -e .

# Download pretrained π0 checkpoint
python -c "from openpi.models import pi0; pi0.download_checkpoint('pi0-base')"

# Inference example
python examples/inference_example.py \
  --checkpoint pi0-base \
  --image path/to/obs.jpg \
  --instruction "pick up the red cup"

Fine-tuning: openpi supports fine-tuning with LeRobot format or custom dataset loader.

Papers:

Berkeley: HumanPlus (~847 stars)

Repo: MarkFzp/HumanPlus

What it is: Whole-body humanoid imitation learning from human motion capture. A person wears a mocap suit (21 markers) → data → train policy for Unitree H1.

What makes it unique: HumanPlus is the "collect data from the human" approach (like UMI) but for the whole body. Instead of teleoperation, you simply... perform the action yourself.

Pipeline:

1. Person wears mocap suit and performs task
2. OptiTrack captures 21-marker body + 6DoF wrist poses
3. Retargeting: human → H1 URDF (via motion retargeting code in repo)
4. Train: ACT (Action Chunking Transformer) on whole-body trajectories
5. Deploy: H1 runs learned policy with head camera + proprioception

Demonstrated tasks: cabinet opening, object manipulation, carrying box, folding shirt (whole body).

Hardware requirements:

Unitree H1 humanoid robot
OptiTrack mocap system (or equivalent)
GPU workstation (training)

Why it matters: HumanPlus proved that complex teleoperation isn't necessary — a mocap suit + retargeting is enough for a working whole-body policy. This approach influenced many labs that followed.

Paper: HumanPlus: Humanoid Shadowing and Imitation from Observations

Stanford + MIT: TeleVision (~1.3k stars)

Repo: OpenTeleVision/TeleVision

What it is: Immersive teleoperation for humanoids using Apple Vision Pro or Meta Quest 3. Operator wears HMD → sees robot perspective → controls with hands.

Architecture:

Apple Vision Pro / Meta Quest 3
    ↓ (hand tracking + head pose)
Retargeting (human → robot arm)
    ↓
ROS2 / custom SDK
    ↓
Humanoid upper body (arms + hands)

Key feature: Binocular video stream from robot head camera → operator gets stereoscopic view — feels like "being inside the robot." Total latency ~80ms (VR rendering + network + robot).

Weakness: Only covers upper body (arms + hands). Legs must run a separate locomotion controller. Apple Vision Pro is expensive ($3,500).

When to use: When you need high-quality manipulation data collection with human-level dexterity and no mocap room. Combine with SONIC/GR00T-WBC to drive legs.

Paper: Open-TeleVision: Teleoperation with Immersive Active Visual Feedback

OpenHelix (~378 stars)

Repo: OpenHelix-Team/OpenHelix

What it is: Dual-system architecture for VLA manipulation — two-module system: "thinking system" (slow, for planning) + "acting system" (fast, for execution). Inspired by System 1/System 2 cognitive science.

What differentiates it from openpi/GR00T:

Designed explicitly for robot arms (not focused on full humanoid)
Easier to deploy because no full humanoid hardware needed
Architecture suited to bimanual manipulation tasks

Supported robots: WidowX, Franka, and any robot with URDF.

When to use: OpenHelix sits at the intersection between manipulation-focused VLA and bimanual systems. Good for studying VLA architecture before scaling to humanoids.

OpenDriveLab: EgoHumanoid (~161 stars)

Repo: OpenDriveLab/EgoHumanoid

What it is: Framework for collecting data and training policies for loco-manipulation (walking + manipulation) from egocentric human demos — person wears a head-mounted camera, performs task, no mocap suit needed.

The most important point about EgoHumanoid: reduces hardware barrier to just a head-mounted camera (ego camera). No mocap room, no VR headset.

Pipeline:

1. Person wears GoPro/RealSense on head and performs task
2. Video + pose estimation (ViTPose, etc.)
3. 3D pose lifting from 2D video
4. Retargeting to humanoid URDF
5. Train: loco-manipulation policy (locomotion + manipulation jointly)
6. Deploy: humanoid with head camera + proprioception

Results from paper (RSS 2026): First demo of whole-body loco-manipulation on a real robot from ego demos — humanoid walks to a table, picks up object, and moves to another location.

Why to watch: If this pipeline works well at scale, the barrier for whole-body data collection will drop significantly — just a head-mounted camera and a person demonstrating.

Paper: EgoHumanoid: Embodied Whole-Body Loco-Manipulation from Egocentric Demonstrations (RSS 2026)

Summary comparison — US group

Repo	Main problem	Hardware needed	Entry barrier
openpi	VLA inference + fine-tune	GPU workstation	Low
Isaac-GR00T	VLA fine-tune + sim	GPU workstation	Medium
GR00T-WBC	Deploy WBC to real robot	Humanoid robot	High
HumanPlus	WBC data collection	Humanoid + mocap	High
TeleVision	Teleoperation data	Humanoid + HMD	Medium
OpenHelix	Bimanual VLA	Robot arm	Low
EgoHumanoid	Loco-manip data	Humanoid + head cam	Medium

Observation

The US group stands out for foundation model quality (openpi, GR00T) and research novelty (EgoHumanoid, HumanPlus). NVIDIA is particularly notable because they're building the full stack — from VLA model to WBC controller — and releasing everything under permissive licenses.

Next: Chinese repos — Unitree, THU, and the open community.

References

VLA + WBC repos from the US: NVIDIA GR00T, openpi, HumanPlus, TeleVision

This is post 2 of the VLA + WBC repos landscape series. This post deep dives into US repositories — NVIDIA, Physical Intelligence, Berkeley, Stanford, and other open-source labs.

NVIDIA: Isaac-GR00T and GR00T-WholeBodyControl

NVIDIA has two separate repos for two different problems — don't confuse them.

Isaac-GR00T (~7.3k stars)

Repo: NVIDIA/Isaac-GR00T

What it is: Foundation model for robots — takes image + language → action. Equivalent to LLaVA but output is robot action instead of text. Model is called GR00T N1 (March 2025).

Architecture:

Observation: [wrist camera RGB] + [head camera RGB] + [text instruction]
Backbone: Eagle2 vision encoder (NVIDIA)
Language: Llama-3 backbone
Action head: flow-matching diffusion
Output: joint positions (delta) or end-effector pose

Strengths: Pretrained on 1000+ tasks across many robots (Franka, UR5, G1, H1, GR1...). Fine-tuning with ~50-100 demos is usually enough for a new task.

Weaknesses: Inference is heavy — needs GPU (recommended A100/H100 for training, RTX 4090 for inference). Not production-ready for real-time deployment (<100ms per step) yet.

Getting started:

git clone https://github.com/NVIDIA/Isaac-GR00T.git
cd Isaac-GR00T
pip install -e ".[dev]"

# Fine-tune with your demo data (LeRobot format)
python scripts/finetune.py \
  --model_path nvidia/GR00T-N1-2B \
  --dataset_path path/to/your/lerobot_dataset \
  --output_dir ./finetuned_model

Data format: LeRobot (HuggingFace). If you already have UMI data, you'll need to convert to LeRobot format first.

Paper: GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

GR00T-WholeBodyControl (~2.2k stars)

Repo: NVlabs/GR00T-WholeBodyControl

What it is: WBC platform for deploying policies on real humanoid robots — not a VLA model. This is the middle layer between VLA output and robot joints.

Decoupled architecture:

VLA policy → [target wrist pose L/R + gripper] (upper body)
                    ↓
         GR00T-WBC ← GEAR (upper body RL controller)
                    ↓
         SONIC (loco-manipulation controller)
                    ↓
         Robot joints (30+ DoF)

Three components:

N1.5 / N1.6: VLA policy (runs at 6Hz)
GEAR: upper body controller (50Hz, RL-trained, from joint targets → torques)
SONIC: whole-body loco controller (200Hz, MPC + RL)

Why decoupled? Because VLA inference (~150ms) and robot control loop (5ms) can't run at the same rate. Decoupled lets each layer run at its own frequency.

Supported robots: GR1, G1, Unitree H1 (and any robot with URDF + sufficient DOF).

Paper: GR00T-WBC: Decoupled Whole-Body Control for Humanoid Locomotion and Manipulation

HOVER (~742 stars)

Repo: NVlabs/HOVER

What it is: A precursor experiment to GR00T-WBC — unified controller for whole-body. Many people use HOVER as a baseline to compare against GR00T-WBC.

Physical Intelligence: openpi (~12.2k stars)

Repo: physical-intelligence/openpi

What it is: Open-source release of π0 and π0.5 — currently the strongest VLA model by many benchmarks (March 2025 release).

π0 architecture:

Base: PaliGemma (vision-language backbone from Google)
Action: flow-matching → continuous action
Frequency: 50Hz (fast enough for real robot)
Parameters: 3B

π0.5 improvement:

Added reasoning chain (chain-of-thought for robots)
Long-horizon task execution (multi-step)
Zero-shot cross-embodiment transfer

Why openpi is appealing:

Inference speed: 50Hz — usable in real-time
Cross-embodiment: pretrained on many robots, transfer works
Largest community: 12k+ stars, many tutorials and examples

Getting started:

git clone https://github.com/physical-intelligence/openpi.git
cd openpi
pip install -e .

# Download pretrained π0 checkpoint
python -c "from openpi.models import pi0; pi0.download_checkpoint('pi0-base')"

# Inference example
python examples/inference_example.py \
  --checkpoint pi0-base \
  --image path/to/obs.jpg \
  --instruction "pick up the red cup"

Fine-tuning: openpi supports fine-tuning with LeRobot format or custom dataset loader.

Papers:

Berkeley: HumanPlus (~847 stars)

Repo: MarkFzp/HumanPlus

What it is: Whole-body humanoid imitation learning from human motion capture. A person wears a mocap suit (21 markers) → data → train policy for Unitree H1.

What makes it unique: HumanPlus is the "collect data from the human" approach (like UMI) but for the whole body. Instead of teleoperation, you simply... perform the action yourself.

Pipeline:

1. Person wears mocap suit and performs task
2. OptiTrack captures 21-marker body + 6DoF wrist poses
3. Retargeting: human → H1 URDF (via motion retargeting code in repo)
4. Train: ACT (Action Chunking Transformer) on whole-body trajectories
5. Deploy: H1 runs learned policy with head camera + proprioception

Demonstrated tasks: cabinet opening, object manipulation, carrying box, folding shirt (whole body).

Hardware requirements:

Unitree H1 humanoid robot
OptiTrack mocap system (or equivalent)
GPU workstation (training)

Paper: HumanPlus: Humanoid Shadowing and Imitation from Observations

Stanford + MIT: TeleVision (~1.3k stars)

Repo: OpenTeleVision/TeleVision

What it is: Immersive teleoperation for humanoids using Apple Vision Pro or Meta Quest 3. Operator wears HMD → sees robot perspective → controls with hands.

Architecture:

Apple Vision Pro / Meta Quest 3
    ↓ (hand tracking + head pose)
Retargeting (human → robot arm)
    ↓
ROS2 / custom SDK
    ↓
Humanoid upper body (arms + hands)

Key feature: Binocular video stream from robot head camera → operator gets stereoscopic view — feels like "being inside the robot." Total latency ~80ms (VR rendering + network + robot).

Weakness: Only covers upper body (arms + hands). Legs must run a separate locomotion controller. Apple Vision Pro is expensive ($3,500).

When to use: When you need high-quality manipulation data collection with human-level dexterity and no mocap room. Combine with SONIC/GR00T-WBC to drive legs.

Paper: Open-TeleVision: Teleoperation with Immersive Active Visual Feedback

OpenHelix (~378 stars)

Repo: OpenHelix-Team/OpenHelix

What differentiates it from openpi/GR00T:

Designed explicitly for robot arms (not focused on full humanoid)
Easier to deploy because no full humanoid hardware needed
Architecture suited to bimanual manipulation tasks

Supported robots: WidowX, Franka, and any robot with URDF.

When to use: OpenHelix sits at the intersection between manipulation-focused VLA and bimanual systems. Good for studying VLA architecture before scaling to humanoids.

OpenDriveLab: EgoHumanoid (~161 stars)

Repo: OpenDriveLab/EgoHumanoid

The most important point about EgoHumanoid: reduces hardware barrier to just a head-mounted camera (ego camera). No mocap room, no VR headset.

Pipeline:

1. Person wears GoPro/RealSense on head and performs task
2. Video + pose estimation (ViTPose, etc.)
3. 3D pose lifting from 2D video
4. Retargeting to humanoid URDF
5. Train: loco-manipulation policy (locomotion + manipulation jointly)
6. Deploy: humanoid with head camera + proprioception

Results from paper (RSS 2026): First demo of whole-body loco-manipulation on a real robot from ego demos — humanoid walks to a table, picks up object, and moves to another location.

Why to watch: If this pipeline works well at scale, the barrier for whole-body data collection will drop significantly — just a head-mounted camera and a person demonstrating.

Paper: EgoHumanoid: Embodied Whole-Body Loco-Manipulation from Egocentric Demonstrations (RSS 2026)

Summary comparison — US group

Repo	Main problem	Hardware needed	Entry barrier
openpi	VLA inference + fine-tune	GPU workstation	Low
Isaac-GR00T	VLA fine-tune + sim	GPU workstation	Medium
GR00T-WBC	Deploy WBC to real robot	Humanoid robot	High
HumanPlus	WBC data collection	Humanoid + mocap	High
TeleVision	Teleoperation data	Humanoid + HMD	Medium
OpenHelix	Bimanual VLA	Robot arm	Low
EgoHumanoid	Loco-manip data	Humanoid + head cam	Medium

Observation

Next: Chinese repos — Unitree, THU, and the open community.

VLA + WBC repos from the US: NVIDIA GR00T, openpi, HumanPlus, TeleVision

VLA + WBC repos from the US: NVIDIA GR00T, openpi, HumanPlus, TeleVision

NVIDIA: Isaac-GR00T and GR00T-WholeBodyControl

Isaac-GR00T (~7.3k stars)

GR00T-WholeBodyControl (~2.2k stars)

HOVER (~742 stars)

Physical Intelligence: openpi (~12.2k stars)

Berkeley: HumanPlus (~847 stars)

Stanford + MIT: TeleVision (~1.3k stars)

OpenHelix (~378 stars)

OpenDriveLab: EgoHumanoid (~161 stars)

Summary comparison — US group

Observation

References

Nguyễn Anh Tuấn

Related Posts

Bản đồ repos VLA + WBC 2025-2026: tổng quan các GitHub repos humanoid

VLA + WBC repos từ Trung Quốc: Unitree, THU RDT-1B, và cộng đồng mở

GR00T N1 + Unitree G1: kiến trúc WBC+VLA decoupled từ 6Hz đến 500Hz

VLA + WBC repos from the US: NVIDIA GR00T, openpi, HumanPlus, TeleVision

VLA + WBC repos from the US: NVIDIA GR00T, openpi, HumanPlus, TeleVision

NVIDIA: Isaac-GR00T and GR00T-WholeBodyControl

Isaac-GR00T (~7.3k stars)

GR00T-WholeBodyControl (~2.2k stars)

HOVER (~742 stars)

Physical Intelligence: openpi (~12.2k stars)

Berkeley: HumanPlus (~847 stars)

Stanford + MIT: TeleVision (~1.3k stars)

OpenHelix (~378 stars)

OpenDriveLab: EgoHumanoid (~161 stars)

Summary comparison — US group

Observation

References

Nguyễn Anh Tuấn

Related Posts

Bản đồ repos VLA + WBC 2025-2026: tổng quan các GitHub repos humanoid

VLA + WBC repos từ Trung Quốc: Unitree, THU RDT-1B, và cộng đồng mở

GR00T N1 + Unitree G1: kiến trúc WBC+VLA decoupled từ 6Hz đến 500Hz

VLA + WBC repos from the US: NVIDIA GR00T, openpi, HumanPlus, TeleVision

NVIDIA: Isaac-GR00T and GR00T-WholeBodyControl

Isaac-GR00T (~7.3k stars)

GR00T-WholeBodyControl (~2.2k stars)

HOVER (~742 stars)

Physical Intelligence: openpi (~12.2k stars)

Berkeley: HumanPlus (~847 stars)

Stanford + MIT: TeleVision (~1.3k stars)

OpenHelix (~378 stars)

OpenDriveLab: EgoHumanoid (~161 stars)

Summary comparison — US group

Observation

References

Related posts

Nguyễn Anh Tuấn

Related Posts

Bản đồ repos VLA + WBC 2025-2026: tổng quan các GitHub repos humanoid

VLA + WBC repos từ Trung Quốc: Unitree, THU RDT-1B, và cộng đồng mở

GR00T N1 + Unitree G1: kiến trúc WBC+VLA decoupled từ 6Hz đến 500Hz

VLA + WBC repos from the US: NVIDIA GR00T, openpi, HumanPlus, TeleVision

NVIDIA: Isaac-GR00T and GR00T-WholeBodyControl

Isaac-GR00T (~7.3k stars)

GR00T-WholeBodyControl (~2.2k stars)

HOVER (~742 stars)

Physical Intelligence: openpi (~12.2k stars)

Berkeley: HumanPlus (~847 stars)

Stanford + MIT: TeleVision (~1.3k stars)

OpenHelix (~378 stars)

OpenDriveLab: EgoHumanoid (~161 stars)

Summary comparison — US group

Observation

References

Related posts

Nguyễn Anh Tuấn

Related Posts

Bản đồ repos VLA + WBC 2025-2026: tổng quan các GitHub repos humanoid

VLA + WBC repos từ Trung Quốc: Unitree, THU RDT-1B, và cộng đồng mở

GR00T N1 + Unitree G1: kiến trúc WBC+VLA decoupled từ 6Hz đến 500Hz