VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam
VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
  1. Home
  2. Blog
  3. LeVERB: Humanoid Whole-Body Control via Latent Vision-Language
wholebody-vlawholebody-vlahumanoidvlawhole-body-controlunitree-g1reinforcement-learninglatent-actionvision-languageberkeleysim2real

LeVERB: Humanoid Whole-Body Control via Latent Vision-Language

UC Berkeley's LeVERB is the first hierarchical framework bridging VLA and humanoid whole-body control via a learned latent verb space — achieving 58.5% success with zero-shot sim-to-real on Unitree G1.

Nguyễn Anh TuấnJune 24, 202611 min read
LeVERB: Humanoid Whole-Body Control via Latent Vision-Language

Most Vision-Language-Action (VLA) systems today assume the robot already has a perfect low-level controller. Just give a command — "pick up the cup", "walk to the chair" — and the robot's body figures out the rest. But for full-body humanoid robots, every step requires dozens of joints coordinating simultaneously while maintaining balance, avoiding obstacles, and parsing natural language semantics. No one had fully solved this.

The EMBER Lab at UC Berkeley, in collaboration with CMU, Simon Fraser University, and NTNU (Norway), just published LeVERB — Latent Vision-Language-Encoded Robot Behavior — the first hierarchical framework to tackle this problem end-to-end on a real humanoid.

Paper: LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction — Haoru Xue, Xiaoyu Huang, Dantong Niu et al., arXiv 2506.13751, June 2025.

Authors: Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, Shankar Sastry — UC Berkeley, CMU, Simon Fraser University, NTNU.

The Core Problem: The Gap Between Language and Dynamics

Consider saying to a Unitree G1: "Walk to the red chair and sit down."

Simple for a human, enormously complex for a robot:

  • Must recognize the chair visually in 3D space
  • Orient its body and navigate to the precise location
  • Execute a sit-down motion — coordinating legs, torso, and arms simultaneously
  • All while maintaining balance and avoiding obstacles

Existing approaches tackle this at extremes:

Approach 1 — End-to-end VLA: A large model directly outputs joint angles. Upside: flexible with language. Downside: no awareness of physical dynamics; robot easily falls.

Approach 2 — Traditional WBC: A rigid controller with a fixed action vocabulary ("walk forward", "turn left", "sit"). Upside: dynamically stable. Downside: can't understand natural language or respond to visual context flexibly.

LeVERB bridges this gap with an elegant idea: learn a latent verb space — an automatically discovered action vocabulary — so the two system tiers can communicate without any hand-crafted motion primitives.

Tool recommendations

VLA train/deploy stack

Train on cloud/workstation, then deploy optimized models to Jetson or the robot computer.

Cloud GPU for VLA / policy training Use for imitation learning, diffusion policies, RL, and robotics model fine-tuning. View cloud GPU → NVIDIA Jetson Orin NX / Orin Nano Edge deployment hardware for perception, logging, and optimized inference. View Jetson → Hugging Face / robotics dataset hosting Host datasets, checkpoints, and model cards for cleaner LeRobot/VLA workflows. View platform →

LeVERB Architecture: Dual-Process System

LeVERB draws inspiration from Daniel Kahneman's cognitive theory of "System 1 & System 2":

  • System 2 — LeVERB-VL (10 Hz): Slow thinking, processes language and vision
  • System 1 — LeVERB-A (50 Hz): Fast reaction, executes physical control

LeVERB data collection and training pipeline: 3 steps from motion capture to distillation — source: arXiv 2506.13751
LeVERB data collection and training pipeline: 3 steps from motion capture to distillation — source: arXiv 2506.13751

LeVERB-VL: The Vision-Language Tier

Inputs:

  • Egocentric camera (robot head camera, 10 Hz)
  • 2–3 third-person cameras
  • Natural language text instruction

Architecture:

  • Vision encoder: SigLIP ViT-B/16 (frozen — not fine-tuned)
  • Fusion Transformer: combines features from multiple cameras and text instruction
  • Output: latent vector $z_t$ encoding "motion semantics" — the "latent verb"

LeVERB-VL does not output joint angles directly. It outputs an abstract vector in a learned latent space, which is handed to LeVERB-A for execution.

LeVERB-A: The Dynamics Control Tier

Inputs:

  • Proprioceptive state: joint positions, IMU readings, angular velocities
  • Latent vector $z_t$ from LeVERB-VL

Architecture:

  • 2-layer Transformer
  • Output: joint position targets @ 50 Hz

Deployment: LeVERB-VL runs on an external RTX 4090 @ 10 Hz; LeVERB-A runs as ONNX/C++ onboard the robot @ 50 Hz. Each latent verb is reused across 5 control steps (50 Hz / 10 Hz = 5).

CVAE: The Bridge Between Tiers

The communication mechanism between the two tiers is a Conditional VAE (CVAE) with a residual latent design:

$$z_t = \text{mean}(z_{encoder}) + \text{residual}(z_{VL})$$

In practice:

  • During training: The CVAE encoder receives the ground-truth trajectory → encodes it into $z_t$, so LeVERB-A has accurate motion context
  • During inference: Only LeVERB-VL predicts $z_t$ from vision + language — no ground-truth trajectory needed

The residual design has a clear purpose: LeVERB-VL focuses on semantics (where to go, what to do), while the CVAE encoder captures motion details (how exactly to move). This clean separation is why ablating the CVAE design causes performance to collapse.

Data Synthesis Pipeline

The biggest challenge in training humanoid VLA is the lack of real robot data. LeVERB sidesteps this entirely using a fully synthetic data generation pipeline.

LeVERB-Bench environments: hundreds of texture options, multi-angle cameras, and 10 diverse task categories — source: arXiv 2506.13751
LeVERB-Bench environments: hundreds of texture options, multi-angle cameras, and 10 diverse task categories — source: arXiv 2506.13751

Step 1: Collect Kinematic Motions from MoCap

  • Source: AMASS dataset (human motion capture)
  • Process: retarget human motions → Unitree G1 via motion retargeting
  • Result: 154 kinematic trajectories, each a complete motion clip

Step 2: Procedural Randomization × 100

Each of the 154 trajectories is randomized 100 times across:

Level What's randomized
Scene-level Floor/wall textures, lighting, material properties
Object-level Chair colors, sizes, physical properties
Placement Random object positions with automatic semantic labeling
Multi-view 3–4 cameras rendered simultaneously (egocentric + third-person)
Mirroring Left/right flip for diversity without additional capture

Result: 17.1 hours of photorealistic video rendered in IsaacSim with ray-tracing. Plus 2.7 hours of language-only trajectories (no camera) for robustness to missing visual input.

Step 3: Annotate with Natural Language

Each trajectory gets diverse text instruction annotations:

  • "Go to the brown chair and sit down"
  • "Walk straight and stop at the red marker"
  • "Turn left 90 degrees and approach the table"

No rigid templates — paraphrased instructions so the model learns semantics, not string patterns.

LeVERB-Bench: 154 Tasks, 10 Categories

The dataset is organized into a structured benchmark:

  • 154 vision-language tasks with clear success criteria
  • 10 categories: navigation, locomotion, sit-down, reaching, and combinations
  • Sim-to-real ready: test in sim → deploy on real robot without additional data

Public dataset on Hugging Face: ember-lab-berkeley/LeVERB-Bench-Dataset

Training Procedure: 3 Phases

Phase 1: Train LeVERB-VL

Goal: Learn a latent verb space from synthetic kinematic data.

Loss function with 3 components:

  1. Reconstruction loss (MSE): Ensures the latent verb encodes enough trajectory information for LeVERB-A to reconstruct the motion
  2. KL divergence: Regularizes the latent space distribution — standard VAE component
  3. Adversarial GRL (Gradient Reversal Layer): The most critical component — aligns distribution between vision-language data and language-only data

Why GRL matters so much: If the model learns to distinguish "has camera" vs "no camera" data, it gets lazy and only uses visual input when cameras are available. GRL reverses the gradient of an adversarial classifier, forcing the model to be blind to modality → it must learn purely semantic features from language, independent of input modality.

The ablation confirms this: remove GRL → success rate drops from 58.5% to 33.0%.

Phase 2: Train Teacher WBC Policies

Goal: Create specialized RL teacher policies for each motion category.

  • Algorithm: PPO (Proximal Policy Optimization)
  • Input: privileged observations (full proprioception + reference trajectory — unrealistic in deployment)
  • Reward: motion_tracking_accuracy + λ₁·smoothness + λ₂·joint_limit_cost
  • Multiple teachers, each specializing in one task group (navigation, sitting, reaching...)

Teachers achieve high performance because they receive full information, but can't be deployed directly (they need privileged observations).

Phase 3: Distill LeVERB-A

Goal: Student policy learns from teachers using only real deployment inputs.

  • Algorithm: DAgger (Dataset Aggregation) — better than pure behavior cloning because it continuously rolls out and collects new data at the student's distribution
  • Student input: real proprioception + latent verb $z_t$
  • Critical trick: During training, sample $z_t$ from the full CVAE distribution (not just the mean) so the policy learns multi-modal behavior

Using only the CVAE mean (no sampling): success rate drops to 6.5% — catastrophic failure. This shows the model must not "mode collapse" to a single way of performing each task.

Benchmark Results

Full Ablation Table

Configuration Overall Success Rate
LeVERB (Full) 58.5%
No Discriminator (GRL) 33.0%
No Kinematics Encoder 53.0%
End-to-end VLA (no WBC) 25.5%
No Low-level Sampling 6.5%

LeVERB full outperforms end-to-end VLA by 7.8× — not a coincidence, but a reflection of a fundamental insight: language and physics operate at different abstraction levels, and need an intelligent bridge between them.

Per-Task Breakdown

Task Success Rate Notes
Simple navigation (target ahead) 80% LeVERB excels here
Complex navigation (target behind) 30% Requires 180° turn
Cluttered environments 25–50% Varies with obstacle density
Visual sit-down on chair 5% Current bottleneck

80% on simple navigation is impressive for zero-shot from simulation. The 5% on chair sit-down marks the clearest current limitation — it requires millimeter-level precision in body positioning.

LeVERB on real Unitree G1: responding to language variations (top) and visual spatial reasoning for chairs (bottom) — source: arXiv 2506.13751
LeVERB on real Unitree G1: responding to language variations (top) and visual spatial reasoning for chairs (bottom) — source: arXiv 2506.13751

Real-World Deployment on Unitree G1

Hardware Setup

[RTX 4090 External GPU]
     │  10 Hz
     │  LeVERB-VL (vision + language → latent verb)
     ▼
[Unitree G1 Robot]
  └─ ONNX Runtime (C++)
     │  50 Hz
     │  LeVERB-A (latent verb + proprioception → joint positions)
     ▼
  [Actuators] → full-body motion

Zero-Shot Sim-to-Real Transfer

All training happens in simulation — no real robot data needed. LeVERB runs directly on the real G1 without any fine-tuning.

Three factors enable this:

  1. Aggressive domain randomization: Diverse textures, lighting, objects → model doesn't overfit to a specific visual environment
  2. Separation of concerns: LeVERB-VL handles semantics (robust to visual domain gap), LeVERB-A handles dynamics (trained in well-calibrated physics simulator)
  3. Appropriate control frequency: 50 Hz is sufficient for the G1's real dynamics

Robustness to Natural Language Variations

Tested with diverse phrasings for the same task:

  • "Go to the chair" / "Walk toward the seat" / "Approach the sitting furniture"
  • "Turn left" / "Rotate to the left" / "Face the left direction"

All work — the robot understands semantics, not literal strings.

Comparison with Related Work

Method WBC Vision Language Sim-to-real
LeVERB ✅ Latent ✅ Multi-cam ✅ Natural ✅ Zero-shot
WholebodyVLA ICLR 2026 ✅ Unified ✅ ✅ Partial
HEX-VLA ✅ ✅ ✅ Partial
DREAM-Chunk Partial ✅ ✅ ✅
End-to-end VLA ❌ ✅ ✅ Hard

LeVERB stands out as the first paper to create a comprehensive benchmark (150+ tasks, 10 categories) for vision-language humanoid WBC, combined with zero-shot sim-to-real deployment.

Current Limitations and Future Directions

Current Bottlenecks

1. LeVERB-VL latency: 100ms (10 Hz) is slow for tasks requiring instant reaction. If an obstacle suddenly appears, LeVERB-A must wait for the next 10 Hz cycle.

2. No manipulation yet: LeVERB currently covers loco-navigation and locomotion. No grasping, pushing, or dexterous manipulation. The next step needs to extend the latent verb space to arm/hand tasks.

3. Data scale is small: 154 tasks from 154 MoCap trajectories. Need to scale to thousands of tasks to cover long-tail behaviors.

4. External GPU dependency: RTX 4090 as external compute limits portability. Need to optimize LeVERB-VL for Jetson Orin or onboard NPU.

Promising Extensions

  • Manipulation: Add arm/hand tasks to the latent verb space (grasping, pushing, inserting)
  • Scale: Combine with internet human video data for pre-training
  • Efficiency: Quantize LeVERB-VL → Jetson Orin NX standalone deployment
  • Memory: Add temporal context to handle long-horizon tasks

Why This Matters

LeVERB isn't just a lab result — it's a blueprint for deploying humanoid robots in real environments:

  • Factory logistics: "Move the red box to shelf 3" → robot navigates, grasps, places precisely
  • Elderly assistance: "Bring me the chair from the corner" → robot understands and acts
  • Research labs: "Set up the workstation in configuration A" → robot follows description

When manipulation support is added (expected in follow-up work from EMBER Lab), LeVERB will become a complete framework for humanoid service robots.

Conclusion

LeVERB solves one of robotics' hardest problems: how to let natural language command a full humanoid body in the physical world without a hand-crafted action vocabulary.

The answer — an automatically learned latent verb space connecting semantic and dynamics tiers — is both theoretically elegant and practically effective. The 7.8× improvement over end-to-end VLA and 80% success on navigation tasks are clear proof.

If you're working on whole-body VLA, sim-to-real transfer, or hierarchical robot control, this paper is essential reading.


Related Posts

  • VLA-JEPA Guide: Enhancing VLA with V-JEPA2 Latent World Models on LeRobot
  • DREAM-Chunk: Reactive Action Chunking for VLA with Latent World Models
  • HEX-VLA: Cross-Embodiment Whole-Body VLA for Humanoid Manipulation
NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Fleet MonitoringROS 2 IntegrationAMR Solutions

Related Posts

Tutorial
TWIST2: PICO teleop và G1 sim2real
twist2unitree-g1picoPart 2
wholebody-vla

TWIST2: PICO teleop và G1 sim2real

Dựng vòng TWIST2 từ ONNX checkpoint đến G1 thật: Redis bus, PICO teleop, sim2sim, sim2real và data recording.

6/11/202617 min read
NT
Tutorial
VIRAL: RGB sim2real cho G1 loco-manip
viralgr00t-visualsim2realunitree-g1Part 4
wholebody-vla

VIRAL: RGB sim2real cho G1 loco-manip

Dựng VIRAL với Isaac Sim 5.1: PPO teacher, RGB DAgger student, Hydra fields, export ONNX và so sánh EgoHumanoid.

6/11/202616 min read
NT
Tutorial
CLONE: MoE teleop và chọn stack
clonemoe-policyteleoperationPart 6
wholebody-vla

CLONE: MoE teleop và chọn stack

Triển khai CLONE cho G1 với Apple Vision Pro, LiDAR odometry, MoE policy và bảng chọn stack whole-body VLA.

6/11/202617 min read
NT
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam