LeVERB: Humanoid Whole-Body Control via Latent Vision-Language

Most Vision-Language-Action (VLA) systems today assume the robot already has a perfect low-level controller. Just give a command — "pick up the cup", "walk to the chair" — and the robot's body figures out the rest. But for full-body humanoid robots, every step requires dozens of joints coordinating simultaneously while maintaining balance, avoiding obstacles, and parsing natural language semantics. No one had fully solved this.

The EMBER Lab at UC Berkeley, in collaboration with CMU, Simon Fraser University, and NTNU (Norway), just published LeVERB — Latent Vision-Language-Encoded Robot Behavior — the first hierarchical framework to tackle this problem end-to-end on a real humanoid.

Paper: LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction — Haoru Xue, Xiaoyu Huang, Dantong Niu et al., arXiv 2506.13751, June 2025.

Authors: Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, Shankar Sastry — UC Berkeley, CMU, Simon Fraser University, NTNU.

The Core Problem: The Gap Between Language and Dynamics

Consider saying to a Unitree G1: "Walk to the red chair and sit down."

Simple for a human, enormously complex for a robot:

Must recognize the chair visually in 3D space
Orient its body and navigate to the precise location
Execute a sit-down motion — coordinating legs, torso, and arms simultaneously
All while maintaining balance and avoiding obstacles

Existing approaches tackle this at extremes:

Approach 1 — End-to-end VLA: A large model directly outputs joint angles. Upside: flexible with language. Downside: no awareness of physical dynamics; robot easily falls.

Approach 2 — Traditional WBC: A rigid controller with a fixed action vocabulary ("walk forward", "turn left", "sit"). Upside: dynamically stable. Downside: can't understand natural language or respond to visual context flexibly.

LeVERB bridges this gap with an elegant idea: learn a latent verb space — an automatically discovered action vocabulary — so the two system tiers can communicate without any hand-crafted motion primitives.

LeVERB Architecture: Dual-Process System

LeVERB draws inspiration from Daniel Kahneman's cognitive theory of "System 1 & System 2":

System 2 — LeVERB-VL (10 Hz): Slow thinking, processes language and vision
System 1 — LeVERB-A (50 Hz): Fast reaction, executes physical control

LeVERB data collection and training pipeline: 3 steps from motion capture to distillation — source: arXiv 2506.13751

LeVERB-VL: The Vision-Language Tier

Inputs:

Egocentric camera (robot head camera, 10 Hz)
2–3 third-person cameras
Natural language text instruction

Architecture:

Vision encoder: SigLIP ViT-B/16 (frozen — not fine-tuned)
Fusion Transformer: combines features from multiple cameras and text instruction
Output: latent vector $z_t$ encoding "motion semantics" — the "latent verb"

LeVERB-VL does not output joint angles directly. It outputs an abstract vector in a learned latent space, which is handed to LeVERB-A for execution.

LeVERB-A: The Dynamics Control Tier

Inputs:

Proprioceptive state: joint positions, IMU readings, angular velocities
Latent vector $z_t$ from LeVERB-VL

Architecture:

2-layer Transformer
Output: joint position targets @ 50 Hz

Deployment: LeVERB-VL runs on an external RTX 4090 @ 10 Hz; LeVERB-A runs as ONNX/C++ onboard the robot @ 50 Hz. Each latent verb is reused across 5 control steps (50 Hz / 10 Hz = 5).

CVAE: The Bridge Between Tiers

The communication mechanism between the two tiers is a Conditional VAE (CVAE) with a residual latent design:

$$z_t = \text{mean}(z_{encoder}) + \text{residual}(z_{VL})$$

In practice:

During training: The CVAE encoder receives the ground-truth trajectory → encodes it into $z_t$, so LeVERB-A has accurate motion context
During inference: Only LeVERB-VL predicts $z_t$ from vision + language — no ground-truth trajectory needed

The residual design has a clear purpose: LeVERB-VL focuses on semantics (where to go, what to do), while the CVAE encoder captures motion details (how exactly to move). This clean separation is why ablating the CVAE design causes performance to collapse.

Data Synthesis Pipeline

The biggest challenge in training humanoid VLA is the lack of real robot data. LeVERB sidesteps this entirely using a fully synthetic data generation pipeline.

LeVERB-Bench environments: hundreds of texture options, multi-angle cameras, and 10 diverse task categories — source: arXiv 2506.13751

Step 1: Collect Kinematic Motions from MoCap

Source: AMASS dataset (human motion capture)
Process: retarget human motions → Unitree G1 via motion retargeting
Result: 154 kinematic trajectories, each a complete motion clip

Step 2: Procedural Randomization × 100

Each of the 154 trajectories is randomized 100 times across:

Level	What's randomized
Scene-level	Floor/wall textures, lighting, material properties
Object-level	Chair colors, sizes, physical properties
Placement	Random object positions with automatic semantic labeling
Multi-view	3–4 cameras rendered simultaneously (egocentric + third-person)
Mirroring	Left/right flip for diversity without additional capture

Result: 17.1 hours of photorealistic video rendered in IsaacSim with ray-tracing. Plus 2.7 hours of language-only trajectories (no camera) for robustness to missing visual input.

Step 3: Annotate with Natural Language

Each trajectory gets diverse text instruction annotations:

"Go to the brown chair and sit down"
"Walk straight and stop at the red marker"
"Turn left 90 degrees and approach the table"

No rigid templates — paraphrased instructions so the model learns semantics, not string patterns.

LeVERB-Bench: 154 Tasks, 10 Categories

The dataset is organized into a structured benchmark:

154 vision-language tasks with clear success criteria
10 categories: navigation, locomotion, sit-down, reaching, and combinations
Sim-to-real ready: test in sim → deploy on real robot without additional data

Public dataset on Hugging Face: ember-lab-berkeley/LeVERB-Bench-Dataset

Training Procedure: 3 Phases

Phase 1: Train LeVERB-VL

Goal: Learn a latent verb space from synthetic kinematic data.

Loss function with 3 components:

Reconstruction loss (MSE): Ensures the latent verb encodes enough trajectory information for LeVERB-A to reconstruct the motion
KL divergence: Regularizes the latent space distribution — standard VAE component
Adversarial GRL (Gradient Reversal Layer): The most critical component — aligns distribution between vision-language data and language-only data

Why GRL matters so much: If the model learns to distinguish "has camera" vs "no camera" data, it gets lazy and only uses visual input when cameras are available. GRL reverses the gradient of an adversarial classifier, forcing the model to be blind to modality → it must learn purely semantic features from language, independent of input modality.

The ablation confirms this: remove GRL → success rate drops from 58.5% to 33.0%.

Phase 2: Train Teacher WBC Policies

Goal: Create specialized RL teacher policies for each motion category.

Algorithm: PPO (Proximal Policy Optimization)
Input: privileged observations (full proprioception + reference trajectory — unrealistic in deployment)
Reward: motion_tracking_accuracy + λ₁·smoothness + λ₂·joint_limit_cost
Multiple teachers, each specializing in one task group (navigation, sitting, reaching...)

Teachers achieve high performance because they receive full information, but can't be deployed directly (they need privileged observations).

Phase 3: Distill LeVERB-A

Goal: Student policy learns from teachers using only real deployment inputs.

Algorithm: DAgger (Dataset Aggregation) — better than pure behavior cloning because it continuously rolls out and collects new data at the student's distribution
Student input: real proprioception + latent verb $z_t$
Critical trick: During training, sample $z_t$ from the full CVAE distribution (not just the mean) so the policy learns multi-modal behavior

Using only the CVAE mean (no sampling): success rate drops to 6.5% — catastrophic failure. This shows the model must not "mode collapse" to a single way of performing each task.

Benchmark Results

Full Ablation Table

Configuration	Overall Success Rate
LeVERB (Full)	58.5%
No Discriminator (GRL)	33.0%
No Kinematics Encoder	53.0%
End-to-end VLA (no WBC)	25.5%
No Low-level Sampling	6.5%

LeVERB full outperforms end-to-end VLA by 7.8× — not a coincidence, but a reflection of a fundamental insight: language and physics operate at different abstraction levels, and need an intelligent bridge between them.

Per-Task Breakdown

Task	Success Rate	Notes
Simple navigation (target ahead)	80%	LeVERB excels here
Complex navigation (target behind)	30%	Requires 180° turn
Cluttered environments	25–50%	Varies with obstacle density
Visual sit-down on chair	5%	Current bottleneck

80% on simple navigation is impressive for zero-shot from simulation. The 5% on chair sit-down marks the clearest current limitation — it requires millimeter-level precision in body positioning.

LeVERB on real Unitree G1: responding to language variations (top) and visual spatial reasoning for chairs (bottom) — source: arXiv 2506.13751

Real-World Deployment on Unitree G1

Hardware Setup

[RTX 4090 External GPU]
     │  10 Hz
     │  LeVERB-VL (vision + language → latent verb)
     ▼
[Unitree G1 Robot]
  └─ ONNX Runtime (C++)
     │  50 Hz
     │  LeVERB-A (latent verb + proprioception → joint positions)
     ▼
  [Actuators] → full-body motion

Zero-Shot Sim-to-Real Transfer

All training happens in simulation — no real robot data needed. LeVERB runs directly on the real G1 without any fine-tuning.

Three factors enable this:

Aggressive domain randomization: Diverse textures, lighting, objects → model doesn't overfit to a specific visual environment
Separation of concerns: LeVERB-VL handles semantics (robust to visual domain gap), LeVERB-A handles dynamics (trained in well-calibrated physics simulator)
Appropriate control frequency: 50 Hz is sufficient for the G1's real dynamics

Robustness to Natural Language Variations

Tested with diverse phrasings for the same task:

"Go to the chair" / "Walk toward the seat" / "Approach the sitting furniture"
"Turn left" / "Rotate to the left" / "Face the left direction"

All work — the robot understands semantics, not literal strings.

Method	WBC	Vision	Language	Sim-to-real
LeVERB	✅ Latent	✅ Multi-cam	✅ Natural	✅ Zero-shot
WholebodyVLA ICLR 2026	✅ Unified	✅	✅	Partial
HEX-VLA	✅	✅	✅	Partial
DREAM-Chunk	Partial	✅	✅	✅
End-to-end VLA	❌	✅	✅	Hard

LeVERB stands out as the first paper to create a comprehensive benchmark (150+ tasks, 10 categories) for vision-language humanoid WBC, combined with zero-shot sim-to-real deployment.

Current Limitations and Future Directions

Current Bottlenecks

1. LeVERB-VL latency: 100ms (10 Hz) is slow for tasks requiring instant reaction. If an obstacle suddenly appears, LeVERB-A must wait for the next 10 Hz cycle.

2. No manipulation yet: LeVERB currently covers loco-navigation and locomotion. No grasping, pushing, or dexterous manipulation. The next step needs to extend the latent verb space to arm/hand tasks.

3. Data scale is small: 154 tasks from 154 MoCap trajectories. Need to scale to thousands of tasks to cover long-tail behaviors.

4. External GPU dependency: RTX 4090 as external compute limits portability. Need to optimize LeVERB-VL for Jetson Orin or onboard NPU.

Promising Extensions

Manipulation: Add arm/hand tasks to the latent verb space (grasping, pushing, inserting)
Scale: Combine with internet human video data for pre-training
Efficiency: Quantize LeVERB-VL → Jetson Orin NX standalone deployment
Memory: Add temporal context to handle long-horizon tasks

Why This Matters

LeVERB isn't just a lab result — it's a blueprint for deploying humanoid robots in real environments:

Factory logistics: "Move the red box to shelf 3" → robot navigates, grasps, places precisely
Elderly assistance: "Bring me the chair from the corner" → robot understands and acts
Research labs: "Set up the workstation in configuration A" → robot follows description

When manipulation support is added (expected in follow-up work from EMBER Lab), LeVERB will become a complete framework for humanoid service robots.

Conclusion

LeVERB solves one of robotics' hardest problems: how to let natural language command a full humanoid body in the physical world without a hand-crafted action vocabulary.

The answer — an automatically learned latent verb space connecting semantic and dynamics tiers — is both theoretically elegant and practically effective. The 7.8× improvement over end-to-end VLA and 80% success on navigation tasks are clear proof.

If you're working on whole-body VLA, sim-to-real transfer, or hierarchical robot control, this paper is essential reading.