Most Vision-Language-Action (VLA) systems today assume the robot already has a perfect low-level controller. Just give a command — "pick up the cup", "walk to the chair" — and the robot's body figures out the rest. But for full-body humanoid robots, every step requires dozens of joints coordinating simultaneously while maintaining balance, avoiding obstacles, and parsing natural language semantics. No one had fully solved this.
The EMBER Lab at UC Berkeley, in collaboration with CMU, Simon Fraser University, and NTNU (Norway), just published LeVERB — Latent Vision-Language-Encoded Robot Behavior — the first hierarchical framework to tackle this problem end-to-end on a real humanoid.
Paper: LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction — Haoru Xue, Xiaoyu Huang, Dantong Niu et al., arXiv 2506.13751, June 2025.
Authors: Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, Shankar Sastry — UC Berkeley, CMU, Simon Fraser University, NTNU.
The Core Problem: The Gap Between Language and Dynamics
Consider saying to a Unitree G1: "Walk to the red chair and sit down."
Simple for a human, enormously complex for a robot:
- Must recognize the chair visually in 3D space
- Orient its body and navigate to the precise location
- Execute a sit-down motion — coordinating legs, torso, and arms simultaneously
- All while maintaining balance and avoiding obstacles
Existing approaches tackle this at extremes:
Approach 1 — End-to-end VLA: A large model directly outputs joint angles. Upside: flexible with language. Downside: no awareness of physical dynamics; robot easily falls.
Approach 2 — Traditional WBC: A rigid controller with a fixed action vocabulary ("walk forward", "turn left", "sit"). Upside: dynamically stable. Downside: can't understand natural language or respond to visual context flexibly.
LeVERB bridges this gap with an elegant idea: learn a latent verb space — an automatically discovered action vocabulary — so the two system tiers can communicate without any hand-crafted motion primitives.
LeVERB Architecture: Dual-Process System
LeVERB draws inspiration from Daniel Kahneman's cognitive theory of "System 1 & System 2":
- System 2 — LeVERB-VL (10 Hz): Slow thinking, processes language and vision
- System 1 — LeVERB-A (50 Hz): Fast reaction, executes physical control

LeVERB-VL: The Vision-Language Tier
Inputs:
- Egocentric camera (robot head camera, 10 Hz)
- 2–3 third-person cameras
- Natural language text instruction
Architecture:
- Vision encoder: SigLIP ViT-B/16 (frozen — not fine-tuned)
- Fusion Transformer: combines features from multiple cameras and text instruction
- Output: latent vector $z_t$ encoding "motion semantics" — the "latent verb"
LeVERB-VL does not output joint angles directly. It outputs an abstract vector in a learned latent space, which is handed to LeVERB-A for execution.
LeVERB-A: The Dynamics Control Tier
Inputs:
- Proprioceptive state: joint positions, IMU readings, angular velocities
- Latent vector $z_t$ from LeVERB-VL
Architecture:
- 2-layer Transformer
- Output: joint position targets @ 50 Hz
Deployment: LeVERB-VL runs on an external RTX 4090 @ 10 Hz; LeVERB-A runs as ONNX/C++ onboard the robot @ 50 Hz. Each latent verb is reused across 5 control steps (50 Hz / 10 Hz = 5).
CVAE: The Bridge Between Tiers
The communication mechanism between the two tiers is a Conditional VAE (CVAE) with a residual latent design:
$$z_t = \text{mean}(z_{encoder}) + \text{residual}(z_{VL})$$
In practice:
- During training: The CVAE encoder receives the ground-truth trajectory → encodes it into $z_t$, so LeVERB-A has accurate motion context
- During inference: Only LeVERB-VL predicts $z_t$ from vision + language — no ground-truth trajectory needed
The residual design has a clear purpose: LeVERB-VL focuses on semantics (where to go, what to do), while the CVAE encoder captures motion details (how exactly to move). This clean separation is why ablating the CVAE design causes performance to collapse.
Data Synthesis Pipeline
The biggest challenge in training humanoid VLA is the lack of real robot data. LeVERB sidesteps this entirely using a fully synthetic data generation pipeline.

Step 1: Collect Kinematic Motions from MoCap
- Source: AMASS dataset (human motion capture)
- Process: retarget human motions → Unitree G1 via motion retargeting
- Result: 154 kinematic trajectories, each a complete motion clip
Step 2: Procedural Randomization × 100
Each of the 154 trajectories is randomized 100 times across:
| Level | What's randomized |
|---|---|
| Scene-level | Floor/wall textures, lighting, material properties |
| Object-level | Chair colors, sizes, physical properties |
| Placement | Random object positions with automatic semantic labeling |
| Multi-view | 3–4 cameras rendered simultaneously (egocentric + third-person) |
| Mirroring | Left/right flip for diversity without additional capture |
Result: 17.1 hours of photorealistic video rendered in IsaacSim with ray-tracing. Plus 2.7 hours of language-only trajectories (no camera) for robustness to missing visual input.
Step 3: Annotate with Natural Language
Each trajectory gets diverse text instruction annotations:
- "Go to the brown chair and sit down"
- "Walk straight and stop at the red marker"
- "Turn left 90 degrees and approach the table"
No rigid templates — paraphrased instructions so the model learns semantics, not string patterns.
LeVERB-Bench: 154 Tasks, 10 Categories
The dataset is organized into a structured benchmark:
- 154 vision-language tasks with clear success criteria
- 10 categories: navigation, locomotion, sit-down, reaching, and combinations
- Sim-to-real ready: test in sim → deploy on real robot without additional data
Public dataset on Hugging Face: ember-lab-berkeley/LeVERB-Bench-Dataset
Training Procedure: 3 Phases
Phase 1: Train LeVERB-VL
Goal: Learn a latent verb space from synthetic kinematic data.
Loss function with 3 components:
- Reconstruction loss (MSE): Ensures the latent verb encodes enough trajectory information for LeVERB-A to reconstruct the motion
- KL divergence: Regularizes the latent space distribution — standard VAE component
- Adversarial GRL (Gradient Reversal Layer): The most critical component — aligns distribution between vision-language data and language-only data
Why GRL matters so much: If the model learns to distinguish "has camera" vs "no camera" data, it gets lazy and only uses visual input when cameras are available. GRL reverses the gradient of an adversarial classifier, forcing the model to be blind to modality → it must learn purely semantic features from language, independent of input modality.
The ablation confirms this: remove GRL → success rate drops from 58.5% to 33.0%.
Phase 2: Train Teacher WBC Policies
Goal: Create specialized RL teacher policies for each motion category.
- Algorithm: PPO (Proximal Policy Optimization)
- Input: privileged observations (full proprioception + reference trajectory — unrealistic in deployment)
- Reward:
motion_tracking_accuracy + λ₁·smoothness + λ₂·joint_limit_cost - Multiple teachers, each specializing in one task group (navigation, sitting, reaching...)
Teachers achieve high performance because they receive full information, but can't be deployed directly (they need privileged observations).
Phase 3: Distill LeVERB-A
Goal: Student policy learns from teachers using only real deployment inputs.
- Algorithm: DAgger (Dataset Aggregation) — better than pure behavior cloning because it continuously rolls out and collects new data at the student's distribution
- Student input: real proprioception + latent verb $z_t$
- Critical trick: During training, sample $z_t$ from the full CVAE distribution (not just the mean) so the policy learns multi-modal behavior
Using only the CVAE mean (no sampling): success rate drops to 6.5% — catastrophic failure. This shows the model must not "mode collapse" to a single way of performing each task.
Benchmark Results
Full Ablation Table
| Configuration | Overall Success Rate |
|---|---|
| LeVERB (Full) | 58.5% |
| No Discriminator (GRL) | 33.0% |
| No Kinematics Encoder | 53.0% |
| End-to-end VLA (no WBC) | 25.5% |
| No Low-level Sampling | 6.5% |
LeVERB full outperforms end-to-end VLA by 7.8× — not a coincidence, but a reflection of a fundamental insight: language and physics operate at different abstraction levels, and need an intelligent bridge between them.
Per-Task Breakdown
| Task | Success Rate | Notes |
|---|---|---|
| Simple navigation (target ahead) | 80% | LeVERB excels here |
| Complex navigation (target behind) | 30% | Requires 180° turn |
| Cluttered environments | 25–50% | Varies with obstacle density |
| Visual sit-down on chair | 5% | Current bottleneck |
80% on simple navigation is impressive for zero-shot from simulation. The 5% on chair sit-down marks the clearest current limitation — it requires millimeter-level precision in body positioning.

Real-World Deployment on Unitree G1
Hardware Setup
[RTX 4090 External GPU]
│ 10 Hz
│ LeVERB-VL (vision + language → latent verb)
▼
[Unitree G1 Robot]
└─ ONNX Runtime (C++)
│ 50 Hz
│ LeVERB-A (latent verb + proprioception → joint positions)
▼
[Actuators] → full-body motion
Zero-Shot Sim-to-Real Transfer
All training happens in simulation — no real robot data needed. LeVERB runs directly on the real G1 without any fine-tuning.
Three factors enable this:
- Aggressive domain randomization: Diverse textures, lighting, objects → model doesn't overfit to a specific visual environment
- Separation of concerns: LeVERB-VL handles semantics (robust to visual domain gap), LeVERB-A handles dynamics (trained in well-calibrated physics simulator)
- Appropriate control frequency: 50 Hz is sufficient for the G1's real dynamics
Robustness to Natural Language Variations
Tested with diverse phrasings for the same task:
- "Go to the chair" / "Walk toward the seat" / "Approach the sitting furniture"
- "Turn left" / "Rotate to the left" / "Face the left direction"
All work — the robot understands semantics, not literal strings.
Comparison with Related Work
| Method | WBC | Vision | Language | Sim-to-real |
|---|---|---|---|---|
| LeVERB | ✅ Latent | ✅ Multi-cam | ✅ Natural | ✅ Zero-shot |
| WholebodyVLA ICLR 2026 | ✅ Unified | ✅ | ✅ | Partial |
| HEX-VLA | ✅ | ✅ | ✅ | Partial |
| DREAM-Chunk | Partial | ✅ | ✅ | ✅ |
| End-to-end VLA | ❌ | ✅ | ✅ | Hard |
LeVERB stands out as the first paper to create a comprehensive benchmark (150+ tasks, 10 categories) for vision-language humanoid WBC, combined with zero-shot sim-to-real deployment.
Current Limitations and Future Directions
Current Bottlenecks
1. LeVERB-VL latency: 100ms (10 Hz) is slow for tasks requiring instant reaction. If an obstacle suddenly appears, LeVERB-A must wait for the next 10 Hz cycle.
2. No manipulation yet: LeVERB currently covers loco-navigation and locomotion. No grasping, pushing, or dexterous manipulation. The next step needs to extend the latent verb space to arm/hand tasks.
3. Data scale is small: 154 tasks from 154 MoCap trajectories. Need to scale to thousands of tasks to cover long-tail behaviors.
4. External GPU dependency: RTX 4090 as external compute limits portability. Need to optimize LeVERB-VL for Jetson Orin or onboard NPU.
Promising Extensions
- Manipulation: Add arm/hand tasks to the latent verb space (grasping, pushing, inserting)
- Scale: Combine with internet human video data for pre-training
- Efficiency: Quantize LeVERB-VL → Jetson Orin NX standalone deployment
- Memory: Add temporal context to handle long-horizon tasks
Why This Matters
LeVERB isn't just a lab result — it's a blueprint for deploying humanoid robots in real environments:
- Factory logistics: "Move the red box to shelf 3" → robot navigates, grasps, places precisely
- Elderly assistance: "Bring me the chair from the corner" → robot understands and acts
- Research labs: "Set up the workstation in configuration A" → robot follows description
When manipulation support is added (expected in follow-up work from EMBER Lab), LeVERB will become a complete framework for humanoid service robots.
Conclusion
LeVERB solves one of robotics' hardest problems: how to let natural language command a full humanoid body in the physical world without a hand-crafted action vocabulary.
The answer — an automatically learned latent verb space connecting semantic and dynamics tiers — is both theoretically elegant and practically effective. The 7.8× improvement over end-to-end VLA and 80% success on navigation tasks are clear proof.
If you're working on whole-body VLA, sim-to-real transfer, or hierarchical robot control, this paper is essential reading.



