WholeBodyVLA: Action-Free Video to Loco-Manipulation

In Part 4, we got panoramic 3D perception for a humanoid. But perception is only half. The other half — and the hardest part of humanoids — is turning perception into whole-body action: balancing, walking, and reaching to manipulate, all at once. This is whole-body loco-manipulation, and this article analyzes a paper that solves it elegantly: WholeBodyVLA.

Paper: WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control — ICLR 2026. Tested on the AgiBot X2 humanoid.

WholeBodyVLA: unified latent VLA for whole-body loco-manipulation — source: arXiv 2512.11047

Source: WholeBodyVLA (arXiv 2512.11047) — from egocentric video to loco-manipulation control.

The problem: walking and manipulating can't be separated

In earlier series (GR00T N1 + Unitree G1), the common architecture is decoupled: one policy handles the arms (manipulation), one controller handles the legs (locomotion), running at different frequencies. This works when the two are independent — standing still to grab something, or walking while carrying nothing.

But real whole-body loco-manipulation is inseparable. When a humanoid reaches for a heavy box on a high shelf:

The center of mass shifts → the legs must compensate to not fall
The arm extends → torque on the hips and torso
Bending/leaning to get close → the whole kinematic chain from legs to arms coordinates

Arms and legs must "negotiate" in real time. WholeBodyVLA solves this by unifying them in one framework, while still respecting that arms and legs run at different frequencies.

Three core ideas of WholeBodyVLA

1. Unified latent VLA

Instead of two separate policies, WholeBodyVLA uses a single VLA that produces a unified latent action — an intermediate representation encoding "whole-body intent" (where to go, what to reach for, what posture). This latent is not a direct joint command, but a high-level goal for the lower tier to execute.

The benefit: the VLA reasons once about the whole body, avoiding the situation where arms and legs "fight" because two policies make conflicting decisions.

2. Learning from action-free egocentric video

This is the data breakthrough. Collecting real-robot data (teleop) is expensive and slow — the topic we dissect in Part 6. WholeBodyVLA leverages action-free egocentric video: first-person video (a person with a head camera doing chores, manipulating objects) — with no action labels (no idea which joint moved how much), only imagery.

Why action-free video is much cheaper:

Data source	Needs robot?	Needs action labels?	Cost	Scale
Robot teleop	Yes	Yes (precise)	Very high	Small
Action-free egocentric video	No	No	Very low	Massive (YouTube, Ego4D)

WholeBodyVLA learns representations and intent from that ocean of action-free video, then uses a small amount of robot data to "anchor" the latent to real actions.

3. Connecting a high-level VLA to a low-level RL controller

This is the decoupled high/low frequency architecture we've seen in earlier humanoid series, but done cleanly:

Observation (egocentric + state)
        ↓
High-level VLA (slow, ~a few Hz)
   → whole-body latent action (intent)
        ↓
RL loco-manipulation controller (fast, ~hundreds of Hz)
   → torque/joint commands to balance + execute the intent
        ↓
Robot (AgiBot X2 / Unitree G1)

The high-level VLA runs slowly and handles "what to do" — read the scene, understand the instruction, emit a latent.
The low-level RL controller runs fast and handles "how" — turn the latent into motion that stays balanced. This is trained with RL in simulation (Isaac Lab / MuJoCo), where the robot falls millions of times to learn to stay upright.

This frequency split isn't arbitrary: vision-language reasoning is inherently slow, while balancing needs a high-frequency control loop. Forcing both into one frequency is a classic mistake.

Relation to the rest of the series

WholeBodyVLA doesn't replace what we've learned — it sits on top:

3D pretraining (SPEAR-1, Part 1). The better the backbone understands 3D geometry, the more spatially accurate the latent VLA. 3D-awareness is the foundation for a good VLA.
Robo3R (Part 2). Provides a metric 3D scene in the robot frame so the VLA can reason about collisions and real distances when reaching.
DP3 (Part 3). A policy choice for fine hand manipulation, while WholeBodyVLA coordinates the whole body.
Panoramic perception (Part 4). Lets the VLA "see" beyond the FOV to plan walking and reaching.

In other words: Parts 1–4 build perception and policy for each piece; WholeBodyVLA is the whole-body coordinator that ties them into a humanoid that walks and works.

Comparison: old decoupled vs unified latent VLA

Criterion	Decoupled (2 separate policies)	WholeBodyVLA (unified latent)
Arm-leg coordination	Weak (two independent policies)	Strong (unified latent)
Primary data source	Robot teleop (expensive)	Action-free video (cheap) + some teleop
Balancing	Separate controller	Low-level RL controller
High-level reasoning	Limited	Full vision-language VLA
Best fit	Simple, separable tasks	Real, complex loco-manipulation

Notes for applying to the Unitree G1

The paper tests on AgiBot X2, but the architecture generalizes to any humanoid with an RL loco controller. For the Unitree G1:

You need a good RL loco-manipulation controller first. The latent VLA only emits intent; if the low-level controller can't hold balance during far reaches, the whole system collapses. Investing in an RL controller in sim (Isaac Lab) is mandatory — see Part 7.
Video domain gap. Egocentric human video differs in morphology from a robot (arm proportions, finger count, reach). You need retargeting/alignment so the latent learned from humans transfers to G1.
Sim2real for the loco part. An RL controller trained in sim needs domain randomization + an actuator-delay model to not "break" on the real robot (techniques discussed in ASAP/Unitree G1).

Conclusion: whole-body VLA is the destination, data is the bottleneck

WholeBodyVLA shows the direction for humanoids in 2026: a unified VLA coordinating the whole body, learning mostly from cheap video, executing through a balancing RL controller. This is where all the pieces of the series converge — 3D perception, point-cloud policy, panoramic perception — into a system that walks and manipulates.

But the paper exposes a truth: data is the bottleneck. Action-free video is cheap but lacks action labels; teleop is precise but expensive. How do you pick the right data strategy for your budget? That is the entire content of Part 6.

Part 4: Omni-Manip & Spatial Reasoning for Humanoids — Panoramic perception feeds the whole-body VLA
Part 6: Data Collection — Teleop vs Robot-Free vs Video — Solving the VLA data bottleneck
GR00T N1 + Unitree G1: Whole-Body VLA Architecture — Comparing decoupled arm/leg architectures

Paper: WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control — ICLR 2026. Tested on the AgiBot X2 humanoid.

WholeBodyVLA: unified latent VLA for whole-body loco-manipulation — source: arXiv 2512.11047

Source: WholeBodyVLA (arXiv 2512.11047) — from egocentric video to loco-manipulation control.

The problem: walking and manipulating can't be separated

But real whole-body loco-manipulation is inseparable. When a humanoid reaches for a heavy box on a high shelf:

The center of mass shifts → the legs must compensate to not fall
The arm extends → torque on the hips and torso
Bending/leaning to get close → the whole kinematic chain from legs to arms coordinates

Arms and legs must "negotiate" in real time. WholeBodyVLA solves this by unifying them in one framework, while still respecting that arms and legs run at different frequencies.

Three core ideas of WholeBodyVLA

1. Unified latent VLA

The benefit: the VLA reasons once about the whole body, avoiding the situation where arms and legs "fight" because two policies make conflicting decisions.

2. Learning from action-free egocentric video

Why action-free video is much cheaper:

Data source	Needs robot?	Needs action labels?	Cost	Scale
Robot teleop	Yes	Yes (precise)	Very high	Small
Action-free egocentric video	No	No	Very low	Massive (YouTube, Ego4D)

WholeBodyVLA learns representations and intent from that ocean of action-free video, then uses a small amount of robot data to "anchor" the latent to real actions.

3. Connecting a high-level VLA to a low-level RL controller

This is the decoupled high/low frequency architecture we've seen in earlier humanoid series, but done cleanly:

Observation (egocentric + state)
        ↓
High-level VLA (slow, ~a few Hz)
   → whole-body latent action (intent)
        ↓
RL loco-manipulation controller (fast, ~hundreds of Hz)
   → torque/joint commands to balance + execute the intent
        ↓
Robot (AgiBot X2 / Unitree G1)

The high-level VLA runs slowly and handles "what to do" — read the scene, understand the instruction, emit a latent.
The low-level RL controller runs fast and handles "how" — turn the latent into motion that stays balanced. This is trained with RL in simulation (Isaac Lab / MuJoCo), where the robot falls millions of times to learn to stay upright.

This frequency split isn't arbitrary: vision-language reasoning is inherently slow, while balancing needs a high-frequency control loop. Forcing both into one frequency is a classic mistake.

Relation to the rest of the series

WholeBodyVLA doesn't replace what we've learned — it sits on top:

3D pretraining (SPEAR-1, Part 1). The better the backbone understands 3D geometry, the more spatially accurate the latent VLA. 3D-awareness is the foundation for a good VLA.
Robo3R (Part 2). Provides a metric 3D scene in the robot frame so the VLA can reason about collisions and real distances when reaching.
DP3 (Part 3). A policy choice for fine hand manipulation, while WholeBodyVLA coordinates the whole body.
Panoramic perception (Part 4). Lets the VLA "see" beyond the FOV to plan walking and reaching.

In other words: Parts 1–4 build perception and policy for each piece; WholeBodyVLA is the whole-body coordinator that ties them into a humanoid that walks and works.

Comparison: old decoupled vs unified latent VLA

Criterion	Decoupled (2 separate policies)	WholeBodyVLA (unified latent)
Arm-leg coordination	Weak (two independent policies)	Strong (unified latent)
Primary data source	Robot teleop (expensive)	Action-free video (cheap) + some teleop
Balancing	Separate controller	Low-level RL controller
High-level reasoning	Limited	Full vision-language VLA
Best fit	Simple, separable tasks	Real, complex loco-manipulation

Notes for applying to the Unitree G1

The paper tests on AgiBot X2, but the architecture generalizes to any humanoid with an RL loco controller. For the Unitree G1:

You need a good RL loco-manipulation controller first. The latent VLA only emits intent; if the low-level controller can't hold balance during far reaches, the whole system collapses. Investing in an RL controller in sim (Isaac Lab) is mandatory — see Part 7.
Video domain gap. Egocentric human video differs in morphology from a robot (arm proportions, finger count, reach). You need retargeting/alignment so the latent learned from humans transfers to G1.
Sim2real for the loco part. An RL controller trained in sim needs domain randomization + an actuator-delay model to not "break" on the real robot (techniques discussed in ASAP/Unitree G1).

Conclusion: whole-body VLA is the destination, data is the bottleneck

Part 4: Omni-Manip & Spatial Reasoning for Humanoids — Panoramic perception feeds the whole-body VLA
Part 6: Data Collection — Teleop vs Robot-Free vs Video — Solving the VLA data bottleneck
GR00T N1 + Unitree G1: Whole-Body VLA Architecture — Comparing decoupled arm/leg architectures

WholeBodyVLA: Action-Free Video to Loco-Manipulation

The problem: walking and manipulating can't be separated

Three core ideas of WholeBodyVLA

1. Unified latent VLA

2. Learning from action-free egocentric video

3. Connecting a high-level VLA to a low-level RL controller

Relation to the rest of the series

Comparison: old decoupled vs unified latent VLA

Notes for applying to the Unitree G1

Conclusion: whole-body VLA is the destination, data is the bottleneck

Nguyễn Anh Tuấn

Related Posts

Perception 3D cho humanoid: Omni-Manip & spatial reasoning

Thu data manipulation: teleop vs robot-free vs video

Vì sao VLA 2D chưa đủ cho manipulation

WholeBodyVLA: Action-Free Video to Loco-Manipulation

The problem: walking and manipulating can't be separated

Three core ideas of WholeBodyVLA

1. Unified latent VLA

2. Learning from action-free egocentric video

3. Connecting a high-level VLA to a low-level RL controller

Relation to the rest of the series

Comparison: old decoupled vs unified latent VLA

Notes for applying to the Unitree G1

Conclusion: whole-body VLA is the destination, data is the bottleneck

Nguyễn Anh Tuấn

Related Posts

Perception 3D cho humanoid: Omni-Manip & spatial reasoning

Thu data manipulation: teleop vs robot-free vs video

Vì sao VLA 2D chưa đủ cho manipulation