Newest WBC + VLA for Humanoids

Why WBC + VLA matters now

In robotics, Whole-Body Control (WBC) means using the whole robot body as one coordinated control system. For a humanoid, that includes the legs, hips, torso, head, arms, hands, and grippers. Vision-Language-Action (VLA) models, on the other hand, map images and language instructions into robot actions. A VLA can take an instruction like "pick up the box and put it on the cart" and produce an action sequence.

The frontier in 2026 is the intersection of these two ideas. A humanoid cannot be useful if it only manipulates objects while standing still. It must walk to the object, face it, squat if needed, stabilize its torso, use both arms, interact with heavy objects, and recover when contact forces disturb balance. That is not just manipulation. It is loco-manipulation.

The most relevant paper for this topic is WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control, from Fudan University, OpenDriveLab, MMLab at The University of Hong Kong, and AgiBot. The official project page, opendrivelab.com/WholeBodyVLA, describes a system running on the AgiBot X2 humanoid. It takes egocentric images and language instructions, predicts latent action tokens, decodes them into upper-body joint actions and locomotion commands, and executes the lower-body command with a high-frequency RL controller. The OpenDriveLab/WholebodyVLA GitHub repo currently serves as a resource and paper list; its README says there is no concrete timeline for open-sourcing the full codebase.

As of 2026-06-03, the newest WBC + VLA landscape has three important branches:

Direction	Example	Main strength	Practical caveat
Unified latent VLA for loco-manipulation	WholeBodyVLA	Learns latent actions from action-free videos and runs large-space real-world tasks	Full official code is not yet released
WBC foundation controller	GR00T-WholeBodyControl and SONIC	Open controller stack, checkpoints, Isaac Lab, teleoperation, deployment docs	More of a WBC/motion foundation stack than the same VLA method
Physics-aware VLA/WBC	PhysiFlow	Connects semantic intent with physics-aware tracking and flow matching	Needs broader open code and real-robot validation

This article uses WholeBodyVLA as the main technical case because it directly answers the question: how can a VLA control a humanoid that walks, turns, squats, grasps, carries, and pushes in one task loop?

The core idea: learn whole-body intent from cheaper video

The first bottleneck is data. A conventional robot policy often needs trajectories like this:

observation_t = RGB image + proprioception + language
action_t      = joint targets / end-effector deltas / base velocity

For a tabletop arm, collecting teleoperation data is already expensive. For a humanoid, it is much harder. The operator must control base motion, body height, torso posture, two arms, grippers, and sometimes contact forces. Motion capture requires a costly setup. VR plus joystick teleoperation is more flexible, but it still needs skilled operators and real hardware time.

WholeBodyVLA reframes the problem. Instead of requiring robot-aligned actions for every pretraining clip, it asks whether the model can learn useful motion intent from action-free egocentric video. If a person wearing a head camera walks toward a box, turns, squats, and reaches toward an object, the video has no robot motor commands, but it still contains important structure: approach direction, object affordances, body height change, visual flow, and manipulation preconditions.

The paper calls its answer unified latent learning. It trains a Latent Action Model (LAM) that compresses the visual change between consecutive frames into discrete latent tokens. These tokens are not motor commands. They are compact codes for how the scene changes: moving forward, turning, lowering the camera, approaching a target, reaching, or combining locomotion and manipulation cues.

The key design choice is to train two separate LAMs:

LAM	Data source	Why separate it?
Manipulation LAM	Manipulation data, including AgiBot World	Camera pose is mostly static; visual change is dominated by arms and objects
Locomotion LAM	Egocentric manipulation-aware locomotion videos	Camera moves continuously; visual change is dominated by body motion through the scene

This split is important. If one shared LAM is trained on mixed data, it can confuse hand motion with camera motion. An object moving in the image might mean the hand moved, or it might mean the whole body moved. For a humanoid, that ambiguity is dangerous. The robot may reach when it should step, or step when it should adjust the arm.

Architecture overview

A simplified WholeBodyVLA pipeline looks like this:

              Pretraining from action-free videos
      +------------------------------------------------+
      |  Human egocentric locomotion videos            |
      |  Robot manipulation videos                     |
      +-----------------------+------------------------+
                              |
                              v
        +---------------------+---------------------+
        |  Locomotion LAM     |  Manipulation LAM  |
        |  VQ-VAE codebook    |  VQ-VAE codebook   |
        +----------+----------+----------+----------+
                   |                     |
                   +----------+----------+
                              v
        +--------------------------------------------+
        | VLA policy                                  |
        | input: image + language                     |
        | output: latent locomotion + latent manip    |
        +--------------------+-----------------------+
                             |
                             v
        +--------------------------------------------+
        | Lightweight execution decoder               |
        | output 1: upper-body joint targets          |
        | output 2: discrete locomotion command       |
        +--------------------+-----------------------+
                             |
                             v
        +--------------------------------------------+
        | LMO RL low-level controller, about 50 Hz    |
        | balance, stepping, turning, squatting       |
        +--------------------------------------------+

The LAM follows the spirit of VQ-VAE. The encoder receives consecutive frames, produces a continuous latent vector, and quantizes it to the nearest entry in a learned codebook. The decoder reconstructs the future frame from the current frame and the quantized latent action. The paper mentions DINOv2 visual features and a spatio-temporal transformer for the LAM encoder.

A simplified LAM implementation would look like this:

class LatentActionModel(nn.Module):
    def __init__(self, vision_backbone, st_transformer, codebook, decoder):
        super().__init__()
        self.vision = vision_backbone
        self.temporal = st_transformer
        self.codebook = codebook
        self.decoder = decoder

    def forward(self, frame_t, frame_tp1):
        feat_t = self.vision(frame_t)
        feat_tp1 = self.vision(frame_tp1)
        z_cont = self.temporal(feat_t, feat_tp1)
        z_q, token_id, vq_loss = self.codebook.quantize(z_cont)
        pred_tp1 = self.decoder(frame_t, z_q)
        recon_loss = mse(pred_tp1, frame_tp1)
        return token_id, recon_loss + vq_loss

After the LAMs are trained, the VLA policy learns to predict both locomotion and manipulation tokens from egocentric image observations and task language. This gives the policy a unified action representation that includes not only "grasp the object", but also "move closer", "turn before grasping", "lower the body", and "stabilize for contact".

LMO RL: why low-level control cannot be a simple velocity tracker

Many locomotion RL systems use continuous velocity commands: vx, vy, and yaw_rate. The low-level policy learns to track those values. That is reasonable for general walking, but it is not precise enough for loco-manipulation. When a humanoid needs to place a box into a container, a few centimeters of stopping error or a few degrees of yaw drift can break the grasp or placement. When the robot squats while lifting a load, arm motion creates structured disturbances that a generic velocity tracker may not handle well.

WholeBodyVLA introduces a Loco-Manipulation-Oriented RL policy (LMO). Instead of asking the high-level policy to output arbitrary continuous velocities, it uses a discrete command interface:

forward/backward flag
lateral flag
yaw turn flag
stance height command

This interface is closer to the task semantics: start moving, stop, sidestep, turn, squat. The LMO policy converts those flags into smooth, stable lower-body behavior. The policy observation is mostly proprioceptive: base angular velocity, gravity vector, joint positions, joint velocities, and the previous action. This is a good engineering choice because balance should not depend on slow cloud perception or a heavy language model.

The LMO training pipeline has two stages:

Basic gait acquisition: learn a minimal gait that responds to the discrete commands and avoids falling. The upper body tracks simple pose targets so the legs experience changing upper-body disturbances.
Precision and stability: standardize cruising speeds, penalize terminal yaw drift, replay structured arm-motion perturbations, and discourage unnecessary leg motion during stand-still episodes.

For beginners, this is one of the most important lessons in the paper. A powerful VLA is not enough. If the low-level controller stops in the wrong pose, rotates too far, or loses balance under manipulation forces, the whole system fails.

Installation and reproduction reality

The official WholeBodyVLA repository does not currently provide the full runnable training and deployment code. You can clone it to read the project README, citation, and research resource list, but you should not expect to reproduce the paper from that repo alone.

git clone https://github.com/OpenDriveLab/WholebodyVLA.git
cd WholebodyVLA
# At the time of writing, this is mainly a project README and resource list.

If your goal is to learn with open WBC code today, NVlabs/GR00T-WholeBodyControl is the more practical starting point. It includes gear_sonic, gear_sonic_deploy, motionbricks, Hugging Face download scripts, Isaac Lab training docs, MuJoCo simulation support, VR teleoperation, and C++ deployment components.

The basic setup flow is:

git clone https://github.com/NVlabs/GR00T-WholeBodyControl.git
cd GR00T-WholeBodyControl
git lfs pull
python check_environment.py

# Training requires Isaac Lab to be installed separately.
pip install -e "gear_sonic/[training]"
python download_from_hf.py --training

To reproduce the WholeBodyVLA idea as a research scaffold, split the work into modules:

wbc-vla-lab/
  data/
    ego_locomotion_videos/
    robot_manipulation_episodes/
    teleop_x2_or_g1/
  lam/
    train_locomotion_lam.py
    train_manipulation_lam.py
  vla/
    train_latent_vla.py
    finetune_decoder.py
  wbc/
    train_lmo_rl.py
    export_policy.py
  deploy/
    camera_node.py
    vla_inference.py
    low_level_bridge.py

Minimum checklist:

Component	Practical beginner step
Video data	Record head-mounted clips: forward, backward, sidestep, turn, squat near objects
Manipulation data	Start with an open dataset or teleoperate a small arm first
LAM	Train frame-to-frame reconstruction and inspect whether tokens cluster by motion type
VLA	Use a smaller backbone first and predict latent tokens instead of raw joints
Low-level control	Start in Isaac Lab or MuJoCo, not on a real humanoid
Deployment	Separate high-level 5-10 Hz inference from low-level 50 Hz or faster control

Training pipeline in detail

WholeBodyVLA uses a three-stage recipe:

Stage I:  LAM pretraining
          video_t, video_t+1 -> discrete latent token

Stage II: VLA latent pretraining
          image + language -> locomotion token + manipulation token

Stage III: Real-robot finetuning
          latent tokens + robot state -> upper-body joints + locomotion command

In Stage I, the goal is not photorealistic video generation. The goal is to build a codebook whose tokens represent control-relevant primitives. If the codebook mostly captures lighting changes, camera noise, or background identity, downstream control will be weak. Data collection therefore needs intent: the camera wearer should walk toward manipulable objects, turn, squat, sidestep, approach tables, boxes, carts, handles, and containers.

In Stage II, the VLA is trained with cross-entropy on the two LAM token streams. The language instruction supplies task semantics. The same scene can require different latent sequences depending on whether the instruction is "load the box onto the cart" or "push the cart forward".

In Stage III, real teleoperation data grounds the latent representation into robot-specific actions. The paper evaluates on AgiBot X2, which has dual 7-DoF arms, grippers, legs, a waist joint, and an egocentric Intel RealSense D435i camera. The teleoperation setup uses VR plus joystick, with 50 executions per task. This is still real robot data, but it is much less than what would be required to train the whole policy from scratch because the model already has priors from action-free video.

Inference: slow semantic policy, fast physical controller

At runtime, WholeBodyVLA uses two different control rates:

Egocentric image + instruction
        |
        v
VLA policy, about 10 Hz
        |
        +--> upper-body joint targets
        |
        +--> discrete locomotion command
                    |
                    v
            LMO RL controller, about 50 Hz
                    |
                    v
              lower-body torques/actions

This separation is practical. A VLA is heavy and does not need to run at motor-control frequency. Balance, however, must be handled at a higher rate. The VLA produces intent, while the LMO controller turns that intent into stable stepping, turning, stopping, and squatting.

A minimal inference loop would look like this:

while robot.is_active():
    if timer.ready("vla", hz=10):
        image = camera.read()
        state = robot.read_state()
        latent_loco, latent_manip = vla.predict(image, instruction)
        arm_targets, loco_cmd = decoder(latent_loco, latent_manip, state)
        robot.send_arm_targets(arm_targets)

    proprio = robot.read_proprioception()
    lower_action = lmo_policy.step(proprio, loco_cmd)
    robot.send_lower_body_action(lower_action)

On real hardware, this loop also needs a safety layer: joint limits, torque limits, fall detection, emergency stop, stale-command timeout, and synchronized logging. These are not optional details. In a humanoid, small timing bugs between camera frames, robot state, and motor commands can become falls.

Results: the gains appear in the hard transitions

The paper evaluates three real-world task suites on AgiBot X2:

Task	Main subgoals	Skills tested
Bag Packing	grasp bag, sidestep, squat, place	bimanual grasp, lateral stepping, squat precision
Box Loading	squat, grasp box, rise, turn, place	balance while lifting, turning, and placing
Cart Pushing	grab handle, push a cart over 50 kg	sustained locomotion, heading control, interaction force

In the main results table, WholeBodyVLA reaches an average score of 78.0%. The modular design baseline reaches 64.0%, OpenVLA-OFT with LMO reaches 56.7%, and GR00T with LMO reaches 42.0%. The ablations are also instructive: removing LAM drops to 39.3%; replacing LMO with a velocity-based RL controller reaches 54.0%; using a manipulation-only LAM reaches 63.3%; using one shared LAM reaches 66.0%. The paper also reports that WholeBodyVLA outperforms prior baselines by 21.3% and 24.0% in its comparison settings.

The key lesson is where the improvement appears. The hardest parts are not isolated grasps while standing still. They are transitions: move and squat, rise and turn, push while staying aligned. These are exactly the phases where modular pipelines often accumulate error. The robot can grasp if it happens to stand in the right place, but the hard problem is creating the right body pose for the next manipulation step.

Comparison with other new systems

GR00T-WholeBodyControl and SONIC are best viewed as an open WBC foundation stack. NVIDIA's repo provides training code, deployment components, teleoperation, simulation, and documentation. If you want hands-on engineering experience with whole-body humanoid control, it is a very practical source. It does not replace WholeBodyVLA directly, because WholeBodyVLA is specifically about unified latent VLA for semantically guided loco-manipulation.

PhysiFlow, posted to arXiv in March 2026, follows a physics-aware multi-brain VLA direction. It combines semantic-motion intent, latent flow matching, and robust tracking to connect VLA reasoning with full-body execution. Its existence is a useful signal: the field is moving beyond "transformer predicts actions" toward architectures that explicitly respect balance, physics, tracking, and contact.

Neon VLA on PyPI, with version 0.1.5 released on 2026-03-31, also shows growing interest in open-source humanoid VLA, especially around Unitree G1. For beginners, the important distinction is this: an alpha package, an open WBC controller repo, and a paper with real robot results are not the same thing. Read them together, but do not treat them as one finished humanoid foundation model.

Where beginners should start

If you are new to this area, do not start on a real humanoid. A safer path is:

Review VLA basics: observations, language embeddings, action heads, and imitation learning.
Learn RL locomotion foundations: rewards, domain randomization, proprioception, and sim-to-real.
Read the WholeBodyVLA paper carefully to understand unified latent learning and LMO RL.
Try a simulated training stack before thinking about real humanoid deployment.

A good beginner project is to train a toy LAM on simple phone videos. Record clips of walking toward a table, turning left, turning right, squatting, and reaching. Train a small model to reconstruct the next frame through a discrete codebook. If the tokens start separating "camera moving forward" from "hand moving", you have understood the core of unified latent learning.

Conclusion

WholeBodyVLA is important because it shifts the question from "how does a VLA control a robot arm?" to "how does a VLA control the whole body so manipulation becomes possible?". For humanoids, that shift is mandatory. A useful humanoid cannot simply stand still and reach. It must walk, turn, squat, stabilize, interact with heavy objects, and still follow language instructions.

The technical message is clear: VLA needs latent actions learned from cheaper data, WBC needs a manipulation-aware command interface, and runtime inference must separate high-level semantic reasoning from low-level physical stability. Systems such as WholeBodyVLA, SONIC/GR00T-WholeBodyControl, and PhysiFlow suggest that the near future of humanoid control will not be a planner calling isolated skills. It will be layered policies where language, vision, latent motion, and physical control are designed together.

Why WBC + VLA matters now

As of 2026-06-03, the newest WBC + VLA landscape has three important branches:

Direction	Example	Main strength	Practical caveat
Unified latent VLA for loco-manipulation	WholeBodyVLA	Learns latent actions from action-free videos and runs large-space real-world tasks	Full official code is not yet released
WBC foundation controller	GR00T-WholeBodyControl and SONIC	Open controller stack, checkpoints, Isaac Lab, teleoperation, deployment docs	More of a WBC/motion foundation stack than the same VLA method
Physics-aware VLA/WBC	PhysiFlow	Connects semantic intent with physics-aware tracking and flow matching	Needs broader open code and real-robot validation

The core idea: learn whole-body intent from cheaper video

The first bottleneck is data. A conventional robot policy often needs trajectories like this:

observation_t = RGB image + proprioception + language
action_t      = joint targets / end-effector deltas / base velocity

The key design choice is to train two separate LAMs:

LAM	Data source	Why separate it?
Manipulation LAM	Manipulation data, including AgiBot World	Camera pose is mostly static; visual change is dominated by arms and objects
Locomotion LAM	Egocentric manipulation-aware locomotion videos	Camera moves continuously; visual change is dominated by body motion through the scene

Architecture overview

A simplified WholeBodyVLA pipeline looks like this:

              Pretraining from action-free videos
      +------------------------------------------------+
      |  Human egocentric locomotion videos            |
      |  Robot manipulation videos                     |
      +-----------------------+------------------------+
                              |
                              v
        +---------------------+---------------------+
        |  Locomotion LAM     |  Manipulation LAM  |
        |  VQ-VAE codebook    |  VQ-VAE codebook   |
        +----------+----------+----------+----------+
                   |                     |
                   +----------+----------+
                              v
        +--------------------------------------------+
        | VLA policy                                  |
        | input: image + language                     |
        | output: latent locomotion + latent manip    |
        +--------------------+-----------------------+
                             |
                             v
        +--------------------------------------------+
        | Lightweight execution decoder               |
        | output 1: upper-body joint targets          |
        | output 2: discrete locomotion command       |
        +--------------------+-----------------------+
                             |
                             v
        +--------------------------------------------+
        | LMO RL low-level controller, about 50 Hz    |
        | balance, stepping, turning, squatting       |
        +--------------------------------------------+

A simplified LAM implementation would look like this:

class LatentActionModel(nn.Module):
    def __init__(self, vision_backbone, st_transformer, codebook, decoder):
        super().__init__()
        self.vision = vision_backbone
        self.temporal = st_transformer
        self.codebook = codebook
        self.decoder = decoder

    def forward(self, frame_t, frame_tp1):
        feat_t = self.vision(frame_t)
        feat_tp1 = self.vision(frame_tp1)
        z_cont = self.temporal(feat_t, feat_tp1)
        z_q, token_id, vq_loss = self.codebook.quantize(z_cont)
        pred_tp1 = self.decoder(frame_t, z_q)
        recon_loss = mse(pred_tp1, frame_tp1)
        return token_id, recon_loss + vq_loss

LMO RL: why low-level control cannot be a simple velocity tracker

WholeBodyVLA introduces a Loco-Manipulation-Oriented RL policy (LMO). Instead of asking the high-level policy to output arbitrary continuous velocities, it uses a discrete command interface:

forward/backward flag
lateral flag
yaw turn flag
stance height command

The LMO training pipeline has two stages:

Basic gait acquisition: learn a minimal gait that responds to the discrete commands and avoids falling. The upper body tracks simple pose targets so the legs experience changing upper-body disturbances.
Precision and stability: standardize cruising speeds, penalize terminal yaw drift, replay structured arm-motion perturbations, and discourage unnecessary leg motion during stand-still episodes.

Installation and reproduction reality

git clone https://github.com/OpenDriveLab/WholebodyVLA.git
cd WholebodyVLA
# At the time of writing, this is mainly a project README and resource list.

The basic setup flow is:

git clone https://github.com/NVlabs/GR00T-WholeBodyControl.git
cd GR00T-WholeBodyControl
git lfs pull
python check_environment.py

# Training requires Isaac Lab to be installed separately.
pip install -e "gear_sonic/[training]"
python download_from_hf.py --training

To reproduce the WholeBodyVLA idea as a research scaffold, split the work into modules:

wbc-vla-lab/
  data/
    ego_locomotion_videos/
    robot_manipulation_episodes/
    teleop_x2_or_g1/
  lam/
    train_locomotion_lam.py
    train_manipulation_lam.py
  vla/
    train_latent_vla.py
    finetune_decoder.py
  wbc/
    train_lmo_rl.py
    export_policy.py
  deploy/
    camera_node.py
    vla_inference.py
    low_level_bridge.py

Minimum checklist:

Component	Practical beginner step
Video data	Record head-mounted clips: forward, backward, sidestep, turn, squat near objects
Manipulation data	Start with an open dataset or teleoperate a small arm first
LAM	Train frame-to-frame reconstruction and inspect whether tokens cluster by motion type
VLA	Use a smaller backbone first and predict latent tokens instead of raw joints
Low-level control	Start in Isaac Lab or MuJoCo, not on a real humanoid
Deployment	Separate high-level 5-10 Hz inference from low-level 50 Hz or faster control

Training pipeline in detail

WholeBodyVLA uses a three-stage recipe:

Stage I:  LAM pretraining
          video_t, video_t+1 -> discrete latent token

Stage II: VLA latent pretraining
          image + language -> locomotion token + manipulation token

Stage III: Real-robot finetuning
          latent tokens + robot state -> upper-body joints + locomotion command

Inference: slow semantic policy, fast physical controller

At runtime, WholeBodyVLA uses two different control rates:

Egocentric image + instruction
        |
        v
VLA policy, about 10 Hz
        |
        +--> upper-body joint targets
        |
        +--> discrete locomotion command
                    |
                    v
            LMO RL controller, about 50 Hz
                    |
                    v
              lower-body torques/actions

A minimal inference loop would look like this:

while robot.is_active():
    if timer.ready("vla", hz=10):
        image = camera.read()
        state = robot.read_state()
        latent_loco, latent_manip = vla.predict(image, instruction)
        arm_targets, loco_cmd = decoder(latent_loco, latent_manip, state)
        robot.send_arm_targets(arm_targets)

    proprio = robot.read_proprioception()
    lower_action = lmo_policy.step(proprio, loco_cmd)
    robot.send_lower_body_action(lower_action)

Results: the gains appear in the hard transitions

The paper evaluates three real-world task suites on AgiBot X2:

Task	Main subgoals	Skills tested
Bag Packing	grasp bag, sidestep, squat, place	bimanual grasp, lateral stepping, squat precision
Box Loading	squat, grasp box, rise, turn, place	balance while lifting, turning, and placing
Cart Pushing	grab handle, push a cart over 50 kg	sustained locomotion, heading control, interaction force

Comparison with other new systems

Where beginners should start

If you are new to this area, do not start on a real humanoid. A safer path is:

Review VLA basics: observations, language embeddings, action heads, and imitation learning.
Learn RL locomotion foundations: rewards, domain randomization, proprioception, and sim-to-real.
Read the WholeBodyVLA paper carefully to understand unified latent learning and LMO RL.
Try a simulated training stack before thinking about real humanoid deployment.

Newest WBC + VLA for Humanoids

Why WBC + VLA matters now

The core idea: learn whole-body intent from cheaper video

Architecture overview

LMO RL: why low-level control cannot be a simple velocity tracker

Installation and reproduction reality

Training pipeline in detail

Inference: slow semantic policy, fast physical controller

Results: the gains appear in the hard transitions

Comparison with other new systems

Where beginners should start

Conclusion

Nguyễn Anh Tuấn

Related Posts

LeRobot v0.5: Pi0-FAST + G1 Whole-Body Control

LeVERB: Điều khiển toàn thân humanoid bằng ngôn ngữ-thị giác tiềm ẩn

Bản đồ pipeline WholeBodyVLA

Newest WBC + VLA for Humanoids

Why WBC + VLA matters now

The core idea: learn whole-body intent from cheaper video

Architecture overview

LMO RL: why low-level control cannot be a simple velocity tracker

Installation and reproduction reality

Training pipeline in detail

Inference: slow semantic policy, fast physical controller

Results: the gains appear in the hard transitions

Comparison with other new systems

Where beginners should start

Conclusion

Nguyễn Anh Tuấn

Related Posts

LeRobot v0.5: Pi0-FAST + G1 Whole-Body Control

LeVERB: Điều khiển toàn thân humanoid bằng ngôn ngữ-thị giác tiềm ẩn

Bản đồ pipeline WholeBodyVLA