wholebody-vlavlawhole-bodyhumanoidcross-embodimenthexmanipulationopen-sourceqwenmixture-of-experts

HEX: Cross-Embodiment VLA for Full-Size Humanoid Robots

A complete guide to HEX — the first whole-body VLA for full-sized humanoid robots, supporting 7 embodiments, open-source, built on Qwen3-VL + MoE UPP + DiT flow-matching.

Nguyễn Anh TuấnJune 10, 20269 min read
HEX: Cross-Embodiment VLA for Full-Size Humanoid Robots

Most VLA frameworks today — including well-known ones like π₀ and GR00T — share a common weakness: they control each body part independently rather than modeling the coordinated whole-body motion a real human uses. The result is a robot that can move its arms skillfully but stumbles when it needs to walk, reach, and maintain balance simultaneously.

HEX (Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation) is the first VLA framework designed from the ground up to solve exactly this — achieving 79.8% success rate across 7 real-world tasks, beating both π₀.₅ (71.8%) and GR00T N1.5 (70.2%).

The Problem HEX Solves

Imagine teaching a humanoid to carry a box from a conveyor belt to a shelf. The task requires:

  • Both hands gripping and holding the box
  • Torso adjusting for balance
  • Legs stepping forward simultaneously
  • Eyes tracking the shelf location

If the VLA model predicts actions for each joint independently, the robot will never learn this coordination. It needs a common language to describe the whole body — whether the robot is a Unitree G1, a Tienkung 2.0, or any other embodiment.

This is why HEX was built around two core innovations:

  1. Canonical body-part state representation — encoding body state into standardized "slots" instead of raw joint indices
  2. Mixture-of-Experts Unified Proprioceptive Predictor (UPP) — learning whole-body coordination from data across 7 different embodiments

HEX Architecture

┌─────────────────────────────────────────────────────────────┐
│                        HEX Pipeline                         │
│                                                             │
│  Camera frames ──► VLM (Qwen3-VL-2B)                       │
│  + Text command      │  Temporal context cache              │
│                      ▼                                      │
│              Visual-Language Features                       │
│                      │                                      │
│  Robot joints ──► UPP (MoE Transformer)                    │
│  (canonical slots)   │  16 routed experts + 2 shared       │
│                      ▼                                      │
│              Proprioceptive Features                        │
│                      │                                      │
│              ┌───────┴───────┐                              │
│              ▼               ▼                              │
│       Action Expert (DiT-B, 16 layers)                      │
│         dual cross-attention fusion                         │
│              │                                              │
│              ▼                                              │
│         Action (flow-matching)                              │
└─────────────────────────────────────────────────────────────┘

1. Visual-Language Backbone: Qwen3-VL-2B

HEX uses Qwen3-VL-2B-Instruct as its vision-language backbone. The key addition is a lightweight history query feature cache — rather than feeding the full video stream into the model, HEX compresses temporal history into a compact context vector. This gives the model a sense of "what happened recently" without the memory cost of full video encoding.

2. Canonical State Representation

This is arguably HEX's most important contribution. Different robots have different joint structures (Unitree G1 has 43 DOF, Tienkung 3.0 has more), making it impossible to use raw joint indices for cross-embodiment learning.

HEX defines 7 standardized body-part slots:

Slot Description Example joints
left_arm Left arm shoulder, elbow, wrist
right_arm Right arm shoulder, elbow, wrist
left_hand Left hand (dexterous) finger joints
right_hand Right hand (dexterous) finger joints
legs Both legs hip, knee, ankle
head Head + neck pan, tilt
waist Torso rotation joints

If a robot lacks a body part (e.g., a wheeled robot without legs), HEX fills those slots with learned missing-part tokens — the model handles it gracefully without any special-case logic.

3. UPP — Unified Proprioceptive Predictor

UPP is the core of HEX. It is a 4-layer transformer (hidden size 768) with a Mixture-of-Experts architecture:

Input: canonical body-part embeddings
       ↓
MoE Layer × 4:
  - 16 routed experts (embodiment-specific patterns)
  - 2 shared experts (cross-embodiment common dynamics)
  - Router selects top-K experts per token
       ↓
Output: temporal + whole-body coordination features

The idea: 16 "specialist" experts learn patterns specific to each robot's morphology, while 2 shared experts learn universal principles of balance and whole-body coordination. When deploying to a new robot, the router combines the right mix of experts.

4. Action Expert: DiT-B with Dual Cross-Attention

HEX's action head is a 16-layer DiT-B (Diffusion Transformer Base, hidden size 1024) with a dual cross-attention architecture:

# Simplified dual cross-attention in HEX Action Expert
class DualCrossAttention(nn.Module):
    def forward(self, action_tokens, vl_features, prop_features):
        # Branch 1: attend to visual-language context
        x_vl = self.cross_attn_vl(action_tokens, vl_features)
        # Branch 2: attend to proprioceptive context
        x_prop = self.cross_attn_prop(action_tokens, prop_features)
        # Adaptive fusion gate
        alpha = self.gate(action_tokens)
        return alpha * x_vl + (1 - alpha) * x_prop

Training uses flow-matching instead of standard diffusion — faster, more stable, and especially suited to tasks requiring fast reaction times.


Dataset and Training Data

HEX is pretrained on over 12 million frames from 4 data sources:

Source Type Embodiments Notes
HEX in-house dataset Real-world Tienkung, Tienyi Diverse manipulation
Humanoid Everyday Real-world Multiple Daily household tasks
AgiBot World Colosseo Real-world AgiBot Wheeled humanoid
RoboCOIN Real-world Leju, G1, H1 Multi-embodiment

7 embodiments in total: Tienkung 2.0, Tienkung 3.0, Tienyi, Unitree G1, Unitree H1, AgiBot, Leju Kuavo.


Installation

System Requirements

  • Ubuntu 20.04/22.04 + CUDA 11.8+
  • GPU: at least 1× A100 40GB for inference, 8× A100 for fine-tuning
  • Python 3.10
  • RAM: 32GB+

Step 1: Clone and set up environment

git clone https://github.com/Open-X-Humanoid/HEX.git
cd HEX

conda create -n hex python=3.10 -y
conda activate hex

# System dependencies
sudo apt update && sudo apt install -y libegl1-mesa-dev libglu1-mesa

# Python dependencies
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install -e .

Step 2: Download model weights

# Download HEX pretrained model (~2.4B params)
python hex/utils/download_model_hex.py

# Download base VLM (Qwen3-VL-2B)
python hex/utils/download_model_qwen.py

Both are hosted on Hugging Face. For slow connections, use hf_transfer:

pip install hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 python hex/utils/download_model_hex.py

Step 3: Verify the install

# Run quick test in LIBERO simulation
bash scripts/libero/eval_libero.sh

# Or run the inference notebook
jupyter notebook notebooks/eval_model.ipynb

Fine-tuning HEX on Your Robot

If you have tele-op data from a real robot, fine-tuning takes two steps.

Step 1: Prepare your data

HEX uses LeRobot v2.1 format. If you already have a LeRobot dataset, you just need to map your joint names to canonical slots:

# configs/embodiment/unitree_g1.yaml
embodiment: unitree_g1
joint_mapping:
  left_arm: [left_shoulder_pitch, left_shoulder_roll, left_shoulder_yaw,
             left_elbow, left_wrist_roll, left_wrist_pitch, left_wrist_yaw]
  right_arm: [right_shoulder_pitch, right_shoulder_roll, right_shoulder_yaw,
              right_elbow, right_wrist_roll, right_wrist_pitch, right_wrist_yaw]
  legs: [left_hip_pitch, left_hip_roll, left_hip_yaw,
         left_knee, left_ankle_pitch, left_ankle_roll,
         right_hip_pitch, right_hip_roll, right_hip_yaw,
         right_knee, right_ankle_pitch, right_ankle_roll]
  waist: [torso_joint]
  # G1 has no dexterous hands → missing-part tokens fill automatically

Step 2: Run fine-tuning

# Fine-tune on your embodiment (2-4 A100s, ~6-12 hours)
bash scripts/fine_tune_hex.sh \
  --embodiment unitree_g1 \
  --data_path /path/to/your/lerobot_dataset \
  --output_dir checkpoints/hex_g1_custom \
  --num_epochs 50 \
  --batch_size 8

# Full pretraining from scratch (~1000 A100 GPU-hours)
bash scripts/pretrain_hex.sh

Compute note: Full pretraining requires ~1000 A100 GPU-hours (200k steps, batch 16). With a limited budget, just fine-tune from the pretrained checkpoint — typically 6-12 hours on 2-4 A100s is enough for convergence.


Running Inference on a Real Robot

After fine-tuning, the inference loop looks like this:

from hex import HEXPolicy
from hex.utils import load_embodiment_config

# Load policy
policy = HEXPolicy.from_pretrained("checkpoints/hex_g1_custom")
config = load_embodiment_config("unitree_g1")
policy.set_embodiment(config)
policy.eval().cuda()

# Inference loop
obs = {
    "image": camera_frame,          # (H, W, 3) numpy array
    "language": "pick up the bottle and place it on the shelf",
    "joint_positions": robot.get_joint_positions(),  # canonical slots
    "joint_velocities": robot.get_joint_velocities()
}

with torch.no_grad():
    actions = policy.predict(obs, num_steps=10)  # predict 10-step chunk

# Execute
for action in actions:
    robot.set_joint_targets(action)

Benchmark Results

HEX was evaluated on 7 real-world tasks ranging from simple pick-and-place to long-horizon multi-stage manipulation.

Overall comparison (seen scenarios)

Model Avg. Success Rate Params Method
HEX 79.8% 2.4B MoE + DiT flow-matching
π₀.₅ 71.8% ~3B Diffusion VLA
GR00T N1.5 70.2% ~1.5B DiT VLA
GR00T N1 52.4% ~1.5B DiT VLA

Long-horizon task: Box Conveyance (4 stages)

This is particularly demanding: the robot must complete 4 sequential stages (approach → grasp → transport → place). Final-stage success rate:

HEX:        53.3% ████████████████████████████░░░░░░░░░░
π₀.₅:       40.0% ████████████████████░░░░░░░░░░░░░░░░░░
GR00T N1.5: 20.0% ██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Generalization to unseen task variants

Model Unseen Success Rate
HEX 61.8%
π₀.₅ 44.3%
GR00T N1.5 41.0%

The largest gap appears on fast-reaction tasks and long-horizon tasks — exactly the scenarios UPP's temporal dynamics modeling was designed for.


The Review-and-Forecast Mechanism

One of HEX's more subtle innovations is the review-and-forecast paradigm:

Past frames → Visual History Summary (review)
                     ↓
             VLM processes context
                     ↓
Future state prediction ← UPP forecasts next body state (forecast)
                     ↓
             Action Expert generates actions
             conditioned on predicted future state

Instead of only reacting to the current observation, HEX also predicts what the robot's body state will be after executing the action. This auxiliary loss forces UPP to learn genuine temporal dynamics — not just a frame-to-action mapping.


When to Use HEX

HEX is best suited when:

  • ✅ You have a full-size humanoid robot (bipedal or wheeled)
  • ✅ Tasks require simultaneous arms + legs + torso coordination
  • ✅ You want one pretrained model fine-tuned across multiple embodiments
  • ✅ Tasks are long-horizon (multiple sequential stages)

Less suitable when:

  • ❌ Your robot is arm-only (no whole-body needed)
  • ❌ You are resource-constrained: inference needs at least 1× A100 40GB
  • ❌ You need real-time < 20ms (flow-matching has non-trivial latency)

For more context on the whole-body VLA ecosystem, see deploying WholebodyVLA on G1 and LeRobot + G1 + π₀Fast whole-body pipeline.


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

Làm synthetic data cho GR00T VLA
wholebody-vla

Làm synthetic data cho GR00T VLA

6/6/202614 min read
NT
Software stack humanoid robot: từ ROS 2 đến VLA deployment
wholebody-vla

Software stack humanoid robot: từ ROS 2 đến VLA deployment

6/4/20265 min read
NT
A1 VLA: Deploy VLA SOTA với Latency Giảm 72%
wholebody-vla

A1 VLA: Deploy VLA SOTA với Latency Giảm 72%

6/1/202612 min read
NT