HEX: VLA Toàn Thân Đa Embodiment cho Humanoid

Hầu hết các framework VLA hiện tại — kể cả những cái nổi tiếng như π₀ hay GR00T — đều có một điểm yếu chung: chúng điều khiển từng bộ phận cơ thể robot độc lập thay vì học cách phối hợp toàn thân như một con người thực sự. Kết quả là robot giỏi cử động tay nhưng lại loạng choạng khi cần vừa di chuyển, vừa với tay, vừa giữ thăng bằng cùng lúc.

HEX (Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation) là framework VLA đầu tiên được thiết kế ngay từ đầu để giải quyết đúng bài toán này — với kết quả 79.8% success rate trên 7 task thực, vượt cả π₀.₅ (71.8%) lẫn GR00T N1.5 (70.2%).

Vấn đề mà HEX giải quyết

Hãy tưởng tượng bạn cần dạy một robot humanoid vác thùng hàng từ băng chuyền sang kệ. Tác vụ này đòi hỏi:

Hai tay kẹp và giữ thùng
Thân mình nghiêng để cân bằng trọng tâm
Chân bước đi đồng thời
Mắt theo dõi vị trí kệ

Nếu model VLA chỉ predict action cho từng khớp độc lập, robot sẽ không bao giờ học được sự phối hợp này. Nó cần một ngôn ngữ chung để mô tả toàn bộ cơ thể — dù robot là Unitree G1 hay Tienkung 2.0 hay bất kỳ embodiment nào khác.

Đây là lý do HEX ra đời với hai đột phá chính:

Canonical body-part state representation — mã hóa trạng thái cơ thể theo các "slot" chuẩn thay vì raw joint indices
Mixture-of-Experts Unified Proprioceptive Predictor (UPP) — học phối hợp toàn thân từ dữ liệu 7 embodiment khác nhau

Kiến trúc HEX

┌─────────────────────────────────────────────────────────────┐
│                        HEX Pipeline                         │
│                                                             │
│  Camera frames ──► VLM (Qwen3-VL-2B)                       │
│  + Text command      │  Temporal context cache              │
│                      ▼                                      │
│              Visual-Language Features                       │
│                      │                                      │
│  Robot joints ──► UPP (MoE Transformer)                    │
│  (canonical slots)   │  16 routed experts + 2 shared       │
│                      ▼                                      │
│              Proprioceptive Features                        │
│                      │                                      │
│              ┌───────┴───────┐                              │
│              ▼               ▼                              │
│       Action Expert (DiT-B, 16 layers)                      │
│         dual cross-attention fusion                         │
│              │                                              │
│              ▼                                              │
│         Action (flow-matching)                              │
└─────────────────────────────────────────────────────────────┘

1. Visual-Language Backbone: Qwen3-VL-2B

HEX dùng Qwen3-VL-2B-Instruct làm backbone xử lý ảnh và ngôn ngữ. Điểm đặc biệt là cơ chế lightweight history query feature cache — thay vì đẩy toàn bộ video frame vào model, HEX nén lịch sử thành một compact context vector, giúp model hiểu được diễn biến theo thời gian mà không tốn quá nhiều memory.

2. Canonical State Representation

Đây là một trong những đóng góp quan trọng nhất. Mỗi robot có cấu trúc khớp khác nhau (Unitree G1 có 43 DOF, Tienkung 3.0 có nhiều hơn), nên không thể dùng joint index thô để học cross-embodiment.

HEX định nghĩa 7 body-part slots chuẩn:

Slot	Mô tả	Ví dụ joints
`left_arm`	Cánh tay trái	shoulder, elbow, wrist
`right_arm`	Cánh tay phải	shoulder, elbow, wrist
`left_hand`	Tay trái (dexterous)	finger joints
`right_hand`	Tay phải (dexterous)	finger joints
`legs`	Hai chân	hip, knee, ankle
`head`	Đầu + cổ	pan, tilt
`waist`	Thắt lưng	torso rotation

Nếu một robot không có bộ phận nào (ví dụ wheeled robot không có legs), HEX dùng learned missing-part tokens để điền vào — model vẫn hoạt động bình thường mà không cần thêm logic xử lý đặc biệt.

3. UPP — Unified Proprioceptive Predictor

Đây là trái tim của HEX. UPP là một 4-layer transformer (hidden size 768) với kiến trúc Mixture-of-Experts:

Input: canonical body-part embeddings
       ↓
MoE Layer × 4:
  - 16 routed experts (embodiment-specific)
  - 2 shared experts (cross-embodiment common)
  - Router chọn top-K experts cho mỗi token
       ↓
Output: temporal + coordination features

Ý tưởng là: 16 expert "chuyên gia" học các pattern đặc thù của từng robot, còn 2 shared expert học những nguyên tắc chung về cân bằng và phối hợp toàn thân. Khi deploy lên robot mới, router sẽ kết hợp đúng experts lại.

4. Action Expert: DiT-B với Dual Cross-Attention

Action head của HEX là 16-layer DiT-B (Diffusion Transformer Base, hidden size 1024) với kiến trúc cross-attention kép:

# Simplified attention flow trong HEX Action Expert
class DualCrossAttention(nn.Module):
    def forward(self, action_tokens, vl_features, prop_features):
        # Branch 1: attend to visual-language context
        x_vl = self.cross_attn_vl(action_tokens, vl_features)
        # Branch 2: attend to proprioceptive context
        x_prop = self.cross_attn_prop(action_tokens, prop_features)
        # Adaptive fusion
        alpha = self.gate(action_tokens)
        return alpha * x_vl + (1 - alpha) * x_prop

Training dùng flow-matching thay vì diffusion thông thường — nhanh hơn, ổn định hơn, và đặc biệt phù hợp với các tác vụ cần reaction time nhanh.

Dataset và Training Data

HEX được pretrain trên hơn 12 triệu frames từ 4 nguồn dữ liệu:

Nguồn	Loại	Embodiment	Đặc điểm
HEX in-house dataset	Real-world	Tienkung, Tienyi	Diverse manipulation
Humanoid Everyday	Real-world	Nhiều loại	Daily tasks
AgiBot World Colosseo	Real-world	AgiBot	Wheeled humanoid
RoboCOIN	Real-world	Leju, G1, H1	Multi-embodiment

Tổng cộng 7 embodiment được cover: Tienkung 2.0, Tienkung 3.0, Tienyi, Unitree G1, Unitree H1, AgiBot, Leju Kuavo.

Cài Đặt

Yêu cầu hệ thống

Ubuntu 20.04/22.04 + CUDA 11.8+
GPU: ít nhất 1× A100 40GB cho inference, 8× A100 cho fine-tuning
Python 3.10
RAM: 32GB+

Bước 1: Clone repo và tạo môi trường

git clone https://github.com/Open-X-Humanoid/HEX.git
cd HEX

conda create -n hex python=3.10 -y
conda activate hex

# System dependencies
sudo apt update && sudo apt install -y libegl1-mesa-dev libglu1-mesa

# Python dependencies
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install -e .

Bước 2: Download model weights

# Download HEX pretrained model (~2.4B params)
python hex/utils/download_model_hex.py

# Download base VLM (Qwen3-VL-2B)
python hex/utils/download_model_qwen.py

Cả hai file đều host trên Hugging Face. Nếu mạng chậm có thể dùng hf_transfer:

pip install hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 python hex/utils/download_model_hex.py

Bước 3: Kiểm tra cài đặt

# Chạy quick test trên LIBERO simulation
bash scripts/libero/eval_libero.sh

# Hoặc chạy notebook inference
jupyter notebook notebooks/eval_model.ipynb

Fine-tuning HEX trên Robot của Bạn

Nếu bạn có dữ liệu tele-op từ robot thực, quy trình fine-tune gồm 2 bước.

Bước 1: Chuẩn bị dữ liệu

HEX dùng định dạng LeRobot v2.1. Nếu bạn đã có dataset LeRobot thì chỉ cần mapping joint names sang canonical slots:

# configs/embodiment/unitree_g1.yaml
embodiment: unitree_g1
joint_mapping:
  left_arm: [left_shoulder_pitch, left_shoulder_roll, left_shoulder_yaw,
             left_elbow, left_wrist_roll, left_wrist_pitch, left_wrist_yaw]
  right_arm: [right_shoulder_pitch, right_shoulder_roll, right_shoulder_yaw,
              right_elbow, right_wrist_roll, right_wrist_pitch, right_wrist_yaw]
  legs: [left_hip_pitch, left_hip_roll, left_hip_yaw,
         left_knee, left_ankle_pitch, left_ankle_roll,
         right_hip_pitch, right_hip_roll, right_hip_yaw,
         right_knee, right_ankle_pitch, right_ankle_roll]
  waist: [torso_joint]
  # G1 không có dexterous hands → missing-part tokens tự động fill

Bước 2: Chạy fine-tuning

# Fine-tune trên embodiment mới (2-4 A100, ~6-12 giờ)
bash scripts/fine_tune_hex.sh \
  --embodiment unitree_g1 \
  --data_path /path/to/your/lerobot_dataset \
  --output_dir checkpoints/hex_g1_custom \
  --num_epochs 50 \
  --batch_size 8

# Pretrain từ đầu (cần ~1000 A100 GPU-hours)
bash scripts/pretrain_hex.sh

Lưu ý về compute: Pretrain đầy đủ cần khoảng 1000 A100 GPU-hours (200k steps, batch 16). Với budget hạn chế, bạn chỉ cần fine-tune từ checkpoint pretrained — thường 6-12 giờ trên 2-4 A100 là đủ hội tụ.

Inference trên Robot Thực

Sau khi có checkpoint fine-tuned, flow inference như sau:

from hex import HEXPolicy
from hex.utils import load_embodiment_config

# Load policy
policy = HEXPolicy.from_pretrained("checkpoints/hex_g1_custom")
config = load_embodiment_config("unitree_g1")
policy.set_embodiment(config)
policy.eval().cuda()

# Inference loop
obs = {
    "image": camera_frame,          # (H, W, 3) numpy array
    "language": "pick up the bottle and place it on the shelf",
    "joint_positions": robot.get_joint_positions(),  # canonical slots
    "joint_velocities": robot.get_joint_velocities()
}

with torch.no_grad():
    actions = policy.predict(obs, num_steps=10)  # predict 10-step chunk

# Execute actions
for action in actions:
    robot.set_joint_targets(action)

Kết Quả Benchmark

HEX được đánh giá trên 7 task thực tế bao gồm cả scenario đơn giản và long-horizon phức tạp.

So sánh tổng quan (seen scenarios)

Model	Avg. Success Rate	Params	Framework
HEX	79.8%	2.4B	MoE + DiT flow-matching
π₀.₅	71.8%	~3B	Diffusion VLA
GR00T N1.5	70.2%	~1.5B	DiT VLA
GR00T N1	52.4%	~1.5B	DiT VLA

Long-horizon task: Box Conveyance (4 stages)

Task này đặc biệt khó vì robot cần hoàn thành 4 giai đoạn liên tiếp: tiếp cận → gắp → di chuyển → đặt xuống. Tỷ lệ thành công giai đoạn cuối:

HEX:        53.3% ████████████████████████████░░░░░░░░░░
π₀.₅:       40.0% ████████████████████░░░░░░░░░░░░░░░░░░
GR00T N1.5: 20.0% ██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Generalization (unseen task variants)

Model	Unseen Success Rate
HEX	61.8%
π₀.₅	44.3%
GR00T N1.5	41.0%

Khoảng cách lớn nhất xuất hiện ở fast-reaction tasks (cần phản xạ nhanh) và long-horizon tasks — đúng như kỳ vọng khi thiết kế UPP với temporal dynamics.

Review-and-Forecast: Cơ Chế Quan Trọng

Một innovation đáng chú ý trong HEX là review-and-forecast paradigm:

Past frames → Visual History Summary (review)
                     ↓
             VLM processes context
                     ↓
Future state prediction ← UPP forecasts next body state (forecast)
                     ↓
             Action Expert generates actions
             conditioned on predicted future

Thay vì chỉ react với observation hiện tại, HEX còn dự đoán trước trạng thái cơ thể sẽ như thế nào sau khi thực hiện action. Auxiliary loss này buộc UPP phải học temporal dynamics thực sự, không chỉ mapping frame → action đơn thuần.

Khi Nào Dùng HEX?

HEX phù hợp nhất khi:

✅ Bạn có full-size humanoid robot (bipedal hoặc wheeled)
✅ Task cần phối hợp tay + chân + thân cùng lúc
✅ Bạn muốn pretrain một lần, fine-tune cho nhiều embodiment
✅ Task có yếu tố long-horizon (nhiều bước liên tiếp)

Ít phù hợp khi:

❌ Robot chỉ có arm (không cần whole-body)
❌ Resource constraint: inference cần ít nhất 1× A100 40GB
❌ Bạn cần real-time < 20ms (flow-matching có latency)

Xem thêm hướng dẫn deploy WholebodyVLA trên G1 và cách LeRobot tích hợp với π₀Fast cho whole-body control để có thêm góc nhìn về các lựa chọn khác.

Vấn đề mà HEX giải quyết

Hãy tưởng tượng bạn cần dạy một robot humanoid vác thùng hàng từ băng chuyền sang kệ. Tác vụ này đòi hỏi:

Hai tay kẹp và giữ thùng
Thân mình nghiêng để cân bằng trọng tâm
Chân bước đi đồng thời
Mắt theo dõi vị trí kệ

Đây là lý do HEX ra đời với hai đột phá chính:

Canonical body-part state representation — mã hóa trạng thái cơ thể theo các "slot" chuẩn thay vì raw joint indices
Mixture-of-Experts Unified Proprioceptive Predictor (UPP) — học phối hợp toàn thân từ dữ liệu 7 embodiment khác nhau

Kiến trúc HEX

┌─────────────────────────────────────────────────────────────┐
│                        HEX Pipeline                         │
│                                                             │
│  Camera frames ──► VLM (Qwen3-VL-2B)                       │
│  + Text command      │  Temporal context cache              │
│                      ▼                                      │
│              Visual-Language Features                       │
│                      │                                      │
│  Robot joints ──► UPP (MoE Transformer)                    │
│  (canonical slots)   │  16 routed experts + 2 shared       │
│                      ▼                                      │
│              Proprioceptive Features                        │
│                      │                                      │
│              ┌───────┴───────┐                              │
│              ▼               ▼                              │
│       Action Expert (DiT-B, 16 layers)                      │
│         dual cross-attention fusion                         │
│              │                                              │
│              ▼                                              │
│         Action (flow-matching)                              │
└─────────────────────────────────────────────────────────────┘

1. Visual-Language Backbone: Qwen3-VL-2B

2. Canonical State Representation

HEX định nghĩa 7 body-part slots chuẩn:

Slot	Mô tả	Ví dụ joints
`left_arm`	Cánh tay trái	shoulder, elbow, wrist
`right_arm`	Cánh tay phải	shoulder, elbow, wrist
`left_hand`	Tay trái (dexterous)	finger joints
`right_hand`	Tay phải (dexterous)	finger joints
`legs`	Hai chân	hip, knee, ankle
`head`	Đầu + cổ	pan, tilt
`waist`	Thắt lưng	torso rotation

3. UPP — Unified Proprioceptive Predictor

Đây là trái tim của HEX. UPP là một 4-layer transformer (hidden size 768) với kiến trúc Mixture-of-Experts:

Input: canonical body-part embeddings
       ↓
MoE Layer × 4:
  - 16 routed experts (embodiment-specific)
  - 2 shared experts (cross-embodiment common)
  - Router chọn top-K experts cho mỗi token
       ↓
Output: temporal + coordination features

4. Action Expert: DiT-B với Dual Cross-Attention

Action head của HEX là 16-layer DiT-B (Diffusion Transformer Base, hidden size 1024) với kiến trúc cross-attention kép:

# Simplified attention flow trong HEX Action Expert
class DualCrossAttention(nn.Module):
    def forward(self, action_tokens, vl_features, prop_features):
        # Branch 1: attend to visual-language context
        x_vl = self.cross_attn_vl(action_tokens, vl_features)
        # Branch 2: attend to proprioceptive context
        x_prop = self.cross_attn_prop(action_tokens, prop_features)
        # Adaptive fusion
        alpha = self.gate(action_tokens)
        return alpha * x_vl + (1 - alpha) * x_prop

Training dùng flow-matching thay vì diffusion thông thường — nhanh hơn, ổn định hơn, và đặc biệt phù hợp với các tác vụ cần reaction time nhanh.

Dataset và Training Data

HEX được pretrain trên hơn 12 triệu frames từ 4 nguồn dữ liệu:

Nguồn	Loại	Embodiment	Đặc điểm
HEX in-house dataset	Real-world	Tienkung, Tienyi	Diverse manipulation
Humanoid Everyday	Real-world	Nhiều loại	Daily tasks
AgiBot World Colosseo	Real-world	AgiBot	Wheeled humanoid
RoboCOIN	Real-world	Leju, G1, H1	Multi-embodiment

Tổng cộng 7 embodiment được cover: Tienkung 2.0, Tienkung 3.0, Tienyi, Unitree G1, Unitree H1, AgiBot, Leju Kuavo.

Cài Đặt

Yêu cầu hệ thống

Ubuntu 20.04/22.04 + CUDA 11.8+
GPU: ít nhất 1× A100 40GB cho inference, 8× A100 cho fine-tuning
Python 3.10
RAM: 32GB+

Bước 1: Clone repo và tạo môi trường

git clone https://github.com/Open-X-Humanoid/HEX.git
cd HEX

conda create -n hex python=3.10 -y
conda activate hex

# System dependencies
sudo apt update && sudo apt install -y libegl1-mesa-dev libglu1-mesa

# Python dependencies
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install -e .

Bước 2: Download model weights

# Download HEX pretrained model (~2.4B params)
python hex/utils/download_model_hex.py

# Download base VLM (Qwen3-VL-2B)
python hex/utils/download_model_qwen.py

Cả hai file đều host trên Hugging Face. Nếu mạng chậm có thể dùng hf_transfer:

pip install hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 python hex/utils/download_model_hex.py

Bước 3: Kiểm tra cài đặt

# Chạy quick test trên LIBERO simulation
bash scripts/libero/eval_libero.sh

# Hoặc chạy notebook inference
jupyter notebook notebooks/eval_model.ipynb

Fine-tuning HEX trên Robot của Bạn

Nếu bạn có dữ liệu tele-op từ robot thực, quy trình fine-tune gồm 2 bước.

Bước 1: Chuẩn bị dữ liệu

HEX dùng định dạng LeRobot v2.1. Nếu bạn đã có dataset LeRobot thì chỉ cần mapping joint names sang canonical slots:

# configs/embodiment/unitree_g1.yaml
embodiment: unitree_g1
joint_mapping:
  left_arm: [left_shoulder_pitch, left_shoulder_roll, left_shoulder_yaw,
             left_elbow, left_wrist_roll, left_wrist_pitch, left_wrist_yaw]
  right_arm: [right_shoulder_pitch, right_shoulder_roll, right_shoulder_yaw,
              right_elbow, right_wrist_roll, right_wrist_pitch, right_wrist_yaw]
  legs: [left_hip_pitch, left_hip_roll, left_hip_yaw,
         left_knee, left_ankle_pitch, left_ankle_roll,
         right_hip_pitch, right_hip_roll, right_hip_yaw,
         right_knee, right_ankle_pitch, right_ankle_roll]
  waist: [torso_joint]
  # G1 không có dexterous hands → missing-part tokens tự động fill

Bước 2: Chạy fine-tuning

# Fine-tune trên embodiment mới (2-4 A100, ~6-12 giờ)
bash scripts/fine_tune_hex.sh \
  --embodiment unitree_g1 \
  --data_path /path/to/your/lerobot_dataset \
  --output_dir checkpoints/hex_g1_custom \
  --num_epochs 50 \
  --batch_size 8

# Pretrain từ đầu (cần ~1000 A100 GPU-hours)
bash scripts/pretrain_hex.sh

Inference trên Robot Thực

Sau khi có checkpoint fine-tuned, flow inference như sau:

from hex import HEXPolicy
from hex.utils import load_embodiment_config

# Load policy
policy = HEXPolicy.from_pretrained("checkpoints/hex_g1_custom")
config = load_embodiment_config("unitree_g1")
policy.set_embodiment(config)
policy.eval().cuda()

# Inference loop
obs = {
    "image": camera_frame,          # (H, W, 3) numpy array
    "language": "pick up the bottle and place it on the shelf",
    "joint_positions": robot.get_joint_positions(),  # canonical slots
    "joint_velocities": robot.get_joint_velocities()
}

with torch.no_grad():
    actions = policy.predict(obs, num_steps=10)  # predict 10-step chunk

# Execute actions
for action in actions:
    robot.set_joint_targets(action)

Kết Quả Benchmark

HEX được đánh giá trên 7 task thực tế bao gồm cả scenario đơn giản và long-horizon phức tạp.

So sánh tổng quan (seen scenarios)

Model	Avg. Success Rate	Params	Framework
HEX	79.8%	2.4B	MoE + DiT flow-matching
π₀.₅	71.8%	~3B	Diffusion VLA
GR00T N1.5	70.2%	~1.5B	DiT VLA
GR00T N1	52.4%	~1.5B	DiT VLA

Long-horizon task: Box Conveyance (4 stages)

Task này đặc biệt khó vì robot cần hoàn thành 4 giai đoạn liên tiếp: tiếp cận → gắp → di chuyển → đặt xuống. Tỷ lệ thành công giai đoạn cuối:

HEX:        53.3% ████████████████████████████░░░░░░░░░░
π₀.₅:       40.0% ████████████████████░░░░░░░░░░░░░░░░░░
GR00T N1.5: 20.0% ██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Generalization (unseen task variants)

Model	Unseen Success Rate
HEX	61.8%
π₀.₅	44.3%
GR00T N1.5	41.0%

Khoảng cách lớn nhất xuất hiện ở fast-reaction tasks (cần phản xạ nhanh) và long-horizon tasks — đúng như kỳ vọng khi thiết kế UPP với temporal dynamics.

Review-and-Forecast: Cơ Chế Quan Trọng

Một innovation đáng chú ý trong HEX là review-and-forecast paradigm:

Past frames → Visual History Summary (review)
                     ↓
             VLM processes context
                     ↓
Future state prediction ← UPP forecasts next body state (forecast)
                     ↓
             Action Expert generates actions
             conditioned on predicted future

Khi Nào Dùng HEX?

HEX phù hợp nhất khi:

✅ Bạn có full-size humanoid robot (bipedal hoặc wheeled)
✅ Task cần phối hợp tay + chân + thân cùng lúc
✅ Bạn muốn pretrain một lần, fine-tune cho nhiều embodiment
✅ Task có yếu tố long-horizon (nhiều bước liên tiếp)

Ít phù hợp khi:

❌ Robot chỉ có arm (không cần whole-body)
❌ Resource constraint: inference cần ít nhất 1× A100 40GB
❌ Bạn cần real-time < 20ms (flow-matching có latency)

Xem thêm hướng dẫn deploy WholebodyVLA trên G1 và cách LeRobot tích hợp với π₀Fast cho whole-body control để có thêm góc nhìn về các lựa chọn khác.

Vấn đề mà HEX giải quyết

Kiến trúc HEX

1. Visual-Language Backbone: Qwen3-VL-2B

2. Canonical State Representation

3. UPP — Unified Proprioceptive Predictor

4. Action Expert: DiT-B với Dual Cross-Attention

Dataset và Training Data

Cài Đặt

Yêu cầu hệ thống

Bước 1: Clone repo và tạo môi trường

Bước 2: Download model weights

Bước 3: Kiểm tra cài đặt

Fine-tuning HEX trên Robot của Bạn

Bước 1: Chuẩn bị dữ liệu

Bước 2: Chạy fine-tuning

Inference trên Robot Thực

Kết Quả Benchmark

So sánh tổng quan (seen scenarios)

Long-horizon task: Box Conveyance (4 stages)

Generalization (unseen task variants)

Review-and-Forecast: Cơ Chế Quan Trọng

Khi Nào Dùng HEX?

Bài Viết Liên Quan

Nguyễn Anh Tuấn

Bài viết liên quan

Qwen-VLA: Mô hình VLA generalist của Alibaba

Fine-tune InternVLA-A1.5 với LeRobot

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2

Vấn đề mà HEX giải quyết

Kiến trúc HEX

1. Visual-Language Backbone: Qwen3-VL-2B

2. Canonical State Representation

3. UPP — Unified Proprioceptive Predictor

4. Action Expert: DiT-B với Dual Cross-Attention

Dataset và Training Data

Cài Đặt

Yêu cầu hệ thống

Bước 1: Clone repo và tạo môi trường

Bước 2: Download model weights

Bước 3: Kiểm tra cài đặt

Fine-tuning HEX trên Robot của Bạn

Bước 1: Chuẩn bị dữ liệu

Bước 2: Chạy fine-tuning

Inference trên Robot Thực

Kết Quả Benchmark

So sánh tổng quan (seen scenarios)

Long-horizon task: Box Conveyance (4 stages)

Generalization (unseen task variants)

Review-and-Forecast: Cơ Chế Quan Trọng

Khi Nào Dùng HEX?

Bài Viết Liên Quan

Nguyễn Anh Tuấn

Bài viết liên quan

Qwen-VLA: Mô hình VLA generalist của Alibaba

Fine-tune InternVLA-A1.5 với LeRobot

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2