OpenHelix: Build Dual-System VLA Từ Survey Đến Deploy

Bạn đã nghe về Helix — kiến trúc dual-system VLA của Figure AI chạy trên humanoid robot thực và gây sốt cộng đồng robotics năm 2025. Vấn đề duy nhất: nó closed-source hoàn toàn. Bạn không thể đọc code, không thể tái hiện, không thể học từ nó.

OpenHelix ra đời để thay đổi điều đó. Đây là bản open-source đầy đủ của kiến trúc dual-system VLA, kèm theo survey tổng hợp và phân tích thực nghiệm nghiêm túc nhất hiện có về chủ đề này. Kết quả: SOTA trên CALVIN ABC-D với average sequence length 4.08 — vượt qua RoboDual, UniVLA, GR-MG và Seer.

Bài viết này sẽ đưa bạn từ "dual-system là gì" → cài đặt môi trường → chuẩn bị data → training → inference, với đủ chi tiết để bạn chạy được trên máy của mình.

Dual-System VLA Là Gì? (và Tại Sao Bạn Nên Quan Tâm)

Hãy hình dung bạn đang lái xe trong thành phố đông đúc. Não bạn thực hiện hai nhiệm vụ song song:

System 2 (chậm, có ý thức): Đọc biển báo, nhận ra tình huống phức tạp ("xe cứu thương đang đến"), ra quyết định chiến lược ("dừng lại, nhường đường")
System 1 (nhanh, phản xạ): Điều khiển vô lăng, nhấn phanh đúng lực, giữ xe trong làn — tất cả trong mili-giây, không cần "nghĩ"

Robot manipulation gặp đúng mâu thuẫn này. Multimodal LLM (MLLM) như LLaVA rất giỏi hiểu ngôn ngữ và suy luận ngữ cảnh — nhưng chạy ở 7-9 Hz, quá chậm để điều khiển robot thời gian thực. Diffusion policy phản ứng ở 200+ Hz — nhưng không "hiểu" gì cả, chỉ mapping cảm biến → action.

Dual-System VLA kết hợp cả hai: MLLM đóng vai System 2 (hiểu ngôn ngữ, lập kế hoạch), diffusion policy đóng vai System 1 (thực thi chính xác, thời gian thực).

Kiến trúc tổng quan dual-system VLA: System 2 (MLLM) cung cấp context cho System 1 (policy)

OpenHelix: Ba Đóng Góp Chính

Paper arXiv:2505.03912 của Can Cui, Pengxiang Ding, Wenxuan Song và cộng sự không chỉ là một model mới — nó là cả một hệ thống kiến thức:

1. Survey Tổng Hợp Landscape

OpenHelix hệ thống hóa toàn bộ không gian thiết kế của dual-system VLA: cách kết nối System 1 và System 2, cách training từng component, cách handle latency mismatch. Đây là bản đồ để bạn không bị lạc khi đọc các paper khác.

2. Empirical Analysis Nghiêm Túc

Thay vì chỉ claim "kiến trúc của tôi tốt hơn", nhóm tác giả ablate từng design choice một cách có hệ thống:

Pre-trained policy vs. train từ đầu?
Prompt-tuning vs. full fine-tuning cho MLLM?
Có hay không có auxiliary prediction task?
Pre-alignment trước khi joint training có quan trọng không?

3. Open-Source Implementation

Toàn bộ code, checkpoint, training script — MIT license — tại github.com/OpenHelix-Team/OpenHelix.

Kiến Trúc Chi Tiết

System 2: LLaVA-7B (Bộ Não Chậm)

Input: Ảnh quan sát + Lệnh ngôn ngữ
Model: LLaVA-7B (FROZEN — không train trọng số)
Adaptation: Prompt tuning (chỉ train ~1% tokens)
Output: Latent embedding Z ∈ ℝ^(N×D)

Tại sao freeze MLLM? Nếu fine-tune toàn bộ LLaVA-7B, bạn mất khả năng generalization đã được học từ hàng tỷ text-image pairs. Prompt tuning giữ lại "trí tuệ" đó trong khi thích nghi với robot domain — chi phí tính toán thấp hơn nhiều.

Phát hiện quan trọng từ ablation: MLLM mặc định vô cảm với thay đổi visual — nó chủ yếu phản ánh ngữ nghĩa của instruction, gần như bỏ qua ảnh quan sát. Đây là vấn đề nghiêm trọng vì robot cần phải phản ứng với môi trường!

Learned Token Bridge (Cầu Nối Hai Hệ Thống)

# Projection layer kết nối System 2 → System 1
class TokenBridge(nn.Module):
    def __init__(self, llm_dim=4096, policy_dim=512):
        super().__init__()
        self.proj = nn.Linear(llm_dim, policy_dim)
        self.norm = nn.LayerNorm(policy_dim)
    
    def forward(self, llm_embedding):
        # Chiếu từ không gian LLM → không gian policy
        return self.norm(self.proj(llm_embedding))

Token bridge là component được train nhiều nhất. Nó học cách "dịch" latent representation từ không gian 4096 chiều của LLaVA sang không gian 512 chiều mà diffusion policy hiểu được.

Lưu ý quan trọng: Phải pre-align projection layer TRƯỚC khi joint training. Nếu khởi tạo random và train cùng lúc từ đầu, gradient từ policy sẽ "poison" representation của MLLM → model collapse.

System 1: 3D Diffuser Actor (Tay Thực Thi)

Input: Token bridge output Z + Proprioceptive state q + Goal features g
Architecture: 3D Diffuser Actor (diffusion-based)
Output: Action sequence a₀:T ∈ ℝ^(T×7)  # 7-DOF robot arm
Frequency: 200+ Hz (asynchronous inference)

3D Diffuser Actor sử dụng diffusion process để generate action sequence, cho phép model capture multimodal action distribution (cùng một task có nhiều cách thực hiện hợp lệ). Nó nhận đầu vào từ 3 nguồn:

Z từ token bridge — context về task và visual state
Proprioceptive state q — joint angles, end-effector pose hiện tại
Goal features g — visual features của trạng thái mục tiêu

Auxiliary Task: Buộc MLLM Nhìn

Auxiliary loss: L_aux = MSE(f_aux(Z), a_expert)
f_aux: Small MLP head trên MLLM output
Tác dụng: Buộc MLLM embedding phải encode visual information

Đây là trick thông minh nhất trong OpenHelix. Bằng cách thêm một loss phụ yêu cầu MLLM predict actions từ embedding của nó, ta buộc LLaVA phải học cách nhìn thực sự. Không có auxiliary task, MLLM có thể "lười biếng" — chỉ cần encode instruction text là đủ để minimize loss training. Với auxiliary task, nó buộc phải kết hợp visual information.

Kết quả ablation: Auxiliary task cải thiện performance lên +0.4 avg sequence length trên CALVIN ABC-D.

Cài Đặt Môi Trường

Yêu Cầu Phần Cứng

Component	Minimum	Recommended
GPU	1× RTX 3090 (24GB)	1× A100 (40GB)
RAM	32GB	64GB
Storage	200GB SSD	500GB SSD
CUDA	11.8+	12.1

Training full OpenHelix mất khoảng 3-4 ngày trên A100. Nếu chỉ inference với pre-trained checkpoint, RTX 3090 là đủ.

Tạo Conda Environment

# Python 3.8 là bắt buộc — OpenHelix chưa support Python 3.10+
conda create -n openhelix python=3.8
conda activate openhelix

# Install PyTorch với CUDA 11.8
conda install pytorch==2.0.1 torchvision==0.15.2 \
    torchaudio==2.0.2 pytorch-cuda=11.8 \
    -c pytorch -c nvidia

Clone Repository và Install Dependencies

git clone https://github.com/OpenHelix-Team/OpenHelix
cd OpenHelix

# Install với một số dependencies quan trọng
pip install -r requirements.txt

# Install CALVIN simulator (submodule)
git submodule update --init --recursive

# Install DGL (Deep Graph Library) — cần thiết cho 3D Diffuser Actor
# Chọn version phù hợp với CUDA của bạn
pip install dgl==1.1.0 -f https://data.dgl.ai/wheels/cu118/repo.html

# Flash Attention để tăng tốc MLLM inference
pip install flash-attn==2.5.9 --no-build-isolation

Lưu ý: flash-attn mất 5-10 phút để compile. Đừng tắt terminal.

Chuẩn Bị Dataset CALVIN

CALVIN (Composing Actions by Learning from Visual Interactions and Narrative descriptions) là benchmark manipulation phổ biến nhất hiện nay. Split ABC-D: train trên môi trường A, B, C — test trên môi trường D chưa thấy bao giờ.

Download Dataset

# Dataset khá lớn (~300GB cho full split)
# Download script từ CALVIN repo
cd ~/data
wget https://calvin.cs.uni-freiburg.de/dataset/task_ABC_D.zip
unzip task_ABC_D.zip

# Cấu trúc thư mục sau khi giải nén:
# task_ABC_D/
# ├── training/          # Environments A, B, C
# │   ├── episode_*.npz  # Demonstration episodes
# │   └── lang_annotations/
# ├── validation/        # Environment D
# └── statistics.yaml

Pre-encode Language Instructions (Tùy Chọn Nhưng Nên Làm)

OpenHelix sử dụng CLIP text encoder để encode language instructions. Pre-encoding trước giúp training nhanh hơn nhiều:

cd OpenHelix

python scripts/encode_instructions.py \
    --dataset_path ~/data/task_ABC_D \
    --output_path ~/data/task_ABC_D/lang_embeddings \
    --encoder clip-vit-base-patch32

# Hoặc download pre-encoded từ HuggingFace (nhanh hơn)
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id='OpenHelix-Team/OpenHelix',
    filename='lang_embeddings.tar.gz',
    local_dir='~/data/task_ABC_D/'
)
"

Verify Dataset Structure

python scripts/verify_dataset.py --path ~/data/task_ABC_D

# Output mong đợi:
# ✓ Training episodes: 23,856
# ✓ Validation episodes: 1,000
# ✓ Language annotations: 34 unique tasks
# ✓ CLIP embeddings: found

Training

Bước 1: Pre-train Projection Layer (Token Bridge Alignment)

KHÔNG được bỏ qua bước này. Đây là điểm khác biệt giữa OpenHelix và nhiều implementation sai:

cd OpenHelix

bash scripts/train_projection_pretrain.sh \
    --data_path ~/data/task_ABC_D \
    --output_dir ./checkpoints/projection_pretrain \
    --epochs 10 \
    --batch_size 64 \
    --lr 1e-4

# Mất khoảng 2-3 giờ trên A100
# Mục tiêu: projection layer học align với MLLM space trước

Bước 2: Joint Training với Auxiliary Task

bash train_trajectory_lcb_pt_act_simple.sh \
    --data_path ~/data/task_ABC_D \
    --pretrained_proj ./checkpoints/projection_pretrain/best.pt \
    --output_dir ./checkpoints/openhelix_full \
    --llm_model llava-hf/llava-1.5-7b-hf \
    --policy_lr 1e-4 \
    --prompt_lr 1e-3 \
    --aux_weight 0.1 \
    --epochs 100 \
    --batch_size 32

Giải thích các flags:

--prompt_lr 1e-3: Prompt tokens học nhanh hơn policy (1e-4) vì ít parameters hơn
--aux_weight 0.1: Trọng số auxiliary loss — 0.1 là giá trị tối ưu theo ablation của paper
--pretrained_proj: BẮT BUỘC — pre-aligned projection từ bước 1

Monitoring Training

# TensorBoard logs
tensorboard --logdir ./checkpoints/openhelix_full/logs

# Metrics cần theo dõi:
# - train/policy_loss: phải giảm liên tục
# - train/aux_loss: phải giảm — nếu tăng, giảm aux_weight
# - val/avg_seq_len: metric chính, mục tiêu > 3.5 sau epoch 50

Kết quả benchmark OpenHelix trên CALVIN ABC-D so với các phương pháp khác

Inference và Evaluation

Inference với Asynchronous Mode

OpenHelix sử dụng asynchronous inference để xử lý latency mismatch giữa System 1 và System 2. System 2 (LLaVA) chạy ở 7 Hz, System 1 (diffusion policy) chạy ở 200 Hz — cần async để cả hai hoạt động đồng thời mà không block lẫn nhau:

bash test_trajectory_lcb_pt_act_simple_asy10.sh \
    --checkpoint ./checkpoints/openhelix_full/epoch_100.pt \
    --data_path ~/data/task_ABC_D \
    --split validation \
    --async_delay 10 \  # 10-step delay giữa System 1 và System 2
    --num_sequences 1000 \
    --output_path ./results/eval_epoch100.json

Dùng Pre-trained Checkpoint (Không Cần Train)

Nếu bạn chỉ muốn thử inference:

# Download checkpoint từ HuggingFace
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='OpenHelix-Team/OpenHelix',
    local_dir='./pretrained_checkpoints'
)
"

# Merge safetensor shards thành 1 file PyTorch
python scripts/merge_safetensors.py \
    --input_dir ./pretrained_checkpoints \
    --output_path ./pretrained_checkpoints/openhelix_merged.pt

# Run inference
bash test_trajectory_lcb_pt_act_simple_asy10.sh \
    --checkpoint ./pretrained_checkpoints/openhelix_merged.pt \
    --data_path ~/data/task_ABC_D \
    --split validation

Đọc Kết Quả Evaluation

python scripts/analyze_results.py --result_path ./results/eval_epoch100.json

# Output format:
# Task             | 1-task | 2-task | 3-task | 4-task | 5-task | Avg Seq Len
# push block       |  0.96  |  0.87  |  0.75  |  0.64  |  0.53  |   3.75
# stack block      |  0.91  |  0.79  |  0.68  |  0.55  |  0.44  |   3.37
# ...
# Overall          |  0.933 |  0.818 |  0.710 |  0.598 |  0.491 |   4.08

Avg Sequence Length 4.08 nghĩa là trung bình robot hoàn thành 4.08 task liên tiếp trước khi thất bại, trong một chain tối đa 5 tasks.

Kết Quả và Phân Tích

So Sánh Với SOTA

Model	CALVIN ABC-D Avg Seq Len
Seer (2024)	3.65
GR-MG (2024)	3.88
UniVLA (2025)	3.92
RoboDual (2025)	3.98
OpenHelix (2025)	4.08

4 Bài Học Từ Ablation Study

Bài học 1: Pre-trained policy > Train từ đầu (+1.2 seq len) Đừng cố train 3D Diffuser Actor từ random weights. Luôn bắt đầu từ pre-trained policy và fine-tune — difference là cực kỳ lớn.

Bài học 2: Prompt tuning đủ tốt, không cần full fine-tune MLLM Fine-tune toàn bộ LLaVA không cải thiện performance và mất khả năng generalization. Prompt tuning với ~1% parameters là optimal.

Bài học 3: Auxiliary task là bắt buộc nếu muốn SOTA (+0.4 seq len) Không có auxiliary task, MLLM gần như bỏ qua visual input. Đây là phát hiện quan trọng nhất của paper.

Bài học 4: Pre-alignment ngăn model collapse Joint training ngay từ đầu với random projection layer thường dẫn đến training instability. Pre-align trước 10 epoch là bảo hiểm quan trọng.

Troubleshooting Phổ Biến

Lỗi: CUDA Out of Memory

# Giảm batch size
--batch_size 16  # thay vì 32

# Hoặc bật gradient checkpointing
--gradient_checkpointing True

Lỗi: flash_attn ImportError

# Uninstall và reinstall với đúng CUDA version
pip uninstall flash-attn
pip install flash-attn==2.5.9 --no-build-isolation \
    FLASH_ATTENTION_FORCE_BUILD=TRUE

Training Loss Không Giảm

Kiểm tra thứ tự: (1) Có dùng --pretrained_proj chưa? (2) aux_weight có quá cao không (thử 0.05)? (3) Learning rate có phù hợp với batch size không (scale linear theo batch size)?

Mở Rộng: Các Repo Liên Quan từ OpenHelix-Team

Sau khi nắm vững OpenHelix, bạn có thể khám phá:

VLA-Adapter — Tiny-scale VLA với real-world ALOHA deployment (nhẹ hơn nhiều)
VLA-RFT — Reinforcement Fine-Tuning cho VLA (RLHF cho robot)
HiF-VLA — Hierarchical spatiotemporal VLA, giải quyết long-horizon tasks
Spatial-Forcing — ICLR 2026 paper, cải thiện 3D spatial understanding

Dual-System VLA Là Gì? (và Tại Sao Bạn Nên Quan Tâm)

Hãy hình dung bạn đang lái xe trong thành phố đông đúc. Não bạn thực hiện hai nhiệm vụ song song:

System 2 (chậm, có ý thức): Đọc biển báo, nhận ra tình huống phức tạp ("xe cứu thương đang đến"), ra quyết định chiến lược ("dừng lại, nhường đường")
System 1 (nhanh, phản xạ): Điều khiển vô lăng, nhấn phanh đúng lực, giữ xe trong làn — tất cả trong mili-giây, không cần "nghĩ"

Dual-System VLA kết hợp cả hai: MLLM đóng vai System 2 (hiểu ngôn ngữ, lập kế hoạch), diffusion policy đóng vai System 1 (thực thi chính xác, thời gian thực).

Kiến trúc tổng quan dual-system VLA: System 2 (MLLM) cung cấp context cho System 1 (policy)

OpenHelix: Ba Đóng Góp Chính

Paper arXiv:2505.03912 của Can Cui, Pengxiang Ding, Wenxuan Song và cộng sự không chỉ là một model mới — nó là cả một hệ thống kiến thức:

1. Survey Tổng Hợp Landscape

2. Empirical Analysis Nghiêm Túc

Thay vì chỉ claim "kiến trúc của tôi tốt hơn", nhóm tác giả ablate từng design choice một cách có hệ thống:

Pre-trained policy vs. train từ đầu?
Prompt-tuning vs. full fine-tuning cho MLLM?
Có hay không có auxiliary prediction task?
Pre-alignment trước khi joint training có quan trọng không?

3. Open-Source Implementation

Toàn bộ code, checkpoint, training script — MIT license — tại github.com/OpenHelix-Team/OpenHelix.

Kiến Trúc Chi Tiết

System 2: LLaVA-7B (Bộ Não Chậm)

Input: Ảnh quan sát + Lệnh ngôn ngữ
Model: LLaVA-7B (FROZEN — không train trọng số)
Adaptation: Prompt tuning (chỉ train ~1% tokens)
Output: Latent embedding Z ∈ ℝ^(N×D)

Learned Token Bridge (Cầu Nối Hai Hệ Thống)

# Projection layer kết nối System 2 → System 1
class TokenBridge(nn.Module):
    def __init__(self, llm_dim=4096, policy_dim=512):
        super().__init__()
        self.proj = nn.Linear(llm_dim, policy_dim)
        self.norm = nn.LayerNorm(policy_dim)
    
    def forward(self, llm_embedding):
        # Chiếu từ không gian LLM → không gian policy
        return self.norm(self.proj(llm_embedding))

System 1: 3D Diffuser Actor (Tay Thực Thi)

Input: Token bridge output Z + Proprioceptive state q + Goal features g
Architecture: 3D Diffuser Actor (diffusion-based)
Output: Action sequence a₀:T ∈ ℝ^(T×7)  # 7-DOF robot arm
Frequency: 200+ Hz (asynchronous inference)

Z từ token bridge — context về task và visual state
Proprioceptive state q — joint angles, end-effector pose hiện tại
Goal features g — visual features của trạng thái mục tiêu

Auxiliary Task: Buộc MLLM Nhìn

Auxiliary loss: L_aux = MSE(f_aux(Z), a_expert)
f_aux: Small MLP head trên MLLM output
Tác dụng: Buộc MLLM embedding phải encode visual information

Kết quả ablation: Auxiliary task cải thiện performance lên +0.4 avg sequence length trên CALVIN ABC-D.

Cài Đặt Môi Trường

Yêu Cầu Phần Cứng

Component	Minimum	Recommended
GPU	1× RTX 3090 (24GB)	1× A100 (40GB)
RAM	32GB	64GB
Storage	200GB SSD	500GB SSD
CUDA	11.8+	12.1

Training full OpenHelix mất khoảng 3-4 ngày trên A100. Nếu chỉ inference với pre-trained checkpoint, RTX 3090 là đủ.

Tạo Conda Environment

# Python 3.8 là bắt buộc — OpenHelix chưa support Python 3.10+
conda create -n openhelix python=3.8
conda activate openhelix

# Install PyTorch với CUDA 11.8
conda install pytorch==2.0.1 torchvision==0.15.2 \
    torchaudio==2.0.2 pytorch-cuda=11.8 \
    -c pytorch -c nvidia

Clone Repository và Install Dependencies

git clone https://github.com/OpenHelix-Team/OpenHelix
cd OpenHelix

# Install với một số dependencies quan trọng
pip install -r requirements.txt

# Install CALVIN simulator (submodule)
git submodule update --init --recursive

# Install DGL (Deep Graph Library) — cần thiết cho 3D Diffuser Actor
# Chọn version phù hợp với CUDA của bạn
pip install dgl==1.1.0 -f https://data.dgl.ai/wheels/cu118/repo.html

# Flash Attention để tăng tốc MLLM inference
pip install flash-attn==2.5.9 --no-build-isolation

Lưu ý: flash-attn mất 5-10 phút để compile. Đừng tắt terminal.

Chuẩn Bị Dataset CALVIN

Download Dataset

# Dataset khá lớn (~300GB cho full split)
# Download script từ CALVIN repo
cd ~/data
wget https://calvin.cs.uni-freiburg.de/dataset/task_ABC_D.zip
unzip task_ABC_D.zip

# Cấu trúc thư mục sau khi giải nén:
# task_ABC_D/
# ├── training/          # Environments A, B, C
# │   ├── episode_*.npz  # Demonstration episodes
# │   └── lang_annotations/
# ├── validation/        # Environment D
# └── statistics.yaml

Pre-encode Language Instructions (Tùy Chọn Nhưng Nên Làm)

OpenHelix sử dụng CLIP text encoder để encode language instructions. Pre-encoding trước giúp training nhanh hơn nhiều:

cd OpenHelix

python scripts/encode_instructions.py \
    --dataset_path ~/data/task_ABC_D \
    --output_path ~/data/task_ABC_D/lang_embeddings \
    --encoder clip-vit-base-patch32

# Hoặc download pre-encoded từ HuggingFace (nhanh hơn)
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id='OpenHelix-Team/OpenHelix',
    filename='lang_embeddings.tar.gz',
    local_dir='~/data/task_ABC_D/'
)
"

Verify Dataset Structure

python scripts/verify_dataset.py --path ~/data/task_ABC_D

# Output mong đợi:
# ✓ Training episodes: 23,856
# ✓ Validation episodes: 1,000
# ✓ Language annotations: 34 unique tasks
# ✓ CLIP embeddings: found

Training

Bước 1: Pre-train Projection Layer (Token Bridge Alignment)

KHÔNG được bỏ qua bước này. Đây là điểm khác biệt giữa OpenHelix và nhiều implementation sai:

cd OpenHelix

bash scripts/train_projection_pretrain.sh \
    --data_path ~/data/task_ABC_D \
    --output_dir ./checkpoints/projection_pretrain \
    --epochs 10 \
    --batch_size 64 \
    --lr 1e-4

# Mất khoảng 2-3 giờ trên A100
# Mục tiêu: projection layer học align với MLLM space trước

Bước 2: Joint Training với Auxiliary Task

bash train_trajectory_lcb_pt_act_simple.sh \
    --data_path ~/data/task_ABC_D \
    --pretrained_proj ./checkpoints/projection_pretrain/best.pt \
    --output_dir ./checkpoints/openhelix_full \
    --llm_model llava-hf/llava-1.5-7b-hf \
    --policy_lr 1e-4 \
    --prompt_lr 1e-3 \
    --aux_weight 0.1 \
    --epochs 100 \
    --batch_size 32

Giải thích các flags:

--prompt_lr 1e-3: Prompt tokens học nhanh hơn policy (1e-4) vì ít parameters hơn
--aux_weight 0.1: Trọng số auxiliary loss — 0.1 là giá trị tối ưu theo ablation của paper
--pretrained_proj: BẮT BUỘC — pre-aligned projection từ bước 1

Monitoring Training

# TensorBoard logs
tensorboard --logdir ./checkpoints/openhelix_full/logs

# Metrics cần theo dõi:
# - train/policy_loss: phải giảm liên tục
# - train/aux_loss: phải giảm — nếu tăng, giảm aux_weight
# - val/avg_seq_len: metric chính, mục tiêu > 3.5 sau epoch 50

Kết quả benchmark OpenHelix trên CALVIN ABC-D so với các phương pháp khác

Inference và Evaluation

Inference với Asynchronous Mode

bash test_trajectory_lcb_pt_act_simple_asy10.sh \
    --checkpoint ./checkpoints/openhelix_full/epoch_100.pt \
    --data_path ~/data/task_ABC_D \
    --split validation \
    --async_delay 10 \  # 10-step delay giữa System 1 và System 2
    --num_sequences 1000 \
    --output_path ./results/eval_epoch100.json

Dùng Pre-trained Checkpoint (Không Cần Train)

Nếu bạn chỉ muốn thử inference:

# Download checkpoint từ HuggingFace
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='OpenHelix-Team/OpenHelix',
    local_dir='./pretrained_checkpoints'
)
"

# Merge safetensor shards thành 1 file PyTorch
python scripts/merge_safetensors.py \
    --input_dir ./pretrained_checkpoints \
    --output_path ./pretrained_checkpoints/openhelix_merged.pt

# Run inference
bash test_trajectory_lcb_pt_act_simple_asy10.sh \
    --checkpoint ./pretrained_checkpoints/openhelix_merged.pt \
    --data_path ~/data/task_ABC_D \
    --split validation

Đọc Kết Quả Evaluation

python scripts/analyze_results.py --result_path ./results/eval_epoch100.json

# Output format:
# Task             | 1-task | 2-task | 3-task | 4-task | 5-task | Avg Seq Len
# push block       |  0.96  |  0.87  |  0.75  |  0.64  |  0.53  |   3.75
# stack block      |  0.91  |  0.79  |  0.68  |  0.55  |  0.44  |   3.37
# ...
# Overall          |  0.933 |  0.818 |  0.710 |  0.598 |  0.491 |   4.08

Avg Sequence Length 4.08 nghĩa là trung bình robot hoàn thành 4.08 task liên tiếp trước khi thất bại, trong một chain tối đa 5 tasks.

Kết Quả và Phân Tích

So Sánh Với SOTA

Model	CALVIN ABC-D Avg Seq Len
Seer (2024)	3.65
GR-MG (2024)	3.88
UniVLA (2025)	3.92
RoboDual (2025)	3.98
OpenHelix (2025)	4.08

4 Bài Học Từ Ablation Study

Troubleshooting Phổ Biến

Lỗi: CUDA Out of Memory

# Giảm batch size
--batch_size 16  # thay vì 32

# Hoặc bật gradient checkpointing
--gradient_checkpointing True

Lỗi: flash_attn ImportError

# Uninstall và reinstall với đúng CUDA version
pip uninstall flash-attn
pip install flash-attn==2.5.9 --no-build-isolation \
    FLASH_ATTENTION_FORCE_BUILD=TRUE

Training Loss Không Giảm

Mở Rộng: Các Repo Liên Quan từ OpenHelix-Team

Sau khi nắm vững OpenHelix, bạn có thể khám phá:

VLA-Adapter — Tiny-scale VLA với real-world ALOHA deployment (nhẹ hơn nhiều)
VLA-RFT — Reinforcement Fine-Tuning cho VLA (RLHF cho robot)
HiF-VLA — Hierarchical spatiotemporal VLA, giải quyết long-horizon tasks
Spatial-Forcing — ICLR 2026 paper, cải thiện 3D spatial understanding

Dual-System VLA Là Gì? (và Tại Sao Bạn Nên Quan Tâm)

OpenHelix: Ba Đóng Góp Chính

1. Survey Tổng Hợp Landscape

2. Empirical Analysis Nghiêm Túc

3. Open-Source Implementation

Kiến Trúc Chi Tiết

System 2: LLaVA-7B (Bộ Não Chậm)

Learned Token Bridge (Cầu Nối Hai Hệ Thống)

System 1: 3D Diffuser Actor (Tay Thực Thi)

Auxiliary Task: Buộc MLLM Nhìn

Cài Đặt Môi Trường

Yêu Cầu Phần Cứng

Tạo Conda Environment

Clone Repository và Install Dependencies

Chuẩn Bị Dataset CALVIN

Download Dataset

Pre-encode Language Instructions (Tùy Chọn Nhưng Nên Làm)

Verify Dataset Structure

Training

Bước 1: Pre-train Projection Layer (Token Bridge Alignment)

Bước 2: Joint Training với Auxiliary Task

Monitoring Training

Inference và Evaluation

Inference với Asynchronous Mode

Dùng Pre-trained Checkpoint (Không Cần Train)

Đọc Kết Quả Evaluation

Kết Quả và Phân Tích

So Sánh Với SOTA

4 Bài Học Từ Ablation Study

Troubleshooting Phổ Biến

Lỗi: CUDA Out of Memory

Lỗi: flash_attn ImportError

Training Loss Không Giảm

Mở Rộng: Các Repo Liên Quan từ OpenHelix-Team

Bài Viết Liên Quan

Nguyễn Anh Tuấn

Bài viết liên quan

OpenHelix: Dual-System VLA Mã Nguồn Mở Cho Manipulation

MemoryVLA++: memory và world model cho VLA

Wall-OSS-0.5: VLA 4B cho LeRobot

Dual-System VLA Là Gì? (và Tại Sao Bạn Nên Quan Tâm)

OpenHelix: Ba Đóng Góp Chính

1. Survey Tổng Hợp Landscape

2. Empirical Analysis Nghiêm Túc

3. Open-Source Implementation

Kiến Trúc Chi Tiết

System 2: LLaVA-7B (Bộ Não Chậm)

Learned Token Bridge (Cầu Nối Hai Hệ Thống)

System 1: 3D Diffuser Actor (Tay Thực Thi)

Auxiliary Task: Buộc MLLM Nhìn

Cài Đặt Môi Trường

Yêu Cầu Phần Cứng

Tạo Conda Environment

Clone Repository và Install Dependencies

Chuẩn Bị Dataset CALVIN

Download Dataset

Pre-encode Language Instructions (Tùy Chọn Nhưng Nên Làm)

Verify Dataset Structure

Training

Bước 1: Pre-train Projection Layer (Token Bridge Alignment)

Bước 2: Joint Training với Auxiliary Task

Monitoring Training

Inference và Evaluation

Inference với Asynchronous Mode

Dùng Pre-trained Checkpoint (Không Cần Train)

Đọc Kết Quả Evaluation

Kết Quả và Phân Tích

So Sánh Với SOTA

4 Bài Học Từ Ablation Study

Troubleshooting Phổ Biến

Lỗi: CUDA Out of Memory

Lỗi: flash_attn ImportError

Training Loss Không Giảm

Mở Rộng: Các Repo Liên Quan từ OpenHelix-Team

Bài Viết Liên Quan

Nguyễn Anh Tuấn

Bài viết liên quan

OpenHelix: Dual-System VLA Mã Nguồn Mở Cho Manipulation

MemoryVLA++: memory và world model cho VLA

Wall-OSS-0.5: VLA 4B cho LeRobot