You've probably heard of Helix — Figure AI's dual-system VLA running on a humanoid robot that shook the robotics community in 2025. The only problem: it's completely closed-source. You can't read the code, can't reproduce it, can't learn from it.
OpenHelix was created to change that. It's a fully open-source implementation of the dual-system VLA architecture, paired with the most thorough survey and empirical analysis on the topic available today. The result: SOTA on CALVIN ABC-D with an average sequence length of 4.08 — beating RoboDual, UniVLA, GR-MG and Seer.
This post takes you from "what is dual-system?" → environment setup → data preparation → training → inference, with enough detail to actually run it on your own machine.
What Is a Dual-System VLA? (And Why You Should Care)
Imagine driving through a busy city. Your brain runs two parallel processes:
- System 2 (slow, deliberate): Reads signs, recognizes complex situations ("ambulance approaching"), makes strategic decisions ("stop, yield")
- System 1 (fast, reflexive): Steers the wheel, applies brake pressure, keeps the car in lane — all in milliseconds, without "thinking"
Robot manipulation faces exactly this tension. Multimodal LLMs (MLLMs) like LLaVA are excellent at understanding language and reasoning about context — but run at 7-9 Hz, far too slow for real-time robot control. Diffusion policies react at 200+ Hz — but don't "understand" anything, just mapping sensors → actions.
Dual-System VLA combines both: the MLLM plays System 2 (language understanding, planning), and the diffusion policy plays System 1 (precise execution, real-time).
OpenHelix: Three Core Contributions
The paper arXiv:2505.03912 by Can Cui, Pengxiang Ding, Wenxuan Song et al. isn't just another model — it's a knowledge system:
1. Comprehensive Survey of the Landscape
OpenHelix systematically maps the full design space of dual-system VLAs: how to connect System 1 and System 2, how to train each component, how to handle latency mismatch. This is the map you need to not get lost reading other papers.
2. Rigorous Empirical Analysis
Rather than just claiming "my architecture is better," the authors ablate each design choice systematically:
- Pre-trained policy vs. training from scratch?
- Prompt-tuning vs. full fine-tuning for the MLLM?
- With or without auxiliary prediction task?
- Does pre-alignment before joint training matter?
3. Open-Source Implementation
Complete code, checkpoints, training scripts — MIT license — at github.com/OpenHelix-Team/OpenHelix.
Architecture Deep Dive
System 2: LLaVA-7B (The Slow Brain)
Input: Visual observation + Language instruction
Model: LLaVA-7B (FROZEN — weights not trained)
Adapt: Prompt tuning (only ~1% parameters)
Output: Latent embedding Z ∈ ℝ^(N×D)
Why freeze the MLLM? Fine-tuning all of LLaVA-7B destroys the generalization learned from billions of text-image pairs. Prompt tuning keeps that "intelligence" intact while adapting to the robot domain — at a fraction of the compute cost.
Key finding from ablation: By default, the MLLM is insensitive to visual changes — it mostly reflects instruction semantics and nearly ignores the visual observation. This is a serious problem because robots need to respond to their environment!
Learned Token Bridge (Connecting the Two Systems)
# Projection layer connecting System 2 → System 1
class TokenBridge(nn.Module):
def __init__(self, llm_dim=4096, policy_dim=512):
super().__init__()
self.proj = nn.Linear(llm_dim, policy_dim)
self.norm = nn.LayerNorm(policy_dim)
def forward(self, llm_embedding):
# Project from LLM space → policy space
return self.norm(self.proj(llm_embedding))
The token bridge is the most heavily trained component. It learns to "translate" latent representations from LLaVA's 4096-dimensional space into the 512-dimensional space the diffusion policy understands.
Critical note: You must pre-align the projection layer BEFORE joint training. If you initialize it randomly and train everything together from scratch, gradients from the policy will "poison" the MLLM's representations → model collapse.
System 1: 3D Diffuser Actor (The Executing Hand)
Input: Token bridge output Z + Proprioceptive state q + Goal features g
Model: 3D Diffuser Actor (diffusion-based)
Output: Action sequence a₀:T ∈ ℝ^(T×7) # 7-DOF robot arm
Speed: 200+ Hz (asynchronous inference)
The 3D Diffuser Actor uses a diffusion process to generate action sequences, allowing the model to capture multimodal action distributions (the same task can have multiple valid execution styles). It receives input from three sources:
- Z from token bridge — task and visual state context
- Proprioceptive state q — current joint angles, end-effector pose
- Goal features g — visual features of the target state
Auxiliary Task: Forcing the MLLM to Actually Look
Auxiliary loss: L_aux = MSE(f_aux(Z), a_expert)
f_aux: Small MLP head on MLLM output
Effect: Forces MLLM embedding to encode visual information
This is the cleverest trick in OpenHelix. By adding an auxiliary loss requiring the MLLM to predict actions from its embedding, we force LLaVA to actually learn to see. Without this, the MLLM can "cheat" — just encoding the instruction text is sufficient to minimize training loss. With auxiliary loss, it's forced to incorporate visual information.
Ablation result: Auxiliary task improves performance by +0.4 average sequence length on CALVIN ABC-D.
Environment Setup
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| GPU | 1× RTX 3090 (24GB) | 1× A100 (40GB) |
| RAM | 32GB | 64GB |
| Storage | 200GB SSD | 500GB SSD |
| CUDA | 11.8+ | 12.1 |
Full OpenHelix training takes approximately 3-4 days on an A100. For inference-only with a pre-trained checkpoint, an RTX 3090 is sufficient.
Create Conda Environment
# Python 3.8 is required — OpenHelix does not support Python 3.10+ yet
conda create -n openhelix python=3.8
conda activate openhelix
# Install PyTorch with CUDA 11.8
conda install pytorch==2.0.1 torchvision==0.15.2 \
torchaudio==2.0.2 pytorch-cuda=11.8 \
-c pytorch -c nvidia
Clone and Install Dependencies
git clone https://github.com/OpenHelix-Team/OpenHelix
cd OpenHelix
# Install with key dependencies
pip install -r requirements.txt
# Install CALVIN simulator (submodule)
git submodule update --init --recursive
# Install DGL (Deep Graph Library) — required for 3D Diffuser Actor
# Choose version matching your CUDA
pip install dgl==1.1.0 -f https://data.dgl.ai/wheels/cu118/repo.html
# Flash Attention to accelerate MLLM inference
pip install flash-attn==2.5.9 --no-build-isolation
Note: flash-attn takes 5-10 minutes to compile. Don't close the terminal.
Preparing the CALVIN Dataset
CALVIN (Composing Actions by Learning from Visual Interactions and Narrative descriptions) is the most widely-used manipulation benchmark today. The ABC-D split: train on environments A, B, C — test on unseen environment D.
Download Dataset
# Dataset is large (~300GB for full split)
cd ~/data
wget https://calvin.cs.uni-freiburg.de/dataset/task_ABC_D.zip
unzip task_ABC_D.zip
# Directory structure after extraction:
# task_ABC_D/
# ├── training/ # Environments A, B, C
# │ ├── episode_*.npz # Demonstration episodes
# │ └── lang_annotations/
# ├── validation/ # Environment D
# └── statistics.yaml
Pre-encode Language Instructions (Optional but Recommended)
OpenHelix uses a CLIP text encoder for language instructions. Pre-encoding speeds up training significantly:
cd OpenHelix
python scripts/encode_instructions.py \
--dataset_path ~/data/task_ABC_D \
--output_path ~/data/task_ABC_D/lang_embeddings \
--encoder clip-vit-base-patch32
# Or download pre-encoded from HuggingFace (faster)
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id='OpenHelix-Team/OpenHelix',
filename='lang_embeddings.tar.gz',
local_dir='~/data/task_ABC_D/'
)
"
Verify Dataset
python scripts/verify_dataset.py --path ~/data/task_ABC_D
# Expected output:
# ✓ Training episodes: 23,856
# ✓ Validation episodes: 1,000
# ✓ Language annotations: 34 unique tasks
# ✓ CLIP embeddings: found
Training
Step 1: Pre-train the Projection Layer (Token Bridge Alignment)
Do not skip this step. This is the difference between OpenHelix and many incorrect implementations:
cd OpenHelix
bash scripts/train_projection_pretrain.sh \
--data_path ~/data/task_ABC_D \
--output_dir ./checkpoints/projection_pretrain \
--epochs 10 \
--batch_size 64 \
--lr 1e-4
# Takes ~2-3 hours on A100
# Goal: projection layer learns to align with MLLM space first
Step 2: Joint Training with Auxiliary Task
bash train_trajectory_lcb_pt_act_simple.sh \
--data_path ~/data/task_ABC_D \
--pretrained_proj ./checkpoints/projection_pretrain/best.pt \
--output_dir ./checkpoints/openhelix_full \
--llm_model llava-hf/llava-1.5-7b-hf \
--policy_lr 1e-4 \
--prompt_lr 1e-3 \
--aux_weight 0.1 \
--epochs 100 \
--batch_size 32
Flag explanations:
--prompt_lr 1e-3: Prompt tokens learn faster than policy (1e-4) because they have fewer parameters--aux_weight 0.1: Auxiliary loss weight — 0.1 is the optimal value per paper ablation--pretrained_proj: REQUIRED — pre-aligned projection from Step 1
Monitoring Training
# TensorBoard logs
tensorboard --logdir ./checkpoints/openhelix_full/logs
# Metrics to watch:
# - train/policy_loss: should decrease consistently
# - train/aux_loss: should decrease — if it increases, reduce aux_weight
# - val/avg_seq_len: main metric, target > 3.5 after epoch 50
Inference and Evaluation
Inference with Asynchronous Mode
OpenHelix uses asynchronous inference to handle latency mismatch between System 1 and System 2. System 2 (LLaVA) runs at 7 Hz, System 1 (diffusion policy) at 200 Hz — async allows both to operate concurrently without blocking each other:
bash test_trajectory_lcb_pt_act_simple_asy10.sh \
--checkpoint ./checkpoints/openhelix_full/epoch_100.pt \
--data_path ~/data/task_ABC_D \
--split validation \
--async_delay 10 \ # 10-step delay between System 1 and System 2
--num_sequences 1000 \
--output_path ./results/eval_epoch100.json
Using Pre-trained Checkpoint (No Training Required)
If you just want to try inference:
# Download checkpoint from HuggingFace
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='OpenHelix-Team/OpenHelix',
local_dir='./pretrained_checkpoints'
)
"
# Merge safetensor shards into single PyTorch file
python scripts/merge_safetensors.py \
--input_dir ./pretrained_checkpoints \
--output_path ./pretrained_checkpoints/openhelix_merged.pt
# Run inference
bash test_trajectory_lcb_pt_act_simple_asy10.sh \
--checkpoint ./pretrained_checkpoints/openhelix_merged.pt \
--data_path ~/data/task_ABC_D \
--split validation
Reading Evaluation Results
python scripts/analyze_results.py --result_path ./results/eval_epoch100.json
# Output format:
# Task | 1-task | 2-task | 3-task | 4-task | 5-task | Avg Seq Len
# push block | 0.96 | 0.87 | 0.75 | 0.64 | 0.53 | 3.75
# stack block | 0.91 | 0.79 | 0.68 | 0.55 | 0.44 | 3.37
# ...
# Overall | 0.933 | 0.818 | 0.710 | 0.598 | 0.491 | 4.08
Avg Sequence Length 4.08 means the robot completes an average of 4.08 consecutive tasks before failing in a chain of up to 5 tasks.
Results and Analysis
Comparison with SOTA
| Model | CALVIN ABC-D Avg Seq Len |
|---|---|
| Seer (2024) | 3.65 |
| GR-MG (2024) | 3.88 |
| UniVLA (2025) | 3.92 |
| RoboDual (2025) | 3.98 |
| OpenHelix (2025) | 4.08 |
4 Key Lessons from the Ablation Study
Lesson 1: Pre-trained policy >> Training from scratch (+1.2 seq len) Never try to train 3D Diffuser Actor from random weights. Always start from a pre-trained policy and fine-tune — the difference is enormous.
Lesson 2: Prompt tuning is sufficient, no need to fully fine-tune MLLM Full fine-tuning of LLaVA doesn't improve performance and destroys generalization. Prompt tuning with ~1% of parameters is optimal.
Lesson 3: Auxiliary task is mandatory for SOTA (+0.4 seq len) Without the auxiliary task, the MLLM nearly ignores visual input. This is the paper's most important finding.
Lesson 4: Pre-alignment prevents model collapse Joint training from scratch with a random projection layer typically leads to training instability. Pre-aligning for 10 epochs is essential insurance.
Common Troubleshooting
CUDA Out of Memory
# Reduce batch size
--batch_size 16 # instead of 32
# Or enable gradient checkpointing
--gradient_checkpointing True
flash_attn ImportError
# Uninstall and reinstall with correct CUDA version
pip uninstall flash-attn
pip install flash-attn==2.5.9 --no-build-isolation \
FLASH_ATTENTION_FORCE_BUILD=TRUE
Training Loss Not Decreasing
Check in order: (1) Are you using --pretrained_proj? (2) Is aux_weight too high (try 0.05)? (3) Is the learning rate appropriate for your batch size (scale linearly)?
Extensions: Related Repos from OpenHelix-Team
Once you're comfortable with OpenHelix, explore:
- VLA-Adapter — Tiny-scale VLA with real-world ALOHA deployment (much lighter)
- VLA-RFT — Reinforcement Fine-Tuning for VLA (RLHF for robots)
- HiF-VLA — Hierarchical spatiotemporal VLA for long-horizon tasks
- Spatial-Forcing — ICLR 2026 paper, improving 3D spatial understanding