The Core Problem: Think Fast or Think Deep?
Modern robot manipulation faces a fundamental tension.
Large language models — and their multimodal cousins — are excellent at scene understanding, language grounding, and long-horizon planning. But they run at 7–9 Hz. Traditional visuomotor policies react at 200+ Hz but lack high-level reasoning. You can't have both… or can you?
Dual-system VLA architectures resolve this by mirroring how the human brain works: a slow, deliberative System 2 decides what to do, while a fast, reactive System 1 handles how to do it moment-to-moment. System 2 runs infrequently to set goals; System 1 executes at high frequency to achieve them.
The concept has been validated — Figure AI's Helix demonstrated impressive real-robot results. But nearly every dual-system VLA in the literature is either closed-source or described at such a high level that reproducing it is practically impossible.
That's the gap OpenHelix fills.
What Is OpenHelix?
OpenHelix (arXiv:2505.03912) — "A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation" — makes three contributions:
- A structured survey of the dual-system VLA landscape, categorizing the key design choices the community debates
- A rigorous empirical analysis that tests which design choices actually matter versus which are just plausible-sounding assumptions
- A complete open-source implementation: code, pretrained weights, training scripts, and evaluation scripts — all public
OpenHelix is evaluated on CALVIN ABC-D, one of the toughest benchmarks for language-conditioned robot manipulation, and achieves state-of-the-art results among open dual-system VLA models.
Architecture: How the Two Systems Work Together
System 2: The Slow Thinker (MLLM)
OpenHelix uses LLaVA-7B as the System 2 backbone — a multimodal LLM pretrained on internet-scale image-text data.
Inputs:
- RGB image from the robot's camera
- Natural language task instruction (e.g., "pick up the red cube and place it in the drawer")
LLaVA-7B processes these and produces a 4096-dimensional latent vector encoding the semantic intent of the task. A linear projection layer maps this down to 512 dimensions, compatible with System 1.
Crucially, OpenHelix does not fine-tune all of LLaVA-7B. That would be compute-prohibitive and would likely destroy the model's generalization. Instead, it uses prompt tuning: a single learnable <ACT> token is appended to the language instruction. Only this token's embedding — plus the projection layer — is trained. The 7B LLaVA weights remain frozen throughout.
An auxiliary task is added on top: the MLLM is asked to predict gripper state and action trajectory directly from its own embeddings. This seems redundant (System 1 will produce the actual actions), but empirical results show it's critical. Without it, the MLLM doesn't attend closely to the visual input — it processes language but ignores scene details. The auxiliary task forces genuine visual reasoning.
System 1: The Fast Actor (3D Diffusion Policy)
System 1 uses 3D Diffuser Actor — a diffusion-based policy that operates in 3D space, combining RGB features with point cloud geometry.
Inputs:
- RGB visual features
- 3D point cloud of the scene
- Proprioceptive state (joint positions, gripper state)
- 512-D goal feature from System 2
System 1 runs denoising diffusion to generate an action trajectory — a sequence of end-effector poses or joint positions executed at high frequency.
Importantly, System 1 is not trained from scratch. OpenHelix starts from a pretrained 3D Diffuser Actor and fine-tunes it. Empirical comparison: pretrained initialization achieves 96% single-task success vs. 89% for training from scratch on CALVIN.
The Bridge: Learned Token + Linear Projection
The connection mechanism is the architectural highlight:
Input: [Visual tokens] + [Language tokens] + [<ACT> token]
│
LLaVA-7B forward pass
│
Embedding at <ACT> position
│
Linear Projection (4096 → 512)
│
Goal Feature Vector
│
3D Diffuser Actor
│
Action Trajectory
Advantages of this design:
- Zero changes to MLLM architecture — only one additional token embedding
- Parameter efficient — Stage 1 training touches only the token embedding and projection
- Modular — swapping the MLLM backbone requires no redesign of the bridge
Asynchronous Inference
System 2 (MLLM) is far slower than System 1 (diffusion policy). In practice, you cannot run LLaVA-7B at every control timestep.
OpenHelix evaluates asynchronous operation: System 2 updates the goal feature every N steps of System 1. The results are striking:
| Delay N (steps) | Performance drop |
|---|---|
| N=1 (sync) | Baseline |
| N=10 | ~1% |
| N=30 | ~2% |
| N=60 | ~3% |
This means you can run the MLLM on a separate GPU at low frequency while the policy runs at high frequency on edge hardware — a practical deployment pattern.
Installation
Requirements
- Python 3.8
- CUDA 11.8
- GPU with at least 24 GB VRAM (to run LLaVA-7B + 3D Diffuser Actor concurrently)
- Ampere or newer GPU (RTX 3090/4090, A100) for Flash Attention 2
Step 1: Create Environment
conda update conda
conda create -n openhelix python=3.8 -y
conda activate openhelix
Step 2: Install CALVIN
OpenHelix is benchmarked on CALVIN, so install it first:
git clone --recurse-submodules https://github.com/mees/calvin.git
export CALVIN_ROOT=$(pwd)/calvin
cd calvin/calvin_env && git checkout main && cd ..
pip install setuptools==57.5.0
./install.sh
cd ..
Step 3: Clone and Install OpenHelix
git clone [email protected]:OpenHelix-robot/OpenHelix.git
cd OpenHelix
pip install -e .
Step 4: Install Dependencies
# Diffusion library
pip install diffusers["torch"]
# Deep Graph Library (required by 3D Diffuser Actor)
pip install dgl -f https://data.dgl.ai/wheels/torch-2.2/cu118/repo.html
# Flash Attention 2 (significantly speeds up MLLM inference)
pip install packaging ninja
pip install flash-attn==2.5.9.post1 --no-build-isolation
Note: Flash Attention takes 10–20 minutes to compile. This is normal — let it finish.
Step 5: Download CALVIN Dataset
cd calvin/dataset
sh download_data.sh ABC # ~100 GB download
cd ../..
# Package into OpenHelix format
python data_preprocessing/package_calvin.py --split training
python data_preprocessing/package_calvin.py --split validation
Training: Two Stages
Stage 1: Pre-Alignment (2,000 iterations)
Only the <ACT> token embedding and linear projection are trained. Both MLLM and policy are fully frozen. The goal: align the MLLM's latent space with what the diffusion policy expects.
Stage 2: Policy Fine-Tuning (100,000 iterations)
The diffusion policy is unfrozen and trained jointly. The MLLM remains frozen; only the prompt token is updated. Combined loss:
- Diffusion loss: denoising action trajectory (primary policy loss)
- Auxiliary MLLM loss: BCE for gripper state + L1 for position/rotation, computed from MLLM embeddings directly
bash scripts/train_trajectory_lcb_pt_act_simple.sh
Full training takes approximately 2–3 days on 8× A100. For those without a cluster, pretrained weights are available on Hugging Face:
# Hub: OpenHelix/openhelix
# Variants:
# prompt_tuning — without auxiliary task
# prompt_tuning_aux — with auxiliary task (recommended)
# Merge sharded safetensors into pytorch_model.bin before loading
Inference and Evaluation
# Evaluate on CALVIN ABC-D with asynchronous inference (N=10)
bash scripts/test_trajectory_lcb_pt_act_simple_asy10.sh
This runs OpenHelix over the full CALVIN ABC-D test suite and reports:
- Single-task success rate
- Multi-task chain success rates (2 through 5 consecutive tasks)
- Average task length (EP_LEN)
Results
CALVIN ABC-D
| Model | 1 Task | 2 Tasks | 3 Tasks | 4 Tasks | 5 Tasks | Avg Len |
|---|---|---|---|---|---|---|
| OpenHelix | 93.3% | 81.8% | 67.2% | 55.1% | 46.0% | 3.45 |
| LCB | 89.5% | 74.9% | 60.4% | 47.5% | 37.2% | 3.09 |
| SuSIE | 87.0% | 69.0% | 49.0% | 38.0% | 26.0% | 2.69 |
EP_LEN=360. SOTA among dual-system VLA models.
CALVIN-E (Language Generalization)
CALVIN-E uses richer, more varied instructions to probe generalization:
- Single-task: 78.9%
- Dual-task: 57.1%
Key Empirical Findings
Finding 1 — Prompt tuning beats full MLLM fine-tuning:
Full fine-tuning improves in-distribution performance but hurts instruction generalization. Prompt tuning with an auxiliary task is the sweet spot: strong task performance while preserving the MLLM's broad language understanding.
Finding 2 — Always start from a pretrained policy:
Fine-tune from pretrained 3D Diffuser Actor: 96% single-task
Train from scratch: 89% single-task
The pretrained model encodes geometry and motion priors that pure task-specific training can't replicate efficiently.
Finding 3 — Asynchronous inference is surprisingly robust:
The MLLM's goal representation doesn't need refreshing at every step. You can run System 2 very infrequently — N=60 steps drops only ~3% — enabling practical deployment where MLLM runs on a high-power server GPU and the policy runs on a low-latency edge device.
Finding 4 — The auxiliary task is non-negotiable:
Without it, the MLLM processes language instructions without genuinely attending to the visual scene. The auxiliary task creates a training signal that forces the model to extract action-relevant visual features, dramatically improving the quality of latent representations passed to System 1.
Why OpenHelix Matters Beyond Its Benchmark Score
If you've studied VLA models for manipulation or diffusion policy in modern robotics, the architectural ideas here aren't entirely new. What makes OpenHelix notable isn't novelty alone — it's epistemic transparency:
- Full reproducibility: Code, weights, and scripts are public. Researchers can verify claims without waiting for an "official release" that never comes
- Honest ablations: The paper reports results for design choices that didn't work as expected, not just the winning configuration
- Practical guidance: The async inference analysis, pretraining findings, and auxiliary task insights directly answer questions engineers face when building real systems
Compare this to the typical "SOTA" robotics paper — a PDF and a polished demo video, no code — and the difference is significant.
For foundational background on the techniques OpenHelix builds on, see Diffusion Policy: How Denoising Generates Robot Actions and VLA Models: From ACT to π₀.
Open Questions and Future Directions
OpenHelix's roadmap includes:
- Real robot deployment — Bridging the sim-to-real gap from CALVIN to physical hardware
- Humanoid robots — Extending the architecture to whole-body control
- Larger MLLM backbones — Testing LLaVA-13B or more capable VLMs
- Richer manipulation tasks — Moving beyond pick-and-place to contact-rich and dexterous manipulation
The biggest open question: CALVIN is simulation-based. Real-world deployment introduces perception noise, contact dynamics, and distribution shifts that no simulator fully captures. The sim-to-real gap remains the hardest unsolved problem in this line of work.
Summary
OpenHelix earns its place as a reference point in dual-system VLA research not just because it achieves SOTA, but because it tells you why it achieves SOTA and gives you everything needed to reproduce, extend, or critique it.
The key design decisions in brief:
- System 2: LLaVA-7B frozen, trained only via prompt tuning + auxiliary task
- Bridge: Learned
<ACT>token → linear projection (4096→512) - System 1: Pretrained 3D Diffuser Actor, fine-tuned with diffusion loss
- Inference: Asynchronous, N up to 60 steps with minimal degradation
If you're building robot manipulation systems and want a strong, reproducible open-source baseline — OpenHelix is the right starting point.
GitHub: OpenHelix-Team/OpenHelix
Paper: arXiv:2505.03912
Weights: OpenHelix/openhelix on Hugging Face