OpenHelix: Dual-System VLA for Robot Manipulation

The Core Problem: Think Fast or Think Deep?

Modern robot manipulation faces a fundamental tension.

Large language models — and their multimodal cousins — are excellent at scene understanding, language grounding, and long-horizon planning. But they run at 7–9 Hz. Traditional visuomotor policies react at 200+ Hz but lack high-level reasoning. You can't have both… or can you?

Dual-system VLA architectures resolve this by mirroring how the human brain works: a slow, deliberative System 2 decides what to do, while a fast, reactive System 1 handles how to do it moment-to-moment. System 2 runs infrequently to set goals; System 1 executes at high frequency to achieve them.

The concept has been validated — Figure AI's Helix demonstrated impressive real-robot results. But nearly every dual-system VLA in the literature is either closed-source or described at such a high level that reproducing it is practically impossible.

That's the gap OpenHelix fills.

What Is OpenHelix?

OpenHelix (arXiv:2505.03912) — "A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation" — makes three contributions:

A structured survey of the dual-system VLA landscape, categorizing the key design choices the community debates
A rigorous empirical analysis that tests which design choices actually matter versus which are just plausible-sounding assumptions
A complete open-source implementation: code, pretrained weights, training scripts, and evaluation scripts — all public

OpenHelix is evaluated on CALVIN ABC-D, one of the toughest benchmarks for language-conditioned robot manipulation, and achieves state-of-the-art results among open dual-system VLA models.

Architecture: How the Two Systems Work Together

System 2: The Slow Thinker (MLLM)

OpenHelix uses LLaVA-7B as the System 2 backbone — a multimodal LLM pretrained on internet-scale image-text data.

Inputs:

RGB image from the robot's camera
Natural language task instruction (e.g., "pick up the red cube and place it in the drawer")

LLaVA-7B processes these and produces a 4096-dimensional latent vector encoding the semantic intent of the task. A linear projection layer maps this down to 512 dimensions, compatible with System 1.

Crucially, OpenHelix does not fine-tune all of LLaVA-7B. That would be compute-prohibitive and would likely destroy the model's generalization. Instead, it uses prompt tuning: a single learnable <ACT> token is appended to the language instruction. Only this token's embedding — plus the projection layer — is trained. The 7B LLaVA weights remain frozen throughout.

An auxiliary task is added on top: the MLLM is asked to predict gripper state and action trajectory directly from its own embeddings. This seems redundant (System 1 will produce the actual actions), but empirical results show it's critical. Without it, the MLLM doesn't attend closely to the visual input — it processes language but ignores scene details. The auxiliary task forces genuine visual reasoning.

System 1: The Fast Actor (3D Diffusion Policy)

System 1 uses 3D Diffuser Actor — a diffusion-based policy that operates in 3D space, combining RGB features with point cloud geometry.

Inputs:

RGB visual features
3D point cloud of the scene
Proprioceptive state (joint positions, gripper state)
512-D goal feature from System 2

System 1 runs denoising diffusion to generate an action trajectory — a sequence of end-effector poses or joint positions executed at high frequency.

Importantly, System 1 is not trained from scratch. OpenHelix starts from a pretrained 3D Diffuser Actor and fine-tunes it. Empirical comparison: pretrained initialization achieves 96% single-task success vs. 89% for training from scratch on CALVIN.

The Bridge: Learned Token + Linear Projection

The connection mechanism is the architectural highlight:

Input: [Visual tokens] + [Language tokens] + [<ACT> token]
                                                    │
                                        LLaVA-7B forward pass
                                                    │
                                   Embedding at <ACT> position
                                                    │
                                 Linear Projection (4096 → 512)
                                                    │
                                         Goal Feature Vector
                                                    │
                                       3D Diffuser Actor
                                                    │
                                         Action Trajectory

Advantages of this design:

Zero changes to MLLM architecture — only one additional token embedding
Parameter efficient — Stage 1 training touches only the token embedding and projection
Modular — swapping the MLLM backbone requires no redesign of the bridge

Asynchronous Inference

System 2 (MLLM) is far slower than System 1 (diffusion policy). In practice, you cannot run LLaVA-7B at every control timestep.

OpenHelix evaluates asynchronous operation: System 2 updates the goal feature every N steps of System 1. The results are striking:

Delay N (steps)	Performance drop
N=1 (sync)	Baseline
N=10	~1%
N=30	~2%
N=60	~3%

This means you can run the MLLM on a separate GPU at low frequency while the policy runs at high frequency on edge hardware — a practical deployment pattern.

Installation

Requirements

Python 3.8
CUDA 11.8
GPU with at least 24 GB VRAM (to run LLaVA-7B + 3D Diffuser Actor concurrently)
Ampere or newer GPU (RTX 3090/4090, A100) for Flash Attention 2

Step 1: Create Environment

conda update conda
conda create -n openhelix python=3.8 -y
conda activate openhelix

Step 2: Install CALVIN

OpenHelix is benchmarked on CALVIN, so install it first:

git clone --recurse-submodules https://github.com/mees/calvin.git
export CALVIN_ROOT=$(pwd)/calvin
cd calvin/calvin_env && git checkout main && cd ..
pip install setuptools==57.5.0
./install.sh
cd ..

Step 3: Clone and Install OpenHelix

git clone [email protected]:OpenHelix-robot/OpenHelix.git
cd OpenHelix
pip install -e .

Step 4: Install Dependencies

# Diffusion library
pip install diffusers["torch"]

# Deep Graph Library (required by 3D Diffuser Actor)
pip install dgl -f https://data.dgl.ai/wheels/torch-2.2/cu118/repo.html

# Flash Attention 2 (significantly speeds up MLLM inference)
pip install packaging ninja
pip install flash-attn==2.5.9.post1 --no-build-isolation

Note: Flash Attention takes 10–20 minutes to compile. This is normal — let it finish.

Step 5: Download CALVIN Dataset

cd calvin/dataset
sh download_data.sh ABC          # ~100 GB download

cd ../..

# Package into OpenHelix format
python data_preprocessing/package_calvin.py --split training
python data_preprocessing/package_calvin.py --split validation

Training: Two Stages

Stage 1: Pre-Alignment (2,000 iterations)

Only the <ACT> token embedding and linear projection are trained. Both MLLM and policy are fully frozen. The goal: align the MLLM's latent space with what the diffusion policy expects.

Stage 2: Policy Fine-Tuning (100,000 iterations)

The diffusion policy is unfrozen and trained jointly. The MLLM remains frozen; only the prompt token is updated. Combined loss:

Diffusion loss: denoising action trajectory (primary policy loss)
Auxiliary MLLM loss: BCE for gripper state + L1 for position/rotation, computed from MLLM embeddings directly

bash scripts/train_trajectory_lcb_pt_act_simple.sh

Full training takes approximately 2–3 days on 8× A100. For those without a cluster, pretrained weights are available on Hugging Face:

# Hub: OpenHelix/openhelix
# Variants:
#   prompt_tuning     — without auxiliary task
#   prompt_tuning_aux — with auxiliary task (recommended)
# Merge sharded safetensors into pytorch_model.bin before loading

Inference and Evaluation

# Evaluate on CALVIN ABC-D with asynchronous inference (N=10)
bash scripts/test_trajectory_lcb_pt_act_simple_asy10.sh

This runs OpenHelix over the full CALVIN ABC-D test suite and reports:

Single-task success rate
Multi-task chain success rates (2 through 5 consecutive tasks)
Average task length (EP_LEN)

Results

CALVIN ABC-D

Model	1 Task	2 Tasks	3 Tasks	4 Tasks	5 Tasks	Avg Len
OpenHelix	93.3%	81.8%	67.2%	55.1%	46.0%	3.45
LCB	89.5%	74.9%	60.4%	47.5%	37.2%	3.09
SuSIE	87.0%	69.0%	49.0%	38.0%	26.0%	2.69

EP_LEN=360. SOTA among dual-system VLA models.

CALVIN-E (Language Generalization)

CALVIN-E uses richer, more varied instructions to probe generalization:

Single-task: 78.9%
Dual-task: 57.1%

Key Empirical Findings

Finding 1 — Prompt tuning beats full MLLM fine-tuning:

Full fine-tuning improves in-distribution performance but hurts instruction generalization. Prompt tuning with an auxiliary task is the sweet spot: strong task performance while preserving the MLLM's broad language understanding.

Finding 2 — Always start from a pretrained policy:

Fine-tune from pretrained 3D Diffuser Actor: 96% single-task
Train from scratch: 89% single-task

The pretrained model encodes geometry and motion priors that pure task-specific training can't replicate efficiently.

Finding 3 — Asynchronous inference is surprisingly robust:

The MLLM's goal representation doesn't need refreshing at every step. You can run System 2 very infrequently — N=60 steps drops only ~3% — enabling practical deployment where MLLM runs on a high-power server GPU and the policy runs on a low-latency edge device.

Finding 4 — The auxiliary task is non-negotiable:

Without it, the MLLM processes language instructions without genuinely attending to the visual scene. The auxiliary task creates a training signal that forces the model to extract action-relevant visual features, dramatically improving the quality of latent representations passed to System 1.

Why OpenHelix Matters Beyond Its Benchmark Score

If you've studied VLA models for manipulation or diffusion policy in modern robotics, the architectural ideas here aren't entirely new. What makes OpenHelix notable isn't novelty alone — it's epistemic transparency:

Full reproducibility: Code, weights, and scripts are public. Researchers can verify claims without waiting for an "official release" that never comes
Honest ablations: The paper reports results for design choices that didn't work as expected, not just the winning configuration
Practical guidance: The async inference analysis, pretraining findings, and auxiliary task insights directly answer questions engineers face when building real systems

Compare this to the typical "SOTA" robotics paper — a PDF and a polished demo video, no code — and the difference is significant.

For foundational background on the techniques OpenHelix builds on, see Diffusion Policy: How Denoising Generates Robot Actions and VLA Models: From ACT to π₀.

Open Questions and Future Directions

OpenHelix's roadmap includes:

Real robot deployment — Bridging the sim-to-real gap from CALVIN to physical hardware
Humanoid robots — Extending the architecture to whole-body control
Larger MLLM backbones — Testing LLaVA-13B or more capable VLMs
Richer manipulation tasks — Moving beyond pick-and-place to contact-rich and dexterous manipulation

The biggest open question: CALVIN is simulation-based. Real-world deployment introduces perception noise, contact dynamics, and distribution shifts that no simulator fully captures. The sim-to-real gap remains the hardest unsolved problem in this line of work.

Summary

OpenHelix earns its place as a reference point in dual-system VLA research not just because it achieves SOTA, but because it tells you why it achieves SOTA and gives you everything needed to reproduce, extend, or critique it.

The key design decisions in brief:

System 2: LLaVA-7B frozen, trained only via prompt tuning + auxiliary task
Bridge: Learned <ACT> token → linear projection (4096→512)
System 1: Pretrained 3D Diffuser Actor, fine-tuned with diffusion loss
Inference: Asynchronous, N up to 60 steps with minimal degradation

If you're building robot manipulation systems and want a strong, reproducible open-source baseline — OpenHelix is the right starting point.

GitHub: OpenHelix-Team/OpenHelix
Paper: arXiv:2505.03912
Weights: OpenHelix/openhelix on Hugging Face

OpenHelix: Dual-System VLA for Robot Manipulation

The Core Problem: Think Fast or Think Deep?

What Is OpenHelix?

Architecture: How the Two Systems Work Together

System 2: The Slow Thinker (MLLM)

System 1: The Fast Actor (3D Diffusion Policy)

The Bridge: Learned Token + Linear Projection

Asynchronous Inference

Installation

Requirements

Step 1: Create Environment

Step 2: Install CALVIN

Step 3: Clone and Install OpenHelix

Step 4: Install Dependencies

Step 5: Download CALVIN Dataset

Training: Two Stages

Stage 1: Pre-Alignment (2,000 iterations)

Stage 2: Policy Fine-Tuning (100,000 iterations)

Inference and Evaluation

Results

CALVIN ABC-D

CALVIN-E (Language Generalization)

Key Empirical Findings

Why OpenHelix Matters Beyond Its Benchmark Score

Open Questions and Future Directions

Summary

Nguyễn Anh Tuấn

Bài viết liên quan

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

Ark v1.5: Python Framework cho Robot Learning sim-to-real

The Core Problem: Think Fast or Think Deep?

What Is OpenHelix?

Architecture: How the Two Systems Work Together

System 2: The Slow Thinker (MLLM)

System 1: The Fast Actor (3D Diffusion Policy)

The Bridge: Learned Token + Linear Projection

Asynchronous Inference

Installation

Requirements

Step 1: Create Environment

Step 2: Install CALVIN

Step 3: Clone and Install OpenHelix

Step 4: Install Dependencies

Step 5: Download CALVIN Dataset

Training: Two Stages

Stage 1: Pre-Alignment (2,000 iterations)

Stage 2: Policy Fine-Tuning (100,000 iterations)

Inference and Evaluation

Results

CALVIN ABC-D

CALVIN-E (Language Generalization)

Key Empirical Findings

Why OpenHelix Matters Beyond Its Benchmark Score

Open Questions and Future Directions

Summary

Related Posts

Nguyễn Anh Tuấn

Bài viết liên quan

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

Ark v1.5: Python Framework cho Robot Learning sim-to-real