OpenHelix: Build Dual-System VLA From Survey to Deploy

You've probably heard of Helix — Figure AI's dual-system VLA running on a humanoid robot that shook the robotics community in 2025. The only problem: it's completely closed-source. You can't read the code, can't reproduce it, can't learn from it.

OpenHelix was created to change that. It's a fully open-source implementation of the dual-system VLA architecture, paired with the most thorough survey and empirical analysis on the topic available today. The result: SOTA on CALVIN ABC-D with an average sequence length of 4.08 — beating RoboDual, UniVLA, GR-MG and Seer.

This post takes you from "what is dual-system?" → environment setup → data preparation → training → inference, with enough detail to actually run it on your own machine.

What Is a Dual-System VLA? (And Why You Should Care)

Imagine driving through a busy city. Your brain runs two parallel processes:

System 2 (slow, deliberate): Reads signs, recognizes complex situations ("ambulance approaching"), makes strategic decisions ("stop, yield")
System 1 (fast, reflexive): Steers the wheel, applies brake pressure, keeps the car in lane — all in milliseconds, without "thinking"

Robot manipulation faces exactly this tension. Multimodal LLMs (MLLMs) like LLaVA are excellent at understanding language and reasoning about context — but run at 7-9 Hz, far too slow for real-time robot control. Diffusion policies react at 200+ Hz — but don't "understand" anything, just mapping sensors → actions.

Dual-System VLA combines both: the MLLM plays System 2 (language understanding, planning), and the diffusion policy plays System 1 (precise execution, real-time).

Overview of dual-system VLA architecture: System 2 (MLLM) provides context to System 1 (policy)

OpenHelix: Three Core Contributions

The paper arXiv:2505.03912 by Can Cui, Pengxiang Ding, Wenxuan Song et al. isn't just another model — it's a knowledge system:

1. Comprehensive Survey of the Landscape

OpenHelix systematically maps the full design space of dual-system VLAs: how to connect System 1 and System 2, how to train each component, how to handle latency mismatch. This is the map you need to not get lost reading other papers.

2. Rigorous Empirical Analysis

Rather than just claiming "my architecture is better," the authors ablate each design choice systematically:

Pre-trained policy vs. training from scratch?
Prompt-tuning vs. full fine-tuning for the MLLM?
With or without auxiliary prediction task?
Does pre-alignment before joint training matter?

3. Open-Source Implementation

Complete code, checkpoints, training scripts — MIT license — at github.com/OpenHelix-Team/OpenHelix.

Architecture Deep Dive

System 2: LLaVA-7B (The Slow Brain)

Input:  Visual observation + Language instruction
Model:  LLaVA-7B (FROZEN — weights not trained)
Adapt:  Prompt tuning (only ~1% parameters)
Output: Latent embedding Z ∈ ℝ^(N×D)

Why freeze the MLLM? Fine-tuning all of LLaVA-7B destroys the generalization learned from billions of text-image pairs. Prompt tuning keeps that "intelligence" intact while adapting to the robot domain — at a fraction of the compute cost.

Key finding from ablation: By default, the MLLM is insensitive to visual changes — it mostly reflects instruction semantics and nearly ignores the visual observation. This is a serious problem because robots need to respond to their environment!

Learned Token Bridge (Connecting the Two Systems)

# Projection layer connecting System 2 → System 1
class TokenBridge(nn.Module):
    def __init__(self, llm_dim=4096, policy_dim=512):
        super().__init__()
        self.proj = nn.Linear(llm_dim, policy_dim)
        self.norm = nn.LayerNorm(policy_dim)
    
    def forward(self, llm_embedding):
        # Project from LLM space → policy space
        return self.norm(self.proj(llm_embedding))

The token bridge is the most heavily trained component. It learns to "translate" latent representations from LLaVA's 4096-dimensional space into the 512-dimensional space the diffusion policy understands.

Critical note: You must pre-align the projection layer BEFORE joint training. If you initialize it randomly and train everything together from scratch, gradients from the policy will "poison" the MLLM's representations → model collapse.

System 1: 3D Diffuser Actor (The Executing Hand)

Input:  Token bridge output Z + Proprioceptive state q + Goal features g
Model:  3D Diffuser Actor (diffusion-based)
Output: Action sequence a₀:T ∈ ℝ^(T×7)  # 7-DOF robot arm
Speed:  200+ Hz (asynchronous inference)

The 3D Diffuser Actor uses a diffusion process to generate action sequences, allowing the model to capture multimodal action distributions (the same task can have multiple valid execution styles). It receives input from three sources:

Z from token bridge — task and visual state context
Proprioceptive state q — current joint angles, end-effector pose
Goal features g — visual features of the target state

Auxiliary Task: Forcing the MLLM to Actually Look

Auxiliary loss: L_aux = MSE(f_aux(Z), a_expert)
f_aux: Small MLP head on MLLM output
Effect: Forces MLLM embedding to encode visual information

This is the cleverest trick in OpenHelix. By adding an auxiliary loss requiring the MLLM to predict actions from its embedding, we force LLaVA to actually learn to see. Without this, the MLLM can "cheat" — just encoding the instruction text is sufficient to minimize training loss. With auxiliary loss, it's forced to incorporate visual information.

Ablation result: Auxiliary task improves performance by +0.4 average sequence length on CALVIN ABC-D.

Environment Setup

Hardware Requirements

Component	Minimum	Recommended
GPU	1× RTX 3090 (24GB)	1× A100 (40GB)
RAM	32GB	64GB
Storage	200GB SSD	500GB SSD
CUDA	11.8+	12.1

Full OpenHelix training takes approximately 3-4 days on an A100. For inference-only with a pre-trained checkpoint, an RTX 3090 is sufficient.

Create Conda Environment

# Python 3.8 is required — OpenHelix does not support Python 3.10+ yet
conda create -n openhelix python=3.8
conda activate openhelix

# Install PyTorch with CUDA 11.8
conda install pytorch==2.0.1 torchvision==0.15.2 \
    torchaudio==2.0.2 pytorch-cuda=11.8 \
    -c pytorch -c nvidia

Clone and Install Dependencies

git clone https://github.com/OpenHelix-Team/OpenHelix
cd OpenHelix

# Install with key dependencies
pip install -r requirements.txt

# Install CALVIN simulator (submodule)
git submodule update --init --recursive

# Install DGL (Deep Graph Library) — required for 3D Diffuser Actor
# Choose version matching your CUDA
pip install dgl==1.1.0 -f https://data.dgl.ai/wheels/cu118/repo.html

# Flash Attention to accelerate MLLM inference
pip install flash-attn==2.5.9 --no-build-isolation

Note: flash-attn takes 5-10 minutes to compile. Don't close the terminal.

Preparing the CALVIN Dataset

CALVIN (Composing Actions by Learning from Visual Interactions and Narrative descriptions) is the most widely-used manipulation benchmark today. The ABC-D split: train on environments A, B, C — test on unseen environment D.

Download Dataset

# Dataset is large (~300GB for full split)
cd ~/data
wget https://calvin.cs.uni-freiburg.de/dataset/task_ABC_D.zip
unzip task_ABC_D.zip

# Directory structure after extraction:
# task_ABC_D/
# ├── training/          # Environments A, B, C
# │   ├── episode_*.npz  # Demonstration episodes
# │   └── lang_annotations/
# ├── validation/        # Environment D
# └── statistics.yaml

Pre-encode Language Instructions (Optional but Recommended)

OpenHelix uses a CLIP text encoder for language instructions. Pre-encoding speeds up training significantly:

cd OpenHelix

python scripts/encode_instructions.py \
    --dataset_path ~/data/task_ABC_D \
    --output_path ~/data/task_ABC_D/lang_embeddings \
    --encoder clip-vit-base-patch32

# Or download pre-encoded from HuggingFace (faster)
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id='OpenHelix-Team/OpenHelix',
    filename='lang_embeddings.tar.gz',
    local_dir='~/data/task_ABC_D/'
)
"

Verify Dataset

python scripts/verify_dataset.py --path ~/data/task_ABC_D

# Expected output:
# ✓ Training episodes: 23,856
# ✓ Validation episodes: 1,000
# ✓ Language annotations: 34 unique tasks
# ✓ CLIP embeddings: found

Training

Step 1: Pre-train the Projection Layer (Token Bridge Alignment)

Do not skip this step. This is the difference between OpenHelix and many incorrect implementations:

cd OpenHelix

bash scripts/train_projection_pretrain.sh \
    --data_path ~/data/task_ABC_D \
    --output_dir ./checkpoints/projection_pretrain \
    --epochs 10 \
    --batch_size 64 \
    --lr 1e-4

# Takes ~2-3 hours on A100
# Goal: projection layer learns to align with MLLM space first

Step 2: Joint Training with Auxiliary Task

bash train_trajectory_lcb_pt_act_simple.sh \
    --data_path ~/data/task_ABC_D \
    --pretrained_proj ./checkpoints/projection_pretrain/best.pt \
    --output_dir ./checkpoints/openhelix_full \
    --llm_model llava-hf/llava-1.5-7b-hf \
    --policy_lr 1e-4 \
    --prompt_lr 1e-3 \
    --aux_weight 0.1 \
    --epochs 100 \
    --batch_size 32

Flag explanations:

--prompt_lr 1e-3: Prompt tokens learn faster than policy (1e-4) because they have fewer parameters
--aux_weight 0.1: Auxiliary loss weight — 0.1 is the optimal value per paper ablation
--pretrained_proj: REQUIRED — pre-aligned projection from Step 1

Monitoring Training

# TensorBoard logs
tensorboard --logdir ./checkpoints/openhelix_full/logs

# Metrics to watch:
# - train/policy_loss: should decrease consistently
# - train/aux_loss: should decrease — if it increases, reduce aux_weight
# - val/avg_seq_len: main metric, target > 3.5 after epoch 50

OpenHelix benchmark results on CALVIN ABC-D compared to other methods

Inference and Evaluation

Inference with Asynchronous Mode

OpenHelix uses asynchronous inference to handle latency mismatch between System 1 and System 2. System 2 (LLaVA) runs at 7 Hz, System 1 (diffusion policy) at 200 Hz — async allows both to operate concurrently without blocking each other:

bash test_trajectory_lcb_pt_act_simple_asy10.sh \
    --checkpoint ./checkpoints/openhelix_full/epoch_100.pt \
    --data_path ~/data/task_ABC_D \
    --split validation \
    --async_delay 10 \  # 10-step delay between System 1 and System 2
    --num_sequences 1000 \
    --output_path ./results/eval_epoch100.json

Using Pre-trained Checkpoint (No Training Required)

If you just want to try inference:

# Download checkpoint from HuggingFace
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='OpenHelix-Team/OpenHelix',
    local_dir='./pretrained_checkpoints'
)
"

# Merge safetensor shards into single PyTorch file
python scripts/merge_safetensors.py \
    --input_dir ./pretrained_checkpoints \
    --output_path ./pretrained_checkpoints/openhelix_merged.pt

# Run inference
bash test_trajectory_lcb_pt_act_simple_asy10.sh \
    --checkpoint ./pretrained_checkpoints/openhelix_merged.pt \
    --data_path ~/data/task_ABC_D \
    --split validation

Reading Evaluation Results

python scripts/analyze_results.py --result_path ./results/eval_epoch100.json

# Output format:
# Task             | 1-task | 2-task | 3-task | 4-task | 5-task | Avg Seq Len
# push block       |  0.96  |  0.87  |  0.75  |  0.64  |  0.53  |   3.75
# stack block      |  0.91  |  0.79  |  0.68  |  0.55  |  0.44  |   3.37
# ...
# Overall          |  0.933 |  0.818 |  0.710 |  0.598 |  0.491 |   4.08

Avg Sequence Length 4.08 means the robot completes an average of 4.08 consecutive tasks before failing in a chain of up to 5 tasks.

Results and Analysis

Comparison with SOTA

Model	CALVIN ABC-D Avg Seq Len
Seer (2024)	3.65
GR-MG (2024)	3.88
UniVLA (2025)	3.92
RoboDual (2025)	3.98
OpenHelix (2025)	4.08

4 Key Lessons from the Ablation Study

Lesson 1: Pre-trained policy >> Training from scratch (+1.2 seq len) Never try to train 3D Diffuser Actor from random weights. Always start from a pre-trained policy and fine-tune — the difference is enormous.

Lesson 2: Prompt tuning is sufficient, no need to fully fine-tune MLLM Full fine-tuning of LLaVA doesn't improve performance and destroys generalization. Prompt tuning with ~1% of parameters is optimal.

Lesson 3: Auxiliary task is mandatory for SOTA (+0.4 seq len) Without the auxiliary task, the MLLM nearly ignores visual input. This is the paper's most important finding.

Lesson 4: Pre-alignment prevents model collapse Joint training from scratch with a random projection layer typically leads to training instability. Pre-aligning for 10 epochs is essential insurance.

Common Troubleshooting

CUDA Out of Memory

# Reduce batch size
--batch_size 16  # instead of 32

# Or enable gradient checkpointing
--gradient_checkpointing True

flash_attn ImportError

# Uninstall and reinstall with correct CUDA version
pip uninstall flash-attn
pip install flash-attn==2.5.9 --no-build-isolation \
    FLASH_ATTENTION_FORCE_BUILD=TRUE

Training Loss Not Decreasing

Check in order: (1) Are you using --pretrained_proj? (2) Is aux_weight too high (try 0.05)? (3) Is the learning rate appropriate for your batch size (scale linearly)?

Once you're comfortable with OpenHelix, explore:

VLA-Adapter — Tiny-scale VLA with real-world ALOHA deployment (much lighter)
VLA-RFT — Reinforcement Fine-Tuning for VLA (RLHF for robots)
HiF-VLA — Hierarchical spatiotemporal VLA for long-horizon tasks
Spatial-Forcing — ICLR 2026 paper, improving 3D spatial understanding

This post takes you from "what is dual-system?" → environment setup → data preparation → training → inference, with enough detail to actually run it on your own machine.

What Is a Dual-System VLA? (And Why You Should Care)

Imagine driving through a busy city. Your brain runs two parallel processes:

System 2 (slow, deliberate): Reads signs, recognizes complex situations ("ambulance approaching"), makes strategic decisions ("stop, yield")
System 1 (fast, reflexive): Steers the wheel, applies brake pressure, keeps the car in lane — all in milliseconds, without "thinking"

Dual-System VLA combines both: the MLLM plays System 2 (language understanding, planning), and the diffusion policy plays System 1 (precise execution, real-time).

Overview of dual-system VLA architecture: System 2 (MLLM) provides context to System 1 (policy)

OpenHelix: Three Core Contributions

The paper arXiv:2505.03912 by Can Cui, Pengxiang Ding, Wenxuan Song et al. isn't just another model — it's a knowledge system:

1. Comprehensive Survey of the Landscape

2. Rigorous Empirical Analysis

Rather than just claiming "my architecture is better," the authors ablate each design choice systematically:

Pre-trained policy vs. training from scratch?
Prompt-tuning vs. full fine-tuning for the MLLM?
With or without auxiliary prediction task?
Does pre-alignment before joint training matter?

3. Open-Source Implementation

Complete code, checkpoints, training scripts — MIT license — at github.com/OpenHelix-Team/OpenHelix.

Architecture Deep Dive

System 2: LLaVA-7B (The Slow Brain)

Input:  Visual observation + Language instruction
Model:  LLaVA-7B (FROZEN — weights not trained)
Adapt:  Prompt tuning (only ~1% parameters)
Output: Latent embedding Z ∈ ℝ^(N×D)

Learned Token Bridge (Connecting the Two Systems)

# Projection layer connecting System 2 → System 1
class TokenBridge(nn.Module):
    def __init__(self, llm_dim=4096, policy_dim=512):
        super().__init__()
        self.proj = nn.Linear(llm_dim, policy_dim)
        self.norm = nn.LayerNorm(policy_dim)
    
    def forward(self, llm_embedding):
        # Project from LLM space → policy space
        return self.norm(self.proj(llm_embedding))

System 1: 3D Diffuser Actor (The Executing Hand)

Input:  Token bridge output Z + Proprioceptive state q + Goal features g
Model:  3D Diffuser Actor (diffusion-based)
Output: Action sequence a₀:T ∈ ℝ^(T×7)  # 7-DOF robot arm
Speed:  200+ Hz (asynchronous inference)

Z from token bridge — task and visual state context
Proprioceptive state q — current joint angles, end-effector pose
Goal features g — visual features of the target state

Auxiliary Task: Forcing the MLLM to Actually Look

Auxiliary loss: L_aux = MSE(f_aux(Z), a_expert)
f_aux: Small MLP head on MLLM output
Effect: Forces MLLM embedding to encode visual information

Ablation result: Auxiliary task improves performance by +0.4 average sequence length on CALVIN ABC-D.

Environment Setup

Hardware Requirements

Component	Minimum	Recommended
GPU	1× RTX 3090 (24GB)	1× A100 (40GB)
RAM	32GB	64GB
Storage	200GB SSD	500GB SSD
CUDA	11.8+	12.1

Full OpenHelix training takes approximately 3-4 days on an A100. For inference-only with a pre-trained checkpoint, an RTX 3090 is sufficient.

Create Conda Environment

# Python 3.8 is required — OpenHelix does not support Python 3.10+ yet
conda create -n openhelix python=3.8
conda activate openhelix

# Install PyTorch with CUDA 11.8
conda install pytorch==2.0.1 torchvision==0.15.2 \
    torchaudio==2.0.2 pytorch-cuda=11.8 \
    -c pytorch -c nvidia

Clone and Install Dependencies

git clone https://github.com/OpenHelix-Team/OpenHelix
cd OpenHelix

# Install with key dependencies
pip install -r requirements.txt

# Install CALVIN simulator (submodule)
git submodule update --init --recursive

# Install DGL (Deep Graph Library) — required for 3D Diffuser Actor
# Choose version matching your CUDA
pip install dgl==1.1.0 -f https://data.dgl.ai/wheels/cu118/repo.html

# Flash Attention to accelerate MLLM inference
pip install flash-attn==2.5.9 --no-build-isolation

Note: flash-attn takes 5-10 minutes to compile. Don't close the terminal.

Preparing the CALVIN Dataset

Download Dataset

# Dataset is large (~300GB for full split)
cd ~/data
wget https://calvin.cs.uni-freiburg.de/dataset/task_ABC_D.zip
unzip task_ABC_D.zip

# Directory structure after extraction:
# task_ABC_D/
# ├── training/          # Environments A, B, C
# │   ├── episode_*.npz  # Demonstration episodes
# │   └── lang_annotations/
# ├── validation/        # Environment D
# └── statistics.yaml

Pre-encode Language Instructions (Optional but Recommended)

OpenHelix uses a CLIP text encoder for language instructions. Pre-encoding speeds up training significantly:

cd OpenHelix

python scripts/encode_instructions.py \
    --dataset_path ~/data/task_ABC_D \
    --output_path ~/data/task_ABC_D/lang_embeddings \
    --encoder clip-vit-base-patch32

# Or download pre-encoded from HuggingFace (faster)
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id='OpenHelix-Team/OpenHelix',
    filename='lang_embeddings.tar.gz',
    local_dir='~/data/task_ABC_D/'
)
"

Verify Dataset

python scripts/verify_dataset.py --path ~/data/task_ABC_D

# Expected output:
# ✓ Training episodes: 23,856
# ✓ Validation episodes: 1,000
# ✓ Language annotations: 34 unique tasks
# ✓ CLIP embeddings: found

Training

Step 1: Pre-train the Projection Layer (Token Bridge Alignment)

Do not skip this step. This is the difference between OpenHelix and many incorrect implementations:

cd OpenHelix

bash scripts/train_projection_pretrain.sh \
    --data_path ~/data/task_ABC_D \
    --output_dir ./checkpoints/projection_pretrain \
    --epochs 10 \
    --batch_size 64 \
    --lr 1e-4

# Takes ~2-3 hours on A100
# Goal: projection layer learns to align with MLLM space first

Step 2: Joint Training with Auxiliary Task

bash train_trajectory_lcb_pt_act_simple.sh \
    --data_path ~/data/task_ABC_D \
    --pretrained_proj ./checkpoints/projection_pretrain/best.pt \
    --output_dir ./checkpoints/openhelix_full \
    --llm_model llava-hf/llava-1.5-7b-hf \
    --policy_lr 1e-4 \
    --prompt_lr 1e-3 \
    --aux_weight 0.1 \
    --epochs 100 \
    --batch_size 32

Flag explanations:

--prompt_lr 1e-3: Prompt tokens learn faster than policy (1e-4) because they have fewer parameters
--aux_weight 0.1: Auxiliary loss weight — 0.1 is the optimal value per paper ablation
--pretrained_proj: REQUIRED — pre-aligned projection from Step 1

Monitoring Training

# TensorBoard logs
tensorboard --logdir ./checkpoints/openhelix_full/logs

# Metrics to watch:
# - train/policy_loss: should decrease consistently
# - train/aux_loss: should decrease — if it increases, reduce aux_weight
# - val/avg_seq_len: main metric, target > 3.5 after epoch 50

OpenHelix benchmark results on CALVIN ABC-D compared to other methods

Inference and Evaluation

Inference with Asynchronous Mode

bash test_trajectory_lcb_pt_act_simple_asy10.sh \
    --checkpoint ./checkpoints/openhelix_full/epoch_100.pt \
    --data_path ~/data/task_ABC_D \
    --split validation \
    --async_delay 10 \  # 10-step delay between System 1 and System 2
    --num_sequences 1000 \
    --output_path ./results/eval_epoch100.json

Using Pre-trained Checkpoint (No Training Required)

If you just want to try inference:

# Download checkpoint from HuggingFace
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='OpenHelix-Team/OpenHelix',
    local_dir='./pretrained_checkpoints'
)
"

# Merge safetensor shards into single PyTorch file
python scripts/merge_safetensors.py \
    --input_dir ./pretrained_checkpoints \
    --output_path ./pretrained_checkpoints/openhelix_merged.pt

# Run inference
bash test_trajectory_lcb_pt_act_simple_asy10.sh \
    --checkpoint ./pretrained_checkpoints/openhelix_merged.pt \
    --data_path ~/data/task_ABC_D \
    --split validation

Reading Evaluation Results

python scripts/analyze_results.py --result_path ./results/eval_epoch100.json

# Output format:
# Task             | 1-task | 2-task | 3-task | 4-task | 5-task | Avg Seq Len
# push block       |  0.96  |  0.87  |  0.75  |  0.64  |  0.53  |   3.75
# stack block      |  0.91  |  0.79  |  0.68  |  0.55  |  0.44  |   3.37
# ...
# Overall          |  0.933 |  0.818 |  0.710 |  0.598 |  0.491 |   4.08

Avg Sequence Length 4.08 means the robot completes an average of 4.08 consecutive tasks before failing in a chain of up to 5 tasks.

Results and Analysis

Comparison with SOTA

Model	CALVIN ABC-D Avg Seq Len
Seer (2024)	3.65
GR-MG (2024)	3.88
UniVLA (2025)	3.92
RoboDual (2025)	3.98
OpenHelix (2025)	4.08

4 Key Lessons from the Ablation Study

Lesson 3: Auxiliary task is mandatory for SOTA (+0.4 seq len) Without the auxiliary task, the MLLM nearly ignores visual input. This is the paper's most important finding.

Common Troubleshooting

CUDA Out of Memory

# Reduce batch size
--batch_size 16  # instead of 32

# Or enable gradient checkpointing
--gradient_checkpointing True

flash_attn ImportError

# Uninstall and reinstall with correct CUDA version
pip uninstall flash-attn
pip install flash-attn==2.5.9 --no-build-isolation \
    FLASH_ATTENTION_FORCE_BUILD=TRUE

Training Loss Not Decreasing

Check in order: (1) Are you using --pretrained_proj? (2) Is aux_weight too high (try 0.05)? (3) Is the learning rate appropriate for your batch size (scale linearly)?

Once you're comfortable with OpenHelix, explore:

VLA-Adapter — Tiny-scale VLA with real-world ALOHA deployment (much lighter)
VLA-RFT — Reinforcement Fine-Tuning for VLA (RLHF for robots)
HiF-VLA — Hierarchical spatiotemporal VLA for long-horizon tasks
Spatial-Forcing — ICLR 2026 paper, improving 3D spatial understanding

What Is a Dual-System VLA? (And Why You Should Care)

OpenHelix: Three Core Contributions

1. Comprehensive Survey of the Landscape

2. Rigorous Empirical Analysis

3. Open-Source Implementation

Architecture Deep Dive

System 2: LLaVA-7B (The Slow Brain)

Learned Token Bridge (Connecting the Two Systems)

System 1: 3D Diffuser Actor (The Executing Hand)

Auxiliary Task: Forcing the MLLM to Actually Look

Environment Setup

Hardware Requirements

Create Conda Environment

Clone and Install Dependencies

Preparing the CALVIN Dataset

Download Dataset

Pre-encode Language Instructions (Optional but Recommended)

Verify Dataset

Training

Step 1: Pre-train the Projection Layer (Token Bridge Alignment)

Step 2: Joint Training with Auxiliary Task

Monitoring Training

Inference and Evaluation

Inference with Asynchronous Mode

Using Pre-trained Checkpoint (No Training Required)

Reading Evaluation Results

Results and Analysis

Comparison with SOTA

4 Key Lessons from the Ablation Study

Common Troubleshooting

CUDA Out of Memory

flash_attn ImportError

Training Loss Not Decreasing

Extensions: Related Repos from OpenHelix-Team

Related Posts

Nguyễn Anh Tuấn

Related Posts

OpenHelix: Dual-System VLA Mã Nguồn Mở Cho Manipulation

MemoryVLA++: memory và world model cho VLA

Wall-OSS-0.5: VLA 4B cho LeRobot

What Is a Dual-System VLA? (And Why You Should Care)

OpenHelix: Three Core Contributions

1. Comprehensive Survey of the Landscape

2. Rigorous Empirical Analysis

3. Open-Source Implementation

Architecture Deep Dive

System 2: LLaVA-7B (The Slow Brain)

Learned Token Bridge (Connecting the Two Systems)

System 1: 3D Diffuser Actor (The Executing Hand)

Auxiliary Task: Forcing the MLLM to Actually Look

Environment Setup

Hardware Requirements

Create Conda Environment

Clone and Install Dependencies

Preparing the CALVIN Dataset

Download Dataset

Pre-encode Language Instructions (Optional but Recommended)

Verify Dataset

Training

Step 1: Pre-train the Projection Layer (Token Bridge Alignment)

Step 2: Joint Training with Auxiliary Task

Monitoring Training

Inference and Evaluation

Inference with Asynchronous Mode

Using Pre-trained Checkpoint (No Training Required)

Reading Evaluation Results

Results and Analysis

Comparison with SOTA

4 Key Lessons from the Ablation Study

Common Troubleshooting

CUDA Out of Memory

flash_attn ImportError

Training Loss Not Decreasing

Extensions: Related Repos from OpenHelix-Team

Related Posts

Nguyễn Anh Tuấn

Related Posts

OpenHelix: Dual-System VLA Mã Nguồn Mở Cho Manipulation

MemoryVLA++: memory và world model cho VLA

Wall-OSS-0.5: VLA 4B cho LeRobot