While the Vision-Language-Action (VLA) field has been racing for scale — OpenVLA 7B, π0 3B, RDT 1B — China's OpenHelix Team uploaded a paper to arXiv that swims firmly against the current: VLA-Adapter uses only a Qwen2.5-0.5B backbone (half a billion parameters, 14× smaller than OpenVLA), no robotic pre-training, and still hits 99.2% on LIBERO-Object and 99.6% on LIBERO-Spatial in the Pro variant. Crucially: 8 hours of training on a single RTX 4090, or as little as 9.6GB of VRAM on an RTX 3060.
This is great news for every robotics engineer who wants to enter the VLA game without an A100 cluster. In this guide we'll go from the paper idea → the Bridge Attention architecture → installation step by step → training on LIBERO → real-world deployment on Franka/UR-5.
Why this paper matters
Before VLA-Adapter, the VLA field largely followed a "scaling hypothesis": good manipulation requires a large backbone (≥7B) and pre-training on millions of robot trajectories (OXE dataset, DROID). The consequences:
- Training cost: OpenVLA needs ~8 days on 64 A100s. π0 needs a TPU pod. Small players can fine-tune but not train from scratch.
- Inference VRAM: A 7B fp16 model needs 14GB just to load weights. Add KV cache → hard to deploy on Jetson Orin (8-16GB).
- Latency: Action chunking on 7B → 5-10Hz, which limits high-speed manipulation.
OpenHelix asked the opposite question: do we really need a large backbone, or do we just need to know how to "extract" condition from a VLM intelligently?
The paper's answer: with a lightweight Policy module equipped with Bridge Attention, a 0.5B backbone is enough to beat OpenVLA-OFT (7B) on LIBERO. Original paper: arXiv:2509.09372 — Wang et al., 2025.
The core idea: Bridge Attention
Traditional VLA models use two ways to bridge VLM and action head:
- Token-as-action (OpenVLA, VLA-0): Treat actions as special tokens, decode autoregressively. Slow because each token must be generated.
- Feature-as-condition (π0, RDT, OpenVLA-OFT): Use VLM hidden states as condition for a diffusion/flow policy. Faster but still requires a big backbone.
VLA-Adapter picks option 2 but adds three crucial improvements:
1. ActionQuery tokens — trainable queries
The authors append 64 learnable tokens (ActionQuery) to the VLM input, trained from scratch. These tokens behave like "questions" the policy module sends to the VLM: "What action does this scene call for?" The VLM answers by updating these tokens through self-attention layers.
2. Multi-layer feature injection
Instead of only grabbing the last-layer hidden state, VLA-Adapter injects features from multiple layers into the policy:
- Raw features (vision + language) are taken from middle layers (less semantic bias than deep layers).
- ActionQuery features are taken from deep layers (filtered through many attention layers).
Ablations show multi-layer beats single-layer by about 2% success rate.
3. Bridge Attention with learnable injection degree
This is the main innovation. Instead of directly summing condition into the policy, VLA-Adapter uses gated cross-attention:
# Pseudo-code Bridge Attention
gate_raw = sigmoid(W_g @ raw_features) # learnable injection ratio
gate_query = 1.0 # ActionQuery: full inject
policy_input = gate_raw * raw_features + gate_query * action_query_features
action = diffusion_head(policy_input, noisy_action)
The result: raw features should be injected with control (because they carry image-text pre-training bias) while ActionQuery features can be fully injected (because they were trained specifically for action).
The numbers
| Suite | OpenVLA-OFT (7B) | π0 (3B) | VLA-Adapter (0.5B) | VLA-Adapter-Pro (0.5B) |
|---|---|---|---|---|
| LIBERO-Spatial | 97.6% | 96.8% | 97.8% | 99.6% |
| LIBERO-Object | 98.4% | 98.8% | 99.2% | 99.6% |
| LIBERO-Goal | 97.9% | 95.8% | 97.2% | 98.2% |
| LIBERO-Long | 94.5% | 85.2% | 95.0% | 96.4% |
| Average | 97.1% | 94.2% | 97.3% | 98.5% |
The kicker: with a frozen backbone, VLA-Adapter still hits 86.4% on LIBERO-Long, while competitors collapse to 0%. That proves Bridge Attention is genuinely extracting useful condition without needing to fine-tune the VLM.
For inference speed, VLA-Adapter reports the fastest inference among comparable VLAs, since it runs only one forward pass through a 0.5B model plus an 8-step diffusion policy (97M params).
Installation from scratch
System requirements
- OS: Ubuntu 22.04 (I've tested 20.04 too)
- CUDA: 11.8 or 12.1
- GPU: At least an RTX 2080Ti 11GB (batch=1, LoRA rank=64)
- Disk: ~50GB for dataset + checkpoints
Step 1: Create the conda env
conda create -n vla-adapter python=3.10.16 -y
conda activate vla-adapter
git clone https://github.com/OpenHelix-Team/VLA-Adapter.git
cd VLA-Adapter
pip install -e .
# Flash-attention — version 2.5.5 has been tested by the authors
pip install "flash-attn==2.5.5" --no-build-isolation
If flash-attn pip install fails (common on RTX 3060/3080):
pip install ninja
MAX_JOBS=4 pip install "flash-attn==2.5.5" --no-build-isolation
Step 2: Download the Prismatic + Qwen2.5-0.5B backbone
mkdir -p pretrained_models
cd pretrained_models
git lfs install
git clone https://huggingface.co/OpenHelix-Team/prism-qwen25-extra-dinosiglip-224px-0_5b
This backbone uses:
- Vision encoder: DINOv2 + SigLIP (concatenated patches)
- Resolution: 224×224
- Language model: Qwen2.5-0.5B-Instruct
- Projector: 2-layer MLP
Step 3: Download the LIBERO dataset (~10GB)
cd VLA-Adapter
mkdir -p data && cd data
wget https://huggingface.co/datasets/openvla/modified_libero_rlds/resolve/main/libero_spatial.zip
wget https://huggingface.co/datasets/openvla/modified_libero_rlds/resolve/main/libero_object.zip
wget https://huggingface.co/datasets/openvla/modified_libero_rlds/resolve/main/libero_goal.zip
wget https://huggingface.co/datasets/openvla/modified_libero_rlds/resolve/main/libero_10.zip # LIBERO-Long
for f in *.zip; do unzip "$f"; done
Training on LIBERO-Object (target: 99.2%)
Config by VRAM
| VRAM | Batch size | LoRA rank | Typical GPU |
|---|---|---|---|
| 9.6GB | 1 | 64 | RTX 2080Ti 11GB, 3060 12GB |
| 24GB | 4 | 64 | RTX 3090, 4090 |
| 40-48GB | 8 | 64 | A100-40GB, RTX 5090 |
| ≥80GB | 16 | 64 | A100-80GB, H100 |
Training command (single 24GB GPU)
CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 \
vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--data_root_dir data \
--dataset_name libero_object \
--run_root_dir outputs \
--run_id LIBERO-Object-Pro \
--batch_size 4 \
--grad_accumulation_steps 1 \
--learning_rate 5e-4 \
--max_steps 200005 \
--save_freq 10000 \
--use_pro_version True \
--lora_rank 64 \
--image_aug True
On an RTX 4090, 200k steps ≈ 8 hours. With batch=1 on an RTX 3060, expect ~30-36 hours — still doable overnight plus a day.
Monitoring training
tensorboard --logdir outputs/LIBERO-Object-Pro/runs --port 6006
Key metrics:
loss/action: should drop from ~0.6 to ~0.05 by 50k steps.loss/diffusion: similar, decreasing monotonically.lr: 1000-step warmup → cosine decay.
If loss plateaus too early (before 30k steps), common causes:
- Image augmentation disabled → set
--image_aug True. - LoRA rank too low → try rank 128.
- Learning rate too low → bump to 1e-3.
Inference & evaluation
Evaluate on the LIBERO simulator
CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
--pretrained_checkpoint outputs/LIBERO-Object-Pro/checkpoint-200000 \
--task_suite_name libero_object \
--use_pro_version True \
--num_trials_per_task 50
Expected: success rate ≈ 99.2-99.6%. If you see below 95%, check:
- Did the checkpoint load correctly? Look for key-mismatch warnings in the log.
- LIBERO version: use commit
b5d8e0aexactly (linked in README).
Real-time inference loop
from vla_adapter import VLAAdapter
import torch
from PIL import Image
vla = VLAAdapter.from_pretrained(
"outputs/LIBERO-Object-Pro/checkpoint-200000",
device="cuda",
use_pro_version=True
)
image = Image.open("camera_frame.jpg").resize((224, 224))
instruction = "pick up the red cup and place it on the plate"
with torch.no_grad():
action_chunk = vla.predict_action(
image=image,
instruction=instruction,
action_chunk_size=8 # predict 8 future actions
)
# action_chunk: shape [8, 7] for a 7-DOF arm (xyz + rpy + gripper)
print(action_chunk)
Inference speed on RTX 4090: ~30-50Hz (action chunk 8) — fast enough for moderate-speed manipulation.
Real-world deployment on Franka / UR-5 / ALOHA
In March 2026 the authors added Cobot Magic (ALOHA-style) deployment support. Code lives in experiments/robot/aloha/. The workflow is similar for Franka/UR-5:
Step 1: Calibrate the camera
Place an RGB camera at a corner of the workspace (top-down or 45°). Resize input to 224×224. Important: keep the camera angle similar to the training data; a large mismatch tanks performance.
Step 2: Adapter for the action space
VLA-Adapter outputs delta end-effector pose (7-DOF: xyz delta + axis-angle + gripper). For Franka:
# Franka via libfranka
from franka_interface import FrankaArm
arm = FrankaArm()
current_pose = arm.get_pose() # 4x4 matrix
delta_xyz = action_chunk[0][:3] * 0.05 # 5cm/step scale
delta_rot = action_chunk[0][3:6] # axis-angle
gripper_cmd = action_chunk[0][6] # 0=close, 1=open
target_pose = current_pose @ delta_to_homog(delta_xyz, delta_rot)
arm.move_to_pose(target_pose, duration=0.05)
arm.set_gripper(gripper_cmd)
Step 3: Safety wrapper
Mandatory:
- Workspace bounds: clip target pose to a safe region (e.g. xy ∈ [-0.5, 0.5], z ∈ [0.05, 0.5]).
- Force/torque limit: stop the arm when contact force exceeds 20N.
- Emergency stop: ESC hotkey or a physical button.
Step 4: Fine-tune on real-world data (if needed)
The LIBERO checkpoint can be deployed zero-shot, but for high success on a new task, collect ~50-100 demo trajectories and fine-tune for an extra 20k steps. Format data as RLDS — see the LeRobot tutorial guide for how to collect demos.
Tradeoffs & pitfalls
Pros:
- 8 hours of training on a single consumer GPU — anyone can do it.
- 0.5B model deploys on a Jetson Orin AGX (32GB).
- MIT license — commercial use is fine.
Cons:
- OOD generalization isn't broadly tested — the paper mostly runs on LIBERO/CALVIN.
- No robotic-data pre-training → may underperform π0/RDT on unseen tasks.
- Real-world deployment only verified on Cobot Magic; no public Franka/UR-5 benchmark.
Pitfalls I keep running into:
- Wrong flash-attention version → 5× slower training. Stick to version 2.5.5.
- Forgetting
--use_pro_version True→ ~2% lower results. Always use Pro. - Wrong image resolution — it must be 224×224, not 256 or 384.
- Dataset format — RLDS must come from the OpenVLA modified commit, not raw LIBERO.
Quick comparison with other VLAs
Compared to NVIDIA's VLA-0 (also based on Qwen but action-as-text), VLA-Adapter is faster because it skips autoregressive decoding. Compared to OpenHelix's Dual-System VLA (same team), VLA-Adapter is simpler — a single backbone instead of fast+slow systems. Compared to SimpleVLA-RL (RL fine-tuning), VLA-Adapter uses pure imitation learning — no reward design needed.
If your goal is rapid prototyping, training on a consumer GPU, edge deployment, VLA-Adapter is currently the top pick. If you need the strongest cross-task generalization, π0/π0-FAST with a 3B backbone is still the safer bet.
Conclusion
VLA-Adapter shows that "scale isn't everything." With a smartly designed Bridge Attention, a 0.5B model can beat a 7B model on LIBERO. The lesson for the field: before scaling to 70B, ask whether your architecture is fully exploiting the features your backbone already produces.
For Vietnamese engineers, the practical implication is huge: you can start VLA research on a single RTX 4090 ($2k) instead of an A100 cluster ($50k+). The entry bar has just dropped dramatically.