VLA-Adapter: Train a 0.5B VLA in 9.6GB, Hit 99.2% on LIBERO

While the Vision-Language-Action (VLA) field has been racing for scale — OpenVLA 7B, π0 3B, RDT 1B — China's OpenHelix Team uploaded a paper to arXiv that swims firmly against the current: VLA-Adapter uses only a Qwen2.5-0.5B backbone (half a billion parameters, 14× smaller than OpenVLA), no robotic pre-training, and still hits 99.2% on LIBERO-Object and 99.6% on LIBERO-Spatial in the Pro variant. Crucially: 8 hours of training on a single RTX 4090, or as little as 9.6GB of VRAM on an RTX 3060.

This is great news for every robotics engineer who wants to enter the VLA game without an A100 cluster. In this guide we'll go from the paper idea → the Bridge Attention architecture → installation step by step → training on LIBERO → real-world deployment on Franka/UR-5.

Why this paper matters

Before VLA-Adapter, the VLA field largely followed a "scaling hypothesis": good manipulation requires a large backbone (≥7B) and pre-training on millions of robot trajectories (OXE dataset, DROID). The consequences:

Training cost: OpenVLA needs ~8 days on 64 A100s. π0 needs a TPU pod. Small players can fine-tune but not train from scratch.
Inference VRAM: A 7B fp16 model needs 14GB just to load weights. Add KV cache → hard to deploy on Jetson Orin (8-16GB).
Latency: Action chunking on 7B → 5-10Hz, which limits high-speed manipulation.

OpenHelix asked the opposite question: do we really need a large backbone, or do we just need to know how to "extract" condition from a VLM intelligently?

The paper's answer: with a lightweight Policy module equipped with Bridge Attention, a 0.5B backbone is enough to beat OpenVLA-OFT (7B) on LIBERO. Original paper: arXiv:2509.09372 — Wang et al., 2025.

The core idea: Bridge Attention

Traditional VLA models use two ways to bridge VLM and action head:

Token-as-action (OpenVLA, VLA-0): Treat actions as special tokens, decode autoregressively. Slow because each token must be generated.
Feature-as-condition (π0, RDT, OpenVLA-OFT): Use VLM hidden states as condition for a diffusion/flow policy. Faster but still requires a big backbone.

VLA-Adapter picks option 2 but adds three crucial improvements:

1. ActionQuery tokens — trainable queries

The authors append 64 learnable tokens (ActionQuery) to the VLM input, trained from scratch. These tokens behave like "questions" the policy module sends to the VLM: "What action does this scene call for?" The VLM answers by updating these tokens through self-attention layers.

2. Multi-layer feature injection

Instead of only grabbing the last-layer hidden state, VLA-Adapter injects features from multiple layers into the policy:

Raw features (vision + language) are taken from middle layers (less semantic bias than deep layers).
ActionQuery features are taken from deep layers (filtered through many attention layers).

Ablations show multi-layer beats single-layer by about 2% success rate.

3. Bridge Attention with learnable injection degree

This is the main innovation. Instead of directly summing condition into the policy, VLA-Adapter uses gated cross-attention:

# Pseudo-code Bridge Attention
gate_raw = sigmoid(W_g @ raw_features)     # learnable injection ratio
gate_query = 1.0                            # ActionQuery: full inject

policy_input = gate_raw * raw_features + gate_query * action_query_features
action = diffusion_head(policy_input, noisy_action)

The result: raw features should be injected with control (because they carry image-text pre-training bias) while ActionQuery features can be fully injected (because they were trained specifically for action).

The numbers

Suite	OpenVLA-OFT (7B)	π0 (3B)	VLA-Adapter (0.5B)	VLA-Adapter-Pro (0.5B)
LIBERO-Spatial	97.6%	96.8%	97.8%	99.6%
LIBERO-Object	98.4%	98.8%	99.2%	99.6%
LIBERO-Goal	97.9%	95.8%	97.2%	98.2%
LIBERO-Long	94.5%	85.2%	95.0%	96.4%
Average	97.1%	94.2%	97.3%	98.5%

The kicker: with a frozen backbone, VLA-Adapter still hits 86.4% on LIBERO-Long, while competitors collapse to 0%. That proves Bridge Attention is genuinely extracting useful condition without needing to fine-tune the VLM.

For inference speed, VLA-Adapter reports the fastest inference among comparable VLAs, since it runs only one forward pass through a 0.5B model plus an 8-step diffusion policy (97M params).

Installation from scratch

System requirements

OS: Ubuntu 22.04 (I've tested 20.04 too)
CUDA: 11.8 or 12.1
GPU: At least an RTX 2080Ti 11GB (batch=1, LoRA rank=64)
Disk: ~50GB for dataset + checkpoints

Step 1: Create the conda env

conda create -n vla-adapter python=3.10.16 -y
conda activate vla-adapter

git clone https://github.com/OpenHelix-Team/VLA-Adapter.git
cd VLA-Adapter
pip install -e .

# Flash-attention — version 2.5.5 has been tested by the authors
pip install "flash-attn==2.5.5" --no-build-isolation

If flash-attn pip install fails (common on RTX 3060/3080):

pip install ninja
MAX_JOBS=4 pip install "flash-attn==2.5.5" --no-build-isolation

Step 2: Download the Prismatic + Qwen2.5-0.5B backbone

mkdir -p pretrained_models
cd pretrained_models
git lfs install
git clone https://huggingface.co/OpenHelix-Team/prism-qwen25-extra-dinosiglip-224px-0_5b

This backbone uses:

Vision encoder: DINOv2 + SigLIP (concatenated patches)
Resolution: 224×224
Language model: Qwen2.5-0.5B-Instruct
Projector: 2-layer MLP

Step 3: Download the LIBERO dataset (~10GB)

cd VLA-Adapter
mkdir -p data && cd data

wget https://huggingface.co/datasets/openvla/modified_libero_rlds/resolve/main/libero_spatial.zip
wget https://huggingface.co/datasets/openvla/modified_libero_rlds/resolve/main/libero_object.zip
wget https://huggingface.co/datasets/openvla/modified_libero_rlds/resolve/main/libero_goal.zip
wget https://huggingface.co/datasets/openvla/modified_libero_rlds/resolve/main/libero_10.zip  # LIBERO-Long

for f in *.zip; do unzip "$f"; done

Training on LIBERO-Object (target: 99.2%)

Config by VRAM

VRAM	Batch size	LoRA rank	Typical GPU
9.6GB	1	64	RTX 2080Ti 11GB, 3060 12GB
24GB	4	64	RTX 3090, 4090
40-48GB	8	64	A100-40GB, RTX 5090
≥80GB	16	64	A100-80GB, H100

Training command (single 24GB GPU)

CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 \
  vla-scripts/finetune.py \
  --vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
  --data_root_dir data \
  --dataset_name libero_object \
  --run_root_dir outputs \
  --run_id LIBERO-Object-Pro \
  --batch_size 4 \
  --grad_accumulation_steps 1 \
  --learning_rate 5e-4 \
  --max_steps 200005 \
  --save_freq 10000 \
  --use_pro_version True \
  --lora_rank 64 \
  --image_aug True

On an RTX 4090, 200k steps ≈ 8 hours. With batch=1 on an RTX 3060, expect ~30-36 hours — still doable overnight plus a day.

Monitoring training

tensorboard --logdir outputs/LIBERO-Object-Pro/runs --port 6006

Key metrics:

loss/action: should drop from ~0.6 to ~0.05 by 50k steps.
loss/diffusion: similar, decreasing monotonically.
lr: 1000-step warmup → cosine decay.

If loss plateaus too early (before 30k steps), common causes:

Image augmentation disabled → set --image_aug True.
LoRA rank too low → try rank 128.
Learning rate too low → bump to 1e-3.

Inference & evaluation

Evaluate on the LIBERO simulator

CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \
  --pretrained_checkpoint outputs/LIBERO-Object-Pro/checkpoint-200000 \
  --task_suite_name libero_object \
  --use_pro_version True \
  --num_trials_per_task 50

Expected: success rate ≈ 99.2-99.6%. If you see below 95%, check:

Did the checkpoint load correctly? Look for key-mismatch warnings in the log.
LIBERO version: use commit b5d8e0a exactly (linked in README).

Real-time inference loop

from vla_adapter import VLAAdapter
import torch
from PIL import Image

vla = VLAAdapter.from_pretrained(
    "outputs/LIBERO-Object-Pro/checkpoint-200000",
    device="cuda",
    use_pro_version=True
)

image = Image.open("camera_frame.jpg").resize((224, 224))
instruction = "pick up the red cup and place it on the plate"

with torch.no_grad():
    action_chunk = vla.predict_action(
        image=image,
        instruction=instruction,
        action_chunk_size=8  # predict 8 future actions
    )

# action_chunk: shape [8, 7] for a 7-DOF arm (xyz + rpy + gripper)
print(action_chunk)

Inference speed on RTX 4090: ~30-50Hz (action chunk 8) — fast enough for moderate-speed manipulation.

Real-world deployment on Franka / UR-5 / ALOHA

In March 2026 the authors added Cobot Magic (ALOHA-style) deployment support. Code lives in experiments/robot/aloha/. The workflow is similar for Franka/UR-5:

Step 1: Calibrate the camera

Place an RGB camera at a corner of the workspace (top-down or 45°). Resize input to 224×224. Important: keep the camera angle similar to the training data; a large mismatch tanks performance.

Step 2: Adapter for the action space

VLA-Adapter outputs delta end-effector pose (7-DOF: xyz delta + axis-angle + gripper). For Franka:

# Franka via libfranka
from franka_interface import FrankaArm

arm = FrankaArm()
current_pose = arm.get_pose()  # 4x4 matrix

delta_xyz = action_chunk[0][:3] * 0.05      # 5cm/step scale
delta_rot = action_chunk[0][3:6]            # axis-angle
gripper_cmd = action_chunk[0][6]            # 0=close, 1=open

target_pose = current_pose @ delta_to_homog(delta_xyz, delta_rot)
arm.move_to_pose(target_pose, duration=0.05)
arm.set_gripper(gripper_cmd)

Step 3: Safety wrapper

Mandatory:

Workspace bounds: clip target pose to a safe region (e.g. xy ∈ [-0.5, 0.5], z ∈ [0.05, 0.5]).
Force/torque limit: stop the arm when contact force exceeds 20N.
Emergency stop: ESC hotkey or a physical button.

Step 4: Fine-tune on real-world data (if needed)

The LIBERO checkpoint can be deployed zero-shot, but for high success on a new task, collect ~50-100 demo trajectories and fine-tune for an extra 20k steps. Format data as RLDS — see the LeRobot tutorial guide for how to collect demos.

Tradeoffs & pitfalls

Pros:

8 hours of training on a single consumer GPU — anyone can do it.
0.5B model deploys on a Jetson Orin AGX (32GB).
MIT license — commercial use is fine.

Cons:

OOD generalization isn't broadly tested — the paper mostly runs on LIBERO/CALVIN.
No robotic-data pre-training → may underperform π0/RDT on unseen tasks.
Real-world deployment only verified on Cobot Magic; no public Franka/UR-5 benchmark.

Pitfalls I keep running into:

Wrong flash-attention version → 5× slower training. Stick to version 2.5.5.
Forgetting --use_pro_version True → ~2% lower results. Always use Pro.
Wrong image resolution — it must be 224×224, not 256 or 384.
Dataset format — RLDS must come from the OpenVLA modified commit, not raw LIBERO.

Quick comparison with other VLAs

Compared to NVIDIA's VLA-0 (also based on Qwen but action-as-text), VLA-Adapter is faster because it skips autoregressive decoding. Compared to OpenHelix's Dual-System VLA (same team), VLA-Adapter is simpler — a single backbone instead of fast+slow systems. Compared to SimpleVLA-RL (RL fine-tuning), VLA-Adapter uses pure imitation learning — no reward design needed.

If your goal is rapid prototyping, training on a consumer GPU, edge deployment, VLA-Adapter is currently the top pick. If you need the strongest cross-task generalization, π0/π0-FAST with a 3B backbone is still the safer bet.

Conclusion

VLA-Adapter shows that "scale isn't everything." With a smartly designed Bridge Attention, a 0.5B model can beat a 7B model on LIBERO. The lesson for the field: before scaling to 70B, ask whether your architecture is fully exploiting the features your backbone already produces.

For Vietnamese engineers, the practical implication is huge: you can start VLA research on a single RTX 4090 (~~$2k) instead of an A100 cluster (~~$50k+). The entry bar has just dropped dramatically.

VLA-Adapter: Train a 0.5B VLA in 9.6GB, Hit 99.2% on LIBERO

Why this paper matters

The core idea: Bridge Attention

1. ActionQuery tokens — trainable queries

2. Multi-layer feature injection

3. Bridge Attention with learnable injection degree

The numbers

Installation from scratch

System requirements

Step 1: Create the conda env

Step 2: Download the Prismatic + Qwen2.5-0.5B backbone

Step 3: Download the LIBERO dataset (~10GB)

Training on LIBERO-Object (target: 99.2%)

Config by VRAM

Training command (single 24GB GPU)

Monitoring training

Inference & evaluation

Evaluate on the LIBERO simulator

Real-time inference loop

Real-world deployment on Franka / UR-5 / ALOHA

Step 1: Calibrate the camera

Step 2: Adapter for the action space

Step 3: Safety wrapper

Step 4: Fine-tune on real-world data (if needed)

Tradeoffs & pitfalls

Quick comparison with other VLAs

Conclusion

Nguyễn Anh Tuấn

Bài viết liên quan

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc

RDT2: Foundation Model Zero-Shot Cross-Embodiment cho Bimanual UR5e/Franka

Why this paper matters

The core idea: Bridge Attention

1. ActionQuery tokens — trainable queries

2. Multi-layer feature injection

3. Bridge Attention with learnable injection degree

The numbers

Installation from scratch

System requirements

Step 1: Create the conda env

Step 2: Download the Prismatic + Qwen2.5-0.5B backbone

Step 3: Download the LIBERO dataset (~10GB)

Training on LIBERO-Object (target: 99.2%)

Config by VRAM

Training command (single 24GB GPU)

Monitoring training

Inference & evaluation

Evaluate on the LIBERO simulator

Real-time inference loop

Real-world deployment on Franka / UR-5 / ALOHA

Step 1: Calibrate the camera

Step 2: Adapter for the action space

Step 3: Safety wrapper

Step 4: Fine-tune on real-world data (if needed)

Tradeoffs & pitfalls

Quick comparison with other VLAs

Conclusion

Related Posts

Nguyễn Anh Tuấn

Bài viết liên quan

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc

RDT2: Foundation Model Zero-Shot Cross-Embodiment cho Bimanual UR5e/Franka