wholebody-vlalerobotvlaqwen3-vlmanipulationopen-sourcerobot-armflow-matchinglaboratoryzjunlp

LabVLA: Open Source VLA for Lab Robots with Qwen3-VL

Run LabVLA — the first VLA model for scientific lab robots, combining Qwen3-VL-4B with DiT flow-matching and LeRobot v2 format. 71.1% on LabUtopia benchmark.

Nguyễn Anh TuấnJune 12, 202612 min read
LabVLA: Open Source VLA for Lab Robots with Qwen3-VL

LabVLA: The First Open Source VLA Model for Scientific Laboratory Robots

Imagine handing a robot a written lab protocol — "Take the 100ml beaker, add 50ml of distilled water, place it on the magnetic stirrer, set temperature to 60°C for 5 minutes" — and watching the robot execute the entire procedure autonomously. That is precisely what LabVLA aims to deliver.

Released in June 2026 by researchers from Zhejiang University, Shanghai AI Laboratory, and Harbin Institute of Technology, LabVLA is the first Vision-Language-Action (VLA) model specifically designed for scientific laboratory environments. Unlike previous VLA systems trained on household manipulation data (picking objects, opening drawers, folding laundry), LabVLA understands laboratory equipment — beakers, flasks, magnetic stirrers, heating plates — and can reliably execute multi-step protocols.

Paper: LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories — Ren et al., arXiv:2606.13578, 2026
GitHub: github.com/zjunlp/LabVLA
Model: huggingface.co/zjunlp/LabVLA (5B parameters, MIT License)

LabVLA framework overview — source: zjunlp.github.io/LabVLA

Why Labs Need a Domain-Specific VLA

Existing VLA models like π₀ (Pi0 Fast) or X-VLA are primarily trained on Open X-Embodiment (OXE) data — hundreds of thousands of episodes involving household objects, kitchen tasks, and general manipulation. Scientific laboratories present fundamentally different challenges:

Perceptual challenges:

  • Transparent liquids (water, acids, solvents) — difficult to segment with standard RGB cameras
  • Specialized equipment (magnetic stirrers, pipettes, UV lamps, analytical balances) absent from home/kitchen datasets
  • Solution concentration inference from subtle color cues requiring fine-grained visual reasoning

Execution challenges:

  • Fixed sequential protocols: solutions must be prepared before heating — no reordering allowed
  • High precision requirements: operating pipettes with 1mm-diameter tips, manipulating narrow-mouth flasks
  • Safety constraints: no spills, no breakage of glassware

The authors define a four-tier capability pyramid for laboratory robots:

Level Name Description
1 Apprentice Single-step operations on instruction
2 Technician Multi-step protocol execution with physical state changes ← LabVLA target
3 Specialist Precision instrument handling (micropipette, analytical balance)
4 Scientist Adaptive decision-making based on experimental readouts

LabVLA targets Level 2 — executing written protocols step by step, handling physical state transitions (liquid to vapor, color changes indicating reaction completion) in the correct order.

LabVLA Architecture: Qwen3-VL Meets DiT Flow-Matching

LabVLA has approximately 5 billion parameters, combining two main components into an end-to-end system:

1. VLM Backbone: Qwen3-VL-4B-Instruct

Qwen3-VL is Alibaba QwenLM's multimodal vision-language model. LabVLA uses the 4B instruction-tuned variant as the "brain" responsible for:

  • Understanding complex, multi-step language instructions with full context
  • Recognizing laboratory objects and equipment from multi-view RGB inputs
  • Producing 2560-dimensional hidden states that jointly encode language and visual information

2. Action Expert: 18-Layer DiT Flow-Matching Module

The DiT (Diffusion Transformer) action expert is an independent module with:

  • 18 transformer layers, width 1024, 8 attention heads, 128 head dimension
  • Cross-attention from DiT to VLM hidden states — the bridge between language/vision and action
  • Predicts 10-step continuous action chunks rather than single discrete actions
  • Inference requires only 10 Euler steps via deterministic vector field integration

Predicting action chunks rather than individual steps yields smoother robot motion and better handles control loop latency in real deployments.

Knowledge Insulation — The Key Design Insight

The most important architectural innovation in LabVLA is the stop-gradient between VLM and action expert during flow-matching posttraining.

Why it matters: If gradients from the flow-matching loss backpropagate into the VLM backbone, the model suffers from "catastrophic forgetting" — progressively losing its language understanding and visual recognition capabilities. This is a well-known failure mode when fine-tuning large VLMs with objective functions that differ significantly from the pretraining loss.

Knowledge Insulation ensures the VLM backbone retains its learned knowledge while the DiT action expert independently learns to transform hidden states into continuous actions.

LabVLA three-stage training pipeline: FAST action-token pretraining, flow-matching posttraining with knowledge insulation, and downstream task fine-tuning
LabVLA three-stage training pipeline: FAST action-token pretraining, flow-matching posttraining with knowledge insulation, and downstream task fine-tuning
Three-stage LabVLA training recipe — source: paper arXiv:2606.13578

RoboGenesis: Automated Laboratory Data Synthesis

The core bottleneck for lab robotics is data scarcity. No one has tens of thousands of robot episodes performing scientific laboratory tasks. The authors address this with RoboGenesis — an automated data synthesis pipeline:

Stage 1 — Environment Building: Constructs 10,000 diverse lab scenes from LabAssetLibrary containing 2,947 annotated 3D assets (beakers, flasks, hot plates, magnetic stirrers, pipettes, etc.). Scenes follow physical plausibility rules — liquids obey gravity, glassware has correct weight and friction properties.

Stage 2 — Agentic Workflow Generation: An LLM automatically generates experimental protocols in natural language, then "compiles" them into sequences of atomic skills (Pick, Place, Pour, Stir, Heat, Shake) instantiated across different robot embodiments with appropriate kinematics.

Stage 3 — Domain Randomization and Export: Each protocol executes under varied conditions: lighting (bright/dim/shadow), object placement, camera angles, random obstacles, surface textures. Only successful episodes pass the filter and are exported with 15 synchronized annotation streams.

Output — LabEmbodied-Data:

  • 4 task families: single-arm primitives, multistep procedures, bimanual operations, mobile manipulation
  • 16 robot embodiments: 13 single-arm (UR5e, UR16e, FR3, Franka, Festo Rizon 4...) + 3 dual-arm
  • Format: LeRobot v2 (Parquet + metadata JSON, compatible with the LeRobot framework)
  • 15 annotation streams: multi-view RGB, joint states, action labels, language instructions, depth maps

Importantly, LabEmbodied-Data is useful beyond LabVLA itself: fine-tuning X-VLA on this dataset improved performance by +15.0 percentage points in-distribution and +19.3 points OOD without any architectural changes.

Installation

System requirements:

  • Python 3.10
  • CUDA 12.6
  • PyTorch 2.7.1
  • GPU: A100 80GB for training; smaller GPUs sufficient for inference

Step 1 — Clone and create conda environment:

git clone https://github.com/zjunlp/LabVLA
cd LabVLA

conda create -n labvla python=3.10 -y
conda activate labvla

Step 2 — Install PyTorch with CUDA 12.6:

pip install torch==2.7.1 torchvision==0.22.1 \
    --index-url https://download.pytorch.org/whl/cu126

Step 3 — Install Flash Attention:

# --no-build-isolation: use the already-installed PyTorch, not a fresh build env
pip install flash_attn==2.8.3 --no-build-isolation

Step 4 — Install remaining dependencies:

pip install -r requirements.txt

Step 5 — Download pretrained checkpoint:

# ~10GB, BF16 safetensors format
huggingface-cli download zjunlp/LabVLA --local-dir ./LabVLA-checkpoint

Three-Stage Training Pipeline

LabVLA uses an adapted π0.5 recipe. If you want to fine-tune on your own robot data, here is the complete pipeline:

Stage 0: Data Preparation

# Scan dataset: detect anomalous episodes, action outliers
python -m data_process scan \
    --root /path/to/your/dataset \
    --out /tmp/report.json

# Clean: remove faulty episodes, normalize action ranges
python -m data_process clean \
    --src /path/to/your/dataset \
    --dst /path/to/clean_dataset \
    --report /tmp/report.json

# Verify final statistics
python -m data_process stats \
    --dataset /path/to/clean_dataset \
    --schema robointer_droid

Data must be in LeRobot v2 format: each episode is a Parquet file with columns observation.images.*, observation.state, action, and timestamp. The data_process module validates and normalizes automatically.

Stage 1: VLM Pretraining with FAST Action Tokens

bash launch/vlm_pretrain/train_vlm_pretrain.sh

The Qwen3-VL backbone learns to "read" actions through FAST action tokenization — discretizing continuous actions into tokens and training with cross-entropy loss (analogous to next-token prediction in language modeling). The goal is to make the VLM aware of the action space before switching to continuous flow-matching.

Training performance (A100 80GB, DeepSpeed ZeRO-2):

Stage BS/GPU Global BS Time/step
VLM Pretraining 24 1,536 ~7s
KI Posttraining 16 1,024 ~5s
Task Fine-tuning 4 192 ~3s

Liger-Kernel operator fusion and selective gradient checkpointing substantially reduce memory compared to a naive implementation.

Stage 2: Flow-Matching Posttraining with Knowledge Insulation

bash launch/ki_posttrain/train_ki_posttrain.sh

The DiT action expert learns to integrate the vector field from noise to target actions using the flow-matching objective. The stop-gradient (Knowledge Insulation) prevents flow loss gradients from reaching the VLM backbone.

Stage 3: Task Fine-Tuning

# Fine-tune on LabUtopia benchmark tasks
bash launch/finetune/train_labutopia.sh

# Or fine-tune on your own dataset
bash launch/finetune/train_custom.sh \
    --data_path /path/to/clean_dataset \
    --output_dir ./checkpoints/my_labvla \
    --num_epochs 50

Fine-tuning is the most critical stage for adapting LabVLA to your specific robot. Even a modest real-robot dataset (a few hundred episodes) combined with LabEmbodied-Data yields substantially better performance than simulation data alone.

Inference and Deployment

LabVLA uses the OpenPI msgpack WebSocket protocol — a standard communication layer in robot learning frameworks:

Start the inference server:

PRETRAINED_PATH=./LabVLA-checkpoint bash deployment/deploy.sh

The server listens on a WebSocket, receives observations (multi-view RGB images + joint state + language instruction), and returns a 10-step action chunk. With only 10 Euler steps required, inference latency is low enough for closed-loop robot control.

Basic Python client example:

import asyncio
import msgpack
import websockets
import numpy as np

async def get_lab_action(obs_images, joint_state, instruction):
    uri = "ws://localhost:8080"
    async with websockets.connect(uri) as ws:
        payload = msgpack.packb({
            "images": [img.tolist() for img in obs_images],
            "state": joint_state.tolist(),
            "instruction": instruction
        })
        await ws.send(payload)
        response = await ws.recv()
        result = msgpack.unpackb(response)
        # shape: (10, 7) — 10 steps, 7 DOF for Franka
        return np.array(result["actions"])

# Usage:
actions = asyncio.run(get_lab_action(
    obs_images=[front_cam_frame, wrist_cam_frame],
    joint_state=robot.get_joint_positions(),
    instruction="Pour the liquid from the beaker into the flask carefully"
))

# Execute action chunk on the robot
for action_step in actions:
    robot.move_to_joint_positions(action_step)

The WebSocket server architecture allows LabVLA to run on a powerful workstation (with GPU) while the robot controller runs on an embedded compute module — a common pattern in lab automation deployments.

16 robot embodiments supported by LabVLA: UR5e, UR16e, FR3, Franka, and dual-arm variants, all from a single set of weights
16 robot embodiments supported by LabVLA: UR5e, UR16e, FR3, Franka, and dual-arm variants, all from a single set of weights
16 robot embodiments supported by a single LabVLA checkpoint — source: zjunlp.github.io/LabVLA

Results: LabUtopia Benchmark

The authors built LabUtopia — a 6-task benchmark covering representative laboratory operations, evaluated in both in-distribution (ID) and out-of-distribution (OOD) conditions:

Task LabVLA ID LabVLA OOD π₀ 3B ID GR00T N1.5 ID
Pick Up 49.2% 48.3% 21.7% 40.8%
Press Button 100% 98.3% 92.5% 99.2%
Open Door 65.0% 65.8% 51.6% 6.7%
Pour Liquid 43.3% 34.2% 37.5% 0%
Heat Beaker 83.3% 87.5% 90.0% 99.2%
Transport Beaker 85.8% 85.8% 86.7% 69.2%
Average 71.1% 70.0% 63.3% 52.5%

Key observations:

  • LabVLA outperforms π₀ (3B) by 7.8 percentage points in-distribution and 6.8 points OOD
  • The ID→OOD gap is only 1.1 points (71.1%→70.0%), demonstrating robust domain randomization
  • GR00T N1.5 scores 0% on "Pour Liquid" — liquid handling is a complete blind spot without domain-specific training data
  • LabVLA is the most consistent model: no task falls below 43%, versus baselines showing high variance across task types

Real-Robot Results on Franka

Beyond simulation, the team validated LabVLA on a Franka Emika arm with 4 composite tasks, 50 rollouts each, across four conditions:

Task In-domain ID Clutter OOD Clean OOD Clutter
Shake Liquid 92% 86% 84% 80%
Pour Liquid 86% 78% 76% 72%
Magnetic Stir 88% 80% 80% 74%
Stopper Plug/Unplug 80% 76% 80% 70%
Average 86.5% 80.0% 80.0% 74.0%

86.5% on in-domain clean conditions is a strong result for laboratory tasks. "Shake Liquid" at 92% requires controlling both trajectory and shake amplitude — the robot must keep liquid from spilling throughout the motion. "Stopper Plug/Unplug" at 80% demands sub-2mm alignment precision.

Real-world experimental setup: Franka robot manipulating beakers, flasks, magnetic stirrer and heating plate under in-domain and OOD conditions
Real-world experimental setup: Franka robot manipulating beakers, flasks, magnetic stirrer and heating plate under in-domain and OOD conditions
Franka real-robot experimental setup — source: zjunlp.github.io/LabVLA

Baseline Comparison

Model Params LabUtopia Avg ID Domain-specific data License
LabVLA 5B 71.1% ✅ LabEmbodied-Data MIT
π₀ (HF) 3B 63.3% Apache-2.0
GR00T N1.5 3B 52.5% NVIDIA
X-VLA 1B 57.5% Apache-2.0
SmolVLA 450M 38.2% Apache-2.0

The advantage of LabVLA is not just the higher average score — it is consistency. Competing baselines like GR00T N1.5 achieve 99.2% on "Heat Beaker" but 0% on "Pour Liquid." LabVLA has no task below 43%, demonstrating that domain-specific training data closes the generalization gap across all task types.

Limitations and Future Directions

The authors acknowledge several current limitations:

  1. Level 3 (Specialist) remains out of reach: Micropipette manipulation (1µl precision), analytical balance operations with 0.001g tolerance
  2. Special liquids: Colored solutions, foams, high-viscosity substances not fully tested
  3. Hardware requirements: Training requires A100 80GB GPUs — cloud access needed for smaller labs
  4. Long-horizon protocols: Sequences with more than 10 steps not rigorously evaluated

Roadmap: Level 3 (precision instruments) and Level 4 (adaptive decision-making) are in the development plan for subsequent releases.

Conclusion

LabVLA marks an important direction shift: domain-specialized VLA models rather than purely generalist systems. By combining Qwen3-VL-4B with DiT flow-matching, training on RoboGenesis-synthesized lab data, and applying Knowledge Insulation during posttraining, the model achieves 71.1% in-distribution on LabUtopia simulation and 86.5% on a real Franka robot — significantly outperforming baselines that lack domain-specific training data.

Most importantly, the entire pipeline (model weights, training code, data generation scripts, benchmark) is fully open-sourced under MIT license. If you are working on robotics for pharmaceutical, chemistry, biology, or any domain requiring laboratory automation, LabVLA provides a strong foundation to build on.

For a deeper understanding of the LeRobot ecosystem that LabVLA builds on, see the LeRobot framework guide.


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

A1 VLA: Deploy VLA SOTA với Latency Giảm 72%
wholebody-vla

A1 VLA: Deploy VLA SOTA với Latency Giảm 72%

6/1/202612 min read
NT
X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot
wholebody-vla

X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot

5/20/202611 min read
NT
HEX: VLA Toàn Thân Đa Embodiment cho Humanoid
wholebody-vla

HEX: VLA Toàn Thân Đa Embodiment cho Humanoid

6/10/202610 min read
NT