LabVLA: The First Open Source VLA Model for Scientific Laboratory Robots
Imagine handing a robot a written lab protocol — "Take the 100ml beaker, add 50ml of distilled water, place it on the magnetic stirrer, set temperature to 60°C for 5 minutes" — and watching the robot execute the entire procedure autonomously. That is precisely what LabVLA aims to deliver.
Released in June 2026 by researchers from Zhejiang University, Shanghai AI Laboratory, and Harbin Institute of Technology, LabVLA is the first Vision-Language-Action (VLA) model specifically designed for scientific laboratory environments. Unlike previous VLA systems trained on household manipulation data (picking objects, opening drawers, folding laundry), LabVLA understands laboratory equipment — beakers, flasks, magnetic stirrers, heating plates — and can reliably execute multi-step protocols.
Paper: LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories — Ren et al., arXiv:2606.13578, 2026
GitHub: github.com/zjunlp/LabVLA
Model: huggingface.co/zjunlp/LabVLA (5B parameters, MIT License)
LabVLA framework overview — source: zjunlp.github.io/LabVLA
Why Labs Need a Domain-Specific VLA
Existing VLA models like π₀ (Pi0 Fast) or X-VLA are primarily trained on Open X-Embodiment (OXE) data — hundreds of thousands of episodes involving household objects, kitchen tasks, and general manipulation. Scientific laboratories present fundamentally different challenges:
Perceptual challenges:
- Transparent liquids (water, acids, solvents) — difficult to segment with standard RGB cameras
- Specialized equipment (magnetic stirrers, pipettes, UV lamps, analytical balances) absent from home/kitchen datasets
- Solution concentration inference from subtle color cues requiring fine-grained visual reasoning
Execution challenges:
- Fixed sequential protocols: solutions must be prepared before heating — no reordering allowed
- High precision requirements: operating pipettes with 1mm-diameter tips, manipulating narrow-mouth flasks
- Safety constraints: no spills, no breakage of glassware
The authors define a four-tier capability pyramid for laboratory robots:
| Level | Name | Description |
|---|---|---|
| 1 | Apprentice | Single-step operations on instruction |
| 2 | Technician | Multi-step protocol execution with physical state changes ← LabVLA target |
| 3 | Specialist | Precision instrument handling (micropipette, analytical balance) |
| 4 | Scientist | Adaptive decision-making based on experimental readouts |
LabVLA targets Level 2 — executing written protocols step by step, handling physical state transitions (liquid to vapor, color changes indicating reaction completion) in the correct order.
LabVLA Architecture: Qwen3-VL Meets DiT Flow-Matching
LabVLA has approximately 5 billion parameters, combining two main components into an end-to-end system:
1. VLM Backbone: Qwen3-VL-4B-Instruct
Qwen3-VL is Alibaba QwenLM's multimodal vision-language model. LabVLA uses the 4B instruction-tuned variant as the "brain" responsible for:
- Understanding complex, multi-step language instructions with full context
- Recognizing laboratory objects and equipment from multi-view RGB inputs
- Producing 2560-dimensional hidden states that jointly encode language and visual information
2. Action Expert: 18-Layer DiT Flow-Matching Module
The DiT (Diffusion Transformer) action expert is an independent module with:
- 18 transformer layers, width 1024, 8 attention heads, 128 head dimension
- Cross-attention from DiT to VLM hidden states — the bridge between language/vision and action
- Predicts 10-step continuous action chunks rather than single discrete actions
- Inference requires only 10 Euler steps via deterministic vector field integration
Predicting action chunks rather than individual steps yields smoother robot motion and better handles control loop latency in real deployments.
Knowledge Insulation — The Key Design Insight
The most important architectural innovation in LabVLA is the stop-gradient between VLM and action expert during flow-matching posttraining.
Why it matters: If gradients from the flow-matching loss backpropagate into the VLM backbone, the model suffers from "catastrophic forgetting" — progressively losing its language understanding and visual recognition capabilities. This is a well-known failure mode when fine-tuning large VLMs with objective functions that differ significantly from the pretraining loss.
Knowledge Insulation ensures the VLM backbone retains its learned knowledge while the DiT action expert independently learns to transform hidden states into continuous actions.

RoboGenesis: Automated Laboratory Data Synthesis
The core bottleneck for lab robotics is data scarcity. No one has tens of thousands of robot episodes performing scientific laboratory tasks. The authors address this with RoboGenesis — an automated data synthesis pipeline:
Stage 1 — Environment Building: Constructs 10,000 diverse lab scenes from LabAssetLibrary containing 2,947 annotated 3D assets (beakers, flasks, hot plates, magnetic stirrers, pipettes, etc.). Scenes follow physical plausibility rules — liquids obey gravity, glassware has correct weight and friction properties.
Stage 2 — Agentic Workflow Generation: An LLM automatically generates experimental protocols in natural language, then "compiles" them into sequences of atomic skills (Pick, Place, Pour, Stir, Heat, Shake) instantiated across different robot embodiments with appropriate kinematics.
Stage 3 — Domain Randomization and Export: Each protocol executes under varied conditions: lighting (bright/dim/shadow), object placement, camera angles, random obstacles, surface textures. Only successful episodes pass the filter and are exported with 15 synchronized annotation streams.
Output — LabEmbodied-Data:
- 4 task families: single-arm primitives, multistep procedures, bimanual operations, mobile manipulation
- 16 robot embodiments: 13 single-arm (UR5e, UR16e, FR3, Franka, Festo Rizon 4...) + 3 dual-arm
- Format: LeRobot v2 (Parquet + metadata JSON, compatible with the LeRobot framework)
- 15 annotation streams: multi-view RGB, joint states, action labels, language instructions, depth maps
Importantly, LabEmbodied-Data is useful beyond LabVLA itself: fine-tuning X-VLA on this dataset improved performance by +15.0 percentage points in-distribution and +19.3 points OOD without any architectural changes.
Installation
System requirements:
- Python 3.10
- CUDA 12.6
- PyTorch 2.7.1
- GPU: A100 80GB for training; smaller GPUs sufficient for inference
Step 1 — Clone and create conda environment:
git clone https://github.com/zjunlp/LabVLA
cd LabVLA
conda create -n labvla python=3.10 -y
conda activate labvla
Step 2 — Install PyTorch with CUDA 12.6:
pip install torch==2.7.1 torchvision==0.22.1 \
--index-url https://download.pytorch.org/whl/cu126
Step 3 — Install Flash Attention:
# --no-build-isolation: use the already-installed PyTorch, not a fresh build env
pip install flash_attn==2.8.3 --no-build-isolation
Step 4 — Install remaining dependencies:
pip install -r requirements.txt
Step 5 — Download pretrained checkpoint:
# ~10GB, BF16 safetensors format
huggingface-cli download zjunlp/LabVLA --local-dir ./LabVLA-checkpoint
Three-Stage Training Pipeline
LabVLA uses an adapted π0.5 recipe. If you want to fine-tune on your own robot data, here is the complete pipeline:
Stage 0: Data Preparation
# Scan dataset: detect anomalous episodes, action outliers
python -m data_process scan \
--root /path/to/your/dataset \
--out /tmp/report.json
# Clean: remove faulty episodes, normalize action ranges
python -m data_process clean \
--src /path/to/your/dataset \
--dst /path/to/clean_dataset \
--report /tmp/report.json
# Verify final statistics
python -m data_process stats \
--dataset /path/to/clean_dataset \
--schema robointer_droid
Data must be in LeRobot v2 format: each episode is a Parquet file with columns observation.images.*, observation.state, action, and timestamp. The data_process module validates and normalizes automatically.
Stage 1: VLM Pretraining with FAST Action Tokens
bash launch/vlm_pretrain/train_vlm_pretrain.sh
The Qwen3-VL backbone learns to "read" actions through FAST action tokenization — discretizing continuous actions into tokens and training with cross-entropy loss (analogous to next-token prediction in language modeling). The goal is to make the VLM aware of the action space before switching to continuous flow-matching.
Training performance (A100 80GB, DeepSpeed ZeRO-2):
| Stage | BS/GPU | Global BS | Time/step |
|---|---|---|---|
| VLM Pretraining | 24 | 1,536 | ~7s |
| KI Posttraining | 16 | 1,024 | ~5s |
| Task Fine-tuning | 4 | 192 | ~3s |
Liger-Kernel operator fusion and selective gradient checkpointing substantially reduce memory compared to a naive implementation.
Stage 2: Flow-Matching Posttraining with Knowledge Insulation
bash launch/ki_posttrain/train_ki_posttrain.sh
The DiT action expert learns to integrate the vector field from noise to target actions using the flow-matching objective. The stop-gradient (Knowledge Insulation) prevents flow loss gradients from reaching the VLM backbone.
Stage 3: Task Fine-Tuning
# Fine-tune on LabUtopia benchmark tasks
bash launch/finetune/train_labutopia.sh
# Or fine-tune on your own dataset
bash launch/finetune/train_custom.sh \
--data_path /path/to/clean_dataset \
--output_dir ./checkpoints/my_labvla \
--num_epochs 50
Fine-tuning is the most critical stage for adapting LabVLA to your specific robot. Even a modest real-robot dataset (a few hundred episodes) combined with LabEmbodied-Data yields substantially better performance than simulation data alone.
Inference and Deployment
LabVLA uses the OpenPI msgpack WebSocket protocol — a standard communication layer in robot learning frameworks:
Start the inference server:
PRETRAINED_PATH=./LabVLA-checkpoint bash deployment/deploy.sh
The server listens on a WebSocket, receives observations (multi-view RGB images + joint state + language instruction), and returns a 10-step action chunk. With only 10 Euler steps required, inference latency is low enough for closed-loop robot control.
Basic Python client example:
import asyncio
import msgpack
import websockets
import numpy as np
async def get_lab_action(obs_images, joint_state, instruction):
uri = "ws://localhost:8080"
async with websockets.connect(uri) as ws:
payload = msgpack.packb({
"images": [img.tolist() for img in obs_images],
"state": joint_state.tolist(),
"instruction": instruction
})
await ws.send(payload)
response = await ws.recv()
result = msgpack.unpackb(response)
# shape: (10, 7) — 10 steps, 7 DOF for Franka
return np.array(result["actions"])
# Usage:
actions = asyncio.run(get_lab_action(
obs_images=[front_cam_frame, wrist_cam_frame],
joint_state=robot.get_joint_positions(),
instruction="Pour the liquid from the beaker into the flask carefully"
))
# Execute action chunk on the robot
for action_step in actions:
robot.move_to_joint_positions(action_step)
The WebSocket server architecture allows LabVLA to run on a powerful workstation (with GPU) while the robot controller runs on an embedded compute module — a common pattern in lab automation deployments.

Results: LabUtopia Benchmark
The authors built LabUtopia — a 6-task benchmark covering representative laboratory operations, evaluated in both in-distribution (ID) and out-of-distribution (OOD) conditions:
| Task | LabVLA ID | LabVLA OOD | π₀ 3B ID | GR00T N1.5 ID |
|---|---|---|---|---|
| Pick Up | 49.2% | 48.3% | 21.7% | 40.8% |
| Press Button | 100% | 98.3% | 92.5% | 99.2% |
| Open Door | 65.0% | 65.8% | 51.6% | 6.7% |
| Pour Liquid | 43.3% | 34.2% | 37.5% | 0% |
| Heat Beaker | 83.3% | 87.5% | 90.0% | 99.2% |
| Transport Beaker | 85.8% | 85.8% | 86.7% | 69.2% |
| Average | 71.1% | 70.0% | 63.3% | 52.5% |
Key observations:
- LabVLA outperforms π₀ (3B) by 7.8 percentage points in-distribution and 6.8 points OOD
- The ID→OOD gap is only 1.1 points (71.1%→70.0%), demonstrating robust domain randomization
- GR00T N1.5 scores 0% on "Pour Liquid" — liquid handling is a complete blind spot without domain-specific training data
- LabVLA is the most consistent model: no task falls below 43%, versus baselines showing high variance across task types
Real-Robot Results on Franka
Beyond simulation, the team validated LabVLA on a Franka Emika arm with 4 composite tasks, 50 rollouts each, across four conditions:
| Task | In-domain | ID Clutter | OOD Clean | OOD Clutter |
|---|---|---|---|---|
| Shake Liquid | 92% | 86% | 84% | 80% |
| Pour Liquid | 86% | 78% | 76% | 72% |
| Magnetic Stir | 88% | 80% | 80% | 74% |
| Stopper Plug/Unplug | 80% | 76% | 80% | 70% |
| Average | 86.5% | 80.0% | 80.0% | 74.0% |
86.5% on in-domain clean conditions is a strong result for laboratory tasks. "Shake Liquid" at 92% requires controlling both trajectory and shake amplitude — the robot must keep liquid from spilling throughout the motion. "Stopper Plug/Unplug" at 80% demands sub-2mm alignment precision.

Baseline Comparison
| Model | Params | LabUtopia Avg ID | Domain-specific data | License |
|---|---|---|---|---|
| LabVLA | 5B | 71.1% | ✅ LabEmbodied-Data | MIT |
| π₀ (HF) | 3B | 63.3% | ❌ | Apache-2.0 |
| GR00T N1.5 | 3B | 52.5% | ❌ | NVIDIA |
| X-VLA | 1B | 57.5% | ❌ | Apache-2.0 |
| SmolVLA | 450M | 38.2% | ❌ | Apache-2.0 |
The advantage of LabVLA is not just the higher average score — it is consistency. Competing baselines like GR00T N1.5 achieve 99.2% on "Heat Beaker" but 0% on "Pour Liquid." LabVLA has no task below 43%, demonstrating that domain-specific training data closes the generalization gap across all task types.
Limitations and Future Directions
The authors acknowledge several current limitations:
- Level 3 (Specialist) remains out of reach: Micropipette manipulation (1µl precision), analytical balance operations with 0.001g tolerance
- Special liquids: Colored solutions, foams, high-viscosity substances not fully tested
- Hardware requirements: Training requires A100 80GB GPUs — cloud access needed for smaller labs
- Long-horizon protocols: Sequences with more than 10 steps not rigorously evaluated
Roadmap: Level 3 (precision instruments) and Level 4 (adaptive decision-making) are in the development plan for subsequent releases.
Conclusion
LabVLA marks an important direction shift: domain-specialized VLA models rather than purely generalist systems. By combining Qwen3-VL-4B with DiT flow-matching, training on RoboGenesis-synthesized lab data, and applying Knowledge Insulation during posttraining, the model achieves 71.1% in-distribution on LabUtopia simulation and 86.5% on a real Franka robot — significantly outperforming baselines that lack domain-specific training data.
Most importantly, the entire pipeline (model weights, training code, data generation scripts, benchmark) is fully open-sourced under MIT license. If you are working on robotics for pharmaceutical, chemistry, biology, or any domain requiring laboratory automation, LabVLA provides a strong foundation to build on.
For a deeper understanding of the LeRobot ecosystem that LabVLA builds on, see the LeRobot framework guide.


