LabVLA: The First Open Source VLA Model for Scientific Laboratory Robots

Imagine handing a robot a written lab protocol — "Take the 100ml beaker, add 50ml of distilled water, place it on the magnetic stirrer, set temperature to 60°C for 5 minutes" — and watching the robot execute the entire procedure autonomously. That is precisely what LabVLA aims to deliver.

Released in June 2026 by researchers from Zhejiang University, Shanghai AI Laboratory, and Harbin Institute of Technology, LabVLA is the first Vision-Language-Action (VLA) model specifically designed for scientific laboratory environments. Unlike previous VLA systems trained on household manipulation data (picking objects, opening drawers, folding laundry), LabVLA understands laboratory equipment — beakers, flasks, magnetic stirrers, heating plates — and can reliably execute multi-step protocols.

Paper: LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories — Ren et al., arXiv:2606.13578, 2026
GitHub: github.com/zjunlp/LabVLA
Model: huggingface.co/zjunlp/LabVLA (5B parameters, MIT License)

LabVLA framework overview — source: zjunlp.github.io/LabVLA

Why Labs Need a Domain-Specific VLA

Existing VLA models like π₀ (Pi0 Fast) or X-VLA are primarily trained on Open X-Embodiment (OXE) data — hundreds of thousands of episodes involving household objects, kitchen tasks, and general manipulation. Scientific laboratories present fundamentally different challenges:

Perceptual challenges:

Transparent liquids (water, acids, solvents) — difficult to segment with standard RGB cameras
Specialized equipment (magnetic stirrers, pipettes, UV lamps, analytical balances) absent from home/kitchen datasets
Solution concentration inference from subtle color cues requiring fine-grained visual reasoning

Execution challenges:

Fixed sequential protocols: solutions must be prepared before heating — no reordering allowed
High precision requirements: operating pipettes with 1mm-diameter tips, manipulating narrow-mouth flasks
Safety constraints: no spills, no breakage of glassware

The authors define a four-tier capability pyramid for laboratory robots:

Level	Name	Description
1	Apprentice	Single-step operations on instruction
2	Technician	Multi-step protocol execution with physical state changes ← LabVLA target
3	Specialist	Precision instrument handling (micropipette, analytical balance)
4	Scientist	Adaptive decision-making based on experimental readouts

LabVLA targets Level 2 — executing written protocols step by step, handling physical state transitions (liquid to vapor, color changes indicating reaction completion) in the correct order.

LabVLA Architecture: Qwen3-VL Meets DiT Flow-Matching

LabVLA has approximately 5 billion parameters, combining two main components into an end-to-end system:

1. VLM Backbone: Qwen3-VL-4B-Instruct

Qwen3-VL is Alibaba QwenLM's multimodal vision-language model. LabVLA uses the 4B instruction-tuned variant as the "brain" responsible for:

Understanding complex, multi-step language instructions with full context
Recognizing laboratory objects and equipment from multi-view RGB inputs
Producing 2560-dimensional hidden states that jointly encode language and visual information

2. Action Expert: 18-Layer DiT Flow-Matching Module

The DiT (Diffusion Transformer) action expert is an independent module with:

18 transformer layers, width 1024, 8 attention heads, 128 head dimension
Cross-attention from DiT to VLM hidden states — the bridge between language/vision and action
Predicts 10-step continuous action chunks rather than single discrete actions
Inference requires only 10 Euler steps via deterministic vector field integration

Predicting action chunks rather than individual steps yields smoother robot motion and better handles control loop latency in real deployments.

Knowledge Insulation — The Key Design Insight

The most important architectural innovation in LabVLA is the stop-gradient between VLM and action expert during flow-matching posttraining.

Why it matters: If gradients from the flow-matching loss backpropagate into the VLM backbone, the model suffers from "catastrophic forgetting" — progressively losing its language understanding and visual recognition capabilities. This is a well-known failure mode when fine-tuning large VLMs with objective functions that differ significantly from the pretraining loss.

Knowledge Insulation ensures the VLM backbone retains its learned knowledge while the DiT action expert independently learns to transform hidden states into continuous actions.

LabVLA three-stage training pipeline: FAST action-token pretraining, flow-matching posttraining with knowledge insulation, and downstream task fine-tuning

Three-stage LabVLA training recipe — source: paper arXiv:2606.13578

RoboGenesis: Automated Laboratory Data Synthesis

The core bottleneck for lab robotics is data scarcity. No one has tens of thousands of robot episodes performing scientific laboratory tasks. The authors address this with RoboGenesis — an automated data synthesis pipeline:

Stage 1 — Environment Building: Constructs 10,000 diverse lab scenes from LabAssetLibrary containing 2,947 annotated 3D assets (beakers, flasks, hot plates, magnetic stirrers, pipettes, etc.). Scenes follow physical plausibility rules — liquids obey gravity, glassware has correct weight and friction properties.

Stage 2 — Agentic Workflow Generation: An LLM automatically generates experimental protocols in natural language, then "compiles" them into sequences of atomic skills (Pick, Place, Pour, Stir, Heat, Shake) instantiated across different robot embodiments with appropriate kinematics.

Stage 3 — Domain Randomization and Export: Each protocol executes under varied conditions: lighting (bright/dim/shadow), object placement, camera angles, random obstacles, surface textures. Only successful episodes pass the filter and are exported with 15 synchronized annotation streams.

Output — LabEmbodied-Data:

4 task families: single-arm primitives, multistep procedures, bimanual operations, mobile manipulation
16 robot embodiments: 13 single-arm (UR5e, UR16e, FR3, Franka, Festo Rizon 4...) + 3 dual-arm
Format: LeRobot v2 (Parquet + metadata JSON, compatible with the LeRobot framework)
15 annotation streams: multi-view RGB, joint states, action labels, language instructions, depth maps

Importantly, LabEmbodied-Data is useful beyond LabVLA itself: fine-tuning X-VLA on this dataset improved performance by +15.0 percentage points in-distribution and +19.3 points OOD without any architectural changes.

Installation

System requirements:

Python 3.10
CUDA 12.6
PyTorch 2.7.1
GPU: A100 80GB for training; smaller GPUs sufficient for inference

Step 1 — Clone and create conda environment:

git clone https://github.com/zjunlp/LabVLA
cd LabVLA

conda create -n labvla python=3.10 -y
conda activate labvla

Step 2 — Install PyTorch with CUDA 12.6:

pip install torch==2.7.1 torchvision==0.22.1 \
    --index-url https://download.pytorch.org/whl/cu126

Step 3 — Install Flash Attention:

# --no-build-isolation: use the already-installed PyTorch, not a fresh build env
pip install flash_attn==2.8.3 --no-build-isolation

Step 4 — Install remaining dependencies:

pip install -r requirements.txt

Step 5 — Download pretrained checkpoint:

# ~10GB, BF16 safetensors format
huggingface-cli download zjunlp/LabVLA --local-dir ./LabVLA-checkpoint

Three-Stage Training Pipeline

LabVLA uses an adapted π0.5 recipe. If you want to fine-tune on your own robot data, here is the complete pipeline:

Stage 0: Data Preparation

# Scan dataset: detect anomalous episodes, action outliers
python -m data_process scan \
    --root /path/to/your/dataset \
    --out /tmp/report.json

# Clean: remove faulty episodes, normalize action ranges
python -m data_process clean \
    --src /path/to/your/dataset \
    --dst /path/to/clean_dataset \
    --report /tmp/report.json

# Verify final statistics
python -m data_process stats \
    --dataset /path/to/clean_dataset \
    --schema robointer_droid

Data must be in LeRobot v2 format: each episode is a Parquet file with columns observation.images.*, observation.state, action, and timestamp. The data_process module validates and normalizes automatically.

Stage 1: VLM Pretraining with FAST Action Tokens

bash launch/vlm_pretrain/train_vlm_pretrain.sh

The Qwen3-VL backbone learns to "read" actions through FAST action tokenization — discretizing continuous actions into tokens and training with cross-entropy loss (analogous to next-token prediction in language modeling). The goal is to make the VLM aware of the action space before switching to continuous flow-matching.

Training performance (A100 80GB, DeepSpeed ZeRO-2):

Stage	BS/GPU	Global BS	Time/step
VLM Pretraining	24	1,536	~7s
KI Posttraining	16	1,024	~5s
Task Fine-tuning	4	192	~3s

Liger-Kernel operator fusion and selective gradient checkpointing substantially reduce memory compared to a naive implementation.

Stage 2: Flow-Matching Posttraining with Knowledge Insulation

bash launch/ki_posttrain/train_ki_posttrain.sh

The DiT action expert learns to integrate the vector field from noise to target actions using the flow-matching objective. The stop-gradient (Knowledge Insulation) prevents flow loss gradients from reaching the VLM backbone.

Stage 3: Task Fine-Tuning

# Fine-tune on LabUtopia benchmark tasks
bash launch/finetune/train_labutopia.sh

# Or fine-tune on your own dataset
bash launch/finetune/train_custom.sh \
    --data_path /path/to/clean_dataset \
    --output_dir ./checkpoints/my_labvla \
    --num_epochs 50

Fine-tuning is the most critical stage for adapting LabVLA to your specific robot. Even a modest real-robot dataset (a few hundred episodes) combined with LabEmbodied-Data yields substantially better performance than simulation data alone.

Inference and Deployment

LabVLA uses the OpenPI msgpack WebSocket protocol — a standard communication layer in robot learning frameworks:

Start the inference server:

PRETRAINED_PATH=./LabVLA-checkpoint bash deployment/deploy.sh

The server listens on a WebSocket, receives observations (multi-view RGB images + joint state + language instruction), and returns a 10-step action chunk. With only 10 Euler steps required, inference latency is low enough for closed-loop robot control.

Basic Python client example:

import asyncio
import msgpack
import websockets
import numpy as np

async def get_lab_action(obs_images, joint_state, instruction):
    uri = "ws://localhost:8080"
    async with websockets.connect(uri) as ws:
        payload = msgpack.packb({
            "images": [img.tolist() for img in obs_images],
            "state": joint_state.tolist(),
            "instruction": instruction
        })
        await ws.send(payload)
        response = await ws.recv()
        result = msgpack.unpackb(response)
        # shape: (10, 7) — 10 steps, 7 DOF for Franka
        return np.array(result["actions"])

# Usage:
actions = asyncio.run(get_lab_action(
    obs_images=[front_cam_frame, wrist_cam_frame],
    joint_state=robot.get_joint_positions(),
    instruction="Pour the liquid from the beaker into the flask carefully"
))

# Execute action chunk on the robot
for action_step in actions:
    robot.move_to_joint_positions(action_step)

The WebSocket server architecture allows LabVLA to run on a powerful workstation (with GPU) while the robot controller runs on an embedded compute module — a common pattern in lab automation deployments.

16 robot embodiments supported by LabVLA: UR5e, UR16e, FR3, Franka, and dual-arm variants, all from a single set of weights

16 robot embodiments supported by a single LabVLA checkpoint — source: zjunlp.github.io/LabVLA

Results: LabUtopia Benchmark

The authors built LabUtopia — a 6-task benchmark covering representative laboratory operations, evaluated in both in-distribution (ID) and out-of-distribution (OOD) conditions:

Task	LabVLA ID	LabVLA OOD	π₀ 3B ID	GR00T N1.5 ID
Pick Up	49.2%	48.3%	21.7%	40.8%
Press Button	100%	98.3%	92.5%	99.2%
Open Door	65.0%	65.8%	51.6%	6.7%
Pour Liquid	43.3%	34.2%	37.5%	0%
Heat Beaker	83.3%	87.5%	90.0%	99.2%
Transport Beaker	85.8%	85.8%	86.7%	69.2%
Average	71.1%	70.0%	63.3%	52.5%

Key observations:

LabVLA outperforms π₀ (3B) by 7.8 percentage points in-distribution and 6.8 points OOD
The ID→OOD gap is only 1.1 points (71.1%→70.0%), demonstrating robust domain randomization
GR00T N1.5 scores 0% on "Pour Liquid" — liquid handling is a complete blind spot without domain-specific training data
LabVLA is the most consistent model: no task falls below 43%, versus baselines showing high variance across task types

Real-Robot Results on Franka

Beyond simulation, the team validated LabVLA on a Franka Emika arm with 4 composite tasks, 50 rollouts each, across four conditions:

Task	In-domain	ID Clutter	OOD Clean	OOD Clutter
Shake Liquid	92%	86%	84%	80%
Pour Liquid	86%	78%	76%	72%
Magnetic Stir	88%	80%	80%	74%
Stopper Plug/Unplug	80%	76%	80%	70%
Average	86.5%	80.0%	80.0%	74.0%

86.5% on in-domain clean conditions is a strong result for laboratory tasks. "Shake Liquid" at 92% requires controlling both trajectory and shake amplitude — the robot must keep liquid from spilling throughout the motion. "Stopper Plug/Unplug" at 80% demands sub-2mm alignment precision.

Real-world experimental setup: Franka robot manipulating beakers, flasks, magnetic stirrer and heating plate under in-domain and OOD conditions

Franka real-robot experimental setup — source: zjunlp.github.io/LabVLA

Baseline Comparison

Model	Params	LabUtopia Avg ID	Domain-specific data	License
LabVLA	5B	71.1%	✅ LabEmbodied-Data	MIT
π₀ (HF)	3B	63.3%	❌	Apache-2.0
GR00T N1.5	3B	52.5%	❌	NVIDIA
X-VLA	1B	57.5%	❌	Apache-2.0
SmolVLA	450M	38.2%	❌	Apache-2.0

The advantage of LabVLA is not just the higher average score — it is consistency. Competing baselines like GR00T N1.5 achieve 99.2% on "Heat Beaker" but 0% on "Pour Liquid." LabVLA has no task below 43%, demonstrating that domain-specific training data closes the generalization gap across all task types.

Limitations and Future Directions

The authors acknowledge several current limitations:

Level 3 (Specialist) remains out of reach: Micropipette manipulation (1µl precision), analytical balance operations with 0.001g tolerance
Special liquids: Colored solutions, foams, high-viscosity substances not fully tested
Hardware requirements: Training requires A100 80GB GPUs — cloud access needed for smaller labs
Long-horizon protocols: Sequences with more than 10 steps not rigorously evaluated

Roadmap: Level 3 (precision instruments) and Level 4 (adaptive decision-making) are in the development plan for subsequent releases.

Conclusion

LabVLA marks an important direction shift: domain-specialized VLA models rather than purely generalist systems. By combining Qwen3-VL-4B with DiT flow-matching, training on RoboGenesis-synthesized lab data, and applying Knowledge Insulation during posttraining, the model achieves 71.1% in-distribution on LabUtopia simulation and 86.5% on a real Franka robot — significantly outperforming baselines that lack domain-specific training data.

Most importantly, the entire pipeline (model weights, training code, data generation scripts, benchmark) is fully open-sourced under MIT license. If you are working on robotics for pharmaceutical, chemistry, biology, or any domain requiring laboratory automation, LabVLA provides a strong foundation to build on.

For a deeper understanding of the LeRobot ecosystem that LabVLA builds on, see the LeRobot framework guide.

LabVLA: The First Open Source VLA Model for Scientific Laboratory Robots

LabVLA framework overview — source: zjunlp.github.io/LabVLA

Why Labs Need a Domain-Specific VLA

Perceptual challenges:

Transparent liquids (water, acids, solvents) — difficult to segment with standard RGB cameras
Specialized equipment (magnetic stirrers, pipettes, UV lamps, analytical balances) absent from home/kitchen datasets
Solution concentration inference from subtle color cues requiring fine-grained visual reasoning

Execution challenges:

Fixed sequential protocols: solutions must be prepared before heating — no reordering allowed
High precision requirements: operating pipettes with 1mm-diameter tips, manipulating narrow-mouth flasks
Safety constraints: no spills, no breakage of glassware

The authors define a four-tier capability pyramid for laboratory robots:

Level	Name	Description
1	Apprentice	Single-step operations on instruction
2	Technician	Multi-step protocol execution with physical state changes ← LabVLA target
3	Specialist	Precision instrument handling (micropipette, analytical balance)
4	Scientist	Adaptive decision-making based on experimental readouts

LabVLA targets Level 2 — executing written protocols step by step, handling physical state transitions (liquid to vapor, color changes indicating reaction completion) in the correct order.

LabVLA Architecture: Qwen3-VL Meets DiT Flow-Matching

LabVLA has approximately 5 billion parameters, combining two main components into an end-to-end system:

1. VLM Backbone: Qwen3-VL-4B-Instruct

Qwen3-VL is Alibaba QwenLM's multimodal vision-language model. LabVLA uses the 4B instruction-tuned variant as the "brain" responsible for:

Understanding complex, multi-step language instructions with full context
Recognizing laboratory objects and equipment from multi-view RGB inputs
Producing 2560-dimensional hidden states that jointly encode language and visual information

2. Action Expert: 18-Layer DiT Flow-Matching Module

The DiT (Diffusion Transformer) action expert is an independent module with:

18 transformer layers, width 1024, 8 attention heads, 128 head dimension
Cross-attention from DiT to VLM hidden states — the bridge between language/vision and action
Predicts 10-step continuous action chunks rather than single discrete actions
Inference requires only 10 Euler steps via deterministic vector field integration

Predicting action chunks rather than individual steps yields smoother robot motion and better handles control loop latency in real deployments.

Knowledge Insulation — The Key Design Insight

The most important architectural innovation in LabVLA is the stop-gradient between VLM and action expert during flow-matching posttraining.

Knowledge Insulation ensures the VLM backbone retains its learned knowledge while the DiT action expert independently learns to transform hidden states into continuous actions.

LabVLA three-stage training pipeline: FAST action-token pretraining, flow-matching posttraining with knowledge insulation, and downstream task fine-tuning

Three-stage LabVLA training recipe — source: paper arXiv:2606.13578

RoboGenesis: Automated Laboratory Data Synthesis

Output — LabEmbodied-Data:

4 task families: single-arm primitives, multistep procedures, bimanual operations, mobile manipulation
16 robot embodiments: 13 single-arm (UR5e, UR16e, FR3, Franka, Festo Rizon 4...) + 3 dual-arm
Format: LeRobot v2 (Parquet + metadata JSON, compatible with the LeRobot framework)
15 annotation streams: multi-view RGB, joint states, action labels, language instructions, depth maps

Installation

System requirements:

Python 3.10
CUDA 12.6
PyTorch 2.7.1
GPU: A100 80GB for training; smaller GPUs sufficient for inference

Step 1 — Clone and create conda environment:

git clone https://github.com/zjunlp/LabVLA
cd LabVLA

conda create -n labvla python=3.10 -y
conda activate labvla

Step 2 — Install PyTorch with CUDA 12.6:

pip install torch==2.7.1 torchvision==0.22.1 \
    --index-url https://download.pytorch.org/whl/cu126

Step 3 — Install Flash Attention:

# --no-build-isolation: use the already-installed PyTorch, not a fresh build env
pip install flash_attn==2.8.3 --no-build-isolation

Step 4 — Install remaining dependencies:

pip install -r requirements.txt

Step 5 — Download pretrained checkpoint:

# ~10GB, BF16 safetensors format
huggingface-cli download zjunlp/LabVLA --local-dir ./LabVLA-checkpoint

Three-Stage Training Pipeline

LabVLA uses an adapted π0.5 recipe. If you want to fine-tune on your own robot data, here is the complete pipeline:

Stage 0: Data Preparation

# Scan dataset: detect anomalous episodes, action outliers
python -m data_process scan \
    --root /path/to/your/dataset \
    --out /tmp/report.json

# Clean: remove faulty episodes, normalize action ranges
python -m data_process clean \
    --src /path/to/your/dataset \
    --dst /path/to/clean_dataset \
    --report /tmp/report.json

# Verify final statistics
python -m data_process stats \
    --dataset /path/to/clean_dataset \
    --schema robointer_droid

Stage 1: VLM Pretraining with FAST Action Tokens

bash launch/vlm_pretrain/train_vlm_pretrain.sh

Training performance (A100 80GB, DeepSpeed ZeRO-2):

Stage	BS/GPU	Global BS	Time/step
VLM Pretraining	24	1,536	~7s
KI Posttraining	16	1,024	~5s
Task Fine-tuning	4	192	~3s

Liger-Kernel operator fusion and selective gradient checkpointing substantially reduce memory compared to a naive implementation.

Stage 2: Flow-Matching Posttraining with Knowledge Insulation

bash launch/ki_posttrain/train_ki_posttrain.sh

Stage 3: Task Fine-Tuning

# Fine-tune on LabUtopia benchmark tasks
bash launch/finetune/train_labutopia.sh

# Or fine-tune on your own dataset
bash launch/finetune/train_custom.sh \
    --data_path /path/to/clean_dataset \
    --output_dir ./checkpoints/my_labvla \
    --num_epochs 50

Inference and Deployment

LabVLA uses the OpenPI msgpack WebSocket protocol — a standard communication layer in robot learning frameworks:

Start the inference server:

PRETRAINED_PATH=./LabVLA-checkpoint bash deployment/deploy.sh

Basic Python client example:

import asyncio
import msgpack
import websockets
import numpy as np

async def get_lab_action(obs_images, joint_state, instruction):
    uri = "ws://localhost:8080"
    async with websockets.connect(uri) as ws:
        payload = msgpack.packb({
            "images": [img.tolist() for img in obs_images],
            "state": joint_state.tolist(),
            "instruction": instruction
        })
        await ws.send(payload)
        response = await ws.recv()
        result = msgpack.unpackb(response)
        # shape: (10, 7) — 10 steps, 7 DOF for Franka
        return np.array(result["actions"])

# Usage:
actions = asyncio.run(get_lab_action(
    obs_images=[front_cam_frame, wrist_cam_frame],
    joint_state=robot.get_joint_positions(),
    instruction="Pour the liquid from the beaker into the flask carefully"
))

# Execute action chunk on the robot
for action_step in actions:
    robot.move_to_joint_positions(action_step)

16 robot embodiments supported by LabVLA: UR5e, UR16e, FR3, Franka, and dual-arm variants, all from a single set of weights

16 robot embodiments supported by a single LabVLA checkpoint — source: zjunlp.github.io/LabVLA

Results: LabUtopia Benchmark

The authors built LabUtopia — a 6-task benchmark covering representative laboratory operations, evaluated in both in-distribution (ID) and out-of-distribution (OOD) conditions:

Task	LabVLA ID	LabVLA OOD	π₀ 3B ID	GR00T N1.5 ID
Pick Up	49.2%	48.3%	21.7%	40.8%
Press Button	100%	98.3%	92.5%	99.2%
Open Door	65.0%	65.8%	51.6%	6.7%
Pour Liquid	43.3%	34.2%	37.5%	0%
Heat Beaker	83.3%	87.5%	90.0%	99.2%
Transport Beaker	85.8%	85.8%	86.7%	69.2%
Average	71.1%	70.0%	63.3%	52.5%

Key observations:

LabVLA outperforms π₀ (3B) by 7.8 percentage points in-distribution and 6.8 points OOD
The ID→OOD gap is only 1.1 points (71.1%→70.0%), demonstrating robust domain randomization
GR00T N1.5 scores 0% on "Pour Liquid" — liquid handling is a complete blind spot without domain-specific training data
LabVLA is the most consistent model: no task falls below 43%, versus baselines showing high variance across task types

Real-Robot Results on Franka

Beyond simulation, the team validated LabVLA on a Franka Emika arm with 4 composite tasks, 50 rollouts each, across four conditions:

Task	In-domain	ID Clutter	OOD Clean	OOD Clutter
Shake Liquid	92%	86%	84%	80%
Pour Liquid	86%	78%	76%	72%
Magnetic Stir	88%	80%	80%	74%
Stopper Plug/Unplug	80%	76%	80%	70%
Average	86.5%	80.0%	80.0%	74.0%

Real-world experimental setup: Franka robot manipulating beakers, flasks, magnetic stirrer and heating plate under in-domain and OOD conditions

Franka real-robot experimental setup — source: zjunlp.github.io/LabVLA

Baseline Comparison

Model	Params	LabUtopia Avg ID	Domain-specific data	License
LabVLA	5B	71.1%	✅ LabEmbodied-Data	MIT
π₀ (HF)	3B	63.3%	❌	Apache-2.0
GR00T N1.5	3B	52.5%	❌	NVIDIA
X-VLA	1B	57.5%	❌	Apache-2.0
SmolVLA	450M	38.2%	❌	Apache-2.0

Limitations and Future Directions

The authors acknowledge several current limitations:

Level 3 (Specialist) remains out of reach: Micropipette manipulation (1µl precision), analytical balance operations with 0.001g tolerance
Special liquids: Colored solutions, foams, high-viscosity substances not fully tested
Hardware requirements: Training requires A100 80GB GPUs — cloud access needed for smaller labs
Long-horizon protocols: Sequences with more than 10 steps not rigorously evaluated

Roadmap: Level 3 (precision instruments) and Level 4 (adaptive decision-making) are in the development plan for subsequent releases.

Conclusion

For a deeper understanding of the LeRobot ecosystem that LabVLA builds on, see the LeRobot framework guide.

LabVLA: The First Open Source VLA Model for Scientific Laboratory Robots

Why Labs Need a Domain-Specific VLA

LabVLA Architecture: Qwen3-VL Meets DiT Flow-Matching

1. VLM Backbone: Qwen3-VL-4B-Instruct

2. Action Expert: 18-Layer DiT Flow-Matching Module

Knowledge Insulation — The Key Design Insight

RoboGenesis: Automated Laboratory Data Synthesis

Installation

Three-Stage Training Pipeline

Stage 0: Data Preparation

Stage 1: VLM Pretraining with FAST Action Tokens

Stage 2: Flow-Matching Posttraining with Knowledge Insulation

Stage 3: Task Fine-Tuning

Inference and Deployment

Results: LabUtopia Benchmark

Real-Robot Results on Franka

Baseline Comparison

Limitations and Future Directions

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

A1 VLA: Deploy VLA SOTA với Latency Giảm 72%

Fine-tune InternVLA-A1.5 với LeRobot

X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot

LabVLA: The First Open Source VLA Model for Scientific Laboratory Robots

Why Labs Need a Domain-Specific VLA

LabVLA Architecture: Qwen3-VL Meets DiT Flow-Matching

1. VLM Backbone: Qwen3-VL-4B-Instruct

2. Action Expert: 18-Layer DiT Flow-Matching Module

Knowledge Insulation — The Key Design Insight

RoboGenesis: Automated Laboratory Data Synthesis

Installation

Three-Stage Training Pipeline

Stage 0: Data Preparation

Stage 1: VLM Pretraining with FAST Action Tokens

Stage 2: Flow-Matching Posttraining with Knowledge Insulation

Stage 3: Task Fine-Tuning

Inference and Deployment

Results: LabUtopia Benchmark

Real-Robot Results on Franka

Baseline Comparison

Limitations and Future Directions

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

A1 VLA: Deploy VLA SOTA với Latency Giảm 72%

Fine-tune InternVLA-A1.5 với LeRobot

X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot