VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam
VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
  1. Home
  2. Blog
  3. Qwen-VLA: Alibaba's Generalist VLA Model
wholebody-vlavlaalibabaqwenmanipulationnavigationcross-embodimentdiffusion-transformergeneralist-robot

Qwen-VLA: Alibaba's Generalist VLA Model

Deep dive into Qwen-VLA — Alibaba's unified VLA using Qwen3.5-4B + DiT decoder, one set of weights for manipulation, navigation, and heterogeneous robots.

Nguyễn Anh TuấnJune 29, 202611 min read
Qwen-VLA: Alibaba's Generalist VLA Model

Imagine you're a robotics engineer responsible for three completely different robot systems: a dual-arm ALOHA robot for assembly, a WidowX mobile robot for warehouse picking, and a wheeled humanoid navigating a hallway. With today's VLA models, you'd need three separate models, three training pipelines, and three codebases. Every new robot means another mountain of work.

Qwen-VLA from Alibaba aims to solve exactly this: one set of weights, multiple robots, multiple tasks.

The paper Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments (arXiv 2605.30280, submitted May 2026) proposes a unified framework that enables a single model to perform manipulation, navigation, and trajectory prediction — requiring only a text prompt change to switch between robot platforms.

The Problem Qwen-VLA Solves

Robotics in 2026 faces a paradox: we have increasingly powerful VLA models — OpenVLA, π₀ (pi-zero), RDT-1B — but each is designed for a specific task or robot type. Want to fine-tune for a new robot? Retrain from scratch. Want to switch from manipulation to navigation? Use a different model.

The core issues are:

  • Hardware fragmentation: Each robot has a different action space — 7-DOF arm, differential drive, biped locomotion — so model output needs to change per hardware
  • Task fragmentation: Manipulation (pick-and-place), navigation (waypoint following), and trajectory prediction have fundamentally different output structures
  • Data fragmentation: Manipulation datasets can't be directly used to train navigation models and vice versa

Qwen-VLA addresses this with a unified action-and-trajectory prediction framework — representing all outputs (actions, waypoints, trajectories) in a shared space — and embodiment-aware prompt conditioning that tells the model which robot it's controlling.

Tool recommendations

VLA train/deploy stack

Train on cloud/workstation, then deploy optimized models to Jetson or the robot computer.

Cloud GPU for VLA / policy training Use for imitation learning, diffusion policies, RL, and robotics model fine-tuning. View cloud GPU → NVIDIA Jetson Orin NX / Orin Nano Edge deployment hardware for perception, logging, and optimized inference. View Jetson → Hugging Face / robotics dataset hosting Host datasets, checkpoints, and model cards for cleaner LeRobot/VLA workflows. View platform →

Technical Architecture

Qwen-VLA consists of two main components:

1. Vision-Language Backbone: Qwen3.5-4B

The "world understanding" part of the model is built on Qwen3.5-4B — Alibaba's vision-language foundation model. This backbone processes:

  • Camera images (RGB, depth, or multi-view depending on robot configuration)
  • Text instructions from the user ("pick up the red cup and place it on the tray")
  • Embodiment prompt — a text description of the current robot, its action space, and control convention

The Qwen3.5-4B backbone, pretrained on massive language and vision data, provides the visual grounding (localizing objects in space) and spatial reasoning (understanding spatial relationships) needed for both manipulation and navigation tasks.

2. Action Decoder: 1.15B DiT Flow-Matching

This is the most distinctive part of Qwen-VLA. Instead of autoregressive decoding (predicting tokens one-by-one like a standard LLM), they use a Diffusion Transformer (DiT) with flow-matching.

Flow-matching is a generative method that learns to "flow" from a noise distribution to the action distribution in fewer steps than traditional DDPM. The DiT action decoder has 1.15 billion parameters — significantly larger than typical action heads — enabling the model to learn complex, multi-modal action distributions.

Input (vision tokens + text tokens from Qwen3.5-4B)
        │
        ▼
DiT Action Decoder (1.15B params, flow-matching)
        │
        ▼
Continuous action vector (7-DOF joint positions, base velocity, etc.)

The key insight: the action decoder does NOT change between robots. Instead, robots are distinguished through the embodiment-aware prompt.

3. Embodiment-Aware Prompt Conditioning

This is the mechanism that enables one model to serve multiple robots. Before each task, the user provides a text description of the current robot:

"You are controlling a 7-DOF ALOHA dual-arm robot. 
The action space is [left_joint_0...6, right_joint_0...6, left_gripper, right_gripper].
Actions are in joint position space, range [-1, 1]."

For a navigation robot, the prompt changes:

"You are controlling a WidowX mobile manipulator.
The action space is [base_vx, base_vy, base_wz, arm_joint_0...5, gripper].
Navigate to the target location while avoiding obstacles."

The model learns to read this prompt and adjust its output accordingly. No per-platform output heads required — just swap the text.

┌─────────────────────────────────────────────────────┐
│                   QWEN-VLA MODEL                    │
│                                                     │
│  [Camera RGB]  [Depth]  [Embodiment Prompt]         │
│       │              │            │                 │
│       └──────────────┴────────────┘                 │
│                       │                             │
│            ┌──────────▼──────────┐                  │
│            │  Qwen3.5-4B VLM     │  ← Vision tokens │
│            │  Visual Grounding   │    + Text tokens  │
│            │  Spatial Reasoning  │                  │
│            └──────────┬──────────┘                  │
│                       │ Feature embedding            │
│            ┌──────────▼──────────┐                  │
│            │ DiT Action Decoder  │  1.15B params     │
│            │ Flow-matching       │  ← Noise input    │
│            └──────────┬──────────┘                  │
│                       │                             │
│            ┌──────────▼──────────┐                  │
│            │   Action Vector     │                  │
│            │  (continuous, N-D)  │                  │
│            └─────────────────────┘                  │
└─────────────────────────────────────────────────────┘

Training Pipeline

Qwen-VLA is trained via joint pretraining — all data types train a single model simultaneously. The training data includes:

Data Source Type Purpose
Robot manipulation trajectories Open X-Embodiment, BridgeV2, RoboTwin demos Learn manipulation
Human egocentric videos Ego4D, EPIC-Kitchens Learn hand-object interaction
Synthetic simulation data Isaac Sim, MuJoCo rollouts Augmentation, rare scenarios
Vision-language navigation data R2R, RxR, NavInstruct Learn navigation
Trajectory-centric supervision Keypoint tracks, optical flow Learn trajectory prediction
Auxiliary VLM data VQA, captioning Maintain visual grounding

Training proceeds in two stages:

  1. Pretraining: Train on all data above with mixed-task batching. The model learns fundamental skills: object recognition, instruction following, action generation.

  2. Instruction tuning (Instruct variant): Fine-tune on high-quality, task-specific data to improve instruction following and generalization.

This yields Qwen-VLA-Base (after stage 1) and Qwen-VLA-Instruct (after stage 2). There's also Qwen-VLA-aloha — a variant with additional pretraining on real ALOHA robot data.

Installation and Usage

System Requirements

# Python 3.10+, CUDA 12.1+, GPU >= 24GB VRAM (for Instruct)
# or >= 16GB with quantization

# Clone the repository
git clone https://github.com/QwenLM/Qwen-VLA.git
cd Qwen-VLA

# Create conda environment
conda create -n qwen-vla python=3.10
conda activate qwen-vla

# Install dependencies
pip install -r requirements.txt

Loading the Model

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

# Load Qwen-VLA-Instruct
model_name = "Qwen/Qwen-VLA-Instruct"

processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

Inference with Embodiment Prompt

from PIL import Image
import torch

# Embodiment prompt for ALOHA robot
embodiment_prompt = """You are controlling a 7-DOF ALOHA dual-arm robot.
Action space: [left_joint_0..6, right_joint_0..6, left_gripper, right_gripper].
Actions are normalized joint positions in range [-1, 1]."""

# Task instruction
task = "Pick up the yellow cup and place it on the white plate."

# Prepare inputs
image = Image.open("camera_frame.jpg")
messages = [
    {
        "role": "system",
        "content": embodiment_prompt
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": task}
        ]
    }
]

# Tokenize
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt"
).to("cuda")

# Generate action via flow-matching inference
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=False
    )

# Decode to action vector
action = processor.decode_action(outputs[0])
print(f"Action: {action}")
# Output: tensor([0.12, -0.03, 0.45, ...]) — 14-dim ALOHA action

Switching to a Navigation Robot

# Just swap the embodiment prompt — model weights stay IDENTICAL
embodiment_prompt = """You are controlling a WidowX mobile robot.
Action space: [base_x_vel, base_y_vel, base_theta_vel, arm_joint_0..5, gripper].
Navigate to target location while avoiding obstacles."""

task = "Navigate to the kitchen counter and pick up the bottle."

# Inference is exactly the same — only the prompt changes

For a deeper look at fine-tuning VLA models for specific tasks, see our guide on fine-tuning Embodied-R1.5 on LIBERO.

Benchmark Results

Qwen-VLA-Instruct achieves strong results across all three task categories:

Manipulation Benchmarks

Benchmark Qwen-VLA-Instruct Best Baseline Improvement
LIBERO 97.9% 95.2% (π₀.5) +2.7%
Simpler-WidowX 73.7% 71.6% (π₀.5) +2.1%
RoboTwin-Easy 86.1% 81.0% +5.1%
RoboTwin-Hard 87.2% 79.3% +7.9%
DOMINO (zero-shot) 26.6% 18.2% +8.4%

DOMINO is particularly interesting — it tests dynamic manipulation with moving objects in zero-shot conditions. The +8.4 percentage point improvement over baseline demonstrates that Qwen-VLA generalizes significantly better than specialized models.

Navigation Benchmarks

Benchmark Qwen-VLA-Instruct Description
R2R (OSR) 69.0% Vision-Language Navigation in 3D environments
RxR (SR) 59.6% Multilingual navigation benchmark

Crucially, these navigation results come from the same model that achieves SOTA manipulation — not a separate, fine-tuned navigation model.

Real-World ALOHA Performance

Condition Success Rate
In-distribution 83.6%
Out-of-distribution (OOD) 76.9%

Qwen-VLA-aloha maintains 76.9% success rate under out-of-distribution conditions — when the table is shifted, object colors change, or camera position varies. This robustness is critical for real-world deployment.

Qwen-VLA real-world demo: manipulation and navigation from a single set of weights — source: QwenLM

Qwen-RobotSuite: The Next Step (June 2026)

Following Qwen-VLA's release, the Qwen team announced Qwen-RobotSuite in June 2026 — a trio of more specialized models:

Qwen-RobotManip

An enhanced manipulation VLA:

  • 80-dimensional canonical action space with per-dimension binary masking: all robot actions are normalized to 80 dimensions. Unused dimensions are masked — ALOHA uses 14 (7+7 joints), WidowX uses 8. Same model, different mask.
  • In-context policy adaptation: Show the model 1-3 short demonstrations of a new task to adapt without fine-tuning
  • Camera-frame delta pose parameterization: Actions computed relative to camera frame rather than base frame, reducing variance when the camera moves

Results: #1 on RoboChallenge Table30-v1; 91.4% on LIBERO-Plus (vs. 84.4% previous SOTA); 69.4% on RoboTwin-C2R Hard (vs. 47.9%).

Qwen-RobotNav

A dedicated navigation model built on Qwen3-VL (2B/4B/8B variants):

  • Predicts 8 waypoints simultaneously rather than step-by-step
  • +10.8% improvement on HM-EQA, +15.4% on EXPRESS-Bench
  • 77% reduction in navigation steps needed

Qwen-RobotWorld

A video world model (20B parameters) — predicts future video from actions. This is the third component of the ecosystem, enabling robots to "imagine" the consequences of an action before executing it.

Comparison with Other VLA Models

Model Backbone Action Decoder Cross-embodiment Multi-task
OpenVLA Prismatic-7B MLP (discrete) ❌ No ❌ No
π₀ (pi-zero) PaliGemma-3B Flow-matching ❌ Limited ✅ Yes
RDT-1B T5-Large DiT ❌ No ✅ Yes
HEX-VLA Qwen3-VL VQ-VAE ✅ Yes ✅ Yes
Qwen-VLA Qwen3.5-4B DiT Flow-matching ✅ Yes ✅ Yes

Qwen-VLA stands out by combining both cross-embodiment and multi-task capability in a single model — something most previous VLA models couldn't achieve simultaneously.

For another approach to cross-embodiment generalization, see our article on HEX-VLA for humanoid whole-body control.

Analysis and Outlook

Strengths:

  • A single set of weights genuinely serves multiple robots and multiple tasks
  • Embodiment-aware prompt conditioning is elegant and extensible — no architectural changes when adding a new robot type
  • DiT action decoder captures more complex action distributions than MLP heads
  • Fully open-source: code, weights, and technical report

Points to Consider:

  • Large model size (Qwen3.5-4B backbone + 1.15B DiT ≈ 5.15B total parameters) — requires substantial GPU VRAM
  • Embodiment prompts must be carefully written — incorrect action space description leads to wrong robot behavior
  • DOMINO zero-shot success of 26.6% remains low — dynamic manipulation is still a hard open problem
  • Navigation benchmarks (R2R 69.0%, RxR 59.6%), while competitive, trail navigation specialists

What's Next:

  • Integration with Qwen-RobotWorld for world model-based planning (imagine → execute → verify)
  • Quantization for edge deployment (Jetson Orin, etc.)
  • Incorporating haptic feedback and proprioception into the input stream
  • Scaling the canonical action space to cover even more embodiments

If you're interested in how VLA models with similar Qwen3 backbones perform in lab settings, the LabVLA with Qwen3-VL guide offers another perspective.

Conclusion

Qwen-VLA marks a meaningful step toward the "generalist robot brain" vision — instead of each robot requiring its own dedicated VLA, a single model can serve multiple platforms through text prompt changes alone. The strong benchmark results (97.9% LIBERO, 76.9% ALOHA OOD) combined with full open-source release make Qwen-VLA an important baseline reference for any VLA project in 2026.

GitHub: QwenLM/Qwen-VLA — Paper: arXiv 2605.30280


Related Posts

  • OpenVLA: Deep Dive into the First Truly Open-Source VLA Model
  • HEX-VLA: Cross-Embodiment VLA for Humanoid Whole-Body Control
  • Embodied-R1.5: Fine-Tuning VLA on LIBERO with LeRobot
NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Fleet MonitoringROS 2 IntegrationAMR Solutions

Related Posts

Tutorial
HEX: VLA Toàn Thân Đa Embodiment cho Humanoid
vlawhole-bodyhumanoid
wholebody-vla

HEX: VLA Toàn Thân Đa Embodiment cho Humanoid

Hướng dẫn chi tiết HEX — VLA whole-body manipulation đầu tiên cho humanoid full-size, hỗ trợ 7 embodiment, open-source với Qwen3-VL + MoE + DiT flow-matching.

6/10/202610 min read
NT
Research
ABot-M0: VLA Foundation Model với Action Manifold
vlafoundation-modelaction-manifold
wholebody-vla

ABot-M0: VLA Foundation Model với Action Manifold

Hướng dẫn ABot-M0 từ AMAP CVLab Alibaba: VLA train trên 6M+ trajectories, predict clean actions thay vì noise, code + weights open-source.

5/15/202610 min read
NT
NEWTutorial
Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2
vlajepaworld-model
wholebody-vla

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2

VLA-JEPA kết hợp Qwen3-VL với V-JEPA2 latent world model, chạy 10Hz trên RTX 3080, fine-tune chỉ cần 13 demo. Hướng dẫn cài đặt và training trên LeRobot.

6/22/202612 min read
NT
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam