Qwen-VLA: Alibaba's Generalist VLA Model

Imagine you're a robotics engineer responsible for three completely different robot systems: a dual-arm ALOHA robot for assembly, a WidowX mobile robot for warehouse picking, and a wheeled humanoid navigating a hallway. With today's VLA models, you'd need three separate models, three training pipelines, and three codebases. Every new robot means another mountain of work.

Qwen-VLA from Alibaba aims to solve exactly this: one set of weights, multiple robots, multiple tasks.

The paper Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments (arXiv 2605.30280, submitted May 2026) proposes a unified framework that enables a single model to perform manipulation, navigation, and trajectory prediction — requiring only a text prompt change to switch between robot platforms.

The Problem Qwen-VLA Solves

Robotics in 2026 faces a paradox: we have increasingly powerful VLA models — OpenVLA, π₀ (pi-zero), RDT-1B — but each is designed for a specific task or robot type. Want to fine-tune for a new robot? Retrain from scratch. Want to switch from manipulation to navigation? Use a different model.

The core issues are:

Hardware fragmentation: Each robot has a different action space — 7-DOF arm, differential drive, biped locomotion — so model output needs to change per hardware
Task fragmentation: Manipulation (pick-and-place), navigation (waypoint following), and trajectory prediction have fundamentally different output structures
Data fragmentation: Manipulation datasets can't be directly used to train navigation models and vice versa

Qwen-VLA addresses this with a unified action-and-trajectory prediction framework — representing all outputs (actions, waypoints, trajectories) in a shared space — and embodiment-aware prompt conditioning that tells the model which robot it's controlling.

Technical Architecture

Qwen-VLA consists of two main components:

1. Vision-Language Backbone: Qwen3.5-4B

The "world understanding" part of the model is built on Qwen3.5-4B — Alibaba's vision-language foundation model. This backbone processes:

Camera images (RGB, depth, or multi-view depending on robot configuration)
Text instructions from the user ("pick up the red cup and place it on the tray")
Embodiment prompt — a text description of the current robot, its action space, and control convention

The Qwen3.5-4B backbone, pretrained on massive language and vision data, provides the visual grounding (localizing objects in space) and spatial reasoning (understanding spatial relationships) needed for both manipulation and navigation tasks.

2. Action Decoder: 1.15B DiT Flow-Matching

This is the most distinctive part of Qwen-VLA. Instead of autoregressive decoding (predicting tokens one-by-one like a standard LLM), they use a Diffusion Transformer (DiT) with flow-matching.

Flow-matching is a generative method that learns to "flow" from a noise distribution to the action distribution in fewer steps than traditional DDPM. The DiT action decoder has 1.15 billion parameters — significantly larger than typical action heads — enabling the model to learn complex, multi-modal action distributions.

Input (vision tokens + text tokens from Qwen3.5-4B)
        │
        ▼
DiT Action Decoder (1.15B params, flow-matching)
        │
        ▼
Continuous action vector (7-DOF joint positions, base velocity, etc.)

The key insight: the action decoder does NOT change between robots. Instead, robots are distinguished through the embodiment-aware prompt.

3. Embodiment-Aware Prompt Conditioning

This is the mechanism that enables one model to serve multiple robots. Before each task, the user provides a text description of the current robot:

"You are controlling a 7-DOF ALOHA dual-arm robot. 
The action space is [left_joint_0...6, right_joint_0...6, left_gripper, right_gripper].
Actions are in joint position space, range [-1, 1]."

For a navigation robot, the prompt changes:

"You are controlling a WidowX mobile manipulator.
The action space is [base_vx, base_vy, base_wz, arm_joint_0...5, gripper].
Navigate to the target location while avoiding obstacles."

The model learns to read this prompt and adjust its output accordingly. No per-platform output heads required — just swap the text.

┌─────────────────────────────────────────────────────┐
│                   QWEN-VLA MODEL                    │
│                                                     │
│  [Camera RGB]  [Depth]  [Embodiment Prompt]         │
│       │              │            │                 │
│       └──────────────┴────────────┘                 │
│                       │                             │
│            ┌──────────▼──────────┐                  │
│            │  Qwen3.5-4B VLM     │  ← Vision tokens │
│            │  Visual Grounding   │    + Text tokens  │
│            │  Spatial Reasoning  │                  │
│            └──────────┬──────────┘                  │
│                       │ Feature embedding            │
│            ┌──────────▼──────────┐                  │
│            │ DiT Action Decoder  │  1.15B params     │
│            │ Flow-matching       │  ← Noise input    │
│            └──────────┬──────────┘                  │
│                       │                             │
│            ┌──────────▼──────────┐                  │
│            │   Action Vector     │                  │
│            │  (continuous, N-D)  │                  │
│            └─────────────────────┘                  │
└─────────────────────────────────────────────────────┘

Training Pipeline

Qwen-VLA is trained via joint pretraining — all data types train a single model simultaneously. The training data includes:

Data Source	Type	Purpose
Robot manipulation trajectories	Open X-Embodiment, BridgeV2, RoboTwin demos	Learn manipulation
Human egocentric videos	Ego4D, EPIC-Kitchens	Learn hand-object interaction
Synthetic simulation data	Isaac Sim, MuJoCo rollouts	Augmentation, rare scenarios
Vision-language navigation data	R2R, RxR, NavInstruct	Learn navigation
Trajectory-centric supervision	Keypoint tracks, optical flow	Learn trajectory prediction
Auxiliary VLM data	VQA, captioning	Maintain visual grounding

Training proceeds in two stages:

Pretraining: Train on all data above with mixed-task batching. The model learns fundamental skills: object recognition, instruction following, action generation.
Instruction tuning (Instruct variant): Fine-tune on high-quality, task-specific data to improve instruction following and generalization.

This yields Qwen-VLA-Base (after stage 1) and Qwen-VLA-Instruct (after stage 2). There's also Qwen-VLA-aloha — a variant with additional pretraining on real ALOHA robot data.

Installation and Usage

System Requirements

# Python 3.10+, CUDA 12.1+, GPU >= 24GB VRAM (for Instruct)
# or >= 16GB with quantization

# Clone the repository
git clone https://github.com/QwenLM/Qwen-VLA.git
cd Qwen-VLA

# Create conda environment
conda create -n qwen-vla python=3.10
conda activate qwen-vla

# Install dependencies
pip install -r requirements.txt

Loading the Model

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

# Load Qwen-VLA-Instruct
model_name = "Qwen/Qwen-VLA-Instruct"

processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

Inference with Embodiment Prompt

from PIL import Image
import torch

# Embodiment prompt for ALOHA robot
embodiment_prompt = """You are controlling a 7-DOF ALOHA dual-arm robot.
Action space: [left_joint_0..6, right_joint_0..6, left_gripper, right_gripper].
Actions are normalized joint positions in range [-1, 1]."""

# Task instruction
task = "Pick up the yellow cup and place it on the white plate."

# Prepare inputs
image = Image.open("camera_frame.jpg")
messages = [
    {
        "role": "system",
        "content": embodiment_prompt
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": task}
        ]
    }
]

# Tokenize
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt"
).to("cuda")

# Generate action via flow-matching inference
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=False
    )

# Decode to action vector
action = processor.decode_action(outputs[0])
print(f"Action: {action}")
# Output: tensor([0.12, -0.03, 0.45, ...]) — 14-dim ALOHA action

# Just swap the embodiment prompt — model weights stay IDENTICAL
embodiment_prompt = """You are controlling a WidowX mobile robot.
Action space: [base_x_vel, base_y_vel, base_theta_vel, arm_joint_0..5, gripper].
Navigate to target location while avoiding obstacles."""

task = "Navigate to the kitchen counter and pick up the bottle."

# Inference is exactly the same — only the prompt changes

For a deeper look at fine-tuning VLA models for specific tasks, see our guide on fine-tuning Embodied-R1.5 on LIBERO.

Benchmark Results

Qwen-VLA-Instruct achieves strong results across all three task categories:

Manipulation Benchmarks

Benchmark	Qwen-VLA-Instruct	Best Baseline	Improvement
LIBERO	97.9%	95.2% (π₀.5)	+2.7%
Simpler-WidowX	73.7%	71.6% (π₀.5)	+2.1%
RoboTwin-Easy	86.1%	81.0%	+5.1%
RoboTwin-Hard	87.2%	79.3%	+7.9%
DOMINO (zero-shot)	26.6%	18.2%	+8.4%

DOMINO is particularly interesting — it tests dynamic manipulation with moving objects in zero-shot conditions. The +8.4 percentage point improvement over baseline demonstrates that Qwen-VLA generalizes significantly better than specialized models.

Benchmark	Qwen-VLA-Instruct	Description
R2R (OSR)	69.0%	Vision-Language Navigation in 3D environments
RxR (SR)	59.6%	Multilingual navigation benchmark

Crucially, these navigation results come from the same model that achieves SOTA manipulation — not a separate, fine-tuned navigation model.

Real-World ALOHA Performance

Condition	Success Rate
In-distribution	83.6%
Out-of-distribution (OOD)	76.9%

Qwen-VLA-aloha maintains 76.9% success rate under out-of-distribution conditions — when the table is shifted, object colors change, or camera position varies. This robustness is critical for real-world deployment.

Qwen-VLA real-world demo: manipulation and navigation from a single set of weights — source: QwenLM

Qwen-RobotSuite: The Next Step (June 2026)

Following Qwen-VLA's release, the Qwen team announced Qwen-RobotSuite in June 2026 — a trio of more specialized models:

Qwen-RobotManip

An enhanced manipulation VLA:

80-dimensional canonical action space with per-dimension binary masking: all robot actions are normalized to 80 dimensions. Unused dimensions are masked — ALOHA uses 14 (7+7 joints), WidowX uses 8. Same model, different mask.
In-context policy adaptation: Show the model 1-3 short demonstrations of a new task to adapt without fine-tuning
Camera-frame delta pose parameterization: Actions computed relative to camera frame rather than base frame, reducing variance when the camera moves

Results: #1 on RoboChallenge Table30-v1; 91.4% on LIBERO-Plus (vs. 84.4% previous SOTA); 69.4% on RoboTwin-C2R Hard (vs. 47.9%).

Qwen-RobotNav

A dedicated navigation model built on Qwen3-VL (2B/4B/8B variants):

Predicts 8 waypoints simultaneously rather than step-by-step
+10.8% improvement on HM-EQA, +15.4% on EXPRESS-Bench
77% reduction in navigation steps needed

Qwen-RobotWorld

A video world model (20B parameters) — predicts future video from actions. This is the third component of the ecosystem, enabling robots to "imagine" the consequences of an action before executing it.

Comparison with Other VLA Models

Model	Backbone	Action Decoder	Cross-embodiment	Multi-task
OpenVLA	Prismatic-7B	MLP (discrete)	❌ No	❌ No
π₀ (pi-zero)	PaliGemma-3B	Flow-matching	❌ Limited	✅ Yes
RDT-1B	T5-Large	DiT	❌ No	✅ Yes
HEX-VLA	Qwen3-VL	VQ-VAE	✅ Yes	✅ Yes
Qwen-VLA	Qwen3.5-4B	DiT Flow-matching	✅ Yes	✅ Yes

Qwen-VLA stands out by combining both cross-embodiment and multi-task capability in a single model — something most previous VLA models couldn't achieve simultaneously.

For another approach to cross-embodiment generalization, see our article on HEX-VLA for humanoid whole-body control.

Analysis and Outlook

Strengths:

A single set of weights genuinely serves multiple robots and multiple tasks
Embodiment-aware prompt conditioning is elegant and extensible — no architectural changes when adding a new robot type
DiT action decoder captures more complex action distributions than MLP heads
Fully open-source: code, weights, and technical report

Points to Consider:

Large model size (Qwen3.5-4B backbone + 1.15B DiT ≈ 5.15B total parameters) — requires substantial GPU VRAM
Embodiment prompts must be carefully written — incorrect action space description leads to wrong robot behavior
DOMINO zero-shot success of 26.6% remains low — dynamic manipulation is still a hard open problem
Navigation benchmarks (R2R 69.0%, RxR 59.6%), while competitive, trail navigation specialists

What's Next:

Integration with Qwen-RobotWorld for world model-based planning (imagine → execute → verify)
Quantization for edge deployment (Jetson Orin, etc.)
Incorporating haptic feedback and proprioception into the input stream
Scaling the canonical action space to cover even more embodiments

If you're interested in how VLA models with similar Qwen3 backbones perform in lab settings, the LabVLA with Qwen3-VL guide offers another perspective.

Conclusion

Qwen-VLA marks a meaningful step toward the "generalist robot brain" vision — instead of each robot requiring its own dedicated VLA, a single model can serve multiple platforms through text prompt changes alone. The strong benchmark results (97.9% LIBERO, 76.9% ALOHA OOD) combined with full open-source release make Qwen-VLA an important baseline reference for any VLA project in 2026.

GitHub: QwenLM/Qwen-VLA — Paper: arXiv 2605.30280

Qwen-VLA from Alibaba aims to solve exactly this: one set of weights, multiple robots, multiple tasks.

The Problem Qwen-VLA Solves

The core issues are:

Hardware fragmentation: Each robot has a different action space — 7-DOF arm, differential drive, biped locomotion — so model output needs to change per hardware
Task fragmentation: Manipulation (pick-and-place), navigation (waypoint following), and trajectory prediction have fundamentally different output structures
Data fragmentation: Manipulation datasets can't be directly used to train navigation models and vice versa

Technical Architecture

Qwen-VLA consists of two main components:

1. Vision-Language Backbone: Qwen3.5-4B

The "world understanding" part of the model is built on Qwen3.5-4B — Alibaba's vision-language foundation model. This backbone processes:

Camera images (RGB, depth, or multi-view depending on robot configuration)
Text instructions from the user ("pick up the red cup and place it on the tray")
Embodiment prompt — a text description of the current robot, its action space, and control convention

2. Action Decoder: 1.15B DiT Flow-Matching

This is the most distinctive part of Qwen-VLA. Instead of autoregressive decoding (predicting tokens one-by-one like a standard LLM), they use a Diffusion Transformer (DiT) with flow-matching.

Input (vision tokens + text tokens from Qwen3.5-4B)
        │
        ▼
DiT Action Decoder (1.15B params, flow-matching)
        │
        ▼
Continuous action vector (7-DOF joint positions, base velocity, etc.)

The key insight: the action decoder does NOT change between robots. Instead, robots are distinguished through the embodiment-aware prompt.

3. Embodiment-Aware Prompt Conditioning

This is the mechanism that enables one model to serve multiple robots. Before each task, the user provides a text description of the current robot:

"You are controlling a 7-DOF ALOHA dual-arm robot. 
The action space is [left_joint_0...6, right_joint_0...6, left_gripper, right_gripper].
Actions are in joint position space, range [-1, 1]."

For a navigation robot, the prompt changes:

"You are controlling a WidowX mobile manipulator.
The action space is [base_vx, base_vy, base_wz, arm_joint_0...5, gripper].
Navigate to the target location while avoiding obstacles."

The model learns to read this prompt and adjust its output accordingly. No per-platform output heads required — just swap the text.

┌─────────────────────────────────────────────────────┐
│                   QWEN-VLA MODEL                    │
│                                                     │
│  [Camera RGB]  [Depth]  [Embodiment Prompt]         │
│       │              │            │                 │
│       └──────────────┴────────────┘                 │
│                       │                             │
│            ┌──────────▼──────────┐                  │
│            │  Qwen3.5-4B VLM     │  ← Vision tokens │
│            │  Visual Grounding   │    + Text tokens  │
│            │  Spatial Reasoning  │                  │
│            └──────────┬──────────┘                  │
│                       │ Feature embedding            │
│            ┌──────────▼──────────┐                  │
│            │ DiT Action Decoder  │  1.15B params     │
│            │ Flow-matching       │  ← Noise input    │
│            └──────────┬──────────┘                  │
│                       │                             │
│            ┌──────────▼──────────┐                  │
│            │   Action Vector     │                  │
│            │  (continuous, N-D)  │                  │
│            └─────────────────────┘                  │
└─────────────────────────────────────────────────────┘

Training Pipeline

Qwen-VLA is trained via joint pretraining — all data types train a single model simultaneously. The training data includes:

Data Source	Type	Purpose
Robot manipulation trajectories	Open X-Embodiment, BridgeV2, RoboTwin demos	Learn manipulation
Human egocentric videos	Ego4D, EPIC-Kitchens	Learn hand-object interaction
Synthetic simulation data	Isaac Sim, MuJoCo rollouts	Augmentation, rare scenarios
Vision-language navigation data	R2R, RxR, NavInstruct	Learn navigation
Trajectory-centric supervision	Keypoint tracks, optical flow	Learn trajectory prediction
Auxiliary VLM data	VQA, captioning	Maintain visual grounding

Training proceeds in two stages:

Pretraining: Train on all data above with mixed-task batching. The model learns fundamental skills: object recognition, instruction following, action generation.
Instruction tuning (Instruct variant): Fine-tune on high-quality, task-specific data to improve instruction following and generalization.

This yields Qwen-VLA-Base (after stage 1) and Qwen-VLA-Instruct (after stage 2). There's also Qwen-VLA-aloha — a variant with additional pretraining on real ALOHA robot data.

Installation and Usage

System Requirements

# Python 3.10+, CUDA 12.1+, GPU >= 24GB VRAM (for Instruct)
# or >= 16GB with quantization

# Clone the repository
git clone https://github.com/QwenLM/Qwen-VLA.git
cd Qwen-VLA

# Create conda environment
conda create -n qwen-vla python=3.10
conda activate qwen-vla

# Install dependencies
pip install -r requirements.txt

Loading the Model

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

# Load Qwen-VLA-Instruct
model_name = "Qwen/Qwen-VLA-Instruct"

processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

Inference with Embodiment Prompt

from PIL import Image
import torch

# Embodiment prompt for ALOHA robot
embodiment_prompt = """You are controlling a 7-DOF ALOHA dual-arm robot.
Action space: [left_joint_0..6, right_joint_0..6, left_gripper, right_gripper].
Actions are normalized joint positions in range [-1, 1]."""

# Task instruction
task = "Pick up the yellow cup and place it on the white plate."

# Prepare inputs
image = Image.open("camera_frame.jpg")
messages = [
    {
        "role": "system",
        "content": embodiment_prompt
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": task}
        ]
    }
]

# Tokenize
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt"
).to("cuda")

# Generate action via flow-matching inference
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=False
    )

# Decode to action vector
action = processor.decode_action(outputs[0])
print(f"Action: {action}")
# Output: tensor([0.12, -0.03, 0.45, ...]) — 14-dim ALOHA action

# Just swap the embodiment prompt — model weights stay IDENTICAL
embodiment_prompt = """You are controlling a WidowX mobile robot.
Action space: [base_x_vel, base_y_vel, base_theta_vel, arm_joint_0..5, gripper].
Navigate to target location while avoiding obstacles."""

task = "Navigate to the kitchen counter and pick up the bottle."

# Inference is exactly the same — only the prompt changes

For a deeper look at fine-tuning VLA models for specific tasks, see our guide on fine-tuning Embodied-R1.5 on LIBERO.

Benchmark Results

Qwen-VLA-Instruct achieves strong results across all three task categories:

Manipulation Benchmarks

Benchmark	Qwen-VLA-Instruct	Best Baseline	Improvement
LIBERO	97.9%	95.2% (π₀.5)	+2.7%
Simpler-WidowX	73.7%	71.6% (π₀.5)	+2.1%
RoboTwin-Easy	86.1%	81.0%	+5.1%
RoboTwin-Hard	87.2%	79.3%	+7.9%
DOMINO (zero-shot)	26.6%	18.2%	+8.4%

Benchmark	Qwen-VLA-Instruct	Description
R2R (OSR)	69.0%	Vision-Language Navigation in 3D environments
RxR (SR)	59.6%	Multilingual navigation benchmark

Crucially, these navigation results come from the same model that achieves SOTA manipulation — not a separate, fine-tuned navigation model.

Real-World ALOHA Performance

Condition	Success Rate
In-distribution	83.6%
Out-of-distribution (OOD)	76.9%

Qwen-VLA real-world demo: manipulation and navigation from a single set of weights — source: QwenLM

Qwen-RobotSuite: The Next Step (June 2026)

Following Qwen-VLA's release, the Qwen team announced Qwen-RobotSuite in June 2026 — a trio of more specialized models:

Qwen-RobotManip

An enhanced manipulation VLA:

80-dimensional canonical action space with per-dimension binary masking: all robot actions are normalized to 80 dimensions. Unused dimensions are masked — ALOHA uses 14 (7+7 joints), WidowX uses 8. Same model, different mask.
In-context policy adaptation: Show the model 1-3 short demonstrations of a new task to adapt without fine-tuning
Camera-frame delta pose parameterization: Actions computed relative to camera frame rather than base frame, reducing variance when the camera moves

Results: #1 on RoboChallenge Table30-v1; 91.4% on LIBERO-Plus (vs. 84.4% previous SOTA); 69.4% on RoboTwin-C2R Hard (vs. 47.9%).

Qwen-RobotNav

A dedicated navigation model built on Qwen3-VL (2B/4B/8B variants):

Predicts 8 waypoints simultaneously rather than step-by-step
+10.8% improvement on HM-EQA, +15.4% on EXPRESS-Bench
77% reduction in navigation steps needed

Qwen-RobotWorld

Comparison with Other VLA Models

Model	Backbone	Action Decoder	Cross-embodiment	Multi-task
OpenVLA	Prismatic-7B	MLP (discrete)	❌ No	❌ No
π₀ (pi-zero)	PaliGemma-3B	Flow-matching	❌ Limited	✅ Yes
RDT-1B	T5-Large	DiT	❌ No	✅ Yes
HEX-VLA	Qwen3-VL	VQ-VAE	✅ Yes	✅ Yes
Qwen-VLA	Qwen3.5-4B	DiT Flow-matching	✅ Yes	✅ Yes

Qwen-VLA stands out by combining both cross-embodiment and multi-task capability in a single model — something most previous VLA models couldn't achieve simultaneously.

For another approach to cross-embodiment generalization, see our article on HEX-VLA for humanoid whole-body control.

Analysis and Outlook

Strengths:

A single set of weights genuinely serves multiple robots and multiple tasks
Embodiment-aware prompt conditioning is elegant and extensible — no architectural changes when adding a new robot type
DiT action decoder captures more complex action distributions than MLP heads
Fully open-source: code, weights, and technical report

Points to Consider:

Large model size (Qwen3.5-4B backbone + 1.15B DiT ≈ 5.15B total parameters) — requires substantial GPU VRAM
Embodiment prompts must be carefully written — incorrect action space description leads to wrong robot behavior
DOMINO zero-shot success of 26.6% remains low — dynamic manipulation is still a hard open problem
Navigation benchmarks (R2R 69.0%, RxR 59.6%), while competitive, trail navigation specialists

What's Next:

Integration with Qwen-RobotWorld for world model-based planning (imagine → execute → verify)
Quantization for edge deployment (Jetson Orin, etc.)
Incorporating haptic feedback and proprioception into the input stream
Scaling the canonical action space to cover even more embodiments

If you're interested in how VLA models with similar Qwen3 backbones perform in lab settings, the LabVLA with Qwen3-VL guide offers another perspective.

Conclusion

GitHub: QwenLM/Qwen-VLA — Paper: arXiv 2605.30280

The Problem Qwen-VLA Solves

Technical Architecture

1. Vision-Language Backbone: Qwen3.5-4B

2. Action Decoder: 1.15B DiT Flow-Matching

3. Embodiment-Aware Prompt Conditioning

Training Pipeline

Installation and Usage

System Requirements

Loading the Model

Inference with Embodiment Prompt

Switching to a Navigation Robot

Benchmark Results

Manipulation Benchmarks

Navigation Benchmarks

Real-World ALOHA Performance

Qwen-RobotSuite: The Next Step (June 2026)

Qwen-RobotManip

Qwen-RobotNav

Qwen-RobotWorld

Comparison with Other VLA Models

Analysis and Outlook

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

HEX: VLA Toàn Thân Đa Embodiment cho Humanoid

ABot-M0: VLA Foundation Model với Action Manifold

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2

The Problem Qwen-VLA Solves

Technical Architecture

1. Vision-Language Backbone: Qwen3.5-4B

2. Action Decoder: 1.15B DiT Flow-Matching

3. Embodiment-Aware Prompt Conditioning

Training Pipeline

Installation and Usage

System Requirements

Loading the Model

Inference with Embodiment Prompt

Switching to a Navigation Robot

Benchmark Results

Manipulation Benchmarks

Navigation Benchmarks

Real-World ALOHA Performance

Qwen-RobotSuite: The Next Step (June 2026)

Qwen-RobotManip

Qwen-RobotNav

Qwen-RobotWorld

Comparison with Other VLA Models

Analysis and Outlook

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

HEX: VLA Toàn Thân Đa Embodiment cho Humanoid

ABot-M0: VLA Foundation Model với Action Manifold

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2