Overview
Wall-OSS-0.5 is one of the more important VLA releases of 2026: an open-source 4B Vision-Language-Action model from X Square Robot, designed so that pretrained robot capability can be measured directly before task-specific fine-tuning. That distinction matters. In robot learning, many models are described as foundation models, but their strongest numbers are often reported only after hundreds of demonstrations on the target benchmark. Wall-OSS-0.5 asks a sharper question: can VLA pretraining itself produce executable robot behavior, or is it only a better initialization for downstream imitation learning?
The technical report answers with a real-robot zero-shot evaluation. Wall-OSS-0.5 is initialized from Qwen2.5-VL-3B-Instruct and expanded beyond 4B parameters with action-generation components. Its pretraining covers more than 20 robot embodiments, processes more than one million robot trajectories per epoch, and mixes those trajectories with a grounded multimodal corpus. The pretrained checkpoint is evaluated on a 17-task real-robot suite, then fine-tuned on 15 tasks and compared against models such as π0.5 and DreamZero.
This guide is practical. We will first unpack the paper, then turn the architecture into a working setup for installation, dataset preparation, training, inference, and safety checks in LeRobot. If you are new to the ecosystem, read the LeRobot framework guide first, then the broader overview of VLA models for manipulation. For production deployment after fine-tuning, the workflow connects naturally to PEFT/LoRA and VLA deployment.
Primary sources used here are the Wall-OSS-0.5 Technical Report, the Wall-X GitHub repository, the Hugging Face model card, and the LeRobot WALL-OSS documentation.
What the Paper Is Really About
Wall-OSS-0.5 is not only interesting because it is a 4B model. Its core contribution is the way it separates three problems that are often merged together in VLA papers:
- Can the model retain visual-language understanding after learning actions?
- Does the pretrained VLM backbone receive a strong enough action signal?
- Can the deployment-time policy output smooth continuous actions for a real robot?
Many VLA systems attach a continuous action head, diffusion model, or flow matching module to a pretrained VLM. That makes sense for deployment because robot controllers need continuous values: end-effector deltas, rotations, gripper commands, and sometimes joint states. But the Wall-OSS-0.5 report highlights a training issue: flow matching is a good execution objective, yet after the early training phase it provides a relatively weak update to the VLM backbone. Discrete action-token prediction has the opposite property. Cross-entropy is native to language-model training, so it gives the backbone a strong gradient. But directly decoding discrete action tokens is usually too coarse for precise manipulation.
Wall-OSS-0.5 addresses this with gradient-bridged co-training. During training, discrete action prediction acts as the gradient bridge that makes the backbone action-aware. Multimodal prediction keeps the model grounded in visual-language instruction following. Continuous flow matching remains the deployment interface for real-valued action chunks. In short: the discrete pathway teaches the backbone; the continuous pathway controls the robot.
The report summarizes the combined objective as:
L = L_flow + lambda_act * L_act-CE + lambda_mm * L_mm-CE
L_act-CE is the autoregressive loss over RVQ action tokens, L_mm-CE is multimodal next-token loss, and L_flow trains the continuous action generator. The report uses lambda_act = lambda_mm = 0.01 and an action-to-multimodal batch mixture of roughly 9:1. This is not just a mathematical detail. It explains why a flow-only fine-tuning run may adapt the action head while leaving the backbone less robotics-aware, and why removing the multimodal anchor can damage grounded vision-language competence.
Architecture: MoT, RVQ Tokens, Flow Actions
Wall-OSS-0.5 starts from Qwen2.5-VL-3B-Instruct and extends it with a Mixture-of-Transformers (MoT) structure. Each layer has two specialized experts:
- VL Expert handles vision tokens, language tokens, proprioception tokens, and discrete action tokens.
- Action Expert handles noisy continuous action tokens used by flow matching.
This is more integrated than attaching an action head to the final hidden state. The paper describes it as routing decomposition rather than gradient isolation. Tokens are routed through the suitable expert, but gradients still flow end to end. An attention mask keeps the discrete action pathway and continuous action pathway from directly seeing each other during the forward pass, which lets the two objectives be trained separately. The Action Expert can still use visual-language context when generating continuous actions.

The second important component is the Vision-Aligned RVQ Action Tokenizer. Models such as Pi0-FAST use FAST tokenization to represent actions efficiently. FAST is useful, but it is largely a rule-based compressor. Wall-OSS-0.5 replaces it with a learned residual vector quantization tokenizer. The tokenizer operates in delta-action space. It encodes observation-conditioned action chunks, quantizes them through multiple residual codebook levels, and decodes them back into action sequences.
The key is that the tokenizer is not trained only for numeric reconstruction. It also uses visual-action alignment, next-frame prediction, and DCT-domain reconstruction. The goal is to make action tokens semantically useful for the VLM backbone rather than merely compact. The report's ablation shows this matters: replacing FAST with the Vision-Aligned RVQ tokenizer improves average real-robot task progress from 29.3% to 48.1% across four tasks under the same co-training setting.
The third component is Action-Space Supervision for flow matching. Standard flow matching learns a velocity field from noise toward clean actions. Wall-OSS-0.5 keeps velocity prediction as the network output, but defines the loss in recovered action space. The intuition is robotics-specific: robot action trajectories are usually low-dimensional and smooth, so the global low-frequency trajectory shape often matters more than high-frequency detail. This action-space loss improves convergence and stabilizes continuous action generation.
The action representation also deserves attention. The report uses relative actions and 6D rotations instead of Euler angles or quaternions, avoiding discontinuities and orientation ambiguity. The standardized action space is 26-dimensional, covering bimanual end-effector pose deltas, 6D rotation, gripper values, and related state depending on the embodiment. For a beginner, the practical lesson is simple: do not plug the checkpoint into a random robot without verifying the action schema. Camera names, proprio fields, normalization keys, action dimension, action horizon, controller units, and safety limits must all match.
Installation
Wall-OSS-0.5 is integrated into LeRobot as policy.type=wall_x. The LeRobot documentation notes that the main docs require installing from source, while stable pip releases may lag behind. A practical workstation should have Ubuntu 22.04, Python 3.10, a CUDA-compatible PyTorch install, enough GPU memory for a 4B model, calibrated cameras, and a real emergency stop if you will run hardware. Because the Hugging Face model card lists BF16/F32 tensors for a 4B model, treat 24GB VRAM as a practical starting point for serious experiments. If you run out of memory, reduce batch size, use BF16, reduce camera count, or run policy inference on a separate GPU machine.
A Wall-X native setup looks like this:
conda create --name wallx python=3.10
conda activate wallx
pip install torch torchvision transformers
pip install huggingface_hub
git clone https://github.com/X-Square-Robot/wall-x.git
cd wall-x
pip install -r requirements.txt
MAX_JOBS=4 pip install flash-attn==2.7.4.post1 --no-build-isolation
MAX_JOBS=4 pip install --no-build-isolation --verbose -e .
If you are already working inside a LeRobot source checkout, the LeRobot docs expose the extra as:
pip install -e ".[wallx]"
Then run a minimal import check:
python - <<'PY'
import torch
print(torch.__version__, torch.cuda.is_available())
import wall_x
print("wall_x import ok")
PY
Most early failures are dependency failures, not model failures. flash-attn is sensitive to CUDA and PyTorch ABI versions and may take time to compile. For first inference tests, start with eager or sdpa attention if available, then optimize with flash_attention_2 after the pipeline is correct. Do not connect a real robot until fake inference and open-loop evaluation are stable.
Dataset Preparation in LeRobot Format
Fine-tuning Wall-OSS-0.5 requires a dataset in LeRobot format. Each episode should include image observations, proprioception, action, and a language instruction. Camera names should be stable: for example, face_view for a static external camera and right_wrist_view for a wrist camera. If your robot has only one arm, you still need to map its action schema into the policy's expected representation: position delta, orientation representation, gripper scalar, and any joint-related fields required by your adapter.
Use this dataset checklist:
- Each episode has a clear natural-language instruction such as "pick up the red cup" or "place the block into the matching tray".
- Camera frames and action timestamps are synchronized.
- The dataset does not contain too much idle action; the Wall-OSS-0.5 report also filters idle and outlier actions.
- Training and validation are split by object, scene, or task variant if you want a real generalization test.
- Normalization statistics are saved and used consistently during both training and inference.
If you are collecting data from scratch, begin with 50-100 high-quality episodes for one simple task such as cup grasping or block sorting. Wall-OSS-0.5 is a strong prior, but it does not make poor demonstrations disappear. Clean camera views, consistent instructions, low controller latency, and accurate gripper timing matter more than simply adding more noisy episodes.
Training with LeRobot
The LeRobot WALL-OSS documentation uses lerobot-train with policy.type=wall_x. A minimal fine-tuning command is:
lerobot-train \
--dataset.repo_id=your-username/robot_pick_place \
--policy.type=wall_x \
--output_dir=./outputs/wallx_pick_place \
--job_name=wallx_pick_place \
--policy.repo_id=your-username/wallx_pick_place \
--policy.pretrained_name_or_path=x-square-robot/wall-oss-flow \
--policy.prediction_mode=diffusion \
--policy.attn_implementation=eager \
--steps=3000 \
--policy.device=cuda \
--batch_size=32
The key arguments are:
--dataset.repo_id: your LeRobot dataset on the Hugging Face Hub or a compatible local repo.--policy.type=wall_x: selects the Wall-X policy implementation.--policy.pretrained_name_or_path: the checkpoint used to initialize training, such as an official Wall-OSS flow checkpoint.--policy.prediction_mode=diffusion: the docs describe this as the iterative denoising/flow-style action generation path; some branches also expose afastnext-token prediction mode.--policy.attn_implementation=eager: a conservative debug setting. Once stable, testsdpaorflash_attention_2.--batch_size: reduce to 1, 2, 4, or 8 if memory is tight.
The native Wall-X repo also provides workspace/lerobot_example/run.sh and configuration files under workspace/lerobot_example. That path gives more control over GPU setup, model paths, robot DOF configuration, hyperparameters, and evaluation scripts. For most beginners, start with the LeRobot CLI because it is easier to audit, then move to the native workspace when you need deeper customization.
A useful training run should track at least three signals: training loss, open-loop validation error, and rollout videos or dry-run action traces. If loss decreases but action magnitude is wrong, check normalization before changing architecture. If the model understands the instruction but closes the gripper late, inspect latency and gripper command mapping. If object grounding is weak, add visual diversity, improve camera placement, or include a wrist view.
Inference and Hardware Safety
The Hugging Face model card includes a fake inference example using checkpoint x-square-robot/wall-oss-0.5. It builds an adapter, creates fake observations with eef_pos, eef_axisangle, gripper, face_view, and wrist_view, then calls generate_flow_action. The Wall-X repository also provides:
python ./scripts/fake_inference.py
python ./scripts/draw_openloop_plot.py
python ./scripts/vqa_inference.py
Use this order before touching hardware:
- Fake inference: the model loads, dtype is correct, output is finite, and output shape is
[horizon, action_dim]. - Open-loop evaluation: recorded observations produce action chunks that are plausible against demonstrations.
- VQA/COT test: the model can identify objects and reason about the next step in your camera view.
- Robot dry run: scale actions down to 10-20% or disable gripper actuation while checking direction and limits.
- Closed-loop rollout: only run full commands after emergency stop, workspace limits, speed limits, and collision limits are active.

The robot-side adapter usually looks like this:
camera frames + proprio + instruction
-> LeRobot processor / Wall-X adapter
-> normalized relative action chunk
-> denormalize
-> convert 6D rotation to controller orientation
-> apply workspace and velocity limits
-> send commands at a fixed control rate
The report states that the pretrained policy reaches 15 Hz inference at high input resolution. Your actual number will depend on GPU, camera count, attention backend, model dtype, and controller loop. If your robot controller runs at 50-100 Hz, use action chunking or hold-last-action between policy calls. For manipulation, timing stability is as important as the model's visual accuracy.
Reported Results
The zero-shot suite contains 17 real-robot tasks: 12 seen tasks from within the pretraining distribution and 5 unseen held-out task configurations not collected identically on the current embodiment. The report uses task progress instead of binary success rate. Each task is scored up to 100 over 10 trajectories. That metric is better for foundation policies because a robot may complete most steps but fail the final insertion or placement; binary success would hide that progress.
At the 400k checkpoint, Wall-OSS-0.5 reaches 51.1% average task progress over the 17 tasks, with 50.0% on seen tasks and 53.6% on unseen tasks. Six tasks reach at least 60% zero-shot task progress: Block Sorting at 100%, Fruit Sorting at 96%, Ring Stacking at 86%, Rope Tightening at 82%, Cup Grasping at 64%, and Bean Pouring at 60%. Rope Tightening is especially interesting because it is an unseen deformable manipulation task, making pure template memorization less likely.
The limitations are also clear. Towel Folding reaches 10%, Table Setting 9%, and Charger Plugging 9% at the same checkpoint. These tasks require deformable state tracking, long-horizon state memory, fine-grained insertion, or sub-centimeter alignment. If your target task is connector plugging, precise folding, or dexterous hand manipulation, treat Wall-OSS-0.5 as a strong prior, not a production policy.
After fine-tuning, Wall-OSS-0.5 reaches 60.5% average task progress on 15 real-robot tasks, outperforming π0.5 by 17.5 percentage points and DreamZero by about 27.1 points according to the report/model-card summary. The suite contains 10 manipulation tasks and 5 reasoning tasks, with all models fine-tuned under the same protocol and roughly 500 trajectories per task. Ablations support the training recipe: full co-training reaches 57.0% on five ablation tasks, compared with 36.6% for flow-only, 31.9% for stop-gradient, and 49.6% for stop-gradient-to-co-training. The Vision-Aligned RVQ tokenizer also improves average task progress from 29.3% to 48.1% over four real-robot tasks under matched co-training.
The practical reading is straightforward: Wall-OSS-0.5 does not remove the need for data collection, but it changes the starting point. Instead of training a small policy from scratch for every task, you begin with a model that already has embodied grounding, semantic manipulation priors, and a continuous action interface. Fine-tuning becomes adaptation to your robot, camera setup, objects, and task distribution rather than teaching manipulation from zero.
When to Use It
Wall-OSS-0.5 is worth trying if you have a real robot arm, want language-conditioned manipulation, have enough GPU memory, and already know how to prepare LeRobot datasets. It is especially promising for tasks with clear semantics: color sorting, object picking by instruction, placing into specified regions, simple pouring, and moderate scene/object generalization.
It is not the best first step if you only have a laptop without a GPU, no calibrated cameras, no emergency stop, or a task requiring sub-centimeter precision. In that case, begin with ACT or Diffusion Policy on a narrow task, debug the data and controller pipeline, and then move to Wall-OSS-0.5 once the robotics infrastructure is reliable.