manipulationwall-ossvlalerobotrobot-manipulationopen-source

Wall-OSS-0.5: 4B VLA for LeRobot

Wall-OSS-0.5 is an open-source 4B VLA for zero-shot real-robot manipulation, integrated with LeRobot and gradient-bridged co-training.

Nguyễn Anh Tuấn5 tháng 6, 202614 min read
Wall-OSS-0.5: 4B VLA for LeRobot

Why Wall-OSS-0.5 matters

Wall-OSS-0.5 is an open-source Vision-Language-Action (VLA) model from X Square Robot. The technical report asks a practical question that robotics teams often avoid: does VLA pretraining itself produce executable behavior on real robot hardware, or does it only provide a better initialization for task-specific fine-tuning?

That evaluation stance is the main reason the release is interesting. Many VLA results are reported after downstream fine-tuning on the target task, target robot, or a very similar data distribution. That is useful for deployment, but it makes foundation capability hard to isolate. Wall-OSS-0.5 reports both layers: zero-shot real-robot behavior before task-specific fine-tuning, and post-fine-tuning performance on real manipulation tasks.

According to the original paper, Wall-OSS-0.5 has more than 4B parameters. It starts from Qwen2.5-VL-3B-Instruct as the vision-language backbone, then adds action-generation capacity. It is pretrained across more than 20 embodiments, processes more than one million robot trajectories per epoch, and co-trains with a grounded multimodal corpus of roughly 90 million samples. The pretrained checkpoint shows non-trivial zero-shot behavior on a 17-task real-robot suite. After fine-tuning, it reaches 60.5 average task progress on 15 real-robot tasks and outperforms pi0.5 by 17.5 percentage points.

For the LeRobot community, the release matters for a second reason: it is not only a paper result. WALL-OSS is documented in the Hugging Face LeRobot ecosystem with policy.type=wall_x, the wall-x repository provides training and inference code, and checkpoints are available on Hugging Face. That makes Wall-OSS-0.5 a useful study case for beginners who want to understand modern VLA systems beyond the simplified phrase "image in, action out".

Quick summary

Item Wall-OSS-0.5
Model type Vision-Language-Action foundation model
Size More than 4B parameters
Backbone Qwen2.5-VL-3B-Instruct
Core architecture Mixture-of-Transformers with VL Expert and Action Expert
Training recipe Gradient-bridged co-training
Deployment action interface Continuous flow matching
Pretraining data Self-collected manipulation, open-source multi-embodiment trajectories, 90M multimodal samples
LeRobot support Available through policy.type=wall_x
License Apache 2.0 according to LeRobot/Hugging Face docs

Primary sources:

The core idea of the paper

The paper is built around a direct operational question: if we pretrain a large VLA on many robots and many forms of data, can the checkpoint control a physical robot immediately?

To answer it, the authors avoid reporting only post-training numbers. They deploy the pretrained checkpoint directly as a real-robot policy and score task progress on 17 tasks. Task progress is not a simple binary success metric. Instead of asking only whether the robot fully completed the task, it gives partial credit: did the robot identify the right object, approach it, grasp it, move in the right direction, and complete the final placement? For early foundation policies, this is often more informative than success rate because it shows where the capability is forming and where it still breaks.

The central technical idea is gradient-bridged co-training. The tension is simple. Robot actions are continuous control signals, while the original VLM backbone is trained through next-token prediction over language and visual tokens. If we attach a flow-matching action head to the VLM, the action head can learn, but the gradient that updates the backbone is relatively weak. If we quantize robot actions into discrete tokens and train them with cross-entropy, the gradient is much more compatible with the VLM training interface, but decoded discrete actions are usually too coarse for precise robot control.

Wall-OSS-0.5 uses both pathways:

  • Discrete action tokens provide a strong VLM-native learning signal.
  • Multimodal text/image prediction preserves grounding and instruction following.
  • Continuous flow matching generates the executable action chunks used at deployment time.

In short: the discrete action pathway mainly teaches the backbone to understand action; the continuous pathway mainly drives the real robot.

Architecture: turning a VLM into a VLA

Wall-OSS-0.5 starts from Qwen2.5-VL-3B-Instruct. That backbone already understands images and language: it can parse visual scenes, follow instructions, answer questions, and reason over objects. To turn it into a VLA, the authors extend it with a Mixture-of-Transformers (MoT) layout.

Here is the simplified view:

Camera images + instruction + proprioception
          |
          v
  VLM-style token sequence
          |
          v
+-------------------------------+
| Mixture-of-Transformers        |
|                               |
|  VL Expert                    |
|  - vision tokens              |
|  - language tokens            |
|  - proprioception tokens      |
|  - discrete action tokens     |
|                               |
|  Action Expert                |
|  - noisy continuous actions   |
|  - flow-matching denoising    |
+-------------------------------+
          |
          v
 Continuous action chunk for robot

The VL Expert preserves the VLM side of the model. It processes vision tokens, language tokens, proprioception tokens, and discrete action tokens. The Action Expert handles noisy continuous action tokens and learns the flow-matching action generator.

This is not a frozen-backbone design. The two experts share sequence-level attention context, so the Action Expert can attend to visual and language information while generating actions, and gradients can still flow end to end. At the same time, the paper uses attention masking so the discrete and continuous action tokens are mutually invisible during the forward pass. That keeps the two action pathways decoupled enough to train and evaluate separately:

  • Discrete pathway: used for action-token cross-entropy during training.
  • Continuous pathway: used for flow matching and deployment-time inference.

The action representation is also carefully chosen. The model uses relative actions and 6D rotations instead of Euler angles or quaternions to avoid discontinuities. The paper describes a 26-dimensional action space: each arm uses relative 3D position, relative 6D rotation, and one gripper value; additional dimensions cover mobile base velocity, lift height, and head actuation. Both action pathways predict about a one-second horizon, with the frame count adjusted to the control frequency of each data source.

Vision-Aligned RVQ Action Tokenizer

One of the most important components is the Vision-Aligned Residual Vector Quantization (RVQ) Action Tokenizer. It replaces a more rule-based FAST-style tokenizer.

For a beginner, RVQ can be understood as a multi-level codebook. The early levels capture coarse motion structure, and the later levels capture finer residual corrections. But in robotics, a tokenizer should not merely compress motor deltas. A useful action token should also carry semantic and physical meaning: what object is affected, what visual change should happen, and what future state the motion implies.

That is why the tokenizer is trained with several objectives:

Objective Role
Reconstruction Keeps action chunks recoverable
Visual-action alignment Pulls action latents toward corresponding visual features
Next-frame prediction Encourages tokens to encode action consequences
DCT-domain reconstruction Suppresses high-frequency trajectory jitter

The resulting token is more than a motor code. It becomes a semantic training interface between robot motion and the VLM backbone. In the tokenizer ablation, the Vision-Aligned RVQ tokenizer improves average task progress from 29.3 to 48.1 on four real-robot tasks, while VQA accuracy slightly rises from 75.7 to 77.5. The key detail is that real-robot evaluation still uses continuous flow actions, so the tokenizer improvement is not confined to the discrete pathway. Better discrete action representations improve the continuous pathway through co-training.

Training recipe: three losses in one stage

Wall-OSS-0.5 is trained in a single stage with a composite objective:

L_total = L_flow + 0.01 * L_act_CE + 0.01 * L_mm_CE

The terms are:

  • L_flow: flow-matching loss for continuous action generation.
  • L_act_CE: cross-entropy loss for autoregressive RVQ action-token prediction.
  • L_mm_CE: cross-entropy loss for multimodal text/image prediction.

The 0.01 weights are not arbitrary. The paper explains that the flow loss is roughly two orders of magnitude smaller than the cross-entropy terms under action-space supervision. Without scaling, language-style cross-entropy could dominate action learning. The authors also mix action and multimodal data at a 9:1 batch ratio, keeping action data central while using multimodal data as an anchor for grounded understanding.

The pretraining data has three major parts:

  1. Self-collected manipulation data: tabletop bimanual systems, mobile manipulators, and XRZero-G0, an embodiment-free collection device that expands scene and task diversity.
  2. Curated open-source multi-embodiment data: sources include RoboMIND, AgiBotWorld, DROID, Bridge v2, Fractal/Google Robot, and other robotics datasets that are normalized into a shared schema.
  3. A multimodal corpus of roughly 90M samples: 78M open-source multimodal samples plus 12M embodied bridge samples built from robot trajectories.

The preprocessing pipeline has to solve practical robotics problems: inconsistent action definitions, coordinate frame conventions, gripper polarity, camera timestamp alignment, dataset imbalance, stationary frames, and long-tail task distribution. The paper uses square-root sampling with p = 0.5 so large data sources do not dominate every epoch. After sampling, each epoch contains more than one million trajectories, approximately 60% self-collected and 40% open-source.

Installing with LeRobot and Wall-X

There are two practical ways to explore Wall-OSS-0.5: use the LeRobot integration or work directly with the Wall-X repository. For most users, starting from LeRobot is easier because the training interface is familiar.

Realistic requirements:

  • Linux, preferably Ubuntu 22.04.
  • Python 3.10 for the Wall-X repository.
  • An NVIDIA GPU with CUDA; fine-tuning a 4B VLA requires much more memory than ACT or small VLA tutorials.
  • bf16 support is strongly recommended.
  • A real robot setup with camera calibration, reliable low-level control, action limits, and an emergency stop.

Install the LeRobot source tree and Wall-X extras:

conda create --name wallx python=3.10
conda activate wallx

pip install torch torchvision transformers
pip install huggingface_hub

# From a LeRobot source checkout:
pip install -e ".[wallx]"

If you follow the Wall-X repository directly, the README asks for requirements.txt, flash-attn==2.7.4.post1, a specific LeRobot commit, and editable installation of the package. Do not treat those pins as cosmetic. With large VLAs, a mismatch in CUDA, FlashAttention, Transformers, or dtype can cause slow inference, checkpoint loading failures, or subtle numerical errors.

In LeRobot, the policy type is:

policy.type=wall_x

A minimal training command from the LeRobot documentation looks like this:

lerobot-train \
  --dataset.repo_id=your_dataset \
  --policy.type=wall_x \
  --output_dir=./outputs/wallx_training \
  --job_name=wallx_training \
  --policy.repo_id=your_repo_id \
  --policy.pretrained_name_or_path=x-square-robot/wall-oss-flow \
  --policy.prediction_mode=diffusion \
  --policy.attn_implementation=eager \
  --steps=3000 \
  --policy.device=cuda \
  --batch_size=32

In a real project, you will likely change batch_size, gradient accumulation, camera names, action dimension, normalization key, robot DOF, and dataset schema. If your demonstrations are not already in LeRobot format, the first milestone is not model training. It is dataset normalization:

episode/
  observation.images.front
  observation.images.wrist
  observation.state
  action
  task
  timestamp

Fine-tuning on your robot data

Fine-tuning Wall-OSS-0.5 is not like fine-tuning an image classifier. The data has to preserve robot semantics:

Field What to verify
Cameras Fixed views, synchronized timestamps, stable exposure
State Consistent end-effector pose or joint state
Action Relative action, gripper convention, control rate
Language Goal and step instructions do not conflict
Episode Idle frames are not overrepresented
Safety Action bounds, collision zones, emergency stop

Start with a small run:

lerobot-train \
  --dataset.repo_id=my-org/my-wallx-task \
  --policy.type=wall_x \
  --policy.pretrained_name_or_path=x-square-robot/wall-oss-0.5 \
  --output_dir=./outputs/wallx_pick_place \
  --job_name=wallx_pick_place_sft \
  --steps=1000 \
  --batch_size=4 \
  --policy.device=cuda \
  --policy.attn_implementation=sdpa

Once the pipeline is stable, increase steps, batch size, and task diversity. For a real robot, evaluate in layers:

  1. Offline sanity check: the model loads and returns the expected action shape.
  2. Open-loop replay: action chunks have plausible scale and do not saturate.
  3. Low-speed closed-loop rollout: constrain velocity and workspace.
  4. Full-speed evaluation: run multiple episodes and log both success and partial progress.

Inference: continuous action chunks

The Hugging Face model card includes a fake-input inference example. It loads x-square-robot/wall-oss-0.5, constructs an observation with end-effector position, axis-angle, gripper state, and 448x448 camera images, then calls generate_flow_action to produce predict_action with shape [horizon, action_dim].

Conceptually, deployment looks like this:

while robot_is_running:
    images = read_cameras()
    proprio = read_robot_state()
    instruction = current_task_text

    action_chunk = wall_oss.generate_flow_action(
        images=images,
        proprio=proprio,
        instruction=instruction,
        horizon=1_second
    )

    execute_first_k_actions(action_chunk)
    replan_with_new_observation()

You generally should not execute the whole chunk blindly if the scene can change. A safer manipulation loop uses receding horizon control: generate a chunk, execute a short prefix, read fresh observations, then generate again. The paper emphasizes inference optimization because real-time control is highly sensitive to latency. On an RTX 5090, the optimized stack reaches about 21 Hz with 224x224 input and about 15 Hz with 448x448 input using 10 denoising steps, roughly a 4x end-to-end speedup over a PyTorch eager baseline.

Main results

Zero-shot real-robot behavior

The zero-shot suite contains 17 tasks: 12 seen tasks from the pretraining distribution and 5 unseen task configurations not collected identically on the current embodiment. Each task is evaluated over 10 trajectories and scored with task progress up to 100.

Task at 400k checkpoint Category Seen/Unseen Task progress
Block Sorting Semantic understanding Seen 100
Fruit Sorting Semantic understanding Seen 96
Ring Stacking Rigid manipulation Seen 86
Rope Tightening Deformable manipulation Unseen 82
Cup Grasping Rigid manipulation Seen 64
Bean Pouring Deformable manipulation Unseen 60

Average progress also rises during pretraining. At the 50k checkpoint, the overall average is 25.5. At 400k, it is 51.1. Seen tasks increase from 26.1 to 50.0, while unseen tasks increase from 24.2 to 53.6. The paper notes that seen and unseen tasks are not difficulty-matched, so the trend is more meaningful than the final ordering.

The hard zero-shot cases are still very hard: Towel Folding, Table Setting, and Charger Plugging score below 20 at the 400k checkpoint. These tasks require deformable handling, insertion precision, long-horizon sequencing, or fine contact control. That is the right interpretation of the result: Wall-OSS-0.5 does not make robots generally competent household workers out of the box. It shows that VLA pretraining can produce directly measurable real-robot behavior before fine-tuning.

Real-robot fine-tuning

After fine-tuning, Wall-OSS-0.5 is compared against pi0.5 and DreamZero on 15 real-robot tasks, with approximately 500 demonstration trajectories per task under the same protocol.

Model Manipulation (10) Reasoning (5) Overall (15)
Wall-OSS-0.5 61.1 59.3 60.5
pi0.5 35.0 58.9 43.0
DreamZero 33.7 32.7 33.4

Wall-OSS-0.5 leads overall and is especially strong on the manipulation subset. Examples include Color Block Sorting at 96 versus 42 for pi0.5, Ring Stacking at 91 versus 60, Drawer Organization at 52 versus 7, and Spoon-in-Bowl at 80 versus 43. pi0.5 still wins on some tasks, including Glasses Rack Placement, Fruit Basket Placement, and Object Matching. That nuance matters: Wall-OSS-0.5 is stronger on average, but it is not universally better on every task type.

Limitations to remember

The paper is clear about several limitations. The backbone is still a 3B VLM scale, so the result does not prove how the recipe behaves with much larger backbones. The model mainly uses single-frame inputs, which limits temporal memory for tasks requiring long observation history. The 26D action space is a concrete engineering choice, not a universal representation for every robot. Task progress is more informative than binary success, but it still depends on manually designed rubrics.

For deployment teams, the biggest constraints are hardware and safety. An open-source VLA checkpoint does not replace calibration, low-level controllers, force limits, collision checks, and supervised evaluation. Treat Wall-OSS-0.5 as a foundation policy for research and fine-tuning, not as a drop-in production controller.

Takeaway

Wall-OSS-0.5 is important because it pushes VLA evaluation toward a more honest standard: report pretrained zero-shot behavior on real robots, release weights, expose the training/inference stack, and integrate with open tooling such as LeRobot. Technically, the most reusable idea is gradient-bridged co-training: use discrete action cross-entropy to teach the backbone about action, use multimodal cross-entropy to preserve grounded VLM capability, and use continuous flow matching for deployment.

If you are new to VLA models, keep this mental model: a VLA is not simply an LLM with a robot head attached. A capable VLA needs a bridge between semantic reasoning and continuous control. Wall-OSS-0.5 is a concrete example of how that bridge can be designed, trained, measured, and opened to the robotics community.

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

NEWTutorial
GRAIL: Dữ liệu synthetic cho G1 VLA
grailunitree-g1vlasynthetic-datawhole-body-manipulation

GRAIL: Dữ liệu synthetic cho G1 VLA

Hướng dẫn GRAIL: tạo dữ liệu synthetic từ 3D assets và video priors để fine-tune VLA whole-body manipulation trên Unitree G1.

5/6/202612 min read
NEWDeep Dive
Software stack humanoid robot: từ ROS 2 đến VLA deployment
humanoidsoftwareros2isaac-simmujocolerobotvlawhole-body

Software stack humanoid robot: từ ROS 2 đến VLA deployment

Kiến trúc software stack cho humanoid robot: realtime control, ROS 2, simulator, teleop data, LeRobot, VLA policy, deployment và monitoring.

4/6/20265 min read
NEWComparison
Chọn Jetson cho humanoid robot: Orin Nano, Orin NX hay cloud GPU?
humanoidjetsonedge-computingnvidiaros2vla

Chọn Jetson cho humanoid robot: Orin Nano, Orin NX hay cloud GPU?

So sánh Jetson Orin Nano, Orin NX và cloud GPU cho humanoid robot theo ROS 2, camera, VLA inference, logging, training và ngân sách.

4/6/20265 min read