What problem does Omega-QVLA solve?
Vision-Language-Action models (VLAs) are becoming a central recipe for robot manipulation. A policy receives camera observations, parses a natural-language instruction, and produces an action sequence for a robot arm or a bimanual platform. The hard part is deployment. Modern VLA policies often combine a multi-billion-parameter vision-language backbone with a diffusion transformer (DiT) action head. On real robots, especially robots running on edge hardware such as an onboard workstation, Jetson-class device, Orin module, or industrial computer, the question is not only "does the policy solve the task?" It is also "can the policy run with enough latency, memory headroom, and control smoothness to close the loop safely?"
Omega-QVLA is arXiv paper 2605.28803, published on May 27, 2026, under the full title "Omega-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling." The important claim is that the method does not merely compress the language backbone. It compresses both the backbone and the full DiT action head to W4A4: 4-bit weights and 4-bit activations. That is an aggressive target, because the action head directly generates continuous control signals. A small numerical perturbation can become jerky motion, a failed grasp, or accumulated error in long-horizon manipulation.
The project code is available at UCMP13753/Omega-QVLA. The README describes recipes for GR00T-N1.5 and pi0.5 on LIBERO: the LLM/backbone side uses DuQuant rotation mode svd_hadamard with GPTQ; the DiT action head also uses svd_hadamard, but with an RTN residual path and a per-step activation scale table. This guide is written as a practical beginner-friendly walkthrough: what the paper is doing, how the architecture works, how to prepare the environment, how to build quantized packs, how to run evaluation, and what to watch before moving the policy to edge hardware.
Why W4A4 is harder than W4A16
Many deployment workflows are comfortable with W4A16 or W8A8. W4A16 means the weights are stored at 4-bit precision, while activations still run at 16-bit precision. This gives a strong memory reduction, but it does not fully unlock low-bit compute paths. If you want a better chance of using low-bit integer kernels or tensor-core paths, activations must also be quantized. That is why Omega-QVLA targets W4A4.
The difficulty comes from two families of outliers:
| Component | Main outlier source | What goes wrong with naive quantization |
|---|---|---|
| LLM/VLM backbone | A small number of unusually large activation channels | Token representations drift and language grounding weakens |
| DiT action head | Activation range changes across denoising steps | Continuous actions drift, especially over long horizons |
| Robot control loop | Small action errors are physically amplified | End-effector motion becomes jerky or overshoots contact |
If you directly apply GPTQ, AWQ, SmoothQuant, or DuQuant in the same way you would for a chatbot model, the DiT action head is usually where things collapse. The paper reports that full-stack W4A16 GPTQ/AWQ/OmniQuant can fall sharply on pi0.5 when the whole stack is quantized, with some success rates around 10-16%. That is the core lesson: VLA quantization is not just LLM quantization with robot data attached.
The paper idea: rotate first, then scale by denoising step
Omega-QVLA has two key mechanisms:
- Composite SVD-Hadamard rotation to smooth channel energy before quantization.
- Per-step DiT activation scaling to handle dynamic-range drift across diffusion denoising steps.
A simplified view:
Camera + language command
|
v
VLA backbone / LLM / VLM
|
| SVD-Hadamard rotation + GPTQ pack
v
Fused visual-language tokens
|
v
DiT action head
|
| SVD-Hadamard rotation + RTN residual
| per-step activation scale table
v
Action chunk: joints / end-effector command
|
v
Robot controller on edge
SVD rotation mainly addresses the weights. In a linear layer, a weight matrix can have a few rows or channels that carry much larger energy than the rest. With 4-bit per-channel quantization, a high-energy channel stretches the scale, wasting resolution for normal values. SVD moves the layer into a basis where row-wise energy is more balanced.
SVD alone does not guarantee that activations become smooth. The rotation is derived from weights, while activation outliers depend on data. Omega-QVLA therefore composes SVD with a Hadamard rotation. A Hadamard transform is an orthogonal mixing operation that spreads dominant channel energy across many channels. In plain terms: SVD reduces weight-side concentration; Hadamard diffuses the remaining activation-side spikes.
The DiT side needs another treatment because activation statistics are not static. A diffusion action head produces actions through multiple denoising steps. Early-step activations can have very different magnitudes from late-step activations. If one static scale is used for every step, some steps clip while others waste bit range. Omega-QVLA therefore stores an act_scale_table: scales indexed by layer, denoising step, and channel. The table is built offline from calibration trajectories, and inference simply looks up the scale for the current step.
How the repository is organized
The README describes the current recipe as offline packs:
| Side | Rotation | Quantizer | Per-step |
|---|---|---|---|
| LLM/Eagle in GR00T | DuQuant svd_hadamard |
GPTQ | No |
| DiT action head | DuQuant svd_hadamard |
RTN residual | Yes, act_scale_table |
| pi0.5 Expert | svd_hadamard |
GPTQ pack + RTN residual | Yes |
| pi0.5 PaliGemma | runtime DuQuant svd_hadamard |
No separate GPTQ builder | Not like Expert |
This is post-training quantization, not policy fine-tuning. You need a small calibration buffer, but you do not need to backpropagate through the policy for many epochs. For GR00T-N1.5, the repo builds one pack for the LLM side and another pack for the DiT side, then merges them into a single quantized.pt file loaded at runtime through GptqLinear.
results/packs/object_LLM/quantized.pt
results/packs/object_DiT/quantized.pt
|
v
results/packs/object_MERGED/quantized.pt
|
v
run_groot_benchmark.sh -> wrapped quantized linear layers
Hardware and environment assumptions
Do not start on a Jetson if you do not already have a quantized pack. The project README calls for one A100 40 GB GPU to build packs, and 4-8 GPUs for parallel multi-suite evaluation. A pragmatic workflow is:
- Build the quantized pack on a cloud GPU or strong workstation.
- Run LIBERO evaluation to check the policy did not collapse.
- Export the pack and the runtime pieces you need.
- Optimize kernels and runtime loading for your edge target.
- Run shadow mode on the real robot before giving the policy direct control.
The base environment from the README is:
conda create -n omega_qvla python=3.10 -y
conda activate omega_qvla
# Inside the Omega-QVLA repository
pip install -e .
# LIBERO benchmark
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git $HOME/LIBERO
pip install -e $HOME/LIBERO
Set paths explicitly:
export QUANTVLA_ROOT=$HOME/Omega-QVLA
export QUANTVLA_CONDA_ENV=omega_qvla
export CONDA_ROOT=$HOME/miniconda3
export CHECKPOINTS_ROOT=$HOME/ckpts
export LIBERO_ROOT=$HOME/LIBERO
export LIBERO_CONFIG_PATH=$HOME/.libero
export QUANTVLA_CACHE_ROOT=$HOME/.cache/omega_qvla
export OPENPI_ROOT=$HOME/openpi
mkdir -p $QUANTVLA_CACHE_ROOT $LIBERO_CONFIG_PATH
The most relevant code locations are:
gr00t/quantization/
quant.py # entry point for enabling quantization from config
gptq_layers.py # GptqLinear, GPTQ solver, per-step support
duquant_layers.py # rotation + RTN runtime
rtn_layers.py # pure RTN
tools/
build_gptq_weights.py
build_dit_a2lite_svd_gptq_perstep.py
build_pi05_a2lite_gptq_perstep.py
merge_packs.py
scripts/
run_groot_benchmark.sh
run_pi05_libero_benchmark.sh
Running GR00T-N1.5: build the LLM pack
The following example uses the object suite. Other choices are goal, spatial, and long.
cd $QUANTVLA_ROOT
SUITE=object
CKPT=$CHECKPOINTS_ROOT/gr00t-n1.5-libero-${SUITE}-posttrain
case "$SUITE" in
goal)
TASK=libero_goal
DCFG=examples.Libero.custom_data_config:LiberoDataConfigMeanStd
;;
long)
TASK=libero_10
DCFG=examples.Libero.custom_data_config:LiberoDataConfig
;;
*)
TASK=libero_${SUITE}
DCFG=examples.Libero.custom_data_config:LiberoDataConfig
;;
esac
LLM_RE='.*backbone\.eagle_model\.language_model\..*\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj).*'
EXCLUDE='(?:^|\.)(vision|radio|norm|ln|layernorm|embed|lm_head|timestep_encoder|state_encoder|action_encoder|action_decoder|pos_embed|vl_self_attention|vlln|future_tokens)(?:\.|$)'
Build the LLM pack:
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$QUANTVLA_ROOT \
python -m tools.build_gptq_weights \
--checkpoint "$CKPT" \
--task-suite-name "$TASK" \
--data-config "$DCFG" \
--output-path results/packs/${SUITE}_LLM/quantized.pt \
--include-regex "$LLM_RE" \
--exclude-regex "$EXCLUDE" \
--duquant-rotation \
--duquant-rot-mode svd_hadamard \
--weight-bits 4 \
--num-samples 10 \
--token-cap 1024 \
--gptq-block-size 128 \
--gptq-damp-percent 0.05
The small --num-samples 10 setting is a calibration buffer. For reproduction, start with the documented value. For a deployment project, test sensitivity with 10, 50, and 100 trajectories. Too narrow a calibration set can look good in one suite and fail after a camera angle, object distribution, or task horizon changes.
Build the DiT action head pack
The DiT builder automatically targets transformer_blocks.*.attn1 and ff.net, and always uses svd_hadamard. The important flags are --use-rtn and --num-steps 8.
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$QUANTVLA_ROOT \
python -m tools.build_dit_a2lite_svd_gptq_perstep \
--checkpoint "$CKPT" \
--task-suite-name "$TASK" \
--data-config "$DCFG" \
--output-path results/packs/${SUITE}_DiT/quantized.pt \
--num-samples 10 \
--token-cap 1024 \
--num-steps 8 \
--svd-rank 0 \
--use-rtn \
--w-bits 4 \
--a-bits 4 \
--act-percentile 99.9 \
--duquant-block-size 64 \
--duquant-block-out 64 \
--gptq-block-size 128 \
--gptq-damp-percent 0.05
A common beginner mistake is to assume that a-bits 4 fully defines activation quantization inside the pack. More precisely, the pack stores the required metadata, while evaluation/runtime also sets ABITS=4 so the wrapped layers execute 4-bit activation quantization. For the DiT side, the per-step table is built offline and dispatched at runtime according to the denoising step.
Merge the pack and run evaluation
Merge the LLM and DiT packs:
python -m tools.merge_packs \
--out results/packs/${SUITE}_MERGED/quantized.pt \
results/packs/${SUITE}_LLM/quantized.pt \
results/packs/${SUITE}_DiT/quantized.pt
Define the DiT regex:
DIT_RE='.*action_head\.model\.transformer_blocks\.\d+\.(attn1\.(to_q|to_k|to_v|to_out\.0)|ff\.net\.(0\.proj|2)).*'
Run the GR00T benchmark:
env CONDA_ROOT=$CONDA_ROOT \
SUITE=$SUITE WBITS=4 ABITS=4 \
LLM_QUANT=gptq DIT_QUANT=gptq DIT_ATTN=1 DIT_PERSTEP=1 \
GR00T_GPTQ_PATH_OVERRIDE=$QUANTVLA_ROOT/results/packs/${SUITE}_MERGED/quantized.pt \
GR00T_GPTQ_INCLUDE_OVERRIDE="(${LLM_RE}|${DIT_RE})" \
GR00T_GPTQ_MISSING=fallback \
GPU_LIST=0,1,2,3 PORT_BASE=8000 NUM_TRIALS_PER_TASK=10 \
GR00T_EVAL_INIT_OFFSET=10 \
OUTPUT_ROOT=$QUANTVLA_ROOT/results/eval/${SUITE} \
bash $QUANTVLA_ROOT/scripts/run_groot_benchmark.sh
python -c "import json; print(round(100*json.load(open('results/eval/${SUITE}/merged_summary.json'))['total_success_rate'],1), '%')"
For the long suite, the README recommends GPU_LIST=0,1,2,3,4,5,6,7. You can reduce trials for a smoke test on fewer GPUs, but do not use that as a method-level claim. The README also notes that short-suite 50-episode success rate can move by ±5-10 percentage points with a single seed.
Running pi0.5: what changes?
pi0.5 combines a PaliGemma backbone with an action Expert. The repo currently does not provide a PaliGemma-side GPTQ builder or prefix-activation recorder, so PaliGemma runs as runtime DuQuant svd_hadamard. The Expert is built as a GPTQ pack with --use-rtn and per-step scaling.
SUITE=object
EXPERT_RE='.*paligemma_with_expert\.gemma_expert\.model\.layers\.[0-9]+\..*\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj).*'
PALI_RE='.*paligemma_with_expert\.paligemma\.model\.language_model\.layers\.[0-9]+\..*\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj).*'
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$QUANTVLA_ROOT \
$OPENPI_ROOT/.venv/bin/python -m tools.build_pi05_a2lite_gptq_perstep \
--checkpoint $CHECKPOINTS_ROOT/pi05_libero_pytorch \
--data-config pi05_libero \
--obs-path duquant_act_stats/pi05_libero_${SUITE}_obs.pt \
--output results/packs/pi05_${SUITE}_expert/quantized.pt \
--include-regex "$EXPERT_RE" \
--max-samples 10 \
--token-cap 512 \
--num-steps 10 \
--use-rtn \
--w-bits 4 \
--a-bits 4 \
--duquant-block-size 64 \
--duquant-block-out 64 \
--gptq-block-size 128 \
--gptq-damp-percent 0.05
Evaluate pi0.5:
env CONDA_ROOT=$CONDA_ROOT METHOD=hybrid SUITE=$SUITE WBITS=4 ABITS=4 \
GPU_LIST=0,1,2,3 PORT_BASE=8100 NUM_TRIALS_PER_TASK=10 \
GR00T_EVAL_INIT_OFFSET=10 \
OPENPI_ROOT=$OPENPI_ROOT OPENPI_PY=$OPENPI_ROOT/.venv/bin/python \
OPENPI_CONFIG=pi05_libero OPENPI_CHECKPOINT=$CHECKPOINTS_ROOT/pi05_libero_pytorch \
OPENPI_GPTQ_PATH=$QUANTVLA_ROOT/results/packs/pi05_${SUITE}_expert/quantized.pt \
OPENPI_GPTQ_INCLUDE="$EXPERT_RE" OPENPI_DUQUANT_INCLUDE="$PALI_RE" \
GR00T_DUQUANT_ROT_MODE=svd_hadamard \
OUTPUT_ROOT=$QUANTVLA_ROOT/results/eval/pi05_${SUITE} \
bash $QUANTVLA_ROOT/scripts/run_pi05_libero_benchmark.sh
The biggest pi0.5 pitfall is small-dimension heads. The README warns that state_proj, action_in_proj, action_out_proj, and time_mlp can collapse under A4, so they should stay outside the include regex. This is a practical deployment detail: not every small or sensitive layer should be forced into 4-bit just because the headline says W4A4.
Training, calibration, and inference are different stages
The word "training" is easy to misuse here. Omega-QVLA does not train a new manipulation skill from demonstrations. It performs PTQ:
| Stage | Policy backprop? | Input | Output |
|---|---|---|---|
| Original VLA fine-tuning | Yes | Robot trajectories | FP16 checkpoint |
| Omega-QVLA calibration | Not policy fine-tuning | A few trajectories or observation dumps | Scales, rotations, quantized weights |
| Pack build | No | FP16 checkpoint + calibration | quantized.pt |
| Inference/evaluation | No | Camera + command | Action chunk |
If you already have a GR00T-N1.5 post-trained checkpoint for LIBERO, Omega-QVLA starts there. If your target is a different real robot, you still need the original policy to work first. Quantization does not create new manipulation ability; it tries to preserve existing behavior at lower precision.
A simplified inference loop looks like this:
obs = get_camera_and_robot_state()
instruction = "put the red block into the bowl"
with quantized_runtime(pack="object_MERGED/quantized.pt"):
tokens = backbone.encode(obs.images, instruction) # W4A4 wrapped linear
action_chunk = dit_action_head.denoise(
tokens=tokens,
state=obs.robot_state,
per_step_scale=True,
)
controller.execute(action_chunk)
On edge hardware, measure at least three things: end-to-end latency, action jitter, and thermal throttling. LIBERO success rate is not enough. A policy that reports 98% in simulation but oscillates between 60 ms and 180 ms latency on a physical robot can still fail because the control loop is irregular.
Main results from the paper
On LIBERO, the paper reports:
| Model | FP16 reference | Omega-QVLA W4A4 | Note |
|---|---|---|---|
| pi0.5 | 97.1% | 98.0% | Slightly above FP16 average |
| GR00T-N1.5 | 87.0% | 87.8% | Slightly above FP16 average |
For static memory footprint:
| Model | FP16 | Omega-QVLA W4A4 | Saving |
|---|---|---|---|
| pi0.5 | 4.27 GB | 1.20 GB | 72.0% |
| GR00T-N1.5 | 1.99 GB | 586 MB | 71.3% |
The real-world experiment uses pi0.5 on a bimanual ARX R5 robot with five tasks: Pick Cup, Put Blocks, Put Fruit, Put Flowers, and Fold Towel. Omega-QVLA W4A4 reaches an average progress score of 51.0, slightly above the Pi-0.5 Base score of 49.6 and far above QuantVLA's 25.0. The qualitative result matters as much as the number: the paper describes QuantVLA as producing jerky end-effector trajectories, while Omega-QVLA produces smoother actions and tracks reference trajectories more closely in open-loop analysis.
The ablation result is also useful. SVD-Hadamard plus per-step scaling reaches 87.75 average in the ablation table, while SVD plus per-step reaches 79.25. Removing per-step scaling costs around 2 points overall and hurts the Long suite more. That matches the intuition: long-horizon tasks amplify activation drift across denoising and action steps.
Edge deployment checklist
Before letting a W4A4 policy drive a real robot, use a checklist like this:
[ ] The FP16 policy already solves the task in sim or on the real robot
[ ] The quantized pack was built for the correct suite/task
[ ] W4A4 evaluation ran enough trials, not only a smoke test
[ ] Action norm, jerk, latency, and dropped frames are logged
[ ] Open-loop actions are compared against the FP16 reference
[ ] Shadow mode is tested: policy predicts but does not control the robot yet
[ ] Safety controller, joint limits, and velocity limits are active
[ ] Short primitives are tested before long-horizon tasks
[ ] Edge-device temperature and clock throttling are monitored
For real robots, do not only track average success rate. Log operational signals:
policy_latency_ms_p50
policy_latency_ms_p95
action_delta_l2
eef_velocity
joint_velocity_max
gripper_command_switch_rate
camera_frame_drop_rate
controller_watchdog_reset_count
If W4A4 behavior is jerky on edge hardware, check three things first: whether your include regex accidentally quantized small sensitive heads; whether the calibration sample is too small or too narrow; and whether the per-step scale table is actually loaded in runtime. If latency is good but the task still fails, the problem may be domain gap, camera calibration, action normalization, or controller mapping rather than quantization itself.
References
- Paper: Omega-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling
- Code: UCMP13753/Omega-QVLA
- GR00T N1.5: NVIDIA Research project page
- Benchmark: LIBERO