wholebody-vlaomega-qvlavlaquantizationgr00tpi0.5edge-ai

Omega-QVLA: W4A4 VLA on Edge

A practical guide to running Omega-QVLA for compressing GR00T N1.5 and pi0.5 to W4A4 for edge VLA manipulation.

Nguyễn Anh TuấnJune 10, 202613 min read
Omega-QVLA: W4A4 VLA on Edge

What problem does Omega-QVLA solve?

Vision-Language-Action models (VLAs) are becoming a central recipe for robot manipulation. A policy receives camera observations, parses a natural-language instruction, and produces an action sequence for a robot arm or a bimanual platform. The hard part is deployment. Modern VLA policies often combine a multi-billion-parameter vision-language backbone with a diffusion transformer (DiT) action head. On real robots, especially robots running on edge hardware such as an onboard workstation, Jetson-class device, Orin module, or industrial computer, the question is not only "does the policy solve the task?" It is also "can the policy run with enough latency, memory headroom, and control smoothness to close the loop safely?"

Omega-QVLA is arXiv paper 2605.28803, published on May 27, 2026, under the full title "Omega-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling." The important claim is that the method does not merely compress the language backbone. It compresses both the backbone and the full DiT action head to W4A4: 4-bit weights and 4-bit activations. That is an aggressive target, because the action head directly generates continuous control signals. A small numerical perturbation can become jerky motion, a failed grasp, or accumulated error in long-horizon manipulation.

The project code is available at UCMP13753/Omega-QVLA. The README describes recipes for GR00T-N1.5 and pi0.5 on LIBERO: the LLM/backbone side uses DuQuant rotation mode svd_hadamard with GPTQ; the DiT action head also uses svd_hadamard, but with an RTN residual path and a per-step activation scale table. This guide is written as a practical beginner-friendly walkthrough: what the paper is doing, how the architecture works, how to prepare the environment, how to build quantized packs, how to run evaluation, and what to watch before moving the policy to edge hardware.

Robot manipulation
Robot manipulation

Why W4A4 is harder than W4A16

Many deployment workflows are comfortable with W4A16 or W8A8. W4A16 means the weights are stored at 4-bit precision, while activations still run at 16-bit precision. This gives a strong memory reduction, but it does not fully unlock low-bit compute paths. If you want a better chance of using low-bit integer kernels or tensor-core paths, activations must also be quantized. That is why Omega-QVLA targets W4A4.

The difficulty comes from two families of outliers:

Component Main outlier source What goes wrong with naive quantization
LLM/VLM backbone A small number of unusually large activation channels Token representations drift and language grounding weakens
DiT action head Activation range changes across denoising steps Continuous actions drift, especially over long horizons
Robot control loop Small action errors are physically amplified End-effector motion becomes jerky or overshoots contact

If you directly apply GPTQ, AWQ, SmoothQuant, or DuQuant in the same way you would for a chatbot model, the DiT action head is usually where things collapse. The paper reports that full-stack W4A16 GPTQ/AWQ/OmniQuant can fall sharply on pi0.5 when the whole stack is quantized, with some success rates around 10-16%. That is the core lesson: VLA quantization is not just LLM quantization with robot data attached.

The paper idea: rotate first, then scale by denoising step

Omega-QVLA has two key mechanisms:

  1. Composite SVD-Hadamard rotation to smooth channel energy before quantization.
  2. Per-step DiT activation scaling to handle dynamic-range drift across diffusion denoising steps.

A simplified view:

Camera + language command
        |
        v
VLA backbone / LLM / VLM
        |
        |  SVD-Hadamard rotation + GPTQ pack
        v
Fused visual-language tokens
        |
        v
DiT action head
        |
        |  SVD-Hadamard rotation + RTN residual
        |  per-step activation scale table
        v
Action chunk: joints / end-effector command
        |
        v
Robot controller on edge

SVD rotation mainly addresses the weights. In a linear layer, a weight matrix can have a few rows or channels that carry much larger energy than the rest. With 4-bit per-channel quantization, a high-energy channel stretches the scale, wasting resolution for normal values. SVD moves the layer into a basis where row-wise energy is more balanced.

SVD alone does not guarantee that activations become smooth. The rotation is derived from weights, while activation outliers depend on data. Omega-QVLA therefore composes SVD with a Hadamard rotation. A Hadamard transform is an orthogonal mixing operation that spreads dominant channel energy across many channels. In plain terms: SVD reduces weight-side concentration; Hadamard diffuses the remaining activation-side spikes.

The DiT side needs another treatment because activation statistics are not static. A diffusion action head produces actions through multiple denoising steps. Early-step activations can have very different magnitudes from late-step activations. If one static scale is used for every step, some steps clip while others waste bit range. Omega-QVLA therefore stores an act_scale_table: scales indexed by layer, denoising step, and channel. The table is built offline from calibration trajectories, and inference simply looks up the scale for the current step.

How the repository is organized

The README describes the current recipe as offline packs:

Side Rotation Quantizer Per-step
LLM/Eagle in GR00T DuQuant svd_hadamard GPTQ No
DiT action head DuQuant svd_hadamard RTN residual Yes, act_scale_table
pi0.5 Expert svd_hadamard GPTQ pack + RTN residual Yes
pi0.5 PaliGemma runtime DuQuant svd_hadamard No separate GPTQ builder Not like Expert

This is post-training quantization, not policy fine-tuning. You need a small calibration buffer, but you do not need to backpropagate through the policy for many epochs. For GR00T-N1.5, the repo builds one pack for the LLM side and another pack for the DiT side, then merges them into a single quantized.pt file loaded at runtime through GptqLinear.

results/packs/object_LLM/quantized.pt
results/packs/object_DiT/quantized.pt
              |
              v
results/packs/object_MERGED/quantized.pt
              |
              v
run_groot_benchmark.sh -> wrapped quantized linear layers

Hardware and environment assumptions

Do not start on a Jetson if you do not already have a quantized pack. The project README calls for one A100 40 GB GPU to build packs, and 4-8 GPUs for parallel multi-suite evaluation. A pragmatic workflow is:

  1. Build the quantized pack on a cloud GPU or strong workstation.
  2. Run LIBERO evaluation to check the policy did not collapse.
  3. Export the pack and the runtime pieces you need.
  4. Optimize kernels and runtime loading for your edge target.
  5. Run shadow mode on the real robot before giving the policy direct control.

The base environment from the README is:

conda create -n omega_qvla python=3.10 -y
conda activate omega_qvla

# Inside the Omega-QVLA repository
pip install -e .

# LIBERO benchmark
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git $HOME/LIBERO
pip install -e $HOME/LIBERO

Set paths explicitly:

export QUANTVLA_ROOT=$HOME/Omega-QVLA
export QUANTVLA_CONDA_ENV=omega_qvla
export CONDA_ROOT=$HOME/miniconda3
export CHECKPOINTS_ROOT=$HOME/ckpts
export LIBERO_ROOT=$HOME/LIBERO
export LIBERO_CONFIG_PATH=$HOME/.libero
export QUANTVLA_CACHE_ROOT=$HOME/.cache/omega_qvla
export OPENPI_ROOT=$HOME/openpi

mkdir -p $QUANTVLA_CACHE_ROOT $LIBERO_CONFIG_PATH

The most relevant code locations are:

gr00t/quantization/
  quant.py              # entry point for enabling quantization from config
  gptq_layers.py        # GptqLinear, GPTQ solver, per-step support
  duquant_layers.py     # rotation + RTN runtime
  rtn_layers.py         # pure RTN

tools/
  build_gptq_weights.py
  build_dit_a2lite_svd_gptq_perstep.py
  build_pi05_a2lite_gptq_perstep.py
  merge_packs.py

scripts/
  run_groot_benchmark.sh
  run_pi05_libero_benchmark.sh

Running GR00T-N1.5: build the LLM pack

The following example uses the object suite. Other choices are goal, spatial, and long.

cd $QUANTVLA_ROOT

SUITE=object
CKPT=$CHECKPOINTS_ROOT/gr00t-n1.5-libero-${SUITE}-posttrain

case "$SUITE" in
  goal)
    TASK=libero_goal
    DCFG=examples.Libero.custom_data_config:LiberoDataConfigMeanStd
    ;;
  long)
    TASK=libero_10
    DCFG=examples.Libero.custom_data_config:LiberoDataConfig
    ;;
  *)
    TASK=libero_${SUITE}
    DCFG=examples.Libero.custom_data_config:LiberoDataConfig
    ;;
esac

LLM_RE='.*backbone\.eagle_model\.language_model\..*\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj).*'
EXCLUDE='(?:^|\.)(vision|radio|norm|ln|layernorm|embed|lm_head|timestep_encoder|state_encoder|action_encoder|action_decoder|pos_embed|vl_self_attention|vlln|future_tokens)(?:\.|$)'

Build the LLM pack:

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$QUANTVLA_ROOT \
python -m tools.build_gptq_weights \
  --checkpoint "$CKPT" \
  --task-suite-name "$TASK" \
  --data-config "$DCFG" \
  --output-path results/packs/${SUITE}_LLM/quantized.pt \
  --include-regex "$LLM_RE" \
  --exclude-regex "$EXCLUDE" \
  --duquant-rotation \
  --duquant-rot-mode svd_hadamard \
  --weight-bits 4 \
  --num-samples 10 \
  --token-cap 1024 \
  --gptq-block-size 128 \
  --gptq-damp-percent 0.05

The small --num-samples 10 setting is a calibration buffer. For reproduction, start with the documented value. For a deployment project, test sensitivity with 10, 50, and 100 trajectories. Too narrow a calibration set can look good in one suite and fail after a camera angle, object distribution, or task horizon changes.

Build the DiT action head pack

The DiT builder automatically targets transformer_blocks.*.attn1 and ff.net, and always uses svd_hadamard. The important flags are --use-rtn and --num-steps 8.

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$QUANTVLA_ROOT \
python -m tools.build_dit_a2lite_svd_gptq_perstep \
  --checkpoint "$CKPT" \
  --task-suite-name "$TASK" \
  --data-config "$DCFG" \
  --output-path results/packs/${SUITE}_DiT/quantized.pt \
  --num-samples 10 \
  --token-cap 1024 \
  --num-steps 8 \
  --svd-rank 0 \
  --use-rtn \
  --w-bits 4 \
  --a-bits 4 \
  --act-percentile 99.9 \
  --duquant-block-size 64 \
  --duquant-block-out 64 \
  --gptq-block-size 128 \
  --gptq-damp-percent 0.05

A common beginner mistake is to assume that a-bits 4 fully defines activation quantization inside the pack. More precisely, the pack stores the required metadata, while evaluation/runtime also sets ABITS=4 so the wrapped layers execute 4-bit activation quantization. For the DiT side, the per-step table is built offline and dispatched at runtime according to the denoising step.

Merge the pack and run evaluation

Merge the LLM and DiT packs:

python -m tools.merge_packs \
  --out results/packs/${SUITE}_MERGED/quantized.pt \
  results/packs/${SUITE}_LLM/quantized.pt \
  results/packs/${SUITE}_DiT/quantized.pt

Define the DiT regex:

DIT_RE='.*action_head\.model\.transformer_blocks\.\d+\.(attn1\.(to_q|to_k|to_v|to_out\.0)|ff\.net\.(0\.proj|2)).*'

Run the GR00T benchmark:

env CONDA_ROOT=$CONDA_ROOT \
  SUITE=$SUITE WBITS=4 ABITS=4 \
  LLM_QUANT=gptq DIT_QUANT=gptq DIT_ATTN=1 DIT_PERSTEP=1 \
  GR00T_GPTQ_PATH_OVERRIDE=$QUANTVLA_ROOT/results/packs/${SUITE}_MERGED/quantized.pt \
  GR00T_GPTQ_INCLUDE_OVERRIDE="(${LLM_RE}|${DIT_RE})" \
  GR00T_GPTQ_MISSING=fallback \
  GPU_LIST=0,1,2,3 PORT_BASE=8000 NUM_TRIALS_PER_TASK=10 \
  GR00T_EVAL_INIT_OFFSET=10 \
  OUTPUT_ROOT=$QUANTVLA_ROOT/results/eval/${SUITE} \
  bash $QUANTVLA_ROOT/scripts/run_groot_benchmark.sh

python -c "import json; print(round(100*json.load(open('results/eval/${SUITE}/merged_summary.json'))['total_success_rate'],1), '%')"

For the long suite, the README recommends GPU_LIST=0,1,2,3,4,5,6,7. You can reduce trials for a smoke test on fewer GPUs, but do not use that as a method-level claim. The README also notes that short-suite 50-episode success rate can move by ±5-10 percentage points with a single seed.

Running pi0.5: what changes?

pi0.5 combines a PaliGemma backbone with an action Expert. The repo currently does not provide a PaliGemma-side GPTQ builder or prefix-activation recorder, so PaliGemma runs as runtime DuQuant svd_hadamard. The Expert is built as a GPTQ pack with --use-rtn and per-step scaling.

SUITE=object

EXPERT_RE='.*paligemma_with_expert\.gemma_expert\.model\.layers\.[0-9]+\..*\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj).*'
PALI_RE='.*paligemma_with_expert\.paligemma\.model\.language_model\.layers\.[0-9]+\..*\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj).*'

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$QUANTVLA_ROOT \
$OPENPI_ROOT/.venv/bin/python -m tools.build_pi05_a2lite_gptq_perstep \
  --checkpoint $CHECKPOINTS_ROOT/pi05_libero_pytorch \
  --data-config pi05_libero \
  --obs-path duquant_act_stats/pi05_libero_${SUITE}_obs.pt \
  --output results/packs/pi05_${SUITE}_expert/quantized.pt \
  --include-regex "$EXPERT_RE" \
  --max-samples 10 \
  --token-cap 512 \
  --num-steps 10 \
  --use-rtn \
  --w-bits 4 \
  --a-bits 4 \
  --duquant-block-size 64 \
  --duquant-block-out 64 \
  --gptq-block-size 128 \
  --gptq-damp-percent 0.05

Evaluate pi0.5:

env CONDA_ROOT=$CONDA_ROOT METHOD=hybrid SUITE=$SUITE WBITS=4 ABITS=4 \
  GPU_LIST=0,1,2,3 PORT_BASE=8100 NUM_TRIALS_PER_TASK=10 \
  GR00T_EVAL_INIT_OFFSET=10 \
  OPENPI_ROOT=$OPENPI_ROOT OPENPI_PY=$OPENPI_ROOT/.venv/bin/python \
  OPENPI_CONFIG=pi05_libero OPENPI_CHECKPOINT=$CHECKPOINTS_ROOT/pi05_libero_pytorch \
  OPENPI_GPTQ_PATH=$QUANTVLA_ROOT/results/packs/pi05_${SUITE}_expert/quantized.pt \
  OPENPI_GPTQ_INCLUDE="$EXPERT_RE" OPENPI_DUQUANT_INCLUDE="$PALI_RE" \
  GR00T_DUQUANT_ROT_MODE=svd_hadamard \
  OUTPUT_ROOT=$QUANTVLA_ROOT/results/eval/pi05_${SUITE} \
  bash $QUANTVLA_ROOT/scripts/run_pi05_libero_benchmark.sh

The biggest pi0.5 pitfall is small-dimension heads. The README warns that state_proj, action_in_proj, action_out_proj, and time_mlp can collapse under A4, so they should stay outside the include regex. This is a practical deployment detail: not every small or sensitive layer should be forced into 4-bit just because the headline says W4A4.

Training, calibration, and inference are different stages

The word "training" is easy to misuse here. Omega-QVLA does not train a new manipulation skill from demonstrations. It performs PTQ:

Stage Policy backprop? Input Output
Original VLA fine-tuning Yes Robot trajectories FP16 checkpoint
Omega-QVLA calibration Not policy fine-tuning A few trajectories or observation dumps Scales, rotations, quantized weights
Pack build No FP16 checkpoint + calibration quantized.pt
Inference/evaluation No Camera + command Action chunk

If you already have a GR00T-N1.5 post-trained checkpoint for LIBERO, Omega-QVLA starts there. If your target is a different real robot, you still need the original policy to work first. Quantization does not create new manipulation ability; it tries to preserve existing behavior at lower precision.

A simplified inference loop looks like this:

obs = get_camera_and_robot_state()
instruction = "put the red block into the bowl"

with quantized_runtime(pack="object_MERGED/quantized.pt"):
    tokens = backbone.encode(obs.images, instruction)      # W4A4 wrapped linear
    action_chunk = dit_action_head.denoise(
        tokens=tokens,
        state=obs.robot_state,
        per_step_scale=True,
    )

controller.execute(action_chunk)

On edge hardware, measure at least three things: end-to-end latency, action jitter, and thermal throttling. LIBERO success rate is not enough. A policy that reports 98% in simulation but oscillates between 60 ms and 180 ms latency on a physical robot can still fail because the control loop is irregular.

Main results from the paper

On LIBERO, the paper reports:

Model FP16 reference Omega-QVLA W4A4 Note
pi0.5 97.1% 98.0% Slightly above FP16 average
GR00T-N1.5 87.0% 87.8% Slightly above FP16 average

For static memory footprint:

Model FP16 Omega-QVLA W4A4 Saving
pi0.5 4.27 GB 1.20 GB 72.0%
GR00T-N1.5 1.99 GB 586 MB 71.3%

The real-world experiment uses pi0.5 on a bimanual ARX R5 robot with five tasks: Pick Cup, Put Blocks, Put Fruit, Put Flowers, and Fold Towel. Omega-QVLA W4A4 reaches an average progress score of 51.0, slightly above the Pi-0.5 Base score of 49.6 and far above QuantVLA's 25.0. The qualitative result matters as much as the number: the paper describes QuantVLA as producing jerky end-effector trajectories, while Omega-QVLA produces smoother actions and tracks reference trajectories more closely in open-loop analysis.

The ablation result is also useful. SVD-Hadamard plus per-step scaling reaches 87.75 average in the ablation table, while SVD plus per-step reaches 79.25. Removing per-step scaling costs around 2 points overall and hurts the Long suite more. That matches the intuition: long-horizon tasks amplify activation drift across denoising and action steps.

Edge deployment checklist

Before letting a W4A4 policy drive a real robot, use a checklist like this:

[ ] The FP16 policy already solves the task in sim or on the real robot
[ ] The quantized pack was built for the correct suite/task
[ ] W4A4 evaluation ran enough trials, not only a smoke test
[ ] Action norm, jerk, latency, and dropped frames are logged
[ ] Open-loop actions are compared against the FP16 reference
[ ] Shadow mode is tested: policy predicts but does not control the robot yet
[ ] Safety controller, joint limits, and velocity limits are active
[ ] Short primitives are tested before long-horizon tasks
[ ] Edge-device temperature and clock throttling are monitored

For real robots, do not only track average success rate. Log operational signals:

policy_latency_ms_p50
policy_latency_ms_p95
action_delta_l2
eef_velocity
joint_velocity_max
gripper_command_switch_rate
camera_frame_drop_rate
controller_watchdog_reset_count

If W4A4 behavior is jerky on edge hardware, check three things first: whether your include regex accidentally quantized small sensitive heads; whether the calibration sample is too small or too narrow; and whether the per-step scale table is actually loaded in runtime. If latency is good but the task still fails, the problem may be domain gap, camera calibration, action normalization, or controller mapping rather than quantization itself.

References

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

Chạy GR00T-VisualSim2Real cho G1
wholebody-vla

Chạy GR00T-VisualSim2Real cho G1

6/7/202615 min read
NT
Làm synthetic data cho GR00T VLA
wholebody-vla

Làm synthetic data cho GR00T VLA

6/6/202614 min read
NT
Fine-Tune GR00T N1.7 với EgoScale: Từ Zero đến Deploy
wholebody-vla

Fine-Tune GR00T N1.7 với EgoScale: Từ Zero đến Deploy

4/21/202612 min read
NT