manipulationvlafinevladual-armrobot-learningqwen

Run FineVLA for Dual-Arm Robots

A practical FineVLA guide covering the paper idea, architecture, installation, training, inference, and results.

Nguyễn Anh TuấnJune 5, 202613 min read
Run FineVLA for Dual-Arm Robots

FineVLA is an open framework for fine-grained instruction alignment in Vision-Language-Action (VLA) learning. The core idea is simple but important: robot policies should not only learn what task to complete, but also how the task should be executed. A coarse instruction such as "pick up the cup" does not tell a dual-arm robot which arm to use, where to contact the object, how to approach it, how to orient the gripper, or how to recover if the object slips.

This article is a practical guide to reading and running FineVLA for dual-arm VLA work. FineVLA is not just a single model checkpoint. It is a system with FineVLA-Tool for data construction, RoboFine-Bench for fine-grained robotic video understanding, RoboFine-VLM for scalable annotation, and FineVLA-Policy for training steerable VLA policies. If you are new to VLA models, read the paper idea and smoke-test sections before attempting distributed training.

Primary sources used for this guide:

The problem FineVLA tackles

A VLA policy consumes visual observations, language instructions, and sometimes robot state, then outputs robot actions. In many open robotics datasets, each trajectory is paired with a short goal-level instruction:

"Pick up the red block."

That may be enough for a basic single-arm task. It is not enough for many dual-arm tasks. Consider the difference between these two labels:

Coarse instruction:
  "Pick up the cup."

Fine-grained instruction:
  "Use the right arm to approach the cup from the front-right side,
   close the gripper around the upper body of the cup, lift vertically
   until it clears the table, then keep the cup upright while moving it
   to the target area."

Both labels describe the same high-level task, but they provide very different learning signals. The first says what to do. The second says what to do and how to do it. In dual-arm manipulation, the "how" often determines whether the behavior is usable: active arm, approach direction, contact region, trajectory shape, orientation, object relationship, and recovery behavior.

FineVLA asks a clean research question: if the trajectory, visual observation, state, and action labels stay the same, but the language label becomes more action-aligned, does the policy become more steerable? The answer in the paper is yes, with an important nuance. The best results usually come from mixing fine-grained instructions with raw goal-level instructions, not from replacing all raw language with long process descriptions.

The FineVLA system

FineVLA has four main pieces:

Raw robot datasets
  BridgeData-V2, BC-Z, RT-1, Galaxea, RoboMIND, RoboCOIN, RH20T, RDT, DROID
        |
        v
FineVLA-Tool
  1. Convert to a unified format
  2. Canonicalize state and action
  3. Cluster trajectories with DTW
  4. Annotate 10 fine-grained dimensions
        |
        v
FineVLA-Data
  47,159 human-verified fine-grained trajectories
        |
        +--> RoboFine-Bench
        |      500 videos, 10,816 atomic facts, 1,030 VQA questions
        |
        +--> RoboFine-VLM
        |      Qwen3.5-397B-A17B fine-tuned as a robotic video annotator
        |
        +--> FineVLA-Policy
               StarVLA-OFT / StarVLA-GR00T policies
               trained with Raw, FG, or mixed instructions

This design matters because FineVLA is not mainly an architecture paper. The authors deliberately instantiate FineVLA-Policy with existing StarVLA-style action decoders to isolate the role of instruction supervision. The experiment is not "we invented a new action head." The stronger claim is "with the same actions and visual data, richer language changes what the policy learns."

The ten fine-grained dimensions

FineVLA-Tool annotates each selected trajectory with ten control-relevant dimensions. For a beginner, treat these dimensions as a checklist for writing useful robot instructions.

Dimension What it captures in dual-arm robotics Example
Action Sequence Ordered manipulation steps reach, grasp, lift, move, place
Active Actor Which arm or end-effector acts left arm, right arm, both arms
Target Object Object disambiguation red block on the left
Initial Configuration Starting scene state cup upright near the tray
Final Configuration Desired end state cup placed upright on target
Contact & Approach Where and how contact happens grasp upper side, approach from front
Trajectory & Orientation Motion path and tool orientation lift vertically, rotate clockwise
Body Motion Whole-body or joint-level motion keep torso fixed, move wrist only
Object Interaction Relationships between objects stack block on top of another
Failure & Recovery Error handling behavior re-grasp if slipping

The practical lesson is direct: if your policy often completes the task but violates execution preferences, your language may be too coarse. A dual-arm policy cannot reliably infer "use the left arm" or "approach from the side" from a label that only says "pick up the object."

FineVLA-Policy architecture

FineVLA-Policy currently uses a modular StarVLA codebase. In the paper, the main policy experiments use two action-decoding designs:

Framework Backbone Action head Idea
StarVLA-OFT Qwen3.5-VL-4B MLP regression Reads hidden states at action tokens and predicts continuous action chunks
StarVLA-GR00T Qwen3.5-VL-4B DiT + flow matching Uses the VLM as a slow reasoning system and a DiT module for continuous action generation

The data path looks like this:

Camera frames + language instruction
        |
        v
Qwen3.5-VL-4B backbone
        |
        +-----------------------------+
        |                             |
        v                             v
StarVLA-OFT                     StarVLA-GR00T
MLP action regression           DiT flow-matching action generator
        |                             |
        v                             v
continuous action chunk         continuous action chunk

The dual-arm part is mostly encoded in the dataset and action representation. The paper uses AlohaMix, an ALOHA-compatible dual-arm mixture built from sources such as RDT, RoboCOIN, and RoboMIND. AlohaMix contains 86,662 episodes across 598 tasks, which gives the policy enough variation to test whether detailed language can steer different manipulation strategies on a consistent dual-arm embodiment.

Install FineVLA-Policy

You need Linux, Python 3.10, a compatible PyTorch/CUDA stack, and NVIDIA GPUs. FineVLA-Policy uses Accelerate and DeepSpeed for distributed training. For a first run, your goal should be simple: install dependencies, load the backbone, and pass the smoke test.

git clone https://github.com/xlang-ai/FineVLA.git
cd FineVLA/FineVLA-Policy

conda create -n finevla python=3.10 -y
conda activate finevla

pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install -e .

Check CUDA and package versions:

nvcc -V
pip list | grep -E 'torch|transformers|flash-attn|accelerate|deepspeed'

The official README notes that flash-attn==2.7.4.post1 works with CUDA 12.0 and 12.4. If FlashAttention fails to build, the common cause is a PyTorch/CUDA mismatch. Install the correct PyTorch build first, then install FlashAttention.

Download the Qwen3.5-VL-4B-Instruct backbone and place it where the example script expects it:

mkdir -p playground/Pretrained_models
# Download Qwen/Qwen3.5-VL-4B-Instruct from Hugging Face.
# Place the checkpoint at:
# ./playground/Pretrained_models/Qwen3.5-VL-4B-Instruct

Run the smoke test:

python starVLA/model/framework/QwenGR00T.py

If the model builds and prints its architecture, the policy code path is working. If you hit GPU memory errors later, reduce batch size, use fewer camera views, shorten action chunks, or move to larger GPUs.

Prepare data

FineVLA-Tool reads the dataset root from VLA_DATA_ROOT. The tool converts heterogeneous robot demonstrations into a unified format, canonicalizes state/action representation, clusters trajectories, and then annotates representative episodes.

cd FineVLA/FineVLA-Tool
export VLA_DATA_ROOT="/path/to/your/Lerobot_v21"

The data pipeline has four stages:

CanonicalizeAndClean
  - convert source datasets into a LeRobot-style format
  - canonicalize absolute coordinates and quaternion rotations
  - filter invalid videos, empty tasks, and corrupted trajectories

ClusteringAndSampling
  - compute trajectory similarity with dynamic time warping
  - cluster trajectories within each task
  - select representative episodes for annotation

AnnotationPipeline
  - use VLMs to describe each manipulation step
  - fill slots such as actor, target, contact, trajectory, and state change

RealANNO-Guidance
  - human annotators review video-grounded descriptions
  - factual and temporal alignment errors are corrected

You do not need to reconstruct the entire FineVLA-Data corpus to learn the training workflow. Start with released assets and example scripts. For your own robot, the best first milestone is to convert logs into a clean LeRobot-style dataset. This makes it easier to reuse dataloaders and visualization tools from the broader robot learning ecosystem.

Train FineVLA-Policy

FineVLA-Policy uses Accelerate and DeepSpeed. The general launch pattern is:

cd FineVLA/FineVLA-Policy

accelerate launch \
  --config_file starVLA/config/deepseeds/deepspeed_zero2.yaml \
  --num_processes 8 \
  starVLA/training/train_starvla.py \
  --config_yaml ./starVLA/config/training/your_config.yaml

The repo provides benchmark-specific scripts:

examples/Aloha/          ALOHA dual-arm tasks and FG:Raw mixing ratios
examples/Robotwin/       RoboTwin simulation benchmark
examples/LIBERO/         LIBERO benchmark
examples/Robocasa_tabletop/
examples/SimplerEnv/

Example: train the GR00T-style ALOHA setup with FG:Raw = 1:1.

bash examples/Aloha/run_qwen35_GR00T_aloha_multi_FG1_1_dlc.sh

The FG:Raw ratio controls how often a sampled trajectory is paired with a fine-grained instruction versus the original raw goal-level instruction.

Setting Instruction sampling When to use
Raw-only Only original goal-level instructions Baseline
FG-only Only fine-grained process instructions Tests whether detailed "how" language helps
FG:Raw = 1:4 20% FG, 80% Raw Conservative fine-tuning
FG:Raw = 1:2 About 33% FG, 67% Raw Strong setting in the paper
FG:Raw = 1:1 50% FG, 50% Raw Best average real-world result in the paper
FG:Raw = 2:1 or 4:1 Mostly FG Useful for ablations, but can overemphasize long descriptions

The important detail is that FineVLA keeps the same trajectories, actions, and visual observations while varying the paired language. This isolates the effect of action-aligned supervision from data scale or embodiment changes.

Inference and evaluation

As of this article date, the official GitHub README says pretrained and fine-tuned policy checkpoints are still "coming soon", while code, FineVLA-Tool, RoboFine-Bench, and RoboFine-VLM are released. That means you have two practical paths:

  1. Train a policy with the example scripts and evaluate your own checkpoint.
  2. When official policy checkpoints appear on Hugging Face, replace the model path in the eval config with the released checkpoint.

For LIBERO, the quick start is a policy server plus evaluator:

cd FineVLA/FineVLA-Policy

bash examples/LIBERO/eval_files/run_policy_server.sh &
bash examples/LIBERO/eval_files/eval_libero.sh

For a custom dual-arm robot, the inference loop conceptually looks like this:

observation = {
    "images": {
        "cam_high": high_rgb,
        "cam_left_wrist": left_wrist_rgb,
        "cam_right_wrist": right_wrist_rgb,
    },
    "state": robot_state,
}

instruction = (
    "Use the right arm to grasp the red block from its right side, "
    "lift it vertically, then place it on top of the blue block."
)

action_chunk = policy.predict_action(observation, instruction)
robot.execute(action_chunk)

You still need an embodiment-specific execution layer. Map the predicted action chunk to joint commands, end-effector deltas, gripper commands, or the ALOHA-style action format used by your controller. Fine-grained language does not replace calibration, camera extrinsics, time synchronization, action scaling, workspace limits, collision checks, or emergency stop logic.

Run RoboFine-Bench

RoboFine-Bench evaluates whether a VLM understands manipulation videos at the execution-detail level. It does not control the robot directly, but it is useful for testing annotators and video-language models before using them to generate training labels.

cd FineVLA/RoboFine-Bench

git lfs install
git clone https://huggingface.co/datasets/xlangai/RoboFine-bench EvalData/

pip install openai httpx tqdm pydantic Pillow av
export OPENAI_API_KEY="your-api-key"

Run VQA evaluation:

python RoboFine-Bench/vqa_eval/run_vqa.py \
  --model qwen3-vl-plus \
  --qa EvalData/QAEvalSets.json \
  --input EvalData/EvalSets.json \
  --num-workers 16

Run caption generation in hard mode, where the model does not receive the task instruction:

python RoboFine-Bench/caption_eval/annotate/run_annotate.py \
  --model qwen3.5-plus \
  --evalsets EvalData/EvalSets.json \
  --frame-index EvalData/frame_index.jsonl \
  --output-dir results/CaptionResult/hard/ \
  --num-workers 16 \
  --no-instruction

Score captions with Direct Alignment:

python -m caption_eval.atomic_eval.atomic_eval direct-align \
  --gt-facts EvalData/GT_AtomicFacts.jsonl \
  --caption results/CaptionResult/qwen3_5-plus_CaptionResult.jsonl \
  --output-dir results/AtomicResult/qwen3_5-plus/ \
  --num-workers 8 \
  --enable-thinking

The benchmark reports these core metrics:

Metric What it measures
Consistency Whether the caption matches the facts it claims
Coverage Whether the caption covers the ground-truth facts
Anti-Hallucination Whether the caption avoids invented actions
CaptionScore Average of the three metric groups

Results to remember

FineVLA-Data starts from 972,247 raw trajectories across 85K tasks and produces 47,159 human-verified fine-grained trajectories. The average instruction length grows from 9.3 words to 96.8 words, roughly a 10.4x increase.

On RoboTwin simulation, the strongest mixed settings reach:

Setting Raw-only Best mixed
RDT-OFT Easy/Hard 61.5 / 60.0 74.1 / 72.4
RDT-GR00T Easy/Hard 55.1 / 53.4 69.4 / 68.2
AlohaMix-OFT Easy/Hard 71.8 / 71.4 86.8 / 82.5

In real-world dual-arm manipulation, Raw-only reaches 49.9/100 average ID score, while FG:Raw = 1:1 reaches 62.7/100. The project page also reports instruction violations dropping from 34% to 12% for the best mixed setting. The largest gains appear in pose, color, and approach direction, which are exactly the factors that short goal-level instructions fail to specify.

RoboFine-VLM is another important result. It is fine-tuned from Qwen3.5-397B-A17B on FineVLA-Data and reaches 71.0% VQA accuracy and around 83.6% captioning score in the hard setting reported by the paper/project page. The training cost is large: the appendix reports RoboFine-VLM SFT on 256 H200 GPUs for about 40 hours. Most labs should use the released annotator, review a subset manually, and spend their compute budget on policy fine-tuning rather than reproducing the full annotator training run.

A small-lab implementation checklist

If you have a dual-arm robot, do not start with a 64-A100 pretraining run. Start with a narrow, verifiable loop:

1. Standardize robot logs
   camera timestamps, action/state format, gripper state, episode metadata

2. Convert to a LeRobot-style dataset
   the closer you are to FineVLA/LeRobot conventions, the easier reuse becomes

3. Write raw task instructions
   example: "stack the red block on the blue block"

4. Annotate 50-200 episodes with fine-grained instructions
   prioritize active actor, target object, contact, approach, and final configuration

5. Fine-tune a small policy or adapter first
   validate the data path before scaling training

6. Compare Raw-only, FG-only, FG:Raw = 1:2, and FG:Raw = 1:1
   keep trajectories and actions fixed to isolate the language effect

7. Deploy behind a safety wrapper
   action clipping, workspace limits, emergency stop, gripper force limits

FineVLA is most valuable when the same task can be executed in multiple valid ways. If all episodes are nearly identical, long instructions will add less. If your tasks vary by arm choice, approach direction, object pose, contact region, tool orientation, or recovery behavior, fine-grained instruction alignment gives the policy a clearer way to bind language to behavior.

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

EXPO-FT: RL Online Cho VLA π0.5
manipulation

EXPO-FT: RL Online Cho VLA π0.5

6/4/202616 min read
NT
UMI là gì? Cách thu data VLA cho robot mà không cần teleop
manipulation

UMI là gì? Cách thu data VLA cho robot mà không cần teleop

5/25/20268 min read
NT
RoboTwin 2.0: Hướng dẫn toàn diện dual-arm manipulation
manipulation

RoboTwin 2.0: Hướng dẫn toàn diện dual-arm manipulation

4/21/202610 min read
NT