OpenVLA: Open VLA for Robots

Why OpenVLA Matters

OpenVLA is one of the most useful projects to study if you want to understand how a Vision-Language-Action model turns a camera image and a natural-language instruction into a robot command. The original paper, "OpenVLA: An Open-Source Vision-Language-Action Model", was released on arXiv in 2024 by researchers from Stanford, UC Berkeley, Toyota Research Institute, Google DeepMind, Physical Intelligence, and MIT. Its value is not just the 7B-parameter model. The important part is the full stack around it: paper, project page, Hugging Face checkpoint, PyTorch code, fine-tuning scripts, inference examples, and real-robot evaluations.

If the earlier posts in this series gave you the basics of imitation learning, Diffusion Policy, VLA models, and hands-on LeRobot workflows, this article focuses on one concrete question: how does OpenVLA actually work, and what should a beginner do to run it carefully?

OpenVLA takes two main inputs: an image of the robot workspace and a language instruction such as put the yellow corn on the plate. Its output is not a text answer like a chatbot. The output is a robot action, usually represented as a 7-DoF end-effector delta: x, y, z, roll, pitch, yaw, gripper. That puts OpenVLA at the intersection of computer vision, language modeling, and robot control.

The first beginner trap is thinking that OpenVLA is simply an LLM that "talks to a robot". It is more precise to call it a policy learned from robot demonstrations. Language is a conditioning signal, not magic. If your robot, camera placement, action space, or workspace is far away from the pretraining data, you should expect to collect demonstrations and fine-tune.

What Problem Does the Paper Solve?

Before OpenVLA, many strong VLA systems were closed or difficult to reproduce. RT-2 and RT-2-X showed that combining vision-language pretraining with robot action data could produce more general robot policies, but outside users could not easily download comparable checkpoints, inspect the training recipe, or adapt the model to their own robot. OpenVLA takes a different route: it provides an open-source 7B VLA that is strong enough to be a serious research and prototyping baseline.

The paper's idea can be summarized in three points:

Problem	OpenVLA's answer	Why it matters
Prior VLAs were hard to access	Release code, checkpoints, model cards	You can run inference and fine-tune
Robot tasks are diverse	Train on 970k Open X-Embodiment trajectories	The model sees many scenes and embodiments
Fine-tuning 7B models is expensive	Support LoRA and quantization	Smaller labs can adapt it

According to the project page, OpenVLA is pretrained on 970k robot episodes from Open X-Embodiment. The paper abstract reports that OpenVLA outperforms the closed RT-2-X 55B model by 16.5 percentage points in absolute task success rate across 29 tasks and multiple robot embodiments, while using roughly seven times fewer parameters. The paper also reports that fine-tuned OpenVLA outperforms from-scratch Diffusion Policy by 20.4% in multi-task, multi-object settings where language grounding matters. These numbers do not mean OpenVLA always beats smaller policies. They mean that diverse pretraining helps when the task requires object grounding, instruction following, and generalization.

OpenVLA Architecture

OpenVLA is built on a Prismatic VLM. Instead of training every part from scratch, the authors fine-tune a pretrained vision-language model so that it predicts robot action tokens. The architecture has three main blocks:

Camera image
    |
    v
+-----------------------------+
| Fused visual encoder        |
| DINOv2 + SigLIP             |
+-----------------------------+
    |
    v
+-----------------------------+
| Projector                   |
| image embeddings -> LLM dim |
+-----------------------------+
    |
    v
+-----------------------------+
| Llama 2 7B backbone         |
| instruction + image tokens  |
+-----------------------------+
    |
    v
Tokenized robot actions
    |
    v
Continuous robot command

1. Fused Visual Encoder

OpenVLA uses two vision backbones: DINOv2 and SigLIP. DINOv2 is strong at self-supervised visual representation, while SigLIP is strong at image-language alignment. Combining them gives the model richer features for objects, colors, backgrounds, spatial layout, and instruction-conditioned targeting.

For a beginner, the simple mental model is this: the camera image is converted into a sequence of embeddings. Each embedding acts like a visual token containing information about a patch or region of the image. The language model does not directly read pixels. It reads the compressed visual embeddings produced by the encoder.

2. Projector

The projector is the bridge between the visual encoder and the LLM. Raw visual features do not naturally live in the same space as Llama 2 token embeddings. The projector learns to map image embeddings into a representation the LLM can process together with language tokens.

If you have used CLIP, BLIP, or other multimodal models, this is similar to an adapter between modalities. In robot learning, that adapter is especially important because the final output is not a caption. The final output is an action that will be executed by hardware.

3. Llama 2 7B Backbone

OpenVLA formats the task as a prompt:

In: What action should the robot take to {instruction}?
Out:

Instead of generating English text, the model generates action tokens. Those tokens are decoded into a continuous action vector and passed to the robot controller. Architecturally, OpenVLA reuses the autoregressive token prediction machinery of a language model. Functionally, it behaves as a visuomotor policy.

How Action Tokenization Works

Robot actions are usually continuous. A typical end-effector action looks like this:

action = [
  delta_x,
  delta_y,
  delta_z,
  delta_roll,
  delta_pitch,
  delta_yaw,
  gripper
]

A language model, however, predicts discrete tokens. OpenVLA-style policies therefore discretize each action dimension into bins, often described as 256-bin action discretization. During training, continuous dataset actions are normalized, discretized, and represented as tokens. During inference, the model predicts tokens, the tokenizer maps them back to bins, and the result is unnormalized using robot- or dataset-specific statistics.

The conceptual pipeline is:

continuous action in dataset
        |
        v
normalize per dataset / robot
        |
        v
discretize into action bins
        |
        v
train LLM to predict action tokens
        |
        v
at inference: decode tokens -> unnormalize -> robot.act()

This is why unnorm_key is not a minor detail. If you use the wrong normalization statistics, the token may represent the right abstract action but the wrong physical scale. The robot may move too far, too slowly, or operate the gripper at the wrong time.

Environment Setup

OpenVLA has two practical setup levels: minimal inference and full training or fine-tuning. Beginners should start with inference to validate the GPU, dependencies, checkpoint loading, and output shape before connecting any hardware.

Minimal Inference Setup

conda create -n openvla python=3.10 -y
conda activate openvla

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers timm tokenizers pillow accelerate
pip install flash-attn --no-build-isolation

If flash-attn fails to build, you can try loading the model without attn_implementation="flash_attention_2", but inference will usually be slower and more memory hungry. On a 24GB GPU, you may need quantization or a server-client deployment where a larger GPU hosts the model.

Repository Setup for Fine-Tuning

conda create -n openvla-train python=3.10 -y
conda activate openvla-train

conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia -y

git clone https://github.com/openvla/openvla.git
cd openvla
pip install -e .

pip install packaging ninja
pip install "flash-attn==2.5.5" --no-build-isolation

The original repository states that the code was developed with Python 3.10, PyTorch 2.2.x, transformers 4.40.1, tokenizers 0.19.1, timm 0.9.10, and flash-attn 2.5.5. In robotics projects, dependency drift is a common source of false debugging. If the model fails to load or produces strange outputs, check package versions before questioning the paper or your robot.

Basic Inference

The following example shows the flow from an image to a predicted action. It is not a complete robot controller, but it is enough to understand the interface.

from PIL import Image
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor

processor = AutoProcessor.from_pretrained(
    "openvla/openvla-7b",
    trust_remote_code=True,
)

vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
).to("cuda:0")

image = Image.open("workspace.jpg").convert("RGB")
instruction = "put the yellow corn on the plate"
prompt = f"In: What action should the robot take to {instruction}?\nOut:"

inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(
    **inputs,
    unnorm_key="bridge_orig",
    do_sample=False,
)

print(action)

On a real robot, image comes from a camera stream and robot.act(action) sends the command to the low-level controller. Before doing that, verify five things:

Component	Question to ask
Camera pose	Does the view match the dataset or fine-tuning data?
Action space	Does the robot expect end-effector deltas or joint commands?
Control frequency	Is the policy running at 5 Hz, 10 Hz, or 15 Hz?
Normalization	Is the `unnorm_key` correct for this dataset and robot?
Safety	Do you have workspace limits, emergency stop, and velocity clamps?

The OpenVLA model card states that the model predicts normalized 7-DoF end-effector deltas and that actions must be unnormalized using per-robot or per-dataset statistics before execution. This is not a small implementation detail. It is the difference between a controlled demo and unsafe hardware behavior.

Fine-Tuning with LoRA

Full fine-tuning a 7B model requires serious compute. The project page says that OpenVLA pretraining used a cluster of 64 A100 GPUs for 15 days. The GitHub repository provides a LoRA fine-tuning recipe because it is more realistic for many smaller labs.

LoRA adds low-rank adapters to selected layers and trains only those adapters instead of updating every model weight. The OpenVLA project reports that LoRA provides the best trade-off between performance and memory in their experiments, matching full fine-tuning while updating only about 1.4% of the parameters.

A representative command from the repository workflow looks like this:

torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
  --vla_path "openvla/openvla-7b" \
  --data_root_dir /data/robot_datasets \
  --dataset_name bridge_orig \
  --run_root_dir /runs/openvla_lora \
  --adapter_tmp_dir /tmp/openvla_adapter \
  --lora_rank 32 \
  --batch_size 16 \
  --grad_accumulation_steps 1 \
  --learning_rate 5e-4 \
  --image_aug True \
  --wandb_project openvla \
  --wandb_entity your_team \
  --save_steps 5000

The repository notes that the example batch size of 16 can require around 72GB of GPU memory for BridgeData V2. If you have a smaller GPU, reduce batch_size and increase grad_accumulation_steps to keep the effective batch size stable. On a 24GB GPU, you may need a small batch, gradient accumulation, quantization, or a newer recipe such as OpenVLA-OFT if it matches your use case.

Preparing a Custom Dataset

OpenVLA works naturally with RLDS because its pretraining and fine-tuning pipeline uses Open X-Embodiment data. If you have a custom robot dataset, the workflow should look like this:

teleoperation demos
    |
    v
store images + actions + language instructions
    |
    v
convert to RLDS or write PyTorch Dataset wrapper
    |
    v
register dataset config + transform
    |
    v
LoRA fine-tune
    |
    v
evaluate on held-out tasks

A minimal episode should contain:

episode/
  observations/
    image_primary[t]
    image_wrist[t]       # optional
    proprio[t]           # optional
  actions/
    world_vector[t]
    rotation_delta[t]
    gripper_closedness[t]
  language_instruction
  metadata/
    robot_id
    control_frequency
    camera_intrinsics

For a beginner project, do not start with 50 tasks. Pick two or three tasks with clear success criteria, such as pick red block, place block in bowl, and open drawer. Collect 50-100 demonstrations per task, split train and validation data, and replay actions offline before training. If replay from the dataset is already wrong, the VLA will learn the wrong behavior quickly.

What to Watch During Training

Training logs may show loss and action accuracy, but robotics cannot be evaluated by token accuracy alone. A model can predict many correct tokens and still fail the rollout because one important grasp or placement step is wrong.

Track these metrics:

Metric	Meaning	Caveat
Training loss	Whether the model fits demonstrations	Fast low loss can mean overfitting
Action token accuracy	Discrete token correctness	Not a replacement for real rollouts
Validation rollout success	Whether the task is completed	Needs a fixed protocol
Intervention count	How often a human must rescue the run	Useful for debugging
Latency	Time from frame to action	Control needs consistency

A simple evaluation protocol:

For each task:
  run 20 trials
  randomize object position within safe region
  use the same natural-language instruction template
  record success/failure
  record failure reason:
    perception miss
    wrong object
    bad grasp
    collision / safety stop
    task sequencing error

When OpenVLA fails, classify the failure. If the model targets the wrong object, check camera placement, lighting, augmentation, and language labels. If the action scale is wrong, check normalization. If the policy moves in the right direction but too slowly, check control frequency and action clipping. If the robot grasps correctly but cannot complete the longer sequence, you may need more demonstrations or a recipe with action chunking such as OFT.

What Do the Results Tell Us?

OpenVLA has three result groups worth remembering.

First, direct evaluation on multiple robot platforms. The project page describes evaluations on the WidowX setup from Bridge V2 and the Google Robot from the RT-series of papers. OpenVLA outperforms RT-1-X, Octo, and even RT-2-X on the aggregate benchmark, while RT-2-X remains stronger on some difficult semantic tasks that require Internet concepts not strongly preserved during OpenVLA's robot-only fine-tuning.

Second, adaptation to new robot setups. The paper evaluates Franka-Tabletop and Franka-DROID. The important lesson is not that OpenVLA always beats Diffusion Policy. Diffusion Policy remains very strong for narrow, precise, single-instruction tasks. OpenVLA shines more in multi-object, multi-task settings where language grounding matters. That gives you a practical heuristic: if the task is a narrow industrial motion, a smaller policy may be enough; if the task has many objects and diverse instructions, a VLA becomes more attractive.

Third, parameter-efficient fine-tuning. LoRA shows that you can get close to full fine-tuning performance with much lower memory. This is why OpenVLA became a common baseline for robotics labs and startups: you do not need to reproduce a 64-A100 pretraining run before starting. You can begin from a public checkpoint and train adapters.

Practical Limitations

OpenVLA is powerful, but it is not plug-and-play for every robot.

Limitation	Impact	Risk reduction
Weak zero-shot transfer to unseen embodiments	Different kinematics or action spaces can fail	Collect demos and fine-tune
7B model latency	Control loop may be slow	Quantization, GPU server, OFT
Action normalization is critical	Wrong scale can be unsafe	Validate `unnorm_key`, clamp actions
Camera/domain shift	Wrong object or pose estimate	Calibration and augmentation
Semantic gaps	Some Internet concepts are not retained	Fine-tune diverse instructions

For serious deployment, put OpenVLA above a safety layer:

OpenVLA action
    |
    v
action unnormalization
    |
    v
safety filter: workspace, velocity, force, collision
    |
    v
low-level controller
    |
    v
robot hardware

Do not let a 7B model send raw commands directly to actuators without workspace limits and an emergency stop path. Robotics is different from chat: a bad output can damage equipment.

When Should You Choose OpenVLA?

Choose OpenVLA if you are researching or prototyping language-conditioned robot manipulation with camera observations, multiple objects, and a realistic fine-tuning plan. Do not choose OpenVLA if you only need one narrow industrial task with strict cycle-time requirements, or if you do not have enough GPU/server capacity.

A quick comparison:

Option	Best fit	Weakness
Hand-coded motion + perception	Fixed task, high stability	Hard to scale language
ACT	Simple imitation learning, fast inference	Less generalization
Diffusion Policy	Contact-rich precise manipulation	From-scratch training, weaker language
OpenVLA	Multi-task, multi-object language grounding	7B model, needs careful fine-tuning
OpenVLA-OFT	Faster VLA adaptation and action chunking	Requires following the newer recipe

Beginner Checklist

If you want to get started in one or two weeks, follow this order:

Day 1-2:
  read the paper abstract and project page
  run minimal inference on a static image

Day 3-4:
  connect a camera stream
  verify prompt format and output shape
  do not connect robot actuators yet

Day 5-7:
  collect small demonstrations
  replay actions offline
  convert the dataset

Day 8-10:
  LoRA fine-tune
  monitor loss, action accuracy, validation samples

Day 11-14:
  run slow, bounded robot rollouts
  log failures
  improve data and normalization

The most important thing to learn from OpenVLA is the boundary between a foundation model and a robot system. The model gives you a strong prior from 970k trajectories, but the system still needs clean data, stable cameras, safe control, a repeatable evaluation protocol, and disciplined debugging.

Sources

Paper: OpenVLA: An Open-Source Vision-Language-Action Model
Project page: openvla.github.io
GitHub repo: openvla/openvla
Model card: openvla/openvla-7b on Hugging Face

Why OpenVLA Matters

What Problem Does the Paper Solve?

The paper's idea can be summarized in three points:

Problem	OpenVLA's answer	Why it matters
Prior VLAs were hard to access	Release code, checkpoints, model cards	You can run inference and fine-tune
Robot tasks are diverse	Train on 970k Open X-Embodiment trajectories	The model sees many scenes and embodiments
Fine-tuning 7B models is expensive	Support LoRA and quantization	Smaller labs can adapt it

OpenVLA Architecture

Camera image
    |
    v
+-----------------------------+
| Fused visual encoder        |
| DINOv2 + SigLIP             |
+-----------------------------+
    |
    v
+-----------------------------+
| Projector                   |
| image embeddings -> LLM dim |
+-----------------------------+
    |
    v
+-----------------------------+
| Llama 2 7B backbone         |
| instruction + image tokens  |
+-----------------------------+
    |
    v
Tokenized robot actions
    |
    v
Continuous robot command

1. Fused Visual Encoder

2. Projector

3. Llama 2 7B Backbone

OpenVLA formats the task as a prompt:

In: What action should the robot take to {instruction}?
Out:

How Action Tokenization Works

Robot actions are usually continuous. A typical end-effector action looks like this:

action = [
  delta_x,
  delta_y,
  delta_z,
  delta_roll,
  delta_pitch,
  delta_yaw,
  gripper
]

The conceptual pipeline is:

continuous action in dataset
        |
        v
normalize per dataset / robot
        |
        v
discretize into action bins
        |
        v
train LLM to predict action tokens
        |
        v
at inference: decode tokens -> unnormalize -> robot.act()

Environment Setup

Minimal Inference Setup

conda create -n openvla python=3.10 -y
conda activate openvla

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers timm tokenizers pillow accelerate
pip install flash-attn --no-build-isolation

Repository Setup for Fine-Tuning

conda create -n openvla-train python=3.10 -y
conda activate openvla-train

conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia -y

git clone https://github.com/openvla/openvla.git
cd openvla
pip install -e .

pip install packaging ninja
pip install "flash-attn==2.5.5" --no-build-isolation

Basic Inference

The following example shows the flow from an image to a predicted action. It is not a complete robot controller, but it is enough to understand the interface.

from PIL import Image
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor

processor = AutoProcessor.from_pretrained(
    "openvla/openvla-7b",
    trust_remote_code=True,
)

vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
).to("cuda:0")

image = Image.open("workspace.jpg").convert("RGB")
instruction = "put the yellow corn on the plate"
prompt = f"In: What action should the robot take to {instruction}?\nOut:"

inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(
    **inputs,
    unnorm_key="bridge_orig",
    do_sample=False,
)

print(action)

On a real robot, image comes from a camera stream and robot.act(action) sends the command to the low-level controller. Before doing that, verify five things:

Component	Question to ask
Camera pose	Does the view match the dataset or fine-tuning data?
Action space	Does the robot expect end-effector deltas or joint commands?
Control frequency	Is the policy running at 5 Hz, 10 Hz, or 15 Hz?
Normalization	Is the `unnorm_key` correct for this dataset and robot?
Safety	Do you have workspace limits, emergency stop, and velocity clamps?

Fine-Tuning with LoRA

A representative command from the repository workflow looks like this:

torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
  --vla_path "openvla/openvla-7b" \
  --data_root_dir /data/robot_datasets \
  --dataset_name bridge_orig \
  --run_root_dir /runs/openvla_lora \
  --adapter_tmp_dir /tmp/openvla_adapter \
  --lora_rank 32 \
  --batch_size 16 \
  --grad_accumulation_steps 1 \
  --learning_rate 5e-4 \
  --image_aug True \
  --wandb_project openvla \
  --wandb_entity your_team \
  --save_steps 5000

Preparing a Custom Dataset

OpenVLA works naturally with RLDS because its pretraining and fine-tuning pipeline uses Open X-Embodiment data. If you have a custom robot dataset, the workflow should look like this:

teleoperation demos
    |
    v
store images + actions + language instructions
    |
    v
convert to RLDS or write PyTorch Dataset wrapper
    |
    v
register dataset config + transform
    |
    v
LoRA fine-tune
    |
    v
evaluate on held-out tasks

A minimal episode should contain:

episode/
  observations/
    image_primary[t]
    image_wrist[t]       # optional
    proprio[t]           # optional
  actions/
    world_vector[t]
    rotation_delta[t]
    gripper_closedness[t]
  language_instruction
  metadata/
    robot_id
    control_frequency
    camera_intrinsics

What to Watch During Training

Track these metrics:

Metric	Meaning	Caveat
Training loss	Whether the model fits demonstrations	Fast low loss can mean overfitting
Action token accuracy	Discrete token correctness	Not a replacement for real rollouts
Validation rollout success	Whether the task is completed	Needs a fixed protocol
Intervention count	How often a human must rescue the run	Useful for debugging
Latency	Time from frame to action	Control needs consistency

A simple evaluation protocol:

For each task:
  run 20 trials
  randomize object position within safe region
  use the same natural-language instruction template
  record success/failure
  record failure reason:
    perception miss
    wrong object
    bad grasp
    collision / safety stop
    task sequencing error

What Do the Results Tell Us?

OpenVLA has three result groups worth remembering.

Practical Limitations

OpenVLA is powerful, but it is not plug-and-play for every robot.

Limitation	Impact	Risk reduction
Weak zero-shot transfer to unseen embodiments	Different kinematics or action spaces can fail	Collect demos and fine-tune
7B model latency	Control loop may be slow	Quantization, GPU server, OFT
Action normalization is critical	Wrong scale can be unsafe	Validate `unnorm_key`, clamp actions
Camera/domain shift	Wrong object or pose estimate	Calibration and augmentation
Semantic gaps	Some Internet concepts are not retained	Fine-tune diverse instructions

For serious deployment, put OpenVLA above a safety layer:

OpenVLA action
    |
    v
action unnormalization
    |
    v
safety filter: workspace, velocity, force, collision
    |
    v
low-level controller
    |
    v
robot hardware

Do not let a 7B model send raw commands directly to actuators without workspace limits and an emergency stop path. Robotics is different from chat: a bad output can damage equipment.

When Should You Choose OpenVLA?

A quick comparison:

Option	Best fit	Weakness
Hand-coded motion + perception	Fixed task, high stability	Hard to scale language
ACT	Simple imitation learning, fast inference	Less generalization
Diffusion Policy	Contact-rich precise manipulation	From-scratch training, weaker language
OpenVLA	Multi-task, multi-object language grounding	7B model, needs careful fine-tuning
OpenVLA-OFT	Faster VLA adaptation and action chunking	Requires following the newer recipe

Beginner Checklist

If you want to get started in one or two weeks, follow this order:

Day 1-2:
  read the paper abstract and project page
  run minimal inference on a static image

Day 3-4:
  connect a camera stream
  verify prompt format and output shape
  do not connect robot actuators yet

Day 5-7:
  collect small demonstrations
  replay actions offline
  convert the dataset

Day 8-10:
  LoRA fine-tune
  monitor loss, action accuracy, validation samples

Day 11-14:
  run slow, bounded robot rollouts
  log failures
  improve data and normalization

Sources

Paper: OpenVLA: An Open-Source Vision-Language-Action Model
Project page: openvla.github.io
GitHub repo: openvla/openvla
Model card: openvla/openvla-7b on Hugging Face

Why OpenVLA Matters

What Problem Does the Paper Solve?

OpenVLA Architecture

1. Fused Visual Encoder

2. Projector

3. Llama 2 7B Backbone

How Action Tokenization Works

Environment Setup

Minimal Inference Setup

Repository Setup for Fine-Tuning

Basic Inference

Fine-Tuning with LoRA

Preparing a Custom Dataset

What to Watch During Training

What Do the Results Tell Us?

Practical Limitations

When Should You Choose OpenVLA?

Beginner Checklist

Sources

Related Posts

Nguyễn Anh Tuấn

Related Posts

ETH Robot Learning 2026: lộ trình tự học

Chạy LingBot-VLA 2.0 trên RoboTwin 2.0

FORCE: Tăng 79% success rate khi fine-tune VLA bằng RL

Why OpenVLA Matters

What Problem Does the Paper Solve?

OpenVLA Architecture

1. Fused Visual Encoder

2. Projector

3. Llama 2 7B Backbone

How Action Tokenization Works

Environment Setup

Minimal Inference Setup

Repository Setup for Fine-Tuning

Basic Inference

Fine-Tuning with LoRA

Preparing a Custom Dataset

What to Watch During Training

What Do the Results Tell Us?

Practical Limitations

When Should You Choose OpenVLA?

Beginner Checklist

Sources

Related Posts

Nguyễn Anh Tuấn

Related Posts

ETH Robot Learning 2026: lộ trình tự học

Chạy LingBot-VLA 2.0 trên RoboTwin 2.0

FORCE: Tăng 79% success rate khi fine-tune VLA bằng RL