Chạy OpenEAI-VLA pretrained với Qwen3-VL

OpenEAI-VLA là một trong những release đáng chú ý nhất cho cộng đồng robotics mã nguồn mở đầu tháng 6/2026: một Vision-Language-Action model khoảng 5B tham số, dựa trên Qwen3-VL-4B-Instruct, dùng Diffusion Transformer action head, có checkpoint pretrained công khai, code training/inference, dataset pipeline và định hướng chạy trên OpenEAI-Arm, một robot arm 6+1 DoF có chi phí vật liệu khoảng 790 USD theo paper.

Điểm quan trọng không phải chỉ là "thêm một VLA model". Điểm hay của OpenEAI-Platform là nhóm tác giả cố gắng mở cả stack: phần cứng robot arm, control, format dữ liệu, pretraining, fine-tuning và policy server. Với người mới, đây là cơ hội tốt để hiểu một VLA hiện đại được đóng gói thế nào từ paper đến robot thật. Nếu bạn mới bắt đầu với khái niệm VLA hoặc diffusion/flow matching, hãy đọc phần kiến trúc chậm hơn một chút trước khi nhảy vào inference.

Bài này tập trung vào phần thực hành: cài đặt repo, tải checkpoint pretrained, chuẩn bị dữ liệu, chạy inference server, hiểu input/output, và biết cần làm gì nếu muốn fine-tune trên robot arm giá rẻ. Vì OpenEAI-VLA là project mới, bạn nên xem hướng dẫn dưới đây như một workflow tái lập có kiểm soát: chạy được server trước, kiểm tra tensor shape, thử request giả lập, sau đó mới nối vào robot thật.

Nguồn gốc project

Các nguồn chính nên đọc:

Paper: OpenEAI-Platform: An Open-source Embodied Artificial Intelligence Hardware-Software Unified Platform
Code: github.com/eai-yeslab/OpenEAI-VLA
Checkpoint pretrained: OpenEAI/OpenEAI-VLA-Pretrained
Dataset đã xử lý: OpenEAI/OpenEAI-Dataset
Backbone cần tải: Qwen/Qwen3-VL-4B-Instruct

Paper được nộp lên arXiv ngày 2026-06-02. Abstract nêu rõ OpenEAI-Platform gồm hai phần: OpenEAI-Arm, một arm 6+1 DoF chi phí thấp, và OpenEAI-VLA, một VLA reproducible dùng Qwen3-VL-4B cùng Diffusion Transformer action head. Repo hiện có installation, dataset conversion, pretraining, fine-tuning và FastAPI inference server. Checkpoint Hugging Face ghi model size khoảng 5B tham số, tensor type F32, file model.safetensors khoảng 20.9 GB.

Ý tưởng chính của paper

Vấn đề OpenEAI muốn giải quyết là reproducibility. Nhiều VLA mạnh như π0/π0.5 có kết quả rất tốt, nhưng phần data quy mô lớn và một số chi tiết training không hoàn toàn mở. Ở chiều ngược lại, nhiều robot arm giá rẻ lại thiếu độ chính xác, thiếu low-level control, hoặc chỉ cung cấp interface black-box, khiến việc thu dữ liệu và deploy policy khó tái lập.

OpenEAI-Platform chọn cách mở cả hai phía:

Thành phần	Vai trò	Điểm cần nhớ
OpenEAI-Arm	Robot arm 6+1 DoF	Chi phí vật liệu khoảng 790 USD, thiết kế desktop manipulation
FF-PID + action smoothing	Low-level control	Làm mượt action chunk rời rạc từ VLA thành chuyển động liên tục
OpenEAI-VLA	Policy end-to-end	Nhận ảnh, instruction, state; trả về action chunk
OpenEAI-Dataset	Data format thống nhất	HDF5, thống kê state/action, image nén, adapter theo dataset
Policy server	Deployment	FastAPI + MessagePack, robot client gửi observation và nhận action

Sơ đồ logic có thể hình dung như sau:

Camera images + language instruction + proprioceptive state
                         |
                         v
              Qwen3-VL-4B-Instruct backbone
                         |
            learnable query embeddings readout
                         |
                         v
       Diffusion Transformer / flow-matching action expert
                         |
                         v
        50-step continuous action chunk for robot arm
                         |
                         v
        smoothing + low-level controller + real robot

Điểm tinh tế nằm ở "learnable query embeddings". Thay vì đưa toàn bộ hidden states của VLM vào action head, OpenEAI-VLA thêm một chuỗi token query học được vào input của Qwen3-VL. Sau forward pass, model chỉ lấy hidden states tương ứng với query này làm conditioning embeddings cho action head. Cách này tạo một bottleneck cố định giữa perception/language và control: action head không bị phình compute theo số image patches hoặc độ dài instruction, nhưng vẫn học được thông tin nào cần nén để điều khiển robot.

Kiến trúc OpenEAI-VLA

Theo paper và config public, model có các thành phần chính:

Tham số	Giá trị trong config
Backbone	`Qwen3-VL-4B-Instruct`
Resize image	`224`
Qwen hidden dim	`2560`
Action head hidden dim	`1664`
DiT layers	`18`
Attention heads	`32`
Action horizon	`50`
Denoise steps trong config	`10`; inference file set `20`
Feature length	`20`

Input inference mặc định trong openeai/infer.py gồm 3 camera:

batch = {
    "images": {
        "cam_left_wrist": left_wrist_rgb,
        "cam_right_wrist": right_wrist_rgb,
        "cam_high": third_person_rgb,
    },
    "state": robot_state,
    "prompt": "fold the towel",
}

Server resize ảnh về 224x224, normalize state theo thống kê trong processor, ghép prompt với 3 vision placeholders, rồi gọi model.infer(..., act=True). Output là actions, sau đó được unnormalize theo data_key của dataset. Nếu checkpoint dùng relative joint action, code có nhánh cộng lại với current state để ra command tuyệt đối.

Điều này có nghĩa là pretrained checkpoint không phải là một chatbot nhìn ảnh rồi trả lời text. Nó là robot policy: prompt chỉ là một phần conditioning. Đầu ra quan trọng là action chunk liên tục, thường cần được gửi qua controller với rate phù hợp và cơ chế smoothing.

Kết quả trong paper

OpenEAI-Platform đánh giá trên 4 task real-world:

Clean Table: pick/place các object trên bàn.
Make Tea: thao tác nhiều bước với vật cứng.
Fold Towel: deformable manipulation với khăn.
Fold T-shirt: dual-arm, long-horizon, deformable object.

So sánh phần cứng cho thấy OpenEAI-Arm đạt average success 0.75 khi chạy π0, so với ARX R5 là 0.71 và AgileX Piper là 0.64 trong thiết lập của paper. Bảng phần cứng cũng ghi OpenEAI-Arm có material cost khoảng 0.79 kUSD, thấp hơn ARX R5 8.60 kUSD và Piper 2.16 kUSD.

So sánh model trên OpenEAI-Arm:

Model	Clean Table avg	Make Tea final	Fold Towel final	Fold T-shirt final
ACT	0.72	0.60	0.33	0.00
Octo	0.20	0.00	0.00	không hỗ trợ multi-arm
OpenVLA-oft	0.68	rất thấp	0.27	0.00
π0	0.92	0.60	0.73	0.83
π0.5	0.96	0.80	0.80	0.83
OpenEAI-VLA	0.94	0.70	0.80	0.83

Thông điệp thực tế: OpenEAI-VLA không vượt π0.5, nhưng đạt mức rất gần π0/π0.5 trên các task được báo cáo, trong khi nhấn mạnh chỉ dùng open-source datasets cho pretraining. Đây là lý do tiêu đề "gần π0" có cơ sở, nhưng bạn vẫn nên đọc kết quả theo đúng phạm vi: 4 task real robot, cùng evaluation protocol của paper, chưa phải một kết luận universal cho mọi robot.

Chuẩn bị máy

Bạn nên bắt đầu với một máy Linux có NVIDIA GPU. Checkpoint F32 khá lớn, nên 24 GB VRAM có thể chật tùy CUDA/PyTorch memory overhead. Nếu chỉ muốn đọc code và test request giả, CPU vẫn được, nhưng chạy policy thật nên dùng GPU.

Yêu cầu tối thiểu thực dụng:

Mục	Khuyến nghị
OS	Ubuntu 22.04 hoặc tương đương
Python	3.10+
GPU	RTX 4090/A5000/A6000/A100; càng nhiều VRAM càng dễ
Disk	80 GB cho code + checkpoint; vài TB nếu tải full dataset
RAM	32 GB trở lên
Network	Cần tải model từ Hugging Face

Cài các công cụ nền:

sudo apt update
sudo apt install -y git git-lfs ffmpeg libgl1
git lfs install

Tạo environment:

conda create -n openeai python=3.10 -y
conda activate openeai

Cài đặt OpenEAI-VLA

Clone repo và cài dependencies:

git clone https://github.com/eai-yeslab/OpenEAI-VLA.git
cd OpenEAI-VLA

pip install -r requirements.txt
pip install -e .

requirements.txt của repo gồm các gói như torch, torchvision, transformers==4.57.1, accelerate, deepspeed, h5py, datasets, uvicorn, fastapi, opencv-python, scipy, imageio, pillow. Nếu môi trường CUDA của bạn không khớp, nên cài PyTorch theo command chính thức cho CUDA version của máy trước, rồi mới chạy requirements.

Kiểm tra import cơ bản:

python - <<'PY'
import torch
import transformers
print("torch", torch.__version__, "cuda", torch.cuda.is_available())
print("transformers", transformers.__version__)
PY

Tải pretrained checkpoint

Tải Qwen3-VL backbone:

huggingface-cli login
huggingface-cli download Qwen/Qwen3-VL-4B-Instruct --repo-type model

Tải OpenEAI-VLA pretrained:

huggingface-cli download OpenEAI/OpenEAI-VLA-Pretrained \
  --repo-type model \
  --local-dir log/OpenEAI-VLA-Pretrained/openeai

Repo inference mặc định có biến:

CKPT_PATH = "log/finetune/openeai_finetune_openeaiarm_fold_towel/checkpoints/100000/openeai"

Để chạy pretrained, bạn cần trỏ CKPT_PATH sang thư mục checkpoint vừa tải, ví dụ:

CKPT_PATH = "log/OpenEAI-VLA-Pretrained/openeai"

Nếu bạn không muốn sửa source nhiều, có thể tạo một nhánh local riêng hoặc patch nhỏ để đọc từ environment variable:

import os
CKPT_PATH = os.environ.get(
    "OPENEAI_CKPT",
    "log/OpenEAI-VLA-Pretrained/openeai",
)

Sau đó chạy:

export OPENEAI_CKPT=log/OpenEAI-VLA-Pretrained/openeai
python openeai/infer.py

Chạy inference server

Server dùng FastAPI và lắng nghe port 8000:

python openeai/infer.py

Khi load thành công, server sẽ:

Đọc OpenEAIVLAConfig từ checkpoint.
Set config.denoise_steps = 20.
Chọn Qwen3-VL-4B-Instruct nếu path checkpoint có qwen3.
Load processor và model.
Đưa model lên cuda:0, dtype mặc định torch.float32.
Tạo endpoint POST /infer.

Observation gửi lên server dùng MessagePack, không phải JSON thường. Pseudo-client:

import msgpack
import requests
import numpy as np

def pack_array(obj):
    if isinstance(obj, np.ndarray):
        return {
            b"__ndarray__": True,
            b"data": obj.tobytes(),
            b"dtype": obj.dtype.str,
            b"shape": obj.shape,
        }
    return obj

obs = {
    "images": {
        "cam_left_wrist": np.zeros((480, 640, 3), dtype=np.uint8),
        "cam_right_wrist": np.zeros((480, 640, 3), dtype=np.uint8),
        "cam_high": np.zeros((480, 640, 3), dtype=np.uint8),
    },
    "state": np.zeros((14,), dtype=np.float32),
    "prompt": "clean the table",
}

payload = msgpack.packb(obs, default=pack_array, use_bin_type=True)
resp = requests.post(
    "http://127.0.0.1:8000/infer",
    data=payload,
    headers={"Content-Type": "application/msgpack"},
)
print(resp.status_code, len(resp.content))

Với robot thật, bạn không gửi ảnh đen như trên. Bạn cần lấy frame đồng bộ từ 3 camera, đọc joint/gripper state hiện tại, rồi stream action chunk trả về cho controller.

Chuẩn bị dữ liệu để fine-tune

Nếu mục tiêu là "chạy pretrained" thì bạn có thể dừng ở inference server. Nhưng nếu muốn robot của bạn làm task cụ thể, gần như chắc chắn cần fine-tune. Lý do là action space, camera placement, gripper, calibration và kinematics của mỗi robot khác nhau. VLA pretrained giúp model có prior tốt, nhưng không thay thế demonstration data trên embodiment thật.

OpenEAI-Dataset dùng format HDF5 thống nhất:

data/
  OpenEAI-Dataset/
    meta/
      pretrain_meta.json
      bc_z_meta.npy
      droid_meta.npy
    bc_z/
      0000.hdf5
        episode_0/
          attrs: instruction, action_type, length
          action: (traj_length, action_dim)
          state:  (traj_length, state_dim)
          image_mid: compressed image sequence

Một episode tối thiểu cần có:

instruction: câu lệnh tự nhiên, ví dụ "put the red cup on the plate".
state: proprioception của robot, thường gồm joint positions và gripper state.
action: target action cùng convention với controller.
image_*: camera frames đã nén hoặc lưu theo format loader hỗ trợ.
state_stat và action_stat: mean/std/quantile để normalize/unnormalize.

Chuyển dataset:

cd data_utils
bash run.sh openeai_arm_my_task

Sau đó sửa config/sft_openeai_multimodal.json:

{
  "task_name": "finetune",
  "pretrain_ckpt_dir": "log/OpenEAI-VLA-Pretrained/openeai",
  "train_steps": 100000,
  "action_chunk": 50,
  "data": {
    "data_root": "data",
    "name": "openeai_arm_my_task",
    "batch_size": 2,
    "resize_size": 224,
    "use_multimodal": true,
    "multimodal_root": "data/OpenEAI-Dataset/multi_modal",
    "multimodal_weight": 0.4
  },
  "optimizer": {
    "lr": 3e-5,
    "weight_decay": 1e-2
  }
}

Training và fine-tuning

Pretraining từ đầu là việc lớn. Config public đặt:

{
  "train_steps": 100000,
  "action_chunk": 50,
  "data": {
    "mixed": true,
    "data_root": "data/OpenEAI-Dataset",
    "batch_size": 4,
    "resize_size": 224
  },
  "optimizer": {
    "lr": 1e-4,
    "weight_decay": 1e-2
  },
  "scheduler": {
    "warmup_steps": 5000,
    "decay_steps": 100000,
    "decay_lr": 1e-5
  }
}

Repo đưa ví dụ pretrain 8 GPU trên 4 node:

bash scripts/pretrain.sh openeai_pretrain

Với beginner, đường hợp lý hơn là fine-tune:

bash scripts/sft.sh openeai_arm_my_task

scripts/sft.sh dùng accelerate launch --config_file config/single_node_zero2.yaml và gọi openeai/sft_zero2.py. Nếu chỉ có một GPU, bạn cần kiểm tra lại config Accelerate/DeepSpeed để giảm batch size, gradient accumulation hoặc dùng checkpointing. Với checkpoint F32 và Qwen3-VL-4B, đừng kỳ vọng fine-tune thoải mái trên GPU 12 GB.

Một checklist fine-tune thực dụng:

Bước	Việc cần kiểm tra
1	Camera order giống inference: left wrist, high, right wrist
2	State dimension đúng với processor stats
3	Action dimension khớp controller
4	Action convention rõ: absolute joint, relative joint, hoặc end-effector delta
5	Prompt nhất quán giữa training và inference
6	Episode có success/failure filtering nếu data thô lẫn lỗi
7	Chạy replay action offline trước khi dùng policy

Deploy lên robot arm giá rẻ

OpenEAI-VLA trả action chunk, nhưng robot arm cần trajectory liên tục. Paper dùng FF-PID, dynamics feedforward và three-point rolling Bézier action chunking để giảm discontinuity ở biên chunk. Nếu bạn dùng arm khác, tối thiểu nên có một lớp safety/controller giữa policy và motor:

OpenEAI-VLA action chunk
        |
        v
action clipping + rate limit
        |
        v
joint limit check + collision zone check
        |
        v
trajectory smoothing
        |
        v
low-level position/velocity/torque controller
        |
        v
robot arm

Đừng gửi trực tiếp output model xuống motor driver nếu chưa kiểm tra. Với arm giá rẻ, backlash, delay, servo saturation và calibration error có thể làm policy fail dù inference đúng. Cách test an toàn:

Chạy server với camera thật nhưng robot disabled, log action.
Replay action trong simulator hoặc dry-run visualization.
Bật robot ở tốc độ thấp, giới hạn workspace.
Test một prompt đơn giản như "move to home" hoặc "pick the cup".
Chỉ tăng tốc sau khi action mượt và không vi phạm joint limit.

Nếu bạn đang xây pipeline LeRobot hoặc OpenArm, hãy đối chiếu thêm cách thu demonstration và chuẩn hóa action trong các bài liên quan ở cuối bài.

Lỗi thường gặp

Lỗi	Nguyên nhân khả dĩ	Cách xử lý
CUDA out of memory	Checkpoint F32, batch/denoise cao	Dùng GPU nhiều VRAM, giảm denoise, kiểm tra dtype
`ModuleNotFoundError`	Chưa `pip install -e .`	Cài editable install trong repo
Output action sai shape	State/action stats không khớp dataset	Kiểm tra meta `.npy`, action_dim, state_dim
Robot giật ở biên chunk	Không có smoothing	Thêm interpolation/rate limiter
Prompt không có tác dụng	Fine-tune data prompt nghèo hoặc không nhất quán	Chuẩn hóa instruction templates
Policy chỉ đứng yên	Normalization sai hoặc camera order sai	Log batch sau processor trước khi infer

Khi nào nên dùng OpenEAI-VLA?

Dùng OpenEAI-VLA nếu bạn muốn một stack VLA mở để học, fine-tune và deploy trên manipulation tasks có camera + language + state. Nó phù hợp cho lab nhỏ vì code và checkpoint đã public, dataset format rõ, và paper bàn cả phần control. Nó cũng phù hợp nếu bạn muốn so sánh với π0/OpenVLA/ACT trong cùng pipeline.

Không nên xem pretrained checkpoint như "cắm vào robot bất kỳ là chạy". VLA phụ thuộc mạnh vào embodiment, camera, action representation và data distribution. Với robot arm giá rẻ, phần cơ khí/control quan trọng không kém model. Nếu chỉ có một camera, action space khác, hoặc gripper khác nhiều, bạn vẫn cần adapter và fine-tuning data.

Kết luận ngắn: OpenEAI-VLA là một bước tốt cho robotics mã nguồn mở vì nó biến câu hỏi "làm sao train VLA giống paper?" thành một pipeline cụ thể hơn. Với beginner, hãy bắt đầu bằng inference server và request giả lập; với người đã có robot, hãy thu 20-100 demonstration sạch cho một task hẹp trước khi mở rộng sang multi-task.

Nguồn gốc project

Các nguồn chính nên đọc:

Paper: OpenEAI-Platform: An Open-source Embodied Artificial Intelligence Hardware-Software Unified Platform
Code: github.com/eai-yeslab/OpenEAI-VLA
Checkpoint pretrained: OpenEAI/OpenEAI-VLA-Pretrained
Dataset đã xử lý: OpenEAI/OpenEAI-Dataset
Backbone cần tải: Qwen/Qwen3-VL-4B-Instruct

Ý tưởng chính của paper

OpenEAI-Platform chọn cách mở cả hai phía:

Thành phần	Vai trò	Điểm cần nhớ
OpenEAI-Arm	Robot arm 6+1 DoF	Chi phí vật liệu khoảng 790 USD, thiết kế desktop manipulation
FF-PID + action smoothing	Low-level control	Làm mượt action chunk rời rạc từ VLA thành chuyển động liên tục
OpenEAI-VLA	Policy end-to-end	Nhận ảnh, instruction, state; trả về action chunk
OpenEAI-Dataset	Data format thống nhất	HDF5, thống kê state/action, image nén, adapter theo dataset
Policy server	Deployment	FastAPI + MessagePack, robot client gửi observation và nhận action

Sơ đồ logic có thể hình dung như sau:

Camera images + language instruction + proprioceptive state
                         |
                         v
              Qwen3-VL-4B-Instruct backbone
                         |
            learnable query embeddings readout
                         |
                         v
       Diffusion Transformer / flow-matching action expert
                         |
                         v
        50-step continuous action chunk for robot arm
                         |
                         v
        smoothing + low-level controller + real robot

Kiến trúc OpenEAI-VLA

Theo paper và config public, model có các thành phần chính:

Tham số	Giá trị trong config
Backbone	`Qwen3-VL-4B-Instruct`
Resize image	`224`
Qwen hidden dim	`2560`
Action head hidden dim	`1664`
DiT layers	`18`
Attention heads	`32`
Action horizon	`50`
Denoise steps trong config	`10`; inference file set `20`
Feature length	`20`

Input inference mặc định trong openeai/infer.py gồm 3 camera:

batch = {
    "images": {
        "cam_left_wrist": left_wrist_rgb,
        "cam_right_wrist": right_wrist_rgb,
        "cam_high": third_person_rgb,
    },
    "state": robot_state,
    "prompt": "fold the towel",
}

Kết quả trong paper

OpenEAI-Platform đánh giá trên 4 task real-world:

Clean Table: pick/place các object trên bàn.
Make Tea: thao tác nhiều bước với vật cứng.
Fold Towel: deformable manipulation với khăn.
Fold T-shirt: dual-arm, long-horizon, deformable object.

So sánh model trên OpenEAI-Arm:

Model	Clean Table avg	Make Tea final	Fold Towel final	Fold T-shirt final
ACT	0.72	0.60	0.33	0.00
Octo	0.20	0.00	0.00	không hỗ trợ multi-arm
OpenVLA-oft	0.68	rất thấp	0.27	0.00
π0	0.92	0.60	0.73	0.83
π0.5	0.96	0.80	0.80	0.83
OpenEAI-VLA	0.94	0.70	0.80	0.83

Chuẩn bị máy

Yêu cầu tối thiểu thực dụng:

Mục	Khuyến nghị
OS	Ubuntu 22.04 hoặc tương đương
Python	3.10+
GPU	RTX 4090/A5000/A6000/A100; càng nhiều VRAM càng dễ
Disk	80 GB cho code + checkpoint; vài TB nếu tải full dataset
RAM	32 GB trở lên
Network	Cần tải model từ Hugging Face

Cài các công cụ nền:

sudo apt update
sudo apt install -y git git-lfs ffmpeg libgl1
git lfs install

Tạo environment:

conda create -n openeai python=3.10 -y
conda activate openeai

Cài đặt OpenEAI-VLA

Clone repo và cài dependencies:

git clone https://github.com/eai-yeslab/OpenEAI-VLA.git
cd OpenEAI-VLA

pip install -r requirements.txt
pip install -e .

Kiểm tra import cơ bản:

python - <<'PY'
import torch
import transformers
print("torch", torch.__version__, "cuda", torch.cuda.is_available())
print("transformers", transformers.__version__)
PY

Tải pretrained checkpoint

Tải Qwen3-VL backbone:

huggingface-cli login
huggingface-cli download Qwen/Qwen3-VL-4B-Instruct --repo-type model

Tải OpenEAI-VLA pretrained:

huggingface-cli download OpenEAI/OpenEAI-VLA-Pretrained \
  --repo-type model \
  --local-dir log/OpenEAI-VLA-Pretrained/openeai

Repo inference mặc định có biến:

CKPT_PATH = "log/finetune/openeai_finetune_openeaiarm_fold_towel/checkpoints/100000/openeai"

Để chạy pretrained, bạn cần trỏ CKPT_PATH sang thư mục checkpoint vừa tải, ví dụ:

CKPT_PATH = "log/OpenEAI-VLA-Pretrained/openeai"

Nếu bạn không muốn sửa source nhiều, có thể tạo một nhánh local riêng hoặc patch nhỏ để đọc từ environment variable:

import os
CKPT_PATH = os.environ.get(
    "OPENEAI_CKPT",
    "log/OpenEAI-VLA-Pretrained/openeai",
)

Sau đó chạy:

export OPENEAI_CKPT=log/OpenEAI-VLA-Pretrained/openeai
python openeai/infer.py

Chạy inference server

Server dùng FastAPI và lắng nghe port 8000:

python openeai/infer.py

Khi load thành công, server sẽ:

Đọc OpenEAIVLAConfig từ checkpoint.
Set config.denoise_steps = 20.
Chọn Qwen3-VL-4B-Instruct nếu path checkpoint có qwen3.
Load processor và model.
Đưa model lên cuda:0, dtype mặc định torch.float32.
Tạo endpoint POST /infer.

Observation gửi lên server dùng MessagePack, không phải JSON thường. Pseudo-client:

import msgpack
import requests
import numpy as np

def pack_array(obj):
    if isinstance(obj, np.ndarray):
        return {
            b"__ndarray__": True,
            b"data": obj.tobytes(),
            b"dtype": obj.dtype.str,
            b"shape": obj.shape,
        }
    return obj

obs = {
    "images": {
        "cam_left_wrist": np.zeros((480, 640, 3), dtype=np.uint8),
        "cam_right_wrist": np.zeros((480, 640, 3), dtype=np.uint8),
        "cam_high": np.zeros((480, 640, 3), dtype=np.uint8),
    },
    "state": np.zeros((14,), dtype=np.float32),
    "prompt": "clean the table",
}

payload = msgpack.packb(obs, default=pack_array, use_bin_type=True)
resp = requests.post(
    "http://127.0.0.1:8000/infer",
    data=payload,
    headers={"Content-Type": "application/msgpack"},
)
print(resp.status_code, len(resp.content))

Chuẩn bị dữ liệu để fine-tune

OpenEAI-Dataset dùng format HDF5 thống nhất:

data/
  OpenEAI-Dataset/
    meta/
      pretrain_meta.json
      bc_z_meta.npy
      droid_meta.npy
    bc_z/
      0000.hdf5
        episode_0/
          attrs: instruction, action_type, length
          action: (traj_length, action_dim)
          state:  (traj_length, state_dim)
          image_mid: compressed image sequence

Một episode tối thiểu cần có:

instruction: câu lệnh tự nhiên, ví dụ "put the red cup on the plate".
state: proprioception của robot, thường gồm joint positions và gripper state.
action: target action cùng convention với controller.
image_*: camera frames đã nén hoặc lưu theo format loader hỗ trợ.
state_stat và action_stat: mean/std/quantile để normalize/unnormalize.

Chuyển dataset:

cd data_utils
bash run.sh openeai_arm_my_task

Sau đó sửa config/sft_openeai_multimodal.json:

{
  "task_name": "finetune",
  "pretrain_ckpt_dir": "log/OpenEAI-VLA-Pretrained/openeai",
  "train_steps": 100000,
  "action_chunk": 50,
  "data": {
    "data_root": "data",
    "name": "openeai_arm_my_task",
    "batch_size": 2,
    "resize_size": 224,
    "use_multimodal": true,
    "multimodal_root": "data/OpenEAI-Dataset/multi_modal",
    "multimodal_weight": 0.4
  },
  "optimizer": {
    "lr": 3e-5,
    "weight_decay": 1e-2
  }
}

Training và fine-tuning

Pretraining từ đầu là việc lớn. Config public đặt:

{
  "train_steps": 100000,
  "action_chunk": 50,
  "data": {
    "mixed": true,
    "data_root": "data/OpenEAI-Dataset",
    "batch_size": 4,
    "resize_size": 224
  },
  "optimizer": {
    "lr": 1e-4,
    "weight_decay": 1e-2
  },
  "scheduler": {
    "warmup_steps": 5000,
    "decay_steps": 100000,
    "decay_lr": 1e-5
  }
}

Repo đưa ví dụ pretrain 8 GPU trên 4 node:

bash scripts/pretrain.sh openeai_pretrain

Với beginner, đường hợp lý hơn là fine-tune:

bash scripts/sft.sh openeai_arm_my_task

Một checklist fine-tune thực dụng:

Bước	Việc cần kiểm tra
1	Camera order giống inference: left wrist, high, right wrist
2	State dimension đúng với processor stats
3	Action dimension khớp controller
4	Action convention rõ: absolute joint, relative joint, hoặc end-effector delta
5	Prompt nhất quán giữa training và inference
6	Episode có success/failure filtering nếu data thô lẫn lỗi
7	Chạy replay action offline trước khi dùng policy

Deploy lên robot arm giá rẻ

OpenEAI-VLA action chunk
        |
        v
action clipping + rate limit
        |
        v
joint limit check + collision zone check
        |
        v
trajectory smoothing
        |
        v
low-level position/velocity/torque controller
        |
        v
robot arm

Chạy server với camera thật nhưng robot disabled, log action.
Replay action trong simulator hoặc dry-run visualization.
Bật robot ở tốc độ thấp, giới hạn workspace.
Test một prompt đơn giản như "move to home" hoặc "pick the cup".
Chỉ tăng tốc sau khi action mượt và không vi phạm joint limit.

Nếu bạn đang xây pipeline LeRobot hoặc OpenArm, hãy đối chiếu thêm cách thu demonstration và chuẩn hóa action trong các bài liên quan ở cuối bài.

Lỗi thường gặp

Lỗi	Nguyên nhân khả dĩ	Cách xử lý
CUDA out of memory	Checkpoint F32, batch/denoise cao	Dùng GPU nhiều VRAM, giảm denoise, kiểm tra dtype
`ModuleNotFoundError`	Chưa `pip install -e .`	Cài editable install trong repo
Output action sai shape	State/action stats không khớp dataset	Kiểm tra meta `.npy`, action_dim, state_dim
Robot giật ở biên chunk	Không có smoothing	Thêm interpolation/rate limiter
Prompt không có tác dụng	Fine-tune data prompt nghèo hoặc không nhất quán	Chuẩn hóa instruction templates
Policy chỉ đứng yên	Normalization sai hoặc camera order sai	Log batch sau processor trước khi infer

Chạy OpenEAI-VLA pretrained với Qwen3-VL

Nguồn gốc project

Ý tưởng chính của paper

Kiến trúc OpenEAI-VLA

Kết quả trong paper

Chuẩn bị máy

Cài đặt OpenEAI-VLA

Tải pretrained checkpoint

Chạy inference server

Chuẩn bị dữ liệu để fine-tune

Training và fine-tuning

Deploy lên robot arm giá rẻ

Lỗi thường gặp

Khi nào nên dùng OpenEAI-VLA?

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

LaST-R1: Fine-tune VLA với Latent CoT và RL đạt 99.8%

TORL-VLA: Fine-tune VLA với Cảm Biến Xúc Giác và Online RL

Tại sao Multi-Agent đánh bại VLA đơn thuần? | AI Manipulation Agents #1

Chạy OpenEAI-VLA pretrained với Qwen3-VL

Nguồn gốc project

Ý tưởng chính của paper

Kiến trúc OpenEAI-VLA

Kết quả trong paper

Chuẩn bị máy

Cài đặt OpenEAI-VLA

Tải pretrained checkpoint

Chạy inference server

Chuẩn bị dữ liệu để fine-tune

Training và fine-tuning

Deploy lên robot arm giá rẻ

Lỗi thường gặp

Khi nào nên dùng OpenEAI-VLA?

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

LaST-R1: Fine-tune VLA với Latent CoT và RL đạt 99.8%

TORL-VLA: Fine-tune VLA với Cảm Biến Xúc Giác và Online RL

Tại sao Multi-Agent đánh bại VLA đơn thuần? | AI Manipulation Agents #1