Ark v1.5: Python Framework cho Robot Learning sim-to-real

Bạn từng bị "đau đầu" khi muốn train một policy ACT trong simulator rồi mang ra robot thật, để rồi nhận ra mình phải viết lại một nửa pipeline vì ROS message khác, joint encoder khác, frame name khác? Đây là nỗi đau chung của hầu hết kỹ sư robotics — và là lý do Ark framework ra đời. Tháng 4/2026, team Huawei Noah's Ark Lab cùng các đối tác từ University College London, University of Oxford, TU Darmstadt phát hành Ark v1.5 — một Python framework open-source mà họ mô tả là "PyTorch + Gym cho robotics".

Bài viết này đi qua toàn bộ Ark từ ý tưởng paper, kiến trúc client-server, cách cài đặt, train một ACT policy trên simulator rồi switch sang robot thật chỉ bằng đổi một config flag. Nếu bạn đã quen với LeRobot hoặc đã đọc series Imitation Learning, bài này sẽ dễ follow.

Ark là gì? Tại sao quan trọng năm 2026?

Trước Ark, để train một imitation learning policy cho robot bạn phải combine ít nhất 4 stack:

ROS 2 cho hardware interface và messaging.
PyBullet/MuJoCo/Isaac cho simulation (mỗi cái một API riêng).
PyTorch cho training (dataset format tự custom).
Glue code đủ kiểu để bridge cả 3 phần trên.

Hệ quả là sinh viên phải dành 70% thời gian viết boilerplate, chỉ còn 30% cho research. Hơn nữa, việc "switch sim → real" thường có nghĩa là viết lại hoàn toàn data pipeline vì ROS topic và simulator state hoạt động khác nhau.

Paper Ark: An Open-source Python-based Framework for Robot Learning — Dierking et al., 2026 — đưa ra 4 đóng góp chính:

Gym-style environment unified — cùng một env.reset() / env.step(action) API hoạt động trên cả simulator và robot thật.
Client-server kiến trúc nhẹ với publisher-subscriber networking, có thể chạy distributed trên LAN.
Modular components: control, SLAM, motion planning, system identification, visualization — đều plug-and-play.
Native ROS interoperability — có sẵn bridge để publish/subscribe ROS 2 topics, không cần rosbridge.

Ngoài ra, bản v1.5 (April 2026) bổ sung optional C/C++ bindings để đảm bảo real-time performance khi cần (ví dụ control loop 1kHz cho legged robot).

Kiến trúc tổng thể

Ark có 3 layer chính:

┌─────────────────────────────────────────────┐
│  Application Layer                          │
│  (your training script, Gym env, RL agent)  │
└─────────────────────────────────────────────┘
                    ↕ Python API
┌─────────────────────────────────────────────┐
│  Ark Core                                   │
│  - Environment abstraction (sim ↔ real)     │
│  - Pub/Sub messaging                        │
│  - Modules: control, SLAM, planning, ID     │
└─────────────────────────────────────────────┘
                    ↕ Drivers
┌─────────────────────────────────────────────┐
│  Backend Layer                              │
│  - Sim: PyBullet (default), MuJoCo (opt)    │
│  - Real: ROS 2 bridge, custom drivers       │
└─────────────────────────────────────────────┘

Điểm mấu chốt: viết code application bạn chỉ thấy Environment interface. Whether environment đang chạy trong PyBullet hay đang điều khiển ViperX300s thật, code không đổi. Ark gọi đây là embodiment-agnostic API — tương tự như torch.device('cuda') vs torch.device('cpu') của PyTorch.

Pub/Sub messaging là một quyết định thiết kế quan trọng. Thay vì coupling chặt như ROS service call, Ark dùng asynchronous topics. Mỗi cảm biến (camera, IMU, joint encoder) là một publisher; mỗi controller là một subscriber. Điều này cho phép:

Distributed training: data collection chạy trên Jetson Orin gần robot, training chạy trên server có RTX 4090 — chỉ cần config IP.
Hot-swappable hardware: thay camera RealSense bằng Zed mà không phải đổi training code.
Replay: record stream pub/sub → replay về chính xác như khi thu thập.

So sánh nhanh với các framework khác

Framework	Sim/Real unified	Imitation algos	ROS bridge	Hardware swap
Ark v1.5	✅ Native	ACT, Diffusion Policy	Native	Easy
LeRobot	Partial	ACT, Diffusion, π0	External	Medium
Isaac Lab	Sim only	RL focus	Limited	Sim-only
ROS 2 + PyTorch	Manual glue	DIY	Native	Hard

Ark không thay thế LeRobot — chúng bổ sung nhau. LeRobot mạnh về model zoo và Hugging Face Hub integration; Ark mạnh hơn về sim-real switching và ROS native. Nhiều team dùng cả hai.

Cài đặt Ark v1.5

Yêu cầu hardware: GPU 8GB+ VRAM (RTX 3060 trở lên), Ubuntu 22.04 (khuyến nghị), Python 3.10. Mac M-series cũng support nhưng phải dùng Python 3.11 + conda PyBullet.

# Tạo workspace
mkdir Ark && cd Ark

# Setup conda env
conda create -n ark_env python=3.10 -y
conda activate ark_env

# Clone + install ark_framework
git clone https://github.com/Robotics-Ark/ark_framework.git
cd ark_framework
pip install -e .
cd ..

# Install ark_types (message definitions)
git clone https://github.com/Robotics-Ark/ark_types.git
cd ark_types
pip install -e .
cd ..

# Verify
ark --help

Nếu dùng macOS:

conda create -n ark_env python=3.11 -y
conda activate ark_env
git clone https://github.com/Robotics-Ark/ark_framework.git && cd ark_framework
pip install -e .
conda install -c conda-forge pybullet -y

Optional: cài MuJoCo nếu muốn dùng physics chính xác hơn cho contact-rich tasks:

pip install mujoco==3.2.0

Để biết thêm về sự khác biệt hai engine, xem bài MuJoCo vs Isaac Lab Deep Dive.

Ví dụ 1: Push task với ViperX300s

Đây là demo gốc trong paper Ark. Bạn sẽ:

Chạy ViperX300s trong PyBullet với một block trên bàn.
Thu thập 50 demo bằng teleop (phím mũi tên).
Train ACT policy.
Inference trong sim → switch sang robot thật chỉ đổi 1 dòng config.

Bước 1 — Khởi tạo environment

from ark.envs import make_env

# Sim mode
env = make_env(
    "viperx_push-v0",
    backend="pybullet",     # đổi thành "real" khi deploy
    render=True,
    obs_modalities=["joint_pos", "rgb"],
)

obs = env.reset(seed=42)
print(obs.keys())  # dict_keys(['joint_pos', 'rgb'])

Bước 2 — Thu data bằng teleop

from ark.teleop import KeyboardTeleop

teleop = KeyboardTeleop(env)
recorder = env.recorder("data/push_demos.zarr")

for episode in range(50):
    obs = env.reset()
    recorder.start_episode()
    done = False
    while not done:
        action = teleop.get_action()      # (joint_pos_delta,)
        obs, reward, done, info = env.step(action)
        recorder.add_step(obs, action, reward)
    recorder.end_episode()

Output là folder .zarr chứa observation, action, reward — format chuẩn của lerobot-dataset để có thể tái sử dụng giữa các framework.

Bước 3 — Train ACT policy

Ark có sẵn config YAML cho ACT và Diffusion Policy. Tham khảo thêm series ACT chi tiết hoặc Diffusion Policy nếu chưa quen các kiến trúc này.

ark train \
    --config configs/act_viperx_push.yaml \
    --dataset data/push_demos.zarr \
    --output checkpoints/act_push

File act_viperx_push.yaml (rút gọn):

algo: act
horizon: 16            # action chunk length
hidden_dim: 512
nheads: 8
enc_layers: 4
dec_layers: 6
batch_size: 64
epochs: 200
lr: 1e-4
optimizer: adamw
loss: l1
backbone: resnet18

Trên RTX 4090, train 200 epochs cho 50 episodes (~5000 transitions) mất ~2 giờ. Trên RTX 3060, khoảng 6–8 giờ.

Bước 4 — Inference trong sim

from ark.policies import load_policy

policy = load_policy("checkpoints/act_push/best.ckpt")

obs = env.reset()
done = False
while not done:
    action = policy.predict(obs)
    obs, _, done, _ = env.step(action)

Bước 5 — Switch sang robot thật

Đây là điểm magic của Ark. Tạo file configs/viperx_real.yaml:

backend: ros2
ros_topics:
  joint_pos: /viperx/joint_states
  command:   /viperx/joint_command
  rgb:       /camera/color/image_raw
gripper_topic: /viperx/gripper_command
control_freq: 30

Rồi:

env = make_env("viperx_push-v0", config="configs/viperx_real.yaml")
# Cùng obs space, cùng action space
action = policy.predict(env.reset())

Không phải retrain, không phải đổi observation parser — Ark tự handle phần convert giữa simulator state và ROS message. Đây là điều mà trước đây người ta phải tự code 200–500 dòng glue.

Ví dụ 2: Cloth manipulation với OpenPyro-A1 humanoid

Demo thứ hai trong paper là cloth folding với humanoid OpenPyro-A1 — task khó hơn vì cần bimanual coordination và tactile feedback. Pipeline giống Ví dụ 1 nhưng:

Action space tăng lên 14-D (7 joints mỗi tay).
Observation thêm tactile sensor và stereo RGB.
Dùng Diffusion Policy thay vì ACT vì task có nhiều mode (gấp ngang, gấp dọc, fold-in-half).

ark train --config configs/dp_openpyro_cloth.yaml \
          --dataset data/cloth_demos.zarr

Paper báo cáo success rate 78% trên cloth folding sau 100 demos, 86% trên object handover sau 80 demos. Đây là benchmark khá tốt cho intermediate-level imitation learning.

ROS 2 integration sâu hơn

Ark không thay thế ROS — nó wrap ROS theo Python idiomatic. Bạn vẫn có thể:

Publish topic ngoài Ark từ Gazebo, MoveIt2, Nav2.
Dùng Ark policy như một node trong ROS launch file.
Combine Ark với existing ROS controllers (ros2_control hardware interface).

from ark.ros import RosBridge

bridge = RosBridge(node_name="ark_act_policy")
bridge.subscribe("/camera/color/image_raw", topic_type="sensor_msgs/Image")
bridge.publish("/arm_controller/follow_joint_trajectory", ...)

Nếu bạn mới làm quen ROS 2, đọc ROS 2 Introduction trước khi vào phần này.

Pitfalls và best practices

Một số bug thường gặp khi tôi reproduce:

Conda env conflict với system ROS — luôn unset LD_LIBRARY_PATH trước khi conda activate ark_env. ROS 2 source script đè environment làm pybullet load wrong libs.
Camera latency mismatch sim vs real — sim render instantaneous, real RealSense có ~30ms latency. Train với obs_delay_ms: [0, 50] để robust.
Joint encoder offset — robot thật có encoder zero không trùng URDF. Calibrate bằng ark calibrate --robot viperx trước khi deploy.
Gripper noise — ACT predict gripper position liên tục nhưng gripper thật là binary. Dùng threshold 0.5 + low-pass filter.
Pub/Sub buffer overflow — nếu thu data 30Hz mà policy 10Hz, queue sẽ phình. Set qos_depth=5 ở subscriber.

Một note quan trọng: Ark v1.5 còn refactor active, API có thể đổi nhỏ giữa minor releases. Pin version trong requirements.txt:

git+https://github.com/Robotics-Ark/[email protected]
git+https://github.com/Robotics-Ark/[email protected]

Khi nào nên dùng Ark, khi nào không

Dùng Ark khi:

Bạn cần switch sim ↔ real nhiều lần trong pipeline development.
Lab có nhiều robot khác nhau (ViperX, UR5, Franka, humanoid) cần shared abstraction.
Bạn muốn integrate ML policy vào ROS-based stack có sẵn.
Bạn ưu tiên Python-first developer experience hơn maximum performance.

Không dùng Ark khi:

Cần real-time control loop > 1kHz (dùng C++ ros2_control).
Train RL ở scale lớn (10k+ envs parallel) — dùng Isaac Lab thay vào.
Đã có pipeline LeRobot ổn định không cần switch sim/real.
Robot quá custom, chưa có driver Ark — viết driver tốn thời gian hơn dùng raw ROS.

Tổng kết

Ark v1.5 giải quyết một bài toán cũ rích nhưng dai dẳng: gap giữa machine learning workflow (Python, Gym, dataset) và robotics workflow (C++, ROS, hardware). Nó không phá vỡ ROS, không cạnh tranh trực tiếp với LeRobot — nó bridge chúng theo cách mà sinh viên Việt Nam có thể follow trong vài ngày thay vì vài tháng.

Với ai đang viết thesis về imitation learning, đặc biệt là sinh viên có sẵn một con UR5e hoặc ViperX trong lab nhưng đang khổ sở với glue code, Ark là cách rút ngắn đáng kể time-to-result. Hãy bắt đầu bằng tutorial push task, đo benchmark trên hardware bạn có, rồi mở rộng sang task riêng.