ABot-M0: VLA Foundation Model with Action Manifold Learning from AMAP CVLab

In March 2026, the AMAP CVLab team at Alibaba Group released ABot-M0 — a VLA foundation model for robotic manipulation, together with full code, pretrained weights, and the data processing pipeline. The most interesting bit is not the scale (though their UniACT training set with 6+ million trajectories and 9,500 hours of data is currently the largest open mixture), but a paradigm shift in how the model learns actions: instead of learning noise like familiar diffusion policies, ABot-M0 directly learns clean actions lying on a low-dimensional manifold shaped by physics and task constraints.

This post walks from the core idea, through the architecture, to running the pretrained weights yourself. Useful for engineers looking for a "plug-and-play" VLA on their own robot arm or humanoid.

References: ABot-M0 paper on arXiv, the official project page, GitHub repo, and HuggingFace weights.

Robotic manipulation foundation model concept

1. The core idea: Action Manifold Hypothesis

If you've ever trained Diffusion Policy or π0, the familiar recipe is: add Gaussian noise to an action chunk, then have the model denoise — predicting the noise step-by-step to recover the clean action. It's powerful, but ABot-M0 calls out two problems:

Slow decoding — many denoising steps per action chunk.
Instability across embodiments — a shared noise schedule applies to all robots, yet Franka, Aloha, UR-5, and dual-arm humanoids have very different action spaces, so the denoising process doesn't converge at the same rate.

The AMAP team proposes a hypothesis: in high-dimensional action space, successful actions are not scattered uniformly — they lie on a low-dimensional manifold shaped by physical constraints (kinematics, dynamics, contact), task goals, and environment. A pick-and-place trajectory isn't any random sequence in $\mathbb{R}^{T \times 7}$ — it's a thin slice within it.

If the manifold truly exists, we should learn a projection onto it rather than the noise. In code, the loss shifts from:

# Diffusion policy
noise_pred = model(noisy_action, t, obs)
loss = mse(noise_pred, true_noise)

to:

# Action Manifold Learning (AML)
action_pred = model(latent, obs)
loss = mse(action_pred, clean_action)

Much simpler, but to prevent the model from collapsing to the mean action, AML adds a few mechanisms: a structured latent prior, action-sequence smoothness regularization, and a residual head to capture fine detail. The paper's ablation shows AML both reduces decode steps and improves stability when training on heterogeneous cross-embodiment data.

To compare these two paradigms in a single codebase, read Diffusion Policy first — it explains denoising in detail and makes it clear where AML "inverts" the formulation.

2. Overall architecture

ABot-M0 uses a familiar DiT (Diffusion Transformer) backbone, but reworks the output head to emit actions directly instead of noise. There are three main blocks:

(a) VLM encoder — Takes multi-view RGB + a text instruction. Outputs semantic tokens describing "what to do, what is seen".

(b) 3D perception adapter (plug-and-play) — A clever design choice: ABot-M0 doesn't force a specific 3D module. You can plug VGGT (Visual Geometry Grounded Transformer) for dense point maps, or Qwen-Image-Edit for geometric priors, without touching the backbone. 3D tokens are concatenated alongside VLM tokens.

(c) Action manifold decoder — A transformer mapping (semantic tokens, 3D tokens, robot state) to an action chunk in manifold space. The final un-projection head converts manifold latents back to delta-actions in end-effector coordinates.

A key technical detail for cross-embodiment: actions are normalized to end-effector delta with rotation vector representation (3D rotation, not 4D quaternion or 9D matrix). Why: rotation vectors are more continuous than quaternions (no double-cover ambiguity), more compact than matrices, and the delta range is easier to normalize across robots. For mixing single-arm and dual-arm data, they use pad-to-dual: always pad to 2 arms (14-DoF action), and zero-out the second arm for single-arm samples — so one model serves both.

Multi-arm robotic system in lab

3. UniACT Dataset — 6M+ trajectories

The data engineering behind ABot-M0 is arguably more valuable than the model itself. They merge six largest public datasets (Open X-Embodiment, DROID, AgiBot-Beta, RoboMIND, RH20T, BridgeData V2…) into UniACT-dataset with:

6,000,000+ trajectories
9,500+ hours of interaction
20+ embodiments (Franka Panda, UR-5, Aloha, Kuka, Galaxea, AgiBot, humanoid GR1/GR2…)

The pipeline has four stages:

Filter invalid samples — drop trajectories with empty instructions, blurry frames, or NaN actions.
Action normalization — convert to delta end-effector (rotation vector), normalize using per-embodiment statistics.
Pad-to-dual — as described above.
Re-balance — over/under-sample by embodiment so a giant dataset (like OXE) doesn't dominate every gradient step.

If you've done imitation learning before, you know how critical step 4 is. A dataset that's 70% Franka pick-and-place will yield a model that's only good at Franka pick-and-place, regardless of how "cross-embodiment" you claim.

4. Benchmark results

ABot-M0 sits at or near the top on the main benchmarks:

Benchmark	ABot-M0	GR00T-N1.6	X-VLA	π0-FAST
LIBERO (avg 4 suites)	98.6%	97.0%	98.1%	96.4%
LIBERO-Plus (zero-shot, 7 shifts)	80.5%	—	—	—
RoboCasa-GR1 (24 tasks)	58.3%	47.6%	—	—
RoboTwin2.0 Clean	80.4%	—	—	—
RoboTwin2.0 Randomized	81.2%	—	—	—

The most informative number isn't LIBERO (saturated) but LIBERO-Plus zero-shot 80.5% — a stress-test with 7 distribution shifts (lighting, camera pose, distractors, language paraphrasing…). Topping that table is a real signal that the manifold actually generalizes, not just memorizes.

5. Installation

Minimum hardware for inference: 1× 24GB GPU (RTX 4090, A5000), 32GB RAM. For full fine-tuning: 4× A100 80GB or equivalent.

# Clone repo
git clone https://github.com/amap-cvlab/ABot-Manipulation.git
cd ABot-Manipulation

# Conda env
conda create -n ABot python=3.10 -y
conda activate ABot

# Core deps
pip install -r requirements.txt

# FlashAttention2 (required for the DiT backbone)
pip install flash-attn --no-build-isolation

# VGGT for 3D perception (optional, plug-and-play)
pip install vggt

# Install ABot package
pip install -e .

Common first-run errors:

flash-attn build fails → CUDA toolkit mismatch. Make sure CUDA 12.x is installed and nvcc --version matches your PyTorch CUDA build.
ModuleNotFoundError: vggt → either run pip install vggt, or use a variant that doesn't need a 3D adapter (you can skip it).

6. Inference with pretrained weights

There are 4 variants under HuggingFace acvlab/:

Variant	Use case
`ABot-Pretrain`	Generalist backbone, fine-tunable
`ABot-LIBERO`	Fine-tuned for LIBERO, ready to eval
`ABot-RoboCasa-GR1`	Tabletop manipulation with humanoid GR1
`ABot-RoboTwin2`	Dual-arm Clean + Randomized

Basic inference with the LIBERO weights:

from abot import ABotPolicy
from abot.envs import LiberoEnv

# Load policy
policy = ABotPolicy.from_pretrained("acvlab/ABot-LIBERO", device="cuda")

# Env LIBERO Spatial
env = LiberoEnv(suite="libero_spatial", task_id=0)
obs = env.reset()

for step in range(200):
    # ABot-M0 predicts a clean action chunk (H=16 steps)
    action_chunk = policy.predict(
        rgb_images=obs["images"],          # multi-view RGB
        instruction=obs["language"],        # text
        robot_state=obs["state"],          # joint + gripper
    )
    # Execute the first step of the chunk, then re-predict (open-loop chunking)
    obs, reward, done, info = env.step(action_chunk[0])
    if done:
        break

On a single RTX 4090, decoding one action chunk takes about 80ms — faster than typical diffusion policy (200-400ms for 50 denoising steps), because AML only needs one forward pass.

7. Fine-tuning on your own robot

Recommended workflow:

Collect teleop data — at least 50-100 demos for a simple task, 200-500 for tasks with multiple objects or phases.
Convert to UniACT format — data_process/convert_to_uniact.py in the repo handles this. You'll need: RGB from cameras (≥2 views recommended), language instruction, robot state, delta end-effector action.
Fine-tune from ABot-Pretrain:

python examples/finetune.py \
  --pretrained acvlab/ABot-Pretrain \
  --dataset /path/to/your_dataset \
  --output_dir ./checkpoints/my_robot \
  --batch_size 32 \
  --learning_rate 1e-4 \
  --num_epochs 30 \
  --enable_3d_adapter vggt  # optional

Evaluate — run 10-20 rollouts on the real robot, log success rate. If you see <50% on a simple task → audit your data quality before throwing more epochs at it.

If you're fine-tuning on a full-body humanoid, the recent Whole-Body VLA paper has a data recipe compatible with ABot's dual-arm structure.

8. Comparison with other VLAs

Model	Params	Predict target	Cross-embodiment data	Open weights
RT-2	~55B	Action token	Limited	❌
OpenVLA	7B	Action token	OXE	✅
π0 / π0-FAST	3B	Noise (flow)	DROID + custom	Partial
GR00T-N1.6	2-3B	Noise (diffusion)	Mixed	✅
ABot-M0	2-3B	Clean action (AML)	UniACT 6M	✅ full

ABot-M0 isn't the biggest, nor the absolute SOTA on every benchmark, but as of 2026 it's the most complete open-source package: code, pretrained, data pipeline, and eval scripts are all public, with a commercial-friendly license (double-check the repo before shipping to production).

For a wider view of the VLA landscape, see the VLA models overview and VLA-0: action as text — VLA-0 goes the opposite direction (text tokens instead of a continuous manifold), and comparing the two approaches is genuinely illuminating.

9. Deployment pitfalls

Don't skip the 3D adapter too early — running with RGB tokens only works, but on spatial reasoning tasks (LIBERO-Spatial, stacking) VGGT or Qwen-Image-Edit improves success rate by ≥5%.
Chunk horizon H — default is H=16. If your robot runs at 10Hz, that's 1.6s open-loop, which can be too long for contact-rich tasks. Drop to H=8 if the gripper drifts.
Camera calibration — ABot-M0 expects intrinsics close to the DROID/OXE setup. Cameras that are too wide (FOV >120°) or too narrow (FOV <40°) confuse the 3D adapter. Calibrate before blaming the model.
Don't overfit pad-to-dual — single-arm robots padded to dual should have the second arm at zero, but if your fine-tune data leaves dummy actions non-zero, the model learns spurious patterns.

10. Roads not taken

Three directions AMAP tried and dropped, worth knowing:

Score matching on the manifold — more complex than AML without better success rates, and training was less stable.
Token-based actions à la RT-2 / VLA-0 — simpler code, but lower precision on continuous control tasks (grasping a cup, gentle pushes).
End-to-end 3D backbone — instead of a plug-and-play 3D adapter. Compute-heavy, hard to scale, and loses modularity.

Keeping the modular 3D + AML head is the reason the repo is so research-friendly — swapping modules is trivial.

VLA-Adapter from OpenHelix: Tiny-Scale VLA on 9.6GB VRAM — opposite end of the scale spectrum.
Manipulation Series #4: Vision-Language-Action Models — the VLA foundations before diving into ABot.
GigaBrain-0: VLA + World Model + RL — how to bolt a world model alongside a VLA backbone.

ABot-M0: VLA Foundation Model with Action Manifold Learning from AMAP CVLab

This post walks from the core idea, through the architecture, to running the pretrained weights yourself. Useful for engineers looking for a "plug-and-play" VLA on their own robot arm or humanoid.

References: ABot-M0 paper on arXiv, the official project page, GitHub repo, and HuggingFace weights.

Robotic manipulation foundation model concept

1. The core idea: Action Manifold Hypothesis

Slow decoding — many denoising steps per action chunk.
Instability across embodiments — a shared noise schedule applies to all robots, yet Franka, Aloha, UR-5, and dual-arm humanoids have very different action spaces, so the denoising process doesn't converge at the same rate.

If the manifold truly exists, we should learn a projection onto it rather than the noise. In code, the loss shifts from:

# Diffusion policy
noise_pred = model(noisy_action, t, obs)
loss = mse(noise_pred, true_noise)

to:

# Action Manifold Learning (AML)
action_pred = model(latent, obs)
loss = mse(action_pred, clean_action)

To compare these two paradigms in a single codebase, read Diffusion Policy first — it explains denoising in detail and makes it clear where AML "inverts" the formulation.

2. Overall architecture

ABot-M0 uses a familiar DiT (Diffusion Transformer) backbone, but reworks the output head to emit actions directly instead of noise. There are three main blocks:

(a) VLM encoder — Takes multi-view RGB + a text instruction. Outputs semantic tokens describing "what to do, what is seen".

Multi-arm robotic system in lab

3. UniACT Dataset — 6M+ trajectories

6,000,000+ trajectories
9,500+ hours of interaction
20+ embodiments (Franka Panda, UR-5, Aloha, Kuka, Galaxea, AgiBot, humanoid GR1/GR2…)

The pipeline has four stages:

Filter invalid samples — drop trajectories with empty instructions, blurry frames, or NaN actions.
Action normalization — convert to delta end-effector (rotation vector), normalize using per-embodiment statistics.
Pad-to-dual — as described above.
Re-balance — over/under-sample by embodiment so a giant dataset (like OXE) doesn't dominate every gradient step.

4. Benchmark results

ABot-M0 sits at or near the top on the main benchmarks:

Benchmark	ABot-M0	GR00T-N1.6	X-VLA	π0-FAST
LIBERO (avg 4 suites)	98.6%	97.0%	98.1%	96.4%
LIBERO-Plus (zero-shot, 7 shifts)	80.5%	—	—	—
RoboCasa-GR1 (24 tasks)	58.3%	47.6%	—	—
RoboTwin2.0 Clean	80.4%	—	—	—
RoboTwin2.0 Randomized	81.2%	—	—	—

5. Installation

Minimum hardware for inference: 1× 24GB GPU (RTX 4090, A5000), 32GB RAM. For full fine-tuning: 4× A100 80GB or equivalent.

# Clone repo
git clone https://github.com/amap-cvlab/ABot-Manipulation.git
cd ABot-Manipulation

# Conda env
conda create -n ABot python=3.10 -y
conda activate ABot

# Core deps
pip install -r requirements.txt

# FlashAttention2 (required for the DiT backbone)
pip install flash-attn --no-build-isolation

# VGGT for 3D perception (optional, plug-and-play)
pip install vggt

# Install ABot package
pip install -e .

Common first-run errors:

flash-attn build fails → CUDA toolkit mismatch. Make sure CUDA 12.x is installed and nvcc --version matches your PyTorch CUDA build.
ModuleNotFoundError: vggt → either run pip install vggt, or use a variant that doesn't need a 3D adapter (you can skip it).

6. Inference with pretrained weights

There are 4 variants under HuggingFace acvlab/:

Variant	Use case
`ABot-Pretrain`	Generalist backbone, fine-tunable
`ABot-LIBERO`	Fine-tuned for LIBERO, ready to eval
`ABot-RoboCasa-GR1`	Tabletop manipulation with humanoid GR1
`ABot-RoboTwin2`	Dual-arm Clean + Randomized

Basic inference with the LIBERO weights:

from abot import ABotPolicy
from abot.envs import LiberoEnv

# Load policy
policy = ABotPolicy.from_pretrained("acvlab/ABot-LIBERO", device="cuda")

# Env LIBERO Spatial
env = LiberoEnv(suite="libero_spatial", task_id=0)
obs = env.reset()

for step in range(200):
    # ABot-M0 predicts a clean action chunk (H=16 steps)
    action_chunk = policy.predict(
        rgb_images=obs["images"],          # multi-view RGB
        instruction=obs["language"],        # text
        robot_state=obs["state"],          # joint + gripper
    )
    # Execute the first step of the chunk, then re-predict (open-loop chunking)
    obs, reward, done, info = env.step(action_chunk[0])
    if done:
        break

On a single RTX 4090, decoding one action chunk takes about 80ms — faster than typical diffusion policy (200-400ms for 50 denoising steps), because AML only needs one forward pass.

7. Fine-tuning on your own robot

Recommended workflow:

Collect teleop data — at least 50-100 demos for a simple task, 200-500 for tasks with multiple objects or phases.
Convert to UniACT format — data_process/convert_to_uniact.py in the repo handles this. You'll need: RGB from cameras (≥2 views recommended), language instruction, robot state, delta end-effector action.
Fine-tune from ABot-Pretrain:

python examples/finetune.py \
  --pretrained acvlab/ABot-Pretrain \
  --dataset /path/to/your_dataset \
  --output_dir ./checkpoints/my_robot \
  --batch_size 32 \
  --learning_rate 1e-4 \
  --num_epochs 30 \
  --enable_3d_adapter vggt  # optional

Evaluate — run 10-20 rollouts on the real robot, log success rate. If you see <50% on a simple task → audit your data quality before throwing more epochs at it.

If you're fine-tuning on a full-body humanoid, the recent Whole-Body VLA paper has a data recipe compatible with ABot's dual-arm structure.

8. Comparison with other VLAs

Model	Params	Predict target	Cross-embodiment data	Open weights
RT-2	~55B	Action token	Limited	❌
OpenVLA	7B	Action token	OXE	✅
π0 / π0-FAST	3B	Noise (flow)	DROID + custom	Partial
GR00T-N1.6	2-3B	Noise (diffusion)	Mixed	✅
ABot-M0	2-3B	Clean action (AML)	UniACT 6M	✅ full

9. Deployment pitfalls

Don't skip the 3D adapter too early — running with RGB tokens only works, but on spatial reasoning tasks (LIBERO-Spatial, stacking) VGGT or Qwen-Image-Edit improves success rate by ≥5%.
Chunk horizon H — default is H=16. If your robot runs at 10Hz, that's 1.6s open-loop, which can be too long for contact-rich tasks. Drop to H=8 if the gripper drifts.
Camera calibration — ABot-M0 expects intrinsics close to the DROID/OXE setup. Cameras that are too wide (FOV >120°) or too narrow (FOV <40°) confuse the 3D adapter. Calibrate before blaming the model.
Don't overfit pad-to-dual — single-arm robots padded to dual should have the second arm at zero, but if your fine-tune data leaves dummy actions non-zero, the model learns spurious patterns.

10. Roads not taken

Three directions AMAP tried and dropped, worth knowing:

Score matching on the manifold — more complex than AML without better success rates, and training was less stable.
Token-based actions à la RT-2 / VLA-0 — simpler code, but lower precision on continuous control tasks (grasping a cup, gentle pushes).
End-to-end 3D backbone — instead of a plug-and-play 3D adapter. Compute-heavy, hard to scale, and loses modularity.

Keeping the modular 3D + AML head is the reason the repo is so research-friendly — swapping modules is trivial.

VLA-Adapter from OpenHelix: Tiny-Scale VLA on 9.6GB VRAM — opposite end of the scale spectrum.
Manipulation Series #4: Vision-Language-Action Models — the VLA foundations before diving into ABot.
GigaBrain-0: VLA + World Model + RL — how to bolt a world model alongside a VLA backbone.

ABot-M0: VLA Foundation Model with Action Manifold

ABot-M0: VLA Foundation Model with Action Manifold Learning from AMAP CVLab

1. The core idea: Action Manifold Hypothesis

2. Overall architecture

3. UniACT Dataset — 6M+ trajectories

4. Benchmark results

5. Installation

6. Inference with pretrained weights

7. Fine-tuning on your own robot

8. Comparison with other VLAs

9. Deployment pitfalls

10. Roads not taken

Nguyễn Anh Tuấn

Related Posts

Qwen-VLA: Mô hình VLA generalist của Alibaba

VLA-RFT: RL Fine-Tune VLA trong World Simulator

X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot

ABot-M0: VLA Foundation Model with Action Manifold

ABot-M0: VLA Foundation Model with Action Manifold Learning from AMAP CVLab

1. The core idea: Action Manifold Hypothesis

2. Overall architecture

3. UniACT Dataset — 6M+ trajectories

4. Benchmark results

5. Installation

6. Inference with pretrained weights

7. Fine-tuning on your own robot

8. Comparison with other VLAs

9. Deployment pitfalls

10. Roads not taken

Nguyễn Anh Tuấn

Related Posts

Qwen-VLA: Mô hình VLA generalist của Alibaba

VLA-RFT: RL Fine-Tune VLA trong World Simulator

X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot