A1 VLA: Deploy SOTA Vision-Language-Action on Franka/AgiBot with 72% Lower Latency

Imagine a robot arm trying to pick up an object. The robot's "brain" — the AI model — needs to think fast to command the arm. If it takes too long to decide, the arm has already moved past the target position. This is the VLA latency problem that researchers have struggled with for years.

In April 2026, ATeam Research published A1 on arXiv (2604.05672) — a fully open-source VLA model with a clever technical innovation: Inter-Layer Truncated Flow Matching. The results speak for themselves: up to 72% lower per-episode latency, up to 76.6% backbone computation reduction, while still achieving state-of-the-art performance on standard benchmarks.

This guide covers everything from A to Z: why A1 matters, how the architecture works, and how to install and run it on Franka Panda and AgiBot robots.

Background: Why Are VLA Models Still Slow?

Before diving into A1, it helps to understand why current VLA models are so slow in the first place.

A modern VLA typically uses a two-stage architecture:

VLM backbone (7B params, e.g. PaliGemma or Qwen): processes camera images and language instructions → generates context embeddings
Action head (usually Diffusion Policy or Flow Matching): takes embeddings → generates robot control sequences

The bottleneck is the iterative denoising in the action head. To generate one action sequence, the model must repeatedly run through denoising steps — and each step needs to run through the VLM backbone again. With 10 denoising steps and a 7B model on GPU, that's about 35ms × 10 = 350ms just to decide a single action. Way too slow for real-time robot control.

Previous solutions all have drawbacks:

Fewer denoising steps: reduces action quality
Consistency models: requires retraining from scratch
Distillation: complex pipeline with information loss

A1 takes a completely different approach, exploiting the internal structure of the VLM backbone itself.

A1 VLA: The Core Idea

A1 VLA architecture with Inter-Layer Truncated Flow Matching — VLM backbone generates embeddings from intermediate layers

Figure 1: A1 exploits representations from intermediate VLM layers rather than always running the full backbone

Key Insight: Intermediate Layers Are Good Enough

When a 7B VLM processes an image and language instruction, it passes through 32–40 transformer layers. The final layer produces the best embedding — but embeddings at layer 16, 20, or 24 already contain sufficient affordance information (understanding of what can be done with objects) to guide the action head toward a correct action.

This is the key insight A1 exploits: you don't need to run the full backbone at every denoising step. Instead:

First denoising step: run backbone to the final layer → high-quality embedding → start denoising
Subsequent steps: only run backbone to an intermediate layer → "good enough" embedding → continue denoising from where you left off (warm-start)

This is the "Inter-Layer" part: using representations from different layers, not just the last one. And "Truncated": cutting backbone computation short — instead of 32 layers, running only 16–20 layers for subsequent steps, saving ~50–70% of backbone compute.

The intuition is like reading a problem statement: after you've read the full problem (first step, full backbone), subsequent glances at partial context are enough to keep solving it.

Architecture Deep Dive: 3 Key Components

1. Inter-Layer Truncated Flow Matching

Flow Matching (as opposed to DDPM Diffusion) learns a straight-line path from noise to the target action, rather than the complex curved trajectory of DDPM. It's faster and requires fewer steps.

A1 extends Flow Matching with warm-starting: instead of beginning each denoising step from pure noise, A1 initializes from the previous step's intermediate output, combined with an embedding from an intermediate VLM layer.

Formally, if we define:

h_L = embedding from the final layer L of the VLM
h_k = embedding from intermediate layer k (k < L)
a_t = action at denoising step t

Then A1 computes:

step 1: a_0 = FlowMatch(noise, h_L)    # full backbone
step 2: a_1 = FlowMatch(a_0, h_k)     # intermediate layer, warm-started from a_0
step 3: a_2 = FlowMatch(a_1, h_k')    # continue, k' can differ from k

This reduces backbone computation by up to 76.6% compared to running the full backbone at every step.

2. Action Consistency Monitoring

How do you know when to stop denoising early?

A1 tracks the consistency of predicted actions across consecutive denoising steps. When the cosine similarity between a_t and a_{t-1} exceeds a threshold, the action has converged — no further denoising needed.

consistency = cosine_similarity(a_t, a_t_prev)
if consistency > threshold:
    break  # Action has converged, stop early

In practice, many simple tasks (picking an unoccluded object, placing on a fixed target) converge in 3–5 steps instead of the standard 10. Early stopping = less computation = faster robot response.

3. Budget-aware Adaptive Inference

Users or systems can set a latency budget — for example: "action must be ready within 100ms". A1 automatically adjusts three parameters:

Backbone depth: how many VLM layers to run
Max denoising steps: upper bound on iterations
Consistency threshold: early stopping sensitivity

This flexibility is something other VLAs lack: you can explicitly trade off accuracy for speed depending on your use case and hardware.

Installation and Setup

System Requirements

# Minimum hardware
# GPU: NVIDIA with ≥16GB VRAM (A100, 4090, 3090 Ti)
# RAM: ≥32GB system RAM
# Storage: ≥100GB (dataset + checkpoints)

# Software
# Python 3.10+, CUDA 11.8+ or 12.1+, PyTorch 2.1+

Clone and Install

git clone https://github.com/ATeam-Research/A1.git
cd A1

conda create -n a1-vla python=3.10 -y
conda activate a1-vla

pip install -e .
pip install flash-attn --no-build-isolation

python -c "import a1_vla; print('A1 VLA ready!')"

Download Pretrained Checkpoints

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="ATeam-Research/A1-7B",
    local_dir="./checkpoints/a1-7b"
)

A1 releases multiple model sizes: 1B (runs on a 16GB GPU), 7B (standard, needs 24GB+), and 34B (research, requires multi-GPU). For most practical use cases, the 7B model is the right choice.

Training and Fine-tuning

Data Preparation

A1 uses the RLDS (Robot Learning Dataset) format, compatible with Open X-Embodiment (OpenX). If you have teleoperation data from a LeRobot-based system, convert it to RLDS first:

python scripts/convert_lerobot_to_rlds.py \
    --input_dir ./data/lerobot_episodes/ \
    --output_dir ./data/rlds_episodes/ \
    --robot franka

Training with PyTorch FSDP

A1 uses PyTorch FSDP (Fully Sharded Data Parallel) to train 7B models across multiple GPUs.

Single A100 80GB:

torchrun --nproc_per_node=1 vla-scripts/train.py \
    --base_model "ATeam-Research/A1-7B" \
    --dataset_dir ./data/rlds_episodes/ \
    --output_dir ./checkpoints/a1-franka-custom \
    --batch_size 8 \
    --learning_rate 2e-5 \
    --num_epochs 10 \
    --gradient_checkpointing true

Multi-GPU (4× A100):

torchrun --nproc_per_node=4 vla-scripts/train.py \
    --base_model "ATeam-Research/A1-7B" \
    --dataset_dir ./data/rlds_episodes/ \
    --output_dir ./checkpoints/a1-franka-custom \
    --batch_size 32 \
    --learning_rate 5e-5 \
    --fsdp true \
    --fsdp_sharding_strategy FULL_SHARD

Important: A1 releases intermediate checkpoints every 1000 steps, enabling you to resume interrupted training and evaluate model quality at different training stages to select the best checkpoint.

Fine-tuning Configuration

# configs/finetune_franka.yaml
base_model: "ATeam-Research/A1-7B"
robot: "franka"
task: "pick_place"

# Truncated flow matching config
flow_matching:
  num_steps: 10
  min_steps: 3            # Stop early when consistency is high enough
  consistency_threshold: 0.95

# Backbone truncation
backbone:
  full_layers: 32         # Final layer used for first denoising step
  truncated_layers: 18    # Layers used for warm-start steps

# Training
training:
  batch_size: 8
  lr: 2e-5
  epochs: 20
  warmup_steps: 100

Inference on Real Robots

A1 VLA running real-time inference on a robot arm — average latency ~100ms with adaptive budget

Figure 2: With 100ms adaptive inference budget, A1 responds 3.5× faster than full backbone inference

Franka Panda

from a1_vla import A1Policy
from a1_vla.robots import FrankaRobot
import cv2

policy = A1Policy.from_pretrained(
    "./checkpoints/a1-7b",
    device="cuda",
    adaptive_inference=True,
    latency_budget_ms=100
)

robot = FrankaRobot(
    ip="192.168.1.100",
    use_gripper=True
)

cap = cv2.VideoCapture(0)
instruction = "Pick up the red cup and place it on the tray"

while True:
    ret, frame = cap.read()
    if not ret:
        break

    joint_pos = robot.get_joint_positions()

    action = policy.predict(
        image=frame,
        instruction=instruction,
        robot_state=joint_pos
    )

    robot.execute(action, blocking=False)

AgiBot

from a1_vla.robots import AgiBotRobot

robot = AgiBotRobot(
    config_path="./configs/agibot_world.yaml"
)

# Same interface as Franka — A1 uses a robot-agnostic API
action = policy.predict(
    image=frame,
    instruction=instruction,
    robot_state=robot.get_state()
)
robot.execute(action)

Tuning the Latency Budget

# Low latency mode — prioritize speed
policy.set_budget(latency_ms=80)

# High quality mode — prioritize accuracy
policy.set_budget(latency_ms=300)

# Inspect current inference statistics
stats = policy.get_inference_stats()
print(f"Avg latency: {stats['avg_latency_ms']:.1f}ms")
print(f"Backbone layers used: {stats['avg_backbone_layers']}")
print(f"Denoising steps: {stats['avg_denoising_steps']:.1f}")

Benchmark Results

VLA Model Comparison

Model	LIBERO	VLABench	RoboChallenge	Latency (est.)
A1 (Full)	96.6%	53.5%	29.0%	~350ms
A1 (Adaptive 100ms)	94.1%	51.2%	27.8%	~100ms
π₀ (Pi-Zero)	~93%	~48%	28.3%	~400ms
X-VLA	~89%	~44%	21.3%	~500ms
RDT-1B	~85%	~40%	15.0%	~600ms

The key finding: A1 Adaptive at 100ms still outperforms full π₀ on LIBERO and VLABench, while being 4× faster. This is an excellent trade-off for real-world deployment.

The deeper implication: most of a VLA's "intelligence" doesn't come from running the full backbone at every denoising step. It comes from the quality of the denoising trajectory — which can be adequately guided by intermediate layer representations.

For more context on the VLA landscape and comparison with other architectures, see Overview of VLA Models: From RT-2 to OpenVLA.

Task Category Breakdown

A1 performs particularly well on:

Simple pick-and-place: 98%+ (fast convergence, few denoising steps)
Multi-step manipulation: 91% (requires more denoising steps)
Tasks with occluded objects: 85% (requires full backbone for first step)
Language-conditioned grasping: 96% (VLM backbone excels at language understanding)

Relative weaknesses:

Tasks requiring precise force control (compliant / deformable environments)
High-noise lighting conditions

Why A1 Matters for the Community

This is what genuinely sets A1 apart from many other SOTA papers:

1. Zero closed-source dependencies. Many SOTA VLAs depend on proprietary datasets or pretrained models not released publicly. A1 releases everything: training code, data processing pipeline, intermediate checkpoints, evaluation scripts. Nothing is hidden.

2. Full reproducibility. You can reproduce the paper's results from A to Z. This is a high standard that robotics research often skips because "robot experiments are hard to reproduce."

3. Scalable from 1B to 34B. The training code supports multiple model sizes — you don't need an 80GB A100. The 1B model trains on a 24GB 4090 with small batch size.

4. Standard RLDS data format. Compatible with Open X-Embodiment and the LeRobot ecosystem → easy integration with existing pipelines without rewriting data loaders from scratch.

When Should You Use A1?

Use A1 if you:

Need fast VLA inference for real-time robot control (sub-150ms latency)
Want to fine-tune on your own robot and tasks with full control over the training stack
Need reproducibility and transparency for research or demos
Have limited hardware and need flexible speed/accuracy trade-offs

Think twice if you:

Need high dexterity (complex multi-finger manipulation) — specialized models may perform better
Already have a stable Diffusion Policy pipeline — switching to A1 requires retraining from scratch
Don't have a GPU with ≥16GB VRAM — you'll lose the latency advantage entirely

Summary

A1 VLA solves one of the biggest pain points of VLA in practice: latency too high for real-time robot control. By exploiting that intermediate VLM layers already contain sufficient information to guide flow matching denoising, A1 saves up to 76.6% of backbone computation without sacrificing much performance.

What makes this project stand out beyond the technical contribution is its fully open-source and transparent approach — no closed-source dependencies, full training stack released, intermediate checkpoints available. This sets a benchmark that robotics research should follow more broadly.

If you're building a robot manipulation system and struggling with VLA latency, A1 is the first thing worth trying.

Resources:

A1 VLA: Deploy SOTA Vision-Language-Action on Franka/AgiBot with 72% Lower Latency

This guide covers everything from A to Z: why A1 matters, how the architecture works, and how to install and run it on Franka Panda and AgiBot robots.

Background: Why Are VLA Models Still Slow?

Before diving into A1, it helps to understand why current VLA models are so slow in the first place.

A modern VLA typically uses a two-stage architecture:

VLM backbone (7B params, e.g. PaliGemma or Qwen): processes camera images and language instructions → generates context embeddings
Action head (usually Diffusion Policy or Flow Matching): takes embeddings → generates robot control sequences

Previous solutions all have drawbacks:

Fewer denoising steps: reduces action quality
Consistency models: requires retraining from scratch
Distillation: complex pipeline with information loss

A1 takes a completely different approach, exploiting the internal structure of the VLM backbone itself.

A1 VLA: The Core Idea

A1 VLA architecture with Inter-Layer Truncated Flow Matching — VLM backbone generates embeddings from intermediate layers

Figure 1: A1 exploits representations from intermediate VLM layers rather than always running the full backbone

Key Insight: Intermediate Layers Are Good Enough

This is the key insight A1 exploits: you don't need to run the full backbone at every denoising step. Instead:

First denoising step: run backbone to the final layer → high-quality embedding → start denoising
Subsequent steps: only run backbone to an intermediate layer → "good enough" embedding → continue denoising from where you left off (warm-start)

The intuition is like reading a problem statement: after you've read the full problem (first step, full backbone), subsequent glances at partial context are enough to keep solving it.

Architecture Deep Dive: 3 Key Components

1. Inter-Layer Truncated Flow Matching

Flow Matching (as opposed to DDPM Diffusion) learns a straight-line path from noise to the target action, rather than the complex curved trajectory of DDPM. It's faster and requires fewer steps.

Formally, if we define:

h_L = embedding from the final layer L of the VLM
h_k = embedding from intermediate layer k (k < L)
a_t = action at denoising step t

Then A1 computes:

step 1: a_0 = FlowMatch(noise, h_L)    # full backbone
step 2: a_1 = FlowMatch(a_0, h_k)     # intermediate layer, warm-started from a_0
step 3: a_2 = FlowMatch(a_1, h_k')    # continue, k' can differ from k

This reduces backbone computation by up to 76.6% compared to running the full backbone at every step.

2. Action Consistency Monitoring

How do you know when to stop denoising early?

consistency = cosine_similarity(a_t, a_t_prev)
if consistency > threshold:
    break  # Action has converged, stop early

3. Budget-aware Adaptive Inference

Users or systems can set a latency budget — for example: "action must be ready within 100ms". A1 automatically adjusts three parameters:

Backbone depth: how many VLM layers to run
Max denoising steps: upper bound on iterations
Consistency threshold: early stopping sensitivity

This flexibility is something other VLAs lack: you can explicitly trade off accuracy for speed depending on your use case and hardware.

Installation and Setup

System Requirements

# Minimum hardware
# GPU: NVIDIA with ≥16GB VRAM (A100, 4090, 3090 Ti)
# RAM: ≥32GB system RAM
# Storage: ≥100GB (dataset + checkpoints)

# Software
# Python 3.10+, CUDA 11.8+ or 12.1+, PyTorch 2.1+

Clone and Install

git clone https://github.com/ATeam-Research/A1.git
cd A1

conda create -n a1-vla python=3.10 -y
conda activate a1-vla

pip install -e .
pip install flash-attn --no-build-isolation

python -c "import a1_vla; print('A1 VLA ready!')"

Download Pretrained Checkpoints

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="ATeam-Research/A1-7B",
    local_dir="./checkpoints/a1-7b"
)

A1 releases multiple model sizes: 1B (runs on a 16GB GPU), 7B (standard, needs 24GB+), and 34B (research, requires multi-GPU). For most practical use cases, the 7B model is the right choice.

Training and Fine-tuning

Data Preparation

A1 uses the RLDS (Robot Learning Dataset) format, compatible with Open X-Embodiment (OpenX). If you have teleoperation data from a LeRobot-based system, convert it to RLDS first:

python scripts/convert_lerobot_to_rlds.py \
    --input_dir ./data/lerobot_episodes/ \
    --output_dir ./data/rlds_episodes/ \
    --robot franka

Training with PyTorch FSDP

A1 uses PyTorch FSDP (Fully Sharded Data Parallel) to train 7B models across multiple GPUs.

Single A100 80GB:

torchrun --nproc_per_node=1 vla-scripts/train.py \
    --base_model "ATeam-Research/A1-7B" \
    --dataset_dir ./data/rlds_episodes/ \
    --output_dir ./checkpoints/a1-franka-custom \
    --batch_size 8 \
    --learning_rate 2e-5 \
    --num_epochs 10 \
    --gradient_checkpointing true

Multi-GPU (4× A100):

torchrun --nproc_per_node=4 vla-scripts/train.py \
    --base_model "ATeam-Research/A1-7B" \
    --dataset_dir ./data/rlds_episodes/ \
    --output_dir ./checkpoints/a1-franka-custom \
    --batch_size 32 \
    --learning_rate 5e-5 \
    --fsdp true \
    --fsdp_sharding_strategy FULL_SHARD

Fine-tuning Configuration

# configs/finetune_franka.yaml
base_model: "ATeam-Research/A1-7B"
robot: "franka"
task: "pick_place"

# Truncated flow matching config
flow_matching:
  num_steps: 10
  min_steps: 3            # Stop early when consistency is high enough
  consistency_threshold: 0.95

# Backbone truncation
backbone:
  full_layers: 32         # Final layer used for first denoising step
  truncated_layers: 18    # Layers used for warm-start steps

# Training
training:
  batch_size: 8
  lr: 2e-5
  epochs: 20
  warmup_steps: 100

Inference on Real Robots

A1 VLA running real-time inference on a robot arm — average latency ~100ms with adaptive budget

Figure 2: With 100ms adaptive inference budget, A1 responds 3.5× faster than full backbone inference

Franka Panda

from a1_vla import A1Policy
from a1_vla.robots import FrankaRobot
import cv2

policy = A1Policy.from_pretrained(
    "./checkpoints/a1-7b",
    device="cuda",
    adaptive_inference=True,
    latency_budget_ms=100
)

robot = FrankaRobot(
    ip="192.168.1.100",
    use_gripper=True
)

cap = cv2.VideoCapture(0)
instruction = "Pick up the red cup and place it on the tray"

while True:
    ret, frame = cap.read()
    if not ret:
        break

    joint_pos = robot.get_joint_positions()

    action = policy.predict(
        image=frame,
        instruction=instruction,
        robot_state=joint_pos
    )

    robot.execute(action, blocking=False)

AgiBot

from a1_vla.robots import AgiBotRobot

robot = AgiBotRobot(
    config_path="./configs/agibot_world.yaml"
)

# Same interface as Franka — A1 uses a robot-agnostic API
action = policy.predict(
    image=frame,
    instruction=instruction,
    robot_state=robot.get_state()
)
robot.execute(action)

Tuning the Latency Budget

# Low latency mode — prioritize speed
policy.set_budget(latency_ms=80)

# High quality mode — prioritize accuracy
policy.set_budget(latency_ms=300)

# Inspect current inference statistics
stats = policy.get_inference_stats()
print(f"Avg latency: {stats['avg_latency_ms']:.1f}ms")
print(f"Backbone layers used: {stats['avg_backbone_layers']}")
print(f"Denoising steps: {stats['avg_denoising_steps']:.1f}")

Benchmark Results

VLA Model Comparison

Model	LIBERO	VLABench	RoboChallenge	Latency (est.)
A1 (Full)	96.6%	53.5%	29.0%	~350ms
A1 (Adaptive 100ms)	94.1%	51.2%	27.8%	~100ms
π₀ (Pi-Zero)	~93%	~48%	28.3%	~400ms
X-VLA	~89%	~44%	21.3%	~500ms
RDT-1B	~85%	~40%	15.0%	~600ms

The key finding: A1 Adaptive at 100ms still outperforms full π₀ on LIBERO and VLABench, while being 4× faster. This is an excellent trade-off for real-world deployment.

For more context on the VLA landscape and comparison with other architectures, see Overview of VLA Models: From RT-2 to OpenVLA.

Task Category Breakdown

A1 performs particularly well on:

Simple pick-and-place: 98%+ (fast convergence, few denoising steps)
Multi-step manipulation: 91% (requires more denoising steps)
Tasks with occluded objects: 85% (requires full backbone for first step)
Language-conditioned grasping: 96% (VLM backbone excels at language understanding)

Relative weaknesses:

Tasks requiring precise force control (compliant / deformable environments)
High-noise lighting conditions

Why A1 Matters for the Community

This is what genuinely sets A1 apart from many other SOTA papers:

2. Full reproducibility. You can reproduce the paper's results from A to Z. This is a high standard that robotics research often skips because "robot experiments are hard to reproduce."

3. Scalable from 1B to 34B. The training code supports multiple model sizes — you don't need an 80GB A100. The 1B model trains on a 24GB 4090 with small batch size.

4. Standard RLDS data format. Compatible with Open X-Embodiment and the LeRobot ecosystem → easy integration with existing pipelines without rewriting data loaders from scratch.

When Should You Use A1?

Use A1 if you:

Need fast VLA inference for real-time robot control (sub-150ms latency)
Want to fine-tune on your own robot and tasks with full control over the training stack
Need reproducibility and transparency for research or demos
Have limited hardware and need flexible speed/accuracy trade-offs

Think twice if you:

Need high dexterity (complex multi-finger manipulation) — specialized models may perform better
Already have a stable Diffusion Policy pipeline — switching to A1 requires retraining from scratch
Don't have a GPU with ≥16GB VRAM — you'll lose the latency advantage entirely

Summary

If you're building a robot manipulation system and struggling with VLA latency, A1 is the first thing worth trying.

Resources:

A1 VLA: Deploy SOTA Vision-Language-Action on Franka/AgiBot with 72% Lower Latency

Background: Why Are VLA Models Still Slow?

A1 VLA: The Core Idea

Key Insight: Intermediate Layers Are Good Enough

Architecture Deep Dive: 3 Key Components

1. Inter-Layer Truncated Flow Matching

2. Action Consistency Monitoring

3. Budget-aware Adaptive Inference

Installation and Setup

System Requirements

Clone and Install

Download Pretrained Checkpoints

Training and Fine-tuning

Data Preparation

Training with PyTorch FSDP

Fine-tuning Configuration

Inference on Real Robots

Franka Panda

AgiBot

Tuning the Latency Budget

Benchmark Results

VLA Model Comparison

Task Category Breakdown

Why A1 Matters for the Community

When Should You Use A1?

Summary

Related Posts

Nguyễn Anh Tuấn

Related Posts

LabVLA: VLA Mã Nguồn Mở cho Robot Phòng Lab

HEX: VLA Toàn Thân Đa Embodiment cho Humanoid

X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot

A1 VLA: Deploy SOTA Vision-Language-Action on Franka/AgiBot with 72% Lower Latency

Background: Why Are VLA Models Still Slow?

A1 VLA: The Core Idea

Key Insight: Intermediate Layers Are Good Enough

Architecture Deep Dive: 3 Key Components

1. Inter-Layer Truncated Flow Matching

2. Action Consistency Monitoring

3. Budget-aware Adaptive Inference

Installation and Setup

System Requirements

Clone and Install

Download Pretrained Checkpoints

Training and Fine-tuning

Data Preparation

Training with PyTorch FSDP

Fine-tuning Configuration

Inference on Real Robots

Franka Panda

AgiBot

Tuning the Latency Budget

Benchmark Results

VLA Model Comparison

Task Category Breakdown

Why A1 Matters for the Community

When Should You Use A1?

Summary

Related Posts

Nguyễn Anh Tuấn

Related Posts

LabVLA: VLA Mã Nguồn Mở cho Robot Phòng Lab

HEX: VLA Toàn Thân Đa Embodiment cho Humanoid

X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot