InSight: How VLAs Learn New Manipulation Skills Without Human Demos

Imagine you've taught a robot to pick up objects. Now you want it to pour water from a bottle — but you have zero demo recordings of that skill. Do you re-record from scratch? Hire a teleoperator? Or can the robot figure it out from what it already knows?

That's exactly the problem Stanford's InSight: Self-Guided Skill Acquisition via Steerable VLAs (arXiv 2606.24884, June 2026) addresses — and it does so elegantly.

The Problem: VLAs Are Capped by Their Training Data

Vision-Language-Action (VLA) models like π0 and π0.5 have proven they can learn manipulation from demonstrations. But they all share a fundamental limitation: you cannot ask the robot to do anything outside its training distribution. Want a new skill? Collect new demos → expensive, time-consuming, operator-dependent.

This is the biggest bottleneck in real-world robotics deployment. The robot's skill inventory is frozen at training time.

InSight attacks this directly with a key observation: manipulation skills are inherently compositional. "Pouring" isn't an entirely new skill — it's "move gripper to bottle", "grasp bottle", "tilt along Y-axis", "return to upright". If the robot already knows some primitives, it only needs to acquire the missing ones and compose them.

Core Idea: Steerable VLA

InSight doesn't replace the VLA — it makes the VLA controllable at the primitive-action level.

Instead of prompting "grab the bowl and pour it into the cup", you can tell the robot:

"move gripper to bowl" → robot only executes that move
"lift upward" → robot only lifts
"tilt along Y-axis" → robot only tilts

When the VLA accepts commands at individual primitive granularity, an external VLM (Gemini) can orchestrate entire task chains in natural language — and simultaneously identify which primitives are missing.

Stage 1: Automatic Primitive Segmentation

To train a steerable VLA, you need demonstration data labeled at the primitive level. But standard demo datasets only label at the task level ("pick and place the apple"). InSight solves this with an automatic segmentation pipeline.

How segmentation works

The pipeline runs in three steps:

Step 1 — VLM decomposition: Given an instruction like "open drawer then grasp object", the VLM (Gemini 3 Flash) decomposes it into a primitive sequence: ["reach to handle", "pull backward", "move to object", "grasp"].

Step 2 — Per-frame end-effector tagging: The system analyzes the end-effector trajectory, computes per-frame delta poses, and assigns a dominant motion axis tag: xy (horizontal movement), z (vertical), rxy (rotation in horizontal plane), rz (rotation around vertical axis). This distinguishes "lift up" from "reach forward" without any manual annotation.

Step 3 — Boundary refinement: Gripper open/close transitions (detected from gripper command velocity) are used to cut segments at visually unambiguous transitions — for example, the exact moment the gripper closes is the boundary between "reach" and "grasp".

Result: each demonstration is sliced into short labeled episodes, one per primitive. These become training data for the steerable VLA.

Stage 1: automatic primitive segmentation pipeline — source: insight-vla.github.io

The progress channel — smart termination

A key technical point: each primitive needs to know when to stop. InSight adds a progress channel to the VLA output — a value in [0, 1) supervised by the normalized timestep within each primitive segment. When progress exceeds 0.95, the primitive is considered complete and the system moves to the next one.

There's also auto-advance based on end-effector motion stillness, and VLM completion checks for more complex primitives.

Stage 2: VLM-Guided Skill Acquisition Flywheel

This is where the autonomous learning actually happens.

Identifying primitive gaps

When given a novel task — say "twist the cap then pour the bottle" — the VLM decomposes the task into its required primitive sequence and compares against vocabulary V (the list of primitives the robot already knows). Any primitive not in V is a "primitive gap".

Important design choice: InSight constrains primitive gaps to single-axis motion only (translation along one of dx/dy/dz, or rotation along one of drx/dry/drz). This isn't a limitation — it's a deliberate design that makes the problem tractable. The vast majority of real manipulation primitives decompose naturally into single-axis motions.

Autonomous execution and verification

For each primitive gap, the VLM analyzes the current scene (via camera) and proposes parameters:

Axis: which direction to move?
Magnitude: how much, and in which sign?

The robot attempts the primitive with these proposed parameters. Afterward, a VLM oracle compares "before" and "after" images to verify whether the primitive succeeded.

Success → the episode gets added to the training dataset.

Retraining with new data

After accumulating N successful rollouts for a new primitive, the VLA is retrained via LoRA — jointly on existing and new primitives. LoRA (Low-Rank Adaptation) allows fast fine-tuning of a small fraction of weights, preventing catastrophic forgetting of previously learned skills.

Stage 2: the VLM-guided data flywheel — source: insight-vla.github.io

Technical Details

Component	Details
Base VLA	π0.5 (architecture-agnostic)
Fine-tuning	LoRA on Gemma-2B backbone + Gemma-300M action expert
VLM	Gemini 3 Flash (4 roles: segmentation, planning, gap ID, verification)
Input	2x 224×224 RGB cameras (scene + wrist) + end-effector pose + gripper state
Output	End-effector deltas + gripper command + progress channel [0,1)
Real hardware	xArm 6 + Intel RealSense cameras
Stack	Python 3.11 + JAX/Flax + uv package manager

InSight is VLA-agnostic — in principle, it can be applied to any VLA that supports primitive-level language conditioning, not just π0.5.

Experimental Results

Real robot (xArm 6)

This is the most impressive part. InSight learns entirely new skills from zero human demos:

New skill acquired	InSight success	Best baseline (CaP-X)
Twist (cap rotation)	92%	32%
Pour (bottle tilting)	96%	16%
Twist-then-pour (14 primitives)	80%	4%
Sweep (from scooping demos)	100% (5/5)	—

Acquisition efficiency:

Twist: 20 successful primitives in 23 trials (~62s robot time per trial)
Pour: 20 successful primitives in 31 trials (~96s robot time per trial)

Skill retention: After learning twist and pour, the robot still achieves 100% on the original pick-and-place tasks. Zero catastrophic forgetting.

Speed advantage: InSight completes twisting ~2× faster than CaP-X (55s vs 109s) because it doesn't need to query the VLM at every timestep like a pure planner does.

Simulation results

Block flipping (learned from pick-and-place demos):

InSight: 75% after 246 successful primitive rollouts / 479 attempts
SAC (RL baseline): 0% — RL only managed to learn reaching (23%) and grasping (10%) within the same compute budget

Drawer closing (learned from drawer-opening demos, out-of-distribution initial states):

Collected 70 successful close-drawer primitives across 82 episodes
Retrained VLA: 100% success on both closing and opening tasks (25 trials)

This last result is notable: the robot learns the "inverse" of a skill it already knows, without any human demos of the inverse.

Installation and Usage

Requirements

# Python 3.11 + uv package manager
pip install uv

# Clone repo
git clone https://github.com/insight-vla/insight
cd insight
git submodule update --init --recursive

# Install dependencies
uv sync
uv sync --group rlds  # optional RLDS support

Compute normalization statistics

Before training, compute normalization stats on your dataset:

uv run python training/compute_norm_stats.py --config-name [config_name]

Training the steerable VLA (Stage 1)

uv run python training/train.py [config_name] \
  --exp-name=my_run \
  --batch-size 32 \
  --fsdp-devices 2

Training uses JAX/Flax with FSDP (Fully Sharded Data Parallelism) — minimum 2 GPUs. This fine-tunes π0.5 with LoRA on the auto-segmented primitive dataset.

Serving the policy for inference

uv run python training/serve_policy.py policy:checkpoint \
  --policy.config [config_name] \
  --policy.dir [checkpoint_path]

The policy is served via API. The Gemini-based VLM pipeline calls this API to control individual primitives during task execution.

Real robot hardware

To run on a physical robot:

xArm 6 (or compatible) with xArm SDK
Intel RealSense D435 (scene camera) + wrist camera
E-stop device (REQUIRED for safety)
Configure workspace bounds in your config file

Limitations and Open Problems

InSight has a few honest limitations worth noting:

Human resets still required: After failed primitive rollouts, a human must reset the environment. The paper acknowledges this and suggests future work via real-to-sim-to-real pipelines or learned world models.
Single-axis primitive constraint: While most real manipulation decomposes naturally, highly dexterous in-hand manipulation with simultaneous multi-axis motion is harder to fit this framework.
VLM quality dependency: Gemini 3 Flash plays four critical roles. If the VLM plans incorrectly or verifies incorrectly, the flywheel can accumulate low-quality primitives.
Acquisition time: Each new primitive requires ~20-30 trials (~1-2 hours of robot time). More complex skills may need more.

Why InSight Matters

InSight lays a practical foundation for continual learning in robotics — more pragmatic than pure RL because it uses natural language as the abstraction layer for defining and verifying skills, rather than requiring carefully designed reward functions.

Comparison to other approaches:

Approach	Pros	Cons
Collect new demos	High quality	Expensive, operator-dependent
RL from scratch	No demos needed	Sample-inefficient, hard reward design
CaP-X (VLM planning)	Flexible	VLM per step → slow, doesn't learn
InSight	Self-improving, composable, fast	Needs human resets, single-axis gap

With its architecture — steerable VLA + VLM orchestration + LoRA retraining — InSight points toward robots that can autonomously expand their skill inventory over time in real deployment environments. That's a meaningful step toward robots that get better at their jobs just by doing them.

For more on fine-tuning VLAs for specific tasks, see Fine-tuning VLA on LIBERO with Embodied-R1.5. For a deep dive on the VLA architecture itself, OpenVLA Deep Dive covers the backbone in detail.

If you're interested in how world models complement VLAs, Weaver: World Models for π0.5 Manipulation is a natural companion read.

That's exactly the problem Stanford's InSight: Self-Guided Skill Acquisition via Steerable VLAs (arXiv 2606.24884, June 2026) addresses — and it does so elegantly.

The Problem: VLAs Are Capped by Their Training Data

This is the biggest bottleneck in real-world robotics deployment. The robot's skill inventory is frozen at training time.

Core Idea: Steerable VLA

InSight doesn't replace the VLA — it makes the VLA controllable at the primitive-action level.

Instead of prompting "grab the bowl and pour it into the cup", you can tell the robot:

"move gripper to bowl" → robot only executes that move
"lift upward" → robot only lifts
"tilt along Y-axis" → robot only tilts

Stage 1: Automatic Primitive Segmentation

How segmentation works

The pipeline runs in three steps:

Result: each demonstration is sliced into short labeled episodes, one per primitive. These become training data for the steerable VLA.

Stage 1: automatic primitive segmentation pipeline — source: insight-vla.github.io

The progress channel — smart termination

There's also auto-advance based on end-effector motion stillness, and VLM completion checks for more complex primitives.

Stage 2: VLM-Guided Skill Acquisition Flywheel

This is where the autonomous learning actually happens.

Identifying primitive gaps

Autonomous execution and verification

For each primitive gap, the VLM analyzes the current scene (via camera) and proposes parameters:

Axis: which direction to move?
Magnitude: how much, and in which sign?

The robot attempts the primitive with these proposed parameters. Afterward, a VLM oracle compares "before" and "after" images to verify whether the primitive succeeded.

Success → the episode gets added to the training dataset.

Retraining with new data

Stage 2: the VLM-guided data flywheel — source: insight-vla.github.io

Technical Details

Component	Details
Base VLA	π0.5 (architecture-agnostic)
Fine-tuning	LoRA on Gemma-2B backbone + Gemma-300M action expert
VLM	Gemini 3 Flash (4 roles: segmentation, planning, gap ID, verification)
Input	2x 224×224 RGB cameras (scene + wrist) + end-effector pose + gripper state
Output	End-effector deltas + gripper command + progress channel [0,1)
Real hardware	xArm 6 + Intel RealSense cameras
Stack	Python 3.11 + JAX/Flax + uv package manager

InSight is VLA-agnostic — in principle, it can be applied to any VLA that supports primitive-level language conditioning, not just π0.5.

Experimental Results

Real robot (xArm 6)

This is the most impressive part. InSight learns entirely new skills from zero human demos:

New skill acquired	InSight success	Best baseline (CaP-X)
Twist (cap rotation)	92%	32%
Pour (bottle tilting)	96%	16%
Twist-then-pour (14 primitives)	80%	4%
Sweep (from scooping demos)	100% (5/5)	—

Acquisition efficiency:

Twist: 20 successful primitives in 23 trials (~62s robot time per trial)
Pour: 20 successful primitives in 31 trials (~96s robot time per trial)

Skill retention: After learning twist and pour, the robot still achieves 100% on the original pick-and-place tasks. Zero catastrophic forgetting.

Speed advantage: InSight completes twisting ~2× faster than CaP-X (55s vs 109s) because it doesn't need to query the VLM at every timestep like a pure planner does.

Simulation results

Block flipping (learned from pick-and-place demos):

InSight: 75% after 246 successful primitive rollouts / 479 attempts
SAC (RL baseline): 0% — RL only managed to learn reaching (23%) and grasping (10%) within the same compute budget

Drawer closing (learned from drawer-opening demos, out-of-distribution initial states):

Collected 70 successful close-drawer primitives across 82 episodes
Retrained VLA: 100% success on both closing and opening tasks (25 trials)

This last result is notable: the robot learns the "inverse" of a skill it already knows, without any human demos of the inverse.

Installation and Usage

Requirements

# Python 3.11 + uv package manager
pip install uv

# Clone repo
git clone https://github.com/insight-vla/insight
cd insight
git submodule update --init --recursive

# Install dependencies
uv sync
uv sync --group rlds  # optional RLDS support

Compute normalization statistics

Before training, compute normalization stats on your dataset:

uv run python training/compute_norm_stats.py --config-name [config_name]

Training the steerable VLA (Stage 1)

uv run python training/train.py [config_name] \
  --exp-name=my_run \
  --batch-size 32 \
  --fsdp-devices 2

Training uses JAX/Flax with FSDP (Fully Sharded Data Parallelism) — minimum 2 GPUs. This fine-tunes π0.5 with LoRA on the auto-segmented primitive dataset.

Serving the policy for inference

uv run python training/serve_policy.py policy:checkpoint \
  --policy.config [config_name] \
  --policy.dir [checkpoint_path]

The policy is served via API. The Gemini-based VLM pipeline calls this API to control individual primitives during task execution.

Real robot hardware

To run on a physical robot:

xArm 6 (or compatible) with xArm SDK
Intel RealSense D435 (scene camera) + wrist camera
E-stop device (REQUIRED for safety)
Configure workspace bounds in your config file

Limitations and Open Problems

InSight has a few honest limitations worth noting:

Human resets still required: After failed primitive rollouts, a human must reset the environment. The paper acknowledges this and suggests future work via real-to-sim-to-real pipelines or learned world models.
Single-axis primitive constraint: While most real manipulation decomposes naturally, highly dexterous in-hand manipulation with simultaneous multi-axis motion is harder to fit this framework.
VLM quality dependency: Gemini 3 Flash plays four critical roles. If the VLM plans incorrectly or verifies incorrectly, the flywheel can accumulate low-quality primitives.
Acquisition time: Each new primitive requires ~20-30 trials (~1-2 hours of robot time). More complex skills may need more.

Why InSight Matters

Comparison to other approaches:

Approach	Pros	Cons
Collect new demos	High quality	Expensive, operator-dependent
RL from scratch	No demos needed	Sample-inefficient, hard reward design
CaP-X (VLM planning)	Flexible	VLM per step → slow, doesn't learn
InSight	Self-improving, composable, fast	Needs human resets, single-axis gap

For more on fine-tuning VLAs for specific tasks, see Fine-tuning VLA on LIBERO with Embodied-R1.5. For a deep dive on the VLA architecture itself, OpenVLA Deep Dive covers the backbone in detail.

If you're interested in how world models complement VLAs, Weaver: World Models for π0.5 Manipulation is a natural companion read.

The Problem: VLAs Are Capped by Their Training Data

Core Idea: Steerable VLA

Stage 1: Automatic Primitive Segmentation

How segmentation works

The progress channel — smart termination

Stage 2: VLM-Guided Skill Acquisition Flywheel

Identifying primitive gaps

Autonomous execution and verification

Retraining with new data

Technical Details

Experimental Results

Real robot (xArm 6)

Simulation results

Installation and Usage

Requirements

Compute normalization statistics

Training the steerable VLA (Stage 1)

Serving the policy for inference

Real robot hardware

Limitations and Open Problems

Why InSight Matters

Related Posts

Nguyễn Anh Tuấn

Related Posts

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2

RLinf-Co: Sim-Real Co-Training cho VLA với RL

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO

The Problem: VLAs Are Capped by Their Training Data

Core Idea: Steerable VLA

Stage 1: Automatic Primitive Segmentation

How segmentation works

The progress channel — smart termination

Stage 2: VLM-Guided Skill Acquisition Flywheel

Identifying primitive gaps

Autonomous execution and verification

Retraining with new data

Technical Details

Experimental Results

Real robot (xArm 6)

Simulation results

Installation and Usage

Requirements

Compute normalization statistics

Training the steerable VLA (Stage 1)

Serving the policy for inference

Real robot hardware

Limitations and Open Problems

Why InSight Matters

Related Posts

Nguyễn Anh Tuấn

Related Posts

Hướng dẫn VLA-JEPA: VLA với Latent World Model V-JEPA2

RLinf-Co: Sim-Real Co-Training cho VLA với RL

VLA-Adapter: Train VLA 0.5B với 9.6GB VRAM, 99.2% LIBERO