← Back to Blog
ailerobotsmolvlavlafine-tuning

SmolVLA: Train a 450M VLA on Consumer GPU

Detailed guide to fine-tuning SmolVLA — a 450M VLA model that runs on consumer GPUs, from data collection to real robot deployment.

Nguyễn Anh Tuấn9 tháng 4, 202610 min read
SmolVLA: Train a 450M VLA on Consumer GPU

SmolVLA: VLA for Everyone, Not Just Labs

If you've read our LeRobot v0.5 overview, you know that VLA (Vision-Language-Action) models are transforming how we program robots. But there's a major problem: most VLA models require data center GPUs — A100, H100 — for both training and inference. This limits VLA to large companies and well-funded research labs.

SmolVLA changes the picture entirely. With only 450M parameters, SmolVLA runs on RTX 3060 — the most popular GPU on the market. You can fine-tune it on a gaming laptop, deploy on Jetson Orin, and achieve 78% real-world success rate on manipulation tasks.

In this tutorial, we'll walk through everything step by step: understanding the architecture, collecting data, fine-tuning the model, and deploying on a real robot.

Robot learning on consumer hardware

What is SmolVLA?

Architecture Overview

SmolVLA is the most compact VLA model currently capable of real-world performance. The architecture consists of three main components:

1. SigLIP Vision Encoder (~100M params)

2. SmolLM2 Language Decoder (~250M params)

3. Flow Matching Action Expert (~100M params)

Total: ~450M parameters — 10x smaller than OpenVLA (7B) and 6x smaller than Pi0 (3B).

Why Is SmolVLA Small Yet Effective?

The secret lies in three key techniques:

Layer Skipping: Instead of passing every visual token through all transformer layers, SmolVLA only passes through selected layers. This reduces computation without significantly affecting accuracy.

Visual Token Reduction: Traditional VLAs encode each frame into 1024 tokens — very expensive for the attention mechanism. SmolVLA compresses this to 64 tokens using learned pooling, reducing computation 16x in attention layers.

Flow Matching over Diffusion: The action expert uses flow matching — a simpler and more efficient version of the diffusion process, requiring fewer denoising steps (5-10 steps instead of 50-100).

Comparison with Other VLA Models

Model Params Min GPU Real-world Success Inference Speed
OpenVLA 7B A100 80GB ~70% ~2 Hz
Pi0 3B A100 40GB ~85% ~5 Hz
Pi0-FAST 3B RTX 4090 ~82% ~25 Hz
SmolVLA 450M RTX 3060 78% ~15 Hz

SmolVLA sacrifices a bit of accuracy (78% vs Pi0's 85%) in exchange for the ability to run on commodity hardware — a very worthwhile tradeoff for most applications.

Step 1: Install LeRobot and SmolVLA Dependencies

System Requirements

Installation

# Create virtual environment
python3.12 -m venv smolvla-env
source smolvla-env/bin/activate

# Clone LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Install with SmolVLA extras
pip install -e ".[smolvla]"

# Verify installation
python -c "from lerobot.policies import SmolVLAPolicy; print('SmolVLA OK')"

The [smolvla] package will additionally install transformers>=5.0, accelerate, and necessary dependencies for SigLIP and SmolLM2.

Check GPU

python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')
"

Step 2: Collect Dataset

Why Data Matters More Than the Model

A critical lesson from the LeRobot community: data quality > model size. SmolVLA at 450M params with 100 high-quality episodes will outperform OpenVLA at 7B with 50 low-quality episodes.

This means you should spend 60-70% of your time on data collection and only 30-40% on training/tuning.

Collecting Data with LeRobot

If you have a real robot (SO-100, SO-101, or compatible), use lerobot-record:

# Record 50 episodes for pick-and-place task
lerobot-record \
  --robot.type=so100 \
  --repo_id=YOUR_USERNAME/pick_place_cube \
  --num_episodes=50 \
  --fps=30

For a deeper understanding of the data collection process, refer to our teleop and data collection guide in this series.

Tips for High-Quality Data

Diverse starting positions: Place the object at many different positions. If you only place the cube in the center of the table, the robot will only learn to pick from the center.

Good: 50 episodes, 10 different cube positions, 5 episodes per position
Bad: 50 episodes, cube always in the same spot

Diverse speeds: Teleop at a natural pace, not too fast or too slow. Avoid pausing mid-action — if you pause, the robot learns that "standing still" is valid behavior.

Check frame rate: Ensure you record at a stable 30 FPS. Unstable FPS creates temporal inconsistency in training.

Minimum quantities:

Using Existing Datasets

If you don't have a robot, you can fine-tune on existing datasets:

# Use SmolVLA's reference dataset
lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=lerobot/svla_so100_pickplace \
  --training.batch_size=64 \
  --training.steps=20000

The lerobot/svla_so100_pickplace dataset contains ~200 pick-and-place episodes on SO-100, pre-formatted for SmolVLA.

Collecting data for robot learning

Step 3: Fine-tune SmolVLA from Pretrained

Why Fine-tune Instead of Training from Scratch?

The SmolVLA base model (lerobot/smolvla_base) has been pretrained on thousands of hours of robot data, including:

The pretrained model has already learned general manipulation skills: how to approach objects, when to close the gripper, how to move smoothly. Fine-tuning only needs to teach it your specific task and specific embodiment.

This reduces required episodes from thousands to 50-100 and training time from days to a few hours.

Basic Fine-tuning Command

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=YOUR_USERNAME/pick_place_cube \
  --training.batch_size=64 \
  --training.steps=20000 \
  --training.lr=1e-5 \
  --training.save_freq=5000

Parameter Explanation

Estimated Training Times

GPU Batch Size 20k Steps 50k Steps
A100 80GB 128 ~2 hours ~5 hours
RTX 4090 64 ~4 hours ~10 hours
RTX 3090 32 ~6 hours ~15 hours
RTX 3060 16 ~10 hours ~25 hours

Google Colab

If you don't have a powerful GPU, SmolVLA can be trained on Google Colab Pro (A100):

# In Colab notebook
!pip install lerobot[smolvla]

# Login Hugging Face
from huggingface_hub import login
login()

# Train
!lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=lerobot/svla_so100_pickplace \
  --training.batch_size=64 \
  --training.steps=20000 \
  --training.output_dir=/content/outputs

Monitoring Training

LeRobot v0.5 logs training metrics to Weights & Biases (if installed):

pip install wandb
wandb login

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=YOUR_USERNAME/pick_place_cube \
  --training.batch_size=64 \
  --training.steps=20000 \
  --training.wandb.enable=true \
  --training.wandb.project=smolvla-finetune

Monitor action_loss — it should gradually decrease and stabilize. If loss plateaus early (before 10k steps), you may need more data or a higher learning rate.

Step 4: Evaluate on Real Robot

Deploying the Policy

After training, you can deploy the model directly:

# Run policy on real robot
lerobot-record \
  --robot.type=so100 \
  --policy.path=outputs/train/smolvla/checkpoints/last/pretrained_model \
  --repo_id=YOUR_USERNAME/eval_results \
  --num_episodes=10

This command will:

  1. Load the trained model
  2. Run inference loop: camera -> model -> robot actions
  3. Record results into a dataset (for later review)

Evaluate in Simulation

To test before deploying on a real robot:

lerobot-eval \
  --policy.path=outputs/train/smolvla/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_object \
  --eval.num_episodes=50

Key Metrics

Advanced Features

Async Inference: 30% Faster

SmolVLA supports asynchronous inference — while the robot executes actions from the current chunk, the model runs inference for the next chunk in the background:

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --policy.async_inference=true \
  --dataset.repo_id=YOUR_USERNAME/pick_place_cube

Result: 30% reduction in latency, 2x throughput increase. Especially useful on slower hardware (RTX 3060, Jetson Orin).

Visual Token Reduction Details

By default, SmolVLA uses 64 visual tokens per frame. You can adjust:

# Fewer tokens = faster, but loses detail
lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --policy.num_visual_tokens=32 \
  --dataset.repo_id=YOUR_USERNAME/dataset

# More tokens = slower, but more detailed (for precision tasks)
lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --policy.num_visual_tokens=128 \
  --dataset.repo_id=YOUR_USERNAME/dataset

Rule of thumb: Use 64 tokens for most tasks. Increase to 128 if the task requires fine-grained spatial reasoning (e.g., inserting a peg into a hole).

Multi-camera Support

SmolVLA supports multiple cameras:

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --policy.camera_names=["top","wrist"] \
  --dataset.repo_id=YOUR_USERNAME/dataset

Tip: A wrist camera significantly improves grasping accuracy. If you only have one camera, prioritize wrist camera over top-down camera.

Common Troubleshooting

CUDA Out of Memory

torch.cuda.OutOfMemoryError: CUDA out of memory

Solution: Reduce batch_size. If batch_size=8 still causes OOM, try:

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --training.batch_size=4 \
  --training.gradient_accumulation_steps=8 \
  --dataset.repo_id=YOUR_USERNAME/dataset

gradient_accumulation_steps=8 with batch_size=4 gives effective batch size = 32 without extra VRAM.

Loss Not Decreasing

If loss plateaus from the start:

  1. Check dataset: Ensure correct format (lerobot-check-dataset --repo_id=YOUR/dataset)
  2. Increase learning rate: Try lr=5e-5 instead of 1e-5
  3. Check task description: SmolVLA uses language conditioning — task descriptions must be clear and consistent

Jerky Robot Actions

If the robot moves jerkily during inference:

  1. Enable action smoothing: --policy.temporal_smoothing=true
  2. Increase chunk_size: --policy.chunk_size=20 (default 10)
  3. Check FPS: inference must run at >=10 Hz

Conclusion

SmolVLA is a breakthrough in democratizing robot AI. For the first time, you can train a VLA model that works in the real world without needing data center GPUs. The complete pipeline — from data collection, fine-tuning, to deployment — all runs on hardware that any student or hobbyist can own.

However, remember: data quality is the decisive factor. Spending time collecting diverse, consistent, and sufficient data will yield much better results than focusing solely on hyperparameter tuning.

For a deeper theoretical foundation, read our articles on PSi0: Architecture Overview and Diffusion Policy deep dive. When you're ready for a more powerful model, the next post covers Pi0-FAST — a 5x faster autoregressive VLA.


Related Posts

Related Posts

ResearchΨ₀ Hands-On (6): Ablation & Bài học rút ra
ai-perceptionvlaresearchhumanoidpsi0Part 6

Ψ₀ Hands-On (6): Ablation & Bài học rút ra

Phân tích ablation studies, so sánh baselines, và 5 bài học quan trọng nhất từ Ψ₀ cho người mới bắt đầu.

11/4/202616 min read
ResearchSimpleVLA-RL (4): Kết quả & Bài học
ai-perceptionvlareinforcement-learningresearchPart 4

SimpleVLA-RL (4): Kết quả & Bài học

Phân tích kết quả SimpleVLA-RL: ablation studies, hiện tượng pushcut, real-world transfer, và 5 bài học rút ra.

11/4/202614 min read
ComparisonSimpleVLA-RL (5): So sánh với LeRobot
ai-perceptionvlareinforcement-learninglerobotresearchPart 5

SimpleVLA-RL (5): So sánh với LeRobot

So sánh chi tiết SimpleVLA-RL và LeRobot: RL approach, VLA models, sim vs real, data efficiency — hai framework bổ trợ nhau.

11/4/202612 min read