SmolVLA: Train a 450M VLA on Consumer GPU

SmolVLA: VLA for Everyone, Not Just Labs

If you've read our LeRobot v0.5 overview, you know that VLA (Vision-Language-Action) models are transforming how we program robots. But there's a major problem: most VLA models require data center GPUs — A100, H100 — for both training and inference. This limits VLA to large companies and well-funded research labs.

SmolVLA changes the picture entirely. With only 450M parameters, SmolVLA runs on RTX 3060 — the most popular GPU on the market. You can fine-tune it on a gaming laptop, deploy on Jetson Orin, and achieve 78% real-world success rate on manipulation tasks.

In this tutorial, we'll walk through everything step by step: understanding the architecture, collecting data, fine-tuning the model, and deploying on a real robot.

Robot learning on consumer hardware

What is SmolVLA?

Architecture Overview

SmolVLA is the most compact VLA model currently capable of real-world performance. The architecture consists of three main components:

1. SigLIP Vision Encoder (~100M params)

Processes camera images
Output: visual tokens representing the scene
Key feature: only 64 tokens per frame (vs 1024 in other VLAs) thanks to visual token reduction

2. SmolLM2 Language Decoder (~250M params)

Processes text instructions (e.g., "pick up the red cube")
Combines visual tokens and language tokens
Output: latent representations for the action expert

3. Flow Matching Action Expert (~100M params)

Receives latent representations from language decoder
Generates continuous robot actions via iterative denoising
Output: action chunk (e.g., 10 actions, each containing joint positions + gripper state)

Total: ~450M parameters — 10x smaller than OpenVLA (7B) and 6x smaller than Pi0 (3B).

Why Is SmolVLA Small Yet Effective?

The secret lies in three key techniques:

Layer Skipping: Instead of passing every visual token through all transformer layers, SmolVLA only passes through selected layers. This reduces computation without significantly affecting accuracy.

Visual Token Reduction: Traditional VLAs encode each frame into 1024 tokens — very expensive for the attention mechanism. SmolVLA compresses this to 64 tokens using learned pooling, reducing computation 16x in attention layers.

Flow Matching over Diffusion: The action expert uses flow matching — a simpler and more efficient version of the diffusion process, requiring fewer denoising steps (5-10 steps instead of 50-100).

Comparison with Other VLA Models

Model	Params	Min GPU	Real-world Success	Inference Speed
OpenVLA	7B	A100 80GB	~70%	~2 Hz
Pi0	3B	A100 40GB	~85%	~5 Hz
Pi0-FAST	3B	RTX 4090	~82%	~25 Hz
SmolVLA	450M	RTX 3060	78%	~15 Hz

SmolVLA sacrifices a bit of accuracy (78% vs Pi0's 85%) in exchange for the ability to run on commodity hardware — a very worthwhile tradeoff for most applications.

Step 1: Install LeRobot and SmolVLA Dependencies

System Requirements

Python 3.12+ (required for LeRobot v0.5)
CUDA 12.1+ (or ROCm 6.0+ for AMD GPUs)
GPU: minimum 6GB VRAM for inference, 12GB+ for training
RAM: 16GB+ (32GB recommended)

Installation

# Create virtual environment
python3.12 -m venv smolvla-env
source smolvla-env/bin/activate

# Clone LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Install with SmolVLA extras
pip install -e ".[smolvla]"

# Verify installation
python -c "from lerobot.policies import SmolVLAPolicy; print('SmolVLA OK')"

The [smolvla] package will additionally install transformers>=5.0, accelerate, and necessary dependencies for SigLIP and SmolLM2.

Check GPU

python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')
"

Step 2: Collect Dataset

Why Data Matters More Than the Model

A critical lesson from the LeRobot community: data quality > model size. SmolVLA at 450M params with 100 high-quality episodes will outperform OpenVLA at 7B with 50 low-quality episodes.

This means you should spend 60-70% of your time on data collection and only 30-40% on training/tuning.

Collecting Data with LeRobot

If you have a real robot (SO-100, SO-101, or compatible), use lerobot-record:

# Record 50 episodes for pick-and-place task
lerobot-record \
  --robot.type=so100 \
  --repo_id=YOUR_USERNAME/pick_place_cube \
  --num_episodes=50 \
  --fps=30

For a deeper understanding of the data collection process, refer to our teleop and data collection guide in this series.

Tips for High-Quality Data

Diverse starting positions: Place the object at many different positions. If you only place the cube in the center of the table, the robot will only learn to pick from the center.

Good: 50 episodes, 10 different cube positions, 5 episodes per position
Bad: 50 episodes, cube always in the same spot

Diverse speeds: Teleop at a natural pace, not too fast or too slow. Avoid pausing mid-action — if you pause, the robot learns that "standing still" is valid behavior.

Check frame rate: Ensure you record at a stable 30 FPS. Unstable FPS creates temporal inconsistency in training.

Minimum quantities:

Simple task (pick-and-place 1 object): 50 episodes
Complex task (stacking, sorting): 100-200 episodes
Task with many variations: 10 episodes per variation (e.g., 5 object types x 10 episodes = 50)

Using Existing Datasets

If you don't have a robot, you can fine-tune on existing datasets:

# Use SmolVLA's reference dataset
lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=lerobot/svla_so100_pickplace \
  --training.batch_size=64 \
  --training.steps=20000

The lerobot/svla_so100_pickplace dataset contains ~200 pick-and-place episodes on SO-100, pre-formatted for SmolVLA.

Collecting data for robot learning

Step 3: Fine-tune SmolVLA from Pretrained

Why Fine-tune Instead of Training from Scratch?

The SmolVLA base model (lerobot/smolvla_base) has been pretrained on thousands of hours of robot data, including:

Bridge V2 dataset
DROID dataset
Open-X Embodiment data
And many more datasets

The pretrained model has already learned general manipulation skills: how to approach objects, when to close the gripper, how to move smoothly. Fine-tuning only needs to teach it your specific task and specific embodiment.

This reduces required episodes from thousands to 50-100 and training time from days to a few hours.

Basic Fine-tuning Command

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=YOUR_USERNAME/pick_place_cube \
  --training.batch_size=64 \
  --training.steps=20000 \
  --training.lr=1e-5 \
  --training.save_freq=5000

Parameter Explanation

--policy.path: Path to pretrained model on Hugging Face Hub
--dataset.repo_id: Your dataset (or public dataset)
--training.batch_size: Samples per batch. Reduce if running out of VRAM:
- A100 80GB: batch_size=128
- RTX 4090 24GB: batch_size=64
- RTX 3090 24GB: batch_size=32
- RTX 3060 12GB: batch_size=16
--training.steps: Total training steps. 20000 is a good starting point for 50-100 episodes
--training.lr: Learning rate. 1e-5 is a safe default for fine-tuning
--training.save_freq: Save checkpoint every N steps

Estimated Training Times

GPU	Batch Size	20k Steps	50k Steps
A100 80GB	128	~2 hours	~5 hours
RTX 4090	64	~4 hours	~10 hours
RTX 3090	32	~6 hours	~15 hours
RTX 3060	16	~10 hours	~25 hours

Google Colab

If you don't have a powerful GPU, SmolVLA can be trained on Google Colab Pro (A100):

# In Colab notebook
!pip install lerobot[smolvla]

# Login Hugging Face
from huggingface_hub import login
login()

# Train
!lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=lerobot/svla_so100_pickplace \
  --training.batch_size=64 \
  --training.steps=20000 \
  --training.output_dir=/content/outputs

Monitoring Training

LeRobot v0.5 logs training metrics to Weights & Biases (if installed):

pip install wandb
wandb login

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=YOUR_USERNAME/pick_place_cube \
  --training.batch_size=64 \
  --training.steps=20000 \
  --training.wandb.enable=true \
  --training.wandb.project=smolvla-finetune

Monitor action_loss — it should gradually decrease and stabilize. If loss plateaus early (before 10k steps), you may need more data or a higher learning rate.

Step 4: Evaluate on Real Robot

Deploying the Policy

After training, you can deploy the model directly:

# Run policy on real robot
lerobot-record \
  --robot.type=so100 \
  --policy.path=outputs/train/smolvla/checkpoints/last/pretrained_model \
  --repo_id=YOUR_USERNAME/eval_results \
  --num_episodes=10

This command will:

Load the trained model
Run inference loop: camera -> model -> robot actions
Record results into a dataset (for later review)

Evaluate in Simulation

To test before deploying on a real robot:

lerobot-eval \
  --policy.path=outputs/train/smolvla/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_object \
  --eval.num_episodes=50

Key Metrics

Success rate: Task completion rate (target >70%)
Completion time: Average time to complete (target <15 seconds for pick-and-place)
Smoothness: Trajectory without jitter or sudden movements

Advanced Features

Async Inference: 30% Faster

SmolVLA supports asynchronous inference — while the robot executes actions from the current chunk, the model runs inference for the next chunk in the background:

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --policy.async_inference=true \
  --dataset.repo_id=YOUR_USERNAME/pick_place_cube

Result: 30% reduction in latency, 2x throughput increase. Especially useful on slower hardware (RTX 3060, Jetson Orin).

Visual Token Reduction Details

By default, SmolVLA uses 64 visual tokens per frame. You can adjust:

# Fewer tokens = faster, but loses detail
lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --policy.num_visual_tokens=32 \
  --dataset.repo_id=YOUR_USERNAME/dataset

# More tokens = slower, but more detailed (for precision tasks)
lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --policy.num_visual_tokens=128 \
  --dataset.repo_id=YOUR_USERNAME/dataset

Rule of thumb: Use 64 tokens for most tasks. Increase to 128 if the task requires fine-grained spatial reasoning (e.g., inserting a peg into a hole).

Multi-camera Support

SmolVLA supports multiple cameras:

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --policy.camera_names=["top","wrist"] \
  --dataset.repo_id=YOUR_USERNAME/dataset

Tip: A wrist camera significantly improves grasping accuracy. If you only have one camera, prioritize wrist camera over top-down camera.

Common Troubleshooting

CUDA Out of Memory

torch.cuda.OutOfMemoryError: CUDA out of memory

Solution: Reduce batch_size. If batch_size=8 still causes OOM, try:

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --training.batch_size=4 \
  --training.gradient_accumulation_steps=8 \
  --dataset.repo_id=YOUR_USERNAME/dataset

gradient_accumulation_steps=8 with batch_size=4 gives effective batch size = 32 without extra VRAM.

Loss Not Decreasing

If loss plateaus from the start:

Check dataset: Ensure correct format (lerobot-check-dataset --repo_id=YOUR/dataset)
Increase learning rate: Try lr=5e-5 instead of 1e-5
Check task description: SmolVLA uses language conditioning — task descriptions must be clear and consistent

Jerky Robot Actions

If the robot moves jerkily during inference:

Enable action smoothing: --policy.temporal_smoothing=true
Increase chunk_size: --policy.chunk_size=20 (default 10)
Check FPS: inference must run at >=10 Hz

Conclusion

SmolVLA is a breakthrough in democratizing robot AI. For the first time, you can train a VLA model that works in the real world without needing data center GPUs. The complete pipeline — from data collection, fine-tuning, to deployment — all runs on hardware that any student or hobbyist can own.

However, remember: data quality is the decisive factor. Spending time collecting diverse, consistent, and sufficient data will yield much better results than focusing solely on hyperparameter tuning.

For a deeper theoretical foundation, read our articles on PSi0: Architecture Overview and Diffusion Policy deep dive. When you're ready for a more powerful model, the next post covers Pi0-FAST — a 5x faster autoregressive VLA.

LeRobot v0.5: What's New — Complete overview of LeRobot's biggest update
VLA Models: From Theory to Practice — VLA theoretical foundations before fine-tuning
Teleop and Data Collection — Detailed guide to collecting high-quality datasets

SmolVLA: VLA for Everyone, Not Just Labs

In this tutorial, we'll walk through everything step by step: understanding the architecture, collecting data, fine-tuning the model, and deploying on a real robot.

Robot learning on consumer hardware

What is SmolVLA?

Architecture Overview

SmolVLA is the most compact VLA model currently capable of real-world performance. The architecture consists of three main components:

1. SigLIP Vision Encoder (~100M params)

Processes camera images
Output: visual tokens representing the scene
Key feature: only 64 tokens per frame (vs 1024 in other VLAs) thanks to visual token reduction

2. SmolLM2 Language Decoder (~250M params)

Processes text instructions (e.g., "pick up the red cube")
Combines visual tokens and language tokens
Output: latent representations for the action expert

3. Flow Matching Action Expert (~100M params)

Receives latent representations from language decoder
Generates continuous robot actions via iterative denoising
Output: action chunk (e.g., 10 actions, each containing joint positions + gripper state)

Total: ~450M parameters — 10x smaller than OpenVLA (7B) and 6x smaller than Pi0 (3B).

Why Is SmolVLA Small Yet Effective?

The secret lies in three key techniques:

Comparison with Other VLA Models

Model	Params	Min GPU	Real-world Success	Inference Speed
OpenVLA	7B	A100 80GB	~70%	~2 Hz
Pi0	3B	A100 40GB	~85%	~5 Hz
Pi0-FAST	3B	RTX 4090	~82%	~25 Hz
SmolVLA	450M	RTX 3060	78%	~15 Hz

SmolVLA sacrifices a bit of accuracy (78% vs Pi0's 85%) in exchange for the ability to run on commodity hardware — a very worthwhile tradeoff for most applications.

Step 1: Install LeRobot and SmolVLA Dependencies

System Requirements

Python 3.12+ (required for LeRobot v0.5)
CUDA 12.1+ (or ROCm 6.0+ for AMD GPUs)
GPU: minimum 6GB VRAM for inference, 12GB+ for training
RAM: 16GB+ (32GB recommended)

Installation

# Create virtual environment
python3.12 -m venv smolvla-env
source smolvla-env/bin/activate

# Clone LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Install with SmolVLA extras
pip install -e ".[smolvla]"

# Verify installation
python -c "from lerobot.policies import SmolVLAPolicy; print('SmolVLA OK')"

The [smolvla] package will additionally install transformers>=5.0, accelerate, and necessary dependencies for SigLIP and SmolLM2.

Check GPU

python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')
"

Step 2: Collect Dataset

Why Data Matters More Than the Model

A critical lesson from the LeRobot community: data quality > model size. SmolVLA at 450M params with 100 high-quality episodes will outperform OpenVLA at 7B with 50 low-quality episodes.

This means you should spend 60-70% of your time on data collection and only 30-40% on training/tuning.

Collecting Data with LeRobot

If you have a real robot (SO-100, SO-101, or compatible), use lerobot-record:

# Record 50 episodes for pick-and-place task
lerobot-record \
  --robot.type=so100 \
  --repo_id=YOUR_USERNAME/pick_place_cube \
  --num_episodes=50 \
  --fps=30

For a deeper understanding of the data collection process, refer to our teleop and data collection guide in this series.

Tips for High-Quality Data

Diverse starting positions: Place the object at many different positions. If you only place the cube in the center of the table, the robot will only learn to pick from the center.

Good: 50 episodes, 10 different cube positions, 5 episodes per position
Bad: 50 episodes, cube always in the same spot

Diverse speeds: Teleop at a natural pace, not too fast or too slow. Avoid pausing mid-action — if you pause, the robot learns that "standing still" is valid behavior.

Check frame rate: Ensure you record at a stable 30 FPS. Unstable FPS creates temporal inconsistency in training.

Minimum quantities:

Simple task (pick-and-place 1 object): 50 episodes
Complex task (stacking, sorting): 100-200 episodes
Task with many variations: 10 episodes per variation (e.g., 5 object types x 10 episodes = 50)

Using Existing Datasets

If you don't have a robot, you can fine-tune on existing datasets:

# Use SmolVLA's reference dataset
lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=lerobot/svla_so100_pickplace \
  --training.batch_size=64 \
  --training.steps=20000

The lerobot/svla_so100_pickplace dataset contains ~200 pick-and-place episodes on SO-100, pre-formatted for SmolVLA.

Collecting data for robot learning

Step 3: Fine-tune SmolVLA from Pretrained

Why Fine-tune Instead of Training from Scratch?

The SmolVLA base model (lerobot/smolvla_base) has been pretrained on thousands of hours of robot data, including:

Bridge V2 dataset
DROID dataset
Open-X Embodiment data
And many more datasets

This reduces required episodes from thousands to 50-100 and training time from days to a few hours.

Basic Fine-tuning Command

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=YOUR_USERNAME/pick_place_cube \
  --training.batch_size=64 \
  --training.steps=20000 \
  --training.lr=1e-5 \
  --training.save_freq=5000

Parameter Explanation

--policy.path: Path to pretrained model on Hugging Face Hub
--dataset.repo_id: Your dataset (or public dataset)
--training.batch_size: Samples per batch. Reduce if running out of VRAM:
- A100 80GB: batch_size=128
- RTX 4090 24GB: batch_size=64
- RTX 3090 24GB: batch_size=32
- RTX 3060 12GB: batch_size=16
--training.steps: Total training steps. 20000 is a good starting point for 50-100 episodes
--training.lr: Learning rate. 1e-5 is a safe default for fine-tuning
--training.save_freq: Save checkpoint every N steps

Estimated Training Times

GPU	Batch Size	20k Steps	50k Steps
A100 80GB	128	~2 hours	~5 hours
RTX 4090	64	~4 hours	~10 hours
RTX 3090	32	~6 hours	~15 hours
RTX 3060	16	~10 hours	~25 hours

Google Colab

If you don't have a powerful GPU, SmolVLA can be trained on Google Colab Pro (A100):

# In Colab notebook
!pip install lerobot[smolvla]

# Login Hugging Face
from huggingface_hub import login
login()

# Train
!lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=lerobot/svla_so100_pickplace \
  --training.batch_size=64 \
  --training.steps=20000 \
  --training.output_dir=/content/outputs

Monitoring Training

LeRobot v0.5 logs training metrics to Weights & Biases (if installed):

pip install wandb
wandb login

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=YOUR_USERNAME/pick_place_cube \
  --training.batch_size=64 \
  --training.steps=20000 \
  --training.wandb.enable=true \
  --training.wandb.project=smolvla-finetune

Monitor action_loss — it should gradually decrease and stabilize. If loss plateaus early (before 10k steps), you may need more data or a higher learning rate.

Step 4: Evaluate on Real Robot

Deploying the Policy

After training, you can deploy the model directly:

# Run policy on real robot
lerobot-record \
  --robot.type=so100 \
  --policy.path=outputs/train/smolvla/checkpoints/last/pretrained_model \
  --repo_id=YOUR_USERNAME/eval_results \
  --num_episodes=10

This command will:

Load the trained model
Run inference loop: camera -> model -> robot actions
Record results into a dataset (for later review)

Evaluate in Simulation

To test before deploying on a real robot:

lerobot-eval \
  --policy.path=outputs/train/smolvla/checkpoints/last/pretrained_model \
  --env.type=libero \
  --env.task=libero_object \
  --eval.num_episodes=50

Key Metrics

Success rate: Task completion rate (target >70%)
Completion time: Average time to complete (target <15 seconds for pick-and-place)
Smoothness: Trajectory without jitter or sudden movements

Advanced Features

Async Inference: 30% Faster

SmolVLA supports asynchronous inference — while the robot executes actions from the current chunk, the model runs inference for the next chunk in the background:

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --policy.async_inference=true \
  --dataset.repo_id=YOUR_USERNAME/pick_place_cube

Result: 30% reduction in latency, 2x throughput increase. Especially useful on slower hardware (RTX 3060, Jetson Orin).

Visual Token Reduction Details

By default, SmolVLA uses 64 visual tokens per frame. You can adjust:

# Fewer tokens = faster, but loses detail
lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --policy.num_visual_tokens=32 \
  --dataset.repo_id=YOUR_USERNAME/dataset

# More tokens = slower, but more detailed (for precision tasks)
lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --policy.num_visual_tokens=128 \
  --dataset.repo_id=YOUR_USERNAME/dataset

Rule of thumb: Use 64 tokens for most tasks. Increase to 128 if the task requires fine-grained spatial reasoning (e.g., inserting a peg into a hole).

Multi-camera Support

SmolVLA supports multiple cameras:

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --policy.camera_names=["top","wrist"] \
  --dataset.repo_id=YOUR_USERNAME/dataset

Tip: A wrist camera significantly improves grasping accuracy. If you only have one camera, prioritize wrist camera over top-down camera.

Common Troubleshooting

CUDA Out of Memory

torch.cuda.OutOfMemoryError: CUDA out of memory

Solution: Reduce batch_size. If batch_size=8 still causes OOM, try:

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --training.batch_size=4 \
  --training.gradient_accumulation_steps=8 \
  --dataset.repo_id=YOUR_USERNAME/dataset

gradient_accumulation_steps=8 with batch_size=4 gives effective batch size = 32 without extra VRAM.

Loss Not Decreasing

If loss plateaus from the start:

Check dataset: Ensure correct format (lerobot-check-dataset --repo_id=YOUR/dataset)
Increase learning rate: Try lr=5e-5 instead of 1e-5
Check task description: SmolVLA uses language conditioning — task descriptions must be clear and consistent

Jerky Robot Actions

If the robot moves jerkily during inference:

Enable action smoothing: --policy.temporal_smoothing=true
Increase chunk_size: --policy.chunk_size=20 (default 10)
Check FPS: inference must run at >=10 Hz

Conclusion

LeRobot v0.5: What's New — Complete overview of LeRobot's biggest update
VLA Models: From Theory to Practice — VLA theoretical foundations before fine-tuning
Teleop and Data Collection — Detailed guide to collecting high-quality datasets

SmolVLA: VLA for Everyone, Not Just Labs

What is SmolVLA?

Architecture Overview

Why Is SmolVLA Small Yet Effective?

Comparison with Other VLA Models

Step 1: Install LeRobot and SmolVLA Dependencies

System Requirements

Installation

Check GPU

Step 2: Collect Dataset

Why Data Matters More Than the Model

Collecting Data with LeRobot

Tips for High-Quality Data

Using Existing Datasets

Step 3: Fine-tune SmolVLA from Pretrained

Why Fine-tune Instead of Training from Scratch?

Basic Fine-tuning Command

Parameter Explanation

Estimated Training Times

Google Colab

Monitoring Training

Step 4: Evaluate on Real Robot

Deploying the Policy

Evaluate in Simulation

Key Metrics

Advanced Features

Async Inference: 30% Faster

Visual Token Reduction Details

Multi-camera Support

Common Troubleshooting

CUDA Out of Memory

Loss Not Decreasing

Jerky Robot Actions

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

LeRobot v0.5: Tổng quan tính năng mới

SARM trong LeRobot: Reward Model cho VLA

LeRobot v0.5: Pi0-FAST + G1 Whole-Body Control

SmolVLA: VLA for Everyone, Not Just Labs

What is SmolVLA?

Architecture Overview

Why Is SmolVLA Small Yet Effective?

Comparison with Other VLA Models

Step 1: Install LeRobot and SmolVLA Dependencies

System Requirements

Installation

Check GPU

Step 2: Collect Dataset

Why Data Matters More Than the Model

Collecting Data with LeRobot

Tips for High-Quality Data

Using Existing Datasets

Step 3: Fine-tune SmolVLA from Pretrained

Why Fine-tune Instead of Training from Scratch?

Basic Fine-tuning Command

Parameter Explanation

Estimated Training Times

Google Colab

Monitoring Training

Step 4: Evaluate on Real Robot

Deploying the Policy

Evaluate in Simulation

Key Metrics

Advanced Features

Async Inference: 30% Faster

Visual Token Reduction Details

Multi-camera Support

Common Troubleshooting

CUDA Out of Memory

Loss Not Decreasing

Jerky Robot Actions

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

LeRobot v0.5: Tổng quan tính năng mới

SARM trong LeRobot: Reward Model cho VLA

LeRobot v0.5: Pi0-FAST + G1 Whole-Body Control