← Back to Blog
aiopenarmsmolvlatrainingdeploymenthilserl

SimpleVLA-RL (8): Training & Deploying on OpenArm

Train SmolVLA, ACT, Pi0-FAST for OpenArm box grasping — from fine-tuning to real robot deployment and improvement with HIL-SERL.

Nguyễn Anh Tuấn11 tháng 4, 202613 min read
SimpleVLA-RL (8): Training & Deploying on OpenArm

Training and Deploying on OpenArm: From 50 Episodes to Autonomous Box Grasping

In the previous post, you collected 50 box-grasping episodes on OpenArm — that was the fuel. This post is the engine: we will train 3 different policies, compare results, deploy on the real robot, and improve performance with Reinforcement Learning. This is the most comprehensive post in the series — from running the training command to having the robot grasp boxes autonomously without human intervention.

We will cover 3 training options from simple to complex: ACT (fastest, no pretrained model needed), SmolVLA (balancing quality and speed), and Pi0-FAST (most powerful but heaviest). You do not need to run all 3 — read the comparison at the end to choose the right approach.

Option 1: Train ACT — Fastest and Simplest

ACT (Action Chunking with Transformers) is a policy architecture designed specifically for robot manipulation. It requires no pretrained model, no language instruction — just teleoperation data and a mid-range GPU.

Why Start with ACT?

ACT is the perfect "first experiment" choice because:

Running Training

lerobot-train \
  --policy.type=act \
  --dataset.repo_id=username/openarm-box-grasp \
  --steps=50000 \
  --batch_size=32

Breaking down each parameter:

Monitoring Training

LeRobot automatically logs metrics to Weights & Biases (if installed). Key metrics to watch:

After 50K steps, the model is saved at outputs/act/checkpoints/last/pretrained_model/.

Training metrics dashboard — monitoring loss and success rate

When to Use ACT

Option 2: Fine-tune SmolVLA — Balancing Quality and Speed

SmolVLA is HuggingFace's 450M parameter VLA model, designed to run on consumer hardware. The biggest difference from ACT: SmolVLA has been pretrained on community data from multiple robot types — it already carries built-in "manipulation experience."

Why SmolVLA Is the Recommended Choice

As analyzed in the SmolVLA training post:

Running Fine-tuning

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=username/openarm-box-grasp \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=outputs/openarm_smolvla \
  --policy.device=cuda

Analysis:

Estimated Training Time

GPU Batch Size Time (20K steps)
A100 (80GB) 64 ~4 hours
RTX 4090 (24GB) 32 ~8 hours
RTX 3090 (24GB) 16 ~12 hours
RTX 3060 (12GB) 8 ~20 hours (not recommended)

SmolVLA Fine-tuning Tips

Learning rate: Use a lower learning rate compared to training from scratch. LeRobot's default for fine-tuning is typically 1e-5 — if the model has not converged, try 3e-5. If loss oscillates heavily, reduce to 5e-6.

Frozen backbone: If GPU is limited, you can freeze the vision encoder and only train the action head:

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=username/openarm-box-grasp \
  --policy.freeze_vision_encoder=true \
  --steps=10000

This is 3-4x faster but performance drops by approximately 5-10%.

Option 3: Fine-tune Pi0-FAST — Most Powerful, Most Demanding

Pi0-FAST (Physical Intelligence + Flow-matching Action Sequence Tokenizer) is the state-of-the-art VLA model. It combines a powerful vision-language model with the FAST tokenizer — converting continuous actions into discrete tokens to leverage language model capabilities for action prediction.

When Do You Need Pi0-FAST?

Running Fine-tuning

lerobot-train \
  --policy.type=pi0_fast \
  --policy.pretrained_path=lerobot/pi0_fast_base \
  --dataset.repo_id=username/openarm-box-grasp \
  --policy.dtype=bfloat16 \
  --policy.gradient_checkpointing=true \
  --steps=50000

Special parameters:

FAST Tokenizer for OpenArm

Pi0-FAST needs to know the robot's specific action space to build its tokenizer. For OpenArm 6-DOF, verify that the tokenizer config is appropriate:

# Check action space
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
ds = LeRobotDataset("username/openarm-box-grasp")
print(f"Action dim: {ds[0]['action'].shape}")  # Should be (6,) for 6-DOF
print(f"State dim: {ds[0]['state'].shape}")    # Should be (6,)

The FAST tokenizer automatically discretizes continuous actions into tokens based on the action range in the dataset. This is transparent to the user — no manual configuration needed.

Warning: Pi0-FAST training is very VRAM-intensive. On RTX 4090 with gradient checkpointing + bfloat16, maximum batch_size is approximately 8-16. If OOM, reduce batch_size or switch to SmolVLA.

Deploying on the Real Robot — The Moment of Truth

This is the most exciting step — watching the model you just trained autonomously control the robot to grasp carton boxes without you holding the leader arm.

Running Policy Evaluation

LeRobot uses the same lerobot-record script but adds the --policy.path flag to run in autonomous mode:

lerobot-record \
  --robot.type=openarm_follower \
  --robot.port=can0 \
  --robot.side=right \
  --robot.id=my_follower \
  --robot.cameras="{ top: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
  --dataset.single_task="Grasp the carton box and lift it" \
  --dataset.repo_id=username/openarm-box-eval \
  --dataset.num_episodes=10 \
  --policy.path=outputs/openarm_smolvla/checkpoints/last/pretrained_model

What happens now:

Evaluating Results

Run 10 evaluation episodes and log results:

Episode Result Notes
1 Success Accurate grasp, stable lift
2 Success Slow approach but successful
3 Fail Gripper opened too early, dropped box
4 Success -
... ... ...

Expected success rates (with 50 episodes training data):

Policy Success Rate Notes
ACT (from scratch) 60-70% Learning from only 50 episodes, no priors
SmolVLA (fine-tuned) 75-85% Pretrained manipulation knowledge helps
Pi0-FAST (fine-tuned) 80-90% Most powerful but needs more compute

If success rate is below 50%, there is likely a problem with the data or calibration. Go back and check the data collection post.

Robot arm autonomously performing manipulation task

Improving with HIL-SERL — RL Directly on the Real Robot

If your policy reaches 70-80% but you want to push to 90%+, HIL-SERL (Human-in-the-Loop Sample Efficient RL) is the most effective path. Instead of collecting more demonstrations (time-consuming), you let the robot self-improve through RL with human assistance.

Step 1: Train Reward Classifier

The reward classifier is a small neural network that predicts "did this task succeed or fail?" from camera images. It is trained from the evaluation data you just collected:

# Pseudo-code: train reward classifier
# Use 10 eval episodes already labeled (success/fail)
# Input: final camera frame of the episode
# Output: probability of success (0.0 - 1.0)

HIL-SERL uses this reward classifier instead of binary reward from a simulator — because we are training on a real robot, there is no simulator to query "is the task complete?"

Step 2: Actor-Learner SAC Loop

SAC (Soft Actor-Critic) is the most suitable RL algorithm for real robots because:

The HIL-SERL process:

  1. Robot performs the task (actor)
  2. Reward classifier evaluates the outcome
  3. SAC updates the policy (learner)
  4. Human intervenes when the robot is about to collide or go in the wrong direction
  5. Repeat for 100-200 episodes

Step 3: Human Interventions

This is the "Human-in-the-Loop" part — you sit next to the robot with a gamepad or keyboard:

Each intervention becomes a high-value data point — it tells the model exactly "in this state, the current behavior is wrong, here is the correct behavior."

Detailed analysis of HIL-SERL is available in the dedicated post. Read that for the full picture of actor-learner architecture, reward classifier training, and safety guidelines.

Improving with SimpleVLA-RL Style (If Simulation Is Available)

If you are an advanced user and want to leverage the SimpleVLA-RL approach — training RL entirely in simulation then transferring to the real robot — OpenArm has this path available too.

OpenArm in Isaac Lab

OpenArm has a repository supporting NVIDIA Isaac Lab (openarm_isaac_lab). This enables:

  1. SFT in sim: Use 50 real episodes to train a baseline policy, then generate additional data in simulation
  2. RL in sim: Apply GRPO/PPO to improve the policy using simulator rewards
  3. Sim-to-real: Transfer the policy to the real OpenArm

Complete pipeline: SFT (real data) → RL (sim data) → Deploy (real robot)

This is the most powerful but most complex path. You need:

Advice: If you are just starting, do not go the simulation route first. Start with ACT/SmolVLA, deploy, then HIL-SERL. The sim-to-real pipeline should only be attempted after you have mastered the basic pipeline.

Comprehensive Comparison: ACT vs SmolVLA vs Pi0-FAST

Here is the summary table to help you choose the right policy for your situation:

Criterion ACT SmolVLA Pi0-FAST
Training time 1-2 hours 4-12 hours 8-24 hours
Minimum GPU 1x RTX 3090 1x RTX 4090 1x A100
VRAM required 12-16 GB 20-24 GB 40-80 GB
Language instruction No Yes Yes
Pre-training No Yes (community data) Yes
Expected success (50 eps) 60-70% 75-85% 80-90%
Multi-task No (1 task/model) Yes Yes
Inference speed Fast (~50Hz) Medium (~15Hz) Slow (~5Hz)
Setup complexity Low Medium High
When to use First experiment Production recommended Push state-of-the-art

How to Read This Table

Complete Iteration Workflow

Here is the recommended process, from start to a stably operating robot:

Phase 1: Quick Baseline (Day 1)

  1. Collect 50 episodes (previous post)
  2. Train ACT — 1-2 hours
  3. Deploy and evaluate success rate
  4. If above 50%, the pipeline works and the data is good

Phase 2: Upgrade Policy (Days 2-3)

  1. If ACT baseline is good, train SmolVLA fine-tune — 4-12 hours
  2. Deploy SmolVLA and compare with ACT
  3. If SmolVLA exceeds ACT by 10%+, use SmolVLA as the primary policy

Phase 3: Collect More Data (Days 4-5)

  1. If needed, collect 50-100 more episodes
  2. Diversify: more box sizes, positions, different lighting conditions
  3. Retrain SmolVLA with the larger dataset

Phase 4: RL Improvement (Days 6-7)

  1. If you want to push above 85%, use HIL-SERL
  2. Run 100-200 RL episodes with human intervention
  3. Re-evaluate success rate

Phase 5: Advanced (Week 2+)

  1. If language control is needed, try Pi0-FAST
  2. If you want sim-to-real, set up the Isaac Lab environment
  3. Scale to multi-task: add "stack boxes," "sort by size"...

Summary: Complete Pipeline from Unboxing to Autonomous Grasping

Across these 2 posts (parts 7 and 8), we have covered the entire pipeline:

  1. Hardware setup: CAN bus, camera, calibration
  2. Data collection: 50 teleoperation episodes with LeRobot
  3. Training: ACT (baseline) → SmolVLA (recommended) → Pi0-FAST (advanced)
  4. Deployment: Running the policy on the real robot, evaluating success rate
  5. RL improvement: HIL-SERL for an additional 10-15% improvement

This pipeline is not limited to box grasping. You can use the same workflow for any manipulation task: stacking objects, pouring water, assembly... The only changes are the task description and training data.

This is the power of the end-to-end learning approach: you do not need to write complex control code for each task — just demonstrate to the robot, train, and deploy. And with OpenArm plus LeRobot, this pipeline is accessible to anyone with $3,500 and a GPU.

If you are new to this series, read SimpleVLA-RL (1): Overview to understand the big picture. And if you want a deeper understanding of the RL training process for VLA, that post explains the GRPO algorithm in detail and why it works so effectively.


Related Posts

Related Posts

TutorialSimpleVLA-RL (10): SFT & RL Training cho OpenArm
openarmsimplevla-rltraininggrporeinforcement-learningPart 10

SimpleVLA-RL (10): SFT & RL Training cho OpenArm

Hướng dẫn chi tiết SFT fine-tuning và RL training với SimpleVLA-RL cho OpenArm — từ config environment đến chạy GRPO.

11/4/202616 min read
TutorialSimpleVLA-RL (11): Sim-to-Real cho OpenArm
openarmsim-to-realdeploymentsimplevla-rlPart 11

SimpleVLA-RL (11): Sim-to-Real cho OpenArm

Deploy model SimpleVLA-RL từ simulation lên OpenArm thật — camera setup, action mapping, và tips giảm sim-to-real gap.

11/4/202617 min read
TutorialSimpleVLA-RL (6): OpenArm — Phân tích Lộ trình
openarmvlareinforcement-learninglerobotpi0Part 6

SimpleVLA-RL (6): OpenArm — Phân tích Lộ trình

Phân tích chi tiết cách tiếp cận training robot OpenArm 7-DoF gắp hộp carton — so sánh 2 lộ trình: LeRobot native vs SimpleVLA-RL.

11/4/202613 min read