SFT & RL Training for OpenArm with SimpleVLA-RL
In the previous post, we collected hundreds of demonstration episodes in Isaac Lab for OpenArm box grasping. The simulation data is ready — now it is time to turn it into a smart policy that can control a real robot. This article walks you through the entire SimpleVLA-RL training pipeline end-to-end: installing the stack, adapting the action space for OpenArm, running SFT (Supervised Fine-Tuning) to create a baseline, then boosting performance with RL (Reinforcement Learning) using the GRPO algorithm.
The biggest difference from Part 8 — where we used LeRobot — is that here we exclusively use the SimpleVLA-RL pipeline built on veRL and OpenVLA-OFT. No LeRobot. No ACT or SmolVLA. Just a single VLA model fine-tuned with SFT then upgraded with RL.
Why SimpleVLA-RL Instead of LeRobot?
Before diving into the technical details, let us understand why we chose this path. LeRobot is an excellent framework for Imitation Learning — collect demos, train a policy, deploy. But it stops there. The policy is only as good as the data you provide.
SimpleVLA-RL goes one step further: after SFT creates a baseline from demonstrations, the GRPO algorithm allows the robot to discover new strategies on its own through trial-and-error in simulation. Results from the original paper on the Piper robot show that RL can increase success rate from 17.5% to 38.5% — a 120% improvement over SFT alone.
Step 1: Install the SimpleVLA-RL Stack
The full stack has 4 core components: Python environment, veRL (RL framework), OpenVLA-OFT (VLA model), and SimpleVLA-RL (glue code connecting everything). Installation order matters due to the dependency chain.
Create Conda Environment
# Create a separate environment — DO NOT share with Isaac Lab env
conda create -n simplevla python==3.10 -y
conda activate simplevla
# PyTorch 2.4 with CUDA 12.4 — most thoroughly tested version
pip3 install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124
Why Python 3.10 and not 3.11 or 3.12? Because veRL and flash-attention have compiled extensions built and tested primarily on 3.10. You can try 3.11 but risk compatibility issues that are not worth the trouble.
Install veRL — RL Training Framework
# Clone veRL v0.2 — stable version supported by SimpleVLA-RL
git clone -b v0.2.x https://github.com/volcengine/verl.git
cd verl
pip install -e .
cd ..
veRL (Volcano Engine Reinforcement Learning) is ByteDance's framework designed for RL training on LLM-scale models. It handles distributed training, rollout collection, and policy optimization — things you do not want to implement yourself.
Install OpenVLA-OFT — VLA Model
# Clone OpenVLA-OFT — VLA model with Open Fine-Tuning support
git clone https://github.com/moojink/openvla-oft.git
cd openvla-oft
pip install -e .
cd ..
# Flash Attention — 2-3x inference speedup
pip install flash-attn --no-build-isolation
OpenVLA-OFT is an extended version of OpenVLA with better fine-tuning support. It uses a Vision-Language-Action architecture with a Prismatic VLM backbone, tokenizing actions into 256 discrete tokens per degree of freedom.
Clone SimpleVLA-RL
git clone https://github.com/simple-vla-rl/SimpleVLA-RL.git
cd SimpleVLA-RL
pip install -e .
After this step, verify the entire stack:
python -c "
import torch
import verl
import prismatic
print(f'PyTorch: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU count: {torch.cuda.device_count()}')
print(f'veRL imported successfully')
print(f'Prismatic (OpenVLA) imported successfully')
"
If everything works, you should see PyTorch 2.4.0, CUDA available, and your GPU count. If torch.cuda.device_count() returns 0, check your CUDA driver version with nvidia-smi.
Step 2: Adapt the Action Space for OpenArm
This is the most important step and also the easiest to get wrong. OpenVLA-OFT is designed by default for 7-DoF (6 joints + 1 gripper, like the Piper robot). OpenArm has 7 joints + 1 gripper = 8-DoF. We need to decide how to handle this difference.
Option A: Direct 8-DoF Mapping (Recommended)
OpenArm's 8-DoF (7 joints + 1 gripper) has the same dimensionality as some default OpenVLA-OFT configurations. If the model supports action_dim=8, we just need to map joint order correctly:
# In config file or directly in code
ACTION_DIM = 8 # 7 joints + 1 gripper
STATE_DIM = 8 # Joint positions feedback
NUM_ACTION_TOKENS = 256 # Each DoF tokenized into 256 discrete values
# OpenArm joints -> action vector mapping
# [joint1, joint2, joint3, joint4, joint5, joint6, joint7, gripper]
# Each joint value normalized to [-1, 1] before tokenization
JOINT_LIMITS = {
'joint1': (-3.14, 3.14), # Base rotation
'joint2': (-1.57, 1.57), # Shoulder
'joint3': (-1.57, 1.57), # Elbow
'joint4': (-3.14, 3.14), # Wrist 1
'joint5': (-1.57, 1.57), # Wrist 2
'joint6': (-3.14, 3.14), # Wrist 3
'joint7': (-1.57, 1.57), # Wrist 4
'gripper': (0.0, 1.0), # Gripper open/close
}
Option B: Keep 7-DoF and Pad
If you want to keep the original 7-DoF architecture of OpenVLA-OFT without modifying the model:
# Merge the last 2 wrist joints into 1
# OpenArm 8-DoF -> 7-DoF mapping
def openarm_to_7dof(joint_positions_8):
"""Map 8-DoF OpenArm to 7-DoF OpenVLA format"""
return [
joint_positions_8[0], # base
joint_positions_8[1], # shoulder
joint_positions_8[2], # elbow
joint_positions_8[3], # wrist1
joint_positions_8[4], # wrist2
# Average wrist3 + wrist4 into single value
(joint_positions_8[5] + joint_positions_8[6]) / 2,
joint_positions_8[7], # gripper
]
I recommend Option A because it preserves all kinematic information. Merging joints loses precision, especially for tasks requiring high wrist dexterity.
Modify Action Tokenizer Config
In OpenVLA-OFT, open the config file and change:
# openvla-oft/prismatic/models/backbones/llm/action_tokenizer.py
# Or in config JSON depending on version
tokenizer_config = {
"action_dim": 8, # Change from 7 -> 8 for OpenArm
"state_dim": 8, # Feedback dimension
"num_action_tokens": 256, # Keep unchanged
"action_chunk_size": 1, # Predict 1 action per step
}
Step 3: Register the OpenArm Environment
SimpleVLA-RL needs to know how to interact with your environment. Three files need modification: rob_dataset.py (data), rob_rollout.py (rollout loop), and the reward function.
Add Dataset to rob_dataset.py
# SimpleVLA-RL/verl/workers/rob_dataset.py
# Add to DATASETS dictionary
DATASETS = {
# ... existing datasets (LIBERO, Piper, etc.)
"openarm_box_grasp": {
"data_path": "/path/to/openarm_sim_demos/",
"task_description": "Pick up the carton box from the table",
"action_dim": 8,
"state_dim": 8,
"camera_names": ["front_camera"], # Camera name in Isaac Lab
"image_size": (224, 224), # Resize for VLA input
"max_episode_length": 400, # Must match sim config
},
}
Add Environment to rob_rollout.py
# SimpleVLA-RL/verl/workers/rob_rollout.py
def create_environment(env_name, env_config):
"""Create environment for rollout"""
if env_name == "openarm_box_grasp":
# Import Isaac Lab environment
from omni.isaac.lab.envs import ManagerBasedRLEnv
import omni.isaac.lab_tasks # Register environments
env = ManagerBasedRLEnv(
cfg=env_config,
render_mode="rgb_array",
)
return OpenArmWrapper(env)
# ... other environments
# Add max_steps for OpenArm
MAX_STEPS = {
# ... existing entries
"openarm_box_grasp": 400, # 400 steps x 0.02s = 8 seconds per episode
}
Implement get_info() for Success Detection
This is the most critical part — it determines the reward signal for RL:
class OpenArmWrapper:
"""Wrapper connecting Isaac Lab env to SimpleVLA-RL rollout"""
def __init__(self, isaac_env):
self.env = isaac_env
self.step_count = 0
def get_info(self):
"""
Return dict containing success/failure info.
SimpleVLA-RL uses binary reward: 1 if success, 0 if failure.
"""
# Check: has the box been lifted above the table?
box_pos = self.env.scene["box"].get_world_poses()[0]
table_height = 0.75 # Table height in sim
# Gripper holding box + box above table by 10cm
gripper_holding = self.env.scene["gripper"].is_closed()
box_lifted = box_pos[2] > table_height + 0.10
success = gripper_holding and box_lifted
return {
"success": success,
"reward": 1.0 if success else 0.0, # Binary reward
"box_height": float(box_pos[2]),
"step": self.step_count,
}
def step(self, action):
"""Execute action and return observation"""
obs, reward, terminated, truncated, info = self.env.step(action)
self.step_count += 1
# Override reward with binary success signal
custom_info = self.get_info()
return obs, custom_info["reward"], terminated, truncated, custom_info
def reset(self):
self.step_count = 0
return self.env.reset()
Step 4: SFT Training — Creating the Baseline
SFT (Supervised Fine-Tuning) is the "cold-start" step — teaching the VLA model to imitate your collected demonstrations. It is similar to Imitation Learning but performed on top of a Vision-Language Model that has been pretrained on millions of images and text.
Download OpenVLA-OFT Base Model
# Download from HuggingFace — approximately 15GB
huggingface-cli download moojink/openvla-7b-oft \
--local-dir ./checkpoints/openvla-oft-base
This base model has been pretrained on the Open X-Embodiment dataset — a collection of data from dozens of different robot types. It "understands" general concepts of manipulation but knows nothing specific about OpenArm.
Prepare SFT Config
# Create config file for SFT training
cat > configs/openarm_sft.yaml << 'EOF'
# === Model ===
model:
base_model_path: "./checkpoints/openvla-oft-base"
action_dim: 8
action_tokens: 256
use_flash_attn: true
# === Data ===
data:
dataset_name: "openarm_box_grasp"
data_path: "/path/to/openarm_sim_demos/"
image_size: 224
batch_size: 32
num_workers: 4
# === Training ===
training:
learning_rate: 2e-5 # LR for SFT — higher than RL
warmup_steps: 100
max_steps: 5000 # 500-1000 episodes x ~5 steps/epoch
gradient_accumulation: 2
fp16: true # Mixed precision
# === Logging ===
logging:
wandb_project: "openarm-simplevla"
wandb_run_name: "sft-openarm-v1"
log_interval: 50
save_interval: 1000
# === Hardware ===
hardware:
num_gpus: 4 # Minimum 4x A100/A800
per_gpu_batch_size: 8
EOF
Run SFT Training
# Activate environment
conda activate simplevla
# Set WANDB key for monitoring
export WANDB_API_KEY="your_wandb_key_here"
# Run SFT training
python -m verl.trainer.sft \
--config configs/openarm_sft.yaml \
--output_dir ./checkpoints/openarm-sft-v1
# Or use torchrun for multi-GPU
torchrun --nproc_per_node=4 \
-m verl.trainer.sft \
--config configs/openarm_sft.yaml \
--output_dir ./checkpoints/openarm-sft-v1
Training time: With 500 episodes and 4x A800, SFT takes about 2-4 hours. On 8x A800, about 1-2 hours. You will see loss drop rapidly in the first 1000 steps then plateau.
Evaluate SFT Baseline
After SFT completes, run evaluation in sim to establish a baseline:
python -m verl.trainer.evaluate \
--model_path ./checkpoints/openarm-sft-v1/final \
--env_name openarm_box_grasp \
--num_episodes 100 \
--render_video true \
--output_dir ./eval_results/sft_baseline
Expected outcome: SFT baseline yields a success rate of roughly 15-25% on the box grasping task. This sounds low but is perfectly normal — the SFT policy imitates demos without understanding "why" each action works. It lacks the ability to generalize when box position changes or when there is observation noise.
Step 5: RL Training — The GRPO Magic
This is the step that makes the biggest difference in SimpleVLA-RL. GRPO (Group Relative Policy Optimization) is a variant of PPO designed for LLM/VLA — it does not need a separate critic network, instead using group-relative advantages from multiple samples of the same query.
How GRPO Works (Intuition)
Imagine you are teaching someone new to bowl. SFT is like showing them tutorial videos — they get the basics but are not skilled yet. GRPO is like letting them throw 8 times per turn, then comparing results:
- The throw that knocks down the most pins: "good strategy, do more of that"
- The throw that misses: "avoid doing that"
In SimpleVLA-RL:
- Each "query" is an observation (camera image + task description)
- The model generates 8 action samples for the same observation
- Each sample is rolled out in sim to get a binary reward (0 or 1)
- Successful samples are reinforced, failed samples are penalized
- This process repeats thousands of times
Create RL Training Script
# Create RL training script for OpenArm
cat > examples/run_openvla_oft_rl_openarm.sh << 'SCRIPT_EOF'
#!/bin/bash
set -euo pipefail
# === Paths ===
SFT_MODEL_PATH="./checkpoints/openarm-sft-v1/final"
CKPT_PATH="./checkpoints/openarm-rl-v1"
DATASET_NAME="openarm_box_grasp"
# === Hardware ===
NUM_GPUS=8 # 8x A800 80GB recommended
# Also works with 4x A100 or 2x H100
# === Hyperparameters (tuned for manipulation tasks) ===
LEARNING_RATE=5e-6 # Lower than SFT since we are fine-tuning policy
BATCH_SIZE=64 # Rollout batch size
SAMPLES_PER_QUERY=8 # Number of action samples per observation
MINI_BATCH_SIZE=128 # Mini-batch for gradient update
CLIP_LOW=0.2 # Async clipping lower bound
CLIP_HIGH=1.28 # Async clipping upper bound
TEMPERATURE=1.6 # Sampling temperature — high for exploration
MAX_STEPS=400 # Max steps per episode
NUM_EPOCHS=200 # Total RL epochs
# === WANDB ===
export WANDB_PROJECT="openarm-simplevla"
export WANDB_RUN_NAME="rl-grpo-openarm-v1"
# === Launch RL training ===
torchrun --nproc_per_node=$NUM_GPUS \
-m verl.trainer.main_ppo \
--config configs/openarm_rl.yaml \
trainer.sft_model_path=$SFT_MODEL_PATH \
trainer.ckpt_path=$CKPT_PATH \
data.dataset_name=$DATASET_NAME \
algorithm.lr=$LEARNING_RATE \
algorithm.batch_size=$BATCH_SIZE \
algorithm.n_samples=$SAMPLES_PER_QUERY \
algorithm.mini_batch_size=$MINI_BATCH_SIZE \
algorithm.clip_range_low=$CLIP_LOW \
algorithm.clip_range_high=$CLIP_HIGH \
algorithm.temperature=$TEMPERATURE \
rollout.max_steps=$MAX_STEPS \
trainer.num_epochs=$NUM_EPOCHS \
trainer.val_only=False
SCRIPT_EOF
chmod +x examples/run_openvla_oft_rl_openarm.sh
Hyperparameter Reference Table
| Parameter | Value | Rationale |
|---|---|---|
| Learning rate | 5e-6 | Low to avoid catastrophic forgetting from SFT |
| Batch size | 64 | Balance between throughput and memory |
| Samples/query | 8 | Enough diversity for GRPO comparison without excess compute |
| Mini-batch | 128 | Smoother gradient updates |
| Clip low | 0.2 | Prevent policy from changing too aggressively downward |
| Clip high | 1.28 | Async clipping — allows more increase than decrease |
| Temperature | 1.6 | High to encourage exploration |
| Action tokens | 256 | Action resolution — 256 levels per DoF |
Note on async clipping (0.2, 1.28): This is a SimpleVLA-RL innovation. Traditional PPO uses symmetric clipping (0.2, 0.2). Async clipping allows the policy to "increase probability" of an action more than it can "decrease probability" — this makes sense because in manipulation, when a robot finds a successful strategy, it should commit strongly to it.
Run RL Training
# Run RL training
bash examples/run_openvla_oft_rl_openarm.sh
RL training takes 8-16 hours on 8x A800 depending on epochs and sim speed. On 4x A100, it may take up to 24 hours.
Monitor on W&B
During training, track these metrics on Weights & Biases:
Key metrics to watch:
1. rollout/success_rate — Should gradually increase from SFT baseline
2. policy/entropy — Gradually decreases as policy converges
3. policy/loss — Fluctuates but trends downward
4. rollout/avg_reward — Increases (correlates with success rate)
5. policy/kl_divergence — Should not be too high (< 0.5)
Expected timeline:
- Epoch 1-20: Success rate fluctuates around SFT baseline (15-25%)
- Epoch 20-80: Starts climbing, robot discovers new strategies
- Epoch 80-150: Rapid increase, success rate reaches 40-60%
- Epoch 150-200: Plateau or continued slow improvement
The "Pushcut" Phenomenon — Robot Self-Invention
One of the most exciting results from RL training is that the robot can discover entirely new strategies not present in the demonstration data. The SimpleVLA-RL paper calls these "emergent" behaviors — the Piper robot learned to push objects into favorable positions before grasping, rather than grasping directly as shown in demos.
With OpenArm box grasping, you might observe the robot self-learning to:
- Rotate the wrist for a better approach angle
- Push the box toward the center of the workspace before grasping
- Open the gripper wider than demo data for a "safer" grasp
Step 6: Evaluate in Simulation
After RL training completes, evaluate the model on 100+ episodes for statistically reliable results.
Run Evaluation
# Modify script for evaluation only (no training)
bash examples/run_openvla_oft_rl_openarm.sh \
trainer.val_only=True \
trainer.sft_model_path=./checkpoints/openarm-rl-v1/best \
rollout.num_eval_episodes=100
SFT vs SFT+RL Comparison
+--------------------------------------------------+
| OpenArm Box Grasp — Sim Results |
+------------------+-----------+-------------------+
| Method | Success % | Improvement |
+------------------+-----------+-------------------+
| SFT only | ~20% | Baseline |
| SFT + RL (GRPO) | ~45-55% | +125-175% |
+------------------+-----------+-------------------+
| * Estimates based on original paper (Piper): |
| SFT 17.5% -> SFT+RL 38.5% (+120%) |
| OpenArm sim may be higher since sim is easier |
| than real-world |
+--------------------------------------------------+
Failure Mode Analysis
When success rate plateaus, analyze why the robot fails:
# Failure analysis script
import json
with open("eval_results/rl_eval/results.json") as f:
results = json.load(f)
failures = [r for r in results if not r["success"]]
print(f"Total failures: {len(failures)}/{len(results)}")
# Categorize failures
failure_types = {
"miss_grasp": 0, # Gripper closes but does not grip the box
"wrong_position": 0, # Robot does not approach box correctly
"timeout": 0, # Ran out of time (400 steps)
"collision": 0, # Collision with table or obstacles
}
for f in failures:
if f["step"] >= 400:
failure_types["timeout"] += 1
elif f["box_height"] < 0.76: # Box not lifted
if f.get("gripper_closed", False):
failure_types["miss_grasp"] += 1
else:
failure_types["wrong_position"] += 1
else:
failure_types["collision"] += 1
for ftype, count in failure_types.items():
pct = count / len(failures) * 100
print(f" {ftype}: {count} ({pct:.1f}%)")
Based on failure analysis, you can adjust:
- Many miss_grasp — Increase domain randomization for box positions in sim
- Many timeout — Increase max_steps or reduce task difficulty
- Many wrong_position — Collect more demos with diverse box positions
Pipeline Summary
Isaac Lab Sim
|
v
Collect Demonstrations (500-1000 episodes)
|
v
SFT Training (OpenVLA-OFT base -> fine-tuned model)
| ~20% success rate
v
RL Training (GRPO with binary rewards)
| ~45-55% success rate
v
Sim Evaluation (100+ episodes)
|
v
Ready for Sim-to-Real (Next post!)
The entire pipeline — from installation to evaluation — can be completed in 2-3 days with appropriate hardware. SFT takes half a day, RL takes one day, evaluation and iteration takes half a day.
In the next post, we will transfer the trained model from sim to the real OpenArm robot — the challenging but exciting sim-to-real step.
Hardware Requirements
| Config | GPUs | SFT Time | RL Time | Total |
|---|---|---|---|---|
| Recommended | 8x A800 80GB | 2h | 10h | ~12h |
| Minimum viable | 4x A100 40GB | 4h | 20h | ~24h |
| Budget | 2x H100 80GB | 3h | 14h | ~17h |
Cloud GPU providers like Lambda Labs, RunPod, or Vast.ai rent 8x A800 for about $15-25/hour. Total training cost is approximately $180-300 for the full pipeline.
Related Posts
- SimpleVLA-RL (1): Framework Overview for VLA RL — Understanding why RL matters for robot manipulation
- SimpleVLA-RL (3): Detailed Training Pipeline — Deep dive into GRPO and veRL internals
- SimpleVLA-RL (11): Sim-to-Real for OpenArm — Transferring the model from sim to real robot