← Back to Blog
aiopenarmsimplevla-rltraininggrporeinforcement-learning

SimpleVLA-RL (10): SFT & RL Training for OpenArm

Step-by-step guide to SFT fine-tuning and RL training with SimpleVLA-RL for OpenArm — from environment config to running GRPO.

Nguyễn Anh Tuấn11 tháng 4, 202615 min read
SimpleVLA-RL (10): SFT & RL Training for OpenArm

SFT & RL Training for OpenArm with SimpleVLA-RL

In the previous post, we collected hundreds of demonstration episodes in Isaac Lab for OpenArm box grasping. The simulation data is ready — now it is time to turn it into a smart policy that can control a real robot. This article walks you through the entire SimpleVLA-RL training pipeline end-to-end: installing the stack, adapting the action space for OpenArm, running SFT (Supervised Fine-Tuning) to create a baseline, then boosting performance with RL (Reinforcement Learning) using the GRPO algorithm.

The biggest difference from Part 8 — where we used LeRobot — is that here we exclusively use the SimpleVLA-RL pipeline built on veRL and OpenVLA-OFT. No LeRobot. No ACT or SmolVLA. Just a single VLA model fine-tuned with SFT then upgraded with RL.

Why SimpleVLA-RL Instead of LeRobot?

Before diving into the technical details, let us understand why we chose this path. LeRobot is an excellent framework for Imitation Learning — collect demos, train a policy, deploy. But it stops there. The policy is only as good as the data you provide.

SimpleVLA-RL goes one step further: after SFT creates a baseline from demonstrations, the GRPO algorithm allows the robot to discover new strategies on its own through trial-and-error in simulation. Results from the original paper on the Piper robot show that RL can increase success rate from 17.5% to 38.5% — a 120% improvement over SFT alone.

SimpleVLA-RL training pipeline includes SFT cold-start and RL fine-tuning

Step 1: Install the SimpleVLA-RL Stack

The full stack has 4 core components: Python environment, veRL (RL framework), OpenVLA-OFT (VLA model), and SimpleVLA-RL (glue code connecting everything). Installation order matters due to the dependency chain.

Create Conda Environment

# Create a separate environment — DO NOT share with Isaac Lab env
conda create -n simplevla python==3.10 -y
conda activate simplevla

# PyTorch 2.4 with CUDA 12.4 — most thoroughly tested version
pip3 install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124

Why Python 3.10 and not 3.11 or 3.12? Because veRL and flash-attention have compiled extensions built and tested primarily on 3.10. You can try 3.11 but risk compatibility issues that are not worth the trouble.

Install veRL — RL Training Framework

# Clone veRL v0.2 — stable version supported by SimpleVLA-RL
git clone -b v0.2.x https://github.com/volcengine/verl.git
cd verl
pip install -e .
cd ..

veRL (Volcano Engine Reinforcement Learning) is ByteDance's framework designed for RL training on LLM-scale models. It handles distributed training, rollout collection, and policy optimization — things you do not want to implement yourself.

Install OpenVLA-OFT — VLA Model

# Clone OpenVLA-OFT — VLA model with Open Fine-Tuning support
git clone https://github.com/moojink/openvla-oft.git
cd openvla-oft
pip install -e .
cd ..

# Flash Attention — 2-3x inference speedup
pip install flash-attn --no-build-isolation

OpenVLA-OFT is an extended version of OpenVLA with better fine-tuning support. It uses a Vision-Language-Action architecture with a Prismatic VLM backbone, tokenizing actions into 256 discrete tokens per degree of freedom.

Clone SimpleVLA-RL

git clone https://github.com/simple-vla-rl/SimpleVLA-RL.git
cd SimpleVLA-RL
pip install -e .

After this step, verify the entire stack:

python -c "
import torch
import verl
import prismatic
print(f'PyTorch: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU count: {torch.cuda.device_count()}')
print(f'veRL imported successfully')
print(f'Prismatic (OpenVLA) imported successfully')
"

If everything works, you should see PyTorch 2.4.0, CUDA available, and your GPU count. If torch.cuda.device_count() returns 0, check your CUDA driver version with nvidia-smi.

Step 2: Adapt the Action Space for OpenArm

This is the most important step and also the easiest to get wrong. OpenVLA-OFT is designed by default for 7-DoF (6 joints + 1 gripper, like the Piper robot). OpenArm has 7 joints + 1 gripper = 8-DoF. We need to decide how to handle this difference.

Option A: Direct 8-DoF Mapping (Recommended)

OpenArm's 8-DoF (7 joints + 1 gripper) has the same dimensionality as some default OpenVLA-OFT configurations. If the model supports action_dim=8, we just need to map joint order correctly:

# In config file or directly in code
ACTION_DIM = 8          # 7 joints + 1 gripper
STATE_DIM = 8           # Joint positions feedback
NUM_ACTION_TOKENS = 256  # Each DoF tokenized into 256 discrete values

# OpenArm joints -> action vector mapping
# [joint1, joint2, joint3, joint4, joint5, joint6, joint7, gripper]
# Each joint value normalized to [-1, 1] before tokenization
JOINT_LIMITS = {
    'joint1': (-3.14, 3.14),   # Base rotation
    'joint2': (-1.57, 1.57),   # Shoulder
    'joint3': (-1.57, 1.57),   # Elbow
    'joint4': (-3.14, 3.14),   # Wrist 1
    'joint5': (-1.57, 1.57),   # Wrist 2
    'joint6': (-3.14, 3.14),   # Wrist 3
    'joint7': (-1.57, 1.57),   # Wrist 4
    'gripper': (0.0, 1.0),     # Gripper open/close
}

Option B: Keep 7-DoF and Pad

If you want to keep the original 7-DoF architecture of OpenVLA-OFT without modifying the model:

# Merge the last 2 wrist joints into 1
# OpenArm 8-DoF -> 7-DoF mapping
def openarm_to_7dof(joint_positions_8):
    """Map 8-DoF OpenArm to 7-DoF OpenVLA format"""
    return [
        joint_positions_8[0],  # base
        joint_positions_8[1],  # shoulder
        joint_positions_8[2],  # elbow
        joint_positions_8[3],  # wrist1
        joint_positions_8[4],  # wrist2
        # Average wrist3 + wrist4 into single value
        (joint_positions_8[5] + joint_positions_8[6]) / 2,
        joint_positions_8[7],  # gripper
    ]

I recommend Option A because it preserves all kinematic information. Merging joints loses precision, especially for tasks requiring high wrist dexterity.

Modify Action Tokenizer Config

In OpenVLA-OFT, open the config file and change:

# openvla-oft/prismatic/models/backbones/llm/action_tokenizer.py
# Or in config JSON depending on version

tokenizer_config = {
    "action_dim": 8,         # Change from 7 -> 8 for OpenArm
    "state_dim": 8,          # Feedback dimension
    "num_action_tokens": 256, # Keep unchanged
    "action_chunk_size": 1,  # Predict 1 action per step
}

Step 3: Register the OpenArm Environment

SimpleVLA-RL needs to know how to interact with your environment. Three files need modification: rob_dataset.py (data), rob_rollout.py (rollout loop), and the reward function.

Add Dataset to rob_dataset.py

# SimpleVLA-RL/verl/workers/rob_dataset.py

# Add to DATASETS dictionary
DATASETS = {
    # ... existing datasets (LIBERO, Piper, etc.)
    
    "openarm_box_grasp": {
        "data_path": "/path/to/openarm_sim_demos/",
        "task_description": "Pick up the carton box from the table",
        "action_dim": 8,
        "state_dim": 8,
        "camera_names": ["front_camera"],  # Camera name in Isaac Lab
        "image_size": (224, 224),          # Resize for VLA input
        "max_episode_length": 400,         # Must match sim config
    },
}

Add Environment to rob_rollout.py

# SimpleVLA-RL/verl/workers/rob_rollout.py

def create_environment(env_name, env_config):
    """Create environment for rollout"""
    if env_name == "openarm_box_grasp":
        # Import Isaac Lab environment
        from omni.isaac.lab.envs import ManagerBasedRLEnv
        import omni.isaac.lab_tasks  # Register environments
        
        env = ManagerBasedRLEnv(
            cfg=env_config,
            render_mode="rgb_array",
        )
        return OpenArmWrapper(env)
    # ... other environments

# Add max_steps for OpenArm
MAX_STEPS = {
    # ... existing entries
    "openarm_box_grasp": 400,  # 400 steps x 0.02s = 8 seconds per episode
}

Implement get_info() for Success Detection

This is the most critical part — it determines the reward signal for RL:

class OpenArmWrapper:
    """Wrapper connecting Isaac Lab env to SimpleVLA-RL rollout"""
    
    def __init__(self, isaac_env):
        self.env = isaac_env
        self.step_count = 0
        
    def get_info(self):
        """
        Return dict containing success/failure info.
        SimpleVLA-RL uses binary reward: 1 if success, 0 if failure.
        """
        # Check: has the box been lifted above the table?
        box_pos = self.env.scene["box"].get_world_poses()[0]
        table_height = 0.75  # Table height in sim
        
        # Gripper holding box + box above table by 10cm
        gripper_holding = self.env.scene["gripper"].is_closed()
        box_lifted = box_pos[2] > table_height + 0.10
        
        success = gripper_holding and box_lifted
        
        return {
            "success": success,
            "reward": 1.0 if success else 0.0,  # Binary reward
            "box_height": float(box_pos[2]),
            "step": self.step_count,
        }
    
    def step(self, action):
        """Execute action and return observation"""
        obs, reward, terminated, truncated, info = self.env.step(action)
        self.step_count += 1
        
        # Override reward with binary success signal
        custom_info = self.get_info()
        return obs, custom_info["reward"], terminated, truncated, custom_info
    
    def reset(self):
        self.step_count = 0
        return self.env.reset()

SimpleVLA-RL system architecture connecting VLA model with simulation environment

Step 4: SFT Training — Creating the Baseline

SFT (Supervised Fine-Tuning) is the "cold-start" step — teaching the VLA model to imitate your collected demonstrations. It is similar to Imitation Learning but performed on top of a Vision-Language Model that has been pretrained on millions of images and text.

Download OpenVLA-OFT Base Model

# Download from HuggingFace — approximately 15GB
huggingface-cli download moojink/openvla-7b-oft \
    --local-dir ./checkpoints/openvla-oft-base

This base model has been pretrained on the Open X-Embodiment dataset — a collection of data from dozens of different robot types. It "understands" general concepts of manipulation but knows nothing specific about OpenArm.

Prepare SFT Config

# Create config file for SFT training
cat > configs/openarm_sft.yaml << 'EOF'
# === Model ===
model:
  base_model_path: "./checkpoints/openvla-oft-base"
  action_dim: 8
  action_tokens: 256
  use_flash_attn: true

# === Data ===
data:
  dataset_name: "openarm_box_grasp"
  data_path: "/path/to/openarm_sim_demos/"
  image_size: 224
  batch_size: 32
  num_workers: 4

# === Training ===
training:
  learning_rate: 2e-5       # LR for SFT — higher than RL
  warmup_steps: 100
  max_steps: 5000           # 500-1000 episodes x ~5 steps/epoch
  gradient_accumulation: 2
  fp16: true                # Mixed precision
  
# === Logging ===
logging:
  wandb_project: "openarm-simplevla"
  wandb_run_name: "sft-openarm-v1"
  log_interval: 50
  save_interval: 1000

# === Hardware ===
hardware:
  num_gpus: 4               # Minimum 4x A100/A800
  per_gpu_batch_size: 8
EOF

Run SFT Training

# Activate environment
conda activate simplevla

# Set WANDB key for monitoring
export WANDB_API_KEY="your_wandb_key_here"

# Run SFT training
python -m verl.trainer.sft \
    --config configs/openarm_sft.yaml \
    --output_dir ./checkpoints/openarm-sft-v1

# Or use torchrun for multi-GPU
torchrun --nproc_per_node=4 \
    -m verl.trainer.sft \
    --config configs/openarm_sft.yaml \
    --output_dir ./checkpoints/openarm-sft-v1

Training time: With 500 episodes and 4x A800, SFT takes about 2-4 hours. On 8x A800, about 1-2 hours. You will see loss drop rapidly in the first 1000 steps then plateau.

Evaluate SFT Baseline

After SFT completes, run evaluation in sim to establish a baseline:

python -m verl.trainer.evaluate \
    --model_path ./checkpoints/openarm-sft-v1/final \
    --env_name openarm_box_grasp \
    --num_episodes 100 \
    --render_video true \
    --output_dir ./eval_results/sft_baseline

Expected outcome: SFT baseline yields a success rate of roughly 15-25% on the box grasping task. This sounds low but is perfectly normal — the SFT policy imitates demos without understanding "why" each action works. It lacks the ability to generalize when box position changes or when there is observation noise.

Step 5: RL Training — The GRPO Magic

This is the step that makes the biggest difference in SimpleVLA-RL. GRPO (Group Relative Policy Optimization) is a variant of PPO designed for LLM/VLA — it does not need a separate critic network, instead using group-relative advantages from multiple samples of the same query.

How GRPO Works (Intuition)

Imagine you are teaching someone new to bowl. SFT is like showing them tutorial videos — they get the basics but are not skilled yet. GRPO is like letting them throw 8 times per turn, then comparing results:

In SimpleVLA-RL:

Create RL Training Script

# Create RL training script for OpenArm
cat > examples/run_openvla_oft_rl_openarm.sh << 'SCRIPT_EOF'
#!/bin/bash
set -euo pipefail

# === Paths ===
SFT_MODEL_PATH="./checkpoints/openarm-sft-v1/final"
CKPT_PATH="./checkpoints/openarm-rl-v1"
DATASET_NAME="openarm_box_grasp"

# === Hardware ===
NUM_GPUS=8                  # 8x A800 80GB recommended
                            # Also works with 4x A100 or 2x H100

# === Hyperparameters (tuned for manipulation tasks) ===
LEARNING_RATE=5e-6          # Lower than SFT since we are fine-tuning policy
BATCH_SIZE=64               # Rollout batch size
SAMPLES_PER_QUERY=8         # Number of action samples per observation
MINI_BATCH_SIZE=128         # Mini-batch for gradient update
CLIP_LOW=0.2                # Async clipping lower bound
CLIP_HIGH=1.28              # Async clipping upper bound  
TEMPERATURE=1.6             # Sampling temperature — high for exploration
MAX_STEPS=400               # Max steps per episode
NUM_EPOCHS=200              # Total RL epochs

# === WANDB ===
export WANDB_PROJECT="openarm-simplevla"
export WANDB_RUN_NAME="rl-grpo-openarm-v1"

# === Launch RL training ===
torchrun --nproc_per_node=$NUM_GPUS \
    -m verl.trainer.main_ppo \
    --config configs/openarm_rl.yaml \
    trainer.sft_model_path=$SFT_MODEL_PATH \
    trainer.ckpt_path=$CKPT_PATH \
    data.dataset_name=$DATASET_NAME \
    algorithm.lr=$LEARNING_RATE \
    algorithm.batch_size=$BATCH_SIZE \
    algorithm.n_samples=$SAMPLES_PER_QUERY \
    algorithm.mini_batch_size=$MINI_BATCH_SIZE \
    algorithm.clip_range_low=$CLIP_LOW \
    algorithm.clip_range_high=$CLIP_HIGH \
    algorithm.temperature=$TEMPERATURE \
    rollout.max_steps=$MAX_STEPS \
    trainer.num_epochs=$NUM_EPOCHS \
    trainer.val_only=False
SCRIPT_EOF

chmod +x examples/run_openvla_oft_rl_openarm.sh

Hyperparameter Reference Table

Parameter Value Rationale
Learning rate 5e-6 Low to avoid catastrophic forgetting from SFT
Batch size 64 Balance between throughput and memory
Samples/query 8 Enough diversity for GRPO comparison without excess compute
Mini-batch 128 Smoother gradient updates
Clip low 0.2 Prevent policy from changing too aggressively downward
Clip high 1.28 Async clipping — allows more increase than decrease
Temperature 1.6 High to encourage exploration
Action tokens 256 Action resolution — 256 levels per DoF

Note on async clipping (0.2, 1.28): This is a SimpleVLA-RL innovation. Traditional PPO uses symmetric clipping (0.2, 0.2). Async clipping allows the policy to "increase probability" of an action more than it can "decrease probability" — this makes sense because in manipulation, when a robot finds a successful strategy, it should commit strongly to it.

Run RL Training

# Run RL training
bash examples/run_openvla_oft_rl_openarm.sh

RL training takes 8-16 hours on 8x A800 depending on epochs and sim speed. On 4x A100, it may take up to 24 hours.

Monitor on W&B

During training, track these metrics on Weights & Biases:

Key metrics to watch:

1. rollout/success_rate     — Should gradually increase from SFT baseline
2. policy/entropy           — Gradually decreases as policy converges
3. policy/loss              — Fluctuates but trends downward
4. rollout/avg_reward       — Increases (correlates with success rate)
5. policy/kl_divergence     — Should not be too high (< 0.5)

Expected timeline:

The "Pushcut" Phenomenon — Robot Self-Invention

One of the most exciting results from RL training is that the robot can discover entirely new strategies not present in the demonstration data. The SimpleVLA-RL paper calls these "emergent" behaviors — the Piper robot learned to push objects into favorable positions before grasping, rather than grasping directly as shown in demos.

With OpenArm box grasping, you might observe the robot self-learning to:

Robot arm performing manipulation task in research environment

Step 6: Evaluate in Simulation

After RL training completes, evaluate the model on 100+ episodes for statistically reliable results.

Run Evaluation

# Modify script for evaluation only (no training)
bash examples/run_openvla_oft_rl_openarm.sh \
    trainer.val_only=True \
    trainer.sft_model_path=./checkpoints/openarm-rl-v1/best \
    rollout.num_eval_episodes=100

SFT vs SFT+RL Comparison

+--------------------------------------------------+
|        OpenArm Box Grasp — Sim Results           |
+------------------+-----------+-------------------+
| Method           | Success % | Improvement       |
+------------------+-----------+-------------------+
| SFT only         | ~20%      | Baseline          |
| SFT + RL (GRPO)  | ~45-55%   | +125-175%         |
+------------------+-----------+-------------------+
| * Estimates based on original paper (Piper):     |
|   SFT 17.5% -> SFT+RL 38.5% (+120%)            |
|   OpenArm sim may be higher since sim is easier  |
|   than real-world                                |
+--------------------------------------------------+

Failure Mode Analysis

When success rate plateaus, analyze why the robot fails:

# Failure analysis script
import json

with open("eval_results/rl_eval/results.json") as f:
    results = json.load(f)

failures = [r for r in results if not r["success"]]
print(f"Total failures: {len(failures)}/{len(results)}")

# Categorize failures
failure_types = {
    "miss_grasp": 0,      # Gripper closes but does not grip the box
    "wrong_position": 0,  # Robot does not approach box correctly
    "timeout": 0,         # Ran out of time (400 steps)
    "collision": 0,       # Collision with table or obstacles
}

for f in failures:
    if f["step"] >= 400:
        failure_types["timeout"] += 1
    elif f["box_height"] < 0.76:  # Box not lifted
        if f.get("gripper_closed", False):
            failure_types["miss_grasp"] += 1
        else:
            failure_types["wrong_position"] += 1
    else:
        failure_types["collision"] += 1

for ftype, count in failure_types.items():
    pct = count / len(failures) * 100
    print(f"  {ftype}: {count} ({pct:.1f}%)")

Based on failure analysis, you can adjust:

Pipeline Summary

Isaac Lab Sim
    |
    v
Collect Demonstrations (500-1000 episodes)
    |
    v
SFT Training (OpenVLA-OFT base -> fine-tuned model)
    |  ~20% success rate
    v
RL Training (GRPO with binary rewards)
    |  ~45-55% success rate
    v
Sim Evaluation (100+ episodes)
    |
    v
Ready for Sim-to-Real (Next post!)

The entire pipeline — from installation to evaluation — can be completed in 2-3 days with appropriate hardware. SFT takes half a day, RL takes one day, evaluation and iteration takes half a day.

In the next post, we will transfer the trained model from sim to the real OpenArm robot — the challenging but exciting sim-to-real step.

Hardware Requirements

Config GPUs SFT Time RL Time Total
Recommended 8x A800 80GB 2h 10h ~12h
Minimum viable 4x A100 40GB 4h 20h ~24h
Budget 2x H100 80GB 3h 14h ~17h

Cloud GPU providers like Lambda Labs, RunPod, or Vast.ai rent 8x A800 for about $15-25/hour. Total training cost is approximately $180-300 for the full pipeline.


Related Posts

Related Posts

ResearchFlashSAC: RL nhanh hơn PPO cho Robot
ai-perceptionreinforcement-learninghumanoidresearch

FlashSAC: RL nhanh hơn PPO cho Robot

FlashSAC — off-policy RL mới vượt PPO về tốc độ lẫn hiệu quả trên 100+ tasks robotics, từ humanoid locomotion đến dexterous manipulation.

11/4/202610 min read
TutorialSimpleVLA-RL (11): Sim-to-Real cho OpenArm
openarmsim-to-realdeploymentsimplevla-rlPart 11

SimpleVLA-RL (11): Sim-to-Real cho OpenArm

Deploy model SimpleVLA-RL từ simulation lên OpenArm thật — camera setup, action mapping, và tips giảm sim-to-real gap.

11/4/202617 min read
ResearchSimpleVLA-RL (4): Kết quả & Bài học
ai-perceptionvlareinforcement-learningresearchPart 4

SimpleVLA-RL (4): Kết quả & Bài học

Phân tích kết quả SimpleVLA-RL: ablation studies, hiện tượng pushcut, real-world transfer, và 5 bài học rút ra.

11/4/202614 min read