You've trained a VLA model, SFT looks great on demos, but the robot fails in the real world. This is a fundamental problem: SFT only imitates success, it never learns from failure. TGRPO (arXiv 2506.08440) from Jilin University proposes a different approach — online RL with dual-level advantages and LLM-generated dense rewards, pushing OpenVLA-7B from 86.4% to 91.0% success rate on LIBERO-Object using a single A100 GPU.
This post breaks down the technical details: why standard GRPO falls short for robotics, how LLM generates dense rewards automatically, and how the dual-level advantage mechanism works.
The Problem: SFT Teaches Imitation, Not Adaptation
Imagine learning to drive by watching 1000 videos of successful drives. You'd learn the pattern, but the first time something unexpected happens — a car cuts you off, the road is wet — you're lost because you've never experienced and recovered from failure.
VLA models trained with SFT have the same problem. When an object is slightly off-position compared to demonstrations, lighting changes, or task order shifts — the policy collapses. Technically: SFT minimizes behavior cloning loss on successful demonstrations only. There's no signal about "why did it fail" or "how to recover."
Reinforcement learning (RL) addresses this by letting the robot actually try things and receive feedback. But applying RL to VLAs faces 3 challenges:
- Reward design is hard: Tabletop manipulation involves complex sub-tasks. Using sparse reward (0/1 at episode end) makes credit assignment nearly impossible across 200+ steps.
- Long credit assignment horizon: Step 50 might cause success at step 150 — which steps actually mattered?
- Standard GRPO wasn't designed for long trajectories: GRPO (Group Relative Policy Optimization) was built for LLM reasoning, where each "trajectory" is a few dozen tokens. Robot trajectories span hundreds of continuous action steps.
TGRPO addresses all three simultaneously.
What is TGRPO? — Big Picture First
TGRPO = Trajectory-wise Group Relative Policy Optimization.
Three main components:
[LLM] → Dense Reward Function
↓
[Robot runs N=4 parallel trajectories] → collect (state, action, reward) tuples
↓
[TGRPO] → step-level + trajectory-level advantage → update policy
The key difference from standard GRPO: GRPO compares at the group level (which trajectory was better). TGRPO adds step-level comparison — at timestep t, which action among N parallel trajectories yielded a higher reward?
Source: TGRPO paper, Jilin University (arXiv 2506.08440)
Component 1: LLM Dense Reward — No More Hand-Crafted Rewards
This is the most practical contribution of the paper. Instead of engineers spending weeks designing reward functions for each task, TGRPO uses an LLM to generate them automatically.
How the LLM Generates Rewards
The LLM receives a prompt containing:
- Natural language task description (e.g., "Pick up the milk and place it in the basket")
- LIBERO environment code (so the LLM understands state space, object positions, success conditions)
- Specific format requirements
Core prompt (from the paper):
"Based on task description, LIBERO environment code, and RL robotics characteristics,
generate a multi-stage reward function where robot receives constant stage-specific
rewards at proximity thresholds, with progressively increasing values per stage
and significantly higher completion rewards."
LLM output is a Python function with a multi-stage structure:
def compute_reward(state, info):
reward = 0.0
# Stage 1: Approach object
dist_to_obj = compute_distance(state.ee_pos, state.obj_pos)
if dist_to_obj < 0.05: # 5cm threshold
reward += 1.0
# Stage 2: Grasp
if info.is_grasped:
reward += 3.0
# Stage 3: Transport to goal
dist_to_goal = compute_distance(state.obj_pos, state.goal_pos)
if dist_to_goal < 0.1:
reward += 5.0
# Stage 4: Place (completion)
if info.task_success:
reward += 10.0
# End-effector pose shaping from demonstration
reward += alpha * pose_similarity(state.ee_pose, demo_ee_pose)
return reward
The reward function has two components:
- f₁(P_object, P_pose_k): Object position tracking — rewards when the object advances toward the next milestone
- f₂(P_pose_k, s_t): End-effector pose shaping — encourages poses similar to demonstrations
Formula: Rₜ = f₁(P_object(t), P_pose^k) + f₂(P_pose^k, s_t)
The key insight: the LLM understands the semantics of "put milk in basket" and automatically infers sub-goals (approach → grasp → lift → place). No engineer needs to hard-code this.
Component 2: Trajectory-wise GRPO — Dual-Level Advantage
This is the core math. Standard GRPO computes advantage by comparing N parallel outputs against each other. For robotics, TGRPO extends this into 2 levels of advantage:
Step-Level Advantage Sᵢ,ₜ
At each timestep t, we have N trajectories running in parallel. Step-level advantage compares the reward at that specific timestep t across N trajectories:
S_{i,t} = (R_{i,t} - mean({R_{j,t} for j in 1..N})) / std({R_{j,t} for j in 1..N})
"At this specific step, did I get a higher reward than the other parallel trajectories?"
This captures local action quality — was the specific action at timestep t better than the alternatives running in parallel?
Trajectory-Level Advantage Tᵢ
Looking at the entire trajectory i, compare cumulative reward against all N trajectories:
T_i = (R_i - mean({R_j for j in 1..N})) / std({R_j for j in 1..N})
"Looking at the full trajectory, did I do better or worse than the group overall?"
This captures global task success — was this trajectory heading in the right direction overall?
Fused Advantage
Combine both levels:
Adv_{i,t} = α₁ · S_{i,t} + α₂ · T_i
The paper finds empirically: α₁ = 0.3, α₂ = 0.7 works best for most tasks. The trajectory-level weight is higher — the "big picture" matters more than any individual step.
Why Dual-Level Beats Single-Level
The ablation study is clear:
| Method | LIBERO-Object avg |
|---|---|
| Step-level only | 73.6% |
| Trajectory-level only | 86.8% |
| TGRPO (both) | 91.0% |
Step-level alone fails because a "randomly good" action at step t doesn't mean the overall strategy is good. Trajectory-level alone lacks local guidance. Combined, both signals reinforce each other.

Full Architecture
OpenVLA-7B (frozen backbone)
└── LoRA adapter (trainable, rank=16)
├── SigLIP encoder (visual features)
├── DINOv2 encoder (visual features)
└── Llama2-7B language backbone
Training loop:
For episode = 1..30:
Sample N=4 parallel trajectories from current policy π_θ
Collect (s_t, a_t, R_t) for each trajectory, max 200 steps
LLM Dense Reward:
R_t = f1(object_pos, keypose) + f2(ee_pose, demo_ee_pose)
Compute advantages:
S_{i,t} = normalize_step(R_{i,t}) # across N trajectories at t
T_i = normalize_traj(sum(R_i)) # across N trajectories total
Adv_{i,t} = 0.3 * S_{i,t} + 0.7 * T_i
Update LoRA with AdamW, lr=1e-5:
L = -E[Adv_{i,t} * log π_θ(a_{i,t} | s_{i,t})]
Practical Setup
Requirements
- GPU: NVIDIA A100 (or equivalent 40GB+ VRAM)
- LIBERO simulator installed
- OpenVLA-7B checkpoint
- LLM API access (for reward generation)
Setup LIBERO
# Install LIBERO
pip install libero
# Download OpenVLA-7B
from huggingface_hub import snapshot_download
snapshot_download("openvla/openvla-7b", local_dir="./models/openvla-7b")
Generate Reward Functions with LLM
Before training, generate rewards for each task:
import anthropic
def generate_reward_function(task_description: str, env_code: str) -> str:
client = anthropic.Anthropic()
prompt = f"""Based on this task description and LIBERO environment code,
generate a multi-stage reward function for RL training.
Task: {task_description}
Environment code: {env_code}
Requirements:
- Multi-stage rewards with progressively increasing values
- Proximity thresholds for each stage
- High completion bonus at final stage
- End-effector pose shaping from demonstrations
- Return a Python function: def compute_reward(state, info) -> float
"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Training with TGRPO
from tgrpo import TGRPOTrainer, TGRPOConfig
config = TGRPOConfig(
model_path="./models/openvla-7b",
lora_rank=16,
learning_rate=1e-5,
optimizer="adamw",
# TGRPO specific
n_parallel=4, # N parallel trajectories per update
alpha_step=0.3, # weight for step-level advantage
alpha_traj=0.7, # weight for trajectory-level advantage
# Training
max_episodes=30,
max_steps_per_episode=200,
n_test_episodes=50,
)
trainer = TGRPOTrainer(
config=config,
env_name="libero_object",
reward_fn=generated_reward_fn, # from LLM
)
trainer.train()
Inference After Fine-tuning
from openvla import OpenVLA
# Load fine-tuned model
model = OpenVLA.from_pretrained("./models/openvla-7b")
model.load_lora("./checkpoints/tgrpo-libero-object")
model.eval()
obs = env.reset()
for step in range(max_steps):
action = model.predict_action(
image=obs["rgb"],
instruction="Pick up the milk and place it in the basket"
)
obs, reward, done, info = env.step(action)
if done:
break
Detailed Results
LIBERO-Object (10 tasks)
| Method | Avg Success Rate |
|---|---|
| OpenVLA (zero-shot) | ~60% |
| SFT | 86.4% |
| PPO | 86.6% |
| GRAPE | ~89% |
| TGRPO | 91.0% |
All 4 LIBERO Suites (Table 1)
| Method | Spatial | Object | Goal | Long | Average |
|---|---|---|---|---|---|
| SFT | 84.7% | 88.4% | 79.2% | 51.1% | 76.5% |
| GRAPE | 88.5% | 92.1% | 83.1% | 57.2% | 80.2% |
| TGRPO | 90.4% | 92.2% | 81.0% | 59.2% | 80.7% |
Notable: LIBERO-Long (long-horizon tasks) shows the largest TGRPO improvement — +8.1% over SFT. This is exactly where dual-level advantage shines — trajectory-level advantage gives the robot a "longer view."
Group Size N Ablation
| N (parallel trajectories) | LIBERO-Goal success |
|---|---|
| N=2 | 76.2% |
| N=4 | 81.0% |
| N=8 | 80.5% |
N=4 is the sweet spot. N=2 lacks diversity for meaningful comparison. N=8 doubles GPU cost with minimal gain.
Known Limitations
1. α₁, α₂ require manual tuning per task: The paper finds optimal weights vary significantly by task. No general formula for automatic selection. Tasks 1,7,8: α₁=10, α₂=1; tasks 4,5,6: α₁≈0.3, α₂≈0.7. Real-world deployment needs grid search.
2. LLM-generated rewards can be wrong: The LLM generates rewards from text descriptions, but environment state space may be more complex than the LLM understands. Always verify reward functions manually before training.
3. Only validated on LIBERO: No results on real robots or other benchmarks (RoboMimic, MetaWorld). Transfer performance is unknown.
4. Requires a good SFT checkpoint: TGRPO fine-tunes from an SFT checkpoint, not from scratch. Weak SFT means RL won't save it.
When to Use TGRPO
Good fit when:
- You have an SFT-trained VLA that needs a performance boost
- You have a simulator for rollouts (LIBERO, MuJoCo, Isaac Gym)
- Your task has clear sub-stages (grasp → lift → place)
- You want automated reward design instead of hand-crafting
Not ready for:
- No simulator available (real-robot RL is expensive)
- Tasks so sparse that LLM can't generate meaningful dense rewards
- Limited GPU budget (need at least A100)
Comparison with Other Approaches
| Method | Reward Design | Advantage | GPU | Performance |
|---|---|---|---|---|
| SFT | None | N/A | Low | 86.4% |
| PPO | Hand-crafted | Value function | High | 86.6% |
| GRAPE | Sparse + shaped | Trajectory-level | Medium | ~89% |
| TGRPO | LLM dense | Step + Trajectory | Medium | 91.0% |
TGRPO hits a good balance: no hand-crafted rewards (uses LLM), no value network (critic-free like GRPO), better GPU efficiency than PPO.

The Bigger Lesson from TGRPO
TGRPO teaches a broader lesson: in robotics RL, reward design and credit assignment are the two hardest problems, and solving both simultaneously is what creates real breakthroughs.
- LLM dense reward solves reward design: instead of hand-crafting, use LLM semantic understanding
- Dual-level advantage solves credit assignment: look at both local (step) and global (trajectory) signals
This trend will expand: instead of engineering rewards, we'll "prompt" them; instead of choosing one RL algorithm, we'll compose advantage signals tailored to specific task structures.
If you're fine-tuning a VLA and want to exceed SFT performance without manual reward engineering — TGRPO is a strong starting point.



