VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam
VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
  1. Home
  2. Blog
  3. TGRPO: Fine-tuning VLA with Trajectory GRPO and LLM Dense Reward
wholebody-vlavlagrporeinforcement-learningliberoopenvlafine-tuningdense-rewardmanipulationrl-fine-tuning

TGRPO: Fine-tuning VLA with Trajectory GRPO and LLM Dense Reward

TGRPO combines LLM-generated dense rewards with dual-level advantage estimation (step + trajectory) to fine-tune OpenVLA-7B on LIBERO, reaching 91% success rate — beating SFT by 4.6%.

Nguyễn Anh TuấnJuly 1, 202610 min read
TGRPO: Fine-tuning VLA with Trajectory GRPO and LLM Dense Reward

You've trained a VLA model, SFT looks great on demos, but the robot fails in the real world. This is a fundamental problem: SFT only imitates success, it never learns from failure. TGRPO (arXiv 2506.08440) from Jilin University proposes a different approach — online RL with dual-level advantages and LLM-generated dense rewards, pushing OpenVLA-7B from 86.4% to 91.0% success rate on LIBERO-Object using a single A100 GPU.

This post breaks down the technical details: why standard GRPO falls short for robotics, how LLM generates dense rewards automatically, and how the dual-level advantage mechanism works.

Tool recommendations

VLA train/deploy stack

Train on cloud/workstation, then deploy optimized models to Jetson or the robot computer.

Cloud GPU for VLA / policy training Use for imitation learning, diffusion policies, RL, and robotics model fine-tuning. View cloud GPU → NVIDIA Jetson Orin NX / Orin Nano Edge deployment hardware for perception, logging, and optimized inference. View Jetson → Hugging Face / robotics dataset hosting Host datasets, checkpoints, and model cards for cleaner LeRobot/VLA workflows. View platform →

The Problem: SFT Teaches Imitation, Not Adaptation

Imagine learning to drive by watching 1000 videos of successful drives. You'd learn the pattern, but the first time something unexpected happens — a car cuts you off, the road is wet — you're lost because you've never experienced and recovered from failure.

VLA models trained with SFT have the same problem. When an object is slightly off-position compared to demonstrations, lighting changes, or task order shifts — the policy collapses. Technically: SFT minimizes behavior cloning loss on successful demonstrations only. There's no signal about "why did it fail" or "how to recover."

Reinforcement learning (RL) addresses this by letting the robot actually try things and receive feedback. But applying RL to VLAs faces 3 challenges:

  1. Reward design is hard: Tabletop manipulation involves complex sub-tasks. Using sparse reward (0/1 at episode end) makes credit assignment nearly impossible across 200+ steps.
  2. Long credit assignment horizon: Step 50 might cause success at step 150 — which steps actually mattered?
  3. Standard GRPO wasn't designed for long trajectories: GRPO (Group Relative Policy Optimization) was built for LLM reasoning, where each "trajectory" is a few dozen tokens. Robot trajectories span hundreds of continuous action steps.

TGRPO addresses all three simultaneously.

What is TGRPO? — Big Picture First

TGRPO = Trajectory-wise Group Relative Policy Optimization.

Three main components:

[LLM] → Dense Reward Function
           ↓
[Robot runs N=4 parallel trajectories] → collect (state, action, reward) tuples
           ↓
[TGRPO] → step-level + trajectory-level advantage → update policy

The key difference from standard GRPO: GRPO compares at the group level (which trajectory was better). TGRPO adds step-level comparison — at timestep t, which action among N parallel trajectories yielded a higher reward?

Source: TGRPO paper, Jilin University (arXiv 2506.08440)

Component 1: LLM Dense Reward — No More Hand-Crafted Rewards

This is the most practical contribution of the paper. Instead of engineers spending weeks designing reward functions for each task, TGRPO uses an LLM to generate them automatically.

How the LLM Generates Rewards

The LLM receives a prompt containing:

  • Natural language task description (e.g., "Pick up the milk and place it in the basket")
  • LIBERO environment code (so the LLM understands state space, object positions, success conditions)
  • Specific format requirements

Core prompt (from the paper):

"Based on task description, LIBERO environment code, and RL robotics characteristics,
generate a multi-stage reward function where robot receives constant stage-specific
rewards at proximity thresholds, with progressively increasing values per stage
and significantly higher completion rewards."

LLM output is a Python function with a multi-stage structure:

def compute_reward(state, info):
    reward = 0.0
    
    # Stage 1: Approach object
    dist_to_obj = compute_distance(state.ee_pos, state.obj_pos)
    if dist_to_obj < 0.05:  # 5cm threshold
        reward += 1.0
    
    # Stage 2: Grasp
    if info.is_grasped:
        reward += 3.0
    
    # Stage 3: Transport to goal
    dist_to_goal = compute_distance(state.obj_pos, state.goal_pos)
    if dist_to_goal < 0.1:
        reward += 5.0
    
    # Stage 4: Place (completion)
    if info.task_success:
        reward += 10.0
    
    # End-effector pose shaping from demonstration
    reward += alpha * pose_similarity(state.ee_pose, demo_ee_pose)
    
    return reward

The reward function has two components:

  • f₁(P_object, P_pose_k): Object position tracking — rewards when the object advances toward the next milestone
  • f₂(P_pose_k, s_t): End-effector pose shaping — encourages poses similar to demonstrations

Formula: Rₜ = f₁(P_object(t), P_pose^k) + f₂(P_pose^k, s_t)

The key insight: the LLM understands the semantics of "put milk in basket" and automatically infers sub-goals (approach → grasp → lift → place). No engineer needs to hard-code this.

Component 2: Trajectory-wise GRPO — Dual-Level Advantage

This is the core math. Standard GRPO computes advantage by comparing N parallel outputs against each other. For robotics, TGRPO extends this into 2 levels of advantage:

Step-Level Advantage Sᵢ,ₜ

At each timestep t, we have N trajectories running in parallel. Step-level advantage compares the reward at that specific timestep t across N trajectories:

S_{i,t} = (R_{i,t} - mean({R_{j,t} for j in 1..N})) / std({R_{j,t} for j in 1..N})

"At this specific step, did I get a higher reward than the other parallel trajectories?"

This captures local action quality — was the specific action at timestep t better than the alternatives running in parallel?

Trajectory-Level Advantage Tᵢ

Looking at the entire trajectory i, compare cumulative reward against all N trajectories:

T_i = (R_i - mean({R_j for j in 1..N})) / std({R_j for j in 1..N})

"Looking at the full trajectory, did I do better or worse than the group overall?"

This captures global task success — was this trajectory heading in the right direction overall?

Fused Advantage

Combine both levels:

Adv_{i,t} = α₁ · S_{i,t} + α₂ · T_i

The paper finds empirically: α₁ = 0.3, α₂ = 0.7 works best for most tasks. The trajectory-level weight is higher — the "big picture" matters more than any individual step.

Why Dual-Level Beats Single-Level

The ablation study is clear:

Method LIBERO-Object avg
Step-level only 73.6%
Trajectory-level only 86.8%
TGRPO (both) 91.0%

Step-level alone fails because a "randomly good" action at step t doesn't mean the overall strategy is good. Trajectory-level alone lacks local guidance. Combined, both signals reinforce each other.

TGRPO results on LIBERO-Object compared to SFT, PPO, and ablation variants across all 10 tasks
TGRPO results on LIBERO-Object compared to SFT, PPO, and ablation variants across all 10 tasks
Source: TGRPO paper — LIBERO-Object 10-task benchmark results (arXiv 2506.08440)

Full Architecture

OpenVLA-7B (frozen backbone)
  └── LoRA adapter (trainable, rank=16)
        ├── SigLIP encoder (visual features)
        ├── DINOv2 encoder (visual features)
        └── Llama2-7B language backbone

Training loop:
  For episode = 1..30:
    Sample N=4 parallel trajectories from current policy π_θ
    Collect (s_t, a_t, R_t) for each trajectory, max 200 steps
    
    LLM Dense Reward:
      R_t = f1(object_pos, keypose) + f2(ee_pose, demo_ee_pose)
    
    Compute advantages:
      S_{i,t} = normalize_step(R_{i,t})    # across N trajectories at t
      T_i = normalize_traj(sum(R_i))        # across N trajectories total
      Adv_{i,t} = 0.3 * S_{i,t} + 0.7 * T_i
    
    Update LoRA with AdamW, lr=1e-5:
      L = -E[Adv_{i,t} * log π_θ(a_{i,t} | s_{i,t})]

Practical Setup

Requirements

  • GPU: NVIDIA A100 (or equivalent 40GB+ VRAM)
  • LIBERO simulator installed
  • OpenVLA-7B checkpoint
  • LLM API access (for reward generation)

Setup LIBERO

# Install LIBERO
pip install libero

# Download OpenVLA-7B
from huggingface_hub import snapshot_download
snapshot_download("openvla/openvla-7b", local_dir="./models/openvla-7b")

Generate Reward Functions with LLM

Before training, generate rewards for each task:

import anthropic

def generate_reward_function(task_description: str, env_code: str) -> str:
    client = anthropic.Anthropic()
    
    prompt = f"""Based on this task description and LIBERO environment code,
generate a multi-stage reward function for RL training.

Task: {task_description}
Environment code: {env_code}

Requirements:
- Multi-stage rewards with progressively increasing values
- Proximity thresholds for each stage
- High completion bonus at final stage
- End-effector pose shaping from demonstrations
- Return a Python function: def compute_reward(state, info) -> float
"""
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Training with TGRPO

from tgrpo import TGRPOTrainer, TGRPOConfig

config = TGRPOConfig(
    model_path="./models/openvla-7b",
    lora_rank=16,
    learning_rate=1e-5,
    optimizer="adamw",
    
    # TGRPO specific
    n_parallel=4,          # N parallel trajectories per update
    alpha_step=0.3,        # weight for step-level advantage
    alpha_traj=0.7,        # weight for trajectory-level advantage
    
    # Training
    max_episodes=30,
    max_steps_per_episode=200,
    n_test_episodes=50,
)

trainer = TGRPOTrainer(
    config=config,
    env_name="libero_object",
    reward_fn=generated_reward_fn,  # from LLM
)

trainer.train()

Inference After Fine-tuning

from openvla import OpenVLA

# Load fine-tuned model
model = OpenVLA.from_pretrained("./models/openvla-7b")
model.load_lora("./checkpoints/tgrpo-libero-object")
model.eval()

obs = env.reset()
for step in range(max_steps):
    action = model.predict_action(
        image=obs["rgb"],
        instruction="Pick up the milk and place it in the basket"
    )
    obs, reward, done, info = env.step(action)
    if done:
        break

Detailed Results

LIBERO-Object (10 tasks)

Method Avg Success Rate
OpenVLA (zero-shot) ~60%
SFT 86.4%
PPO 86.6%
GRAPE ~89%
TGRPO 91.0%

All 4 LIBERO Suites (Table 1)

Method Spatial Object Goal Long Average
SFT 84.7% 88.4% 79.2% 51.1% 76.5%
GRAPE 88.5% 92.1% 83.1% 57.2% 80.2%
TGRPO 90.4% 92.2% 81.0% 59.2% 80.7%

Notable: LIBERO-Long (long-horizon tasks) shows the largest TGRPO improvement — +8.1% over SFT. This is exactly where dual-level advantage shines — trajectory-level advantage gives the robot a "longer view."

Group Size N Ablation

N (parallel trajectories) LIBERO-Goal success
N=2 76.2%
N=4 81.0%
N=8 80.5%

N=4 is the sweet spot. N=2 lacks diversity for meaningful comparison. N=8 doubles GPU cost with minimal gain.

Known Limitations

1. α₁, α₂ require manual tuning per task: The paper finds optimal weights vary significantly by task. No general formula for automatic selection. Tasks 1,7,8: α₁=10, α₂=1; tasks 4,5,6: α₁≈0.3, α₂≈0.7. Real-world deployment needs grid search.

2. LLM-generated rewards can be wrong: The LLM generates rewards from text descriptions, but environment state space may be more complex than the LLM understands. Always verify reward functions manually before training.

3. Only validated on LIBERO: No results on real robots or other benchmarks (RoboMimic, MetaWorld). Transfer performance is unknown.

4. Requires a good SFT checkpoint: TGRPO fine-tunes from an SFT checkpoint, not from scratch. Weak SFT means RL won't save it.

When to Use TGRPO

Good fit when:

  • You have an SFT-trained VLA that needs a performance boost
  • You have a simulator for rollouts (LIBERO, MuJoCo, Isaac Gym)
  • Your task has clear sub-stages (grasp → lift → place)
  • You want automated reward design instead of hand-crafting

Not ready for:

  • No simulator available (real-robot RL is expensive)
  • Tasks so sparse that LLM can't generate meaningful dense rewards
  • Limited GPU budget (need at least A100)

Comparison with Other Approaches

Method Reward Design Advantage GPU Performance
SFT None N/A Low 86.4%
PPO Hand-crafted Value function High 86.6%
GRAPE Sparse + shaped Trajectory-level Medium ~89%
TGRPO LLM dense Step + Trajectory Medium 91.0%

TGRPO hits a good balance: no hand-crafted rewards (uses LLM), no value network (critic-free like GRPO), better GPU efficiency than PPO.

Success case demonstrations of TGRPO policy on LIBERO-Object tasks after fine-tuning
Success case demonstrations of TGRPO policy on LIBERO-Object tasks after fine-tuning
Source: TGRPO paper — success cases across 10 LIBERO-Object tasks

The Bigger Lesson from TGRPO

TGRPO teaches a broader lesson: in robotics RL, reward design and credit assignment are the two hardest problems, and solving both simultaneously is what creates real breakthroughs.

  • LLM dense reward solves reward design: instead of hand-crafting, use LLM semantic understanding
  • Dual-level advantage solves credit assignment: look at both local (step) and global (trajectory) signals

This trend will expand: instead of engineering rewards, we'll "prompt" them; instead of choosing one RL algorithm, we'll compose advantage signals tailored to specific task structures.

If you're fine-tuning a VLA and want to exceed SFT performance without manual reward engineering — TGRPO is a strong starting point.


Related Posts

  • OpenVLA Deep Dive: Architecture and Fine-tuning a 7B-parameter VLA
  • Fine-tuning VLA on LIBERO with RL: Embodied-R1 Practical Guide
  • ProcVLM: Dense Reward via Process Supervision for VLA RL
NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Fleet MonitoringROS 2 IntegrationAMR Solutions

Related Posts

Deep Dive
VLA-RFT: RL Fine-Tune VLA trong World Simulator
vlavla-rftreinforcement-learning
wholebody-vla

VLA-RFT: RL Fine-Tune VLA trong World Simulator

VLA-RFT dùng world model làm simulator để fine-tune VLA bằng GRPO, reward kiểm chứng và code GitHub trên LIBERO.

6/3/202614 min read
NT
Tutorial
VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc
vlanvidianvlabs
wholebody-vla

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc

NVIDIA NVlabs chứng minh: action as text đạt 94.7% trên LIBERO, vượt pi_0 và GR00T-N1 mà không cần sửa kiến trúc — chỉ với Qwen2.5-VL-3B.

5/4/202613 min read
NT
Research
ROVE: Human Intervention làm RL Signal cho VLA Humanoid
rovevlareinforcement-learning
wholebody-vla

ROVE: Human Intervention làm RL Signal cho VLA Humanoid

ROVE dùng Optimistic Value Estimation (OVE) để fine-tune VLA humanoid manipulation từ human intervention imperfect — pipeline thực tế từ XPENG Robotics.

6/22/202612 min read
NT
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam