TGRPO: Fine-tuning VLA with Trajectory GRPO and LLM Dense Reward

You've trained a VLA model, SFT looks great on demos, but the robot fails in the real world. This is a fundamental problem: SFT only imitates success, it never learns from failure. TGRPO (arXiv 2506.08440) from Jilin University proposes a different approach — online RL with dual-level advantages and LLM-generated dense rewards, pushing OpenVLA-7B from 86.4% to 91.0% success rate on LIBERO-Object using a single A100 GPU.

This post breaks down the technical details: why standard GRPO falls short for robotics, how LLM generates dense rewards automatically, and how the dual-level advantage mechanism works.

The Problem: SFT Teaches Imitation, Not Adaptation

Imagine learning to drive by watching 1000 videos of successful drives. You'd learn the pattern, but the first time something unexpected happens — a car cuts you off, the road is wet — you're lost because you've never experienced and recovered from failure.

VLA models trained with SFT have the same problem. When an object is slightly off-position compared to demonstrations, lighting changes, or task order shifts — the policy collapses. Technically: SFT minimizes behavior cloning loss on successful demonstrations only. There's no signal about "why did it fail" or "how to recover."

Reinforcement learning (RL) addresses this by letting the robot actually try things and receive feedback. But applying RL to VLAs faces 3 challenges:

Reward design is hard: Tabletop manipulation involves complex sub-tasks. Using sparse reward (0/1 at episode end) makes credit assignment nearly impossible across 200+ steps.
Long credit assignment horizon: Step 50 might cause success at step 150 — which steps actually mattered?
Standard GRPO wasn't designed for long trajectories: GRPO (Group Relative Policy Optimization) was built for LLM reasoning, where each "trajectory" is a few dozen tokens. Robot trajectories span hundreds of continuous action steps.

TGRPO addresses all three simultaneously.

What is TGRPO? — Big Picture First

TGRPO = Trajectory-wise Group Relative Policy Optimization.

Three main components:

[LLM] → Dense Reward Function
           ↓
[Robot runs N=4 parallel trajectories] → collect (state, action, reward) tuples
           ↓
[TGRPO] → step-level + trajectory-level advantage → update policy

The key difference from standard GRPO: GRPO compares at the group level (which trajectory was better). TGRPO adds step-level comparison — at timestep t, which action among N parallel trajectories yielded a higher reward?

Source: TGRPO paper, Jilin University (arXiv 2506.08440)

Component 1: LLM Dense Reward — No More Hand-Crafted Rewards

This is the most practical contribution of the paper. Instead of engineers spending weeks designing reward functions for each task, TGRPO uses an LLM to generate them automatically.

How the LLM Generates Rewards

The LLM receives a prompt containing:

Natural language task description (e.g., "Pick up the milk and place it in the basket")
LIBERO environment code (so the LLM understands state space, object positions, success conditions)
Specific format requirements

Core prompt (from the paper):

"Based on task description, LIBERO environment code, and RL robotics characteristics,
generate a multi-stage reward function where robot receives constant stage-specific
rewards at proximity thresholds, with progressively increasing values per stage
and significantly higher completion rewards."

LLM output is a Python function with a multi-stage structure:

def compute_reward(state, info):
    reward = 0.0
    
    # Stage 1: Approach object
    dist_to_obj = compute_distance(state.ee_pos, state.obj_pos)
    if dist_to_obj < 0.05:  # 5cm threshold
        reward += 1.0
    
    # Stage 2: Grasp
    if info.is_grasped:
        reward += 3.0
    
    # Stage 3: Transport to goal
    dist_to_goal = compute_distance(state.obj_pos, state.goal_pos)
    if dist_to_goal < 0.1:
        reward += 5.0
    
    # Stage 4: Place (completion)
    if info.task_success:
        reward += 10.0
    
    # End-effector pose shaping from demonstration
    reward += alpha * pose_similarity(state.ee_pose, demo_ee_pose)
    
    return reward

The reward function has two components:

f₁(P_object, P_pose_k): Object position tracking — rewards when the object advances toward the next milestone
f₂(P_pose_k, s_t): End-effector pose shaping — encourages poses similar to demonstrations

Formula: Rₜ = f₁(P_object(t), P_pose^k) + f₂(P_pose^k, s_t)

The key insight: the LLM understands the semantics of "put milk in basket" and automatically infers sub-goals (approach → grasp → lift → place). No engineer needs to hard-code this.

Component 2: Trajectory-wise GRPO — Dual-Level Advantage

This is the core math. Standard GRPO computes advantage by comparing N parallel outputs against each other. For robotics, TGRPO extends this into 2 levels of advantage:

Step-Level Advantage Sᵢ,ₜ

At each timestep t, we have N trajectories running in parallel. Step-level advantage compares the reward at that specific timestep t across N trajectories:

S_{i,t} = (R_{i,t} - mean({R_{j,t} for j in 1..N})) / std({R_{j,t} for j in 1..N})

"At this specific step, did I get a higher reward than the other parallel trajectories?"

This captures local action quality — was the specific action at timestep t better than the alternatives running in parallel?

Trajectory-Level Advantage Tᵢ

Looking at the entire trajectory i, compare cumulative reward against all N trajectories:

T_i = (R_i - mean({R_j for j in 1..N})) / std({R_j for j in 1..N})

"Looking at the full trajectory, did I do better or worse than the group overall?"

This captures global task success — was this trajectory heading in the right direction overall?

Fused Advantage

Combine both levels:

Adv_{i,t} = α₁ · S_{i,t} + α₂ · T_i

The paper finds empirically: α₁ = 0.3, α₂ = 0.7 works best for most tasks. The trajectory-level weight is higher — the "big picture" matters more than any individual step.

Why Dual-Level Beats Single-Level

The ablation study is clear:

Method	LIBERO-Object avg
Step-level only	73.6%
Trajectory-level only	86.8%
TGRPO (both)	91.0%

Step-level alone fails because a "randomly good" action at step t doesn't mean the overall strategy is good. Trajectory-level alone lacks local guidance. Combined, both signals reinforce each other.

TGRPO results on LIBERO-Object compared to SFT, PPO, and ablation variants across all 10 tasks

Source: TGRPO paper — LIBERO-Object 10-task benchmark results (arXiv 2506.08440)

Full Architecture

OpenVLA-7B (frozen backbone)
  └── LoRA adapter (trainable, rank=16)
        ├── SigLIP encoder (visual features)
        ├── DINOv2 encoder (visual features)
        └── Llama2-7B language backbone

Training loop:
  For episode = 1..30:
    Sample N=4 parallel trajectories from current policy π_θ
    Collect (s_t, a_t, R_t) for each trajectory, max 200 steps
    
    LLM Dense Reward:
      R_t = f1(object_pos, keypose) + f2(ee_pose, demo_ee_pose)
    
    Compute advantages:
      S_{i,t} = normalize_step(R_{i,t})    # across N trajectories at t
      T_i = normalize_traj(sum(R_i))        # across N trajectories total
      Adv_{i,t} = 0.3 * S_{i,t} + 0.7 * T_i
    
    Update LoRA with AdamW, lr=1e-5:
      L = -E[Adv_{i,t} * log π_θ(a_{i,t} | s_{i,t})]

Practical Setup

Requirements

GPU: NVIDIA A100 (or equivalent 40GB+ VRAM)
LIBERO simulator installed
OpenVLA-7B checkpoint
LLM API access (for reward generation)

Setup LIBERO

# Install LIBERO
pip install libero

# Download OpenVLA-7B
from huggingface_hub import snapshot_download
snapshot_download("openvla/openvla-7b", local_dir="./models/openvla-7b")

Generate Reward Functions with LLM

Before training, generate rewards for each task:

import anthropic

def generate_reward_function(task_description: str, env_code: str) -> str:
    client = anthropic.Anthropic()
    
    prompt = f"""Based on this task description and LIBERO environment code,
generate a multi-stage reward function for RL training.

Task: {task_description}
Environment code: {env_code}

Requirements:
- Multi-stage rewards with progressively increasing values
- Proximity thresholds for each stage
- High completion bonus at final stage
- End-effector pose shaping from demonstrations
- Return a Python function: def compute_reward(state, info) -> float
"""
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Training with TGRPO

from tgrpo import TGRPOTrainer, TGRPOConfig

config = TGRPOConfig(
    model_path="./models/openvla-7b",
    lora_rank=16,
    learning_rate=1e-5,
    optimizer="adamw",
    
    # TGRPO specific
    n_parallel=4,          # N parallel trajectories per update
    alpha_step=0.3,        # weight for step-level advantage
    alpha_traj=0.7,        # weight for trajectory-level advantage
    
    # Training
    max_episodes=30,
    max_steps_per_episode=200,
    n_test_episodes=50,
)

trainer = TGRPOTrainer(
    config=config,
    env_name="libero_object",
    reward_fn=generated_reward_fn,  # from LLM
)

trainer.train()

Inference After Fine-tuning

from openvla import OpenVLA

# Load fine-tuned model
model = OpenVLA.from_pretrained("./models/openvla-7b")
model.load_lora("./checkpoints/tgrpo-libero-object")
model.eval()

obs = env.reset()
for step in range(max_steps):
    action = model.predict_action(
        image=obs["rgb"],
        instruction="Pick up the milk and place it in the basket"
    )
    obs, reward, done, info = env.step(action)
    if done:
        break

Detailed Results

LIBERO-Object (10 tasks)

Method	Avg Success Rate
OpenVLA (zero-shot)	~60%
SFT	86.4%
PPO	86.6%
GRAPE	~89%
TGRPO	91.0%

All 4 LIBERO Suites (Table 1)

Method	Spatial	Object	Goal	Long	Average
SFT	84.7%	88.4%	79.2%	51.1%	76.5%
GRAPE	88.5%	92.1%	83.1%	57.2%	80.2%
TGRPO	90.4%	92.2%	81.0%	59.2%	80.7%

Notable: LIBERO-Long (long-horizon tasks) shows the largest TGRPO improvement — +8.1% over SFT. This is exactly where dual-level advantage shines — trajectory-level advantage gives the robot a "longer view."

Group Size N Ablation

N (parallel trajectories)	LIBERO-Goal success
N=2	76.2%
N=4	81.0%
N=8	80.5%

N=4 is the sweet spot. N=2 lacks diversity for meaningful comparison. N=8 doubles GPU cost with minimal gain.

Known Limitations

1. α₁, α₂ require manual tuning per task: The paper finds optimal weights vary significantly by task. No general formula for automatic selection. Tasks 1,7,8: α₁=10, α₂=1; tasks 4,5,6: α₁≈0.3, α₂≈0.7. Real-world deployment needs grid search.

2. LLM-generated rewards can be wrong: The LLM generates rewards from text descriptions, but environment state space may be more complex than the LLM understands. Always verify reward functions manually before training.

3. Only validated on LIBERO: No results on real robots or other benchmarks (RoboMimic, MetaWorld). Transfer performance is unknown.

4. Requires a good SFT checkpoint: TGRPO fine-tunes from an SFT checkpoint, not from scratch. Weak SFT means RL won't save it.

When to Use TGRPO

Good fit when:

You have an SFT-trained VLA that needs a performance boost
You have a simulator for rollouts (LIBERO, MuJoCo, Isaac Gym)
Your task has clear sub-stages (grasp → lift → place)
You want automated reward design instead of hand-crafting

Not ready for:

No simulator available (real-robot RL is expensive)
Tasks so sparse that LLM can't generate meaningful dense rewards
Limited GPU budget (need at least A100)

Comparison with Other Approaches

Method	Reward Design	Advantage	GPU	Performance
SFT	None	N/A	Low	86.4%
PPO	Hand-crafted	Value function	High	86.6%
GRAPE	Sparse + shaped	Trajectory-level	Medium	~89%
TGRPO	LLM dense	Step + Trajectory	Medium	91.0%

TGRPO hits a good balance: no hand-crafted rewards (uses LLM), no value network (critic-free like GRPO), better GPU efficiency than PPO.

Success case demonstrations of TGRPO policy on LIBERO-Object tasks after fine-tuning

Source: TGRPO paper — success cases across 10 LIBERO-Object tasks

The Bigger Lesson from TGRPO

TGRPO teaches a broader lesson: in robotics RL, reward design and credit assignment are the two hardest problems, and solving both simultaneously is what creates real breakthroughs.

LLM dense reward solves reward design: instead of hand-crafting, use LLM semantic understanding
Dual-level advantage solves credit assignment: look at both local (step) and global (trajectory) signals

This trend will expand: instead of engineering rewards, we'll "prompt" them; instead of choosing one RL algorithm, we'll compose advantage signals tailored to specific task structures.

If you're fine-tuning a VLA and want to exceed SFT performance without manual reward engineering — TGRPO is a strong starting point.

This post breaks down the technical details: why standard GRPO falls short for robotics, how LLM generates dense rewards automatically, and how the dual-level advantage mechanism works.

The Problem: SFT Teaches Imitation, Not Adaptation

Reinforcement learning (RL) addresses this by letting the robot actually try things and receive feedback. But applying RL to VLAs faces 3 challenges:

Reward design is hard: Tabletop manipulation involves complex sub-tasks. Using sparse reward (0/1 at episode end) makes credit assignment nearly impossible across 200+ steps.
Long credit assignment horizon: Step 50 might cause success at step 150 — which steps actually mattered?
Standard GRPO wasn't designed for long trajectories: GRPO (Group Relative Policy Optimization) was built for LLM reasoning, where each "trajectory" is a few dozen tokens. Robot trajectories span hundreds of continuous action steps.

TGRPO addresses all three simultaneously.

What is TGRPO? — Big Picture First

TGRPO = Trajectory-wise Group Relative Policy Optimization.

Three main components:

[LLM] → Dense Reward Function
           ↓
[Robot runs N=4 parallel trajectories] → collect (state, action, reward) tuples
           ↓
[TGRPO] → step-level + trajectory-level advantage → update policy

Source: TGRPO paper, Jilin University (arXiv 2506.08440)

Component 1: LLM Dense Reward — No More Hand-Crafted Rewards

This is the most practical contribution of the paper. Instead of engineers spending weeks designing reward functions for each task, TGRPO uses an LLM to generate them automatically.

How the LLM Generates Rewards

The LLM receives a prompt containing:

Natural language task description (e.g., "Pick up the milk and place it in the basket")
LIBERO environment code (so the LLM understands state space, object positions, success conditions)
Specific format requirements

Core prompt (from the paper):

"Based on task description, LIBERO environment code, and RL robotics characteristics,
generate a multi-stage reward function where robot receives constant stage-specific
rewards at proximity thresholds, with progressively increasing values per stage
and significantly higher completion rewards."

LLM output is a Python function with a multi-stage structure:

def compute_reward(state, info):
    reward = 0.0
    
    # Stage 1: Approach object
    dist_to_obj = compute_distance(state.ee_pos, state.obj_pos)
    if dist_to_obj < 0.05:  # 5cm threshold
        reward += 1.0
    
    # Stage 2: Grasp
    if info.is_grasped:
        reward += 3.0
    
    # Stage 3: Transport to goal
    dist_to_goal = compute_distance(state.obj_pos, state.goal_pos)
    if dist_to_goal < 0.1:
        reward += 5.0
    
    # Stage 4: Place (completion)
    if info.task_success:
        reward += 10.0
    
    # End-effector pose shaping from demonstration
    reward += alpha * pose_similarity(state.ee_pose, demo_ee_pose)
    
    return reward

The reward function has two components:

f₁(P_object, P_pose_k): Object position tracking — rewards when the object advances toward the next milestone
f₂(P_pose_k, s_t): End-effector pose shaping — encourages poses similar to demonstrations

Formula: Rₜ = f₁(P_object(t), P_pose^k) + f₂(P_pose^k, s_t)

The key insight: the LLM understands the semantics of "put milk in basket" and automatically infers sub-goals (approach → grasp → lift → place). No engineer needs to hard-code this.

Component 2: Trajectory-wise GRPO — Dual-Level Advantage

This is the core math. Standard GRPO computes advantage by comparing N parallel outputs against each other. For robotics, TGRPO extends this into 2 levels of advantage:

Step-Level Advantage Sᵢ,ₜ

At each timestep t, we have N trajectories running in parallel. Step-level advantage compares the reward at that specific timestep t across N trajectories:

S_{i,t} = (R_{i,t} - mean({R_{j,t} for j in 1..N})) / std({R_{j,t} for j in 1..N})

"At this specific step, did I get a higher reward than the other parallel trajectories?"

This captures local action quality — was the specific action at timestep t better than the alternatives running in parallel?

Trajectory-Level Advantage Tᵢ

Looking at the entire trajectory i, compare cumulative reward against all N trajectories:

T_i = (R_i - mean({R_j for j in 1..N})) / std({R_j for j in 1..N})

"Looking at the full trajectory, did I do better or worse than the group overall?"

This captures global task success — was this trajectory heading in the right direction overall?

Fused Advantage

Combine both levels:

Adv_{i,t} = α₁ · S_{i,t} + α₂ · T_i

The paper finds empirically: α₁ = 0.3, α₂ = 0.7 works best for most tasks. The trajectory-level weight is higher — the "big picture" matters more than any individual step.

Why Dual-Level Beats Single-Level

The ablation study is clear:

Method	LIBERO-Object avg
Step-level only	73.6%
Trajectory-level only	86.8%
TGRPO (both)	91.0%

Step-level alone fails because a "randomly good" action at step t doesn't mean the overall strategy is good. Trajectory-level alone lacks local guidance. Combined, both signals reinforce each other.

TGRPO results on LIBERO-Object compared to SFT, PPO, and ablation variants across all 10 tasks

Source: TGRPO paper — LIBERO-Object 10-task benchmark results (arXiv 2506.08440)

Full Architecture

OpenVLA-7B (frozen backbone)
  └── LoRA adapter (trainable, rank=16)
        ├── SigLIP encoder (visual features)
        ├── DINOv2 encoder (visual features)
        └── Llama2-7B language backbone

Training loop:
  For episode = 1..30:
    Sample N=4 parallel trajectories from current policy π_θ
    Collect (s_t, a_t, R_t) for each trajectory, max 200 steps
    
    LLM Dense Reward:
      R_t = f1(object_pos, keypose) + f2(ee_pose, demo_ee_pose)
    
    Compute advantages:
      S_{i,t} = normalize_step(R_{i,t})    # across N trajectories at t
      T_i = normalize_traj(sum(R_i))        # across N trajectories total
      Adv_{i,t} = 0.3 * S_{i,t} + 0.7 * T_i
    
    Update LoRA with AdamW, lr=1e-5:
      L = -E[Adv_{i,t} * log π_θ(a_{i,t} | s_{i,t})]

Practical Setup

Requirements

GPU: NVIDIA A100 (or equivalent 40GB+ VRAM)
LIBERO simulator installed
OpenVLA-7B checkpoint
LLM API access (for reward generation)

Setup LIBERO

# Install LIBERO
pip install libero

# Download OpenVLA-7B
from huggingface_hub import snapshot_download
snapshot_download("openvla/openvla-7b", local_dir="./models/openvla-7b")

Generate Reward Functions with LLM

Before training, generate rewards for each task:

import anthropic

def generate_reward_function(task_description: str, env_code: str) -> str:
    client = anthropic.Anthropic()
    
    prompt = f"""Based on this task description and LIBERO environment code,
generate a multi-stage reward function for RL training.

Task: {task_description}
Environment code: {env_code}

Requirements:
- Multi-stage rewards with progressively increasing values
- Proximity thresholds for each stage
- High completion bonus at final stage
- End-effector pose shaping from demonstrations
- Return a Python function: def compute_reward(state, info) -> float
"""
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Training with TGRPO

from tgrpo import TGRPOTrainer, TGRPOConfig

config = TGRPOConfig(
    model_path="./models/openvla-7b",
    lora_rank=16,
    learning_rate=1e-5,
    optimizer="adamw",
    
    # TGRPO specific
    n_parallel=4,          # N parallel trajectories per update
    alpha_step=0.3,        # weight for step-level advantage
    alpha_traj=0.7,        # weight for trajectory-level advantage
    
    # Training
    max_episodes=30,
    max_steps_per_episode=200,
    n_test_episodes=50,
)

trainer = TGRPOTrainer(
    config=config,
    env_name="libero_object",
    reward_fn=generated_reward_fn,  # from LLM
)

trainer.train()

Inference After Fine-tuning

from openvla import OpenVLA

# Load fine-tuned model
model = OpenVLA.from_pretrained("./models/openvla-7b")
model.load_lora("./checkpoints/tgrpo-libero-object")
model.eval()

obs = env.reset()
for step in range(max_steps):
    action = model.predict_action(
        image=obs["rgb"],
        instruction="Pick up the milk and place it in the basket"
    )
    obs, reward, done, info = env.step(action)
    if done:
        break

Detailed Results

LIBERO-Object (10 tasks)

Method	Avg Success Rate
OpenVLA (zero-shot)	~60%
SFT	86.4%
PPO	86.6%
GRAPE	~89%
TGRPO	91.0%

All 4 LIBERO Suites (Table 1)

Method	Spatial	Object	Goal	Long	Average
SFT	84.7%	88.4%	79.2%	51.1%	76.5%
GRAPE	88.5%	92.1%	83.1%	57.2%	80.2%
TGRPO	90.4%	92.2%	81.0%	59.2%	80.7%

Group Size N Ablation

N (parallel trajectories)	LIBERO-Goal success
N=2	76.2%
N=4	81.0%
N=8	80.5%

N=4 is the sweet spot. N=2 lacks diversity for meaningful comparison. N=8 doubles GPU cost with minimal gain.

Known Limitations

3. Only validated on LIBERO: No results on real robots or other benchmarks (RoboMimic, MetaWorld). Transfer performance is unknown.

4. Requires a good SFT checkpoint: TGRPO fine-tunes from an SFT checkpoint, not from scratch. Weak SFT means RL won't save it.

When to Use TGRPO

Good fit when:

You have an SFT-trained VLA that needs a performance boost
You have a simulator for rollouts (LIBERO, MuJoCo, Isaac Gym)
Your task has clear sub-stages (grasp → lift → place)
You want automated reward design instead of hand-crafting

Not ready for:

No simulator available (real-robot RL is expensive)
Tasks so sparse that LLM can't generate meaningful dense rewards
Limited GPU budget (need at least A100)

Comparison with Other Approaches

Method	Reward Design	Advantage	GPU	Performance
SFT	None	N/A	Low	86.4%
PPO	Hand-crafted	Value function	High	86.6%
GRAPE	Sparse + shaped	Trajectory-level	Medium	~89%
TGRPO	LLM dense	Step + Trajectory	Medium	91.0%

TGRPO hits a good balance: no hand-crafted rewards (uses LLM), no value network (critic-free like GRPO), better GPU efficiency than PPO.

Success case demonstrations of TGRPO policy on LIBERO-Object tasks after fine-tuning

Source: TGRPO paper — success cases across 10 LIBERO-Object tasks

The Bigger Lesson from TGRPO

TGRPO teaches a broader lesson: in robotics RL, reward design and credit assignment are the two hardest problems, and solving both simultaneously is what creates real breakthroughs.

LLM dense reward solves reward design: instead of hand-crafting, use LLM semantic understanding
Dual-level advantage solves credit assignment: look at both local (step) and global (trajectory) signals

This trend will expand: instead of engineering rewards, we'll "prompt" them; instead of choosing one RL algorithm, we'll compose advantage signals tailored to specific task structures.

If you're fine-tuning a VLA and want to exceed SFT performance without manual reward engineering — TGRPO is a strong starting point.

The Problem: SFT Teaches Imitation, Not Adaptation

What is TGRPO? — Big Picture First

Component 1: LLM Dense Reward — No More Hand-Crafted Rewards

How the LLM Generates Rewards

Component 2: Trajectory-wise GRPO — Dual-Level Advantage

Step-Level Advantage Sᵢ,ₜ

Trajectory-Level Advantage Tᵢ

Fused Advantage

Why Dual-Level Beats Single-Level

Full Architecture

Practical Setup

Requirements

Setup LIBERO

Generate Reward Functions with LLM

Training with TGRPO

Inference After Fine-tuning

Detailed Results

LIBERO-Object (10 tasks)

All 4 LIBERO Suites (Table 1)

Group Size N Ablation

Known Limitations

When to Use TGRPO

Comparison with Other Approaches

The Bigger Lesson from TGRPO

Related Posts

Nguyễn Anh Tuấn

Related Posts

VLA-RFT: RL Fine-Tune VLA trong World Simulator

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc

ROVE: Human Intervention làm RL Signal cho VLA Humanoid

The Problem: SFT Teaches Imitation, Not Adaptation

What is TGRPO? — Big Picture First

Component 1: LLM Dense Reward — No More Hand-Crafted Rewards

How the LLM Generates Rewards

Component 2: Trajectory-wise GRPO — Dual-Level Advantage

Step-Level Advantage Sᵢ,ₜ

Trajectory-Level Advantage Tᵢ

Fused Advantage

Why Dual-Level Beats Single-Level

Full Architecture

Practical Setup

Requirements

Setup LIBERO

Generate Reward Functions with LLM

Training with TGRPO

Inference After Fine-tuning

Detailed Results

LIBERO-Object (10 tasks)

All 4 LIBERO Suites (Table 1)

Group Size N Ablation

Known Limitations

When to Use TGRPO

Comparison with Other Approaches

The Bigger Lesson from TGRPO

Related Posts

Nguyễn Anh Tuấn

Related Posts

VLA-RFT: RL Fine-Tune VLA trong World Simulator

VLA-0: Train VLA Đỉnh Cao Không Cần Sửa Kiến Trúc

ROVE: Human Intervention làm RL Signal cho VLA Humanoid