ROVE: VLA-RL with Human Feedback Loop for Dexterous Humanoid

Imagine you're teaching a robot to fold clothes. It fails — you grab its hand and correct it. But you're not a robotics expert; your correction is a bit clumsy, not perfectly optimal. Can the robot still learn something useful from that imperfect intervention?

This is exactly the problem ROVE (Reinforcement Learning with Optimistic Value Estimation) from XPENG Robotics addresses. Published on arXiv in June 2026, the paper introduces a new direction for training dexterous humanoid manipulation: instead of simply imitating good data, the robot learns to extract high-value behaviors from mixed-quality trajectories.

The Core Problem: Human Interventions Are Imperfect

In robotics, "human-in-the-loop" (HITL) is a popular technique: when the robot struggles, an operator intervenes and corrects it. This intervention data is then used to train a better policy.

That sounds straightforward — but with dexterous-hand humanoids, it becomes far more complex:

Teleoperation gap: Operators use VR headsets + motion capture to control the robot. But the human hand and robot hand differ drastically in structure, with no haptic (force) feedback. Every movement gets "translated" imperfectly into the robot's joint space.
Intervention actions are often suboptimal: When saving the robot, operators act quickly, awkwardly, not following the optimal path. If the robot learns to blindly imitate everything, it will also learn bad behaviors.
50 degrees of freedom: XPENG's IRON-R01-1.11 robot has a 50-dimensional proprioceptive state (body joints + dexterous hands). Controlling this entire chain via VR is a massive engineering challenge.

What Is ROVE? The Overall Framework

ROVE is a post-training RL framework — it does not train a VLA from scratch, but rather fine-tunes an existing VLA policy using newly collected intervention data.

ROVE framework overview — source: XPENG Robotics project page

Three core components:

1. Human-in-the-Loop Data Collection Pipeline

Each episode is split into three phases:

Autonomous rollout: The VLA policy runs independently, completing as much of the task as possible.
Intervention-adaptation phase: When the policy is about to fail, an operator takes over. The system detects "failure boundaries" conservatively to intervene before complete collapse.
Intervention-recovery phase: Once the operator stabilizes the situation, the VLA continues completing the remainder of the task.

A key detail: the system uses a command filter to smooth the handover of control — preventing abrupt joint jerks that could damage the robot or introduce noisy data.

2. Optimistic Value Estimation (OVE)

This is the core theoretical contribution. The fundamental question: how do you evaluate which actions are "worth learning" from mixed-quality data?

OVE uses expectile regression to estimate a value function biased toward higher values:

$$\mathcal{L}_{OVE}(\phi) = \mathbb{E}\left[|\tau - \mathbf{1}{\hat{V}t - V\phi(s_t) < 0}|(\hat{V}t - V\phi(s_t))^2\right]$$

Where:

$\hat{V}t = \sum{i=t}^{t+H-1} \gamma^{i-t} r_i + \gamma^H V_{\bar{\phi}}(s_{t+H})$ is the H-step bootstrapped value target
$\tau \in (0.5, 1)$ controls the "degree of optimism" — closer to 1 means the critic is more biased toward high-value trajectories

Default parameters: $\tau = 0.7$, $H = 16$ steps for inference/advantage, $H = 50$ for training.

The key advantage over Monte-Carlo regression (imitating everything): OVE does not need to query out-of-distribution actions because it operates on a value function rather than a Q-function, sidestepping the classical offline RL distribution shift problem.

3. Advantage-Conditioned Policy Extraction

After training the critic, ROVE computes the advantage for each step:

$$A_{\phi_k} = \sum \text{rewards} + \gamma^H V_{\phi_k}(s_{t+H}) - V_{\phi_k}(s_t)$$

Steps receive binary labels: "improvement" if the advantage exceeds the 70th percentile threshold, "no improvement" otherwise. The actor is then fine-tuned via advantage-conditioned behavior cloning — learning only from steps labeled "improvement."

At inference, classifier-free guidance decoding is applied:

$$v_{cfg} = v_{neg} + \beta(v_{pos} - v_{neg})$$

This steers generation toward high-advantage actions without requiring any changes to the network architecture.

The ROVE Training Loop: Step by Step

Here is the complete algorithm from Algorithm 1 in the paper:

Initialization:

Initial policy $\pi_0$ (a VLA pre-trained on demonstrations)
Demonstration dataset $\mathcal{D}^0_{critic}$

Iterate $k = 1, 2, 3$:

Train critic $V_{\phi_k}$ on $\mathcal{D}^k_{critic}$ using the OVE objective
Compute advantages for all transitions in the dataset
Assign binary labels using the 70th percentile threshold
Fine-tune actor $\pi_{k+1}$ from $\pi_k$ via advantage-conditioned BC
Deploy $\pi_{k+1}$, collect new episodes + interventions → create $\mathcal{D}^{k+1}_{critic}$

Each iteration corresponds to roughly one real-robot data collection session.

Cross-Embodiment Learning: Learning from Human Videos

Another compelling aspect of ROVE: it incorporates 180 egocentric human videos per task. These videos capture humans performing similar manipulations, used to supervise value estimation — especially valuable for long-tailed failure scenarios.

Why does this work? Because human hand postures during recovery situations carry information about the structure of the problem — the robot can learn "when to do what" from this data, even though the robot and human body differ significantly.

Experiments: Two Challenging Real-World Tasks

Task 1: Erase the Whiteboard (contact-rich)

Policy improvement across ROVE iterations — source: XPENG Robotics

Robot: IRON-R01-1.11
Initial data: 225 demonstration episodes
Per iteration: additional 82/71/79 episodes
Result after 3 iterations: 45.0% → 80.0% success rate

This is complex because it requires continuous force contact (correct pressure on the board), grip angle adjustment, and smooth trajectory following.

Task 2: Put Bread into Toaster (fine-grained insertion)

Initial data: 220 demonstration episodes
Per iteration: additional 97/69/104 episodes
Intervention fraction: ~4.53% of total steps
Result after 3 iterations: 56.7% → 86.7% success rate

This task demands extreme precision: the toaster slot has tight tolerances, the robot must align its dexterous hand in 3D space with minimal force feedback.

Comparison Against Baselines

ROVE outperforms all baselines tested:

Method	Whiteboard	Bread Toaster
Base policy (SFT)	45.0%	56.7%
Filtered BC	~52%	~63%
HG-DAgger	Below base	~61%
RECAP	~65%	~72%
ROVE (ours)	80.0%	86.7%

Notably: HG-DAgger fails on one task — performing worse than the initial policy. This happens because HG-DAgger directly imitates every operator intervention, including suboptimal ones.

Why OVE Matters: Ablation Study

Ablation: Value estimation with and without human experience data — source: XPENG Robotics

Key ablations from the paper:

OVE vs. Monte-Carlo regression: OVE consistently wins because it isn't "dragged down" by low-value trajectories.
With vs. without human videos: Human videos significantly improve value estimation quality for rare failure scenarios.
τ sensitivity: τ=0.7 is the right balance — τ=0.5 degenerates to mean regression, τ→1 overfits to best-case trajectories only.

Applying ROVE Principles to Your Own Project

Although ROVE does not have open-source code at the time of writing, its principles can be applied with any VLA:

Step 1: Prepare a Base VLA Policy

# Assumes you have a VLA (LeRobot, OpenVLA, or π0) pre-trained on demos
# ROVE pipeline requires the VLA to return action distributions

from lerobot.common.policies.pi0 import Pi0Policy

base_policy = Pi0Policy.from_pretrained("your-checkpoint")

Step 2: Collect Human-in-the-Loop Data

# ROVE data structure:
# - state: observation at time t
# - action: action taken (by robot or human intervening)
# - reward: sparse reward (0/1 task completion)
# - intervention_flag: True if human intervened at this step

episode = {
    "states": [],        # List[np.ndarray] shape (T, state_dim)
    "actions": [],       # List[np.ndarray] shape (T, action_dim)
    "rewards": [],       # List[float]
    "intervention": []   # List[bool]
}

Step 3: Train the Critic with OVE

import torch
import torch.nn as nn

class OVECritic(nn.Module):
    def __init__(self, state_dim, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, state):
        return self.net(state).squeeze(-1)

def ove_loss(predictions, targets, tau=0.7):
    """
    Expectile regression loss with optimism parameter tau.
    tau=0.5 → standard mean regression
    tau→1   → learn only from highest-value trajectories
    """
    errors = targets - predictions
    weights = torch.where(errors >= 0,
                          torch.full_like(errors, tau),
                          torch.full_like(errors, 1 - tau))
    return (weights * errors.pow(2)).mean()

def compute_bootstrap_target(rewards, next_values, gamma=0.99, H=16):
    """
    V̂_t = Σ γ^i r_i + γ^H V(s_{t+H})
    """
    discounted_rewards = sum(
        gamma**i * r for i, r in enumerate(rewards[:H])
    )
    return discounted_rewards + gamma**H * next_values

Step 4: Compute Advantages and Assign Labels

import numpy as np

def compute_advantages_and_labels(
    states, rewards, critic, gamma=0.99, H=16, percentile=70
):
    """
    Compute advantages and assign binary labels at the 70th percentile.
    """
    values = critic(torch.tensor(states)).detach().numpy()
    
    advantages = []
    for t in range(len(states) - H):
        bootstrap_val = compute_bootstrap_target(
            rewards[t:t+H], values[t+H], gamma, H
        )
        adv = bootstrap_val - values[t]
        advantages.append(adv)
    
    threshold = np.percentile(advantages, percentile)
    labels = [1 if a >= threshold else 0 for a in advantages]
    
    return advantages, labels

Step 5: Fine-Tune the Actor with Advantage Conditioning

def fine_tune_actor_advantage_conditioned(policy, dataset, labels):
    """
    Fine-tune only on steps with label=1 (high advantage).
    Equivalent to Advantage-Conditioned Behavior Cloning.
    """
    optimizer = torch.optim.AdamW(policy.parameters(), lr=1e-5)
    
    for step, (state, action, label) in enumerate(dataset):
        if label == 0:  # Skip low-advantage steps
            continue
        
        # Standard behavior cloning loss
        pred_action = policy(state, improvement_signal=True)
        loss = nn.MSELoss()(pred_action, action)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Important Regularization

ROVE uses two key techniques to prevent overfitting:

State dropout: 0.3 probability of dropping proprioceptive state → policy doesn't overfit to specific sensor readings.
Gaussian perturbation: Add Gaussian noise (σ=0.4) to proprioceptive inputs → increases robustness.

When to Use ROVE

ROVE is best suited when:

You already have a VLA baseline policy — ROVE is post-training, not from scratch.
The task requires contact or high precision — sparse reward alone is insufficient for standard RL.
You have available operators who can intervene but don't need to be robotics experts.
You want iterative improvement — each round collects ~70-100 additional episodes.

ROVE is less suitable when:

The task is simple enough for dense-reward RL from scratch.
No human operators are available for real-robot interventions.
You need more than 3 iterations with tighter timelines.

Strengths and Limitations

Strengths:

Leverages imperfect interventions — much more realistic than requiring expert demonstrations.
OVE sidesteps the out-of-distribution problem common in offline RL.
Cross-embodiment learning from human videos handles rare failure edge cases.

Current Limitations:

Requires a physical robot for data collection — no sim-to-real pipeline yet.
Performance depends on the quality and quantity of human interventions.
No open-source code at time of writing (June 2026).
Tested on only 2 tasks — generality is unclear.

Conclusion

ROVE answers an important practical question: "How do you turn an operator's clumsiness into valuable training data?" The answer is to use RL with Optimistic Value Estimation to distill value from mixed-quality data, rather than imitating everything or applying rigid filters.

With results of 45%→80% and 57%→87% success rate in just 3 collection rounds, ROVE is a meaningful step toward real-world humanoid deployment — where expert robot operators aren't always available.

Original paper: ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning — Wei Xiao et al., XPENG Robotics, arXiv June 2026

The Core Problem: Human Interventions Are Imperfect

In robotics, "human-in-the-loop" (HITL) is a popular technique: when the robot struggles, an operator intervenes and corrects it. This intervention data is then used to train a better policy.

That sounds straightforward — but with dexterous-hand humanoids, it becomes far more complex:

Teleoperation gap: Operators use VR headsets + motion capture to control the robot. But the human hand and robot hand differ drastically in structure, with no haptic (force) feedback. Every movement gets "translated" imperfectly into the robot's joint space.
Intervention actions are often suboptimal: When saving the robot, operators act quickly, awkwardly, not following the optimal path. If the robot learns to blindly imitate everything, it will also learn bad behaviors.
50 degrees of freedom: XPENG's IRON-R01-1.11 robot has a 50-dimensional proprioceptive state (body joints + dexterous hands). Controlling this entire chain via VR is a massive engineering challenge.

What Is ROVE? The Overall Framework

ROVE is a post-training RL framework — it does not train a VLA from scratch, but rather fine-tunes an existing VLA policy using newly collected intervention data.

ROVE framework overview — source: XPENG Robotics project page

Three core components:

1. Human-in-the-Loop Data Collection Pipeline

Each episode is split into three phases:

Autonomous rollout: The VLA policy runs independently, completing as much of the task as possible.
Intervention-adaptation phase: When the policy is about to fail, an operator takes over. The system detects "failure boundaries" conservatively to intervene before complete collapse.
Intervention-recovery phase: Once the operator stabilizes the situation, the VLA continues completing the remainder of the task.

A key detail: the system uses a command filter to smooth the handover of control — preventing abrupt joint jerks that could damage the robot or introduce noisy data.

2. Optimistic Value Estimation (OVE)

This is the core theoretical contribution. The fundamental question: how do you evaluate which actions are "worth learning" from mixed-quality data?

OVE uses expectile regression to estimate a value function biased toward higher values:

$$\mathcal{L}_{OVE}(\phi) = \mathbb{E}\left[|\tau - \mathbf{1}{\hat{V}t - V\phi(s_t) < 0}|(\hat{V}t - V\phi(s_t))^2\right]$$

Where:

$\hat{V}t = \sum{i=t}^{t+H-1} \gamma^{i-t} r_i + \gamma^H V_{\bar{\phi}}(s_{t+H})$ is the H-step bootstrapped value target
$\tau \in (0.5, 1)$ controls the "degree of optimism" — closer to 1 means the critic is more biased toward high-value trajectories

Default parameters: $\tau = 0.7$, $H = 16$ steps for inference/advantage, $H = 50$ for training.

3. Advantage-Conditioned Policy Extraction

After training the critic, ROVE computes the advantage for each step:

$$A_{\phi_k} = \sum \text{rewards} + \gamma^H V_{\phi_k}(s_{t+H}) - V_{\phi_k}(s_t)$$

At inference, classifier-free guidance decoding is applied:

$$v_{cfg} = v_{neg} + \beta(v_{pos} - v_{neg})$$

This steers generation toward high-advantage actions without requiring any changes to the network architecture.

The ROVE Training Loop: Step by Step

Here is the complete algorithm from Algorithm 1 in the paper:

Initialization:

Initial policy $\pi_0$ (a VLA pre-trained on demonstrations)
Demonstration dataset $\mathcal{D}^0_{critic}$

Iterate $k = 1, 2, 3$:

Train critic $V_{\phi_k}$ on $\mathcal{D}^k_{critic}$ using the OVE objective
Compute advantages for all transitions in the dataset
Assign binary labels using the 70th percentile threshold
Fine-tune actor $\pi_{k+1}$ from $\pi_k$ via advantage-conditioned BC
Deploy $\pi_{k+1}$, collect new episodes + interventions → create $\mathcal{D}^{k+1}_{critic}$

Each iteration corresponds to roughly one real-robot data collection session.

Cross-Embodiment Learning: Learning from Human Videos

Experiments: Two Challenging Real-World Tasks

Task 1: Erase the Whiteboard (contact-rich)

Policy improvement across ROVE iterations — source: XPENG Robotics

Robot: IRON-R01-1.11
Initial data: 225 demonstration episodes
Per iteration: additional 82/71/79 episodes
Result after 3 iterations: 45.0% → 80.0% success rate

This is complex because it requires continuous force contact (correct pressure on the board), grip angle adjustment, and smooth trajectory following.

Task 2: Put Bread into Toaster (fine-grained insertion)

Initial data: 220 demonstration episodes
Per iteration: additional 97/69/104 episodes
Intervention fraction: ~4.53% of total steps
Result after 3 iterations: 56.7% → 86.7% success rate

This task demands extreme precision: the toaster slot has tight tolerances, the robot must align its dexterous hand in 3D space with minimal force feedback.

Comparison Against Baselines

ROVE outperforms all baselines tested:

Method	Whiteboard	Bread Toaster
Base policy (SFT)	45.0%	56.7%
Filtered BC	~52%	~63%
HG-DAgger	Below base	~61%
RECAP	~65%	~72%
ROVE (ours)	80.0%	86.7%

Notably: HG-DAgger fails on one task — performing worse than the initial policy. This happens because HG-DAgger directly imitates every operator intervention, including suboptimal ones.

Why OVE Matters: Ablation Study

Ablation: Value estimation with and without human experience data — source: XPENG Robotics

Key ablations from the paper:

OVE vs. Monte-Carlo regression: OVE consistently wins because it isn't "dragged down" by low-value trajectories.
With vs. without human videos: Human videos significantly improve value estimation quality for rare failure scenarios.
τ sensitivity: τ=0.7 is the right balance — τ=0.5 degenerates to mean regression, τ→1 overfits to best-case trajectories only.

Applying ROVE Principles to Your Own Project

Although ROVE does not have open-source code at the time of writing, its principles can be applied with any VLA:

Step 1: Prepare a Base VLA Policy

# Assumes you have a VLA (LeRobot, OpenVLA, or π0) pre-trained on demos
# ROVE pipeline requires the VLA to return action distributions

from lerobot.common.policies.pi0 import Pi0Policy

base_policy = Pi0Policy.from_pretrained("your-checkpoint")

Step 2: Collect Human-in-the-Loop Data

# ROVE data structure:
# - state: observation at time t
# - action: action taken (by robot or human intervening)
# - reward: sparse reward (0/1 task completion)
# - intervention_flag: True if human intervened at this step

episode = {
    "states": [],        # List[np.ndarray] shape (T, state_dim)
    "actions": [],       # List[np.ndarray] shape (T, action_dim)
    "rewards": [],       # List[float]
    "intervention": []   # List[bool]
}

Step 3: Train the Critic with OVE

import torch
import torch.nn as nn

class OVECritic(nn.Module):
    def __init__(self, state_dim, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, state):
        return self.net(state).squeeze(-1)

def ove_loss(predictions, targets, tau=0.7):
    """
    Expectile regression loss with optimism parameter tau.
    tau=0.5 → standard mean regression
    tau→1   → learn only from highest-value trajectories
    """
    errors = targets - predictions
    weights = torch.where(errors >= 0,
                          torch.full_like(errors, tau),
                          torch.full_like(errors, 1 - tau))
    return (weights * errors.pow(2)).mean()

def compute_bootstrap_target(rewards, next_values, gamma=0.99, H=16):
    """
    V̂_t = Σ γ^i r_i + γ^H V(s_{t+H})
    """
    discounted_rewards = sum(
        gamma**i * r for i, r in enumerate(rewards[:H])
    )
    return discounted_rewards + gamma**H * next_values

Step 4: Compute Advantages and Assign Labels

import numpy as np

def compute_advantages_and_labels(
    states, rewards, critic, gamma=0.99, H=16, percentile=70
):
    """
    Compute advantages and assign binary labels at the 70th percentile.
    """
    values = critic(torch.tensor(states)).detach().numpy()
    
    advantages = []
    for t in range(len(states) - H):
        bootstrap_val = compute_bootstrap_target(
            rewards[t:t+H], values[t+H], gamma, H
        )
        adv = bootstrap_val - values[t]
        advantages.append(adv)
    
    threshold = np.percentile(advantages, percentile)
    labels = [1 if a >= threshold else 0 for a in advantages]
    
    return advantages, labels

Step 5: Fine-Tune the Actor with Advantage Conditioning

def fine_tune_actor_advantage_conditioned(policy, dataset, labels):
    """
    Fine-tune only on steps with label=1 (high advantage).
    Equivalent to Advantage-Conditioned Behavior Cloning.
    """
    optimizer = torch.optim.AdamW(policy.parameters(), lr=1e-5)
    
    for step, (state, action, label) in enumerate(dataset):
        if label == 0:  # Skip low-advantage steps
            continue
        
        # Standard behavior cloning loss
        pred_action = policy(state, improvement_signal=True)
        loss = nn.MSELoss()(pred_action, action)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Important Regularization

ROVE uses two key techniques to prevent overfitting:

State dropout: 0.3 probability of dropping proprioceptive state → policy doesn't overfit to specific sensor readings.
Gaussian perturbation: Add Gaussian noise (σ=0.4) to proprioceptive inputs → increases robustness.

When to Use ROVE

ROVE is best suited when:

You already have a VLA baseline policy — ROVE is post-training, not from scratch.
The task requires contact or high precision — sparse reward alone is insufficient for standard RL.
You have available operators who can intervene but don't need to be robotics experts.
You want iterative improvement — each round collects ~70-100 additional episodes.

ROVE is less suitable when:

The task is simple enough for dense-reward RL from scratch.
No human operators are available for real-robot interventions.
You need more than 3 iterations with tighter timelines.

Strengths and Limitations

Strengths:

Leverages imperfect interventions — much more realistic than requiring expert demonstrations.
OVE sidesteps the out-of-distribution problem common in offline RL.
Cross-embodiment learning from human videos handles rare failure edge cases.

Current Limitations:

Requires a physical robot for data collection — no sim-to-real pipeline yet.
Performance depends on the quality and quantity of human interventions.
No open-source code at time of writing (June 2026).
Tested on only 2 tasks — generality is unclear.

Conclusion

Original paper: ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning — Wei Xiao et al., XPENG Robotics, arXiv June 2026

The Core Problem: Human Interventions Are Imperfect

What Is ROVE? The Overall Framework

1. Human-in-the-Loop Data Collection Pipeline

2. Optimistic Value Estimation (OVE)

3. Advantage-Conditioned Policy Extraction

The ROVE Training Loop: Step by Step

Cross-Embodiment Learning: Learning from Human Videos

Experiments: Two Challenging Real-World Tasks

Task 1: Erase the Whiteboard (contact-rich)

Task 2: Put Bread into Toaster (fine-grained insertion)

Comparison Against Baselines

Why OVE Matters: Ablation Study

Applying ROVE Principles to Your Own Project

Step 1: Prepare a Base VLA Policy

Step 2: Collect Human-in-the-Loop Data

Step 3: Train the Critic with OVE

Step 4: Compute Advantages and Assign Labels

Step 5: Fine-Tune the Actor with Advantage Conditioning

Important Regularization

When to Use ROVE

Strengths and Limitations

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

LeVERB: Điều khiển toàn thân humanoid bằng ngôn ngữ-thị giác tiềm ẩn

GEAR-SONIC: Whole-Body Control cho Humanoid Robot

RISE: robot tự cải thiện bằng world model

The Core Problem: Human Interventions Are Imperfect

What Is ROVE? The Overall Framework

1. Human-in-the-Loop Data Collection Pipeline

2. Optimistic Value Estimation (OVE)

3. Advantage-Conditioned Policy Extraction

The ROVE Training Loop: Step by Step

Cross-Embodiment Learning: Learning from Human Videos

Experiments: Two Challenging Real-World Tasks

Task 1: Erase the Whiteboard (contact-rich)

Task 2: Put Bread into Toaster (fine-grained insertion)

Comparison Against Baselines

Why OVE Matters: Ablation Study

Applying ROVE Principles to Your Own Project

Step 1: Prepare a Base VLA Policy

Step 2: Collect Human-in-the-Loop Data

Step 3: Train the Critic with OVE

Step 4: Compute Advantages and Assign Labels

Step 5: Fine-Tune the Actor with Advantage Conditioning

Important Regularization

When to Use ROVE

Strengths and Limitations

Conclusion

Related Posts

Nguyễn Anh Tuấn

Related Posts

LeVERB: Điều khiển toàn thân humanoid bằng ngôn ngữ-thị giác tiềm ẩn

GEAR-SONIC: Whole-Body Control cho Humanoid Robot

RISE: robot tự cải thiện bằng world model