humanoidhumanoidloco-manipulationteleoperationimitation-learning

Loco-Manipulation: Walking and Manipulating Simultaneously

Techniques for controlling humanoids that both move and manipulate objects — decoupled control, teleoperation, and imitation learning.

Nguyen Anh Tuan23 tháng 2, 20268 phút đọc
Loco-Manipulation: Walking and Manipulating Simultaneously

What is Loco-Manipulation?

Most humanoid research focuses on locomotion (walking, running, climbing) OR manipulation (grasping, lifting, object handling). But real-world applications require both simultaneously: robots must walk while carrying objects, move while opening doors, climb stairs while maintaining balance with objects in hand.

This is loco-manipulation — one of the hardest open problems in humanoid robotics today.

Why It's Difficult

  1. Coupling: When an arm lifts a heavy object, the center of mass (CoM) shifts, directly affecting balance and gait stability
  2. Competing objectives: Locomotion needs stable, rhythmic feet; manipulation needs dexterous arms — both sharing the same rigid skeleton
  3. High-dimensional: Humanoids have 30-75 DOF; controlling everything simultaneously is computationally and algorithmically very complex
  4. Contact-rich: Both feet (ground contact) and hands (object contact) create complex dynamics simultaneously

Humanoid robot performing loco-manipulation in a real-world environment

Approach 1: Decoupled Upper/Lower Body Control

The simplest idea: separate the control of the upper body (manipulation) and lower body (locomotion). Each part has its own controller, communicating through a lightweight interface.

Mobile-TeleVision Paper

"Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control" (arXiv:2412.07773) is a canonical example of the decoupled approach.

Architecture:

  1. Upper body: IK + motion retargeting from a human operator (teleoperation)
  2. Lower body: RL policy for locomotion, conditioned on upper body motion
  3. PMP (Predictive Motion Priors): A CVAE model predicts future upper body motion, allowing the locomotion policy to "know ahead of time" what the arms will do

Why PMP? If the locomotion policy only sees the current arm state, it is reactive — slow to respond when the arms change suddenly (e.g., lifting a heavy object). PMP enables anticipation — knowing the arms are about to lift something, the legs can pre-adjust posture.

# Simplified decoupled control architecture
class DecoupledController:
    def __init__(self):
        self.upper_body_ik = InverseKinematics(arm_joints)
        self.lower_body_rl = load_policy("locomotion_policy.pt")
        self.pmp = load_model("predictive_motion_prior.pt")
    
    def step(self, upper_body_target, velocity_command):
        # 1. Upper body: IK for arms
        arm_targets = self.upper_body_ik.solve(upper_body_target)
        
        # 2. Predict future upper body motion
        motion_prior = self.pmp.predict(
            current_arm_state=arm_targets,
            history=self.arm_history
        )
        
        # 3. Lower body: RL conditioned on motion prior
        obs = concat([
            proprioception,       # Joint states
            velocity_command,     # Move command
            motion_prior,         # Predicted arm motion
        ])
        leg_actions = self.lower_body_rl(obs)
        
        return concat([arm_targets, leg_actions])

Results: Tested on Fourier GR-1 and Unitree H1 in simulation, and Unitree H1 in the real world. The robot can walk while holding objects, and walk while opening doors.

Advantages of Decoupled Control

  • Easy to debug: Each component can be validated independently
  • Reusable: Locomotion policy can be reused across many manipulation tasks
  • Teleoperation-friendly: Human operator controls only the arms; legs handle themselves automatically

Disadvantages

  • Limited coordination: Hard to perform tasks requiring whole-body synergy (e.g., throwing far)
  • Interface bottleneck: Information shared between the two components is limited by interface design
  • Sub-optimal: Cannot simultaneously optimize locomotion and manipulation objectives

Approach 2: Teleoperation + Imitation Learning

Instead of hand-designing controllers, learn from humans:

  1. Teleoperate the humanoid: A human operator drives the robot to perform a task (holding objects, tidying up, etc.)
  2. Collect data: Record all observations and corresponding actions
  3. Train a policy: Use Imitation Learning (Behavior Cloning, ACT, Diffusion Policy) to replicate the behavior

Teleoperation Systems for Humanoids

TWIST (Teleoperated Whole-Body Imitation System) and Open-TeleVision are two prominent teleoperation systems:

  • VR headset: Oculus/Meta Quest for head tracking + hand tracking
  • Motion capture: Full-body tracking of the human operator
  • Retargeting: Map human movements to robot motion (accounting for different body proportions and DOF counts)
  • Force feedback: Operator feels resistance from the robot
# Teleoperation data collection pipeline
class TeleoperationCollector:
    def __init__(self, robot, vr_interface):
        self.robot = robot
        self.vr = vr_interface
        self.dataset = []
    
    def collect_episode(self, task_name):
        """Collect one teleoperation episode."""
        episode_data = []
        
        while not task_complete:
            # Read VR input
            head_pose = self.vr.get_head_pose()
            hand_poses = self.vr.get_hand_poses()  # Left + Right
            body_pose = self.vr.get_body_pose()
            
            # Retarget to robot
            robot_targets = retarget(
                human_pose={
                    'head': head_pose,
                    'hands': hand_poses,
                    'body': body_pose
                },
                robot_model=self.robot.model
            )
            
            # Execute on robot
            observation = self.robot.get_observation()
            self.robot.set_targets(robot_targets)
            action = self.robot.get_applied_action()
            
            # Save data
            episode_data.append({
                'observation': observation,
                'action': action,
                'image': self.robot.get_camera_images()
            })
        
        self.dataset.append(episode_data)

Imitation Learning for Loco-Manipulation

After collecting data, train a policy using:

Behavior Cloning (BC) — simplest approach:

# policy(observation) -> action
loss = MSE(policy(obs), expert_action)

ACT (Action Chunking with Transformers) — more effective:

  • Predicts a chunk of future actions (e.g., 100 actions at once) instead of 1 action at a time
  • Uses CVAE to model multi-modal behavior
  • State-of-the-art for manipulation tasks

Diffusion Policy — strongest for multi-modal scenarios:

  • Models the action distribution using a diffusion process
  • Handles multi-modal behaviors (many valid ways to accomplish one task)

See Imitation Learning for Robotics and ACT: Action Chunking with Transformers for deeper coverage.

Teleoperation system for humanoid data collection

Approach 3: End-to-End RL

Instead of separating components, train one single policy for the entire loco-manipulation task.

ALMI — Adversarial Locomotion and Motion Imitation

ALMI uses adversarial training between upper and lower body:

  • Lower body provides robust locomotion
  • Upper body tracks diverse motion targets
  • Adversarial loss ensures both parts "cooperate" well

ResMimic — Residual Learning

ResMimic uses a 2-stage approach:

  1. Stage 1: Train a general motion tracking policy (walking, standing, waving arms, etc.)
  2. Stage 2: Train a residual policy on top of Stage 1 for specific tasks (carrying, opening doors)

The residual policy only needs to learn the difference between general motion and task-specific motion — faster and more stable than training from scratch.

class ResidualPolicy:
    def __init__(self):
        self.base_policy = load_pretrained("motion_tracking.pt")  # Frozen
        self.residual = nn.Sequential(                             # Trainable
            nn.Linear(obs_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim)
        )
    
    def forward(self, obs):
        with torch.no_grad():
            base_action = self.base_policy(obs)
        
        residual_action = self.residual(obs)
        
        # Final action = base + small residual
        return base_action + 0.1 * residual_action

WholeBodyVLA — Vision-Language-Action

WholeBodyVLA (ICLR 2026) is the latest framework, combining Vision-Language-Action models with humanoid loco-manipulation:

  • Learns from egocentric videos (video from the robot's own viewpoint) — cheap and easy to collect
  • Uses VLA architecture to understand language + images → action
  • Tested on AgiBot X2, outperforming all baselines by 21.3%

This is the hottest direction right now — combining LLM/VLM understanding with whole-body control.

Comparison of Approaches

Approach Pros Cons Papers
Decoupled Easy to debug, reusable components Sub-optimal, limited coordination Mobile-TeleVision
Teleop + IL Learns from demos, flexible Needs data and skilled operators TWIST, Open-TeleVision
End-to-End RL Optimal, no human in loop Hard to train, reward engineering ALMI
Residual Fast convergence, stable Requires a good base policy ResMimic
VLA Language grounding, general Compute-intensive, data-hungry WholeBodyVLA

Remaining Challenges

1. Dexterous Manipulation While Walking

Most research only demonstrates power grasp (firm grip) while walking. Dexterous manipulation (rotating objects, unscrewing caps, using tools) while moving remains very difficult and largely unsolved.

2. Reactive Balance Under Load

When a robot is carrying a heavy object and gets pushed, it must simultaneously maintain its grip and recover balance. The priority tradeoff between these two competing goals has not been well solved.

3. Long-Horizon Tasks

Tasks like tidying a room or preparing a meal require long-horizon planning — something current policies struggle with. Integration with LLM-based task planning is promising but still nascent.

4. Multi-Contact Scenarios

When a robot interacts with multiple objects simultaneously (holding 2 items, using a foot to pin a door), the number of contact points increases and the dynamics complexity grows exponentially.

Research Directions 2026-2027

  1. VLA + Whole-body: Using language models to command full humanoid body motion
  2. Teleoperation at scale: Collecting data from hundreds of operators, training general-purpose policies
  3. Sim-to-real for manipulation: Sim-to-real for locomotion works well, but manipulation remains hard due to contact-rich dynamics
  4. Multi-humanoid collaboration: Two humanoids coordinating to lift heavy objects or perform joint tasks

Next in Series


NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Bài viết liên quan

NEWTutorial
Fine-Tune GR00T N1.6 với Cosmos Reason 2
grootnvidiavlacosmosfine-tuninghumanoidisaac

Fine-Tune GR00T N1.6 với Cosmos Reason 2

Hướng dẫn chi tiết fine-tune NVIDIA GR00T N1.6 — VLA model 3B tham số kết hợp Cosmos Reason 2 để điều khiển humanoid robot từ ảnh và ngôn ngữ.

15/4/202611 phút đọc
NEWTutorial
GEAR-SONIC: Whole-Body Control cho Humanoid Robot
humanoidwhole-body-controlnvidiareinforcement-learningmotion-trackingvr-teleoperationisaac-lab

GEAR-SONIC: Whole-Body Control cho Humanoid Robot

Hướng dẫn chi tiết GEAR-SONIC của NVIDIA — huấn luyện whole-body controller cho humanoid robot với dataset BONES-SEED và VR teleoperation.

13/4/202612 phút đọc
NEWTutorial
Genie Sim 3.0: Huấn luyện Humanoid với AGIBOT
simulationhumanoidisaac-simgenie-simagibotsim-to-realreinforcement-learning

Genie Sim 3.0: Huấn luyện Humanoid với AGIBOT

Hướng dẫn chi tiết dựng môi trường simulation với Genie Sim 3.0 — nền tảng open-source từ AGIBOT trên Isaac Sim để huấn luyện robot humanoid.

12/4/202611 phút đọc