← Back to Blog
humanoidhumanoidloco-manipulationteleoperationimitation-learning

Loco-Manipulation: Walking and Manipulating Simultaneously

Techniques for controlling humanoids that both move and manipulate objects — decoupled control, teleoperation, and imitation learning.

Nguyen Anh Tuan23 tháng 2, 20268 min read
Loco-Manipulation: Walking and Manipulating Simultaneously

What is Loco-Manipulation?

Most humanoid research focuses on locomotion (walking, running, climbing) OR manipulation (grasping, lifting, object handling). But real-world applications require both simultaneously: robots must walk while carrying objects, move while opening doors, climb stairs while maintaining balance with objects in hand.

This is loco-manipulation — one of the hardest open problems in humanoid robotics today.

Why It's Difficult

  1. Coupling: When an arm lifts a heavy object, the center of mass (CoM) shifts, directly affecting balance and gait stability
  2. Competing objectives: Locomotion needs stable, rhythmic feet; manipulation needs dexterous arms — both sharing the same rigid skeleton
  3. High-dimensional: Humanoids have 30-75 DOF; controlling everything simultaneously is computationally and algorithmically very complex
  4. Contact-rich: Both feet (ground contact) and hands (object contact) create complex dynamics simultaneously

Humanoid robot performing loco-manipulation in a real-world environment

Approach 1: Decoupled Upper/Lower Body Control

The simplest idea: separate the control of the upper body (manipulation) and lower body (locomotion). Each part has its own controller, communicating through a lightweight interface.

Mobile-TeleVision Paper

"Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control" (arXiv:2412.07773) is a canonical example of the decoupled approach.

Architecture:

  1. Upper body: IK + motion retargeting from a human operator (teleoperation)
  2. Lower body: RL policy for locomotion, conditioned on upper body motion
  3. PMP (Predictive Motion Priors): A CVAE model predicts future upper body motion, allowing the locomotion policy to "know ahead of time" what the arms will do

Why PMP? If the locomotion policy only sees the current arm state, it is reactive — slow to respond when the arms change suddenly (e.g., lifting a heavy object). PMP enables anticipation — knowing the arms are about to lift something, the legs can pre-adjust posture.

# Simplified decoupled control architecture
class DecoupledController:
    def __init__(self):
        self.upper_body_ik = InverseKinematics(arm_joints)
        self.lower_body_rl = load_policy("locomotion_policy.pt")
        self.pmp = load_model("predictive_motion_prior.pt")
    
    def step(self, upper_body_target, velocity_command):
        # 1. Upper body: IK for arms
        arm_targets = self.upper_body_ik.solve(upper_body_target)
        
        # 2. Predict future upper body motion
        motion_prior = self.pmp.predict(
            current_arm_state=arm_targets,
            history=self.arm_history
        )
        
        # 3. Lower body: RL conditioned on motion prior
        obs = concat([
            proprioception,       # Joint states
            velocity_command,     # Move command
            motion_prior,         # Predicted arm motion
        ])
        leg_actions = self.lower_body_rl(obs)
        
        return concat([arm_targets, leg_actions])

Results: Tested on Fourier GR-1 and Unitree H1 in simulation, and Unitree H1 in the real world. The robot can walk while holding objects, and walk while opening doors.

Advantages of Decoupled Control

Disadvantages

Approach 2: Teleoperation + Imitation Learning

Instead of hand-designing controllers, learn from humans:

  1. Teleoperate the humanoid: A human operator drives the robot to perform a task (holding objects, tidying up, etc.)
  2. Collect data: Record all observations and corresponding actions
  3. Train a policy: Use Imitation Learning (Behavior Cloning, ACT, Diffusion Policy) to replicate the behavior

Teleoperation Systems for Humanoids

TWIST (Teleoperated Whole-Body Imitation System) and Open-TeleVision are two prominent teleoperation systems:

# Teleoperation data collection pipeline
class TeleoperationCollector:
    def __init__(self, robot, vr_interface):
        self.robot = robot
        self.vr = vr_interface
        self.dataset = []
    
    def collect_episode(self, task_name):
        """Collect one teleoperation episode."""
        episode_data = []
        
        while not task_complete:
            # Read VR input
            head_pose = self.vr.get_head_pose()
            hand_poses = self.vr.get_hand_poses()  # Left + Right
            body_pose = self.vr.get_body_pose()
            
            # Retarget to robot
            robot_targets = retarget(
                human_pose={
                    'head': head_pose,
                    'hands': hand_poses,
                    'body': body_pose
                },
                robot_model=self.robot.model
            )
            
            # Execute on robot
            observation = self.robot.get_observation()
            self.robot.set_targets(robot_targets)
            action = self.robot.get_applied_action()
            
            # Save data
            episode_data.append({
                'observation': observation,
                'action': action,
                'image': self.robot.get_camera_images()
            })
        
        self.dataset.append(episode_data)

Imitation Learning for Loco-Manipulation

After collecting data, train a policy using:

Behavior Cloning (BC) — simplest approach:

# policy(observation) -> action
loss = MSE(policy(obs), expert_action)

ACT (Action Chunking with Transformers) — more effective:

Diffusion Policy — strongest for multi-modal scenarios:

See Imitation Learning for Robotics and ACT: Action Chunking with Transformers for deeper coverage.

Teleoperation system for humanoid data collection

Approach 3: End-to-End RL

Instead of separating components, train one single policy for the entire loco-manipulation task.

ALMI — Adversarial Locomotion and Motion Imitation

ALMI uses adversarial training between upper and lower body:

ResMimic — Residual Learning

ResMimic uses a 2-stage approach:

  1. Stage 1: Train a general motion tracking policy (walking, standing, waving arms, etc.)
  2. Stage 2: Train a residual policy on top of Stage 1 for specific tasks (carrying, opening doors)

The residual policy only needs to learn the difference between general motion and task-specific motion — faster and more stable than training from scratch.

class ResidualPolicy:
    def __init__(self):
        self.base_policy = load_pretrained("motion_tracking.pt")  # Frozen
        self.residual = nn.Sequential(                             # Trainable
            nn.Linear(obs_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim)
        )
    
    def forward(self, obs):
        with torch.no_grad():
            base_action = self.base_policy(obs)
        
        residual_action = self.residual(obs)
        
        # Final action = base + small residual
        return base_action + 0.1 * residual_action

WholeBodyVLA — Vision-Language-Action

WholeBodyVLA (ICLR 2026) is the latest framework, combining Vision-Language-Action models with humanoid loco-manipulation:

This is the hottest direction right now — combining LLM/VLM understanding with whole-body control.

Comparison of Approaches

Approach Pros Cons Papers
Decoupled Easy to debug, reusable components Sub-optimal, limited coordination Mobile-TeleVision
Teleop + IL Learns from demos, flexible Needs data and skilled operators TWIST, Open-TeleVision
End-to-End RL Optimal, no human in loop Hard to train, reward engineering ALMI
Residual Fast convergence, stable Requires a good base policy ResMimic
VLA Language grounding, general Compute-intensive, data-hungry WholeBodyVLA

Remaining Challenges

1. Dexterous Manipulation While Walking

Most research only demonstrates power grasp (firm grip) while walking. Dexterous manipulation (rotating objects, unscrewing caps, using tools) while moving remains very difficult and largely unsolved.

2. Reactive Balance Under Load

When a robot is carrying a heavy object and gets pushed, it must simultaneously maintain its grip and recover balance. The priority tradeoff between these two competing goals has not been well solved.

3. Long-Horizon Tasks

Tasks like tidying a room or preparing a meal require long-horizon planning — something current policies struggle with. Integration with LLM-based task planning is promising but still nascent.

4. Multi-Contact Scenarios

When a robot interacts with multiple objects simultaneously (holding 2 items, using a foot to pin a door), the number of contact points increases and the dynamics complexity grows exponentially.

Research Directions 2026-2027

  1. VLA + Whole-body: Using language models to command full humanoid body motion
  2. Teleoperation at scale: Collecting data from hundreds of operators, training general-purpose policies
  3. Sim-to-real for manipulation: Sim-to-real for locomotion works well, but manipulation remains hard due to contact-rich dynamics
  4. Multi-humanoid collaboration: Two humanoids coordinating to lift heavy objects or perform joint tasks

Next in Series


Related Articles

Related Posts

Unitree G1 vs H1 vs Tesla Optimus: So sánh humanoid 2026
humanoidroboticsresearch

Unitree G1 vs H1 vs Tesla Optimus: So sánh humanoid 2026

Phân tích chi tiết 3 nền tảng humanoid robot phổ biến nhất — specs, giá thành, SDK và khả năng ứng dụng thực tế.

23/3/202612 min read
ResearchTrung Quốc dẫn đầu cuộc đua Humanoid Robot 2026
humanoidresearch

Trung Quốc dẫn đầu cuộc đua Humanoid Robot 2026

Phân tích thị trường humanoid Trung Quốc -- Unitree, UBTECH, Fourier, Agibot và chiến lược quốc gia.

12/3/20269 min read
Deep DiveImitation Learning: BC, DAgger và DAPG cho robot
ai-perceptionimitation-learningprogrammingPart 2

Imitation Learning: BC, DAgger và DAPG cho robot

Tại sao imitation learning quan trọng hơn RL cho nhiều bài toán manipulation — cách thu thập data và train policy.

8/3/202611 min read