Loco-Manipulation: Walking and Manipulating Simultaneously

What is Loco-Manipulation?

Most humanoid research focuses on locomotion (walking, running, climbing) OR manipulation (grasping, lifting, object handling). But real-world applications require both simultaneously: robots must walk while carrying objects, move while opening doors, climb stairs while maintaining balance with objects in hand.

This is loco-manipulation — one of the hardest open problems in humanoid robotics today.

Why It's Difficult

Coupling: When an arm lifts a heavy object, the center of mass (CoM) shifts, directly affecting balance and gait stability
Competing objectives: Locomotion needs stable, rhythmic feet; manipulation needs dexterous arms — both sharing the same rigid skeleton
High-dimensional: Humanoids have 30-75 DOF; controlling everything simultaneously is computationally and algorithmically very complex
Contact-rich: Both feet (ground contact) and hands (object contact) create complex dynamics simultaneously

Approach 1: Decoupled Upper/Lower Body Control

The simplest idea: separate the control of the upper body (manipulation) and lower body (locomotion). Each part has its own controller, communicating through a lightweight interface.

Mobile-TeleVision Paper

"Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control" (arXiv:2412.07773) is a canonical example of the decoupled approach.

Architecture:

Upper body: IK + motion retargeting from a human operator (teleoperation)
Lower body: RL policy for locomotion, conditioned on upper body motion
PMP (Predictive Motion Priors): A CVAE model predicts future upper body motion, allowing the locomotion policy to "know ahead of time" what the arms will do

Why PMP? If the locomotion policy only sees the current arm state, it is reactive — slow to respond when the arms change suddenly (e.g., lifting a heavy object). PMP enables anticipation — knowing the arms are about to lift something, the legs can pre-adjust posture.

# Simplified decoupled control architecture
class DecoupledController:
    def __init__(self):
        self.upper_body_ik = InverseKinematics(arm_joints)
        self.lower_body_rl = load_policy("locomotion_policy.pt")
        self.pmp = load_model("predictive_motion_prior.pt")
    
    def step(self, upper_body_target, velocity_command):
        # 1. Upper body: IK for arms
        arm_targets = self.upper_body_ik.solve(upper_body_target)
        
        # 2. Predict future upper body motion
        motion_prior = self.pmp.predict(
            current_arm_state=arm_targets,
            history=self.arm_history
        )
        
        # 3. Lower body: RL conditioned on motion prior
        obs = concat([
            proprioception,       # Joint states
            velocity_command,     # Move command
            motion_prior,         # Predicted arm motion
        ])
        leg_actions = self.lower_body_rl(obs)
        
        return concat([arm_targets, leg_actions])

Results: Tested on Fourier GR-1 and Unitree H1 in simulation, and Unitree H1 in the real world. The robot can walk while holding objects, and walk while opening doors.

Advantages of Decoupled Control

Easy to debug: Each component can be validated independently
Reusable: Locomotion policy can be reused across many manipulation tasks
Teleoperation-friendly: Human operator controls only the arms; legs handle themselves automatically

Disadvantages

Limited coordination: Hard to perform tasks requiring whole-body synergy (e.g., throwing far)
Interface bottleneck: Information shared between the two components is limited by interface design
Sub-optimal: Cannot simultaneously optimize locomotion and manipulation objectives

Approach 2: Teleoperation + Imitation Learning

Instead of hand-designing controllers, learn from humans:

Teleoperate the humanoid: A human operator drives the robot to perform a task (holding objects, tidying up, etc.)
Collect data: Record all observations and corresponding actions
Train a policy: Use Imitation Learning (Behavior Cloning, ACT, Diffusion Policy) to replicate the behavior

Teleoperation Systems for Humanoids

TWIST (Teleoperated Whole-Body Imitation System) and Open-TeleVision are two prominent teleoperation systems:

VR headset: Oculus/Meta Quest for head tracking + hand tracking
Motion capture: Full-body tracking of the human operator
Retargeting: Map human movements to robot motion (accounting for different body proportions and DOF counts)
Force feedback: Operator feels resistance from the robot

# Teleoperation data collection pipeline
class TeleoperationCollector:
    def __init__(self, robot, vr_interface):
        self.robot = robot
        self.vr = vr_interface
        self.dataset = []
    
    def collect_episode(self, task_name):
        """Collect one teleoperation episode."""
        episode_data = []
        
        while not task_complete:
            # Read VR input
            head_pose = self.vr.get_head_pose()
            hand_poses = self.vr.get_hand_poses()  # Left + Right
            body_pose = self.vr.get_body_pose()
            
            # Retarget to robot
            robot_targets = retarget(
                human_pose={
                    'head': head_pose,
                    'hands': hand_poses,
                    'body': body_pose
                },
                robot_model=self.robot.model
            )
            
            # Execute on robot
            observation = self.robot.get_observation()
            self.robot.set_targets(robot_targets)
            action = self.robot.get_applied_action()
            
            # Save data
            episode_data.append({
                'observation': observation,
                'action': action,
                'image': self.robot.get_camera_images()
            })
        
        self.dataset.append(episode_data)

Imitation Learning for Loco-Manipulation

After collecting data, train a policy using:

Behavior Cloning (BC) — simplest approach:

# policy(observation) -> action
loss = MSE(policy(obs), expert_action)

ACT (Action Chunking with Transformers) — more effective:

Predicts a chunk of future actions (e.g., 100 actions at once) instead of 1 action at a time
Uses CVAE to model multi-modal behavior
State-of-the-art for manipulation tasks

Diffusion Policy — strongest for multi-modal scenarios:

Models the action distribution using a diffusion process
Handles multi-modal behaviors (many valid ways to accomplish one task)

See Imitation Learning for Robotics and ACT: Action Chunking with Transformers for deeper coverage.

Approach 3: End-to-End RL

Instead of separating components, train one single policy for the entire loco-manipulation task.

ALMI — Adversarial Locomotion and Motion Imitation

ALMI uses adversarial training between upper and lower body:

Lower body provides robust locomotion
Upper body tracks diverse motion targets
Adversarial loss ensures both parts "cooperate" well

ResMimic — Residual Learning

ResMimic uses a 2-stage approach:

Stage 1: Train a general motion tracking policy (walking, standing, waving arms, etc.)
Stage 2: Train a residual policy on top of Stage 1 for specific tasks (carrying, opening doors)

The residual policy only needs to learn the difference between general motion and task-specific motion — faster and more stable than training from scratch.

class ResidualPolicy:
    def __init__(self):
        self.base_policy = load_pretrained("motion_tracking.pt")  # Frozen
        self.residual = nn.Sequential(                             # Trainable
            nn.Linear(obs_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim)
        )
    
    def forward(self, obs):
        with torch.no_grad():
            base_action = self.base_policy(obs)
        
        residual_action = self.residual(obs)
        
        # Final action = base + small residual
        return base_action + 0.1 * residual_action

WholeBodyVLA — Vision-Language-Action

WholeBodyVLA (ICLR 2026) is the latest framework, combining Vision-Language-Action models with humanoid loco-manipulation:

Learns from egocentric videos (video from the robot's own viewpoint) — cheap and easy to collect
Uses VLA architecture to understand language + images → action
Tested on AgiBot X2, outperforming all baselines by 21.3%

This is the hottest direction right now — combining LLM/VLM understanding with whole-body control.

Comparison of Approaches

Approach	Pros	Cons	Papers
Decoupled	Easy to debug, reusable components	Sub-optimal, limited coordination	Mobile-TeleVision
Teleop + IL	Learns from demos, flexible	Needs data and skilled operators	TWIST, Open-TeleVision
End-to-End RL	Optimal, no human in loop	Hard to train, reward engineering	ALMI
Residual	Fast convergence, stable	Requires a good base policy	ResMimic
VLA	Language grounding, general	Compute-intensive, data-hungry	WholeBodyVLA

Remaining Challenges

1. Dexterous Manipulation While Walking

Most research only demonstrates power grasp (firm grip) while walking. Dexterous manipulation (rotating objects, unscrewing caps, using tools) while moving remains very difficult and largely unsolved.

2. Reactive Balance Under Load

When a robot is carrying a heavy object and gets pushed, it must simultaneously maintain its grip and recover balance. The priority tradeoff between these two competing goals has not been well solved.

3. Long-Horizon Tasks

Tasks like tidying a room or preparing a meal require long-horizon planning — something current policies struggle with. Integration with LLM-based task planning is promising but still nascent.

4. Multi-Contact Scenarios

When a robot interacts with multiple objects simultaneously (holding 2 items, using a foot to pin a door), the number of contact points increases and the dynamics complexity grows exponentially.

Research Directions 2026-2027

VLA + Whole-body: Using language models to command full humanoid body motion
Teleoperation at scale: Collecting data from hundreds of operators, training general-purpose policies
Sim-to-real for manipulation: Sim-to-real for locomotion works well, but manipulation remains hard due to contact-rich dynamics
Multi-humanoid collaboration: Two humanoids coordinating to lift heavy objects or perform joint tasks

Next in Series

Part 4: RL for Humanoid: From Humanoid-Gym to Sim2Real
Part 6: The Future of Humanoids: Opportunities for Robotics Engineers

Imitation Learning for Robotics — Behavior Cloning, DAgger, and more
ACT: Action Chunking with Transformers — State-of-the-art manipulation policy
Diffusion Policy for Robot Control — Diffusion-based control
Foundation Models for Robots: RT-2, Octo, OpenVLA — VLA models overview
RL for Bipedal Walking — RL locomotion fundamentals

Loco-Manipulation: Walking and Manipulating Simultaneously

What is Loco-Manipulation?

Why It's Difficult

Approach 1: Decoupled Upper/Lower Body Control

Mobile-TeleVision Paper

Advantages of Decoupled Control

Disadvantages

Approach 2: Teleoperation + Imitation Learning

Teleoperation Systems for Humanoids

Imitation Learning for Loco-Manipulation

Approach 3: End-to-End RL

ALMI — Adversarial Locomotion and Motion Imitation

ResMimic — Residual Learning

WholeBodyVLA — Vision-Language-Action

Comparison of Approaches

Remaining Challenges

1. Dexterous Manipulation While Walking

2. Reactive Balance Under Load

3. Long-Horizon Tasks

4. Multi-Contact Scenarios

Research Directions 2026-2027

Next in Series

Nguyễn Anh Tuấn

Bài viết liên quan

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

Ark v1.5: Python Framework cho Robot Learning sim-to-real

Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym

What is Loco-Manipulation?

Why It's Difficult

Approach 1: Decoupled Upper/Lower Body Control

Mobile-TeleVision Paper

Advantages of Decoupled Control

Disadvantages

Approach 2: Teleoperation + Imitation Learning

Teleoperation Systems for Humanoids

Imitation Learning for Loco-Manipulation

Approach 3: End-to-End RL

ALMI — Adversarial Locomotion and Motion Imitation

ResMimic — Residual Learning

WholeBodyVLA — Vision-Language-Action

Comparison of Approaches

Remaining Challenges

1. Dexterous Manipulation While Walking

2. Reactive Balance Under Load

3. Long-Horizon Tasks

4. Multi-Contact Scenarios

Research Directions 2026-2027

Next in Series

Related Articles

Nguyễn Anh Tuấn

Bài viết liên quan

WholeBodyVLA Tutorial: Teleop → Train → Deploy Humanoid

Ark v1.5: Python Framework cho Robot Learning sim-to-real

Booster Gym ICRA 2026: Train Humanoid T1 Sim-to-Real với Isaac Gym