What is Loco-Manipulation?
Most humanoid research focuses on locomotion (walking, running, climbing) OR manipulation (grasping, lifting, object handling). But real-world applications require both simultaneously: robots must walk while carrying objects, move while opening doors, climb stairs while maintaining balance with objects in hand.
This is loco-manipulation — one of the hardest open problems in humanoid robotics today.
Why It's Difficult
- Coupling: When an arm lifts a heavy object, the center of mass (CoM) shifts, directly affecting balance and gait stability
- Competing objectives: Locomotion needs stable, rhythmic feet; manipulation needs dexterous arms — both sharing the same rigid skeleton
- High-dimensional: Humanoids have 30-75 DOF; controlling everything simultaneously is computationally and algorithmically very complex
- Contact-rich: Both feet (ground contact) and hands (object contact) create complex dynamics simultaneously
Approach 1: Decoupled Upper/Lower Body Control
The simplest idea: separate the control of the upper body (manipulation) and lower body (locomotion). Each part has its own controller, communicating through a lightweight interface.
Mobile-TeleVision Paper
"Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control" (arXiv:2412.07773) is a canonical example of the decoupled approach.
Architecture:
- Upper body: IK + motion retargeting from a human operator (teleoperation)
- Lower body: RL policy for locomotion, conditioned on upper body motion
- PMP (Predictive Motion Priors): A CVAE model predicts future upper body motion, allowing the locomotion policy to "know ahead of time" what the arms will do
Why PMP? If the locomotion policy only sees the current arm state, it is reactive — slow to respond when the arms change suddenly (e.g., lifting a heavy object). PMP enables anticipation — knowing the arms are about to lift something, the legs can pre-adjust posture.
# Simplified decoupled control architecture
class DecoupledController:
def __init__(self):
self.upper_body_ik = InverseKinematics(arm_joints)
self.lower_body_rl = load_policy("locomotion_policy.pt")
self.pmp = load_model("predictive_motion_prior.pt")
def step(self, upper_body_target, velocity_command):
# 1. Upper body: IK for arms
arm_targets = self.upper_body_ik.solve(upper_body_target)
# 2. Predict future upper body motion
motion_prior = self.pmp.predict(
current_arm_state=arm_targets,
history=self.arm_history
)
# 3. Lower body: RL conditioned on motion prior
obs = concat([
proprioception, # Joint states
velocity_command, # Move command
motion_prior, # Predicted arm motion
])
leg_actions = self.lower_body_rl(obs)
return concat([arm_targets, leg_actions])
Results: Tested on Fourier GR-1 and Unitree H1 in simulation, and Unitree H1 in the real world. The robot can walk while holding objects, and walk while opening doors.
Advantages of Decoupled Control
- Easy to debug: Each component can be validated independently
- Reusable: Locomotion policy can be reused across many manipulation tasks
- Teleoperation-friendly: Human operator controls only the arms; legs handle themselves automatically
Disadvantages
- Limited coordination: Hard to perform tasks requiring whole-body synergy (e.g., throwing far)
- Interface bottleneck: Information shared between the two components is limited by interface design
- Sub-optimal: Cannot simultaneously optimize locomotion and manipulation objectives
Approach 2: Teleoperation + Imitation Learning
Instead of hand-designing controllers, learn from humans:
- Teleoperate the humanoid: A human operator drives the robot to perform a task (holding objects, tidying up, etc.)
- Collect data: Record all observations and corresponding actions
- Train a policy: Use Imitation Learning (Behavior Cloning, ACT, Diffusion Policy) to replicate the behavior
Teleoperation Systems for Humanoids
TWIST (Teleoperated Whole-Body Imitation System) and Open-TeleVision are two prominent teleoperation systems:
- VR headset: Oculus/Meta Quest for head tracking + hand tracking
- Motion capture: Full-body tracking of the human operator
- Retargeting: Map human movements to robot motion (accounting for different body proportions and DOF counts)
- Force feedback: Operator feels resistance from the robot
# Teleoperation data collection pipeline
class TeleoperationCollector:
def __init__(self, robot, vr_interface):
self.robot = robot
self.vr = vr_interface
self.dataset = []
def collect_episode(self, task_name):
"""Collect one teleoperation episode."""
episode_data = []
while not task_complete:
# Read VR input
head_pose = self.vr.get_head_pose()
hand_poses = self.vr.get_hand_poses() # Left + Right
body_pose = self.vr.get_body_pose()
# Retarget to robot
robot_targets = retarget(
human_pose={
'head': head_pose,
'hands': hand_poses,
'body': body_pose
},
robot_model=self.robot.model
)
# Execute on robot
observation = self.robot.get_observation()
self.robot.set_targets(robot_targets)
action = self.robot.get_applied_action()
# Save data
episode_data.append({
'observation': observation,
'action': action,
'image': self.robot.get_camera_images()
})
self.dataset.append(episode_data)
Imitation Learning for Loco-Manipulation
After collecting data, train a policy using:
Behavior Cloning (BC) — simplest approach:
# policy(observation) -> action
loss = MSE(policy(obs), expert_action)
ACT (Action Chunking with Transformers) — more effective:
- Predicts a chunk of future actions (e.g., 100 actions at once) instead of 1 action at a time
- Uses CVAE to model multi-modal behavior
- State-of-the-art for manipulation tasks
Diffusion Policy — strongest for multi-modal scenarios:
- Models the action distribution using a diffusion process
- Handles multi-modal behaviors (many valid ways to accomplish one task)
See Imitation Learning for Robotics and ACT: Action Chunking with Transformers for deeper coverage.
Approach 3: End-to-End RL
Instead of separating components, train one single policy for the entire loco-manipulation task.
ALMI — Adversarial Locomotion and Motion Imitation
ALMI uses adversarial training between upper and lower body:
- Lower body provides robust locomotion
- Upper body tracks diverse motion targets
- Adversarial loss ensures both parts "cooperate" well
ResMimic — Residual Learning
ResMimic uses a 2-stage approach:
- Stage 1: Train a general motion tracking policy (walking, standing, waving arms, etc.)
- Stage 2: Train a residual policy on top of Stage 1 for specific tasks (carrying, opening doors)
The residual policy only needs to learn the difference between general motion and task-specific motion — faster and more stable than training from scratch.
class ResidualPolicy:
def __init__(self):
self.base_policy = load_pretrained("motion_tracking.pt") # Frozen
self.residual = nn.Sequential( # Trainable
nn.Linear(obs_dim, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, action_dim)
)
def forward(self, obs):
with torch.no_grad():
base_action = self.base_policy(obs)
residual_action = self.residual(obs)
# Final action = base + small residual
return base_action + 0.1 * residual_action
WholeBodyVLA — Vision-Language-Action
WholeBodyVLA (ICLR 2026) is the latest framework, combining Vision-Language-Action models with humanoid loco-manipulation:
- Learns from egocentric videos (video from the robot's own viewpoint) — cheap and easy to collect
- Uses VLA architecture to understand language + images → action
- Tested on AgiBot X2, outperforming all baselines by 21.3%
This is the hottest direction right now — combining LLM/VLM understanding with whole-body control.
Comparison of Approaches
| Approach | Pros | Cons | Papers |
|---|---|---|---|
| Decoupled | Easy to debug, reusable components | Sub-optimal, limited coordination | Mobile-TeleVision |
| Teleop + IL | Learns from demos, flexible | Needs data and skilled operators | TWIST, Open-TeleVision |
| End-to-End RL | Optimal, no human in loop | Hard to train, reward engineering | ALMI |
| Residual | Fast convergence, stable | Requires a good base policy | ResMimic |
| VLA | Language grounding, general | Compute-intensive, data-hungry | WholeBodyVLA |
Remaining Challenges
1. Dexterous Manipulation While Walking
Most research only demonstrates power grasp (firm grip) while walking. Dexterous manipulation (rotating objects, unscrewing caps, using tools) while moving remains very difficult and largely unsolved.
2. Reactive Balance Under Load
When a robot is carrying a heavy object and gets pushed, it must simultaneously maintain its grip and recover balance. The priority tradeoff between these two competing goals has not been well solved.
3. Long-Horizon Tasks
Tasks like tidying a room or preparing a meal require long-horizon planning — something current policies struggle with. Integration with LLM-based task planning is promising but still nascent.
4. Multi-Contact Scenarios
When a robot interacts with multiple objects simultaneously (holding 2 items, using a foot to pin a door), the number of contact points increases and the dynamics complexity grows exponentially.
Research Directions 2026-2027
- VLA + Whole-body: Using language models to command full humanoid body motion
- Teleoperation at scale: Collecting data from hundreds of operators, training general-purpose policies
- Sim-to-real for manipulation: Sim-to-real for locomotion works well, but manipulation remains hard due to contact-rich dynamics
- Multi-humanoid collaboration: Two humanoids coordinating to lift heavy objects or perform joint tasks
Next in Series
- Part 4: RL for Humanoid: From Humanoid-Gym to Sim2Real
- Part 6: The Future of Humanoids: Opportunities for Robotics Engineers
Related Articles
- Imitation Learning for Robotics — Behavior Cloning, DAgger, and more
- ACT: Action Chunking with Transformers — State-of-the-art manipulation policy
- Diffusion Policy for Robot Control — Diffusion-based control
- Foundation Models for Robots: RT-2, Octo, OpenVLA — VLA models overview
- RL for Bipedal Walking — RL locomotion fundamentals