VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam
VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
  1. Home
  2. Blog
  3. Loco-Manipulation: Walking and Manipulating Simultaneously
humanoidhumanoidloco-manipulationteleoperationimitation-learning

Loco-Manipulation: Walking and Manipulating Simultaneously

Techniques for controlling humanoids that both move and manipulate objects — decoupled control, teleoperation, and imitation learning.

Nguyen Anh TuanFebruary 23, 20268 min readUpdated: Jun 14, 2026
Loco-Manipulation: Walking and Manipulating Simultaneously

What is Loco-Manipulation?

Most humanoid research focuses on locomotion (walking, running, climbing) OR manipulation (grasping, lifting, object handling). But real-world applications require both simultaneously: robots must walk while carrying objects, move while opening doors, climb stairs while maintaining balance with objects in hand.

This is loco-manipulation — one of the hardest open problems in humanoid robotics today.

Why It's Difficult

  1. Coupling: When an arm lifts a heavy object, the center of mass (CoM) shifts, directly affecting balance and gait stability
  2. Competing objectives: Locomotion needs stable, rhythmic feet; manipulation needs dexterous arms — both sharing the same rigid skeleton
  3. High-dimensional: Humanoids have 30-75 DOF; controlling everything simultaneously is computationally and algorithmically very complex
  4. Contact-rich: Both feet (ground contact) and hands (object contact) create complex dynamics simultaneously

Humanoid robot performing loco-manipulation in a real-world environment
Humanoid robot performing loco-manipulation in a real-world environment

Approach 1: Decoupled Upper/Lower Body Control

The simplest idea: separate the control of the upper body (manipulation) and lower body (locomotion). Each part has its own controller, communicating through a lightweight interface.

Mobile-TeleVision Paper

"Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control" (arXiv:2412.07773) is a canonical example of the decoupled approach.

Architecture:

  1. Upper body: IK + motion retargeting from a human operator (teleoperation)
  2. Lower body: RL policy for locomotion, conditioned on upper body motion
  3. PMP (Predictive Motion Priors): A CVAE model predicts future upper body motion, allowing the locomotion policy to "know ahead of time" what the arms will do

Why PMP? If the locomotion policy only sees the current arm state, it is reactive — slow to respond when the arms change suddenly (e.g., lifting a heavy object). PMP enables anticipation — knowing the arms are about to lift something, the legs can pre-adjust posture.

# Simplified decoupled control architecture
class DecoupledController:
    def __init__(self):
        self.upper_body_ik = InverseKinematics(arm_joints)
        self.lower_body_rl = load_policy("locomotion_policy.pt")
        self.pmp = load_model("predictive_motion_prior.pt")
    
    def step(self, upper_body_target, velocity_command):
        # 1. Upper body: IK for arms
        arm_targets = self.upper_body_ik.solve(upper_body_target)
        
        # 2. Predict future upper body motion
        motion_prior = self.pmp.predict(
            current_arm_state=arm_targets,
            history=self.arm_history
        )
        
        # 3. Lower body: RL conditioned on motion prior
        obs = concat([
            proprioception,       # Joint states
            velocity_command,     # Move command
            motion_prior,         # Predicted arm motion
        ])
        leg_actions = self.lower_body_rl(obs)
        
        return concat([arm_targets, leg_actions])

Results: Tested on Fourier GR-1 and Unitree H1 in simulation, and Unitree H1 in the real world. The robot can walk while holding objects, and walk while opening doors.

Advantages of Decoupled Control

  • Easy to debug: Each component can be validated independently
  • Reusable: Locomotion policy can be reused across many manipulation tasks
  • Teleoperation-friendly: Human operator controls only the arms; legs handle themselves automatically

Disadvantages

  • Limited coordination: Hard to perform tasks requiring whole-body synergy (e.g., throwing far)
  • Interface bottleneck: Information shared between the two components is limited by interface design
  • Sub-optimal: Cannot simultaneously optimize locomotion and manipulation objectives

Approach 2: Teleoperation + Imitation Learning

Instead of hand-designing controllers, learn from humans:

  1. Teleoperate the humanoid: A human operator drives the robot to perform a task (holding objects, tidying up, etc.)
  2. Collect data: Record all observations and corresponding actions
  3. Train a policy: Use Imitation Learning (Behavior Cloning, ACT, Diffusion Policy) to replicate the behavior

Teleoperation Systems for Humanoids

TWIST (Teleoperated Whole-Body Imitation System) and Open-TeleVision are two prominent teleoperation systems:

  • VR headset: Oculus/Meta Quest for head tracking + hand tracking
  • Motion capture: Full-body tracking of the human operator
  • Retargeting: Map human movements to robot motion (accounting for different body proportions and DOF counts)
  • Force feedback: Operator feels resistance from the robot
# Teleoperation data collection pipeline
class TeleoperationCollector:
    def __init__(self, robot, vr_interface):
        self.robot = robot
        self.vr = vr_interface
        self.dataset = []
    
    def collect_episode(self, task_name):
        """Collect one teleoperation episode."""
        episode_data = []
        
        while not task_complete:
            # Read VR input
            head_pose = self.vr.get_head_pose()
            hand_poses = self.vr.get_hand_poses()  # Left + Right
            body_pose = self.vr.get_body_pose()
            
            # Retarget to robot
            robot_targets = retarget(
                human_pose={
                    'head': head_pose,
                    'hands': hand_poses,
                    'body': body_pose
                },
                robot_model=self.robot.model
            )
            
            # Execute on robot
            observation = self.robot.get_observation()
            self.robot.set_targets(robot_targets)
            action = self.robot.get_applied_action()
            
            # Save data
            episode_data.append({
                'observation': observation,
                'action': action,
                'image': self.robot.get_camera_images()
            })
        
        self.dataset.append(episode_data)

Imitation Learning for Loco-Manipulation

After collecting data, train a policy using:

Behavior Cloning (BC) — simplest approach:

# policy(observation) -> action
loss = MSE(policy(obs), expert_action)

ACT (Action Chunking with Transformers) — more effective:

  • Predicts a chunk of future actions (e.g., 100 actions at once) instead of 1 action at a time
  • Uses CVAE to model multi-modal behavior
  • State-of-the-art for manipulation tasks

Diffusion Policy — strongest for multi-modal scenarios:

  • Models the action distribution using a diffusion process
  • Handles multi-modal behaviors (many valid ways to accomplish one task)

See Imitation Learning for Robotics and ACT: Action Chunking with Transformers for deeper coverage.

Teleoperation system for humanoid data collection
Teleoperation system for humanoid data collection

Approach 3: End-to-End RL

Instead of separating components, train one single policy for the entire loco-manipulation task.

ALMI — Adversarial Locomotion and Motion Imitation

ALMI uses adversarial training between upper and lower body:

  • Lower body provides robust locomotion
  • Upper body tracks diverse motion targets
  • Adversarial loss ensures both parts "cooperate" well

ResMimic — Residual Learning

ResMimic uses a 2-stage approach:

  1. Stage 1: Train a general motion tracking policy (walking, standing, waving arms, etc.)
  2. Stage 2: Train a residual policy on top of Stage 1 for specific tasks (carrying, opening doors)

The residual policy only needs to learn the difference between general motion and task-specific motion — faster and more stable than training from scratch.

class ResidualPolicy:
    def __init__(self):
        self.base_policy = load_pretrained("motion_tracking.pt")  # Frozen
        self.residual = nn.Sequential(                             # Trainable
            nn.Linear(obs_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim)
        )
    
    def forward(self, obs):
        with torch.no_grad():
            base_action = self.base_policy(obs)
        
        residual_action = self.residual(obs)
        
        # Final action = base + small residual
        return base_action + 0.1 * residual_action

WholeBodyVLA — Vision-Language-Action

WholeBodyVLA (ICLR 2026) is the latest framework, combining Vision-Language-Action models with humanoid loco-manipulation:

  • Learns from egocentric videos (video from the robot's own viewpoint) — cheap and easy to collect
  • Uses VLA architecture to understand language + images → action
  • Tested on AgiBot X2, outperforming all baselines by 21.3%

This is the hottest direction right now — combining LLM/VLM understanding with whole-body control.

Comparison of Approaches

Approach Pros Cons Papers
Decoupled Easy to debug, reusable components Sub-optimal, limited coordination Mobile-TeleVision
Teleop + IL Learns from demos, flexible Needs data and skilled operators TWIST, Open-TeleVision
End-to-End RL Optimal, no human in loop Hard to train, reward engineering ALMI
Residual Fast convergence, stable Requires a good base policy ResMimic
VLA Language grounding, general Compute-intensive, data-hungry WholeBodyVLA

Remaining Challenges

1. Dexterous Manipulation While Walking

Most research only demonstrates power grasp (firm grip) while walking. Dexterous manipulation (rotating objects, unscrewing caps, using tools) while moving remains very difficult and largely unsolved.

2. Reactive Balance Under Load

When a robot is carrying a heavy object and gets pushed, it must simultaneously maintain its grip and recover balance. The priority tradeoff between these two competing goals has not been well solved.

3. Long-Horizon Tasks

Tasks like tidying a room or preparing a meal require long-horizon planning — something current policies struggle with. Integration with LLM-based task planning is promising but still nascent.

4. Multi-Contact Scenarios

When a robot interacts with multiple objects simultaneously (holding 2 items, using a foot to pin a door), the number of contact points increases and the dynamics complexity grows exponentially.

Research Directions 2026-2027

  1. VLA + Whole-body: Using language models to command full humanoid body motion
  2. Teleoperation at scale: Collecting data from hundreds of operators, training general-purpose policies
  3. Sim-to-real for manipulation: Sim-to-real for locomotion works well, but manipulation remains hard due to contact-rich dynamics
  4. Multi-humanoid collaboration: Two humanoids coordinating to lift heavy objects or perform joint tasks

Next in Series

  • Part 4: RL for Humanoid: From Humanoid-Gym to Sim2Real
  • Part 6: The Future of Humanoids: Opportunities for Robotics Engineers

Related Articles

  • Imitation Learning for Robotics — Behavior Cloning, DAgger, and more
  • ACT: Action Chunking with Transformers — State-of-the-art manipulation policy
  • Diffusion Policy for Robot Control — Diffusion-based control
  • Foundation Models for Robots: RT-2, Octo, OpenVLA — VLA models overview
  • RL for Bipedal Walking — RL locomotion fundamentals
NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Fleet MonitoringROS 2 IntegrationAMR Solutions
humanoid-engineering — Phần 5/6
← RL for Humanoid: From Humanoid-Gym to Sim2RealThe Future of Humanoids: Opportunities for Robotics Engineers →

Related Posts

Tutorial
SUGAR: Train Humanoid từ Video Người, Không Cần Reward
sugarhumanoidloco-manipulation
humanoid

SUGAR: Train Humanoid từ Video Người, Không Cần Reward

Hướng dẫn chi tiết SUGAR (arXiv 2026.05) — framework học loco-manipulation cho humanoid từ video người thật, full open-source, beginner-friendly từ A đến Z.

5/29/202612 min read
NT
Deep Dive
Humanoid
humanoidcareermarketPart 6
humanoid

Tương lai Humanoid: Cơ hội cho kỹ sư robotics

Thị trường humanoid robot $38B vào 2035, xu hướng tuyển dụng, kỹ năng cần thiết và lộ trình học cho kỹ sư robotics Việt Nam.

2/27/20269 min read
NT
Tutorial
RL cho Humanoid: Humanoid-Gym đến sim2real
humanoidreinforcement-learningsim2realPart 4
humanoid

RL cho Humanoid: Humanoid-Gym đến sim2real

Hướng dẫn train locomotion policy cho humanoid robot với Humanoid-Gym, từ cấu hình reward đến zero-shot sim-to-real transfer.

2/19/20269 min read
NT
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam