Loco-Manipulation: Robot vừa đi vừa thao tác

Loco-Manipulation là gì?

Hầu hết các research về humanoid tập trung vào locomotion (đi, chạy, leo) HOẶC manipulation (cầm, nâng, thao tác). Nhưng ứng dụng thực tế đòi hỏi cả hai cùng lúc: robot phải vừa đi vừa mang đồ, vừa di chuyển vừa mở cửa, vừa bước lên cầu thang vừa giữ thăng bằng vật trên tay.

Đây là loco-manipulation -- một trong những bài toán khó nhất của humanoid robotics hiện nay.

Tại sao khó?

Coupling: Khi tay nâng vật nặng, trọng tâm (CoM) thay đổi -> ảnh hưởng balance
Competing objectives: Locomotion cần chân ổn định, manipulation cần tay linh hoạt -- cả hai dùng chung cơ thể (skeleton)
High-dimensional: Humanoid có 30-75 DOF, điều khiển toàn bộ cùng lúc rất phức tạp
Contact-rich: Cả chân (tiếp xúc mặt đất) và tay (tiếp xúc vật) đều có contact dynamics

Humanoid robot thực hiện loco-manipulation trong môi trường thực tế

Phương pháp 1: Decoupled Upper/Lower Body

Ý tưởng đơn giản nhất: tách riêng điều khiển thân trên (manipulation) và thân dưới (locomotion). Mỗi phần có controller riêng, giao tiếp qua interface nhẹ.

Mobile-TeleVision (arXiv:2412.07773)

Paper "Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control" (arXiv:2412.07773) là ví dụ điển hình của decoupled approach.

Kiến trúc:

Upper body: IK + motion retargeting từ người điều khiển (teleoperation)
Lower body: RL policy cho locomotion, conditioned trên upper body motion
PMP (Predictive Motion Priors): CVAE model dự đoán upper body motion tương lai, giúp locomotion policy "biết trước" tay sẽ làm gì

Tại sao cần PMP? Nếu locomotion policy chỉ nhìn trạng thái hiện tại của tay, nó sẽ reactive -- phản ứng chậm khi tay thay đổi đột ngột (ví dụ nâng vật nặng lên). PMP cho phép policy anticipate -- biết tay sắp nâng vật -> điều chỉnh tư thế trước.

# Simplified decoupled control architecture
class DecoupledController:
    def __init__(self):
        self.upper_body_ik = InverseKinematics(arm_joints)
        self.lower_body_rl = load_policy("locomotion_policy.pt")
        self.pmp = load_model("predictive_motion_prior.pt")
    
    def step(self, upper_body_target, velocity_command):
        # 1. Upper body: IK cho tay
        arm_targets = self.upper_body_ik.solve(upper_body_target)
        
        # 2. Predict upper body motion tương lai
        motion_prior = self.pmp.predict(
            current_arm_state=arm_targets,
            history=self.arm_history
        )
        
        # 3. Lower body: RL conditioned trên motion prior
        obs = concat([
            proprioception,       # Joint states
            velocity_command,     # Di chuyển command  
            motion_prior,         # Dự đoán chuyển động tay
        ])
        leg_actions = self.lower_body_rl(obs)
        
        return concat([arm_targets, leg_actions])

Kết quả: Tested trên Fourier GR-1 và Unitree H1 trong simulation, và Unitree H1 trong thực tế. Robot có thể vừa đi vừa cầm vật, vừa đi vừa mở cửa.

Ưu điểm của Decoupled

Dễ debug: Tách riêng từng phần
Reusable: Locomotion policy có thể dùng lại cho nhiều manipulation tasks
Teleoperation friendly: Người điều khiển chỉ cần control tay, chân tự động

Nhược điểm

Limited coordination: Khó làm các task cần toàn thân (ví dụ ném bóng xa)
Interface bottleneck: Thông tin giữa 2 phần bị giới hạn
Sub-optimal: Không thể tối ưu đồng thời cả loco và manip

Phương pháp 2: Teleoperation + Imitation Learning

Thay vì thiết kế controller thủ công, học từ người:

Teleoperate humanoid: Người điều khiển robot làm task (cầm vật, dọn dẹp, ...)
Thu thập data: Ghi lại observations và actions
Train policy: Dùng Imitation Learning (Behavior Cloning, ACT, Diffusion Policy) để học policy

Hệ thống Teleoperation cho Humanoid

TWIST (Teleoperated Whole-Body Imitation System) và Open-TeleVision là 2 hệ thống teleoperation nổi bật:

VR headset: Oculus/Meta Quest cho head tracking + hand tracking
Motion capture: Toàn thân người điều khiển được tracked
Retargeting: Map chuyển động người -> chuyển động robot (khác nhau về kích thước, DOF)
Force feedback: Người điều khiển cảm nhận lực từ robot

# Teleoperation data collection pipeline
class TeleoperationCollector:
    def __init__(self, robot, vr_interface):
        self.robot = robot
        self.vr = vr_interface
        self.dataset = []
    
    def collect_episode(self, task_name):
        """Thu thập 1 episode teleoperation."""
        episode_data = []
        
        while not task_complete:
            # Đọc VR input
            head_pose = self.vr.get_head_pose()
            hand_poses = self.vr.get_hand_poses()  # Left + Right
            body_pose = self.vr.get_body_pose()
            
            # Retarget sang robot
            robot_targets = retarget(
                human_pose={
                    'head': head_pose,
                    'hands': hand_poses,
                    'body': body_pose
                },
                robot_model=self.robot.model
            )
            
            # Thực hiện trên robot
            observation = self.robot.get_observation()
            self.robot.set_targets(robot_targets)
            action = self.robot.get_applied_action()
            
            # Lưu data
            episode_data.append({
                'observation': observation,
                'action': action,
                'image': self.robot.get_camera_images()
            })
        
        self.dataset.append(episode_data)

Imitation Learning cho Loco-Manipulation

Sau khi có data, train policy bằng:

Behavior Cloning (BC) -- đơn giản nhất:

# policy(observation) -> action
loss = MSE(policy(obs), expert_action)

ACT (Action Chunking with Transformers) -- hiệu quả hơn:

Predict chunk of actions (ví dụ 100 actions tương lai) thay vì 1 action
Dùng CVAE để model multi-modal behavior
State-of-the-art cho manipulation tasks

Diffusion Policy -- mạnh nhất cho multi-modal:

Model action distribution bằng diffusion process
Xử lý được multi-modal behaviors (nhiều cách làm 1 việc)

Chi tiết về ACT và Diffusion Policy: xem Imitation Learning cho Robotics và ACT: Action Chunking with Transformers.

Teleoperation system cho humanoid data collection

Phương pháp 3: End-to-End RL

Thay vì tách riêng, train 1 policy duy nhất cho toàn bộ loco-manipulation.

ALMI -- Adversarial Locomotion and Motion Imitation

ALMI sử dụng adversarial training giữa upper và lower body:

Lower body cung cấp locomotion robust
Upper body track các motions khác nhau
Adversarial loss đảm bảo 2 phần "hợp tác" tốt

ResMimic -- Residual Learning

ResMimic dùng 2 stages:

Stage 1: Train general motion tracking policy (biết đi, đứng, vẫy tay, ...)
Stage 2: Train residual policy trên stage 1 cho specific task (cầm vật, mở cửa)

Residual policy chỉ cần học sự khác biệt giữa general motion và task-specific motion -- nhanh hơn và ổn định hơn train từ đầu.

class ResidualPolicy:
    def __init__(self):
        self.base_policy = load_pretrained("motion_tracking.pt")  # Frozen
        self.residual = nn.Sequential(                             # Trainable
            nn.Linear(obs_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim)
        )
    
    def forward(self, obs):
        with torch.no_grad():
            base_action = self.base_policy(obs)
        
        residual_action = self.residual(obs)
        
        # Final action = base + residual (với scale nhỏ)
        return base_action + 0.1 * residual_action

WholeBodyVLA -- Vision-Language-Action

WholeBodyVLA (ICLR 2026) là framework mới nhất, kết hợp Vision-Language-Action model với humanoid loco-manipulation:

Học từ egocentric videos (video từ góc nhìn robot) -- rẻ và dễ thu thập
Dùng VLA architecture để hiểu ngôn ngữ + hình ảnh -> action
Tested trên AgiBot X2, vượt qua các baselines 21,3%

Đây là hướng đi hot nhất hiện nay -- kết hợp LLM/VLM với whole-body control.

So sánh các phương pháp

Phương pháp	Ưu điểm	Nhược điểm	Papers
Decoupled	Dễ debug, reusable	Sub-optimal, limited	Mobile-TeleVision
Teleop + IL	Học từ demo, flexible	Cần data, operator	TWIST, Open-TeleVision
End-to-end RL	Optimal, no human	Khó train, reward design	ALMI
Residual	Nhanh, stable	Cần base policy	ResMimic
VLA	Language grounding	Cần compute, data	WholeBodyVLA

Challenges còn lại

1. Dexterous manipulation trong khi đi

Hầu hết research chỉ demo power grasp (nắm chắc) trong khi đi. Dexterous manipulation (xoay vật, mở nắp, sử dụng công cụ) trong khi di chuyển vẫn rất khó.

2. Reactive balance

Khi robot đang cầm vật nặng và bị đẩy, nó cần đồng thời giữ vật và giữ thăng bằng. Priority nào cao hơn? Trade-off này chưa được giải quyết tốt.

3. Long-horizon tasks

Các task dài (ví dụ dọn phòng, nấu ăn) đòi hỏi kế hoạch dài hạn -- điều mà current policies chưa làm tốt. Cần kết hợp với task planning (LLM-based).

4. Multi-contact

Robot tương tác đồng thời với nhiều vật (ví dụ cầm 2 tay, dùng chân chèn khóa) -- số điểm tiếp xúc tăng, độ phức tạp tăng theo.

Hướng nghiên cứu 2026-2027

VLA + Whole-body: Dùng language models để điều khiển toàn thân humanoid
Teleoperation at scale: Thu thập data từ hàng trăm người, train general-purpose policy
Sim-to-real cho manipulation: Hiện tại sim-to-real cho locomotion đã tốt, nhưng cho manipulation vẫn khó (contact-rich)
Multi-robot loco-manipulation: 2 humanoid hợp tác nâng vật nặng

Tiếp theo trong series

Part 4: RL cho Humanoid: Humanoid-Gym đến sim2real
Part 6: Tương lai Humanoid: Cơ hội cho kỹ sư robotics

Loco-Manipulation là gì?

Đây là loco-manipulation -- một trong những bài toán khó nhất của humanoid robotics hiện nay.

Tại sao khó?

Coupling: Khi tay nâng vật nặng, trọng tâm (CoM) thay đổi -> ảnh hưởng balance
Competing objectives: Locomotion cần chân ổn định, manipulation cần tay linh hoạt -- cả hai dùng chung cơ thể (skeleton)
High-dimensional: Humanoid có 30-75 DOF, điều khiển toàn bộ cùng lúc rất phức tạp
Contact-rich: Cả chân (tiếp xúc mặt đất) và tay (tiếp xúc vật) đều có contact dynamics

Humanoid robot thực hiện loco-manipulation trong môi trường thực tế

Phương pháp 1: Decoupled Upper/Lower Body

Ý tưởng đơn giản nhất: tách riêng điều khiển thân trên (manipulation) và thân dưới (locomotion). Mỗi phần có controller riêng, giao tiếp qua interface nhẹ.

Mobile-TeleVision (arXiv:2412.07773)

Paper "Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control" (arXiv:2412.07773) là ví dụ điển hình của decoupled approach.

Kiến trúc:

Upper body: IK + motion retargeting từ người điều khiển (teleoperation)
Lower body: RL policy cho locomotion, conditioned trên upper body motion
PMP (Predictive Motion Priors): CVAE model dự đoán upper body motion tương lai, giúp locomotion policy "biết trước" tay sẽ làm gì

# Simplified decoupled control architecture
class DecoupledController:
    def __init__(self):
        self.upper_body_ik = InverseKinematics(arm_joints)
        self.lower_body_rl = load_policy("locomotion_policy.pt")
        self.pmp = load_model("predictive_motion_prior.pt")
    
    def step(self, upper_body_target, velocity_command):
        # 1. Upper body: IK cho tay
        arm_targets = self.upper_body_ik.solve(upper_body_target)
        
        # 2. Predict upper body motion tương lai
        motion_prior = self.pmp.predict(
            current_arm_state=arm_targets,
            history=self.arm_history
        )
        
        # 3. Lower body: RL conditioned trên motion prior
        obs = concat([
            proprioception,       # Joint states
            velocity_command,     # Di chuyển command  
            motion_prior,         # Dự đoán chuyển động tay
        ])
        leg_actions = self.lower_body_rl(obs)
        
        return concat([arm_targets, leg_actions])

Kết quả: Tested trên Fourier GR-1 và Unitree H1 trong simulation, và Unitree H1 trong thực tế. Robot có thể vừa đi vừa cầm vật, vừa đi vừa mở cửa.

Ưu điểm của Decoupled

Dễ debug: Tách riêng từng phần
Reusable: Locomotion policy có thể dùng lại cho nhiều manipulation tasks
Teleoperation friendly: Người điều khiển chỉ cần control tay, chân tự động

Nhược điểm

Limited coordination: Khó làm các task cần toàn thân (ví dụ ném bóng xa)
Interface bottleneck: Thông tin giữa 2 phần bị giới hạn
Sub-optimal: Không thể tối ưu đồng thời cả loco và manip

Phương pháp 2: Teleoperation + Imitation Learning

Thay vì thiết kế controller thủ công, học từ người:

Teleoperate humanoid: Người điều khiển robot làm task (cầm vật, dọn dẹp, ...)
Thu thập data: Ghi lại observations và actions
Train policy: Dùng Imitation Learning (Behavior Cloning, ACT, Diffusion Policy) để học policy

Hệ thống Teleoperation cho Humanoid

TWIST (Teleoperated Whole-Body Imitation System) và Open-TeleVision là 2 hệ thống teleoperation nổi bật:

VR headset: Oculus/Meta Quest cho head tracking + hand tracking
Motion capture: Toàn thân người điều khiển được tracked
Retargeting: Map chuyển động người -> chuyển động robot (khác nhau về kích thước, DOF)
Force feedback: Người điều khiển cảm nhận lực từ robot

# Teleoperation data collection pipeline
class TeleoperationCollector:
    def __init__(self, robot, vr_interface):
        self.robot = robot
        self.vr = vr_interface
        self.dataset = []
    
    def collect_episode(self, task_name):
        """Thu thập 1 episode teleoperation."""
        episode_data = []
        
        while not task_complete:
            # Đọc VR input
            head_pose = self.vr.get_head_pose()
            hand_poses = self.vr.get_hand_poses()  # Left + Right
            body_pose = self.vr.get_body_pose()
            
            # Retarget sang robot
            robot_targets = retarget(
                human_pose={
                    'head': head_pose,
                    'hands': hand_poses,
                    'body': body_pose
                },
                robot_model=self.robot.model
            )
            
            # Thực hiện trên robot
            observation = self.robot.get_observation()
            self.robot.set_targets(robot_targets)
            action = self.robot.get_applied_action()
            
            # Lưu data
            episode_data.append({
                'observation': observation,
                'action': action,
                'image': self.robot.get_camera_images()
            })
        
        self.dataset.append(episode_data)

Imitation Learning cho Loco-Manipulation

Sau khi có data, train policy bằng:

Behavior Cloning (BC) -- đơn giản nhất:

# policy(observation) -> action
loss = MSE(policy(obs), expert_action)

ACT (Action Chunking with Transformers) -- hiệu quả hơn:

Predict chunk of actions (ví dụ 100 actions tương lai) thay vì 1 action
Dùng CVAE để model multi-modal behavior
State-of-the-art cho manipulation tasks

Diffusion Policy -- mạnh nhất cho multi-modal:

Model action distribution bằng diffusion process
Xử lý được multi-modal behaviors (nhiều cách làm 1 việc)

Chi tiết về ACT và Diffusion Policy: xem Imitation Learning cho Robotics và ACT: Action Chunking with Transformers.

Teleoperation system cho humanoid data collection

Phương pháp 3: End-to-End RL

Thay vì tách riêng, train 1 policy duy nhất cho toàn bộ loco-manipulation.

ALMI -- Adversarial Locomotion and Motion Imitation

ALMI sử dụng adversarial training giữa upper và lower body:

Lower body cung cấp locomotion robust
Upper body track các motions khác nhau
Adversarial loss đảm bảo 2 phần "hợp tác" tốt

ResMimic -- Residual Learning

ResMimic dùng 2 stages:

Stage 1: Train general motion tracking policy (biết đi, đứng, vẫy tay, ...)
Stage 2: Train residual policy trên stage 1 cho specific task (cầm vật, mở cửa)

Residual policy chỉ cần học sự khác biệt giữa general motion và task-specific motion -- nhanh hơn và ổn định hơn train từ đầu.

class ResidualPolicy:
    def __init__(self):
        self.base_policy = load_pretrained("motion_tracking.pt")  # Frozen
        self.residual = nn.Sequential(                             # Trainable
            nn.Linear(obs_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim)
        )
    
    def forward(self, obs):
        with torch.no_grad():
            base_action = self.base_policy(obs)
        
        residual_action = self.residual(obs)
        
        # Final action = base + residual (với scale nhỏ)
        return base_action + 0.1 * residual_action

WholeBodyVLA -- Vision-Language-Action

WholeBodyVLA (ICLR 2026) là framework mới nhất, kết hợp Vision-Language-Action model với humanoid loco-manipulation:

Học từ egocentric videos (video từ góc nhìn robot) -- rẻ và dễ thu thập
Dùng VLA architecture để hiểu ngôn ngữ + hình ảnh -> action
Tested trên AgiBot X2, vượt qua các baselines 21,3%

Đây là hướng đi hot nhất hiện nay -- kết hợp LLM/VLM với whole-body control.

So sánh các phương pháp

Phương pháp	Ưu điểm	Nhược điểm	Papers
Decoupled	Dễ debug, reusable	Sub-optimal, limited	Mobile-TeleVision
Teleop + IL	Học từ demo, flexible	Cần data, operator	TWIST, Open-TeleVision
End-to-end RL	Optimal, no human	Khó train, reward design	ALMI
Residual	Nhanh, stable	Cần base policy	ResMimic
VLA	Language grounding	Cần compute, data	WholeBodyVLA

Challenges còn lại

1. Dexterous manipulation trong khi đi

Hầu hết research chỉ demo power grasp (nắm chắc) trong khi đi. Dexterous manipulation (xoay vật, mở nắp, sử dụng công cụ) trong khi di chuyển vẫn rất khó.

2. Reactive balance

Khi robot đang cầm vật nặng và bị đẩy, nó cần đồng thời giữ vật và giữ thăng bằng. Priority nào cao hơn? Trade-off này chưa được giải quyết tốt.

3. Long-horizon tasks

Các task dài (ví dụ dọn phòng, nấu ăn) đòi hỏi kế hoạch dài hạn -- điều mà current policies chưa làm tốt. Cần kết hợp với task planning (LLM-based).

4. Multi-contact

Robot tương tác đồng thời với nhiều vật (ví dụ cầm 2 tay, dùng chân chèn khóa) -- số điểm tiếp xúc tăng, độ phức tạp tăng theo.

Hướng nghiên cứu 2026-2027

VLA + Whole-body: Dùng language models để điều khiển toàn thân humanoid
Teleoperation at scale: Thu thập data từ hàng trăm người, train general-purpose policy
Sim-to-real cho manipulation: Hiện tại sim-to-real cho locomotion đã tốt, nhưng cho manipulation vẫn khó (contact-rich)
Multi-robot loco-manipulation: 2 humanoid hợp tác nâng vật nặng

Tiếp theo trong series

Part 4: RL cho Humanoid: Humanoid-Gym đến sim2real
Part 6: Tương lai Humanoid: Cơ hội cho kỹ sư robotics

Loco-Manipulation là gì?

Tại sao khó?

Phương pháp 1: Decoupled Upper/Lower Body

Mobile-TeleVision (arXiv:2412.07773)

Ưu điểm của Decoupled

Nhược điểm

Phương pháp 2: Teleoperation + Imitation Learning

Hệ thống Teleoperation cho Humanoid

Imitation Learning cho Loco-Manipulation

Phương pháp 3: End-to-End RL

ALMI -- Adversarial Locomotion and Motion Imitation

ResMimic -- Residual Learning

WholeBodyVLA -- Vision-Language-Action

So sánh các phương pháp

Challenges còn lại

1. Dexterous manipulation trong khi đi

2. Reactive balance

3. Long-horizon tasks

4. Multi-contact

Hướng nghiên cứu 2026-2027

Tiếp theo trong series

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

SUGAR: Train Humanoid từ Video Người, Không Cần Reward

Tương lai Humanoid: Cơ hội cho kỹ sư robotics

RL cho Humanoid: Humanoid-Gym đến sim2real

Loco-Manipulation là gì?

Tại sao khó?

Phương pháp 1: Decoupled Upper/Lower Body

Mobile-TeleVision (arXiv:2412.07773)

Ưu điểm của Decoupled

Nhược điểm

Phương pháp 2: Teleoperation + Imitation Learning

Hệ thống Teleoperation cho Humanoid

Imitation Learning cho Loco-Manipulation

Phương pháp 3: End-to-End RL

ALMI -- Adversarial Locomotion and Motion Imitation

ResMimic -- Residual Learning

WholeBodyVLA -- Vision-Language-Action

So sánh các phương pháp

Challenges còn lại

1. Dexterous manipulation trong khi đi

2. Reactive balance

3. Long-horizon tasks

4. Multi-contact

Hướng nghiên cứu 2026-2027

Tiếp theo trong series

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

SUGAR: Train Humanoid từ Video Người, Không Cần Reward

Tương lai Humanoid: Cơ hội cho kỹ sư robotics

RL cho Humanoid: Humanoid-Gym đến sim2real