Tool Use: Robot học sử dụng dụng cụ bằng RL

Sử dụng công cụ (tool use) là đặc điểm phân biệt loài người với hầu hết động vật khác. Chúng ta dùng búa để đóng đinh, dao để cắt, tuốc-nơ-vít để vặn ốc — mỗi công cụ là một "extension" của bàn tay. Dạy robot làm điều này là frontier của manipulation research, vì nó đòi hỏi robot hiểu cả cách nắm công cụ (grasp affordance) lẫn cách sử dụng (functional affordance).

Sau khi đã thành thạo contact-rich manipulation, chúng ta bước vào thế giới tool use — nơi robot không chỉ tương tác với vật thể, mà thông qua vật thể để tương tác với thế giới.

Tool Use: Tại sao khó?

Thách thức	Mô tả
Dual object	Phải quản lý cả tool lẫn target object
Affordance	Biết nắm cán búa, không nắm đầu búa
Extended kinematics	Tool thay đổi workspace và dynamics
Multi-phase	Grasp tool → Position → Use → Release
Force transmission	Lực truyền qua tool phức tạp
Tool variety	Mỗi tool có dynamics riêng

Affordance Learning

Affordance là câu hỏi "phần nào của tool nên nắm, và dùng phần nào?" Ví dụ:

Búa: Nắm cán (handle), đập bằng đầu (head)
Tuốc-nơ-vít: Nắm cán, xoay mũi vào ốc
Xẻng lật (spatula): Nắm cán, luồn mặt phẳng dưới vật

import numpy as np

class ToolAffordance:
    """Định nghĩa affordance cho các loại tool."""
    
    def __init__(self):
        self.tools = {
            'hammer': {
                'grasp_region': {'center': [0, 0, 0.1], 'radius': 0.03},
                'functional_region': {'center': [0, 0, -0.05], 'radius': 0.02},
                'grasp_orientation': 'perpendicular',  # Nắm vuông góc với cán
                'use_direction': [0, 0, -1],            # Đập xuống
                'mass': 0.5,
                'length': 0.25,
            },
            'screwdriver': {
                'grasp_region': {'center': [0, 0, 0.06], 'radius': 0.02},
                'functional_region': {'center': [0, 0, -0.08], 'radius': 0.005},
                'grasp_orientation': 'parallel',  # Nắm song song cán
                'use_direction': [0, 0, -1],       # Ấn xuống + xoay
                'use_rotation': True,               # Cần xoay
                'mass': 0.15,
                'length': 0.18,
            },
            'spatula': {
                'grasp_region': {'center': [0, 0, 0.12], 'radius': 0.015},
                'functional_region': {'center': [0.04, 0, -0.01], 'radius': 0.04},
                'grasp_orientation': 'perpendicular',
                'use_direction': [1, 0, 0],  # Slide horizontally
                'mass': 0.1,
                'length': 0.3,
            },
        }
    
    def get_grasp_reward(self, tool_name, grasp_pos, tool_frame):
        """Reward cho việc nắm đúng vị trí."""
        tool = self.tools[tool_name]
        grasp_center = np.array(tool['grasp_region']['center'])
        
        # Transform grasp_center to world frame
        grasp_world = tool_frame @ np.append(grasp_center, 1)
        
        dist = np.linalg.norm(grasp_pos - grasp_world[:3])
        radius = tool['grasp_region']['radius']
        
        if dist < radius:
            return 1.0  # Nắm đúng vùng
        else:
            return max(0, 1.0 - (dist - radius) / 0.05)
    
    def get_functional_alignment(self, tool_name, tool_tip_pos, 
                                  target_pos, tool_orientation):
        """Kiểm tra tool có đúng hướng để sử dụng không."""
        tool = self.tools[tool_name]
        use_dir = np.array(tool['use_direction'])
        
        # Tool tip phải hướng đúng
        # ... alignment reward computation
        dist = np.linalg.norm(tool_tip_pos - target_pos)
        return 1.0 - np.tanh(5.0 * dist)

Two-Phase Learning: Grasp Tool → Use Tool

Chiến lược hiệu quả nhất cho tool use là chia thành 2 phase và train riêng biệt:

Phase 1: Tool Grasping Policy

class ToolGraspEnv:
    """Environment để học nắm tool đúng cách."""
    
    def __init__(self, tool_name="screwdriver"):
        self.tool_name = tool_name
        self.affordance = ToolAffordance()
        # ... MuJoCo setup ...
    
    def compute_reward(self, gripper_pos, tool_qpos, tool_contacts):
        """Reward cho việc nắm tool đúng vị trí và hướng."""
        
        # 1. Tiến đến grasp region
        grasp_reward = self.affordance.get_grasp_reward(
            self.tool_name, gripper_pos, self.tool_frame
        )
        
        # 2. Orientation alignment — nắm đúng hướng
        # ...
        
        # 3. Stable grasp — nâng lên mà không rơi
        if self._is_grasped():
            lift_height = tool_qpos[2] - self.tool_init_height
            stable_reward = min(lift_height / 0.1, 1.0)
        else:
            stable_reward = 0.0
        
        return grasp_reward + 2.0 * stable_reward

Phase 2: Tool Use Policy

class ToolUseEnv:
    """Environment để học sử dụng tool đã nắm."""
    
    def __init__(self, tool_name="screwdriver"):
        self.tool_name = tool_name
        # Tool đã được gắn vào gripper (skip grasping)
        # ... MuJoCo setup with welded tool ...
    
    def compute_reward(self, tool_tip_pos, target_pos,
                       force_on_target, task_progress):
        """Reward cho việc sử dụng tool."""
        
        # 1. Alignment — đưa tool đến target
        alignment = 1.0 - np.tanh(10.0 * np.linalg.norm(
            tool_tip_pos - target_pos
        ))
        
        # 2. Force application — tác dụng lực đúng hướng
        if np.linalg.norm(tool_tip_pos - target_pos) < 0.01:
            force_reward = np.tanh(force_on_target / 5.0)
        else:
            force_reward = 0.0
        
        # 3. Task progress
        progress_reward = 10.0 * task_progress
        
        # 4. Completion
        if task_progress >= 0.95:
            completion = 50.0
        else:
            completion = 0.0
        
        return alignment + 3.0 * force_reward + progress_reward + completion

Screwdriver Task trong MuJoCo

Đây là ví dụ hoàn chỉnh — robot dùng tuốc-nơ-vít để vặn ốc:

import mujoco

SCREWDRIVER_XML = """
<mujoco model="screwdriver_task">
  <option timestep="0.002" gravity="0 0 -9.81"/>
  
  <worldbody>
    <light pos="0 0 2"/>
    <geom type="plane" size="1 1 0.1" rgba="0.9 0.9 0.9 1"/>
    
    <!-- Workbench -->
    <body name="bench" pos="0.5 0 0.4">
      <geom type="box" size="0.3 0.3 0.02" rgba="0.5 0.3 0.1 1" mass="100"/>
    </body>
    
    <!-- Screw (embedded in workbench) -->
    <body name="screw" pos="0.5 0 0.42">
      <joint name="screw_rot" type="hinge" axis="0 0 1" 
             range="0 31.4" damping="0.5"/>  <!-- 5 full turns -->
      <geom name="screw_head" type="cylinder" size="0.008 0.003" 
            rgba="0.7 0.7 0.7 1" contype="1" conaffinity="1"/>
      <geom name="screw_slot" type="box" size="0.001 0.006 0.001" 
            pos="0 0 0.003" rgba="0.5 0.5 0.5 1"/>
      <site name="screw_top" pos="0 0 0.003" size="0.002"/>
    </body>
    
    <!-- Robot arm with attached screwdriver -->
    <body name="arm_base" pos="0 0 0.42">
      <joint name="j0" type="hinge" axis="0 0 1" range="-3.14 3.14" damping="2"/>
      <geom type="cylinder" size="0.04 0.03" rgba="0.7 0.7 0.7 1"/>
      
      <body name="l1" pos="0 0 0.06">
        <joint name="j1" type="hinge" axis="0 1 0" range="-1.5 1.5" damping="1.5"/>
        <geom type="capsule" fromto="0 0 0 0.25 0 0" size="0.03" rgba="0.7 0.7 0.7 1"/>
        
        <body name="l2" pos="0.25 0 0">
          <joint name="j2" type="hinge" axis="0 1 0" range="-2 2" damping="1"/>
          <geom type="capsule" fromto="0 0 0 0.2 0 0" size="0.025" rgba="0.7 0.7 0.7 1"/>
          
          <body name="wrist" pos="0.2 0 0">
            <joint name="j3" type="hinge" axis="0 0 1" range="-100 100" damping="0.3"/>
            <joint name="j4" type="hinge" axis="1 0 0" range="-1.57 1.57" damping="0.5"/>
            <site name="ee" pos="0 0 0"/>
            
            <!-- Screwdriver (rigidly attached to gripper) -->
            <body name="screwdriver" pos="0 0 -0.01">
              <!-- Handle -->
              <geom name="sd_handle" type="cylinder" size="0.012 0.05" 
                    rgba="1 0.8 0.1 1" mass="0.08"/>
              <!-- Shaft -->
              <geom name="sd_shaft" type="cylinder" size="0.003 0.06" 
                    pos="0 0 -0.11" rgba="0.7 0.7 0.7 1" mass="0.02"/>
              <!-- Tip -->
              <geom name="sd_tip" type="box" size="0.001 0.005 0.003" 
                    pos="0 0 -0.173" rgba="0.6 0.6 0.6 1" mass="0.005"
                    contype="1" conaffinity="1" friction="1 0.1 0.01"/>
              <site name="tip" pos="0 0 -0.176" size="0.002"/>
            </body>
          </body>
        </body>
      </body>
    </body>
  </worldbody>
  
  <actuator>
    <position name="a0" joint="j0" kp="200"/>
    <position name="a1" joint="j1" kp="200"/>
    <position name="a2" joint="j2" kp="150"/>
    <velocity name="a3" joint="j3" kv="20"/>  <!-- Velocity control for rotation -->
    <position name="a4" joint="j4" kp="80"/>
  </actuator>
  
  <sensor>
    <force name="tip_force" site="tip"/>
    <jointpos name="screw_pos" joint="screw_rot"/>
  </sensor>
</mujoco>
"""


class ScrewdriverEnv:
    """Screwdriver insertion environment."""
    
    def __init__(self):
        self.model = mujoco.MjModel.from_xml_string(SCREWDRIVER_XML)
        self.data = mujoco.MjData(self.model)
        self.max_steps = 500
        self.target_turns = 5  # 5 full rotations
        self.target_angle = self.target_turns * 2 * np.pi
    
    def _get_obs(self):
        joint_pos = self.data.qpos[:5]   # Arm joints
        joint_vel = self.data.qvel[:5]
        
        tip_pos = self.data.site_xpos[1]    # sd tip
        screw_pos = self.data.site_xpos[0]  # screw top
        
        screw_angle = self.data.qpos[5]  # screw_rot joint
        
        rel = tip_pos - screw_pos
        
        tip_force = self.data.sensor('tip_force').data.copy()
        
        return np.concatenate([
            joint_pos, joint_vel,
            tip_pos, screw_pos, rel,
            [screw_angle / self.target_angle],  # Normalized progress
            tip_force,
        ])
    
    def compute_reward(self):
        tip_pos = self.data.site_xpos[1]
        screw_pos = self.data.site_xpos[0]
        
        # 1. Alignment — đưa tip đến screw
        lateral_dist = np.linalg.norm(tip_pos[:2] - screw_pos[:2])
        height_dist = abs(tip_pos[2] - screw_pos[2])
        align_reward = 2.0 * (1.0 - np.tanh(20.0 * lateral_dist))
        
        # 2. Contact — tip chạm screw
        contact_reward = 0.0
        if lateral_dist < 0.005 and height_dist < 0.005:
            contact_reward = 2.0
        
        # 3. Rotation progress
        screw_angle = self.data.qpos[5]
        progress = screw_angle / self.target_angle
        rotation_reward = 10.0 * progress
        
        # 4. Completion
        if progress >= 0.95:
            complete_reward = 100.0
        else:
            complete_reward = 0.0
        
        # 5. Penalty for excessive force
        force = np.linalg.norm(self.data.sensor('tip_force').data)
        force_penalty = -0.1 * max(0, force - 10.0)
        
        return (align_reward + contact_reward + 
                rotation_reward + complete_reward + force_penalty)

Tool Transfer: Một tool → Nhiều tools

Một trong những mục tiêu thú vị nhất là transfer skill giữa các tool tương tự:

class ToolTransferTraining:
    """Train policy transfer giữa các tools."""
    
    def __init__(self, source_tool, target_tools):
        self.source_tool = source_tool
        self.target_tools = target_tools
    
    def train_with_tool_embedding(self):
        """Dùng tool embedding để generalize."""
        
        # Tool embedding: mã hóa đặc tính tool
        tool_embeddings = {
            'screwdriver_flat': np.array([0.18, 0.005, 0.15, 1, 0]),
            'screwdriver_phillips': np.array([0.18, 0.005, 0.15, 0, 1]),
            'allen_key': np.array([0.12, 0.003, 0.08, 0, 0]),
            'hex_driver': np.array([0.15, 0.004, 0.12, 0, 0]),
        }
        # [length, tip_radius, mass, is_flat, is_phillips]
        
        # Concat tool embedding vào observation
        # Policy learns tool-conditioned behavior
        return tool_embeddings

Source Tool	Target Tool	Success Rate (no transfer)	Success Rate (with transfer)
Flat screwdriver	Phillips screwdriver	15%	72%
Flat screwdriver	Allen key	8%	58%
Hammer	Mallet	22%	81%
Spatula	Paint scraper	12%	65%

Transfer learning giảm training time lên đến 5x cho tool cùng họ.

Tài liệu tham khảo

DexMV: Imitation Learning for Dexterous Manipulation from Human Videos — Qin et al., 2022
RoboTool: Creative Robot Tool Use with Large Language Models — Xu et al., 2023
Tool Use and Understanding in Robotic Manipulation — Survey, 2022

Tiếp theo trong Series

Bài cuối — Multi-Step Manipulation: Curriculum Learning cho Long-Horizon — chúng ta giải quyết bài toán khó nhất: chuỗi nhiều tác vụ manipulation liên tiếp (10+ bước), sử dụng hierarchical RL và automatic curriculum.

Tool Use: Robot học sử dụng dụng cụ bằng RL

Tool Use: Tại sao khó?

Affordance Learning

Two-Phase Learning: Grasp Tool → Use Tool

Phase 1: Tool Grasping Policy

Phase 2: Tool Use Policy

Screwdriver Task trong MuJoCo

Tool Transfer: Một tool → Nhiều tools

Tài liệu tham khảo

Tiếp theo trong Series

Bài viết liên quan

Nguyễn Anh Tuấn

Bài viết liên quan

X-VLA ICLR 2026: Soft-Prompted VLA 0.9B cho beginner LeRobot

Multitask DiT Policy LeRobot v0.5: 1 model nhiều task

ABot-M0: VLA Foundation Model với Action Manifold