Force Control bằng RL: Giữ cốc nước thăng bằng

Hãy tưởng tượng: bạn cầm một cốc cà phê đầy và đi từ bếp ra bàn. Não bạn liên tục điều chỉnh lực cầm, góc nghiêng cổ tay, và tốc độ di chuyển — tất cả một cách vô thức. Bây giờ, hãy dạy robot làm điều tương tự. Đây không chỉ là bài toán grasping (nắm), mà là force control — kiểm soát lực tinh tế để vật thể không bị nghiêng, rung, hay đổ.

Trong bài trước — Grasping với RL — chúng ta đã học cách nắm vật thể. Bây giờ, chúng ta nâng cấp: robot phải nắm và giữ thăng bằng một cốc nước trong suốt quá trình di chuyển.

Bài toán Cup-of-Water

Tại sao Force Control khó?

Force control khác fundamentally so với position control:

Tiêu chí	Position Control	Force Control
Mục tiêu	Đưa end-effector đến vị trí X	Duy trì lực/momen ở giá trị Y
Feedback	Encoder (vị trí khớp)	Force/Torque sensor
Sensitivity	Tolerant vài mm	Nhạy cảm từng 0.1N
Contact dynamics	Ít quan trọng	Cực kỳ quan trọng
Stability	Dễ ổn định	Dễ rung (oscillation)

Khi cầm cốc nước, robot cần đồng thời:

Lực nắm vừa đủ — quá mạnh thì vỡ cốc, quá nhẹ thì rơi
Giữ cốc thẳng đứng — nghiêng > 15 độ = nước đổ
Di chuyển mượt — gia tốc đột ngột = nước sóng sánh ra ngoài
Giảm rung — jerk cao = mặt nước dao động

Impedance Control Baseline

Trước khi dùng RL, hãy hiểu impedance control — phương pháp truyền thống cho force control. Impedance control mô hình hóa robot như hệ lò xo-giảm chấn:

$$F = K(x_{desired} - x) + D(\dot{x}_{desired} - \dot{x})$$

trong đó $K$ là stiffness, $D$ là damping.

import numpy as np

class ImpedanceController:
    """Variable Impedance Controller cho cup balancing."""
    
    def __init__(self, kp_pos=100.0, kd_pos=20.0, 
                 kp_rot=50.0, kd_rot=10.0):
        self.kp_pos = kp_pos  # Stiffness (vị trí)
        self.kd_pos = kd_pos  # Damping (vị trí)
        self.kp_rot = kp_rot  # Stiffness (góc)
        self.kd_rot = kd_rot  # Damping (góc)
    
    def compute_action(self, current_pos, desired_pos,
                       current_vel, desired_vel,
                       current_rot, desired_rot,
                       current_angvel):
        """Tính toán lực/torque cần thiết."""
        # Position error
        pos_error = desired_pos - current_pos
        vel_error = desired_vel - current_vel
        
        # Force command
        force = self.kp_pos * pos_error + self.kd_pos * vel_error
        
        # Rotation error (simplified - giữ thẳng đứng)
        rot_error = desired_rot - current_rot
        torque = self.kp_rot * rot_error - self.kd_rot * current_angvel
        
        return np.concatenate([force, torque])
    
    def compute_grasp_force(self, object_mass, tilt_angle):
        """Tính lực nắm dựa trên khối lượng và góc nghiêng."""
        # Lực nắm tối thiểu để không rơi
        min_force = object_mass * 9.81 / (2 * 0.5)  # mu = 0.5
        
        # Tăng lực khi nghiêng (bù trọng lực thành phần)
        safety_factor = 1.5 + 0.5 * abs(tilt_angle) / np.pi
        
        return min_force * safety_factor

Impedance control hoạt động tốt cho trajectory đơn giản, nhưng không thể thích ứng với:

Thay đổi khối lượng nước (uống dần)
Nhiễu ngoài (ai đó chạm vào cốc)
Bề mặt trơn/gồ ghề khi di chuyển

Đây là lý do chúng ta cần RL.

Reward Design cho Cup Balancing

Multi-Objective Reward

Reward function cho cup balancing phải cân bằng nhiều mục tiêu đối lập:

class CupBalanceReward:
    """Reward function cho cup-of-water balancing task."""
    
    def __init__(self):
        self.tilt_threshold = np.radians(15)   # Max 15 độ
        self.spill_threshold = np.radians(30)  # Đổ nước
        self.jerk_weight = 0.1
        self.prev_vel = None
    
    def compute(self, cup_tilt, cup_angular_vel, ee_vel, 
                ee_accel, goal_dist, action, grasping):
        """
        Args:
            cup_tilt: Góc nghiêng cốc so với vertical (rad)
            cup_angular_vel: Vận tốc góc cốc [3]
            ee_vel: Vận tốc end-effector [3]
            ee_accel: Gia tốc end-effector [3]
            goal_dist: Khoảng cách đến đích
            action: Action vector
            grasping: Bool, đang nắm cốc
        """
        if not grasping:
            return -50.0, {'spill': True}  # Penalty lớn nếu rơi
        
        rewards = {}
        
        # 1. TILT PENALTY — Giữ cốc thẳng
        tilt_magnitude = abs(cup_tilt)
        if tilt_magnitude > self.spill_threshold:
            rewards['tilt'] = -20.0  # Đổ nước!
            rewards['spill'] = True
        else:
            # Penalty tăng dần theo góc nghiêng
            rewards['tilt'] = -5.0 * (tilt_magnitude / self.tilt_threshold) ** 2
            rewards['spill'] = False
        
        # 2. ANGULAR VELOCITY PENALTY — Giảm rung lắc
        ang_vel_mag = np.linalg.norm(cup_angular_vel)
        rewards['angular_vel'] = -2.0 * np.tanh(3.0 * ang_vel_mag)
        
        # 3. JERK PENALTY — Chuyển động mượt
        jerk = np.linalg.norm(ee_accel)
        rewards['jerk'] = -self.jerk_weight * np.tanh(jerk)
        
        # 4. PROGRESS REWARD — Tiến đến đích
        rewards['progress'] = 2.0 * (1.0 - np.tanh(3.0 * goal_dist))
        
        # 5. SPEED REWARD — Đến nhanh nhưng không quá nhanh
        speed = np.linalg.norm(ee_vel)
        if goal_dist > 0.1:
            # Khuyến khích di chuyển khi xa đích
            rewards['speed'] = 0.5 * min(speed, 0.3) / 0.3
        else:
            # Chậm lại khi gần đích
            rewards['speed'] = -1.0 * speed
        
        # 6. SUCCESS BONUS
        if goal_dist < 0.05 and tilt_magnitude < self.tilt_threshold:
            rewards['success'] = 20.0
        else:
            rewards['success'] = 0.0
        
        # 7. ACTION SMOOTHNESS
        rewards['action_smooth'] = -0.01 * np.sum(action ** 2)
        
        total = sum(rewards.values()) - rewards.get('spill', 0)
        return total, rewards

Phân tích Trade-offs

Reward này thể hiện rõ multi-objective trade-off:

Nhanh vs ổn định: Robot muốn đến đích nhanh (progress reward) nhưng không được rung (jerk penalty)
Nắm chặt vs nhẹ nhàng: Nắm quá chặt gây rung, quá nhẹ thì rơi
Thẳng vs di chuyển: Cốc muốn thẳng đứng nhưng khi quẹo phải nghiêng nhẹ

MuJoCo Environment: Cup with Liquid Approximation

MuJoCo không hỗ trợ mô phỏng chất lỏng trực tiếp, nhưng chúng ta có thể xấp xỉ bằng rigid body dynamics:

import mujoco
import numpy as np

CUP_BALANCE_XML = """
<mujoco model="cup_balance">
  <option timestep="0.002" gravity="0 0 -9.81"/>
  
  <worldbody>
    <light pos="0 0 3" dir="0 0 -1"/>
    <geom type="plane" size="2 2 0.1" rgba="0.9 0.9 0.9 1"/>
    
    <!-- Table -->
    <body name="table" pos="0.5 0 0.4">
      <geom type="box" size="0.6 0.6 0.02" rgba="0.6 0.4 0.2 1" mass="100"/>
    </body>
    
    <!-- Robot arm (simplified 5-DOF) -->
    <body name="base" pos="0 0 0.42">
      <joint name="j0" type="hinge" axis="0 0 1" range="-3.14 3.14" damping="2"/>
      <geom type="cylinder" size="0.05 0.04" rgba="0.3 0.3 0.3 1"/>
      
      <body name="l1" pos="0 0 0.08">
        <joint name="j1" type="hinge" axis="0 1 0" range="-1.57 1.57" damping="2"/>
        <geom type="capsule" fromto="0 0 0 0.3 0 0" size="0.035" rgba="0.7 0.7 0.7 1"/>
        
        <body name="l2" pos="0.3 0 0">
          <joint name="j2" type="hinge" axis="0 1 0" range="-2.5 2.5" damping="1.5"/>
          <geom type="capsule" fromto="0 0 0 0.25 0 0" size="0.03" rgba="0.7 0.7 0.7 1"/>
          
          <body name="l3" pos="0.25 0 0">
            <joint name="j3" type="hinge" axis="0 0 1" range="-3.14 3.14" damping="1"/>
            <geom type="capsule" fromto="0 0 0 0.1 0 0" size="0.025" rgba="0.5 0.5 0.5 1"/>
            
            <body name="wrist" pos="0.1 0 0">
              <joint name="j4" type="hinge" axis="1 0 0" range="-1.57 1.57" damping="1"/>
              <site name="ee" pos="0 0 0" size="0.01"/>
              
              <!-- Gripper fingers -->
              <body name="fl" pos="0 0.025 0">
                <joint name="jfl" type="slide" axis="0 1 0" range="0 0.035" damping="5"/>
                <geom type="box" size="0.008 0.004 0.04" rgba="0.8 0.2 0.2 1"
                      contype="1" conaffinity="1" friction="2 0.5 0.01"/>
              </body>
              <body name="fr" pos="0 -0.025 0">
                <joint name="jfr" type="slide" axis="0 -1 0" range="0 0.035" damping="5"/>
                <geom type="box" size="0.008 0.004 0.04" rgba="0.8 0.2 0.2 1"
                      contype="1" conaffinity="1" friction="2 0.5 0.01"/>
              </body>
            </body>
          </body>
        </body>
      </body>
    </body>
    
    <!-- Cup -->
    <body name="cup" pos="0.45 0 0.44">
      <freejoint name="cup_free"/>
      <site name="cup_top" pos="0 0 0.06" size="0.005"/>
      
      <!-- Cup walls (hollow cylinder approximation) -->
      <geom name="cup_bottom" type="cylinder" size="0.03 0.003" pos="0 0 0" 
            rgba="0.9 0.9 1 0.8" mass="0.05" contype="1" conaffinity="1"/>
      <geom name="cup_wall1" type="box" size="0.003 0.03 0.03" pos="0.03 0 0.03"
            rgba="0.9 0.9 1 0.8" mass="0.01" contype="1" conaffinity="1"/>
      <geom name="cup_wall2" type="box" size="0.003 0.03 0.03" pos="-0.03 0 0.03"
            rgba="0.9 0.9 1 0.8" mass="0.01" contype="1" conaffinity="1"/>
      <geom name="cup_wall3" type="box" size="0.03 0.003 0.03" pos="0 0.03 0.03"
            rgba="0.9 0.9 1 0.8" mass="0.01" contype="1" conaffinity="1"/>
      <geom name="cup_wall4" type="box" size="0.03 0.003 0.03" pos="0 -0.03 0.03"
            rgba="0.9 0.9 1 0.8" mass="0.01" contype="1" conaffinity="1"/>
      
      <!-- Liquid approximation: ball inside cup -->
      <body name="liquid" pos="0 0 0.02">
        <joint name="liquid_x" type="slide" axis="1 0 0" range="-0.02 0.02" damping="5"/>
        <joint name="liquid_y" type="slide" axis="0 1 0" range="-0.02 0.02" damping="5"/>
        <geom name="liquid_ball" type="sphere" size="0.02" rgba="0.2 0.5 1 0.6" 
              mass="0.2" contype="0" conaffinity="0"/>
      </body>
    </body>
    
    <!-- Goal position -->
    <body name="goal" pos="0.5 0.3 0.55">
      <geom type="sphere" size="0.03" rgba="0 1 0 0.3" contype="0" conaffinity="0"/>
      <site name="goal_site" pos="0 0 0" size="0.01"/>
    </body>
  </worldbody>
  
  <actuator>
    <position name="a0" joint="j0" kp="200"/>
    <position name="a1" joint="j1" kp="200"/>
    <position name="a2" joint="j2" kp="200"/>
    <position name="a3" joint="j3" kp="100"/>
    <position name="a4" joint="j4" kp="100"/>
    <position name="afl" joint="jfl" kp="80"/>
    <position name="afr" joint="jfr" kp="80"/>
  </actuator>
</mujoco>
"""


class CupBalanceEnv:
    """Environment cho cup balancing task."""
    
    def __init__(self):
        self.model = mujoco.MjModel.from_xml_string(CUP_BALANCE_XML)
        self.data = mujoco.MjData(self.model)
        self.reward_fn = CupBalanceReward()
        self.max_steps = 300
        self.goal_pos = np.array([0.5, 0.3, 0.55])
        self.prev_ee_vel = np.zeros(3)
        
    def get_cup_tilt(self):
        """Tính góc nghiêng cốc so với phương thẳng đứng."""
        cup_quat = self.data.qpos[7:11]  # Cup quaternion
        # Chuyển quaternion sang rotation matrix
        rot = np.zeros(9)
        mujoco.mju_quat2Mat(rot, cup_quat)
        rot = rot.reshape(3, 3)
        # Up vector của cốc
        cup_up = rot[:, 2]
        # Góc giữa cup_up và world_up (0,0,1)
        cos_angle = cup_up[2]  # dot product với (0,0,1)
        tilt = np.arccos(np.clip(cos_angle, -1, 1))
        return tilt
    
    def get_liquid_offset(self):
        """Lấy vị trí tương đối của liquid ball."""
        # liquid slides inside cup
        liq_x = self.data.qpos[11]  # liquid_x joint
        liq_y = self.data.qpos[12]  # liquid_y joint
        return np.array([liq_x, liq_y])
    
    def step(self, action):
        # Apply action
        joint_delta = action[:5] * 0.03  # Nhỏ hơn bình thường cho smooth
        gripper = (action[5] + 1) / 2 * 0.035
        
        self.data.ctrl[:5] = self.data.qpos[:5] + joint_delta
        self.data.ctrl[5] = gripper
        self.data.ctrl[6] = gripper
        
        for _ in range(10):
            mujoco.mj_step(self.model, self.data)
        
        # Observations
        ee_pos = self.data.site_xpos[0]
        ee_vel = (ee_pos - self.prev_ee_pos) / (0.002 * 10)
        ee_accel = (ee_vel - self.prev_ee_vel) / (0.002 * 10)
        
        cup_tilt = self.get_cup_tilt()
        cup_angular_vel = self.data.qvel[10:13]
        goal_dist = np.linalg.norm(ee_pos - self.goal_pos)
        liquid_offset = self.get_liquid_offset()
        
        # Check if still grasping
        grasping = self._check_grasp()
        
        reward, info = self.reward_fn.compute(
            cup_tilt, cup_angular_vel, ee_vel,
            ee_accel, goal_dist, action, grasping
        )
        
        self.prev_ee_vel = ee_vel.copy()
        self.prev_ee_pos = ee_pos.copy()
        
        return self._get_obs(), reward, False, info

Training với SAC

from stable_baselines3 import SAC

# Wrap environment cho Gymnasium compatibility
# (đã bỏ qua wrapper code cho ngắn gọn)

model = SAC(
    "MlpPolicy",
    cup_env,
    learning_rate=1e-4,       # Lower LR cho stability
    buffer_size=500_000,
    batch_size=512,
    tau=0.001,                # Slow target update
    gamma=0.995,              # Cao hơn bình thường
    train_freq=2,
    gradient_steps=2,
    ent_coef="auto",
    target_entropy="auto",
    verbose=1,
    policy_kwargs=dict(
        net_arch=[256, 256, 128],  # Larger network
    )
)

model.learn(total_timesteps=3_000_000)

So sánh RL vs Impedance Control

Metric	Impedance Control	SAC (RL)
Max tilt (avg)	12.3 deg	6.8 deg
Spill rate	18%	4%
Avg travel time	8.2s	5.1s
Jerk (smoothness)	15.6	8.3
Adapts to new cups	Cần retune	Tự thích ứng
Adapts to perturbation	Kém	Tốt

RL rõ ràng vượt trội — đặc biệt ở khả năng thích ứng. Policy đã học biết giảm tốc trước khi quẹo, nghiêng cốc nhẹ để compensate ly tâm, và phản ứng nhanh khi bị nhiễu.

Kỹ thuật nâng cao: Variable Impedance RL

Một hướng tiếp cận mạnh mẽ là kết hợp impedance control với RL — policy RL không điều khiển trực tiếp joint commands, mà điều khiển tham số impedance ($K$, $D$):

class VariableImpedancePolicy:
    """RL policy outputs impedance parameters."""
    
    def __init__(self, base_controller):
        self.controller = base_controller
    
    def act(self, obs, rl_output):
        """
        rl_output: [kp_x, kp_y, kp_z, kd_x, kd_y, kd_z, 
                     desired_x, desired_y, desired_z]
        """
        # RL chọn stiffness và damping
        kp = np.exp(rl_output[:3]) * 50   # [5, 500] range
        kd = np.exp(rl_output[3:6]) * 5   # [0.5, 50] range
        
        # RL chọn desired position offset
        desired_offset = rl_output[6:9] * 0.02  # Max 2cm
        
        self.controller.kp_pos = np.diag(kp)
        self.controller.kd_pos = np.diag(kd)
        
        current_desired = self.get_trajectory_point() + desired_offset
        
        return self.controller.compute_action(
            current_pos, current_desired,
            current_vel, np.zeros(3),
            current_rot, np.array([0, 0, 1]),
            current_angvel
        )

Cách tiếp cận này có lợi thế lớn cho sim-to-real transfer — impedance controller cung cấp safety bounds, còn RL cung cấp adaptability. Chi tiết về sim-to-real cho force control, xem bài Domain Randomization.

Tài liệu tham khảo

Learning Variable Impedance Control for Contact-Rich Manipulation — Martín-Martín et al., 2019
Variable Impedance Control in End-Effector Space — Buchli et al., 2011
Reinforcement Learning for Contact-Rich Manipulation — Survey, 2023

Tiếp theo trong Series

Bài tiếp — Pick-and-Place chính xác: Position và Orientation Control — chúng ta sẽ giải quyết bài toán đặt vật thể với độ chính xác dưới 1cm, bao gồm cả orientation alignment. Hindsight Experience Replay (HER) sẽ là nhân vật chính.