Pick-and-Place chính xác: Position và Orientation Control

Pick-and-place là tác vụ phổ biến nhất trong robot công nghiệp — nhặt linh kiện từ băng chuyền và đặt vào vị trí chính xác trên PCB, pallet, hay dây chuyền lắp ráp. Nghe đơn giản, nhưng khi yêu cầu độ chính xác dưới 1cm cho cả vị trí lẫn hướng, bài toán trở nên cực kỳ thách thức cho RL.

Bài trước — Force Control bằng RL — chúng ta đã học cách kiểm soát lực tinh tế. Bây giờ, chúng ta tập trung vào precision — và vũ khí bí mật của chúng ta là Hindsight Experience Replay (HER).

Tại sao Precision Pick-and-Place khó cho RL?

Trong pick-and-place thông thường, robot chỉ cần đưa vật đến "gần" vị trí mục tiêu. Nhưng trong manufacturing:

Linh kiện điện tử cần đặt với sai số < 0.5mm
Orientation phải đúng (xoay 90 độ = hỏng)
Nhiều bước: reach → grasp → lift → transport → align → place → release

Thách thức	Mô tả	Giải pháp RL
Sparse reward	Chỉ +1 khi đặt đúng hoàn toàn	HER
6-DOF goal	Cần cả position + orientation	SE(3) goal space
Precision	Sub-cm accuracy	Fine-tuned reward shaping
Multi-phase	Nhiều giai đoạn liên tiếp	Phase-based reward
Long horizon	200+ steps	Discount tuning

Hindsight Experience Replay (HER)

Vấn đề Sparse Reward

Với sparse reward (chỉ +1 khi thành công), robot gần như không bao giờ nhận được positive reward trong giai đoạn đầu — vì xác suất đặt chính xác random là cực kỳ thấp. Không có reward signal = không thể học.

Ý tưởng HER

HER giải quyết bằng một trick đơn giản nhưng thiên tài: sau mỗi episode, giả vờ rằng goal ban đầu là vị trí thực tế đạt được.

Ví dụ: Robot cố đặt vật ở $(0.5, 0.3, 0.4)$ nhưng thực tế đặt ở $(0.3, 0.1, 0.4)$. Thay vì coi episode này là thất bại hoàn toàn, HER tạo thêm một "virtual experience" với goal = $(0.3, 0.1, 0.4)$ — và episode này thành thành công.

import numpy as np
from collections import deque

class HindsightExperienceReplay:
    """HER implementation cho goal-conditioned manipulation."""
    
    def __init__(self, buffer_size=1_000_000, k_future=4, 
                 strategy="future"):
        self.buffer = deque(maxlen=buffer_size)
        self.k_future = k_future  # Số hindsight goals mỗi transition
        self.strategy = strategy
        self.episode_buffer = []
    
    def store_transition(self, obs, action, reward, next_obs, 
                         done, achieved_goal, desired_goal):
        """Lưu transition trong episode hiện tại."""
        self.episode_buffer.append({
            'obs': obs,
            'action': action,
            'reward': reward,
            'next_obs': next_obs,
            'done': done,
            'achieved_goal': achieved_goal,
            'desired_goal': desired_goal,
        })
    
    def end_episode(self, compute_reward_fn):
        """Kết thúc episode, tạo hindsight goals."""
        episode = self.episode_buffer
        n = len(episode)
        
        for i, transition in enumerate(episode):
            # Lưu transition gốc
            self.buffer.append(transition)
            
            # Tạo k hindsight goals
            if self.strategy == "future":
                # Chọn future states làm goal
                future_indices = np.random.randint(i, n, 
                                                    size=self.k_future)
                for idx in future_indices:
                    # Dùng achieved_goal của future state làm desired_goal
                    new_goal = episode[idx]['achieved_goal']
                    
                    # Tính lại reward với goal mới
                    new_reward = compute_reward_fn(
                        transition['achieved_goal'], new_goal
                    )
                    
                    # Tạo hindsight transition
                    hindsight = {
                        'obs': self._replace_goal(transition['obs'], new_goal),
                        'action': transition['action'],
                        'reward': new_reward,
                        'next_obs': self._replace_goal(transition['next_obs'], new_goal),
                        'done': transition['done'],
                        'achieved_goal': transition['achieved_goal'],
                        'desired_goal': new_goal,
                    }
                    self.buffer.append(hindsight)
            
            elif self.strategy == "episode":
                # Chọn bất kỳ state nào trong episode
                indices = np.random.randint(0, n, size=self.k_future)
                for idx in indices:
                    new_goal = episode[idx]['achieved_goal']
                    new_reward = compute_reward_fn(
                        transition['achieved_goal'], new_goal
                    )
                    hindsight = {
                        'obs': self._replace_goal(transition['obs'], new_goal),
                        'action': transition['action'],
                        'reward': new_reward,
                        'next_obs': self._replace_goal(transition['next_obs'], new_goal),
                        'done': transition['done'],
                        'achieved_goal': transition['achieved_goal'],
                        'desired_goal': new_goal,
                    }
                    self.buffer.append(hindsight)
        
        self.episode_buffer = []
    
    def _replace_goal(self, obs, new_goal):
        """Thay thế goal trong observation."""
        new_obs = obs.copy()
        # Giả sử goal nằm ở cuối obs vector
        new_obs[-len(new_goal):] = new_goal
        return new_obs
    
    def sample(self, batch_size):
        """Sample batch cho training."""
        indices = np.random.randint(0, len(self.buffer), size=batch_size)
        batch = [self.buffer[i] for i in indices]
        return batch

Tại sao HER hiệu quả?

HER tăng data efficiency gấp nhiều lần vì:

Mỗi episode thất bại tạo ra $k$ episodes "thành công" ảo
Policy học được goal-conditioned behavior — biết cách đến bất kỳ vị trí nào
Không cần reward shaping phức tạp — sparse reward đủ dùng

6-DOF Goal Space

Precision pick-and-place đòi hỏi cả position (3-DOF) và orientation (3-DOF):

class SE3Goal:
    """Goal representation trong SE(3) cho 6-DOF placement."""
    
    def __init__(self, pos_threshold=0.01, rot_threshold=0.1):
        self.pos_threshold = pos_threshold  # 1cm
        self.rot_threshold = rot_threshold  # ~5.7 degrees
    
    def compute_reward(self, achieved, desired):
        """
        Tính reward cho 6-DOF goal.
        achieved: [x, y, z, qw, qx, qy, qz]
        desired: [x, y, z, qw, qx, qy, qz]
        """
        # Position error
        pos_error = np.linalg.norm(achieved[:3] - desired[:3])
        
        # Orientation error (quaternion distance)
        q1 = achieved[3:7] / np.linalg.norm(achieved[3:7])
        q2 = desired[3:7] / np.linalg.norm(desired[3:7])
        dot = abs(np.dot(q1, q2))
        rot_error = 2.0 * np.arccos(np.clip(dot, 0, 1))
        
        # Sparse reward
        if pos_error < self.pos_threshold and rot_error < self.rot_threshold:
            return 1.0
        return 0.0
    
    def compute_dense_reward(self, achieved, desired):
        """Dense reward variant cho faster learning."""
        pos_error = np.linalg.norm(achieved[:3] - desired[:3])
        
        q1 = achieved[3:7] / np.linalg.norm(achieved[3:7])
        q2 = desired[3:7] / np.linalg.norm(desired[3:7])
        dot = abs(np.dot(q1, q2))
        rot_error = 2.0 * np.arccos(np.clip(dot, 0, 1))
        
        pos_reward = 1.0 - np.tanh(10.0 * pos_error)
        rot_reward = 1.0 - np.tanh(5.0 * rot_error)
        
        success = float(pos_error < self.pos_threshold 
                        and rot_error < self.rot_threshold)
        
        return 0.5 * pos_reward + 0.5 * rot_reward + 10.0 * success

Training Pipeline cho Precision Pick-and-Place

Sử dụng Stable-Baselines3 với HER

from stable_baselines3 import SAC, HerReplayBuffer
from stable_baselines3.common.envs import BitFlippingEnv
import gymnasium as gym

# Custom pick-and-place env phải implement GoalEnv interface
# Observation space = Dict("observation", "achieved_goal", "desired_goal")

class PrecisionPickPlaceEnv(gym.Env):
    """Pick-and-place environment với goal-conditioned interface."""
    
    def __init__(self):
        super().__init__()
        
        obs_dim = 25  # robot state
        goal_dim = 7  # pos(3) + quat(4)
        
        self.observation_space = gym.spaces.Dict({
            'observation': gym.spaces.Box(-np.inf, np.inf, (obs_dim,)),
            'achieved_goal': gym.spaces.Box(-np.inf, np.inf, (goal_dim,)),
            'desired_goal': gym.spaces.Box(-np.inf, np.inf, (goal_dim,)),
        })
        
        self.action_space = gym.spaces.Box(-1, 1, shape=(7,))
        self.se3_goal = SE3Goal(pos_threshold=0.01, rot_threshold=0.1)
    
    def compute_reward(self, achieved_goal, desired_goal, info):
        """Required by HER — vectorized reward computation."""
        rewards = []
        for ag, dg in zip(achieved_goal, desired_goal):
            rewards.append(self.se3_goal.compute_reward(ag, dg))
        return np.array(rewards)
    
    def reset(self, seed=None, options=None):
        # ... reset simulation ...
        # Randomize goal position and orientation
        goal_pos = np.random.uniform([0.3, -0.2, 0.42], [0.7, 0.2, 0.5])
        goal_quat = self._random_quaternion()
        self._goal = np.concatenate([goal_pos, goal_quat])
        
        return self._get_obs(), {}
    
    def _random_quaternion(self):
        """Generate random valid placement quaternion."""
        # Chỉ random yaw (z-rotation), giữ upright
        yaw = np.random.uniform(-np.pi, np.pi)
        return np.array([np.cos(yaw/2), 0, 0, np.sin(yaw/2)])


# ---- Training với HER ----
env = PrecisionPickPlaceEnv()

model = SAC(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,             # k_future = 4
        goal_selection_strategy="future",
    ),
    learning_rate=1e-3,
    buffer_size=1_000_000,
    batch_size=256,
    tau=0.05,
    gamma=0.95,
    verbose=1,
    tensorboard_log="./pick_place_logs/"
)

model.learn(total_timesteps=5_000_000)

Kết quả điển hình

Method	Position Error (mm)	Orientation Error (deg)	Success Rate
SAC (dense reward)	8.3	12.1	52%
SAC + HER (sparse)	4.7	5.8	71%
SAC + HER (dense)	3.2	4.1	83%
SAC + HER + curriculum	2.1	2.8	89%

HER kết hợp với dense reward cho kết quả tốt nhất. Thêm curriculum (bắt đầu với threshold lớn, giảm dần) đẩy success rate lên gần 90%.

Ứng dụng: PCB Component Placement

Một ví dụ thực tế — đặt chip IC lên PCB:

class PCBPlacementEnv(PrecisionPickPlaceEnv):
    """Đặt linh kiện điện tử lên PCB."""
    
    def __init__(self):
        super().__init__()
        self.se3_goal = SE3Goal(
            pos_threshold=0.005,   # 5mm (giảm dần qua curriculum)
            rot_threshold=0.087,   # 5 degrees
        )
        
        # Component types với kích thước khác nhau
        self.components = {
            'smd_0805': {'size': [0.002, 0.00125, 0.0005]},
            'soic_8': {'size': [0.005, 0.004, 0.0015]},
            'qfp_44': {'size': [0.012, 0.012, 0.002]},
            'bga_256': {'size': [0.017, 0.017, 0.002]},
        }
    
    def reset(self, seed=None, options=None):
        obs, info = super().reset(seed=seed)
        
        # Chọn random component type
        comp_type = np.random.choice(list(self.components.keys()))
        self.current_component = self.components[comp_type]
        
        # Randomize PCB pad position
        # Trong thực tế, vị trí này đến từ CAD file
        pad_pos = np.array([
            0.5 + np.random.uniform(-0.05, 0.05),
            np.random.uniform(-0.05, 0.05),
            0.42  # PCB surface height
        ])
        pad_orientation = np.array([1, 0, 0, 0])  # Upright
        
        self._goal = np.concatenate([pad_pos, pad_orientation])
        
        return self._get_obs(), {}

Tips cho Precision RL

Action scaling nhỏ: Dùng max delta = 0.02 rad thay vì 0.05 để movement tinh tế hơn
Observation normalization: Normalize tất cả observations về [-1, 1] range
Asymmetric actor-critic: Critic nhận thêm privileged information (exact position), actor chỉ nhận sensor data
Residual RL: Dùng motion planning cho coarse motion, RL cho fine adjustment
Reward annealing: Bắt đầu với threshold lớn, giảm dần khi policy improve

Tài liệu tham khảo

Hindsight Experience Replay — Andrychowicz et al., NeurIPS 2017
Multi-Goal Reinforcement Learning — Plappert et al., 2018
Asymmetric Actor Critic for Image-Based Robot Learning — Pinto et al., RSS 2018

Tiếp theo trong Series

Bài tiếp — Bê và vận chuyển vật thể: Ổn định trong di chuyển — chúng ta kết hợp grasping, force control, và precision placement để giải quyết bài toán vận chuyển vật thể qua địa hình phức tạp.