← Back to Blog
manipulationrlpick-placemanipulationprecision

Precision Pick-and-Place: Position & Orientation Control

Train precise pick-and-place with RL — HER for sparse rewards, 6-DOF placement, and orientation alignment with sub-cm accuracy.

Nguyễn Anh Tuấn22 tháng 3, 20267 min read
Precision Pick-and-Place: Position & Orientation Control

Pick-and-place is the most common task in industrial robotics — picking components from a conveyor and placing them precisely on a PCB, pallet, or assembly line. Sounds simple, but when you require sub-centimeter accuracy for both position and orientation, the problem becomes extremely challenging for RL.

The previous post — RL Force Control — covered delicate force regulation. Now, we focus on precision — and our secret weapon is Hindsight Experience Replay (HER).

Why is Precision Pick-and-Place Hard for RL?

In standard pick-and-place, the robot only needs to bring the object "near" the target. But in manufacturing:

Challenge Description RL Solution
Sparse reward Only +1 when placed perfectly HER
6-DOF goal Need position + orientation SE(3) goal space
Precision Sub-cm accuracy Fine-tuned reward shaping
Multi-phase Sequential stages Phase-based reward
Long horizon 200+ steps Discount tuning

Precision robotic assembly

Hindsight Experience Replay (HER)

The Sparse Reward Problem

With sparse rewards (only +1 on success), the robot almost never receives positive reward early in training — because the probability of randomly placing precisely is extremely low. No reward signal = no learning.

The HER Idea

HER solves this with a simple but brilliant trick: after each episode, pretend the original goal was actually the position achieved.

Example: The robot tries to place an object at $(0.5, 0.3, 0.4)$ but actually places it at $(0.3, 0.1, 0.4)$. Instead of treating this episode as a complete failure, HER creates an additional "virtual experience" with goal = $(0.3, 0.1, 0.4)$ — and this episode becomes a success.

import numpy as np
from collections import deque

class HindsightExperienceReplay:
    """HER implementation for goal-conditioned manipulation."""
    
    def __init__(self, buffer_size=1_000_000, k_future=4, 
                 strategy="future"):
        self.buffer = deque(maxlen=buffer_size)
        self.k_future = k_future
        self.strategy = strategy
        self.episode_buffer = []
    
    def store_transition(self, obs, action, reward, next_obs, 
                         done, achieved_goal, desired_goal):
        """Store transition in current episode."""
        self.episode_buffer.append({
            'obs': obs,
            'action': action,
            'reward': reward,
            'next_obs': next_obs,
            'done': done,
            'achieved_goal': achieved_goal,
            'desired_goal': desired_goal,
        })
    
    def end_episode(self, compute_reward_fn):
        """End episode, create hindsight goals."""
        episode = self.episode_buffer
        n = len(episode)
        
        for i, transition in enumerate(episode):
            # Store original transition
            self.buffer.append(transition)
            
            # Create k hindsight goals
            if self.strategy == "future":
                future_indices = np.random.randint(i, n, 
                                                    size=self.k_future)
                for idx in future_indices:
                    new_goal = episode[idx]['achieved_goal']
                    new_reward = compute_reward_fn(
                        transition['achieved_goal'], new_goal
                    )
                    
                    hindsight = {
                        'obs': self._replace_goal(transition['obs'], new_goal),
                        'action': transition['action'],
                        'reward': new_reward,
                        'next_obs': self._replace_goal(transition['next_obs'], new_goal),
                        'done': transition['done'],
                        'achieved_goal': transition['achieved_goal'],
                        'desired_goal': new_goal,
                    }
                    self.buffer.append(hindsight)
            
            elif self.strategy == "episode":
                indices = np.random.randint(0, n, size=self.k_future)
                for idx in indices:
                    new_goal = episode[idx]['achieved_goal']
                    new_reward = compute_reward_fn(
                        transition['achieved_goal'], new_goal
                    )
                    hindsight = {
                        'obs': self._replace_goal(transition['obs'], new_goal),
                        'action': transition['action'],
                        'reward': new_reward,
                        'next_obs': self._replace_goal(transition['next_obs'], new_goal),
                        'done': transition['done'],
                        'achieved_goal': transition['achieved_goal'],
                        'desired_goal': new_goal,
                    }
                    self.buffer.append(hindsight)
        
        self.episode_buffer = []
    
    def _replace_goal(self, obs, new_goal):
        """Replace goal in observation."""
        new_obs = obs.copy()
        new_obs[-len(new_goal):] = new_goal
        return new_obs
    
    def sample(self, batch_size):
        """Sample batch for training."""
        indices = np.random.randint(0, len(self.buffer), size=batch_size)
        batch = [self.buffer[i] for i in indices]
        return batch

Why HER Works

HER multiplies data efficiency because:

  1. Each failed episode generates $k$ virtual "successful" episodes
  2. The policy learns goal-conditioned behavior — how to reach any position
  3. No need for complex reward shaping — sparse reward suffices

6-DOF Goal Space

Precision pick-and-place requires both position (3-DOF) and orientation (3-DOF):

class SE3Goal:
    """Goal representation in SE(3) for 6-DOF placement."""
    
    def __init__(self, pos_threshold=0.01, rot_threshold=0.1):
        self.pos_threshold = pos_threshold  # 1cm
        self.rot_threshold = rot_threshold  # ~5.7 degrees
    
    def compute_reward(self, achieved, desired):
        """
        Compute reward for 6-DOF goal.
        achieved: [x, y, z, qw, qx, qy, qz]
        desired: [x, y, z, qw, qx, qy, qz]
        """
        pos_error = np.linalg.norm(achieved[:3] - desired[:3])
        
        q1 = achieved[3:7] / np.linalg.norm(achieved[3:7])
        q2 = desired[3:7] / np.linalg.norm(desired[3:7])
        dot = abs(np.dot(q1, q2))
        rot_error = 2.0 * np.arccos(np.clip(dot, 0, 1))
        
        if pos_error < self.pos_threshold and rot_error < self.rot_threshold:
            return 1.0
        return 0.0
    
    def compute_dense_reward(self, achieved, desired):
        """Dense reward variant for faster learning."""
        pos_error = np.linalg.norm(achieved[:3] - desired[:3])
        
        q1 = achieved[3:7] / np.linalg.norm(achieved[3:7])
        q2 = desired[3:7] / np.linalg.norm(desired[3:7])
        dot = abs(np.dot(q1, q2))
        rot_error = 2.0 * np.arccos(np.clip(dot, 0, 1))
        
        pos_reward = 1.0 - np.tanh(10.0 * pos_error)
        rot_reward = 1.0 - np.tanh(5.0 * rot_error)
        
        success = float(pos_error < self.pos_threshold 
                        and rot_error < self.rot_threshold)
        
        return 0.5 * pos_reward + 0.5 * rot_reward + 10.0 * success

Training Pipeline for Precision Pick-and-Place

Using Stable-Baselines3 with HER

from stable_baselines3 import SAC, HerReplayBuffer
import gymnasium as gym

class PrecisionPickPlaceEnv(gym.Env):
    """Pick-and-place environment with goal-conditioned interface."""
    
    def __init__(self):
        super().__init__()
        
        obs_dim = 25
        goal_dim = 7  # pos(3) + quat(4)
        
        self.observation_space = gym.spaces.Dict({
            'observation': gym.spaces.Box(-np.inf, np.inf, (obs_dim,)),
            'achieved_goal': gym.spaces.Box(-np.inf, np.inf, (goal_dim,)),
            'desired_goal': gym.spaces.Box(-np.inf, np.inf, (goal_dim,)),
        })
        
        self.action_space = gym.spaces.Box(-1, 1, shape=(7,))
        self.se3_goal = SE3Goal(pos_threshold=0.01, rot_threshold=0.1)
    
    def compute_reward(self, achieved_goal, desired_goal, info):
        """Required by HER — vectorized reward computation."""
        rewards = []
        for ag, dg in zip(achieved_goal, desired_goal):
            rewards.append(self.se3_goal.compute_reward(ag, dg))
        return np.array(rewards)
    
    def reset(self, seed=None, options=None):
        goal_pos = np.random.uniform([0.3, -0.2, 0.42], [0.7, 0.2, 0.5])
        goal_quat = self._random_quaternion()
        self._goal = np.concatenate([goal_pos, goal_quat])
        return self._get_obs(), {}
    
    def _random_quaternion(self):
        """Generate random valid placement quaternion."""
        yaw = np.random.uniform(-np.pi, np.pi)
        return np.array([np.cos(yaw/2), 0, 0, np.sin(yaw/2)])


# ---- Training with HER ----
env = PrecisionPickPlaceEnv()

model = SAC(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy="future",
    ),
    learning_rate=1e-3,
    buffer_size=1_000_000,
    batch_size=256,
    tau=0.05,
    gamma=0.95,
    verbose=1,
    tensorboard_log="./pick_place_logs/"
)

model.learn(total_timesteps=5_000_000)

Typical Results

Method Position Error (mm) Orientation Error (deg) Success Rate
SAC (dense reward) 8.3 12.1 52%
SAC + HER (sparse) 4.7 5.8 71%
SAC + HER (dense) 3.2 4.1 83%
SAC + HER + curriculum 2.1 2.8 89%

HER combined with dense reward gives the best results. Adding curriculum (starting with large threshold, gradually decreasing) pushes success rate to nearly 90%.

Precision placement visualization

Application: PCB Component Placement

A practical example — placing an IC chip on a PCB:

class PCBPlacementEnv(PrecisionPickPlaceEnv):
    """Place electronic components on PCB."""
    
    def __init__(self):
        super().__init__()
        self.se3_goal = SE3Goal(
            pos_threshold=0.005,   # 5mm (decreases via curriculum)
            rot_threshold=0.087,   # 5 degrees
        )
        
        self.components = {
            'smd_0805': {'size': [0.002, 0.00125, 0.0005]},
            'soic_8': {'size': [0.005, 0.004, 0.0015]},
            'qfp_44': {'size': [0.012, 0.012, 0.002]},
            'bga_256': {'size': [0.017, 0.017, 0.002]},
        }
    
    def reset(self, seed=None, options=None):
        obs, info = super().reset(seed=seed)
        
        comp_type = np.random.choice(list(self.components.keys()))
        self.current_component = self.components[comp_type]
        
        pad_pos = np.array([
            0.5 + np.random.uniform(-0.05, 0.05),
            np.random.uniform(-0.05, 0.05),
            0.42
        ])
        pad_orientation = np.array([1, 0, 0, 0])
        
        self._goal = np.concatenate([pad_pos, pad_orientation])
        return self._get_obs(), {}

Tips for Precision RL

  1. Small action scaling: Use max delta = 0.02 rad instead of 0.05 for finer movement
  2. Observation normalization: Normalize all observations to [-1, 1] range
  3. Asymmetric actor-critic: Critic receives privileged information (exact position), actor receives only sensor data
  4. Residual RL: Use motion planning for coarse motion, RL for fine adjustment
  5. Reward annealing: Start with large threshold, decrease gradually as policy improves

References

  1. Hindsight Experience Replay — Andrychowicz et al., NeurIPS 2017
  2. Multi-Goal Reinforcement Learning — Plappert et al., 2018
  3. Asymmetric Actor Critic for Image-Based Robot Learning — Pinto et al., RSS 2018

Next in the Series

Next up — Carrying & Transporting Objects: Stability During Motion — we combine grasping, force control, and precision placement to solve object transport through complex terrain.

Related Posts

Related Posts

TutorialSim-to-Real cho Humanoid: Deployment Best Practices
sim2realhumanoiddeploymentrlunitreePart 10

Sim-to-Real cho Humanoid: Deployment Best Practices

Pipeline hoàn chỉnh deploy RL locomotion policy lên robot humanoid thật — domain randomization, system identification, safety, và Unitree SDK.

9/4/202611 min read
ResearchUnifoLM-VLA-0: Mô hình VLA cho Manipulation trên Unitree G1
vlaunitreeg1manipulationhumanoid

UnifoLM-VLA-0: Mô hình VLA cho Manipulation trên Unitree G1

Phân tích và hướng dẫn triển khai UnifoLM-VLA-0 — mô hình VLA open-source đầu tiên chạy trực tiếp trên G1 humanoid

8/4/202623 min read
ResearchWholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation
vlahumanoidloco-manipulationiclrrl

WholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation

ICLR 2026 — học manipulation từ egocentric video, kết hợp VLA + RL cho locomotion-aware control

7/4/202613 min read