Pick-and-place is the most common task in industrial robotics — picking components from a conveyor and placing them precisely on a PCB, pallet, or assembly line. Sounds simple, but when you require sub-centimeter accuracy for both position and orientation, the problem becomes extremely challenging for RL.
The previous post — RL Force Control — covered delicate force regulation. Now, we focus on precision — and our secret weapon is Hindsight Experience Replay (HER).
Why is Precision Pick-and-Place Hard for RL?
In standard pick-and-place, the robot only needs to bring the object "near" the target. But in manufacturing:
- Electronic components must be placed with < 0.5mm error
- Orientation must be correct (90-degree rotation = defective)
- Multiple steps: reach, grasp, lift, transport, align, place, release
| Challenge | Description | RL Solution |
|---|---|---|
| Sparse reward | Only +1 when placed perfectly | HER |
| 6-DOF goal | Need position + orientation | SE(3) goal space |
| Precision | Sub-cm accuracy | Fine-tuned reward shaping |
| Multi-phase | Sequential stages | Phase-based reward |
| Long horizon | 200+ steps | Discount tuning |
Hindsight Experience Replay (HER)
The Sparse Reward Problem
With sparse rewards (only +1 on success), the robot almost never receives positive reward early in training — because the probability of randomly placing precisely is extremely low. No reward signal = no learning.
The HER Idea
HER solves this with a simple but brilliant trick: after each episode, pretend the original goal was actually the position achieved.
Example: The robot tries to place an object at $(0.5, 0.3, 0.4)$ but actually places it at $(0.3, 0.1, 0.4)$. Instead of treating this episode as a complete failure, HER creates an additional "virtual experience" with goal = $(0.3, 0.1, 0.4)$ — and this episode becomes a success.
import numpy as np
from collections import deque
class HindsightExperienceReplay:
"""HER implementation for goal-conditioned manipulation."""
def __init__(self, buffer_size=1_000_000, k_future=4,
strategy="future"):
self.buffer = deque(maxlen=buffer_size)
self.k_future = k_future
self.strategy = strategy
self.episode_buffer = []
def store_transition(self, obs, action, reward, next_obs,
done, achieved_goal, desired_goal):
"""Store transition in current episode."""
self.episode_buffer.append({
'obs': obs,
'action': action,
'reward': reward,
'next_obs': next_obs,
'done': done,
'achieved_goal': achieved_goal,
'desired_goal': desired_goal,
})
def end_episode(self, compute_reward_fn):
"""End episode, create hindsight goals."""
episode = self.episode_buffer
n = len(episode)
for i, transition in enumerate(episode):
# Store original transition
self.buffer.append(transition)
# Create k hindsight goals
if self.strategy == "future":
future_indices = np.random.randint(i, n,
size=self.k_future)
for idx in future_indices:
new_goal = episode[idx]['achieved_goal']
new_reward = compute_reward_fn(
transition['achieved_goal'], new_goal
)
hindsight = {
'obs': self._replace_goal(transition['obs'], new_goal),
'action': transition['action'],
'reward': new_reward,
'next_obs': self._replace_goal(transition['next_obs'], new_goal),
'done': transition['done'],
'achieved_goal': transition['achieved_goal'],
'desired_goal': new_goal,
}
self.buffer.append(hindsight)
elif self.strategy == "episode":
indices = np.random.randint(0, n, size=self.k_future)
for idx in indices:
new_goal = episode[idx]['achieved_goal']
new_reward = compute_reward_fn(
transition['achieved_goal'], new_goal
)
hindsight = {
'obs': self._replace_goal(transition['obs'], new_goal),
'action': transition['action'],
'reward': new_reward,
'next_obs': self._replace_goal(transition['next_obs'], new_goal),
'done': transition['done'],
'achieved_goal': transition['achieved_goal'],
'desired_goal': new_goal,
}
self.buffer.append(hindsight)
self.episode_buffer = []
def _replace_goal(self, obs, new_goal):
"""Replace goal in observation."""
new_obs = obs.copy()
new_obs[-len(new_goal):] = new_goal
return new_obs
def sample(self, batch_size):
"""Sample batch for training."""
indices = np.random.randint(0, len(self.buffer), size=batch_size)
batch = [self.buffer[i] for i in indices]
return batch
Why HER Works
HER multiplies data efficiency because:
- Each failed episode generates $k$ virtual "successful" episodes
- The policy learns goal-conditioned behavior — how to reach any position
- No need for complex reward shaping — sparse reward suffices
6-DOF Goal Space
Precision pick-and-place requires both position (3-DOF) and orientation (3-DOF):
class SE3Goal:
"""Goal representation in SE(3) for 6-DOF placement."""
def __init__(self, pos_threshold=0.01, rot_threshold=0.1):
self.pos_threshold = pos_threshold # 1cm
self.rot_threshold = rot_threshold # ~5.7 degrees
def compute_reward(self, achieved, desired):
"""
Compute reward for 6-DOF goal.
achieved: [x, y, z, qw, qx, qy, qz]
desired: [x, y, z, qw, qx, qy, qz]
"""
pos_error = np.linalg.norm(achieved[:3] - desired[:3])
q1 = achieved[3:7] / np.linalg.norm(achieved[3:7])
q2 = desired[3:7] / np.linalg.norm(desired[3:7])
dot = abs(np.dot(q1, q2))
rot_error = 2.0 * np.arccos(np.clip(dot, 0, 1))
if pos_error < self.pos_threshold and rot_error < self.rot_threshold:
return 1.0
return 0.0
def compute_dense_reward(self, achieved, desired):
"""Dense reward variant for faster learning."""
pos_error = np.linalg.norm(achieved[:3] - desired[:3])
q1 = achieved[3:7] / np.linalg.norm(achieved[3:7])
q2 = desired[3:7] / np.linalg.norm(desired[3:7])
dot = abs(np.dot(q1, q2))
rot_error = 2.0 * np.arccos(np.clip(dot, 0, 1))
pos_reward = 1.0 - np.tanh(10.0 * pos_error)
rot_reward = 1.0 - np.tanh(5.0 * rot_error)
success = float(pos_error < self.pos_threshold
and rot_error < self.rot_threshold)
return 0.5 * pos_reward + 0.5 * rot_reward + 10.0 * success
Training Pipeline for Precision Pick-and-Place
Using Stable-Baselines3 with HER
from stable_baselines3 import SAC, HerReplayBuffer
import gymnasium as gym
class PrecisionPickPlaceEnv(gym.Env):
"""Pick-and-place environment with goal-conditioned interface."""
def __init__(self):
super().__init__()
obs_dim = 25
goal_dim = 7 # pos(3) + quat(4)
self.observation_space = gym.spaces.Dict({
'observation': gym.spaces.Box(-np.inf, np.inf, (obs_dim,)),
'achieved_goal': gym.spaces.Box(-np.inf, np.inf, (goal_dim,)),
'desired_goal': gym.spaces.Box(-np.inf, np.inf, (goal_dim,)),
})
self.action_space = gym.spaces.Box(-1, 1, shape=(7,))
self.se3_goal = SE3Goal(pos_threshold=0.01, rot_threshold=0.1)
def compute_reward(self, achieved_goal, desired_goal, info):
"""Required by HER — vectorized reward computation."""
rewards = []
for ag, dg in zip(achieved_goal, desired_goal):
rewards.append(self.se3_goal.compute_reward(ag, dg))
return np.array(rewards)
def reset(self, seed=None, options=None):
goal_pos = np.random.uniform([0.3, -0.2, 0.42], [0.7, 0.2, 0.5])
goal_quat = self._random_quaternion()
self._goal = np.concatenate([goal_pos, goal_quat])
return self._get_obs(), {}
def _random_quaternion(self):
"""Generate random valid placement quaternion."""
yaw = np.random.uniform(-np.pi, np.pi)
return np.array([np.cos(yaw/2), 0, 0, np.sin(yaw/2)])
# ---- Training with HER ----
env = PrecisionPickPlaceEnv()
model = SAC(
"MultiInputPolicy",
env,
replay_buffer_class=HerReplayBuffer,
replay_buffer_kwargs=dict(
n_sampled_goal=4,
goal_selection_strategy="future",
),
learning_rate=1e-3,
buffer_size=1_000_000,
batch_size=256,
tau=0.05,
gamma=0.95,
verbose=1,
tensorboard_log="./pick_place_logs/"
)
model.learn(total_timesteps=5_000_000)
Typical Results
| Method | Position Error (mm) | Orientation Error (deg) | Success Rate |
|---|---|---|---|
| SAC (dense reward) | 8.3 | 12.1 | 52% |
| SAC + HER (sparse) | 4.7 | 5.8 | 71% |
| SAC + HER (dense) | 3.2 | 4.1 | 83% |
| SAC + HER + curriculum | 2.1 | 2.8 | 89% |
HER combined with dense reward gives the best results. Adding curriculum (starting with large threshold, gradually decreasing) pushes success rate to nearly 90%.
Application: PCB Component Placement
A practical example — placing an IC chip on a PCB:
class PCBPlacementEnv(PrecisionPickPlaceEnv):
"""Place electronic components on PCB."""
def __init__(self):
super().__init__()
self.se3_goal = SE3Goal(
pos_threshold=0.005, # 5mm (decreases via curriculum)
rot_threshold=0.087, # 5 degrees
)
self.components = {
'smd_0805': {'size': [0.002, 0.00125, 0.0005]},
'soic_8': {'size': [0.005, 0.004, 0.0015]},
'qfp_44': {'size': [0.012, 0.012, 0.002]},
'bga_256': {'size': [0.017, 0.017, 0.002]},
}
def reset(self, seed=None, options=None):
obs, info = super().reset(seed=seed)
comp_type = np.random.choice(list(self.components.keys()))
self.current_component = self.components[comp_type]
pad_pos = np.array([
0.5 + np.random.uniform(-0.05, 0.05),
np.random.uniform(-0.05, 0.05),
0.42
])
pad_orientation = np.array([1, 0, 0, 0])
self._goal = np.concatenate([pad_pos, pad_orientation])
return self._get_obs(), {}
Tips for Precision RL
- Small action scaling: Use max delta = 0.02 rad instead of 0.05 for finer movement
- Observation normalization: Normalize all observations to [-1, 1] range
- Asymmetric actor-critic: Critic receives privileged information (exact position), actor receives only sensor data
- Residual RL: Use motion planning for coarse motion, RL for fine adjustment
- Reward annealing: Start with large threshold, decrease gradually as policy improves
References
- Hindsight Experience Replay — Andrychowicz et al., NeurIPS 2017
- Multi-Goal Reinforcement Learning — Plappert et al., 2018
- Asymmetric Actor Critic for Image-Based Robot Learning — Pinto et al., RSS 2018
Next in the Series
Next up — Carrying & Transporting Objects: Stability During Motion — we combine grasping, force control, and precision placement to solve object transport through complex terrain.