How does a robot arm learn to pick up objects without being explicitly programmed for every motion? The answer lies in Reinforcement Learning (RL) — a paradigm that lets robots explore, experiment, and gradually discover how to manipulate the physical world.
This is the first post in our RL for Manipulation series, where we will go from theoretical foundations to training robots on complex manipulation tasks. In this post, we focus on three pillars: MDP formulation, reward design, and environment setup in MuJoCo.
If you are new to RL fundamentals, consider reading RL Basics for Robotics first to build a solid foundation.
Manipulation Through the MDP Lens
What is a Markov Decision Process?
Every RL problem is modeled as a Markov Decision Process (MDP), defined by the tuple $(S, A, T, R, \gamma)$:
| Component | Meaning | Manipulation Example |
|---|---|---|
| $S$ (State space) | Set of all states | Joint positions, velocities, object pose |
| $A$ (Action space) | Set of all actions | Joint torques or target positions |
| $T$ (Transition) | State transition probability | Physics simulation (MuJoCo dynamics) |
| $R$ (Reward) | Reward function | +1 when grasping succeeds, -0.01 per step |
| $\gamma$ (Discount) | Discount factor | Typically 0.99 for manipulation |
The critical challenge when applying MDP to manipulation is designing the right state and action spaces. Give the robot too little information (impoverished state) and it cannot learn. Give it too much irrelevant information and learning becomes slow and unstable.
State Space for Manipulation
The state space in manipulation typically includes three groups of information:
1. Robot State (Proprioception)
- Joint positions: $q \in \mathbb{R}^n$
- Joint velocities: $\dot{q} \in \mathbb{R}^n$
- End-effector pose: $x_{ee} \in SE(3)$
- Gripper state (open/closed): $g \in [0, 1]$
2. Object State
- Object position: $p_{obj} \in \mathbb{R}^3$
- Object orientation: $q_{obj} \in SO(3)$
- Linear and angular velocity: $v_{obj}, \omega_{obj}$
3. Task-Specific Information
- Goal position: $p_{goal} \in \mathbb{R}^3$
- Contact information (contact forces)
- Tactile sensing data
import numpy as np
def get_manipulation_state(sim):
"""Extract state vector for manipulation task."""
# Robot proprioception
joint_pos = sim.data.qpos[:7] # 7-DOF arm
joint_vel = sim.data.qvel[:7]
gripper_pos = sim.data.qpos[7:9] # 2-finger gripper
# End-effector pose
ee_pos = sim.data.site_xpos[0] # End-effector position
ee_rot = sim.data.site_xmat[0].reshape(3, 3) # Rotation matrix
# Object state
obj_pos = sim.data.qpos[9:12] # Object position
obj_quat = sim.data.qpos[12:16] # Object quaternion
obj_vel = sim.data.qvel[9:15] # 6-DOF velocity
# Relative information (very important!)
rel_pos = obj_pos - ee_pos # Vector from gripper to object
state = np.concatenate([
joint_pos, joint_vel, # 14 dims
gripper_pos, # 2 dims
ee_pos, # 3 dims
obj_pos, obj_quat, obj_vel, # 13 dims
rel_pos, # 3 dims
])
return state # Total: 35 dims
Key tip: Always include relative position (distance from gripper to object) in the state. This relative information helps policies generalize much better than using only absolute positions.
Action Space: Torque vs Position vs Velocity
The choice of action space significantly impacts learning ability and motion quality:
| Action Space | Pros | Cons | When to Use |
|---|---|---|---|
| Joint Torque | Most flexible, direct force control | Hard to learn, high sample complexity, jerky motions | Force control research |
| Joint Position | Easy to learn, smooth motions | No direct force control | Most manipulation tasks |
| Joint Velocity | Balance between torque and position | Requires PD gain tuning | Pick-and-place, trajectory following |
| End-effector Delta | Intuitive, easy to transfer | Needs IK solver, singularity issues | Teleoperation-style tasks |
In practice, Joint Position control (or Delta Position) is the most commonly used because it allows the policy to focus on "where to go" rather than "how much force to use":
# Action space: delta joint position
action_space = spaces.Box(
low=-0.05, # Max change per step
high=0.05,
shape=(8,), # 7 joints + 1 gripper
dtype=np.float32
)
def apply_action(sim, action):
"""Apply delta position action."""
current_pos = sim.data.qpos[:8].copy()
target_pos = current_pos + action
# Clip to joint limits
target_pos = np.clip(target_pos, joint_lower, joint_upper)
sim.data.ctrl[:8] = target_pos
Reward Design: Art and Science
Reward design is the hardest and most important part of RL for manipulation. A good reward function can help a robot learn in hours, while a bad one ensures it never learns anything — or worse, learns bizarre "reward hacking" behaviors.
Sparse vs Dense Reward
Sparse Reward — only reward upon task completion:
def sparse_reward(achieved_goal, desired_goal, threshold=0.05):
"""Reward = 1 if goal reached, 0 otherwise."""
distance = np.linalg.norm(achieved_goal - desired_goal)
return 1.0 if distance < threshold else 0.0
Dense Reward — continuous feedback based on progress:
def dense_reward(ee_pos, obj_pos, goal_pos, grasped):
"""Continuous reward based on distance and state."""
# Phase 1: Reach toward object
reach_dist = np.linalg.norm(ee_pos - obj_pos)
reach_reward = 1.0 - np.tanh(5.0 * reach_dist)
# Phase 2: Lift object toward goal
if grasped:
place_dist = np.linalg.norm(obj_pos - goal_pos)
place_reward = 1.0 - np.tanh(5.0 * place_dist)
return reach_reward + 5.0 * place_reward
return reach_reward
| Criterion | Sparse Reward | Dense Reward |
|---|---|---|
| Ease of design | Very easy | Difficult, error-prone |
| Learning speed | Very slow | Much faster |
| Reward hacking risk | Low | High if poorly designed |
| Requires HER? | Nearly essential | Not necessary |
| Scalability | Good | Hard for multi-step |
Principles for Reward Function Design
Over years of research, the RL community has distilled several golden principles:
1. Divide reward into phases: Each phase corresponds to a stage of the task (approach, grasp, move, place).
2. Use distance-based shaping: Functions like tanh or exponential decay provide stable gradients:
# Good: stable gradient at all distances
reward = 1.0 - np.tanh(5.0 * distance)
# Poor: gradient too small when far, too large when near
reward = -distance
# Poor: gradient = 0 everywhere except at boundary
reward = float(distance < threshold)
3. Penalize undesired behavior: Add penalties for excessive actions, hard collisions, or dropping objects:
# Penalty for large actions (encourage smooth motion)
action_penalty = -0.01 * np.sum(action ** 2)
# Penalty for dropping (if previously grasped)
drop_penalty = -10.0 if was_grasped and not is_grasped else 0.0
4. Normalize reward components: Ensure all reward components have comparable scales so the agent does not ignore important terms.
Setting Up Manipulation Environments in MuJoCo
Why MuJoCo?
MuJoCo (Multi-Joint dynamics with Contact) is the leading simulator for manipulation RL because of:
- Accurate contact dynamics — realistic friction and complex collision modeling
- High speed — thousands of steps per second
- Free (since 2022, when DeepMind acquired it)
- Excellent integration with Gymnasium and Stable-Baselines3
Creating a Custom Manipulation Environment
Below is complete code to create a basic manipulation environment — a robot arm must grasp a cube and bring it to a target location:
import gymnasium as gym
from gymnasium import spaces
import mujoco
import mujoco.viewer
import numpy as np
# ---- MuJoCo XML model ----
MANIPULATION_XML = """
<mujoco model="simple_manipulation">
<option timestep="0.002" gravity="0 0 -9.81"/>
<worldbody>
<!-- Ground plane -->
<geom type="plane" size="1 1 0.1" rgba="0.9 0.9 0.9 1"/>
<!-- Table -->
<body name="table" pos="0.5 0 0.4">
<geom type="box" size="0.4 0.4 0.02" rgba="0.6 0.3 0.1 1"
mass="100" contype="1" conaffinity="1"/>
</body>
<!-- Robot Arm (simplified 4-DOF) -->
<body name="base" pos="0.0 0 0.42">
<joint name="j0" type="hinge" axis="0 0 1" range="-3.14 3.14"/>
<geom type="cylinder" size="0.04 0.05" rgba="0.2 0.2 0.8 1"/>
<body name="link1" pos="0 0 0.1">
<joint name="j1" type="hinge" axis="0 1 0" range="-1.57 1.57"/>
<geom type="capsule" fromto="0 0 0 0.3 0 0" size="0.03" rgba="0.3 0.3 0.9 1"/>
<body name="link2" pos="0.3 0 0">
<joint name="j2" type="hinge" axis="0 1 0" range="-2.0 2.0"/>
<geom type="capsule" fromto="0 0 0 0.25 0 0" size="0.025" rgba="0.4 0.4 1.0 1"/>
<body name="link3" pos="0.25 0 0">
<joint name="j3" type="hinge" axis="0 1 0" range="-2.0 2.0"/>
<geom type="capsule" fromto="0 0 0 0.1 0 0" size="0.02" rgba="0.5 0.5 1.0 1"/>
<!-- Simple Gripper -->
<body name="gripper" pos="0.1 0 0">
<site name="grip_site" pos="0 0 0" size="0.01"/>
<body name="finger_left" pos="0 0.02 0">
<joint name="finger_l" type="slide" axis="0 1 0" range="0 0.04"/>
<geom type="box" size="0.01 0.005 0.03" rgba="0.8 0.2 0.2 1"
contype="1" conaffinity="1" friction="2 0.05 0.01"/>
</body>
<body name="finger_right" pos="0 -0.02 0">
<joint name="finger_r" type="slide" axis="0 -1 0" range="0 0.04"/>
<geom type="box" size="0.01 0.005 0.03" rgba="0.8 0.2 0.2 1"
contype="1" conaffinity="1" friction="2 0.05 0.01"/>
</body>
</body>
</body>
</body>
</body>
</body>
<!-- Object to manipulate -->
<body name="object" pos="0.5 0 0.45">
<freejoint name="obj_joint"/>
<geom type="box" size="0.025 0.025 0.025" rgba="0.1 0.8 0.1 1"
mass="0.1" contype="1" conaffinity="1" friction="1 0.05 0.01"/>
</body>
<!-- Goal marker (visual only) -->
<body name="goal" pos="0.4 0.2 0.55">
<geom type="sphere" size="0.02" rgba="1 0 0 0.5" contype="0" conaffinity="0"/>
</body>
</worldbody>
<actuator>
<position name="a_j0" joint="j0" kp="100"/>
<position name="a_j1" joint="j1" kp="100"/>
<position name="a_j2" joint="j2" kp="100"/>
<position name="a_j3" joint="j3" kp="100"/>
<position name="a_fl" joint="finger_l" kp="50"/>
<position name="a_fr" joint="finger_r" kp="50"/>
</actuator>
</mujoco>
"""
class SimpleManipulationEnv(gym.Env):
"""Custom MuJoCo manipulation environment.
Task: Robot arm must grasp a green cube
and bring it to the goal position (red sphere).
"""
metadata = {"render_modes": ["human", "rgb_array"]}
def __init__(self, render_mode=None, reward_type="dense"):
super().__init__()
self.render_mode = render_mode
self.reward_type = reward_type
# Load MuJoCo model
self.model = mujoco.MjModel.from_xml_string(MANIPULATION_XML)
self.data = mujoco.MjData(self.model)
# Action: 4 joint deltas + 1 gripper command
self.action_space = spaces.Box(
low=-1.0, high=1.0, shape=(5,), dtype=np.float32
)
# Observation: joint pos/vel + ee pos + obj pos/vel + goal + relative
obs_dim = 4 + 4 + 3 + 3 + 3 + 3 + 3 # = 23
self.observation_space = spaces.Box(
low=-np.inf, high=np.inf, shape=(obs_dim,), dtype=np.float64
)
self.max_steps = 200
self.current_step = 0
self._goal_pos = np.array([0.4, 0.2, 0.55])
def _get_obs(self):
"""Build observation vector."""
joint_pos = self.data.qpos[:4]
joint_vel = self.data.qvel[:4]
ee_pos = self.data.site_xpos[0] # grip_site position
obj_pos = self.data.qpos[6:9] # Object position (freejoint)
obj_vel = self.data.qvel[6:9]
rel_obj = obj_pos - ee_pos
rel_goal = self._goal_pos - obj_pos
return np.concatenate([
joint_pos, joint_vel, ee_pos,
obj_pos, obj_vel, rel_obj, rel_goal
])
def _compute_reward(self, obs):
"""Compute reward based on current state."""
ee_pos = obs[8:11]
obj_pos = obs[11:14]
reach_dist = np.linalg.norm(ee_pos - obj_pos)
goal_dist = np.linalg.norm(obj_pos - self._goal_pos)
if self.reward_type == "sparse":
return 1.0 if goal_dist < 0.05 else 0.0
# Dense reward: continuous feedback
reach_reward = 1.0 - np.tanh(5.0 * reach_dist)
goal_reward = 1.0 - np.tanh(5.0 * goal_dist)
# Bonus when object is near goal
success_bonus = 10.0 if goal_dist < 0.05 else 0.0
return reach_reward + 2.0 * goal_reward + success_bonus
def reset(self, seed=None, options=None):
super().reset(seed=seed)
mujoco.mj_resetData(self.model, self.data)
# Randomize object position on table
self.data.qpos[6] = 0.5 + self.np_random.uniform(-0.1, 0.1)
self.data.qpos[7] = self.np_random.uniform(-0.1, 0.1)
self.data.qpos[8] = 0.45
# Randomize goal
self._goal_pos = np.array([
0.4 + self.np_random.uniform(-0.1, 0.1),
0.2 + self.np_random.uniform(-0.1, 0.1),
0.55 + self.np_random.uniform(-0.05, 0.05),
])
mujoco.mj_forward(self.model, self.data)
self.current_step = 0
return self._get_obs(), {}
def step(self, action):
# Scale actions
joint_delta = action[:4] * 0.05 # Max 0.05 rad per step
gripper_cmd = (action[4] + 1) / 2 * 0.04 # [0, 0.04]
# Apply actions
self.data.ctrl[:4] = self.data.qpos[:4] + joint_delta
self.data.ctrl[4] = gripper_cmd # Left finger
self.data.ctrl[5] = gripper_cmd # Right finger
# Simulate
for _ in range(10): # 10 substeps
mujoco.mj_step(self.model, self.data)
obs = self._get_obs()
reward = self._compute_reward(obs)
self.current_step += 1
terminated = self.current_step >= self.max_steps
truncated = False
return obs, reward, terminated, truncated, {}
Training with Stable-Baselines3
With our environment created, we can train an agent using PPO or SAC:
from stable_baselines3 import PPO, SAC
from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.common.callbacks import EvalCallback
def make_env(rank, seed=0):
def _init():
env = SimpleManipulationEnv(reward_type="dense")
env.reset(seed=seed + rank)
return env
return _init
# Create parallel environments
num_envs = 8
env = SubprocVecEnv([make_env(i) for i in range(num_envs)])
# Train PPO
model = PPO(
"MlpPolicy",
env,
learning_rate=3e-4,
n_steps=2048,
batch_size=256,
n_epochs=10,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
ent_coef=0.01,
verbose=1,
tensorboard_log="./logs/"
)
# Evaluation callback
eval_env = SimpleManipulationEnv(reward_type="dense")
eval_callback = EvalCallback(
eval_env,
best_model_save_path="./best_model/",
eval_freq=10000,
n_eval_episodes=20,
deterministic=True
)
# Start training
model.learn(total_timesteps=1_000_000, callback=eval_callback)
model.save("manipulation_ppo")
Sparse vs Dense Reward: Experimental Comparison
After training for 1 million timesteps, here are the typical results:
| Metric | Sparse Reward (PPO) | Dense Reward (PPO) | Dense Reward (SAC) |
|---|---|---|---|
| Success Rate | ~5% | ~65% | ~78% |
| Avg Return | 0.12 | 145.3 | 187.6 |
| Convergence | Does not converge | ~400K steps | ~300K steps |
| Behavior quality | Random | Reasonable | Smooth |
Sparse reward is nearly impossible to learn with vanilla PPO — you will need Hindsight Experience Replay (HER) to solve this, which we will discuss in detail in Post 4: Precision Pick-and-Place.
Gymnasium Robotics: Ready-Made Environments
Instead of building environments from scratch, you can use Gymnasium Robotics — a collection of standardized manipulation environments:
import gymnasium as gym
import gymnasium_robotics
# Fetch environments (7-DOF arm + gripper)
env = gym.make("FetchReach-v3") # Move gripper to target
env = gym.make("FetchPush-v3") # Push object to position
env = gym.make("FetchSlide-v3") # Slide object on table
env = gym.make("FetchPickAndPlace-v3") # Pick and place
# Shadow Hand environments (24-DOF)
env = gym.make("HandReach-v2") # Move fingers to position
env = gym.make("HandManipulateBlock-v2") # Rotate block
Each environment provides both observation (full state) and desired_goal / achieved_goal — the ideal structure for HER.
References
- Gymnasium Robotics: A Standard Interface for Reinforcement Learning in Robot Manipulation — Towers et al., 2024
- MuJoCo: A physics engine for model-based control — Todorov et al., 2012
- A Survey on Robot Manipulation — Cui et al., 2023
Next in the Series
The next post — Grasping with RL: Stable Grasp & Object Variety — will apply these foundations to train robots to grasp objects of varying shapes and sizes. We will compare PPO vs SAC and design curriculum learning for grasping.