← Back to Blog
manipulationrlmanipulationmujocoreward-design

RL for Manipulation: MDP, Reward Design & Environment Setup

Build the foundation of RL for manipulation — from MDP formulation and reward function design to setting up MuJoCo environments from scratch.

Nguyễn Anh Tuấn10 tháng 3, 202612 min read
RL for Manipulation: MDP, Reward Design & Environment Setup

How does a robot arm learn to pick up objects without being explicitly programmed for every motion? The answer lies in Reinforcement Learning (RL) — a paradigm that lets robots explore, experiment, and gradually discover how to manipulate the physical world.

This is the first post in our RL for Manipulation series, where we will go from theoretical foundations to training robots on complex manipulation tasks. In this post, we focus on three pillars: MDP formulation, reward design, and environment setup in MuJoCo.

If you are new to RL fundamentals, consider reading RL Basics for Robotics first to build a solid foundation.

Manipulation Through the MDP Lens

What is a Markov Decision Process?

Every RL problem is modeled as a Markov Decision Process (MDP), defined by the tuple $(S, A, T, R, \gamma)$:

Component Meaning Manipulation Example
$S$ (State space) Set of all states Joint positions, velocities, object pose
$A$ (Action space) Set of all actions Joint torques or target positions
$T$ (Transition) State transition probability Physics simulation (MuJoCo dynamics)
$R$ (Reward) Reward function +1 when grasping succeeds, -0.01 per step
$\gamma$ (Discount) Discount factor Typically 0.99 for manipulation

The critical challenge when applying MDP to manipulation is designing the right state and action spaces. Give the robot too little information (impoverished state) and it cannot learn. Give it too much irrelevant information and learning becomes slow and unstable.

Robot arm in simulation environment

State Space for Manipulation

The state space in manipulation typically includes three groups of information:

1. Robot State (Proprioception)

2. Object State

3. Task-Specific Information

import numpy as np

def get_manipulation_state(sim):
    """Extract state vector for manipulation task."""
    # Robot proprioception
    joint_pos = sim.data.qpos[:7]      # 7-DOF arm
    joint_vel = sim.data.qvel[:7]
    gripper_pos = sim.data.qpos[7:9]   # 2-finger gripper

    # End-effector pose
    ee_pos = sim.data.site_xpos[0]     # End-effector position
    ee_rot = sim.data.site_xmat[0].reshape(3, 3)  # Rotation matrix

    # Object state
    obj_pos = sim.data.qpos[9:12]      # Object position
    obj_quat = sim.data.qpos[12:16]    # Object quaternion
    obj_vel = sim.data.qvel[9:15]      # 6-DOF velocity

    # Relative information (very important!)
    rel_pos = obj_pos - ee_pos         # Vector from gripper to object

    state = np.concatenate([
        joint_pos, joint_vel,           # 14 dims
        gripper_pos,                    # 2 dims
        ee_pos,                        # 3 dims
        obj_pos, obj_quat, obj_vel,    # 13 dims
        rel_pos,                       # 3 dims
    ])
    return state  # Total: 35 dims

Key tip: Always include relative position (distance from gripper to object) in the state. This relative information helps policies generalize much better than using only absolute positions.

Action Space: Torque vs Position vs Velocity

The choice of action space significantly impacts learning ability and motion quality:

Action Space Pros Cons When to Use
Joint Torque Most flexible, direct force control Hard to learn, high sample complexity, jerky motions Force control research
Joint Position Easy to learn, smooth motions No direct force control Most manipulation tasks
Joint Velocity Balance between torque and position Requires PD gain tuning Pick-and-place, trajectory following
End-effector Delta Intuitive, easy to transfer Needs IK solver, singularity issues Teleoperation-style tasks

In practice, Joint Position control (or Delta Position) is the most commonly used because it allows the policy to focus on "where to go" rather than "how much force to use":

# Action space: delta joint position
action_space = spaces.Box(
    low=-0.05,  # Max change per step
    high=0.05,
    shape=(8,),  # 7 joints + 1 gripper
    dtype=np.float32
)

def apply_action(sim, action):
    """Apply delta position action."""
    current_pos = sim.data.qpos[:8].copy()
    target_pos = current_pos + action
    # Clip to joint limits
    target_pos = np.clip(target_pos, joint_lower, joint_upper)
    sim.data.ctrl[:8] = target_pos

Reward Design: Art and Science

Reward design is the hardest and most important part of RL for manipulation. A good reward function can help a robot learn in hours, while a bad one ensures it never learns anything — or worse, learns bizarre "reward hacking" behaviors.

Visualization of reward shaping concept

Sparse vs Dense Reward

Sparse Reward — only reward upon task completion:

def sparse_reward(achieved_goal, desired_goal, threshold=0.05):
    """Reward = 1 if goal reached, 0 otherwise."""
    distance = np.linalg.norm(achieved_goal - desired_goal)
    return 1.0 if distance < threshold else 0.0

Dense Reward — continuous feedback based on progress:

def dense_reward(ee_pos, obj_pos, goal_pos, grasped):
    """Continuous reward based on distance and state."""
    # Phase 1: Reach toward object
    reach_dist = np.linalg.norm(ee_pos - obj_pos)
    reach_reward = 1.0 - np.tanh(5.0 * reach_dist)

    # Phase 2: Lift object toward goal
    if grasped:
        place_dist = np.linalg.norm(obj_pos - goal_pos)
        place_reward = 1.0 - np.tanh(5.0 * place_dist)
        return reach_reward + 5.0 * place_reward
    
    return reach_reward
Criterion Sparse Reward Dense Reward
Ease of design Very easy Difficult, error-prone
Learning speed Very slow Much faster
Reward hacking risk Low High if poorly designed
Requires HER? Nearly essential Not necessary
Scalability Good Hard for multi-step

Principles for Reward Function Design

Over years of research, the RL community has distilled several golden principles:

1. Divide reward into phases: Each phase corresponds to a stage of the task (approach, grasp, move, place).

2. Use distance-based shaping: Functions like tanh or exponential decay provide stable gradients:

# Good: stable gradient at all distances
reward = 1.0 - np.tanh(5.0 * distance)

# Poor: gradient too small when far, too large when near
reward = -distance

# Poor: gradient = 0 everywhere except at boundary
reward = float(distance < threshold)

3. Penalize undesired behavior: Add penalties for excessive actions, hard collisions, or dropping objects:

# Penalty for large actions (encourage smooth motion)
action_penalty = -0.01 * np.sum(action ** 2)

# Penalty for dropping (if previously grasped)
drop_penalty = -10.0 if was_grasped and not is_grasped else 0.0

4. Normalize reward components: Ensure all reward components have comparable scales so the agent does not ignore important terms.

Setting Up Manipulation Environments in MuJoCo

Why MuJoCo?

MuJoCo (Multi-Joint dynamics with Contact) is the leading simulator for manipulation RL because of:

Creating a Custom Manipulation Environment

Below is complete code to create a basic manipulation environment — a robot arm must grasp a cube and bring it to a target location:

import gymnasium as gym
from gymnasium import spaces
import mujoco
import mujoco.viewer
import numpy as np


# ---- MuJoCo XML model ----
MANIPULATION_XML = """
<mujoco model="simple_manipulation">
  <option timestep="0.002" gravity="0 0 -9.81"/>
  
  <worldbody>
    <!-- Ground plane -->
    <geom type="plane" size="1 1 0.1" rgba="0.9 0.9 0.9 1"/>
    
    <!-- Table -->
    <body name="table" pos="0.5 0 0.4">
      <geom type="box" size="0.4 0.4 0.02" rgba="0.6 0.3 0.1 1"
            mass="100" contype="1" conaffinity="1"/>
    </body>
    
    <!-- Robot Arm (simplified 4-DOF) -->
    <body name="base" pos="0.0 0 0.42">
      <joint name="j0" type="hinge" axis="0 0 1" range="-3.14 3.14"/>
      <geom type="cylinder" size="0.04 0.05" rgba="0.2 0.2 0.8 1"/>
      
      <body name="link1" pos="0 0 0.1">
        <joint name="j1" type="hinge" axis="0 1 0" range="-1.57 1.57"/>
        <geom type="capsule" fromto="0 0 0 0.3 0 0" size="0.03" rgba="0.3 0.3 0.9 1"/>
        
        <body name="link2" pos="0.3 0 0">
          <joint name="j2" type="hinge" axis="0 1 0" range="-2.0 2.0"/>
          <geom type="capsule" fromto="0 0 0 0.25 0 0" size="0.025" rgba="0.4 0.4 1.0 1"/>
          
          <body name="link3" pos="0.25 0 0">
            <joint name="j3" type="hinge" axis="0 1 0" range="-2.0 2.0"/>
            <geom type="capsule" fromto="0 0 0 0.1 0 0" size="0.02" rgba="0.5 0.5 1.0 1"/>
            
            <!-- Simple Gripper -->
            <body name="gripper" pos="0.1 0 0">
              <site name="grip_site" pos="0 0 0" size="0.01"/>
              <body name="finger_left" pos="0 0.02 0">
                <joint name="finger_l" type="slide" axis="0 1 0" range="0 0.04"/>
                <geom type="box" size="0.01 0.005 0.03" rgba="0.8 0.2 0.2 1"
                      contype="1" conaffinity="1" friction="2 0.05 0.01"/>
              </body>
              <body name="finger_right" pos="0 -0.02 0">
                <joint name="finger_r" type="slide" axis="0 -1 0" range="0 0.04"/>
                <geom type="box" size="0.01 0.005 0.03" rgba="0.8 0.2 0.2 1"
                      contype="1" conaffinity="1" friction="2 0.05 0.01"/>
              </body>
            </body>
          </body>
        </body>
      </body>
    </body>
    
    <!-- Object to manipulate -->
    <body name="object" pos="0.5 0 0.45">
      <freejoint name="obj_joint"/>
      <geom type="box" size="0.025 0.025 0.025" rgba="0.1 0.8 0.1 1"
            mass="0.1" contype="1" conaffinity="1" friction="1 0.05 0.01"/>
    </body>
    
    <!-- Goal marker (visual only) -->
    <body name="goal" pos="0.4 0.2 0.55">
      <geom type="sphere" size="0.02" rgba="1 0 0 0.5" contype="0" conaffinity="0"/>
    </body>
  </worldbody>
  
  <actuator>
    <position name="a_j0" joint="j0" kp="100"/>
    <position name="a_j1" joint="j1" kp="100"/>
    <position name="a_j2" joint="j2" kp="100"/>
    <position name="a_j3" joint="j3" kp="100"/>
    <position name="a_fl" joint="finger_l" kp="50"/>
    <position name="a_fr" joint="finger_r" kp="50"/>
  </actuator>
</mujoco>
"""


class SimpleManipulationEnv(gym.Env):
    """Custom MuJoCo manipulation environment.
    
    Task: Robot arm must grasp a green cube
    and bring it to the goal position (red sphere).
    """
    metadata = {"render_modes": ["human", "rgb_array"]}
    
    def __init__(self, render_mode=None, reward_type="dense"):
        super().__init__()
        
        self.render_mode = render_mode
        self.reward_type = reward_type
        
        # Load MuJoCo model
        self.model = mujoco.MjModel.from_xml_string(MANIPULATION_XML)
        self.data = mujoco.MjData(self.model)
        
        # Action: 4 joint deltas + 1 gripper command
        self.action_space = spaces.Box(
            low=-1.0, high=1.0, shape=(5,), dtype=np.float32
        )
        
        # Observation: joint pos/vel + ee pos + obj pos/vel + goal + relative
        obs_dim = 4 + 4 + 3 + 3 + 3 + 3 + 3  # = 23
        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf, shape=(obs_dim,), dtype=np.float64
        )
        
        self.max_steps = 200
        self.current_step = 0
        self._goal_pos = np.array([0.4, 0.2, 0.55])
        
    def _get_obs(self):
        """Build observation vector."""
        joint_pos = self.data.qpos[:4]
        joint_vel = self.data.qvel[:4]
        ee_pos = self.data.site_xpos[0]  # grip_site position
        obj_pos = self.data.qpos[6:9]     # Object position (freejoint)
        obj_vel = self.data.qvel[6:9]
        rel_obj = obj_pos - ee_pos
        rel_goal = self._goal_pos - obj_pos
        
        return np.concatenate([
            joint_pos, joint_vel, ee_pos,
            obj_pos, obj_vel, rel_obj, rel_goal
        ])
    
    def _compute_reward(self, obs):
        """Compute reward based on current state."""
        ee_pos = obs[8:11]
        obj_pos = obs[11:14]
        
        reach_dist = np.linalg.norm(ee_pos - obj_pos)
        goal_dist = np.linalg.norm(obj_pos - self._goal_pos)
        
        if self.reward_type == "sparse":
            return 1.0 if goal_dist < 0.05 else 0.0
        
        # Dense reward: continuous feedback
        reach_reward = 1.0 - np.tanh(5.0 * reach_dist)
        goal_reward = 1.0 - np.tanh(5.0 * goal_dist)
        
        # Bonus when object is near goal
        success_bonus = 10.0 if goal_dist < 0.05 else 0.0
        
        return reach_reward + 2.0 * goal_reward + success_bonus
    
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        mujoco.mj_resetData(self.model, self.data)
        
        # Randomize object position on table
        self.data.qpos[6] = 0.5 + self.np_random.uniform(-0.1, 0.1)
        self.data.qpos[7] = self.np_random.uniform(-0.1, 0.1)
        self.data.qpos[8] = 0.45
        
        # Randomize goal
        self._goal_pos = np.array([
            0.4 + self.np_random.uniform(-0.1, 0.1),
            0.2 + self.np_random.uniform(-0.1, 0.1),
            0.55 + self.np_random.uniform(-0.05, 0.05),
        ])
        
        mujoco.mj_forward(self.model, self.data)
        self.current_step = 0
        
        return self._get_obs(), {}
    
    def step(self, action):
        # Scale actions
        joint_delta = action[:4] * 0.05  # Max 0.05 rad per step
        gripper_cmd = (action[4] + 1) / 2 * 0.04  # [0, 0.04]
        
        # Apply actions
        self.data.ctrl[:4] = self.data.qpos[:4] + joint_delta
        self.data.ctrl[4] = gripper_cmd   # Left finger
        self.data.ctrl[5] = gripper_cmd   # Right finger
        
        # Simulate
        for _ in range(10):  # 10 substeps
            mujoco.mj_step(self.model, self.data)
        
        obs = self._get_obs()
        reward = self._compute_reward(obs)
        
        self.current_step += 1
        terminated = self.current_step >= self.max_steps
        truncated = False
        
        return obs, reward, terminated, truncated, {}

Training with Stable-Baselines3

With our environment created, we can train an agent using PPO or SAC:

from stable_baselines3 import PPO, SAC
from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.common.callbacks import EvalCallback

def make_env(rank, seed=0):
    def _init():
        env = SimpleManipulationEnv(reward_type="dense")
        env.reset(seed=seed + rank)
        return env
    return _init

# Create parallel environments
num_envs = 8
env = SubprocVecEnv([make_env(i) for i in range(num_envs)])

# Train PPO
model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=256,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01,
    verbose=1,
    tensorboard_log="./logs/"
)

# Evaluation callback
eval_env = SimpleManipulationEnv(reward_type="dense")
eval_callback = EvalCallback(
    eval_env, 
    best_model_save_path="./best_model/",
    eval_freq=10000,
    n_eval_episodes=20,
    deterministic=True
)

# Start training
model.learn(total_timesteps=1_000_000, callback=eval_callback)
model.save("manipulation_ppo")

Sparse vs Dense Reward: Experimental Comparison

After training for 1 million timesteps, here are the typical results:

Metric Sparse Reward (PPO) Dense Reward (PPO) Dense Reward (SAC)
Success Rate ~5% ~65% ~78%
Avg Return 0.12 145.3 187.6
Convergence Does not converge ~400K steps ~300K steps
Behavior quality Random Reasonable Smooth

Sparse reward is nearly impossible to learn with vanilla PPO — you will need Hindsight Experience Replay (HER) to solve this, which we will discuss in detail in Post 4: Precision Pick-and-Place.

Training curves comparison

Gymnasium Robotics: Ready-Made Environments

Instead of building environments from scratch, you can use Gymnasium Robotics — a collection of standardized manipulation environments:

import gymnasium as gym
import gymnasium_robotics

# Fetch environments (7-DOF arm + gripper)
env = gym.make("FetchReach-v3")        # Move gripper to target
env = gym.make("FetchPush-v3")         # Push object to position
env = gym.make("FetchSlide-v3")        # Slide object on table
env = gym.make("FetchPickAndPlace-v3") # Pick and place

# Shadow Hand environments (24-DOF)
env = gym.make("HandReach-v2")         # Move fingers to position
env = gym.make("HandManipulateBlock-v2")  # Rotate block

Each environment provides both observation (full state) and desired_goal / achieved_goal — the ideal structure for HER.

References

  1. Gymnasium Robotics: A Standard Interface for Reinforcement Learning in Robot Manipulation — Towers et al., 2024
  2. MuJoCo: A physics engine for model-based control — Todorov et al., 2012
  3. A Survey on Robot Manipulation — Cui et al., 2023

Next in the Series

The next post — Grasping with RL: Stable Grasp & Object Variety — will apply these foundations to train robots to grasp objects of varying shapes and sizes. We will compare PPO vs SAC and design curriculum learning for grasping.

Related Posts

Related Posts

TutorialSim-to-Real cho Humanoid: Deployment Best Practices
sim2realhumanoiddeploymentrlunitreePart 10

Sim-to-Real cho Humanoid: Deployment Best Practices

Pipeline hoàn chỉnh deploy RL locomotion policy lên robot humanoid thật — domain randomization, system identification, safety, và Unitree SDK.

9/4/202611 min read
ResearchUnifoLM-VLA-0: Mô hình VLA cho Manipulation trên Unitree G1
vlaunitreeg1manipulationhumanoid

UnifoLM-VLA-0: Mô hình VLA cho Manipulation trên Unitree G1

Phân tích và hướng dẫn triển khai UnifoLM-VLA-0 — mô hình VLA open-source đầu tiên chạy trực tiếp trên G1 humanoid

8/4/202623 min read
ResearchWholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation
vlahumanoidloco-manipulationiclrrl

WholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation

ICLR 2026 — học manipulation từ egocentric video, kết hợp VLA + RL cho locomotion-aware control

7/4/202613 min read