Grasping with RL: Stable Grasp & Object Variety

Grasping — the act of picking up objects — sounds trivial for humans, but it remains one of the biggest challenges in robotics. A six-month-old baby can grab toys, while robots still struggle after millions of training steps. Why? Because grasping requires precise coordination between perception, planning, and control — all within an environment with complex contact dynamics.

In the previous post — RL for Manipulation: MDP, Reward Design & Environment Setup — we built the theoretical foundation. Now, let us apply everything to train a robot to grasp objects stably, across a variety of shapes and sizes.

Types of Grasping

Before coding, let us understand the types of grasps in robotics:

Grasp Type	Description	Suitable Gripper	RL Difficulty
Parallel Jaw	Two-finger parallel squeeze	Parallel gripper	Low
Suction	Vacuum suction	Suction cup	Very low
Power Grasp	Full-hand wrap	Dexterous hand	High
Precision Grasp	Fingertip grasp	Dexterous hand	Very high
Enveloping	Wrap around object	Soft gripper	Medium

In this post, we focus on Parallel Jaw Grasping — the most common type in industry — and gradually expand to more complex variants.

Robot gripper grasping objects

Reward Design for Grasping

A reward function for grasping must capture multiple phases: approach, align, contact, grasp, and lift. Each phase needs its own reward component.

Multi-Phase Reward Function

import numpy as np

class GraspReward:
    """Multi-phase reward function for grasping task."""
    
    def __init__(self):
        self.grasp_height_threshold = 0.1  # 10cm lift = success
        self.contact_threshold = 0.01       # Contact distance
    
    def compute(self, ee_pos, obj_pos, obj_height_init,
                finger_contact, obj_vel, action):
        """
        Compute reward for each grasping phase.
        
        Args:
            ee_pos: End-effector position [3]
            obj_pos: Object position [3]
            obj_height_init: Initial object height
            finger_contact: Bool, True if both fingers touch object
            obj_vel: Object velocity [6]
            action: Action vector [N]
        
        Returns:
            total_reward, reward_info dict
        """
        rewards = {}
        
        # Phase 1: REACH — Move toward object
        reach_dist = np.linalg.norm(ee_pos - obj_pos)
        rewards['reach'] = 1.0 - np.tanh(5.0 * reach_dist)
        
        # Phase 2: CONTACT — Establish contact
        if finger_contact:
            rewards['contact'] = 1.0
        else:
            rewards['contact'] = 0.0
        
        # Phase 3: LIFT — Raise the object
        lift_height = obj_pos[2] - obj_height_init
        if finger_contact and lift_height > 0:
            rewards['lift'] = min(lift_height / self.grasp_height_threshold, 1.0)
        else:
            rewards['lift'] = 0.0
        
        # Phase 4: HOLD — Hold stably (minimal oscillation)
        if lift_height > self.grasp_height_threshold:
            obj_angular_vel = np.linalg.norm(obj_vel[3:6])
            rewards['hold'] = 1.0 - np.tanh(2.0 * obj_angular_vel)
            rewards['success'] = 10.0  # Large bonus on success
        else:
            rewards['hold'] = 0.0
            rewards['success'] = 0.0
        
        # Penalties
        rewards['action_penalty'] = -0.001 * np.sum(action ** 2)
        
        # Weighted sum
        total = (1.0 * rewards['reach'] +
                 2.0 * rewards['contact'] +
                 5.0 * rewards['lift'] +
                 3.0 * rewards['hold'] +
                 rewards['success'] +
                 rewards['action_penalty'])
        
        return total, rewards

Phase-by-Phase Analysis

Phase 1 (Reach): Uses tanh shaping so the gripper approaches the object. Low weight since this is just preparation.

Phase 2 (Contact): Binary reward when both fingers touch the object. This encourages the robot to actually make contact rather than just hovering nearby.

Phase 3 (Lift): Reward proportional to lift height. Only active when contact exists — preventing reward hacking like "push the object upward."

Phase 4 (Hold): Penalizes high angular velocity — the object must be held stably without spinning.

Environment for Grasping Training

MuJoCo Grasping Environment

import gymnasium as gym
from gymnasium import spaces
import mujoco
import numpy as np

GRASPING_XML = """
<mujoco model="grasping_env">
  <option timestep="0.002" gravity="0 0 -9.81">
    <flag contact="enable"/>
  </option>
  
  <default>
    <geom condim="4" friction="1 0.5 0.01"/>
  </default>
  
  <worldbody>
    <light pos="0 0 2" dir="0 0 -1"/>
    <geom type="plane" size="2 2 0.1" rgba="0.9 0.9 0.9 1"/>
    
    <!-- Table -->
    <body name="table" pos="0.5 0 0.4">
      <geom type="box" size="0.5 0.5 0.02" rgba="0.6 0.4 0.2 1" mass="200"/>
    </body>
    
    <!-- Panda-like 7DOF arm (simplified) -->
    <body name="base_link" pos="0 0 0.42">
      <geom type="cylinder" size="0.06 0.03" rgba="0.9 0.9 0.9 1" mass="5"/>
      
      <body name="link1" pos="0 0 0.06">
        <joint name="j1" type="hinge" axis="0 0 1" range="-2.9 2.9" damping="1"/>
        <geom type="capsule" fromto="0 0 0 0 0 0.15" size="0.04" rgba="0.9 0.9 0.9 1"/>
        
        <body name="link2" pos="0 0 0.15">
          <joint name="j2" type="hinge" axis="0 1 0" range="-1.76 1.76" damping="1"/>
          <geom type="capsule" fromto="0 0 0 0.3 0 0" size="0.035" rgba="0.9 0.9 0.9 1"/>
          
          <body name="link3" pos="0.3 0 0">
            <joint name="j3" type="hinge" axis="0 1 0" range="-2.0 0.1" damping="1"/>
            <geom type="capsule" fromto="0 0 0 0.25 0 0" size="0.03" rgba="0.9 0.9 0.9 1"/>
            
            <body name="link4" pos="0.25 0 0">
              <joint name="j4" type="hinge" axis="0 0 1" range="-3.07 3.07" damping="0.5"/>
              <geom type="capsule" fromto="0 0 0 0.08 0 0" size="0.025" rgba="0.3 0.3 0.3 1"/>
              
              <body name="wrist" pos="0.08 0 0">
                <joint name="j5" type="hinge" axis="1 0 0" range="-1.57 1.57" damping="0.5"/>
                <site name="ee_site" pos="0 0 0" size="0.01"/>
                
                <!-- Parallel Jaw Gripper -->
                <body name="left_finger" pos="0 0.02 0">
                  <joint name="finger_l" type="slide" axis="0 1 0" range="0 0.04" damping="5"/>
                  <geom type="box" size="0.01 0.005 0.04" rgba="0.8 0.2 0.2 1" mass="0.05"
                        contype="1" conaffinity="1" friction="2 1 0.01"/>
                </body>
                <body name="right_finger" pos="0 -0.02 0">
                  <joint name="finger_r" type="slide" axis="0 -1 0" range="0 0.04" damping="5"/>
                  <geom type="box" size="0.01 0.005 0.04" rgba="0.8 0.2 0.2 1" mass="0.05"
                        contype="1" conaffinity="1" friction="2 1 0.01"/>
                </body>
              </body>
            </body>
          </body>
        </body>
      </body>
    </body>
    
    <!-- Object (randomized at reset) -->
    <body name="object" pos="0.5 0 0.44">
      <freejoint name="obj_free"/>
      <geom name="obj_geom" type="box" size="0.025 0.025 0.025" 
            rgba="0.1 0.8 0.1 1" mass="0.05" contype="1" conaffinity="1"/>
    </body>
  </worldbody>
  
  <actuator>
    <position name="act_j1" joint="j1" kp="200"/>
    <position name="act_j2" joint="j2" kp="200"/>
    <position name="act_j3" joint="j3" kp="200"/>
    <position name="act_j4" joint="j4" kp="100"/>
    <position name="act_j5" joint="j5" kp="100"/>
    <position name="act_fl" joint="finger_l" kp="100"/>
    <position name="act_fr" joint="finger_r" kp="100"/>
  </actuator>
  
  <sensor>
    <touch name="touch_l" site="ee_site"/>
  </sensor>
</mujoco>
"""


class GraspingEnv(gym.Env):
    """Grasping environment with object randomization."""
    
    def __init__(self, reward_type="dense", curriculum_level=0):
        super().__init__()
        self.reward_type = reward_type
        self.curriculum_level = curriculum_level
        
        self.model = mujoco.MjModel.from_xml_string(GRASPING_XML)
        self.data = mujoco.MjData(self.model)
        
        # 5 joints + 1 gripper = 6 actions
        self.action_space = spaces.Box(-1, 1, shape=(6,), dtype=np.float32)
        
        # State: joints(10) + ee(3) + obj(7) + relative(3) + contacts(2)
        self.observation_space = spaces.Box(
            -np.inf, np.inf, shape=(25,), dtype=np.float64
        )
        
        self.reward_fn = GraspReward()
        self.max_steps = 150
        self.current_step = 0
        self.obj_init_height = 0.44
        
    def _check_contact(self):
        """Check if fingers are touching the object."""
        left_contact = False
        right_contact = False
        for i in range(self.data.ncon):
            c = self.data.contact[i]
            geom1 = self.model.geom(c.geom1).name
            geom2 = self.model.geom(c.geom2).name
            pair = {geom1, geom2}
            if "obj_geom" in pair:
                if any("left" in g or "finger" in g for g in pair - {"obj_geom"}):
                    left_contact = True
                if any("right" in g for g in pair - {"obj_geom"}):
                    right_contact = True
        return left_contact and right_contact
        
    def _get_obs(self):
        joint_pos = self.data.qpos[:5]
        joint_vel = self.data.qvel[:5]
        ee_pos = self.data.site_xpos[0]
        obj_pos = self.data.qpos[7:10]
        obj_quat = self.data.qpos[10:14]
        rel = obj_pos - ee_pos
        contacts = np.array([float(self._check_contact()), 
                           self.data.qpos[5] + self.data.qpos[6]])
        
        return np.concatenate([
            joint_pos, joint_vel, ee_pos, obj_pos, obj_quat, rel, contacts
        ])
    
    def _randomize_object(self):
        """Randomize object based on curriculum level."""
        if self.curriculum_level == 0:
            # Easy: cube at center, minimal variation
            x = 0.5 + np.random.uniform(-0.05, 0.05)
            y = np.random.uniform(-0.05, 0.05)
        elif self.curriculum_level == 1:
            # Medium: wider spread
            x = 0.5 + np.random.uniform(-0.15, 0.15)
            y = np.random.uniform(-0.15, 0.15)
        else:
            # Hard: wide spread + random friction
            x = 0.5 + np.random.uniform(-0.2, 0.2)
            y = np.random.uniform(-0.2, 0.2)
            # Randomize friction
            geom_id = mujoco.mj_name2id(self.model, mujoco.mjtObj.mjOBJ_GEOM, "obj_geom")
            self.model.geom_friction[geom_id] = [
                np.random.uniform(0.3, 2.0),
                np.random.uniform(0.01, 0.5),
                np.random.uniform(0.001, 0.01)
            ]
        
        self.data.qpos[7] = x
        self.data.qpos[8] = y
        self.data.qpos[9] = 0.44
        self.obj_init_height = 0.44
    
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        mujoco.mj_resetData(self.model, self.data)
        self._randomize_object()
        mujoco.mj_forward(self.model, self.data)
        self.current_step = 0
        return self._get_obs(), {}
    
    def step(self, action):
        # Apply scaled actions
        joint_delta = action[:5] * 0.05
        gripper = (action[5] + 1) / 2 * 0.04
        
        self.data.ctrl[:5] = self.data.qpos[:5] + joint_delta
        self.data.ctrl[5] = gripper
        self.data.ctrl[6] = gripper
        
        for _ in range(10):
            mujoco.mj_step(self.model, self.data)
        
        obs = self._get_obs()
        ee_pos = obs[10:13]
        obj_pos = obs[13:16]
        obj_vel = self.data.qvel[7:13]
        contact = self._check_contact()
        
        reward, info = self.reward_fn.compute(
            ee_pos, obj_pos, self.obj_init_height,
            contact, obj_vel, action
        )
        
        self.current_step += 1
        terminated = self.current_step >= self.max_steps
        
        return obs, reward, terminated, False, info

Curriculum Learning for Grasping

Curriculum learning is an "easy to hard" strategy — start with simple objects and favorable positions, then gradually increase difficulty. This is a key technique for achieving high success rates.

Curriculum learning progression

Designing the Curriculum

class GraspingCurriculum:
    """Automatic curriculum for grasping training."""
    
    def __init__(self, success_threshold=0.7, window_size=100):
        self.level = 0
        self.max_level = 4
        self.success_threshold = success_threshold
        self.success_history = []
        self.window_size = window_size
        
        self.levels = {
            0: {
                'name': 'Cube - Center',
                'objects': ['cube'],
                'position_range': 0.05,
                'size_range': (0.025, 0.025),
                'friction_range': (1.0, 1.0),
            },
            1: {
                'name': 'Cube - Wide Area',
                'objects': ['cube'],
                'position_range': 0.15,
                'size_range': (0.02, 0.035),
                'friction_range': (0.5, 1.5),
            },
            2: {
                'name': 'Multi-Shape - Easy',
                'objects': ['cube', 'cylinder', 'sphere'],
                'position_range': 0.15,
                'size_range': (0.02, 0.04),
                'friction_range': (0.5, 1.5),
            },
            3: {
                'name': 'Multi-Shape - Hard',
                'objects': ['cube', 'cylinder', 'sphere', 'ellipsoid'],
                'position_range': 0.20,
                'size_range': (0.015, 0.045),
                'friction_range': (0.3, 2.0),
            },
            4: {
                'name': 'Adversarial',
                'objects': ['cube', 'cylinder', 'sphere', 'ellipsoid', 'mesh'],
                'position_range': 0.25,
                'size_range': (0.01, 0.05),
                'friction_range': (0.1, 2.5),
            },
        }
    
    def update(self, success: bool):
        """Update after each episode."""
        self.success_history.append(float(success))
        if len(self.success_history) > self.window_size:
            self.success_history.pop(0)
        
        if len(self.success_history) >= self.window_size:
            rate = np.mean(self.success_history)
            if rate >= self.success_threshold and self.level < self.max_level:
                self.level += 1
                self.success_history.clear()
                print(f"Curriculum advanced to Level {self.level}: "
                      f"{self.levels[self.level]['name']}")
    
    def get_config(self):
        """Return config for current level."""
        return self.levels[self.level]

Complete Training Pipeline

from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import BaseCallback

class CurriculumCallback(BaseCallback):
    """Callback to automatically advance curriculum level."""
    
    def __init__(self, curriculum, verbose=0):
        super().__init__(verbose)
        self.curriculum = curriculum
    
    def _on_step(self):
        for info in self.locals.get('infos', []):
            if 'success' in info:
                self.curriculum.update(info['success'])
                if hasattr(self.training_env, 'envs'):
                    for env in self.training_env.envs:
                        env.curriculum_level = self.curriculum.level
        return True

# ---- Main Training ----
curriculum = GraspingCurriculum(success_threshold=0.7, window_size=100)
env = GraspingEnv(reward_type="dense", curriculum_level=0)

# SAC typically outperforms PPO for manipulation
model = SAC(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    buffer_size=1_000_000,
    batch_size=256,
    tau=0.005,
    gamma=0.99,
    train_freq=1,
    gradient_steps=1,
    learning_starts=1000,
    verbose=1,
    tensorboard_log="./grasp_logs/"
)

curriculum_cb = CurriculumCallback(curriculum)
model.learn(total_timesteps=2_000_000, callback=curriculum_cb)
model.save("grasping_sac_curriculum")

PPO vs SAC for Grasping

Criterion	PPO	SAC
Sample Efficiency	Low (~3M steps)	High (~1M steps)
Stability	Very stable	Occasional divergence
Final Performance	60-70% success	75-85% success
Parallelization	Excellent (many envs)	Limited (replay buffer)
Continuous Actions	OK	Excellent
Hyperparameter Sensitivity	Low	Medium

Conclusion: SAC is the better choice for grasping due to high sample efficiency and excellent handling of continuous action spaces. However, PPO remains useful when you have many parallel environments available.

Evaluation Metrics

Evaluating a grasping policy requires more than just success rate:

def evaluate_grasping(model, env, n_episodes=100):
    """Comprehensive evaluation of grasping policy."""
    metrics = {
        'success_rate': 0,
        'grasp_success': 0,    # Grasped but not lifted
        'lift_success': 0,     # Successfully lifted
        'hold_success': 0,     # Held stably for 2 seconds
        'avg_grasp_time': [],  # Time to grasp
        'avg_lift_height': [], # Lift height
        'drop_rate': 0,        # Drop rate after grasping
    }
    
    for ep in range(n_episodes):
        obs, _ = env.reset()
        grasped = False
        lifted = False
        
        for step in range(env.max_steps):
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, done, _, info = env.step(action)
            
            if info.get('contact', False) and not grasped:
                grasped = True
                metrics['grasp_success'] += 1
                metrics['avg_grasp_time'].append(step)
            
            if info.get('lift', 0) > 0.5 and not lifted:
                lifted = True
                metrics['lift_success'] += 1
            
            if grasped and not info.get('contact', False):
                metrics['drop_rate'] += 1
            
            if done:
                if info.get('success', False):
                    metrics['success_rate'] += 1
                    metrics['hold_success'] += 1
                break
    
    for key in ['success_rate', 'grasp_success', 'lift_success', 
                'hold_success', 'drop_rate']:
        metrics[key] /= n_episodes
    
    metrics['avg_grasp_time'] = (np.mean(metrics['avg_grasp_time']) 
                                  if metrics['avg_grasp_time'] else float('inf'))
    
    return metrics

Typical results after 2M timesteps with SAC + curriculum:

Metric	Level 0 (Cube)	Level 2 (Multi-shape)	Level 4 (Adversarial)
Success Rate	92%	78%	61%
Grasp Success	97%	89%	75%
Drop Rate	3%	8%	15%
Avg Grasp Time	25 steps	35 steps	48 steps

References

DexGraspNet: A Large-Scale Robotic Dexterous Grasp Dataset — Wang et al., 2023
QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation — Kalashnikov et al., 2018
Learning Dexterous In-Hand Manipulation — OpenAI, 2018

Next in the Series

Next up — RL Force Control: Balancing a Cup of Water — we tackle the problem of delicate force control: a robot must grasp and move a cup of water without spilling. This is where impedance control meets RL.

Types of Grasping

Before coding, let us understand the types of grasps in robotics:

Grasp Type	Description	Suitable Gripper	RL Difficulty
Parallel Jaw	Two-finger parallel squeeze	Parallel gripper	Low
Suction	Vacuum suction	Suction cup	Very low
Power Grasp	Full-hand wrap	Dexterous hand	High
Precision Grasp	Fingertip grasp	Dexterous hand	Very high
Enveloping	Wrap around object	Soft gripper	Medium

In this post, we focus on Parallel Jaw Grasping — the most common type in industry — and gradually expand to more complex variants.

Robot gripper grasping objects

Reward Design for Grasping

A reward function for grasping must capture multiple phases: approach, align, contact, grasp, and lift. Each phase needs its own reward component.

Multi-Phase Reward Function

import numpy as np

class GraspReward:
    """Multi-phase reward function for grasping task."""
    
    def __init__(self):
        self.grasp_height_threshold = 0.1  # 10cm lift = success
        self.contact_threshold = 0.01       # Contact distance
    
    def compute(self, ee_pos, obj_pos, obj_height_init,
                finger_contact, obj_vel, action):
        """
        Compute reward for each grasping phase.
        
        Args:
            ee_pos: End-effector position [3]
            obj_pos: Object position [3]
            obj_height_init: Initial object height
            finger_contact: Bool, True if both fingers touch object
            obj_vel: Object velocity [6]
            action: Action vector [N]
        
        Returns:
            total_reward, reward_info dict
        """
        rewards = {}
        
        # Phase 1: REACH — Move toward object
        reach_dist = np.linalg.norm(ee_pos - obj_pos)
        rewards['reach'] = 1.0 - np.tanh(5.0 * reach_dist)
        
        # Phase 2: CONTACT — Establish contact
        if finger_contact:
            rewards['contact'] = 1.0
        else:
            rewards['contact'] = 0.0
        
        # Phase 3: LIFT — Raise the object
        lift_height = obj_pos[2] - obj_height_init
        if finger_contact and lift_height > 0:
            rewards['lift'] = min(lift_height / self.grasp_height_threshold, 1.0)
        else:
            rewards['lift'] = 0.0
        
        # Phase 4: HOLD — Hold stably (minimal oscillation)
        if lift_height > self.grasp_height_threshold:
            obj_angular_vel = np.linalg.norm(obj_vel[3:6])
            rewards['hold'] = 1.0 - np.tanh(2.0 * obj_angular_vel)
            rewards['success'] = 10.0  # Large bonus on success
        else:
            rewards['hold'] = 0.0
            rewards['success'] = 0.0
        
        # Penalties
        rewards['action_penalty'] = -0.001 * np.sum(action ** 2)
        
        # Weighted sum
        total = (1.0 * rewards['reach'] +
                 2.0 * rewards['contact'] +
                 5.0 * rewards['lift'] +
                 3.0 * rewards['hold'] +
                 rewards['success'] +
                 rewards['action_penalty'])
        
        return total, rewards

Phase-by-Phase Analysis

Phase 1 (Reach): Uses tanh shaping so the gripper approaches the object. Low weight since this is just preparation.

Phase 2 (Contact): Binary reward when both fingers touch the object. This encourages the robot to actually make contact rather than just hovering nearby.

Phase 3 (Lift): Reward proportional to lift height. Only active when contact exists — preventing reward hacking like "push the object upward."

Phase 4 (Hold): Penalizes high angular velocity — the object must be held stably without spinning.

Environment for Grasping Training

MuJoCo Grasping Environment

import gymnasium as gym
from gymnasium import spaces
import mujoco
import numpy as np

GRASPING_XML = """
<mujoco model="grasping_env">
  <option timestep="0.002" gravity="0 0 -9.81">
    <flag contact="enable"/>
  </option>
  
  <default>
    <geom condim="4" friction="1 0.5 0.01"/>
  </default>
  
  <worldbody>
    <light pos="0 0 2" dir="0 0 -1"/>
    <geom type="plane" size="2 2 0.1" rgba="0.9 0.9 0.9 1"/>
    
    <!-- Table -->
    <body name="table" pos="0.5 0 0.4">
      <geom type="box" size="0.5 0.5 0.02" rgba="0.6 0.4 0.2 1" mass="200"/>
    </body>
    
    <!-- Panda-like 7DOF arm (simplified) -->
    <body name="base_link" pos="0 0 0.42">
      <geom type="cylinder" size="0.06 0.03" rgba="0.9 0.9 0.9 1" mass="5"/>
      
      <body name="link1" pos="0 0 0.06">
        <joint name="j1" type="hinge" axis="0 0 1" range="-2.9 2.9" damping="1"/>
        <geom type="capsule" fromto="0 0 0 0 0 0.15" size="0.04" rgba="0.9 0.9 0.9 1"/>
        
        <body name="link2" pos="0 0 0.15">
          <joint name="j2" type="hinge" axis="0 1 0" range="-1.76 1.76" damping="1"/>
          <geom type="capsule" fromto="0 0 0 0.3 0 0" size="0.035" rgba="0.9 0.9 0.9 1"/>
          
          <body name="link3" pos="0.3 0 0">
            <joint name="j3" type="hinge" axis="0 1 0" range="-2.0 0.1" damping="1"/>
            <geom type="capsule" fromto="0 0 0 0.25 0 0" size="0.03" rgba="0.9 0.9 0.9 1"/>
            
            <body name="link4" pos="0.25 0 0">
              <joint name="j4" type="hinge" axis="0 0 1" range="-3.07 3.07" damping="0.5"/>
              <geom type="capsule" fromto="0 0 0 0.08 0 0" size="0.025" rgba="0.3 0.3 0.3 1"/>
              
              <body name="wrist" pos="0.08 0 0">
                <joint name="j5" type="hinge" axis="1 0 0" range="-1.57 1.57" damping="0.5"/>
                <site name="ee_site" pos="0 0 0" size="0.01"/>
                
                <!-- Parallel Jaw Gripper -->
                <body name="left_finger" pos="0 0.02 0">
                  <joint name="finger_l" type="slide" axis="0 1 0" range="0 0.04" damping="5"/>
                  <geom type="box" size="0.01 0.005 0.04" rgba="0.8 0.2 0.2 1" mass="0.05"
                        contype="1" conaffinity="1" friction="2 1 0.01"/>
                </body>
                <body name="right_finger" pos="0 -0.02 0">
                  <joint name="finger_r" type="slide" axis="0 -1 0" range="0 0.04" damping="5"/>
                  <geom type="box" size="0.01 0.005 0.04" rgba="0.8 0.2 0.2 1" mass="0.05"
                        contype="1" conaffinity="1" friction="2 1 0.01"/>
                </body>
              </body>
            </body>
          </body>
        </body>
      </body>
    </body>
    
    <!-- Object (randomized at reset) -->
    <body name="object" pos="0.5 0 0.44">
      <freejoint name="obj_free"/>
      <geom name="obj_geom" type="box" size="0.025 0.025 0.025" 
            rgba="0.1 0.8 0.1 1" mass="0.05" contype="1" conaffinity="1"/>
    </body>
  </worldbody>
  
  <actuator>
    <position name="act_j1" joint="j1" kp="200"/>
    <position name="act_j2" joint="j2" kp="200"/>
    <position name="act_j3" joint="j3" kp="200"/>
    <position name="act_j4" joint="j4" kp="100"/>
    <position name="act_j5" joint="j5" kp="100"/>
    <position name="act_fl" joint="finger_l" kp="100"/>
    <position name="act_fr" joint="finger_r" kp="100"/>
  </actuator>
  
  <sensor>
    <touch name="touch_l" site="ee_site"/>
  </sensor>
</mujoco>
"""


class GraspingEnv(gym.Env):
    """Grasping environment with object randomization."""
    
    def __init__(self, reward_type="dense", curriculum_level=0):
        super().__init__()
        self.reward_type = reward_type
        self.curriculum_level = curriculum_level
        
        self.model = mujoco.MjModel.from_xml_string(GRASPING_XML)
        self.data = mujoco.MjData(self.model)
        
        # 5 joints + 1 gripper = 6 actions
        self.action_space = spaces.Box(-1, 1, shape=(6,), dtype=np.float32)
        
        # State: joints(10) + ee(3) + obj(7) + relative(3) + contacts(2)
        self.observation_space = spaces.Box(
            -np.inf, np.inf, shape=(25,), dtype=np.float64
        )
        
        self.reward_fn = GraspReward()
        self.max_steps = 150
        self.current_step = 0
        self.obj_init_height = 0.44
        
    def _check_contact(self):
        """Check if fingers are touching the object."""
        left_contact = False
        right_contact = False
        for i in range(self.data.ncon):
            c = self.data.contact[i]
            geom1 = self.model.geom(c.geom1).name
            geom2 = self.model.geom(c.geom2).name
            pair = {geom1, geom2}
            if "obj_geom" in pair:
                if any("left" in g or "finger" in g for g in pair - {"obj_geom"}):
                    left_contact = True
                if any("right" in g for g in pair - {"obj_geom"}):
                    right_contact = True
        return left_contact and right_contact
        
    def _get_obs(self):
        joint_pos = self.data.qpos[:5]
        joint_vel = self.data.qvel[:5]
        ee_pos = self.data.site_xpos[0]
        obj_pos = self.data.qpos[7:10]
        obj_quat = self.data.qpos[10:14]
        rel = obj_pos - ee_pos
        contacts = np.array([float(self._check_contact()), 
                           self.data.qpos[5] + self.data.qpos[6]])
        
        return np.concatenate([
            joint_pos, joint_vel, ee_pos, obj_pos, obj_quat, rel, contacts
        ])
    
    def _randomize_object(self):
        """Randomize object based on curriculum level."""
        if self.curriculum_level == 0:
            # Easy: cube at center, minimal variation
            x = 0.5 + np.random.uniform(-0.05, 0.05)
            y = np.random.uniform(-0.05, 0.05)
        elif self.curriculum_level == 1:
            # Medium: wider spread
            x = 0.5 + np.random.uniform(-0.15, 0.15)
            y = np.random.uniform(-0.15, 0.15)
        else:
            # Hard: wide spread + random friction
            x = 0.5 + np.random.uniform(-0.2, 0.2)
            y = np.random.uniform(-0.2, 0.2)
            # Randomize friction
            geom_id = mujoco.mj_name2id(self.model, mujoco.mjtObj.mjOBJ_GEOM, "obj_geom")
            self.model.geom_friction[geom_id] = [
                np.random.uniform(0.3, 2.0),
                np.random.uniform(0.01, 0.5),
                np.random.uniform(0.001, 0.01)
            ]
        
        self.data.qpos[7] = x
        self.data.qpos[8] = y
        self.data.qpos[9] = 0.44
        self.obj_init_height = 0.44
    
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        mujoco.mj_resetData(self.model, self.data)
        self._randomize_object()
        mujoco.mj_forward(self.model, self.data)
        self.current_step = 0
        return self._get_obs(), {}
    
    def step(self, action):
        # Apply scaled actions
        joint_delta = action[:5] * 0.05
        gripper = (action[5] + 1) / 2 * 0.04
        
        self.data.ctrl[:5] = self.data.qpos[:5] + joint_delta
        self.data.ctrl[5] = gripper
        self.data.ctrl[6] = gripper
        
        for _ in range(10):
            mujoco.mj_step(self.model, self.data)
        
        obs = self._get_obs()
        ee_pos = obs[10:13]
        obj_pos = obs[13:16]
        obj_vel = self.data.qvel[7:13]
        contact = self._check_contact()
        
        reward, info = self.reward_fn.compute(
            ee_pos, obj_pos, self.obj_init_height,
            contact, obj_vel, action
        )
        
        self.current_step += 1
        terminated = self.current_step >= self.max_steps
        
        return obs, reward, terminated, False, info

Curriculum Learning for Grasping

Curriculum learning is an "easy to hard" strategy — start with simple objects and favorable positions, then gradually increase difficulty. This is a key technique for achieving high success rates.

Curriculum learning progression

Designing the Curriculum

class GraspingCurriculum:
    """Automatic curriculum for grasping training."""
    
    def __init__(self, success_threshold=0.7, window_size=100):
        self.level = 0
        self.max_level = 4
        self.success_threshold = success_threshold
        self.success_history = []
        self.window_size = window_size
        
        self.levels = {
            0: {
                'name': 'Cube - Center',
                'objects': ['cube'],
                'position_range': 0.05,
                'size_range': (0.025, 0.025),
                'friction_range': (1.0, 1.0),
            },
            1: {
                'name': 'Cube - Wide Area',
                'objects': ['cube'],
                'position_range': 0.15,
                'size_range': (0.02, 0.035),
                'friction_range': (0.5, 1.5),
            },
            2: {
                'name': 'Multi-Shape - Easy',
                'objects': ['cube', 'cylinder', 'sphere'],
                'position_range': 0.15,
                'size_range': (0.02, 0.04),
                'friction_range': (0.5, 1.5),
            },
            3: {
                'name': 'Multi-Shape - Hard',
                'objects': ['cube', 'cylinder', 'sphere', 'ellipsoid'],
                'position_range': 0.20,
                'size_range': (0.015, 0.045),
                'friction_range': (0.3, 2.0),
            },
            4: {
                'name': 'Adversarial',
                'objects': ['cube', 'cylinder', 'sphere', 'ellipsoid', 'mesh'],
                'position_range': 0.25,
                'size_range': (0.01, 0.05),
                'friction_range': (0.1, 2.5),
            },
        }
    
    def update(self, success: bool):
        """Update after each episode."""
        self.success_history.append(float(success))
        if len(self.success_history) > self.window_size:
            self.success_history.pop(0)
        
        if len(self.success_history) >= self.window_size:
            rate = np.mean(self.success_history)
            if rate >= self.success_threshold and self.level < self.max_level:
                self.level += 1
                self.success_history.clear()
                print(f"Curriculum advanced to Level {self.level}: "
                      f"{self.levels[self.level]['name']}")
    
    def get_config(self):
        """Return config for current level."""
        return self.levels[self.level]

Complete Training Pipeline

from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import BaseCallback

class CurriculumCallback(BaseCallback):
    """Callback to automatically advance curriculum level."""
    
    def __init__(self, curriculum, verbose=0):
        super().__init__(verbose)
        self.curriculum = curriculum
    
    def _on_step(self):
        for info in self.locals.get('infos', []):
            if 'success' in info:
                self.curriculum.update(info['success'])
                if hasattr(self.training_env, 'envs'):
                    for env in self.training_env.envs:
                        env.curriculum_level = self.curriculum.level
        return True

# ---- Main Training ----
curriculum = GraspingCurriculum(success_threshold=0.7, window_size=100)
env = GraspingEnv(reward_type="dense", curriculum_level=0)

# SAC typically outperforms PPO for manipulation
model = SAC(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    buffer_size=1_000_000,
    batch_size=256,
    tau=0.005,
    gamma=0.99,
    train_freq=1,
    gradient_steps=1,
    learning_starts=1000,
    verbose=1,
    tensorboard_log="./grasp_logs/"
)

curriculum_cb = CurriculumCallback(curriculum)
model.learn(total_timesteps=2_000_000, callback=curriculum_cb)
model.save("grasping_sac_curriculum")

PPO vs SAC for Grasping

Criterion	PPO	SAC
Sample Efficiency	Low (~3M steps)	High (~1M steps)
Stability	Very stable	Occasional divergence
Final Performance	60-70% success	75-85% success
Parallelization	Excellent (many envs)	Limited (replay buffer)
Continuous Actions	OK	Excellent
Hyperparameter Sensitivity	Low	Medium

Evaluation Metrics

Evaluating a grasping policy requires more than just success rate:

def evaluate_grasping(model, env, n_episodes=100):
    """Comprehensive evaluation of grasping policy."""
    metrics = {
        'success_rate': 0,
        'grasp_success': 0,    # Grasped but not lifted
        'lift_success': 0,     # Successfully lifted
        'hold_success': 0,     # Held stably for 2 seconds
        'avg_grasp_time': [],  # Time to grasp
        'avg_lift_height': [], # Lift height
        'drop_rate': 0,        # Drop rate after grasping
    }
    
    for ep in range(n_episodes):
        obs, _ = env.reset()
        grasped = False
        lifted = False
        
        for step in range(env.max_steps):
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, done, _, info = env.step(action)
            
            if info.get('contact', False) and not grasped:
                grasped = True
                metrics['grasp_success'] += 1
                metrics['avg_grasp_time'].append(step)
            
            if info.get('lift', 0) > 0.5 and not lifted:
                lifted = True
                metrics['lift_success'] += 1
            
            if grasped and not info.get('contact', False):
                metrics['drop_rate'] += 1
            
            if done:
                if info.get('success', False):
                    metrics['success_rate'] += 1
                    metrics['hold_success'] += 1
                break
    
    for key in ['success_rate', 'grasp_success', 'lift_success', 
                'hold_success', 'drop_rate']:
        metrics[key] /= n_episodes
    
    metrics['avg_grasp_time'] = (np.mean(metrics['avg_grasp_time']) 
                                  if metrics['avg_grasp_time'] else float('inf'))
    
    return metrics

Typical results after 2M timesteps with SAC + curriculum:

Metric	Level 0 (Cube)	Level 2 (Multi-shape)	Level 4 (Adversarial)
Success Rate	92%	78%	61%
Grasp Success	97%	89%	75%
Drop Rate	3%	8%	15%
Avg Grasp Time	25 steps	35 steps	48 steps

References

DexGraspNet: A Large-Scale Robotic Dexterous Grasp Dataset — Wang et al., 2023
QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation — Kalashnikov et al., 2018
Learning Dexterous In-Hand Manipulation — OpenAI, 2018

Grasping with RL: Stable Grasp & Object Variety

Types of Grasping

Reward Design for Grasping

Multi-Phase Reward Function

Phase-by-Phase Analysis

Environment for Grasping Training

MuJoCo Grasping Environment

Curriculum Learning for Grasping

Designing the Curriculum

Complete Training Pipeline

PPO vs SAC for Grasping

Evaluation Metrics

References

Next in the Series

Nguyễn Anh Tuấn

Related Posts

RL cho Manipulation: MDP, Reward Design và Environment Setup

Multi-Step Manipulation: Curriculum Learning cho Long-Horizon

Tool Use: Robot học sử dụng dụng cụ bằng RL

Grasping with RL: Stable Grasp & Object Variety

Types of Grasping

Reward Design for Grasping

Multi-Phase Reward Function

Phase-by-Phase Analysis

Environment for Grasping Training

MuJoCo Grasping Environment

Curriculum Learning for Grasping

Designing the Curriculum

Complete Training Pipeline

PPO vs SAC for Grasping

Evaluation Metrics

References

Next in the Series

Nguyễn Anh Tuấn

Related Posts

RL cho Manipulation: MDP, Reward Design và Environment Setup

Multi-Step Manipulation: Curriculum Learning cho Long-Horizon

Tool Use: Robot học sử dụng dụng cụ bằng RL