Why is Grasping the Hardest Problem in Manipulation?
For humans, picking up a coffee cup is unconscious -- eyes see, hand reaches, fingers close. For robots, this combines perception, planning, and control extremely complexly.
A robot arm must answer 3 questions before grasping: (1) where on the object to contact? (2) what orientation of gripper is stable? (3) how much force is enough without breaking? Each question is a separate problem, and solution methods evolved from analytical (geometry + physics calculations) to learning-based (learning from data) over the past two decades.
This is Part 1 of Robot Manipulation Masterclass series -- I'll go from classical theory to latest deep learning models, with comparison tables to help you choose the right approach for your project.
Analytical Grasping: Physical Foundations
Force Closure -- Necessary and Sufficient Condition
The most important concept in analytical grasping is force closure: a grasp has force closure when its contact forces can resist any external wrench (force + moment) applied to the object.
Simple analogy: when holding a ball with 2 fingers (thumb and index), friction at the 2 contact points creates a wrench space large enough to hold the ball steady without slipping, even if you tilt your hand or someone pushes it gently.
Mathematical condition for force closure:
A grasp G with n contact points has force closure if and only if:
the convex hull of primitive wrenches encloses origin
in 6D wrench space (3 force + 3 torque).
Steps to compute:
- Define contact model: point contact with friction (PCWF), soft finger, or rigid body
- Compute friction cone: at each contact, create cone of allowable forces based on friction coefficient (mu)
- Map to wrench space: each contact force → wrench = [force; torque] via cross product with contact position
- Check force closure: does convex hull of all wrenches contain origin?
GraspIt! -- Classical Grasp Simulator
GraspIt! is a classical grasp planning simulator developed at Columbia University since 2004, still widely used for analytical grasping research.
GraspIt! allows:
- Load robot hand models (Barrett, Shadow, Allegro) and object meshes
- Auto-search grasp poses via eigengrasp planner or simulated annealing
- Calculate grasp quality metrics: epsilon metric (largest smallest wrench it can resist), volume metric (total wrench space volume)
# Install GraspIt! (Ubuntu)
sudo apt-get install libqt5-dev libsoqt520-dev libcoin-dev
git clone https://github.com/graspit-simulator/graspit.git
cd graspit && mkdir build && cd build
cmake .. && make -j$(nproc)
Limitation of analytical approach: needs accurate 3D object models, precise friction coefficients, and doesn't scale well to unknown objects. This is what motivated learning-based methods.
Learning-Based Grasping: Learning from Data
PointNetGPD -- Direct Point Cloud
PointNetGPD (Liang et al., 2018) is one of the first papers using PointNet to evaluate grasp quality directly from point cloud.
Core idea: instead of complex geometric analysis, let a neural network learn from 350K grasp samples on the YCB object set. Input is point cloud inside the gripper volume, output is a grasp quality score.
# PointNetGPD inference pipeline (simplified)
import torch
from pointnet_gpd import PointNetGPD
model = PointNetGPD(num_classes=2)
model.load_state_dict(torch.load("pointnetgpd_weights.pth"))
# 1. Sample grasp candidates from point cloud
grasp_candidates = sample_antipodal_grasps(point_cloud, num_samples=200)
# 2. For each candidate, crop point cloud inside gripper volume
for grasp in grasp_candidates:
local_points = crop_points_in_gripper(point_cloud, grasp)
# 3. PointNet predicts quality score
score = model(local_points)
grasp.quality = score
# 4. Select grasp with highest score
best_grasp = max(grasp_candidates, key=lambda g: g.quality)
Advantages: real-time, no object model needed, generalizes well to unseen objects.
Disadvantages: only supports parallel-jaw gripper, doesn't consider scene context (occlusion, clutter).
Contact-GraspNet -- 6-DoF in Clutter
Contact-GraspNet (Sundermeyer et al., 2021) solves PointNetGPD's limitations by generating 6-DoF grasps directly from scene point cloud, while accounting for clutter.
Key breakthroughs:
- Contact-based representation: each point in the point cloud is a potential grasp contact -- only need to predict 4-DoF (approach direction + grasp width) instead of full 6-DoF, which reduces learning complexity
- Trained on 17 million simulated grasps, generalizes well to real sensor data
- Achieves >90% success rate on unseen objects in structured clutter -- double the prior state-of-the-art
# Contact-GraspNet inference (simplified)
from contact_graspnet import ContactGraspNet
model = ContactGraspNet.load_pretrained()
# Input: single-view depth image -> point cloud
point_cloud = depth_to_pointcloud(depth_image, camera_intrinsics)
# Output: set of 6-DoF grasps with confidence scores
grasps, scores, contact_points = model.predict(
point_cloud,
forward_passes=5 # multiple passes for uncertainty estimation
)
# Filter and rank
valid_grasps = grasps[scores > 0.5]
best_grasp = valid_grasps[scores.argmax()]
Method Comparison
| Criterion | Analytical (GraspIt!) | PointNetGPD | Contact-GraspNet |
|---|---|---|---|
| Input | 3D mesh + friction | Point cloud (local) | Point cloud (scene) |
| Output | Grasp + quality metric | Grasp score | 6-DoF grasps + scores |
| Unknown objects | No (needs mesh) | Yes | Yes |
| Clutter handling | No | Limited | Good |
| Speed | Slow (optimization) | Real-time | ~0.5s/scene |
| Gripper type | Multi-finger | Parallel-jaw | Parallel-jaw |
| Training data | Not needed | 350K grasps | 17M grasps |
| Success rate (real) | ~70-80% (known objects) | ~85% | >90% |
| Best use case | Research, multi-finger | Quick prototype | Production clutter |
Grasp Quality Metrics
Regardless of method, you need to measure grasp quality. Here are the most common metrics:
Epsilon Metric (Force Closure Quality)
The epsilon metric is the radius of the largest inscribed ball in the wrench space convex hull. Epsilon > 0 means force closure. Larger epsilon means the grasp is more robust against external disturbances.
# Compute epsilon metric
from scipy.spatial import ConvexHull
import numpy as np
def epsilon_metric(wrenches):
"""
wrenches: (N, 6) array of primitive wrenches
Returns: epsilon value (>0 = force closure)
"""
hull = ConvexHull(wrenches)
# Shortest distance from origin to each facet
distances = []
for eq in hull.equations:
normal = eq[:-1]
offset = eq[-1]
dist = abs(offset) / np.linalg.norm(normal)
distances.append(dist)
return min(distances)
Grasp Success Rate (Empirical)
The most practical metric: run N grasp attempts and count successes. A grasp "succeeds" when the robot picks up the object, lifts it 10cm, and holds for 3 seconds without dropping.
Diversity and Coverage
Beyond quality, diversity also matters: a good grasp planner generates many candidates from different approach directions, giving the robot fallback options when the preferred grasp is blocked by obstacles.
Analytical vs. Learning: When to Use What?
Choose Analytical when:
- Working with known objects that have accurate 3D models (e.g., assembly lines with fixed parts)
- Need interpretability -- explain why a grasp is good/bad (important for safety-critical applications)
- Using multi-finger hands (Shadow, Allegro) -- learning methods for multi-finger are still immature
- Need grasp quality guarantees (provable force closure)
Choose Learning-Based when:
- Need to generalize to unknown objects (warehouse, home environment)
- Environment has clutter (multiple overlapping objects)
- Only have partial observations (single-view depth camera)
- Need real-time performance (<1s per grasp)
- Using parallel-jaw gripper (most common in industry)
Hybrid Approach
The 2025-2026 trend is to combine both: use a learning model to quickly generate grasp candidates, then use analytical metrics to verify and rank them. This is the approach being taken by Google DeepMind and UC Berkeley in their latest systems.
Hands-on: Running Contact-GraspNet
If you want to try it right away, here's the fastest setup:
# Clone repo
git clone https://github.com/NVlabs/contact_graspnet.git
cd contact_graspnet
# Install dependencies (Python 3.8+, CUDA 11.x)
pip install -r requirements.txt
# Download pre-trained weights
bash download_weights.sh
# Run inference on sample depth image
python contact_graspnet/inference.py \
--np_path=test_data/scene_0.npy \
--forward_passes=5 \
--z_range=[0.2,1.2]
Output is a set of 6-DoF grasps visualized on the point cloud. From there you can integrate with a robot arm via ROS 2 or directly through inverse kinematics.
Resources
- GraspIt! docs: https://graspit-simulator.github.io/
- Contact-GraspNet paper: arXiv:2103.14127
- PointNetGPD paper: arXiv:1809.06267
- Grasp quality survey: Ferrari & Canny, "Planning Optimal Grasps" (1992) -- foundational paper on force closure metrics
Next in Series
This is Part 1 of Robot Manipulation Masterclass. Coming up:
- Part 2: Imitation Learning for Manipulation: BC, DAgger, ACT -- Teaching robots manipulation from demonstrations
- Part 3: Diffusion Policy in Practice: From Theory to Code -- State-of-the-art policy learning
Related Posts
- Imitation Learning for Manipulation: BC, DAgger, ACT -- Part 2 of this series
- Tactile Sensing for Manipulation -- How tactile sensors improve grasping precision
- Foundation Models for Robots: RT-2, Octo, OpenVLA -- VLA models that can grasp zero-shot
- Inverse Kinematics for 6-DOF Robots -- IK needed to execute grasp poses