aidense-modelsrobot-perception6d-pose-estimationmanipulationdeep-learningdondensefusiondensematcher

Dense Models in Robotics: From DON to DenseMatcher

Explore Dense Object Nets, DenseFusion, and DenseMatcher — three dense visual descriptor technologies revolutionizing how robots see and grasp objects.

Nguyễn Anh Tuấn22 tháng 4, 202612 min read
Dense Models in Robotics: From DON to DenseMatcher

Imagine asking a robot to grab the right ear specifically of a caterpillar plush toy. Not the left ear, not the body — the right ear. And the robot has never seen this particular toy before.

Sound like science fiction? It isn't. This is an actual result from an MIT CSAIL study in 2018, made possible by a concept called Dense Models in robotics.

This article takes you from theoretical foundations to working code, covering the three most important works in dense visual representation for robots: Dense Object Nets (DON), DenseFusion, and DenseMatcher.


What Does "Dense" Mean — And Why Does It Matter?

Before diving into specifics, let's clarify what "dense" means in this context.

In computer vision, there are two main approaches:

Sparse: Extract a small number of salient feature points from an image — corners, edges, or keypoints like SIFT/ORB. Fast, but throws away a lot of information.

Dense: Compute a feature vector for every pixel in the image. Slower and more resource-intensive, but extraordinarily rich in information.

Sparse: [  *   ,    *  ,  *   ,     *    ]
         keypoint keypoint ...

Dense:  [p₁, p₂, p₃, ..., pₙ]  ← every pixel has a feature vector

For robots, dense representation unlocks capabilities that sparse approaches simply cannot achieve:

  • Identify specific points on objects: "Grab here, not there"
  • Generalize to unseen objects: Learn from one shoe → grasp any shoe
  • Robust to partial occlusion: Still works when the object is partially hidden
  • No CAD model required: No engineering drawings needed

Dense Object Nets (DON) — The Foundation of Everything

Paper: Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation
Authors: Peter R. Florence, Lucas Manuelli, Russ Tedrake — MIT CSAIL
Conference: CoRL 2018 (Best Paper Award)
GitHub: RobotLocomotion/pytorch-dense-correspondence

The Core Idea

DON learns a mapping function: every pixel (u, v) in an image → a feature vector D ∈ ℝᵈ (typically d = 3 or d = 16).

These vectors are learned such that:

  • The same point on an object → nearby vectors in feature space
  • Different points → distant vectors

The magic: these vectors are viewpoint-invariant. The right ear of the caterpillar toy always maps to the same feature vector, whether viewed from the front, side, or top.

DON Architecture

RGB Image (H×W×3)
      │
      ▼
  FCN Backbone (ResNet-34 or VGG-16)
      │
      ▼
  Dense Feature Map (H×W×D)
  ← each pixel has a D-dimensional descriptor

DON uses a Fully Convolutional Network (FCN) — no fully connected layers — so the output retains the same spatial dimensions as the input image.

Training: Self-Supervised with RGB-D

The most impressive aspect of DON is that it is entirely self-supervised — no manual labels required. Training data is generated automatically from an RGB-D camera:

1. Capture many images of the object from multiple angles
2. Use depth + camera pose to project 3D points
3. Auto-generate correspondences: pixel A in image 1 ↔ pixel B in image 2
   (same 3D point on the object)
4. Train with contrastive loss

Contrastive Loss:

# Match loss: matching pair descriptors should be close
L_match = ||D(u_a) - D(u_b)||²

# Non-match loss: non-matching pair descriptors should be further than margin M
L_non_match = max(0, M - ||D(u_a) - D(u_c)||)²

# Total loss
L = L_match + α * L_non_match

Practical Application: Point-Specific Robot Grasping

import torch
import numpy as np
from dense_correspondence.network.dense_correspondence_network import DenseCorrespondenceNetwork

# Load trained model
dcn = DenseCorrespondenceNetwork.from_model_folder('path/to/trained_model')
dcn.eval()

# Reference image: user points to the grasp location
img_ref = load_image('caterpillar_reference.png')  # H×W×3
target_pixel = (120, 85)  # pixel pointed to by the user

# Get descriptor for the target point
with torch.no_grad():
    descriptor_map_ref = dcn.forward_single_image(img_ref)
    target_descriptor = descriptor_map_ref[target_pixel[0], target_pixel[1]]  # D-dim vector

# Live image from robot camera
img_live = get_camera_frame()  # new image of the object (different position/angle)

# Find corresponding point in live image
with torch.no_grad():
    descriptor_map_live = dcn.forward_single_image(img_live)
    
    # L2 distance against the entire map
    diff = descriptor_map_live - target_descriptor  # H×W×D
    dist = torch.norm(diff, dim=-1)  # H×W
    
    # Best match
    best_pixel = torch.argmin(dist.view(-1))
    best_y = best_pixel // img_live.shape[1]
    best_x = best_pixel % img_live.shape[1]

print(f"Grasp point in live image: ({best_x}, {best_y})")

DON Results

MIT tested DON with a Kuka LBR iiwa robot arm:

  • Grasping accuracy: 87.4% success rate at the designated point
  • Cross-instance: Learned from one shoe → grasped a different, unseen shoe
  • Non-rigid objects: Works with fabric, bags, and soft objects

Dense visual descriptor map — each color represents a semantic location on the object


DenseFusion — 6D Pose Estimation with RGB-D Fusion

Paper: DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion
Authors: Chen Wang, Danfei Xu, Yuke Zhu, et al. — Stanford University
Conference: CVPR 2019
GitHub: j96w/DenseFusion

The Problem: What Is 6D Pose Estimation?

For a robot to grasp an object, it needs to know exactly:

  • Translation (3D): Where is the object in space (x, y, z)
  • Rotation (3D): Which direction is the object oriented (roll, pitch, yaw)

Six degrees of freedom total — called 6D pose.

Why Is DON Not Enough?

DON is excellent at finding correspondences, but for precise 6D pose, we need:

  1. Combined information from color (RGB) and depth
  2. Dense pixel-level processing rather than sparse keypoints

DenseFusion addresses this by fusing dense features from both RGB and point cloud.

DenseFusion Architecture

RGB Image ──────────────────────► RGB Feature Extractor (PSPNet)
      │                                    │
      │                                    ▼ per-pixel color features
      │
Depth Image → Point Cloud ───────► PointNet Feature Extractor
                                           │
                                           ▼ per-point geometry features
                                    
                          [Fusion Layer: RGB + Geometry per point]
                                           │
                                           ▼
                             Pose Estimation Head
                             → (R, t) per object instance
                             → Confidence score per prediction

The key insight: DenseFusion does not fuse at the global level — it fuses at each point in the point cloud with the corresponding pixel in the RGB image. This preserves extremely detailed spatial information.

Setup and Running DenseFusion

# Install dependencies
conda create -n densefusion python=3.7
conda activate densefusion
pip install torch==1.7.1 torchvision==0.8.2
pip install scipy==1.2.0 opencv-python transforms3d

# Clone repo
git clone https://github.com/j96w/DenseFusion.git
cd DenseFusion

# Training on YCB-Video dataset
python tools/train.py \
  --dataset ycb \
  --dataset_root path/to/YCB_Video_Dataset \
  --batch_size 8 \
  --workers 10 \
  --lr 0.0001 \
  --start_epoch 0

Iterative Refinement — The Critical Second Step

DenseFusion doesn't just predict pose once — it has an Iterative Refinement step:

# Step 1: Initial pose prediction
initial_pose = densefusion_model(rgb, depth, roi)

# Step 2: Iterative refinement
current_pose = initial_pose
for i in range(num_iterations):
    # Use current pose to transform point cloud
    transformed_cloud = transform_point_cloud(depth_cloud, current_pose)
    
    # Predict refinement (delta R, delta t)
    delta = refinement_model(rgb, transformed_cloud, roi)
    
    # Update pose
    current_pose = compose_pose(current_pose, delta)

final_pose = current_pose

Each iteration improves the pose estimate incrementally — like fine-tuning the fit of an object until it locks perfectly into place.

DenseFusion Results

Tested on YCB-Video and LineMOD datasets:

Metric DenseFusion (no refine) DenseFusion (iterative)
ADD(-S) AUC on YCB 86.2% 91.8%
ADD on LineMOD 79.7% 86.2%

Compared to prior methods (PoseCNN, DeepIM), DenseFusion achieves 12–15% improvement with faster inference.


DenseMatcher — Generalizing from a Single Demo

Paper: DenseMatcher: Learning 3D Semantic Correspondence for Category-Level Manipulation from a Single Demo
Authors: Junzhe Zhu et al. — TEA Lab
Conference: ICLR 2025
GitHub: TEA-Lab/DenseMatcher
Project: https://tea-lab.github.io/DenseMatcher/

The New Problem: Category-Level Generalization

DON and DenseFusion work well for instance-level tasks — specific objects the robot has already seen. But if we want a robot to learn from one demo and apply it to any mug (regardless of shape or size), that's category-level generalization — a much harder problem.

DenseMatcher solves this by learning dense 3D correspondence between objects within the same category.

DenseMatcher — learning 3D correspondence from multiview 2D features and functional maps

DenseMatcher Architecture

Object Mesh A                          Object Mesh B
     │                                      │
     ▼                                      ▼
Multi-view Rendering                 Multi-view Rendering
(many viewpoints)                    (many viewpoints)
     │                                      │
     ▼                                      ▼
2D Foundation Model                  2D Foundation Model
(DINO / Stable Diffusion)            (DINO / Stable Diffusion)
     │                                      │
     ▼                                      ▼
Project features → Mesh Vertices     Project features → Mesh Vertices
     │                                      │
     ▼                                      ▼
3D GNN Refinement                    3D GNN Refinement
     │                                      │
     └──────────────────┬─────────────────┘
                        ▼
              Functional Map Layer
              (find correspondence function)
                        │
                        ▼
              Dense 3D Correspondence
              (every vertex A ↔ vertex B)

Step 1 — Project 2D features onto 3D mesh:
Render the object from multiple views (16–32), extract 2D features with DINO or Stable Diffusion, then project back onto mesh vertices via weighted averaging.

Step 2 — 3D GNN refinement:
A Graph Neural Network runs over the mesh graph to smooth and enforce consistency in features, leveraging the 3D structure of the object.

Step 3 — Functional Maps:
Instead of computing correspondences directly (extremely expensive for large meshes), DenseMatcher uses Functional Maps — representing correspondence as a mapping matrix between the spectral spaces of two meshes, then converting back to point-to-point correspondences.

Setting Up DenseMatcher

# Clone and setup
git clone https://github.com/TEA-Lab/DenseMatcher.git
cd DenseMatcher
conda create -n densematcher python=3.9
conda activate densematcher
pip install -r requirements.txt

# Download pre-trained model
python scripts/download_model.py

# Test with two object meshes
python demo.py \
  --mesh_a data/example/cup_A.obj \
  --mesh_b data/example/cup_B.obj \
  --output_dir results/

Applying to Robot Manipulation

from densematcher import DenseMatcher
import numpy as np

# Load model
matcher = DenseMatcher.from_pretrained('densematcher-v1')

# Demo: robot learns how to hold mug A
# Grasp point on mug A (from human demo)
demo_mesh = load_mesh('cup_A.obj')
demo_grasp_vertex = 1247  # vertex index at the grasp location

# Target: mug B (never seen before, different shape)
target_mesh = load_mesh('cup_B.obj')

# Find correspondence
correspondence = matcher.compute_correspondence(demo_mesh, target_mesh)

# Find corresponding vertex on mug B
target_grasp_vertex = correspondence[demo_grasp_vertex]
target_grasp_position = target_mesh.vertices[target_grasp_vertex]

print(f"Grasp point on target cup: {target_grasp_position}")
# → Robot uses this point to execute grasp

DenseMatcher Results

  • Outperforms baselines: +43.5% over the best prior 3D matching method
  • Cross-category: Learn from mugs → grasp water bottles, boxes, despite different categories
  • Long-horizon tasks: Executes complex action sequences (open lid → pour → close) from just 1 demo
  • Zero-shot: No fine-tuning required for new objects

Comparison of Three Methods

Criterion DON (2018) DenseFusion (2019) DenseMatcher (2025)
Input RGB-D images RGB + Point Cloud 3D Mesh
Output 2D pixel descriptors 6D pose (R, t) 3D vertex correspondence
Generalization Cross-instance (rigid) Instance-level Category-level (cross-instance)
Supervision Self-supervised Fully supervised Self-supervised (via foundation models)
Foundation model No No Yes (DINO, SD)
Real-time Yes (~10ms) Yes (~20ms) No (offline)
Use case Grasping specific points Precise pick-and-place One-shot imitation learning

A Practical Pipeline for Robot Manipulation

In practice, modern robot systems combine all three approaches:

Camera (RGB-D)
      │
      ├──► Segmentation (Mask R-CNN / SAM) → object ROI
      │
      ├──► DenseFusion ─────────────────────► 6D Pose (for trajectory planning)
      │
      └──► DON / DenseMatcher ───────────────► Grasp point (to determine where to grip)
                                                      │
                                              Motion Planner
                                                      │
                                              Robot Execution

Combined dense perception pipeline for robot manipulation in industrial settings


Future Directions

Following DenseMatcher, research is moving toward:

1. Dense Features from Foundation Models:
DINOBot (2024) and similar works use DINO-ViT features directly — no additional training required. These features are already "dense" and "semantic" enough for matching.

2. Dense World Models:
Rather than predicting discrete poses, predict the entire scene representation as a dense feature map — enabling more accurate long-horizon planning.

3. 4D Dense Tracking:
Track dense correspondences through time (not just spatially) — understanding how every point on an object has moved and changed throughout robot interaction.

4. Integration with VLA Models:
Use dense visual features as visual tokens for Vision-Language-Action models, replacing global image embeddings — giving models much finer-grained visual detail.


Quick Start (DON)

If you want to try DON today:

# Requirements: Ubuntu 20.04+, CUDA 11+, Python 3.8+
git clone https://github.com/RobotLocomotion/pytorch-dense-correspondence.git
cd pytorch-dense-correspondence

# Install dependencies
pip install torch torchvision
pip install -r requirements.txt

# Download pre-trained model (shoes dataset)
python scripts/download_models.py --model shoes

# Run demo: find correspondences between 2 images
python scripts/find_correspondences.py \
  --model_path trained_models/shoes \
  --image_a data/shoes/image_a.png \
  --image_b data/shoes/image_b.png \
  --pixel_a 150 120  # pixel of interest in image A

Output: an image with the corresponding point marked in image B — you can immediately see how descriptor matching works in practice.


Conclusion

Dense models have fundamentally changed how robots understand and interact with objects:

  • DON (2018) laid the groundwork: every pixel carries meaning, not just the whole image
  • DenseFusion (2019) fused RGB + Depth at the pixel level → accurate 6D pose
  • DenseMatcher (2025) generalized to category-level using foundation models

The trend is clear: dense representations become increasingly powerful when combined with foundation models (DINO, Stable Diffusion). A 2026 robot doesn't need thousands of demos — a few dozen with dense matching is enough to generalize broadly.

If you're building a manipulation system, here's the recommended stack:

  1. Segmentation: SAM (Segment Anything Model) to isolate objects
  2. Pose: DenseFusion or FoundationPose for 6D pose
  3. Grasp point: DenseMatcher for category-level generalization
  4. Policy: ACT or Diffusion Policy for motion planning

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

NEWDeep Dive
VLA-RFT: RL Fine-Tune VLA trong World Simulator
vlavla-rftreinforcement-learningworld-modelgrpoliberoopenhelixmanipulation

VLA-RFT: RL Fine-Tune VLA trong World Simulator

VLA-RFT dùng world model làm simulator để fine-tune VLA bằng GRPO, reward kiểm chứng và code GitHub trên LIBERO.

3/6/202614 min read
NEWTutorial
Chạy Wall-OSS-0.5 với LeRobot
wall-ossvlalerobotmanipulationzero-shot

Chạy Wall-OSS-0.5 với LeRobot

Hướng dẫn chạy Wall-OSS-0.5, VLA 4B open-source zero-shot cho robot manipulation, từ paper đến LeRobot training và inference.

3/6/202613 min read
NEWResearch
A1 VLA: Deploy VLA SOTA với Latency Giảm 72%
vlarobot-armfrankaagibotopen-sourceflow-matchinginference-optimizationmanipulation

A1 VLA: Deploy VLA SOTA với Latency Giảm 72%

Hướng dẫn A1 VLA open-source: giảm latency 72% trên Franka/AgiBot nhờ Inter-Layer Truncated Flow Matching, đạt SOTA trên LIBERO 96.6% và VLABench 53.5%.

1/6/202612 min read