manipulationjetsonmanipulationtask-planningedge-ainvidiacuTAMPcuMotionTensorRT

Task Planning for Manipulation on Jetson Edge

Deploy task planning for robot manipulation on NVIDIA Jetson AGX Orin 64GB — from cuTAMP and cuMotion to VLM inference at the edge.

Nguyễn Anh Tuấn14 tháng 4, 202614 phút đọc
Task Planning for Manipulation on Jetson Edge

You have a robot arm, an NVIDIA Jetson AGX Orin 64GB, and a cluttered workspace full of objects. The hardest question isn't "can the robot grasp?" but rather "how does the robot decide what to grasp first, where to place it, and how to avoid collisions throughout the entire action sequence?"

That's the core challenge of Task and Motion Planning (TAMP) — and in this article, we'll dive deep into solving it directly on edge devices with the Jetson AGX Orin 64GB, using three technology pillars: cuTAMP (GPU-accelerated TAMP), cuMotion (CUDA motion planning), and VLM-based task planning via TensorRT Edge-LLM.

Why Task Planning on Edge?

First, let's understand why running task planning directly on edge devices matters instead of offloading to the cloud:

  1. Low latency — Robot manipulation demands control loops under 30ms. Cloud round-trips add 100-500ms in network latency alone.
  2. Offline operation — Factories, warehouses, and field environments don't always have reliable internet.
  3. Data security — Camera and sensor data never leaves the device.
  4. Operating costs — No cloud inference fees for every robot decision.

The NVIDIA Jetson AGX Orin 64GB delivers 275 TOPS of AI performance with an Ampere GPU featuring 2048 CUDA cores, 64 Tensor Cores, dual NVDLA v2.0, and 64GB LPDDR5 — enough to run perception, planning, and control simultaneously.

Jetson AGX Orin 64GB — the edge brain for robot manipulation

The TAMP Problem — Task and Motion Planning

TAMP combines two layers of planning:

  • Task planning (high level): Decides the action sequence — pick object A first, move to position B, place it down, then pick object C. This is a symbolic reasoning problem.
  • Motion planning (low level): Finds the specific trajectory for each action — how the robot arm moves from pose X to pose Y without colliding with obstacles.

The biggest challenge: these two layers are interdependent. The task planner might propose "pick A then pick B," but the motion planner discovers that picking A first blocks the approach path to B. Result? Backtracking, retrying, and exponentially growing computation time.

Traditional Methods and Their Limits

Traditional TAMP solvers like PDDLStream or Lazy-PRM work sequentially (serial):

  1. Propose a plan skeleton (abstract action sequence)
  2. Try to solve continuous variables (grasp poses, trajectories) for that skeleton
  3. If it fails → backtrack, try another skeleton
  4. Repeat until a solution is found

The problem? Step 2 is the bottleneck. Each attempt to solve continuous variables for a skeleton requires checking inverse kinematics, collision, and stability constraints — all on the CPU, sequentially. For complex problems (many objects, tight spaces), this can take minutes to hours.

cuTAMP — GPU-Parallelized TAMP

cuTAMP (Differentiable GPU-Parallelized Task and Motion Planning) is a paper from MIT CSAIL and NVIDIA Research, published at RSS 2025. The core idea: instead of solving candidates sequentially on CPU, solve thousands of candidates in parallel on GPU.

cuTAMP Architecture

cuTAMP uses bilevel search:

Level 1 — Skeleton enumeration: Enumerate feasible plan skeletons. Each skeleton is an abstract action sequence, e.g., pick(A) → place(A, shelf_top) → pick(B) → place(B, shelf_bottom).

Level 2 — GPU-parallel continuous optimization: For each skeleton, cuTAMP:

  1. Samples thousands of seeds on GPU — Each seed is a set of continuous values (grasp positions, placement angles, joint configurations)
  2. Differentiable batch optimization — All seeds are optimized simultaneously on GPU to satisfy:
    • Inverse kinematics constraints
    • Collision avoidance (robot-object, object-object)
    • Stability constraints (placed objects must be stable)
    • Cost minimization (shortest path, minimum time)
  3. Feasibility heuristic — Uses sampling results to estimate skeleton success probability, prioritizing promising skeletons
import torch
from cutamp import CuTAMPSolver, ProblemDef

problem = ProblemDef(
    robot_urdf="franka_panda.urdf",
    objects=["box_A", "box_B", "box_C"],
    goal_predicates=[
        ("on", "box_A", "shelf_top"),
        ("on", "box_B", "shelf_mid"),
        ("on", "box_C", "shelf_bottom"),
    ],
    obstacles=["table", "wall"],
)

solver = CuTAMPSolver(
    num_particles=4096,
    device="cuda",
    max_iterations=200,
    learning_rate=0.01,
)

solution = solver.solve(problem)
# solution.skeleton: ['pick(A)', 'place(A, top)', 'pick(B)', ...]
# solution.trajectories: list of joint-space trajectories
# solution.solve_time: ~2.3 seconds

Benchmark Results

cuTAMP was tested across multiple manipulation domains:

Domain Description cuTAMP Baseline (serial)
Single-object packing Place 1 object in tight box 14x faster Random init optimization
Tetris packing Pack multiple blocks in frame (only 0.3% valid candidates) Seconds Infeasible in reasonable time
Stick-Button Use stick to press button (5,698+ skeletons) < 1.5 seconds Minutes
Fruit packing 5 objects, requires obstruction reasoning Seconds Tens of seconds

cuTAMP has been validated on real robots: Franka Panda, Kinova Gen3, and UR5 — always finding solutions in under 30 seconds.

Integration with cuRobo

cuTAMP integrates with cuRobo — NVIDIA's CUDA-accelerated robot motion planning library — to generate executable trajectories. cuRobo runs on GPU, optimizing collision-free trajectories in milliseconds instead of seconds.

cuMotion — Motion Planning on Jetson

While cuTAMP solves "what to do and in what order," cuMotion (part of NVIDIA Isaac ROS) solves "how to move the robot arm." cuMotion is a CUDA-accelerated motion planning library designed to run on Jetson.

cuMotion Features

  • Optimal-time, minimal-jerk trajectories: Smooth, time-optimized paths
  • GPU-parallel trajectory optimization: Runs multiple trajectory optimizations concurrently
  • Collision avoidance: Supports obstacles as cuboids, meshes, or signed distance fields
  • MoveIt 2 integration: Plugs directly into the ROS 2 MoveIt ecosystem
  • Robot segmentation: Removes the robot from depth streams to avoid false self-collision detection

Performance on Jetson

Platform Planning time
Jetson Thor Fraction of a second
Jetson AGX Orin < 1 second (typical)
RTX 6000 (discrete GPU) Tens of milliseconds

On Jetson AGX Orin 64GB, cuMotion is fast enough for pick-and-place cycle times under 5 seconds — including perception, planning, and execution.

Setting Up cuMotion on Jetson AGX Orin

# Install JetPack 6.x (Ubuntu 22.04 + CUDA 12.x)
sudo apt update && sudo apt install nvidia-jetpack

# Setup Isaac ROS workspace
mkdir -p ~/ros2_ws/src && cd ~/ros2_ws/src
git clone https://github.com/NVIDIA-ISAAC-ROS/isaac_ros_cumotion.git
git clone https://github.com/NVIDIA-ISAAC-ROS/isaac_ros_common.git

# Build in Docker container (recommended)
cd ~/ros2_ws
./src/isaac_ros_common/scripts/run_dev.sh
colcon build --packages-select isaac_ros_cumotion
source install/setup.bash

Using cuMotion with MoveIt 2

import rclpy
from rclpy.node import Node
from moveit_msgs.action import MoveGroup
from geometry_msgs.msg import PoseStamped

class ManipulationPlanner(Node):
    def __init__(self):
        super().__init__('manipulation_planner')
        self._move_group_client = self.create_client(
            MoveGroup, '/move_group'
        )

    def plan_pick(self, target_pose: PoseStamped):
        """Plan a grasp at target_pose.
        cuMotion backend automatically optimizes trajectory on GPU."""
        goal = MoveGroup.Goal()
        goal.request.group_name = "panda_arm"
        goal.request.goal_constraints = [
            self._create_pose_constraint(target_pose)
        ]
        future = self._move_group_client.call_async(goal)
        return future

    def _create_pose_constraint(self, pose):
        from moveit_msgs.msg import Constraints, PositionConstraint
        constraint = Constraints()
        pos = PositionConstraint()
        pos.header = pose.header
        pos.link_name = "panda_hand"
        pos.constraint_region.primitive_poses = [pose.pose]
        constraint.position_constraints.append(pos)
        return constraint

VLM-based Task Planning on Edge

The third approach — and the fastest-growing one — uses Vision-Language Models (VLMs) as task planners. Instead of hand-coding symbolic rules, the VLM observes the camera feed and reasons about "what to do next."

Why VLM for Task Planning?

Traditional TAMP (including cuTAMP) requires domain specification — you must predefine all objects, predicates, and actions. When encountering new objects or situations outside the specification, the system fails.

VLM approach:

  1. Camera captures workspace image
  2. VLM analyzes the scene: "There are 3 boxes on the table, 1 bottle tipped over, and the gripper is open"
  3. User says: "Clean up the table"
  4. VLM generates plan: ["pick up bottle", "place in bin", "stack boxes neatly"]
  5. Each step is converted to motion commands via a policy or motion planner

TensorRT Edge-LLM for Jetson

NVIDIA developed TensorRT Edge-LLM — an open-source C++ framework for LLM/VLM inference on embedded platforms. Key features:

  • EAGLE-3 speculative decoding: Accelerates token generation
  • NVFP4 quantization: Reduces model size while preserving accuracy
  • Chunked prefill: Optimizes memory for long-context inputs
  • Concurrent model serving: Runs multiple models simultaneously

Architecture pipeline: VLM → Task Plan → cuMotion → Robot Action

Models That Run on Jetson AGX Orin 64GB

Model Parameters Tokens/sec Purpose
Qwen2.5-VL-7B 7B ~15-20 Scene understanding + task planning
LLaVA-13B 13B ~8-12 Detailed visual reasoning
Phi-3.5-Vision 4.2B ~25-30 Lightweight, fast inference
VILA 1.5-3B 3B ~35-40 Real-time visual queries

With 64GB of memory, the Jetson AGX Orin comfortably runs 7-13B models for task planning while retaining enough memory for the perception pipeline and cuMotion.

Complete Pipeline: VLM + cuMotion on Jetson

import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image

class EdgeTaskPlanner:
    def __init__(self):
        self.model = Qwen2VLForConditionalGeneration.from_pretrained(
            "Qwen/Qwen2.5-VL-7B-Instruct",
            torch_dtype=torch.float16,
            device_map="cuda"
        )
        self.processor = AutoProcessor.from_pretrained(
            "Qwen/Qwen2.5-VL-7B-Instruct"
        )

    def plan_from_image(self, image: Image, instruction: str):
        """Takes workspace image + instruction, returns task plan."""
        messages = [
            {
                "role": "system",
                "content": (
                    "You are a robot task planner. Given a workspace image "
                    "and instruction, output a JSON list of pick-and-place "
                    "actions. Each action has: object, pick_region, "
                    "place_region."
                )
            },
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": instruction}
                ]
            }
        ]

        text = self.processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = self.processor(
            text=[text], images=[image],
            return_tensors="pt"
        ).to("cuda")

        output_ids = self.model.generate(
            **inputs, max_new_tokens=512, temperature=0.1
        )
        plan = self.processor.batch_decode(
            output_ids, skip_special_tokens=True
        )
        return self._parse_plan(plan[0])

    def _parse_plan(self, raw_output):
        import json
        start = raw_output.find('[')
        end = raw_output.rfind(']') + 1
        return json.loads(raw_output[start:end])

# Usage
planner = EdgeTaskPlanner()
camera_image = Image.open("/dev/video0_capture.jpg")
plan = planner.plan_from_image(
    camera_image,
    "Sort the red boxes to the left bin and blue boxes to the right bin"
)
# plan = [
#     {"object": "red_box_1", "pick_region": [0.3, 0.4], "place_region": [0.1, 0.5]},
#     {"object": "blue_box_1", "pick_region": [0.5, 0.3], "place_region": [0.8, 0.5]},
#     ...
# ]

Comparing the Three Approaches

Criterion cuTAMP cuMotion (standalone) VLM Task Planning
Input Domain specification + goal Start/end pose Camera image + language
Output Full plan + trajectories Single trajectory Action sequence (text)
Generalization Within defined domain Any reachable pose Open-world
Latency 1-30s (full plan) 10-500ms (1 trajectory) 2-10s (VLM inference)
GPU memory ~4-8GB ~2-4GB ~16-32GB (model)
Jetson AGX Orin 64GB ✅ Sufficient ✅ Very comfortable ✅ Handles 7-13B

When to Use What?

  • cuTAMP: Complex manipulation problems with many objects requiring optimality guarantees (packing, assembly). Best for factories with well-defined tasks.
  • cuMotion: When you already have a task plan (from cuTAMP or VLM) and need fast trajectory generation for the robot arm. Always needed in the pipeline.
  • VLM Task Planning: Dynamic environments requiring open-world reasoning, robots accepting natural language commands. Best for service robots and cobots working alongside humans.

Complete Pipeline Setup Guide

Step 1: Prepare Jetson AGX Orin

# Flash JetPack 6.1+ (supports CUDA 12.6, TensorRT 10.x)
# Download SDK Manager: https://developer.nvidia.com/sdk-manager

# Verify GPU
nvidia-smi
# Expected: Orin (nvgpu), 64GB, CUDA 12.6

# Install Docker + nvidia-container-toolkit
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker

Step 2: Install Isaac ROS + cuMotion

# Clone Isaac ROS repos
mkdir -p ~/isaac_ros_ws/src && cd ~/isaac_ros_ws/src
git clone -b release-3.2 \
  https://github.com/NVIDIA-ISAAC-ROS/isaac_ros_cumotion.git
git clone -b release-3.2 \
  https://github.com/NVIDIA-ISAAC-ROS/isaac_ros_common.git

# Build in container
cd ~/isaac_ros_ws
./src/isaac_ros_common/scripts/run_dev.sh
colcon build --symlink-install
source install/setup.bash

Step 3: Install VLM for Task Planning

# In Python environment on Jetson
pip install transformers accelerate pillow

# Download model (will cache to ~/.cache/huggingface)
python -c "
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
model = Qwen2VLForConditionalGeneration.from_pretrained(
    'Qwen/Qwen2.5-VL-7B-Instruct',
    torch_dtype='float16'
)
print('Model loaded successfully')
"

Step 4: Connect the Pipeline

#!/usr/bin/env python3
"""Full pipeline: Camera → VLM → cuMotion → Robot."""

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image as RosImage
from cv_bridge import CvBridge
from PIL import Image
import numpy as np

class FullManipulationPipeline(Node):
    def __init__(self):
        super().__init__('full_pipeline')
        self.bridge = CvBridge()
        self.planner = EdgeTaskPlanner()
        self.motion = ManipulationPlanner()

        self.camera_sub = self.create_subscription(
            RosImage, '/camera/color/image_raw',
            self.on_image, 10
        )
        self.current_task = None

    def on_image(self, msg):
        if self.current_task is not None:
            return

        cv_image = self.bridge.imgmsg_to_cv2(msg, "rgb8")
        pil_image = Image.fromarray(cv_image)

        plan = self.planner.plan_from_image(
            pil_image,
            "Pick up all objects and place them in the sorting bins"
        )

        self.get_logger().info(f"Generated plan: {len(plan)} steps")
        self.execute_plan(plan)

    def execute_plan(self, plan):
        for i, step in enumerate(plan):
            self.get_logger().info(
                f"Step {i+1}/{len(plan)}: {step['object']}"
            )
            pick_pose = self.region_to_pose(step['pick_region'])
            place_pose = self.region_to_pose(step['place_region'])

            self.motion.plan_pick(pick_pose)
            self.motion.plan_pick(place_pose)

    def region_to_pose(self, region):
        from geometry_msgs.msg import PoseStamped
        pose = PoseStamped()
        pose.header.frame_id = "base_link"
        pose.pose.position.x = region[0]
        pose.pose.position.y = region[1]
        pose.pose.position.z = 0.15
        pose.pose.orientation.w = 1.0
        return pose

def main():
    rclpy.init()
    node = FullManipulationPipeline()
    rclpy.spin(node)

if __name__ == '__main__':
    main()

Performance Optimization on Jetson

Memory Management

With 64GB shared memory (CPU + GPU), proper allocation is essential:

Component Memory Notes
VLM (Qwen2.5-VL-7B, FP16) ~14GB Can reduce to ~5GB with INT4
cuMotion ~2-4GB Depends on scene complexity
Perception (FoundationPose, etc.) ~4-6GB Object detection + pose estimation
OS + ROS 2 ~4GB Base system
Total ~24-28GB ~36GB headroom remaining

Quantization for VLM

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
)

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    quantization_config=quantization_config,
    device_map="cuda"
)
# Memory: ~14GB → ~5GB, speed loss < 10%

Power Mode

# High performance mode (60W) for heavy inference
sudo nvpmodel -m 0
sudo jetson_clocks

# Balanced mode (30W) for normal operation
sudo nvpmodel -m 1

Limitations and Future Directions

Current Limitations

  1. cuTAMP lacks a Jetson-native build — Currently cuTAMP runs on desktop GPUs (RTX series). Porting to Jetson requires compiling cuRobo + custom CUDA kernels for Ampere mobile architecture.
  2. VLM latency remains high — 2-10 seconds per planning call. Not fast enough for reactive manipulation (needs < 100ms).
  3. Hallucination risk — VLMs can generate infeasible plans (e.g., grasping objects that are too heavy, placing objects on tilted surfaces).

Future Directions 2026-2027

  • Jetson Thor (Q4 2026): 7.5x AI compute vs. Orin, capable of running 20-70B models locally → more accurate and faster VLM planning.
  • TensorRT Edge-LLM: Expanding Jetson Orin support with EAGLE-3 speculative decoding for 2-3x speedup.
  • Hybrid TAMP + VLM: Using VLM to initialize plans, cuTAMP to refine + guarantee feasibility — combining the strengths of both approaches.
  • VLA models on edge: Models like GR00T N1 are running end-to-end on Jetson — perception to action in a single model, bypassing discrete task planning entirely.

Conclusion

Task planning for robot manipulation on edge devices has shifted from "impossible" to "feasible" thanks to three breakthroughs:

  1. cuTAMP tackles combinatorial explosion with GPU parallelism — from hours to seconds.
  2. cuMotion generates collision-free trajectories in milliseconds on Jetson — fast enough for real-time control loops.
  3. VLMs on Jetson AGX Orin 64GB enable open-world reasoning — robots understand natural language and reason in never-before-seen environments.

The Jetson AGX Orin 64GB with 275 TOPS is currently the most powerful edge platform for manipulation — and with Jetson Thor on the horizon, the boundary between edge and cloud AI will continue to blur.

References

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Bài viết liên quan

NEWTutorial
GEAR-SONIC: Whole-Body Control cho Humanoid Robot
humanoidwhole-body-controlnvidiareinforcement-learningmotion-trackingvr-teleoperationisaac-lab

GEAR-SONIC: Whole-Body Control cho Humanoid Robot

Hướng dẫn chi tiết GEAR-SONIC của NVIDIA — huấn luyện whole-body controller cho humanoid robot với dataset BONES-SEED và VR teleoperation.

13/4/202612 phút đọc
NEWTutorial
Hướng dẫn fine-tune NVIDIA GR00T N1
vlahumanoidnvidiaisaac-labfine-tuningdeep-learninggrootsim2real

Hướng dẫn fine-tune NVIDIA GR00T N1

Hướng dẫn chi tiết fine-tune VLA model GR00T N1 cho humanoid robot với Isaac Lab và dữ liệu AGIBOT World — từ cài đặt đến inference.

12/4/202612 phút đọc
NEWTutorial
Hướng dẫn GigaBrain-0: VLA + World Model + RL
vlaworld-modelreinforcement-learninggigabrainroboticsmanipulation

Hướng dẫn GigaBrain-0: VLA + World Model + RL

Hướng dẫn chi tiết huấn luyện VLA bằng World Model và Reinforcement Learning với framework RAMP từ GigaBrain — open-source, 3.5B params.

12/4/202611 phút đọc