← Back to Blog
aiai-perceptionnvidiatutorial

Deploy YOLOv8 to Jetson Orin Nano in 30 Minutes

Step-by-step guide deploying YOLOv8 with TensorRT on NVIDIA Jetson Orin Nano — from model export to real-time 60 FPS inference.

Nguyen Anh Tuan15 tháng 2, 20268 min read
Deploy YOLOv8 to Jetson Orin Nano in 30 Minutes

Deploy YOLOv8 to Jetson Orin Nano — Why Does It Matter?

Deploying YOLOv8 to Jetson Orin Nano is the critical step for bringing AI perception from your laptop into real robots. If you are building mobile robots, drones, or quality inspection systems, you need real-time inference directly on the edge device — you cannot afford to wait for round-trip latency to the cloud at 200ms per frame.

In this tutorial, I will walk you through everything from A to Z: hardware setup, model export, TensorRT optimization, building a camera pipeline with GStreamer, and finally wrapping it as a ROS 2 node ready to run on your robot.

NVIDIA Jetson Orin Nano developer kit for AI perception robotics

Prerequisites — What Do You Need?

Hardware

Device Notes
NVIDIA Jetson Orin Nano 8GB Developer Kit or module + carrier board
MicroSD 64GB+ (UHS-I) Or NVMe SSD for better performance
USB camera or CSI camera Logitech C920 or IMX219 CSI
5V/4A USB-C power supply Included with Dev Kit

Software

# Install JetPack (if not flashed)
sudo apt update && sudo apt install -y nvidia-jetpack

# Verify CUDA
nvcc --version
# cuda_12.2.r12.2/compiler.33191640_0

# Install Ultralytics
pip3 install ultralytics

# Verify PyTorch + CUDA
python3 -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"
# CUDA: True, Device: Orin (nvgpu)

Step 1: Export YOLOv8 to TensorRT Engine

This is the most critical step. TensorRT will optimize the model for the Ampere GPU on Orin Nano, including layer fusion, kernel auto-tuning, and precision calibration.

Export FP16 (Recommended)

from ultralytics import YOLO

# Load pretrained model (or your custom trained model)
model = YOLO("yolov8n.pt")  # nano version, optimized for edge

# Export to TensorRT FP16
# This process takes 5-10 minutes on Orin Nano
model.export(
    format="engine",
    device=0,
    half=True,        # FP16 precision
    imgsz=640,        # Input resolution
    workspace=4,      # GB workspace for TensorRT optimizer
    verbose=True
)
print("Export done! File: yolov8n.engine")

Export INT8 (Maximum FPS)

INT8 requires a calibration dataset — a set of 200-500 images representative of real-world data:

from ultralytics import YOLO

model = YOLO("yolov8n.pt")

# Export INT8 with calibration
model.export(
    format="engine",
    device=0,
    int8=True,
    data="coco128.yaml",  # Dataset for calibration
    imgsz=640,
    workspace=4
)

Important Note: The .engine file is built specific to each GPU architecture. An engine built on Orin Nano will not run on RTX 4090 and vice versa. Always export directly on your target device.

Step 2: Benchmark — FP32 vs FP16 vs INT8

I benchmarked on Jetson Orin Nano 8GB, power mode MAXN (15W), with jetson_clocks enabled:

Format Model FPS (640x640) Latency mAP@50 GPU Memory
PyTorch FP32 yolov8n 15 FPS 67ms 37.3 1.2GB
TensorRT FP32 yolov8n 32 FPS 31ms 37.3 680MB
TensorRT FP16 yolov8n 58 FPS 17ms 37.2 420MB
TensorRT INT8 yolov8n 72 FPS 14ms 36.8 310MB
PyTorch FP32 yolov8s 8 FPS 125ms 44.9 2.1GB
TensorRT FP16 yolov8s 35 FPS 29ms 44.8 780MB

Conclusion: FP16 is the sweet spot — nearly 4x speedup over PyTorch with almost no accuracy loss. Use INT8 only when you need >60 FPS or running multiple models simultaneously.

# Quick benchmark on Jetson
yolo benchmark model=yolov8n.engine imgsz=640 half=True device=0

Step 3: Complete Python Inference Script

"""
YOLOv8 TensorRT Inference on Jetson Orin Nano
Optimized for real-time robot perception
"""
import cv2
import time
import numpy as np
from ultralytics import YOLO
from collections import deque

class JetsonDetector:
    def __init__(self, model_path: str = "yolov8n.engine", conf: float = 0.5):
        self.model = YOLO(model_path)
        self.conf = conf
        self.fps_buffer = deque(maxlen=30)

    def detect(self, frame: np.ndarray) -> list[dict]:
        """Detect objects and return list of detections."""
        start = time.perf_counter()

        results = self.model.predict(
            frame,
            conf=self.conf,
            verbose=False,
            half=True
        )

        elapsed = time.perf_counter() - start
        self.fps_buffer.append(1.0 / elapsed)

        detections = []
        for box in results[0].boxes:
            det = {
                "class_id": int(box.cls[0]),
                "class_name": results[0].names[int(box.cls[0])],
                "confidence": float(box.conf[0]),
                "bbox": box.xyxy[0].cpu().numpy().tolist(),  # [x1, y1, x2, y2]
            }
            detections.append(det)

        return detections

    @property
    def fps(self) -> float:
        return np.mean(self.fps_buffer) if self.fps_buffer else 0.0


# --- Main ---
if __name__ == "__main__":
    detector = JetsonDetector("yolov8n.engine", conf=0.45)
    cap = cv2.VideoCapture(0)
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        detections = detector.detect(frame)

        # Draw bounding boxes
        for det in detections:
            x1, y1, x2, y2 = map(int, det["bbox"])
            label = f'{det["class_name"]} {det["confidence"]:.2f}'
            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.putText(frame, label, (x1, y1 - 10),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

        # FPS overlay
        cv2.putText(frame, f"FPS: {detector.fps:.1f}", (10, 30),
                     cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)

        cv2.imshow("YOLOv8 Jetson", frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()

Step 4: Camera Pipeline with GStreamer

USB cameras work through OpenCV, but CSI cameras need a GStreamer pipeline for hardware-accelerated decoding on Jetson. This is also the optimal approach for USB cameras.

CSI Camera (IMX219/IMX477)

def gstreamer_pipeline(
    sensor_id=0,
    capture_width=1280,
    capture_height=720,
    display_width=1280,
    display_height=720,
    framerate=60,
    flip_method=0,
):
    """Build GStreamer pipeline string for CSI camera on Jetson."""
    return (
        f"nvarguscamerasrc sensor-id={sensor_id} ! "
        f"video/x-raw(memory:NVMM), width=(int){capture_width}, "
        f"height=(int){capture_height}, framerate=(fraction){framerate}/1 ! "
        f"nvvidconv flip-method={flip_method} ! "
        f"video/x-raw, width=(int){display_width}, height=(int){display_height}, "
        f"format=(string)BGRx ! videoconvert ! "
        f"video/x-raw, format=(string)BGR ! appsink drop=1"
    )

# Usage
cap = cv2.VideoCapture(gstreamer_pipeline(), cv2.CAP_GSTREAMER)

USB Camera (Optimized with V4L2)

def usb_gstreamer_pipeline(device="/dev/video0", width=1280, height=720, fps=30):
    """GStreamer pipeline for USB camera, decode on GPU."""
    return (
        f"v4l2src device={device} ! "
        f"video/x-raw, width={width}, height={height}, framerate={fps}/1 ! "
        f"videoconvert ! video/x-raw, format=BGR ! appsink drop=1"
    )

cap = cv2.VideoCapture(usb_gstreamer_pipeline(), cv2.CAP_GSTREAMER)

Multi-camera Setup

# Run 2 cameras in parallel
cap_left = cv2.VideoCapture(gstreamer_pipeline(sensor_id=0), cv2.CAP_GSTREAMER)
cap_right = cv2.VideoCapture(gstreamer_pipeline(sensor_id=1), cv2.CAP_GSTREAMER)

# Or use threading to avoid blocking
import threading

class CameraThread:
    def __init__(self, pipeline):
        self.cap = cv2.VideoCapture(pipeline, cv2.CAP_GSTREAMER)
        self.frame = None
        self.running = True
        self.thread = threading.Thread(target=self._update, daemon=True)
        self.thread.start()

    def _update(self):
        while self.running:
            ret, self.frame = self.cap.read()

    def read(self):
        return self.frame

    def stop(self):
        self.running = False
        self.cap.release()

Camera pipeline AI perception for mobile robots

Step 5: Deploy as ROS 2 Node

This is the final step — wrap the detector as a ROS 2 node to integrate into your robot stack.

"""
ROS 2 YOLOv8 Detection Node for Jetson Orin Nano
Publish detections to /yolo/detections topic
"""
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image
from vision_msgs.msg import Detection2DArray, Detection2D, ObjectHypothesisWithPose
from cv_bridge import CvBridge
import cv2
import numpy as np
from ultralytics import YOLO


class YoloDetectionNode(Node):
    def __init__(self):
        super().__init__("yolo_detection_node")

        # Parameters
        self.declare_parameter("model_path", "yolov8n.engine")
        self.declare_parameter("confidence", 0.5)
        self.declare_parameter("input_topic", "/camera/image_raw")

        model_path = self.get_parameter("model_path").value
        self.conf = self.get_parameter("confidence").value
        input_topic = self.get_parameter("input_topic").value

        # Load TensorRT model
        self.model = YOLO(model_path)
        self.bridge = CvBridge()

        # Subscriber & Publisher
        self.sub = self.create_subscription(
            Image, input_topic, self.image_callback, 10
        )
        self.pub = self.create_publisher(Detection2DArray, "/yolo/detections", 10)
        self.img_pub = self.create_publisher(Image, "/yolo/annotated", 10)

        self.get_logger().info(
            f"YOLOv8 node started — model: {model_path}, conf: {self.conf}"
        )

    def image_callback(self, msg: Image):
        # Convert ROS Image -> OpenCV
        frame = self.bridge.imgmsg_to_cv2(msg, "bgr8")

        # Inference
        results = self.model.predict(frame, conf=self.conf, verbose=False, half=True)

        # Build Detection2DArray message
        det_array = Detection2DArray()
        det_array.header = msg.header

        for box in results[0].boxes:
            det = Detection2D()
            x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()

            # Bounding box (center + size format)
            det.bbox.center.position.x = float((x1 + x2) / 2)
            det.bbox.center.position.y = float((y1 + y2) / 2)
            det.bbox.size_x = float(x2 - x1)
            det.bbox.size_y = float(y2 - y1)

            # Classification result
            hyp = ObjectHypothesisWithPose()
            hyp.hypothesis.class_id = str(int(box.cls[0]))
            hyp.hypothesis.score = float(box.conf[0])
            det.results.append(hyp)

            det_array.detections.append(det)

        self.pub.publish(det_array)

        # Publish annotated image (optional, for visualization)
        annotated = results[0].plot()
        self.img_pub.publish(self.bridge.cv2_to_imgmsg(annotated, "bgr8"))


def main(args=None):
    rclpy.init(args=args)
    node = YoloDetectionNode()
    rclpy.spin(node)
    node.destroy_node()
    rclpy.shutdown()


if __name__ == "__main__":
    main()

Launch File

# launch/yolo_detection.launch.py
from launch import LaunchDescription
from launch_ros.actions import Node

def generate_launch_description():
    return LaunchDescription([
        Node(
            package="yolo_perception",
            executable="yolo_node",
            name="yolo_detection",
            parameters=[{
                "model_path": "/opt/models/yolov8n.engine",
                "confidence": 0.5,
                "input_topic": "/camera/image_raw",
            }],
            output="screen",
        ),
    ])
# Build and run
cd ~/ros2_ws
colcon build --packages-select yolo_perception
source install/setup.bash
ros2 launch yolo_perception yolo_detection.launch.py

Additional Optimization Tips

Power Management

# MAXN mode (15W) — maximum performance
sudo nvpmodel -m 0
sudo jetson_clocks

# Check temperature (throttle at 80°C)
cat /sys/devices/virtual/thermal/thermal_zone*/temp
# Divide by 1000 to get °C

GPU Pre-processing

# Use cv2.cuda instead of CPU resize
gpu_frame = cv2.cuda_GpuMat()
gpu_frame.upload(frame)
gpu_resized = cv2.cuda.resize(gpu_frame, (640, 640))
frame_resized = gpu_resized.download()

Monitor Resource Usage

# Jetson stats — similar to htop but for Jetson
pip3 install jetson-stats
jtop

Summary

Complete pipeline: Camera → GStreamer → YOLOv8 TensorRT → ROS 2 topic — all running at 58+ FPS on Jetson Orin Nano 8GB with FP16. Fast enough for mobile robots, drones, and industrial quality inspection systems.

Key takeaways:


Related Articles

Related Posts

TutorialSim-to-Real Pipeline: Từ training đến robot thật
simulationsim2realtutorialPart 5

Sim-to-Real Pipeline: Từ training đến robot thật

End-to-end guide: train policy trong sim, evaluate, domain randomization, deploy lên robot thật và iterate.

2/4/202615 min read
Sim-to-Real Transfer: Train simulation, chạy thực tế
ai-perceptionresearchrobotics

Sim-to-Real Transfer: Train simulation, chạy thực tế

Kỹ thuật chuyển đổi mô hình từ simulation sang robot thật — domain randomization, system identification và best practices.

1/4/202612 min read
TutorialBắt đầu với MuJoCo: Cài đặt đến mô phỏng robot đầu tiên
simulationmujocotutorialPart 2

Bắt đầu với MuJoCo: Cài đặt đến mô phỏng robot đầu tiên

Hands-on tutorial MuJoCo — cài đặt, tạo MJCF model, simulate robot arm và visualize kết quả với Python.

30/3/202611 min read