Target detection theory: detailed explanation of bounding box, anchor box, IOU calculation and NMS algorithm
Introduction
Target detection is one of the core tasks in computer vision. It simultaneously completes the two sub-tasks of classification (identifying "what") and positioning (marking "where"). It is widely used in fields such as autonomous driving, security monitoring, and industrial quality inspection. This article will deeply dismantle its core theory and reusable key code to help you lay a solid foundation.
📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 迁移学习 (Transfer Learning) · YOLO 家族实战
1. Basic literacy in target detection
1.1 Task definition and representation
Input: A single image with the shape of Height×Width×Number of Channels (H×W×C). Output: N detected targets, each target contains three pieces of information:
- Category label (class): Which category the object belongs to
- bounding box (bbox): the position of the object in the image
- Confidence (confidence): How sure the model is of the detection result
Simply put, after the target detection model reads a picture, it will tell you what objects are in the picture, where each object is, and give a credibility score for each judgment.
1.2 Core application scenarios
1.3 Brief history of algorithm development (key nodes)
- 2014: R-CNN (first use of CNN for object detection)
- 2015: Faster R-CNN (introducing the regional proposal network RPN to achieve end-to-end training)
- 2016: YOLOv1/SSD (single-stage detector, greatly improved speed)
- 2020: DETR (For the first time, Transformer is directly applied to target detection without anchor boxes)
- 2021-2023: YOLOv5/v6/v7/v8 (the ultimate balance of ease of use, accuracy and speed)
2. Bounding box representation and operation
Bounding boxes are the core carrier for positioning targets, and there are two mainstream representation methods that can be converted into each other.
2.1 Mainstream representation formats
💡 Normalization technique: Divide the pixel coordinates by the image width and height and convert it to the [0,1] range, which can be adapted to input images of any size.
2.2 Reusable core operation code
3. IOU (intersection and union ratio) calculation
IOU (Intersection over Union) is the most core evaluation/matching indicator in target detection, which is used to measure the degree of overlap between the predicted frame and the real frame.
3.1 Basic concepts
Definition: The ratio of the intersection area of two bounding boxes to the union area. Calculation formula: IOU = intersection area / union area
Common threshold reference:
- IOU > 0.5: preliminary match (COCO AP50 standard)
- IOU > 0.7: good match
- IOU > 0.9: almost perfect overlap
3.2 Reusable core computing code
4. NMS (non-maximum suppression)
NMS (Non-Maximum Suppression) is the core post-processing algorithm for removing duplicate detection frames and is used by almost all modern detectors.
4.1 Basic process
- Sort all detection frames in descending order by confidence score**
- Select the box with the highest confidence as the currently reserved box
- Calculate the IOU between the current frame and the remaining frames, delete frames with IOU greater than the threshold
- Repeat steps 2-3 until all boxes have been processed
4.2 Reusable basic NMS code
4.3 Commonly used optimization variants (introduction)
5. Anchor frame mechanism
Anchor Box is the core predefined tool for two-stage/early single-stage detectors (such as Faster R-CNN, SSD, YOLOv2-v3) to solve multi-scale target detection.
5.1 Basic concepts
At each grid point of the feature map, a set of reference boxes of predefined sizes and aspect ratios are placed, and the model simply predicts:
- bounding box offset relative to the anchor box (not absolute coordinates)
- Whether the objectness score of the target exists in the anchor box
- Category probability of the target
5.2 Key parameters
6. Core evaluation indicators
The most commonly used evaluation system for target detection is the COCO standard. The core indicators are as follows:
7. Summary
This article summarizes the four core foundations of target detection:
- Bounding box representation: Master the mutual conversion, normalization and cropping of xyxy and xywh
- IOU calculation: core matching/evaluation indicators, understanding formulas and implementation
- NMS algorithm: remove the post-processing core of repeated detection, master the basic version and common variants
- Anchor Frame Mechanism: Predefined reference frames for early/two-stage detectors
These concepts are the cornerstone of all modern object detection models and must be mastered.

