Target detection theory: detailed explanation of bounding box, anchor box, IOU calculation and NMS algorithm

Introduction

Target detection is one of the core tasks in computer vision. It simultaneously completes the two sub-tasks of classification (identifying "what") and positioning (marking "where"). It is widely used in fields such as autonomous driving, security monitoring, and industrial quality inspection. This article will deeply dismantle its core theory and reusable key code to help you lay a solid foundation.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 迁移学习 (Transfer Learning) · YOLO 家族实战

1. Basic literacy in target detection

1.1 Task definition and representation

Input: A single image with the shape of Height×Width×Number of Channels (H×W×C). Output: N detected targets, each target contains three pieces of information:

Category label (class): Which category the object belongs to
bounding box (bbox): the position of the object in the image
Confidence (confidence): How sure the model is of the detection result

Simply put, after the target detection model reads a picture, it will tell you what objects are in the picture, where each object is, and give a credibility score for each judgment.

1.2 Core application scenarios

Field	Specific application examples
Autonomous driving	Real-time detection of vehicles/pedestrians/traffic lights/road signs
Security monitoring	Abnormal behavior/intruder/dangerous goods identification
Industrial quality inspection	Product defect/dimensional deviation/surface defect detection
Medical imaging	Locating lung nodules/tumors/fractures and other diseased areas
Smart retail/agriculture	Product shelf inventory/customer movement tracking/crop pests and diseases/fruit maturity detection

1.3 Brief history of algorithm development (key nodes)

2014: R-CNN (first use of CNN for object detection)
2015: Faster R-CNN (introducing the regional proposal network RPN to achieve end-to-end training)
2016: YOLOv1/SSD (single-stage detector, greatly improved speed)
2020: DETR (For the first time, Transformer is directly applied to target detection without anchor boxes)
2021-2023: YOLOv5/v6/v7/v8 (the ultimate balance of ease of use, accuracy and speed)

2. Bounding box representation and operation

Bounding boxes are the core carrier for positioning targets, and there are two mainstream representation methods that can be converted into each other.

2.1 Mainstream representation formats

Format type	Representation form	Meaning	Applicable scenarios
Corner point coordinates (xyxy)	(x_min, y_min, x_max, y_max)	Upper left corner + lower right corner pixel coordinates	Basic operations such as drawing, cropping, IOU calculation, etc.
Center coordinates (xywh)	(c_x, c_y, w, h)	Center point coordinates + width and height	Normalization, anchor box regression prediction

💡 Normalization technique: Divide the pixel coordinates by the image width and height and convert it to the [0,1] range, which can be adapted to input images of any size.

2.2 Reusable core operation code

def xyxy2xywh(bbox_xyxy):
    """角点坐标 -> 中心坐标"""
    x1, y1, x2, y2 = bbox_xyxy
    cx = (x1 + x2) / 2
    cy = (y1 + y2) / 2
    w = x2 - x1
    h = y2 - y1
    return cx, cy, w, h

def xywh2xyxy(bbox_xywh):
    """中心坐标 -> 角点坐标"""
    cx, cy, w, h = bbox_xywh
    x1 = cx - w / 2
    y1 = cy - h / 2
    x2 = cx + w / 2
    y2 = cy + h / 2
    return x1, y1, x2, y2

def clip_bbox(bbox_xyxy, img_h, img_w):
    """将边界框裁剪到图像边界内"""
    x1, y1, x2, y2 = bbox_xyxy
    x1 = max(0, min(x1, img_w))
    y1 = max(0, min(y1, img_h))
    x2 = max(0, min(x2, img_w))
    y2 = max(0, min(y2, img_h))
    return x1, y1, x2, y2

3. IOU (intersection and union ratio) calculation

IOU (Intersection over Union) is the most core evaluation/matching indicator in target detection, which is used to measure the degree of overlap between the predicted frame and the real frame.

3.1 Basic concepts

Definition: The ratio of the intersection area of two bounding boxes to the union area. Calculation formula: IOU = intersection area / union area

Common threshold reference:

IOU > 0.5: preliminary match (COCO AP50 standard)
IOU > 0.7: good match
IOU > 0.9: almost perfect overlap

3.2 Reusable core computing code

def calculate_iou(box1_xyxy, box2_xyxy):
    """计算两个角点坐标框的IOU"""
    # 1. 计算交集坐标
    x1_inter = max(box1_xyxy[0], box2_xyxy[0])
    y1_inter = max(box1_xyxy[1], box2_xyxy[1])
    x2_inter = min(box1_xyxy[2], box2_xyxy[2])
    y2_inter = min(box1_xyxy[3], box2_xyxy[3])
    
    # 2. 判断是否有交集
    if x2_inter <= x1_inter or y2_inter <= y1_inter:
        return 0.0
    
    # 3. 计算交集、并集面积
    inter_area = (x2_inter - x1_inter) * (y2_inter - y1_inter)
    box1_area = (box1_xyxy[2] - box1_xyxy[0]) * (box1_xyxy[3] - box1_xyxy[1])
    box2_area = (box2_xyxy[2] - box2_xyxy[0]) * (box2_xyxy[3] - box2_xyxy[1])
    union_area = box1_area + box2_area - inter_area
    
    # 4. 计算IOU
    return inter_area / union_area

4. NMS (non-maximum suppression)

NMS (Non-Maximum Suppression) is the core post-processing algorithm for removing duplicate detection frames and is used by almost all modern detectors.

4.1 Basic process

Sort all detection frames in descending order by confidence score**
Select the box with the highest confidence as the currently reserved box
Calculate the IOU between the current frame and the remaining frames, delete frames with IOU greater than the threshold
Repeat steps 2-3 until all boxes have been processed

4.2 Reusable basic NMS code

def nms_basic(boxes_xyxy, scores, iou_threshold=0.5):
    """
    基础版本NMS
    
    Args:
        boxes_xyxy: 角点坐标框列表 [(x1,y1,x2,y2), ...]
        scores: 对应的置信度分数列表 [s1, s2, ...]
        iou_threshold: IOU抑制阈值
    
    Returns:
        keep_indices: 保留框的索引列表
    """
    if len(boxes_xyxy) == 0:
        return []
    
    # 按分数降序排序索引
    sorted_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
    keep = []
    
    while sorted_indices:
        # 保留当前最高分框
        current_idx = sorted_indices[0]
        keep.append(current_idx)
        
        # 移除已保留的框
        sorted_indices = sorted_indices[1:]
        
        # 抑制IOU过高的框
        sorted_indices = [
            idx for idx in sorted_indices
            if calculate_iou(boxes_xyxy[current_idx], boxes_xyxy[idx]) < iou_threshold
        ]
    
    return keep

4.3 Commonly used optimization variants (introduction)

Variant types	Core improvements	Applicable scenarios
Soft-NMS	Reduce the score of overlapping boxes instead of directly deleting them	Dense object detection (such as crowds, flocks of birds)
DIoU-NMS	Combined with the center point distance to suppress low-confidence frames that are too close to the center point of the current frame	Scenes with dense occlusion
Category-aware NMS	Only perform NMS on boxes of the same category, frames of different categories do not affect each other	All multi-category detection scenarios (default)

5. Anchor frame mechanism

Anchor Box is the core predefined tool for two-stage/early single-stage detectors (such as Faster R-CNN, SSD, YOLOv2-v3) to solve multi-scale target detection.

5.1 Basic concepts

At each grid point of the feature map, a set of reference boxes of predefined sizes and aspect ratios are placed, and the model simply predicts:

bounding box offset relative to the anchor box (not absolute coordinates)
Whether the objectness score of the target exists in the anchor box
Category probability of the target

5.2 Key parameters

Parameters	Function	Example values
Size (Scale)	The size of the anchor box	[32, 64, 128, 256]
Aspect ratio (AR)	Shape of anchor box, width/height	[0.5, 1.0, 2.0]
Stride	The feature map grid points correspond to the step size of the original image	[8, 16, 32] (multi-scale feature map)

6. Core evaluation indicators

The most commonly used evaluation system for target detection is the COCO standard. The core indicators are as follows:

Indicator	Full name	Meaning
AP50/mAP@0.5	Average Precision at IoU=0.5	When the IoU threshold is 0.5, the average precision of all categories
AP75/mAP@0.75	Average Precision at IoU=0.75	Average precision when the IoU threshold is 0.75 (more stringent)
mAP@0.5:0.95	COCO mAP	Average accuracy under 10 thresholds with IoU thresholds from 0.5 to 0.95, step size 0.05 (most authoritative)

1. **Hands-on practice**: Try to use Python to visualize the process of bounding boxes, IOUs, and NMS to deepen your understanding. 2. **Read the source code**: In the source code of YOLOv5/v8, the implementation of these core algorithms is very efficient and standardized. 3. **Easy first, then difficult**: Master the basic IOU/NMS first, then learn variants such as Soft-NMS/DIoU-NMS

7. Summary

This article summarizes the four core foundations of target detection:

Bounding box representation: Master the mutual conversion, normalization and cropping of xyxy and xywh
IOU calculation: core matching/evaluation indicators, understanding formulas and implementation
NMS algorithm: remove the post-processing core of repeated detection, master the basic version and common variants
Anchor Frame Mechanism: Predefined reference frames for early/two-stage detectors

These concepts are the cornerstone of all modern object detection models and must be mastered.

#Target detection theory: detailed explanation of bounding box, anchor box, IOU calculation and NMS algorithm

#Introduction

#1. Basic literacy in target detection

#1.1 Task definition and representation

#1.2 Core application scenarios

#1.3 Brief history of algorithm development (key nodes)

#2. Bounding box representation and operation

#2.1 Mainstream representation formats

#2.2 Reusable core operation code

#3. IOU (intersection and union ratio) calculation

#3.1 Basic concepts

#3.2 Reusable core computing code

#4. NMS (non-maximum suppression)

#4.1 Basic process

#4.2 Reusable basic NMS code

#4.3 Commonly used optimization variants (introduction)

#5. Anchor frame mechanism

#5.1 Basic concepts

#5.2 Key parameters

#6. Core evaluation indicators

#7. Summary