Target detection theory: detailed explanation of bounding box, anchor box, IOU calculation and NMS algorithm

Introduction

Target detection is one of the core tasks in computer vision. It simultaneously completes the two sub-tasks of classification (identifying "what") and positioning (marking "where"). It is widely used in fields such as autonomous driving, security monitoring, and industrial quality inspection. This article will deeply dismantle its core theory and reusable key code to help you lay a solid foundation.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 迁移学习 (Transfer Learning) · YOLO 家族实战


1. Basic literacy in target detection

1.1 Task definition and representation

Input: A single image with the shape of Height×Width×Number of Channels (H×W×C). Output: N detected targets, each target contains three pieces of information:

  • Category label (class): Which category the object belongs to
  • bounding box (bbox): the position of the object in the image
  • Confidence (confidence): How sure the model is of the detection result

Simply put, after the target detection model reads a picture, it will tell you what objects are in the picture, where each object is, and give a credibility score for each judgment.

1.2 Core application scenarios

FieldSpecific application examples
Autonomous drivingReal-time detection of vehicles/pedestrians/traffic lights/road signs
Security monitoringAbnormal behavior/intruder/dangerous goods identification
Industrial quality inspectionProduct defect/dimensional deviation/surface defect detection
Medical imagingLocating lung nodules/tumors/fractures and other diseased areas
Smart retail/agricultureProduct shelf inventory/customer movement tracking/crop pests and diseases/fruit maturity detection

1.3 Brief history of algorithm development (key nodes)

  • 2014: R-CNN (first use of CNN for object detection)
  • 2015: Faster R-CNN (introducing the regional proposal network RPN to achieve end-to-end training)
  • 2016: YOLOv1/SSD (single-stage detector, greatly improved speed)
  • 2020: DETR (For the first time, Transformer is directly applied to target detection without anchor boxes)
  • 2021-2023: YOLOv5/v6/v7/v8 (the ultimate balance of ease of use, accuracy and speed)

2. Bounding box representation and operation

Bounding boxes are the core carrier for positioning targets, and there are two mainstream representation methods that can be converted into each other.

2.1 Mainstream representation formats

Format typeRepresentation formMeaningApplicable scenarios
Corner point coordinates (xyxy)(x_min, y_min, x_max, y_max)Upper left corner + lower right corner pixel coordinatesBasic operations such as drawing, cropping, IOU calculation, etc.
Center coordinates (xywh)(c_x, c_y, w, h)Center point coordinates + width and heightNormalization, anchor box regression prediction

💡 Normalization technique: Divide the pixel coordinates by the image width and height and convert it to the [0,1] range, which can be adapted to input images of any size.

2.2 Reusable core operation code

def xyxy2xywh(bbox_xyxy):
    """角点坐标 -> 中心坐标"""
    x1, y1, x2, y2 = bbox_xyxy
    cx = (x1 + x2) / 2
    cy = (y1 + y2) / 2
    w = x2 - x1
    h = y2 - y1
    return cx, cy, w, h

def xywh2xyxy(bbox_xywh):
    """中心坐标 -> 角点坐标"""
    cx, cy, w, h = bbox_xywh
    x1 = cx - w / 2
    y1 = cy - h / 2
    x2 = cx + w / 2
    y2 = cy + h / 2
    return x1, y1, x2, y2

def clip_bbox(bbox_xyxy, img_h, img_w):
    """将边界框裁剪到图像边界内"""
    x1, y1, x2, y2 = bbox_xyxy
    x1 = max(0, min(x1, img_w))
    y1 = max(0, min(y1, img_h))
    x2 = max(0, min(x2, img_w))
    y2 = max(0, min(y2, img_h))
    return x1, y1, x2, y2

3. IOU (intersection and union ratio) calculation

IOU (Intersection over Union) is the most core evaluation/matching indicator in target detection, which is used to measure the degree of overlap between the predicted frame and the real frame.

3.1 Basic concepts

Definition: The ratio of the intersection area of ​​two bounding boxes to the union area. Calculation formula: IOU = intersection area / union area

Common threshold reference:

  • IOU > 0.5: preliminary match (COCO AP50 standard)
  • IOU > 0.7: good match
  • IOU > 0.9: almost perfect overlap

3.2 Reusable core computing code

def calculate_iou(box1_xyxy, box2_xyxy):
    """计算两个角点坐标框的IOU"""
    # 1. 计算交集坐标
    x1_inter = max(box1_xyxy[0], box2_xyxy[0])
    y1_inter = max(box1_xyxy[1], box2_xyxy[1])
    x2_inter = min(box1_xyxy[2], box2_xyxy[2])
    y2_inter = min(box1_xyxy[3], box2_xyxy[3])
    
    # 2. 判断是否有交集
    if x2_inter <= x1_inter or y2_inter <= y1_inter:
        return 0.0
    
    # 3. 计算交集、并集面积
    inter_area = (x2_inter - x1_inter) * (y2_inter - y1_inter)
    box1_area = (box1_xyxy[2] - box1_xyxy[0]) * (box1_xyxy[3] - box1_xyxy[1])
    box2_area = (box2_xyxy[2] - box2_xyxy[0]) * (box2_xyxy[3] - box2_xyxy[1])
    union_area = box1_area + box2_area - inter_area
    
    # 4. 计算IOU
    return inter_area / union_area

4. NMS (non-maximum suppression)

NMS (Non-Maximum Suppression) is the core post-processing algorithm for removing duplicate detection frames and is used by almost all modern detectors.

4.1 Basic process

  1. Sort all detection frames in descending order by confidence score**
  2. Select the box with the highest confidence as the currently reserved box
  3. Calculate the IOU between the current frame and the remaining frames, delete frames with IOU greater than the threshold
  4. Repeat steps 2-3 until all boxes have been processed

4.2 Reusable basic NMS code

def nms_basic(boxes_xyxy, scores, iou_threshold=0.5):
    """
    基础版本NMS
    
    Args:
        boxes_xyxy: 角点坐标框列表 [(x1,y1,x2,y2), ...]
        scores: 对应的置信度分数列表 [s1, s2, ...]
        iou_threshold: IOU抑制阈值
    
    Returns:
        keep_indices: 保留框的索引列表
    """
    if len(boxes_xyxy) == 0:
        return []
    
    # 按分数降序排序索引
    sorted_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
    keep = []
    
    while sorted_indices:
        # 保留当前最高分框
        current_idx = sorted_indices[0]
        keep.append(current_idx)
        
        # 移除已保留的框
        sorted_indices = sorted_indices[1:]
        
        # 抑制IOU过高的框
        sorted_indices = [
            idx for idx in sorted_indices
            if calculate_iou(boxes_xyxy[current_idx], boxes_xyxy[idx]) < iou_threshold
        ]
    
    return keep

4.3 Commonly used optimization variants (introduction)

Variant typesCore improvementsApplicable scenarios
Soft-NMSReduce the score of overlapping boxes instead of directly deleting themDense object detection (such as crowds, flocks of birds)
DIoU-NMSCombined with the center point distance to suppress low-confidence frames that are too close to the center point of the current frameScenes with dense occlusion
Category-aware NMSOnly perform NMS on boxes of the same category, frames of different categories do not affect each otherAll multi-category detection scenarios (default)

5. Anchor frame mechanism

Anchor Box is the core predefined tool for two-stage/early single-stage detectors (such as Faster R-CNN, SSD, YOLOv2-v3) to solve multi-scale target detection.

5.1 Basic concepts

At each grid point of the feature map, a set of reference boxes of predefined sizes and aspect ratios are placed, and the model simply predicts:

  1. bounding box offset relative to the anchor box (not absolute coordinates)
  2. Whether the objectness score of the target exists in the anchor box
  3. Category probability of the target

5.2 Key parameters

ParametersFunctionExample values ​​
Size (Scale)The size of the anchor box[32, 64, 128, 256]
Aspect ratio (AR)Shape of anchor box, width/height[0.5, 1.0, 2.0]
StrideThe feature map grid points correspond to the step size of the original image[8, 16, 32] (multi-scale feature map)

6. Core evaluation indicators

The most commonly used evaluation system for target detection is the COCO standard. The core indicators are as follows:

IndicatorFull nameMeaning
AP50/mAP@0.5Average Precision at IoU=0.5When the IoU threshold is 0.5, the average precision of all categories
AP75/mAP@0.75Average Precision at IoU=0.75Average precision when the IoU threshold is 0.75 (more stringent)
mAP@0.5:0.95COCO mAPAverage accuracy under 10 thresholds with IoU thresholds from 0.5 to 0.95, step size 0.05 (most authoritative)

1. **Hands-on practice**: Try to use Python to visualize the process of bounding boxes, IOUs, and NMS to deepen your understanding. 2. **Read the source code**: In the source code of YOLOv5/v8, the implementation of these core algorithms is very efficient and standardized. 3. **Easy first, then difficult**: Master the basic IOU/NMS first, then learn variants such as Soft-NMS/DIoU-NMS

7. Summary

This article summarizes the four core foundations of target detection:

  1. Bounding box representation: Master the mutual conversion, normalization and cropping of xyxy and xywh
  2. IOU calculation: core matching/evaluation indicators, understanding formulas and implementation
  3. NMS algorithm: remove the post-processing core of repeated detection, master the basic version and common variants
  4. Anchor Frame Mechanism: Predefined reference frames for early/two-stage detectors

These concepts are the cornerstone of all modern object detection models and must be mastered.