title: Detailed explanation of YOLO series: Real-time target detection revolution from YOLOv1 to YOLOv10 | Daoman PythonAI description: In-depth analysis of the YOLO (You Only Look Once) series of target detection algorithms, the evolution from YOLOv1 to YOLOv10, including detailed architecture analysis, PyTorch implementation and practical application scenarios. keywords: [YOLO, target detection, real-time detection, YOLOv1, YOLOv5, YOLOv8, YOLOv10, deep learning, computer vision, PyTorch]

Detailed explanation of YOLO series: Real-time target detection revolution from YOLOv1 to YOLOv10

Introduction

In the field of computer vision, Object Detection is the core combined task of "classification + positioning": it is necessary to not only identify what is in the image (What), but also to frame where it is (Where).

Although traditional two-stage algorithms (such as R-CNN/Fast/Faster R-CNN) have led in accuracy for many years, the two-step logic of "first generating candidate frames + re-classification correction" is destined to be unable to meet the requirements of real-time scenarios such as autonomous driving, industrial quality inspection, and live broadcast interaction. It was not until 2015 that Joseph Redmon published his groundbreaking paper "You Only Look Once" that the deadlock was completely broken.

YOLO directly compresses target detection** into a single regression problem** without the need for candidate frame generation. It can output the categories, locations and confidence levels of all objects by "looking at the entire image", achieving the "golden balance point of speed and accuracy" and becoming the most widely used target detection paradigm in the industry in the past 10 years.


1. Minimal overview of YOLO series

1.1 Core design philosophy

"Simple, direct and fast" - this is the soul of YOLO that has continued from v1 to this day:

  • Abandon the two-stage complex process and return to "full picture end-to-end"
  • Utilize global context information to reduce background false detections
  • Supports single GPU training and multi-platform deployment

1.2 Key evolution nodes (lite version)

In order to avoid information overload, let’s sort out the core milestone versions first:

graph LR
A[YOLOv1<br>2015:奠基两阶段→一阶段] --> B[YOLOv2/v3<br>2016-2018:工业化骨架成型]
B --> C[YOLOv5/v8<br>2020-2023:Ultralytics主导的生态时代]
C --> D[YOLOv9/v10<br>2024:架构+后处理双突破]

2. Understand the core principles of YOLOv1 from scratch

2.1 Meshing: Assigning “detection responsibilities”

The first step of YOLOv1 is to evenly divide the input image (448×448) into a grid of S×S=7×7:

  • Each grid is only responsible for detecting all targets "The center point of the object falls within this grid" (maximum 1 category, 2 boxes)
  • This design naturally takes advantage of the global context and avoids the problem that the two-stage algorithm only focuses on local candidate boxes.
def visualize_yolov1_grid(image, grid_size=7, center_threshold=5):
    """
    可视化YOLOv1的网格划分 + 中心点检测责任分配
    Args:
        image: 输入图像 (H, W, 3)
        grid_size: 网格数量
        center_threshold: 判定中心点在网格内的阈值
    Returns:
        带网格和标注的可视化图像
    """
    import cv2
    import numpy as np
    img_copy = image.copy()
    h, w = img_copy.shape[:2]
    cell_h, cell_w = h // grid_size, w // grid_size
    
    # 1. 画网格
    for i in range(1, grid_size):
        cv2.line(img_copy, (i*cell_w, 0), (i*cell_w, h), (255, 255, 255), 1)
        cv2.line(img_copy, (0, i*cell_h), (w, i*cell_h), (255, 255, 255), 1)
    
    # 2. 假设标注中心点 (示例数据)
    gt_centers = np.array([[180, 120], [320, 380], [400, 200]])
    for cx, cy in gt_centers:
        # 找到归属的网格
        grid_x = cx // cell_w
        grid_y = cy // cell_h
        # 画中心点和网格高亮
        cv2.circle(img_copy, (int(cx), int(cy)), 5, (0, 255, 0), -1)
        cv2.rectangle(
            img_copy, 
            (grid_x*cell_w, grid_y*cell_h), 
            ((grid_x+1)*cell_w-1, (grid_y+1)*cell_h-1), 
            (0, 255, 0), 2
        )
    return img_copy

2.2 Output tensor: explain "all information" at once

For a configuration of 7×7 grid, 2 boxes per grid, and COCO 80 class, the output dimensions of YOLOv1 are: 7 × 7 × (2 × 5 + 80) = 7 × 7 × 90

Break down the meaning of each part:

ModuleContentDimension description
Bounding box(x, y, w, h, c)2 per grid, 2×5=10 dimensions in total
- x, y: The offset of the box center** relative to the upper left corner** of the grid
- w, h: The normalized value of the width and height of the box relative to the width and height of the entire image
- c: Confidence = P(Object) × IoU(Box, GT)
Class probabilityP(Class_i|Object)1 per grid, total 80 dimensions
(valid only when an object falls within the grid)

3. Core improvements in critical versions (skipping non-industrial mainstream branches)

3.1 YOLOv2/v3: Make up for the shortcomings in accuracy

The Redmon team successively launched v2/v3 after v1, completely solving the problems of inaccurate positioning and missed detection of small targets in v1:

Improvement pointsSpecific plansEffects
Anchor BoxesLearn from Faster R-CNN and use K-means clustering to generate a priori boxesImprove recall rate and reduce positioning offset
Multi-scale trainingRandomly change the input resolution (320→608) every 10 epochsEnhance the model's robustness to targets of different sizes
Darknet-53 + Feature Pyramid (FPN)Darknet-53 as the backbone (with residuals), FPN integrates the three scales of P3/P4/P5Greatly improves small target detection capabilities

3.2 YOLOv8: The most mainstream ecological choice at present

YOLOv8, launched by Ultralytics in 2023, is currently the preferred target detection framework in the industry/competition circle** - not only leading in accuracy/speed, but also supporting multi-tasks of "detection + segmentation + classification + attitude estimation", the ecosystem is extremely complete:

  • Anchor-Free: No need to predefine a priori boxes, simplifying the process
  • Decoupled Head: Separation of classification head and regression head (resolving task conflicts)
  • Task-Aligned Assigner (TAL): Dynamically assign labels (replaces traditional IoU assignment)
  • Mosaic9 enhancement: upgraded version of Mosaic data enhancement (8 pictures spliced ​​into 1)

4. PyTorch implementation: YOLOv8 core components (lite version)

In order to allow readers to truly understand the internal logic of YOLO, we reproduce the three core modules of YOLOv8 (for complete implementation, please refer to Ultralytics official code).

4.1 Basic module: Conv + C2f

import torch
import torch.nn as nn
import torch.nn.functional as F

def autopad(k: int | tuple, p: int | tuple | None = None, d: int = 1) -> int | tuple:
    """自动计算卷积的padding值,保证输出尺寸与输入一致(当stride=1时)"""
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]
    return p

class Conv(nn.Module):
    """YOLOv8的标准卷积块:Conv2d + BatchNorm2d + SiLU激活函数"""
    default_act = nn.SiLU()  # 全局默认激活函数
    
    def __init__(self, c1: int, c2: int, k: int = 1, s: int = 1, p: int | tuple | None = None, g: int = 1, act: bool | nn.Module = True):
        """
        Args:
            c1: 输入通道数
            c2: 输出通道数
            k: 卷积核大小
            s: 步长
            p: padding(默认自动计算)
            g: 分组卷积的组数
            act: 激活函数(True=SiLU,False=无,nn.Module=自定义)
        """
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.act(self.bn(self.conv(x)))

class C2f(nn.Module):
    """YOLOv8的C2f模块:轻量化版的CSPNet模块,兼顾精度和速度"""
    def __init__(self, c1: int, c2: int, n: int = 1, shortcut: bool = False, g: int = 1, e: float = 0.5):
        """
        Args:
            c1: 输入通道数
            c2: 输出通道数
            n: Bottleneck的数量
            shortcut: 是否使用残差连接
            g: 分组卷积的组数
            e: 中间通道数的压缩比
        """
        super().__init__()
        self.c = int(c2 * e)  # 中间隐藏层的通道数
        # 第一个卷积:将输入通道拆分为两部分
        self.cv1 = Conv(c1, 2 * self.c, 1, 1)
        # 第二个卷积:将所有分支的特征融合
        self.cv2 = Conv((2 + n) * self.c, c2, 1)
        # n个Bottleneck模块
        self.m = nn.ModuleList(
            Bottleneck(self.c, self.c, shortcut, g, e=1.0) 
            for _ in range(n)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # 第一步:cv1 + 按通道分割为两部分
        y1, y2 = self.cv1(x).chunk(2, dim=1)
        # 第二步:将y2输入到n个Bottleneck中,收集每一步的输出
        y = [y1, y2]
        for m in self.m:
            y.append(m(y[-1]))
        # 第三步:拼接所有输出 + cv2融合
        return self.cv2(torch.cat(y, dim=1))

class Bottleneck(nn.Module):
    """YOLOv8的标准瓶颈块:1×1降维 + 3×3卷积 + 1×1升维"""
    def __init__(self, c1: int, c2: int, shortcut: bool = False, g: int = 1, e: float = 0.5):
        super().__init__()
        self.c_ = int(c2 * e)
        self.cv1 = Conv(c1, self.c_, 1, 1)
        self.cv2 = Conv(self.c_, c2, 3, 1, g=g)
        self.add = shortcut and (c1 == c2)  # 只有输入输出通道数相同才能用残差

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

5. Get started quickly: use Ultralytics YOLOv8 for training/inference

Ultralytics provides an extremely friendly API, and you can complete a target detection project within 10 minutes even with zero knowledge**.

5.1 Installation and environment preparation

# 安装最新版的ultralytics
pip install ultralytics
# 检查CUDA是否可用(可选但推荐)
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

5.2 Official pre-training model inference

from ultralytics import YOLO
import cv2

# 1. 加载预训练模型(nano版本最快,适合演示)
model = YOLO("yolov8n.pt")

# 2. 推理单张图片
results = model.predict(
    source="test_image.jpg",
    conf=0.5,  # 置信度阈值
    iou=0.45,  # NMS的IoU阈值
    save=True,  # 保存可视化结果
    save_txt=True  # 保存检测结果的txt文件
)

# 3. 访问推理结果
for result in results:
    print(f"检测到的目标数量: {len(result.boxes)}")
    for box in result.boxes:
        print(f"类别ID: {box.cls.item()}, 置信度: {box.conf.item():.2f}")
        print(f"边界框坐标(xyxy): {box.xyxy.cpu().numpy().tolist()[0]}")

6. Practical Suggestions: Pitfall Avoidance Guide and Tuning Tips

6.1 Data preparation (the most important step!)

  • Labeling quality: Ensure that the bounding box is close to the edge of the target, and no missing/mislabeled labels are allowed (you can use the LabelImg/LabelStudio tool)
  • Data Enhancement: Ultralytics turns on Mosaic9 + MixUp + color dithering by default. If you have small targets, you can additionally turn on RandomCrop.
  • Category Balance: If there are very few samples of a certain type, it can be solved by "oversampling (repeated replication)", "Focal Loss" and "category weight"
  • Multi-scale training: If the sizes of the targets to be detected vary greatly, it is recommended to useimgsz=640orimgsz=1280(If you have enough video memory)

6.2 Model deployment

  • Edge Devices (Mobile Phone/Raspberry Pi): Export to NCNN/TFLite format
  • GPU Server: Export to ONNX/TensorRT format (TensorRT can speed up 3-10 times)
  • Browser: Export to ONNX format, use ONNX Runtime Web inference

12. Summary

The YOLO series has gone through nine years from the "revolutionary paradigm" in 2015 to the "industrial standard + cutting-edge exploration" in 2024 - its success lies not only in the innovation of the algorithm itself, but also in the continuous contribution of the community and the improvement of the ecosystem.

For beginners, it is recommended to start with the official API of Ultralytics YOLOv8 to get through the inference and training process; for advanced users, you can delve into cutting-edge technologies such as YOLOv9's "Programmable Gradient Information (PGI)" and YOLOv10's "NMS-free detection head".


Don't read the complete source code at the beginning! First understand the core principles (grid division, output tensor, anchor frame), then use the official API to run the project, and finally gradually disassemble the modules of interest.

🔗 Extended reading