Practical project: Autonomous driving perception system

Introduction

The autonomous driving perception system is the core of modern intelligent transportation. It uses computer vision, deep learning and multi-sensor fusion technology to achieve real-time understanding of the road environment. As the foundation and core of autonomous driving technology, the perception system needs to simultaneously handle multiple tasks such as lane line detection, vehicle recognition, pedestrian detection, traffic sign recognition, and distance estimation.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 实战项目二:工业缺陷检测


1. System Overview

1.1 The importance of perception system

If autonomous driving is compared to human driving, the perception system is equivalent to human eyes and brain. It's asking three key questions at all times: "What's around me?", "Where are these objects?", and "What might they do next?"**. Only by answering these three questions accurately can the planning system decide where to go next.

The importance of the perception system is specifically reflected in:

  • Environment Understanding: Capture lane lines, traffic signs, pedestrians, vehicles and other information in real time, and identify the entire driving scene like the human eye.
  • Safety Guarantee: Detect potential dangers earlier than humans, avoid traffic accidents, and significantly reduce human errors caused by fatigue or distraction.
  • Intelligent decision-making: Provide data support for path planning and control modules to achieve advanced driving behaviors such as lane changing, overtaking, and car following.

1.2 Composition of perception system

A complete autonomous driving perception system is structured like a hierarchical combat team:

  • Perception layer (the "eyes" of the sensor): cameras, lidar, millimeter wave radar, ultrasonic sensors, etc., responsible for collecting raw data.
  • Algorithm layer (the "thinking" of the brain): including core algorithms such as target detection, semantic segmentation, depth estimation, and target tracking.
  • Fusion layer ("summary" of information): Unify data from different sensors into the same spatio-temporal coordinates to form consistent scene cognition.
  • Decision-making layer ("command" of behavior): Based on the fusion results, behavior prediction, path planning and control instruction generation are performed.

The focus of this tutorial is on the algorithm layer - especially how to use multi-task learning method to allow a model to complete lane line detection, vehicle recognition and distance estimation at the same time.


2. Multi-task learning architecture

Multi-task learning is the core technology of autonomous driving perception systems. To put it simply, it no longer trains a separate model for each task, but allows multiple tasks to share the main part of the same network, and only separates a few "dedicated heads" at the end. The benefits of doing this are very obvious:

  • Parameter Sharing: Significantly reduces the total number of model parameters and reduces deployment costs.
  • Task Collaboration: Knowledge between different tasks complements each other. For example, the detected vehicle position can help better estimate distance.
  • Strong generalization ability: The shared feature representation is more robust and the performance is more stable when facing new scenarios.
  • Good real-time performance: Only one "backbone calculation" is performed, multiple results can be output, and the inference speed is faster.

2.1 Shared backbone network design

The backbone network is equivalent to the "visual center" of the entire system, and its task is to extract rich feature maps from the input image. Below we use a simplified version of the convolutional network to demonstrate:

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Dict, Tuple, List

class SharedBackbone(nn.Module):
    """共享骨干网络:提取通用视觉特征"""
    def __init__(self, input_channels=3, backbone_type='simple'):
        super().__init__()
        
        if backbone_type == 'simple':
            # 一个由 Conv、BN、ReLU、Pooling 组成的基础特征提取器
            self.features = nn.Sequential(
                # 第一层:大卷积核快速降低分辨率
                nn.Conv2d(input_channels, 64, 7, stride=2, padding=3),
                nn.BatchNorm2d(64),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
                
                # 后续几个卷积块逐步提取更抽象的语义信息
                nn.Conv2d(64, 128, 3, padding=1),
                nn.BatchNorm2d(128),
                nn.ReLU(inplace=True),
                
                nn.Conv2d(128, 256, 3, padding=1),
                nn.BatchNorm2d(256),
                nn.ReLU(inplace=True),
                
                nn.Conv2d(256, 512, 3, padding=1),
                nn.BatchNorm2d(512),
                nn.ReLU(inplace=True),
            )
    
    def forward(self, x):
        return self.features(x)

In actual engineering, the backbone network usually uses more powerful structures such as ResNet, MobileNet or EfficientNet. Here we use the simplest version for the purpose of teaching clarity.

2.2 Feature Pyramid Network

Objects on the road may be far or near, large or small. In the same picture, pedestrians in the distance may only have a few dozen pixels, while vehicles nearby occupy a large area. In order to allow the model to handle targets of different scales at the same time, we introduce Feature Pyramid Network (FPN).

The idea of ​​FPN is to extract feature maps of different resolutions from different stages of the backbone network, and then propagate low-resolution high-level semantic information to high-resolution low-level features through top-down paths and lateral connections.

class FeaturePyramidNetwork(nn.Module):
    """特征金字塔网络:融合多尺度特征"""
    def __init__(self, channels_list=[256, 512, 1024, 2048]):
        super().__init__()
        
        self.channels_list = channels_list
        self.num_levels = len(channels_list)
        
        # 1x1卷积将不同层的通道数统一到256
        self.adjust_convs = nn.ModuleList([
            nn.Conv2d(channels, 256, 1) for channels in channels_list
        ])
        
        # 对融合后的每一层再做一次3x3卷积平滑特征
        self.top_down_layers = nn.ModuleList([
            nn.Conv2d(256, 256, 3, padding=1) for _ in range(self.num_levels)
        ])
    
    def forward(self, features_list):
        # 第一步:把所有层的通道数统一为256
        laterals = []
        for i, feat in enumerate(features_list):
            laterals.append(self.adjust_convs[i](feat))
        
        # 第二步:从高层向低层进行上采样并相加,实现信息融合
        for i in range(len(laterals) - 1, 0, -1):
            laterals[i-1] += F.interpolate(
                laterals[i], size=laterals[i-1].shape[2:], mode='nearest'
            )
        
        # 第三步:对融合后的特征图再平滑处理
        outputs = []
        for i, feat in enumerate(laterals):
            outputs.append(self.top_down_layers[i](feat))
        
        return outputs

After FPN processing, each layer of feature maps has both low-level detailed information and high-level semantic information, which is very suitable for subsequent multi-task applications.


3. Core perception tasks

With a powerful feature extractor in place, let's design dedicated "task heads" for the three core tasks.

3.1 Lane line detection

Lane line detection is essentially a semantic segmentation problem - we need to classify every pixel in the image as "lane line" or "background". This is crucial for functions such as keeping the vehicle in the center of the lane and enabling automatic lane changes.

class LaneDetection(nn.Module):
    """车道线检测:像素级二分类"""
    def __init__(self, in_channels=256, num_classes=2):  # 0: 背景, 1: 车道线
        super().__init__()
        
        self.segmentation_head = nn.Sequential(
            nn.Conv2d(in_channels, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            
            nn.Conv2d(128, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            
            # 最后一层输出每个像素属于“背景/车道线”的概率
            nn.Conv2d(64, num_classes, 1),
            nn.Softmax(dim=1)
        )
    
    def forward(self, features):
        return self.segmentation_head(features)

During actual deployment, post-processing can also be done based on the segmentation results, such as using curve fitting to restore the shape of the lane lines.

3.2 Vehicle detection

Vehicle detection is a typical target detection task, which requires both the position of the vehicle (bounding box) and the probability of belonging to the vehicle. Here we adopt an idea similar to YOLO, preset a number of anchor boxes on each cell of the feature map, and then let the model predict the offset and category of each anchor box.

class VehicleDetection(nn.Module):
    """车辆检测:同时预测类别和边界框回归"""
    def __init__(self, in_channels=256, num_classes=2, anchors_per_cell=3):
        super().__init__()
        
        self.num_classes = num_classes
        self.anchors_per_cell = anchors_per_cell
        
        # 检测头前半段:进一步提取特征
        self.detection_head = nn.Sequential(
            nn.Conv2d(in_channels, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            
            nn.Conv2d(256, 512, 3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
        )
        
        # 分类分支:每个锚框输出 num_classes 个概率值
        self.classifier = nn.Conv2d(512, anchors_per_cell * num_classes, 1)
        # 回归分支:每个锚框输出 (中心坐标x, y, 宽, 高, 置信度)
        self.regressor = nn.Conv2d(512, anchors_per_cell * 5, 1)
    
    def forward(self, features):
        features = self.detection_head(features)
        
        # 获得分类结果
        class_pred = self.classifier(features)
        class_pred = class_pred.view(class_pred.size(0), self.anchors_per_cell, 
                                   self.num_classes, *class_pred.shape[2:])
        
        # 获得边界框预测结果
        bbox_pred = self.regressor(features)
        bbox_pred = bbox_pred.view(bbox_pred.size(0), self.anchors_per_cell, 
                                  5, *bbox_pred.shape[2:])
        
        return class_pred, bbox_pred

The post-processing stage usually uses non-maximum suppression (NMS) to eliminate overlapping detection boxes and obtain a final clean result.

3.3 Distance estimation

It is not enough to know "there is a car in front", you must also know "how far away this car is from me", otherwise you cannot safely follow or avoid the car. Distance estimation can be viewed as a monocular depth estimation task, that is, predicting the distance corresponding to each pixel from an RGB image.

class DistanceEstimation(nn.Module):
    """距离估算:逐像素深度估计"""
    def __init__(self, in_channels=256, output_channels=1):
        super().__init__()
        
        # 与分割网络类似,但最终只输出一个通道表示归一化深度值
        self.depth_head = nn.Sequential(
            nn.Conv2d(in_channels, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            
            nn.Conv2d(128, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            
            nn.Conv2d(64, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            
            nn.Conv2d(32, output_channels, 1),
            nn.Sigmoid()  # 将输出压缩到0~1之间,代表归一化距离
        )
    
    def forward(self, features):
        return self.depth_head(features)

The output is Sigmoid normalized to a value between 0 and 1, and during training we compare this value to the true distance (also normalized). Of course, to obtain the true meter-based distance, it is necessary to perform inverse transformation combined with camera intrinsic parameters, which is beyond the scope of this section.


4. Complete perception system integration

Now we assemble the various components introduced earlier to form an end-to-end autonomous driving perception system.

class AutonomousDrivingPerceptionSystem(nn.Module):
    """完整的自动驾驶感知系统:一个模型,多个输出"""
    def __init__(self):
        super().__init__()
        
        # 共享骨干网络
        self.shared_backbone = SharedBackbone()
        
        # 三个任务专用头部
        self.lane_detection = LaneDetection(in_channels=512)
        self.vehicle_detection = VehicleDetection(in_channels=512)
        self.distance_estimation = DistanceEstimation(in_channels=512)
    
    def forward(self, image):
        # 1. 提取共享特征
        features = self.shared_backbone(image)
        
        # 2. 分头完成各任务
        lane_pred = self.lane_detection(features)
        vehicle_pred = self.vehicle_detection(features)
        distance_pred = self.distance_estimation(features)
        
        # 3. 封装成字典返回
        return {
            'lane_segmentation': lane_pred,
            'vehicle_detection': vehicle_pred,
            'depth_estimation': distance_pred
        }

This class completely demonstrates the idea of ​​multi-task learning: the input is just a picture, but the output contains three results: lane line segmentation map, vehicle detection frame, and depth distance map. In actual training, the loss function will also be formed by the weighted sum of the losses of these three tasks.


5. Performance optimization and security assurance

5.1 Performance optimization strategy

Autonomous driving has extremely high requirements for real-time performance, usually requiring one frame to be processed within tens of milliseconds. Common optimization methods include:

  • Model Quantization: Convert model parameters from float32 to int8, significantly reducing memory usage and calculation time.
  • Model Pruning: Remove unimportant connections or channels in the network to streamline the structure while maintaining accuracy.
  • Hardware Acceleration: Leverage GPU, TPU or dedicated NPU (Neural Network Processor) for inference.
  • Pipeline processing: Use multi-frame parallelism or multi-task pipeline to improve the overall system throughput.

5.2 Security and Reliability

Once the perception system goes wrong, the consequences will be disastrous. Therefore, safety and reliability design must be integrated throughout the entire development process:

  • Redundant Design: Using multiple sensors such as cameras and lidar at the same time, when one sensor fails, the system can still maintain basic sensing capabilities.
  • Fault Detection: Continuously monitor the operating status of each module, and provide timely alarms for abnormal response delays or erroneous outputs.
  • Safety Mechanism: Design emergency parking, backup control channels and other safety strategies to ensure the safety of passengers in extreme situations.
  • Verification Test: Covering millions of kilometers of real road tests and simulation tests, including various complex scenes such as rainy days, nights, and strong light.

Summarize

The autonomous driving perception system is the culmination of computer vision and deep learning technologies. Its core technologies cover multi-task learning, target detection, semantic segmentation and depth estimation. By building a complete end-to-end perception framework, we enable vehicles to have the basic ability to "see the world around them clearly" and lay a solid data foundation for subsequent decision-making and planning.

In this tutorial, we take PyTorch as an example to show how to build a prototype of a multi-task perception system with a small amount of code. In actual projects, it is necessary to further improve data processing, loss design, post-processing and sensor fusion.

Autonomous driving perception system is a top application of computer vision. It is recommended to master basic image classification, target detection and other knowledge points first, and then gradually transition to multi-task learning and sensor fusion. In actual projects, the security and reliability of the system are often more important than pure function implementation.

Related tutorials:

💡 Important reminder: Autonomous driving perception systems require extremely high safety and reliability. In actual deployment, it must undergo rigorous testing and verification to ensure that the system can operate safely in various complex environments.