Detailed explanation of inference acceleration framework: TensorRT complete guide

🎯 One sentence summary: TensorRT is the ultimate accelerator for running deep models on NVIDIA GPUs. It can increase the inference speed by 2 to 8 times while compressing the model volume. It is a skill that must be mastered for algorithm implementation.

Introduction

Throwing the trained model directly into the production environment, we often find that "it can run but not fast" - face recognition takes two seconds per frame, and autonomous driving point clouds are stuck in slides. Inference Acceleration Framework is an industrial-level solution to this problem: through computational graph pruning, operator fusion, hardware-specific instruction binding, and precision compression, it can cut inference delays by more than 90%, double single-card throughput, compress model size, and reduce graphics memory usage. Face recognition in real-time monitoring has been reduced from 2s per CPU frame to 20ms per TensorRT frame. The cost of inference on the cloud GPU cluster has dropped by 2/3 - this is the charm of the inference acceleration framework.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 模型轻量化 · Web 视觉应用


1. Why is a "separate reasoning framework" needed?

Many beginners will be confused: **Isn’t it enough to just use PyTorch / TensorFlow to run the model? Why bother to introduce ONNX and TensorRT? **

The answer lies in the fact that training and inference have very different goals:

  • Training phase: Pursuing "fast convergence + high accuracy", the framework requires dynamic calculation graphs, automatic differentiation, rich optimizers and training-specific operators, and has a high tolerance for computing resource consumption.
  • Inference phase: The pursuit of "ultimate speed + minimum resource occupation under a fixed calculation path" does not require backpropagation and training special operators. Instead, it hopes to fix the calculation graph and optimize it to the greatest extent.

PyTorch.ptOr TorchScript can only perform basic operator fusion, and cannot directly utilize NVIDIA's Tensor Cores, dynamic graph depth pruning, low-precision inference (INT8/FP16) and other functions. These capabilities are the core advantages of hardware-specific inference engines such as TensorRT.


2. Comparison of mainstream inference acceleration frameworks

Currently, the commonly used reasoning frameworks in the industry can be divided into three major categories. Let’s first see the big picture and then focus on TensorRT:

Framework typeRepresentative productsApplicable hardwareCore advantagesTypical scenarios
Universal cross-platformONNX Runtime, TFLiteFully compatible with CPU / GPU / NPU / FPGAFast cross-platform migration and strong compatibilityMulti-terminal deployment (PC+mobile phone+IoT)
Hardware exclusive (GPU)TensorRT, TF-TRTNVIDIA full series of GPUsExtreme performance, deeply optimized NVIDIA hardwareAutonomous driving, security real-time video, cloud GPU cluster
Embedded / EdgeOpenVINO, RKNN, MNNIntel / Rockchip / MobileChip-level adaptation, ultra-low resource usageEdge computing box, mobile APP, AIoT equipment

As you can see, if you use NVIDIA GPU for high-demand reasoning, TensorRT is the first choice that cannot be avoided.


TensorRT's optimization can be divided into "static image optimization in the preparation stage" and "hardware acceleration during runtime". Here, the four most important technologies are explained in the most straightforward language.

1. Static calculation graph pruning

The trained model has many "branches and leaves" that are not used at all during inference: Dropout's random deactivation, BatchNorm's sliding mean update, backpropagation link... TensorRT will automatically identify and cut off these redundant nodes, leaving only the most streamlined forward inference backbone.

2. Kernel Fusion

This is the optimization with the most immediate effect. For example, a common convolution post-processing path:

Conv → BatchNorm → ReLU → Add(残差连接)

A conventional GPU will execute 4 independent kernels in sequence. After each kernel is completed, the intermediate results must be written back to the video memory, and the next kernel will read them out. The data transfer time is much greater than the calculation time, forming a "memory access bottleneck".

TensorRT will merge these continuous operations into a dedicated large kernel, such asConv_BatchNorm_ReLU_Add, all operations are completed in one calculation, and intermediate data is no longer read and written repeatedly, greatly reducing latency.

3. Hardware-specific instruction set: Tensor Cores

Starting from the NVIDIA Volta architecture (V100 / RTX 20 series), Tensor Cores that specialize in matrix multiplication and addition (GEMM) have been added to the GPU, and more than 90% of the calculations of models such as CNN and Transformer are GEMM.

TensorRT will automatically detect the GEMM operator in the model and directly call Tensor Cores based on the input accuracy (FP16/INT8 is ideal) and shape, which is 2~8 times faster than ordinary CUDA cores.

4. Precision compression (quantization)

Compress model weights and activations from FP32 (32-bit floating point) to FP16 (half-precision) or INT8 (8-bit integer), significantly speeding up and slimming down with minimal accuracy loss.

  • FP16 Mixed Precision: Most visual models (ResNet, YOLO, etc.) hardly lose accuracy, are halved in size, 1~4 times faster, and naturally support Tensor Cores.
  • INT8 quantization: The volume is compressed to 1/4 of FP32, and the speed is 1~2 times faster than FP16, but quantization calibration needs to be done first (TensorRT runs it with a small amount of real data, records the activation value range, and then maps it to INT8) to avoid a huge drop in accuracy.

4. PyTorch → TensorRT deployment minimalist practice

PyTorch cannot directly export the TensorRT model and must take a "middleman" route: PyTorch → ONNX → TensorRT
ONNX (Open Neural Network Exchange Format) is like a universal translator, connecting all training frameworks and inference engines.

Let’s use the classic ResNet18 image classification model to walk you through it from beginning to end.

Environment preparation

Make sure to install the following libraries (it is highly recommended to use the official NVIDIA Docker imagenvcr.io/nvidia/tensorrt:24.05-py3, eliminating the trouble of version matching):

  • PyTorch + torchvision (with CUDA)
  • ONNX
  • ONNX Runtime (used to verify ONNX correctness)
  • TensorRT Python API

Step 1: PyTorch Export ONNX Model

Core points to note:

  • The model must be switched toeval()mode, freezes the BN layer.
  • Prepare a dummy input to tell ONNX to input the shape.
  • opset_versionIt is recommended to choose 15~18 for the best compatibility.
import torch
import torchvision.models as models

# 1. 加载预训练模型,切换到 eval 模式并放到 GPU
resnet18 = models.resnet18(pretrained=True).eval().cuda()

# 2. 准备虚拟输入:batch_size=1,3通道RGB,224x224
dummy_input = torch.randn(1, 3, 224, 224).cuda()

# 3. 导出 ONNX
onnx_path = "resnet18.onnx"
torch.onnx.export(
    resnet18,
    dummy_input,
    onnx_path,
    export_params=True,      # 必须导出权重
    opset_version=17,        # ONNX 算子版本
    do_constant_folding=True, # 常量折叠优化
    input_names=["input"],   # 输入节点名称(后面会用到)
    output_names=["output"], # 输出节点名称
    dynamic_axes=None        # 这里使用静态 shape
)

print(f"✅ ONNX 模型已导出到:{onnx_path}")

Step 2: Verify ONNX model correctness

This step is easy to overlook, but it is crucial. Use ONNX Runtime to run an inference and compare the results with PyTorch to ensure that the model structure is correct.

import onnx
import onnxruntime as ort
import numpy as np

# 1. 检查模型结构是否合法
onnx_model = onnx.load(onnx_path)
onnx.checker.check_model(onnx_model)
print("✅ ONNX 模型结构合法")

# 2. 准备输入数据(numpy 格式)
ort_input = {onnx_model.graph.input[0].name: dummy_input.cpu().numpy()}

# 3. 创建 ONNX Runtime 会话(优先使用 CUDA 执行提供者)
ort_session = ort.InferenceSession(
    onnx_path, providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# 4. 分别获得 PyTorch 和 ONNX Runtime 的输出
with torch.no_grad():
    pytorch_output = resnet18(dummy_input).cpu().numpy()
ort_output = ort_session.run(None, ort_input)[0]

# 5. 对比最大差异,阈值 < 1e-5 即为正常
diff = np.abs(pytorch_output - ort_output).max()
print(f"✅ PyTorch 与 ONNX Runtime 输出最大差异:{diff:.8f} (阈值 < 1e-5)")

Step 3: Convert ONNX to TensorRT engine

FP16 Mixed Precision is selected here to balance speed and accuracy.

import tensorrt as trt

# 1. 创建日志记录器(INFO 级别可以观察优化过程)
logger = trt.Logger(trt.Logger.INFO)

# 2. 创建 Builder、Network 和 ONNX Parser
builder = trt.Builder(logger)
network = builder.create_network(
    1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)                                      # 必须显式指定 batch 维度
parser = trt.OnnxParser(network, logger)

# 3. 解析 ONNX 文件
with open(onnx_path, "rb") as f:
    if not parser.parse(f.read()):
        # 解析失败则打印错误信息
        for error in range(parser.num_errors):
            print(parser.get_error(error))
        exit(1)
print("✅ ONNX 模型解析成功")

# 4. 配置 Builder
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB 工作空间,算子融合等会用到

# 开启 FP16 混合精度(GPU 支持 Tensor Cores 时效果显著)
if builder.platform_has_fast_fp16:
    config.set_flag(trt.BuilderFlag.FP16)
    print("✅ 已开启 FP16 混合精度")

# 5. 构建并保存 TensorRT 引擎
engine = builder.build_engine(network, config)
print("✅ TensorRT 引擎构建成功")

engine_path = "resnet18_fp16.trt"
with open(engine_path, "wb") as f:
    f.write(engine.serialize())
print(f"✅ TensorRT 引擎已保存到:{engine_path}")

At this point, a lightweight, high-speed inference engine is completed. Subsequent loading of the engine and execution of inference only takes a few lines of code, and the latency is usually on the millisecond level.


5. Tips for avoiding pitfalls

  1. Version matching is a prerequisite: The versions of CUDA, cuDNN, TensorRT, and PyTorch must be compatible. It is strongly recommended to directly use the NVIDIA official Docker image to avoid "environmental hell".
  2. Prefer static shape/batch: Although dynamic shape is flexible, the optimization degree of TensorRT will be compromised, and the speed may be 10%~30% slower than static shape. If your input size is fixed (such as a security camera zoomed to 224×224), decisively use static.

(Full text ends)