Detailed explanation of inference acceleration framework: TensorRT complete guide
🎯 One sentence summary: TensorRT is the ultimate accelerator for running deep models on NVIDIA GPUs. It can increase the inference speed by 2 to 8 times while compressing the model volume. It is a skill that must be mastered for algorithm implementation.
Introduction
Throwing the trained model directly into the production environment, we often find that "it can run but not fast" - face recognition takes two seconds per frame, and autonomous driving point clouds are stuck in slides. Inference Acceleration Framework is an industrial-level solution to this problem: through computational graph pruning, operator fusion, hardware-specific instruction binding, and precision compression, it can cut inference delays by more than 90%, double single-card throughput, compress model size, and reduce graphics memory usage. Face recognition in real-time monitoring has been reduced from 2s per CPU frame to 20ms per TensorRT frame. The cost of inference on the cloud GPU cluster has dropped by 2/3 - this is the charm of the inference acceleration framework.
📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 模型轻量化 · Web 视觉应用
1. Why is a "separate reasoning framework" needed?
Many beginners will be confused: **Isn’t it enough to just use PyTorch / TensorFlow to run the model? Why bother to introduce ONNX and TensorRT? **
The answer lies in the fact that training and inference have very different goals:
- Training phase: Pursuing "fast convergence + high accuracy", the framework requires dynamic calculation graphs, automatic differentiation, rich optimizers and training-specific operators, and has a high tolerance for computing resource consumption.
- Inference phase: The pursuit of "ultimate speed + minimum resource occupation under a fixed calculation path" does not require backpropagation and training special operators. Instead, it hopes to fix the calculation graph and optimize it to the greatest extent.
PyTorch.ptOr TorchScript can only perform basic operator fusion, and cannot directly utilize NVIDIA's Tensor Cores, dynamic graph depth pruning, low-precision inference (INT8/FP16) and other functions. These capabilities are the core advantages of hardware-specific inference engines such as TensorRT.
2. Comparison of mainstream inference acceleration frameworks
Currently, the commonly used reasoning frameworks in the industry can be divided into three major categories. Let’s first see the big picture and then focus on TensorRT:
As you can see, if you use NVIDIA GPU for high-demand reasoning, TensorRT is the first choice that cannot be avoided.
3. TensorRT core optimization principles (popular version)
TensorRT's optimization can be divided into "static image optimization in the preparation stage" and "hardware acceleration during runtime". Here, the four most important technologies are explained in the most straightforward language.
1. Static calculation graph pruning
The trained model has many "branches and leaves" that are not used at all during inference: Dropout's random deactivation, BatchNorm's sliding mean update, backpropagation link... TensorRT will automatically identify and cut off these redundant nodes, leaving only the most streamlined forward inference backbone.
2. Kernel Fusion
This is the optimization with the most immediate effect. For example, a common convolution post-processing path:
Conv → BatchNorm → ReLU → Add(残差连接)
A conventional GPU will execute 4 independent kernels in sequence. After each kernel is completed, the intermediate results must be written back to the video memory, and the next kernel will read them out. The data transfer time is much greater than the calculation time, forming a "memory access bottleneck".
TensorRT will merge these continuous operations into a dedicated large kernel, such asConv_BatchNorm_ReLU_Add, all operations are completed in one calculation, and intermediate data is no longer read and written repeatedly, greatly reducing latency.
3. Hardware-specific instruction set: Tensor Cores
Starting from the NVIDIA Volta architecture (V100 / RTX 20 series), Tensor Cores that specialize in matrix multiplication and addition (GEMM) have been added to the GPU, and more than 90% of the calculations of models such as CNN and Transformer are GEMM.
TensorRT will automatically detect the GEMM operator in the model and directly call Tensor Cores based on the input accuracy (FP16/INT8 is ideal) and shape, which is 2~8 times faster than ordinary CUDA cores.
4. Precision compression (quantization)
Compress model weights and activations from FP32 (32-bit floating point) to FP16 (half-precision) or INT8 (8-bit integer), significantly speeding up and slimming down with minimal accuracy loss.
- FP16 Mixed Precision: Most visual models (ResNet, YOLO, etc.) hardly lose accuracy, are halved in size, 1~4 times faster, and naturally support Tensor Cores.
- INT8 quantization: The volume is compressed to 1/4 of FP32, and the speed is 1~2 times faster than FP16, but quantization calibration needs to be done first (TensorRT runs it with a small amount of real data, records the activation value range, and then maps it to INT8) to avoid a huge drop in accuracy.
4. PyTorch → TensorRT deployment minimalist practice
PyTorch cannot directly export the TensorRT model and must take a "middleman" route:
PyTorch → ONNX → TensorRT
ONNX (Open Neural Network Exchange Format) is like a universal translator, connecting all training frameworks and inference engines.
Let’s use the classic ResNet18 image classification model to walk you through it from beginning to end.
Environment preparation
Make sure to install the following libraries (it is highly recommended to use the official NVIDIA Docker imagenvcr.io/nvidia/tensorrt:24.05-py3, eliminating the trouble of version matching):
- PyTorch + torchvision (with CUDA)
- ONNX
- ONNX Runtime (used to verify ONNX correctness)
- TensorRT Python API
Step 1: PyTorch Export ONNX Model
Core points to note:
- The model must be switched to
eval()mode, freezes the BN layer. - Prepare a dummy input to tell ONNX to input the shape.
opset_versionIt is recommended to choose 15~18 for the best compatibility.
Step 2: Verify ONNX model correctness
This step is easy to overlook, but it is crucial. Use ONNX Runtime to run an inference and compare the results with PyTorch to ensure that the model structure is correct.
Step 3: Convert ONNX to TensorRT engine
FP16 Mixed Precision is selected here to balance speed and accuracy.
At this point, a lightweight, high-speed inference engine is completed. Subsequent loading of the engine and execution of inference only takes a few lines of code, and the latency is usually on the millisecond level.
5. Tips for avoiding pitfalls
- Version matching is a prerequisite: The versions of CUDA, cuDNN, TensorRT, and PyTorch must be compatible. It is strongly recommended to directly use the NVIDIA official Docker image to avoid "environmental hell".
- Prefer static shape/batch: Although dynamic shape is flexible, the optimization degree of TensorRT will be compromised, and the speed may be 10%~30% slower than static shape. If your input size is fixed (such as a security camera zoomed to 224×224), decisively use static.
(Full text ends)

