Inference acceleration framework: core components from cognition to implementation (including OpenVINO/TensorRT introductory easter eggs)

Introduction

The inference acceleration framework is the last-mile hard-core technology for deep learning models to move from "laboratory demo" to "industrial/consumer-level available services". It uses the following key means to compress academic-level models that often weigh tens of MB or even GB and take tens of milliseconds for inference to the KB level (in some scenarios), and reduce the delay to milliseconds or even microseconds, while effectively controlling the power consumption and memory usage of hardware such as CPU/GPU/NPU/TPU:

  • Computation graph pruning and compression: Remove unnecessary nodes in forward propagation to reduce the amount of calculation;
  • Lightweight operator fusion: Merge multiple independent operations into a "super operator" to reduce memory reading and writing;
  • Customized adaptation of the underlying hardware instruction set: Optimized for the SIMD, number of threads and other characteristics of a specific platform;
  • Dynamic/Static Accuracy Calibrated Compression: Replace FP32 with lower bit data types to achieve significant speedups with almost no loss of accuracy.

It is these technologies that support scenarios such as autonomous driving perception, mobile phone real-time translation, and security monitoring edge warnings that have extremely high requirements on response speed, resource limits, and deployment costs.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 模型轻量化 · Web 视觉应用


Why not use native PyTorch / TensorFlow inference?

Many students who are new to deployment will ask: **Use it directlytorch.load() + model.to(device) + model.eval()Isn’t it fun to run reasoning? ** The fragrance is fragrance, but that fragrance is "quickly verified". When it comes to large-scale deployment, edge device deployment, or scenarios with high real-time requirements, the performance shortcomings of the native framework are exposed. Let’s compare it with a table:

Comparison dimensionsNative PyTorch / TensorFlow inferenceMainstream inference acceleration frameworks (TensorRT / OpenVINO, etc.)
Response delayHigh (tens of ms or even hundreds of ms)Low (up to several ms or even microseconds, with INT8 quantization it can be reduced by more than 90%)
Hardware utilizationLow (operator fusion is not performed, there are many redundant calculations, and the hardware instruction set is not aligned)High (Operator fusion reduces memory access, dedicated kernel instruction set, and the utilization rate can reach more than 80%)
Resource UsageHigh (loading the complete training framework + uncompressed model, large memory/video memory usage)Low (only loading the inference engine + compressed model, edge device friendly)
Platform CompatibilityWeak (for example, TF Lite does not directly support GPU FP16, and PyTorch has difficulty adapting beyond NPU/TPU)Strong (TensorRT covers the entire range of NVIDIA GPUs; OpenVINO adapts to all Intel CPU/GPU/NPU/edge chips)
Dynamic Batch supportNative support, but efficiency fluctuates greatlySupported, but requires pre-configuration (TensorRT's Dynamic Shape may increase compilation time, OpenVINO's Dynamic Batch is more flexible)

💡 In one sentence: Native reasoning is suitable for the "experiment table", and the reasoning acceleration framework is the real "production workshop".


Core concepts for getting started: first build a knowledge base

Before diving into the reasoning acceleration framework, there are several high-frequency concepts that need to be understood first.

1. Calculation graph

The computation graph is the essence of the deep learning model: all tensor operations (convolution, pooling, activation, full connection...) are abstracted into Nodes, and the data flow between operations is abstracted into Edges. More than 90% of the optimization work of the inference acceleration framework is based on "simplifying the graph structure + optimizing the execution order".

2. ONNX(Open Neural Network Exchange)

ONNX is currently the most common deep learning model intermediate representation format (IR, Intermediate Representation) in the industry. It is like a "USB adapter in the model world": models trained by PyTorch, TensorFlow, Keras, MXNet and other frameworks can be exported as a unified.onnxThe file is then handed over to inference frameworks such as TensorRT and OpenVINO to parse and further optimize it to specific hardware.

From a process perspective, ONNX is the bridge between the training framework and the inference framework.

3. Accuracy calibration compression

Compress the weights and activation values ​​originally stored in 32-bit floating point numbers (FP32, academic default precision) into:

  • 16-bit floating point (FP16/BF16)
  • 8-bit integer (INT8)
  • Even 2-bit / 1-bit integers (combined with QAT or extreme compression schemes)

Why does this work? Because deep learning models are inherently very robust to small-range numerical errors, the accuracy loss caused by compression is negligible in most cases, but the speed increase and volume reduction are considerable.

There are two main schools of precision compression:

  • PTQ (Post-Training Quantization, post-training quantization): No need to retrain, only a small amount of calibration data (tens to hundreds of pictures) is used to calibrate the quantification range of the activation value, which is fast to deploy and easy to verify.
  • QAT (Quantization-Aware Training, quantification-aware training): Introducing "quantization simulation nodes" in the final stage of training to allow the model to adapt to future quantization errors in advance. The accuracy loss is much less than PTQ, and is suitable for scenarios with extremely high accuracy requirements (such as medical imaging diagnosis).

Introductory easter egg: 20 lines to complete PyTorch → ONNX → TensorRT / OpenVINO inference

Here we use a ResNet18 image classification model as an example to demonstrate how to quickly open the inference link of PyTorch → ONNX → TensorRT (NVIDIA side) / OpenVINO (Intel side).

Environment preparation

RecommendedcondaCreate a separate virtual environment and install the required dependencies:

# 通用依赖
pip install torch torchvision onnx onnxruntime

# NVIDIA 端:安装对应 CUDA/cuDNN 版本后,再安装 TensorRT
# 参考官方文档:https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html

# Intel 端:直接 pip 安装 OpenVINO
pip install openvino openvino-dev

Step 1: PyTorch → ONNX model conversion

First convert the official pre-trained ResNet18 to ONNX format:

import torch
import torchvision.models as models

# 1. 加载预训练模型,切换为推理模式
model = models.resnet18(pretrained=True).eval()

# 2. 定义一个虚拟输入(告诉 ONNX 模型的输入形状、数据类型)
dummy_input = torch.randn(1, 3, 224, 224)   # batch=1, 3 通道 RGB, 224×224

# 3. 导出 ONNX 模型
onnx_path = "resnet18.onnx"
torch.onnx.export(
    model,
    dummy_input,
    onnx_path,
    export_params=True,          # 导出模型权重参数
    opset_version=13,            # ONNX 算子集版本(推荐 ≥12,兼容性好)
    do_constant_folding=True,    # 常量折叠优化
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"}
    }                             # 支持动态 Batch
)
print(f"模型成功转换为 ONNX 格式,保存路径:{onnx_path}")

Step 2: ONNX → OpenVINO fast inference

The OpenVINO inference process is very intuitive: load ONNX → compile to device executable network → create inference request → perform inference.

import cv2
import numpy as np
from openvino.runtime import Core

# 1. 初始化 OpenVINO 推理核心
ie = Core()

# 2. 读取 ONNX 模型,并编译到指定设备(这里用 CPU,也可选 GPU、NPU 等)
model = ie.read_model(model=onnx_path)
compiled_model = ie.compile_model(model=model, device_name="CPU")

# 3. 创建推理请求
infer_request = compiled_model.create_infer_request()

# 4. 准备输入数据(这里用随机生成的数据,实际使用时替换为真实图片)
image = np.random.randn(1, 3, 224, 224).astype(np.float32)

# 5. 执行同步推理(高吞吐场景可改用异步推理)
output = infer_request.infer(inputs={"input": image})

# 6. 获取输出结果(ResNet18 输出 1000 个类别的概率)
output_tensor = output[compiled_model.output(0)]
predicted_class = np.argmax(output_tensor)
print(f"OpenVINO 推理完成,预测类别:{predicted_class}")

🔧 Tips: You can also use it firstmo(Model Optimizer) command line tool converts ONNX into OpenVINO's IR format (.xml + .bin) before loading and compiling, sometimes better performance can be obtained.


Step 3: ONNX → TensorRT fast inference

The entrance to TensorRT is a little more complicated because it requires ONNX to be compiled into a dedicated inference engine (Engine), but the core idea is similar to OpenVINO.

import tensorrt as trt
import numpy as np
import torch   # 仅用于 GPU 内存交互

# 1. 初始化 TensorRT Logger
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def build_engine(onnx_file_path):
    """
    从 ONNX 构建 TensorRT Engine
    """
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )
    parser = trt.OnnxParser(network, TRT_LOGGER)
    config = builder.create_builder_config()
    config.max_workspace_size = 1 << 30   # 1 GB 工作空间

    # 如果 GPU 支持 FP16,则启用 FP16 优化
    if builder.platform_has_fast_fp16:
        config.set_flag(trt.BuilderFlag.FP16)

    # 解析 ONNX 模型
    with open(onnx_file_path, "rb") as f:
        if not parser.parse(f.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            return None

    # 构建并序列化 Engine
    return builder.build_engine(network, config)

# 2. 构建 Engine
engine = build_engine(onnx_path)
if engine is None:
    print("构建 TensorRT Engine 失败!")
    exit()

# 3. 创建执行上下文
context = engine.create_execution_context()

# 4. 分配输入/输出缓冲区(Host 端 + Device 端)
def allocate_buffers(engine):
    inputs, outputs, bindings = [], [], []
    stream = torch.cuda.Stream()

    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Host 端内存
        host_mem = np.zeros(size, dtype=dtype)
        # Device 端内存
        device_mem = torch.tensor(host_mem).cuda().contiguous()
        bindings.append(int(device_mem.data_ptr()))

        if engine.binding_is_input(binding):
            inputs.append({"host": host_mem, "device": device_mem})
        else:
            outputs.append({"host": host_mem, "device": device_mem})

    return inputs, outputs, bindings, stream

inputs, outputs, bindings, stream = allocate_buffers(engine)

# 5. 准备输入数据并拷贝到 GPU
image = np.random.randn(1, 3, 224, 224).astype(np.float32).flatten()
np.copyto(inputs[0]["host"], image)
inputs[0]["device"].copy_(torch.from_numpy(inputs[0]["host"]))

# 6. 执行异步推理,然后等待完成
context.execute_async_v2(bindings=bindings, stream_handle=stream.cuda_stream)
torch.cuda.synchronize()

# 7. 将输出从 GPU 拷贝回 CPU
for output in outputs:
    output["device"].copy_(torch.from_numpy(output["host"]))
    np.copyto(output["host"], output["device"].cpu().numpy())

# 8. 取最大概率的类别
predicted_class = np.argmax(outputs[0]["host"])
print(f"TensorRT 推理完成,预测类别:{predicted_class}")

Summary and advanced directions

This article will show you:

  • ✅ Understand the importance of the inference acceleration framework and its essential differences from native PyTorch/TensorFlow inference;
  • ✅ Consolidates core concepts such as calculation graphs, ONNX, and precision calibration compression (PTQ/QAT);
  • ✅ Through the complete example of ResNet18, I personally went through the entry-level deployment process of PyTorch → ONNX → TensorRT / OpenVINO.

If you want to take it to the next level, you can refer to the following advanced paths:

  1. In-depth practice of precision compression: Conduct comparative experiments on PTQ vs QAT, and master OpenVINO INT8 calibration, TensorRT dynamic INT8 quantization and other techniques.
  2. Principles of Computational Graph Optimization: Dig deeper into TensorRT’s operator fusion mechanism, OpenVINO’s graph pruning and constant folding strategies.
  3. Edge device deployment optimization: Complete complete deployment on Jetson Nano, Intel NCS2 and other devices, and deal with issues such as power consumption, heat dissipation, and multi-model scheduling.
  4. Cloud service high throughput tuning: Use Dynamic Batch, Dynamic Shape, asynchronous reasoning, multi-threaded pipelines and other means to satisfy a large number of concurrent requests.

Daoman PythonAI will continue to update the advanced content of the inference acceleration framework, so stay tuned!