Inference acceleration framework: detailed explanation of ONNX Runtime, TensorRT, and OpenVINO

Introduction

After training an excellent deep learning model, only half of the story is completed. When this model is actually deployed to servers, mobile phones, cameras or smart speakers, you will immediately encounter a hard nut - inference speed. Training frameworks (such as PyTorch, TensorFlow) are inherently better at backpropagation and debugging training. However, for online inference that "only runs forward once", they appear bloated, slow and consume resources.

At this time, a specialized reasoning acceleration framework is needed. They are like "turbochargers" for model implementation. Through a series of technologies such as calculation graph optimization, operator fusion, hardware instruction set deep adaptation, and quantization compression, they can reduce latency to the extreme and make throughput soar. Whether it is millisecond-level decision-making for autonomous driving or real-time video analysis for security monitoring, this link is inseparable.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 模型轻量化 · Web 视觉应用


1. Overview of inference acceleration framework

1.1 Training vs. inference, why can’t we use the native framework directly?

For example: training is like building an F1 car, which requires various sensors and engineers to adjust, and can brake and reverse at any time (back propagation); while inference is like sprinting on the track, we only need to step on the accelerator (forward propagation), the faster the better.

The main problems with direct inference in native frameworks are:

  • Low operating efficiency: There is no extreme optimization for the target hardware, and the GPU/CPU utilization cannot increase.
  • Waste of resources: The calculation graph during training retains a large number of nodes that are only useful for updating weights, which are all a burden during inference.
  • Inconvenient deployment: It is inconvenient to plug PyTorch models directly into C++ services, and cross-platform support is also limited.

Therefore, we need a specialized inference framework to "slim down, adjust, and package" the model into the best form suitable for specific hardware.

1.2 Quick overview of mainstream frameworks

FrameworkOriginCore AdvantagesSuitable for ScenariosPerformance Level
ONNX RuntimeMicrosoftCross-platform + multiple hardware backends, easy to get startedUniversal cross-platform deployment, rapid prototypingMedium-High
TensorRTNVIDIAExtreme optimization on NVIDIA GPU, mature quantificationCloud GPU and Jetson edge devicesExtremely high
OpenVINOIntelIntel family bucket deeply optimized, edge-friendlyIntel CPU/GPU/VPU/NPUHigh
TVMAmazonCompiler-level automatic optimization, support for custom hardwareResearch, niche hardware adaptationHigh

In actual combat, the usual process is: first export the model from any training framework to a unified ONNX format, then feed it to ONNX Runtime to run quickly, and then switch to TensorRT or OpenVINO when extreme performance requirements are required.


2. ONNX: The “master key” for model exchange

ONNX (Open Neural Network Exchange) is an open standard that defines a set of universal graph representations so that models can be seamlessly migrated between different frameworks and different hardware. You can train the model in PyTorch, export ONNX, and then run it in C#, Java, and mobile terminals—even on accelerators from different manufacturers, greatly reducing deployment confusion.

2.1 Getting started with converting PyTorch to ONNX

PyTorch has a very friendly export function built-in. The core onlytorch.onnx.export. The following example demonstrates the complete conversion and verification process:

import torch
import torch.onnx
import onnx

class ExampleCNN(torch.nn.Module):
    """简单的分类CNN,仅作示例"""
    def __init__(self):
        super().__init__()
        self.conv1 = torch.nn.Conv2d(3, 64, 3, padding=1)
        self.bn1 = torch.nn.BatchNorm2d(64)
        self.relu = torch.nn.ReLU()
        self.pool = torch.nn.AdaptiveAvgPool2d((1, 1))
        self.fc = torch.nn.Linear(64, 10)
    
    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.pool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x

# 1. 初始化模型,切换到评估模式(冻结BN、Dropout)
model = ExampleCNN().eval()

# 2. 构建一个虚拟输入,形状必须和真实输入一模一样
dummy_input = torch.randn(1, 3, 224, 224)  # 单张图,3通道,224x224

# 3. 导出 ONNX
torch.onnx.export(
    model,
    dummy_input,
    "example_cnn.onnx",
    export_params=True,          # 把训练好的权重一同保存
    opset_version=14,            # 算子集版本,14以上支持更多优化特性
    do_constant_folding=True,    # 折叠常量运算,例如把BN的归一化提前算掉
    input_names=["input"],       # 给输入节点起个好名字
    output_names=["output"],     # 给输出节点起个好名字
    dynamic_axes={               # 允许推理时改变 batch 大小
        "input": {0: "batch_size"},
        "output": {0: "batch_size"}
    }
)

# 4. 用onnx库验证模型是否合法
onnx_model = onnx.load("example_cnn.onnx")
onnx.checker.check_model(onnx_model)
print("✅ ONNX模型验证成功!")

After exporting, you will get a.onnxfile, which can then be loaded using any inference engine that supports ONNX.


3. ONNX Runtime: The first choice for running the model

ONNX Runtime is a high-performance inference engine maintained by Microsoft. Its biggest advantage is that it is versatile and easy to use. With it installed, you can switch computing backends with one click: CPU, CUDA, TensorRT, DirectML... One set of code can be enjoyed on multiple hardware. Its performance is good enough for most rapid deployment scenarios.

3.1 Quickly start reasoning

import onnxruntime as ort
import numpy as np

# 1. 创建推理会话,让Runtime自己挑选最佳硬件(先CUDA后CPU)
session = ort.InferenceSession(
    "example_cnn.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# 2. 查看模型期望的输入输出名字和形状
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
input_shape = session.get_inputs()[0].shape  # 如 (None, 3, 224, 224),None代表动态batch

# 3. 准备数据(注意:onnxruntime默认吃float32,要和模型设定一致)
input_data = np.random.randn(4, 3, 224, 224).astype(np.float32)  # 一次推理4张图

# 4. 执行推理
results = session.run([output_name], {input_name: input_data})
output = results[0]
print(f"✅ 推理完成,输出形状:{output.shape}")

3.2 Several tips to make reasoning faster

def get_optimized_session(model_path):
    # 开启所有内置图优化
    sess_options = ort.SessionOptions()
    sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
    # CPU场景:多线程绑定,通常设置为物理核数-1
    sess_options.intra_op_num_threads = 4
    sess_options.inter_op_num_threads = 1
    # 内存优化选项,开启后能重用临时张量,降低内存占用
    sess_options.enable_mem_pattern = True
    sess_options.enable_mem_reuse = True

    return ort.InferenceSession(
        model_path,
        sess_options=sess_options,
        providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
    )

These small settings can improve the throughput even further without changing the model.


4. Which framework to choose? A picture to help you make decisions

In actual projects, we rarely decide which framework to use based on our head, but rather what hardware is used for deployment.

Deployment hardware/scenarioPreferred frameworkAlternative framework
NVIDIA A100/T4/V100 and other cloud GPUsTensorRTONNX Runtime (CUDA backend)
NVIDIA Jetson edge devicesTensorRTONNX Runtime (TensorRT backend)
Intel Core/Xeon CPUOpenVINOONNX Runtime (OpenVINO backend)
Intel Arc independent graphics/Movidius visual processing unitOpenVINO
Hybrid equipment/hope that one set of code is compatible with multiple platformsONNX Runtime
Rapid verification during developmentONNX RuntimeNative PyTorch

Generally speaking, you first use ONNX Runtime to run through the pipeline, and then decide whether to use TensorRT or OpenVINO to squeeze out the hardware based on the stress test results.

Some general deployment optimization suggestions

  1. Try half-precision (FP16) first On the GPU, FP16 can double the speed while maintaining almost the same accuracy; it is standard on edge devices.

  2. Batch inference is a good friend Putting multiple independent requests into a batch for reasoning together as much as possible can significantly improve GPU utilization and reduce average latency.

  3. Don’t take out the operations that can be put into the picture Preprocessing such as normalization, scaling, and color space conversion should be integrated into ONNX or the engine's own pre- and post-processing graphs as much as possible to reduce data transfer between the CPU and GPU.

  4. Warm up at startup Before the service goes online, run a few empty inferences randomly to complete one-time overheads such as model loading, memory allocation, and Kernel compilation in advance to avoid the first inference request being labeled as "slow".


Inference acceleration is the core skill of deep learning engineering. It is recommended to master ONNX format conversion and general reasoning of ONNX Runtime first, and then delve into the exclusive optimization of TensorRT or OpenVINO based on the hardware you have!

Summarize

The inference acceleration framework is the key bridge connecting research experiments and industrial implementation:

  • ONNX is the cornerstone of the bridge, breaking down the barriers between different frameworks.
  • ONNX Runtime is a universal Swiss army knife that covers most common deployment scenarios.
  • TensorRT / OpenVINO are professional racing engines. When you have extreme requirements for performance, they are the best choice.

💡 Important reminder: There is no "universally optimal" framework, only the "most suitable for current hardware and scenarios" framework. Before deployment, please be sure to do a complete Benchmark with your own data!

🔗 Extended reading