How to convert a PyTorch model to TensorRT and speed up inference (2023)

How to convert a PyTorch model to TensorRT and speed up inference (1)

The life of a machine learning engineer consists of long periods of frustration and a few moments of joy!

First, fight for your model to perform well with your training data. You look at your training data, clean it up and train again. You read about itbias variance compensationin machine learning to approach the training process systematically.

One fine day, your PyTorch model will be perfectly trained and ready for production.

This is pure joy!

You pride yourself on accuracy, mark your task as complete on your project tracker, and let your CTO know the model is complete.

She shakes her head disapprovingly and announces that the model is not yet ready for series production. Training a model is not enough. You need to change the model to be efficient at run time (aka inference).

You don't know how to proceed. Your friendly CTO recommends reading this post on TensorRT on So here it is to indulge in a different learning experience.

This post will show you how to use it quickly and easilyTensorRTfor implementation if you already have the networkPyTorch.

We will use the following steps.

  1. Train a model with PyTorch
  2. Convert the model to ONNX format
  3. Use NVIDIA TensorRT for inference

In this tutorial, we'll just use a pre-trained model, so we'll skip step 1. Now, let's understand what ONNX and TensorRT are.

What is ONX?

There are many frameworks for training a deep learning model. The most popular are Tensorflow and PyTorch. However, a model trained by Tensorflow cannot be used with PyTorch and vice versa.

ONNX stands for Open Neural Network Exchange. It is an open format created to represent machine learning models.

You can train your model in any framework of your choice and then convert it to ONNX format.

The great advantage of a common format is that the software or hardware that loads your model at runtime only has to be ONNX compatible.

ONNX is to machine learning models what JPEG is to images or MPEG to video.

(Video) Inference with Torch-TensorRT Deep Learning Prediction for Beginners - CPU vs CUDA vs TensorRT

I have exclusively connected withOpenCV.orgto offer you official courses on artificial intelligence, computer vision and deep learning that will guide you on a structured path from the first steps to mastery.

What is TensorRT?

NVIDIA TensorRT is a high performance deep learning inference SDK.

It provides APIs to infer pre-trained models and builds runtime engines optimized for your platform.

There are a variety of ways in which this optimization is achieved. For example, TensorRT allows us to use INT8 (8-bit integer) or FP16 (16-bit floating point) arithmetic instead of the usual FP32. This reduction in accuracy can significantly speed up inference with a small reduction in accuracy.

Other types of optimizations include minimizing the memory footprint of the GPU through memory reuse, merging layers and tensors, choosing the appropriate data layers based on hardware, etc.

Setting up the environment for TensorRT

To reproduce the experiments mentioned in this article, you need aNvidiaGraphic card. Any architecture newer than Maxwell that has a compute capacity of 5.0 will do. You can find your GPU processing power in the table here: Don't forget to install properlydriver.

Install PyTorch, ONNX and OpenCV

To installPython 3.6or later and run

python3 -m pip install -r requisitos.txt
(Video) [Educational Video] PyTorch, TensorFlow, Keras, ONNX, TensorRT, OpenVINO, AI Model File Conversion



The code has been tested on specific versions. But it's okay to try running it on other versions if you already have some of those components installed.

Install TensorRT

  1. Download and installNVIDIA CUDA 10.0or later according to official instructions:shortcut
  2. download and extractCuDNNLibrary for your version of CUDA (login required):shortcut
  3. Download and extract NVIDIATensorRTLibrary for your version of CUDA (login required):shortcut. The minimum required version is Please follow theinstallation Guidefor your system and don't forget to install itPython part
  4. Add absolute path to CUDA, TensorRT, CuDNN libraries to environment variableAWAYÖLD_LIBRARY_PATH
  5. To installPyCUDA

Now we are ready for our experiment.

How to convert a PyTorch model to TensorRT

Let's walk through the steps required to convert a model from PyTorch to TensorRT.

1. Load and run a pre-trained model with PyTorch

First, let's implement a simple classifier with a pre-trained network in PyTorch. For example we will takeResnet50but you can choose what you want. More information and explanations about working with PyTorch can be found here:# PyTorch for beginners: image classification with pre-trained models

Torchvisionmodel import models = models.resnet50 (pretrained = true)

Next important step:preprocessorthe input image. We need to know what transformations were performed during training in order to replicate them for inference. We recommend the following modules for the preprocessing step:albuminationsjCV2(OpenCV).

The model was trained on 224×224 images. The input data was then normalized (divide the pixel values ​​by 255, subtract the mean and divide by the standard deviation).

Download codeTo easily follow this tutorial, download the code by clicking the button below. It's free!

Download code

import cv2import torquefrom albumentations import Resize, Composefrom albumentations.pytorch.transforms import ToTensorfrom albumentations.augmentations.transforms import Normalizedef preprocess_image(img_path): # transformaciones para los datas de enterda transforms = Compose([ Resize(224, 224, interpolation=cv2.INTER_NEAREST ) , Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ToTensor(), ]) # leeres Bild vom Eingang input_img = cv2.imread(img_path) # hacer transformaciones input_data = transforms(image =entrada_img)["imagen"]

Prepare batch for submission to network. In our case, there is only one image in the stack. Please note that we load input data onto the GPU to run the program faster and to keep our comparison with TensorRT honest.

batch_data = antorcha.unsqueeze(input_data, 0) return lote_datosput = preprocess_image("turkish_coffee.jpg").cuda()

Now we can draw the conclusion. Don't forget to switch the model to evaluation mode and also copy it to the GPU. As a result, we get tensor[1, 1000] with certainty about which class the object belongs to.

model.eval() model.cuda() output = model(input)

To get human-readable results, we need the post-processing step. The names of the classes can be found inimagenet_classes.txt. calculationweichmaxto get percentages for each class and print out the best classes provided by the network.

def postprocess(output_data): # class names with open("imagenet_classes.txt") as f:classes = [line.strip() for line in f.readlines()] # compute human-readable value with softmax trusts = Torch . nn.functional.softmax(output_data, dim=1)[0] * 100 # finds the predicted top classes _, indices = Torch.sort(output_data, descending=True) i = 0 # returns the top classes predicted by the model off while confidences[indexes[0][i]] > 0.5: classidx = indexes[0][i] print( "class:", classes[classidx], ", confidences:", confidences[ class idx].Element( ), " %, index:", class_idx.item(), ) i += 1postprocess(output)

It's time to test our script! Our input picture:

(Video) How To Increase Inference Performance with TensorFlow-TensorRT

How to convert a PyTorch model to TensorRT and speed up inference (2)

And results:

Class: Cup Confidence: 92.430747% Index: 968 Class: Espresso Confidence: 6.138075% Index: 967 Class: Coffee Cup Confidence: 0.728557% Index: 504

2. Convert the PyTorch model to ONNX format

To convert the resulting model, you only need one statementTorch.onnx.export, which took the following arguments: the pretrained model itself, tensor of the same size as the input data, ONNX filename, input and output names.

ONNX_FILE_PATH = 'resnet50.onnx'torch.onnx.export(modelo, enterda, ONNX_FILE_PATH, input_names=['input'], output_names=['output'], export_params=True)

Call to check if the model converted wellonnx.checker.check_modelo:

onnx_model = onnx.load(ONNX_FILE_PATH)onnx.checker.check_model(onnx_model)

3. View the ONNX model

Now let's visualize our ONNX chart withNeutron. To install this, start:

python3 -m pip installs netron

WriteNeutronon the command line and openhttp://localhost:8080/in your browser. You see the complete network diagram. Make sure the input and output are the expected size.

How to convert a PyTorch model to TensorRT and speed up inference (3)

4. Initialize the model in TensorRT

Now it's time to parse the ONNX model and initialize TensorRTcontextjMotor. To do this, we need to create an instance ofconstructor. The builder can createThe networkand generateMotor(which would be optimized for your platform/hardware) from that network. than we createdThe networkWe can define the structure of the network through flags, but in our case it is sufficient to use the default flag, which means that all tensors would have an implicit batch dimension. WithThe networkDefinition of which we can create an instanceAnalyzerand finally analyze our ONNX file.

import pycuda.driver as cudaimport pycuda.autoinitimport numpy as npimport tensorrt as trt# logger to capture errors, warnings, and other information during the build and inference phasesTRT_LOGGER = trt.Logger()def build_engine(onnx_file_path): # initialize the TensorRT engine and parse ONNX model builder = trt.Builder(TRT_LOGGER) network = builder.create_network() parser = trt.OnnxParser(network, TRT_LOGGER) # parse ONNX mit open(onnx_file_path, 'rb') als model: print('Starting parsing of ONNX files' ) parser.parse( print('Vollständige Analyse der ONNX-Datei')

It is possible to configure some engine parameters, e.g. B. the maximum memory allowed by the TensorRT engine or the FP16 mode. We also need to specify the lot size.

# allow TensorRT to use up to 1GB GPU memory for tactical selection builder.max_workspace_size = 1 << 30 # we only have one image in batch builder.max_batch_size = 1 # use FP16 mode if possible if builder.platform_has_fast_fp16 : builder.fp16_mode = True

After that we can generate itMotorand build the executablecontext. The engine takes input data, performs inference, and outputs inference results.

# Building the TensorRT engine optimized for the target platform print('Building an engine...') engine = builder.build_cuda_engine(network) context = engine.create_execution_context() print("Engine build complete") return engine, context

Tips:Initialization can take a long time as TensorRT tries to find the best and fastest way to create your network on your platform. To just do it once and then use the already built engine you canserializeYour Engine Serialized engines are not portable between different GPU models, platforms, or versions of TensorRT. Engines are specific to the exact hardware and software they are built on. Here you will find more information:

5. Main pipe

So what would the full pipeline for inference in TensorRT look like? Let's take a look at thoseHeadmasterFunction. First, let's analyze the model and initialize the engine and context:

def main(): # Initialize TensorRT engine and parse ONNX model engine, context = build_engine(ONNX_FILE_PATH)

Once we initialize the engine, we can figure out the input and output dimensions in our program. To know that we can allocate the required memory for the input data and the output data. In normal cases a model can have many inputs and outputs, but in our case we know that we only have one input and one output.

(Video) Getting Started with NVIDIA Torch-TensorRT

# get the input and output sizes and allocate the memory required for the input and output data to bind in the engine: if engine.binding_is_input(binding): # expect only one input input_shape = engine.get_binding_shape(binding) input_size = trt .volume (input_shape) * engine.max_batch_size * np.dtype(np.float32).itemsize # in bytes device_input = cuda.mem_alloc(input_size) else: # and create an output output_shape = engine.get_binding_shape(binding) # per page locked memory buffers ( i.e. not switched to disk) host_output = cuda.pagelocked_empty(trt.volume(output_shape) * engine.max_batch_size, dtype=np.float32) device_output = cuda.mem_alloc(host_output.nbytes)

CUDA functions can be called asynchronouslystreams, scripts that are executed in order. All instructions in a thread are executed sequentially, but different threads can execute their instructions at the same time or out of order. If you run asynchronous CUDA commands without specifying a stream, the runtime uses the default null stream. In our simple script we create only one stream and that would be enough. For example, in more complicated cases, you can use different streams to process different images at the same time.

# Create a sequence where inputs/outputs are copied and inferences are made. stream = cuda.Stream()

To get the same result in TensorRT as in PyTorch, we would prepare the data for inference and repeat any preprocessing steps we did previously. The main advantage of the Python API for TensorRT is that data pre-processing and post-processing can be reused from the PyTorch side. The only thing we need to do additionally is place datacoherentand use paged locked memory whenever possible. We can then copy this data to the GPU and use it for inference.

# Vorverarbeitungsdaten vom Eingang host_input = np.array(preprocess_image("turkish_coffee.jpg").numpy(), dtype=np.float32, order='C') cuda.memcpy_htod_async(device_input, host_input, stream)

Infer and copy the output from the device to the host:

# Ejecutar-Inferenz context.execute_async(bindings=[int(device_input), int(device_output)], stream_handle=stream.handle) cuda.memcpy_dtoh_async(host_output, device_output, stream) stream.synchronize()

The result is saved inoutput_hostas an array with one dimension. So before we use the post-processing part of PyTorch to get human-readable values, we need to refactor it.

# Verarbeitungsergebnisse output_data = Torch.Tensor(host_output).reshape(engine.max_batch_size, output_shape[0]) postprocess(output_data)

That's all! Now you can run and test your script.

6. Accuracy test

We ran some ad hoc tests, summarized in the table below.

classIndexPyTorchTensorRT: FP32TensorRT: FP16
Cup of coffee5040,728557 %0,728557 %0,760683 %

As we can see, the predicted classes agree. Confidence is almost the same in FP32 mode (bug smaller than 1e-05). In FP16 mode the error is larger (~0.003) but still enough to get correct predictions.

Note that there is no guarantee that testing with different hardware, software, or even an input image will result in the same error. The error may depend on the initial reference decision and may vary with different cards. We get these results with the following configuration:

Ubuntu 18.04.4, octa-core ×16 AMD® Ryzen 7 2700x processor, GeForce RTX 2070 SUPER, TensorRT, CUDA 10.0

7. Accelerate with TensorRT

To compare the time in PyTorch and TensorRT, we would not measure the model initialization time since we only initialize the model once. So we will compare the inference time. When first started, CUDA initializes and saves some data, so the first call to a CUDA function is slower than usual. To account for this, we run the inference multiple times and get an average time. And what we have:

How to convert a PyTorch model to TensorRT and speed up inference (4)

In our example, we achieved 4x to 6x speedup in FP16 mode and 2x to 3x speedup in FP32 mode.

Subscribe and download the code

If you enjoyed this article and would like to download the code (C++ and Python) and example images used in this post, click here. Alternatively, sign up for a free Computer Vision resource guide. In our newsletter we share OpenCV tutorials and examples written in C++/Python, as well as algorithms and news about computer vision and machine learning.

Download sample code

(Video) NVAITC Webinar: Deploying Models with TensorRT


How do I make PyTorch model faster for inference? ›

There are a few complementary ways to achieve this in practice: use relatively wide models (where the non-batched dimensions are large), use batching, and use multiple streams at once. All of these help to improve the effective amount of work active on the GPU at once, and drive up achieved utilization.

How to convert PyTorch model to tflite model? ›

The main pipeline to convert a PyTorch model into TensorFlow lite is as follows:
  1. Build the PyTorch Model.
  2. Export the Model in ONNX Format.
  3. Convert the ONNX Model into Tensorflow (Using onnx-tf ) ...
  4. Convert the Tensorflow Model into Tensorflow Lite (tflite)
Apr 19, 2021

How do I convert PyTorch model to Tensorflow? ›

Converting a PyTorch model to TensorFlow
  1. Save the trained model., 'mnist.pth')
  2. Load the saved model. Generate and pass random input so the Pytorch exporter can trace the model and save it to an ONNX file.
Mar 8, 2021

Is TensorRT faster than PyTorch? ›

Conclusions. Inference of TensorRT==7.2 is much slower than PyTorch, while TensorRT==8.2 is faster than PyTorch.

How do you speed up inferences? ›

Quantization is a simple technique to speed up deep learning models at the inference stage. It is a method of compressing information. Model parameters are stored in floating point numbers, and model operations are calculated using these floating point numbers.

Is inference faster on GPU or CPU? ›

Results. Even for this average-sized dataset, we can observe that GPU is able to beat the CPU machine by a 76% in both training and inference times.

How do you convert a model to a TensorRT? ›

There are three main options for converting a model with TensorRT:
  1. using TF-TRT.
  2. automatic ONNX conversion from . onnx files.
  3. manually constructing a network using the TensorRT API (either in C++ or Python)
Dec 11, 2022

What Transform will convert it into a PyTorch tensor? ›

To convert an image to a tensor in PyTorch we use PILToTensor() and ToTensor() transforms. These transforms are provided in the torchvision. transforms package. Using these transforms we can convert a PIL image or a numpy.

Is Torch faster than TensorFlow? ›

PyTorch vs TensorFlow: Performance Comparison

Even though both PyTorch and TensorFlow provide similar fast performance when it comes to speed, both frameworks have advantages and disadvantages in specific scenarios. The performance of Python is faster for PyTorch.

Why is PyTorch harder than TensorFlow? ›

PyTorch is more pythonic and building ML models feels more intuitive. On the other hand, for using Tensorflow, you will have to learn a bit more about it's working (sessions, placeholders etc.) and so it becomes a bit more difficult to learn Tensorflow than PyTorch.

How do you use PyTorch to train a model? ›

To train the image classifier with PyTorch, you need to complete the following steps:
  1. Load the data. If you've done the previous step of this tutorial, you've handled this already.
  2. Define a Convolution Neural Network.
  3. Define a loss function.
  4. Train the model on the training data.
  5. Test the network on the test data.
Jun 22, 2022

How much faster is TensorRT? ›

TensorRT-based applications perform up to 36x faster than CPU-only platforms during inference. It has a low response time of under 7ms and can perform target-specific optimizations.

How much faster are tensor cores? ›

The latest generation of Tensor Cores are faster than ever on a broader array of AI and high-performance computing (HPC) tasks. From 6X speedups in transformer network training to 3X boosts in performance across all applications, NVIDIA Tensor Cores deliver new capabilities to all workloads.

How much faster is TensorFlow than PyTorch? ›

It indicates a significantly higher training time for TensorFlow (average of 11.19 seconds for TensorFlow vs. PyTorch with an average of 7.67 seconds).

How can I speed up my ML model? ›

By leveraging an MLOps pipeline, a data science team can achieve these tasks faster and more efficiently. Advanced MLOps solutions can support ML projects from data selection and annotation up to model optimization. In computer vision projects, this need for data optimization is crucial.

How can I make my deep learning model faster? ›

Compression techniques: These algorithms and techniques optimise the model's architecture, typically by compressing its layers. One of the most popular examples of compression technique is quantisation, where the weight of a layer is compressed by reducing its precision with minimum loss in quality.

How can I increase my deep learning model speed? ›

For example, we can increase the batch size 4 times when training over four GPUs. We can also multiply the learning rate by 4 to increase the speed of the training. We can also say this method is the learning rate warmup, which is a simple strategy to start the training of the model with high learning rates.

Which GPU is best for AI inference? ›

NVIDIA's RTX 3090 is the best GPU for deep learning and AI in 2020 2021. It has exceptional performance and features make it perfect for powering the latest generation of neural networks. Whether you're a data scientist, researcher, or developer, the RTX 3090 will help you take your projects to the next level.

What is the best inference GPU? ›

If your workload is intense enough, the NVIDIA Ampere architecture-based NVIDIA RTX A6000 is one of the best values for inference. It is CoreWeave's recommended GPU for fine-tuning, due to the 48GB of RAM, which allows you to fine-tune up to Fairseq 13B on a single GPU.

Which GPU is best for deep learning inference? ›

The GIGABYTE GeForce RTX 3080 is the best GPU for deep learning since it was designed to meet the requirements of the latest deep learning techniques, such as neural networks and generative adversarial networks. The RTX 3080 enables you to train your models much faster than with a different GPU.

Can we convert PyTorch to TensorFlow? ›

You can train your model in PyTorch and then convert it to Tensorflow easily as long as you are using standard layers. The best way to achieve this conversion is to first convert the PyTorch model to ONNX and then to Tensorflow / Keras format.

How do I optimize PyTorch model? ›

Inside the training loop, optimization happens in three steps:
  1. Call optimizer. zero_grad() to reset the gradients of model parameters. ...
  2. Backpropagate the prediction loss with a call to loss. backward() . ...
  3. Once we have our gradients, we call optimizer.

How do I convert TF model to Tflite? ›

Convert a TensorFlow model using tf. lite. TFLiteConverter . A TensorFlow model is stored using the SavedModel format and is generated either using the high-level tf.
Python API
  1. lite. TFLiteConverter. from_saved_model() (recommended): Converts a SavedModel.
  2. lite. TFLiteConverter. ...
  3. lite. TFLiteConverter.
Jun 11, 2022

Does TensorRT affect accuracy? ›

TensorRT uses FP32 algorithms for performing inference to obtain the highest possible inference accuracy by default. However, you can use FP16 and INT8 precision for inference with minimal impact on the accuracy of results in many cases.

Is TensorRT faster than ONNX? ›

tensorrt slower than onnx #3883.

How do you convert a value to a tensor? ›

a NumPy array is created by using the np. array() method. The NumPy array is converted to tensor by using tf. convert_to_tensor() method.

Is Torch tensor faster than numpy? ›

Tensors in CPU and GPU

GPU (graphics processing units) composes of hundreds of simpler cores, which makes training deep learning models much faster. Below is the quick comparison between GPU and CPU. It is nearly 15 times faster than Numpy for simple matrix multiplication!

Is PyTorch tensor faster than numpy? ›

Even if you already know Numpy, there are still a couple of reasons to switch to PyTorch for tensor computation. The main reason is the GPU acceleration. As you'll see, using a GPU with PyTorch is super easy and super fast. If you do large computations, this is beneficial because it speeds things up a lot.

How fast is PyTorch batch inference? ›

The inference time is about 1-2 seconds per batch.

How can I speed up my PyTorch data loader? ›

Tricks to Speed Up Data Loading with
  1. use Numpy Memmap to load array and say goodbye to HDF5. I used to relay on HDF5 to read/write data, especially when loading only sub-part of all data. ...
  2. torch. from_numpy() to avoid extra copy. ...
  3. torch. utils. ...
  4. A simple trick to overlap data-copy time and GPU Time.

How do I make my PyTorch Model 340 faster with a ray? ›

Here are the steps to follow:
  1. Deserialize the model (minus weights) from Plasma.
  2. Extract the weights from Plasma (without copying data)
  3. Wrap the weights in PyTorch Tensors (without copying)
  4. Install the weight tensors back in the reconstructed model (without copying)
Aug 23, 2021

Which is faster PyTorch or TensorFlow? ›

It indicates a significantly higher training time for TensorFlow (average of 11.19 seconds for TensorFlow vs. PyTorch with an average of 7.67 seconds).

Which one is faster PyTorch or TensorFlow? ›

PyTorch vs TensorFlow: Performance Comparison

Even though both PyTorch and TensorFlow provide similar fast performance when it comes to speed, both frameworks have advantages and disadvantages in specific scenarios. The performance of Python is faster for PyTorch.

Is batch inference faster? ›

Batch processing can increase throughput and optimize your resources because it helps complete a larger number of inferences in a certain amount of time at the expense of latency.


1. 01 Optimizing Tensorflow Model Using TensorRT with 3.7x Faster Inference Time
(Ardian Umam)
2. Inference Optimization with NVIDIA TensorRT
3. How to convert PyTorch model to Tensorflow | | Machine Learning | Data Magic
(Data Magic (by Sunny Kusawa))
4. Production Inference Deployment with PyTorch
5. INT8 Inference of Quantization-Aware trained models using ONNX-TensorRT
6. TensorRT Overview
(Ahmad Bazzi)


Top Articles
Latest Posts
Article information

Author: Ouida Strosin DO

Last Updated: 07/30/2023

Views: 6707

Rating: 4.6 / 5 (76 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Ouida Strosin DO

Birthday: 1995-04-27

Address: Suite 927 930 Kilback Radial, Candidaville, TN 87795

Phone: +8561498978366

Job: Legacy Manufacturing Specialist

Hobby: Singing, Mountain biking, Water sports, Water sports, Taxidermy, Polo, Pet

Introduction: My name is Ouida Strosin DO, I am a precious, combative, spotless, modern, spotless, beautiful, precious person who loves writing and wants to share my knowledge and understanding with you.