Sometimes it can be very beneficial to use tools such as compilers that can modify and compile your models for optimal inference performance. In this post, we explore TensorRT and how to use it with Amazon SageMaker inference using NVIDIA Triton Inference Server. We explore how TensorRT works and how to host and optimize these models for performance and cost efficiency on SageMaker. SageMaker provides single model endpoints (SMEs), which allow you to deploy a single ML model, or multi-model endpoints (MMEs), which allow you to specify multiple models to host behind a logical endpoint for higher resource utilization.

To serve models, Triton supports various backends as engines to support the running and serving of various ML models for inference. For any Triton deployment, it’s crucial to know how the backend behavior impacts your workloads and what to expect so that you can be successful. In this post, we help you understand the TensorRT backend that is supported by Triton on SageMaker so that you can make an informed decision for your workloads and get great results.

Deep dive into the TensorRT backend

TensorRT enables you to optimize inference using techniques such as quantization, layer and tensor fusion, kernel tuning, and others on NVIDIA GPUs. By adopting and compiling models to use TensorRT, you can optimize performance and utilization for your inference workloads. In some cases, there are trade-offs, which is typical of techniques such as quantization, but the results can be dramatic in benefiting performance, addressing latency and the number of transactions that can be processed.

The TensorRT backend is used to run TensorRT models. TensorRT is an SDK developed by NVIDIA that provides a high-performance deep learning inference library. It’s optimized for NVIDIA GPUs and provides a way to accelerate deep learning inference in production environments. TensorRT supports major deep learning frameworks and includes a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for AI applications.

TensorRT is able to accelerate model performance by using a technique called graph optimization to optimize the computation graph generated by a deep learning model. It optimizes the graph to minimize the memory footprint by freeing unnecessary memory and efficiently reusing it. TensorRT compilation fuses the sparse operations inside the model graph to form a larger kernel to avoid the overhead of multiple small kernel launches. With kernel auto-tuning, the engine selects the best algorithm for the target GPU, maximizing hardware utilization. Additionally, TensorRT employs CUDA streams to enable parallel processing of models, further improving GPU utilization and performance. Finally, through quantization, TensorRT can use mixed-precision acceleration of Tensor cores, enabling the model to run in FP32, TF32, FP16, and INT8 precision for the best inference performance. However, although the reduced precision can generally improve the latency performance, it might come with possible instability and degradation in model accuracy. Overall, TensorRT’s combination of techniques results in faster inference and lower latency compared to other inference engines.

The TensorRT backend for Triton Inference Server is designed to take advantage of the powerful inference capabilities of NVIDIA GPUs. To use TensorRT as a backend for Triton Inference Server, you need to create a TensorRT engine from your trained model using the TensorRT API. This engine is then loaded into Triton Inference Server and used to perform inference on incoming requests. The following are the basic steps to use TensorRT as a backend for Triton Inference Server:

  1. Convert your trained model to the ONNX format. Triton Inference Server supports ONNX as a model format. ONNX is a standard for representing deep learning models, enabling them to be transferred between frameworks. If your model isn’t already in the ONNX format, you need to convert it using the appropriate framework-specific tool. For example, in PyTorch, this can be done using the torch.onnx.export method.
  2. Import the ONNX model into TensorRT and generate the TensorRT engine. For TensorRT, there are several ways to build a TensorRT from your ONNX model. For this post, we use the trtexec CLI tool. trtexec is a tool to quickly utilize TensorRT without having to develop your own application. The trtexec tool has three main purposes:
    1. Benchmarking networks on random or user-provided input data.
    2. Generating serialized engines from models.
    3. Generating a serialized timing cache from the builder.
  3. Load the TensorRT engine in Triton Inference Server. After the TensorRT engine is generated, it can be loaded into Triton Inference Server by creating a model configuration file. The model configuration (config.pbtxt) file should include the path to the TensorRT engine file and the input and output shapes of the model.

Each model in a model repository must include a model configuration that provides required and optional information about the model. Typically, this configuration is provided in a config.pbtxt file specified as ModelConfig protobuf. There are several key points to note in this configuration file:

  • name – This field defines the model’s name and must be unique within the model repository.
  • platform – This field defines the type of the model: TensorRT engine, PyTorch, or something else.
  • max_batch_size – This specifies the maximum batch size that can be passed to this model. If the model’s batch dimension is the first dimension, and all inputs and outputs to the model have this batch dimension, then Triton can use its dynamic batcher or sequence batcher to automatically use batching with the model. In this case, max_batch_size should be set to a value greater than or equal to 1, which indicates the maximum batch size that Triton should use with the model. For models that don’t support batching, or don’t support batching in the specific ways we’ve described, max_batch_size must be set to 0.
  • Input and output – These fields are required because NVIDIA Triton needs metadata about the model. Essentially, it requires the names of your network’s input and output layers and the shape of said inputs and outputs.
  • instance_group – This determines how many instances of this model will be created and whether they will use the GPU or CPU.
  • dynamic_batchingDynamic batching is a feature of Triton that allows inference requests to be combined by the server, so that a batch is created dynamically. The preferred_batch_size property indicates the batch sizes that the dynamic batcher should attempt to create. For most models, preferred_batch_size should not be specified, as described in Recommended Configuration Process. An exception is TensorRT models that specify multiple optimization profiles for different batch sizes. In this case, because some optimization profiles may give significant performance improvement compared to others, it may make sense to use preferred_batch_size for the batch sizes supported by those higher-performance optimization profiles. You can also reference the batch size that was previously used when running trtexec. You can also configure the delay time to allow requests to be delayed for a limited time in the scheduler to allow other requests to join the dynamic batch.

The TensorRT backend is improved to have significantly better performance. Improvements include reducing thread contention, using pinned memory for faster transfers between CPU and GPU, and increasing compute and memory copy overlap on GPUs. It also reduces memory usage of TensorRT models in many cases by sharing weights across multiple model instances. Overall, the TensorRT backend for Triton Inference Server provides a powerful and flexible way to serve deep learning models with optimized TensorRT inference. By adjusting the configuration options, you can optimize performance and control behavior to suit your specific use case.

SageMaker provides Triton via SMEs and MMEs

SageMaker enables you to deploy both single and multi-model endpoints with Triton Inference Server. Triton supports a heterogeneous cluster with both GPUs and CPUs, which helps standardize inference across platforms and dynamically scales out to any CPU or GPU to handle peak loads. The following diagram illustrates the Triton Inference Server architecture. Inference requests arrive at the server via either HTTP/REST or by the C API, and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The framework backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs. The outputs are then formatted and returned in the response. The model repository is a file system-based repository of the models that Triton will make available for inferencing.

Triton architecture

SageMaker takes care of traffic shaping to the MME endpoint and maintains optimal model copies on GPU instances for best price performance. It continues to route traffic to the instance where the model is loaded. If the instance resources reach capacity due to high utilization, SageMaker unloads the least-used models from the container to free up resources to load more frequently used models. SageMaker MMEs offer capabilities for running multiple deep learning or ML models on the GPU, at the same time, with Triton Inference Server, which has been extended to implement the MME API contract. MMEs enable sharing GPU instances behind an endpoint across multiple models, and dynamically load and unload models based on the incoming traffic. With this, you can easily achieve optimal price performance.

When a SageMaker MME receives an HTTP invocation request for a particular model using TargetModel in the request along with the payload, it routes traffic to the right instance behind the endpoint where the target model is loaded. SageMaker takes care of model management behind the endpoint. It dynamically downloads models from Amazon Simple Storage Service (Amazon S3) to the instance’s storage volume if the invoked model isn’t available on the instance storage volume. Then SageMaker loads the model to the NVIDIA Triton container’s memory on a GPU-accelerated instance and serves the inference request. The GPU core is shared by all the models in an instance. For more information about SageMaker MMEs on GPU, see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

SageMaker MMEs can horizontally scale using an auto scaling policy and provision additional GPU compute instances based on specified metrics. When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance as the primary criteria to determine the scaling characteristics of your auto scaling groups. In addition, based on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. For single model endpoints, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For multi-model endpoints, we recommend deploying similar models behind a given endpoint to have more steady, predictable performance. In use cases where models of varying sizes and requirements are used, you might want to separate those workloads across multiple multi-model endpoints or spend some time fine-tuning your auto scaling group policy to obtain the best cost and performance balance.

Solution overview

With the NVIDIA Triton container image on SageMaker, you can now use Triton’s TensorRT backend, which allows you to deploy TensorRT models. The TensorRT_backend repo contains the documentation and source for the backend. In the following sections, we walk you through the example notebook that demonstrates how to use NVIDIA Triton Inference Server on SageMaker MMEs with the GPU feature to deploy a BERT natural language processing (NLP) model.

Set up the environment

We begin by setting up the required environment. We install the dependencies required to package our model pipeline and run inferences using Triton Inference Server. We also define the AWS Identity and Access Management (IAM) role that gives SageMaker access to the model artifacts and the NVIDIA Triton Amazon Elastic Container Registry (Amazon ECR) image. You can use the following code example to retrieve the pre-built Triton ECR image:

import transformers
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
bucket = sagemaker_session.default_bucket()
print(bucket) account_id_map = { "us-east-1": "785573368785", "us-east-2": "007439368137", "us-west-1": "710691900526", "us-west-2": "301217895009", "eu-west-1": "802834080501", "eu-west-2": "205493899709", "eu-west-3": "254080097072", "eu-north-1": "601324751636", "eu-south-1": "966458181534", "eu-central-1": "746233611703", "ap-east-1": "110948597952", "ap-south-1": "763008648453", "ap-northeast-1": "941853720454", "ap-northeast-2": "151534178276", "ap-southeast-1": "324986816169", "ap-southeast-2": "355873309152", "cn-northwest-1": "474822919863", "cn-north-1": "472730292857", "sa-east-1": "756306329178", "ca-central-1": "464438896020", "me-south-1": "836785723513", "af-south-1": "774647643957",
} region = boto3.Session().region_name
if region not in account_id_map.keys(): raise ("UNSUPPORTED REGION") base = "" if region.startswith("cn-") else ""
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.02-py3".format(
account_id=account_id_map[region], region=region, base=base

Add utility methods for preparing the request payload

We create the functions to transform the sample text we’re using for inference into the payload that can be sent for inference to Triton Inference Server. The tritonclient package, which was installed at the beginning, provides utility methods to generate the payload without having to know the details of the specification. We use the created methods to convert our inference request into a binary format, which provides lower latencies for inference. These functions are used during the inference step.

Prepare the TensorRT model

In this step, we load the pre-trained BERT model and convert to ONNX representation using the torch ONNX exporter and the script. After the ONNX model is created, we use the TensorRT trtexec command to create the model plan to be hosted with Triton. This is run as part of the script from the following cell. Note that the cell takes around 30 minutes to complete.

!docker run --gpus=all --rm -it \
-v `pwd`/workspace:/workspace \ /bin/bash

While waiting for the command to finish running, you can check the scripts used in this step. In the script, we use the torch.onnx.export function for ONNX model creation:

 torch.onnx.export( model, dummy_inputs,, export_params=True, opset_version=10, input_names=["token_ids", "attn_mask"], output_names=["output","pooled_output"], dynamic_axes={"token_ids": [0, 1], "attn_mask": [0, 1], "output": [0]}, )

The command line in the file creates the TensorRT model plan. For more information, refer to the trtexec command-line tool.

trtexec —onnx=model.onnx —saveEngine=model_bs16.plan —minShapes=token_ids:1x128,attn_mask:1x128 —optShapes=token_ids:16x128,attn_mask:16x128 —maxShapes=token_ids:128x128,attn_mask:128x128 —fp16 —verbose —workspace=14000 | tee conversion_bs16_dy.txt

Build a TensorRT NLP BERT model repository

Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. For each model, we need to create a model directory consisting of the model artifact and define the config.pbtxt file to specify the model configuration that Triton uses to load and serve the model. To learn more about the config settings, refer to Model Configuration. The model repository structure for the BERT model is as follows:

Folder structure for model

Note that Triton has specific requirements for model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model. Here, the folder 1 represents version 1 of the BERT model. Each model is run by a specific backend, so within each version subdirectory there must be the model artifacts required by that backend. Here, we are using the TensorRT backend, which requires the TensorRT plan file that is used for serving (for this example, model.plan). If we were using a PyTorch backend, a file would be required. For more details on naming conventions for model files, refer to Model Files.

Every TensorRT model must provide a config.pbtxt file describing the model configuration. In order to use this backend, you must set the backend field of your model config.pbtxt file to tensorrt_plan. The following section of code shows an example of how to define the configuration file for the BERT model being served through Triton’s TensorRT backend:

name: "bert"
platform: "tensorrt_plan"
max_batch_size: 128
input [ { name: "token_ids" data_type: TYPE_INT32 dims: [128] }, { name: "attn_mask" data_type: TYPE_INT32 dims: [128] }
output [ { name: "output" data_type: TYPE_FP32 dims: [128, 768] }, { name: "pooled_output" data_type: TYPE_FP32 dims: [768] }
instance_group { count: 1 kind: KIND_GPU
dynamic_batching { preferred_batch_size: 16

SageMaker expects a .tar.gz file containing each Triton model repository to be hosted on the multi-model endpoint. To simulate several similar models being hosted, you might think all it takes is to tar the model repository we have already built, and then copy it with different file names. However, Triton requires unique model names. Therefore, we first copy the model repo N times, changing the model directory names and their corresponding config.pbtxt files. You can change the number of N to have more copies of the model that can be dynamically loaded to the hosting endpoint to simulate the model load/unload action managed by SageMaker. See the following code:

import os
import shutil N = 5
prefix = 'bert-mme'
model_repo_base = 'model_repo' # Get model names from model_repo_0
model_names = [name for name in os.listdir(f'{model_repo_base}_0') if os.path.isdir(f'{model_repo_base}_0/{name}')] for i in range(N): # Make copy of previous model repo, increment # id shutil.copytree(f'{model_repo_base}_0', f'{model_repo_base}_{i+1}') time.sleep(5) for name in model_names: model_dirs_path = f'{model_repo_base}_{i+1}/{name}' # Open each model's config file to increment model # id there fin = open(f'{model_dirs_path}/config.pbtxt', "rt") data = data = data.replace(name, name[:-1] + str(i+1)) fin.close() fin = open(f'{model_dirs_path}/config.pbtxt', "wt") fin.write(data) fin.close() # Change model directory name to match new config os.rename(model_dirs_path,model_dirs_path[:-1]+str(i+1)) time.sleep(2) if i == 0: tar_file_name = f'bert-{i}.tar.gz' model_repo_target = f'{model_repo_base}_{i}/' !tar -C $model_repo_target -czf $tar_file_name . sagemaker_session.upload_data(path=tar_file_name, key_prefix=prefix) tar_file_name = f'bert-{i+1}.tar.gz' model_repo_target = f'{model_repo_base}_{i+1}/' !tar -C $model_repo_target -czf $tar_file_name . sagemaker_session.upload_data(path=tar_file_name, key_prefix=prefix) !sudo rm -r "$tar_file_name" "$model_repo_target"

Create a SageMaker endpoint

Now that we have uploaded the model artifacts to Amazon S3, we can create the SageMaker model object, endpoint configuration, and endpoint.

Firstly, we need to define the serving container. In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker multi-model endpoint will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker will create the endpoint with MME container specifications. See the following code:

container = { "Image": triton_image_uri, "ModelDataUrl": model_data_uri, "Mode": "MultiModel",

Then we create the SageMaker model object using the create_model boto3 API by specifying the ModelName and container definition:

create_model_response = sm.create_model(
ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container

We use this model to create an endpoint configuration where we can specify the type and number of instances we want in the endpoint. Here we are deploying to a g5.xlarge NVIDIA GPU instance:

create_endpoint_config_response = sm.create_endpoint_config( EndpointConfigName=endpoint_config_name, ProductionVariants=[ { "InstanceType": "ml.g5.xlarge", "InitialVariantWeight": 1, "InitialInstanceCount": 1, "ModelName": sm_model_name, "VariantName": "AllTraffic", } ],

With this endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService when the deployment is successful.

endpoint_name = "triton-nlp-bert-trt-mme-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
create_endpoint_response = sm.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name

Invoke your model hosted on the SageMaker endpoint

When the endpoint is running, we can use some sample raw data to perform inference using either JSON or binary+JSON as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols. We can send the inference request to the multi-model endpoint using the invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type. Here we invoke the endpoint in a for loop to request the endpoint to dynamically load or unload models based on the requests:

text_triton = "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs."
input_ids, attention_mask = tokenize_text(text_triton) payload = { "inputs": [ {"name": "token_ids", "shape": [1, 128], "datatype": "INT32", "data": input_ids}, {"name": "attn_mask", "shape": [1, 128], "datatype": "INT32", "data": attention_mask}, ]
} for i in range(N): response = client.invoke_endpoint( EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload), TargetModel=f"bert-{i}.tar.gz", ) print(json.loads(response["Body"].read().decode("utf8")))

You can monitor the model loading and unloading status using Amazon CloudWatch metrics and logs. SageMaker multi-model endpoints provide instance-level metrics to monitor; for more details, refer to Monitor Amazon SageMaker with Amazon CloudWatch. The LoadedModelCount metric shows the number of models loaded in the containers. The ModelCacheHit metric shows the number of invocations to model that are already loaded onto the container to help you get model invitation-level insights. To check if models are unloaded from the memory, you can look for the successful unloaded log entries in the endpoint’s CloudWatch logs.

The notebook can be found in the GitHub repository.

Best practices

Before starting any optimization effort with TensorRT, it’s essential to determine what should be measured. Without measurements, it’s impossible to make reliable progress or measure whether success has been achieved. Here are some best practices to consider when using the TensorRT backend for Triton Inference Server:

  • Optimize your TensorRT model – Before deploying a model on Triton with the TensorRT backend, make sure to optimize the model following the TensorRT best practices guide. This will help you achieve better performance by reducing inference time and memory consumption.
  • Use TensorRT instead of other Triton backends when possible – TensorRT is designed to optimize deep learning models for deployment on NVIDIA GPUs, so using it can significantly improve inference performance compared to using other supported Triton backends.
  • Use the right precision – TensorRT supports multiple precisions (FP32, FP16, INT8), and selecting the right precision for your model can have a significant impact on performance. Consider using lower precision when possible.
  • Use batch sizes that fit your hardware – Make sure to choose batch sizes that fit your GPU’s memory and compute capabilities. Using batch sizes that are too large or too small can negatively impact performance.


In this post, we dove deep into the TensorRT backend that Triton Inference Server supports on SageMaker. This backend provides for both CPU and GPU acceleration of your TensorRT models. There are many options to consider to get the best performance for inference, such as batch sizes, data input formats, and other factors that can be tuned to meet your needs. SageMaker allows you to take advantage of this capability using single model endpoints for guaranteed performance and multi-model endpoints to get a better balance of performance and cost savings. To get started with MME support for GPU, see Supported algorithms, frameworks, and instances.

We invite you to try Triton Inference Server containers in SageMaker, and share your feedback and questions in the comments.

 About the Authors

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,Melanie Li is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,James Park is a Solutions Architect at Amazon Web Services. He works with Amazon to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Read more about this on: AWS