Amazon SageMaker Serverless Inference is a purpose-built inference option that makes it easy for you to deploy and scale machine learning (ML) models. It provides a pay-per-use model, which is ideal for services where endpoint invocations are infrequent and unpredictable. Unlike a real-time hosting endpoint, which is backed by a long-running instance, compute resources for serverless endpoints are provisioned on demand, thereby eliminating the need to choose instance types or manage scaling policies.

The following high-level architecture illustrates how a serverless endpoint works. A client invokes an endpoint, which is backed by AWS managed infrastructure.

However, serverless endpoints are prone to cold starts in the order of seconds, and is therefore more suitable for intermittent or unpredictable workloads.

To help determine whether a serverless endpoint is the right deployment option from a cost and performance perspective, we have developed the SageMaker Serverless Inference Benchmarking Toolkit, which tests different endpoint configurations and compares the most optimal one against a comparable real-time hosting instance.

In this post, we introduce the toolkit and provide an overview of its configuration and outputs.

## Solution overview

You can download the toolkit and install it from the GitHub repo. Getting started is easy: simply install the library, create a SageMaker model, and provide the name of your model along with a JSON lines formatted file containing a sample set of invocation parameters, including the payload body and content type. A convenience function is provided to convert a list of sample invocation arguments to a JSON lines file or a pickle file for binary payloads such as images, video, or audio.

## Install the toolkit

First install the benchmarking library into your Python environment using pip:

pip install sm-serverless-benchmarking

You can run the following code from an Amazon SageMaker Studio instance, SageMaker notebook instance, or any instance with programmatic access to AWS and the appropriate AWS Identity and Access Management (IAM) permissions. The requisite IAM permissions are documented in the GitHub repo. For additional guidance and example policies for IAM, refer to How Amazon SageMaker Works with IAM. This code runs a benchmark with a default set of parameters on a model that expects a CSV input with two example records. It’s a good practice to provide a representative set of examples to analyze how the endpoint performs with different input payloads.

from sm_serverless_benchmarking import benchmark
from sm_serverless_benchmarking.utils import convert_invoke_args_to_jsonl
model_name = "<SageMaker Model Name>"
example_invoke_args = [ {'Body': '1,2,3,4,5', "ContentType": "text/csv"}, {'Body': '6,7,8,9,10', "ContentType": "text/csv"} ]
example_args_file = convert_invoke_args_to_jsonl(example_invoke_args,
output_path=".")
r = benchmark.run_serverless_benchmarks(model_name, example_args_file)

Additionally, you can run the benchmark as a SageMaker Processing job, which may be a more reliable option for longer-running benchmarks with a large number of invocations. See the following code:

from sm_serverless_benchmarking.sagemaker_runner import run_as_sagemaker_job
run_as_sagemaker_job( role="<execution_role_arn>", model_name="<model_name>", invoke_args_examples_file="<invoke_args_examples_file>", )

Note that this will incur additional cost of running an ml.m5.large SageMaker Processing instance for the duration of the benchmark.

Both methods accept a number of parameters to configure, such as a list of memory configurations to benchmark and the number of times each configuration will be invoked. In most cases, the default options should suffice as a starting point, but refer to the GitHub repo for a complete list and descriptions of each parameter.

## Benchmarking configuration

Before delving into what the benchmark does and what outputs it produces, it’s important to understand a few key concepts when it comes to configuring serverless endpoints.

There are two key configuration options: MemorySizeInMB and MaxConcurrency. MemorySizeInMB configures the amount of memory that is allocated to the instance, and can be 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. The number of vCPUs also scales proportionally to the amount of memory allocated. The MaxConcurrency parameter adjusts how many concurrent requests an endpoint is able to service. With a MaxConcurrency of 1, a serverless endpoint can only process a single request at a time.

To summarize, the MemorySizeInMB parameter provides a mechanism for vertical scalability, allowing you to adjust memory and compute resources to serve larger models, whereas MaxConcurrency provides a mechanism for horizontal scalability, allowing your endpoint to process more concurrent requests.

The cost of operating an endpoint is largely determined by the memory size, and there is no cost associated with increasing the max concurrency. However, there is a per-Region account limit for max concurrency across all endpoints. Refer to SageMaker endpoints and quotas for the latest limits.

## Benchmarking outputs

Given this, the goal of benchmarking a serverless endpoint is to determine the most cost-effective and reliable memory size setting, and the minimum max concurrency that can handle your expected traffic patterns.

By default, the tool runs two benchmarks. The first is a stability benchmark, which deploys an endpoint for each of the specified memory configurations and invokes each endpoint with the provided sample payloads. The goal of this benchmark is to determine the most effective and stable MemorySizeInMB setting. The benchmark captures the invocation latencies and computes the expected per-invocation cost for each endpoint. It then compares the cost against a similar real-time hosting instance.

When the benchmarking is complete, the tool generates several outputs in the specified result_save_path directory with the following directory structure:

├── benchmarking_report
├── concurrency_benchmark_raw_results
├── concurrency_benchmark_summary_results
├── cost_analysis_summary_results
├── stability_benchmark_raw_results
├── stability_benchmark_summary_results

The benchmarking_report directory contains a consolidated report with all the summary outputs that we outline in this post. Additional directories contain raw and intermediate outputs that you can use for additional analyses. Refer to the GitHub repo for a more detailed description of each output artifact.

Let’s examine a few actual benchmarking outputs for an endpoint serving a computer vision MobileNetV2 TensorFlow model. If you’d like to reproduce this example, refer to the example notebooks directory in the GitHub repo.

The first output within the consolidated report is a summary table that provides the minimum, mean, medium, and maximum latency metrics for each MemorySizeInMB successful memory size configuration. As shown in the following table, the average invocation latency (invocation_latency_mean) continued to improve as memory configuration was increased to 3072 MB, but stopped improving thereafter.

In addition to the high-level descriptive statistics, a chart is provided showing the distribution of latency as observed from the client for each of the memory configurations. Again, we can observe that the 1024 MB configuration isn’t as performant as the other options, but there isn’t a substantial difference in performance in configurations of 2048 and above.

Amazon CloudWatch metrics associated with each endpoint configuration are also provided. One key metric here is ModelSetupTime, which measures how long it took to load the model when the endpoint was invoked in a cold state. The metric may not always appear in the report as an endpoint is launched in a warm state. A cold_start_delay parameter is available for specifying the number of seconds to sleep before starting the benchmark on a deployed endpoint. Setting this parameter to a higher number such as 600 seconds should increase the likelihood of a cold state invocation and improve the chances of capturing this metric. Additionally, this metric is far more likely to be captured with the concurrent invocation benchmark, which we discuss later in this section.

The following table shows the metrics captured by CloudWatch for each memory configuration.

The next chart shows the performance and cost trade-offs of different memory configurations. One line shows the estimated cost of invoking the endpoint 1 million times, and the other shows the average response latency. These metrics can inform your decision of which endpoint configuration is most cost-effective. In this example, we see that the average latency flattens out after 2048 MB, whereas the cost continues to increase, indicating that for this model a memory size configuration of 2048 would be most optimal.

The final output of the cost and stability benchmark is a recommended memory configuration, along with a table comparing the cost of operating a serverless endpoint against a comparable SageMaker hosting instance. Based on the data collected, the tool determined that the 2048 MB configuration is the most optimal one for this model. Although the 3072 configuration provides roughly 10 milliseconds better latency, that comes with a 30% increase in cost, from $4.55 to$5.95 per 1 million requests. Additionally, the output shows that a serverless endpoint would provide savings of up to 88.72% against a comparable real-time hosting instance when there are fewer than 1 million monthly invocation requests, and breaks even with a real-time endpoint after 8.5 million requests.

The second type of benchmark is optional and tests various MaxConcurency settings under different traffic patterns. This benchmark is usually run using the optimal MemorySizeInMB configuration from the stability benchmark. The two key parameters for this benchmark is a list of MaxConcurency settings to test along with a list of client multipliers, which determine the number of simulated concurrent clients that the endpoint is tested with.

For example, by setting the concurrency_benchmark_max_conc parameter to [4, 8] and concurrency_num_clients_multiplier to [1, 1.5, 2], two endpoints are launched: one with MaxConcurency of 4 and the other 8. Each endpoint is then benchmarked with a (MaxConcurency x multiplier) number of simulated concurrent clients, which for the endpoint with a concurrency of 4 translates to load test benchmarks with 4, 6, and 8 concurrent clients.

The first output of this benchmark is a table that shows the latency metrics, throttling exceptions, and transactions per second metrics (TPS) associated with each MaxConcurrency configuration with different numbers of concurrent clients. These metrics help determine the appropriate MaxConcurrency setting to handle the expected traffic load. In the following table, we can see that an endpoint configured with a max concurrency of 8 was able to handle up to 16 concurrent clients with only two throttling exceptions out of 2,500 invocations made at an average of 24 transactions per second.

The next set of outputs provides a chart for each MaxConcurrency setting showing the distribution of latency under different loads. In this example, we can see that an endpoint with a MaxConcurrency setting of 4 was able to successfully process all requests with up to 8 concurrent clients with a minimal increase in invocation latency.

The final output provides a table with CloudWatch metrics for each MaxConcurrency configuration. Unlike the previous table showing the distribution of latency for each memory configuration, which may not always display the cold start ModelSetupTime metric, this metric is far more likely to appear in this table due to the larger number of invocation requests and a greater MaxConcurrency.

## Conclusion

In this post, we introduced the SageMaker Serverless Inference Benchmarking Toolkit and provided an overview of its configuration and outputs. The tool can help you make a more informed decision with regards to serverless inference by load testing different configurations with realistic traffic patterns. Try the benchmarking toolkit with your own models to see for yourself the performance and cost saving you can expect by deploying a serverless endpoint. Please refer to the GitHub repo for additional documentation and example notebooks.