Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don’t have to manage servers. It also provides common ML algorithms that are optimized to run efficiently against extremely large data in a distributed environment.

SageMaker real-time inference is ideal for workloads that have real-time, interactive, low-latency requirements. With SageMaker real-time inference, you can deploy REST endpoints that are backed by a specific instance type with a certain amount of compute and memory. Deploying a SageMaker real-time endpoint is only the first step in the path to production for many customers. We want to be able to maximize the performance of the endpoint to achieve a target transactions per second (TPS) while adhering to latency requirements. A large part of performance optimization for inference is making sure you select the proper instance type and count to back an endpoint.

This post describes the best practices for load testing a SageMaker endpoint to find the right configuration for the number of instances and size. This can help us understand the minimum provisioned instance requirements to meet our latency and TPS requirements. From there, we dive into how you can track and understand the metrics and performance of the SageMaker endpoint utilizing Amazon CloudWatch metrics.

We first benchmark the performance of our model on a single instance to identify the TPS it can handle per our acceptable latency requirements. Then we extrapolate the findings to decide on the number of instances we need in order to handle our production traffic. Finally, we simulate production-level traffic and set up load tests for a real-time SageMaker endpoint to confirm our endpoint can handle the production-level load. The entire set of code for the example is available in the following GitHub repository.

Overview of solution

For this post, we deploy a pre-trained Hugging Face DistilBERT model from the Hugging Face Hub. This model can perform a number of tasks, but we send a payload specifically for sentiment analysis and text classification. With this sample payload, we strive to achieve 1000 TPS.

Deploy a real-time endpoint

This post assumes you are familiar with how to deploy a model. Refer to Create your endpoint and deploy your model to understand the internals behind hosting an endpoint. For now, we can quickly point to this model in the Hugging Face Hub and deploy a real-time endpoint with the following code snippet:

# Hub Model configuration.
hub = { 'HF_MODEL_ID':'distilbert-base-uncased', 'HF_TASK':'text-classification'
} # create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
) # deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type='ml.m5.12xlarge' # ec2 instance type

Let’s test our endpoint quickly with the sample payload that we want to use for load testing:

import boto3
import json
client = boto3.client('sagemaker-runtime')
content_type = "application/json"
request_body = {'inputs': "I am super happy right now."}
data = json.loads(json.dumps(request_body))
payload = json.dumps(data)
response = client.invoke_endpoint(
result = response['Body'].read()

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

Note that we’re backing the endpoint using a single Amazon Elastic Compute Cloud (Amazon EC2) instance of type ml.m5.12xlarge, which contains 48 vCPU and 192 GiB of memory. The number of vCPUs is a good indication of the concurrency the instance can handle. In general, it’s recommended to test different instance types to make sure we have an instance that has resources that are properly utilized. To see a full list of SageMaker instances and their corresponding compute power for real-time Inference, refer to Amazon SageMaker Pricing.

Metrics to track

Before we can get into load testing, it’s essential to understand what metrics to track to understand the performance breakdown of your SageMaker endpoint. CloudWatch is the primary logging tool that SageMaker uses to help you understand the different metrics that describe your endpoint’s performance. You can utilize CloudWatch logs to debug your endpoint invocations; all logging and print statements you have in your inference code are captured here. For more information, refer to How Amazon CloudWatch works.

There are two different types of metrics CloudWatch covers for SageMaker: instance-level and invocation metrics.

Instance-level metrics

The first set of parameters to consider is the instance-level metrics: CPUUtilization and MemoryUtilization (for GPU-based instances, GPUUtilization). For CPUUtilization, you may see percentages above 100% at first in CloudWatch. It’s important to realize for CPUUtilization, the sum of all the CPU cores is being displayed. For example, if the instance behind your endpoint contains 4 vCPUs, this means the range of utilization is up to 400%. MemoryUtilization, on the other hand, is in the range of 0–100%.

Specifically, you can use CPUUtilization to get a deeper understanding of if you have sufficient or even an excess amount of hardware. If you have an under-utilized instance (less than 30%), you could potentially scale down your instance type. Conversely, if you are around 80–90% utilization, it would benefit to pick an instance with greater compute/memory. From our tests, we suggest around 60–70% utilization of your hardware.

Invocation metrics

As suggested by the name, invocation metrics is where we can track the end-to-end latency of any invokes to your endpoint. You can utilize the invocation metrics to capture error counts and what type of errors (5xx, 4xx, and so on) that your endpoint may be experiencing. More importantly, you can understand the latency breakdown of your endpoint calls. A lot of this can be captured with ModelLatency and OverheadLatency metrics, as illustrated in the following diagram.


The ModelLatency metric captures the time that inference takes within the model container behind a SageMaker endpoint. Note that the model container also includes any custom inference code or scripts that you have passed for inference. This unit is captured in microseconds as an invocation metric, and generally you can graph a percentile across CloudWatch (p99, p90, and so on) to see if you’re meeting your target latency. Note that several factors can impact model and container latency, such as the following:

  • Custom inference script – Whether you have implemented your own container or used a SageMaker-based container with custom inference handlers, it’s best practice to profile your script to catch any operations that are specifically adding a lot of time to your latency.
  • Communication protocol – Consider REST vs. gRPC connections to the model server within the model container.
  • Model framework optimizations – This is framework specific, for example with TensorFlow, there are a number of environment variables you can tune that are TF Serving specific. Make sure to check what container you’re using and if there are any framework-specific optimizations you can add within the script or as environment variables to inject in the container.

OverheadLatency is measured from the time that SageMaker receives the request until it returns a response to the client, minus the model latency. This part is largely outside of your control and falls under the time taken by SageMaker overheads.

End-to-end latency as a whole depends on a variety of factors and isn’t necessarily the sum of ModelLatency plus OverheadLatency. For example, if you client is making the InvokeEndpoint API call over the internet, from the client’s perspective, the end-to-end latency would be internet + ModelLatency + OverheadLatency. As such, when load testing your endpoint in order to accurately benchmark the endpoint itself, it’s recommended to focus on the endpoint metrics (ModelLatency, OverheadLatency, and InvocationsPerInstance) to accurately benchmark the SageMaker endpoint. Any issues related to end-to-end latency can then be isolated separately.

A few questions to consider for end-to-end latency:

  • Where is the client that is invoking your endpoint?
  • Are there any intermediary layers between your client and the SageMaker runtime?

Auto scaling

We don’t cover auto scaling in this post specifically, but it’s an important consideration in order to provision the correct number of instances based on the workload. Depending on your traffic patterns, you can attach an auto scaling policy to your SageMaker endpoint. There are different scaling options, such as TargetTrackingScaling, SimpleScaling, and StepScaling. This allows your endpoint to scale in and out automatically based on your traffic pattern.

A common option is target tracking, where you can specify a CloudWatch metric or custom metric that you have defined and scale out based on that. A frequent utilization of auto scaling is tracking the InvocationsPerInstance metric. After you have identified a bottleneck at a certain TPS, you can often use that as a metric to scale out to a greater number of instances to be able to handle peak loads of traffic. To get a deeper breakdown of auto scaling SageMaker endpoints, refer to Configuring autoscaling inference endpoints in Amazon SageMaker.

Load testing

Although we utilize Locust to display how we can load test at scale, if you’re trying to right size the instance behind your endpoint, SageMaker Inference Recommender is a more efficient option. With third-party load testing tools, you have to manually deploy endpoints across different instances. With Inference Recommender, you can simply pass an array of the instance types you want to load test against, and SageMaker will spin up jobs for each of these instances.


For this example, we use Locust, an open-source load testing tool that you can implement using Python. Locust is similar to many other open-source load testing tools, but has a few specific benefits:

  • Easy to set up – As we demonstrate in this post, we’ll pass a simple Python script that can easily be refactored for your specific endpoint and payload.
  • Distributed and scalable – Locust is event-based and utilizes gevent under the hood. This is very useful for testing highly concurrent workloads and simulating thousands of concurrent users. You can achieve high TPS with a single process running Locust, but it also has a distributed load generation feature that enables you to scale out to multiple processes and client machines, as we will explore in this post.
  • Locust metrics and UI – Locust also captures end-to-end latency as a metric. This can help supplement your CloudWatch metrics to paint a full picture of your tests. This is all captured in the Locust UI, where you can track concurrent users, workers, and more.

To further understand Locust, check out their documentation.

Amazon EC2 setup

You can set up Locust in whatever environment is compatible for you. For this post, we set up an EC2 instance and install Locust there to conduct our tests. We use a c5.18xlarge EC2 instance. The client-side compute power is also something to consider. At times when you run out of compute power on the client side, this is often not captured, and is mistaken as a SageMaker endpoint error. It’s important to place your client in a location of sufficient compute power that can handle the load that you are testing at. For our EC2 instance, we use an Ubuntu Deep Learning AMI, but you can utilize any AMI as long as you can properly set up Locust on the machine. To understand how to launch and connect to your EC2 instance, refer to the tutorial Get started with Amazon EC2 Linux instances.

The Locust UI is accessible via port 8089. We can open this by adjusting our inbound security group rules for the EC2 Instance. We also open up port 22 so we can SSH into the EC2 instance. Consider scoping the source down to the specific IP address you are accessing the EC2 instance from.

Security Groups

After you’re connected to your EC2 instance, we set up a Python virtual environment and install the open-source Locust API via the CLI:

virtualenv venv #venv is the virtual environment name, you can change as you desire
source venv/bin/activate #activate virtual environment
pip install locust

We’re now ready to work with Locust for load testing our endpoint.

Locust testing

All Locust load tests are conducted based off of a Locust file that you provide. This Locust file defines a task for the load test; this is where we define our Boto3 invoke_endpoint API call. See the following code:

config = Config(
retries = { 'max_attempts': 0, 'mode': 'standard'
) self.sagemaker_client = boto3.client('sagemaker-runtime',config=config)
self.endpoint_name = host.split('/')[-1]
self.region = region
self.content_type = content_type
self.payload = payload

In the preceding code, adjust your invoke endpoint call parameters to suit your specific model invocation. We use the InvokeEndpoint API using the following piece of code in the Locust file; this is our load test run point. The Locust file we’re using is

def send(self): request_meta = { "request_type": "InvokeEndpoint", "name": "SageMaker", "start_time": time.time(), "response_length": 0, "response": None, "context": {}, "exception": None,
start_perf_counter = time.perf_counter() try:
response = self.sagemaker_client.invoke_endpoint(
response_body = response["Body"].read()

Now that we have our Locust script ready, we want to run distributed Locust tests to stress test our single instance to find out how much traffic our instance can handle.

Locust distributed mode is a little more nuanced than a single-process Locust test. In distributed mode, we have one primary and multiple workers. The primary worker instructs the workers on how to spawn and control the concurrent users that are sending a request. In our script, we see by default that 240 users will be distributed across the 60 workers. Note that the --headless flag in the Locust CLI removes the UI feature of Locust.

#replace with your endpoint name in format https://<<endpoint-name>>
export ENDPOINT_NAME=https://$1 export REGION=us-east-1
export CONTENT_TYPE=application/json
export PAYLOAD='{"inputs": "I am super happy right now."}'
export USERS=240
export WORKERS=60
export RUN_TIME=1m
export LOCUST_UI=false # Use Locust UI .
. locust -f $SCRIPT -H $ENDPOINT_NAME --master --expect-workers $WORKERS -u $USERS -t $RUN_TIME --csv results &
. for (( c=1; c<=$WORKERS; c++ ))
locust -f $SCRIPT -H $ENDPOINT_NAME --worker --master-host=localhost &

./ huggingface-pytorch-inference-2022-10-04-02-46-44-677 #to execute Distributed Locust test

We first run the distributed test on a single instance backing the endpoint. The idea here is we want to fully maximize a single instance to understand the instance count we need to achieve our target TPS while staying within our latency requirements. Note that if you want to access the UI, change the Locust_UI environment variable to True and take the public IP of your EC2 instance and map port 8089 to the URL.

The following screenshot shows our CloudWatch metrics.

CloudWatch Metrics

Eventually, we notice that although we initially achieve a TPS of 200, we start noticing 5xx errors in our EC2 client-side logs, as shown in the following screenshot.

We can also verify this by looking at our instance-level metrics, specifically CPUUtilization.

CloudWatch MetricsHere we notice CPUUtilization at nearly 4,800%. Our ml.m5.12x.large instance has 48 vCPUs (48 * 100 = 4800~). This is saturating the entire instance, which also helps explain our 5xx errors. We also see an increase in ModelLatency.

It seems as if our single instance is getting toppled and doesn’t have the compute to sustain a load past the 200 TPS that we are observing. Our target TPS is 1000, so let’s try to increase our instance count to 5. This might have to be even more in a production setting, because we were observing errors at 200 TPS after a certain point.

Endpoint settings

We see in both the Locust UI and CloudWatch logs that we have a TPS of nearly 1000 with five instances backing the endpoint.


CloudWatch MetricsIf you start experiencing errors even with this hardware setup, make sure to monitor CPUUtilization to understand the full picture behind your endpoint hosting. It’s crucial to understand your hardware utilization to see if you need to scale up or even down. Sometimes container-level problems lead to 5xx errors, but if CPUUtilization is low, it indicates that it’s not your hardware but something at the container or model level that might be leading to these issues (proper environment variable for number of workers not set, for example). On the other hand, if you notice your instance is getting fully saturated, it’s a sign that you need to either increase the current instance fleet or try out a larger instance with a smaller fleet.

Although we increased the instance count to 5 to handle 100 TPS, we can see that the ModelLatency metric is still high. This is due to the instances being saturated. In general, we suggest aiming to utilize the instance’s resources between 60–70%.

Clean up

After load testing, make sure to clean up any resources you won’t utilize via the SageMaker console or through the delete_endpoint Boto3 API call. In addition, make sure to stop your EC2 instance or whatever client setup you have to not incur any further charges there as well.


In this post, we described how you can load test your SageMaker real-time endpoint. We also discussed what metrics you should be evaluating when load testing your endpoint to understand your performance breakdown. Make sure to check out SageMaker Inference Recommender to further understand instance right-sizing and more performance optimization techniques.

About the Authors

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,Marc Karp is a ML Architect with the SageMaker Service team. He focuses on helping customers design, deploy and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Read more about this on: AWS