Unlock cost savings with the new scale down to zero feature in SageMaker Inference
Today at AWS re:Invent 2024, we are excited to announce a new feature for Amazon SageMaker inference endpoints: the ability to scale SageMaker inference endpoints to zero instances. This long-awaited capability is a game changer for our customers using the power of AI and machine learning (ML) inference in the cloud. Previously, SageMaker inference endpoints maintained a minimum number of instances to provide continuous availability, even during periods of low or no traffic. With this update, available when using SageMaker inference components, you have more options to align your resource usage with your specific needs and traffic patterns.
Refer to the accompanying notebooks to get started with the new scale down to zero feature.
The new feature expands the possibilities for managing SageMaker inference endpoints. It allows you to configure the endpoints so they can scale to zero instances during periods of inactivity, providing an additional tool for resource management. With this feature, you can closely match your compute resource usage to your actual needs, potentially reducing costs during times of low demand. This enhancement builds upon the existing auto scaling capabilities in SageMaker, offering more granular control over resource allocation. You can now configure your scaling policies to include scaling to zero, allowing for more precise management of your AI inference infrastructure.
The scale down to zero feature presents new opportunities for how businesses can approach their cloud-based ML operations. It provides additional options for managing resources across various scenarios, from development and testing environments to production deployments with variable traffic patterns. As with any new feature, you are encouraged to carefully evaluate how it fits into your overall architecture and operational needs, considering factors such as response times and the specific requirements of your applications.
In this post, we explore the new scale to zero feature for SageMaker inference endpoints, demonstrating how to implement and use this capability to optimize costs and manage resources more effectively. We cover the key scenarios where scaling to zero is beneficial, provide best practices for optimizing scale-up time, and walk through the step-by-step process of implementing this functionality. Additionally, we discuss how to set up scheduled scaling actions for predictable traffic patterns and test the behavior of your scaled-to-zero endpoints.
Determining when to scale to zero
Before we dive into the implementation details of the new scale to zero feature, it’s crucial to understand when and why you should consider using it. Although the ability to scale SageMaker inference endpoints to zero instances offers significant cost-saving potential, it’s crucial to understand when and how to apply this feature effectively. Not all scenarios benefit equally from scaling to zero, and in some cases, it may even impact the performance of your applications. Let’s explore why it’s important to carefully consider when to implement this feature and how to identify the scenarios where it provides the most value.
The ability to scale SageMaker inference endpoints to zero instances is particularly beneficial in three key scenarios:
- Predictable traffic patterns – If your inference traffic is predictable and follows a consistent schedule, you can use this scaling functionality to automatically scale down to zero during periods of low or no usage. This eliminates the need to manually delete and recreate inference components and endpoints.
- Sporadic or variable traffic – For applications that experience sporadic or variable inference traffic patterns, scaling down to zero instances can provide significant cost savings. However, scaling from zero instances back up to serving traffic is not instantaneous. During the scale-out process, any requests sent to the endpoint will fail, and these
NoCapacityInvocationFailures
will be captured in Amazon CloudWatch. - Development and testing environments – The scale to zero functionality is also beneficial when testing and evaluating new ML models. During model development and experimentation, you might create temporary inference endpoints to test different configurations. However, it’s possible to forget to delete these endpoints when you’re done. Scaling down to zero makes sure these test endpoints automatically scale back to zero instances when not in use, preventing unwanted charges. This allows you to freely experiment without closely monitoring infrastructure usage or remembering to manually delete endpoints. The automatic scaling to zero provides a cost-effective way to test out ideas and iterate on your ML solutions.
By carefully evaluating your specific use case against these scenarios, you can make informed decisions about implementing scale to zero functionality. This approach makes sure you maximize cost savings without compromising on the performance and availability requirements of your ML applications. It’s important to note that although scaling to zero can provide significant benefits, it also introduces a trade-off in terms of initial response time when scaling back up. Therefore, it’s crucial to assess whether your application can tolerate this potential delay and to implement appropriate strategies to manage it. In the following sections, we dive deeper into each scenario and provide guidance on how to determine if scaling to zero is the right choice for your specific needs. We also discuss best practices for implementation and strategies to mitigate potential drawbacks.
Scale down to zero is only supported when using inference components. For more information on inference components, see Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker.
Now that we understand when to use the scale to zero feature, let’s dive into how to optimize its performance and implement it effectively. Scaling up from zero instances to serving traffic introduces a brief delay (cold start), which can impact your application’s responsiveness. To mitigate this, we first explore best practices for minimizing scale-up time. Then we walk through the step-by-step process of implementing the scale to zero functionality for your SageMaker inference endpoints.
Optimizing scale-up time best practices
When using the scale to zero feature, it’s crucial to minimize the time it takes for your endpoint to scale up and begin serving requests. The following are several best practices you can implement to decrease the scale-out time for your SageMaker inference endpoints:
- Decrease model or container download time – Use uncompressed model format to reduce the time it takes to download the model artifacts when scaling up. Compressed model files may save storage space, but they require additional time to uncompress and files can’t be downloaded in parallel, which can slow down the scale-up process. To learn more, see Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference.
- Reduce model server startup time – Look for ways to optimize the startup and initialization of your model server container. This could include techniques like building in packages into the image, using multi-threading, or minimizing unnecessary initialization steps. For more details, see Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1.
- Use faster auto scaling metrics – Take advantage of more granular auto scaling metrics like
ConcurrentRequestsPerCopy
to more accurately monitor and react to changes in inference traffic. These sub-minute metrics can help trigger scale-out actions more precisely, reducing the number ofNoCapacityInvocationFailures
your users might experience. For more information, see Amazon SageMaker inference launches faster auto scaling for generative AI models. - Handle failed requests – When scaling from zero instances, there will be a brief period where requests fail due to
NoCapacityInvocationFailures
because SageMaker provisions resources. To handle this, you can use queues or implement client-side retries: - Use a serverless queue like Amazon Simple Queue Service (Amazon SQS) to buffer requests during scale-out. When a failure occurs, enqueue the request and dequeue after the model copies have scaled up from zero.
- Alternatively, have your client reject failed requests, but then retry after some time after the model copies have scaled. You can retrieve the number of copies of an inference component at any time by making the DescribeInferenceComponent API call and checking the
CurrentCopyCount
. This allows time for the model copies to scale out from zero, transparently handling the transition for end-users.
By implementing these best practices, you can help make sure your SageMaker inference endpoints can scale out quickly and efficiently to meet changes in traffic, providing a responsive and reliable experience for your end-users.
Solution overview
With these best practices in mind, let’s now walk through the process of enabling your SageMaker inference endpoints to scale down to zero instances. This process involves a few key steps that are crucial for optimizing your endpoint’s performance and cost-efficiency:
- Configure your endpoint – The first and most critical step is to enable managed instance scaling for your SageMaker endpoint. This is the foundational action that allows you to implement advanced scaling features, including scaling to zero. By enabling managed instance scaling, you’re creating an inference component endpoint, which is essential for the fine-grained control over scaling behaviors we discuss later in this post. After you configure managed instance scaling, you then configure the SageMaker endpoint to set the
MinInstanceCount
parameter to 0. This parameter allows the endpoint to scale all the way down to zero instances when not in use, maximizing cost-efficiency. Enabling managed instance scaling and settingMinInstanceCount
to 0 work together to provide a highly flexible and cost-effective endpoint configuration. However, scaling up from zero will introduce cold starts, potentially impacting response times for initial requests after periods of inactivity. The inference component endpoint created through managed instance scaling serves as the foundation for implementing the sophisticated scaling policies we explore in the next step. - Define scaling policies – Next, you need to create two scaling policies that work in tandem to manage the scaling behavior of your endpoint effectively:
- Scaling policy for inference component copies – This target tracking scaling policy will manage the scaling of your inference component copies. It’s a dynamic policy that adjusts the number of copies based on a specified metric, such as CPU utilization or request count. The policy is designed to scale the copy count to zero when there is no traffic, making sure you’re not paying for unused resources. Conversely, it will scale back up to your desired capacity when needed, allowing your endpoint to handle incoming requests efficiently. When configuring this policy, you need to carefully choose the target metric and threshold that best reflect your workload patterns and performance requirements.
- Scale out from zero policy – This policy is crucial for enabling your endpoint to scale out from zero model copies when traffic arrives. It’s implemented as a step scaling policy that adds model copies when triggered by incoming requests. This allows SageMaker to provision the necessary instances to support the model copies and handle the incoming traffic. When configuring this policy, you need to consider factors such as the expected traffic patterns, the desired responsiveness of your endpoint, and the potential cold start latency. You may want to set up multiple steps in your policy to handle different levels of incoming traffic more granularly.
By implementing these scaling policies, you create a flexible and cost-effective infrastructure that can automatically adjust to your workload demands and scale to zero when needed.
Now let’s see how to use this feature step by step.
Set up your endpoint
The first crucial step in enabling your SageMaker endpoint to scale to zero is properly configuring the endpoint and its associated components. This process involves three main steps:
- Create the endpoint configuration and set
MinInstanceCount
to 0. This allows the endpoint to scale down all the way to zero instances when not in use. - Create the SageMaker endpoint:
- Create the inference component for your endpoint:
Add scaling policies
After the endpoint is deployed and InService
, you can add the necessary scaling policies:
- A target tracking policy that can scale down the copy count for our inference component model copies to zero, and from 1 to n
- A step scaling policy that will allow the endpoint to scale up from zero
Scaling policy for inference components model copies
After you create your SageMaker endpoint and inference components, you register a new auto scaling target for Application Auto Scaling. In the following code block, you set MinCapacity
to 0, which is required for your endpoint to scale down to zero:
After you have registered your new scalable target, the next step is to define your target tracking policy. In the following code example, we set the TargetValue
to 5. This setting instructs the auto scaling system to increase capacity when the number of concurrent requests per model reaches or exceeds 5.
Application Auto Scaling creates two CloudWatch alarms per scaling target. The first triggers scale-out actions after 1 minute (using one 1-minute data point), and the second triggers scale-in after 15 minutes (using 90 10-second data points). The time to trigger the scaling action is usually 1–2 minutes longer than those minutes because it takes time for the endpoint to publish metrics to CloudWatch, and it also takes time for AutoScaling
to react.
Scale out from zero model copies policy
To enable your endpoint to scale out from zero instances, complete the following steps:
- Create a step scaling policy that defines when and how to scale out from zero. This policy will add one model copy when triggered, enabling SageMaker to provision the instances required to handle incoming requests after being idle. The following code shows you how to define a step scaling policy. Here we have configured to scale from zero to one model copy (
"ScalingAdjustment": 1
). Depending on your use case, you can adjustScalingAdjustment
as required. - Create a CloudWatch alarm with the metric
NoCapacityInvocationFailures
.
When triggered, the alarm initiates the previously defined scaling policy. For more information about the NoCapacityInvocationFailures
metric, see documentation.
We have also set the following:
EvaluationPeriods
to 1DatapointsToAlarm
to 1ComparisonOperator
toGreaterThanOrEqualToThreshold
This results in waiting approximately 1 minute for the step scaling policy to trigger after our endpoint receives a single request.
Replace <STEP_SCALING_POLICY_ARN> with the Amazon Resource Name (ARN) of the scaling policy you created in the previous step.
Notice the "MinInstanceCount": 0
setting in the endpoint configuration, which allows the endpoint to scale down to zero instances. With the scaling policy, CloudWatch alarm, and minimum instances set to zero, your SageMaker inference endpoint will now be able to automatically scale down to zero instances when not in use.
Test the solution
When our SageMaker endpoint doesn’t receive requests for 15 minutes, it will automatically scale down to zero the number of model copies:
After 10 additional minutes of inactivity, SageMaker automatically stops all underlying instances of the endpoint, eliminating all associated instance costs.
If we try to invoke our endpoint while instances are scaled down to zero, we get a validation error:
An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API.
However, after 1 minute, our step scaling policy should start. SageMaker will then start provisioning a new instance and deploy our inference component model copy to handle requests.
Schedule scaling down to zero
In some scenarios, you might observe consistent weekly traffic patterns: a steady workload Monday through Friday, and no traffic on weekends. You can optimize costs and performance by configuring scheduled actions that align with these patterns:
- Weekend scale-in (Friday evening) – Configure a scheduled action to reduce the number of model copies to zero. This will instruct SageMaker to scale the number instance behind the endpoint to zero, completely eliminating costs during the weekend period of no usage.
- Workweek scale-out (Monday morning) – Set up a complementary scheduled action to restore the required model capacity for the inference component on Monday morning, so your application is ready for weekday operations.
You can scale your endpoint to zero in two ways. The first method is to set the number of model copies to zero in your inference component using the UpdateInferenceComponentRuntimeConfig API. This approach maintains your endpoint configuration while eliminating compute costs during periods of inactivity.
Amazon EventBridge Scheduler can automate SageMaker API calls using cron/rate
expressions for recurring schedules or one-time invocations. To function, EventBridge Scheduler requires an execution role with appropriate permissions to invoke the target API operations on your behalf. For more information about how to create this role, see Set up the execution role. The specific permissions needed depend on the target API being called.
The following code creates two scheduled actions for the inference component during 2024–2025. The first schedule scales in the CopyCount
to zero every Friday at 18:00 UTC+1, and the second schedule restores model capacity every Monday at 07:00 UTC+1. The schedule will start on November 29, 2024, end on December 31, 2025, and be deleted after completion.
The second method is to delete the inference components by calling the DeleteInferenceComponent API. This approach achieves the same cost-saving benefit while completely removing the components from your configuration. The following code creates a scheduled action that automatically deletes the inference component every Friday at 18:00 UTC during 2024–2025. It also creates a complementary scheduled action that recreates the inference component every Monday at 07:00 UTC+1.
To scale to zero on an endpoint with multiple inference components, all components must be either set to 0 or deleted. You can also automate this process by using EventBridge Scheduler to trigger an AWS Lambda function that handles either deletion or zero-setting of all inference components.
Performance evaluation
We evaluated the performance implications of the Scale to Zero feature by conducting tests using a Llama3-8B instruct model. These tests utilized container caching and optimized model loading techniques, and were performed with both Target Tracking and Step Scaling policies in place. Our findings for Llama3-8B instruct show that when using the Target Tracking policy, SageMaker will scale the endpoint to zero model copies in approximately 15 minutes, and then take an additional 10 minutes to fully scale down the underlying instances, for a total scale-in time of 25 minutes. Conversely, when scaling the endpoint back up from zero, the Step Scaling policy triggers the provisioning of new instances in around 1 minute, followed by provisioning the instance(s) in ~1.748 minutes. Scaling out of model copies in approximately 2.28 minutes, resulting in a total scale-out time of around 5.028 minutes.
The performance tests on LLaMa3.1 models (8B and 70B variants) demonstrate SageMaker’s Scale to Zero feature’s effectiveness, with intentionally conservative scaling times to prevent endpoint thrashing and accommodate spiky traffic patterns. For both model sizes, scaling in takes a total of 25 minutes, allowing a 15-minute buffer before initiating scale-down and an additional 10 minutes to fully decommission instances. This cautious approach helps avoid premature scaling during temporary lulls in traffic. When scaling out, the 8B model takes about 5 minutes, while the 70B model needs approximately 6 minutes. These times include a 1-minute trigger delay, followed by instance provisioning and model copy instantiation. The slightly longer scale-out times, especially for larger models, provide a balance between responsiveness and stability, ensuring the system can handle sudden traffic increases without constantly scaling up and down. This measured approach to scaling helps maintain consistent performance and cost-efficiency in environments with variable workloads.
LLaMa3.1 8B Instruct | ||||
Scale in | Time to trigger target tracking (min) | Time to scale in instance count to zero (min) | Total time (min) | |
15 | 10 | 25 | ||
Scale out | Time to trigger step scaling policy (min) | Time to provision instance(s) (min) | Time to instatiate a new model copy (min) | Total time (min) |
1 | 1.748 | 2.28 | 5.028 | |
LLaMa3.1 70B | ||||
Scale in | Time to trigger target tracking (min) | Time to scale in instance count to zero (min) | Total time (min) | |
15 | 10 | 25 | ||
Scale out | Time to trigger step scaling policy (min) | Time to provision instance(s) (min) | Time to instatiate a new model copy (min) | Total time (min) |
1 | 3.018 | 1.984 | 6.002 |
Scale up Trials
LLaMa3.1 8B Instruct | ||||
Trial | Time to trigger step scaling policy (min) | Time to provision instance(s) (min) | Time to instantiate a new model copy (min) | Total time (min) |
1 | 1 | 1.96 | 3.1 | 6.06 |
2 | 1 | 1.75 | 2.6 | 5.35 |
3 | 1 | 1.4 | 2.1 | 4.5 |
4 | 1 | 1.96 | 1.9 | 4.86 |
5 | 1 | 1.67 | 1.7 | 4.37 |
Average | 1 | 1.748 | 2.28 | 5.028 |
LLaMa3.1 70B | ||||
Trial | Time to trigger step scaling policy (min) | Time to provision instance(s) (min) | Time to instantiate a new model copy (min) | Total time (min) |
1 | 1 | 3.1 | 1.98 | 6.08 |
2 | 1 | 2.92 | 1.98 | 5.9 |
3 | 1 | 2.82 | 1.98 | 5.8 |
4 | 1 | 3.27 | 2 | 6.27 |
5 | 1 | 2.98 | 1.98 | 5.96 |
Average | 1 | 3.018 | 1.984 | 6.002 |
- Target Tracking: Scale Model Copies to Zero (min) – This refers to the time it took target tracking to trigger the alarm and SageMaker to decrease model copies to zero on the instance
- Scale in Instance Count to Zero (min) – This refers to the time it takes SageMaker to scale the instances down to zero after all inference component model copies are zero
- Step Scaling: Scale up Model Copies from Zero (min) – This refers to the time it took step scaling to trigger the alarm and for SageMaker to provision the instances
- Scale out Instance Count from Zero (min) – This refers to the time it takes for SageMaker to scale out and add inference component model copies
If you want more customization and faster scaling, consider using step scaling to scale model copies instead of target tracking.
Customers testimonials
The new Scale to Zero feature for SageMaker inference endpoints has sparked considerable interest across customers. We gathered initial reactions from companies who have previewed and evaluated this capability, highlighting its potential impact on AI and machine learning operations.
Atlassian, headquartered in Sydney, Australia, is a software company specializing in collaboration tools for software development and project management:
“The new Scale to Zero feature for SageMaker inference strongly aligns with our commitment to efficiency and innovation. We’re enthusiastic about its potential to revolutionize how we manage our machine learning inference resources, and we look forward to integrating it into our operations”
– Guarav Awadhwal – Senior Engineering Manager at Atlassian
iFood is a Latin American online food delivery firm based in Brazil. It works with over 300,000 restaurants, connecting them with millions of customers every month.
“The Scale to Zero feature for SageMaker Endpoints will be fundamental for iFood’s Machine Learning Operations. Over the years, we’ve collaborated closely with the SageMaker team to enhance our inference capabilities. This feature represents a significant advancement, as it allows us to improve cost efficiency without compromising the performance and quality of our ML services, given that inference constitutes a substantial part of our infrastructure expenses.”
– Daniel Vieira, MLOps Engineer Manager at iFoods
VIDA, headquartered in Jakarta, Indonesia, is a leading digital identity provider that enable individuals and business to conduct business in a safe and secure digital environment.
“SageMaker’s new Scale to Zero feature for GPU inference endpoints shows immense promise for deep fake detection operations. The potential to efficiently manage our face liveness and document verification inference models while optimizing infrastructure costs aligns perfectly with our goals. We’re excited to leverage this capability to enhance our identity verification solutions.”
– Keshav Sharma, ML Platform Architect at VIDA
APOIDEA Group is a leading AI-focused FinTech ISV company headquartered in Hong Kong. Leveraging cutting-edge generative AI and deep learning technologies, the company develops innovative AI FinTech solutions for multinational banks. APOIDEA’s products automate repetitive human analysis tasks, extracting valuable financial insights from extensive financial documents to accelerate AI-driven transformation across the industry.
“SageMaker’s Scale to Zero feature is a game changer for our AI financial analysis solution in operations. It delivers significant cost savings by scaling down endpoints during quiet periods, while maintaining the flexibility we need for batch inference and model testing. This capability is transforming how we manage our GenAI workloads and evaluate new models. We’re eager to harness its power to further optimize our deep learning and NLP model deployments.”
– Mickey Yip, VP of Product at APOIDEA Group
Fortiro, based in Melbourne, Australia, is a FinTech company specializing in automated document fraud detection and financial verification for trusted financial institutions.
“The new Scale-to-Zero capability in SageMaker is a game-changer for our MLOps and delivers great cost savings. Being able to easily scale inference endpoints and GPUs means we can take advantage of a fast, highly responsive environment, without incurring unnecessary costs. Our R&D teams constantly experiment with new AI-based document fraud detection methods, which involves a lot of testing and repeating. This capability empowers us to do this both faster and more efficiently.”
– Amir Vahid , Chief Technology Officer at Fortiro
These testimonials underscore the anticipation for SageMaker’s Scale to Zero feature. As organizations begin to implement this capability, we expect to see innovative applications that balance cost efficiency with performance in machine learning deployments.
Conclusion
In this post, we introduced the new scale to zero feature in SageMaker, an innovative capability that enables you to optimize costs by automatically scaling in your inference endpoints when they’re not in use. We guided you through the detailed process of implementing this feature, including configuring endpoints, setting up auto scaling policies, and managing inference components for both automatic and scheduled scaling scenarios.
This cost-saving functionality presents new possibilities for how you can approach your ML operations. With this feature, you can closely align your compute resource usage with actual needs, potentially reducing costs during periods of low demand. We encourage you to try this capability and start optimizing your SageMaker inference costs today.
To help you get started quickly, we’ve prepared a comprehensive notebooks containing an end-to-end example of how to configure an endpoint to scale to zero.
We encourage you to try this capability and start optimizing your SageMaker inference costs today!
About the authors
Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.
Christian Kamwangala is an AI/ML and Generative AI Specialist Solutions Architect at AWS, based in Paris, France. He helps enterprise customers architect and implement cutting-edge AI solutions using AWS’s comprehensive suite of tools, with a focus on production-ready systems that follow industry best practices. In his spare time, Christian enjoys exploring nature and spending time with family and friends.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Raghu Ramesha is a Senior GenAI/ML Solutions Architect on the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in computer science from UT Dallas. In his free time, he enjoys traveling and photography.
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Raj Vippagunta is a Principal Engineer at Amazon SageMaker Machine Learning(ML) platform team in AWS. He uses his vast experience of 18+ years in large-scale distributed systems and his passion for machine learning to build practical service offerings in the AI and ML space. He has helped build various at-scale solutions for AWS and Amazon. In his spare time, he likes reading books, pursue long distance running and exploring new places with his family.