Governing ML lifecycle at scale: Best practices to set up cost and usage visibility of ML workloads in multi-account environments

[ad_1]

Cloud costs can significantly impact your business operations. Gaining real-time visibility into infrastructure expenses, usage patterns, and cost drivers is essential. This insight enables agile decision-making, optimized scalability, and maximizes the value derived from cloud investments, providing cost-effective and efficient cloud utilization for your organization’s future growth. What makes cost visibility even more important for the cloud is that cloud usage is dynamic. This requires continuous cost reporting and monitoring to make sure costs don’t exceed expectations and you only pay for the usage you need. Additionally, you can measure the value the cloud delivers to your organization by quantifying the associated cloud costs.

For a multi-account environment, you can track costs at an AWS account level to associate expenses. However, to allocate costs to cloud resources, a tagging strategy is essential. A combination of an AWS account and tags provides the best results. Implementing a cost allocation strategy early is critical for managing your expenses and future optimization activities that will reduce your spend.

This post outlines steps you can take to implement a comprehensive tagging governance strategy across accounts, using AWS tools and services that provide visibility and control. By setting up automated policy enforcement and checks, you can achieve cost optimization across your machine learning (ML) environment.

Implement a tagging strategy

A tag is a label you assign to an AWS resource. Tags consist of a customer-defined key and an optional value to help manage, search for, and filter resources. Tag keys and values are case sensitive. A tag value (for example, Production) is also case sensitive, like the keys.

It’s important to define a tagging strategy for your resources as soon as possible when establishing your cloud foundation. Tagging is an effective scaling mechanism for implementing cloud management and governance strategies. When defining your tagging strategy, you need to determine the right tags that will gather all the necessary information in your environment. You can remove tags when they’re no longer needed and apply new tags whenever required.

Categories for designing tags

Some of the common categories used for designing tags are as follows:

Cost allocation tags – These help track costs by different attributes like department, environment, or application. This allows reporting and filtering costs in billing consoles based on tags.
Automation tags – These are used during resource creation or management workflows. For example, tagging resources with their environment allows automating tasks like stopping non-production instances after hours.
Access control tags – These enable restricting access and permissions based on tags. AWS Identity and Access Management (IAM) roles and policies can reference tags to control which users or services can access specific tagged resources.
Technical tags – These provide metadata about resources. For example, tags like environment or owner help identify technical attributes. The AWS reserved prefix aws: tags provide additional metadata tracked by AWS.
Compliance tags – These may be needed to adhere to regulatory requirements, such as tagging with classification levels or whether data is encrypted or not.
Business tags – These represent business-related attributes, not technical metadata, such as cost centers, business lines, and products. This helps track spending for cost allocation purposes.

A tagging strategy also defines a standardized convention and implementation of tags across all resource types.

When defining tags, use the following conventions:

Use all lowercase for consistency and to avoid confusion
Separate words with hyphens
Use a prefix to identify and separate AWS generated tags from third-party tool generated tags

Tagging dictionary

When defining a tagging dictionary, delineate between mandatory and discretionary tags. Mandatory tags help identify resources and their metadata, regardless of purpose. Discretionary tags are the tags that your tagging strategy defines, and they should be made available to assign to resources as needed. The following table provides examples of a tagging dictionary used for tagging ML resources.

Tag Type	Tag Key	Purpose	Cost Allocation	Mandatory
Workload	`anycompany:workload:application-id`	Identifies disparate resources that are related to a specific application	Y	Y
Workload	`anycompany:workload:environment`	Distinguishes between `dev`, `test`, and `production`	Y	Y
Financial	`anycompany:finance:owner`	Indicates who is responsible for the resource, for example `SecurityLead`, `SecOps`, `Workload-1-Development-team`	Y	Y
Financial	`anycompany:finance:business-unit`	Identifies the business unit the resource belongs to, for example `Finance`, `Retail`, `Sales`, `DevOps`, `Shared`	Y	Y
Financial	`anycompany:finance:cost-center`	Indicates cost allocation and tracking, for example `5045`, `Sales-5045`, `HR-2045`	Y	Y
Security	`anycompany:security:data-classification`	Indicates data confidentiality that the resource supports	N	Y
Automation	`anycompany:automation:encryption`	Indicates if the resource needs to store encrypted data	N	N
Workload	`anycompany:workload:name`	Identifies an individual resource	N	N
Workload	`anycompany:workload:cluster`	Identifies resources that share a common configuration or perform a specific function for the application	N	N
Workload	`anycompany:workload:version`	Distinguishes between different versions of a resource or application component	N	N
Operations	`anycompany:operations:backup`	Identifies if the resource needs to be backed up based on the type of workload and the data that it manages	N	N
Regulatory	`anycompany:regulatory:framework`	Requirements for compliance to specific standards and frameworks, for example NIST, HIPAA, or GDPR	N	N

You need to define what resources require tagging and implement mechanisms to enforce mandatory tags on all necessary resources. For multiple accounts, assign mandatory tags to each one, identifying its purpose and the owner responsible. Avoid personally identifiable information (PII) when labeling resources because tags remain unencrypted and visible.

Tagging ML workloads on AWS

When running ML workloads on AWS, primary costs are incurred from compute resources required, such as Amazon Elastic Compute Cloud (Amazon EC2) instances for hosting notebooks, running training jobs, or deploying hosted models. You also incur storage costs for datasets, notebooks, models, and so on stored in Amazon Simple Storage Service (Amazon S3).

A reference architecture for the ML platform with various AWS services is shown in the following diagram. This framework considers multiple personas and services to govern the ML lifecycle at scale. For more information about the reference architecture in detail, see Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker.

Machine Learning Platform Reference Architecture

The reference architecture includes a landing zone and multi-account landing zone accounts. These should be tagged to track costs for governance and shared services.

The key contributors towards recurring ML cost that should be tagged and tracked are as follows:

Amazon DataZone – Amazon DataZone allows you to catalog, discover, govern, share, and analyze data across various AWS services. Tags can be added at an Amazon DataZone domain and used for organizing data assets, users, and projects. Usage of data is tracked through the data consumers, such as Amazon Athena, Amazon Redshift, or Amazon SageMaker.
AWS Lake Formation – AWS Lake Formation helps manage data lakes and integrate them with other AWS analytics services. You can define metadata tags and assign them to resources like databases and tables. This identifies teams or cost centers responsible for those resources. Automating resource tags when creating databases or tables with the AWS Command Line Interface (AWS CLI) or SDKs provides consistent tagging. This enables accurate tracking of costs incurred by different teams.
Amazon SageMaker – Amazon SageMaker uses a domain to provide access to an environment and resources. When a domain is created, tags are automatically generated with a DomainId key by SageMaker, and administrators can add a custom ProjectId Together, these tags can be used for project-level resource isolation. Tags on a SageMaker domain are automatically propagated to any SageMaker resources created in the domain.
Amazon SageMaker Feature Store – Amazon SageMaker Feature Store allows you to tag your feature groups and search for feature groups using tags. You can add tags when creating a new feature group or edit the tags of an existing feature group.
Amazon SageMaker resources – When you tag SageMaker resources such as jobs or endpoints, you can track spending based on attributes like project, team, or environment. For example, you can specify tags when creating the SageMaker Estimator that launches a training job.

Using tags allows you to incur costs that align with business needs. Monitoring expenses this way gives insight into how budgets are consumed.

Enforce a tagging strategy

An effective tagging strategy uses mandatory tags and applies them consistently and programmatically across AWS resources. You can use both reactive and proactive approaches for governing tags in your AWS environment.

Proactive governance uses tools such as AWS CloudFormation, AWS Service Catalog, tag policies in AWS Organizations, or IAM resource-level permissions to make sure you apply mandatory tags consistently at resource creation. For example, you can use the CloudFormation Resource Tags property to apply tags to resource types. In Service Catalog, you can add tags that automatically apply when you launch the service.

Reactive governance is for finding resources that lack proper tags using tools such as the AWS Resource Groups tagging API, AWS Config rules, and custom scripts. To find resources manually, you can use Tag Editor and detailed billing reports.

Proactive governance

Proactive governance uses the following tools:

Service catalog – You can apply tags to all resources created when a product launches from the service catalog. The service catalog provides a TagOptions Use this to define the tag key-pairs to associate with the product.
CloudFormation Resource Tags – You can apply tags to resources using the AWS CloudFormation Resource Tags property. Tag only those resources that support tagging through AWS CloudFormation.
Tag policies – Tag policies standardize tags across your organization’s account resources. Define tagging rules in a tag policy that apply when resources get tagged. For example, specify that a CostCenter tag attached to a resource must match the case and values the policy defines. Also specify that noncompliant tagging operations on some resources get enforced, preventing noncompliant requests from completing. The policy doesn’t evaluate untagged resources or undefined tags for compliance. Tag policies involve working with multiple AWS services:
- To enable the tag policies feature, use AWS Organizations. You can create tag policies and then attach those policies to organization entities to put the tagging rules into effect.
- Use AWS Resource Groups to find noncompliant tags on account resources. Correct the noncompliant tags in the AWS service where you created the resource.
Service Control Policies – You can restrict the creation of an AWS resource without proper tags. Use Service Control Policies (SCPs) to set guardrails around requests to create resources. SCPs allow you to enforce tagging policies on resource creation. To create an SCP, navigate to the AWS Organizations console, choose Policies in the navigation pane, then choose Service Control Policies.

Reactive governance

Reactive governance uses the following tools:

AWS Config rules – Check resources regularly for improper tagging. The AWS Config rule required-tags examines resources to make sure they contain specified tags. You should take action when resources lack necessary tags.
AWS Resource Groups tagging API – The AWS Resource Groups Tagging API lets you tag or untag resources. It also enables searching for resources in a specified AWS Region or account using tag-based filters. Additionally, you can search for existing tags in a Region or account, or find existing values for a key within a specific Region or account. To create a resource tag group, refer to Creating query-based groups in AWS Resource Groups.
Tag Editor – With Tag Editor, you build a query to find resources in one or more Regions that are available for tagging. To find resources to tag, see Finding resources to tag.

SageMaker tag propagation

Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all ML development steps required to prepare data, as well as build, train, and deploy models. SageMaker Studio automatically copies and assign tags to the SageMaker Studio notebooks created by the users, so you can track and categorize the cost of SageMaker Studio notebooks.

Amazon SageMaker Pipelines allows you to create end-to-end workflows for managing and deploying SageMaker jobs. Each pipeline is composed of a sequence of steps that transform data into a trained model. Tags can be applied to pipelines similarly to how they are used for other SageMaker resources. When a pipeline is run, its tags can potentially propagate to the underlying jobs launched as part of the pipeline steps.

When models are registered in Amazon SageMaker Model Registry, tags can be propagated from model packages to other related resources like endpoints. Model packages in the registry can be tagged when registering a model version. These tags become associated with the model package. Tags on model packages can potentially propagate to other resources that reference the model, such as endpoints created using the model.

Tag policy quotas

The number of policies that you can attach to an entity (root, OU, and account) is subject to quotas for AWS Organizations. See Quotas and service limits for AWS Organizations for the number of tags that you can attach.

Monitor resources

To achieve financial success and accelerate business value realization in the cloud, you need complete, near real-time visibility of cost and usage information to make informed decisions.

Cost organization

You can apply meaningful metadata to your AWS usage with AWS cost allocation tags. Use AWS Cost Categories to create rules that logically group cost and usage information by account, tags, service, charge type, or other categories. Access the metadata and groupings in services like AWS Cost Explorer, AWS Cost and Usage Reports, and AWS Budgets to trace costs and usage back to specific teams, projects, and business initiatives.

Cost visualization

You can view and analyze your AWS costs and usage over the past 13 months using Cost Explorer. You can also forecast your likely spending for the next 12 months and receive recommendations for Reserved Instance purchases that may reduce your costs. Using Cost Explorer enables you to identify areas needing further inquiry and to view trends to understand your costs. For more detailed cost and usage data, use AWS Data Exports to create exports of your billing and cost management data by selecting SQL columns and rows to filter the data you want to receive. Data exports get delivered on a recurring basis to your S3 bucket for you to use with your business intelligence (BI) or data analytics solutions.

You can use AWS Budgets to set custom budgets that track cost and usage for simple or complex use cases. AWS Budgets also lets you enable email or Amazon Simple Notification Service (Amazon SNS) notifications when actual or forecasted cost and usage exceed your set budget threshold. In addition, AWS Budgets integrates with Cost Explorer.

Cost allocation

Cost Explorer enables you to view and analyze your costs and usage data over time, up to 13 months, through the AWS Management Console. It provides premade views displaying quick information about your cost trends to help you customize views suiting your needs. You can apply various available filters to view specific costs. Also, you can save any view as a report.

Monitoring in a multi-account setup

SageMaker supports cross-account lineage tracking. This allows you to associate and query lineage entities, like models and training jobs, owned by different accounts. It helps you track related resources and costs across accounts. Use the AWS Cost and Usage Report to track costs for SageMaker and other services across accounts. The report aggregates usage and costs based on tags, resources, and more so you can analyze spending per team, project, or other criteria spanning multiple accounts.

Cost Explorer allows you to visualize and analyze SageMaker costs from different accounts. You can filter costs by tags, resources, or other dimensions. You can also export the data to third-party BI tools for customized reporting.

Conclusion

In this post, we discussed how to implement a comprehensive tagging strategy to track costs for ML workloads across multiple accounts. We discussed implementing tagging best practices by logically grouping resources and tracking costs by dimensions like environment, application, team, and more. We also looked at enforcing the tagging strategy using proactive and reactive approaches. Additionally, we explored the capabilities within SageMaker to apply tags. Lastly, we examined approaches to provide visibility of cost and usage for your ML workloads.

For more information about how to govern your ML lifecycle, see Part 1 and Part 2 of this series.

About the authors

Gunjan Jain, an AWS Solutions Architect based in Southern California, specializes in guiding large financial services companies through their cloud transformation journeys. He expertly facilitates cloud adoption, optimization, and implementation of Well-Architected best practices. Gunjan’s professional focus extends to machine learning and cloud resilience, areas where he demonstrates particular enthusiasm. Outside of his professional commitments, he finds balance by spending time in nature.

Ram Vittal is a Principal Generative AI Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, reliable and scalable GenAI/ML systems to help enterprise customers improve their business outcomes. In his spare time, he rides motorcycle and enjoys walking with his dogs!

[ad_2]

Governing ML lifecycle at scale: Best practices to set up cost and usage visibility of ML workloads in multi-account environments