Enhancing Cost Transparency for Machine Learning Workloads on Amazon EKS with AWS Split Cost Allocation Data

We are excited to introduce split cost allocation support for accelerated workloads in Amazon Elastic Kubernetes Service (EKS). This enhancement to Split Cost Allocation Data for EKS enables customers to track container-level resource costs for accelerator-powered workloads. Split Cost Allocation Data now utilizes Trainium, Inferentia, NVIDIA, and AMD GPUs, complementing existing CPU and memory cost tracking capabilities. This cost data is available in the AWS Cost and Usage Report (legacy and CUR 2.0), providing organizations with a consolidated view of their cloud expenditures. This feature is now available across all AWS commercial regions (excluding China regions) at no additional cost to customers.

The Challenges of Monitoring and Allocating Container Costs for Accelerated Workloads

Organizations are increasingly leveraging accelerator-powered workloads on Amazon EKS to power Artificial Intelligence (AI) applications, including Machine Learning (ML) and Generative AI applications. These specialized workloads typically run in multi-tenant clusters, utilizing shared Amazon Elastic Compute Cloud instances to host multiple application containers. The high demand and value associated with accelerator resources make it essential to optimize their usage, ensuring maximum return on investment.

These clusters often support application workloads spanning across various teams, departments, and environments. Consequently, customers require granular cost visibility and accountability to accurately allocate expenses, set budgets, and promote efficient resource utilization. Relying solely on CPU and memory metrics for accelerated workloads provides an incomplete view of infrastructure usage, which can lead to misallocation. Therefore, customers increasingly seek detailed pod-level usage data for accelerator resources alongside traditional metrics. This need often pushes them towards homegrown solutions or costly third-party products, adding complexity to resource management.

Get Granular Cost Visibility for Accelerated Workloads Running on EKS with Split Cost Allocation Data

The newly added accelerator support in the Split Cost Allocation Data for EKS provides customers with a native AWS solution that allows visibility into the cost and usage of Kubernetes pods, based on the actual utilization of accelerators (Trainium, Inferentia, NVIDIA, and AMD GPUs), CPUs, and memory. This capability is particularly powerful as it enables organizations to harness cost allocation tags, including aws:eks:cluster-name, aws:eks:namespace, aws:eks:node, aws:eks:workload-type, aws:eks:workload-name, and aws:eks:deployment. These tags, automatically enabled for accelerator-powered pods, facilitate a consolidated view of applications’ costs and resource usage in shared, multi-tenant environments.

Granular cost data allows customers to allocate Inferentia, Trainium, and GPU expenses accurately across respective cost centers. This not only fosters accountability in resource usage but also informs critical product prioritization decisions. Additionally, the Split Cost Allocation Data feature aids in identifying unused compute resources, enabling customers to optimize their cluster configurations and container reservations to minimize inefficiencies. This alleviates the need for developing custom cost management tools, which can be both resource-intensive and financially burdensome to maintain.

Machine learning workload customers can opt-in to Split Cost Allocation Data for Amazon EKS through the AWS Billing and Cost Management Console. Once opted in, the system automatically scans for clusters across all accounts within the organization, ingesting accelerator, CPU, and memory reservation data for container workloads, and preparing detailed cost data for the current month. This feature automatically calculates split allocation cost metrics, including GPU usage per Kubernetes pod, accounting for the amortized costs of Amazon EC2 instances and applicable discounts. Customers can use the aforementioned cost allocation tags to categorize costs conveniently, gaining insights at hourly, daily, or monthly granularity and enabling internal chargebacks.

For specific instructions on enabling split cost allocation data for EKS, please refer to Understanding split cost allocation data.

How EKS Split Cost Allocation Works

To utilize this feature, customers must first activate Split Cost Allocation Data. For existing users of this capability, it will be enabled automatically. The process ingests accelerator, CPU, and memory reservations along with actual utilization, utilizing the greater of reservation and usage to compute the allocated resources for each pod.

To illustrate this, consider an example with a single EC2 instance, running 4 pods across two namespaces. Suppose the instance type is a p3.16xlarge featuring 8 GPUs, 64 vCPUs, and 488 GB of RAM, with an on-demand cost of $10 per hour. If the instance is a commitment (Savings Plan or Reserved Instance), the net amortized cost will be employed for calculations. The Split Cost Allocation Data calculates a normalized cost per resource based on a relative ratio of GPU to CPU and memory of 9:1, implying each GPU costs nine times more than one unit of CPU or memory.

Step #1 – Compute the Unit Cost

Using the specified accelerator (Trainium, Inferentia, and GPU) alongside CPU and memory resources on the EC2 instance, the Split Cost Allocation Data calculates the unit cost for GPU-hr, vCPU-hr, and GB-hr at $0.50, $0.05, and $0.005, respectively.

Step #2 – Calculate Allocated and Unused Capacity

By assessing the GPU, vCPU, and memory requests alongside actual usage across the four Kubernetes pods, the system computes allocated resources. For instance, if Pod 2 uses more GPU, CPU, and memory than requested (due to the lack of a defined limit), the Split Cost Allocation Data computes the allocations based on the higher actual or requested usage. The allocated values can indicate zero unused vGPU and vCPU, albeit revealing unallocated memory (e.g., 48 GB).

Step #3 – Compute Utilization Ratios and Split Usage Ratios

The ratio of split-usage is calculated as the percentage of allocated CPU or memory for each pod compared to overall available resources on the EC2 instance. Similarly, an unused ratio is determined based on allocated resources against all resources available. For example, with 48GB of unallocated memory, it represents a particular ratio relative to total instance memory allocations.

Step #4 – Compute the Split and Unused Costs

Once the pod-level split costs are computed using split-usage ratios multiplied by the per resource costs, any unused resource costs (like the aforementioned 48GB of memory) can be redistributively allocated to the pods based on computed unused ratios. This allows for both specific pod-level costs and aggregate costs at the namespace level, providing a comprehensive view that is also amenable to further categorization via cost allocation tags.

What Are the New Cost and Usage Report Columns?

For existing users of split cost allocation data, there will be no new columns introduced; the accelerator support will utilize the current structure. However, new users can expect to see Kubernetes pod-level metrics in their CUR reports, such as “SplitLineItem/SplitUsage,” revealing GPU, vCPU, or memory allocation across specified timeframes at the pod level. For more detailed information, refer to the CUR data dictionary.

The demo CUR report shows how this data will appear in the new columns. You can access the Containers Cost Allocation dashboard to visualize EKS costs in Amazon QuickSight and employ the CUR query library for querying EKS costs using Amazon Athena.

The Symbolic Strategy Letter

Premium features

Enhancing Cost Transparency for Machine Learning Workloads on Amazon EKS with AWS Split Cost Allocation Data

The Challenges of Monitoring and Allocating Container Costs for Accelerated Workloads

Get Granular Cost Visibility for Accelerated Workloads Running on EKS with Split Cost Allocation Data

How EKS Split Cost Allocation Works

Step #1 – Compute the Unit Cost

Step #2 – Calculate Allocated and Unused Capacity

Step #3 – Compute Utilization Ratios and Split Usage Ratios

Step #4 – Compute the Split and Unused Costs

What Are the New Cost and Usage Report Columns?

Table of contents [hide]

Vermeer Secures $10 Million for Computer Vision Navigation Technology

Enhancing 3D Defect Analysis in Lattice Structures with Deep Learning Super-Resolution X-ray Tomography

2025 Micro-Factory Market Report: Automation Fuels Growth and Robot Density Doubles

Exploring Cameroon’s Geomaterials Heritage with AI and NLP Technology

Prioritizing Generative AI Projects with Responsible AI Practices

Related updates

Exploring SU(d)-Symmetric Random Unitaries: Quantum Scrambling, Error Correction, and Machine Learning

Predicting N2 Lymph Node Metastasis in Non-Small Cell Lung Cancer Using Machine Learning

Interpretable Machine Learning for Classifying Metal Passivity from Minimal EIS Data

Optimizing Lithofacies Prediction in the Lower Goru Formation Using Diverse Machine Learning Algorithms

Vermeer Secures $10 Million for Computer Vision Navigation Technology

Enhancing 3D Defect Analysis in Lattice Structures with Deep...

2025 Micro-Factory Market Report: Automation Fuels Growth and Robot...

WeTransfer Controversy Reveals Urgent Need for Better AI Practices

Efficient Deep Learning Model for Small Object Detection in...

Leopard Imaging Unveils Next-Gen Computer Vision Cameras at Embedded...