Right-Sizing Guide

Most cloud workloads are over-provisioned. Matching instance types to actual resource consumption is the single highest-impact optimization, typically saving 20-40% with no performance degradation.

Utilization Thresholds

Recommended action based on average resource utilization over a 14-day observation period.

CPU UtilizationMemory UtilizationRecommendationAction
Below 10%Below 20%Severely over-provisionedDownsize 2+ instance sizes or consolidate
10-30%20-40%Over-provisionedDownsize 1 instance size
30-70%40-80%Well-sizedNo change needed
Above 70%Above 80%Under-provisionedUpsize or enable auto-scaling

Savings Estimate by Over-Provisioning Level

Estimated monthly savings for a fleet of 20 general-purpose instances (m7i.xlarge baseline at $0.2016/hr).

Over-ProvisioningCurrent Cost/moRight-Sized ToNew Cost/moMonthly Savings% Saved
2x (50% idle CPU)$2,903m7i.large$1,452$1,45150%
4x (75% idle CPU)$5,806m7i.large$1,452$4,35475%
Wrong family (GPU unused)$14,832m7i.xlarge$2,903$11,92980%

Container Resource Limits

Kubernetes Requests vs Limits

Set resource requests to the P95 actual usage (from metrics). Set limits to 2x requests as a safety margin. Pods with requests significantly below limits waste cluster capacity since the scheduler reserves based on requests.

Vertical Pod Autoscaler (VPA)

VPA automatically adjusts container resource requests based on observed usage. Run in recommendation mode first to review suggestions before enabling auto-updates. Avoid using VPA and HPA simultaneously on CPU metrics.

Auto-Scaling Configuration

Target Tracking

Set CPU target to 50-70% for web workloads. This provides headroom for traffic spikes while keeping average utilization in the well-sized range. Use custom metrics (request latency, queue depth) when CPU is not the bottleneck.

Scale-Down Cooldowns

Set scale-down cooldowns to 5-10 minutes to avoid flapping. AWS ASG default is 300s. GCP MIG default is 10 minutes. Aggressive scale-down saves money but risks capacity shortages during bursty traffic.

Scheduled Scaling

For predictable traffic patterns (business hours, batch jobs), schedule capacity changes instead of relying solely on reactive scaling. Pre-warm capacity 10-15 minutes before expected load increases.

Cluster Autoscaler

Kubernetes Cluster Autoscaler adds/removes nodes based on pending pods. Configure over-provisioning with a low-priority pause pod to maintain a buffer of 1-2 spare nodes, reducing scale-up latency from minutes to seconds.