Right-Sizing Guide

Most cloud workloads are over-provisioned. Matching instance types to actual resource consumption is the single highest-impact optimization, typically saving 20-40% with no performance degradation.

Utilization Thresholds

Recommended action based on average resource utilization over a 14-day observation period.

CPU Utilization	Memory Utilization	Recommendation	Action
Below 10%	Below 20%	Severely over-provisioned	Downsize 2+ instance sizes or consolidate
10-30%	20-40%	Over-provisioned	Downsize 1 instance size
30-70%	40-80%	Well-sized	No change needed
Above 70%	Above 80%	Under-provisioned	Upsize or enable auto-scaling

Savings Estimate by Over-Provisioning Level

Estimated monthly savings for a fleet of 20 general-purpose instances (m7i.xlarge baseline at $0.2016/hr).

Over-Provisioning	Current Cost/mo	Right-Sized To	New Cost/mo	Monthly Savings	% Saved
2x (50% idle CPU)	$2,903	m7i.large	$1,452	$1,451	50%
4x (75% idle CPU)	$5,806	m7i.large	$1,452	$4,354	75%
Wrong family (GPU unused)	$14,832	m7i.xlarge	$2,903	$11,929	80%

Container Resource Limits

Kubernetes Requests vs Limits

Set resource requests to the P95 actual usage (from metrics). Set limits to 2x requests as a safety margin. Pods with requests significantly below limits waste cluster capacity since the scheduler reserves based on requests.

Vertical Pod Autoscaler (VPA)

VPA automatically adjusts container resource requests based on observed usage. Run in recommendation mode first to review suggestions before enabling auto-updates. Avoid using VPA and HPA simultaneously on CPU metrics.

Auto-Scaling Configuration

Target Tracking

Set CPU target to 50-70% for web workloads. This provides headroom for traffic spikes while keeping average utilization in the well-sized range. Use custom metrics (request latency, queue depth) when CPU is not the bottleneck.

Scale-Down Cooldowns

Set scale-down cooldowns to 5-10 minutes to avoid flapping. AWS ASG default is 300s. GCP MIG default is 10 minutes. Aggressive scale-down saves money but risks capacity shortages during bursty traffic.

Scheduled Scaling

For predictable traffic patterns (business hours, batch jobs), schedule capacity changes instead of relying solely on reactive scaling. Pre-warm capacity 10-15 minutes before expected load increases.

Cluster Autoscaler

Kubernetes Cluster Autoscaler adds/removes nodes based on pending pods. Configure over-provisioning with a low-priority pause pod to maintain a buffer of 1-2 spare nodes, reducing scale-up latency from minutes to seconds.