Right-Sizing Guide
Most cloud workloads are over-provisioned. Matching instance types to actual resource consumption is the single highest-impact optimization, typically saving 20-40% with no performance degradation.
Utilization Thresholds
Recommended action based on average resource utilization over a 14-day observation period.
| CPU Utilization | Memory Utilization | Recommendation | Action |
|---|---|---|---|
| Below 10% | Below 20% | Severely over-provisioned | Downsize 2+ instance sizes or consolidate |
| 10-30% | 20-40% | Over-provisioned | Downsize 1 instance size |
| 30-70% | 40-80% | Well-sized | No change needed |
| Above 70% | Above 80% | Under-provisioned | Upsize or enable auto-scaling |
Savings Estimate by Over-Provisioning Level
Estimated monthly savings for a fleet of 20 general-purpose instances (m7i.xlarge baseline at $0.2016/hr).
| Over-Provisioning | Current Cost/mo | Right-Sized To | New Cost/mo | Monthly Savings | % Saved |
|---|---|---|---|---|---|
| 2x (50% idle CPU) | $2,903 | m7i.large | $1,452 | $1,451 | 50% |
| 4x (75% idle CPU) | $5,806 | m7i.large | $1,452 | $4,354 | 75% |
| Wrong family (GPU unused) | $14,832 | m7i.xlarge | $2,903 | $11,929 | 80% |
Container Resource Limits
Kubernetes Requests vs Limits
Set resource requests to the P95 actual usage (from metrics). Set limits to 2x requests as a safety margin. Pods with requests significantly below limits waste cluster capacity since the scheduler reserves based on requests.
Vertical Pod Autoscaler (VPA)
VPA automatically adjusts container resource requests based on observed usage. Run in recommendation mode first to review suggestions before enabling auto-updates. Avoid using VPA and HPA simultaneously on CPU metrics.
Auto-Scaling Configuration
Target Tracking
Set CPU target to 50-70% for web workloads. This provides headroom for traffic spikes while keeping average utilization in the well-sized range. Use custom metrics (request latency, queue depth) when CPU is not the bottleneck.
Scale-Down Cooldowns
Set scale-down cooldowns to 5-10 minutes to avoid flapping. AWS ASG default is 300s. GCP MIG default is 10 minutes. Aggressive scale-down saves money but risks capacity shortages during bursty traffic.
Scheduled Scaling
For predictable traffic patterns (business hours, batch jobs), schedule capacity changes instead of relying solely on reactive scaling. Pre-warm capacity 10-15 minutes before expected load increases.
Cluster Autoscaler
Kubernetes Cluster Autoscaler adds/removes nodes based on pending pods. Configure over-provisioning with a low-priority pause pod to maintain a buffer of 1-2 spare nodes, reducing scale-up latency from minutes to seconds.