Kubernetes Scaling Strategies for Production Workloads
Horizontal Pod Autoscaling (HPA)
HPA automatically adjusts the number of pod replicas based on observed metrics. The most common metric is CPU utilization, but custom and external metrics (requests per second, queue depth, latency percentiles) often provide better scaling signals for web applications. Configure HPA with a target utilization that leaves headroom for traffic spikes — a target of 70% CPU means pods scale up before they're saturated.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120Set scaleDown stabilization to at least 5 minutes to prevent flapping during variable traffic. Scaling up should be aggressive; scaling down should be conservative.
Resource Requests and Limits
Accurate resource requests are the foundation of effective autoscaling. Requests determine scheduling — the scheduler places pods on nodes with sufficient unreserved resources. Limits cap resource usage and trigger OOM kills or CPU throttling. Set requests based on observed steady-state usage (p50-p75 of actual consumption) and limits at 2-3x requests to accommodate burst traffic. Under-requesting leads to overcommitted nodes and degraded performance; over-requesting wastes capacity and increases costs.
- Set CPU requests to the p50-p75 of observed steady-state usage
- Set memory requests to the p90 of observed usage (memory is incompressible)
- Set CPU limits to 2-3x requests, or omit them entirely in non-critical namespaces
- Set memory limits close to requests — memory overcommit causes OOM kills
- Use Vertical Pod Autoscaler in recommendation mode to discover optimal values
Cluster Autoscaling
Cluster Autoscaler (CA) adds or removes nodes based on pending pods and node utilization. When pods can't be scheduled due to insufficient resources, CA provisions new nodes. When nodes are underutilized (below 50% resource allocation by default), CA cordons and drains them. Configure CA with appropriate node group sizes, scale-down delays, and priority expanders to balance cost and availability.
Cluster autoscaler can take 2-5 minutes to provision new nodes. For latency-sensitive workloads, maintain buffer capacity with pod priority and preemption, or use Karpenter for faster provisioning.
Capacity Planning and Load Testing
Autoscaling is not a substitute for capacity planning. Understand your application's resource profile under load through regular load testing. Identify the scaling bottleneck — is it CPU, memory, database connections, or external API rate limits? Determine the maximum pods your application can scale to before hitting infrastructure limits (database connection pools, load balancer limits, DNS resolution rates). Plan for peak capacity at 2x your expected maximum to handle unexpected traffic surges.