KubernetesScalingCloud

Kubernetes Scaling Strategies for Production Workloads

15 min read
Kubernetes provides powerful primitives for scaling workloads, but configuring them correctly requires understanding your application's resource profile, traffic patterns, and failure modes. This guide covers the three scaling dimensions in Kubernetes — horizontal pod autoscaling, vertical pod autoscaling, and cluster autoscaling — with practical configuration examples and production lessons.

Horizontal Pod Autoscaling (HPA)

HPA automatically adjusts the number of pod replicas based on observed metrics. The most common metric is CPU utilization, but custom and external metrics (requests per second, queue depth, latency percentiles) often provide better scaling signals for web applications. Configure HPA with a target utilization that leaves headroom for traffic spikes — a target of 70% CPU means pods scale up before they're saturated.

k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120
tip

Set scaleDown stabilization to at least 5 minutes to prevent flapping during variable traffic. Scaling up should be aggressive; scaling down should be conservative.

Resource Requests and Limits

Accurate resource requests are the foundation of effective autoscaling. Requests determine scheduling — the scheduler places pods on nodes with sufficient unreserved resources. Limits cap resource usage and trigger OOM kills or CPU throttling. Set requests based on observed steady-state usage (p50-p75 of actual consumption) and limits at 2-3x requests to accommodate burst traffic. Under-requesting leads to overcommitted nodes and degraded performance; over-requesting wastes capacity and increases costs.

  • Set CPU requests to the p50-p75 of observed steady-state usage
  • Set memory requests to the p90 of observed usage (memory is incompressible)
  • Set CPU limits to 2-3x requests, or omit them entirely in non-critical namespaces
  • Set memory limits close to requests — memory overcommit causes OOM kills
  • Use Vertical Pod Autoscaler in recommendation mode to discover optimal values

Cluster Autoscaling

Cluster Autoscaler (CA) adds or removes nodes based on pending pods and node utilization. When pods can't be scheduled due to insufficient resources, CA provisions new nodes. When nodes are underutilized (below 50% resource allocation by default), CA cordons and drains them. Configure CA with appropriate node group sizes, scale-down delays, and priority expanders to balance cost and availability.

warning

Cluster autoscaler can take 2-5 minutes to provision new nodes. For latency-sensitive workloads, maintain buffer capacity with pod priority and preemption, or use Karpenter for faster provisioning.

Capacity Planning and Load Testing

Autoscaling is not a substitute for capacity planning. Understand your application's resource profile under load through regular load testing. Identify the scaling bottleneck — is it CPU, memory, database connections, or external API rate limits? Determine the maximum pods your application can scale to before hitting infrastructure limits (database connection pools, load balancer limits, DNS resolution rates). Plan for peak capacity at 2x your expected maximum to handle unexpected traffic surges.