Triage: Sustained CPU Spike

You receive a P2 alert: “High CPU usage (>90%) on Payment Service”. The service is running on Kubernetes.

What are your first steps to triage this alert? How do you distinguish between a valid surge and a bug?

Immediate Check: Health endpoints (is the service actually down/slow?), Error rates, and Latency graphs.
Correlations: Check throughput/RPS. If RPS increased proportionally with CPU, it’s likely scaling lag.
Logs: Check for errors or repetitive log lines (infinite loop).
Action: Horizontal Pod Autoscaler (HPA) should handle it. If max replicas reached, check upstream changes (deployments) or downstream dependencies (slow database causing churn).

Title here