Etcd Cluster Loss Recovery

Scenario

You have a 3-node HA Kubernetes cluster. Due to a critical failure (e.g., 2 nodes corrupted simultaneously), you have lost quorum in etcd. The API server is read-only or unresponsive. You have a snapshot backup snapshot.db.

Question

How do you restore the cluster functionality?

Expected Answer

Stop Instances: Stop all kube-apiserver and etcd static pods on all masters to prevent further corruption.
Restore (etcdctl): On one master node, run etcdctl snapshot restore snapshot.db --data-dir /var/lib/etcd-new .... This creates a new single-member cluster from the data.
Update Config: specific point: update etcd.yaml manifest to point to the new data directory and set --initial-cluster-state=new.
Start & Scaling: Start this main member. Once healthy, join the other 2 nodes as new members one by one.
Restart Control Plane: Restart API servers/Scheduler/Controller Manager.

All Questions

Interview Prep

Title here

Etcd Cluster Loss Recovery

Scenario

Question

Expected Answer