Cluster auto healing is a feature of the Kubernetes service on the Catalyst Cloud that monitors the health state of the cluster and automatically repairs Kubernetes’ components (for example: etcd, kubelet) or nodes (master or worker) that are unhealthy. This is distinct and complementary to the self-healing that Kubernetes performs for pods.
Auto-healing is enabled by default. If desirable, auto-healing can be disabled
at cluster creation time via the label
An agent called
magnum-auto-healer is deployed as a daemon set to the
Kubernetes cluster (automatically, during cluster creation time). It monitors
the health state of the following components:
For master nodes, it monitors the output of the
healthz API call for
the health status of
etcd every 30 seconds.
For all nodes, it monitors if the
kubelet status is ready every 30
A repair action is triggered if any component is unhealthy for more than 3 minutes. The repair procedure rebuilds the unhealthy node from scratch, while minimising impact to running application workloads as much as possible.