A pod is the smallest unit in Kubernetes. It is a container for containers that are running in a shared context like the same host, same IP etc. The status of the containers can be checked by so called probes. The respective results are then aggregated to the status of a pod by Kubernetes. A probe is a diagnosis that is made regularly by the Kubelet on a running container. To perform this diagnosis, the Kubelet calls an endpoint implemented by the container process or executes a binary in a container. The Kubelet can perform and react to three types of probes: Readiness, Liveness and Startup.
Typical questions around these probes are:
- Do we need separate endpoints for liveness- and readiness-probes?
- Do we always need both?
- Does the endpoint’s code check the availability of upstream services?
- How to handle exceptions?
The difference between phase and state
A pod’s lifecycle is divided into two parts. A pod’s phase, which is a simple, high-level summary of where the pod is in its lifecycle and the pod state, which is an array of conditions through which the pod has or has not passed. Additionally there is also a container state. It is quite simple and can be:
The pod phase can be viewed e.g. by issuing
kubectl get pods. The detailed pod state by
kubectl describe pod <pod name>.
But what does this mean for the readiness of a pod? If a pod is in phase running, it means that at least one container is in the state running. But in the case of a multi container pod, it is not sufficient to reflect the state of only one container. Actually, the pod condition is only ready once all containers are in state running. If it is so, the pod can be added to the load balancing pool of all matching services. Otherwise, it is removed.
This procedure ensures that most cases of container creation or deletion are automatically handled correctly. The only thing we have to take care of is the time between container state running (that Kubernetes can use directly) and the ability to actually serve requests by the application code. This initialization time can be reflected by the readiness probe. In these simple cases it can use the same endpoint as the liveness probe.
Periodic probe of container service readiness. Containers will be removed from service load balancer if the probe fails.
Recommendation: Use readiness probe…
- … during container startup phase
- … if an application takes itself down for maintenance
Question: Should a readiness probe check the application dependency?
Readiness probes of 3 application pods (replicas) are checking access to a dependent DB service. However, in case of a DB unavailability, all application replicas are removed from Service. In effect, the application is then offline (with no difference to “all pods are failing” or to “deployment is deleted”). The overall result is a propagation of faults that eventually become a system failure. The attempt to avoid this is the main reason for not checking dependencies. Remember also, that readiness in the sense of Kubernetes means: technically ready, not business ready. The probe only signals whether or not the pod is added to the LB. A removed pod always means failure, and this can never be a valid business status.
What happens if the DB is not available?
As mentioned above, to avoid fault propagation, it is not advisable to simply put the readiness probe to false. One option is to implement some sort of degraded mode. For instance a REST service is answering only some requests that can be answered from cache or from defaults, while responding with a 503 (Service Unavailable) on writes (PUT/POST). For sure we have to take care that downstream services are aware of this kind of degraded mode (in general, the downstream services should in any case be resilient to faulty calls to upstream services).
For the sake of completeness: A disadvantage of the degraded modes may be that they tend to end up with a kind of distributed degraded mode that is sometimes difficult to handle. So replying with 503 for everything may be a good option too.
How to handle exceptions?
If the application code encounters an unexpected and unrecoverable internal exception while calculating the readiness response, it should crash on its own. This is because it can be expected to be a serious container internal issue that has no connection with external dependencies.
Periodic probe of container liveness. Container will be restarted if the probe fails.
The liveness probe should be used if
- the process in your Container is unable to crash on its own whenever it encounters an issue or becomes unhealthy
- application code is running a framework, where it is unable to control its execution (e.g. servlet container)
With regard to the verification of upstream dependencies, the same applies as for the readiness probe. Liveness probes should only help to determine whether the container process is responding or not. If the container process is able to detect its unhealthiness on its own, it can simply exit.
Indicates that the pod has successfully initialized. If specified, no other probes are executed until this completes successfully. Similar to the liveness probe the pod will be restarted if it fails.
This (alpha feature) probe has been introduced to reflect long boot times typically experienced by legacy applications or technologies with uncomfortably long initialization times such as Spring Boot. The usage of the liveness probe alone forces us to take these delays into account and it can be tricky to set up parameters without compromising the fast response to unhealthiness of applications. So if your container normally starts in more than initialDelaySeconds + failureThreshold × periodSeconds, you should specify a startup probe and use the same endpoint as the liveness probe.
Resilience and availability must be considered more globally and take into account the overall behavior of the application. However, probes are an important feature to increase resiliency. They can help to prevent faults from becoming failures. But neither liveness- nor readiness probes magically solve every resiliency challenge for you. Their simple use is not equal to resilience. They are technical means for carrying out health checks and should be used wisely.