Measuring the Health of Your Application: Why Healthchecks are not Enough

I was inspired to write after reading something on Twitter about how do we define “Health”? Specifically Ia m going to talk about

  • Health for web services
  • Focusing on larger systems with high availability requirements
  • Systems that lose money if they are down

Using Health-checks to check the Health of your service is like checking the Health of your car by looking at if the Headlight turns on

Health checks could be a /health API that might run a deep health check or simplify a ping that checks whether service is up or not.

Health checks in the simple ping form, tell you very little about your system. It indicates a single information which is that a certain API is reachable. It doesn’t necessarily tell you about the logic underneath or interactions with entire system/

Deep Health checks, those that potentially hit up your API and dependencies, are an indication of a better health check but they are costly. They might take a lot of time to run and they consume computing resources on your machine. They also incur a cost for your dependencies and therefore your entire system. Those deep health checks are only verifying some of your assumptions about how the system works. They don’t verify the entire system.

Shallow health checks are equivalent to checking if Headlight turns on in your car. If the headlight turns on, your car is healthy.

Deep health checks are equivalent to turning on the engine and revving the engine in your car. If you hear a noise, your car is healthy.

Using the car example, we haven’t really tested the brakes, the steering, the tail lights, the electric system, the air filters, the AC, the sound system etc.

Similarly using a health check for a measure of the health of your service is not enough. It doesn’t really tell you much about if your service is healthy.

Define SLAs for your System

You need to define SLAs for your system. Google popularized the concept in their RSE book but mainly companies have been using SLAs for a long time.

If you don’t have a clear idea of how your system should behave, how do you even define if it’s not doing what it should be doing?

An SLA defines conditions in a steady state of our system. Deployments and failures shouldn’t affect a highly available system. The more concrete our SLAs are the better we know what to expect from our system.

Defining Health

The Health of you system is the total of the health of all its subsystems. I am using the word System over service or application because we must be aware of all the little components of our system. A web service might involve APIs, static content, CDNs, load balancers and hosts it runs on. It might include databases and storage, replication, fault tolerance mechanisms and much more.

The best way to think about health fo your system is to decompose into smaller subsystems and define health for all of those.

Examples of Measures of Health (SLA, SLI and SLO everything)

The list is not exhaustive in any way and is very high level (I might follow up with a different article on this); below are examples of things we could measure to track the health of our service.

  • Backend web service: latency, failure, errors
  • Database: latency, failure, replication lag
  • Load Balancer: number of connections, services available,
  • Host: Disk, CPU, IO, memory
  • Runtime environment:  garbage collection pauses, crashes processes, heap size
  • CDN: availability, latency, failures, errors, cache hits

Hierarchy of Health

In order to define a good Health Monitor for your system, you need to have a hierarchy for your system. This hierarchy structure can run deep. In a very large complex system with hundreds or thousands of components and services, having this hierarchy is important. Being able to measure health for everything, and dive into little components will help you debug problems faster and figure out what’s wrong immediately.

Takeaways

  • Define SLAs for your system and measure them
  • Compose the health of your system into the health of smaller components
  • Monitor everything you care about in your system
  • Have an overarching Health Alarm/Monitor to track the health of your system
  • Health checks are not bad, they are just not enough by themselves

My Tweets

The tweets that inspired me to write this are below. Let me know what you think about Health of your system

Leave a Reply