SRE School: Health Checking

Any service that has complex logic or external dependencies might stop working for unexpected reasons. While instrumentation and monitoring can help bring these problems to human attention, it can be difficult to use dashboards or alerts for low-latency automated responses. A load balancer, for example, should respond to unhealthy backends on the order of seconds – long before any human can become aware of the problem.

Health checking is the process by which processes self-monitor for problems, report those problems to other parts of the service, and respond to other processes' unhealthiness in ways that mitigate overall service degredation.

Reporting Problems

Health checks are done not for a process's own benefit, but for the benefit of others. The first part of any health checking logic is the endpoint by which other processes poll it. This is essentially a miniature black-box monitoring system.

Health checks should almost always be performed over the same protocol that normal requests and responses will be handled by. If your HTTP server processes health checks in a separate thread pool or with a special low-dependency handler, then the risk of health checks reporting OK for an unhealthy process is significantly increased.

HTTP

Many distributed systems use HTTP as a transport protocol, so adding in a simple /healthcheck endpoint is popular. The semantics are usually "always respond 200 OK", and upstream load balancers treat timeouts or other response codes as unhealthy. Repeated failed health checks cause the load balancer to stop sending requests to that backend.

A few changes to this basic model can improve the efficiency:

Certain error codes can be special-cased to mean "stop sending requests immediately" – for example, Envoy treats 503 as a hard go-away.
When using common ports like :80 or :443, an "expected name" might be attached to the request to identify which backend the load balancer expects to be talking to. When a different process is listening on that port at the moment, it will reject the health-check request and the load balancer will avoid sending it traffic for the other service.

gRPC

gRPC has a standardised and expanded version of the basic HTTP health check. It expects each port to respond to /grpc.health.v1.Health/Check, and allows requests to specify which service name they are for:

The handling of service names is important because each gRPC server can offer multiple gRPC services, each logically distinct and with its own health check logic. For example, an authorization server with separate "issue token" and "validate token" services might be temporarily unable to issue tokens, but could still validate any that were previously issued.

Dependencies

While a service might become unhealthy because of some internal problem, it's far more common for unhealthiness to be caused by dependencies on other components. A mail server might be unable to send email because it's getting CONNECTION_REFUSED from smptd, unable to show existing emails because the database machine is rebooting, or unable to do anything at all because a human has manually marked its local machine as bad.

Within a single process, health status is detected and propagated to relevant services via a dependency tree. Ideally, the codebase is structured so that depending on any external resource (a database, an RPC backend, a secret key installed by Puppet) requires going through the dependency framework.

Interfaces

type HealthChecker interface {
    Metadata() *Metadata
    Children() []HealthChecker
    HealthCheck(context.Context, func(error))
}

type Metadata struct {
    Name        string
    Description string
}

Defining Dependencies

type FileDependency struct {
    Path string
}

var _ health.HealthChecker = (*FileDependency)(nil)

func (f *FileDependency) Metadata() *health.Metadata {
    return &health.Metadata{
        Name: fmt.Sprintf("local file: %s", f.Path),
    }
}

func (f *FileDependency) Children() []health.HealthChecker { return nil }

func (f *FileDependency) HealthCheck(ctx context.Context, cb func(error)) {
    ticker := time.Tick(time.Second)
    for {
        select {
        case <-ticker:
            fp, err := os.Open(f.Path)
            if err == nil {
                fp.Close()
            }
            cb(err)
        case <-ctx.Done():
            return
        }
    }
}

Registration

type motdImpl struct {
    motdFile *FileDependency
}

func (i *motdImpl) Motd(ctx context.Context, req *pb.MotdRequest) (*pb.MotdResponse, error) {
    motd, err := ioutil.ReadFile(i.motdFile.Path)
    if err != nil {
        return nil, err
    }
    return &pb.MotdResponse{Message: motd}, nil
}

func main() {
    ctx := context.Background()

    impl := &motdImpl{
        motdFile: &FileDependency{
            Path: "/etc/motd",
        },
    }
    machineHealthyFile := &FileDependency{
        Path: "/etc/machine-healthy",
    }

    srv := grpc.NewServer()
    pb.RegisterMotdServer(srv, impl)

    healthSrv := health.NewHealthServer()
    grpc_health_v1.RegisterHealthServer(srv, healthSrv)

    healthSrv.Register(impl.motdFile, health.ServiceName("com.example.Motd"))
    healthSrv.Register(machineHealthyFile)

    // waits for dependencies to become healthy
    healthSrv.Start(ctx)

    address := "127.0.0.1:1234"
    socket, err := net.Listen("tcp", address)
    if err != nil {
        log.Fatalf("net.Listen(%q): %v", address, err)
    }
    srv.Serve(socket)
}

Server Startup

Server startup should block until dependencies have become healthy, so that service implementation code doesn't have to deal with "half-open" dependencies (unless explicitly written to do so). Since dependencies can take a few seconds to initialize, starting them in parallel also helps reduce overall startup time.

Not all dependencies should block server startup, and some should only block startup but not otherwise affect health checking. The levels are:

Hard dependencies block startup until the dependency is healthy, and the service (or entire process) becomes unhealthy if the dependency is unhealthy. Examples might include the main database server, a proxy for outgoing connections, or disk space for critical logs.
Startup dependencies block startup, but once loaded don't need to be re-checked. Examples include a per-service private key, a large file loaded from local disk, or configuration data stored remotely.
Optional dependencies do not block startup, but do propagate health status to services that depend on them. This is useful when a single process is providing many services, and there's no problem with only accepting traffic for some of them.