Overview


Just Checking

Ackal reduces the toil in managing globally-distributed health checks for your gRPC services. Less toil means your team has more time to do what it does best.

Using Ackal's command-line tool (ackalctl) or Ackal's gRPC API, you can...

  1. Schedule
  2. Manage
  3. Monitor

...your health checks with the goal of ensuring that your gRPC services are available reliably.

Ackal has a single purpose: Just Checking

Health Checks

Health check are an important SRE tool.

Complexity means that, even with reliable platforms and networks, it's not possible to assume that a deployed service is healthy.

Healthiness can be difficult to define and measure. It is often a combination of a service's:

  • developers' perspective (the service is dependent on a database that must be available, configured correctly and itself healthy),
  • SRE's perspective(the service is impacted by a data-center issue)
  • and, most importantly, the customer's perspective (I'm unable to access the service).

In many cases, this complexity is reduced to an application layer protocol check; checking that an HTTP endpoint returns a 2xx status code. While indicative of a service's health, such checks are only a reflection it.

For gRPC services, health checking is more complicated.

If a server does not support reflection, it's not possible to enumerate the methods that it may be support in order to find one to use (as a proxy) for a health check.

Even if reflection is supported, the only way to check the liveness of a method is to construct and send a request message to the method.

Should we even be calling arbitrary methods with arbitrary messages on a service?

A good solution to this challenge is gRPC Health Checking Protocol. It defines 2 methods (Check and Watch) and 2 messages (HealthCheckRequest and HealthCheckResponse) that, when implemented by a service, provide a standard mechanism by which the health of the server's services may be checked.

As with any effective health checking mechanism, it is the responsibility of the service's developers to ensure that the health checking mechanism correctly reflects the state of the service.

Pre-requisite

For the reasons described in the Health Checks section, Ackal implements gRPC Health Checking Protocol.

For this reason, it can only health check services that implement this protocol too.

Schedule Health Checks

Assuming you have a gRPC service that meets the Pre-requisites of implementing gRPC Health Checking Protocol, you can create a health check for the service using Ackal.

You will need to decide:

  1. How frequently the health check should be made
  2. From where the health check should run
  3. Which subset of your gRPC services should be tested

To create health checks, you can then use either Ackal's command-line tool (ackalctl) or Ackal's gRPC API:


ackalctl \
--endpoint="server-01:443" \
--period="300s" \
--location_ids="the-dalles,tokyo" \
--services="ServiceA,ServiceB" \
create check

id: 123456789abcdef0123456789abcdef0123456789abcdef012345678
                            

Ackal health checks continue to run from the locations and on the frequency you specify until you delete them.

Manage Health Checks

You can list and delete Ackal health checks using Ackal's command-line tool (ackalctl) or Ackal's gRPC API:


ackalctl \
list checks

Endpoint      Period Location(s)      Services(s)       Enabled
server-01:443 300    the-dalles,tokyo ServiceA,ServiceB true
                            

ackalctl \
--endpoint="server-01:443" \
--period="300s" \
delete check

id: 123456789abcdef0123456789abcdef0123456789abcdef012345678
                            

Monitor Health Checks

Ackal health checks publish metrics on an HTTP endpoint using Prometheus' text-based exposition format.

You can use a tool like curl or similar tool to browse (GET) these endpoints.

More commonly, you will want to configure Prometheus or a Prometheus-compatible tool to scrape your health check endpoints periodically for you. Ackal provides two proxies that provide authenticated service discovery of your Ackal health checks:

NOTE You should configure Prometheus to scrape your health check endpoints no more frequently than the health checks are being made.

Here is an example using curl to enumerate the endpoints for your Ackal health checks:


# Your Customer ID
CUSTOMER="..."

# Your Credentials File JWT
JWT=$(\
  more ${HOME}/.config/ackalctl/credentials.json \
  | jq -r .jwt)

curl \
--silent \
--get \
--header "Authorization: Bearer ${JWT}" \
--header "Ackal-Customer: ${CUSTOMER}" \
--header "Content-Type: application/json" \
https://listr.ack.al \
| jq -r .
                            
NOTE This script uses jq to parse JSON.

The resulting JSON describes your Ackal health check endpoints:


{
    "exporters": [
      {
        "url": "https://healthcheck-01",
        "labels": {
          "endpoint": "server-01:443",
          "location": "somewhere",
          "period": "600"
        }
      },
      {
        "url": "https://healthcheck-02",
        "labels": {
          "endpoint": "server-02:443",
          "location": "somewhere",
          "period": "300"
        }
      }
    ]
  }
                            

If you'd prefer to see the JSON that is used by Ackal's ackalctl-http-proxy which provides authenticated service discovery for Prometheus for your Ackal health checks, you can append ?format=prometheus:


[
  {
    "targets": [
      "healthcheck-01:443"
    ],
    "labels": {
      "endpoint": "server-01:443",
      "location": "somehwere",
      "period": "600"
    }
  },
  {
    "targets": [
      "healthcheck-02:443"
    ],
    "labels": {
      "endpoint": "server-02:443",
      "location": "somewhere",
      "period": "300"
    }
  }
]
                            

Ackal secures health check endpoints and so you must provide credentials (using an Authorization header) with requests against your health check endpoints:


# Your Customer ID
CUSTOMER="..."

# Your Credentials File JWT
JWT=$(\
  more ${HOME}/.config/ackalctl/credentials.json \
  | jq -r .jwt)

curl \
--silent \
--get \
--header "Authorization: Bearer ${JWT}" \
--header "Ackal-Customer: ${CUSTOMER}" \
--header "Content-Type: application/json" \
https://healthcheck-01/metrics \
| awk "/^ackal_exporter_check/ {print}"
                            
NOTE You must prefix health check URLs with the scheme (https://) and use /metrics. In the example above awk is used to extract Ackal's metrics (ackal_exporter_check) from the list of metrics.

ackal_exporter_check_histogram_ms_bucket{service="ServiceA",status="SERVING",le="5"} 0
ackal_exporter_check_histogram_ms_bucket{service="ServiceA",status="SERVING",le="10"} 0
ackal_exporter_check_histogram_ms_bucket{service="ServiceA",status="SERVING",le="25"} 0
ackal_exporter_check_histogram_ms_bucket{service="ServiceA",status="SERVING",le="50"} 0
ackal_exporter_check_histogram_ms_bucket{service="ServiceA",status="SERVING",le="100"} 0
ackal_exporter_check_histogram_ms_bucket{service="ServiceA",status="SERVING",le="250"} 0
ackal_exporter_check_histogram_ms_bucket{service="ServiceA",status="SERVING",le="500"} 1
ackal_exporter_check_histogram_ms_bucket{service="ServiceA",status="SERVING",le="1000"} 6
ackal_exporter_check_histogram_ms_bucket{service="ServiceA",status="SERVING",le="2500"} 6
ackal_exporter_check_histogram_ms_bucket{service="ServiceA",status="SERVING",le="5000"} 6
ackal_exporter_check_histogram_ms_bucket{service="ServiceA",status="SERVING",le="10000"} 6
ackal_exporter_check_histogram_ms_bucket{service="ServiceA",status="SERVING",le="+Inf"} 6
ackal_exporter_check_histogram_ms_sum{service="ServiceA",status="SERVING"} 3880
ackal_exporter_check_histogram_ms_count{service="ServiceA",status="SERVING"} 6
ackal_exporter_check_summary_ms{service="ServiceA",status="SERVING",quantile="0.5"} 698
ackal_exporter_check_summary_ms{service="ServiceA",status="SERVING",quantile="0.9"} 698
ackal_exporter_check_summary_ms{service="ServiceA",status="SERVING",quantile="0.99"} 698
ackal_exporter_check_summary_ms_sum{service="ServiceA",status="SERVING"} 3880
ackal_exporter_check_summary_ms_count{service="ServiceA",status="SERVING"} 6
                            

As described above, you should consider using one of Ackal's proxies for authenticated service discovery in order to automate Prometheus' scraping of your Ackal health checks. See ackalctl-http-proxy

Improved Service Reliability

When the inevitable happens and one of your services fails, you don't want to hear about it first from your customers.

By monitoring metrics from your services' Ackal health checks, you can create alerts using your preferreds tools (PageDuty, Slack etc.) to ensure your oncall team is able to deal with service issues promptly.

More Engineering

The main benefit in using Ackal is that you spend less time keeping your systems running ("toil") and more time undertaking novel (interesting), engineering work