Good Alerts

S7rthak
5 min readOct 1, 2024

--

This write up summarizes some Opinions™ that I have around good alerts, particularly in the context of infrastructure and platforms. The advice in here doesn’t represent universal truths, merely my experience over being on-call for systems both large and small, reliable and unreliable.

The TL;DR is that a good alert is all three of the following:

  • Actionable: When the alert fires, it means that a real human needs to take real action to mitigate and resolve the issue. If a human doesn’t need to take action, then it’s not a good alert.
  • Specific: The alert points to a specific component that is impaired or in danger of becoming impaired.
  • Helpful: The alert provides helpful context around the error, like what are the likely causes, how critical is the issue, and where should the on-call engineer go to diagnose and resolve the root cause.

Good alerts don’t just come out of nowhere. They are the result of constant vigilance and refinement. It’s not uncommon for a good alert to succumb to the laws of entropy and become a bad alert when you least expect it. If you want good alerts, you have to periodically weed the garden, trim dead leaves, and water your beautiful alert flowers.

The rest of this document dives in to some detail on each quality of a good alert.

Actionable

This is the most critical feature of a good alert: when the alert pages you, it should be indicative of a real problem that requires human intervention to resolve. If it’s not both of those things, it’s probably noise, and noise is the worst thing that an alert can be. Noisy alerts waste your time (and sanity) and create a boy-why-cried-wolf problem whereby you’ll start ignoring real, actionable alerts because you rightfully assume that they’re most likely noise.

Part of being actionable is ensuring that your alerts are based on metrics that people actually care about. For example, it doesn’t matter if a queue worker is using more than 70% of its CPU. What likely does matter, for example, is if the queue latency (the time that messages spend waiting in a queue) is above 10 seconds on average over the past 2 minutes. CPU utilization, memory utilization, network I/O, and disk IOPS are all useful signals if queue latency is high, but they aren’t in and of themselves particularly noteworthy as alert signals on their own.

Even better, alerts on high-level metrics that users care about do the job of dozens or even hundreds of individual alerts. For example, an alert on high queue latency will notify you if the workers are under-provisioned, if pods are in a crash loop, if the network I/O is saturated, if a bug is causing workers to fail to make progress, if no pods are running at all, and so on. You don’t need to worry about covering every possible cause in the universe, just worry about the small number of symptoms that you actually want to be woken up for. A good dashboard will help you find the root cause quickly.

Other guidelines for actionable alerts:

  • Prefer percentages or ratios to absolute numbers. For example, an alert on “more than 100 errors in the past hour” may be helpful if you average 1,000 requests per minute, but is pure noise if you grow to 1M requests per minute. Instead, you may find a percentage more helpful, like “5xx error rate > 1% on average over the past hour”.
  • If you find yourself consistently forwarding the alert to another team, talk to that team to see if the alert should be permanently migrated to that team. Even if they may need to escalate back to you, periodically, sending alerts to the teams that are best equipped to respond to them lowers the time to recovery and minimizes alert fatigue for teams building infrastructure and platforms used by dozens or hundreds of teams.

A note on “warnings”: Based on my experience of being on-call, I have not once seen a team consistently respond to warning alerts. At best, they’re annoying to configure and maintain and at worst (which they always are) they’re pure noise. If it’s important and actionable, it should be an alert. If it’s not, it can be a metric on a dashboard.’

Specific

Alerts should be scoped to a reasonable but not overly specific level of granularity. For example, instead of monitoring the average offset lag across all consumer groups in a Kafka topic, monitor a specific consumer group’s offset. If you ever find yourself responding to an alert by first trying to figure out which service, component, or system is affected, you should make the alert target a deeper level of granularity.

By making your alerts specific, you’ll save yourself time and stress when responding to alerts. It’s a balance, however. An overly specific alert may be a noisy alert (the worst!), so use judgement. For example, you probably don’t need to alert at the specific pod level for something like high response error rates in an API.

Helpful

If you’re steeped in the details of a particular system, an alert like “high p99 pcx latency” or “scorb rate elevated” might make total sense to you, but it’s complete nonsense to new engineers or people outside the team. Use the alert description field as a place to provide context and point on-call engineers in the right direction. For example:

The rate of S.C.O.R.B. (Standardized Core Operating Requirements Breaches) is
high. This likely means that customers are seeing long page load times or even
errors. If you need help, reach out to #ask-foo-team.

- Region: us1
- Cluster: prod-<custom_name>-c
- Namespace: frob-platform
- Service: frob-xxnxx-canary

- Runbook: https://<company_name>.atlassian.net/...
- Dashboard: https://app.datadoghq.com/...
- System architecture: https://<company_name>.atlassian.net/...

The alert description shouldn’t supplant a true runbook; keep it brief and easily scannable. Link to anything that needs to be longer than a couple sentences.

You can also include things like:

  • If particular teams above all others may be severely impacted by issues with the service/component, call that out and provide a channel or @-tag that should be notified.
  • Link to relevant architecture diagrams or design documents, if they’re not too out-of-date. This can help on-call engineers understand how the system/component fits into the larger picture and where they might need to go look to understand the root cause, if the root cause is somewhere else.

Datadog allows you to use the by tags in the alert description, which is especially helpful for adding context and creating links to dashboards with the variables pre-filled:

...

- Region: {{region.name}}
- Cluster: {{kube_cluster.name}}
- Namespace: {{kube_namespace.name}}
- Service: {{service.name}}

- Dashboard: https://app.datadoghq.com/...

...

OpsGenie supposedly supports HTML in alert descriptions, but it’s honestly completely broken and doesn’t play nicely with Slack. It also doesn’t render Markdown in Slack messages. You’re best off using the plainest formatting you can get away with: newline-separated paragraphs and “-” prefixed lists.

--

--