How do you manage alerts during code deploys?

Apr 21, 2014 by Berkay Mollamustafaoglu

As organizations embrace DevOps and Continuous Deployment, it’s becoming common to do frequent code deploys, often multiple times a day. Deployments inevitably cause monitoring tools to generate alerts as applications & servers become temporarily unavailable/unresponsive. This can be problematic since these alerts:

generate noise, and unnecessarily interrupt people
mislead people and cause them to waste time chasing down nonexistent problems
erode attention and cause people to miss real problems

Therefore, ops teams need mechanisms to prevent (or at least easily identify) unnecessary alerts triggered by code deploys. Given the increasing frequency of deploys, ideally this needs to be done programmatically as part of the automated deployment pipeline.

Dev/Ops teams employ a wide range of tools to move code into production. As part of the deployment process, it makes sense to integrate with the monitoring tools and services, and control the generated alerts as well. Most organizations use multiple solutions to monitor their applications and the infrastructure. Automation becomes a lot easier if the alerts flow through a single solution and provides the necessary interfaces to control the alert flow programmatically.

Alert management patterns during deploys

Regardless of how the deployment process is orchestrated, whether the process is governed by continuous integration tools, or managed from chat rooms, a few common patterns emerge to manage the alerts generated by monitoring tools during code deploys.

Dropping alerts from monitoring tools during code deploys

Alerting can be stopped at their source by interfacing with the monitoring tools directly. This is a commonly used pattern as it is relatively easy to implement (easier with some tools than others). Alternatively, if another tool like OpsGenie is being used to aggregate alerts, the aggregation solution can be configured to ignore the alerts generated by the monitoring tool. When multiple monitoring tools are being used, disabling alerts in the aggregation level can be easier as it provides a single integration point.

Dropping all alerts from a monitoring tool may not always be a viable (or acceptable) option, particularly when the monitoring tool is used to monitor many different applications. If the monitoring tool is sufficiently flexible, it may be possible to make more subtle configuration changes and stop only the relevant alerts but this approach often requires significantly more work and planning.

OpsGenie supports configuring one or more “integrations” with monitoring tools. Each integration can be disabled/enabled via the UI, or the API, hence orchestration tools can make use of the API to disable alerts during code deploys and enable it back once the deployment is complete. Support for multiple integration points enable separating alerts for different applications/services, even when alerts come from the same monitoring tool, provided that the monitoring tools is able to differentiate them.

Tagging alerts during code deploys

Some Dev/Ops teams may not be entirely comfortable with dropping alerts during deploys. Instead of dropping the alerts, they prefer to “mark” the alerts. Internally, we do use this pattern. As part of the deployment process, an OpsGenie policy is enabled to tag relevant alerts with the “deployment” tag. Tagged alerts are filtered out in our notification rules, hence users do not get disrupted by notifications for these alerts but they can see the alerts in OpsGenie UI.

We use Slack for team communications, and alerts that have the deployment tag are posted to a different channel. This allows developers who are pushing code into production to monitor the alerts in that channel, without interrupting rest of the team.

Delaying alert notifications

Along the same lines with tagging alerts, delaying alert notifications during code deploys is another pattern that aims to provide visibility while minimizing unnecessary interruptions.

OpsGenie supports specifying a time delay for sending alert notifications for subset of alerts. During deploys, an OpsGenie policy can be enabled to delay the notifications which means alert would be created in OpsGenie, but notifications via push, SMS, phone, etc. would not be sent to the users for the specified amount of time, say 10 minutes. If the alerts get acknowledged or closed within that 10 minutes, users would not be notified.

During deploys, application processes often get restarted, generating a lot of alerts. These alerts are often ignored (even if they are created) since they are expected. When the application comes back up, alerts typically get closed automatically. If something goes wrong and an alert does not get closed, the appropriate people would be notified, providing a safety net, in case the alert was missed in the midst of all the alerts that are generated during the deploy.

Handling alerts properly during code deploys is an essential step in dealing with alert fatigue. When we get used to receiving alert notifications that are not real, we get desensitized to alerts even if we are that they are not real.