"Empowering the alert recipients" has been a core principle for our product development since the beginning, driving many of the capabilities that differentiates OpsGenie from the alternatives. We believe that the role of the alerting system does not end with an alert notification that is devoid of any useful information. Sure, we need to make sure the right person is notified when there is a problem, but we cannot declare "mission accomplished" just because we’ve told someone that there is an alert. We believe that if we’re to interrupt someone, at the very least ask for the attention, worst wake them up, we ought to provide the relevant information that would enable them to assess the severity and the urgency of the problem as well.
At OpsGenie, we strive to provide a platform that enables OpsGenie users to design alerts that inform and empower the recipients. Here are some of the core capabilities that enable OpsGenie users to design effective alerts:
An alert model rich enough to convey useful, actionable information
OpsGenie alerts can have as many alert fields and tags as needed, and fields can include URLs. More importantly, information from other tools (images, logs, docs, etc.) can be attached to the alerts as files. These attachments become part of the alert, hence accessible via the OpsGenie apps. Ability to attach files to the alerts is a simple but quite powerful concept. It means that we can enrich the alerts with information from any source. Graphs for the relevant metrics can be retrieved from performance monitoring tools, logs can be searched to extract entries that may explain what’s happening. If there are runbooks (and there should be), they can be included with the alert, accessible by the recipients with no friction. We can include configuration information, as well as the change history that may well enable the recipients to quickly hone in on what the cause of the problem may be.
Compared to receiving an alert that says “you have an alert” (congratulations?), receiving an alert that includes supporting information is is available along with the alert. OpsGenie makes it easy to provide this information but alerts still has to be designed. There is no magic here. It’s still brain sweat to identify the right set of information that may provide insight into the cause and severity of the problem. This paradigm works particularly well when people who will receive the alerts can design their own alerts (self service). Just make it easy for them to include the relevant information and enable them to iterate and improve the alerts as needed.
Automated actions to enrich the alerts
OpsGenie supports a simple yet powerful mechanism that can be used to add the relevant information to alerts. Whenever an alert is created, alert data is passed to an application. Marid can execute a script that can gather the relevant information directly from systems & applications or from management tools, and attach the information to the alert. This is a generic mechanism that would work regardless of which tool generated the alert, and can gather information from any tool or service that provides some sort of an programmable interface whether it’s web based API or CLI.
Executing actions interactively to respond to alerts
OpsGenie support defining “custom actions” when creating alerts. This allows users to execute relevant actions when they receive the alerts directly from OpsGenie apps. When user initiates an action from OpsGenie, alert data and the action executed by the user gets passed to Marid. Marid can execute the corresponding script to gather the relevant data and post it to the alert. This capability not only enables users to execute investigative and even corrective actions rapidly from OpsGenie apps, but also provides visibility into everything that is done to investigate the users since the actions and the results are posted to the alert. As a result, if the alert is escalated, team members joining in don’t have to start over.
Bi-directional integration with group chat tools to enable ChatOps
OpsGenie provides direct integration with Campfire, HipChat and Slack, and can be integrated into IRC and others via webhooks or marid. At OpsGenie, we use group chat rooms for team communications, and we’ve integrated most of the tools we use for development and operations with our chat system, and we forward alerts to the chat system as well. The information posted from various tools that we use as well as actions executed by team members provide valuable context that can potentially help in analysis of the alerts. Upon receiving an alert notification, a quick look at the chat history can reveal what changed recently, what others are working on, etc. Here is a short screencast on executing investigative actions from a chat room using OpsGenie for a glimpse of what can be achieved using this approach
Realization that we do need to “design” alerts is the first and probably the most important step. Regardless of what your tooling is, do invest time in designing effective alerts. It’s worth it and future you will thank you for it.