<p>Alerting is largely a signal to noise ratio problem - catching critical problems while trying not to drown in the sea of data. Put it in another way, we don’t want to miss any critical problems and we don’t want too many alert notifications.</p><p>OpsGenie strives to improve the lives of the alert recipients So let’s take a look at how OpsGenie does its part to tackle this formidable challenge:</p><h3><strong>"We don’t want to miss problems"</strong></h3><p>OpsGenie has many features to ensure alerts are not missed:</p><ul><li>Users can receive alert notifications through <a href="https://www.opsgenie.com/public/features/multiple.channels.html">multiple notification channels</a> (SMS, push, email, phone, chat, etc.)</li><li>OpsGenie can <a href="https://www.opsgenie.com/public/features/escalations.html">automatically escalate</a> to other user(s), if initial user does not acknowledge the alert within specified amount of time. </li><li>OpsGenie can renotify users, if the alert is not closed within specified amount of time</li></ul><p>So it’s safe to say that if you don’t want to miss alerts, OpsGenie got you covered. We can send many notifications every which way to ensure that you are notified when there is a problem. Yet this often introduces a different problem, which brings us to the second part of this challenge</p><h3>"We don’t want too many alert notifications"</h3><p>Too many alert notifications is just as bad as too few. In fact, it is often easier to miss a critical alert when there are many alerts. When we receive too many notifications, alert fatigue sets in, and we stop paying (sufficient) attention to the alerts. “Attention” has become a scarce resource, hence it is essential to minimize how much attention we demand from alert recipients. Here are some of the capabilities OpsGenie provides to minimize the number of alert notifications:</p><p><strong>- Escalations </strong></p><p>Escalations can significantly reduce the number of notifications received by users. Instead of notifying all users at once, using escalation policies, OpsGenie can first notify the first tier of users. Users that are not in the first tier would only get notified if the first tier users do not acknowledge the alert within the allocated time. </p><p><strong>- On-call schedules</strong></p><p>Most ops organizations take turns to respond to alerts - at least during off hours - instead of disrupting everyone with alert notifications (attention is a scarce commodity). OpsGenie schedules allow notifying only the on-call person first. On-call person can then analyze the problem, and can address the problem or escalate to the right people. </p><p><strong>- Bulk actions</strong></p><p>Users can select multiple alerts and acknowledge or close them in bulk (currently web UI only) or use ”acknowledge all” or “close all” actions to acknowledge/close multiple alerts with a single action. These actions are executed asynchronously a the back end not to block the user.</p><p><strong>- Mute notifications</strong></p><p>Users can “Mute” OpsGenie. When muted, OpsGenie stops sending notifications to that user for the next 5 minutes. Users can also turn off/on notifications anytime using OpsGenie apps.</p><p><strong>- Alert deduplication</strong></p><p>OpsGenie alerts can be deduplicated using the “alias” field, which is a user defined unique identifier for alerts. Using the alias field, multiple related alerts can be consolidated into a single alert, reducing noise. Alias field also makes it possible to automatically close alerts in OpsGenie when monitoring tools send recovery alerts.</p><p><strong>- Transient alerts</strong></p><p>OpsGenie can delay notifications to users for some of the alerts. Delaying notifications prevents transient alerts from disrupting users unnecessarily. If the problem is resolved within the specified time frame, users would not get notified. Users can still see the alerts using OpsGenie apps, when they want. Delaying notifications also allows users who may be in front of their computers to work with alerts, acknowledge when appropriate to avoid others from being notified unnecessarily. </p><p><strong>- Suppressing alert notifications</strong></p><p>OpsGenie can suppress notifications for some alerts. This can be quite handy during deployments, etc. to prevent superfluous notifications, while still providing visibility into the alerts. Suppressing the notifications rather than not creating alerts (as it is often done) is often the better approach as it still provides insight into what’s happening and eliminates the risk of a problem falling through the cracks since alerts are expected. This is one area the OpsGenie’s approach to separate the alerts and notifications show it’s impact clearly. </p><p><strong>- Notification rules</strong></p><p>OpsGenie notification rules empower the users to control how they are notified. Users can define notification rules allows users to receive notifications using different methods based on time of day and alert content, hence users can limit notifications to only critical alerts at certain times, etc. By providing this capability to users, OpsGenie also significantly reduces the administrative burden since the admins no longer have to repeatedly adjust rules to satisfy continuously changing user preferences. </p><p>Notification rules also reduce the number of notifications by allowing the users to use different notification methods in an order. For example, a user can define a rule to receive email and push notification immediately, an SMS notification 2 minutes later and phone notifications 5 minutes later. OpsGenie sends notifications till the users “sees” the alert, hence if the user sees the alert after receiving the email/push notification, OpsGenie would not send SMS or phone notification unnecessarily since the user is already aware of the alert. </p><p><strong>- Notification aggregation</strong></p><p>When there are multiple notifications, OpsGenie aggregates the (SMS and phone) notifications and sends a single aggregated notification in order to both reduce the noise level and cost to customers (especially for international users). </p><p>Improving signal to noise ratio is a continuous struggle. We march on.</p>
↧