Quantcast
Channel: OpsGenie Blog
Viewing all articles
Browse latest Browse all 204

Role of alert notifications in IT Operations

$
0
0
<p>Mathias (<a href="http://twitter.com/roidrage" target="_blank">@roidrage</a>) of <a href="https://travis-ci.org/" target="_blank">Travis CI</a> has an <a href="http://www.paperplanes.de/2013/1/2/on-pager-duty.html" target="_blank">excellent blog post</a> on operations of a hosted product and the role alerting. It&#8217;s a good read for anyone who is in operations or would like to understand operations better. In the post, he describes not only what they are currently doing but also the challenges they face, as well as his thoughts on what they will need to do to improve. <br/><br/> At <a href="http://www.opsgenie.com">OpsGenie</a>, our goals are highly relevant to the topics discussed in the post. We provide <strong>alert &amp; notification management tools to enable ops teams to manage entire alert life cycle</strong>, what happens after an alert is generated till the problem is resolved. Since we also operate a hosted service that needs to be up and running at all times, and deal with many of the same challenges mentioned in the post, I wanted to add my 3.1415 cents as well:</p><h4>Should developers of platforms/applications be fully involved in the operations side of things?</h4><p>For small to medium size teams I believe the most efficient and effective model is for the developers to be directly involved in operations, especially for mission critical application. After all, no one knows the applications better than the developers. This modus operandi feels a lot more comfortable for modern applications running on the cloud infrastructure. As stated in the post, being part of operations vastly improve how developers think about the application, production environment, requirements of resilient systems, and write code to make the software more operable. Advantages are well articulated in the post, and based on past experience I couldn&#8217;t agree more. <br/><br/> For critical applications with high availability requirements, having the very developer who has implemented the code (or has good knowledge of it) at hand is invaluable in resolving issues rapidly. As such, at OpsGenie, developers are very involved in the operations as well in order to ensure potential issues are handled before any impact on our customers. We also understand that as good as this is for the operations of our service, if we&#8217;re not careful, involving developers in daily operations can drastically slow down the paste of development. To mitigate this risk bogging down developers with operational duties such as responding to alerts, etc., we automate whenever we can, and implement tools to enable team members to be as efficient as possible. Many of these capabilities are provided to our customers as part of the OpsGenie service as well (some of them are mentioned below). <br/><br/> To be clear, in the (large) enterprise however, there are often hundreds, if not, thousands of applications supported by the operations teams. Many of these applications are not built by internal development teams, or developers are no longer around. As such, in these environments, operations processes tend to be very different, and it&#8217;s important to recognize these differences.</p><h4>Empowering the alert recipients with knowledge - Playbooks</h4><p>Creating playbooks, aka runbooks, for alerts is one of the most effective ways to increase operations efficiencies and reduce the dependency on individuals. At OpsGenie, we assign a different code to each exception generated by the application, each alert generated by monitoring tools or our applications themselves. Each alert code is associated with a short description that explains what the alert is about, whether it&#8217;s critical, etc. This also allows creation of playbooks starting from more common alerts codes. When an alert is generated, recipient can use this information to make an initial assessment of the alert and determine the urgency, right person to handle the problem etc. <br/><br/> One of the key OpsGenie features that enable playbooks is the ability to <a href="http://support.opsgenie.com/customer/portal/articles/551579-attachments">attach files to alerts</a> in OpsGenie either via the UI or the API. Using this capability OpsGenie customers can assign the relevant playbook to the alert empowering the recipients of the alert to determine the right course of action. And combining the runbook with additional information such as configuration data, change history, etc. (take a look a sample alert in OpsGenie for an example) truly enables the recipients to determine the right course of action quickly. Recipients don&#8217;t have to scramble to find a computer, connect to variety of systems, etc. to collect the information necessary for an assessment. <br/><br/> In addition, OpsGenie allows <a href="http://support.opsgenie.com/customer/portal/articles/761242-alert-action-execution">alert recipients to execute actions</a> to further collect data or initiate remedial actions directly from their mobile devices, without having to have access to a computer, network connection, VPN, etc. For example recipients can <a href="http://www.opsgenie.com/public/features/alert.actions.html">initiate an action</a> that would execute a script that to collect additional data that is relevant to the problem and attach it to the alert to make it available to the recipients. <br/><br/> I believe these are precisely the type of things operations teams need to be able to do in order to improve operational efficiencies and minimize the overhead of handling alerts. Efficient and effective operations requires empowering alert recipients to handle most alerts directly from their mobile devices, and we&#8217;re building underlying infrastructure to enable operations teams to implement these capabilities.</p><h4>Can I rely on the alert system to get the notifications?</h4><p>One of the main problems we&#8217;ve tackled with OpsGenie is increasing the reliability of the notifications. We have concluded early on that we could not rely on any single notification channel alone and we have to use<a href="http://www.opsgenie.com/public/features/multiple.channels.html"> multiple notification channels</a>. When a single channel like SMS is used for alert notifications, not only notification may not be delivered in a timely manner (SMS is best effort and delivery is not guaranteed), but recipients may simply miss it. OpsGenie leverages iPhone and Android push notifications, SMS and phone calls for notifications and allows users to put them in an order with time delay. OpsGenie tries each method until the recipient sees the alert or the alert is acknowledged by one of the recipients. There are also number of other capabilities such as <a href="http://www.opsgenie.com/public/features/tracking.html">detailed tracking</a> and<a href="http://support.opsgenie.com/customer/portal/articles/759603-heartbeat-monitoring"> heartbeat monitoring</a> to improve reliability of the system overall.</p><h4>Who gets up in the middle of the night when an alert goes off&#160;?</h4><p>OpsGenie currently allow specifying <a href="http://support.opsgenie.com/customer/portal/articles/551517-creating-alerts-via-the-web-ui">users and/or groups</a> as the recipients of an alert when an alert is created. At OpsGenie for our own alerting, we chose to specify the group name as the recipient. This allows us to modify group membership as needed to determine who will be notified for alerts. Team members can add/remove themselves to the group at any time. And with recently added &#8220;<a href="http://support.opsgenie.com/customer/portal/articles/912552-escalations">escalations</a>" capabilities, it is now possible to have tiered notifications. For example, using escalations, OpsGenie users can now notify a user or a group first, and notify the larger team if the alert is not acknowledged within x minutes. Using this approach enables only waking up designated team member(s) for alerts, but if the user does not attend to the alert for whatever reason, other team members would get notified. <br/><br/> OpsGenie does not yet have support for on call schedules, where on call team member is assigned automatically based on date and time (though alerts can be sent to a group and group members can be changed). This is one of the more commonly requested features hence we will work on it in the near future. If you&#8217;re an OpsGenie user and have requirements/thoughts in this area, please do <a href="http://support.opsgenie.com/customer/portal/emails/new">share them with us</a> as we&#8217;re actively working on the design of this feature. <br/> Sending alert notifications is only the first step of effective alerting. At OpsGenie, we believe what happens next really matters, and we strive to make it as painless as possible for ops teams to manage the process. <br/><br/><a href="http://twitter.com/roidrage" target="_blank">@berkay</a></p>

Viewing all articles
Browse latest Browse all 204

Trending Articles