<p>Erik Budin of <a href="http://www.sciencelogic.com">ScienceLogic”</a> has a great <a href="http://blog.sciencelogic.com/getting-the-right-event-to-the-right-person/06/2013">blog post</a> that describes the integration of ScienceLogic with (our competitor) <a href="http://www.pagerduty.com">PagerDuty</a>. Kudos to both parties for coming up with a well thought out, bi-directional integration that goes well beyond the alerting integration supported by many of the monitoring solutions in the market! We believe that to be able to truly enable operations teams to work effectively, monitoring and alerting integration needs to be much richer than just forwarding alerts. Hence, it’s good to see this type of effort implemented and described in detail. Erik starts the blog post with a real-world scenario that has become possible with the integrated solution:</p><p>"It’s Friday night. You’re at a friend.s place enjoying the evening’s activities when you get an SMS alert from work. You’re thinking, "I’m not on-call tonight for data center issues? The on-call tech should be on this anyway." Next thing you know, the phone rings and this time it’s an automated pre-recorded message indicating that your attention is required to address a critical service issue. You step aside from all the fun and login to your notification service portal and check the issue. The normal on-call engineer was unresponsive and you’re next in line in the notification workflow; one step from the Operations Director (your boss).</p><p>There is a critical incident reported by the ScienceLogic network management system that is being handled by your PagerDuty service. First, you acknowledge the incident so your boss doesn’t get the next call out. Next, you click on the embedded link in the PagerDuty incident and it prompts you to login to your ScienceLogic portal on your tablet. Up comes the source event and you can see the issue causing a potential service disruption. Within a few clicks, you’re off troubleshooting and addressing the problem while the others at the party wonder where you are. After a few minutes resolving the issue, a new event appears in your ScienceLogic Smart IT software indicating service is back to normal. You double check with PagerDuty and you see that the original incident has already been resolved. It’s back to the party and back to having a fun Friday evening.”</p><p>This is indeed a good solution yet leaves a lot to be desired. Notice that the recipient of the alert has very little information, no more than “something is wrong”. What is the problem? what’s the context? How urgent is it? Why did he get notified? Not surprisingly, the scenario continues with the alert recipient first accessing to the notification service portal to figure out why he got notified, and then to the ScienceLogic portal to figure out what the problem is about. And that’s where the problem lies. <strong>Monitoring and alerting solutions have failed the recipient of the alert by not providing the information he needs to make a rapid assessment of the problem.</strong> As a result, the recipient has to access multiple systems to gather the information. It does not have to be this way. The alert can contain at least the minimal information that would enable the recipient to make a quick assessment and determine what the next steps should be. In addition, I’d wager that vast majority of the ScienceLogic implementations are NOT accessible from outside the corporate network. The recipient cannot access to the ScienceLogic system without first finding a computer and connecting to the corporate network. As such, this is often not a feasible solution when users are mobile. And even when they are not, the recipient would have to spend a lot of time to access the systems just to be able to make an assessment. It does not have to be this way. This is one of the fundamental problems we strive to solve with OpsGenie. It is our core belief that alerts should contain sufficient information about what the problem is to empower the alert recipients. The fundamental challenge for alerting and monitoring systems today is NOT “waking people up”. That’s largely a solved problem (using <a href="http://www.opsgenie.com/public/features/multiple.channels.html">multiple notification methods</a> and <a href="http://www.opsgenie.com/public/features/escalations.html">escalations</a>). Rather, the challenge is enabling the alert recipients to handle alert as efficiently as possible. We need to enable the alert recipients to make good decisions quickly, by assessing whether or not problem is a critical, whether it can wait or not, and if cannot, who the best person to handle the problem may be (if not the recipient). So how would we do it differently? ScienceLogic has wealth of management data; performance metrics, configuration data, etc. When ScienceLogic creates an alert, it can <a href="http://support.opsgenie.com/customer/portal/articles/551579-attachments">attach all the relevant information</a>, alert history, device configuration, performance metrics, logs, etc. to the alert. Alert recipients would have access to all this information directly from their devices using OpsGenie apps, and can make use of this information to determine what to do next. Recipients can also <a href="http://www.opsgenie.com/public/features/alert.actions.html">take actions</a> directly from OpsGenie apps, acknowledge the alert, comment on it, or execute commands like traceroute to gather additional information. User actions would automatically be synced to ScienceLogic via OpsGenie <a href="http://support.opsgenie.com/customer/portal/articles/1072547">callbacks</a> (no need to poll the service every 60 secs to see what changed). In short, OpsGenie extends monitoring tools to mobile devices. Our solution focuses on what happens after an alert is generated by monitoring system. We continue to strive to make the alert recipients life a little less unpleasant every day. <a href="http://twitter.com/berkay">@berkay</a></p>
↧