<p>Data generated by monitoring systems can be used to support operational support processes in different ways; and I think it’s useful to know the distinction between the two core uses:</p><h4>Troubleshooting problems</h4><p>Modern applications rely on complex technology infrastructures with many levels of abstraction and consists of dozens of interdependent applications, as well as applications/services managed by third parties. Saying that troubleshooting problems and identifying the root causes in such complex environments can be “challenging” is likely a gross understatement. The operational data generated by monitoring tools as well as applications and systems themselves are invaluable in facilitating the troubleshooting process. When troubleshooting a problem, we want all the data we can get our hands on. We want the log files, the metrics (resource utilization, response time, application performance, business, etc.), configuration data (entities, relationships, change history, etc.) and all other relevant data. The more data the merrier, provided that data is stored and organized in a way that makes it easy to access, query, filter, correlate and analyze. We want to be able to compare and contrast, perform time based correlation, topology based correlation etc. to understand the problems, eliminate potential causes of the problem and eventually determine the root cause. <br/><br/> This mode of operations (sometimes referred to as bottom up monitoring) does not necessarily require a through understanding of the potential problems in advance. Instead, it requires scalable and flexible systems that can handle large amounts of loosely structured data. For most organizations that rely on traditional tools, handling this type of data in a cost efficient way has not been a viable option until recently. As a result, metric data gets aggregated (averages lie!), events get created based on logs, then normalized, filtered, deduplicated, etc. and a lot of information gets lost in the process. In addition, most organizations use monitoring tools with disparate data stores, hence it’s often not possible for operations to analyze all the relevant data together. However, thanks to big data technologies, this has become less of a challenge. Operations teams can now take advantage of the tools like <a href="http://graphite.wikidot.com/">Graphite</a>, and <a href="http://opentsdb.net/">OpenTSDB </a>to store and analyze vast amounts of metric data, use <a href="http://logstash.net/">Logstash</a> / <a href="http://www.elasticsearch.org">ElasticSearch</a> / <a href="http://kibana.org/">Kibana</a> combo to aggregate and search all log files. The popularity of <a href="http://www.splunk.com">Splunk</a> can be attributed to the fact it was the first solution in the enterprise that can handle large amounts of unstructured operational data, primarily logs but metrics as well. <br/><br/> It is important to note that although it’s possible to code the intelligence to analyze this data programmatically and identify problems, by and large the this approach requires carbon based intelligence to perform the analysis. As such, the interface to the data, the ease of access, query capabilities, and the visualization of the data play a crucial role in enabling operations folks to make effective use of the data.</p><h4>Alerting</h4><p>The goal of alerting is to detect problems that - currently or in the future, may have a negative impact on provided services, and notifying the right set of people. There are many different ways to generate alerts, including:</p><ul><li>thresholds on metric (time series) data</li><li>parsing log files for specific keywords</li><li>active checks to check the state of applications, systems, etc.</li></ul><p>Many organizations confuse monitoring requirements for troubleshooting problems as described above and alerting. As a result generating alerts for everything that is monitored and end up with too many alerts. I’ve worked with many organizations with number of active alerts in many tens of thousands. In these organizations alerts are not primarily used to detect the problems and notify appropriate people. Rather, alerts are mostly used as a troubleshooting tool, and a poor one at that, since alerts do not contain all the data. When creating alerts, thinking about the required action is provides good guidance on whether or not that alert should be generated. What do we expect the recipients of the alert to do? Will the recipients be able to figure out what the problem is, what the impact is, how urgent it is, or at least how to start troubleshooting the problem? If there is no clear answer to these questions, chances are the alert is not actionable, hence has limited value. <br/><br/>A “top down” approach to alerting is determining what the potential problems are and what the data may be needed in order to diagnose these problems. We can start from the problems that would have the worst impact, most likely to happen, etc. and work our way down from there. This approach has number of advantages:</p><ul><li>it may be possible to implement the solution faster since the data that we may need to collect is often a small subset of all possible monitoring data we can collect</li><li>we can classify, assign a code to each problem. And even better, we can document the troubleshooting and recovery procedures, the impact and the severity of the problem, providing handy runbooks for the recipients to empower them to respond to the alerts effectively.</li><li>we can tune the content of each alert according to the problem to make analysis of the problem easier. What additional information should be provided with the alert? metric data trends, configuration data, change history, alert history, etc.</li></ul><p>Looking at the problem from this perspective, it should be clear that creating an alert when the CPU is utilization over a set threshold is almost never a good idea. It is simply not actionable information. Server will run out of disk space within x minutes however, is very much an actionable alert, and since it is also a common problem, it makes sense to think about how we can collect the necessary data and diagnose this problem. <br/><br/> Monitoring for troubleshooting problems and alerting both are useful and necessary disciplines. When implementing monitoring solutions, understanding the differences between the two and what the primary goal goes a long way in providing the clarity needed to plot the right course. <br/><br/><a href="http://twitter.com/berkay">@berkay</a></p>
↧