Smarts notifications on mobile devices

August 3, 2012, 7:48 am

≫ Next: Why use mobile apps for IT management alerts?

≪ Previous: Splunking When You Are Mobile

EMC <a href="http://www.emc.com/it-management/smarts/index.htm"> Smarts (Ionix)</a> Service Assurance Manager (SAM) “tools” enable operators to execute custom actions from Smarts console interactively, and “escalation policies” enable implementation of automated responses to problems detected by Smarts root-cause analysis engines. <a href="http://www.opsgenie.com">OpsGenie</a> is a cloud based service that provides <a href="http://www.opsgenie.com/public/features/rich.notifications.html">rich alert notifications</a> and mobile response capabilities. Leveraging Smarts tools and escalation policies, OpsGenie extends Smarts’ root cause analysis capabilities into mobile users. When Smarts detects a critical problem that requires attention, OpsGenie notifies the users through <a href="http://www.opsgenie.com/public/features/multiple.channels.html">multiple notification channels</a> (SMS, mobile push, voice, etc.), and enables the recipients to view the alert directly from their mobile devices. Here is how it works:<ol><li>Smarts server tool or escalation policy executes a shell script for Smarts events</li><li>Shell script uses OpsGenie <a href="http://support.opsgenie.com/customer/portal/articles/574596-lamp-command-line-interface-for-opsgenie">lamp utility</a> to forward the alert to OpsGenie service.</li><li>OpsGenie notifies the specified recipients through <a href="http://www.opsgenie.com/public/features/multiple.channels.html">multiple channels</a> (iPhone/Android push notifications, email, SMS, phone calls, etc.) according to users’ preferences.</li><li>Lamp script also retrieves additional relevant information like properties of the object the problem is about, etc. from Smarts and other servers, generates a mobile friendly html files and <a href="http://support.opsgenie.com/customer/portal/articles/551579-attachments">attaches the files</a> to the alert in OpsGenie.</li><li>Recipients view the alert, as well as the supporting information using OpsGenie mobile apps (<a href="http://itunes.apple.com/us/app/opsgenie/id528590328?ls=1&mt=8">iPhone app</a>, web app etc.)</li><li>Recipients execute appropriate actions (acknowledge, take ownership, etc.) directly from the mobile apps to communicate with others, execute searches, initiate remedial processes, etc.</li><li>OpsGenie passes the user executed action information to the customers’ systems.</li><li>All the activity around the alert (when the alert was created, when the recipients were notified, when they have viewed the alert, executed actions, etc.) are <a href="http://www.opsgenie.com/public/features/tracking.html">tracked and reported</a> by OpsGenie.</li></ol>OpsGenie can be used not only to rapidly notify the right people when there is a problem, but also to provide access to incident focused management information to stakeholders without having to open up all management systems. Internal and external users can be kept up to date about the state of the services they use through simple apps.<img alt="image" src="http://media.tumblr.com/f2f060ffdf9c37b1a966c2c1e9421b0c/tumblr_inline_mu3bqgHdUo1soq1dj.png"/> Like most IT management tools, Smarts can send notifications via email, and there are software products that enable Smarts to send notifications via SMS. So why use OpsGenie? Here are some of the reasons: <ul><li>For critical alerts, “fire and forget” email/SMS notifications are not suitable. Recipients may not see email or text messages in a timely manner for variety of reasons, and Smarts has no way to determine whether an alert has been seen by a recipient. OpsGenie uses <a href="http://www.opsgenie.com/public/features/multiple.channels.html">multiple notification channels</a> (email, SMS, mobile push notifications, phone calls, etc.) to notify the users to <a href="http://www.opsgenie.com/public/features/reliable.notifications.html">ensure delivery</a>, and <a href="http://www.opsgenie.com/public/features/tracking.html">tracks</a> whether recipients have seen the alert or not. Users can configure OpsGenie to use try multiple notification method in succession till the recipients see the alert. For example, a user can configure OpsGenie to send an iPhone push notification first, and make a voice call if the user does not view the alert within X minutes, etc.</li><li>OpsGenie alerts are not limited to short text messages. OpsGenie allows alerts to have <a href="http://www.opsgenie.com/public/features/rich.notifications.html">multiple optional fields, tags, and attached files</a>. For example, for Smarts alerts, object details are retrieved from Smarts servers and are attached to the alert as an html file, providing context and enabling recipients to determine the best course of action directly in their mobile devices. Even when OpsGenie sends notifications via SMS, the text message includes a link to OpsGenie web app to enable users to view all alert details. In addition, OpsGenie uses multiple notification methods to ensure recipients receive the notifications to overcome delays in SMS or email delivery.</li><li>OpsGenie is a cloud based service and recipients do not need access to Smarts servers to view alert details. All alert data and other supporting information are stored in OpsGenie systems and available through OpsGenie apps regardless of where the recipients may be, as long as they have access to the Internet. This enables organizations to easily share information with external partners as well as internal users in a controlled fashion.</li><li>OpsGenie empowers users to manage their own notification preferences, ensuring contact information accuracy, and reducing administrative overhead.</li></ul> <a href="https://www.opsgenie.com/customer/signUp">Sign up for free</a> and start getting Smarts notifications on your smartphones today. Detailed instructions on how to integrate Smarts and OpsGenie is available on <a href="http://support.opsgenie.com/customer/portal/articles/673566-smarts-integration">Smarts Integration Page.</a>

↧

Why use mobile apps for IT management alerts?

August 20, 2012, 7:48 am

≫ Next: Zapier, another way to integrate with OpsGenie

≪ Previous: Smarts notifications on mobile devices

IT Ops folks have been using electronic devices for notifications for decades. It started with pagers on our belts and pagers got more sophisticated in time. <a href="https://en.wikipedia.org/wiki/Pager"><img alt="pager" src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/40/Alphadual.jpg/220px-Alphadual.jpg"/></a>Alpha numeric pagers followed numeric ones that could only display a phone number; and two way pagers with tiny keyboards followed them. Pagers still get used by some operations folks but largely have been replaced by mobile phones thanks to text messaging capabilities available on almost any mobile phone. IT operations processes largely use email as the main communications method to notify users when an action is required and rely on short text messages (SMS) when there is some urgency. Today’s widely used smartphones (iPhone, Android, Blackberry, etc.) have incredible array of capabilities. Taking advantage of these capabilities, notifications solutions continue to evolve. All major smartphone platforms provide robust (push) notification methods, and combination of push notifications and smartphone apps provide the infrastructure for next generation of alert and notification management solutions. As a notification channel mobile push notifications have number of advantages over SMS:<ul><li>Cost: There is no charge for sending or receiving push notifications on your <a href="http://itunes.apple.com/us/app/opsgenie/id528590328?ls=1&mt=8">iPhone</a> or <a href="http://play.google.com/store/apps/details?id=com.ifountain.opsgenie">Android</a>, other than minimal data costs. Cost of sending text messages can add up depending on the provider and in US, recipients have to pay for receiving SMS as well.</li><li>Global reach: Mobile push notifications work everywhere the same way, where sending SMS to international numbers can be both costly and difficult.</li><li>Tablets: Mobile push notifications work not only on smartphones but also on tablets as long as they are connected to the Internet.</li><li>Control and Flexibility: Smartphones give the users a lot of control over how <a href="http://support.opsgenie.com/customer/portal/articles/553138-notification-rules">they’d like to be notified</a>. They can use different sounds for different notifications, vibration, and even lights. And apps enable the users to easily manage notification preferences directly from their phones, change escalation rules, turn off some notifications, etc.</li></ul>As such, push notifications are an essential part of an modern, effective notification management solution. So, what are the essential ingredients of an effective notification management solution? I believe an effective notification solution should: be able to do the following:<img alt="image" src="http://media.tumblr.com/b1b89ad31d23adfa6c495aca229aaeac/tumblr_inline_mu3cz7Ch511soq1dj.png"/><ol><li>ensure that recipients are notified in a <a href="http://www.opsgenie.com/public/features/reliable.notifications.html">timely manner</a></li><li><a href="http://www.opsgenie.com/public/features/rich.notifications.html">provide all the necessary information</a> to the recipients to enable them to determine right course of action</li><li>enable the recipients to <a href="http://www.opsgenie.com/public/why/incident.management.html">take action rapidly</a> whenever possible</li></ol>Notifications via text messages and emails alone fail to meet all the basic requirements listed above. In addition to email and SMS, OpsGenie provides apps for Smartphones (iPhone, Android, etc.) to satisfy these requirements:<ol><li>In addition to email and SMS, OpsGenie can notify users via mobile push notifications and phone calls. Users can specify which <a href="http://www.opsgenie.com/public/features/multiple.channels.html">communication channels</a> should be used in what order. OpsGenie tries each method until the user sees the alert.</li><li>An OpsGenie alert can have as many fields as necessary, as well as <a href="http://www.opsgenie.com/public/features/rich.notifications.html">attached files</a>. Recipients can view all the information directly from OpsGenie apps, before deciding what to do next.</li><li>OpsGenie enables users to <a href="http://www.opsgenie.com/public/why/incident.management.html">initiate appropriate actions</a> directly from OpsGenie apps.</li></ol>For example, if a server is having a problem, the alert can include configuration information for the server, change history, past events, performance charts, run book, etc. to arm the recipient to at least do some triage and determine the right course of action; whether it is to escalate the problem to someone else, initiate corrective actions, or simply communicate with others that may be involved. It’s time to move beyond fire and forget text messages as the only means of notifications. <a href="https://www.opsgenie.com/customer/signUp">Create your free OpsGenie account</a> today and enjoy the next generation notifications on your smartphones!

↧

Zapier, another way to integrate with OpsGenie

August 31, 2012, 7:48 am

≫ Next: AWS CloudWatch alarms on your SmartPhones with OpsGenie

≪ Previous: Why use mobile apps for IT management alerts?

As Software as a Service (SaaS) solutions continue to make inroads into the enterprise, integration among disparate SaaS solutions is becoming necessary as it has been the case with on-premise applications. <a href="https://zapier.com/">Zapier</a>, a SaaS offering itself is tackling this problem. Zapier provides a platform and an intuitive web based user interface to integrate various web applications. There are already almost 90 applications that can be integrated via Zapier, and we’ve already found number of use cases to integrate various tools such as <a href="https://trello.com/">Trello</a> and <a href="https://www.hipchat.com/">HipChat</a>.<img alt="image" src="http://media.tumblr.com/32f824a957572f157634b76beccb103d/tumblr_inline_mu3d2lQE2O1soq1dj.png"/> I’m happy to announce that we’ve added <a href="https://zapier.com/zapbook/opsgenie/">OpsGenie</a> to the <a href="https://zapier.com/zapbook/">Zapier app directory</a>, providing another easy integration method for <a href="http://www.opsgenie.com">OpsGenie</a>.

↧

AWS CloudWatch alarms on your SmartPhones with OpsGenie

September 4, 2012, 7:48 am

≫ Next: OpsGenie Email Integration - Creating alerts and notifying users just got easier

≪ Previous: Zapier, another way to integrate with OpsGenie

Amazon <a href="http://aws.amazon.com/cloudwatch/">CloudWatch</a> provides monitoring for Amazon Web Services (AWS) and the applications that make use of AWS. There are many alternatives to collecting resource utilization metrics from EC2 instances, however when AWS services like ELB, RDS, DynamoDB, SQS, etc. are used, <a href="http://docs.amazonwebservices.com/AmazonCloudWatch/latest/DeveloperGuide/CW_Support_For_AWS.html">CloudWatch metrics</a> play a critical role in the monitoring of the applications running on AWS cloud. One of the key capabilities of CloudWatch service is the alarms. A <a href="http://aws.typepad.com/aws/2012/07/ec2-instance-status-metrics.html">CloudWatch alarm</a> can watch a single metric over a specified time period and execute automated actions based on the value of the watched metric and given threshold. The automated action may be sending emails, or calling HTTP/S end points, etc.<h3>Notify users via iPhone/Android apps, SMS, and phone calls</h3>OpsGenie customers can configure CloudWatch to call OpsGenie web services API, get the right people notified rapidly via push notifications to iPhone/Android apps, SMS, voice calls, etc. according to the preferences of the recipients. It’s very easy to set this up and try:<ul><li>If you don’t already have one, <a href="https://www.opsgenie.com/customer/signUp">Create an OpsGenie account</a>.</li><li>Get the customerKey for your account from the “<a href="https://www.opsgenie.com/customer/settings">Account Settings</a>" page</li><li><a href="http://docs.amazonwebservices.com/sns/latest/gsg/CreateTopic.html">Create an SNS topic</a>, and an <a href="http://docs.amazonwebservices.com/sns/latest/gsg/Subscribe.html">HTTPS subscription</a>. The endpoint should point to OpsGenie. customerKey and recipients are mandatory parameters, and message, description, alias, actions, and entity are optional parameters. For example:</li></ul><a href="https://api.opsgenie.com/v1/json/cloudwatch?customerKey=342532b52342&setAlias=AlarmName&recipients=operations&actions=acknowledge,reboot">https://api.opsgenie.com/v1/json/cloudwatch?customerKey=342532b52342&setAlias=AlarmName&recipients=operations&actions=acknowledge,reboot</a> <img alt="image" src="http://media.tumblr.com/16bf8d3d4e2fc9504e8a6acd6f0e5c1a/tumblr_inline_mu3esl7N9L1soq1dj.png"/><ul><li><a href="http://docs.amazonwebservices.com/AmazonCloudWatch/latest/DeveloperGuide/AlarmThatSendsEmail.html">Create a CloudWatch alarm</a> with any metric and select the SNS topic you’ve created as the action</li></ul><img alt="image" src="http://media.tumblr.com/34a9f006c9907ed803a91828a79648e5/tumblr_inline_mu3etgGbiQ1soq1dj.png"/> That’s all it takes. When the alarm condition is met, AWS CloudWatch will pass the alarm details to OpsGenie, and OpsGenie will notify the specified users (recipients parameter) based on the recipients’ notification preferences.<img alt="image" src="http://media.tumblr.com/f6c1533510216f45d8b4b45381bcbeae/tumblr_inline_mu3eu3Q0mS1soq1dj.png"/> If you don’t already have one, <a href="https://www.opsgenie.com/customer/signUp">create a free OpsGenie account</a> now, and give it a try. If you have an iPhone or Android phone, you can download OpsGenie app, and get notified via mobile push notifications, for other phones you can get notifications via SMS and voice calls, and use OpsGenie mobile or desktop web apps. <h3>Enrich CloudWatch alarms and provide context</h3>Using OpsGenie, you can not only send CloudWatch alarms to users as described above, but also can provide additional information to enable the recipients to make an assessment and determine the right course of action. For example, you can retrieve data for the relevant metrics, from CloudWatch via its APIs, generate graphs with the metric data, and attach the graphs to the alert using OpsGenie APIs. Similarly, you can retrieve relevant information from other systems such as configuration management tools, and attach to the alert. Recipients can view all the information directly from OpsGenie apps and determine what to do next. As both CloudWatch and OpsGenie provide web APIs, this functionality can be implemented on any platform using your choice of language. OpsGenie provides a handy tool called Marid that makes the process as easy as possible. Marid can execute a groovy or ruby script for a web request, hence script can be triggered by a CloudWatch alarm. OpsGenie AWS integration files include an example script that retrieves configured metrics for the alarm from CloudWatch, generates metric charts, creates an OpsGenie alert and attaches the charts to the alert. <div><img alt="image" src="http://media.tumblr.com/e436774cf4d5f9bd027e2348dea96f38/tumblr_inline_mu3ev05DC31soq1dj.png"/><img alt="image" src="http://media.tumblr.com/e96601f3af6b343b47032b582db5073a/tumblr_inline_mu3ev4gNT11soq1dj.png"/></div><h3>Empower recipients to initiate corrective actions from anywhere</h3>When an alert requires rapid attention of the users, it is often necessary for the recipients to take action rapidly; acknowledge the alert, communicate with others, gather more information, initiate corrective actions, etc. As described above, with OpsGenie, recipients can be provided with the information they need to determine the right course of action. Furthermore, OpsGenie enables the users to initiate actions directly from OpsGenie apps., potentially significantly reducing mean time to repair, preventing impact on services. etc. <img alt="image" src="http://media.tumblr.com/edde9a5553a43e3ee3393a23410f1ba7/tumblr_inline_mu3evjVwbm1soq1dj.png"/> <a href="http://support.opsgenie.com/customer/portal/articles/719296-marid-integration-server-for-opsgenie">Instructions on how to setup and configure Marid </a>can be found on our <a href="http://support.opsgenie.com">support site</a>.< OpsGenie also provides another integration which can be used to create alerts and also attach CloudWatch performance graphs to the alert : <a href="http://support.opsgenie.com/customer/portal/articles/714183-all-downloads#cloudwatch-integration">AWS CloudWatch Integration Download.</a>

↧

OpsGenie Email Integration - Creating alerts and notifying users just got easier

September 27, 2012, 7:48 am

≫ Next: Notification methods - which one to use when

≪ Previous: AWS CloudWatch alarms on your SmartPhones with OpsGenie

OpsGenie has a simple web API to create alerts in OpsGenie, and we also provide tools like Lamp and Marid to ensure IT management tools can be integrated with OpsGenie easily. And most, if not all, IT management tools can send email notifications. We’ve just added email as an alert source for OpsGenie to make creating alerts even easier with OpsGenie. An OpsGenie email address that is generated for each account can be found at the account settings page. You can configure your IT management tools to send an email to this email address. You can of course, send emails directly from your email client to create alerts for testing, etc. as well. Configuration instructions can be found in the <a href="http://support.opsgenie.com/customer/portal/articles/757537-creating-alerts-via-email">creating alerts via email</a> document. OpsGenie uses email rules to create the alert based on the incoming email. A default email rule for each account is defined when the account is created. By default, recipients is set to “all” (every user in the account is notified), email subject is used as the notification message, and the email body is put in the description field. This behavior is completely configurable; you can change the default rule or define additional rules in the <a href="http://www.opsgenie.com/customer/settings#emailtoalertrules" target="_blank">Account Settings</a>page. Defining email rules is simple. You specify a “condition” based on the incoming email content, such as if the email from address is from a specific address, email subject or body contains specific text, etc., then set the alert properties such as recipients, tags, actions, etc. Email fields such as from, subject, etc. can be used as variables when setting alert properties. Email rules can be used to route the alerts to the appropriate users & groups, specify the available actions for different types of alerts, etc. For example, the following rule would notify John Smith and members of the web_operations group if the email from address contains “nagios” and email subject contains “down”.<img alt="" src="http://support.opsgenie.com/customer/portal/attachments/77640"/>

↧

Notification methods - which one to use when

October 1, 2012, 7:48 am

≫ Next: Get notified for OpenNMS events

≪ Previous: OpsGenie Email Integration - Creating alerts and notifying users just got easier

<a href="http://www.opsgenie.com">OpsGenie</a> provides <a href="http://www.opsgenie.com/public/features/multiple.channels.html">multiple notification methods</a> (email, SMS, iPhone/Android push notifications, voice calls, etc) to users for number of reasons:<ul><li>Timely delivery of notifications via methods like email and SMS are not guaranteed. Carriers offer SMS delivery as “best effort” and delivery times can vary. OpsGenie allows users to use multiple methods so that they are not dependent on a single method. Note that this does not mean users will get multiple notifications since once the user views the alert, OpsGenie stops sending notifications for that alert through other notification methods.</li><li>Combination of these methods ensures the widest coverage, enabling OpsGenie to notify anyone who has a computer or a phone.</li><li>Different notification methods have different strengths and weaknesses.</li></ul>Let’s look at each of the notification methods:<ul><li>Push notifications to OpsGenie iPhone/Android app is our preferred notification method. It allows us (as we feast on our own dog food) to receive notifications not only on our smartphones but also on iPod touch, iPad and Android tablets. We typically recommend using push notifications as the primary notification method to our users, provided they have an iPhone or an Android phone.</li><li>Text messages (SMS) ensure that any user that has a mobile phone can be notified by OpsGenie, and if their phone has a web browser, can view all the alert details and manage their notification preferences using OpsGenie mobile web app, just as iPhone/Android users can do using native apps.</li><li>Voice phone calls ensure OpsGenie can reach anyone with a phone (mobile or landline). In addition, a phone call is one notification method that has guaranteed delivery, and a ringing phone may be harder to miss than a text message or email. OpsGenie uses text-to-speech to read the alert details to the users, and enables users to respond by pressing phone keys. We typically use phone call as the last notification method.</li></ul>OpsGenie allows users to set their own notification rules, and users often have their own reasons to use different rules. Also, it is very easy to change these rules, disable/enable notification methods etc., using OpsGenie apps. Here is what we recommend as a good starting point (and use ourselves):<ul><li>Notify me via email immediately. If I happen to be on my PC, this allows me to use OpsGenie web application to view the alert, etc. rather than reaching for my mobile.</li><li>Notify me via mobile app immediately, this means I get a push notification on my Android/iPhone and I can click on the notification to see the alert easily. Push notifications also have the advantage of working on places with no cell coverage like data centers in basements. As long as you have a network connection, all is well.</li><li>Notify me via SMS after 2 minutes. If I have not seen the alert within 2 minutes, OpsGenie sends a text message. I can click on the link on the message or use any OpsGenie app (full web app, iPhone/Android app, etc.) to view the alert.</li><li>Notify me via phone call after 4 minutes. If I still have not seen the alert after 4 minutes, OpsGenie calls me on my phone. I can listen to OpsGenie read me the alert or can use an OpsGenie app to view the alert.</li></ul>We’re also considering adding in-browser notifications (Firefox and Chrome) and instant messaging as notification methods assuming customers demand it. If you have an opinion on the matter, we’d love to hear it!

↧

Get notified for OpenNMS events

October 5, 2012, 7:48 am

≫ Next: Overwriting quiet hours for critical alerts

≪ Previous: Notification methods - which one to use when

OpenNMS is an award-winning, enterprise-grade network management application platform. OpsGenie OpenNMS integration plugin has just been released. OpenNMS Integration plugin enables OpenNMS users to get notified for specified events via email, SMS, iPhone/Android push notifications, and phone calls using OpsGenie service. OpenNMS supports multiple notifications methods. First version of the integration plugin includes documentation and examples that use email and script notification methods. Script notification method provides additional capability to add supporting information about the event to the alert in OpsGenie as attached files. Integration document has step by step instructions describing how to configure both OpenNMS and OpsGenie. We’re looking forward to feedback from OpenNMS community to improve the integration and to make notifications more useful for users. Detailed instructions on how to integrate OpenNMS and OpsGenie is available on <a href="http://support.opsgenie.com/customer/portal/articles/769280-opennms-integration">OpenNMS Integration Page.</a>

↧

Overwriting quiet hours for critical alerts

October 11, 2012, 7:48 am

≫ Next: Get Rackspace cloud monitoring alerts via OpsGenie

≪ Previous: Get notified for OpenNMS events

OpsGenie empowers users to control how they are notified. One of the available features is <a href="http://support.opsgenie.com/customer/portal/articles/711340-notification-preferences">quiet hours</a>. If the user specifies quiet hours, OpsGenie does not send notifications during these hours to the user. This feature is typically used by users who’d like normally be notified when something goes wrong but not want to wake up in the middle of the night unless they have to. But what if for some alerts they do want to be notified whenever? An early <a href="http://support.opsgenie.com/customer/portal/questions/359762-feature-request-alert-severity">enhancement request</a> was the option to ignore quiet hours settings for critical alerts. Along these lines, we’ve just implemented “OverwriteQuietHours” tag. If an alert has this tag, then OpsGenie ignores the quiet hours settings and notifies the users according to their preferences as it would normally do. This tag can be added to any alert while it’s being created via the API or the <a href="http://support.opsgenie.com/customer/portal/articles/757537-creating-alerts-via-email">email rules</a>. Tags add a powerful yet simple way to categorize alerts and process them accordingly. We expect that there will be other similar use cases. If you have other special processing requirements, please let us know!

↧

Get Rackspace cloud monitoring alerts via OpsGenie

October 19, 2012, 7:48 am

≫ Next: Notifications and working with Netcool from your smartphones

≪ Previous: Overwriting quiet hours for critical alerts

It would be very unusual for a web product to not have an API these days, and designing and implementing the APIs first and foremost is considered the right way to do things by many. <a href="http://www.rackspace.com/cloud/public/monitoring/" title="Rackspace cloud monitoring" target="_blank">Rackspace cloud monitoring</a> seems to have embraced this philosophy fully. In fact, it does not even have a graphical user interface (at least not at the moment). Instead, it has a well designed JSON/HTTP API and a (somewhat complex ) data model. It seems odd at first, but most welcome if you work on integration and automation as I do. After all the service is built for developers. In addition to the JSON/HTTP API, Rackspace provides a command line interface called <a href="https://github.com/racker/rackspace-monitoring-cli" target="_blank">raxmon</a>, similar to OpsGenie <a href="http://support.opsgenie.com/customer/portal/articles/574596-lamp-command-line-interface-for-opsgenie" target="_blank">lamp</a>) to interact with the API easily. Given that <a href="http://www.opsgenie.com" target="_blank">OpsGenie</a> also has an HTTP/JSON <a href="http://support.opsgenie.com/customer/portal/articles/565567-web-api" target="_blank">API</a> (as do number of services we use) one can not only automate the entire process of configuring Rackspace to monitor their services from multiple locations (monitoring zones) but also forwarding alerts to OpsGenie to notify the users. Rackspace API currently support notifications via email and webhook. Although either one can be used to forward Rackspace alerts to OpsGenie, webhook is the preferred mechanism as it would not rely on another component in the middle. OpsGenie Rackspace integration enables Rackspace cloud monitoring users to forward alerts from Rackspace to OpsGenie using webhooks. <a href="http://support.opsgenie.com/customer/portal/articles/793489-rackspace-cloud-monitoring-integration" target="_blank">OpsGenie Rackspace Integration guide</a> describes how to configure Rackspace and OpsGenie. The integration enables users to receive and view Rackspace alerts on their mobile phones using OpsGenie apps. And using OpsGenie custom actions, users can even initiate corrective actions, reboot servers, start new instances, etc. Rackspace cloud monitoring does not yet provide historical data and graphing but once these capabilities are added, you can expect that it will be possible to attach the data or the graphs related to the alarms to the alert as well. Happy monitoring!

↧

Notifications and working with Netcool from your smartphones

October 29, 2012, 7:48 am

≫ Next: Monitoring applications on the cloud - Part Zero

≪ Previous: Get Rackspace cloud monitoring alerts via OpsGenie

IBM Tivoli <a href="http://www-01.ibm.com/software/tivoli/products/netcool-omnibus/" target="_blank">Netcool</a> is the most common event (<a href="http://support.opsgenie.com/customer/portal/articles/756707-what-are-opsgenie-alerts-and-notifications">alerts</a> in OpsGenie terminology) management solution used by operations, particularly in large enterprises and service providers. Since Netcool is used to collect and consolidate events from many event sources into a central repository, it makes sense to integrate OpsGenie with Netcool to add the capability to notify users for events that are important to them.<a href="http://support.opsgenie.com/customer/portal/articles/800164-ibm-tivoli-netcool-integration">OpsGenie Netcool integration</a> is bi-directional. Integration not only enables Netcool to send notifications to users using <a href="http://www.opsgenie.com/public/features/multiple.channels.html">multiple notification methods</a> (iPhone & Android push notifications, email, SMS, phone calls, etc.) to ensure they get notified in a timely manner, but also empowers the users <a href="http://www.opsgenie.com/public/features/alert.actions.html">respond to the events</a> directly from OpsGenie apps. As a result, users can interact with Netcool events even when they are not in front of a console. Using OpsGenie apps running on their mobile phones, users can receive notifications for critical events, and acknowledge the events, suppress or escalate them or write to the event journal, etc.One of the most popular features of Netcool is the ability to add fields to Netcool events, and populate these fields with additional information that would provide context, assist operations folks to determine impact and the appropriate course of action. With traditional notification mechanisms such as sending text messages, most of this information cannot be conveyed to the recipients due to the limitations of SMS, hence crippling the ability of the recipients to determine the right course of action. OpsGenie not only allows conveying all event data to the recipients, but also allows <a href="http://www.opsgenie.com/public/features/rich.notifications.html">enrichment of the alert</a> with information from other data sources. For example, if a router in the network is having a problem, event history for that router can be retrieved from historical event stores and attached to the alert using OpsGenie file attachment support. Similarly, configuration change history, performance charts, any additional information about the device, etc. can be attached to the alert, and made available to the recipients. This type of information is often invaluable to the recipients in determining what to do next and can save massive amounts of time and increase efficiencies.Please don’t take our word for it, get a free account and give it a try. Integration can be as simple as <a href="http://support.opsgenie.com/customer/portal/articles/757537-creating-alerts-via-email">forwarding emails</a> to OpsGenie, or sophisticated as bi-directional as described in the <a href="http://support.opsgenie.com/customer/portal/articles/800164-ibm-tivoli-netcool-integration">integration documentation</a>.

↧

Monitoring applications on the cloud - Part Zero

November 8, 2012, 5:41 am

≫ Next: Librato alerts on your mobile devices

≪ Previous: Notifications and working with Netcool from your smartphones

I’ve been thinking about the impact of “cloudification” of technology infrastructure on IT operations management, and particularly on monitoring. Unfortunately, every time I wanted to write about something I feel like I need to write about a lot of other things first, just to provide the context. Monitoring as a discipline covers a surprisingly vast area. What I wanted to write about was the management/monitoring capabilities needed to manage production application running on (private of public) server instances provided as a service (aka IaaS). I’ll refer to this as “managing applications on the cloud” for brevity, and hope that it does not cause too much confusion.So first in this post, I’ll attempt to describe the management disciplines that are relevant to managing production applications running on the cloud. Hopefully the posts to follow will make more sense with the provided context.<h3>Log management</h3>Access to the log files is probably the most essential management requirement for operations. To be sure, one does not need any special “tools” to view the log files. Operations can indeed have shell access to the servers, and view the log files using tools like grep and tail. However, it is safe to say that this is not good practice for number of reasons:<ul><li>Giving shell access to production servers increase operational risks. Sure, it can be managed with access rights but that also introduces overhead and may not always work as intended.</li><li>It is typical for cloud applications to have many instances of the same application component running across different server instances (virtual or physical). In this type of environment, looking for errors mean accessing all the servers and looking at each of the log files, etc. Can be quite painful.</li><li>Logs can be quite verbose and application exceptions often consist of dozens of lines (particularly java apps). It is very difficult to process this information in a command line window.</li></ul>Centralizing the logs in a searchable repository is necessary regardless of where applications are hosted, however it is even more essential when applications run on (public or private) the cloud. The solution should provide the users not only real-time access, similar to tailing a file, but also the capability to browse and query historical logs as well.Although applications can send logs directly over the network to a central repository, this type of coupling is considered as unnecessary risk by most people. And use of connectionless protocols like UDP introduces the risk of logs getting lost. As such, aggregation of the logs often require some sort of “agent” on the server instances to ship the logs to the central repository. The agent can be basic, with minimal overhead, and simply ship the log files, or can do some filtering and parsing as well. If the applications running on the cloud, basic agent becomes more appealing as it has much less chance of impacting the performance of the applications running on the server instance.In the enterprise world, <a href="http://www.splunk.com" target="_blank">Splunk</a> is by far the most common solution for this purpose. <a href="http://www.splunk.com" target="_blank">Splunk</a> <a href="http://docs.splunk.com/Documentation/Splunk/latest/Deploy/Introducingtheuniversalforwarder" target="_blank">universal forward</a> is a lightweight agent that only ships the logs to the central Splunk server. Splunk can parse most log files out of the box and provides a nice user interface to work with the logs. <a href="https://www.splunkstorm.com/" target="_blank">SplunkStorm</a> is the SaaS based version of Splunk recently came out of beta, and although it’s missing some key features so far, it seems to be catching up quickly.<a href="http://logstash.net/" target="_blank">Logstash</a> <a href="http://www.elasticsearch.org/" target="_blank">ElasticSearch</a> combination is the open source alternative. As it is often the case with open source, this option appeals to the do it yourself crowd, and requires integration of various components and some development. <a href="https://papertrailapp.com/" target="_blank">Papertrail</a> and <a href="http://www.loggly.com/" target="_blank">Loggly</a> are other SaaS solutions for log management.<h3>Server monitoring</h3>Monitoring of server resource utilization (CPU, Memory, Disk IO, Disk space, Network IO, etc.) and processes running on the server is probably what most people think when they refer to monitoring. The information gathered is typically used for troubleshooting, fault management, performance management, capacity planning, etc.Server based agentsServer monitoring has been traditionally done using a server based agent, but nowadays some basic information is available through the hypervisor as well. Traditional server agents typically perform “active checks”, basically periodically execute code that check availability of application components, collect resource utilization metrics, etc. Server agents provided by the enterprise vendors are known to be quite heavy (high overhead on the server as well as high administrative overhead to deploy and maintain), hence mostly unusable for monitoring server instances running on the cloud. Unfortunately, the dominant open source option, <a href="http://www.nagios.com/" target="_blank">Nagios</a>, does not fair a lot better in terms of administrative overhead. Number of new gen SaaS based monitoring providers such as <a href="http://www.datadoghq.com" target="_blank">Datadog</a>, <a href="http://newrelic.com/product/server-monitoring" target="_blank">New Relic</a>, <a href="http://copperegg.com/revealcloud-server-monitoring/" target="_blank">CopperEgg</a> provide server based agents.Another problem with the traditional agents is that the resource utilization metrics collected once every couple of minutes often lack the granularity to debug problems, and increasing the frequency to say sub 1 minute interval may increase the load on the server, therefore may not be acceptable.Passive agentsGoing forward, use of server based agents to perform solely periodic active checks to gather data and monitor applications will likely continue to diminish. Better options have been emerging in the market. <a href="http://www.appfirst.com/" target="_blank">AppFirst</a> is a SaaS based monitoring solution with an agent technology that passively collects detailed data by listening to calls made by the applications to the operating system. It can not only collect resource utilization data for the server but can also track processes individually (CPU usage, number of open files, network connections, threads etc.) with little overhead.Log collection agentsAnother option that has emerged is the use of the agent used for log monitoring to collect server monitoring data, metrics, faults, etc. as well. <a href="http://splunk-base.splunk.com/apps/22314/splunk-for-unix-and-linux" target="_blank">Splunk</a> for instance, provides “<a href="http://splunk-base.splunk.com/apps/22315/splunk-app-for-windows" target="_blank">apps</a>" to collect and visualize resource resource utilization metrics leveraging the log processing infrastructure already in place. <a href="http://logstash.net/docs/1.1.0/tutorials/metrics-from-logs" target="_blank">Logstash</a> has the capabilities to forward performance metrics to various products such as <a href="http://graphite.wikidot.com/">Graphite</a> & <a href="http://opentsdb.net/" target="_blank">OpenTSDB</a>, and services such as <a href="https://circonus.com/" target="_blank">Circonus</a> & <a href="https://metrics.librato.com/" target="_blank">Librato</a>, and it can forward events elsewhere as well. However, Logstash lacks ability to collect the resource utilization metric data itself. Additional code, scripts, etc. would need to be deployed on the server and executed periodically to provide the data through Logstash. Some application developers embed the code to do this into their applications, dumping the data to log files periodically for Logstash to process and ship the data to a time series database or to an event repository.<h3>Application (availability and response time) monitoring </h3>Monitoring the application components using a server based agent can be misleading, both from availability and performance standpoints, as it does not reflect how users access the application. In addition, installing an agent to every server instance is not a viable option (at least difficult) for many organizations. As such, many organizations employ methods to monitor the availability of the applications from outside using “synthetic transactions” This approach is also referred as “agentless monitoring”.Synthetic transactions are essentially active checks that simulate users or application components, such as requesting a web page via HTTP, resolve a host name in DNS, etc. Synthetic transactions are executed from one or more external locations. There are numerous products and services with varying strengths and weaknesses in this area. To name just a few, Nagios is probably the most popular open source solution as it can run not only standard checks but can also be extended with custom checks. Major shortcoming of Nagios seems to be that it’s quite painful to operate in scale. <a href="http://www.opennms.org/" target="_blank">OpenNMS</a> is a highly scalable open source solution typically favored by folks who need to monitor large number of servers and apps running on them with a wide selection of checks. <a href="http://www.rackspace.com/cloud/public/monitoring/" target="_blank">Rackspace cloud monitoring</a>, <a href="http://copperegg.com/" target="_blank">CopperEgg</a> and <a href="https://circonus.com/" target="_blank">Circonus</a> are some of the companies offering granular (1 minute or less), API driven checks from multiple locations for most common web services. However, (AFAIK) these solutions do not offer sophisticated multi-step checks such as simulating a user login to a web app, click through several pages, fill a form, etc.For public facing web applications <a href="http://www.compuware.com/application-performance-management/end-user-experience-synthetic-monitoring.html" target="_blank">Compuware Gomez</a> and <a href="http://www.keynote.com/" target="_blank">Keynote</a> provide a somewhat different monitoring service. They execute synthetic transactions from thousands of computers and mobile devices distributed globally running actual browsers, and offer advanced scripting to simulate complex user interactions.Although it is possible to use synthetic transactions and agent based server monitoring to the same ends, endless agent based vs agentless discussions mostly miss the mark. These capabilities mostly complement each other and both essential ingredients of a robust monitoring solution.External checks can determine problems as perceived by users more accurately and server monitoring can be instrumental in determining the cause of the problem, and preventing problems to impact users in the first place.<h3>Application performance monitoring</h3>Applications have their own metrics indicating the performance of the application as well as business metrics (number of users, credit card transactions,etc.). Attempts to establish standards to collect application performance data have failed. There are number of different methods to collect application performance metrics. Extensions to server based agentsUsing the server based agents to monitor the performance of application components has been the traditional approach. Most agents can be extended, either by configuration (check these ports, responses, etc.) or scripts, to check the availability and performance of the application components running on the same server instance or on other instances.For example, there are thousands of Nagios plugins to monitor anything and everything from applications to routing protocols, and it’s straight forward to add your own plugins with custom checks. AppFirst (mentioned above) has a pragmatic approach and leverages this vast set of available <a href="http://exchange.nagios.org/directory/Plugins" target="_blank">Nagios plugins</a> to monitor application availability and performance. <a href="http://support.hyperic.com/display/hyperforge/Home">Hyperic</a> is another monitoring solution that provides an agent with a large set of plugins and as well as custom plugins.Although this approach is mostly used for availability monitoring, it is also used to collect performance metrics as well. The weaknesses of this approach include lack of granularity (hence the potential to miss intermittent problems) and only simulation of a small subset of actual application transactions.Application components collecting data themselvesFor in-house developed applications, often the best application performance metrics can be collected by the applications themselves. Collected data can be pushed to another system/process, written to files, etc. As it is for the logs, sending performance metrics over the network directly to a repository is a possibility, but not without its problems. Hence, this option is particularly appealing for organizations that have already deployed a log monitoring agent like Splunk or Logstash, and have a time series data repository such as <a href="http://opentsdb.net/" target="_blank">OpenTSDB</a> or <a href="http://graphite.wikidot.com/" target="_blank">Graphite</a> in place, and can collect and store the data easily.Another option is using an agent specifically for this purpose. <a href="https://github.com/etsy/statsd" target="_blank">StatsD </a>and its <a href="http://joemiller.me/2011/09/21/list-of-statsd-server-implementations/" target="_blank">variants</a> have emerged as a common solution. There are statsd client libraries in almost every language, and use of UDP protocol means no impact on the application performance. Appfirst and <a href="http://www.datadoghq.com" target="_blank">Datadog</a> agents include embedded <a href="https://github.com/appfirst/statsd_clients" target="_blank">statsd</a> daemons, enabling them to <a href="http://www.slideshare.net/appfirst/statsd-webinar-final" target="_blank">receive metrics</a> from applications.Agents running in application serversToday most web applications use application servers as part of the solution. One highly successful approach to gather application performance metrics with little effort has been running an agent on the application server to monitor all application activity. Since most application traffic flows through the application servers, peeking into the application server activity can provide powerful insights into the application performance and help identifying problems. Shortcoming of this approach is that it does require deployment of an agent on the application server. The agent also introduces some performance overhead that varies depending on the application agent.<a href="http://www.ca.com/us/application-management.aspx" target="_blank">CA APM (Wily)</a> has been the pioneer of this approach and still widely used in large enterprises. <a href="http://newrelic.com" target="_blank">New Relic</a> provides this technology as a SaaS solution, making it available to the masses. For example, one can install New Relic java agent, restart the application server, and observe the performance of applications running on that application server, as well as their interactions with back end services, databases etc. within minutes. This technology can help with not only identifying operational issues, but also problems in the code, slow SQL statements, etc. AFAIK, there are no viable open source projects providing these capabilities.Network based toolsA rather different approach is determining application performance by analyzing the network traffic. These network appliances typically mirror a port on the switches that servers are connected to (as well as some other techniques). They can analyze the traffic to figure out the performance of real user transactions as well as transactions between application components and back end services.Fundamental appeal of this approach is that it can be deployed without any changes to the application on the server or the client side (no agents, code changes, etc.), though in practice some changes to the configuration or application code ( to be able to stitch transactions spanning multiple servers, etc.) seems to improve the quality of the analysis.Another advantage is that, using this approach does not introduce any performance overhead as they passively process mirrored traffic. Weakness of this approach solution is that it requires a hardware device to be deployed on the network which may not always be feasible. Another problem is in virtual environments, the traffic between VMs may not go through physical switches at all if the VMs are running on the same host. In this case, the suggested solution seems to be deploying a VM on each of the physical hosts to network traffic on that host somewhat departing from its easy deployment and no overhead promise. <a href="http://www.extrahop.com/" target="_blank">ExtraHop</a> provides a product that uses this approach.<h3>Configuration management</h3>A configuration management system is needed to deploy software, and make/track configuration changes, etc. in an automated, repeatable, testable manner. Having shell access to production servers and installing applications manually is a high risk endeavor. It is easy to introduce errors that can cause outages, and errors introduced are typically very hard to find afterwards. It is also considered a security risk, hence may not acceptable in risk averse organizations. Although it is possible to automate the process using scripts and ssh into the servers, more common approach is to have an agent running on the server.<a href="http://puppetlabs.com/" target="_blank">Puppet</a> and <a href="http://www.opscode.com/chef/" target="_blank">Chef</a> are the most popular open source configuration management tools with large communities. Chef is also available as a <a href="http://www.opscode.com/hosted-chef/" target="_blank">hosted service</a>. <a href="http://linkedin.github.com/glu/docs/latest/html/contents.html" target="_blank">Glu</a> and <a href="http://ansible.cc/" target="_blank">Ansible</a> are some of the less well known alternatives with smaller communities (also open source). There are also tools more focused on application deployment like <a href="https://github.com/capistrano/capistrano" target="_blank">Capistrano</a>.If you don’t have anything better to do, you can follow me on twitter <a href="http://twitter.com/berkay">@berkay</a>

↧

Librato alerts on your mobile devices

December 11, 2012, 5:41 am

≫ Next: Alert life cycle management in OpsGenie

≪ Previous: Monitoring applications on the cloud - Part Zero

Operations folks at Etsy said it best with “<a href="http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/">measure anything, measure everything</a>”. Metric (aka time series) data collection, visualization, and alerting are essential operations management capabilities. We need to be able to track not only systems metrics such as CPU and memory utilization, but also (even more so) application and business metrics such as response times, number of transactions, etc.<img alt="image" src="http://media.tumblr.com/93f37247ed427a6cd2f54f244b7dd1a0/tumblr_inline_mu3fc7kvf71soq1dj.png"/> <a href="https://metrics.librato.com/" target="_blank">Librato Metrics</a>is a cloud based service just for that. Librato makes it easy to collect, store and visualize time series data.Collect: They provide a simple web API (HTTP/JSON), as well as language bindings and integration with collecting agents such as statsd and collectd to collect data.Store: Librato stores all the data, takes care of roll-ups, and scales as needed.Visualize: Metric data can be visualized as individual metrics, can be correlated with other metrics, organized in dashboards, etc. Charts update in real-time to reflect the latest values for the metrics. You can also embed the charts on your own apps using the JavaScript SDK.<img alt="image" src="http://media.tumblr.com/6f3641baf34c0f2c1060fe342cca79b1/tumblr_inline_mu3fcwlPwn1soq1dj.png"/>Librato provides alerting capabilities as well. You can define thresholds for metrics, and triggers alerts when thresholds are exceeded. We’re happy to announce that in collaboration with folks in Librato, we’ve integrated OpsGenie with Librato. Using OpsGenie, Librato users can now create alerts in OpsGenie, manage alert lifecycle (acknowledge, take/assign ownership, close, etc.) and receive notifications on their mobile devices using iPhone/Android push notifications, SMS, and phone calls. <a href="http://support.opsgenie.com/customer/portal/articles/868562-librato-integration">OpsGenie Librato integration document</a>describes the configuration.It is now easier than ever to track key metrics and get notifications on your mobile devices. Librato provides a free trial just as OpsGenie, so if you’d like to track some key metrics, give it a try.

↧

Alert life cycle management in OpsGenie

December 27, 2012, 5:41 am

≫ Next: Nagios and OpsGenie, Yin and Yang

≪ Previous: Librato alerts on your mobile devices

Most operations teams use number of disparate monitoring tools (and services) to monitor the technology infrastructure, network, systems, applications etc. These monitoring tools all have some degree of alerting. They can generate alerts when they detect problems and can send alert notifications via email, etc. Yet alerting, particularly what happens after an alert is generated differs significantly from between tools.This is a problem for operations teams as it makes it difficult to have a consistent process around alerts. What’s next after an alert is generated by a monitoring tool? At OpsGenie, we try to make this process as efficient and pleasant as possible for ops teams, by providing necessary tools for alert life cycle management. Our goal is to make OpsGenie the “alert (and notification) management” solution regardless of where the alerts may originate from.OpsGenie already provides number of capabilities to facilitate alert management. For example, all alert related activities are tracked by OpsGenie; when the alert was created, who was notified, when and whether recipients have seen the alert, and initiated actions, etc.Until the latest release, OpsGenie had support for two standard alert actions, “Add Note” and “Close”, as well as support for defining custom actions that can be specified for each alert when the alert gets created. Yet there are other actions that are commonly used in alert management, and it made sense to implement these as standard actions in OpsGenie to facilitate better alert life cycle management. Hence, we’ve just added acknowledge, take and assign ownership, and delete as standard alert actions. Here is the current list of standard alert actions in OpsGenie:<ul><li>Acknowledge action sets acknowledged property value to true, and assigns the ownership of the alert to the user that acknowledged the alert. This is typically used to indicate that at least one of the recipients have seen the alert and assumes responsibility for the alert.</li><li>Take ownership action allows a user to assume the ownership of an alert. Take ownership action is only available for acknowledged alerts and can be used to change the owner of the alert.</li><li>Assign action allows a user to assign the ownership of an alert to another user. Assign action can be used by managers, dispatchers, etc. to assign the alert to someone in the team, or can be used by a recipient when someone else needs to take over, etc. Alerts can be assigned to another OpsGenie user, or to an external entity such as a service provider, etc. as well.</li><li>Add note action can be used by recipients to communicate with each other. Each note is attached to the alert and can be seen by the other recipients.</li><li>Close action. When an alert is no longer active/valid, alert can be closed.</li><li>Delete action removes the alert.</li></ul>The new alert management actions are available via the web UI, the API and client libraries, as well as the latest release of the iPhone, Android and mobile web apps.We’ve also made number of improvements to make alert management easier:<ul><li>support for executing actions for multiple alerts at once (currently web UI only)</li><li>a filter to the alert list view to make it easier to see unacknowledged alerts.</li><li>alert list shows the “owner” of the alert as well as whether the alert is acknowledged or not</li><li>actions are available via drop down menu in the web UI and by swipe actions in mobile apps (without going into alert details)</li></ul>We hope that OpsGenie alert and notification management capabilities will make collaboration and coordination little bit easier for operations teams. We’re committed to further improve the OpsGenie’s alert life-cycle management capabilities and will continue to work with OpsGenie users in this area, so thanks for all your feedback so far, and keep the ideas/requirements coming!

↧

Nagios and OpsGenie, Yin and Yang

January 2, 2013, 5:41 am

≫ Next: Role of alert notifications in IT Operations

≪ Previous: Alert life cycle management in OpsGenie

Nagios is an open source IT infrastructure monitoring tool that offers monitoring and alerting for servers, switches, applications, and services. OpsGenie is an alert and notification management service that is highly complementary to Nagios. OpsGenie Nagios integration leverages the Nagios notification system to forward alerts to OpsGenie (either via email or API) and notify users via iPhone/Android push notifications, email, SMS, and phone calls. There are already many OpsGenie users taking advantage of the integration. So what does OpsGenie have to offer for Nagios users?<h4>Consolidated alert and notification management</h4>Nagios is great for many things but aggregating alerts from other sources is not one of its strengths. Nagios users can leverage OpsGenie to aggregate and manage alerts not only from Nagios and other sources, and use Nagios to do what it is designed for. OpsGenie also enables users to<a href="http://www.opsgenie.com/public/why/consolidated.notifications.html"> maintain their own notifications information and preferences</a> in one place, eliminating the burden to keep this <a href="http://www.opsgenie.com/public/why/eliminate.overhead.html">information current and accurate</a> in multiple disparate tools.<h4>Alert life cycle management</h4>For Nagios users, OpsGenie integration provides full alert life cycle management capabilities. Using OpsGenie, users can not only receive notifications for critical problems detected by Nagios, but also <a href="http://www.opsgenie.com/blog/2012/12/27/alert-lifecycle-management.html">acknowledge alerts, take or assign ownership of the alerts, comment on them, etc. easily no matter when and where they may receive the alerts</a>. OpsGenie keeps track of all alert activity seamlessly: when the alert was created, who was notified when and how, whether and when recipients have seen the alert, acknowledged it, who executed which action, etc. OpsGenie can also automatically close alerts when host/service comes back up and Nagios checks indicate the state as “OK”.<h4>Stay connected when you’re mobile</h4>Using OpsGenie, Nagios users can receive notifications for critical alerts via <a href="http://www.opsgenie.com/public/features/multiple.channels.html">SMS, phone calls, and iPhone & Android push notifications</a>, and can <a href="http://www.opsgenie.com/public/features/alert.actions.html">respond to the alerts</a> directly from their mobile devices using OpsGenie apps. And since OpsGenie is a cloud based service accessible from anywhere, unlike mobile web apps for Nagios that require users to have access to the Nagios server, users don’t have to have access to the corporate network (which is often) to be able to work with Nagios alerts.<h4>Alerts that empower</h4>A short text message (SMS) typically used to notify users often fails to convey sufficient information to enable the recipients to assess the problem and determine the right course of action. <a href="http://www.opsgenie.com/public/features/rich.notifications.html">OpsGenie alerts are not limited to couple hundred characters of text</a>, includes many fields, tags, and <a href="http://support.opsgenie.com/customer/portal/articles/551579-attachments">attached files</a>. Recipients can not only see the alert message but all the supporting information, charts, etc. and figure out what to do next. For example Nagios alerts by default include host/service data, and alert histogram. Forward a Nagios alert to OpsGenie and you’ll see what we mean. You can <a href="http://support.opsgenie.com/customer/portal/articles/551579-attachments">attach any information</a> that you think as relevant to the alert and make it available to the recipients, either via the API or the web UI.<h4>Bi-directional integration</h4>When users update alerts in OpsGenie, OpsGenie can forward user actions to Nagios using <a href="http://www.opsgenie.com/public/features/alert.actions.html">action execution</a> capabilities and <a href="http://support.opsgenie.com/customer/portal/topics/345445-marid/articles">Marid integration tool</a>. For example, when a user acknowledges an alert using an OpsGenie app (iPhone/Android/Web), OpsGenie can make a call to the Nagios server (using Marid) to acknowledge the alert in Nagios as well. Not convinced? Don’t take our word for it. Sign up for a free trial account and see it yourself. We think you’ll like it :)

↧

Role of alert notifications in IT Operations

January 8, 2013, 5:54 am

≫ Next: Monitoring for troubleshooting problems vs alerting

≪ Previous: Nagios and OpsGenie, Yin and Yang

Mathias (<a href="http://twitter.com/roidrage" target="_blank">@roidrage</a>) of <a href="https://travis-ci.org/" target="_blank">Travis CI</a> has an <a href="http://www.paperplanes.de/2013/1/2/on-pager-duty.html" target="_blank">excellent blog post</a> on operations of a hosted product and the role alerting. It’s a good read for anyone who is in operations or would like to understand operations better. In the post, he describes not only what they are currently doing but also the challenges they face, as well as his thoughts on what they will need to do to improve. At <a href="http://www.opsgenie.com">OpsGenie</a>, our goals are highly relevant to the topics discussed in the post. We provide alert & notification management tools to enable ops teams to manage entire alert life cycle, what happens after an alert is generated till the problem is resolved. Since we also operate a hosted service that needs to be up and running at all times, and deal with many of the same challenges mentioned in the post, I wanted to add my 3.1415 cents as well:<h4>Should developers of platforms/applications be fully involved in the operations side of things?</h4>For small to medium size teams I believe the most efficient and effective model is for the developers to be directly involved in operations, especially for mission critical application. After all, no one knows the applications better than the developers. This modus operandi feels a lot more comfortable for modern applications running on the cloud infrastructure. As stated in the post, being part of operations vastly improve how developers think about the application, production environment, requirements of resilient systems, and write code to make the software more operable. Advantages are well articulated in the post, and based on past experience I couldn’t agree more. For critical applications with high availability requirements, having the very developer who has implemented the code (or has good knowledge of it) at hand is invaluable in resolving issues rapidly. As such, at OpsGenie, developers are very involved in the operations as well in order to ensure potential issues are handled before any impact on our customers. We also understand that as good as this is for the operations of our service, if we’re not careful, involving developers in daily operations can drastically slow down the paste of development. To mitigate this risk bogging down developers with operational duties such as responding to alerts, etc., we automate whenever we can, and implement tools to enable team members to be as efficient as possible. Many of these capabilities are provided to our customers as part of the OpsGenie service as well (some of them are mentioned below). To be clear, in the (large) enterprise however, there are often hundreds, if not, thousands of applications supported by the operations teams. Many of these applications are not built by internal development teams, or developers are no longer around. As such, in these environments, operations processes tend to be very different, and it’s important to recognize these differences.<h4>Empowering the alert recipients with knowledge - Playbooks</h4>Creating playbooks, aka runbooks, for alerts is one of the most effective ways to increase operations efficiencies and reduce the dependency on individuals. At OpsGenie, we assign a different code to each exception generated by the application, each alert generated by monitoring tools or our applications themselves. Each alert code is associated with a short description that explains what the alert is about, whether it’s critical, etc. This also allows creation of playbooks starting from more common alerts codes. When an alert is generated, recipient can use this information to make an initial assessment of the alert and determine the urgency, right person to handle the problem etc. One of the key OpsGenie features that enable playbooks is the ability to <a href="http://support.opsgenie.com/customer/portal/articles/551579-attachments">attach files to alerts</a> in OpsGenie either via the UI or the API. Using this capability OpsGenie customers can assign the relevant playbook to the alert empowering the recipients of the alert to determine the right course of action. And combining the runbook with additional information such as configuration data, change history, etc. (take a look a sample alert in OpsGenie for an example) truly enables the recipients to determine the right course of action quickly. Recipients don’t have to scramble to find a computer, connect to variety of systems, etc. to collect the information necessary for an assessment. In addition, OpsGenie allows <a href="http://support.opsgenie.com/customer/portal/articles/761242-alert-action-execution">alert recipients to execute actions</a> to further collect data or initiate remedial actions directly from their mobile devices, without having to have access to a computer, network connection, VPN, etc. For example recipients can <a href="http://www.opsgenie.com/public/features/alert.actions.html">initiate an action</a> that would execute a script that to collect additional data that is relevant to the problem and attach it to the alert to make it available to the recipients. I believe these are precisely the type of things operations teams need to be able to do in order to improve operational efficiencies and minimize the overhead of handling alerts. Efficient and effective operations requires empowering alert recipients to handle most alerts directly from their mobile devices, and we’re building underlying infrastructure to enable operations teams to implement these capabilities.<h4>Can I rely on the alert system to get the notifications?</h4>One of the main problems we’ve tackled with OpsGenie is increasing the reliability of the notifications. We have concluded early on that we could not rely on any single notification channel alone and we have to use<a href="http://www.opsgenie.com/public/features/multiple.channels.html"> multiple notification channels</a>. When a single channel like SMS is used for alert notifications, not only notification may not be delivered in a timely manner (SMS is best effort and delivery is not guaranteed), but recipients may simply miss it. OpsGenie leverages iPhone and Android push notifications, SMS and phone calls for notifications and allows users to put them in an order with time delay. OpsGenie tries each method until the recipient sees the alert or the alert is acknowledged by one of the recipients. There are also number of other capabilities such as <a href="http://www.opsgenie.com/public/features/tracking.html">detailed tracking</a> and<a href="http://support.opsgenie.com/customer/portal/articles/759603-heartbeat-monitoring"> heartbeat monitoring</a> to improve reliability of the system overall.<h4>Who gets up in the middle of the night when an alert goes off ?</h4>OpsGenie currently allow specifying <a href="http://support.opsgenie.com/customer/portal/articles/551517-creating-alerts-via-the-web-ui">users and/or groups</a> as the recipients of an alert when an alert is created. At OpsGenie for our own alerting, we chose to specify the group name as the recipient. This allows us to modify group membership as needed to determine who will be notified for alerts. Team members can add/remove themselves to the group at any time. And with recently added “<a href="http://support.opsgenie.com/customer/portal/articles/912552-escalations">escalations</a>" capabilities, it is now possible to have tiered notifications. For example, using escalations, OpsGenie users can now notify a user or a group first, and notify the larger team if the alert is not acknowledged within x minutes. Using this approach enables only waking up designated team member(s) for alerts, but if the user does not attend to the alert for whatever reason, other team members would get notified. OpsGenie does not yet have support for on call schedules, where on call team member is assigned automatically based on date and time (though alerts can be sent to a group and group members can be changed). This is one of the more commonly requested features hence we will work on it in the near future. If you’re an OpsGenie user and have requirements/thoughts in this area, please do <a href="http://support.opsgenie.com/customer/portal/emails/new">share them with us</a> as we’re actively working on the design of this feature. Sending alert notifications is only the first step of effective alerting. At OpsGenie, we believe what happens next really matters, and we strive to make it as painless as possible for ops teams to manage the process. <a href="http://twitter.com/roidrage" target="_blank">@berkay</a>

↧

Monitoring for troubleshooting problems vs alerting

January 14, 2013, 5:54 am

≫ Next: Annual payment option and reduced international SMS prices

≪ Previous: Role of alert notifications in IT Operations

Data generated by monitoring systems can be used to support operational support processes in different ways; and I think it’s useful to know the distinction between the two core uses:<h4>Troubleshooting problems</h4>Modern applications rely on complex technology infrastructures with many levels of abstraction and consists of dozens of interdependent applications, as well as applications/services managed by third parties. Saying that troubleshooting problems and identifying the root causes in such complex environments can be “challenging” is likely a gross understatement. The operational data generated by monitoring tools as well as applications and systems themselves are invaluable in facilitating the troubleshooting process. When troubleshooting a problem, we want all the data we can get our hands on. We want the log files, the metrics (resource utilization, response time, application performance, business, etc.), configuration data (entities, relationships, change history, etc.) and all other relevant data. The more data the merrier, provided that data is stored and organized in a way that makes it easy to access, query, filter, correlate and analyze. We want to be able to compare and contrast, perform time based correlation, topology based correlation etc. to understand the problems, eliminate potential causes of the problem and eventually determine the root cause. This mode of operations (sometimes referred to as bottom up monitoring) does not necessarily require a through understanding of the potential problems in advance. Instead, it requires scalable and flexible systems that can handle large amounts of loosely structured data. For most organizations that rely on traditional tools, handling this type of data in a cost efficient way has not been a viable option until recently. As a result, metric data gets aggregated (averages lie!), events get created based on logs, then normalized, filtered, deduplicated, etc. and a lot of information gets lost in the process. In addition, most organizations use monitoring tools with disparate data stores, hence it’s often not possible for operations to analyze all the relevant data together. However, thanks to big data technologies, this has become less of a challenge. Operations teams can now take advantage of the tools like <a href="http://graphite.wikidot.com/">Graphite</a>, and <a href="http://opentsdb.net/">OpenTSDB </a>to store and analyze vast amounts of metric data, use <a href="http://logstash.net/">Logstash</a> / <a href="http://www.elasticsearch.org">ElasticSearch</a> / <a href="http://kibana.org/">Kibana</a> combo to aggregate and search all log files. The popularity of <a href="http://www.splunk.com">Splunk</a> can be attributed to the fact it was the first solution in the enterprise that can handle large amounts of unstructured operational data, primarily logs but metrics as well. It is important to note that although it’s possible to code the intelligence to analyze this data programmatically and identify problems, by and large the this approach requires carbon based intelligence to perform the analysis. As such, the interface to the data, the ease of access, query capabilities, and the visualization of the data play a crucial role in enabling operations folks to make effective use of the data.<h4>Alerting</h4>The goal of alerting is to detect problems that - currently or in the future, may have a negative impact on provided services, and notifying the right set of people. There are many different ways to generate alerts, including:<ul><li>thresholds on metric (time series) data</li><li>parsing log files for specific keywords</li><li>active checks to check the state of applications, systems, etc.</li></ul>Many organizations confuse monitoring requirements for troubleshooting problems as described above and alerting. As a result generating alerts for everything that is monitored and end up with too many alerts. I’ve worked with many organizations with number of active alerts in many tens of thousands. In these organizations alerts are not primarily used to detect the problems and notify appropriate people. Rather, alerts are mostly used as a troubleshooting tool, and a poor one at that, since alerts do not contain all the data. When creating alerts, thinking about the required action is provides good guidance on whether or not that alert should be generated. What do we expect the recipients of the alert to do? Will the recipients be able to figure out what the problem is, what the impact is, how urgent it is, or at least how to start troubleshooting the problem? If there is no clear answer to these questions, chances are the alert is not actionable, hence has limited value. A “top down” approach to alerting is determining what the potential problems are and what the data may be needed in order to diagnose these problems. We can start from the problems that would have the worst impact, most likely to happen, etc. and work our way down from there. This approach has number of advantages:<ul><li>it may be possible to implement the solution faster since the data that we may need to collect is often a small subset of all possible monitoring data we can collect</li><li>we can classify, assign a code to each problem. And even better, we can document the troubleshooting and recovery procedures, the impact and the severity of the problem, providing handy runbooks for the recipients to empower them to respond to the alerts effectively.</li><li>we can tune the content of each alert according to the problem to make analysis of the problem easier. What additional information should be provided with the alert? metric data trends, configuration data, change history, alert history, etc.</li></ul>Looking at the problem from this perspective, it should be clear that creating an alert when the CPU is utilization over a set threshold is almost never a good idea. It is simply not actionable information. Server will run out of disk space within x minutes however, is very much an actionable alert, and since it is also a common problem, it makes sense to think about how we can collect the necessary data and diagnose this problem. Monitoring for troubleshooting problems and alerting both are useful and necessary disciplines. When implementing monitoring solutions, understanding the differences between the two and what the primary goal goes a long way in providing the clarity needed to plot the right course. <a href="http://twitter.com/berkay">@berkay</a>

↧

Annual payment option and reduced international SMS prices

January 29, 2013, 5:54 am

≫ Next: Reducing alert noise using escalations

≪ Previous: Monitoring for troubleshooting problems vs alerting

OpsGenie apps for these smartphone platforms not only allow us to provide rich alert notifications that are not limited to couple hundred characters of text, but also enable OpsGenie to send push notifications (to Apple and Android devices). Unlike text messages there is no charge for push notifications which means that owners of Apple and Android devices can receive alert notifications without any additional cost to them (other than data transfer costs) regardless of wherever there may be in the world. SMS notifications are still necessary for non Apple/Android phones as well as as an additional notification mechanism for everyone. After all, you cannot rely on any single notification channel for critical alert delivery. As such, SMS notifications are fully supported by OpsGenie, and used by many OpsGenie users. We’re happy to announce that the cost for international SMS notifications is reduced to $0.10 per notification. We’ve also added discounted annual payment options which brings cost of enterprise plan to $16/user/month (when paid yearly) that includes unlimited US/Canada SMS/phone notifications as well as 25 international SMS notifications and 10 international phone notifications. Subscription plans and pricing details can be found in our <a href="http://www.opsgenie.com/pricing/details">pricing details</a> page.

↧

Reducing alert noise using escalations

January 30, 2013, 5:54 am

≫ Next: Who to notify when - can I do that with OpsGenie?

≪ Previous: Annual payment option and reduced international SMS prices

We’ve recently added support for “<a href="http://www.opsgenie.com/public/features/escalations.html">escalations</a>" in OpsGenie. Escalations typically refer to notifying different users at different times until the alert is seen and processed (acknowledged) by someone, or problem is resolved and the alert is closed. If the user who gets notified first resolves the problem, or determines the problem is not urgent, etc. other users don’t have to be notified. Since escalations allow notifying only a subset of the users for alerts initially, they can be quite useful in reducing “alert (notification) noise” while still ensuring alerts don’t fall through the cracks. OpsGenie supports both “rules based” and “ad-hoc” escalations. You can create <a href="http://support.opsgenie.com/customer/portal/articles/912552-escalations">escalation rules</a> that specify who should be notified when; You can then use the escalation rule as the recipient of an alert, instead of specifying users or groups directly. For example, the following escalation rule would notify user “fili” as soon as the alert is created, and if the alert is not acknowledged within 10 minutes, OpsGenie would notify the members of the “web_team” group.<img alt="image" src="http://media.tumblr.com/d514e644f3e6da1ad34607ab1118fe9b/tumblr_inline_mu3g40EElO1soq1dj.png"/>Escalation rules are quite useful when there is a predefined agreed on escalation process, however it is not always clear how an alert should be escalated. Who should be notified? An escalation rule like the one in the example above would notify all the members of the web_team group. Wouldn’t it be better if only the right person is notified instead? Of course, this is not always possible nor desirable, but in most cases, the first responder may be able to determine to whom to escalate the problem to. In these situations, “<a href="http://www.opsgenie.com/blog/2012/12/27/alert-lifecycle-management.html">ad-hoc escalations</a>" provide the opportunity further reduce the noise by empowering users to control the escalation path. So what do we mean by ad-hoc escalations? We refers to the set of features that enable users to notify additional people using alert actions. The recipients of an alert can either “assign” the ownership of an alert to a user, or add others to the alerts as recipients. In either case, OpsGenie would notify the additional users according to their notification preferences. As it is with any alert related activity, these actions are tracked by OpsGenie and users can see, who were notified, who has seen the alert, etc. in the recipients section and in the alert log. Reducing alert noise is essential in ensuring operations folks (and whoever else participates in operations) do not get overwhelmed with high number of alerts. At OpsGenie, our goal is to provide the tools that may help ops folks to minimize the number of interruptions (during the day or night) and hope that you can use escalations to do just that. <a href="http://twitter.com/berkay">@berkay</a>

↧

Who to notify when - can I do that with OpsGenie?

February 4, 2013, 5:54 am

≫ Next: Complex systems, IT operations and learning from others

≪ Previous: Reducing alert noise using escalations

Since we’ve released support for <a href="http://www.opsgenie.com/blog/2013/01/28/oncall-schedules.html">escalations and on-call schedules</a>, we’ve been fielding questions about whether a scenario is supported or not. So far, we’re quite happy with the outcome since we’re indeed able to address the requirements OpsGenie users have thrown at us. I admit to being little giddy about the fact that number of use cases presented by users’ of competing solutions are handled by OpsGenie with ease. We’ve been collecting and organizing the use cases we’ve heard <a href="http://support.opsgenie.com/customer/portal/topics/437981-escalations-on-call-schedules-and-rotations">on our support site</a>, and we’ll continue to add new ones. You have a scenario in mind? Please do let us know. I can’t guarantee that they are supported, but if they are not, I can promise that it won’t take us 2 years to support them :) <a href="http://twitter.com/berkay">@berkay</a>

↧

Complex systems, IT operations and learning from others

February 6, 2013, 5:54 am

≫ Next: Email integration, even easier and more powerful

≪ Previous: Who to notify when - can I do that with OpsGenie?

I first found out about complex systems almost 20 years ago when I read “<a href="http://www.amazon.com/COMPLEXITY-EMERGING-SCIENCE-ORDER-CHAOS/dp/0671872346">Complexity, emerging science at the edge of order and chaos</a>" by <a href="https://plus.google.com/117836940450633501050">Mitchell Waldrop</a>. The book chronicled the development of the complexity theory and scientist who are involved. As I read the book and contemplated its core ideas, I realized that almost everything I was interested in learning more about was indeed a <a href="http://en.wikipedia.org/wiki/Complex_adaptive_system">complex adaptive system</a>: political systems, the economy, biological organisms, the nature, the society, etc. The notion that these complex systems may have common characteristics and may be governed by similar rules put my mind into overdrive. I was hooked. The months that followed, I devoured books on related topics, <a href="http://www.amazon.com/Emergence-Chaos-Order-Helix-Books/dp/0738201421">emergence</a>, <a href="http://www.amazon.com/At-Home-Universe-Self-Organization-Complexity/dp/0195111303">self organization</a>, <a href="http://www.amazon.com/Complexity-Cooperation-Agent-Based-Competition-Collaboration/dp/0691015678">game theory and cooperation</a>, <a href="http://www.amazon.com/Complexity-Life-at-Edge-Chaos/dp/0226476553">edge of chaos</a>, etc. It’s safe to say the ideas in these books and complex adaptive systems theory has shaped how I viewed the world. One of the core ideas of complexity theory is that complex systems are <a href="http://www.globalsystemsinitiatives.net/TwelveSimpleRules.pdf">governed by a set of rules</a> that are unlike other systems but common in complex systems. As such, to understand one complex adaptive system, we can study and learn from other complex systems. This idea strongly resonated with me and influenced how I approach analyzing any complex problem throughout my life. As I learned about complex systems ranging from political systems, to biology, I found myself looking for patterns, comparing and contrasting what I’m looking at with other systems. I love talking to and listening to people with in-depth understanding of any sufficiently complex phenomena for this very reason, regardless of the topic. It is not only immensely fun for me but can be very very useful. There is so much to learn from others. To be clear, I’m not pretending to unearth some previously unknown secret of life. Many people have been doing this forever even unconsciously. And many have been using this approach as an innovation tool. <a href="http://www.amazon.com/Biomimicry-Innovation-Inspired-Janine-Benyus/dp/0060533226">Biomimicry</a>, <a href="http://www.ted.com/talks/janine_benyus_biomimicry_in_action.html">looking at nature for inspiration</a> for new inventions and solving complex problems, has emerged as a new branch of science for example. And even in our field, as the level of complexity in information technologies increase, folks have been looking at other fields for inspiration. <a href="http://twitter.com/botchagalupe">John Willis</a> has been obsessed with Deming for the last couple of months, digging into what we can learn from his ideas on quality and variations and <a href="http://itrevolution.com/deming-to-devops-part-1/">how they apply to devops</a>. <a href="https://twitter.com/roidrage">Mathias Meyer</a> have been looking into other industries, <a href="http://www.paperplanes.de/2013/1/21/failure-is-always-an-option.html">how they approach failure and post-mortems</a>. <a href="https://twitter.com/jamesurquhart">James Urquhart</a> seems to be a fellow complex system enthusiast and has been contemplating whether <a href="http://gigaom.com/2012/01/08/cloud-is-complex-deal-with-it/">the cloud is becoming a complex adaptive system</a>, and what the implications may be for devops. These are remarkably valuable explorations. Pure brain food. Essential ingredients triggering new thoughts, and making room for innovation we desperately need to tackle the challenges we face in ever changing technology landscape. As I spent a lot of time working with alerts and notifications these days, I’ve been also looking into parallels in other industries. What can we learn from medical practices about alerting? Are there parallels between the relationship between the ER doctors and specialist relationship, and the relationship between operations engineers and developers when dealing with problems? Are the dashboards in planes useful? How do pilots handle the barrage of flashing lights and beeps when there is a problem? How do the thinkers in that industry envision controls and instruments will evolve? What do they think is the right level of automation? Will the carbon based intelligence still be the decision maker going forward? What type of intelligence and automation seems to gain ground, heuristics rules based systems, or statistical analytics? Each of these explorations open up possibilities. Perhaps it’s worth noting that learning from other complex systems does not mean just copying the patterns. Yes, we can learn from their best practices, and see whether they apply to our field, but we can also look at their trials and tribulations and may be avoid repeating their mistakes. If nothing else, they may simply trigger new ways of thinking. And that is what we sorely need to move forward. <a href="http://twitter.com/berkay">@berkay</a>

↧