As long as our applications are in production, boosting uptime and avoiding outages is the highest priority for us developers and operational teams. Despite the great care, having 100% uptime and avoiding outages is a challenging task for even the most stringent DevOps teams. Let’s imagine that one of your data centers stops responding and in-turn your email service is completely out, or your payment service has gone offline during Black Friday. Remember the AWS outage that lasted four days and affected countless numbers of cloud services in April 2011. This is a good example that outages happen even to the most secure environments.. Now what? Are you going to examine huge log files to find out what went wrong? Are you going to notify all of your operational teams and developers at the same time to investigate the cause? Unless you allocate large resources for chaos engineering like Netflix does, you most likely will have very limited time to overcome the issue. So those aren’t realistic options for most organizations.
What if you already knew the exact reason for the outage/problem? These are all great questions and good real-time scenarios thatHoneybadger and OpsGenie would answer for most DevOps teams. By combining the modern error management and monitoring suite of Honeybadger along with OpsGenie's meaningful and actionable alerts, powerful on-call schedules, and escalations, you can easily understand the context of problems within your app. You’ll be able to monitor your applications for errors and performance from the perspective of a developer and also an end-user. Joshua Wood, Co-Founder at Honeybadger, briefly summarizes the power of integrating both services, "Organizations with dedicated IT and DevOps teams often want fine-grained control over who gets alerted, and when (i.e. at 3 am on a Saturday morning). Honeybadger's built-in alerts work great for keeping development teams in sync when it comes to application errors, but we're really excited to integrate with OpsGenie to bring our customers all the benefits of a full-featured IT alerting platform. With Honeybadger and OpsGenie you get the best of both words: highly contextual error reports sent to the right people at the right time."
Well-Rounded Capability, Easy Configuration
On the Honeybadger side, you can easily select which events should be redirected to OpsGenie. In addition to the error and uptime monitoring events of Honeybadger, it is also worth noting that the rate escalation functionality of Honeybadger is a bonus in which you can create an event and send it to OpsGenie whenever an error occurs more than a user-defined threshold such as five times in twelve hours. You may also want to receive a notification if several critical or non-critical errors occur in production.
When adding a Honeybadger integration, you will find a ready-to-use configuration. That being said, you can modify the default configuration while at the same time seamlessly manage your alert rules such as when to create an alert from Honeybadger events, what alert properties should be, when to acknowledge/close/add a note to those alerts, and who should be notified of them. The below illustration is an example of a Honeybadger alert created by an error event using the default configuration.
You can also take advantage of OpsGenie'sAlert De-Duplication to prevent an alert storm at the moment of truth. Because each Honeybadger event contains an ID that is unique for the error/website, you can easily differentiate where the error is sourced. What if the problem is resolved by itself? In this case, it makes sense to close the related alert automatically so that the already-resolved problem won’t be a continued nuisance to the user. Your default configuration for the Honeybadger integration includes this capability.
Close Collaboration Between Both Sides
Two years have passed since we announced our Honeybadger integration for the first time. Both OpsGenie and Honeybadger have always taken great strides to improve our services. In order to be effective, we at OpsGenie and Honeybadger have worked together in close co-operation to improve this integration on both sides. From the very beginning, we exchanged ideas with one another to clearly define how each service works and what customers expect from this integration. During this process, another prominent consideration was the maintainability of the existing integrations on both sides. Considering the infrequency of such close collaboration in the SaaS market, we are pleased with the resulting integration. Meaning the integration can be set up within minutes, works well with all services that both parties provide, and has a ready-to-use default configuration that covers the most common use cases.
Final Words: Why should you adopt monitoring and incident management cycle?
Wouldn't it be embarrassing if your users called you to let you know your site is down? Doesn't such a case harm your reputation? A problem with your application/service can turn into a major outage that you may not overcome within a short period of time unless you know about the problem. Instead of hoping Murphy's Law doesn’t come true, you should consistently monitor your application for its performance and errors, and you should configure a strong incident management cycle. Stop guessing.....and Sign Up if you haven't yet.