Best tactics to resolve incidents with certified responders
For over a decade now, 24/7 Support has been part of the services we provide to our customers. We started from the ground up offering this service to a single client, and over time it has become more and more important and in demand with the majority of our clients.
A few weeks back I attended a day long seminar about incident response run by PagerDuty. The information they shared with the audience was invaluable and very well presented, and for the most part it validated my own experience and knowledge on the matter. At the end of the seminar, we were even offered the opportunity to take a test and to attain the certificate of incident responder.
This is not a particularly glamourous topic but given the amount of work happening in the background, I thought it would be useful to share some insights on the subject with you.
What is incident response and why do I need it?
Digital Presence as a concept has come a long way since the times of “it’s only a website”, and it has evolved into a real ecosystem: complex content management systems, bespoke components, third party integrations, real-time data, automated deployments, cloud hosting components, performance requirements, you name it. Obviously, the more complexity is added to a platform, the more the chances that something can go wrong.
The importance of proper monitoring & response strategy has been very clear to us from the very beginning, and as with most processes, we continue to refine it over time. This task is a challenge in its own right, as It requires different tools and integrations, bespoke components for additional information, know-how, the involvement of several staff members and our clients teams.
Being on call
In simple terms, “being on call” means that you can be contacted at any time (in our case day and night) to troubleshoot and address an issue that arises from any of the supported systems, and it’s a responsibility that can’t be taken lightly. The caller can be either a customer having a problem performing a specific task, or an automated call from a system that has detected an anomaly or an outage. Depending on the severity of the incident, this also means that you may have to drop whatever you were doing to address the issue, and sometimes it even means skipping meals altogether! However, being on call is only the tip of the iceberg: in order to be ready to react to an incident, the staff needs to know what they’re looking at and how to respond.
Before an incident
First, let’s define the concept of incident: in this context, we refer to an incident as any unplanned disruption, service interruption or degradation that is actively affecting our clients’ or their customers’ ability to avail of the services as otherwise expected. It can be something as simple as a service or a website that has become slow to respond, or a major outage affecting various services at once.
In my opinion, before any response strategy is deployed, it is fundamental to have the right monitoring and communication plans in place. It is also worth pointing out how monitoring strategies may differ from system to system; what may be considered a minor nuisance by some clients could be regarded as highly important by others, and it needs to be treated accordingly. However, the workflow used to respond to the issue requires the same attention regardless of the nature of the incident. I appreciate that this can sound conflicting with what I just have said about monitoring but using the same approach every time avoids mistakes caused by panic responses. We are only human after all, and incidents can be stressful to deal with, especially in the early phases when there are still many unanswered questions.
Once the alarm is triggered, the responders need to know what to look for, who to contact for all the required remediations and what relevant information to relay for effective communication. The priority is to resolve the issue ASAP, and the process must work as close as possible to muscle memory to be effective.
During an incident
As with most things, communication is key. Regardless of whether the information sources are from the monitoring systems or conversations between humans, the first step is to identify the problem (i.e., what is the issue and how it was raised, if by a monitoring tool or by a user reporting a fault). During this information gathering phase, the golden rule is to keep calm, as panic does not help anyone. In fact, panicking makes things look way worse than they are.
Typically, during an incident investigation, a call among the responders is opened. The ultimate purpose of this call is to facilitate the resolution of the incident as quickly as possible. To get there, a few repeatable steps are required, but it is important to remember that this is not a time for discussing the severity of the issue or its consequences, nor for complaints or blame games.
This is a time to assign clearly defined tasks to specific people to quickly identify and resolve the issue. More often than not, one cycle is enough to resolve the issue (e.g., a simple restart, a manual failover, or a rollback), however sometimes the solution to the actual problem can be different from first appearances, and the cycle needs to be repeated until a valid resolution is applied.
Ideally the problem should be addressed on the first attempt, but as we know, that’s not always the case. Troubleshooting needs to be considered part of the information gathering process, since applying the wrong fix is way better than making no decision at all. Just to clarify, the “no panic” rule still needs to prevail, and it is important not to attempt random resolutions for the sake of it.
After the incident
So, the incident has been addressed, what’s next? In my experience, most of the incidents happen overnight, so (if that’s the case) catching up on sleep would be my immediate priority! :P
Following the incident, the team needs to compile what’s called a post-mortem report. This document contains all the important information such as the timeline of the incident, the applied resolution(s), the root cause, the learnings, and suggestions on how to prevent this issue from happening again.
Provided that all the relevant information has been properly transcribed during the resolution call, compiling this document should be relatively straightforward, however the review must be thorough, and the entire team must be involved. The ultimate purpose of this document is not to point fingers, rather to ensure that the team has learnt from the experience and to prevent this type of issue from happening again.
PagerDuty, our weapon of choice for on call alerts, have compiled a very detailed set of guidelines for post-mortems, and if you are interested in elaborating on this further, you can find it here.
Having an incident response plan in place, along with round the clock support is crucial to the maintenance of our clients’ sites and applications, for them and their customers. I hope this has shed some light on the subject, and if you think your company needs help managing the process described above, please feel free to contact us!