Skip to Content
ArekiboCare Manages your website 24/7 ArekiboCare
Close ArekiboCare

ArekiboCare is a dedicated 24/7 service for managing websites and digital platforms. We handle the details, so you can focus on what matters—your business and life.

Learn more
Menu
Jun 09, 2021

Effective tactics for incident resolution with responders

For over a decade now, 24/7 Support has been part of the services we provide to our customers. We started from the ground up, offering this service to a single client, and over time, it has become more and more important and in demand with the majority of our clients.

A few weeks back, I attended a day-long seminar about incident response run by PagerDuty. The information they shared with the audience was invaluable and very well presented, and for the most part it validated my own experience and knowledge on the matter. At the end of the seminar, we were even offered the opportunity to take a test and to attain the certificate of an incident responder

This is not a particularly glamorous topic, but given the amount of work happening in the background, I thought it would be useful to share some insights with you.

What is incident response and why do I need it?

Digital Presence as a concept has come a long way since the times of “it’s only a website”, and it has evolved into a real ecosystem: complex content management systems, bespoke components, third-party integrations, real-time data, automated deployments, cloud hosting components, performance requirements, you name it. Obviously, the more complexity is added to a platform, the more the chances that something can go wrong.

The importance of proper monitoring & response strategy has been very clear to us from the very beginning, and as with most processes, we continue to refine it over time. This task is a challenge in its own right, as it requires different tools and integrations, bespoke components for additional information, know-how, the involvement of several staff members and our clients' teams.

Being on call

In simple terms, “being on call” means that you can be contacted at any time (in our case day and night) to troubleshoot and address an issue that arises from any of the supported systems, and it’s a responsibility that can’t be taken lightly. The caller can be either a customer having a problem performing a specific task, or an automated call from a system that has detected an anomaly or an outage. Depending on the severity of the incident, you may have to drop whatever you were doing to address the issue, and sometimes that even means skipping meals altogether! However, being on call is only the tip of the iceberg: to be ready to respond to an incident, the staff needs to know what they’re looking at and how to respond.

Before an incident

First, let’s define the concept of incident: in this context, we refer to an incident as any unplanned disruption, service interruption or degradation that is actively affecting our clients’ or their customers’ ability to avail of the services as otherwise expected. It can be something as simple as a service or a website that has become slow to respond, or a major outage affecting various services at once.

In my opinion, before any response strategy is deployed, it is fundamental to have the right monitoring and communication plans in place. It is also worth pointing out how monitoring strategies may differ from system to system; what may be considered a minor nuisance by some clients could be regarded as highly important by others, and it needs to be treated accordingly. However, the workflow for responding to the issue requires the same level of attention regardless of the incident's nature. I appreciate that this can sound conflicting with what I just have said about monitoring but using the same approach every time avoids mistakes caused by panic responses. We are only human, after all, and incidents can be stressful to deal with, especially in the early phases when many questions remain unanswered.

Once the alarm is triggered, the responders need to know what to look for, who to contact for all the required remediations and what relevant information to relay for effective communication. The priority is to resolve the issue ASAP, and the process must work as close to muscle memory as possible to be effective.

During an incident

As with most things, communication is key. Regardless of whether the information sources are monitoring systems or human conversations, the first step is to identify the problem (i.e., what the issue is and how it was raised, whether by a monitoring tool or by a user reporting a fault). During this information-gathering phase, the golden rule is to keep calm, as panic does not help anyone. In fact, panicking makes things look way worse than they are.

Typically, during an incident investigation, a call is opened among the responders. The ultimate purpose of this call is to facilitate the resolution of the incident as quickly as possible. To get there, a few repeatable steps are required, but it is important to remember that this is not a time for discussing the severity of the issue or its consequences, nor for complaints or blame games.

This is a time to assign clearly defined tasks to specific people to quickly identify and resolve the issue. More often than not, one cycle is enough to resolve the issue (e.g., a simple restart, a manual failover, or a rollback); sometimes, the solution to the actual problem can be different from first appearances, and the cycle needs to be repeated until a valid resolution is applied.

Ideally, the problem should be addressed on the first attempt, but as we know, that’s not always the case. Troubleshooting needs to be considered part of the information-gathering process, since applying the wrong fix is better than making no decision at all. Just to clarify, the “no panic” rule still needs to prevail, and it is important not to attempt random solutions for their own sake. Also important to note that having a robust business-as-usual, we call BAU website support contract in place, means incidents are handled with defined SLAs. This provides proactive monitoring and on-call support. Our ongoing website support and monitoring service is called ArekiboCare.

After the incident

So, the incident has been addressed. What’s next? In my experience, most of the incidents happen overnight, so (if that’s the case) catching up on sleep would be my immediate priority! :P

Following the incident, the team needs to compile what’s called a post-mortem report. This document contains all the important information, including the timeline of the incident, the applied resolution(s), the root cause, the lessons learned, and suggestions to prevent this issue from happening again.

Provided that all the relevant information has been properly transcribed during the resolution call, compiling this document should be relatively straightforward; the review must be thorough, and the entire team must be involved. The ultimate purpose of this document is not to point fingers, but rather to ensure that the team has learned from the experience and to prevent this type of issue from happening again.

PagerDuty, our weapon of choice for on-call alerts, have compiled a very detailed set of guidelines for post-mortems, and if you are interested in elaborating on this further, you can find it here.

That's all, folks!

Having an incident response plan in place, along with round-the-clock support, is crucial to maintaining our clients’ sites and applications for them and their customers. I hope this has shed some light on the subject. If you think your company needs help managing the process described above, please feel free to contact us!


Latest Articles

ITIL Certified Agency
23 Apr 2026 ITIL Certified

Our team have successfully achieved the ITIL 4 Foundation certification.

B2B-AISEO-Yes
23 Apr 2026 AI & SEO - How it is impacting B2B organisations

How B2B buyers find your business using search has changed. Here is what is happening, why it matters, and what to do about it.

Cludo-Partnership
22 Apr 2026 Cludo - Partnership

We’re excited to announce our partnership with Cludo - transforming website search