Practical Monitoring

Practical Monitoring

Effective Strategies for the Real World

Mike Julian

What about for incidents that are actual outages and last longer than a few minutes? In that case, a well-defined set of roles becomes crucial. Each of these roles has a singular function and they should not be doing double-duty: Incident commander (IC) This person’s job is to make decisions. Notably, they are not performing any remediation, customer or internal communication, or investigation. Their job is to oversee the outage investigation and that’s it. Often, the on-call person adopts the IC role at the start of the incident. Sometimes the IC role is handed off to someone else, especially if the person on-call is better suited for another role. Scribe The scribe’s job is to write down what’s going on. Who’s saying what and when. What decisions are being made? What follow-up items are being identified? Again, this role should not be performing any investigation or remediation. Communication liaison This role communicates status updates to stakeholders, whether they are internal or external. In a sense, they are the sole communication point between people working on the incident and people demanding to know what’s going on. One facet of this role is to prevent stakeholders (e.g., managers) from interfering with the incident by directly asking those working on resolving the incident for status updates. Subject matter experts (SMEs) These are the people actually working on the incident.

Link · 848