This is a great topic and something in my organisation we have thought about a lot on improving. One progress we've introduced that has helped to onboard people into incident handling is to put together a mock incident. We take a previous incident and run through a war game with a set of engineers to try and discover what the problem is/was. We have a manager playing the role of an incident manager in a mock slack room.
This has been a good tool for onboarding people onto the process, as usually, the first experience is an issue live in production which can be a daunting task.
Thank you for compiling all this information. I learned a lot about incident management in my last job and am working to improve incident management in my current job, so this is especially relevant to me in my current role.
One thing I didn't see mentioned is analyzing groups of incidents for commonalities and trends. For instance, one analysis I did showed that a large percentage of our high severity incidents were due to problems with scalability. Because we didn't understand the drivers of scalability, we would run out of a given resource and that would cause an incident. After each incident, we had a post-mortem, but we didn't observe and address the overall trend.
Also, Commonplace (a newsletter run by a guy named Cedric) has several articles on extracting tacit knowledge. Sources of Power by Gary Klein also has insights regarding how people develop and use tacit knowledge.
This is a great topic and something in my organisation we have thought about a lot on improving. One progress we've introduced that has helped to onboard people into incident handling is to put together a mock incident. We take a previous incident and run through a war game with a set of engineers to try and discover what the problem is/was. We have a manager playing the role of an incident manager in a mock slack room.
This has been a good tool for onboarding people onto the process, as usually, the first experience is an issue live in production which can be a daunting task.
Thank you for compiling all this information. I learned a lot about incident management in my last job and am working to improve incident management in my current job, so this is especially relevant to me in my current role.
One thing I didn't see mentioned is analyzing groups of incidents for commonalities and trends. For instance, one analysis I did showed that a large percentage of our high severity incidents were due to problems with scalability. Because we didn't understand the drivers of scalability, we would run out of a given resource and that would cause an incident. After each incident, we had a post-mortem, but we didn't observe and address the overall trend.
Also, Commonplace (a newsletter run by a guy named Cedric) has several articles on extracting tacit knowledge. Sources of Power by Gary Klein also has insights regarding how people develop and use tacit knowledge.