Incident management, a structured set of processes and procedures for identifying, diagnosing, and resolving IT incidents, allows us to do exactly that. Panagiotis Fezoulidis is IT Quality Manager at Porsche and explains more about the proper management of IT issues.
What is incident management?
Incident management is a set of processes that helps IT teams to quickly recognize and fix the problems that need attention. Here, the focus is not so much on developing the perfect solution or detecting the underlying cause of the problem, but rather on finding ad hoc fixes and workarounds. In the event of an unplanned interruption, incident management empowers IT teams to restore the affected service as quickly as possible. Incident management is part of the ITIL (IT Infrastructure Library) framework developed and popularized by the UK Government’s Central Computer and Telecommunications Agency in the 1980s to standardize IT management practices.
The aim of our incident management at Porsche is to restore services to our customers as quickly as possible, thereby ensuring the provision of top-quality service to all clients at all times. By channeling all IT incidents through a single point of contact, we stay on top of IT issues and keep IT costs under control.
Incident types, phases, and priorities
At Porsche, we distinguish between three types of incidents:
1. service requests, such as a password reset or an account registration,
2. incidents (every incident is recorded as an incident ticket in the incident management process; similar service requests may indicate an impairment of the IT system/IT service),
3. major incidents, which trigger the major incident management process in addition to an incident ticket.
Our incident handling process is structured into four phases:
1. identify, record, classify incidents and provide initial support,
2. investigate and diagnose incident,
3. resolve incident and restore service,
4. close incident.
Once a workaround has been implemented, we initiate long-term troubleshooting measures via root cause analysis if necessary, which is being handled in the Problem Management Process.
How do our IT teams know what is critical and what is not? Incidents are prioritized according to impact and urgency, with the highest priority given to critical incidents. Based on the assessed severity, the incidents are classified and processed by our teams.
Major incident management
Major incidents can have a high impact on IT systems, which in turn can cause e.g. production line outages or even the prevention of a car sale. However, these types of disruptions can cost the company a lot of money. Therefore, the resolution of major incidents takes priority over other tasks in day-to-day business for everyone involved. The most promising solution has the highest priority and is pursued jointly. The Major Incident Manager (MIM) coordinates tasks and decisions and is responsible for resolving the major incident as quickly as possible.
Our major incident management process is also structured into four phases:
1. incident detection and prioritization,
2. automatic alerting (a preselected group is notified via SMS and email),
3. acceptance, communication, and resolution (on-demand communication with the entire organization via email),
4. incident closure (final communication to the entire organization via email and quality review of the incident ticket).
Moreover, we have defined several business rules for the major incident management. For example, if designated persons with the appropriate competencies cannot be reached, technical and substantive responsibility should be assumed in the sense of finding a solution. Any decision that is made is generally better than a decision that is not made, as it offers the possibility of remedying the situation.
If the Major Incident Manager is already active in the case of a major incident and has started the MIM communication, only he may downgrade the incident. At the same time, he ensures MIM de-escalation. Once an incident has been resolved, the priority must not be changed, as this would falsify the service-level agreement evaluation.
In conclusion, incidents are always happening in IT processes. What counts is the way we deal with them to redeem the consequences in the best way possible. This prevents IT failure and maximizes the performance throughout our teams.