Sunday, April 11, 2010

Fault and Event Management - Are we missing the boat?

In the beginning, folks used to tail log files.   As interesting things would show up, folks would see what was happening and respond to the situation.  Obviously, this didn't scale too well as you can only get about 35-40 lines per screen. As things evolved, folks looked for ways to visually queue the important messages.   When  you look at swatch, it changes the text colors and background colors, blinking, etc. as interesting things were noted.

Applications like xnmevents from OpenView NNM provided an event display that is basically a sequential event list.  (Here's a link to an image snapshot of xnmevents in action -> HERE! )

In Openview Operations, events are aligned by nodes that belong to the user.  If the event is received from a node that is not in the users node group, they don't receive the event.

Some applications tended to attempt to mask downstream events from users through topology based correlation.  And while this appears to be a good thing in that it reduces the numbers of events in an events display, it takes away the ability to notify customers based on side effect events. A true double edged sword - especially if you want to be customer focused.

With some implementations, the focus of event management is to qualify and only include those events that are perceived as being worthy of being displayed. While it may seem a valid strategy, the importance should be on the situation awareness of the NOC and not on the enrichment.  You may miss whole pieces of information and awareness... But your customers and end users may not miss them!

All in all, we're still just talking about discreet events here.  These events may or may not be conditional or situational or even pertinent to a particular users given perspective.

From an ITIL perspective, (Well, I have ascertained the 3 different versions of ITIL Incident definitions as things have evolved...) as:
"Incident (ITILv3):    [Service OperationAn unplanned interruption to an IT Service or a reduction in the Quality of an IT ServiceFailure of a Configuration Item that has not yet impacted Service is also an Incident. For example, Failure of one diskfrom a mirror set.
See alsoProblem
Incident (ITILv2):    An event which is not part of the standard operation of a service and which causes or may cause disruption to or a reduction in the quality of services and Customer productivity.
An Incident might give rise to the identification and investigation of a Problem, but never become a Problem. Even if handed over to the Problem Management process for 2nd Line Incident Control, it remains an IncidentProblem Management might, however, manage the resolution of the Incident and Problem in tandem, for instance if the Incident can only be closed by resolution of the Problem.
Incident (ITILv1):    An event which is not part of the normal operation of an IT Service. It will have an impact on the service, although this may be slight and may even be transparent to customers.

From the ITIL specification folks, I got this on Incident Management: Ref:
Quoting them "

'Real World' definition of Incident Management: IM is the way that the Service Desk puts out the 'daily fires'.

An 'Incident' is any event which is not part of the standard operation of the service and which causes, or may cause, an interruption or a reduction of the quality of the service.

The objective of Incident Management is to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price.

Inputs for Incident Management mostly come from users, but can have other sources as well like management Information or Detection Systems. The outputs of the process are RFC’s (Requests for Changes), resolved and closed Incidents, management information and communication to the customer.

Activities of the Incident Management process:

Incident detection and recording
Classification and initial support
Investigation and diagnosis
Resolution and recovery
Incident closure
Incident ownership, monitoring, tracking and communication

These elements provides a baseline for management review.

Also, I got this snippet from the same web site :

"Incidents and Service Requests are formally managed through a staged process to conclusion. This process is referred to as the "Incident Management Lifecycle". The objective of the Incident Management Lifecycle is to restore the service as quickly as possible to meet Service Level Agreements. The process is primarily aimed at the user level."

From an Event perspective, and event may or may not signify an Incident. An incident, by definition, has a lifecycle from start to conclusion which means it is a defined process.  This process can and should be mapped out, optimized, and documented.

Even the fact that you process an unknown event should, according to ITIL best practices, align your process steps toward an Incident Lifecycle on to an escalation that captures and uses information derived from the new incident to be mapped, process wise.

So, if one is presented with an event, is it an incident?  If it is, what is the process by which this Incident is handled? And if it is an Incident and it is being processed, what step  in the Incident process is the incident?  How long has it been in processing? What steps need to be taken right away, to process this incident effectively?

From a real world perspective, the events we work from are discreet events. They may be presented in a way that signifies a discreet "start of an Incident" process.  But inherently, an Incident may have several valid inputs from discreet events as part of the Incident Management process.

So, are we missing the boat here? Is every event presented an Incident? Not hardly. Now, intuitively, are your users managing events or incidents? Events - Hmmm Thought so.  How do you apply process and process optimization to something you don't inherently manage to in real time?  Incident management becomes an ABSTRACTION of event management. And you manage to Events in hopes that you'll make Incident Management better.

My take is that the abstraction is backwards because the software hasn't evolved to be incident / problem focused. So you see folks optimize to events as thats they way information is presented to them.  But it is not the same as managing incidents. 

For example, lets say I have a disk drive go south for the winter. And OK, its mirrored and is capable of being corrected without downtime. AWESOME. However, when you replace the drive, your mirror has to synch.  When it does, applications that use that drive - let's say a database - are held back from operating due to the synchronization.

From the aspect of incidents, you have a disk drive failure which is an incident to the System administrator for the system. This disk drive error may present thousands of events in that the dependencies of the CIs upon the failed or errored component span over into multiple areas.  For example, if you're scraping error logs and sending them in as traps, each unique event presents itself as something separate. Application performance thresholds present events depicting conditional changes in performance. 

This one incident could have a profound waterfall effect on events, their numbers and handling, given a single incident. Only the tools mandate that you manage to events which further exacerbates the workflow issue.

Organizations attempt to work around this by implementing ticketing systems. Only, once to move a user from the single pane of glass / near real time display to individual tickets, the end users become unaware of the real time aspects of the environment.  Once a ticket is opened and worked, all Hades could break loose and that user wouldn't be aware.

In Summary

The Event Management tools today present and process events. And the align the users toward Events. Somewhere along the way, we have missed the fact that an event does not equal an Incident.  But the tools don't align the information to incident so it has hampered effective ITIL implementation.

The Single Pane of Glass applications need to start migrating and evolving toward empowering Incident management in that near real time realm they do best. Create awareness of incidents as well as the incident process lifecycle. 


  1. Check this out:
    The fellas at Rivermuse are segregating events from alerts as well. NICE!

  2. You can ask the event management company to help you set up special meetings that your employees and/or business partners may need to attend.