Dougie's Enterprise Management World: workflow

Showing posts with label workflow. Show all posts

Sunday, July 1, 2012

ENSM Products are Commodities?

On your quest to put in network and systems management capabilities, you have to figure in several explicit and implicit factors related to your end goals. What I mean is that while it's easy to go to your Framework Vendor of choice, break out the Bill of Materials spreadsheet, and sit down with the Sales person and go through the elements you would need for your environment, it may be filled with hidden challenges. And some challenges may be harder to overcome than others once you have signed the check.

Don't forget, these products don't magically install and run themselves. They take care and feeding. Some more than others. And the more complex it is, the more complex it is to figure out when something goes awry.

Sounds so easy! After all, all of these products are commodities. And buying from a single vendor gives you a single point of support... and blame. In effect, a single "throat to choke". NOTHING could be further from the truth!

Most of the big vendor's product frameworks are aggregations and conglomerations of products that have been acquired, some overlapping, into what looks like a somewhat unified solution. In many cases, it is only after you buy the product framework that you discover stuff like there are different portals with different products and these portals don't effectively integrate together. Or you may find the north bound interface of one product is a kludge to somewhat loosely fit the two products together. Or two products use competing Java versions.

Some vendors product suites have become more and more complex as new releases are GAed. In many cases, these new levels of complexity have a profound impact on your ability to install, administer, or diagnose issues as they arise.

First up - Where are your requirements? Do you know the numbers and types of elements in your environment? What about the applications? How do these apply to Service Level Agreements? Do you have varying levels of maintenance and support for the components in your environment?

Do you know who the users will be? Have you defined your support model? Which groups need access to what elements of information? Do you have or have you prepared a proposed workflow of how users, managers, and even customers are going to interact with the new capabilities?

Who is going to take care of the management systems and applications? Have you aligned your organization to be successful in deployment? Do you have the skill sets? Do you have adequate skills coverage?

Have you defined the event flow? What about performance reports needs and distributions? And ad hoc reporting needs? Have you defined any baseline thresholds?

Do you have SNMP access? What about ICMP? SSH? Have you considered the implications of management traffic across your security zones?

Product Choices

While there are a plethora of choices available to you, many do not want to go through the hassle of doing due diligence. But be forewarned, failure to do due diligence can wreak mayhem in you environment. I know, the big guns say that "our product works in your competitors" but does it really? You don't know? As is your competition that undifferentiated from you? (May not be a good thing!)

When you go through product selection, you need to realize the support needed to administer the new management applications. Do you need specialists just to install it? What about training? Are you going to need other resources like Business Intelligence Analysts, Web Developers, Database Administrators, Script Developers, or even additional Analysts or Engineers.

Here are some signs you may experience:

If the product takes longer than a couple of days to install and integrate, here's your sign.

If two or more products in your big vendor product suite need a significant amount of customization to work together, here's your sign.

If the installation document for the product deviates from the actual installation, here's your sign.

If you find out you actually have to install additional product as discovered during the installation, here's your sign.

If you end up realizing that the recommended hardware specs are either overkill or under-speced, here's your sign.

If you end up having to deal with libraries and utilities that are not included or resolved with the product installation, here's your sign.

Missed it by THAT much!

If you find yourself opening up support tickets in the middle of the installation, here's your sign.

If you find that the product breaks your security model AFTER you do the installation, here's your sign.

If it takes Vendor specific Engineering to install the product, here's your sign.

If you cannot see value in the first day after the installation of a product, here's your sign.

If you find that you need to restructure and build out your support team AFTER the installation, here's your sign.

Systems Management

Systems Management brings whole new challenges to your environment. Some of the things you need to evaluate up front are:

Agent deployment - Level of Difficulty - OS Coverage - consistent data across agents. Manual, Automatic, or distribute able
Agent-less - Browser specific? Adequate coverage? Full transactions? Handles redirection?
Agent run time - Resource utilization - memory footprint - stability - Security.
Data collection - Pull or push model? Resiliency? Effect on run time resources?
External Restrictions - Java versions? Perl versions? Python versions?
Adequate application coverage?
Thresholds - Level of difficulty? Binary only or degrees of utilization/capacity/performance? Stateful? Dynamic thresholds? Northbound traps already defined or do you have to do your own?

Summary

Enterprise Management does not have to be that difficult. There are products out there that work very well for what they do and are easy to deploy and maintain. For example, go do an OpenNMS installation. Even though OpenNMS runs on just about any platform (a testament to their developer community and product maturity), you go to their wiki page http://www.opennms.org/documentation/installguide.html , pick out your platform of choice, and follow the procedure. Most of the time, you are looking at maybe an hour. In an hour, you're starting discovery and picking up inventory to monitor and manage.

Solarwinds isn't too bad either. Nice, clean install on Windows.

Splunk is awesome and up in running in no time. http://www.splunk.com/

Hyperic HQ wasn't a bad installation either. Pretty simple. However, it is time sensitive on the agents. Kind of thick (I think its the Struts), Java wise. http://www.hyperic.com/

eGInnovations is cake. One agent everywhere for OS and applications. Handles VMWare, Xen and others. And the UI is straight forward. A Ton of value across both system and application monitoring and performance. http://www.eginnovations.com/

Appliance based solutions take a bit more time in the planning phase up front but take the sting out of installation. Some of these include:

http://www.sevone.com/ (SevOne does offer a software download for evaluation)
http://www.sciencelogic.com/
http://www.loglogic.com/ (They also offer a virtual appliance download)

One solution I dig is Tavve ZoneRanger for solving those access issues like UDP/SNMP across firewalls, SSH access across a firewall, etc., without having to run through proxies upon proxies and still maintain consistent auditing and logging. It deploys as an appliance of virtual appliance. http://www.tavve.com/

Another aspect you may consider include hosted applications. ServiceNow is easy to deploy because it is a hosted solution. http://www.servicenow.com/

Saturday, May 22, 2010

Support Model Woes

I find it ironic that folks claim to understand ITIL Management processes yet do not understand the levels of support model.

Most support organizations have multiple ters of support. For example, there is usually a Level 1 which is the initial interface toward the customer. Level 2 is usually a traige or technical feet on the street. Level 3 is usually Engineering or Development. In some cases, Level 4 is used to denote on site vendor support or third party support.

In organizations where Level 1 does dispatch only or doesn't follow through problems with the customer, the customer ends up owning and following through the problem to solution. What does this say to customers?

- There are technically equal to or better than level 1
- they become accustomed to automatic escalation.
- They lack confidence in the service provider
- They look for specific Engineers versus following process
- They build organizations specifically to follow and track problems through to resolution.

If your desks do dispatch only, event management systems are only used to present the initial event leading up to a problem. What a way to render Netcool useless! Netcool is designed to display the active things that are happening in your environment. If all you ever do is dispatch, why do you need Netcool? Just generate a ticket straight away. No need to display it.

What? Afraid of rendering your multi-million dollar investment useless? Why leave it in a disfunctional state of semi-uselessness when you could save a significant amount of money getting rid of it? Just think, every trap can become an object that gets processed like a "traplet" of sorts.

One of the first things that crops up is that events have a tendency to be dumb. Somewhere along the way, somebody had the bright idea that to put an agent into production that is "lightweight" - meaning no intelligence or correlation built in. Its so easy to do little or nothing, isn't it? Besides, its someone elses problem to deal with the intelligence. In the
vendors case, many times agent functionality is an afterthought.

Your model is all wrong. And until the model is corrected, you will never realize the potential ROI of what the systems can provide. You cannot evolve because you have to attempt to retrofit everything back to the same broken model. And when you automate, you automate the broken as well.

Heres the way it works normally. You have 3 or 4 levels of support.

Level 1 is considered first line and they perform the initial customer engagement, diagnostics and triage, and initiate workflow. Level 1 initiates and engages additional levels of support trackingt thingts through to completion. In effect, level 1 owns the incident / problem management process but also provides customer engagement and fulfillment.

Level 2 is specialized support for various areas like network, hosting, or application support. They are engaged through Level 1 personnel and are matrixed to report to level 1, problem by problem such that they empower level 1 to keep the customer informed of status and timelines, set expectations, and answer questions.

Level 3 is engaged when the problem becomes beyond the technical capabilities of levels 1 and 2, requires project, capital expenditure, architecture, and planning support.

Level 4 is reserved for Vendor support or consulting support and engagement.

A Level 0 ia used to describe automation and correlation performed before workflow is enacted.

When you breakdown your workflow into these levels, you can start to optimize and realize ROI by reducing the cost of maintenance actions across the board. By establishing goals to solve 60-70% of all incidents at LEvel 1, Level 2-4 involvement helps to drive knowledge and understanding downward to level 1 folks why better utilizing level 2 - 4 folks.

In order to implement these levels of support, you have organize and define your support organization accordingly. Define its rolls and responsibilities, set expectations, and work towards success. Netcool, as an Event Management platform, need to be aligned to the support model. Things that ingress and egress tickets need to be updated in Netcool. Workflow that occurs, needs to update Netcool so that personnel have awareness of what is going on.

Sunday, April 18, 2010

Netcool and Evolution toward Situation Management

Virtually no new evolution in Fault Management and correlation has been done in the last ten years. Seems we have a presumption that what we have today is as far as we can go. Truly sad.

In recent discussions on the INUG Netcool Users Forum, we discussed shortfalls in the products in hopes that big Blue may see its way clear of the technical obstacles. I don't think they are accepting or open to mine and other suggestions. But thats OK. you plant a seed - water it - feed it. And hopefully, one day, it comes to life!

Most of Netcool design is based somewhat loosely on TMF standards. They left out the hard stuff like object modelling but I understand why. The problem is that most Enterprises and MSPs don't fit the TMF design pattern. Nor do they fit eTOM. This plays specifically to my suggestion that "There's more than one way to do it!" - The Slogan behind Perl.

The underlying premise behind Netcool is that it is a single pane of glass for viewing and recognizing what is going on in your environment. It provides a way to achieve situation awareness and a platform which can be used to drive interactive work from. So what about ITIL and Netcool?

From the aspect of product positioning, most ITIL based platforms have turned out to be rehashs of Trouble Ticketing systems. When you talk to someone about ITIL, they immediately think of HP ITSM or BMC Remedy. Because of the complexity, these systems sometimes takes several months to implement. And nothing is cheap. Some folks resort to open source like RT or OTRS. Others want to migrate towards a different, appliance based model like ServiceNow and ScienceLogic EM7.

The problem is that once you transition out of Netcool, you lose your situation awareness. Its like having a notebook full of pages. Once you flip to page 50, pages 1-49 are out of sight and therefore gone. All hell could break lose and you'd never know.

So, why not implement ITIL in Netcool? May be a bit difficult. Here are a few things to consider:

1. The paradigm that an event has only 2 states is bogus.
2. The concept that there are events and these lead to incidents, problems, and changes.
3. Introduces workflow to Netcool.
4. Needs to be aware of CI references and relationships.
5. Introduces the concept that the user is part of the system in lieu of being an external entity.
6. May change the exclusion approach toward event processing.
7. Requires data storage and retrieval capabilities.

End Game

From a point of view where you'd like to end up, there are several use cases one could apply. For example:

One could see a situation develop and get solved in the Netcool display over time. As it is escalated and transitioned, you are able to see what has occurred, the workflow steps taken to solve this, and the people involved.

One could take a given situation and search through all of the events to see which ones may be applicable to the situation. Applying a ranking mechanism like a google search would help to position somewhat fuzzy information in proper contexts for the users.

Be able to take the process as it occurred and diagnose the steps and elements of information to optimize processes in future encounters.

Be able to automate, via the system, steps in the incident / problem process. Like escalations or notifications. Or executing some action externally.

Once you introduce workflow to Netcool, you need to introduce the concept of user awareness and collaboration. Who is online? What situations are they actively working versus observing? How do you handle Management escalations?

In ITIL definitions, an Incident has a defined workflow process from start to finish. Netcool could help to make the users aware of the process along with its effectiveness. Even in a simple event display you can show last, current and next steps in fields.

Value Proposition

From the aspect of implementation, the implementation of ITIL based systems has been focused solely around trouble ticketing systems. These systems have become huge behemoths of applications and with this comes two significant factors that hinder success - The loss of situation Awareness and the inability to realize and optimize processes in the near term.

These behemoth systems become difficult to adapt and difficult to keep up with optimizations. As such, they slow down the optimization process making it painful to move forward. If its hard to optimize, it will be hard to differentiate service because you cannot adapt to changes and measure the effectiveness fast enough to do any good.

A support organization that is aware of whats going on, subliminally portrays confidence. This confidence carries a huge weight in interactions with customers and staff alike. It is a different world on a desk when you're empowered to do good work for your customer.

More to come!

Hopefully, this will provide some food for thought on the evolution of event management into Situation Management. In the coming days I plan on adding to this thread several concepts like evolution toward complex event processing, Situation Awareness and Knowledge, data warehousing, and visualization.

Sunday, April 11, 2010

Fault and Event Management - Are we missing the boat?

In the beginning, folks used to tail log files. As interesting things would show up, folks would see what was happening and respond to the situation. Obviously, this didn't scale too well as you can only get about 35-40 lines per screen. As things evolved, folks looked for ways to visually queue the important messages. When you look at swatch, it changes the text colors and background colors, blinking, etc. as interesting things were noted.

Applications like xnmevents from OpenView NNM provided an event display that is basically a sequential event list. (Here's a link to an image snapshot of xnmevents in action -> HERE! )

In Openview Operations, events are aligned by nodes that belong to the user. If the event is received from a node that is not in the users node group, they don't receive the event.

Some applications tended to attempt to mask downstream events from users through topology based correlation. And while this appears to be a good thing in that it reduces the numbers of events in an events display, it takes away the ability to notify customers based on side effect events. A true double edged sword - especially if you want to be customer focused.

With some implementations, the focus of event management is to qualify and only include those events that are perceived as being worthy of being displayed. While it may seem a valid strategy, the importance should be on the situation awareness of the NOC and not on the enrichment. You may miss whole pieces of information and awareness... But your customers and end users may not miss them!

All in all, we're still just talking about discreet events here. These events may or may not be conditional or situational or even pertinent to a particular users given perspective.

From an ITIL perspective, (Well, I have ascertained the 3 different versions of ITIL Incident definitions as things have evolved...) as:
"Incident (ITILv3): [Service Operation] An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service. Failure of a Configuration Item that has not yet impacted Service is also an Incident. For example, Failure of one diskfrom a mirror set.

See also: Problem

Incident (ITILv2): An event which is not part of the standard operation of a service and which causes or may cause disruption to or a reduction in the quality of services and Customer productivity.

An Incident might give rise to the identification and investigation of a Problem, but never become a Problem. Even if handed over to the Problem Management process for 2nd Line Incident Control, it remains an Incident. Problem Management might, however, manage the resolution of the Incident and Problem in tandem, for instance if the Incident can only be closed by resolution of the Problem.

See also: Incident Management

Incident (ITILv1): An event which is not part of the normal operation of an IT Service. It will have an impact on the service, although this may be slight and may even be transparent to customers.

" as quoted from http://www.knowledgetransfer.net/dictionary/ITIL/en/Incident.htm

From the ITIL specification folks, I got this on Incident Management: Ref: http://www.itlibrary.org/index.php?page=Incident_Management

Quoting them "

'Real World' definition of Incident Management: IM is the way that the Service Desk puts out the 'daily fires'.

An 'Incident' is any event which is not part of the standard operation of the service and which causes, or may cause, an interruption or a reduction of the quality of the service.

The objective of Incident Management is to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price.

Inputs for Incident Management mostly come from users, but can have other sources as well like management Information or Detection Systems. The outputs of the process are RFC’s (Requests for Changes), resolved and closed Incidents, management information and communication to the customer.

Activities of the Incident Management process:

Incident detection and recording

Classification and initial support

Investigation and diagnosis

Resolution and recovery

Incident closure

Incident ownership, monitoring, tracking and communication

These elements provides a baseline for management review.

Also, I got this snippet from the same web site :

"Incidents and Service Requests are formally managed through a staged process to conclusion. This process is referred to as the "Incident Management Lifecycle". The objective of the Incident Management Lifecycle is to restore the service as quickly as possible to meet Service Level Agreements. The process is primarily aimed at the user level."

From an Event perspective, and event may or may not signify an Incident. An incident, by definition, has a lifecycle from start to conclusion which means it is a defined process. This process can and should be mapped out, optimized, and documented.

Even the fact that you process an unknown event should, according to ITIL best practices, align your process steps toward an Incident Lifecycle on to an escalation that captures and uses information derived from the new incident to be mapped, process wise.

So, if one is presented with an event, is it an incident? If it is, what is the process by which this Incident is handled? And if it is an Incident and it is being processed, what step in the Incident process is the incident? How long has it been in processing? What steps need to be taken right away, to process this incident effectively?

From a real world perspective, the events we work from are discreet events. They may be presented in a way that signifies a discreet "start of an Incident" process. But inherently, an Incident may have several valid inputs from discreet events as part of the Incident Management process.

So, are we missing the boat here? Is every event presented an Incident? Not hardly. Now, intuitively, are your users managing events or incidents? Events - Hmmm Thought so. How do you apply process and process optimization to something you don't inherently manage to in real time? Incident management becomes an ABSTRACTION of event management. And you manage to Events in hopes that you'll make Incident Management better.

My take is that the abstraction is backwards because the software hasn't evolved to be incident / problem focused. So you see folks optimize to events as thats they way information is presented to them. But it is not the same as managing incidents.

For example, lets say I have a disk drive go south for the winter. And OK, its mirrored and is capable of being corrected without downtime. AWESOME. However, when you replace the drive, your mirror has to synch. When it does, applications that use that drive - let's say a database - are held back from operating due to the synchronization.

From the aspect of incidents, you have a disk drive failure which is an incident to the System administrator for the system. This disk drive error may present thousands of events in that the dependencies of the CIs upon the failed or errored component span over into multiple areas. For example, if you're scraping error logs and sending them in as traps, each unique event presents itself as something separate. Application performance thresholds present events depicting conditional changes in performance.

This one incident could have a profound waterfall effect on events, their numbers and handling, given a single incident. Only the tools mandate that you manage to events which further exacerbates the workflow issue.

Organizations attempt to work around this by implementing ticketing systems. Only, once to move a user from the single pane of glass / near real time display to individual tickets, the end users become unaware of the real time aspects of the environment. Once a ticket is opened and worked, all Hades could break loose and that user wouldn't be aware.

In Summary

The Event Management tools today present and process events. And the align the users toward Events. Somewhere along the way, we have missed the fact that an event does not equal an Incident. But the tools don't align the information to incident so it has hampered effective ITIL implementation.

The Single Pane of Glass applications need to start migrating and evolving toward empowering Incident management in that near real time realm they do best. Create awareness of incidents as well as the incident process lifecycle.