Monday, August 14, 2017

Machine Learning and Status Determination

In thinking about status polling in networks,  most of the software we use utilizes ping lists of IP addresses or host names to perform the polling.   We set standard timeouts across the board as well as poll rates, statically.   Not that ICMP ping is all that accurate in the determination of a specific Node REALLY being up or down. It is rife with misconceptions by those that lack an in depth understanding of the ICMP protocol and its purpose in life. Especially if there are devices in the network that are not pingable but can affect latency and connectivity.

What if we are thinking about this wrong? 

Artificial intelligence and machine learning techniques may enable us to do a much better job in status "determination" versus just status polling.  

First, I would think that we should reset the objectives for status polling.  Status polling used to be goals like ping every device every 2 minutes in the network.  While this seems OK, is it really what you want?  What if you set the objective to:

Establish a status interval of 2 minutes for every managed device. Or I need to know when a device is no longer responding within 2 minutes. And what about latency and thresholds?

Heres a thought

I have 10000 device IPs and 2 pollers. 

Give the list to both pollers. Have them go through the list to establish:

The latency distance from the poller.
Establish the baseline root mean and standard deviation for latency.

Now, organize your latency distances from the longest to the shortest into groups or bands. Work to distribute the list as primary and secondary based on best results but evenly.

Assumptions

The farthest latency distance may or may not be the farthest physical distance
Non-responsive IPs may or may not be down.
Polling the farthest latencies should be more synchronous than closer IPs.

What if...?

What if I organize the data for each band into a heat map?  Could I use this to visualize clusters of changes or anomalies in latency on the network?  The problem today is that we respond to ping failures individually and we work those tickets that way.   In some cases, by the time the engineer gets around to diagnosis, the problem has magically disappeared. So, we do what everyone else does - we give away the SLA to let only the persistent problems bubble up.

By organizing in bands, in the heat map, what you will see is:

Changes in closer latency bands may affect other more distant bands. So, diagnose the closer first, then work your way out. (Kind of helps to prioritize the list when things get rough)
The outages are as important as the changes. When one or more IPs change enough latency wise to transition from one band to another, iit illuminates something in the network that may be indicative of something much bigger. Or may be precursors to outages. Or may be indicative of static or dynamic changes in the network.

Machine Learning possibilities

Look for groupings in the managed elements. For example, each subnet has 1 or more IP addresses that are a part of that subnet. 

IPs in some subnets may display different latency distances. For example a 255.255.255.252 subnet has 2 host IPs and is commonly used for WAN links. If the distance to one is longer than the other, you can discern that its probably the distant end. (If you can poll the more distant one, would that not imply that the nearer IP is also up?) Interesting is that subnets like this may be able to be visualized as "bridges" between bands.

On networks that have a larger mask, one can assume that each shared network has one or more routers that provide service to that network.  And they have one or more switches used to interconnect devices at the logical link layer.   It may not be the case that you can imply that a node being up on a LAN also means the switch and router are up.  But, when things are broken, you would want to check the routers first, then the switches, then the nodes.

Use machine learning to capture and hypothesize failures as patterns.  When something changes, what also changed around it?  What broke further downstream?  After seeing a few failures, you should be able to pick up a probability on the causes and side effects in the network. (What a brilliant insight this would be for engineering teams to identify and solve for availability, performance, and redundancy issues based on priors?) 

Let machine learning determine the frequency based on the goals and objectives.

When you overlay other data elements in with this, the picture becomes a much more effective story.  For example, in the heat map identify any node / IP that has experienced configuration changes.  What about has an open ticket on it? What about event counts beyond the ping?

What if I use RUM and a bit of creative data from IPFIX to overlay onto the heat map application performance?  Interesting...

Sunday, August 13, 2017

Machine Learning and Event Classification

When you look at the Netcool probe systems, there are a couple of things you can learn

from the rules files as well as the events and raw data. Really, there is a tremendous amount of data at your fingertips that could be leveraged in new ways.

When you look at basic Naive Bayes as a function, it is an effective algorithm to use to classify

text elements. While I like the approach, I chose to simplify it a bit here to make it more understandable and usable without undue complication. Lets go on a data exploration journey and see what we can derive from the mental exercise...

Heres a thought

Given I have a set of rules, and event example, and I have raw event data. One of the first thing I'd like to train on is severity. This is important because it is an organizationally defined tag depicting a user perception.

On a side note, I seem to always revert back to the Golden Rule of correlation:

ENTERPRISE:NODE:SUBOBJECT.INSTANCE:PROPERTY

The Scope applies to how much of the record is applied to delineate the Managed Object. For example a Node Scope means that correlation elements need to match on Enterprise and Node to apply together.

Naive Bayes is about intelligently "guessing" elements that make up a fact as to whether they are goodor bad. For a given small sample of events, I'm going to use 4 events as my reference training to illustrate potential learning.

Summary = Interface GE1 is down , Severity = 4

Summary = Interface GE1 is up , Severity = 0

Summary = Node down ping fail, Severity = 5

Summary = Node up, Severity = 0

So,for the training process, I break down each word in the Summary as follows:

			Word hit count
	Clear	Indeterminate	Warning	Minor	Major	Critical
Interface	1				1
GE1	1				1
is	1				1
down					1	1
up	2
Node	1					1
ping						1
fail						1

It should be noted that some words may be a bit nebulous and have no differentiation in the determination of severity. For example, check out the following table.

	Interface	GE1	is	down	up	fail	ping	Node
Clear	1	1	1	0	2	0	0	1
Indeterminate	0	0	0	0	0	0	0	0
Warning	0	0	0	0	0	0	0	0
Minor	0	0	0	0	0	0	0	0
Major	1	1	1	1	0	0	0	0
Critical	0	0	0	1	0	1	1	1

			Ratio of occurrence by Severity
Clear	0.125	0.125	0.125	0	0.25	0	0	0.125
Indeterminate	0	0	0	0	0	0	0	0
Warning	0	0	0	0	0	0	0	0
Minor	0	0	0	0	0	0	0	0
Major	0.125	0.125	0.125	0.125	0	0	0	0
Critical	0	0	0	0.125	0	0.125	0.125	0.125

			Ratio of Non-occurrence
Clear	0.875	0.875	0.875	1	0.75	1	1	0.875
Indeterminate	1	1	1	1	1	1	1	1
Warning	1	1	1	1	1	1	1	1
Minor	1	1	1	1	1	1	1	1
Major	0.875	0.875	0.875	0.875	1	1	1	1
Critical	1	1	1	0.875	1	0.875	0.875	0.875

Normally, naive Bayes specifies a simple true of false kind of thing. With 6 different severities, one could classify the true as severities 2-5 and false 0 and 1. In this case, you look for words that “differentiate” True or false. In the example, the differentiating words are down, up, ping, and fail. In this first iteration, I would be tempted to either drop ping or add it to the Up event.

In analyzing the words and distribution of severities, we can discern that 4 words are differentiators. If the distribution of words like “is” are across multiple severities, they aren't so relevant in use as a predictor or a prior in bayesian terms.

The ratios are calculated per the total word count for the entire sample. Out of each word,what ratio would that word play on determination of severity.

The ratio of non-occurrence is interesting in that it shows you the ratio of how often it does not occur and clearly illustrates severities that can be ignored by a specific words presence.

This is a very basic machine learning event word classification mechanism spawned mainly out of the simplification of Naive Bayes theory. While Naive Bayes is very rudimentary, a lot of folks get hung up in the math that goes with it in its truest sense. Here are the formulas I use:

Ratio of occurrence is the number of times a word is seen via a given severity divided by the total number of unique words.

Ratio of non-Occurrence is Total number of words minus the occurrences in this severity for this word, divided by the total number of words.

The benefit is that Severity ratings are a perception of importance. The more accurate and consistent these simple classifications become, the better your more intensive Inferences and deep learning will be going forward.

Interestingly enough,these are just words that are independent of one another. Other that occurrence and perception, is is no different than down.

Now, what if I look for words in the event text that are also in the AlertKey? Normally, AlertKey is used to identify the Subobject and instance of the event. Would this not readily identify the noun/pronoun/object this event relates to? Could it be skipped in the calculations to make things more accurate?

What if you used AlertGroup to do associative vectors to only group together like events? Like a defined cluster of events by type.

What if you did the same thing for Type? Melding in the dimension of whether an event relates to a Problem or Resolution could help drive the accuracy of your event processing system.

Solving a Problem

The questions you may be looking to answer are:

Are my event wordings consistent to be able to guess a severity if we didn't have one already?
Is my severity selection consistent?
Event object definition accurate?

Lets say, I do this exercise on my top 20 events seen in the environment. Use these events to hone the “dictionary” of initial training. Then, lets apply it to the rest of the events.

There are some good news elements here. These include:

Identify and build in consistency into your event processing.
Use the results to identify the severity of new and existing events.
You can store the results by event (using Identifier as the key) so that you don't have to recalculate each time.
You can use history to build this.

Another very interesting “feature” here is that many of the events processed are SNMP based. Within the rules used to process the event, in many cases, you get the OID of the enterprise, OIDs or varbinds, and enumerations used to translated the numbers into text values. And what about the MIB object descriptions? And variable bindings that are instanced usually point ot a given subobject and instance. (ifEntry.4 as an example)

Some may even ask why do this. What if, programmatically, I can determine that performance and error conditions related to a given sub-object are in fact, supporting evidence to a most probable cause event that resulted in an outage. Now, we are getting some where. Think beyond the Root Cause and Side Effect paradigm into the realm of recognizing and understanding event clusters.

Summary

I should note that while I use Netcool as an example, one should not constrain these techniques to a single product. I could see this same sort of techniques used in HPE Node Manager, ScienceLogic EM7, Splunk, OpenNMS, Monolith, and others.

Machine learning is about exploring your data, classifying it, and producing better information.

Tuesday, October 13, 2015

Getting ahead of the Customer Experience Perception Case

I have read several articles on how users are the alarm of prevalence in many environments. How we should be looking at customer experience.

This is so true and appropo. If you are truly looking to provide customer service, the customer experience should be at the forefront of your service philosophy.

Why?

In the beginning, help desk staff listened for a call. They waited on the phone to ring. In fact, in some (MANY!) environments, they still do. They also use the Call routing information to determine if there is a problem in a specific area, neighborhood, or part of the infrastructure. Wild huh? If your NOC is using call statistics to do correlation, you are definitely managing alarms and alerts in the user perception space.

Even in many modern day Operations centers, Operations operates in a mode of being purely reactive to incoming events. Furthermore, in cases where the inputs overwhelm the staff's ability to discern real problems and prioritize them as time evolves,you see people that will wait for the loudest problem to surface with incoming calls.

If your layer 1 support is doing dispatch only,in almost all cases, you are operating in a purely reactive environment.

If you only allow events to be presented that are predefined and actionable, you are simulated that phone call in software. Same thing,different media.

User Perception Window

It is the time interval from when the customer is affected to the time it gets too painful to go on without reporting the problem.

In some environments, this can be hours. Especially if the end user has not seen results from previous service outages. Or they have had negative experiences in calling in problems. Some will commence to doing their own troubleshooting like rebooting everything. Some will merely wait, go take a break, or go to lunch in hopes that the problem heals or gets fixed.

When folks transition from in house support to an outsourced arrangement, one of the factors that is common is the need for better support. More responsiveness. Better up time. Better awareness.

In some instances, the time has become so critical, end users will introduce problems just to test and see how long it takes for the managed services provider to respond. This results in a very short window and usually fares badly for the Service provider.

Negative perceptions by your customers have a negative effect on your Net Promoter Scores and can be the most prevalent cause of customer churn. They affect the effectiveness of the support organization. And the ability to generate new revenue.

Architecture

Most management architectures are designed wrong for the ability to migrate towards a proactive management stance. If you are waiting on Traps and syslog events, you are also waiting on the phone to ring. While this is cheap and easy, it carries with it the consequences of always being after the fact, always post-cognitive.

And the problem is profoundly exacerbated by the introduction of agility in the enterprise. The migration towards constant updates, infrastructure movement and redefinition, migration of applications across cloud platforms and containers, even off premise.

Consider this - changes in the environment can happen ANYWHERE in the Green, Red, or Yellow zones. In effect, a change can lead up to an event horizon, cause other effects after the event horizon, or change the effects by changing in the middle of a problem.

If your architecture only looks at the red and yellow zones, you can never get AHEAD of the User Perception Window. You can get a better handle on how you handle problems, identification and prioritization of problems, even building better workflow and run book processes.

How Do You Get There?

In many cases,architects and management has chosen the path of least resistance in hopes that enterprise management as a technology, is a commodity. (Funny - This was a marketing ploy by wares vendors to circumvent having to compete!)

Interesting thing about getting ahead of the customer is that this is the hard part. It is the part where you have to go through the data, the workflow, the results, and come up with solutions to designing and implementing around architectural and product shortcomings, improving the processes and automations, and building and putting in place more effective instrumentation.

I'd like to warn you up front - if you're not willing to commit to the challenge, its better to admit that you will never get ahead of the customer experience perceptron. Maybe you can set expectations with customers. Maybe you can put some spin on it.

There are several, very important Continuous Process Improvement sorts of tasks that need to be undertaken. These include:

1. Post Mortems.

What was the root cause of the problem? Was there more than a single cause?
Did the organization mishandle the problem?
Were there things that could have made the problem correction, better?
Are the runbooks and processes in order?
Has redundancy, DR, and HA been addressed properly?

A post mortem analysis is imperative to go through and analyze the what happened and how the support organization responded. You need the data to be able to benchmark how information was derived and things were accomplished from the start to finish.

2. Failure Analysis

In the course of time, periodically, you need to go through your tickets and look for hardware and software that has failed over the reporting period. Look for patterns and inconsistencies in the products, services, and systems.

An important gauge is to come up with a way of providing a cost of maintenance per device / Device type / Application. Analyze both Scheduled and unscheduled maintenance actions. This gives you an EXCELLENT way of illuminating problem areas in a way that non-technical people understand - dollars and cents. Doesn't have to be real but relative and relevant.

Many Operations environments actually inflict a lot of pain on themselves by not doing failure analysis. You need to be ahead of the curve of equipments, systems, and software that fails more and more,takes more time to maintain, and causes more downtime.

3. Instrumentation

In the course of getting ahead of the customer perception window, you have to advance the instrument to seek out and illuminate issues before user perception is realized. If you are not increasing the instrumentation to be more predictive, you can not ever be able to visualize before the event horizon.

With containerization and microarchitectures, you need to build in advanced monitoring capabilities. In fact, this advanced precognitive monitoring needs to be an integral part of the microarchitecture.

If you are not fielding advanced correlation where you are ACTIVELY looking for pre-cognitive conditions - conditions that lead up to a potential failure, you will NEVER EVER get in front of the customer. If you are still waiting on a trap, a syslog event, or even a timed threshold, you are tragically on the wrong side of the timeline.

You need to look at user transactions from the user perspective. a 3 second deviation,while not discernable to many, could yield huge insight into an oncoming disaster.

What about IPFIX / Netflow data? What can you discern from this data to yield insights? Can you instrument the patterns into software to turn it into something to alert on?

4. Adaptive Analytics

You need to be able to sample through the combinations of configurations and analyze event streams, instrumentation, and workflow data to look for predictive patterns that point to a customer experience potential problem BEFORE the event horizon occurs.

What things happened to illuminate an pending event horizon?

Can you discern loading conditions and thresholds from you analytics?

Linear regression? By time intervals? Related or not related. Causal or not?

Summary

While out of the box Enterprise Management applications say they are proactive, take a good look at where they function in the Customer perspective perceptron space. Could be, they are proactive after the customer perceptron.

Until you instrument and threshold on things that are before the customer perception window,

YOU CANNOT GET AHEAD OF THE CUSTOMER EXPERIENCE.

In the comments, I'd be interested in hearing how your product / service fits in the Customer Perception window. Leave a comment!

Saturday, September 12, 2015

To Build Versus Buy

Over the course of time that Management systems have evolved - at least in my years of exposure - there seems to always be the question - Build vs. Buy.

I have been in 3 different scenarios that show different perspectives as to where these views come from. These perspectives include:

Product Company
Integrator
End User

Product Perspective

From a product perspective, many tend to believe that the product is close enough to the 80% rule that the question of build versus buy should never come up. In fact, some believe strongly enough to count the end user out as eccentric or lacking. Or they are intimidated by the Integrator.

I always liked the possibilities I could bring about with Log Matrix NerveCenter. And the functions I put into the menus of HP OpenView.

Some products painted themselves in a corner. Like Netcool OMNIbus. When they aligned to Telco oriented standards, they mandated to problems outside of the telco realm had to adapt to the Telco standards. For example, color coding of Severity. But then again, not everything has a real severity or conforms so much to event management.

Some product companies are skeptical and somewhat intimidated by premier Integrators. Instead of listening to the requirements and approaches, then embracing change, they will shun it.

As an Integrator, I was avoided, patronized, and shunned. A few select product companies embraced my approaches.

As an end user, product companies seemed a bit taken aback when I told them their product doesn't scale. Or doesn't fit. Sometimes to the point of having their lawyers call me (Kinda weird). And I've seen Sales folks do all sorts of things in lieu of fixing the technical issues. Like a visit to your VP. Or a talk with your CEO. You know you've been sandbagged when your VP comes in with a bunch of glossies and tells you to evaluate this or that.

Integrator

As an Integrator, products are viewed as something they can use to deliver a value add. While there are a lot of Integrators that prefer to just do installation and setup, the premier Integrators are always looking for products that create a difference. That segregates them from others.

From an Integrator perspective, I always strived to achieve success for both my team and my customer. Sometimes, that takes a bit of work. And some thinking out of the box.

Sometimes, it was with a product and its capabilities. Other times, I did my own code.

In the industry, there are a few folks out there that take products out of the box, put some code around them to integrate with other products or to add capabilities, and sell that and services around the extended capabilities and services. Product companies don't always know how to leverage these folks or even consider them viable.

End User

From an end user perspective, some products just don't scale enough to make it. Or they lack critical capabilities. And yes, price can be a factor.

I've been in places that could not use commercial products without significant work to make it work. For example, I know of a place that had 8 separate and distinct eHealth instances. And they ended up with 8 people supporting the product.

In instances where Build has evolved into a viable option, cost, capabilities, and scalability are the primary reasons.

In the immortal words of Larry Wall -- "There's more than one way to do it!". Product companies don't have a stranglehold on innovation. In many cases, quite the opposite is true. Its hard to do something different if the product does 50% or less of what the requirement is if it means refactoring and redeveloping code. In fact, many developers consider that the product is done once coded. Of the ones that have evolved, you see a lot of refactoring and reworking to achieve more capabilities and scale.

In Summary... EMBRACE the Innovator.

Product Companies - Use these Innovators to expand horizons, empower repeatable integrations, and drive solutions over tools.

Integrators - Want to get to the next level? Get you some Innovators and start productizing your Value Add.

End Users - Use Innovators to work through problems and get to solutions. Tools aren't much it they don't fit your organization form and function and its workflow. Innovators do that. Embrace innovations that enhance your business in meaningful and distinctive ways. Keep driving efficiency and customer satisfaction up.

Wednesday, August 8, 2012

Enterprise Management - What's Missing?

Enterprise Management has been a somewhat strange market over the years. The problem definitions are really tough but are rewritten to lessen the amount of development needed to achieve something that marketing can spin. Some elements like correlation and root cause analysis have been dumbed down and respun in an effort for many to tout a "me too" to the uninitiated.

Other areas like performance management platforms have become huge report generators. After all, pretty graphs and charts sell. In some systems, it is amazing the amounts of graphs produced, 90% of which may never be looked at.

The other amazing part is that even though a product is evolved and bought out, there comes a point where design decisions of the past dictate what can and cannot be done going forward. And this holds true for not just the older products but some new ones as well. It is incredible to see a product that in its relatively young life cycle that has already mandated the inability to extend the architecture further.

The emerging technologies that have come to light most in the past 3 years or so are provisioning, configuration management, and workflow automation. This is especially true in the Cloud realm as Cloud needs standard configuration items, a way of automating provisioning, and ways to pre-provision systems overall. After all, virtualization is the technology and Cloud is the process.

OK - What's Missing?

I know we've seen all of the sales pitches on performance management with charts and graphs. And we've been barraged by the constant pitch for business services management with its dashboards and service trees. And we've seen the event and alarm lists, the maps, and even ticketing.

Someone once said that "a fool with a tool is still a fool". Effective Enterprise Management Architectures and implementations have NEVER been about tools. Tools are the technology and Workflow is the process. Tools do nothing for the bottom line of an Enterprise or Service Provider without someone using it. In fact, the tools that are not used usually become shelfware or junkware. In many cases, you can't even sell the stuff to some other fool looking for a tool.

One of the most interesting assessments of emerging technologies is the Gartner Hype Cycle. Here is the Wikipedia article on the Hype Cycle.

http://en.wikipedia.org/wiki/Hype_cycle

While it is not so scientific and not really a cycle per se, it is a way to describe the subjective nature of a technology and the phases it goes through. Interestingly, there are five different stages in the Gartner Hype Cycle. Following is an excerpt from Wikipedia describing the Hype cycles.

Five phases

Hype cycle for emerging technologies as of July, 2009

A hype cycle in Gartner's interpretation comprises five phases:

"Technology Trigger" — The first phase of a hype cycle is the "technology trigger" or breakthrough, product launch or other event that generates significant press and interest.
"Peak of Inflated Expectations" — In the next phase, a frenzy of publicity typically generates over-enthusiasm and unrealistic expectations. There may be some successful applications of a technology, but there are typically more failures.
"Trough of Disillusionment" — Technologies enter the "trough of disillusionment" because they fail to meet expectations and quickly become unfashionable. Consequently, the press usually abandons the topic and the technology.
"Slope of Enlightenment" — Although the press may have stopped covering the technology, some businesses continue through the "slope of enlightenment" and experiment to understand the benefits and practical application of the technology.
"Plateau of Productivity" — A technology reaches the "plateau of productivity" as the benefits of it become widely demonstrated and accepted. The technology becomes increasingly stable and evolves in second and third generations. The final height of the plateau varies according to whether the technology is broadly applicable or benefits only a niche market.

The term is now used more broadly in the marketing of new technologies.

Now keep in mind, the hype cycle is very subjective. Dependent upon perspective, one person could place a given technology in a different category than you would. (This is why you enlist the expert analysis of the Gartner Analysts.)

Workflow Automation

You see a significant number of Workflow automation products that are just now coming into Service Providers and Enterprises. Part of this is being driven by requirements from cloud services. Cloud Services NEED to be automated. The services need to be automated and user driven as much as possible.

Did we ever get to be ITIL compliant? Or is there such a thing? Maybe for ticketing systems. But do your organizations reall truly follow ITIL Incident and problem management to the letter? Can your organization?

ITIL defines an Incident as :

--------------------------------

Any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to or a reduction in, the quality of that service. The stated ITIL objective is to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price.
-------------------------------

Effective workflow automation and process optimization begins with a baseline of the process. If you cannot define the process or measure it, how will you improve it? In the spirit of getting ITIL Incident Management to be optimized and effective, one needs to implement effective business process management techniques.

But there are psychological and philosophical problems with how processes are approached and presented. Some of these "issues" include:

The belief that documenting the process is to train someone else for your job.
The belief that documenting some technical processes is impossible as it takes into consideration judgement and subjective decisions.
The belief that documenting processes gets in the way of actual work.

A significant amount of this is Fear, Uncertainty, and Doubt or FUD. But it can block or kill effective process optimization. You have to manage to and reward process optimization.

Knowledge Management

While you see segmented implementations of Knowledge Management by various products or as modules in ticketing systems, rarely do you see somewhat robust implementations beyond the very large Service Providers.

Knowledge Management takes knowledge, facts, and references, organizes that information and makes it available in the various processes as a way or method of putting the knowledge to work. For example, an event is presented to a level 1support person that denotes an actionable event. In the course of this work, knowledge can be presented at several different aspects to include:

Notification instructions to contact the customer
Escalation instructions to the appropriate Triage Team
Recent histories of the elements in question
Scheduled actions and maintenance actions
Recent Configuration changes.
Checks and validations process data.

Knowledge Management takes information and creates confidence and situation awareness in support organizations. The knowledge enables each and every user or support member to work with common technical terms regardless of level of expertise. Even simple knowledge management tasks such as attaching linkages to vendor technical documentation to configuration records,enables users to be able to reference and communicate better regarding diagnosis, configuration, or provisioning.

Collaboration

When take a step back and look at every application on the market today in the Enterprise Management realm, the one thing that sticks out profoundly is the lack of collaboration capabilities and multi-user awareness. Not one single application has user awareness. For example, you open a network map. Who else is in that map? When you open a ticket, who else has that ticket open?

Applications don't really enable folks to work together. They are more geared toward individual actions. Then when that person is done, it is escalated to the next person. And when things move across products, it is even worse.

Consider the customer who calls into a service Provider with a problem. The first level takes down the ticket information, runs through some run book elements, then escalates the ticket to the next level of support. At this point,the customer is told they will get a call from the next tier. So, the customer hangs up and waits on the next call. In this scenario, who owns the problem? What if Tier 2 doesn't call back for a long period of time?

While this scenario is all too common, who is collaborating? The customer is stuck dealing with a bunch of individual contributors to get their problem resolved. There is no team. The only way the customer gets a team is to initiate an escalation to the Service Representative for the Account and crank up a conference call with all of the Managers.

Effective customer service and customer support is a TEAM effort that requires collaboration and socialization. It is not about fielding tickets. It is about taking care of the customer.

There are products on the horizon that will empower collaboration as a whole but still, a significant amount of work needs to be done on each individual product.

The Hype Cycle

The Gartner folks are much more precise and much better at accurately listing the Hypecycle than I am capable of. However, I like the way it depicts a product or technology lifecycle. I am merely overlaying my opinion on the Gartner definitions. While I know its subjective and only my opinion, here's where I put these technologies in the Gartner Hype Cycle:

Workflow Automation – Between the “Peak of Inflated Expectations” and the “Trough of Disillusionment”.

There are a significant number of product offerings in the industry that do these functions. And getting more and more each day. While we do see new product offerings for the same sort of functionality, everyone seems to think they do it better. Yet many fall short, are too complex to maintain, or are too expensive.

The big players have purchased many of the initial workflow automation applications and next generation applications are in process. Additionally, BPM related systems are being enhanced to perform more of the workflow automation associated with Enterprise Management.

Interesting links:

Workflow Management Coalition

Cordys Business Operations Platform

HP OO

BMC Atrium Orchestrator

Knowledge Management -- “Technology Trigger”

While there are a couple of KM systems offerings, most of the implementations are released as product features or are developed by large service providers for in house use only. For example, there are KM functions in SCOM/SCCM. Additionally, there are KM modules for ticketing systems like Remedy.

What you do see is that larger Service Providers are investing heavily in KM systems. It is done for many reasons, chief among them is driving down the cost of support and leveraging knowledge across multiple customers. Makes sense as these large outsourcing and service providers are usually a training ground for entry level personnel. Implementing KM systems is an accelerator for learning in these environments.

Interesting links:

WiiKno - The company helps organizations put in place Knowledge Management systems and integration.
Microsoft Knowledge Management Architecture paper
BMC Knowledge Management

Collaboration – “Technology Trigger”

There are a couple of emerging technologies here that could very well redefine how IT organizations perform operations and support. You may very well see some older technologies (Like Ticketing) take a back seat to newer collaboration and socialization technologies.

Part of the problem is that ticketing systems rarely capture enough information to do detailed operations analysis and optimization. They are typically too user overbearing to facilitate much of the real time notes that get missed or input after the fact.

When problems are ongoing, do you get the full monte of whats going on? Or do you have to aggregate your own picture? How effective is your post mortem analysis data? Do you see and recognize involvement by different support functions, business units, and the customers?

Interesting links:

MOOGSoft - Check out the guys behind the curtains.