Showing posts with label knowledge management. Show all posts

Monday, April 9, 2012

Product quality Dilemma

All too often, we have products that we have bought, put in production, and attempted to work through the shortcomings and obfuscated abnormalities prevalent in so many products. (I call this product "ISMs" and I use the term to describe specific product behaviors or personalities.) As part of this long living life cycle, changes, additions, deprecations, and behaviors change over time. Whether its fixing bugs or rewriting functions as upgrades or enhancements, things happen.

All too often, developers tend to think of their creation in a way that may be significantly different than the deployed environments they go into. Its easy to get stuck in microcosms and walled off development environments. Sometimes you miss the urgency of need, the importance of the functionality, or the sense of mandate around the business.

With performance management products, its all too easy just to gather everything and produce reports ad nauseum. With an overwhelming level of output, its easy to get caught up in the flash, glitz, and glamour of fancy graphs, pie charts, bar charts... Even Ishigawa diagrams!

All this is a distraction of what the end user really really NEEDS. I'll give a shot at outlining some basic requirements pertinent to all performance management products.

1. Don't keep trying to collect on broken access mechanisms.

Many performance applications continue to collector attempt to collect, even when they haven't had any valid data in several hours or days. Its crazy as all of the errors just get in the way of valid data. And some applications will continue to generate reports even though no data has been collected! Why?

SNMP Authentication failures are a HUGE clue your app is wasting resources or something simple. Listening for ICMP Source Quenches will tell you if you're hammering end devices.

2. Migrate away from mass produced reports in favor of providing information.

If no one is looking at the reports, you are wasting cycles, hardware,and personnel time on results that are meaningless.

3. If you can't create new reports without code, its too complicated.

All too often, products want to put glue code or even programming environments / IDEs in front of your reporting. Isn't it a stretch to assume that a developer will be the reporting person? Most of the time its someone more business process oriented.

4. Data and indexes should be documented and manageable. If you have to BYODBA (Bring Your Own DBA), the wares vendor hasn't done their home work.

How many times have we loaded up a big performance management application only to find out you have to do a significant amount of work tuning the data and the database parameters just to get the app to generate reports on time?

And you end up having to dig through the logs to figure out what works and what doesn't.

If you know what goes into the database, why do you not put in indexes,checks and balances, and even recommended functions when expansion occurs.

In some instances, databases used by performance management applications are geared toward the polling and collection versus the reporting of information. In many cases, one needs to build data derivatives of multiple elements in order to facilitate information presentation. For example, a simple dynamic thresholding mechanism is to take a sample of a series of values and perform an average, root mean, and standard deviation derivative.

If a reporting person has to do more than one join to get to their data elements, your data needs to be better organized, normalized, and accessible via either derivative tables or a view. Complex data access mechanisms tend to alienate BI and performance / Capacity Engineers. They would rather work the data than work your system.

5. If the algorithm is too complex to explain without a PhD, it is not usable nor trustable.

There are a couple of applications that use patented algorithms to extrapolate bandwidth, capacity, or effective usage. If you haven't simplified the explanation of how it works, you're going to alienate a large portion of your operations base.

6. If an algorithm or method is held as SECRET, it works just until something breaks or is suspect. Then your problem is a SECRET too!

Secrets are BAD. Cisco publishes all of its bugs online specifically because it eliminates the perception that they are keeping something from the customer.

If one remembers Concord's eHealth Health Index... In the earlier days, it was SECRET SQUIRREL SAUCE. Many an Engineer got a bad review or lost their job because of the arrogance of not publishing the elements that made up the Health Index.

7. Be prepared to handle BI types of access. Bulk transfers, ODBC and Excel/Access replication, ETL tools access, etc.

If Engineers are REALLY using your data, they want to use it in their own applications, their own analysis work, and their own business activities. The more useful your data is, the more embedded and valuable your application is. Provide ways of providing shared tables,timed transfers, transformations, and data dumps.

8. Reports are not just a graph on a splash page or a table of data. Reports to Operations personnel means they put text and formatting around the graphs, charts, tables, and data to relate the operational aspects of the environment in with the illustrations.

9. In many cases, you need to transfer data in a transformed state from one system that reports to another. Without ETL tools, your reporting solution kind of misses the mark.

Think about this... You have configuration data and you need this data in a multitude of applications. Netcool. Your CMDB. Your Operational Data Store. Your discovery tools. Your ticketing system. Your performance management system. And it seems that every one of this data elements may be text, XML, databases of various forms and flavors, even HTML. How do you get transformed from one place to another?

10. If you cannot store, archive, and activate polling, collection, threshold, and reporting configurations accurately, you will drive away customization.

As soon as a data source becomes difficult to work with, it gets in the way of progress. In essence, what happens in that when a data source becomes difficult to access, it quits being used beyond its own internal function. When this occurs, you start seeing separation and duplication of data.

The definitions of the data can also morph over time. When this occurs and the data is shared, you can correct it pretty quickly. When data is isolated, many times the problem just continues until its a major ordeal to correct. Reconciliation when there are a significant number of discrepancies can be rough.

Last but not least - If you develop an application and you move the configuration from test/QA to production and it does not stay EXACTLY the same, YOUR APPLICATION is JUNK. Its dangerous, haphazard, incredibly short sided, and should be avoided at all costs. Recently, I had a dear friend develop, test, and validate a performance management application upgrade. After a month in test and QA running many different use case validations, it was put into production. The application overloaded the paired down configurations to defaults upon placement into production, polled EVERYTHING and it caused major outages and major consternation for the business. In fact, heads rolled. The business lost customers. There were people that were terminated. And a lot of man power was expended trying to fix the issues.

In no uncertain terms, I will never let my friends and customers be caught by this product.

Saturday, May 22, 2010

Support Model Woes

I find it ironic that folks claim to understand ITIL Management processes yet do not understand the levels of support model.

Most support organizations have multiple ters of support. For example, there is usually a Level 1 which is the initial interface toward the customer. Level 2 is usually a traige or technical feet on the street. Level 3 is usually Engineering or Development. In some cases, Level 4 is used to denote on site vendor support or third party support.

In organizations where Level 1 does dispatch only or doesn't follow through problems with the customer, the customer ends up owning and following through the problem to solution. What does this say to customers?

- There are technically equal to or better than level 1
- they become accustomed to automatic escalation.
- They lack confidence in the service provider
- They look for specific Engineers versus following process
- They build organizations specifically to follow and track problems through to resolution.

If your desks do dispatch only, event management systems are only used to present the initial event leading up to a problem. What a way to render Netcool useless! Netcool is designed to display the active things that are happening in your environment. If all you ever do is dispatch, why do you need Netcool? Just generate a ticket straight away. No need to display it.

What? Afraid of rendering your multi-million dollar investment useless? Why leave it in a disfunctional state of semi-uselessness when you could save a significant amount of money getting rid of it? Just think, every trap can become an object that gets processed like a "traplet" of sorts.

One of the first things that crops up is that events have a tendency to be dumb. Somewhere along the way, somebody had the bright idea that to put an agent into production that is "lightweight" - meaning no intelligence or correlation built in. Its so easy to do little or nothing, isn't it? Besides, its someone elses problem to deal with the intelligence. In the
vendors case, many times agent functionality is an afterthought.

Your model is all wrong. And until the model is corrected, you will never realize the potential ROI of what the systems can provide. You cannot evolve because you have to attempt to retrofit everything back to the same broken model. And when you automate, you automate the broken as well.

Heres the way it works normally. You have 3 or 4 levels of support.

Level 1 is considered first line and they perform the initial customer engagement, diagnostics and triage, and initiate workflow. Level 1 initiates and engages additional levels of support trackingt thingts through to completion. In effect, level 1 owns the incident / problem management process but also provides customer engagement and fulfillment.

Level 2 is specialized support for various areas like network, hosting, or application support. They are engaged through Level 1 personnel and are matrixed to report to level 1, problem by problem such that they empower level 1 to keep the customer informed of status and timelines, set expectations, and answer questions.

Level 3 is engaged when the problem becomes beyond the technical capabilities of levels 1 and 2, requires project, capital expenditure, architecture, and planning support.

Level 4 is reserved for Vendor support or consulting support and engagement.

A Level 0 ia used to describe automation and correlation performed before workflow is enacted.

When you breakdown your workflow into these levels, you can start to optimize and realize ROI by reducing the cost of maintenance actions across the board. By establishing goals to solve 60-70% of all incidents at LEvel 1, Level 2-4 involvement helps to drive knowledge and understanding downward to level 1 folks why better utilizing level 2 - 4 folks.

In order to implement these levels of support, you have organize and define your support organization accordingly. Define its rolls and responsibilities, set expectations, and work towards success. Netcool, as an Event Management platform, need to be aligned to the support model. Things that ingress and egress tickets need to be updated in Netcool. Workflow that occurs, needs to update Netcool so that personnel have awareness of what is going on.

Sunday, April 18, 2010

Netcool and Evolution toward Situation Management

Virtually no new evolution in Fault Management and correlation has been done in the last ten years. Seems we have a presumption that what we have today is as far as we can go. Truly sad.

In recent discussions on the INUG Netcool Users Forum, we discussed shortfalls in the products in hopes that big Blue may see its way clear of the technical obstacles. I don't think they are accepting or open to mine and other suggestions. But thats OK. you plant a seed - water it - feed it. And hopefully, one day, it comes to life!

Most of Netcool design is based somewhat loosely on TMF standards. They left out the hard stuff like object modelling but I understand why. The problem is that most Enterprises and MSPs don't fit the TMF design pattern. Nor do they fit eTOM. This plays specifically to my suggestion that "There's more than one way to do it!" - The Slogan behind Perl.

The underlying premise behind Netcool is that it is a single pane of glass for viewing and recognizing what is going on in your environment. It provides a way to achieve situation awareness and a platform which can be used to drive interactive work from. So what about ITIL and Netcool?

From the aspect of product positioning, most ITIL based platforms have turned out to be rehashs of Trouble Ticketing systems. When you talk to someone about ITIL, they immediately think of HP ITSM or BMC Remedy. Because of the complexity, these systems sometimes takes several months to implement. And nothing is cheap. Some folks resort to open source like RT or OTRS. Others want to migrate towards a different, appliance based model like ServiceNow and ScienceLogic EM7.

The problem is that once you transition out of Netcool, you lose your situation awareness. Its like having a notebook full of pages. Once you flip to page 50, pages 1-49 are out of sight and therefore gone. All hell could break lose and you'd never know.

So, why not implement ITIL in Netcool? May be a bit difficult. Here are a few things to consider:

1. The paradigm that an event has only 2 states is bogus.
2. The concept that there are events and these lead to incidents, problems, and changes.
3. Introduces workflow to Netcool.
4. Needs to be aware of CI references and relationships.
5. Introduces the concept that the user is part of the system in lieu of being an external entity.
6. May change the exclusion approach toward event processing.
7. Requires data storage and retrieval capabilities.

End Game

From a point of view where you'd like to end up, there are several use cases one could apply. For example:

One could see a situation develop and get solved in the Netcool display over time. As it is escalated and transitioned, you are able to see what has occurred, the workflow steps taken to solve this, and the people involved.

One could take a given situation and search through all of the events to see which ones may be applicable to the situation. Applying a ranking mechanism like a google search would help to position somewhat fuzzy information in proper contexts for the users.

Be able to take the process as it occurred and diagnose the steps and elements of information to optimize processes in future encounters.

Be able to automate, via the system, steps in the incident / problem process. Like escalations or notifications. Or executing some action externally.

Once you introduce workflow to Netcool, you need to introduce the concept of user awareness and collaboration. Who is online? What situations are they actively working versus observing? How do you handle Management escalations?

In ITIL definitions, an Incident has a defined workflow process from start to finish. Netcool could help to make the users aware of the process along with its effectiveness. Even in a simple event display you can show last, current and next steps in fields.

Value Proposition

From the aspect of implementation, the implementation of ITIL based systems has been focused solely around trouble ticketing systems. These systems have become huge behemoths of applications and with this comes two significant factors that hinder success - The loss of situation Awareness and the inability to realize and optimize processes in the near term.

These behemoth systems become difficult to adapt and difficult to keep up with optimizations. As such, they slow down the optimization process making it painful to move forward. If its hard to optimize, it will be hard to differentiate service because you cannot adapt to changes and measure the effectiveness fast enough to do any good.

A support organization that is aware of whats going on, subliminally portrays confidence. This confidence carries a huge weight in interactions with customers and staff alike. It is a different world on a desk when you're empowered to do good work for your customer.

More to come!

Hopefully, this will provide some food for thought on the evolution of event management into Situation Management. In the coming days I plan on adding to this thread several concepts like evolution toward complex event processing, Situation Awareness and Knowledge, data warehousing, and visualization.

Saturday, March 27, 2010

NPS, Enterprise Management, and Situation Awareness

In the course of what I do, I have to sometimes take non-technical metrics and understand where implementation of technology - especially in the ENMS realm - applies toward achieving real business goals.

Recently, alot of Services based companies are working toward understanding and improving their Net Promoter Score or NPS. As part of this initiative, what can I do to realize the overall goal?

First, I went looking for a definition of NPS to understand the terms, conditions, and metrics related to this KPI. I found this definition:

What is Net Promoter?

Net Promoter® is both a loyalty metric and a discipline for using customer feedback to fuel profitable growth in your business. Developed by Satmetrix, Bain & Company, and Fred Reichheld, the concept was first popularized through Reichheld's book The Ultimate Question, and has since been embraced by leading companies worldwide as the standard for measuring and improving customer loyalty.

The Net Promoter Score, or NPS®, is a straightforward metric that holds companies and employees accountable for how they treat customers. It has gained popularity thanks to its simplicity and its linkage to profitable growth. Employees at all levels of the organization understand it, opening the door to customer- centric change and improved performance.

Net Promoter programs are not traditional customer satisfaction programs, and simply measuring your NPS does not lead to success. Companies need to follow an associated discipline to actually drive improvements in customer loyalty and enable profitable growth. They must have leadership commitment, and the right business processes and systems in place to deliver real-time information to employees, so they can act on customer feedback and achieve results.

I found this at :

http://www.netpromoter.com/np/index.jsp

In essence, the NPS KPI is a metric by which to measure customer loyalty. In its simplicity, come the subjectiveness of how you treat your customers. So this begs the question: What can I do from an Enterprise Management perspective to affect this?

From my perspective, the NPS is a measure of the effectiveness of your support CULTURE first and foremost. This is a personal - core belief - sort of thing at its foundation. Customer facing people in your support organization must project several key personality traits and behaviors. Some of these I envision to be:

Dedication. The customer is the only person in the room sort of thing.

Urgency of need. The support person must understand the importance of the situation.

Empathy. A willingness and understanding of the customer's pain.

Confidence. In the face of unknown issues and varying conditions, the customer facing person must exhibit technical strength.

Follow Through. If the customer trusts you enough to let you off the phone to handle things, you MUST FOLLOW THROUGH.

There is also the notion that in a Service oriented company, EVERYONE is a sales person in one way or another. Every interaction means an opportunity to understand the customer and help them be successful.

When you go to MacDonalds and you're trying to figure out what you'd like or what level of poly unsaturated fat and cholesterol you want to propagate to your family... Ever gotten the person that asks you what you want and you don't know and they stand there looking at you? NPS score --.

Now, if they engage you and suggest items like a 12 pack of Big Macs, they are DEDICATED, empathic to your hunger pains,understand your urgency of need, and have confidence your order is going to be up in a minute or so after inputting it in the computer. And in the end they ask about dessert - Great Sales person and GREAT customer service person. NPS score ++.

From a personal work habits perspective, one of the key behaviors to be considered is creating and maintaining Situation Awareness. I ran across the term SA while working on an Air force project and found it profoundly appropo for operations organizations doing customer service. Check this out on Wikipedia:

http://en.wikipedia.org/wiki/Situation_awareness

I also read through several sections of the google book review about Situation Awareness by Dr. Mica Endsley and Daniel Garland. This is at :

http://books.google.com/books?id=tUwqcqa_QaMC&printsec=frontcover&dq=Situation+Awareness&source=bl&ots=NccDiPzgMI&sig=NW0LAHrBsOTFXVmSyCSKz6154IU&hl=en&ei=pUWuS_HdE4X7lwf9tdCRAQ&sa=X&oi=book_result&ct=result&resnum=2&ved=0CBQQ6AEwAQ#v=onepage&q=&f=false

(I'm ordering the book!)

The model graphic they provide is useful as well:

http://en.wikipedia.org/wiki/File:SA_Wikipedia_Figure_1_Shared_SA_(20Nov2007).jpg

In effect, what enterprise management applications and technology MUST do to effectively achieve a higher NPS is to empower SA at all levels. In doing so, you create a culture where information is meant to be shared and used to make predictions and illicit responses and decisions based upon information being presented.

Now this is a bit taller of an order than once thought. For example, on the Event / Fault Management side of things, information is presented as events. People respond to events. They test or open tickets or whatever workflow they do when an event is received.

But an event is NOT a situation! A Situation is something a bit different and more abstract than a simple event. So, you have to transition your events to be situation focused! Interesting thought... Especially since event presentation is the real prevalent method! Maybe the Netcool approach needs to evolve a bit!

Interesting in that OpenNMS introduces the concept that events are different from Alarms in their own GUI. Check it out at:

http://www.opennms.org/wiki/Alarms

A brilliant piece of work (and a notion that simple is Good!) in that EVENTS != ALARMS! My hats off the the OpenNMS guys and the OGP for GETTING IT! In fact, its a start down the road of understanding the concept of Situations in SA.

Trouble Ticketing systems attempt to do this situation grouping via tickets but its almost too late once it leaves your near real time pane of glass. Once you transition away from a single pane of glass, you effectively lose your SA of the real time. And if you attempt to work out of tickets, you miss all of the elemental sorts of things that happen underneath. Even elements of information like event activity, performance thresholds, support activity, and the like have to be discerned and recognized in near real time to be effective information. If you miss it, you don't know. But your customer may not miss it!

If you ticket from event to ticket, you're asking for problems. Problems like tickets that are not problems but side effects. Or side effects that are problems, just rolled up under a ticket. Or awareness that conditions have cleared while the ticket is still being escalated and worked. Or missing all of the adjacent issues like a router taking out a subnet taking out and application and its three different desks.

The interesting part here is that two given situations may have events that effect each situation. This may throw a kink in normal, database table based event management systems. May be a bit difficult to implement and support.

I am beginning to think a bit different on Event processing especially with regards to SA and understanding, recognizing, and responding to SITUATIONS. For example, check out this presentation by Tim Bass of Cyberstrategics. He has a long history of thought leadership in Situational Awareness in Cyberspace.

http://www.slideshare.net/TimBassCEP/getting-started-in-cep-how-to-build-an-event-processing-application-presentation-717795

CEP techniques would enable an event to be consumed by multiple situations as situations develop and dissipate. Think about the weighting of events and conditions within a given situation. Some elements may be much more pertinent than others.

A significant part of Situation Awareness is the visualization and presentation of data regarding the ongoing situation. For example, check out this video: http://www.youtube.com/watch?v=FdKOxZIIKmQ

From the aspect of true, situational awareness, shouldn't we be looking at evolving Enterprise Management toward being able to deal with situations?

Another thought here. If I'm worried about an NPS, could I MANAGE to it live? Or at least closer to real time? What if I could meld in the capabilities of Evolve24's The Mirror product as a look at the REPUTATION SITUATION as it evolves? Check it out at: http://www.evolve24.com/mirror_for_social_media.php

This kind of changes the face of what we have been considering as BSM, doesn't it.

The common denominator in all this process and technology is Knowledge Management. How are you developing knowledge? How are you integrating it in with EVERY person. How are you using it to create SA and HUGE business discriminators? How are you using KM to empower your customers?