Dougie's Enterprise Management World: NMS

All too often, we have products that we have bought, put in production, and attempted to work through the shortcomings and obfuscated abnormalities prevalent in so many products. (I call this product "ISMs" and I use the term to describe specific product behaviors or personalities.) As part of this long living life cycle, changes, additions, deprecations, and behaviors change over time. Whether its fixing bugs or rewriting functions as upgrades or enhancements, things happen.

All too often, developers tend to think of their creation in a way that may be significantly different than the deployed environments they go into. Its easy to get stuck in microcosms and walled off development environments. Sometimes you miss the urgency of need, the importance of the functionality, or the sense of mandate around the business.

With performance management products, its all too easy just to gather everything and produce reports ad nauseum. With an overwhelming level of output, its easy to get caught up in the flash, glitz, and glamour of fancy graphs, pie charts, bar charts... Even Ishigawa diagrams!

All this is a distraction of what the end user really really NEEDS. I'll give a shot at outlining some basic requirements pertinent to all performance management products.

1. Don't keep trying to collect on broken access mechanisms.

Many performance applications continue to collector attempt to collect, even when they haven't had any valid data in several hours or days. Its crazy as all of the errors just get in the way of valid data. And some applications will continue to generate reports even though no data has been collected! Why?

SNMP Authentication failures are a HUGE clue your app is wasting resources or something simple. Listening for ICMP Source Quenches will tell you if you're hammering end devices.

2. Migrate away from mass produced reports in favor of providing information.

If no one is looking at the reports, you are wasting cycles, hardware,and personnel time on results that are meaningless.

3. If you can't create new reports without code, its too complicated.

All too often, products want to put glue code or even programming environments / IDEs in front of your reporting. Isn't it a stretch to assume that a developer will be the reporting person? Most of the time its someone more business process oriented.

4. Data and indexes should be documented and manageable. If you have to BYODBA (Bring Your Own DBA), the wares vendor hasn't done their home work.

How many times have we loaded up a big performance management application only to find out you have to do a significant amount of work tuning the data and the database parameters just to get the app to generate reports on time?

And you end up having to dig through the logs to figure out what works and what doesn't.

If you know what goes into the database, why do you not put in indexes,checks and balances, and even recommended functions when expansion occurs.

In some instances, databases used by performance management applications are geared toward the polling and collection versus the reporting of information. In many cases, one needs to build data derivatives of multiple elements in order to facilitate information presentation. For example, a simple dynamic thresholding mechanism is to take a sample of a series of values and perform an average, root mean, and standard deviation derivative.

If a reporting person has to do more than one join to get to their data elements, your data needs to be better organized, normalized, and accessible via either derivative tables or a view. Complex data access mechanisms tend to alienate BI and performance / Capacity Engineers. They would rather work the data than work your system.

5. If the algorithm is too complex to explain without a PhD, it is not usable nor trustable.

There are a couple of applications that use patented algorithms to extrapolate bandwidth, capacity, or effective usage. If you haven't simplified the explanation of how it works, you're going to alienate a large portion of your operations base.

6. If an algorithm or method is held as SECRET, it works just until something breaks or is suspect. Then your problem is a SECRET too!

Secrets are BAD. Cisco publishes all of its bugs online specifically because it eliminates the perception that they are keeping something from the customer.

If one remembers Concord's eHealth Health Index... In the earlier days, it was SECRET SQUIRREL SAUCE. Many an Engineer got a bad review or lost their job because of the arrogance of not publishing the elements that made up the Health Index.

7. Be prepared to handle BI types of access. Bulk transfers, ODBC and Excel/Access replication, ETL tools access, etc.

If Engineers are REALLY using your data, they want to use it in their own applications, their own analysis work, and their own business activities. The more useful your data is, the more embedded and valuable your application is. Provide ways of providing shared tables,timed transfers, transformations, and data dumps.

8. Reports are not just a graph on a splash page or a table of data. Reports to Operations personnel means they put text and formatting around the graphs, charts, tables, and data to relate the operational aspects of the environment in with the illustrations.

9. In many cases, you need to transfer data in a transformed state from one system that reports to another. Without ETL tools, your reporting solution kind of misses the mark.

Think about this... You have configuration data and you need this data in a multitude of applications. Netcool. Your CMDB. Your Operational Data Store. Your discovery tools. Your ticketing system. Your performance management system. And it seems that every one of this data elements may be text, XML, databases of various forms and flavors, even HTML. How do you get transformed from one place to another?

10. If you cannot store, archive, and activate polling, collection, threshold, and reporting configurations accurately, you will drive away customization.

As soon as a data source becomes difficult to work with, it gets in the way of progress. In essence, what happens in that when a data source becomes difficult to access, it quits being used beyond its own internal function. When this occurs, you start seeing separation and duplication of data.

The definitions of the data can also morph over time. When this occurs and the data is shared, you can correct it pretty quickly. When data is isolated, many times the problem just continues until its a major ordeal to correct. Reconciliation when there are a significant number of discrepancies can be rough.

Last but not least - If you develop an application and you move the configuration from test/QA to production and it does not stay EXACTLY the same, YOUR APPLICATION is JUNK. Its dangerous, haphazard, incredibly short sided, and should be avoided at all costs. Recently, I had a dear friend develop, test, and validate a performance management application upgrade. After a month in test and QA running many different use case validations, it was put into production. The application overloaded the paired down configurations to defaults upon placement into production, polled EVERYTHING and it caused major outages and major consternation for the business. In fact, heads rolled. The business lost customers. There were people that were terminated. And a lot of man power was expended trying to fix the issues.

In no uncertain terms, I will never let my friends and customers be caught by this product.

Ever watch folks and how they use various applications? When you do some research around the science of Situation Awareness, you realize that human behavior in user interfaces is vital to understanding how to put information in front of users in ways that empowers the users inline with what they need.

In ENMS related systems, it is imperative that you present information in ways that empower users to understand situations and conditions beyond just a single node. While all of the wares vendors have been focused on delivering some sort of Root Cause Analysis, this may not be what is REALLY needed by the users. And dependent upon whether you are a Service Provider or an Enterprise, the rules may be different.

What I look for in applications and User Interfaces are ways to streamline the interaction versus being disruptive. If you are swapping a lot of screens, inherently look at your user. If they have to readjust their vision or posture, the UI is disrupting their flow.

For example, if the user is looking at an events display and they execute a function as part of the menu. This function produces a screen that overcomes the existing events display. If you watch your user, you will see them have to readjust to the screen change.

I feel like this is one of the primary reasons ticketing systems do not capture more real time data. It becomes too disruptive to keep changing screens so the user waits until later to update the ticket. Inherently, data is filtered and lost.

This has an effect on other processes. One is that if you are attempting to do BSM scorecards, ticket loading and resource management in near real time, you don’t have all of the data to complete your picture. In effect, situation awareness for management levels is skewed until the data is input.

The second effect to this is that if you’re doing continuous process improvement, especially with the incident and problem management aspects of ITIL, you miss critical data and time elements necessary to measure and improve upon.

Some folks have attempted to work around this by managing from ticket queues. So, you end up with one display of events and incoming situation elements and a second interface as the ticket interface. In order to try to make this even close to being effective, the tendency is to automatically generate tickets for every incoming event. Without doing a lot of intelligent correlation up front, automatic ticket generation can be very dangerous. Due diligence must be applied to each and every event that gets propagated or you may end up with false ticket generation or missed ticket opportunities.

Consider this as well. An Event Management system is capable of handling a couple thousand events pretty handily. A Ticketing system that handles 2000 ongoing tickets at one time changes the parameters of many ticketing systems.

Also, consider that in Remedy 7.5, the potential exists that each ticket may utilize 1GB or more of Database space. 2000 active tickets means you’re actively working across 2TB of drive / database space.

I like simple update utilities or popups that solicit information needed and move that information element back into the working Situation Awareness screen. For example, generating a ticket should be a simple screen to solicit data that is needed for the ticket that cannot be looked up directly or indirectly. Elements like ticket synopsis or symptom. Assignment to a queue or department. Changing status of a ticket.

Maps

Maps can be handy. But if you cannot overlay tools and status effectively or the map isn’t dynamic, it becomes more of a marketing display rather than a tool that you can use. This is even more prevalent when maps are not organized into hierarchies.

One of the main obstacles is the canvas. You can only place a certain amount of objects on a given screen. Some applications use scroll bars to enable you to get around. Others use a zoom in - zoom out capability where they scale the size of the icons and text according to the zoom. Others enable dragging the canvas. Another approach is to use a Hyperbolic display where analysis of detail is accomplished by establishing a moveable region under a higher level map akin to a magnifying glass over a desktop document.

3D displays get around the limitations of a small canvas a bit by using depth to position things in front or behind. However, 3D displays have to use techniques like LOD or Level of Details, or Fog to enable only more local objects are attended to, otherwise it has to render every object local and remote. This can be computationally intensive.

A couple of techniques I like in the 3D world are CAVE / Immersion displays and the concept of HUDs and Avatars. CAVE displays display your environment from several perspectives including top, bottom, front, left, right, and even behind. Movement is accomplished interacting with one screen and the other screens are synchronized to the main, frontal screen. This gives the user the effect of an immersive visual environment.

A HUD or heads up display enables you to present real time information directly in front of a user regardless of position or view.

The concept of an avatar is important in that if you have an avatar or user symbol, you can use that symbol to enable collaboration. In fact, your proximity to a given object may be used to help others collaborate and team up to solve problems.

Next week, I’ll discuss network layouts, transitioning, state and condition management, and morphing displays. Hopefully, in the coming weeks, I’ll take a shot at designing a hybrid, immersive 2D display that is true multiuser, and can be used as a solid tools and analysis visualization system.

Dougie's Enterprise Management World

Monday, April 9, 2012

Product quality Dilemma

Sunday, July 11, 2010

ENMS User Interfaces...