Dougie's Enterprise Management World: Service Management

All too often, we have products that we have bought, put in production, and attempted to work through the shortcomings and obfuscated abnormalities prevalent in so many products. (I call this product "ISMs" and I use the term to describe specific product behaviors or personalities.) As part of this long living life cycle, changes, additions, deprecations, and behaviors change over time. Whether its fixing bugs or rewriting functions as upgrades or enhancements, things happen.

All too often, developers tend to think of their creation in a way that may be significantly different than the deployed environments they go into. Its easy to get stuck in microcosms and walled off development environments. Sometimes you miss the urgency of need, the importance of the functionality, or the sense of mandate around the business.

With performance management products, its all too easy just to gather everything and produce reports ad nauseum. With an overwhelming level of output, its easy to get caught up in the flash, glitz, and glamour of fancy graphs, pie charts, bar charts... Even Ishigawa diagrams!

All this is a distraction of what the end user really really NEEDS. I'll give a shot at outlining some basic requirements pertinent to all performance management products.

1. Don't keep trying to collect on broken access mechanisms.

Many performance applications continue to collector attempt to collect, even when they haven't had any valid data in several hours or days. Its crazy as all of the errors just get in the way of valid data. And some applications will continue to generate reports even though no data has been collected! Why?

SNMP Authentication failures are a HUGE clue your app is wasting resources or something simple. Listening for ICMP Source Quenches will tell you if you're hammering end devices.

2. Migrate away from mass produced reports in favor of providing information.

If no one is looking at the reports, you are wasting cycles, hardware,and personnel time on results that are meaningless.

3. If you can't create new reports without code, its too complicated.

All too often, products want to put glue code or even programming environments / IDEs in front of your reporting. Isn't it a stretch to assume that a developer will be the reporting person? Most of the time its someone more business process oriented.

4. Data and indexes should be documented and manageable. If you have to BYODBA (Bring Your Own DBA), the wares vendor hasn't done their home work.

How many times have we loaded up a big performance management application only to find out you have to do a significant amount of work tuning the data and the database parameters just to get the app to generate reports on time?

And you end up having to dig through the logs to figure out what works and what doesn't.

If you know what goes into the database, why do you not put in indexes,checks and balances, and even recommended functions when expansion occurs.

In some instances, databases used by performance management applications are geared toward the polling and collection versus the reporting of information. In many cases, one needs to build data derivatives of multiple elements in order to facilitate information presentation. For example, a simple dynamic thresholding mechanism is to take a sample of a series of values and perform an average, root mean, and standard deviation derivative.

If a reporting person has to do more than one join to get to their data elements, your data needs to be better organized, normalized, and accessible via either derivative tables or a view. Complex data access mechanisms tend to alienate BI and performance / Capacity Engineers. They would rather work the data than work your system.

5. If the algorithm is too complex to explain without a PhD, it is not usable nor trustable.

There are a couple of applications that use patented algorithms to extrapolate bandwidth, capacity, or effective usage. If you haven't simplified the explanation of how it works, you're going to alienate a large portion of your operations base.

6. If an algorithm or method is held as SECRET, it works just until something breaks or is suspect. Then your problem is a SECRET too!

Secrets are BAD. Cisco publishes all of its bugs online specifically because it eliminates the perception that they are keeping something from the customer.

If one remembers Concord's eHealth Health Index... In the earlier days, it was SECRET SQUIRREL SAUCE. Many an Engineer got a bad review or lost their job because of the arrogance of not publishing the elements that made up the Health Index.

7. Be prepared to handle BI types of access. Bulk transfers, ODBC and Excel/Access replication, ETL tools access, etc.

If Engineers are REALLY using your data, they want to use it in their own applications, their own analysis work, and their own business activities. The more useful your data is, the more embedded and valuable your application is. Provide ways of providing shared tables,timed transfers, transformations, and data dumps.

8. Reports are not just a graph on a splash page or a table of data. Reports to Operations personnel means they put text and formatting around the graphs, charts, tables, and data to relate the operational aspects of the environment in with the illustrations.

9. In many cases, you need to transfer data in a transformed state from one system that reports to another. Without ETL tools, your reporting solution kind of misses the mark.

Think about this... You have configuration data and you need this data in a multitude of applications. Netcool. Your CMDB. Your Operational Data Store. Your discovery tools. Your ticketing system. Your performance management system. And it seems that every one of this data elements may be text, XML, databases of various forms and flavors, even HTML. How do you get transformed from one place to another?

10. If you cannot store, archive, and activate polling, collection, threshold, and reporting configurations accurately, you will drive away customization.

As soon as a data source becomes difficult to work with, it gets in the way of progress. In essence, what happens in that when a data source becomes difficult to access, it quits being used beyond its own internal function. When this occurs, you start seeing separation and duplication of data.

The definitions of the data can also morph over time. When this occurs and the data is shared, you can correct it pretty quickly. When data is isolated, many times the problem just continues until its a major ordeal to correct. Reconciliation when there are a significant number of discrepancies can be rough.

Last but not least - If you develop an application and you move the configuration from test/QA to production and it does not stay EXACTLY the same, YOUR APPLICATION is JUNK. Its dangerous, haphazard, incredibly short sided, and should be avoided at all costs. Recently, I had a dear friend develop, test, and validate a performance management application upgrade. After a month in test and QA running many different use case validations, it was put into production. The application overloaded the paired down configurations to defaults upon placement into production, polled EVERYTHING and it caused major outages and major consternation for the business. In fact, heads rolled. The business lost customers. There were people that were terminated. And a lot of man power was expended trying to fix the issues.

In no uncertain terms, I will never let my friends and customers be caught by this product.

I have been looking at monitoring and how its typically implemented. Part of my look is to drive visualization but also how can I leverage the data in a way that organizes people's thoughts on the desk.

Part of my thought process is around OpenNMS. What can I contribute to make the project better.

What I came to realize is that Nodes are monitored on a Node / IP address basis by the majority of products available today. All of the alarms and events are aligned by node - even the sub-object based events get aggregated back to the node level. And for the most part, this is OK. You dispatch a tech to the Node level, right?

When you look at topology at a general sense, you can see the relationship between the poller and the Node under test. Between the poller and the end node, there is a list of elements that make up the lineage of network service components. So, from a service perspective, a simple traceroute between the poller and the end node produces a simple network "lineage".

Extending this a bit further, knowing that traceroute is typically done in ICMP, this gives you an IP level perspective of the network. Note also that because traceroute exploits the time to Live parameter of IP, it can be accomplished in any transport layer protocol. For example, traceroute could work on TCP port 80 or 8080. The importance is that you place a protocol specific responder on the end of the code to see if the service is actually working beyond just responding to a connection request.

And while traceroute is a one way street, it still derives a lineage of path between the poller and the Node under test - and now the protocol or SERVICE under test. And it is still a simple lineage.

The significance of the path lineage is that in order to do some level of path correlation, you need to understand what is connected to what. given that this can be very volatile and change very quickly, topology based correlation can be somewhat problematic - especially if your "facts" change on the fly. and IP based networks do that. They are supposed to do that. They are a best efffort communications methodology that needs to adapt to various conditions.

Traceroute doesn't give you ALL of the topology. By far. Consider the case of a simple frame relay circuit. A Frame Relay circuit is mapped end to end by a Circuit provider but uses T carrier access to the local exchange. Traceroute only captures the IP level access and doesn't capture elements below that. In fact, if you have ISDN backup enabled for a Frame Relay circuit, your end points for the circuit will change in most cases, for the access. And the hop count may change as well.

The good part about tracerouteing via a legitimate protocol is that you get to visualize any administrative access issues up front. For example, if port 8080 is blocked between the poller and the end node, the traceroute will fail. Additionally, you may see ICMP administratively prohibited messages as well. In effect, by positioning the poller according to end users populations, you get to see the service access pathing.

Now, think about this... From a basic service perspective, if you poll via the service, you get a basic understanding of the service you are providing via that connection. When something breaks, you also have a BASELINE with which to diagnose the problem. So, if the poll fails, rerun the traceroute via the protocol and see where it stops.

Here are the interesting things to note about this approach:

You are simply replicating human expert knowledge in software. Easy to explain. Easy to transition to personnel.
You get to derive path breakage points pretty quickly.
You get to discern the perspective of the end user.
You are now managing your Enterprise via SERVICE!

Topology really doesn't mean ANYTHING until you evolve to manage by Service and not by individual nodes. You can have all the pretty maps you want. It doesn't mean crapola until you start managing by service.

This approach is an absolute NATURAL for OpenNMS. Let me explain...

Look at the Path Outages tab. While it is currently manually configured, using the traceroute by service lineage here provides a way of visualizing the path lineage.

OpenNMS supports services pollers natively. There are alot of different services out of the box and its easy to do more if you find something different from what they already do.

Look at the difference between Alarms versus Events. Service outages could directly be related to an Alarm while the things that are eventing underneath may affect the service, are presented as events.

What if you took the reports and charts and aligned the elements to the service lineage? For example, if you had a difference in service response, you could align all of the IO graphs for everything in the service lineage. You could also align all of the CPU utilizations as well.

In elements where there are subobjects abstracted in the lineage, if you discover them, you could add those in the lineage. For example, if you discovered the Frame Relay PVCs and LEC access circuits, these could be included in with your visualization underneath the path where they are present.

The other part is that the way you work may need to evolve as well. For example, if you've traditionally ticketed outages on Nodes, now you may need to transition to a Service based model. And while you may issue tickets on a node, your ticket on a Service becomes the overlying dominant ticket in that multiple node problems may be present in a service problem.

And the important thing. You become aware of the customer and Service first, then elements underneath that. It becomes easier to manage to service along with impact assessments, when you manage to a service versus manage to a node. And when you throw in the portability, agility, and abstractness of Cloud computing, this approach is a very logical fit.

Dougie's Enterprise Management World

Monday, April 9, 2012

Product quality Dilemma

Sunday, April 4, 2010

Simplifying topology