Dougie's Enterprise Management World: April 2011

Data Interpolation

In some performance management systems, the fact that you may have nulls or failed polls in your data series draws consternation in some products. If you are missing data polls, the software doesn't seem to know what to do.

I think the important thing is to catch counter rollovers in instances where your data elements are counters. You want to catch a rollover in such a way as when you calculate the delta, you add up until the maximum counter value, then start adding again up to the new value. In effect you get a good delta in between two values. What you do not want is tainted counters that span more than one counter rollover.

If you take the delta value between a missed poll as in:

P1 500
P2 Missed
P3 1000

In smoothing, you can take the difference between P3 and P1 and divide that by half as a delta. Simply add this to P1 to produce P2 which produces a derived time series of:

P1 500
P2 750
P3 1000

In gauges, you may have to sample the elements around the missed data element and use an average to smooth over the missed data element. Truth be known, I don't see why a graphing engine cannot connect from P1 to P4 without the interpolation! It would make graphs much simpler and more rich - If you don't see a data point on the time slot, guess it didn't happen! In the smoothing scenario, you cannot tell where the smoothing is.

Availability

In my own thoughts, I view availability as a discreet event in time or a state over a time period. Both have valid uses but need to presented in the context that they are borne so that users do not get confused. For example, an ICMP ping that occurs every 5 minutes is a discreet event in time. Here is an example time series of pings:

P1 Good
P2 Good
P3 Fail
P4 Good
P5 Fail
P6 Good
P7 Good
P8 Good
P9 Good
P10 Good

This time series denotes ping failures at P3 and P5 intervals. In theory, a line graph is not appropriate for this instance because it is boolean and it is discreet and not representative of the entire time period. If P0 is equal to 1303088400 on the local poller and the sysUptime value at P0 is 2500000, then the following SNMP polls yield:

Period Poller uTime SysUptime
P0   1303088400 2500000
P1 1303088700 2503000
P2   1303089000 2506000
P3   1303089300 2509000
P4   1303089600 2512000
P5 1303089900 20000
P6 1303090200 50000
P7 1303090500 80000
P8   1303090800 110000
P9 1303091100 5000
P10   1303091400 35000

utime is the number of seconds since January 1, 1970 and as such, it increments every second. SysUptime is the number of Clock ticks since the management system reinitialized. When you look hard at the TimeTicks data type, it is a modulo-s counter of the number of 1/100ths of a second for a given time epoch period. This integer will roll over at the value of 4294967296 or approximately every 49.7 days.

In looking for availability in the time series, if you derive the delta between the utime timestamps and multiply that times 100 (given that timeTicks is 1/100th of a second), you can see that if the sysUpTime value is less than the previous value plus the serived delta timeticks, you can clearly see where you have an availability issues. As compared to the previous time series using ICMP ping, P3 was meaningless and only P5 managed to ascertain some level of availability discrepancy.

You can also derive from the irrant time series periods that P5 lost 10000 timeTicks from the minimum delta value of 30000 (300 seconds * 100). So, for period P5 we were not available for 10000 timeticks and not the full period. Also note that if you completely miss a poll or don't get a response, the delta of the last poll chould tell you where availability was affected or not even though the successful poll did not go as planned.

From a pure time series data kind of conversion, one could translate sysUpTime to an accumulated number of available seconds over the time period. From a statistics perspective, this makes sense in that it becomes a count or integer.

Discreet to time Series Interpolation

Let's say you have an event management system. And within this system, you receive a significant number of events If can categorize and number your events into instances counted per interval, you can convert this into a counter for a time period. for example, you received 4 CPU threshold events in 5 minutes for a given node.

In other instances, you may have to covert discreet events into more of a stateful.approach toward conversions. For example, you have a link down and 10 seconds later, you get a link up. You have to translate this to a non-availability of the link for 10 seconds of a 5 minute interval. What is really interesting about this is that you have very finite time stamps. When you do, you can use this to compare to other elements that may have changed around the same period of time. Kind of a cause and effect analysis.

This is especially pertinent when you analyze data from devices, event data, and workflow and business related elements. For example, what if you did a comparison between HTTP response times to Internet network IO rates, and Apache click rates on your web servers? What if you threw in trouble tickets and Change orders over this same period? What about query rates on your Oracle databases?

Now, you can start translating real metrics with business metrics and even cause and effect elements because they are all done using a common denominator - Time. While many management platforms concentrate on the pure technical - like IO rates of Router interfaces, it may not mean as much to a business person. What if your web servers seem to run best when IO rates are between 10-15 percent. If that range is where the company performs best, I'd tune to that range. If you change your apps to increase efficiency, you can go back and look at these same metrics after you make changes and validate handily.

This gets even more interesting when you get Netflow data for the intervals as well. But I'll cover that in a later post!

.OK, The problem is that given a significant amount of raw performance data, I want to be able to determine some indication and potentially predictions, related to whether the metrics are growing or declining and what will potentially happen beyond the data we currently have, given the existing data trending.

Sounds pretty easy, doesn't it.

When you look at the overall problem, what you're trying to do is to take raw data and turn it into information. In effect, it looks like a Business Intelligence system. Forrester Research defines Business Intelligence as

"Forrester often defines BI in one of two ways. Typically, we use the following broad definition: Business intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making. But when using this definition, BI also has to include technologies such as data integration, data quality, data warehousing, master data management, text, and content analytics, and many others that the market sometimes lumps into the information management segment. Therefore, we also refer to data preparation and data usage as two separate, but closely linked, segments of the BI architectural stack. We define the narrower BI market as: A set of methodologies, processes, architectures, and technologies that leverage the output of information management processes for analysis, reporting, performance management, and information delivery."

Here is the link to their glossary for reference.

When you look at this definition versus what you get in todays performance management systems, you see somewhat of a dichotomy. Most of the performance management systems are graph and report producing products. You get what you get and so its up to you to evolve towards a Decision Support System versus a Graphical Report builder.

Many of the performance management applications lack the ability to produce and use derivative data sets. They do produce graphs and reports on canned data sets. It is left up to the user to sort through the graphs to determine what is going on in the environment.

It should be noted that many products are hesitant to support extremely large data sets with OLTP oriented Relational Databases. When the database table sizes go up, you start seeing degradation or slowness in the response of CRUD type transactions. Some vendors resort to batch loading. Others use complex table space schemes.

Part of the problem with Row oriented databases is that it must conform to the ACID model. Within the ACID model, a transaction must go or not go. Also, you cannot read at the same time as a write is occurring or your consistency is suspect. In performance management systems, there is typically a lot more writes than there a reads. Also, as the data set gets bigger and bigger, the indexes have to grow as well.

Bottom line - If you want to build a Decision Support System around Enterprise Management and the corresponding Operational data, you have to have a database that can scale to the multiplication of data through new derivatives.

Consider this - a Terabyte of raw data may be able to produce more than 4 times that when performing analysis. You could end up with a significant number of derivative data sets and storing them in a database enables you to use them repeatedly. (Keep in mind, some OLAP databases are very good at data compression.)

Time - The Common Denominator

In most graphs and in the data, the one common denominator about performance data is that it is intrinsically time series oriented. This provides a good foundation for comparisons. So now, what we need to do is to select samples in a context or SICs. For example, we may need to derive a sample of an hour over the context of a month.

In instances where a time series is not directly implied, it can be derived. For example, the number of state changes for a given object can be counted for the time series interval. Another instance may be a derivate of the state for a given time interval or the state when the interval occurred. In this example, you use what ever state the object is in at the time stamp.

Given that raw data may be collected at intervals much smaller than the sample period, we must look for a way to summarize the individual raw elements sufficiently enough to characterize the whole sample period. This gets one to contemplating about what would be interesting to look at. For example, probably the simplest way to characterize a counter is to Sum all of the raw data values into a sample (Simple Addition). For example, if you retrieved counts on 5 minute intervals, you could simply ad up the deltas over the course of the 1 hour sample into a derivative data element producing a total count for the sample. In some instances, you could look at the counter at the start and look at the end, if there are no resets or counter rollovers during the sample.

In other instances, you may see the need to take in the deltas of the raw data for the sample to produce and average. This average is useful by itself in many cases. But in others, you could go back through the raw data and compare the delta to the average to produce an offset or deviation from the average. What this offset or deviation really does is to produce a metric to recognize activity within the sample. The more up and down off of the average, the greater the deviations will be.

Another thing you need to be aware of is data set tainting. For example, lets say you have a period within your context where data is not collected, not available, or you suspect bad data. This sample could be considered tainted but in many cases, it can still be used. It is important sometimes to compare data collection to different elements like availability, change management windows, counter resets, or collection "holes" in the data.

What these techniques provide is a tool bag of functions that you can use to characterize and compare time series data sets. What if you wanted to understand if there were any cause and effect correlation of any metric on a given device. There are elements you may find that are alike. Yet there are others that may be opposite from each other. For example, when one metric increases, the other decreases. The characterization enables you to programmatically determine relationships between metrics.

The result of normalizing samples is to produce derivative data sets so that those can be used in different contexts and correlations. when you think about it, one should even be able to characterize and compare multiple SICs (Samples in Contexts). For example, one could compare day in month to week in month SICs.

Back to our samples in context paradigm. What we need samples for is to look at the contexts over time and look for growth or decline behaviors. The interesting part is the samples in that samples are needed when performing Bayesian Belief Networks. BBNs take a series of samples and produce a ratio of occurrences and anti-occurrences, translated as a simple + 1 to -1.

When you look at Operations as a whole, activity and work is accomplished around business hours. Weekdays become important as does shifts. So, it makes sense to align your Samples to Operational time domains.

The first exercise we can do is to take a complex SIC like weekday by week by Month. For each weekday, we need to total up our metric for the weekday. For each weekday in week in month sample, is the metric growing or declining? If you create a derivative data set that captures the percent different of the previous value, you can start to visualize the change over the samples.

When you start down the road of building the data sets, you need to be able to visualize the data in different ways. Graphs, charts, scatterplots, bubble charts, pie charts, and a plethora of other mechanisms provide ways you can see and use the data. One of the most common tools that BI professionals use is a Spreadsheet. You can access the data sources via ODBC and use the tools and functions within the spreadsheet to work through the analysis.

Dougie's Enterprise Management World

Sunday, April 17, 2011

Data Analytics and Decision Support - Continued

Sunday, April 10, 2011

Data Analytics and Decision Support