Sunday, April 17, 2011

Data Analytics and Decision Support - Continued

Data Interpolation

In some performance management systems, the fact that you may have nulls or failed polls in your data series draws consternation in some products. If you are missing data polls, the software doesn't seem to know what to do.

I think the important thing is to catch counter rollovers in instances where your data elements are counters.  You want to catch a rollover in such a way as when you calculate the delta, you add up until the maximum counter value, then start adding again up to the new value.  In effect you get a good delta in between two values.  What you do not want is tainted counters that span more than one counter rollover.

If you take the delta value between a missed poll as in:

P1                   500
P2                   Missed
P3                   1000

In smoothing, you can take the difference between P3 and P1 and divide that by half as a delta.  Simply add this to P1 to produce P2 which produces a derived time series of:

P1                    500
P2                    750
P3                    1000

In gauges, you may have to sample the elements around the missed data element and use an average to smooth over the missed data element.  Truth be known, I don't see why a graphing engine cannot connect from P1 to P4 without the interpolation!  It would make graphs much simpler and more rich - If you don't see a data point on the time slot, guess it didn't happen!  In the smoothing scenario, you cannot tell where the smoothing is.


In my own thoughts, I view availability as a discreet event in time or a state over a time period.  Both have valid uses but need to presented in the context that they are borne so that users do not get confused. For example, an ICMP ping that occurs every 5 minutes is a discreet event in time. Here is an example time series of pings:

P1             Good
P2             Good
P3             Fail
P4             Good
P5             Fail
P6             Good
P7             Good
P8             Good
P9             Good
P10           Good

This time series denotes ping failures at P3 and P5 intervals.  In theory, a line graph is not appropriate for this instance because it is boolean and it is discreet and not representative of the entire time period.  If P0 is equal to 1303088400 on the local poller and the sysUptime value at P0 is 2500000, then the following SNMP polls yield:

Period        Poller uTime       SysUptime
P0             1303088400      2500000
P1             1303088700      2503000
P2             1303089000      2506000
P3             1303089300      2509000
P4             1303089600      2512000
P5             1303089900      20000
P6             1303090200      50000
P7             1303090500      80000
P8             1303090800      110000
P9             1303091100      5000
P10           1303091400      35000

utime is the number of seconds since January 1, 1970 and as such, it increments every second.  SysUptime is the number of Clock ticks since the management system reinitialized.  When you look hard at the TimeTicks data type, it is a modulo-s counter of the number of 1/100ths of a second for a given time epoch period. This integer will roll over at the value of  4294967296 or approximately every 49.7 days.

In looking for availability in the time series, if you derive the delta between the utime timestamps and multiply that times 100 (given that timeTicks is 1/100th of a second), you can see that if the sysUpTime value is less than the previous value plus the serived delta timeticks, you can clearly see where you have an availability issues. As compared to the previous time series using ICMP ping, P3 was meaningless and only P5 managed to ascertain some level of availability discrepancy.

You can also derive from the irrant time series periods that P5 lost 10000 timeTicks from the minimum delta value of 30000 (300 seconds * 100). So, for period P5 we were not available for 10000 timeticks and not the full period. Also note that if you completely miss a poll or don't get a response, the delta of the last poll chould tell you where availability was affected or not even though the successful poll did not go as planned.

From a pure time series data kind of conversion, one could translate sysUpTime to an accumulated number of available seconds over the time period.  From a statistics perspective, this makes sense in that it becomes a count or integer.

Discreet to time Series Interpolation

Let's say you have an event management system.  And within this system, you receive a significant number of events   If can categorize and number your events into instances counted per interval, you can convert this into a counter for a time period.  for example, you received 4 CPU threshold events in 5 minutes for a given node.

In other instances, you may have to covert discreet events into more of a stateful.approach toward conversions. For example, you have a link down and 10 seconds later, you get a link up. You have to translate this to a non-availability of the link for 10 seconds of a 5 minute interval.  What is really interesting about this is that you have very finite time stamps.  When you do, you can use this to compare to other elements  that may have changed around the same period of time.  Kind of a cause and effect analysis.

This is especially pertinent when you analyze data from devices, event data, and workflow and business related elements. For example, what if you did a comparison between HTTP response times to Internet network IO rates, and Apache click rates on your web servers? What if you threw in trouble tickets and Change orders over this same period?   What about query rates on your Oracle databases?

Now, you can start translating real metrics with business metrics and even cause and effect elements because they are all done using a common denominator - Time. While many management platforms concentrate on the pure technical - like IO rates of Router interfaces, it may not mean as much to a business person.   What if your web servers seem to run best when IO rates are between 10-15 percent.  If that range is where the company performs best, I'd tune to that range.  If you change your apps to increase efficiency, you can go back and look at these same metrics after you make changes and validate handily.

This gets even more interesting when you get Netflow data for the intervals as well.  But I'll cover that in a later post!

No comments:

Post a Comment