In some performance management systems, the fact that you may have nulls or failed polls in your data series draws consternation in some products. If you are missing data polls, the software doesn't seem to know what to do.
I think the important thing is to catch counter rollovers in instances where your data elements are counters. You want to catch a rollover in such a way as when you calculate the delta, you add up until the maximum counter value, then start adding again up to the new value. In effect you get a good delta in between two values. What you do not want is tainted counters that span more than one counter rollover.
If you take the delta value between a missed poll as in:
In smoothing, you can take the difference between P3 and P1 and divide that by half as a delta. Simply add this to P1 to produce P2 which produces a derived time series of:
In gauges, you may have to sample the elements around the missed data element and use an average to smooth over the missed data element. Truth be known, I don't see why a graphing engine cannot connect from P1 to P4 without the interpolation! It would make graphs much simpler and more rich - If you don't see a data point on the time slot, guess it didn't happen! In the smoothing scenario, you cannot tell where the smoothing is.
In my own thoughts, I view availability as a discreet event in time or a state over a time period. Both have valid uses but need to presented in the context that they are borne so that users do not get confused. For example, an ICMP ping that occurs every 5 minutes is a discreet event in time. Here is an example time series of pings:
This time series denotes ping failures at P3 and P5 intervals. In theory, a line graph is not appropriate for this instance because it is boolean and it is discreet and not representative of the entire time period. If P0 is equal to 1303088400 on the local poller and the sysUptime value at P0 is 2500000, then the following SNMP polls yield:
Period Poller uTime SysUptime
P0 1303088400 2500000
P1 1303088700 2503000
P2 1303089000 2506000
P3 1303089300 2509000
P4 1303089600 2512000
P5 1303089900 20000
P6 1303090200 50000
P7 1303090500 80000
P8 1303090800 110000
P9 1303091100 5000
P10 1303091400 35000
utime is the number of seconds since January 1, 1970 and as such, it increments every second. SysUptime is the number of Clock ticks since the management system reinitialized. When you look hard at the TimeTicks data type, it is a modulo-s counter of the number of 1/100ths of a second for a given time epoch period. This integer will roll over at the value of 4294967296 or approximately every 49.7 days.
In looking for availability in the time series, if you derive the delta between the utime timestamps and multiply that times 100 (given that timeTicks is 1/100th of a second), you can see that if the sysUpTime value is less than the previous value plus the serived delta timeticks, you can clearly see where you have an availability issues. As compared to the previous time series using ICMP ping, P3 was meaningless and only P5 managed to ascertain some level of availability discrepancy.
You can also derive from the irrant time series periods that P5 lost 10000 timeTicks from the minimum delta value of 30000 (300 seconds * 100). So, for period P5 we were not available for 10000 timeticks and not the full period. Also note that if you completely miss a poll or don't get a response, the delta of the last poll chould tell you where availability was affected or not even though the successful poll did not go as planned.
From a pure time series data kind of conversion, one could translate sysUpTime to an accumulated number of available seconds over the time period. From a statistics perspective, this makes sense in that it becomes a count or integer.
Discreet to time Series Interpolation
Let's say you have an event management system. And within this system, you receive a significant number of events If can categorize and number your events into instances counted per interval, you can convert this into a counter for a time period. for example, you received 4 CPU threshold events in 5 minutes for a given node.
In other instances, you may have to covert discreet events into more of a stateful.approach toward conversions. For example, you have a link down and 10 seconds later, you get a link up. You have to translate this to a non-availability of the link for 10 seconds of a 5 minute interval. What is really interesting about this is that you have very finite time stamps. When you do, you can use this to compare to other elements that may have changed around the same period of time. Kind of a cause and effect analysis.
This is especially pertinent when you analyze data from devices, event data, and workflow and business related elements. For example, what if you did a comparison between HTTP response times to Internet network IO rates, and Apache click rates on your web servers? What if you threw in trouble tickets and Change orders over this same period? What about query rates on your Oracle databases?
Now, you can start translating real metrics with business metrics and even cause and effect elements because they are all done using a common denominator - Time. While many management platforms concentrate on the pure technical - like IO rates of Router interfaces, it may not mean as much to a business person. What if your web servers seem to run best when IO rates are between 10-15 percent. If that range is where the company performs best, I'd tune to that range. If you change your apps to increase efficiency, you can go back and look at these same metrics after you make changes and validate handily.
This gets even more interesting when you get Netflow data for the intervals as well. But I'll cover that in a later post!