Sunday, April 10, 2011

Data Analytics and Decision Support

.OK,  The problem is that given a significant amount of raw performance data, I want to be able to determine some indication and potentially predictions, related to whether the metrics are growing or declining and what will potentially happen beyond the data we currently have, given the existing data trending.

Sounds pretty easy, doesn't it.


When you look at the overall problem, what you're trying to do is to take raw data and turn it into information. In effect, it looks like a Business Intelligence system.  Forrester Research defines Business Intelligence as

"Forrester often defines BI in one of two ways. Typically, we use the following broad definition: Business intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making. But when using this definition, BI also has to include technologies such as data integration, data quality, data warehousing, master data management, text, and content analytics, and many others that the market sometimes lumps into the information management segment. Therefore, we also refer to data preparation and data usage as two separate, but closely linked, segments of the BI architectural stack. We define the narrower BI market as: A set of methodologies, processes, architectures, and technologies that leverage the output of information management processes for analysis, reporting, performance management, and information delivery."


Here is the link to their glossary for reference.


When you look at this definition versus what you get in todays performance management systems, you see somewhat of a dichotomy.  Most of the performance management systems are graph and report producing products.   You get what you get and so its up to you to evolve towards a Decision Support System versus a Graphical Report builder.


Many of the performance management applications lack the ability to produce and use derivative data sets.  They do produce graphs and reports on canned data sets. It is left up to the user to sort through the graphs to determine what is going on in the environment.


It should be noted that many products are hesitant to support extremely large data sets with OLTP oriented Relational Databases.  When the database table sizes go up, you start seeing degradation or slowness in the response of CRUD type transactions. Some vendors resort to batch loading.  Others use complex table space schemes. 


Part of the problem with Row oriented databases is that it must conform to the ACID model. Within the ACID model, a transaction must go or not go.  Also, you cannot read at the same time as a write is occurring or your consistency is suspect. In performance management systems, there is typically a lot more writes than there a reads. Also, as the data set gets bigger and bigger, the indexes have to grow as well.


Bottom line - If you want to build a Decision Support System around Enterprise Management and the corresponding Operational data, you have to have a database that can scale to the multiplication of data through new derivatives.


Consider this - a Terabyte of raw data may be able to produce more than 4 times that when performing analysis. You could end up with a significant number of derivative data sets and storing them in a database enables you to use them repeatedly. (Keep in mind, some OLAP databases are very good at data compression.)

Time - The Common Denominator

In most graphs and in the data, the one common denominator about performance data is that it is intrinsically time series oriented.  This provides a good foundation for comparisons.  So now, what we need to do is to select samples in a context or SICs.  For example, we may need to derive a sample of an hour over the context of a month.

In instances where a time series is not directly implied, it can be derived.  For example, the number of state changes for a given object can be counted for the time series interval.  Another instance may be a derivate of the state for a given time interval or the state when the interval occurred.  In this example, you use what ever state the object is in at the time stamp.

Given that raw data may be collected at intervals much smaller than the sample period, we must look for a way to summarize the individual raw elements sufficiently enough to characterize the whole sample period.  This gets one to contemplating about what would be interesting to look at.  For example, probably the simplest way to characterize a counter is to Sum all of the raw data values into a sample (Simple Addition). For example, if you retrieved counts on 5 minute intervals, you could simply ad up the deltas over the course of the 1 hour sample into a derivative data element producing a total count for the sample.   In some instances, you could look at the counter at the start and look at the end, if there are no resets or counter rollovers during the sample.

In other instances, you may see the need to take in the deltas of the raw data for the sample to produce and average.  This average is useful by itself in many cases.  But in others, you could go back through the raw data and compare the delta to the average to produce an offset or deviation from the average.  What this offset or deviation really does is to produce a metric to recognize activity within the sample.  The more up and down off of the average, the greater the deviations will be.

Another thing you need to be aware of is data set tainting.  For example, lets say you have a period within your context where data is not collected, not available, or you suspect bad data.  This sample could be considered tainted but in many cases, it can still be used.   It is important sometimes to compare data collection to different elements like availability, change management windows, counter resets, or collection "holes" in the data.

What these techniques provide is a tool bag of functions that you can use to characterize and compare time series data sets.  What if you wanted to understand if there were any cause and effect correlation of any metric on a given device.  There are elements you may find that are alike.  Yet there are others that may be opposite from each other.  For example, when one metric increases, the other decreases. The characterization enables you to programmatically determine relationships between metrics.

The result of normalizing samples is to produce derivative data sets so that those can be used in different contexts and correlations.  when you think about it, one should even be able to characterize and compare multiple SICs (Samples in Contexts).  For example, one could compare day in month to week in month SICs.

Back to our samples in context paradigm. What we need samples for is to look at the contexts over time and  look for growth or decline behaviors.  The interesting part is the samples in that samples are needed when performing Bayesian Belief Networks.  BBNs take a series of samples and produce a ratio of occurrences and anti-occurrences, translated as a simple + 1 to -1.

When you look at Operations as a whole, activity and work is accomplished around business hours.  Weekdays become important as does shifts. So, it makes sense to align your Samples to Operational time domains.

The first exercise we can do is to  take a complex SIC like weekday by week by Month. For each weekday, we need to total up our metric for the weekday. For each weekday in week in month sample, is the metric growing or declining?  If you create a derivative data set that captures the percent different of the previous value, you can start to visualize the change over the samples.

When you start down the road of building the data sets, you need to be able to visualize the data in different ways.  Graphs, charts, scatterplots, bubble charts, pie charts, and a plethora of other mechanisms provide ways you can see and use the data.  One of the most common tools that BI professionals use is a Spreadsheet. You can access the data sources via ODBC and use the tools and functions within the spreadsheet to work through the analysis.

1 comment:

  1. It actually work on that way how you mention here. I like the way you mention design procedure.
    dean graziosi

    ReplyDelete