Dougie's Enterprise Management World: 2011

Monday, October 10, 2011

Dante's Inferno in IT...

We’ve all been through Dante’s Inferno. I know I have.

I was at one place where they developed an in house SNMP agent to be deployed on all managed systems. Because of the variety of operating systems supported and the amount of bugs in the software, many customers hated the agent and would request that the agent be taken off of their systems. But because the agent was “free”, it kept living in a miserable existence. Turns out there were many different versions supported and deployed. Additionally, the design of the agents’ sub-agents capabilities deviated significantly from industry standards such that it hamstrung the open source openness of the base agent.

Dante’s Inferno came in when you had to deal with the agent capabilities as an Architect. Saying anything negative related to the agent was Heresy. The Manager that owned the Agent would resort to Anger, Fraud, and Treachery in order to divert any negative attention to their baby. Part of the reason for hanging on to this Agent was that the ownership of more developed products promoted the Manager’s gluttony and greed. It was his silo of management technology.

While I was busy circumnavigating the Machiavellian Urinary Olympics, that Manager was working hard to put me in Limbo. Any requirements that I put forth were immediately in negotiation such that I could not finish requirements. Finally, in total frustration, I sent out a Final version of the requirements. Doing this sent the Manager into a frenzy of new Machiavellian Urinary Olympics such that my actions were elevated all the way up to a Sr. VP. Alas, I could not overcome the Marijuana Principle of Management (Harder you suck ,the higher you get!)

I left shortly afterward. So did several of my coworkers. Some are still there. All with the common experience that we’ve all been through Dante’s Inferno.

Lessons for the Architect :

Be careful in calling someone’s baby ugly. Given the “embeddedness” of a given politician, there may be some things you cannot change.

Some Silos can only break down through years of pain and years of continued failure.

In moving toward Cloud computing models, some folks may have an inclination to bring with them all of the bad habits they have currently.

If a person has only ever seen one place, they may not understand that success looks totally different in other places.

There is a direct cost and an indirect cost to supporting internally developed products. If your internally developed product is holding back progress and new business, it is a danger sign…

As an Architect, be wary of consensus. “Where there is no vision, the people perish. (Proverbs 29:18)”

Monday, September 5, 2011

Topology based Correlation

Amazing how many products have attempted just this sort of thing and for one reason or another, ended up with something a bit more complex than it really should be.

Consider this... In an IP Network, when to have a connectivity issue, basic network diagnosis mandates a connectivity test like a ping and if that doesn't work, run a traceroute to the end node to see how far you get.

What a traceroute doesn't give you is the interface by interface, blow by blow or the layer 2. If you turn on the verbose flag in ping or traceroute, you will see ICMP HOST_UNREACHABLE control messages. These come from the router that services the IP Subnet for the end device when an ARP request goes unnoticed.

So, consider this. when you have a connectivity problem to an end node:

Can you ping the end Node?
If not, can you ping the router for that subnet?
If yes, you have a Layer 2 problem.
If not, you have a layer 3 problem. Perform a traceroute.

Within the traceroute, it should bottom out where it cannot go further giving you some level of indication where the problem is.

On a layer 2 problem, you need to check the path from the end node, through the switch or switches, on to the Router.

Hows that for simplicity?

Sunday, June 12, 2011

Thinking Outside of the outside

Typical of any engineer, I like puzzles - problem sets. And as with any puzzle, you have to decide what the puzzle looks like when solved. Want to test your resolve and hone your pattern matching skills? Purchase a 1000 piece puzzle and solve it with picture down!

I miss the scale and speed of working with problems sets in Global NetOps at AOL. What if you could experience a live event stream that was composed of over 1 million events a minute? How would you deal with the rate? Log turnover? Parsing and handling these as events? Handling decisions at this rate? I tell you, it gives you a whole new perspective.

A couple of requirements we had with all of our applications were:

Must run 24 By Forever.
Must Scale

24 By Forever meant that you had control ports in your processes. You could start, stop, change, failover, failback, reroute... Anything to keep from failing.

Must scale meant that you could start small and get tall without intervention. Without buffering too much. Without rendering the information useless.

I started off with a simple Netcool Syslog probe. It was filtered and only passed events every minute based upon a cron job that filtered the log and sumped to Netcool, only the pertinent events. Why? It just wasn't keeping up. Yet there were alot of syslog events that were not useful for Netcool. And yet, it was always behind. It was outdated before it even got to Netcool.

Lesson 1 - Delaying Events from presentation can render your information useless.

I built a little process called collatord that ran as a daemon, watched all of syslog logs, and pattern matched incoming lines for dispatch and formatting to Netcool. I subsequently moved from the venerable Syslog probe over to TCP Port probes which we had implemented behind load balancers. all I had to do was output my events in a value=pair manner with a \n\n line termination. (I subsequently found out that win or lose, it always returned and OK!)

Lesson 2 - Always check returns!

Little collatord had an ACTIONS hash that was made up of a regex pattern as a key and a Code reference to a subroutine. When a pattern "hit", it executed the subroutine passing in the line as an argument. Turns out, it ran pretty quickly. I was running in a POE kernel and only a single session. Even with 100K lines a minute, it still skimmed right along!

One of the problems I had was that I would do a subroutine that parsed and handled a specific pattern for a given event and the syslog message would change over time. Maybe it was a field that moved from one position to another. Maybe it was a slightly different format. I found that if I took a sample line and put it in the subroutine as a comment, my whole subroutine became innately simpler in that I could see what the pattern was before and adapt the new pattern within the subroutine.

Lesson 3 - Take care to make your app reentrant. The better you are at this, the less rewriting code you'll do trying to figure out how to change things.

Now, with these samples, I got the bright idea that I could take any event in the sample, change the time and hostname to protect the innocent, and reinsert it as a Unit test.

Lesson 4 - Having a repeatable Unit test. PRICELESS!

Then I figured out the if I appended the word TRACEMESSAGE on any incoming event, I could profile each and every sample line. All I had to do was to recognize /TRACEMESSAGE$/ and log a microsecond timestamp along with what the function was doing.

Lesson 5 - Being able to profile a specific function is INVALUABLE in a live system where you suspect something weird or intermittent and you don't have to restart it.

After I ran into a couple of pattern / parser problems where I had to schedule downtime, I went and talked to my cohorts in crime. I got to looking at their stuff and found that they could change code on the fly. They didn't need downtime. (24 by Forever!) I went back to my desk and started putting together a control port.

In the control port, I'd connect to the collatord process via a TCP socket, authenticate, and run commands. In Perl, you can even handle subroutines through an eval. So, I would pass in a new subroutine, eval it, and put the pattern and Code Reference in the %ACTIONS Hash.

Lesson 6 - Being able to adapt on the fly without downtime. BRILLIANT

Not all languages can be adapted to do this so your mileage may vary!

Sunday, April 17, 2011

Data Analytics and Decision Support - Continued

Data Interpolation

In some performance management systems, the fact that you may have nulls or failed polls in your data series draws consternation in some products. If you are missing data polls, the software doesn't seem to know what to do.

I think the important thing is to catch counter rollovers in instances where your data elements are counters. You want to catch a rollover in such a way as when you calculate the delta, you add up until the maximum counter value, then start adding again up to the new value. In effect you get a good delta in between two values. What you do not want is tainted counters that span more than one counter rollover.

If you take the delta value between a missed poll as in:

P1 500
P2 Missed
P3 1000

In smoothing, you can take the difference between P3 and P1 and divide that by half as a delta. Simply add this to P1 to produce P2 which produces a derived time series of:

P1 500
P2 750
P3 1000

In gauges, you may have to sample the elements around the missed data element and use an average to smooth over the missed data element. Truth be known, I don't see why a graphing engine cannot connect from P1 to P4 without the interpolation! It would make graphs much simpler and more rich - If you don't see a data point on the time slot, guess it didn't happen! In the smoothing scenario, you cannot tell where the smoothing is.

Availability

In my own thoughts, I view availability as a discreet event in time or a state over a time period. Both have valid uses but need to presented in the context that they are borne so that users do not get confused. For example, an ICMP ping that occurs every 5 minutes is a discreet event in time. Here is an example time series of pings:

P1 Good
P2 Good
P3 Fail
P4 Good
P5 Fail
P6 Good
P7 Good
P8 Good
P9 Good
P10 Good

This time series denotes ping failures at P3 and P5 intervals. In theory, a line graph is not appropriate for this instance because it is boolean and it is discreet and not representative of the entire time period. If P0 is equal to 1303088400 on the local poller and the sysUptime value at P0 is 2500000, then the following SNMP polls yield:

Period Poller uTime SysUptime
P0   1303088400 2500000
P1 1303088700 2503000
P2   1303089000 2506000
P3   1303089300 2509000
P4   1303089600 2512000
P5 1303089900 20000
P6 1303090200 50000
P7 1303090500 80000
P8   1303090800 110000
P9 1303091100 5000
P10   1303091400 35000

utime is the number of seconds since January 1, 1970 and as such, it increments every second. SysUptime is the number of Clock ticks since the management system reinitialized. When you look hard at the TimeTicks data type, it is a modulo-s counter of the number of 1/100ths of a second for a given time epoch period. This integer will roll over at the value of 4294967296 or approximately every 49.7 days.

In looking for availability in the time series, if you derive the delta between the utime timestamps and multiply that times 100 (given that timeTicks is 1/100th of a second), you can see that if the sysUpTime value is less than the previous value plus the serived delta timeticks, you can clearly see where you have an availability issues. As compared to the previous time series using ICMP ping, P3 was meaningless and only P5 managed to ascertain some level of availability discrepancy.

You can also derive from the irrant time series periods that P5 lost 10000 timeTicks from the minimum delta value of 30000 (300 seconds * 100). So, for period P5 we were not available for 10000 timeticks and not the full period. Also note that if you completely miss a poll or don't get a response, the delta of the last poll chould tell you where availability was affected or not even though the successful poll did not go as planned.

From a pure time series data kind of conversion, one could translate sysUpTime to an accumulated number of available seconds over the time period. From a statistics perspective, this makes sense in that it becomes a count or integer.

Discreet to time Series Interpolation

Let's say you have an event management system. And within this system, you receive a significant number of events If can categorize and number your events into instances counted per interval, you can convert this into a counter for a time period. for example, you received 4 CPU threshold events in 5 minutes for a given node.

In other instances, you may have to covert discreet events into more of a stateful.approach toward conversions. For example, you have a link down and 10 seconds later, you get a link up. You have to translate this to a non-availability of the link for 10 seconds of a 5 minute interval. What is really interesting about this is that you have very finite time stamps. When you do, you can use this to compare to other elements that may have changed around the same period of time. Kind of a cause and effect analysis.

This is especially pertinent when you analyze data from devices, event data, and workflow and business related elements. For example, what if you did a comparison between HTTP response times to Internet network IO rates, and Apache click rates on your web servers? What if you threw in trouble tickets and Change orders over this same period? What about query rates on your Oracle databases?

Now, you can start translating real metrics with business metrics and even cause and effect elements because they are all done using a common denominator - Time. While many management platforms concentrate on the pure technical - like IO rates of Router interfaces, it may not mean as much to a business person. What if your web servers seem to run best when IO rates are between 10-15 percent. If that range is where the company performs best, I'd tune to that range. If you change your apps to increase efficiency, you can go back and look at these same metrics after you make changes and validate handily.

This gets even more interesting when you get Netflow data for the intervals as well. But I'll cover that in a later post!

Sunday, April 10, 2011

Data Analytics and Decision Support

.OK, The problem is that given a significant amount of raw performance data, I want to be able to determine some indication and potentially predictions, related to whether the metrics are growing or declining and what will potentially happen beyond the data we currently have, given the existing data trending.

Sounds pretty easy, doesn't it.

When you look at the overall problem, what you're trying to do is to take raw data and turn it into information. In effect, it looks like a Business Intelligence system. Forrester Research defines Business Intelligence as

"Forrester often defines BI in one of two ways. Typically, we use the following broad definition: Business intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making. But when using this definition, BI also has to include technologies such as data integration, data quality, data warehousing, master data management, text, and content analytics, and many others that the market sometimes lumps into the information management segment. Therefore, we also refer to data preparation and data usage as two separate, but closely linked, segments of the BI architectural stack. We define the narrower BI market as: A set of methodologies, processes, architectures, and technologies that leverage the output of information management processes for analysis, reporting, performance management, and information delivery."

Here is the link to their glossary for reference.

When you look at this definition versus what you get in todays performance management systems, you see somewhat of a dichotomy. Most of the performance management systems are graph and report producing products. You get what you get and so its up to you to evolve towards a Decision Support System versus a Graphical Report builder.

Many of the performance management applications lack the ability to produce and use derivative data sets. They do produce graphs and reports on canned data sets. It is left up to the user to sort through the graphs to determine what is going on in the environment.

It should be noted that many products are hesitant to support extremely large data sets with OLTP oriented Relational Databases. When the database table sizes go up, you start seeing degradation or slowness in the response of CRUD type transactions. Some vendors resort to batch loading. Others use complex table space schemes.

Part of the problem with Row oriented databases is that it must conform to the ACID model. Within the ACID model, a transaction must go or not go. Also, you cannot read at the same time as a write is occurring or your consistency is suspect. In performance management systems, there is typically a lot more writes than there a reads. Also, as the data set gets bigger and bigger, the indexes have to grow as well.

Bottom line - If you want to build a Decision Support System around Enterprise Management and the corresponding Operational data, you have to have a database that can scale to the multiplication of data through new derivatives.

Consider this - a Terabyte of raw data may be able to produce more than 4 times that when performing analysis. You could end up with a significant number of derivative data sets and storing them in a database enables you to use them repeatedly. (Keep in mind, some OLAP databases are very good at data compression.)

Time - The Common Denominator

In most graphs and in the data, the one common denominator about performance data is that it is intrinsically time series oriented. This provides a good foundation for comparisons. So now, what we need to do is to select samples in a context or SICs. For example, we may need to derive a sample of an hour over the context of a month.

In instances where a time series is not directly implied, it can be derived. For example, the number of state changes for a given object can be counted for the time series interval. Another instance may be a derivate of the state for a given time interval or the state when the interval occurred. In this example, you use what ever state the object is in at the time stamp.

Given that raw data may be collected at intervals much smaller than the sample period, we must look for a way to summarize the individual raw elements sufficiently enough to characterize the whole sample period. This gets one to contemplating about what would be interesting to look at. For example, probably the simplest way to characterize a counter is to Sum all of the raw data values into a sample (Simple Addition). For example, if you retrieved counts on 5 minute intervals, you could simply ad up the deltas over the course of the 1 hour sample into a derivative data element producing a total count for the sample. In some instances, you could look at the counter at the start and look at the end, if there are no resets or counter rollovers during the sample.

In other instances, you may see the need to take in the deltas of the raw data for the sample to produce and average. This average is useful by itself in many cases. But in others, you could go back through the raw data and compare the delta to the average to produce an offset or deviation from the average. What this offset or deviation really does is to produce a metric to recognize activity within the sample. The more up and down off of the average, the greater the deviations will be.

Another thing you need to be aware of is data set tainting. For example, lets say you have a period within your context where data is not collected, not available, or you suspect bad data. This sample could be considered tainted but in many cases, it can still be used. It is important sometimes to compare data collection to different elements like availability, change management windows, counter resets, or collection "holes" in the data.

What these techniques provide is a tool bag of functions that you can use to characterize and compare time series data sets. What if you wanted to understand if there were any cause and effect correlation of any metric on a given device. There are elements you may find that are alike. Yet there are others that may be opposite from each other. For example, when one metric increases, the other decreases. The characterization enables you to programmatically determine relationships between metrics.

The result of normalizing samples is to produce derivative data sets so that those can be used in different contexts and correlations. when you think about it, one should even be able to characterize and compare multiple SICs (Samples in Contexts). For example, one could compare day in month to week in month SICs.

Back to our samples in context paradigm. What we need samples for is to look at the contexts over time and look for growth or decline behaviors. The interesting part is the samples in that samples are needed when performing Bayesian Belief Networks. BBNs take a series of samples and produce a ratio of occurrences and anti-occurrences, translated as a simple + 1 to -1.

When you look at Operations as a whole, activity and work is accomplished around business hours. Weekdays become important as does shifts. So, it makes sense to align your Samples to Operational time domains.

The first exercise we can do is to take a complex SIC like weekday by week by Month. For each weekday, we need to total up our metric for the weekday. For each weekday in week in month sample, is the metric growing or declining? If you create a derivative data set that captures the percent different of the previous value, you can start to visualize the change over the samples.

When you start down the road of building the data sets, you need to be able to visualize the data in different ways. Graphs, charts, scatterplots, bubble charts, pie charts, and a plethora of other mechanisms provide ways you can see and use the data. One of the most common tools that BI professionals use is a Spreadsheet. You can access the data sources via ODBC and use the tools and functions within the spreadsheet to work through the analysis.