Sunday, June 12, 2011

Thinking Outside of the outside

Typical of any engineer, I like puzzles - problem sets.  And as with any puzzle, you have to decide what the puzzle looks like when solved. Want to test your resolve and hone your pattern matching skills? Purchase a 1000 piece puzzle and solve it with picture down!

I miss the scale and speed of working with problems sets in Global NetOps at AOL.  What if you could experience a live event stream that was composed of over 1 million events a minute? How would you deal with the rate?  Log turnover? Parsing and handling these as events?   Handling decisions at this rate?  I tell you, it gives you a whole new perspective.

A couple of requirements we had with all of our applications were:

   Must run 24 By Forever.
   Must Scale

24 By Forever meant that you had control ports in your processes.  You could start, stop, change, failover, failback, reroute... Anything to keep from failing.

Must scale meant that you could start small and get tall without intervention.  Without buffering too much.  Without rendering the information useless.

I started off with a simple Netcool Syslog probe. It was filtered and only passed events every minute based upon a cron job that filtered the log and sumped to Netcool, only the pertinent events.  Why? It just wasn't keeping up.  Yet there were alot of syslog events that were not useful for Netcool.  And yet, it was always behind.  It was outdated before it even got to Netcool.

Lesson 1 - Delaying Events from presentation can render your information useless.

I built a little process called collatord that ran as a daemon, watched all of syslog logs, and pattern matched incoming lines for dispatch and formatting to Netcool.   I subsequently moved from the venerable Syslog probe over to TCP Port probes which we had implemented behind load balancers. all I had to do was output my events in a value=pair manner with a \n\n line termination. (I subsequently found out that win or lose, it always returned and OK!)

Lesson 2 - Always check returns!

Little collatord had an ACTIONS hash that was made up of a regex pattern as a key and a Code reference to a subroutine.  When a pattern "hit", it executed the subroutine passing in the line as an argument. Turns out, it ran pretty quickly.  I was running in a POE kernel and only a single session.  Even with 100K lines a minute, it still skimmed right along!

One of the problems I had was that I would do a subroutine that parsed and handled a specific pattern for a given event and the syslog message would change over time. Maybe it was a field that moved from one position to another.  Maybe it was a slightly different format.  I found that if I took a sample line and put it in the subroutine as a comment, my whole subroutine became innately simpler in that I could see what the pattern was before and adapt the new pattern within the subroutine.

Lesson 3 - Take care to make your app reentrant. The better you are at this, the less rewriting code you'll do trying to figure out how to change things.

Now, with these samples, I got the bright idea that I could take any event in the sample, change the time and hostname to protect the innocent, and reinsert it as a Unit test.

Lesson 4 - Having a repeatable Unit test. PRICELESS!

Then I figured out the if I appended the word TRACEMESSAGE on any incoming event, I could profile each and every sample line.  All I had to do was to recognize /TRACEMESSAGE$/ and log a microsecond timestamp along with what the function was doing.

Lesson 5 - Being able to profile a specific function is INVALUABLE in a live system where you suspect something weird or intermittent and you don't have to restart it.

After I ran into a couple of pattern / parser problems where I had to schedule downtime, I went and talked to my cohorts in crime. I got to looking at their stuff and found that they could change code on the fly.  They didn't need downtime. (24 by Forever!) I went back to my desk and started putting together a control port.

In the control port, I'd connect to the collatord process via a TCP socket, authenticate, and run commands. In Perl, you can even handle subroutines through an eval.  So, I would pass in a new subroutine, eval it, and put the pattern and Code Reference in the %ACTIONS Hash.

Lesson 6 - Being able to adapt on the fly without downtime. BRILLIANT

Not all languages can be adapted to do this so your mileage may vary!