Dougie's Enterprise Management World

Saturday, March 24, 2012

ENMS Integration - Scripting versus OOTB

The other day I was discussing integration work between products and how that takes a lot of scripting, data munging, and transformations to make an ENMS architecture work. In the course of the discussion, I was taken aback a bit when asked about supporting a bunch of scripts. Definitely caught me off guard.

Part of the thing that caught me off guard was that some folks believe that products integrate tightly together without glueware and scripting. In fact, I got the impression that the products they had did enough out of the box for them.

So, why do you script?

To integrate products, tools, and technology. But most of all, INFORMATION. Scripts enable you to plumb data flow from product to product and feature to feature.

Think about CGI tools, forms, and ASP pages. All scripting. Even the Javascript inside of your HTML that interfaces with Dojo libs on a server... SCRIPT.

Think about the ETL tasks you have that grab data out of one application and fit it or compare it to data sets in other applications.

Think about all of those mundane reports that you do. The configuration data. The performance data. The post mortems you do.

Rules of Thumb

1. There is no such thing as a temporary or disposable script. Scripts begin life as something simple and linear and end up living far longer than one would ever think.

2.There will never be time to document a script after you put it in place. You have to document as you go. In fact, I really like to leave notes and design considerations within the script.

3. You have to assume that sooner or later, someone else will need to maintain your script. You have to document egress and ingress points, expansion capabilities, and integrate in test cases.

4. Assume that portions of your code may be usable by others. Work to make things modular, reusable, extensible, and portable. Probably 70% of all scripting done by System Administrators is done initially by reviewing someone else's code. Given this, you should strive to set the example.

Things I like in Scripting

perldoc- Perldoc is the stuff. Document your code inside of your own code. Your own module. Your script.

perl -MCPAN -e shell Getting modules to perform things - PRICELESS!

Templates. You need to build and use templates when developing code. For each function/ sub-routine/ Code Block / Whatever -You need to have documentation, test cases, logging, debugging, and return codes. Ultimately, it leads to much better consistency across the board. And Code Reviews get guaged around the template AND the functionality.

Control ports - In long running or daemon processes, control ports save your Butt!

getopt is your friend!!!

STDERR is awesome for logging errors.

POE - POE lets you organize your code into callbacks and subroutines around an event loop.

/usr/bin/logger is AWESOME! I have used the LOCAL0 facility as an impromptu message bus as many apps only log to LOCAL1-7.

Data::Dumper -- Nuff said!!!

Date::Manip-- If you are dealing with date and time transformations, Date::Manip is your ace in the hole. It can translate last week from a "string" to a to and from date - time stamp and even on to a Unix time value.

Spreadsheet::WriteExcel --I love this module! Lets me build Excel spreadsheets on the fly including multiple sheets, formulas, lookup tables and even charts and graphs. And using an .xls fie extension, most browsers know how to handle them. And EVERYONE knows how to work through a spreadsheet!

ENMS products have a lot of scripting capabilities. Check out Impact. HP OO. BMC RBA. Logmatrix NerveCenter. Ionix SMARTs. The list goes on and on.

Bottom line - If you have integration work to do, you will need to script. Could be perl, shell, python,or whatever. The products just don't have enough cross product functionality to fit themselves out of the box. In fact, there are several products that embrace scripting and scripting capabilities out of the box. Even products within the same product line will require scripting and glueware when you really start using the products. After all -> YOU ARE FITTING INFORMATION.

Monday, October 10, 2011

Dante's Inferno in IT...

We’ve all been through Dante’s Inferno. I know I have.

I was at one place where they developed an in house SNMP agent to be deployed on all managed systems. Because of the variety of operating systems supported and the amount of bugs in the software, many customers hated the agent and would request that the agent be taken off of their systems. But because the agent was “free”, it kept living in a miserable existence. Turns out there were many different versions supported and deployed. Additionally, the design of the agents’ sub-agents capabilities deviated significantly from industry standards such that it hamstrung the open source openness of the base agent.

Dante’s Inferno came in when you had to deal with the agent capabilities as an Architect. Saying anything negative related to the agent was Heresy. The Manager that owned the Agent would resort to Anger, Fraud, and Treachery in order to divert any negative attention to their baby. Part of the reason for hanging on to this Agent was that the ownership of more developed products promoted the Manager’s gluttony and greed. It was his silo of management technology.

While I was busy circumnavigating the Machiavellian Urinary Olympics, that Manager was working hard to put me in Limbo. Any requirements that I put forth were immediately in negotiation such that I could not finish requirements. Finally, in total frustration, I sent out a Final version of the requirements. Doing this sent the Manager into a frenzy of new Machiavellian Urinary Olympics such that my actions were elevated all the way up to a Sr. VP. Alas, I could not overcome the Marijuana Principle of Management (Harder you suck ,the higher you get!)

I left shortly afterward. So did several of my coworkers. Some are still there. All with the common experience that we’ve all been through Dante’s Inferno.

Lessons for the Architect :

Be careful in calling someone’s baby ugly. Given the “embeddedness” of a given politician, there may be some things you cannot change.

Some Silos can only break down through years of pain and years of continued failure.

In moving toward Cloud computing models, some folks may have an inclination to bring with them all of the bad habits they have currently.

If a person has only ever seen one place, they may not understand that success looks totally different in other places.

There is a direct cost and an indirect cost to supporting internally developed products. If your internally developed product is holding back progress and new business, it is a danger sign…

As an Architect, be wary of consensus. “Where there is no vision, the people perish. (Proverbs 29:18)”

Monday, September 5, 2011

Topology based Correlation

Amazing how many products have attempted just this sort of thing and for one reason or another, ended up with something a bit more complex than it really should be.

Consider this... In an IP Network, when to have a connectivity issue, basic network diagnosis mandates a connectivity test like a ping and if that doesn't work, run a traceroute to the end node to see how far you get.

What a traceroute doesn't give you is the interface by interface, blow by blow or the layer 2. If you turn on the verbose flag in ping or traceroute, you will see ICMP HOST_UNREACHABLE control messages. These come from the router that services the IP Subnet for the end device when an ARP request goes unnoticed.

So, consider this. when you have a connectivity problem to an end node:

Can you ping the end Node?
If not, can you ping the router for that subnet?
If yes, you have a Layer 2 problem.
If not, you have a layer 3 problem. Perform a traceroute.

Within the traceroute, it should bottom out where it cannot go further giving you some level of indication where the problem is.

On a layer 2 problem, you need to check the path from the end node, through the switch or switches, on to the Router.

Hows that for simplicity?

Sunday, June 12, 2011

Thinking Outside of the outside

Typical of any engineer, I like puzzles - problem sets. And as with any puzzle, you have to decide what the puzzle looks like when solved. Want to test your resolve and hone your pattern matching skills? Purchase a 1000 piece puzzle and solve it with picture down!

I miss the scale and speed of working with problems sets in Global NetOps at AOL. What if you could experience a live event stream that was composed of over 1 million events a minute? How would you deal with the rate? Log turnover? Parsing and handling these as events? Handling decisions at this rate? I tell you, it gives you a whole new perspective.

A couple of requirements we had with all of our applications were:

Must run 24 By Forever.
Must Scale

24 By Forever meant that you had control ports in your processes. You could start, stop, change, failover, failback, reroute... Anything to keep from failing.

Must scale meant that you could start small and get tall without intervention. Without buffering too much. Without rendering the information useless.

I started off with a simple Netcool Syslog probe. It was filtered and only passed events every minute based upon a cron job that filtered the log and sumped to Netcool, only the pertinent events. Why? It just wasn't keeping up. Yet there were alot of syslog events that were not useful for Netcool. And yet, it was always behind. It was outdated before it even got to Netcool.

Lesson 1 - Delaying Events from presentation can render your information useless.

I built a little process called collatord that ran as a daemon, watched all of syslog logs, and pattern matched incoming lines for dispatch and formatting to Netcool. I subsequently moved from the venerable Syslog probe over to TCP Port probes which we had implemented behind load balancers. all I had to do was output my events in a value=pair manner with a \n\n line termination. (I subsequently found out that win or lose, it always returned and OK!)

Lesson 2 - Always check returns!

Little collatord had an ACTIONS hash that was made up of a regex pattern as a key and a Code reference to a subroutine. When a pattern "hit", it executed the subroutine passing in the line as an argument. Turns out, it ran pretty quickly. I was running in a POE kernel and only a single session. Even with 100K lines a minute, it still skimmed right along!

One of the problems I had was that I would do a subroutine that parsed and handled a specific pattern for a given event and the syslog message would change over time. Maybe it was a field that moved from one position to another. Maybe it was a slightly different format. I found that if I took a sample line and put it in the subroutine as a comment, my whole subroutine became innately simpler in that I could see what the pattern was before and adapt the new pattern within the subroutine.

Lesson 3 - Take care to make your app reentrant. The better you are at this, the less rewriting code you'll do trying to figure out how to change things.

Now, with these samples, I got the bright idea that I could take any event in the sample, change the time and hostname to protect the innocent, and reinsert it as a Unit test.

Lesson 4 - Having a repeatable Unit test. PRICELESS!

Then I figured out the if I appended the word TRACEMESSAGE on any incoming event, I could profile each and every sample line. All I had to do was to recognize /TRACEMESSAGE$/ and log a microsecond timestamp along with what the function was doing.

Lesson 5 - Being able to profile a specific function is INVALUABLE in a live system where you suspect something weird or intermittent and you don't have to restart it.

After I ran into a couple of pattern / parser problems where I had to schedule downtime, I went and talked to my cohorts in crime. I got to looking at their stuff and found that they could change code on the fly. They didn't need downtime. (24 by Forever!) I went back to my desk and started putting together a control port.

In the control port, I'd connect to the collatord process via a TCP socket, authenticate, and run commands. In Perl, you can even handle subroutines through an eval. So, I would pass in a new subroutine, eval it, and put the pattern and Code Reference in the %ACTIONS Hash.

Lesson 6 - Being able to adapt on the fly without downtime. BRILLIANT

Not all languages can be adapted to do this so your mileage may vary!

Sunday, April 17, 2011

Data Analytics and Decision Support - Continued

Data Interpolation

In some performance management systems, the fact that you may have nulls or failed polls in your data series draws consternation in some products. If you are missing data polls, the software doesn't seem to know what to do.

I think the important thing is to catch counter rollovers in instances where your data elements are counters. You want to catch a rollover in such a way as when you calculate the delta, you add up until the maximum counter value, then start adding again up to the new value. In effect you get a good delta in between two values. What you do not want is tainted counters that span more than one counter rollover.

If you take the delta value between a missed poll as in:

P1 500
P2 Missed
P3 1000

In smoothing, you can take the difference between P3 and P1 and divide that by half as a delta. Simply add this to P1 to produce P2 which produces a derived time series of:

P1 500
P2 750
P3 1000

In gauges, you may have to sample the elements around the missed data element and use an average to smooth over the missed data element. Truth be known, I don't see why a graphing engine cannot connect from P1 to P4 without the interpolation! It would make graphs much simpler and more rich - If you don't see a data point on the time slot, guess it didn't happen! In the smoothing scenario, you cannot tell where the smoothing is.

Availability

In my own thoughts, I view availability as a discreet event in time or a state over a time period. Both have valid uses but need to presented in the context that they are borne so that users do not get confused. For example, an ICMP ping that occurs every 5 minutes is a discreet event in time. Here is an example time series of pings:

P1 Good
P2 Good
P3 Fail
P4 Good
P5 Fail
P6 Good
P7 Good
P8 Good
P9 Good
P10 Good

This time series denotes ping failures at P3 and P5 intervals. In theory, a line graph is not appropriate for this instance because it is boolean and it is discreet and not representative of the entire time period. If P0 is equal to 1303088400 on the local poller and the sysUptime value at P0 is 2500000, then the following SNMP polls yield:

Period Poller uTime SysUptime
P0   1303088400 2500000
P1 1303088700 2503000
P2   1303089000 2506000
P3   1303089300 2509000
P4   1303089600 2512000
P5 1303089900 20000
P6 1303090200 50000
P7 1303090500 80000
P8   1303090800 110000
P9 1303091100 5000
P10   1303091400 35000

utime is the number of seconds since January 1, 1970 and as such, it increments every second. SysUptime is the number of Clock ticks since the management system reinitialized. When you look hard at the TimeTicks data type, it is a modulo-s counter of the number of 1/100ths of a second for a given time epoch period. This integer will roll over at the value of 4294967296 or approximately every 49.7 days.

In looking for availability in the time series, if you derive the delta between the utime timestamps and multiply that times 100 (given that timeTicks is 1/100th of a second), you can see that if the sysUpTime value is less than the previous value plus the serived delta timeticks, you can clearly see where you have an availability issues. As compared to the previous time series using ICMP ping, P3 was meaningless and only P5 managed to ascertain some level of availability discrepancy.

You can also derive from the irrant time series periods that P5 lost 10000 timeTicks from the minimum delta value of 30000 (300 seconds * 100). So, for period P5 we were not available for 10000 timeticks and not the full period. Also note that if you completely miss a poll or don't get a response, the delta of the last poll chould tell you where availability was affected or not even though the successful poll did not go as planned.

From a pure time series data kind of conversion, one could translate sysUpTime to an accumulated number of available seconds over the time period. From a statistics perspective, this makes sense in that it becomes a count or integer.

Discreet to time Series Interpolation

Let's say you have an event management system. And within this system, you receive a significant number of events If can categorize and number your events into instances counted per interval, you can convert this into a counter for a time period. for example, you received 4 CPU threshold events in 5 minutes for a given node.

In other instances, you may have to covert discreet events into more of a stateful.approach toward conversions. For example, you have a link down and 10 seconds later, you get a link up. You have to translate this to a non-availability of the link for 10 seconds of a 5 minute interval. What is really interesting about this is that you have very finite time stamps. When you do, you can use this to compare to other elements that may have changed around the same period of time. Kind of a cause and effect analysis.

This is especially pertinent when you analyze data from devices, event data, and workflow and business related elements. For example, what if you did a comparison between HTTP response times to Internet network IO rates, and Apache click rates on your web servers? What if you threw in trouble tickets and Change orders over this same period? What about query rates on your Oracle databases?

Now, you can start translating real metrics with business metrics and even cause and effect elements because they are all done using a common denominator - Time. While many management platforms concentrate on the pure technical - like IO rates of Router interfaces, it may not mean as much to a business person. What if your web servers seem to run best when IO rates are between 10-15 percent. If that range is where the company performs best, I'd tune to that range. If you change your apps to increase efficiency, you can go back and look at these same metrics after you make changes and validate handily.

This gets even more interesting when you get Netflow data for the intervals as well. But I'll cover that in a later post!

Sunday, April 10, 2011

Data Analytics and Decision Support

.OK, The problem is that given a significant amount of raw performance data, I want to be able to determine some indication and potentially predictions, related to whether the metrics are growing or declining and what will potentially happen beyond the data we currently have, given the existing data trending.

Sounds pretty easy, doesn't it.

When you look at the overall problem, what you're trying to do is to take raw data and turn it into information. In effect, it looks like a Business Intelligence system. Forrester Research defines Business Intelligence as

"Forrester often defines BI in one of two ways. Typically, we use the following broad definition: Business intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making. But when using this definition, BI also has to include technologies such as data integration, data quality, data warehousing, master data management, text, and content analytics, and many others that the market sometimes lumps into the information management segment. Therefore, we also refer to data preparation and data usage as two separate, but closely linked, segments of the BI architectural stack. We define the narrower BI market as: A set of methodologies, processes, architectures, and technologies that leverage the output of information management processes for analysis, reporting, performance management, and information delivery."

Here is the link to their glossary for reference.

When you look at this definition versus what you get in todays performance management systems, you see somewhat of a dichotomy. Most of the performance management systems are graph and report producing products. You get what you get and so its up to you to evolve towards a Decision Support System versus a Graphical Report builder.

Many of the performance management applications lack the ability to produce and use derivative data sets. They do produce graphs and reports on canned data sets. It is left up to the user to sort through the graphs to determine what is going on in the environment.

It should be noted that many products are hesitant to support extremely large data sets with OLTP oriented Relational Databases. When the database table sizes go up, you start seeing degradation or slowness in the response of CRUD type transactions. Some vendors resort to batch loading. Others use complex table space schemes.

Part of the problem with Row oriented databases is that it must conform to the ACID model. Within the ACID model, a transaction must go or not go. Also, you cannot read at the same time as a write is occurring or your consistency is suspect. In performance management systems, there is typically a lot more writes than there a reads. Also, as the data set gets bigger and bigger, the indexes have to grow as well.

Bottom line - If you want to build a Decision Support System around Enterprise Management and the corresponding Operational data, you have to have a database that can scale to the multiplication of data through new derivatives.

Consider this - a Terabyte of raw data may be able to produce more than 4 times that when performing analysis. You could end up with a significant number of derivative data sets and storing them in a database enables you to use them repeatedly. (Keep in mind, some OLAP databases are very good at data compression.)

Time - The Common Denominator

In most graphs and in the data, the one common denominator about performance data is that it is intrinsically time series oriented. This provides a good foundation for comparisons. So now, what we need to do is to select samples in a context or SICs. For example, we may need to derive a sample of an hour over the context of a month.

In instances where a time series is not directly implied, it can be derived. For example, the number of state changes for a given object can be counted for the time series interval. Another instance may be a derivate of the state for a given time interval or the state when the interval occurred. In this example, you use what ever state the object is in at the time stamp.

Given that raw data may be collected at intervals much smaller than the sample period, we must look for a way to summarize the individual raw elements sufficiently enough to characterize the whole sample period. This gets one to contemplating about what would be interesting to look at. For example, probably the simplest way to characterize a counter is to Sum all of the raw data values into a sample (Simple Addition). For example, if you retrieved counts on 5 minute intervals, you could simply ad up the deltas over the course of the 1 hour sample into a derivative data element producing a total count for the sample. In some instances, you could look at the counter at the start and look at the end, if there are no resets or counter rollovers during the sample.

In other instances, you may see the need to take in the deltas of the raw data for the sample to produce and average. This average is useful by itself in many cases. But in others, you could go back through the raw data and compare the delta to the average to produce an offset or deviation from the average. What this offset or deviation really does is to produce a metric to recognize activity within the sample. The more up and down off of the average, the greater the deviations will be.

Another thing you need to be aware of is data set tainting. For example, lets say you have a period within your context where data is not collected, not available, or you suspect bad data. This sample could be considered tainted but in many cases, it can still be used. It is important sometimes to compare data collection to different elements like availability, change management windows, counter resets, or collection "holes" in the data.

What these techniques provide is a tool bag of functions that you can use to characterize and compare time series data sets. What if you wanted to understand if there were any cause and effect correlation of any metric on a given device. There are elements you may find that are alike. Yet there are others that may be opposite from each other. For example, when one metric increases, the other decreases. The characterization enables you to programmatically determine relationships between metrics.

The result of normalizing samples is to produce derivative data sets so that those can be used in different contexts and correlations. when you think about it, one should even be able to characterize and compare multiple SICs (Samples in Contexts). For example, one could compare day in month to week in month SICs.

Back to our samples in context paradigm. What we need samples for is to look at the contexts over time and look for growth or decline behaviors. The interesting part is the samples in that samples are needed when performing Bayesian Belief Networks. BBNs take a series of samples and produce a ratio of occurrences and anti-occurrences, translated as a simple + 1 to -1.

When you look at Operations as a whole, activity and work is accomplished around business hours. Weekdays become important as does shifts. So, it makes sense to align your Samples to Operational time domains.

The first exercise we can do is to take a complex SIC like weekday by week by Month. For each weekday, we need to total up our metric for the weekday. For each weekday in week in month sample, is the metric growing or declining? If you create a derivative data set that captures the percent different of the previous value, you can start to visualize the change over the samples.

When you start down the road of building the data sets, you need to be able to visualize the data in different ways. Graphs, charts, scatterplots, bubble charts, pie charts, and a plethora of other mechanisms provide ways you can see and use the data. One of the most common tools that BI professionals use is a Spreadsheet. You can access the data sources via ODBC and use the tools and functions within the spreadsheet to work through the analysis.

Saturday, November 27, 2010

EM7 Dynamic App Development

Dynamic Application development in EM7 is easy and a lot of fun.

A Dynamic Application in EM7 is used to gather statistics, configuration data, performance data, set thresholds, and setup alerts and events.

In the past, I've done alot of SNMP based applications where I would have to take SNMP Walks, the MIBs, and put together polling, thresholds, and triggers for events. I'd download the MIB, suck it into Excel, and start analyzing the data. A lot of time would be spent parsing MIBs, verifying tables, and looking at counters.

Enter EM7.

Login and go to System->Tools->MIB Compiler

It provides a list of all MIBs it knows about.

From this, you can see if the MIB you're looking for is there and compiled. If its not, you can hit the import button up on the top right portion of the window and it will prompt you to upload a new MIB file. (Pretty friggin straight forward if I do say so myself!)

To compile the MIB, hit the lightning icon on the MIB entry. Simple as that.

So, lets say you want to create a dynamic application around some configuration items associated with a couple of MIB tables in a MIB. For our example, we will use the CISCO-CCME-MIB applicable for the cisco Call Manager Express systems. You can view the MIB using the I icon, download or export hte MIB via the green disk icon on the right side of the entry, or compile it using the lightning bolt. I'm going to view the MIB so I select the I icon for the row for the MIB I want.

I hit the icon and it pulls up a file edit window. I scroll through to find an object I'd like to work with as an application. I highlight the object and copy it to my clipboard.

I go to System->Tools->OID Browser, select Where name is like in the search function and I paste the object I was looking at in the MIB. This pulls up a hierarchy of the OIDs and MIB objects below the object I input. All in all, we're into the second minute if you're a bit slow on the keyboard like me!

Notice on the top of the selection box is the MIB object path. This is handy to visualize where you are as you start to work with MIB objects.

So, the first application I want to put together is a configuration type of application centered around parameters that are part of the Music on Hold system function. I select the MIB elements I want to work with in my configuration application by clicking on the box for each parameter.

On the bottom right corner of this window is a selection box with a default value of [Select Action]. Pull down the menu and select Add to a new Configuration Application. Select the go button directly to its right.

You are then prompted if this is really what you want to do. Select Yes.

What this does is to take the MIB objects and create a dynamic application with your collection objects setup already. Here is what it looks like.

If you do nothing else, go to the the Properties tab, input an application Name, then save it. You have just created your first dynamic application in its basic sense of form and function.

This is what the Properties screen looks like:

Overall, you have a lot of work to do in that you probably need to cleanup the names of the Objects so that they show up better on reports. Setup polling intervals. And setup any thresholds, alerts, or events as needed. But you have an application in a few minutes. And it can cross multiple tables if you like! CLICK CLICK BOOM!

ENMS Projects

I have been looking over why a significant number of enterprise management implementations fail. And not only do they fail in implementation, they continue to live in a failed state year over year. Someone once said that 90% or greater of all enterprise management implementations fail.

I cannot help but to think that part of this may be something to do with big projects and evolution during implementation.

Another consideration I have seen is that lower level managers tend to manage to the green and are afraid of having to defend issues in their department or their products. When they manage to the Green, management above them loses awareness of how things are really working beneath them in the infrastructure.

One article I really liked was by Frank Hayes called "Big Projects, Done Small" in a recent Computerworld issue. Here is the link to it. I like his way of thinking in that big projects need to be sliced and diced into smaller pieces in order to facilitate success more readily.

Those of us in technical roles tend to operate and function in a 90 days are less role. We know from experience that if you have a project exceed 90 days, the "Dodge Syndrome" will rear its ugly head. (The rules have changed!) In all actuality, requirements tend to change and evolve around an implementation if the implementation takes over 90 days.

The second part you realize is that in projects that need to span over 90 days, mid-level Managers tend to get nervous and lose faith on the project. Once this happens, you start seeing the Manager retract support and resources for the project. The representatives don't show up to meetings.

But Tier 1 managers like the big, high cost projects as it is a way of keeping score. They like the big price tag and they like the enormous scope of global projects.

It is the job of Architects to align projects into more manageable, chunks for implementation and integration. They need to know and plan the budget for these projects so that you get what needs to be done, in the proper sequence, with the proper hooks, and with the proper resources. If this is done by Project Managers, Contractors, or Development staff, you risk the project becoming tactical in nature and lose sight of strategic goals.

When a project becomes tactical, the tasks become a cut list that has to happen every week. When something changes or impacts the schedule, tactical decisions are made to recover. For example, a decision to modify a configuration may be made today that kills or hampers an integration that needs to be done in a few months. These tactical decisions may save some time and resources today but yield a much larger need later on.

It is the equivalent to painting yourself in a corner.

Here in lies the situations that are oh so common:

1) Tier 1 Management wants to show the shareholders and Board of Directors that they are leading the company in a new direction.

2) Mid-level tiers of management can chose to use this project to promote their own enpires if possible.

3) Mid-Level managers can chose just to accept the big project and support it as needed.

4) Architects need to be keenly aware and on top of the project as a whole. They need to set, establish, and control direction.

5) End users need to be empowered to use the product to solve problems and work with Architecture to get their requirements in.

Conclusion

A big project can be accomplished and can succeed. But the strategic direction needs to be set with a vision for both tactical and strategic goals alignment. Things need to be broken up by goals and objectives and costed accordingly. Simplify - simplify - simplify.

Tuesday, October 12, 2010

The Case for EM7

I have been monitoring alot of activity around various products that I have had my head wrapped around for years. See, I have been a Consultant, Engineer, end user, OSS Developer/Architect and even Chief Technologist. I've developed apps and I've done a TON of integration across the years and problem sets. I have worked with, implemented, and supported a ton of different products and functions.

I think about all of the engagements I worked on where we spent 4-6 weeks just getting together the requirements doc and start down the road of doing a BOM. Then, after somebody approved the design, we get down to ordering hardware and software.

Then, we'd go about bringing up the systems, loading the software, and tuning the systems.

Before you knew it, we were 6 months on site.

This has been standard operations for EONS. For as long as I remember.

While the boom was on, this was acceptable. Business needed capabilities and that came from tools and technology. But now, money is tight and times are hard. However, companies still need IT capabilities that are managed and reliable.

Now days, customers want results. Not in 6 months. Not even next month. NOW! When you look at the prospects of Cloud, what attracts business and IT alike is that in a cloud, you can deploy applications quickly!

Enter EM7. What if I told you I could bring a couple of appliances (well, depends on the size of your environment...), rack and stack, and have you seeing results right away? Events. Performance. Ticketing. Discovery. Reports.

EM7 lives in this Rapid Deployment, lots of capabilities up front sort of scenario. Literally, in a few hours, you have a management platform.

Question: How long does it take for you to get changes into your network management platform? Days? Weeks?

Question: How long does it take for you to get a new report in place on your existing system?

Question: Are you dependent upon receiving traps and events only as your interface into receiving problem triggers? Do you even perform polls?

Question: How many FTEs are used just to maintain the monitoring system? How many Consultants with that?

Now, the question is WHY should you check out EM7?

First and foremost, you want whats best for your environment. The term "Best" changes over time. And while many companies will tell you best of breed is dead and that you need a single throat to choke, its because they KNOW they are not Best of Breed and they HATE competition. But competition is good for customers. It keeps value in perspective.

Without a bit of competition, you don't know best. Your environment has been evolving and so have the management capabilities and platforms (well some have!)

In some environments, politics keep new products and capabilities at bay. You can't find capital or you know you'll step on toes if you propose new products. So, you need to have some lab time. With EM7, you spend more time working with devices and capabilities in your lab than you do installing, administering, and integrating the management platform. EM7 is significantly faster to USE than others.

EM7 has a lot of the management applications you need to manage an IT / Network Infrastructure, in an APPLIANCE model. All of the sizing, OS tuning, Database architecture, etc. is done up front in defineable methods. We do this part and you get to spend your time adding value to your environment instead of the management platform.

In a politically charged environment, EM7 can be installed, up and running, and adding SIGNIFICANT value before the fifedom owners even know. Em7 becomes the stndard quickly in buy vs. build scenario as there are alot of capabilities up front. But there feature rich ways to extend management and reporting beyond the stock product. Dynamic applications enable the same developers to build upon the EM7 foundation rather than attempting to redevelop foundation, then the advanced capabilities.

If you're adapting and integrating Cloud capabilities, EM7 makes sense early on. Lots of both agent and agentless capabilities, service polling, mapping and dashboards, run book automation, ticketing, dynamic applications... the list goes on and on. Cloud and virtualization software and APIs can change very quickly. Standards are not always first. So, you have to be able to adapt and overcome challenges quickly.

If you're in the managed services world, you need to be vigilant in your quest for capabilites that turn into revenue. If you can generate revenue with a solution, it is entirely possible to make more than it costs. Do you know how much revenue is generated by your management platform? Now compare that to what it costs in maintenance and support.

Now, go back and look at things you've done to generate and instance new product. How long did it take from concept to offering? How much did it cost resource wise?

In these times, when margins are lean and business is very competitive, if you aren't moving forward to compete, you're probably facing losing market share.

With EM7, you get to COMPETE. You get a solid foundation with which to discriminate and differentiate your services from your competition. Quickly. Efficiently.

Sunday, August 15, 2010

Save the Village ID ten T

Ever been in one of these situations?

So, you have this Manager you work with. Seems that every time you have to interact with them, it becomes a painful ordeal. See, they are the Political type. You know, the type that destroys the creativity of organizations.

When you deal with them, all of a sudden your world is in Black and white... Not color. It's a 6th Sense moment. "I see Stupid People. And they don't even know they're stupid!".

You ask a technical question and get the trout look. You know - Fresh off the plate! This person no more understands what the F you're talking about than the man in the moon. Nor do they want to understand. They have tuned you out. In an effort to shut you up, they proceed up the food chain to suck up to every manager in an attempt to put more red tape and BS in front of you.

Well, you figure that because you are obviously not getting your technical points across, you do memos and white papers explaining the technical challenges. Then, you're too condescending and academic.

You do requirements and everything becomes a new negotiation. You can get nothing done. To this person, it doesn't matter if you have valid requirements or not. What matters most is their own personal interest and growing their own empire.

Its Managers like this that hold back Corporations. They kill productivity and teamwork and they can single-handedly poison your Development and integration teams. They have no idea the destruction they cause. The negative effects they have on folks. The creativity they squash.

They foster Analysis Paralysis, lack of evolution and the tragic killing of products and solutions. These types of folks tend to latch on to products or functions and never release control. Such that it kills functionality and the evolution beyond the basic. Everything is about the bare minimums as you don't have to work at it much. You can float, do what you can, and go on. So what if you never ever deliver new capabilities UNLESS it is POLITICALLY ADVANCING to do so.

How do you save someone like this from themselves? They refuse to listen. They are already Lunar Lander personalities - "Tranquility Base here! The EGO has landed!". They don't even have an inkling of what they don't know or understand. Yet they continue to drive out innovation.

As a Manager of one of these personality types, it becomes dangerous for you. You either subscribe to the lineage of politics or they start politically going around you. So the deceit, lies, and ignorance perpetuates. And the company continues to suffer. It takes strong leadership to either control the politics or make a decision to oust the impediments to progress. Both take intestinal fortitude, leadership, and a lot of paperwork.

In summary, I wish I could reach these folks. Have them understand the creativity and peoples careers they kill. All in the name of self preservation. Have them understand that a little common sense would open their whole world.

In my career, I have been in this situation 4-5 times. A Situation where the only way to be successful is to compromise your principals and ethics and start sucking up. My heart goes out to my friends that endure the torture of mindless decisions , the lack of experience and unwillingness to understand, and the unending political positioning and rambling rather than doing the right thing.

I'm soo elated to be away from this. Words cannot do justice to how elated and ecstatic I am to be able from the mindlessness.

Event Management and Java programmers

Question: Why is it a good idea to hire a Java programmer over someone with direct Netcool experience for a Netcool specific job?

I recently heard about an open position for a Netcool specialist where they opened up the position to be more of a Java programmer slot. In part, I think its a lack of understanding regarding supporting Netcool and understanding what it takes, domain knowledge wise, to be successful at it.

Most Java programmers either want to do some SOA sort of thing, Jakarta and struts, Spring Framework, Tomcat / Jboss or some portal rehash sort of thing. Most of these technologies have little to do with Netcool and rules. In fact, the lack of domain knowledge around the behavior of Network and systems devices in specific failure induced conditions, may limit a true Java programmers ability to be successful without significant training and exposure over time.

Here is the ugly truth:

1. Not every problem can be adequately or most effectively solved with Object Oriented Programming. Some things are done much more efficiently in Turing machines or Finite State automata.

2. You give away resources in OO to empower some level of portability and reuse. If your problem is linear and pretty static, why give away resources?

3. With Oracle suing Google of copywrite enfringement over the use of Java, Java may be career limiting. I mean, if Oracle is willing to initiate a lawsuit with Google which has the resources to handle such a lawsuit, how much more will it be willing to sue companies of significantly less legal resources?

So, unless you plan on rewriting Netcool in Java, I'd say this repositioning is a pretty limiting move. And if you were foolish enough to think this is even close to being cost effective, I'd pretty much say that as a Manager, you are dangerous to your company and share holders.

Sunday, July 11, 2010

ENMS User Interfaces...

Ever watch folks and how they use various applications? When you do some research around the science of Situation Awareness, you realize that human behavior in user interfaces is vital to understanding how to put information in front of users in ways that empowers the users inline with what they need.

In ENMS related systems, it is imperative that you present information in ways that empower users to understand situations and conditions beyond just a single node. While all of the wares vendors have been focused on delivering some sort of Root Cause Analysis, this may not be what is REALLY needed by the users. And dependent upon whether you are a Service Provider or an Enterprise, the rules may be different.

What I look for in applications and User Interfaces are ways to streamline the interaction versus being disruptive. If you are swapping a lot of screens, inherently look at your user. If they have to readjust their vision or posture, the UI is disrupting their flow.

For example, if the user is looking at an events display and they execute a function as part of the menu. This function produces a screen that overcomes the existing events display. If you watch your user, you will see them have to readjust to the screen change.

I feel like this is one of the primary reasons ticketing systems do not capture more real time data. It becomes too disruptive to keep changing screens so the user waits until later to update the ticket. Inherently, data is filtered and lost.

This has an effect on other processes. One is that if you are attempting to do BSM scorecards, ticket loading and resource management in near real time, you don’t have all of the data to complete your picture. In effect, situation awareness for management levels is skewed until the data is input.

The second effect to this is that if you’re doing continuous process improvement, especially with the incident and problem management aspects of ITIL, you miss critical data and time elements necessary to measure and improve upon.

Some folks have attempted to work around this by managing from ticket queues. So, you end up with one display of events and incoming situation elements and a second interface as the ticket interface. In order to try to make this even close to being effective, the tendency is to automatically generate tickets for every incoming event. Without doing a lot of intelligent correlation up front, automatic ticket generation can be very dangerous. Due diligence must be applied to each and every event that gets propagated or you may end up with false ticket generation or missed ticket opportunities.

Consider this as well. An Event Management system is capable of handling a couple thousand events pretty handily. A Ticketing system that handles 2000 ongoing tickets at one time changes the parameters of many ticketing systems.

Also, consider that in Remedy 7.5, the potential exists that each ticket may utilize 1GB or more of Database space. 2000 active tickets means you’re actively working across 2TB of drive / database space.

I like simple update utilities or popups that solicit information needed and move that information element back into the working Situation Awareness screen. For example, generating a ticket should be a simple screen to solicit data that is needed for the ticket that cannot be looked up directly or indirectly. Elements like ticket synopsis or symptom. Assignment to a queue or department. Changing status of a ticket.

Maps

Maps can be handy. But if you cannot overlay tools and status effectively or the map isn’t dynamic, it becomes more of a marketing display rather than a tool that you can use. This is even more prevalent when maps are not organized into hierarchies.

One of the main obstacles is the canvas. You can only place a certain amount of objects on a given screen. Some applications use scroll bars to enable you to get around. Others use a zoom in - zoom out capability where they scale the size of the icons and text according to the zoom. Others enable dragging the canvas. Another approach is to use a Hyperbolic display where analysis of detail is accomplished by establishing a moveable region under a higher level map akin to a magnifying glass over a desktop document.

3D displays get around the limitations of a small canvas a bit by using depth to position things in front or behind. However, 3D displays have to use techniques like LOD or Level of Details, or Fog to enable only more local objects are attended to, otherwise it has to render every object local and remote. This can be computationally intensive.

A couple of techniques I like in the 3D world are CAVE / Immersion displays and the concept of HUDs and Avatars. CAVE displays display your environment from several perspectives including top, bottom, front, left, right, and even behind. Movement is accomplished interacting with one screen and the other screens are synchronized to the main, frontal screen. This gives the user the effect of an immersive visual environment.

A HUD or heads up display enables you to present real time information directly in front of a user regardless of position or view.

The concept of an avatar is important in that if you have an avatar or user symbol, you can use that symbol to enable collaboration. In fact, your proximity to a given object may be used to help others collaborate and team up to solve problems.

Next week, I’ll discuss network layouts, transitioning, state and condition management, and morphing displays. Hopefully, in the coming weeks, I’ll take a shot at designing a hybrid, immersive 2D display that is true multiuser, and can be used as a solid tools and analysis visualization system.

ENMS Architecture notes...

From the Architect…

OK. You’ve got a huge task before you. You walk into an organization where you have an Event Management tool, a Network Management application, a Help Desk application, performance management applications, databases ad nauseum… And each becomes its own silo of a Beast. Each with its own competing management infrastructure, own budget, and own support staff.

I get emails every week from friends and colleagues facing this, as well as recruiters looking for an Architect that can come in for their customer, round up the wagons, and get everything in line going forward.

Sounds rather daunting, doesn’t it. Let’s look at what its going to take to get on track towards success.

1. You need to identify and map out the Functional Empires. Who’s running what product and what is the current roadmap for each Functional Empire.
2. You need to be aware of any upcoming product “deals”.
3. You need to understand the organizational capabilities and the budget.
4. In some instances, you’ll need to be strong enough technically to defend your architecture. Not just to internal customers but to product vendors. If you’re not strong enough technically, you need to find someone that is to cover you.
5. You need to understand who the Executive is, what the goals are, and the timelines needed by the Corporation.
ITIL is about processes. I tend to label ITIL as Functional Process Areas. These are the process areas needed in an effective IT Service. FCAPS is about Functional Management Areas. It is about the Functional Areas in which you need to organize and apply technology and workflow. eTOM adds Service Delivery and provisioning in a service environment into the mix as well.

The standards are the easy part.

The really hard part is merging the siloes you already have and doing so without selling the organization down the river. And the ultimate goal – Getting the users using the systems.

The big 4 Wares vendors are counting on you not being able to consolidate the silos on your own. I’ve heard the term “Best of Breed” is dead and “A single Throat to Choke” as being important to customers. These are planted seeds that they want you to believe. The only way to even come close to merging in their eyes is to use only one vendor’s wares.

When you deviate from addressing requirements and functionality in your implementation, you end up with whatever the vendor you picked says you’re gonna get.
You need to put together a strategy that spans 2 major release cycles, and delineate the functionality needed across your design. Go back to the basics, incorporate the standards, and put EVERYTHING on the table. Your strategy needs to evolve into a vision of where the Enterprise Management system should be in the 2 major release time cycle. The moment you let your guard down on focus, the chances that something thwart movement forward, will present itself.

Be advised. Regardless of how hard you work and what products and capabilities you implement, sometimes an organization becomes so narcissistic that it cannot change. No matter what you do, nothing gets put into production because the people in the silos block your every move. There are some that are totally resistant to change, evolution, and continuous improvement.

And you’re up against a lot of propaganda. Every vendor will tell you they are the leader or market best. And they will show you charts and statistics from analysis firms that show you that they are leaders or visionaries in the market space. It is all superfluous propaganda. Keep to requirements, capabilities, and proving/reproving these functions and their usability.

And listen to your end users most carefully. If the function adds to their arsenal and adds value, it will be accepted. If the function gets in the way or creates confusion or distraction, it will not be used.

--------------
Cross posted at : http://blog.sciencelogic.com/enterprise-network-management-systems-notes-from-the-architect/07/2010

Sunday, June 13, 2010

Appreciate our Military...

Saturday, I was flying back from Reston. On my leg from O'Hare to Oklahoma City, I ran into some Airmen coming back from a deployment. As we talked, I realized they were stationed at Tinker.

I told the Airman that I remember when my unit was the only unit on Tinker wearing Camos. He told me I must have been in the 3rd Herd. We talked some about unit history. How the original patch was an Owl and the motto was "World Wise Communications". I also told him about the transition to the War Frog under Col. Lurie and the transition to the 3rd Herd under Col. Witt.

On my building there used to be a sign painted on the roof:

Дом третьей группы борьбе связь
Всегда готов

It means "Home of the 3rd Combat Communications Group. Always Prepared" in Russian. Col. Lurie was a bit hard core but it motivated us to be ready at all times.

In the 80's - when I was stationed at Tinker - I was assigned to the 3rd Combat Communications Group. From an Air Force perspective, it wasn't considered a "choice" assignment but I wanted it anyway. Originally, out of Tech School, I had an assignment to Clark AB, Phillipines. I swapped assignments to get to Tinker.

I spent 6 1/2 years at the 3rd. 65 Tactical Air Base expercises. 35 Real world and JCS exercise deployments. 2 ORIs. Over 2 years in a tent. I earned my Combat Readiness Medal.

Combat Comm can be hard duty. Field conditions alot of times. You deploy on a moments notice anywhere in the world and you had to stay ready. My deployment bag stayed packed. And I had uniforms for jungle, desert, arctic, and chemical warefare that I maintained at all times. You don't know when and how long you'll be gone.

Our equipment had to be ready to rock at all times as well. It was amazing how fast and how much equipment we moved when it become time to "Rock and Roll".

What you do learn in a Unit like the 3rd:

- How to face difficult conditions and work through challenges
- Ability to survive and operate
- Mental and physical preparedness
- Adapting to the Mission
- Urgency of need
- Teamwork
- Perserverance, persistence, and focus.
- Never settle for anything less than your best.
- Leadership
- Strength of Character

Many colleges have classes and discussions about these things but in the 3rd, you lived it. You were immersed in it.

In garrison, you trained and prepared. On deployment, you delivered.

I can't help but to thank alot of the folks I worked with and for. Col. Lurie. MSgt Olival. MSgt Reinhardt. CMSgt Kremer. Capt. Kovacs. 1LT Faughn. TSgt Tommy Brown. SSgt Jeff Brock. SSgt Tim Palmer. Sgt. Elmer Shipman. SSGt Nyberg. SSgt Roy David. A1C David Cubit. A1C Vince Simmons. A1C Rod Pitts. SSgt Darren Newell. The list goes on and on. We worked together. Prepared and helped each other.

The Airman I was talking to represented the United States of America, the Air Force, my old Unit, and those stripes very well. He is a testament to the dedication to duty, service, and country. And this got me to thinking. I'm am a shadow Warrior from the Air Force's finest Combat Communications Group. Behind Airman Whittington and his team are thousands of us Shadow Warriors (those that have served before him) who pray for those that are actively serving. And we all hope and pray that the things we did while we were Active Duty helped in some small way, the ability of those that serve now to do better than we did.

So - to Airman Whittington and Team. Thank you for what you do. Thank you for representing well. God Bless ya'll! Welcome home!

And to Airman Whittington personally - Congratulations on your upcoming promotion. Nice talking with you, Sir!

Wednesday, May 26, 2010

Tactical Integration Decisions...

Inherently, I am a pre-cognitive Engineer. I think about and operate in a future realm. It is they way I work best and achieve success.

Recently I became aware of a situation where a commercial product replaced an older, in house developed product. This new product has functionality and capabilities well beyond that of the "function" it replaced.

During the integration effort, it was realized that a bit of event / threshold customization was needed on the SNMP Traps front in order to get Fault Management capabilities into the integration.

In an effort to take a short cut, it was determined that they would adapt the new commercial product to the functionality and limitations of the previous "function". This is bad for several reasons:

1. You limit the capabilities of the new product going forward to those functions that were in the previous function. No new capabilities.

2. You taint the event model of the commercial product to that of the legacy function. Now all event customizations have to deal with old OIDs, and varbinds.

3. You limit the supportability and upgrade-ability of the new product. Now, every patch, upgrade, and enhancement must be transitioned back to the legacy methodology.

4. It defies common sense. How can you "let go" of the past when when you readily limit yourself to the past?

5. You assume that this product cannot provide any new value to the customer or infrastructure.

You can either do things right the first time or you can create a whole new level of work that has to be done over and over. People that walk around backwards are resigned to the past. They make better historians than Engineers or Developers. How do you know where to go if you can't see anything but your feet or the end of your nose?

Saturday, May 22, 2010

Support Model Woes

I find it ironic that folks claim to understand ITIL Management processes yet do not understand the levels of support model.

Most support organizations have multiple ters of support. For example, there is usually a Level 1 which is the initial interface toward the customer. Level 2 is usually a traige or technical feet on the street. Level 3 is usually Engineering or Development. In some cases, Level 4 is used to denote on site vendor support or third party support.

In organizations where Level 1 does dispatch only or doesn't follow through problems with the customer, the customer ends up owning and following through the problem to solution. What does this say to customers?

- There are technically equal to or better than level 1
- they become accustomed to automatic escalation.
- They lack confidence in the service provider
- They look for specific Engineers versus following process
- They build organizations specifically to follow and track problems through to resolution.

If your desks do dispatch only, event management systems are only used to present the initial event leading up to a problem. What a way to render Netcool useless! Netcool is designed to display the active things that are happening in your environment. If all you ever do is dispatch, why do you need Netcool? Just generate a ticket straight away. No need to display it.

What? Afraid of rendering your multi-million dollar investment useless? Why leave it in a disfunctional state of semi-uselessness when you could save a significant amount of money getting rid of it? Just think, every trap can become an object that gets processed like a "traplet" of sorts.

One of the first things that crops up is that events have a tendency to be dumb. Somewhere along the way, somebody had the bright idea that to put an agent into production that is "lightweight" - meaning no intelligence or correlation built in. Its so easy to do little or nothing, isn't it? Besides, its someone elses problem to deal with the intelligence. In the
vendors case, many times agent functionality is an afterthought.

Your model is all wrong. And until the model is corrected, you will never realize the potential ROI of what the systems can provide. You cannot evolve because you have to attempt to retrofit everything back to the same broken model. And when you automate, you automate the broken as well.

Heres the way it works normally. You have 3 or 4 levels of support.

Level 1 is considered first line and they perform the initial customer engagement, diagnostics and triage, and initiate workflow. Level 1 initiates and engages additional levels of support trackingt thingts through to completion. In effect, level 1 owns the incident / problem management process but also provides customer engagement and fulfillment.

Level 2 is specialized support for various areas like network, hosting, or application support. They are engaged through Level 1 personnel and are matrixed to report to level 1, problem by problem such that they empower level 1 to keep the customer informed of status and timelines, set expectations, and answer questions.

Level 3 is engaged when the problem becomes beyond the technical capabilities of levels 1 and 2, requires project, capital expenditure, architecture, and planning support.

Level 4 is reserved for Vendor support or consulting support and engagement.

A Level 0 ia used to describe automation and correlation performed before workflow is enacted.

When you breakdown your workflow into these levels, you can start to optimize and realize ROI by reducing the cost of maintenance actions across the board. By establishing goals to solve 60-70% of all incidents at LEvel 1, Level 2-4 involvement helps to drive knowledge and understanding downward to level 1 folks why better utilizing level 2 - 4 folks.

In order to implement these levels of support, you have organize and define your support organization accordingly. Define its rolls and responsibilities, set expectations, and work towards success. Netcool, as an Event Management platform, need to be aligned to the support model. Things that ingress and egress tickets need to be updated in Netcool. Workflow that occurs, needs to update Netcool so that personnel have awareness of what is going on.

Politically based Engineering Organizations

When an organization is resistant to change and evolution, it can in many cases be attributed to weak technical leadership. Now this weakness can be because of the politicalization of Engineering and Development organizations.

Some of the warning signs include:

- Political assasinations to protect the current system
- Senior personnel and very experienced people intimidate the management structure and are discounted
- Inability to deliver
- An unwillingness to question existing process or function.

People that function at a very technical level find places where politics prevalent, a difficult place to work.

- Management doesn't want to know whats wrong with their product.
- They don't want to change.
- They shun and avoid senior technical folks and experience is shunned.

Political tips and tricks

- Put up processes and negotiations to stop progress.
- Random changing processes.
- Sandbagging of information.
- When something is brought to the attention of the Manager, the technical
person's motivations are called into question. What VALUE does the person bring to the Company?
- The person's looks or mannerisms are called into question.
- The persons heritage or background is called into question.
- The person's ability to be part of a team is called into question.
- Diversions are put in place to stop progress at any cost.

General Rules of Thumb

+ Politically oriented Supervisors kill technical organizations.
+ Image becomes more important than capability.
+ Politicians cannot fail or admit failure. Therefore, risks are avoided.
+ Plausible deniability is prevalent in politics.
+ Blame-metrics is prevalent. "You said..."

Given strong ENGINEERING leadership, technical folks will grow very quickly. Consequently, their product will become better and better as you learn new techniques and share them, everyone gets smarter. True Engineers have a willingness to help others, solve problems, and do great things. A little autonomy and simple recognitions and you're off to the races.

Politicians are more suited to Sales jobs. Sales is about making the numbers at whatever cost. In fact, Sales people will do just about anything to close a deal. They are more apt to help themselves than to help others unless it also helps themselves. Engineers need to know that their managers have their backs. Sales folks have allegiance to the sale and themselves... A bad combination for Engineers.

One of the best and most visionary Vice Presidents I've ever worked under, John Schanz once told us in a Group meeting that he wanted us to fail. Failure is the ability to discover what doesn't work. So, Fail only once. And be willing to share your lessons, good and bad. And dialog is good. Keep asking new questions.

In Operations, Sales people can be very dangerous. They have the "instant gratification" mentality in many cases. Sign the PO, stand up the product, and the problem is solved in their minds. They lack the understanding that true integration comes at a personal level with each and every user. This level of integration is hard to achieve. And when things don't work or are not accepted, folks are quick to blame someone or some vendor.

The best Engineers, Developers, and Architects are the ones that have come up through the ranks. They have worked in a NOC or Operations Center. They have fielded customer calls and problems. They understand how to span from the known to and unknown even when the customer is right there. And they learn to think on their feet.

Thursday, May 20, 2010

Architecture and Vision

One of the key aspects of OSS/ENMS Architecture is that you need to build a model that is realizable in 2 major product release cycles, of where you'd like to be, systems, applications, and process wise. This equates usually to an 18-24 month "what it needs to look like" sort of vision.

Why?

- It provides a reference model to set goals and objectives.
- It aligns development and integration teams toward a common roadmap
- It empowers risk assessment and mitigation in context.
- It empowers planning as well as execution.

What happens if you don't?

- You have a lot of "cowboy" and rogue development efforts.
- Capabilities are built in one organization that kill capabilities and performance in other products.
- Movement forward is next to impossible in that when something doesn't directly agree with the stakeholder's localized vision, process pops up to block progress.
- You create an environment where you HAVE to manage to the Green.
- Any flaws or shortcomings in localized capabilities results in fierce political maneuvering.

What are the warning signs?

- Self directed design and development
- Products that are deployed in multiple versions.
- You have COTS products in production that are EOLed or are no longer supported.
- Seemingly simple changes turn out to require significant development efforts.
- You are developing commodity product using in house staff.
- You have teams with an Us vs. Them mentality.

Benefits?

- Products start to become "Blockified". Things become easier to change, adapt, and modify.
- Development to Product to Support to Sales becomes aligned. Same goals.
- Elimination of a lot of the weird permutations. No more weird products that are marsupials and have duck bills.
- The collective organizational intelligence goes up. Better teamwork.
- Migration away from "Manage to the Green" towards a Teamwork driven model.
- Better communication throughout. No more "Secret Squirrel" groups.

OSS ought to own a Product Catalog, data warehouse, and CMS. Not a hundred different applications. OSS Support should own the apps the the users should own the configurations and the data as these users need to be empowered to use the tools and systems as they see fit.

Every release of capability should present changes to the product catalog. New capabilities, new functions, and even the loss of functionality, needs to be kept in synch with the product teams. If I ran product development, I'd want to be a lurker on the Change Advisory Board and I'd want my list published and kept up to date at all times. Add a new capability. OSS had BETTER inform product teams.

INUG Activities...

Over the past few weeks, I've made a couple of observations regarding INUG traffic...

Jim Popovitch A Stalwart in the Netcool community, left IBM and the Netcool product behind to go to Monolith! What the hell does that say?

There was some discussion of architectural problems with Netcool and after that - CRICKETS. Interaction on the list by guys like Rob Cowart, Victor Havard, Jim - Silence. Even Heath Newburn's posts are very short.

There is a storm brewing. Somebody SBDed the party. And you can smell I mean tell. The SBD was licensing models. EVERYONE is checking their shoes and double checking their implementations. While Wares vendors push license "true ups" as a way to drive adoption and have them pay later, on the user side it is seen as Career Limiting as it is very difficult to justify your existence when you have to go back to the well for an unplanned budget item at the end of the year.

Something is brewing because talk is very limited.

Product Evaluations...

I consider product competition as a good thing. It keeps everyone working to be the best in breed, deliver the best and most cost effective solution to the customer, and drives the value proposition.

In fact, in product evaluations I like to pit vendors products against each other so that my end customer gets the best solution and the most cost effective. For example, I use capabilities that may not have been in the original requirements to further the customer capability refinement. If they run across something that makes their life better, why not leverage that in my product evaluations? In the end, I get a much more effective solution and my customer gets the best product for them.

When faced with using internal resources to develop a capability and using an outside, best of breed solution, danger exists in that if you grade on a curve for internally developed product, you take away competition and ultimately the competitive leadership associated with a Best of Breed product implementation.

It is too easy to start to minimize requirements to the bare necessities and to further segregate these requirements into phases. When you do, you lose the benefit of competition and you lose the edge you get when you tell the vendors to bring the best they have.

Its akin to looking at the problem space and asking what is th bare minimum needed to do this. Or asking what is the best solution for this problem set? Two completely different approaches.

If you evaluate on bare minimums, you get bare minimums. You will always be behind the technology curve in that you will never consider new approaches, capabilities, or technology in your evaluation. And your customer is always left wanting.

It becomes even more dangerous when you evaluate internally developed product versus COTS in that, if you apply the minimum curve gradient to only the internally developed product, the end customer only gets bare minimum capabilities within the development window. No new capabilities. No new technology. No new functionality.

It is not a fair and balanced evaluation anyway if you only apply bare minimums to evaluations. I want the BEST solution for my customer. Bare minimums are not the BEST for my customer. They are best for the development team because now, they don't have to be the best. They can slow down innovation through development processes. And the customer suffers.

If you're using developer in house, it is an ABSOLUTE WASTE of company resources and money to develop commodity software that does not provide clear business discriminators. Free is not a business discriminator in that FREE doesn't deliver any new capabilities - capabilities that commodity software doesn't already have.

Inherently, there are two mindsets that evolve. You take away or you empower the customer. A Gatekeeper or a Provider.

If you do bare minimums, you take away capabilities that the customer wants but because it may not be a bare minimum, the capability is taken away.

If you evaluate on Best of Breed, you ultimately bring capabilities to them.

Tuesday, May 18, 2010

IT Managed Services and Waffle House

OK.

So now you're thinking - What the hell does Waffle House have to do with IT Managed Services. Give me a minute and let me 'splain it a bit!

When you go into a Waffle House, immediately you get greeted at the door. Good morning! Welcome to Waffle House! If the place is full, you may have a door corps person to seat you in waiting chairs and fetch you coffee or juice while you're waiting.

When you get to the table, the Waitress sets up your silverware and ensures you have something to drink. Asks you if you have decided on what you'd like.

When they get your order, they call in to the cook what you'd like to eat.

A few minutes later, food arrives, drinks get refilled, and things are taken care of.

Pretty straightforward, don't you think? One needs to look at the behaviors and mannerisms od successful customer service representatives to see what behaviors are needed in IT Managed Services.

1. The Customer is acknowledged and engaged at the earliest possible moment.

2. Even if there are no open tables, work begins to establish a connection and a level of trust.

3. The CSR establishes a dialog and works to further the trust and connection. To the customer, they are made to feel like they are the most important customer in the place. (Focus, eye contact. Setting expectations. Assisting where necessary.)

4. They call in the order to the cook. First, meats are pulled as they take longer to cook. Next, if you watch closely, the cook lays out the order using a plate marking system. The customer then prepares the food according to the plate markings.

5. Food is delivered. Any open ends are closed. (drinks)

6. Customer finishes. Customer is engaged again to address any additional needs.

7. Customer pays out. Satisfied.

All too often, we get IT CSRs that don't readily engage customers. As soon as they assess the problem, they escalate to someone else. This someone else then calls the customer back, reiterates the situation, then begins work. When they cannot find the problem or something like a supply action of technician dispatch needs to occur, the customer gets terminated and re-initiated by the new person in the process.

In Waffle House, if you got waited on by a couple of different waitresses and the cook, how inefficient would that be? How confusing would that be as a customer? Can you imagine the labor costs to support 3 different wait staff even in slow periods? How long would Waffle House stay in business?

Regarding the system of calling and marking out product... This system is a process thats taught to EVERY COOK and Wait staff person in EVERY Waffle House EVERYWHERE. The process is tried, vetted, optimized, and implemented. And the follow through is taken care of by YOUR Wait person. The one thats focused on YOU, the Customer.

I wish I could take every Service provider person and manager and put them through 90 days of Waffle House Boot Camp. Learn how to be customer focused. Learn urgency of need, customer engagement, trust, and services fulfillment. If you could thrive at Waffle House and take with you the customer lessons, customer service in an IT environment should be Cake.

And it is living proof that workflow can be very effective even at a very rudimentary level.

Tuesday, May 11, 2010

Blocking Data

Do yourself a favor.

Take a 10000 line list of names, etc. and put it in a Database. Now, put that same list in a single file.

Crank up a term window and a SQL client on one and command line on another. Pick a pattern that gets a few names from the list.

Now in the SQL Window, do a SELECT * from Names WHERE name LIKE 'pattern';

In a second window, do a grep 'pattern' on the list file.

Now hit return on each - simultaneously if possible. Which one did you see results first?

The grep, right!!!!

SQL is Blocking code. The access blocks until it returns with data. If you front end an application with code that doesn't do callbacks right and doesn't handle the blocking, what happens to your UI? IT BLOCKS!

Blocking UIs are what? Thats right! JUNK!

Sunday, May 9, 2010

ENMS Products and Strategies

Cloud Computing is changing the rules of how IT and ENMS products are done. With Cloud Computing, resources being configurable and adaptable very quickly, applications are also changing to fit the paradigm of fitting in a virtual machine instance. We may even see apps deployed with their own VMs as a package.

This also changes the rules of how Service providers deliver. In the past, you had weeks to get the order out, setup the hardware, and do the integration. In the near future, all Service providers and application wares vendors will be pushed to Respond and Deliver at the speed of Cloud.

It changes a couple of things real quickly:

1. He who responds and delivers first will win.
2. Relationships don't deliver anything. Best of Breed is back. No excuse for the blame game.
3. If it takes longer for you to fix a problem than it does to replace, GAME OVER!
4. If your application takes too long to install or doesn't fit in the Cloud paradigm, it may become obsolete SOON.
5. In instances where hardware is required to be tuned to the application, appliances will sell before buying your own hardware.
6. In a Cloud environment, your Java JRE uses resources. Therefore it COSTS. Lighter is better. Look for more Perl, Python, PHP and Ruby. And Javascript seems to be a language now!

Some of the differentiators in the Cloud:

1. Integrated system and unit tests with the application.
2. Inline knowledge base.
3. Migration from tool to a workflow strategy.
4. High agility. Configurability. Customizeability.
5. Hyper-scaling technology like memcached and distributed databases incorporated.
6. Integrated products on an appliance or VM instance.
7. Lightweight and Agile Web 2.0/3.0 UIs.

SNMP MIBs and Data and Information Models

Recently, I was having a discussion about SNMP MIB data and organization and its application into a Federated CMDB and thought it might provoke a bit of thought going forward.

When you compile a MIB for a management application, what you do is to organize the definitions and the objects according to name a OID, in a way thats searchable and applicable to performing polling, OID to text interpretation, variable interpretation, and definitions. In effect, you compiled MIB turns out to be every possible SNMP object you could poll from in your enterprise.

This "Global Tree" has to be broken down logically with each device / agent. When you do this, you build an information model related to each managed object. In breaking this down further, there are branches that are persistent for every node of that type and there are branches that are only populated / instanced if that capability is present.

For example, on a Router that has only LAN type interfaces, you'd see MIB branches for Ethernet like interfaces but not DSX or ATM. These transitional branches are dependent upon configuration and presence of CIs underneath the Node CI - associated with the CIs corresponding to these functions.

From a CMDB Federation standpoint, a CI element has a source of truth from the node itself, using the instance provided via the MIB, and the methods via the MIB branch and attributes. But a MIB goes even further to identify keys on rows in tables, enumerations, data types, and descriptive definitions. A MIB element can even link together multiple MIB Objects based on relationships or inheritance.

In essence, I like the organization of NerveCenter Property Groups and Properties:

Property Groups are initially organized by MIB and they include every branch in that MIB. And these initial Property Groups are assigned to individual Nodes via a mapping of the system.sysObjectID to Property Group. The significance of the Property Group is that is contains a list of the MIB branches applicable to a given node.

These Property Groups are very powerful in that it is how Polls, traps, and other trigger generators are contained and applied according to the end node behavior. For example, you could have a model that uses two different MIB branches via poll definitions, but depending the node and its property group assignment, only the polls applicable to the node's property group, are applied. Yet it is done with a single model definition.

The power behind property groups was that you could add custom properties to property groups and apply these new Property Groups on the fly. So, you could use a custom property to group a specific set of nodes together.

I have setup 3 distinct property groups in NerveCenter corresponding to 3 different polling interval SLAs and used a common model to poll at three different rates dependent upon the custom properties for Cisco_basic, Cisco_advanced, and Cisco_premium to poll at 2 minutes, 1 minute, or 20 seconds respectively.

I used the same trigger name for all three poll definitions but set the property to only be applicable to Cisco_basic, Cisco_advanced, or Cisco_premium respectively.

What Property Groups do is to enable you to setup and maintain a specific MIB tree for a given node type. Taking this a bit further, in reality, every node has its own MIB tree. Some of the tree is standard for every node of the same type while other branches are option or capability specific. This tree actually corresponds to the information model for any given node.

Seems kinda superfluous at this point. If you have an information model, you have a model of data elements and the method to retrieve that data. You also have associative data elements and relational data elements. Whats missing?

Associated with these CIs related to capabilities like a DSX interface or an ATM interface or even something as mundane as an ATA Disk drive, are elements of information like technical specifications and documentation, process information, warranty and maintenance information.. even mundane elements like configuration notes.

So, when you're building your information model, the CMDB is only a small portion of the overall information system. But it can be used to Meta or cross reference other data elements and help to turn these into a cohesive information model.

This information model ties well into fault management, performance management, and even ITIL Incident, Problem and change management. But you have to think of the whole as an Information model to make things work effectively.

Wouldn't it be awesome if to could manage and respond by service versus just a node? When you get an event on a node, do you enrich the event to provide the service or do you wait until a ticket is open? If the problem was presented as a service issue, you could work it as a service issue. For example, if you know the service lineage or pathing, you can start to overlay elements of information that empower you to start to put together a more cohesive awareness of your service.

Lets say you have a simple 3 tier Web enabled application that consists of a Web Server, and application Server, and a Database. On the periphery, you have network components, firewalls, switches, etc. How valuable is just the lineage? Now, if I can overlay information elements on this ontology, it comes alive. For example, show me a graph of CPU performance on everything in the lineage. Add in memory and IO utilization. If I can overlay response times for application transactions, the picture becomes indispensable as an aid to situation awareness.

Looking at things from a different perspective, what if I could overlay network errors. Or disk errors. What about other, seemingly irrelevant elements of information like ticket activities or the amount of support time spent on each component in the service lineage, takes data and empowers you to present it in a service context.

On a BSM front, what if I could overlay the transaction rate with CPU, IO, Disk, or even application memory size or CPU usage? Scorecards start becoming a bit more relevant it would seem.

In Summary

SNMP MIB data is vital toward not only the technical aspects of polling and traps but conversion and linkage to technical information. SNMP ius a source of truth for a significant number of CIs and the MIB definitions tell you how the information is presented.

But all this needs to be part of an Information plan. How do I take this data, derive new data and information, and present it to folks when they need it the most?

BSM assumes that you have the data you need organized in a database that can enable you to present scorecards and service trees. Many applications go through very complex gyrations on SQL queries in an attempt to pull the data out. When the data isn't there or it isn't fully baked, BSM vendors may tend to stub up the data to show that the application works. This gets the sale but the customer ends up finding that the BSM application isn't as much out of the box as the vendor said it was.

These systems depend on Data and information. Work needs to be done to align and index data sources toward being usable. For example, if you commonly use inner and outer joins in queries, you haven't addressed making your data accessible. If it takes elements from 2 tables to do selects on others, you need to work on your data model.