Dougie's Enterprise Management World

Wednesday, May 26, 2010

Tactical Integration Decisions...

Inherently, I am a pre-cognitive Engineer. I think about and operate in a future realm. It is they way I work best and achieve success.

Recently I became aware of a situation where a commercial product replaced an older, in house developed product. This new product has functionality and capabilities well beyond that of the "function" it replaced.

During the integration effort, it was realized that a bit of event / threshold customization was needed on the SNMP Traps front in order to get Fault Management capabilities into the integration.

In an effort to take a short cut, it was determined that they would adapt the new commercial product to the functionality and limitations of the previous "function". This is bad for several reasons:

1. You limit the capabilities of the new product going forward to those functions that were in the previous function. No new capabilities.

2. You taint the event model of the commercial product to that of the legacy function. Now all event customizations have to deal with old OIDs, and varbinds.

3. You limit the supportability and upgrade-ability of the new product. Now, every patch, upgrade, and enhancement must be transitioned back to the legacy methodology.

4. It defies common sense. How can you "let go" of the past when when you readily limit yourself to the past?

5. You assume that this product cannot provide any new value to the customer or infrastructure.

You can either do things right the first time or you can create a whole new level of work that has to be done over and over. People that walk around backwards are resigned to the past. They make better historians than Engineers or Developers. How do you know where to go if you can't see anything but your feet or the end of your nose?

Saturday, May 22, 2010

Support Model Woes

I find it ironic that folks claim to understand ITIL Management processes yet do not understand the levels of support model.

Most support organizations have multiple ters of support. For example, there is usually a Level 1 which is the initial interface toward the customer. Level 2 is usually a traige or technical feet on the street. Level 3 is usually Engineering or Development. In some cases, Level 4 is used to denote on site vendor support or third party support.

In organizations where Level 1 does dispatch only or doesn't follow through problems with the customer, the customer ends up owning and following through the problem to solution. What does this say to customers?

- There are technically equal to or better than level 1
- they become accustomed to automatic escalation.
- They lack confidence in the service provider
- They look for specific Engineers versus following process
- They build organizations specifically to follow and track problems through to resolution.

If your desks do dispatch only, event management systems are only used to present the initial event leading up to a problem. What a way to render Netcool useless! Netcool is designed to display the active things that are happening in your environment. If all you ever do is dispatch, why do you need Netcool? Just generate a ticket straight away. No need to display it.

What? Afraid of rendering your multi-million dollar investment useless? Why leave it in a disfunctional state of semi-uselessness when you could save a significant amount of money getting rid of it? Just think, every trap can become an object that gets processed like a "traplet" of sorts.

One of the first things that crops up is that events have a tendency to be dumb. Somewhere along the way, somebody had the bright idea that to put an agent into production that is "lightweight" - meaning no intelligence or correlation built in. Its so easy to do little or nothing, isn't it? Besides, its someone elses problem to deal with the intelligence. In the
vendors case, many times agent functionality is an afterthought.

Your model is all wrong. And until the model is corrected, you will never realize the potential ROI of what the systems can provide. You cannot evolve because you have to attempt to retrofit everything back to the same broken model. And when you automate, you automate the broken as well.

Heres the way it works normally. You have 3 or 4 levels of support.

Level 1 is considered first line and they perform the initial customer engagement, diagnostics and triage, and initiate workflow. Level 1 initiates and engages additional levels of support trackingt thingts through to completion. In effect, level 1 owns the incident / problem management process but also provides customer engagement and fulfillment.

Level 2 is specialized support for various areas like network, hosting, or application support. They are engaged through Level 1 personnel and are matrixed to report to level 1, problem by problem such that they empower level 1 to keep the customer informed of status and timelines, set expectations, and answer questions.

Level 3 is engaged when the problem becomes beyond the technical capabilities of levels 1 and 2, requires project, capital expenditure, architecture, and planning support.

Level 4 is reserved for Vendor support or consulting support and engagement.

A Level 0 ia used to describe automation and correlation performed before workflow is enacted.

When you breakdown your workflow into these levels, you can start to optimize and realize ROI by reducing the cost of maintenance actions across the board. By establishing goals to solve 60-70% of all incidents at LEvel 1, Level 2-4 involvement helps to drive knowledge and understanding downward to level 1 folks why better utilizing level 2 - 4 folks.

In order to implement these levels of support, you have organize and define your support organization accordingly. Define its rolls and responsibilities, set expectations, and work towards success. Netcool, as an Event Management platform, need to be aligned to the support model. Things that ingress and egress tickets need to be updated in Netcool. Workflow that occurs, needs to update Netcool so that personnel have awareness of what is going on.

Politically based Engineering Organizations

When an organization is resistant to change and evolution, it can in many cases be attributed to weak technical leadership. Now this weakness can be because of the politicalization of Engineering and Development organizations.

Some of the warning signs include:

- Political assasinations to protect the current system
- Senior personnel and very experienced people intimidate the management structure and are discounted
- Inability to deliver
- An unwillingness to question existing process or function.

People that function at a very technical level find places where politics prevalent, a difficult place to work.

- Management doesn't want to know whats wrong with their product.
- They don't want to change.
- They shun and avoid senior technical folks and experience is shunned.

Political tips and tricks

- Put up processes and negotiations to stop progress.
- Random changing processes.
- Sandbagging of information.
- When something is brought to the attention of the Manager, the technical
person's motivations are called into question. What VALUE does the person bring to the Company?
- The person's looks or mannerisms are called into question.
- The persons heritage or background is called into question.
- The person's ability to be part of a team is called into question.
- Diversions are put in place to stop progress at any cost.

General Rules of Thumb

+ Politically oriented Supervisors kill technical organizations.
+ Image becomes more important than capability.
+ Politicians cannot fail or admit failure. Therefore, risks are avoided.
+ Plausible deniability is prevalent in politics.
+ Blame-metrics is prevalent. "You said..."

Given strong ENGINEERING leadership, technical folks will grow very quickly. Consequently, their product will become better and better as you learn new techniques and share them, everyone gets smarter. True Engineers have a willingness to help others, solve problems, and do great things. A little autonomy and simple recognitions and you're off to the races.

Politicians are more suited to Sales jobs. Sales is about making the numbers at whatever cost. In fact, Sales people will do just about anything to close a deal. They are more apt to help themselves than to help others unless it also helps themselves. Engineers need to know that their managers have their backs. Sales folks have allegiance to the sale and themselves... A bad combination for Engineers.

One of the best and most visionary Vice Presidents I've ever worked under, John Schanz once told us in a Group meeting that he wanted us to fail. Failure is the ability to discover what doesn't work. So, Fail only once. And be willing to share your lessons, good and bad. And dialog is good. Keep asking new questions.

In Operations, Sales people can be very dangerous. They have the "instant gratification" mentality in many cases. Sign the PO, stand up the product, and the problem is solved in their minds. They lack the understanding that true integration comes at a personal level with each and every user. This level of integration is hard to achieve. And when things don't work or are not accepted, folks are quick to blame someone or some vendor.

The best Engineers, Developers, and Architects are the ones that have come up through the ranks. They have worked in a NOC or Operations Center. They have fielded customer calls and problems. They understand how to span from the known to and unknown even when the customer is right there. And they learn to think on their feet.

Thursday, May 20, 2010

Architecture and Vision

One of the key aspects of OSS/ENMS Architecture is that you need to build a model that is realizable in 2 major product release cycles, of where you'd like to be, systems, applications, and process wise. This equates usually to an 18-24 month "what it needs to look like" sort of vision.

Why?

- It provides a reference model to set goals and objectives.
- It aligns development and integration teams toward a common roadmap
- It empowers risk assessment and mitigation in context.
- It empowers planning as well as execution.

What happens if you don't?

- You have a lot of "cowboy" and rogue development efforts.
- Capabilities are built in one organization that kill capabilities and performance in other products.
- Movement forward is next to impossible in that when something doesn't directly agree with the stakeholder's localized vision, process pops up to block progress.
- You create an environment where you HAVE to manage to the Green.
- Any flaws or shortcomings in localized capabilities results in fierce political maneuvering.

What are the warning signs?

- Self directed design and development
- Products that are deployed in multiple versions.
- You have COTS products in production that are EOLed or are no longer supported.
- Seemingly simple changes turn out to require significant development efforts.
- You are developing commodity product using in house staff.
- You have teams with an Us vs. Them mentality.

Benefits?

- Products start to become "Blockified". Things become easier to change, adapt, and modify.
- Development to Product to Support to Sales becomes aligned. Same goals.
- Elimination of a lot of the weird permutations. No more weird products that are marsupials and have duck bills.
- The collective organizational intelligence goes up. Better teamwork.
- Migration away from "Manage to the Green" towards a Teamwork driven model.
- Better communication throughout. No more "Secret Squirrel" groups.

OSS ought to own a Product Catalog, data warehouse, and CMS. Not a hundred different applications. OSS Support should own the apps the the users should own the configurations and the data as these users need to be empowered to use the tools and systems as they see fit.

Every release of capability should present changes to the product catalog. New capabilities, new functions, and even the loss of functionality, needs to be kept in synch with the product teams. If I ran product development, I'd want to be a lurker on the Change Advisory Board and I'd want my list published and kept up to date at all times. Add a new capability. OSS had BETTER inform product teams.

INUG Activities...

Over the past few weeks, I've made a couple of observations regarding INUG traffic...

Jim Popovitch A Stalwart in the Netcool community, left IBM and the Netcool product behind to go to Monolith! What the hell does that say?

There was some discussion of architectural problems with Netcool and after that - CRICKETS. Interaction on the list by guys like Rob Cowart, Victor Havard, Jim - Silence. Even Heath Newburn's posts are very short.

There is a storm brewing. Somebody SBDed the party. And you can smell I mean tell. The SBD was licensing models. EVERYONE is checking their shoes and double checking their implementations. While Wares vendors push license "true ups" as a way to drive adoption and have them pay later, on the user side it is seen as Career Limiting as it is very difficult to justify your existence when you have to go back to the well for an unplanned budget item at the end of the year.

Something is brewing because talk is very limited.

Product Evaluations...

I consider product competition as a good thing. It keeps everyone working to be the best in breed, deliver the best and most cost effective solution to the customer, and drives the value proposition.

In fact, in product evaluations I like to pit vendors products against each other so that my end customer gets the best solution and the most cost effective. For example, I use capabilities that may not have been in the original requirements to further the customer capability refinement. If they run across something that makes their life better, why not leverage that in my product evaluations? In the end, I get a much more effective solution and my customer gets the best product for them.

When faced with using internal resources to develop a capability and using an outside, best of breed solution, danger exists in that if you grade on a curve for internally developed product, you take away competition and ultimately the competitive leadership associated with a Best of Breed product implementation.

It is too easy to start to minimize requirements to the bare necessities and to further segregate these requirements into phases. When you do, you lose the benefit of competition and you lose the edge you get when you tell the vendors to bring the best they have.

Its akin to looking at the problem space and asking what is th bare minimum needed to do this. Or asking what is the best solution for this problem set? Two completely different approaches.

If you evaluate on bare minimums, you get bare minimums. You will always be behind the technology curve in that you will never consider new approaches, capabilities, or technology in your evaluation. And your customer is always left wanting.

It becomes even more dangerous when you evaluate internally developed product versus COTS in that, if you apply the minimum curve gradient to only the internally developed product, the end customer only gets bare minimum capabilities within the development window. No new capabilities. No new technology. No new functionality.

It is not a fair and balanced evaluation anyway if you only apply bare minimums to evaluations. I want the BEST solution for my customer. Bare minimums are not the BEST for my customer. They are best for the development team because now, they don't have to be the best. They can slow down innovation through development processes. And the customer suffers.

If you're using developer in house, it is an ABSOLUTE WASTE of company resources and money to develop commodity software that does not provide clear business discriminators. Free is not a business discriminator in that FREE doesn't deliver any new capabilities - capabilities that commodity software doesn't already have.

Inherently, there are two mindsets that evolve. You take away or you empower the customer. A Gatekeeper or a Provider.

If you do bare minimums, you take away capabilities that the customer wants but because it may not be a bare minimum, the capability is taken away.

If you evaluate on Best of Breed, you ultimately bring capabilities to them.

Tuesday, May 18, 2010

IT Managed Services and Waffle House

OK.

So now you're thinking - What the hell does Waffle House have to do with IT Managed Services. Give me a minute and let me 'splain it a bit!

When you go into a Waffle House, immediately you get greeted at the door. Good morning! Welcome to Waffle House! If the place is full, you may have a door corps person to seat you in waiting chairs and fetch you coffee or juice while you're waiting.

When you get to the table, the Waitress sets up your silverware and ensures you have something to drink. Asks you if you have decided on what you'd like.

When they get your order, they call in to the cook what you'd like to eat.

A few minutes later, food arrives, drinks get refilled, and things are taken care of.

Pretty straightforward, don't you think? One needs to look at the behaviors and mannerisms od successful customer service representatives to see what behaviors are needed in IT Managed Services.

1. The Customer is acknowledged and engaged at the earliest possible moment.

2. Even if there are no open tables, work begins to establish a connection and a level of trust.

3. The CSR establishes a dialog and works to further the trust and connection. To the customer, they are made to feel like they are the most important customer in the place. (Focus, eye contact. Setting expectations. Assisting where necessary.)

4. They call in the order to the cook. First, meats are pulled as they take longer to cook. Next, if you watch closely, the cook lays out the order using a plate marking system. The customer then prepares the food according to the plate markings.

5. Food is delivered. Any open ends are closed. (drinks)

6. Customer finishes. Customer is engaged again to address any additional needs.

7. Customer pays out. Satisfied.

All too often, we get IT CSRs that don't readily engage customers. As soon as they assess the problem, they escalate to someone else. This someone else then calls the customer back, reiterates the situation, then begins work. When they cannot find the problem or something like a supply action of technician dispatch needs to occur, the customer gets terminated and re-initiated by the new person in the process.

In Waffle House, if you got waited on by a couple of different waitresses and the cook, how inefficient would that be? How confusing would that be as a customer? Can you imagine the labor costs to support 3 different wait staff even in slow periods? How long would Waffle House stay in business?

Regarding the system of calling and marking out product... This system is a process thats taught to EVERY COOK and Wait staff person in EVERY Waffle House EVERYWHERE. The process is tried, vetted, optimized, and implemented. And the follow through is taken care of by YOUR Wait person. The one thats focused on YOU, the Customer.

I wish I could take every Service provider person and manager and put them through 90 days of Waffle House Boot Camp. Learn how to be customer focused. Learn urgency of need, customer engagement, trust, and services fulfillment. If you could thrive at Waffle House and take with you the customer lessons, customer service in an IT environment should be Cake.

And it is living proof that workflow can be very effective even at a very rudimentary level.

Tuesday, May 11, 2010

Blocking Data

Do yourself a favor.

Take a 10000 line list of names, etc. and put it in a Database. Now, put that same list in a single file.

Crank up a term window and a SQL client on one and command line on another. Pick a pattern that gets a few names from the list.

Now in the SQL Window, do a SELECT * from Names WHERE name LIKE 'pattern';

In a second window, do a grep 'pattern' on the list file.

Now hit return on each - simultaneously if possible. Which one did you see results first?

The grep, right!!!!

SQL is Blocking code. The access blocks until it returns with data. If you front end an application with code that doesn't do callbacks right and doesn't handle the blocking, what happens to your UI? IT BLOCKS!

Blocking UIs are what? Thats right! JUNK!

Sunday, May 9, 2010

ENMS Products and Strategies

Cloud Computing is changing the rules of how IT and ENMS products are done. With Cloud Computing, resources being configurable and adaptable very quickly, applications are also changing to fit the paradigm of fitting in a virtual machine instance. We may even see apps deployed with their own VMs as a package.

This also changes the rules of how Service providers deliver. In the past, you had weeks to get the order out, setup the hardware, and do the integration. In the near future, all Service providers and application wares vendors will be pushed to Respond and Deliver at the speed of Cloud.

It changes a couple of things real quickly:

1. He who responds and delivers first will win.
2. Relationships don't deliver anything. Best of Breed is back. No excuse for the blame game.
3. If it takes longer for you to fix a problem than it does to replace, GAME OVER!
4. If your application takes too long to install or doesn't fit in the Cloud paradigm, it may become obsolete SOON.
5. In instances where hardware is required to be tuned to the application, appliances will sell before buying your own hardware.
6. In a Cloud environment, your Java JRE uses resources. Therefore it COSTS. Lighter is better. Look for more Perl, Python, PHP and Ruby. And Javascript seems to be a language now!

Some of the differentiators in the Cloud:

1. Integrated system and unit tests with the application.
2. Inline knowledge base.
3. Migration from tool to a workflow strategy.
4. High agility. Configurability. Customizeability.
5. Hyper-scaling technology like memcached and distributed databases incorporated.
6. Integrated products on an appliance or VM instance.
7. Lightweight and Agile Web 2.0/3.0 UIs.

SNMP MIBs and Data and Information Models

Recently, I was having a discussion about SNMP MIB data and organization and its application into a Federated CMDB and thought it might provoke a bit of thought going forward.

When you compile a MIB for a management application, what you do is to organize the definitions and the objects according to name a OID, in a way thats searchable and applicable to performing polling, OID to text interpretation, variable interpretation, and definitions. In effect, you compiled MIB turns out to be every possible SNMP object you could poll from in your enterprise.

This "Global Tree" has to be broken down logically with each device / agent. When you do this, you build an information model related to each managed object. In breaking this down further, there are branches that are persistent for every node of that type and there are branches that are only populated / instanced if that capability is present.

For example, on a Router that has only LAN type interfaces, you'd see MIB branches for Ethernet like interfaces but not DSX or ATM. These transitional branches are dependent upon configuration and presence of CIs underneath the Node CI - associated with the CIs corresponding to these functions.

From a CMDB Federation standpoint, a CI element has a source of truth from the node itself, using the instance provided via the MIB, and the methods via the MIB branch and attributes. But a MIB goes even further to identify keys on rows in tables, enumerations, data types, and descriptive definitions. A MIB element can even link together multiple MIB Objects based on relationships or inheritance.

In essence, I like the organization of NerveCenter Property Groups and Properties:

Property Groups are initially organized by MIB and they include every branch in that MIB. And these initial Property Groups are assigned to individual Nodes via a mapping of the system.sysObjectID to Property Group. The significance of the Property Group is that is contains a list of the MIB branches applicable to a given node.

These Property Groups are very powerful in that it is how Polls, traps, and other trigger generators are contained and applied according to the end node behavior. For example, you could have a model that uses two different MIB branches via poll definitions, but depending the node and its property group assignment, only the polls applicable to the node's property group, are applied. Yet it is done with a single model definition.

The power behind property groups was that you could add custom properties to property groups and apply these new Property Groups on the fly. So, you could use a custom property to group a specific set of nodes together.

I have setup 3 distinct property groups in NerveCenter corresponding to 3 different polling interval SLAs and used a common model to poll at three different rates dependent upon the custom properties for Cisco_basic, Cisco_advanced, and Cisco_premium to poll at 2 minutes, 1 minute, or 20 seconds respectively.

I used the same trigger name for all three poll definitions but set the property to only be applicable to Cisco_basic, Cisco_advanced, or Cisco_premium respectively.

What Property Groups do is to enable you to setup and maintain a specific MIB tree for a given node type. Taking this a bit further, in reality, every node has its own MIB tree. Some of the tree is standard for every node of the same type while other branches are option or capability specific. This tree actually corresponds to the information model for any given node.

Seems kinda superfluous at this point. If you have an information model, you have a model of data elements and the method to retrieve that data. You also have associative data elements and relational data elements. Whats missing?

Associated with these CIs related to capabilities like a DSX interface or an ATM interface or even something as mundane as an ATA Disk drive, are elements of information like technical specifications and documentation, process information, warranty and maintenance information.. even mundane elements like configuration notes.

So, when you're building your information model, the CMDB is only a small portion of the overall information system. But it can be used to Meta or cross reference other data elements and help to turn these into a cohesive information model.

This information model ties well into fault management, performance management, and even ITIL Incident, Problem and change management. But you have to think of the whole as an Information model to make things work effectively.

Wouldn't it be awesome if to could manage and respond by service versus just a node? When you get an event on a node, do you enrich the event to provide the service or do you wait until a ticket is open? If the problem was presented as a service issue, you could work it as a service issue. For example, if you know the service lineage or pathing, you can start to overlay elements of information that empower you to start to put together a more cohesive awareness of your service.

Lets say you have a simple 3 tier Web enabled application that consists of a Web Server, and application Server, and a Database. On the periphery, you have network components, firewalls, switches, etc. How valuable is just the lineage? Now, if I can overlay information elements on this ontology, it comes alive. For example, show me a graph of CPU performance on everything in the lineage. Add in memory and IO utilization. If I can overlay response times for application transactions, the picture becomes indispensable as an aid to situation awareness.

Looking at things from a different perspective, what if I could overlay network errors. Or disk errors. What about other, seemingly irrelevant elements of information like ticket activities or the amount of support time spent on each component in the service lineage, takes data and empowers you to present it in a service context.

On a BSM front, what if I could overlay the transaction rate with CPU, IO, Disk, or even application memory size or CPU usage? Scorecards start becoming a bit more relevant it would seem.

In Summary

SNMP MIB data is vital toward not only the technical aspects of polling and traps but conversion and linkage to technical information. SNMP ius a source of truth for a significant number of CIs and the MIB definitions tell you how the information is presented.

But all this needs to be part of an Information plan. How do I take this data, derive new data and information, and present it to folks when they need it the most?

BSM assumes that you have the data you need organized in a database that can enable you to present scorecards and service trees. Many applications go through very complex gyrations on SQL queries in an attempt to pull the data out. When the data isn't there or it isn't fully baked, BSM vendors may tend to stub up the data to show that the application works. This gets the sale but the customer ends up finding that the BSM application isn't as much out of the box as the vendor said it was.

These systems depend on Data and information. Work needs to be done to align and index data sources toward being usable. For example, if you commonly use inner and outer joins in queries, you haven't addressed making your data accessible. If it takes elements from 2 tables to do selects on others, you need to work on your data model.

Monday, May 3, 2010

Event Processing...

I ran across this blog post by Tim Bass on Complex Event Processing dubbed "Orwellian Event Processing" and it struck a nerve.

The rules we build into products like Netcool Omnibus, HP Openview, and others are all based on simple if-then-else logic. Yet, through fear that somebody may do something bad with an event loop, recursive processing is shunned.

In his blog, he describes his use of Bayesian Belief Networks to learn versus the static if-then-else logic.

Because BBNs learn patterns through evidence and though cause and effect, the application of BBNs in common event classification and correlation systems makes total sense. And the better you get at classification, the better you get at dealing with uncertainty.

In its simplest form, BBNs output a ratio of occurrences ranging from 1 to -1 where 1 is 100 percent that an element in a pattern occurs and -1 or 100% that an element in a pattern never occurs.

The interesting part is that statistically, a BBN will recognize patterns that a human cannot. We naturally filter out things that are obfuscated or don't appear relevant.

What if I could have a BBN build my rules files for Netcool based upon statistical analysis of the raw event data? What would it look like as compared to current rule sets? Could I setup and establish patterns that lead up to an event horizon? Could I also understand the cause an effect of an event? What would that do to the event presentation?

Does this not open up the thought that events are presented in patterns? How could I use that to drive up the accuracy of event presentation?

Sunday, May 2, 2010

Cloud Computing - Notes from a guy behind the curtain

The latest buzz is Cloud computing. When I attended a CloudCamp hosted here in St. Louis, it became rather obvious that the term Cloud computing has an almost unlimited supply of definitions depending upon which Marketing Dweeb you talk to. It can range from everything from hosted MS Exchange to hosted VMs to applications and even hosted services. I'm really not in a position to say what Cloud is or isn't and in fact, I don't believe theres any way to win that argument. Cloud computing is a marketing perception that is rapidly morphing into whatever marketing types deem necessary to sell something. right or wrong - the spin doctors own the term and it is whatever they think will sell.

In my own perception, Cloud Computing is a process by which applications, services, and infrastructure are delivered to a customer in a rapid manner and empowers the customer to pay for what they use in small, finite increments. Cloud Computing incorporates a lot of technologies and process to make this happen. Technology like Virtualization, configuration management databases, hardware, and software.

What used to take days or weeks to deliver now takes minutes. MINUTES. What does this mean? It takes longer to secure an S Corp and setup a corresponding Tax ID than it does to setup and deliver a new companies web access to customers. And not just local, down home Main street customers but you are in business, competing at a GLOBAL LEVEL in MINUTES. And you pay as you go!

Sounds tasty, huh. Now heres a kink. How the heck do you manage this? I know the big Four Management companies say a relationship is more important than Best of Breed. I've heard in in numerous presentations and conversations. If you are in a position to sit still in business, this may be OK for you. Are you so secure in your market space that you do not fear competition to the point where you would sit idly?

Their products reflect this same lack of concern in that it is the same old stuff. It hasn't evolved much - it takes forever to get up and running and it takes months to be productive. For example, IBM Tivoli ITNM/IP. Takes at least a week of planning just to get ready to install it - IF you have hardware. Next, you need another week and consulting to get things cranking on a discovery for the first time. Takes weeks to integrate in your environment dealing with community string issues, network access, and even dealing with discovery elements.

Dealing with LDAP and network views is another nightmare altogether.

The UI is clunky, slow, and non-interactive. Way too slow to be used interactively as a diagnostic tool. At least you could work with NNM in earlier versions to get some sort of speed. (Well, with the Web UI, you had HP - slower than molasses - and then when you needed something that worked you bought Edge Technologies or Onion Peel.) In the ITNM/IP design, somebody in their infinite wisdom decided to store the map objects and map instances in a binary glob field in MySQL. At least if you had the coordinates you could FIX the topoviz maps or even display them in something a bit faster and more Web 2.0 - Like Flash / Flex. (Hardware is CHEAP!)

And how do you apply this product to a cloud infrastructure? If you can only discover once every few days, I think you're gonna miss a few customers setting up new infrastructures and not to mention any corresponding VMotion events that occur when things fail or load balance. How do you even discover and display the virtual network infrastructure with the real network infrastructure?

Even if you wanted to use it like TADDM, Tideway, or DDMi, the underlying database is not architected right. It doesn't allow you to map out the relationships between entities enough to make it viable. Even if you did a custom Discovery agent and plugged in NMap - (Hey! Everybody uses it!) you cannot fit the data correctly. And it isn't even close to the CIM schema.

And every time you want some additional functionality like performance data integration, its a new check and a new ball game. They sort of attempt to address this by enabling short term polling via the UI. Huge Fail. How do you look at data from yesterday? DOH!

ITNM/IP + Cloud == Shelfware.

If we are expected to respond at the speed of Cloud, there is a HUGE Pile of Compost consisting of management technology of the past that just isn't going to make it. These products take too much support, take too much resources to maintain, and they hold back innovation. The cost just doesn't justify the integration. Even the products we considered as untouchables. Many have been architected in a way that paints them in a corner. Once you evolve a tool kit into a solution, you have to take care not to close up integration capabilities along the way.

They take too long to install, take a huge level of effort to keep running, and the yearly maintenance costs can be rather daunting. The Cloud methodology kind of changes the rules a bit. In the cloud, its SaaS. You sign up for management. You pay for what you get. If you don't like it or want something else, presto changeo - an hour later, you're on a new plan! AND you pay as you go. No more HUGE budget outlays, planning, and negotiation cycles. No more "True-Ups" at the end of the year that kill your upward mobility and career.

BMC - Bring More Cash
EMC - EXtreme Monetary Concerns
IBM - I've Been Mugged!
HP - Huge PriceTag!
CA - Cash Advanced!

Think about Remedy. Huge Cash outlay up front. Takes a long time to get up and running. Takes even longer to get into production. Hard to change over time. And everything custom becomes an ordeal at upgrade time.

They are counting on you to not be able to move. That you have become so political and process bound, you couldn't replace it if you wanted to. In fact, in the late 80s and early 90s, there was the notion that the applications that ran on mainframes could never be moved off of those huge platforms. I remember working on the transition from Space Station Freedom to International Space Station Information systems. The old MacDac folks kept telling up there was no way we could move to open systems. Especially on time and under budget. 9 months later, 2 ES9000 Duals and a bunch of Vaxes repurposed. 28 Applications migrated. Reduced support head count from over 300 to 70. And it was done with less than half of the cost of software maintenance for a year. New software costs ~15% of what they were before. And we had alot more users. And it was faster too! NASA. ESA. CSA. NASDA. RSA. All customers.

Bert Beals, Mark Spooner, and Tim Forrester are among my list of folks that had a profound effect on my career in that they taught me through example to keep it simple and that NOTHING is impossible. And to keep asking "And then what?"

And while not every app fits in a VM, there is a growing catalog of appliance based applications that make total sense. You get to optimize the hardware according to the application and its data. That first couple of months of planning, sizing, and procurement - DONE.

And some apps thrive on the Cloud virtualization. If you need a data warehouse or are looking to make sense of your data footprint, check out Greenplum. Distributed Database BASED on VMs! You plug in resources as VMs as you grow and change!

And the line between the network and the systems and the applications and the users - disappearing quickly. Presents an ever increasing data challenge to be able to discover and use all these relationships to deliver better services to customers.

Cloud Computing is bringing that revolution and reinvention cycle back into focus in the IT industry. It is a culling event as it will cull out the non-producers and change the customer engagement rules. Best of Breed is back! And with a Vengeance!

Sunday, April 25, 2010

Performance Management Architecture

Performance Management systems in IT infrastructures do a few common things. These are:

Gather performance data
Enable processing of the data to produce:
Events and thresholds
New data and information
Baseline and average information
Present data through a UI or via scheduled reports.
Provide for ad hoc and data mining exercises

Common themes for broken systems include:

If you have to redevelop your application to add new metrics
If you have more than one or two data access points.
If data is not consistent
If reporting mechanisms have to be redeveloped for changes to occur
If a development staff owns access to the data
If a Development staff controls what data gets gathered and stored.
If multiple systems are in place and they overlap (Significantly) in coverage.
If you cannot graph any data newer than 5 minutes.
If theres no such thing as a live graph or the live graph is done via Metarefresh.

I dig SevOne. Easy to setup. Easy to use. Baselines. New graphs. New reports. And schedules. But they also do drill down from SNMP into IPFIX DIRECTLY. No popping out of one system and popping into another. SEAMLESSLY.

It took me 30 minutes or so to rack and stack the appliance. I went back to my desk, verified I could access the appliance, then called the SE. He setup a WebEx and it was 7 minutes and a few odd seconds later I got my first reports. Quite a significant difference from the previous Proviso install which took more than a single day to install.

The real deal is that with SevOne, your network engineers can get and setup the data collection they need. And the hosting engineers can follow suite. Need a new metric. Engineering sets it up. NO DEVELOPMENT EFFORT.

And it can be done today. Not 3 months from now. When something like a performance management system cannot be used as part of the diagnostics and triage of near real time, it significantly detracts from usability in both the near real time and the longer term trending functions as well.

BSM - Sounds EXCELLENT. YMMV

Business Service Management

OK. Here goes. First and foremost, I went hunting for a definition. Heres one from bitpipe that I thought sounded good.

ALSO CALLED: BSM
DEFINITION: A strategy and an approach for linking key IT components to the goals of the business. It enables you to understand and predict how technology impacts the business and how business impacts the IT infrastructure.

Sounds good, right?

When I analyze this definition, it looks very much like the definition for Situation Awareness. Check out the article on Wikipedia.

Situation awareness, or SA, is the perception of environmental elements within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future

So, I see where BSM as a strategy, creates a system where Situation Awareness for business as a function of IT services, can be achieved. In effect, BSM creates SA for business users through IT service practices.

Sounds all fine, good, and well in theory. But in practice, there are a ton of data sources. Some are database enabled. Some are Web services. Some are simple web content elements. How do you assemble, index, and align all this data from multiple sources, in a way that enables a business user to achieve situation awareness? How do you handle the data sources being timed wrong or failing?

The Road to Success

First of all, if you have a BSM strategy and you're buying or considering a purchase of a BSM framework, you need to seriously consider BI and a Data Architecture as well. All three technologies are interdependent. You have to organize your data, use it to create information, them make it suitable enough to be presented in a consistent way.

As you develop your data, you also develop your data model. With the data model will come information derivation and working through query and explain plans. In some instances, you need to look at a Data warehouse of sorts. You need to be able to organize and index your data to be presented in a timely and expeditious fashion so that the information helps to drive SA by business users.

A recent data warehouse sort od product has come to my attention. It is Greenplum. Love the technology. Scalable. But based on mature technology. My thoughts are about taking data from disparate sources, organizing that data, deriving new information, and indexing the data so that the reports you provide, can happen in a timely fashion.

Organizing your data around a data warehouse allows you to get around having to deal with multiple databases, multiple access mechanisms, and latency issues. And how easier it is to analyze cause and effect, derivatives, and patterns given you can search across these data sources from a single access. Makes true Business intelligence easier.

BSM products tend to be around creative SQL queries and dashboard/scorecard generation. You may not need to buy the entire cake to get a taste. Look for web generation utilities that can be used to augment your implementation and strategy.

And if you're implementing a BSM product, wouldn't it make sense to setup SLAs on performance, availability, and response time for the app and its data sources? This is the ONE App that could be used to set a standard and a precedence.

I tend to develop the requirements, then storyboard the dashboards and drill throughs. This gives you a way of visualizing holes in the dashboards and layouts but it also enables you to drive to completion. Developing dashboards can really drive scope creep if you don't manage it.
Storyboarding allows you to manage expectations and drive delivery.

Saturday, April 24, 2010

SNMP + Polling Techniques

Over the course of many years, it seems that I see the same lack of evolution regarding SNMP polling, how its accomplished, and the underlying ramifications. To give credit where credit is due, I learned alot from Ari Hirschman, Eric Wall, Will Pearce, and Alex Keifer. And of the things we learned - Bill Frank, Scott Rife, and Mike O'Brien.

Building an SNMP poller isn't bad. Provided you understand the data structures, understand what happens on the end node, and understand how it performs in its client server model.

First off, there are 5 basic operations one can perform. These are:

GET
GET-NEXT
SET
GET-RESPONSE
GET-BULK

Here is a reference link to RFC-1157 where SNMP v1 is defined.

The GET-BULK operator was introduced when SNMP V2 was proposed and it carried into SNMP V3. While SNMP V2 was never a standard, its defacto implementations followed the Community based model referenced in RFCs 1901-1908.

SNMP V3 is the current standard for SNMP (STD0062) and version 1 and 2 SNMP are considered obsolete or historical.

SNMP TRAPs and NOTIFICATIONs are event type messages sent from the Managed object back to the Manager. In the case of NOTIFICATIONs, the Manager returns the trap as an acknowledgement.

From a polling perspective, lets start with a basic SNMP Get Request. I will illustrate this via the Net::SNMP perl module directly. (URL is http://search.cpan.org/dist/Net-SNMP/lib/Net/SNMP.pm)

get_request() - send a SNMP get-request to the remote agent

$result = $session->get_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
-varbindlist => \@oids,
);
This method performs a SNMP get-request query to gather data from the remote agent on the host associated with the Net::SNMP object. The message is built using the list of OBJECT IDENTIFIERs in dotted notation passed to the method as an array reference using the -varbindlist argument. Each OBJECT IDENTIFIER is placed into a single SNMP GetRequest-PDU in the same order that it held in the original list.

A reference to a hash is returned in blocking mode which contains the contents of the VarBindList. In non-blocking mode, a true value is returned when no error has occurred. In either mode, the undefined value is returned when an error has occurred. The error() method may be used to determine the cause of the failure.

This can be either blocking - meaning the request will block until data is returned or non-blocking - the session will return right away but will initiate a callback subroutine upon finishing or timing out.

For the args:

-callback is used to attach a handler subroutine for non-blocking calls
-delay is used to delay the SNMP Porotocol exchange for the given number of seconds.
-contextengineid is used to pass the contextengineid needed for SNMP V3.
-contextname is used to pass the SNMP V3 contextname.
-varbindlist is an array of OIDs to get.

What this does is to setup a Session object for a given node and run through the gets in the varbindlist one PDU at a time. If you have set it up to be non-blocking, the PDUs are assembled and sent one right after another. If you are using blocking mode, the first PDU is sent and a response is received before the second one is sent.

GET requests require you to know the instance of the attribute ahead of time. Some tables are zero instanced while others may be instanced by one or even multiple indexes. For example, MIB-2.system is a zero instanced table in that there is only one row in the table. Other tables like MIB-2.interfaces.ifTable.ifEntry have multiple rows indexed by ifIndex. Here is a reference to the MIB-2 RFC-1213.

A GET-NEXT request is like a GET request except that it does not require the instance up front. For example, if you start with a table like ifEntry and you do not know what the first instance is, you would query the table without an instance.

Now here is the GET-NEXT:

$result = $session->get_next_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
-varbindlist => \@oids,
);

In the Net::SNMP module, each OID in th \@oids array reference is passed as a single PDU instance. And like the GET, it can also be performed in blocking mode or non-blocking mode.

An snmpwalk is simply a macro of multiple recursive GET-NEXTs for a given starting OID.

As polling started to evolve, folks started looking for ways to make things a bit more scalable and faster. One of the ways they proposed was the GET-BULK operator. This enabled an SNMP Manager to pull whole portions of an SNMP MIB Table with a single request.

A GETBULK request is like a getnext but tells the agent to return as much as it can from the table. And yes, it can return partial results.
$result = $session->get_bulk_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
[-nonrepeaters => $non_reps,]
[-maxrepetitions => $max_reps,]
-varbindlist => \@oids,
);

In SNMP V2, the GET BULK operator came into being. This was done to enable a large amount of table data to be retrieved from a single request. It does introduce two new parameters:

nonrepeaters partial information.
maxrepetitions

Nonrepeaters tells the get-bulk command that the first N objects can be retrieved with a simple get-next operation or single successor MIB objects.

Max-repetitions tells the get-bulk command to attempt up to M get-next operations to retrieve the remaining objects or how many times to repeat the get process.

The difficult part of GET BULK is you have to guess how many rows and there and you have to deal with partial returns.

As things evolved, folks started realizing that multiple OIDs were possible in SNMP GET NEXT operations through a concept of PDU Packing. However, not all agents are created equal. Some will support a few operations in a single PDU while some could support upwards of 512 in a single SNMP PDU.

In effect, by packing PDUs, you can overcome certain annoyances in data like time skew between two attributes given that they can be polled simultaneously.

When you look at the SNMP::Multi module, it not only allows multiple OIDs in a PDU by packing, it enables you to poll alot of hosts at one time. Follwing is a "synopsis" quote from the SNMP::Multi module:

use SNMP::Multi;

my $req = SNMP::Multi::VarReq->new (
nonrepeaters => 1,
hosts => [ qw/ router1.my.com router2.my.com / ],
vars => [ [ 'sysUpTime' ], [ 'ifInOctets' ], [ 'ifOutOctets' ] ],
);
die "VarReq: $SNMP::Multi::VarReq::error\n" unless $req;

my $sm = SNMP::Multi->new (
Method => 'bulkwalk',
MaxSessions => 32,
PduPacking => 16,
Community => 'public',
Version => '2c',
Timeout => 5,
Retries => 3,
UseNumeric => 1,
# Any additional options for SNMP::Session::new() ...
)
or die "$SNMP::Multi::error\n";

$sm->request($req) or die $sm->error;
my $resp = $sm->execute() or die "Execute: $SNMP::Multi::error\n";

print "Got response for ", (join ' ', $resp->hostnames()), "\n";
for my $host ($resp->hosts()) {

print "Results for $host: \n";
for my $result ($host->results()) {
if ($result->error()) {
print "Error with $host: ", $result->error(), "\n";
next;
}

print "Values for $host: ", (join ' ', $result->values());
for my $varlist ($result->varlists()) {
print map { "\t" . $_->fmt() . "\n" } @$varlist;
}
print "\n";
}
}

Using the Net::SNMP libraries underneath means that you're still constrained by port as it only uses one UDP port to poll and through requestIDs, handles the callbacks. In higher end pollers, the SNMP Collector can poll from multiple ports simultaneously.

Summary

Alot of evolution and technique has went into making SNMP data collection efficient over the years. It would be nice to see SNMP implementations that used these enhancements and evolve a bit as well. The evolution of these techniques came about for a reason. When I see places that haven't evolved in their SNMP Polling techniques, I tend to believe that they haven't evolved enough as an IT service to experience the pain that necessitated the lessons learned of the code evolution.

Sunday, April 18, 2010

Web Visualization...

I have been trying to get my head around visualization for several months. Web presentation presents a few challenges that some of the product vendors seem to overlook.

First off, there is an ever increasing propensity for each vendor to develop and produce their own portal. It must be a common Java class in a lot of schools because it is so prevalent. And not all portals are created equal or even open in some cases. I think that while they are redeveloping the wheel, they are missing the point in that they need to develop CONTENT first.

So, what are the essential parts of a portal?

Security Model
Content Customization and Presentation
Content organization

In a security model, you need to understand that users belong to groups and are identified with content and brandings. A user can be part of a team (shared content), assigned access to tools and technologies (content distribution), and will need to be able to organize the data in ways that make it easy for them to work (content brandings).

In some cases, multi-tenancy is a prime concern. How do you take and segregate discreet content yet share the shareable content?

A Web presence lends itself very well to project or incident based portal instances if you make it easy to put in place new instances pertinent to projects and situations. This empowers the capture of knowledge within given conditions, projects, or team efforts. The more relevant the cature is, the better the information is as an end result. (The longer you wait, the more daat and information you lose.)

Single Sign On.

While vendors say they do SSO, they typically only do so across their product line. Proxying, cookies and sessions, authentications and certificates are all ways to have someone have to authenticate to access systems.

From the actor perspective, once you have to stop what you're doing to log into another application, subconsciously, you have to switch gears. This switching becomes a hindrance because people will instinctively avoid disruptive processes. And in many cases, this also refocuses the screen on another window which also detracts from user focus.

Every web presence has content, a layout, and a look and feel. Templates for content layout, branding, organization, become the more common elements addressed in a portal. In some cases, language translation also plays a part. In other cases, branding also plays a significant part.

I happen to like Edge Technologies enPortal. Let me explain.

It is a general purpose Portal with Single sign On across product, it has a strong security model, and it lets you deploy web sites as needed. You can synch with LDAP and you can bring in content from a variety of sources... Even sources that are not web enabled. They do this with an interface module integrated with Sun Secure Global Desktop(The old Tarantella product...)

The enPortal is solid and fault tolerant. Can be deployed in redundant configurations.

But web visualization in support organizations needs to go much further in the future. They need to enable collaboration, topology and GIS maps, fold in external data sources like weather and traffic data. And they need to incorporate reward mechanisms for users processing data faster and more efficient.

Data and information must be melded across technologies. Fault to performance to security to applications to even functions like release management, need to be incorporated, content wise.

Some Wares vendors in the BSM space claim that they support visualization. They do. In part... Alot of the BSM products out there cater specifically to CxO level and a couple of levels below that. They lack firm grounding in the bottom layers of an organization. In fact, many times the BSM products will get in the way of folks on the desks.

A sure fire litmus test is to have the vendor install the product, give them a couple of data sources and have them show you a graphical view of the elements they found. Many cannot even come close! They depend on you to put all the data and relationships together.

Ever thought about the addictiveness of online games? They have reward mechanisms that empower you to earn points, gold, or coins or gold starts - something. These small reward mechanisms shape behavior by rewarding small things to accumulate better behavior over time.

In many cases, the data underneath required to provide effective visualization is not there, is too difficult to access, or is not in a format that is usable for reporting. When you start looking at data sources, you must examine explain plans, understand indexes as well as views, and be prepared to create information from raw data.

If you can get the data organized, you can use a multitude of products to create good, usable content. Be prepared to create data subsets, cubes of data, reference data elements, as well as provide tools that enable you to munge these data elements and sources, put it all together, and produce some preliminary results.

Netcool and Evolution toward Situation Management

Virtually no new evolution in Fault Management and correlation has been done in the last ten years. Seems we have a presumption that what we have today is as far as we can go. Truly sad.

In recent discussions on the INUG Netcool Users Forum, we discussed shortfalls in the products in hopes that big Blue may see its way clear of the technical obstacles. I don't think they are accepting or open to mine and other suggestions. But thats OK. you plant a seed - water it - feed it. And hopefully, one day, it comes to life!

Most of Netcool design is based somewhat loosely on TMF standards. They left out the hard stuff like object modelling but I understand why. The problem is that most Enterprises and MSPs don't fit the TMF design pattern. Nor do they fit eTOM. This plays specifically to my suggestion that "There's more than one way to do it!" - The Slogan behind Perl.

The underlying premise behind Netcool is that it is a single pane of glass for viewing and recognizing what is going on in your environment. It provides a way to achieve situation awareness and a platform which can be used to drive interactive work from. So what about ITIL and Netcool?

From the aspect of product positioning, most ITIL based platforms have turned out to be rehashs of Trouble Ticketing systems. When you talk to someone about ITIL, they immediately think of HP ITSM or BMC Remedy. Because of the complexity, these systems sometimes takes several months to implement. And nothing is cheap. Some folks resort to open source like RT or OTRS. Others want to migrate towards a different, appliance based model like ServiceNow and ScienceLogic EM7.

The problem is that once you transition out of Netcool, you lose your situation awareness. Its like having a notebook full of pages. Once you flip to page 50, pages 1-49 are out of sight and therefore gone. All hell could break lose and you'd never know.

So, why not implement ITIL in Netcool? May be a bit difficult. Here are a few things to consider:

1. The paradigm that an event has only 2 states is bogus.
2. The concept that there are events and these lead to incidents, problems, and changes.
3. Introduces workflow to Netcool.
4. Needs to be aware of CI references and relationships.
5. Introduces the concept that the user is part of the system in lieu of being an external entity.
6. May change the exclusion approach toward event processing.
7. Requires data storage and retrieval capabilities.

End Game

From a point of view where you'd like to end up, there are several use cases one could apply. For example:

One could see a situation develop and get solved in the Netcool display over time. As it is escalated and transitioned, you are able to see what has occurred, the workflow steps taken to solve this, and the people involved.

One could take a given situation and search through all of the events to see which ones may be applicable to the situation. Applying a ranking mechanism like a google search would help to position somewhat fuzzy information in proper contexts for the users.

Be able to take the process as it occurred and diagnose the steps and elements of information to optimize processes in future encounters.

Be able to automate, via the system, steps in the incident / problem process. Like escalations or notifications. Or executing some action externally.

Once you introduce workflow to Netcool, you need to introduce the concept of user awareness and collaboration. Who is online? What situations are they actively working versus observing? How do you handle Management escalations?

In ITIL definitions, an Incident has a defined workflow process from start to finish. Netcool could help to make the users aware of the process along with its effectiveness. Even in a simple event display you can show last, current and next steps in fields.

Value Proposition

From the aspect of implementation, the implementation of ITIL based systems has been focused solely around trouble ticketing systems. These systems have become huge behemoths of applications and with this comes two significant factors that hinder success - The loss of situation Awareness and the inability to realize and optimize processes in the near term.

These behemoth systems become difficult to adapt and difficult to keep up with optimizations. As such, they slow down the optimization process making it painful to move forward. If its hard to optimize, it will be hard to differentiate service because you cannot adapt to changes and measure the effectiveness fast enough to do any good.

A support organization that is aware of whats going on, subliminally portrays confidence. This confidence carries a huge weight in interactions with customers and staff alike. It is a different world on a desk when you're empowered to do good work for your customer.

More to come!

Hopefully, this will provide some food for thought on the evolution of event management into Situation Management. In the coming days I plan on adding to this thread several concepts like evolution toward complex event processing, Situation Awareness and Knowledge, data warehousing, and visualization.

Sunday, April 11, 2010

Fault and Event Management - Are we missing the boat?

In the beginning, folks used to tail log files. As interesting things would show up, folks would see what was happening and respond to the situation. Obviously, this didn't scale too well as you can only get about 35-40 lines per screen. As things evolved, folks looked for ways to visually queue the important messages. When you look at swatch, it changes the text colors and background colors, blinking, etc. as interesting things were noted.

Applications like xnmevents from OpenView NNM provided an event display that is basically a sequential event list. (Here's a link to an image snapshot of xnmevents in action -> HERE! )

In Openview Operations, events are aligned by nodes that belong to the user. If the event is received from a node that is not in the users node group, they don't receive the event.

Some applications tended to attempt to mask downstream events from users through topology based correlation. And while this appears to be a good thing in that it reduces the numbers of events in an events display, it takes away the ability to notify customers based on side effect events. A true double edged sword - especially if you want to be customer focused.

With some implementations, the focus of event management is to qualify and only include those events that are perceived as being worthy of being displayed. While it may seem a valid strategy, the importance should be on the situation awareness of the NOC and not on the enrichment. You may miss whole pieces of information and awareness... But your customers and end users may not miss them!

All in all, we're still just talking about discreet events here. These events may or may not be conditional or situational or even pertinent to a particular users given perspective.

From an ITIL perspective, (Well, I have ascertained the 3 different versions of ITIL Incident definitions as things have evolved...) as:
"Incident (ITILv3): [Service Operation] An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service. Failure of a Configuration Item that has not yet impacted Service is also an Incident. For example, Failure of one diskfrom a mirror set.

Sunday, April 4, 2010

Simplifying topology

I have been looking at monitoring and how its typically implemented. Part of my look is to drive visualization but also how can I leverage the data in a way that organizes people's thoughts on the desk.

Part of my thought process is around OpenNMS. What can I contribute to make the project better.

What I came to realize is that Nodes are monitored on a Node / IP address basis by the majority of products available today. All of the alarms and events are aligned by node - even the sub-object based events get aggregated back to the node level. And for the most part, this is OK. You dispatch a tech to the Node level, right?

When you look at topology at a general sense, you can see the relationship between the poller and the Node under test. Between the poller and the end node, there is a list of elements that make up the lineage of network service components. So, from a service perspective, a simple traceroute between the poller and the end node produces a simple network "lineage".

Extending this a bit further, knowing that traceroute is typically done in ICMP, this gives you an IP level perspective of the network. Note also that because traceroute exploits the time to Live parameter of IP, it can be accomplished in any transport layer protocol. For example, traceroute could work on TCP port 80 or 8080. The importance is that you place a protocol specific responder on the end of the code to see if the service is actually working beyond just responding to a connection request.

And while traceroute is a one way street, it still derives a lineage of path between the poller and the Node under test - and now the protocol or SERVICE under test. And it is still a simple lineage.

The significance of the path lineage is that in order to do some level of path correlation, you need to understand what is connected to what. given that this can be very volatile and change very quickly, topology based correlation can be somewhat problematic - especially if your "facts" change on the fly. and IP based networks do that. They are supposed to do that. They are a best efffort communications methodology that needs to adapt to various conditions.

Traceroute doesn't give you ALL of the topology. By far. Consider the case of a simple frame relay circuit. A Frame Relay circuit is mapped end to end by a Circuit provider but uses T carrier access to the local exchange. Traceroute only captures the IP level access and doesn't capture elements below that. In fact, if you have ISDN backup enabled for a Frame Relay circuit, your end points for the circuit will change in most cases, for the access. And the hop count may change as well.

The good part about tracerouteing via a legitimate protocol is that you get to visualize any administrative access issues up front. For example, if port 8080 is blocked between the poller and the end node, the traceroute will fail. Additionally, you may see ICMP administratively prohibited messages as well. In effect, by positioning the poller according to end users populations, you get to see the service access pathing.

Now, think about this... From a basic service perspective, if you poll via the service, you get a basic understanding of the service you are providing via that connection. When something breaks, you also have a BASELINE with which to diagnose the problem. So, if the poll fails, rerun the traceroute via the protocol and see where it stops.

Here are the interesting things to note about this approach:

You are simply replicating human expert knowledge in software. Easy to explain. Easy to transition to personnel.
You get to derive path breakage points pretty quickly.
You get to discern the perspective of the end user.
You are now managing your Enterprise via SERVICE!

Topology really doesn't mean ANYTHING until you evolve to manage by Service and not by individual nodes. You can have all the pretty maps you want. It doesn't mean crapola until you start managing by service.

This approach is an absolute NATURAL for OpenNMS. Let me explain...

Look at the Path Outages tab. While it is currently manually configured, using the traceroute by service lineage here provides a way of visualizing the path lineage.

OpenNMS supports services pollers natively. There are alot of different services out of the box and its easy to do more if you find something different from what they already do.

Look at the difference between Alarms versus Events. Service outages could directly be related to an Alarm while the things that are eventing underneath may affect the service, are presented as events.

What if you took the reports and charts and aligned the elements to the service lineage? For example, if you had a difference in service response, you could align all of the IO graphs for everything in the service lineage. You could also align all of the CPU utilizations as well.

In elements where there are subobjects abstracted in the lineage, if you discover them, you could add those in the lineage. For example, if you discovered the Frame Relay PVCs and LEC access circuits, these could be included in with your visualization underneath the path where they are present.

The other part is that the way you work may need to evolve as well. For example, if you've traditionally ticketed outages on Nodes, now you may need to transition to a Service based model. And while you may issue tickets on a node, your ticket on a Service becomes the overlying dominant ticket in that multiple node problems may be present in a service problem.

And the important thing. You become aware of the customer and Service first, then elements underneath that. It becomes easier to manage to service along with impact assessments, when you manage to a service versus manage to a node. And when you throw in the portability, agility, and abstractness of Cloud computing, this approach is a very logical fit.

Dangerous Development

Ever been in an environment where the developed solutions tend to work around problems rather than confronting issues directly? Do you see bandaids to issues as the normal, mode of operation? Do you have software that is developed without requirements or User Acceptance testing?

What you are seeing are solutions developed in a vacuum without regard to the domain knowledge necessary to understand that a problem needs to be corrected and not "worked around". Essentially, when developers lack the domain knowledge around identifying and correcting problems in areas outside of software, you end up with software developed that works around or bandaid across issues. Essentially, they don't know how to diagnose or correct the problem, diagnose the effects of the problem, or in many cases, even understand that it is a problem.

In some cases, you need strong managerial leadership to stand up and make things right. The problem may be exacerbated by weak management or politically charged environments where managers manage to the Green. And some problems do need escalation.

This gets very dangerous to an IT environment for a multitude of reasons including:

It masks a problem. Once a problem is masked, any fix to the real problem breaks the developed bandaid solution.
It sets an even more dangerous precedent in that now its OK to develop bandaid solutions.
Once developed and in place, it is difficult to replace the solution. (It is easier to do nothing.)
It creates a mandate that further development will always be required because of the work arounds in the environment. In essence, no standards based product can no longer fulfill the requirements because of the work arounds.

A lot of factors contribute to this condition commonly known as "Painted in a Corner" development. In essence, development efforts paint themselves into a corner where they cannot be truly finished or the return on investment can never be fully realized. A developer or IT organization cannot divorce itself or disengage from a product. In effect, you cannot finish it!

A common factor is a lack of life cycle methodology in the development organization. Without standards and methodologies, it is so easy for developers to skip over certain steps because of the pain and suffering. These elements include:

Requirements and Requirements traceability
Unit Testing
System Testing
Test Harnesses and structured testing
Quality Assurance
Coding standards
Documentation
Code / Project refactoring
Acceptance Testing.

This is no different from doing other tasks such as Network Engineering, Systems Engineering, and Applications Engineering. The danger is that once the precedence is set that its OK to go around the Policies, Procedures, and Discipline associated with effective Software Development, it is very hard to reign it back in. In effect, the organization has been compromised. And they lack the awareness that they are compromised.

What do you do the right the ship?

Obviously, there is a lack of standards and governance up front. These need to be remedied. Coding standards, software lifecycle management techniques need to be chosen and implemented up front. Need to get away from Cowboy code and software development that is not customer driven. Additionally, it should be obvious that design and architecture choices need to be made external to this software development team in the foreseeable future.

Every piece of code written needs to be reviewed and corrected. You need to get rid of the bandaids and "solutions" that perpetuate problems. And you need to start addressing the real problems instead of working around them.

Software that perpetuates problems by masking them via workarounds and bandaids is a dangerous pattern to manifest in your organization. Its like finding roaches in your house. Not only will you see the bandaid redeveloped and reused over and over again, you have an empowered development staff that bandaids versus fixes. Until you can do a thorough cleaning and bombing of the house, it is next to impossible to get rid of the roaches.

Sometimes, development managers tend to be promoted a bit early in that while they have experience with code and techniques, exposure to a lot of different problem sets segregates the good development leaders from the politicians and wannabes. Those that are pipelined do not always understand how to reason through problems, discern the good from bad techniques and approaches, and lead down the right paths. Some turn very political because it is easier for them to respond politically than technically.

What are the Warnings Signs?

Typically, you see in house developed solutions that are developed around a problem in functionality that would not normally be seen in many different places. This can be manifested in many ways. Some examples or warning signs include:

Non-applicability of commercial products because of one off in house solutions or applied workarounds.
Typical terms like "thats the way we've always done this" or "We developed this to work around..." arise in discussions.
You see alot of in house developed products in many different versions.
In house developed products tend to lack sufficient documentation.
You see major deviations away from standards.
No QA and code review is only internal to the group.
No Unit or System test functionality available to the support organization.
In house developed software that never transitions out of the development group.
Software developed in house that is never finished.
You get political answers to technical questions.

A lesson here is that it does matter that the person you setup to lead, have exposure to a lot of different problem sets, SDLC methodologies in practice and not just theory, and they have some definite problem reasoning skills. A Politician is not a good coach or development leader in these environments.

In Summary

One must be very careful in the design and implementation of software done in house. If wrong, you can quickly paint development and the developed capabilities into a corner where you must forklift to get functionality back. And if you're not careful, you will stop evolution in the environment because the technical solutions will continue to work around instead of directly addressing problems.