Dougie's Enterprise Management World: 2010

Saturday, November 27, 2010

EM7 Dynamic App Development

Dynamic Application development in EM7 is easy and a lot of fun.

A Dynamic Application in EM7 is used to gather statistics, configuration data, performance data, set thresholds, and setup alerts and events.

In the past, I've done alot of SNMP based applications where I would have to take SNMP Walks, the MIBs, and put together polling, thresholds, and triggers for events. I'd download the MIB, suck it into Excel, and start analyzing the data. A lot of time would be spent parsing MIBs, verifying tables, and looking at counters.

Enter EM7.

Login and go to System->Tools->MIB Compiler

It provides a list of all MIBs it knows about.

From this, you can see if the MIB you're looking for is there and compiled. If its not, you can hit the import button up on the top right portion of the window and it will prompt you to upload a new MIB file. (Pretty friggin straight forward if I do say so myself!)

To compile the MIB, hit the lightning icon on the MIB entry. Simple as that.

So, lets say you want to create a dynamic application around some configuration items associated with a couple of MIB tables in a MIB. For our example, we will use the CISCO-CCME-MIB applicable for the cisco Call Manager Express systems. You can view the MIB using the I icon, download or export hte MIB via the green disk icon on the right side of the entry, or compile it using the lightning bolt. I'm going to view the MIB so I select the I icon for the row for the MIB I want.

I hit the icon and it pulls up a file edit window. I scroll through to find an object I'd like to work with as an application. I highlight the object and copy it to my clipboard.

I go to System->Tools->OID Browser, select Where name is like in the search function and I paste the object I was looking at in the MIB. This pulls up a hierarchy of the OIDs and MIB objects below the object I input. All in all, we're into the second minute if you're a bit slow on the keyboard like me!

Notice on the top of the selection box is the MIB object path. This is handy to visualize where you are as you start to work with MIB objects.

So, the first application I want to put together is a configuration type of application centered around parameters that are part of the Music on Hold system function. I select the MIB elements I want to work with in my configuration application by clicking on the box for each parameter.

On the bottom right corner of this window is a selection box with a default value of [Select Action]. Pull down the menu and select Add to a new Configuration Application. Select the go button directly to its right.

You are then prompted if this is really what you want to do. Select Yes.

What this does is to take the MIB objects and create a dynamic application with your collection objects setup already. Here is what it looks like.

If you do nothing else, go to the the Properties tab, input an application Name, then save it. You have just created your first dynamic application in its basic sense of form and function.

This is what the Properties screen looks like:

Overall, you have a lot of work to do in that you probably need to cleanup the names of the Objects so that they show up better on reports. Setup polling intervals. And setup any thresholds, alerts, or events as needed. But you have an application in a few minutes. And it can cross multiple tables if you like! CLICK CLICK BOOM!

ENMS Projects

I have been looking over why a significant number of enterprise management implementations fail. And not only do they fail in implementation, they continue to live in a failed state year over year. Someone once said that 90% or greater of all enterprise management implementations fail.

I cannot help but to think that part of this may be something to do with big projects and evolution during implementation.

Another consideration I have seen is that lower level managers tend to manage to the green and are afraid of having to defend issues in their department or their products. When they manage to the Green, management above them loses awareness of how things are really working beneath them in the infrastructure.

One article I really liked was by Frank Hayes called "Big Projects, Done Small" in a recent Computerworld issue. Here is the link to it. I like his way of thinking in that big projects need to be sliced and diced into smaller pieces in order to facilitate success more readily.

Those of us in technical roles tend to operate and function in a 90 days are less role. We know from experience that if you have a project exceed 90 days, the "Dodge Syndrome" will rear its ugly head. (The rules have changed!) In all actuality, requirements tend to change and evolve around an implementation if the implementation takes over 90 days.

The second part you realize is that in projects that need to span over 90 days, mid-level Managers tend to get nervous and lose faith on the project. Once this happens, you start seeing the Manager retract support and resources for the project. The representatives don't show up to meetings.

But Tier 1 managers like the big, high cost projects as it is a way of keeping score. They like the big price tag and they like the enormous scope of global projects.

It is the job of Architects to align projects into more manageable, chunks for implementation and integration. They need to know and plan the budget for these projects so that you get what needs to be done, in the proper sequence, with the proper hooks, and with the proper resources. If this is done by Project Managers, Contractors, or Development staff, you risk the project becoming tactical in nature and lose sight of strategic goals.

When a project becomes tactical, the tasks become a cut list that has to happen every week. When something changes or impacts the schedule, tactical decisions are made to recover. For example, a decision to modify a configuration may be made today that kills or hampers an integration that needs to be done in a few months. These tactical decisions may save some time and resources today but yield a much larger need later on.

It is the equivalent to painting yourself in a corner.

Here in lies the situations that are oh so common:

1) Tier 1 Management wants to show the shareholders and Board of Directors that they are leading the company in a new direction.

2) Mid-level tiers of management can chose to use this project to promote their own enpires if possible.

3) Mid-Level managers can chose just to accept the big project and support it as needed.

4) Architects need to be keenly aware and on top of the project as a whole. They need to set, establish, and control direction.

5) End users need to be empowered to use the product to solve problems and work with Architecture to get their requirements in.

Conclusion

A big project can be accomplished and can succeed. But the strategic direction needs to be set with a vision for both tactical and strategic goals alignment. Things need to be broken up by goals and objectives and costed accordingly. Simplify - simplify - simplify.

Tuesday, October 12, 2010

The Case for EM7

I have been monitoring alot of activity around various products that I have had my head wrapped around for years. See, I have been a Consultant, Engineer, end user, OSS Developer/Architect and even Chief Technologist. I've developed apps and I've done a TON of integration across the years and problem sets. I have worked with, implemented, and supported a ton of different products and functions.

I think about all of the engagements I worked on where we spent 4-6 weeks just getting together the requirements doc and start down the road of doing a BOM. Then, after somebody approved the design, we get down to ordering hardware and software.

Then, we'd go about bringing up the systems, loading the software, and tuning the systems.

Before you knew it, we were 6 months on site.

This has been standard operations for EONS. For as long as I remember.

While the boom was on, this was acceptable. Business needed capabilities and that came from tools and technology. But now, money is tight and times are hard. However, companies still need IT capabilities that are managed and reliable.

Now days, customers want results. Not in 6 months. Not even next month. NOW! When you look at the prospects of Cloud, what attracts business and IT alike is that in a cloud, you can deploy applications quickly!

Enter EM7. What if I told you I could bring a couple of appliances (well, depends on the size of your environment...), rack and stack, and have you seeing results right away? Events. Performance. Ticketing. Discovery. Reports.

EM7 lives in this Rapid Deployment, lots of capabilities up front sort of scenario. Literally, in a few hours, you have a management platform.

Question: How long does it take for you to get changes into your network management platform? Days? Weeks?

Question: How long does it take for you to get a new report in place on your existing system?

Question: Are you dependent upon receiving traps and events only as your interface into receiving problem triggers? Do you even perform polls?

Question: How many FTEs are used just to maintain the monitoring system? How many Consultants with that?

Now, the question is WHY should you check out EM7?

First and foremost, you want whats best for your environment. The term "Best" changes over time. And while many companies will tell you best of breed is dead and that you need a single throat to choke, its because they KNOW they are not Best of Breed and they HATE competition. But competition is good for customers. It keeps value in perspective.

Without a bit of competition, you don't know best. Your environment has been evolving and so have the management capabilities and platforms (well some have!)

In some environments, politics keep new products and capabilities at bay. You can't find capital or you know you'll step on toes if you propose new products. So, you need to have some lab time. With EM7, you spend more time working with devices and capabilities in your lab than you do installing, administering, and integrating the management platform. EM7 is significantly faster to USE than others.

EM7 has a lot of the management applications you need to manage an IT / Network Infrastructure, in an APPLIANCE model. All of the sizing, OS tuning, Database architecture, etc. is done up front in defineable methods. We do this part and you get to spend your time adding value to your environment instead of the management platform.

In a politically charged environment, EM7 can be installed, up and running, and adding SIGNIFICANT value before the fifedom owners even know. Em7 becomes the stndard quickly in buy vs. build scenario as there are alot of capabilities up front. But there feature rich ways to extend management and reporting beyond the stock product. Dynamic applications enable the same developers to build upon the EM7 foundation rather than attempting to redevelop foundation, then the advanced capabilities.

If you're adapting and integrating Cloud capabilities, EM7 makes sense early on. Lots of both agent and agentless capabilities, service polling, mapping and dashboards, run book automation, ticketing, dynamic applications... the list goes on and on. Cloud and virtualization software and APIs can change very quickly. Standards are not always first. So, you have to be able to adapt and overcome challenges quickly.

If you're in the managed services world, you need to be vigilant in your quest for capabilites that turn into revenue. If you can generate revenue with a solution, it is entirely possible to make more than it costs. Do you know how much revenue is generated by your management platform? Now compare that to what it costs in maintenance and support.

Now, go back and look at things you've done to generate and instance new product. How long did it take from concept to offering? How much did it cost resource wise?

In these times, when margins are lean and business is very competitive, if you aren't moving forward to compete, you're probably facing losing market share.

With EM7, you get to COMPETE. You get a solid foundation with which to discriminate and differentiate your services from your competition. Quickly. Efficiently.

Sunday, August 15, 2010

Save the Village ID ten T

Ever been in one of these situations?

So, you have this Manager you work with. Seems that every time you have to interact with them, it becomes a painful ordeal. See, they are the Political type. You know, the type that destroys the creativity of organizations.

When you deal with them, all of a sudden your world is in Black and white... Not color. It's a 6th Sense moment. "I see Stupid People. And they don't even know they're stupid!".

You ask a technical question and get the trout look. You know - Fresh off the plate! This person no more understands what the F you're talking about than the man in the moon. Nor do they want to understand. They have tuned you out. In an effort to shut you up, they proceed up the food chain to suck up to every manager in an attempt to put more red tape and BS in front of you.

Well, you figure that because you are obviously not getting your technical points across, you do memos and white papers explaining the technical challenges. Then, you're too condescending and academic.

You do requirements and everything becomes a new negotiation. You can get nothing done. To this person, it doesn't matter if you have valid requirements or not. What matters most is their own personal interest and growing their own empire.

Its Managers like this that hold back Corporations. They kill productivity and teamwork and they can single-handedly poison your Development and integration teams. They have no idea the destruction they cause. The negative effects they have on folks. The creativity they squash.

They foster Analysis Paralysis, lack of evolution and the tragic killing of products and solutions. These types of folks tend to latch on to products or functions and never release control. Such that it kills functionality and the evolution beyond the basic. Everything is about the bare minimums as you don't have to work at it much. You can float, do what you can, and go on. So what if you never ever deliver new capabilities UNLESS it is POLITICALLY ADVANCING to do so.

How do you save someone like this from themselves? They refuse to listen. They are already Lunar Lander personalities - "Tranquility Base here! The EGO has landed!". They don't even have an inkling of what they don't know or understand. Yet they continue to drive out innovation.

As a Manager of one of these personality types, it becomes dangerous for you. You either subscribe to the lineage of politics or they start politically going around you. So the deceit, lies, and ignorance perpetuates. And the company continues to suffer. It takes strong leadership to either control the politics or make a decision to oust the impediments to progress. Both take intestinal fortitude, leadership, and a lot of paperwork.

In summary, I wish I could reach these folks. Have them understand the creativity and peoples careers they kill. All in the name of self preservation. Have them understand that a little common sense would open their whole world.

In my career, I have been in this situation 4-5 times. A Situation where the only way to be successful is to compromise your principals and ethics and start sucking up. My heart goes out to my friends that endure the torture of mindless decisions , the lack of experience and unwillingness to understand, and the unending political positioning and rambling rather than doing the right thing.

I'm soo elated to be away from this. Words cannot do justice to how elated and ecstatic I am to be able from the mindlessness.

Event Management and Java programmers

Question: Why is it a good idea to hire a Java programmer over someone with direct Netcool experience for a Netcool specific job?

I recently heard about an open position for a Netcool specialist where they opened up the position to be more of a Java programmer slot. In part, I think its a lack of understanding regarding supporting Netcool and understanding what it takes, domain knowledge wise, to be successful at it.

Most Java programmers either want to do some SOA sort of thing, Jakarta and struts, Spring Framework, Tomcat / Jboss or some portal rehash sort of thing. Most of these technologies have little to do with Netcool and rules. In fact, the lack of domain knowledge around the behavior of Network and systems devices in specific failure induced conditions, may limit a true Java programmers ability to be successful without significant training and exposure over time.

Here is the ugly truth:

1. Not every problem can be adequately or most effectively solved with Object Oriented Programming. Some things are done much more efficiently in Turing machines or Finite State automata.

2. You give away resources in OO to empower some level of portability and reuse. If your problem is linear and pretty static, why give away resources?

3. With Oracle suing Google of copywrite enfringement over the use of Java, Java may be career limiting. I mean, if Oracle is willing to initiate a lawsuit with Google which has the resources to handle such a lawsuit, how much more will it be willing to sue companies of significantly less legal resources?

So, unless you plan on rewriting Netcool in Java, I'd say this repositioning is a pretty limiting move. And if you were foolish enough to think this is even close to being cost effective, I'd pretty much say that as a Manager, you are dangerous to your company and share holders.

Sunday, July 11, 2010

ENMS User Interfaces...

Ever watch folks and how they use various applications? When you do some research around the science of Situation Awareness, you realize that human behavior in user interfaces is vital to understanding how to put information in front of users in ways that empowers the users inline with what they need.

In ENMS related systems, it is imperative that you present information in ways that empower users to understand situations and conditions beyond just a single node. While all of the wares vendors have been focused on delivering some sort of Root Cause Analysis, this may not be what is REALLY needed by the users. And dependent upon whether you are a Service Provider or an Enterprise, the rules may be different.

What I look for in applications and User Interfaces are ways to streamline the interaction versus being disruptive. If you are swapping a lot of screens, inherently look at your user. If they have to readjust their vision or posture, the UI is disrupting their flow.

For example, if the user is looking at an events display and they execute a function as part of the menu. This function produces a screen that overcomes the existing events display. If you watch your user, you will see them have to readjust to the screen change.

I feel like this is one of the primary reasons ticketing systems do not capture more real time data. It becomes too disruptive to keep changing screens so the user waits until later to update the ticket. Inherently, data is filtered and lost.

This has an effect on other processes. One is that if you are attempting to do BSM scorecards, ticket loading and resource management in near real time, you don’t have all of the data to complete your picture. In effect, situation awareness for management levels is skewed until the data is input.

The second effect to this is that if you’re doing continuous process improvement, especially with the incident and problem management aspects of ITIL, you miss critical data and time elements necessary to measure and improve upon.

Some folks have attempted to work around this by managing from ticket queues. So, you end up with one display of events and incoming situation elements and a second interface as the ticket interface. In order to try to make this even close to being effective, the tendency is to automatically generate tickets for every incoming event. Without doing a lot of intelligent correlation up front, automatic ticket generation can be very dangerous. Due diligence must be applied to each and every event that gets propagated or you may end up with false ticket generation or missed ticket opportunities.

Consider this as well. An Event Management system is capable of handling a couple thousand events pretty handily. A Ticketing system that handles 2000 ongoing tickets at one time changes the parameters of many ticketing systems.

Also, consider that in Remedy 7.5, the potential exists that each ticket may utilize 1GB or more of Database space. 2000 active tickets means you’re actively working across 2TB of drive / database space.

I like simple update utilities or popups that solicit information needed and move that information element back into the working Situation Awareness screen. For example, generating a ticket should be a simple screen to solicit data that is needed for the ticket that cannot be looked up directly or indirectly. Elements like ticket synopsis or symptom. Assignment to a queue or department. Changing status of a ticket.

Maps

Maps can be handy. But if you cannot overlay tools and status effectively or the map isn’t dynamic, it becomes more of a marketing display rather than a tool that you can use. This is even more prevalent when maps are not organized into hierarchies.

One of the main obstacles is the canvas. You can only place a certain amount of objects on a given screen. Some applications use scroll bars to enable you to get around. Others use a zoom in - zoom out capability where they scale the size of the icons and text according to the zoom. Others enable dragging the canvas. Another approach is to use a Hyperbolic display where analysis of detail is accomplished by establishing a moveable region under a higher level map akin to a magnifying glass over a desktop document.

3D displays get around the limitations of a small canvas a bit by using depth to position things in front or behind. However, 3D displays have to use techniques like LOD or Level of Details, or Fog to enable only more local objects are attended to, otherwise it has to render every object local and remote. This can be computationally intensive.

A couple of techniques I like in the 3D world are CAVE / Immersion displays and the concept of HUDs and Avatars. CAVE displays display your environment from several perspectives including top, bottom, front, left, right, and even behind. Movement is accomplished interacting with one screen and the other screens are synchronized to the main, frontal screen. This gives the user the effect of an immersive visual environment.

A HUD or heads up display enables you to present real time information directly in front of a user regardless of position or view.

The concept of an avatar is important in that if you have an avatar or user symbol, you can use that symbol to enable collaboration. In fact, your proximity to a given object may be used to help others collaborate and team up to solve problems.

Next week, I’ll discuss network layouts, transitioning, state and condition management, and morphing displays. Hopefully, in the coming weeks, I’ll take a shot at designing a hybrid, immersive 2D display that is true multiuser, and can be used as a solid tools and analysis visualization system.

ENMS Architecture notes...

From the Architect…

OK. You’ve got a huge task before you. You walk into an organization where you have an Event Management tool, a Network Management application, a Help Desk application, performance management applications, databases ad nauseum… And each becomes its own silo of a Beast. Each with its own competing management infrastructure, own budget, and own support staff.

I get emails every week from friends and colleagues facing this, as well as recruiters looking for an Architect that can come in for their customer, round up the wagons, and get everything in line going forward.

Sounds rather daunting, doesn’t it. Let’s look at what its going to take to get on track towards success.

1. You need to identify and map out the Functional Empires. Who’s running what product and what is the current roadmap for each Functional Empire.
2. You need to be aware of any upcoming product “deals”.
3. You need to understand the organizational capabilities and the budget.
4. In some instances, you’ll need to be strong enough technically to defend your architecture. Not just to internal customers but to product vendors. If you’re not strong enough technically, you need to find someone that is to cover you.
5. You need to understand who the Executive is, what the goals are, and the timelines needed by the Corporation.
ITIL is about processes. I tend to label ITIL as Functional Process Areas. These are the process areas needed in an effective IT Service. FCAPS is about Functional Management Areas. It is about the Functional Areas in which you need to organize and apply technology and workflow. eTOM adds Service Delivery and provisioning in a service environment into the mix as well.

The standards are the easy part.

The really hard part is merging the siloes you already have and doing so without selling the organization down the river. And the ultimate goal – Getting the users using the systems.

The big 4 Wares vendors are counting on you not being able to consolidate the silos on your own. I’ve heard the term “Best of Breed” is dead and “A single Throat to Choke” as being important to customers. These are planted seeds that they want you to believe. The only way to even come close to merging in their eyes is to use only one vendor’s wares.

When you deviate from addressing requirements and functionality in your implementation, you end up with whatever the vendor you picked says you’re gonna get.
You need to put together a strategy that spans 2 major release cycles, and delineate the functionality needed across your design. Go back to the basics, incorporate the standards, and put EVERYTHING on the table. Your strategy needs to evolve into a vision of where the Enterprise Management system should be in the 2 major release time cycle. The moment you let your guard down on focus, the chances that something thwart movement forward, will present itself.

Be advised. Regardless of how hard you work and what products and capabilities you implement, sometimes an organization becomes so narcissistic that it cannot change. No matter what you do, nothing gets put into production because the people in the silos block your every move. There are some that are totally resistant to change, evolution, and continuous improvement.

And you’re up against a lot of propaganda. Every vendor will tell you they are the leader or market best. And they will show you charts and statistics from analysis firms that show you that they are leaders or visionaries in the market space. It is all superfluous propaganda. Keep to requirements, capabilities, and proving/reproving these functions and their usability.

And listen to your end users most carefully. If the function adds to their arsenal and adds value, it will be accepted. If the function gets in the way or creates confusion or distraction, it will not be used.

--------------
Cross posted at : http://blog.sciencelogic.com/enterprise-network-management-systems-notes-from-the-architect/07/2010

Sunday, June 13, 2010

Appreciate our Military...

Saturday, I was flying back from Reston. On my leg from O'Hare to Oklahoma City, I ran into some Airmen coming back from a deployment. As we talked, I realized they were stationed at Tinker.

I told the Airman that I remember when my unit was the only unit on Tinker wearing Camos. He told me I must have been in the 3rd Herd. We talked some about unit history. How the original patch was an Owl and the motto was "World Wise Communications". I also told him about the transition to the War Frog under Col. Lurie and the transition to the 3rd Herd under Col. Witt.

On my building there used to be a sign painted on the roof:

Дом третьей группы борьбе связь
Всегда готов

It means "Home of the 3rd Combat Communications Group. Always Prepared" in Russian. Col. Lurie was a bit hard core but it motivated us to be ready at all times.

In the 80's - when I was stationed at Tinker - I was assigned to the 3rd Combat Communications Group. From an Air Force perspective, it wasn't considered a "choice" assignment but I wanted it anyway. Originally, out of Tech School, I had an assignment to Clark AB, Phillipines. I swapped assignments to get to Tinker.

I spent 6 1/2 years at the 3rd. 65 Tactical Air Base expercises. 35 Real world and JCS exercise deployments. 2 ORIs. Over 2 years in a tent. I earned my Combat Readiness Medal.

Combat Comm can be hard duty. Field conditions alot of times. You deploy on a moments notice anywhere in the world and you had to stay ready. My deployment bag stayed packed. And I had uniforms for jungle, desert, arctic, and chemical warefare that I maintained at all times. You don't know when and how long you'll be gone.

Our equipment had to be ready to rock at all times as well. It was amazing how fast and how much equipment we moved when it become time to "Rock and Roll".

What you do learn in a Unit like the 3rd:

- How to face difficult conditions and work through challenges
- Ability to survive and operate
- Mental and physical preparedness
- Adapting to the Mission
- Urgency of need
- Teamwork
- Perserverance, persistence, and focus.
- Never settle for anything less than your best.
- Leadership
- Strength of Character

Many colleges have classes and discussions about these things but in the 3rd, you lived it. You were immersed in it.

In garrison, you trained and prepared. On deployment, you delivered.

I can't help but to thank alot of the folks I worked with and for. Col. Lurie. MSgt Olival. MSgt Reinhardt. CMSgt Kremer. Capt. Kovacs. 1LT Faughn. TSgt Tommy Brown. SSgt Jeff Brock. SSgt Tim Palmer. Sgt. Elmer Shipman. SSGt Nyberg. SSgt Roy David. A1C David Cubit. A1C Vince Simmons. A1C Rod Pitts. SSgt Darren Newell. The list goes on and on. We worked together. Prepared and helped each other.

The Airman I was talking to represented the United States of America, the Air Force, my old Unit, and those stripes very well. He is a testament to the dedication to duty, service, and country. And this got me to thinking. I'm am a shadow Warrior from the Air Force's finest Combat Communications Group. Behind Airman Whittington and his team are thousands of us Shadow Warriors (those that have served before him) who pray for those that are actively serving. And we all hope and pray that the things we did while we were Active Duty helped in some small way, the ability of those that serve now to do better than we did.

So - to Airman Whittington and Team. Thank you for what you do. Thank you for representing well. God Bless ya'll! Welcome home!

And to Airman Whittington personally - Congratulations on your upcoming promotion. Nice talking with you, Sir!

Wednesday, May 26, 2010

Tactical Integration Decisions...

Inherently, I am a pre-cognitive Engineer. I think about and operate in a future realm. It is they way I work best and achieve success.

Recently I became aware of a situation where a commercial product replaced an older, in house developed product. This new product has functionality and capabilities well beyond that of the "function" it replaced.

During the integration effort, it was realized that a bit of event / threshold customization was needed on the SNMP Traps front in order to get Fault Management capabilities into the integration.

In an effort to take a short cut, it was determined that they would adapt the new commercial product to the functionality and limitations of the previous "function". This is bad for several reasons:

1. You limit the capabilities of the new product going forward to those functions that were in the previous function. No new capabilities.

2. You taint the event model of the commercial product to that of the legacy function. Now all event customizations have to deal with old OIDs, and varbinds.

3. You limit the supportability and upgrade-ability of the new product. Now, every patch, upgrade, and enhancement must be transitioned back to the legacy methodology.

4. It defies common sense. How can you "let go" of the past when when you readily limit yourself to the past?

5. You assume that this product cannot provide any new value to the customer or infrastructure.

You can either do things right the first time or you can create a whole new level of work that has to be done over and over. People that walk around backwards are resigned to the past. They make better historians than Engineers or Developers. How do you know where to go if you can't see anything but your feet or the end of your nose?

Saturday, May 22, 2010

Support Model Woes

I find it ironic that folks claim to understand ITIL Management processes yet do not understand the levels of support model.

Most support organizations have multiple ters of support. For example, there is usually a Level 1 which is the initial interface toward the customer. Level 2 is usually a traige or technical feet on the street. Level 3 is usually Engineering or Development. In some cases, Level 4 is used to denote on site vendor support or third party support.

In organizations where Level 1 does dispatch only or doesn't follow through problems with the customer, the customer ends up owning and following through the problem to solution. What does this say to customers?

- There are technically equal to or better than level 1
- they become accustomed to automatic escalation.
- They lack confidence in the service provider
- They look for specific Engineers versus following process
- They build organizations specifically to follow and track problems through to resolution.

If your desks do dispatch only, event management systems are only used to present the initial event leading up to a problem. What a way to render Netcool useless! Netcool is designed to display the active things that are happening in your environment. If all you ever do is dispatch, why do you need Netcool? Just generate a ticket straight away. No need to display it.

What? Afraid of rendering your multi-million dollar investment useless? Why leave it in a disfunctional state of semi-uselessness when you could save a significant amount of money getting rid of it? Just think, every trap can become an object that gets processed like a "traplet" of sorts.

One of the first things that crops up is that events have a tendency to be dumb. Somewhere along the way, somebody had the bright idea that to put an agent into production that is "lightweight" - meaning no intelligence or correlation built in. Its so easy to do little or nothing, isn't it? Besides, its someone elses problem to deal with the intelligence. In the
vendors case, many times agent functionality is an afterthought.

Your model is all wrong. And until the model is corrected, you will never realize the potential ROI of what the systems can provide. You cannot evolve because you have to attempt to retrofit everything back to the same broken model. And when you automate, you automate the broken as well.

Heres the way it works normally. You have 3 or 4 levels of support.

Level 1 is considered first line and they perform the initial customer engagement, diagnostics and triage, and initiate workflow. Level 1 initiates and engages additional levels of support trackingt thingts through to completion. In effect, level 1 owns the incident / problem management process but also provides customer engagement and fulfillment.

Level 2 is specialized support for various areas like network, hosting, or application support. They are engaged through Level 1 personnel and are matrixed to report to level 1, problem by problem such that they empower level 1 to keep the customer informed of status and timelines, set expectations, and answer questions.

Level 3 is engaged when the problem becomes beyond the technical capabilities of levels 1 and 2, requires project, capital expenditure, architecture, and planning support.

Level 4 is reserved for Vendor support or consulting support and engagement.

A Level 0 ia used to describe automation and correlation performed before workflow is enacted.

When you breakdown your workflow into these levels, you can start to optimize and realize ROI by reducing the cost of maintenance actions across the board. By establishing goals to solve 60-70% of all incidents at LEvel 1, Level 2-4 involvement helps to drive knowledge and understanding downward to level 1 folks why better utilizing level 2 - 4 folks.

In order to implement these levels of support, you have organize and define your support organization accordingly. Define its rolls and responsibilities, set expectations, and work towards success. Netcool, as an Event Management platform, need to be aligned to the support model. Things that ingress and egress tickets need to be updated in Netcool. Workflow that occurs, needs to update Netcool so that personnel have awareness of what is going on.

Politically based Engineering Organizations

When an organization is resistant to change and evolution, it can in many cases be attributed to weak technical leadership. Now this weakness can be because of the politicalization of Engineering and Development organizations.

Some of the warning signs include:

- Political assasinations to protect the current system
- Senior personnel and very experienced people intimidate the management structure and are discounted
- Inability to deliver
- An unwillingness to question existing process or function.

People that function at a very technical level find places where politics prevalent, a difficult place to work.

- Management doesn't want to know whats wrong with their product.
- They don't want to change.
- They shun and avoid senior technical folks and experience is shunned.

Political tips and tricks

- Put up processes and negotiations to stop progress.
- Random changing processes.
- Sandbagging of information.
- When something is brought to the attention of the Manager, the technical
person's motivations are called into question. What VALUE does the person bring to the Company?
- The person's looks or mannerisms are called into question.
- The persons heritage or background is called into question.
- The person's ability to be part of a team is called into question.
- Diversions are put in place to stop progress at any cost.

General Rules of Thumb

+ Politically oriented Supervisors kill technical organizations.
+ Image becomes more important than capability.
+ Politicians cannot fail or admit failure. Therefore, risks are avoided.
+ Plausible deniability is prevalent in politics.
+ Blame-metrics is prevalent. "You said..."

Given strong ENGINEERING leadership, technical folks will grow very quickly. Consequently, their product will become better and better as you learn new techniques and share them, everyone gets smarter. True Engineers have a willingness to help others, solve problems, and do great things. A little autonomy and simple recognitions and you're off to the races.

Politicians are more suited to Sales jobs. Sales is about making the numbers at whatever cost. In fact, Sales people will do just about anything to close a deal. They are more apt to help themselves than to help others unless it also helps themselves. Engineers need to know that their managers have their backs. Sales folks have allegiance to the sale and themselves... A bad combination for Engineers.

One of the best and most visionary Vice Presidents I've ever worked under, John Schanz once told us in a Group meeting that he wanted us to fail. Failure is the ability to discover what doesn't work. So, Fail only once. And be willing to share your lessons, good and bad. And dialog is good. Keep asking new questions.

In Operations, Sales people can be very dangerous. They have the "instant gratification" mentality in many cases. Sign the PO, stand up the product, and the problem is solved in their minds. They lack the understanding that true integration comes at a personal level with each and every user. This level of integration is hard to achieve. And when things don't work or are not accepted, folks are quick to blame someone or some vendor.

The best Engineers, Developers, and Architects are the ones that have come up through the ranks. They have worked in a NOC or Operations Center. They have fielded customer calls and problems. They understand how to span from the known to and unknown even when the customer is right there. And they learn to think on their feet.

Thursday, May 20, 2010

Architecture and Vision

One of the key aspects of OSS/ENMS Architecture is that you need to build a model that is realizable in 2 major product release cycles, of where you'd like to be, systems, applications, and process wise. This equates usually to an 18-24 month "what it needs to look like" sort of vision.

Why?

- It provides a reference model to set goals and objectives.
- It aligns development and integration teams toward a common roadmap
- It empowers risk assessment and mitigation in context.
- It empowers planning as well as execution.

What happens if you don't?

- You have a lot of "cowboy" and rogue development efforts.
- Capabilities are built in one organization that kill capabilities and performance in other products.
- Movement forward is next to impossible in that when something doesn't directly agree with the stakeholder's localized vision, process pops up to block progress.
- You create an environment where you HAVE to manage to the Green.
- Any flaws or shortcomings in localized capabilities results in fierce political maneuvering.

What are the warning signs?

- Self directed design and development
- Products that are deployed in multiple versions.
- You have COTS products in production that are EOLed or are no longer supported.
- Seemingly simple changes turn out to require significant development efforts.
- You are developing commodity product using in house staff.
- You have teams with an Us vs. Them mentality.

Benefits?

- Products start to become "Blockified". Things become easier to change, adapt, and modify.
- Development to Product to Support to Sales becomes aligned. Same goals.
- Elimination of a lot of the weird permutations. No more weird products that are marsupials and have duck bills.
- The collective organizational intelligence goes up. Better teamwork.
- Migration away from "Manage to the Green" towards a Teamwork driven model.
- Better communication throughout. No more "Secret Squirrel" groups.

OSS ought to own a Product Catalog, data warehouse, and CMS. Not a hundred different applications. OSS Support should own the apps the the users should own the configurations and the data as these users need to be empowered to use the tools and systems as they see fit.

Every release of capability should present changes to the product catalog. New capabilities, new functions, and even the loss of functionality, needs to be kept in synch with the product teams. If I ran product development, I'd want to be a lurker on the Change Advisory Board and I'd want my list published and kept up to date at all times. Add a new capability. OSS had BETTER inform product teams.

INUG Activities...

Over the past few weeks, I've made a couple of observations regarding INUG traffic...

Jim Popovitch A Stalwart in the Netcool community, left IBM and the Netcool product behind to go to Monolith! What the hell does that say?

There was some discussion of architectural problems with Netcool and after that - CRICKETS. Interaction on the list by guys like Rob Cowart, Victor Havard, Jim - Silence. Even Heath Newburn's posts are very short.

There is a storm brewing. Somebody SBDed the party. And you can smell I mean tell. The SBD was licensing models. EVERYONE is checking their shoes and double checking their implementations. While Wares vendors push license "true ups" as a way to drive adoption and have them pay later, on the user side it is seen as Career Limiting as it is very difficult to justify your existence when you have to go back to the well for an unplanned budget item at the end of the year.

Something is brewing because talk is very limited.

Product Evaluations...

I consider product competition as a good thing. It keeps everyone working to be the best in breed, deliver the best and most cost effective solution to the customer, and drives the value proposition.

In fact, in product evaluations I like to pit vendors products against each other so that my end customer gets the best solution and the most cost effective. For example, I use capabilities that may not have been in the original requirements to further the customer capability refinement. If they run across something that makes their life better, why not leverage that in my product evaluations? In the end, I get a much more effective solution and my customer gets the best product for them.

When faced with using internal resources to develop a capability and using an outside, best of breed solution, danger exists in that if you grade on a curve for internally developed product, you take away competition and ultimately the competitive leadership associated with a Best of Breed product implementation.

It is too easy to start to minimize requirements to the bare necessities and to further segregate these requirements into phases. When you do, you lose the benefit of competition and you lose the edge you get when you tell the vendors to bring the best they have.

Its akin to looking at the problem space and asking what is th bare minimum needed to do this. Or asking what is the best solution for this problem set? Two completely different approaches.

If you evaluate on bare minimums, you get bare minimums. You will always be behind the technology curve in that you will never consider new approaches, capabilities, or technology in your evaluation. And your customer is always left wanting.

It becomes even more dangerous when you evaluate internally developed product versus COTS in that, if you apply the minimum curve gradient to only the internally developed product, the end customer only gets bare minimum capabilities within the development window. No new capabilities. No new technology. No new functionality.

It is not a fair and balanced evaluation anyway if you only apply bare minimums to evaluations. I want the BEST solution for my customer. Bare minimums are not the BEST for my customer. They are best for the development team because now, they don't have to be the best. They can slow down innovation through development processes. And the customer suffers.

If you're using developer in house, it is an ABSOLUTE WASTE of company resources and money to develop commodity software that does not provide clear business discriminators. Free is not a business discriminator in that FREE doesn't deliver any new capabilities - capabilities that commodity software doesn't already have.

Inherently, there are two mindsets that evolve. You take away or you empower the customer. A Gatekeeper or a Provider.

If you do bare minimums, you take away capabilities that the customer wants but because it may not be a bare minimum, the capability is taken away.

If you evaluate on Best of Breed, you ultimately bring capabilities to them.

Tuesday, May 18, 2010

IT Managed Services and Waffle House

OK.

So now you're thinking - What the hell does Waffle House have to do with IT Managed Services. Give me a minute and let me 'splain it a bit!

When you go into a Waffle House, immediately you get greeted at the door. Good morning! Welcome to Waffle House! If the place is full, you may have a door corps person to seat you in waiting chairs and fetch you coffee or juice while you're waiting.

When you get to the table, the Waitress sets up your silverware and ensures you have something to drink. Asks you if you have decided on what you'd like.

When they get your order, they call in to the cook what you'd like to eat.

A few minutes later, food arrives, drinks get refilled, and things are taken care of.

Pretty straightforward, don't you think? One needs to look at the behaviors and mannerisms od successful customer service representatives to see what behaviors are needed in IT Managed Services.

1. The Customer is acknowledged and engaged at the earliest possible moment.

2. Even if there are no open tables, work begins to establish a connection and a level of trust.

3. The CSR establishes a dialog and works to further the trust and connection. To the customer, they are made to feel like they are the most important customer in the place. (Focus, eye contact. Setting expectations. Assisting where necessary.)

4. They call in the order to the cook. First, meats are pulled as they take longer to cook. Next, if you watch closely, the cook lays out the order using a plate marking system. The customer then prepares the food according to the plate markings.

5. Food is delivered. Any open ends are closed. (drinks)

6. Customer finishes. Customer is engaged again to address any additional needs.

7. Customer pays out. Satisfied.

All too often, we get IT CSRs that don't readily engage customers. As soon as they assess the problem, they escalate to someone else. This someone else then calls the customer back, reiterates the situation, then begins work. When they cannot find the problem or something like a supply action of technician dispatch needs to occur, the customer gets terminated and re-initiated by the new person in the process.

In Waffle House, if you got waited on by a couple of different waitresses and the cook, how inefficient would that be? How confusing would that be as a customer? Can you imagine the labor costs to support 3 different wait staff even in slow periods? How long would Waffle House stay in business?

Regarding the system of calling and marking out product... This system is a process thats taught to EVERY COOK and Wait staff person in EVERY Waffle House EVERYWHERE. The process is tried, vetted, optimized, and implemented. And the follow through is taken care of by YOUR Wait person. The one thats focused on YOU, the Customer.

I wish I could take every Service provider person and manager and put them through 90 days of Waffle House Boot Camp. Learn how to be customer focused. Learn urgency of need, customer engagement, trust, and services fulfillment. If you could thrive at Waffle House and take with you the customer lessons, customer service in an IT environment should be Cake.

And it is living proof that workflow can be very effective even at a very rudimentary level.

Tuesday, May 11, 2010

Blocking Data

Do yourself a favor.

Take a 10000 line list of names, etc. and put it in a Database. Now, put that same list in a single file.

Crank up a term window and a SQL client on one and command line on another. Pick a pattern that gets a few names from the list.

Now in the SQL Window, do a SELECT * from Names WHERE name LIKE 'pattern';

In a second window, do a grep 'pattern' on the list file.

Now hit return on each - simultaneously if possible. Which one did you see results first?

The grep, right!!!!

SQL is Blocking code. The access blocks until it returns with data. If you front end an application with code that doesn't do callbacks right and doesn't handle the blocking, what happens to your UI? IT BLOCKS!

Blocking UIs are what? Thats right! JUNK!

Sunday, May 9, 2010

ENMS Products and Strategies

Cloud Computing is changing the rules of how IT and ENMS products are done. With Cloud Computing, resources being configurable and adaptable very quickly, applications are also changing to fit the paradigm of fitting in a virtual machine instance. We may even see apps deployed with their own VMs as a package.

This also changes the rules of how Service providers deliver. In the past, you had weeks to get the order out, setup the hardware, and do the integration. In the near future, all Service providers and application wares vendors will be pushed to Respond and Deliver at the speed of Cloud.

It changes a couple of things real quickly:

1. He who responds and delivers first will win.
2. Relationships don't deliver anything. Best of Breed is back. No excuse for the blame game.
3. If it takes longer for you to fix a problem than it does to replace, GAME OVER!
4. If your application takes too long to install or doesn't fit in the Cloud paradigm, it may become obsolete SOON.
5. In instances where hardware is required to be tuned to the application, appliances will sell before buying your own hardware.
6. In a Cloud environment, your Java JRE uses resources. Therefore it COSTS. Lighter is better. Look for more Perl, Python, PHP and Ruby. And Javascript seems to be a language now!

Some of the differentiators in the Cloud:

1. Integrated system and unit tests with the application.
2. Inline knowledge base.
3. Migration from tool to a workflow strategy.
4. High agility. Configurability. Customizeability.
5. Hyper-scaling technology like memcached and distributed databases incorporated.
6. Integrated products on an appliance or VM instance.
7. Lightweight and Agile Web 2.0/3.0 UIs.

SNMP MIBs and Data and Information Models

Recently, I was having a discussion about SNMP MIB data and organization and its application into a Federated CMDB and thought it might provoke a bit of thought going forward.

When you compile a MIB for a management application, what you do is to organize the definitions and the objects according to name a OID, in a way thats searchable and applicable to performing polling, OID to text interpretation, variable interpretation, and definitions. In effect, you compiled MIB turns out to be every possible SNMP object you could poll from in your enterprise.

This "Global Tree" has to be broken down logically with each device / agent. When you do this, you build an information model related to each managed object. In breaking this down further, there are branches that are persistent for every node of that type and there are branches that are only populated / instanced if that capability is present.

For example, on a Router that has only LAN type interfaces, you'd see MIB branches for Ethernet like interfaces but not DSX or ATM. These transitional branches are dependent upon configuration and presence of CIs underneath the Node CI - associated with the CIs corresponding to these functions.

From a CMDB Federation standpoint, a CI element has a source of truth from the node itself, using the instance provided via the MIB, and the methods via the MIB branch and attributes. But a MIB goes even further to identify keys on rows in tables, enumerations, data types, and descriptive definitions. A MIB element can even link together multiple MIB Objects based on relationships or inheritance.

In essence, I like the organization of NerveCenter Property Groups and Properties:

Property Groups are initially organized by MIB and they include every branch in that MIB. And these initial Property Groups are assigned to individual Nodes via a mapping of the system.sysObjectID to Property Group. The significance of the Property Group is that is contains a list of the MIB branches applicable to a given node.

These Property Groups are very powerful in that it is how Polls, traps, and other trigger generators are contained and applied according to the end node behavior. For example, you could have a model that uses two different MIB branches via poll definitions, but depending the node and its property group assignment, only the polls applicable to the node's property group, are applied. Yet it is done with a single model definition.

The power behind property groups was that you could add custom properties to property groups and apply these new Property Groups on the fly. So, you could use a custom property to group a specific set of nodes together.

I have setup 3 distinct property groups in NerveCenter corresponding to 3 different polling interval SLAs and used a common model to poll at three different rates dependent upon the custom properties for Cisco_basic, Cisco_advanced, and Cisco_premium to poll at 2 minutes, 1 minute, or 20 seconds respectively.

I used the same trigger name for all three poll definitions but set the property to only be applicable to Cisco_basic, Cisco_advanced, or Cisco_premium respectively.

What Property Groups do is to enable you to setup and maintain a specific MIB tree for a given node type. Taking this a bit further, in reality, every node has its own MIB tree. Some of the tree is standard for every node of the same type while other branches are option or capability specific. This tree actually corresponds to the information model for any given node.

Seems kinda superfluous at this point. If you have an information model, you have a model of data elements and the method to retrieve that data. You also have associative data elements and relational data elements. Whats missing?

Associated with these CIs related to capabilities like a DSX interface or an ATM interface or even something as mundane as an ATA Disk drive, are elements of information like technical specifications and documentation, process information, warranty and maintenance information.. even mundane elements like configuration notes.

So, when you're building your information model, the CMDB is only a small portion of the overall information system. But it can be used to Meta or cross reference other data elements and help to turn these into a cohesive information model.

This information model ties well into fault management, performance management, and even ITIL Incident, Problem and change management. But you have to think of the whole as an Information model to make things work effectively.

Wouldn't it be awesome if to could manage and respond by service versus just a node? When you get an event on a node, do you enrich the event to provide the service or do you wait until a ticket is open? If the problem was presented as a service issue, you could work it as a service issue. For example, if you know the service lineage or pathing, you can start to overlay elements of information that empower you to start to put together a more cohesive awareness of your service.

Lets say you have a simple 3 tier Web enabled application that consists of a Web Server, and application Server, and a Database. On the periphery, you have network components, firewalls, switches, etc. How valuable is just the lineage? Now, if I can overlay information elements on this ontology, it comes alive. For example, show me a graph of CPU performance on everything in the lineage. Add in memory and IO utilization. If I can overlay response times for application transactions, the picture becomes indispensable as an aid to situation awareness.

Looking at things from a different perspective, what if I could overlay network errors. Or disk errors. What about other, seemingly irrelevant elements of information like ticket activities or the amount of support time spent on each component in the service lineage, takes data and empowers you to present it in a service context.

On a BSM front, what if I could overlay the transaction rate with CPU, IO, Disk, or even application memory size or CPU usage? Scorecards start becoming a bit more relevant it would seem.

In Summary

SNMP MIB data is vital toward not only the technical aspects of polling and traps but conversion and linkage to technical information. SNMP ius a source of truth for a significant number of CIs and the MIB definitions tell you how the information is presented.

But all this needs to be part of an Information plan. How do I take this data, derive new data and information, and present it to folks when they need it the most?

BSM assumes that you have the data you need organized in a database that can enable you to present scorecards and service trees. Many applications go through very complex gyrations on SQL queries in an attempt to pull the data out. When the data isn't there or it isn't fully baked, BSM vendors may tend to stub up the data to show that the application works. This gets the sale but the customer ends up finding that the BSM application isn't as much out of the box as the vendor said it was.

These systems depend on Data and information. Work needs to be done to align and index data sources toward being usable. For example, if you commonly use inner and outer joins in queries, you haven't addressed making your data accessible. If it takes elements from 2 tables to do selects on others, you need to work on your data model.

Monday, May 3, 2010

Event Processing...

I ran across this blog post by Tim Bass on Complex Event Processing dubbed "Orwellian Event Processing" and it struck a nerve.

The rules we build into products like Netcool Omnibus, HP Openview, and others are all based on simple if-then-else logic. Yet, through fear that somebody may do something bad with an event loop, recursive processing is shunned.

In his blog, he describes his use of Bayesian Belief Networks to learn versus the static if-then-else logic.

Because BBNs learn patterns through evidence and though cause and effect, the application of BBNs in common event classification and correlation systems makes total sense. And the better you get at classification, the better you get at dealing with uncertainty.

In its simplest form, BBNs output a ratio of occurrences ranging from 1 to -1 where 1 is 100 percent that an element in a pattern occurs and -1 or 100% that an element in a pattern never occurs.

The interesting part is that statistically, a BBN will recognize patterns that a human cannot. We naturally filter out things that are obfuscated or don't appear relevant.

What if I could have a BBN build my rules files for Netcool based upon statistical analysis of the raw event data? What would it look like as compared to current rule sets? Could I setup and establish patterns that lead up to an event horizon? Could I also understand the cause an effect of an event? What would that do to the event presentation?

Does this not open up the thought that events are presented in patterns? How could I use that to drive up the accuracy of event presentation?

Sunday, May 2, 2010

Cloud Computing - Notes from a guy behind the curtain

The latest buzz is Cloud computing. When I attended a CloudCamp hosted here in St. Louis, it became rather obvious that the term Cloud computing has an almost unlimited supply of definitions depending upon which Marketing Dweeb you talk to. It can range from everything from hosted MS Exchange to hosted VMs to applications and even hosted services. I'm really not in a position to say what Cloud is or isn't and in fact, I don't believe theres any way to win that argument. Cloud computing is a marketing perception that is rapidly morphing into whatever marketing types deem necessary to sell something. right or wrong - the spin doctors own the term and it is whatever they think will sell.

In my own perception, Cloud Computing is a process by which applications, services, and infrastructure are delivered to a customer in a rapid manner and empowers the customer to pay for what they use in small, finite increments. Cloud Computing incorporates a lot of technologies and process to make this happen. Technology like Virtualization, configuration management databases, hardware, and software.

What used to take days or weeks to deliver now takes minutes. MINUTES. What does this mean? It takes longer to secure an S Corp and setup a corresponding Tax ID than it does to setup and deliver a new companies web access to customers. And not just local, down home Main street customers but you are in business, competing at a GLOBAL LEVEL in MINUTES. And you pay as you go!

Sounds tasty, huh. Now heres a kink. How the heck do you manage this? I know the big Four Management companies say a relationship is more important than Best of Breed. I've heard in in numerous presentations and conversations. If you are in a position to sit still in business, this may be OK for you. Are you so secure in your market space that you do not fear competition to the point where you would sit idly?

Their products reflect this same lack of concern in that it is the same old stuff. It hasn't evolved much - it takes forever to get up and running and it takes months to be productive. For example, IBM Tivoli ITNM/IP. Takes at least a week of planning just to get ready to install it - IF you have hardware. Next, you need another week and consulting to get things cranking on a discovery for the first time. Takes weeks to integrate in your environment dealing with community string issues, network access, and even dealing with discovery elements.

Dealing with LDAP and network views is another nightmare altogether.

The UI is clunky, slow, and non-interactive. Way too slow to be used interactively as a diagnostic tool. At least you could work with NNM in earlier versions to get some sort of speed. (Well, with the Web UI, you had HP - slower than molasses - and then when you needed something that worked you bought Edge Technologies or Onion Peel.) In the ITNM/IP design, somebody in their infinite wisdom decided to store the map objects and map instances in a binary glob field in MySQL. At least if you had the coordinates you could FIX the topoviz maps or even display them in something a bit faster and more Web 2.0 - Like Flash / Flex. (Hardware is CHEAP!)

And how do you apply this product to a cloud infrastructure? If you can only discover once every few days, I think you're gonna miss a few customers setting up new infrastructures and not to mention any corresponding VMotion events that occur when things fail or load balance. How do you even discover and display the virtual network infrastructure with the real network infrastructure?

Even if you wanted to use it like TADDM, Tideway, or DDMi, the underlying database is not architected right. It doesn't allow you to map out the relationships between entities enough to make it viable. Even if you did a custom Discovery agent and plugged in NMap - (Hey! Everybody uses it!) you cannot fit the data correctly. And it isn't even close to the CIM schema.

And every time you want some additional functionality like performance data integration, its a new check and a new ball game. They sort of attempt to address this by enabling short term polling via the UI. Huge Fail. How do you look at data from yesterday? DOH!

ITNM/IP + Cloud == Shelfware.

If we are expected to respond at the speed of Cloud, there is a HUGE Pile of Compost consisting of management technology of the past that just isn't going to make it. These products take too much support, take too much resources to maintain, and they hold back innovation. The cost just doesn't justify the integration. Even the products we considered as untouchables. Many have been architected in a way that paints them in a corner. Once you evolve a tool kit into a solution, you have to take care not to close up integration capabilities along the way.

They take too long to install, take a huge level of effort to keep running, and the yearly maintenance costs can be rather daunting. The Cloud methodology kind of changes the rules a bit. In the cloud, its SaaS. You sign up for management. You pay for what you get. If you don't like it or want something else, presto changeo - an hour later, you're on a new plan! AND you pay as you go. No more HUGE budget outlays, planning, and negotiation cycles. No more "True-Ups" at the end of the year that kill your upward mobility and career.

BMC - Bring More Cash
EMC - EXtreme Monetary Concerns
IBM - I've Been Mugged!
HP - Huge PriceTag!
CA - Cash Advanced!

Think about Remedy. Huge Cash outlay up front. Takes a long time to get up and running. Takes even longer to get into production. Hard to change over time. And everything custom becomes an ordeal at upgrade time.

They are counting on you to not be able to move. That you have become so political and process bound, you couldn't replace it if you wanted to. In fact, in the late 80s and early 90s, there was the notion that the applications that ran on mainframes could never be moved off of those huge platforms. I remember working on the transition from Space Station Freedom to International Space Station Information systems. The old MacDac folks kept telling up there was no way we could move to open systems. Especially on time and under budget. 9 months later, 2 ES9000 Duals and a bunch of Vaxes repurposed. 28 Applications migrated. Reduced support head count from over 300 to 70. And it was done with less than half of the cost of software maintenance for a year. New software costs ~15% of what they were before. And we had alot more users. And it was faster too! NASA. ESA. CSA. NASDA. RSA. All customers.

Bert Beals, Mark Spooner, and Tim Forrester are among my list of folks that had a profound effect on my career in that they taught me through example to keep it simple and that NOTHING is impossible. And to keep asking "And then what?"

And while not every app fits in a VM, there is a growing catalog of appliance based applications that make total sense. You get to optimize the hardware according to the application and its data. That first couple of months of planning, sizing, and procurement - DONE.

And some apps thrive on the Cloud virtualization. If you need a data warehouse or are looking to make sense of your data footprint, check out Greenplum. Distributed Database BASED on VMs! You plug in resources as VMs as you grow and change!

And the line between the network and the systems and the applications and the users - disappearing quickly. Presents an ever increasing data challenge to be able to discover and use all these relationships to deliver better services to customers.

Cloud Computing is bringing that revolution and reinvention cycle back into focus in the IT industry. It is a culling event as it will cull out the non-producers and change the customer engagement rules. Best of Breed is back! And with a Vengeance!

Sunday, April 25, 2010

Performance Management Architecture

Performance Management systems in IT infrastructures do a few common things. These are:

Gather performance data
Enable processing of the data to produce:
Events and thresholds
New data and information
Baseline and average information
Present data through a UI or via scheduled reports.
Provide for ad hoc and data mining exercises

Common themes for broken systems include:

If you have to redevelop your application to add new metrics
If you have more than one or two data access points.
If data is not consistent
If reporting mechanisms have to be redeveloped for changes to occur
If a development staff owns access to the data
If a Development staff controls what data gets gathered and stored.
If multiple systems are in place and they overlap (Significantly) in coverage.
If you cannot graph any data newer than 5 minutes.
If theres no such thing as a live graph or the live graph is done via Metarefresh.

I dig SevOne. Easy to setup. Easy to use. Baselines. New graphs. New reports. And schedules. But they also do drill down from SNMP into IPFIX DIRECTLY. No popping out of one system and popping into another. SEAMLESSLY.

It took me 30 minutes or so to rack and stack the appliance. I went back to my desk, verified I could access the appliance, then called the SE. He setup a WebEx and it was 7 minutes and a few odd seconds later I got my first reports. Quite a significant difference from the previous Proviso install which took more than a single day to install.

The real deal is that with SevOne, your network engineers can get and setup the data collection they need. And the hosting engineers can follow suite. Need a new metric. Engineering sets it up. NO DEVELOPMENT EFFORT.

And it can be done today. Not 3 months from now. When something like a performance management system cannot be used as part of the diagnostics and triage of near real time, it significantly detracts from usability in both the near real time and the longer term trending functions as well.

BSM - Sounds EXCELLENT. YMMV

Business Service Management

OK. Here goes. First and foremost, I went hunting for a definition. Heres one from bitpipe that I thought sounded good.

ALSO CALLED: BSM
DEFINITION: A strategy and an approach for linking key IT components to the goals of the business. It enables you to understand and predict how technology impacts the business and how business impacts the IT infrastructure.

Sounds good, right?

When I analyze this definition, it looks very much like the definition for Situation Awareness. Check out the article on Wikipedia.

Situation awareness, or SA, is the perception of environmental elements within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future

So, I see where BSM as a strategy, creates a system where Situation Awareness for business as a function of IT services, can be achieved. In effect, BSM creates SA for business users through IT service practices.

Sounds all fine, good, and well in theory. But in practice, there are a ton of data sources. Some are database enabled. Some are Web services. Some are simple web content elements. How do you assemble, index, and align all this data from multiple sources, in a way that enables a business user to achieve situation awareness? How do you handle the data sources being timed wrong or failing?

The Road to Success

First of all, if you have a BSM strategy and you're buying or considering a purchase of a BSM framework, you need to seriously consider BI and a Data Architecture as well. All three technologies are interdependent. You have to organize your data, use it to create information, them make it suitable enough to be presented in a consistent way.

As you develop your data, you also develop your data model. With the data model will come information derivation and working through query and explain plans. In some instances, you need to look at a Data warehouse of sorts. You need to be able to organize and index your data to be presented in a timely and expeditious fashion so that the information helps to drive SA by business users.

A recent data warehouse sort od product has come to my attention. It is Greenplum. Love the technology. Scalable. But based on mature technology. My thoughts are about taking data from disparate sources, organizing that data, deriving new information, and indexing the data so that the reports you provide, can happen in a timely fashion.

Organizing your data around a data warehouse allows you to get around having to deal with multiple databases, multiple access mechanisms, and latency issues. And how easier it is to analyze cause and effect, derivatives, and patterns given you can search across these data sources from a single access. Makes true Business intelligence easier.

BSM products tend to be around creative SQL queries and dashboard/scorecard generation. You may not need to buy the entire cake to get a taste. Look for web generation utilities that can be used to augment your implementation and strategy.

And if you're implementing a BSM product, wouldn't it make sense to setup SLAs on performance, availability, and response time for the app and its data sources? This is the ONE App that could be used to set a standard and a precedence.

I tend to develop the requirements, then storyboard the dashboards and drill throughs. This gives you a way of visualizing holes in the dashboards and layouts but it also enables you to drive to completion. Developing dashboards can really drive scope creep if you don't manage it.
Storyboarding allows you to manage expectations and drive delivery.

Saturday, April 24, 2010

SNMP + Polling Techniques

Over the course of many years, it seems that I see the same lack of evolution regarding SNMP polling, how its accomplished, and the underlying ramifications. To give credit where credit is due, I learned alot from Ari Hirschman, Eric Wall, Will Pearce, and Alex Keifer. And of the things we learned - Bill Frank, Scott Rife, and Mike O'Brien.

Building an SNMP poller isn't bad. Provided you understand the data structures, understand what happens on the end node, and understand how it performs in its client server model.

First off, there are 5 basic operations one can perform. These are:

GET
GET-NEXT
SET
GET-RESPONSE
GET-BULK

Here is a reference link to RFC-1157 where SNMP v1 is defined.

The GET-BULK operator was introduced when SNMP V2 was proposed and it carried into SNMP V3. While SNMP V2 was never a standard, its defacto implementations followed the Community based model referenced in RFCs 1901-1908.

SNMP V3 is the current standard for SNMP (STD0062) and version 1 and 2 SNMP are considered obsolete or historical.

SNMP TRAPs and NOTIFICATIONs are event type messages sent from the Managed object back to the Manager. In the case of NOTIFICATIONs, the Manager returns the trap as an acknowledgement.

From a polling perspective, lets start with a basic SNMP Get Request. I will illustrate this via the Net::SNMP perl module directly. (URL is http://search.cpan.org/dist/Net-SNMP/lib/Net/SNMP.pm)

get_request() - send a SNMP get-request to the remote agent

$result = $session->get_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
-varbindlist => \@oids,
);
This method performs a SNMP get-request query to gather data from the remote agent on the host associated with the Net::SNMP object. The message is built using the list of OBJECT IDENTIFIERs in dotted notation passed to the method as an array reference using the -varbindlist argument. Each OBJECT IDENTIFIER is placed into a single SNMP GetRequest-PDU in the same order that it held in the original list.

A reference to a hash is returned in blocking mode which contains the contents of the VarBindList. In non-blocking mode, a true value is returned when no error has occurred. In either mode, the undefined value is returned when an error has occurred. The error() method may be used to determine the cause of the failure.

This can be either blocking - meaning the request will block until data is returned or non-blocking - the session will return right away but will initiate a callback subroutine upon finishing or timing out.

For the args:

-callback is used to attach a handler subroutine for non-blocking calls
-delay is used to delay the SNMP Porotocol exchange for the given number of seconds.
-contextengineid is used to pass the contextengineid needed for SNMP V3.
-contextname is used to pass the SNMP V3 contextname.
-varbindlist is an array of OIDs to get.

What this does is to setup a Session object for a given node and run through the gets in the varbindlist one PDU at a time. If you have set it up to be non-blocking, the PDUs are assembled and sent one right after another. If you are using blocking mode, the first PDU is sent and a response is received before the second one is sent.

GET requests require you to know the instance of the attribute ahead of time. Some tables are zero instanced while others may be instanced by one or even multiple indexes. For example, MIB-2.system is a zero instanced table in that there is only one row in the table. Other tables like MIB-2.interfaces.ifTable.ifEntry have multiple rows indexed by ifIndex. Here is a reference to the MIB-2 RFC-1213.

A GET-NEXT request is like a GET request except that it does not require the instance up front. For example, if you start with a table like ifEntry and you do not know what the first instance is, you would query the table without an instance.

Now here is the GET-NEXT:

$result = $session->get_next_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
-varbindlist => \@oids,
);

In the Net::SNMP module, each OID in th \@oids array reference is passed as a single PDU instance. And like the GET, it can also be performed in blocking mode or non-blocking mode.

An snmpwalk is simply a macro of multiple recursive GET-NEXTs for a given starting OID.

As polling started to evolve, folks started looking for ways to make things a bit more scalable and faster. One of the ways they proposed was the GET-BULK operator. This enabled an SNMP Manager to pull whole portions of an SNMP MIB Table with a single request.

A GETBULK request is like a getnext but tells the agent to return as much as it can from the table. And yes, it can return partial results.
$result = $session->get_bulk_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
[-nonrepeaters => $non_reps,]
[-maxrepetitions => $max_reps,]
-varbindlist => \@oids,
);

In SNMP V2, the GET BULK operator came into being. This was done to enable a large amount of table data to be retrieved from a single request. It does introduce two new parameters:

nonrepeaters partial information.
maxrepetitions

Nonrepeaters tells the get-bulk command that the first N objects can be retrieved with a simple get-next operation or single successor MIB objects.

Max-repetitions tells the get-bulk command to attempt up to M get-next operations to retrieve the remaining objects or how many times to repeat the get process.

The difficult part of GET BULK is you have to guess how many rows and there and you have to deal with partial returns.

As things evolved, folks started realizing that multiple OIDs were possible in SNMP GET NEXT operations through a concept of PDU Packing. However, not all agents are created equal. Some will support a few operations in a single PDU while some could support upwards of 512 in a single SNMP PDU.

In effect, by packing PDUs, you can overcome certain annoyances in data like time skew between two attributes given that they can be polled simultaneously.

When you look at the SNMP::Multi module, it not only allows multiple OIDs in a PDU by packing, it enables you to poll alot of hosts at one time. Follwing is a "synopsis" quote from the SNMP::Multi module:

use SNMP::Multi;

my $req = SNMP::Multi::VarReq->new (
nonrepeaters => 1,
hosts => [ qw/ router1.my.com router2.my.com / ],
vars => [ [ 'sysUpTime' ], [ 'ifInOctets' ], [ 'ifOutOctets' ] ],
);
die "VarReq: $SNMP::Multi::VarReq::error\n" unless $req;

my $sm = SNMP::Multi->new (
Method => 'bulkwalk',
MaxSessions => 32,
PduPacking => 16,
Community => 'public',
Version => '2c',
Timeout => 5,
Retries => 3,
UseNumeric => 1,
# Any additional options for SNMP::Session::new() ...
)
or die "$SNMP::Multi::error\n";

$sm->request($req) or die $sm->error;
my $resp = $sm->execute() or die "Execute: $SNMP::Multi::error\n";

print "Got response for ", (join ' ', $resp->hostnames()), "\n";
for my $host ($resp->hosts()) {

print "Results for $host: \n";
for my $result ($host->results()) {
if ($result->error()) {
print "Error with $host: ", $result->error(), "\n";
next;
}

print "Values for $host: ", (join ' ', $result->values());
for my $varlist ($result->varlists()) {
print map { "\t" . $_->fmt() . "\n" } @$varlist;
}
print "\n";
}
}

Using the Net::SNMP libraries underneath means that you're still constrained by port as it only uses one UDP port to poll and through requestIDs, handles the callbacks. In higher end pollers, the SNMP Collector can poll from multiple ports simultaneously.

Summary

Alot of evolution and technique has went into making SNMP data collection efficient over the years. It would be nice to see SNMP implementations that used these enhancements and evolve a bit as well. The evolution of these techniques came about for a reason. When I see places that haven't evolved in their SNMP Polling techniques, I tend to believe that they haven't evolved enough as an IT service to experience the pain that necessitated the lessons learned of the code evolution.

Sunday, April 18, 2010

Web Visualization...

I have been trying to get my head around visualization for several months. Web presentation presents a few challenges that some of the product vendors seem to overlook.

First off, there is an ever increasing propensity for each vendor to develop and produce their own portal. It must be a common Java class in a lot of schools because it is so prevalent. And not all portals are created equal or even open in some cases. I think that while they are redeveloping the wheel, they are missing the point in that they need to develop CONTENT first.

So, what are the essential parts of a portal?

Security Model
Content Customization and Presentation
Content organization

In a security model, you need to understand that users belong to groups and are identified with content and brandings. A user can be part of a team (shared content), assigned access to tools and technologies (content distribution), and will need to be able to organize the data in ways that make it easy for them to work (content brandings).

In some cases, multi-tenancy is a prime concern. How do you take and segregate discreet content yet share the shareable content?

A Web presence lends itself very well to project or incident based portal instances if you make it easy to put in place new instances pertinent to projects and situations. This empowers the capture of knowledge within given conditions, projects, or team efforts. The more relevant the cature is, the better the information is as an end result. (The longer you wait, the more daat and information you lose.)

Single Sign On.

While vendors say they do SSO, they typically only do so across their product line. Proxying, cookies and sessions, authentications and certificates are all ways to have someone have to authenticate to access systems.

From the actor perspective, once you have to stop what you're doing to log into another application, subconsciously, you have to switch gears. This switching becomes a hindrance because people will instinctively avoid disruptive processes. And in many cases, this also refocuses the screen on another window which also detracts from user focus.

Every web presence has content, a layout, and a look and feel. Templates for content layout, branding, organization, become the more common elements addressed in a portal. In some cases, language translation also plays a part. In other cases, branding also plays a significant part.

I happen to like Edge Technologies enPortal. Let me explain.

It is a general purpose Portal with Single sign On across product, it has a strong security model, and it lets you deploy web sites as needed. You can synch with LDAP and you can bring in content from a variety of sources... Even sources that are not web enabled. They do this with an interface module integrated with Sun Secure Global Desktop(The old Tarantella product...)

The enPortal is solid and fault tolerant. Can be deployed in redundant configurations.

But web visualization in support organizations needs to go much further in the future. They need to enable collaboration, topology and GIS maps, fold in external data sources like weather and traffic data. And they need to incorporate reward mechanisms for users processing data faster and more efficient.

Data and information must be melded across technologies. Fault to performance to security to applications to even functions like release management, need to be incorporated, content wise.

Some Wares vendors in the BSM space claim that they support visualization. They do. In part... Alot of the BSM products out there cater specifically to CxO level and a couple of levels below that. They lack firm grounding in the bottom layers of an organization. In fact, many times the BSM products will get in the way of folks on the desks.

A sure fire litmus test is to have the vendor install the product, give them a couple of data sources and have them show you a graphical view of the elements they found. Many cannot even come close! They depend on you to put all the data and relationships together.

Ever thought about the addictiveness of online games? They have reward mechanisms that empower you to earn points, gold, or coins or gold starts - something. These small reward mechanisms shape behavior by rewarding small things to accumulate better behavior over time.

In many cases, the data underneath required to provide effective visualization is not there, is too difficult to access, or is not in a format that is usable for reporting. When you start looking at data sources, you must examine explain plans, understand indexes as well as views, and be prepared to create information from raw data.

If you can get the data organized, you can use a multitude of products to create good, usable content. Be prepared to create data subsets, cubes of data, reference data elements, as well as provide tools that enable you to munge these data elements and sources, put it all together, and produce some preliminary results.

Netcool and Evolution toward Situation Management

Virtually no new evolution in Fault Management and correlation has been done in the last ten years. Seems we have a presumption that what we have today is as far as we can go. Truly sad.

In recent discussions on the INUG Netcool Users Forum, we discussed shortfalls in the products in hopes that big Blue may see its way clear of the technical obstacles. I don't think they are accepting or open to mine and other suggestions. But thats OK. you plant a seed - water it - feed it. And hopefully, one day, it comes to life!

Most of Netcool design is based somewhat loosely on TMF standards. They left out the hard stuff like object modelling but I understand why. The problem is that most Enterprises and MSPs don't fit the TMF design pattern. Nor do they fit eTOM. This plays specifically to my suggestion that "There's more than one way to do it!" - The Slogan behind Perl.

The underlying premise behind Netcool is that it is a single pane of glass for viewing and recognizing what is going on in your environment. It provides a way to achieve situation awareness and a platform which can be used to drive interactive work from. So what about ITIL and Netcool?

From the aspect of product positioning, most ITIL based platforms have turned out to be rehashs of Trouble Ticketing systems. When you talk to someone about ITIL, they immediately think of HP ITSM or BMC Remedy. Because of the complexity, these systems sometimes takes several months to implement. And nothing is cheap. Some folks resort to open source like RT or OTRS. Others want to migrate towards a different, appliance based model like ServiceNow and ScienceLogic EM7.

The problem is that once you transition out of Netcool, you lose your situation awareness. Its like having a notebook full of pages. Once you flip to page 50, pages 1-49 are out of sight and therefore gone. All hell could break lose and you'd never know.

So, why not implement ITIL in Netcool? May be a bit difficult. Here are a few things to consider:

1. The paradigm that an event has only 2 states is bogus.
2. The concept that there are events and these lead to incidents, problems, and changes.
3. Introduces workflow to Netcool.
4. Needs to be aware of CI references and relationships.
5. Introduces the concept that the user is part of the system in lieu of being an external entity.
6. May change the exclusion approach toward event processing.
7. Requires data storage and retrieval capabilities.

End Game

From a point of view where you'd like to end up, there are several use cases one could apply. For example:

One could see a situation develop and get solved in the Netcool display over time. As it is escalated and transitioned, you are able to see what has occurred, the workflow steps taken to solve this, and the people involved.

One could take a given situation and search through all of the events to see which ones may be applicable to the situation. Applying a ranking mechanism like a google search would help to position somewhat fuzzy information in proper contexts for the users.

Be able to take the process as it occurred and diagnose the steps and elements of information to optimize processes in future encounters.

Be able to automate, via the system, steps in the incident / problem process. Like escalations or notifications. Or executing some action externally.

Once you introduce workflow to Netcool, you need to introduce the concept of user awareness and collaboration. Who is online? What situations are they actively working versus observing? How do you handle Management escalations?

In ITIL definitions, an Incident has a defined workflow process from start to finish. Netcool could help to make the users aware of the process along with its effectiveness. Even in a simple event display you can show last, current and next steps in fields.

Value Proposition

From the aspect of implementation, the implementation of ITIL based systems has been focused solely around trouble ticketing systems. These systems have become huge behemoths of applications and with this comes two significant factors that hinder success - The loss of situation Awareness and the inability to realize and optimize processes in the near term.

These behemoth systems become difficult to adapt and difficult to keep up with optimizations. As such, they slow down the optimization process making it painful to move forward. If its hard to optimize, it will be hard to differentiate service because you cannot adapt to changes and measure the effectiveness fast enough to do any good.

A support organization that is aware of whats going on, subliminally portrays confidence. This confidence carries a huge weight in interactions with customers and staff alike. It is a different world on a desk when you're empowered to do good work for your customer.

More to come!

Hopefully, this will provide some food for thought on the evolution of event management into Situation Management. In the coming days I plan on adding to this thread several concepts like evolution toward complex event processing, Situation Awareness and Knowledge, data warehousing, and visualization.