Sunday, April 25, 2010

Performance Management Architecture

Performance Management systems in IT infrastructures do a few common things. These are:

Gather performance data
Enable processing of the data to produce:
Events and thresholds
New data and information
Baseline and average information
Present data through a UI or via scheduled reports.
Provide for ad hoc and data mining exercises

Common themes for broken systems include:

If you have to redevelop your application to add new metrics
If you have more than one or two data access points.
If data is not consistent
If reporting mechanisms have to be redeveloped for changes to occur
If a development staff owns access to the data
If a Development staff controls what data gets gathered and stored.
If multiple systems are in place and they overlap (Significantly) in coverage.
If you cannot graph any data newer than 5 minutes.
If theres no such thing as a live graph or the live graph is done via Metarefresh.

I dig SevOne. Easy to setup. Easy to use. Baselines. New graphs. New reports. And schedules. But they also do drill down from SNMP into IPFIX DIRECTLY. No popping out of one system and popping into another. SEAMLESSLY.

It took me 30 minutes or so to rack and stack the appliance. I went back to my desk, verified I could access the appliance, then called the SE. He setup a WebEx and it was 7 minutes and a few odd seconds later I got my first reports. Quite a significant difference from the previous Proviso install which took more than a single day to install.

The real deal is that with SevOne, your network engineers can get and setup the data collection they need. And the hosting engineers can follow suite. Need a new metric. Engineering sets it up. NO DEVELOPMENT EFFORT.

And it can be done today. Not 3 months from now. When something like a performance management system cannot be used as part of the diagnostics and triage of near real time, it significantly detracts from usability in both the near real time and the longer term trending functions as well.

BSM - Sounds EXCELLENT. YMMV

Business Service Management

OK. Here goes. First and foremost, I went hunting for a definition. Heres one from bitpipe that I thought sounded good.

ALSO CALLED: BSM
DEFINITION: A strategy and an approach for linking key IT components to the goals of the business. It enables you to understand and predict how technology impacts the business and how business impacts the IT infrastructure.

Sounds good, right?

When I analyze this definition, it looks very much like the definition for Situation Awareness. Check out the article on Wikipedia.
Situation awareness, or SA, is the perception of environmental elements within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future

So, I see where BSM as a strategy, creates a system where Situation Awareness for business as a function of IT services, can be achieved. In effect, BSM creates SA for business users through IT service practices.

Sounds all fine, good, and well in theory. But in practice, there are a ton of data sources. Some are database enabled. Some are Web services. Some are simple web content elements. How do you assemble, index, and align all this data from multiple sources, in a way that enables a business user to achieve situation awareness? How do you handle the data sources being timed wrong or failing?

The Road to Success

First of all, if you have a BSM strategy and you're buying or considering a purchase of a BSM framework, you need to seriously consider BI and a Data Architecture as well. All three technologies are interdependent. You have to organize your data, use it to create information, them make it suitable enough to be presented in a consistent way.

As you develop your data, you also develop your data model. With the data model will come information derivation and working through query and explain plans. In some instances, you need to look at a Data warehouse of sorts. You need to be able to organize and index your data to be presented in a timely and expeditious fashion so that the information helps to drive SA by business users.

A recent data warehouse sort od product has come to my attention. It is Greenplum. Love the technology. Scalable. But based on mature technology. My thoughts are about taking data from disparate sources, organizing that data, deriving new information, and indexing the data so that the reports you provide, can happen in a timely fashion.

Organizing your data around a data warehouse allows you to get around having to deal with multiple databases, multiple access mechanisms, and latency issues. And how easier it is to analyze cause and effect, derivatives, and patterns given you can search across these data sources from a single access. Makes true Business intelligence easier.

BSM products tend to be around creative SQL queries and dashboard/scorecard generation. You may not need to buy the entire cake to get a taste. Look for web generation utilities that can be used to augment your implementation and strategy.

And if you're implementing a BSM product, wouldn't it make sense to setup SLAs on performance, availability, and response time for the app and its data sources? This is the ONE App that could be used to set a standard and a precedence.

I tend to develop the requirements, then storyboard the dashboards and drill throughs. This gives you a way of visualizing holes in the dashboards and layouts but it also enables you to drive to completion. Developing dashboards can really drive scope creep if you don't manage it.
Storyboarding allows you to manage expectations and drive delivery.

Saturday, April 24, 2010

SNMP + Polling Techniques

Over the course of many years, it seems that I see the same lack of evolution regarding SNMP polling, how its accomplished, and the underlying ramifications. To give credit where credit is due, I learned alot from Ari Hirschman, Eric Wall, Will Pearce, and Alex Keifer. And of the things we learned - Bill Frank, Scott Rife, and Mike O'Brien.

Building an SNMP poller isn't bad. Provided you understand the data structures, understand what happens on the end node, and understand how it performs in its client server model.

First off, there are 5 basic operations one can perform. These are:

GET
GET-NEXT
SET
GET-RESPONSE
GET-BULK

Here is a reference link to RFC-1157 where SNMP v1 is defined.

The GET-BULK operator was introduced when SNMP V2 was proposed and it carried into SNMP V3. While SNMP V2 was never a standard, its defacto implementations followed the Community based model referenced in RFCs 1901-1908.

SNMP V3 is the current standard for SNMP (STD0062) and version 1 and 2 SNMP are considered obsolete or historical.

SNMP TRAPs and NOTIFICATIONs are event type messages sent from the Managed object back to the Manager. In the case of NOTIFICATIONs, the Manager returns the trap as an acknowledgement.

From a polling perspective, lets start with a basic SNMP Get Request. I will illustrate this via the Net::SNMP perl module directly. (URL is http://search.cpan.org/dist/Net-SNMP/lib/Net/SNMP.pm)

get_request() - send a SNMP get-request to the remote agent

$result = $session->get_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
-varbindlist => \@oids,
);
This method performs a SNMP get-request query to gather data from the remote agent on the host associated with the Net::SNMP object. The message is built using the list of OBJECT IDENTIFIERs in dotted notation passed to the method as an array reference using the -varbindlist argument. Each OBJECT IDENTIFIER is placed into a single SNMP GetRequest-PDU in the same order that it held in the original list.

A reference to a hash is returned in blocking mode which contains the contents of the VarBindList. In non-blocking mode, a true value is returned when no error has occurred. In either mode, the undefined value is returned when an error has occurred. The error() method may be used to determine the cause of the failure.

This can be either blocking - meaning the request will block until data is returned or non-blocking - the session will return right away but will initiate a callback subroutine upon finishing or timing out.

For the args:

-callback is used to attach a handler subroutine for non-blocking calls
-delay is used to delay the SNMP Porotocol exchange for the given number of seconds.
-contextengineid is used to pass the contextengineid needed for SNMP V3.
-contextname is used to pass the SNMP V3 contextname.
-varbindlist is an array of OIDs to get.

What this does is to setup a Session object for a given node and run through the gets in the varbindlist one PDU at a time. If you have set it up to be non-blocking, the PDUs are assembled and sent one right after another. If you are using blocking mode, the first PDU is sent and a response is received before the second one is sent.

GET requests require you to know the instance of the attribute ahead of time. Some tables are zero instanced while others may be instanced by one or even multiple indexes. For example, MIB-2.system is a zero instanced table in that there is only one row in the table. Other tables like MIB-2.interfaces.ifTable.ifEntry have multiple rows indexed by ifIndex. Here is a reference to the MIB-2 RFC-1213.

A GET-NEXT request is like a GET request except that it does not require the instance up front. For example, if you start with a table like ifEntry and you do not know what the first instance is, you would query the table without an instance.

Now here is the GET-NEXT:

$result = $session->get_next_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
-varbindlist => \@oids,
);

In the Net::SNMP module, each OID in th \@oids array reference is passed as a single PDU instance. And like the GET, it can also be performed in blocking mode or non-blocking mode.

An snmpwalk is simply a macro of multiple recursive GET-NEXTs for a given starting OID.

As polling started to evolve, folks started looking for ways to make things a bit more scalable and faster. One of the ways they proposed was the GET-BULK operator. This enabled an SNMP Manager to pull whole portions of an SNMP MIB Table with a single request.

A GETBULK request is like a getnext but tells the agent to return as much as it can from the table. And yes, it can return partial results.
$result = $session->get_bulk_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
[-nonrepeaters => $non_reps,]
[-maxrepetitions => $max_reps,]
-varbindlist => \@oids,
);

In SNMP V2, the GET BULK operator came into being. This was done to enable a large amount of table data to be retrieved from a single request. It does introduce two new parameters:

nonrepeaters partial information.
maxrepetitions

Nonrepeaters tells the get-bulk command that the first N objects can be retrieved with a simple get-next operation or single successor MIB objects.

Max-repetitions tells the get-bulk command to attempt up to M get-next operations to retrieve the remaining objects or how many times to repeat the get process.

The difficult part of GET BULK is you have to guess how many rows and there and you have to deal with partial returns.

As things evolved, folks started realizing that multiple OIDs were possible in SNMP GET NEXT operations through a concept of PDU Packing. However, not all agents are created equal. Some will support a few operations in a single PDU while some could support upwards of 512 in a single SNMP PDU.

In effect, by packing PDUs, you can overcome certain annoyances in data like time skew between two attributes given that they can be polled simultaneously.

When you look at the SNMP::Multi module, it not only allows multiple OIDs in a PDU by packing, it enables you to poll alot of hosts at one time. Follwing is a "synopsis" quote from the SNMP::Multi module:


use SNMP::Multi;

my $req = SNMP::Multi::VarReq->new (
nonrepeaters => 1,
hosts => [ qw/ router1.my.com router2.my.com / ],
vars => [ [ 'sysUpTime' ], [ 'ifInOctets' ], [ 'ifOutOctets' ] ],
);
die "VarReq: $SNMP::Multi::VarReq::error\n" unless $req;

my $sm = SNMP::Multi->new (
Method => 'bulkwalk',
MaxSessions => 32,
PduPacking => 16,
Community => 'public',
Version => '2c',
Timeout => 5,
Retries => 3,
UseNumeric => 1,
# Any additional options for SNMP::Session::new() ...
)
or die "$SNMP::Multi::error\n";

$sm->request($req) or die $sm->error;
my $resp = $sm->execute() or die "Execute: $SNMP::Multi::error\n";

print "Got response for ", (join ' ', $resp->hostnames()), "\n";
for my $host ($resp->hosts()) {

print "Results for $host: \n";
for my $result ($host->results()) {
if ($result->error()) {
print "Error with $host: ", $result->error(), "\n";
next;
}

print "Values for $host: ", (join ' ', $result->values());
for my $varlist ($result->varlists()) {
print map { "\t" . $_->fmt() . "\n" } @$varlist;
}
print "\n";
}
}

Using the Net::SNMP libraries underneath means that you're still constrained by port as it only uses one UDP port to poll and through requestIDs, handles the callbacks. In higher end pollers, the SNMP Collector can poll from multiple ports simultaneously.

Summary

Alot of evolution and technique has went into making SNMP data collection efficient over the years. It would be nice to see SNMP implementations that used these enhancements and evolve a bit as well. The evolution of these techniques came about for a reason. When I see places that haven't evolved in their SNMP Polling techniques, I tend to believe that they haven't evolved enough as an IT service to experience the pain that necessitated the lessons learned of the code evolution.

Sunday, April 18, 2010

Web Visualization...

I have been trying to get my head around visualization for several months. Web presentation presents a few challenges that some of the product vendors seem to overlook.

First off, there is an ever increasing propensity for each vendor to develop and produce their own portal. It must be a common Java class in a lot of schools because it is so prevalent. And not all portals are created equal or even open in some cases. I think that while they are redeveloping the wheel, they are missing the point in that they need to develop CONTENT first.

So, what are the essential parts of a portal?

Security Model
Content Customization and Presentation
Content organization

In a security model, you need to understand that users belong to groups and are identified with content and brandings. A user can be part of a team (shared content), assigned access to tools and technologies (content distribution), and will need to be able to organize the data in ways that make it easy for them to work (content brandings).

In some cases, multi-tenancy is a prime concern. How do you take and segregate discreet content yet share the shareable content?

A Web presence lends itself very well to project or incident based portal instances if you make it easy to put in place new instances pertinent to projects and situations. This empowers the capture of knowledge within given conditions, projects, or team efforts. The more relevant the cature is, the better the information is as an end result. (The longer you wait, the more daat and information you lose.)

Single Sign On.

While vendors say they do SSO, they typically only do so across their product line. Proxying, cookies and sessions, authentications and certificates are all ways to have someone have to authenticate to access systems.

From the actor perspective, once you have to stop what you're doing to log into another application, subconsciously, you have to switch gears. This switching becomes a hindrance because people will instinctively avoid disruptive processes. And in many cases, this also refocuses the screen on another window which also detracts from user focus.

Every web presence has content, a layout, and a look and feel. Templates for content layout, branding, organization, become the more common elements addressed in a portal. In some cases, language translation also plays a part. In other cases, branding also plays a significant part.

I happen to like Edge Technologies enPortal. Let me explain.

It is a general purpose Portal with Single sign On across product, it has a strong security model, and it lets you deploy web sites as needed. You can synch with LDAP and you can bring in content from a variety of sources... Even sources that are not web enabled. They do this with an interface module integrated with Sun Secure Global Desktop(The old Tarantella product...)

The enPortal is solid and fault tolerant. Can be deployed in redundant configurations.

But web visualization in support organizations needs to go much further in the future. They need to enable collaboration, topology and GIS maps, fold in external data sources like weather and traffic data. And they need to incorporate reward mechanisms for users processing data faster and more efficient.

Data and information must be melded across technologies. Fault to performance to security to applications to even functions like release management, need to be incorporated, content wise.

Some Wares vendors in the BSM space claim that they support visualization. They do. In part... Alot of the BSM products out there cater specifically to CxO level and a couple of levels below that. They lack firm grounding in the bottom layers of an organization. In fact, many times the BSM products will get in the way of folks on the desks.

A sure fire litmus test is to have the vendor install the product, give them a couple of data sources and have them show you a graphical view of the elements they found. Many cannot even come close! They depend on you to put all the data and relationships together.

Ever thought about the addictiveness of online games? They have reward mechanisms that empower you to earn points, gold, or coins or gold starts - something. These small reward mechanisms shape behavior by rewarding small things to accumulate better behavior over time.

In many cases, the data underneath required to provide effective visualization is not there, is too difficult to access, or is not in a format that is usable for reporting. When you start looking at data sources, you must examine explain plans, understand indexes as well as views, and be prepared to create information from raw data.

If you can get the data organized, you can use a multitude of products to create good, usable content. Be prepared to create data subsets, cubes of data, reference data elements, as well as provide tools that enable you to munge these data elements and sources, put it all together, and produce some preliminary results.

Netcool and Evolution toward Situation Management

Virtually no new evolution in Fault Management and correlation has been done in the last ten years. Seems we have a presumption that what we have today is as far as we can go. Truly sad.

In recent discussions on the INUG Netcool Users Forum, we discussed shortfalls in the products in hopes that big Blue may see its way clear of the technical obstacles. I don't think they are accepting or open to mine and other suggestions. But thats OK. you plant a seed - water it - feed it. And hopefully, one day, it comes to life!

Most of Netcool design is based somewhat loosely on TMF standards. They left out the hard stuff like object modelling but I understand why. The problem is that most Enterprises and MSPs don't fit the TMF design pattern. Nor do they fit eTOM. This plays specifically to my suggestion that "There's more than one way to do it!" - The Slogan behind Perl.

The underlying premise behind Netcool is that it is a single pane of glass for viewing and recognizing what is going on in your environment. It provides a way to achieve situation awareness and a platform which can be used to drive interactive work from. So what about ITIL and Netcool?

From the aspect of product positioning, most ITIL based platforms have turned out to be rehashs of Trouble Ticketing systems. When you talk to someone about ITIL, they immediately think of HP ITSM or BMC Remedy. Because of the complexity, these systems sometimes takes several months to implement. And nothing is cheap. Some folks resort to open source like RT or OTRS. Others want to migrate towards a different, appliance based model like ServiceNow and ScienceLogic EM7.

The problem is that once you transition out of Netcool, you lose your situation awareness. Its like having a notebook full of pages. Once you flip to page 50, pages 1-49 are out of sight and therefore gone. All hell could break lose and you'd never know.

So, why not implement ITIL in Netcool? May be a bit difficult. Here are a few things to consider:

1. The paradigm that an event has only 2 states is bogus.
2. The concept that there are events and these lead to incidents, problems, and changes.
3. Introduces workflow to Netcool.
4. Needs to be aware of CI references and relationships.
5. Introduces the concept that the user is part of the system in lieu of being an external entity.
6. May change the exclusion approach toward event processing.
7. Requires data storage and retrieval capabilities.

End Game

From a point of view where you'd like to end up, there are several use cases one could apply. For example:

One could see a situation develop and get solved in the Netcool display over time. As it is escalated and transitioned, you are able to see what has occurred, the workflow steps taken to solve this, and the people involved.

One could take a given situation and search through all of the events to see which ones may be applicable to the situation. Applying a ranking mechanism like a google search would help to position somewhat fuzzy information in proper contexts for the users.

Be able to take the process as it occurred and diagnose the steps and elements of information to optimize processes in future encounters.

Be able to automate, via the system, steps in the incident / problem process. Like escalations or notifications. Or executing some action externally.

Once you introduce workflow to Netcool, you need to introduce the concept of user awareness and collaboration. Who is online? What situations are they actively working versus observing? How do you handle Management escalations?

In ITIL definitions, an Incident has a defined workflow process from start to finish. Netcool could help to make the users aware of the process along with its effectiveness. Even in a simple event display you can show last, current and next steps in fields.

Value Proposition

From the aspect of implementation, the implementation of ITIL based systems has been focused solely around trouble ticketing systems. These systems have become huge behemoths of applications and with this comes two significant factors that hinder success - The loss of situation Awareness and the inability to realize and optimize processes in the near term.

These behemoth systems become difficult to adapt and difficult to keep up with optimizations. As such, they slow down the optimization process making it painful to move forward. If its hard to optimize, it will be hard to differentiate service because you cannot adapt to changes and measure the effectiveness fast enough to do any good.

A support organization that is aware of whats going on, subliminally portrays confidence. This confidence carries a huge weight in interactions with customers and staff alike. It is a different world on a desk when you're empowered to do good work for your customer.

More to come!

Hopefully, this will provide some food for thought on the evolution of event management into Situation Management. In the coming days I plan on adding to this thread several concepts like evolution toward complex event processing, Situation Awareness and Knowledge, data warehousing, and visualization.

Sunday, April 11, 2010

Fault and Event Management - Are we missing the boat?

In the beginning, folks used to tail log files.   As interesting things would show up, folks would see what was happening and respond to the situation.  Obviously, this didn't scale too well as you can only get about 35-40 lines per screen. As things evolved, folks looked for ways to visually queue the important messages.   When  you look at swatch, it changes the text colors and background colors, blinking, etc. as interesting things were noted.

Applications like xnmevents from OpenView NNM provided an event display that is basically a sequential event list.  (Here's a link to an image snapshot of xnmevents in action -> HERE! )

In Openview Operations, events are aligned by nodes that belong to the user.  If the event is received from a node that is not in the users node group, they don't receive the event.

Some applications tended to attempt to mask downstream events from users through topology based correlation.  And while this appears to be a good thing in that it reduces the numbers of events in an events display, it takes away the ability to notify customers based on side effect events. A true double edged sword - especially if you want to be customer focused.

With some implementations, the focus of event management is to qualify and only include those events that are perceived as being worthy of being displayed. While it may seem a valid strategy, the importance should be on the situation awareness of the NOC and not on the enrichment.  You may miss whole pieces of information and awareness... But your customers and end users may not miss them!

All in all, we're still just talking about discreet events here.  These events may or may not be conditional or situational or even pertinent to a particular users given perspective.

From an ITIL perspective, (Well, I have ascertained the 3 different versions of ITIL Incident definitions as things have evolved...) as:
"Incident (ITILv3):    [Service OperationAn unplanned interruption to an IT Service or a reduction in the Quality of an IT ServiceFailure of a Configuration Item that has not yet impacted Service is also an Incident. For example, Failure of one diskfrom a mirror set.
See alsoProblem
Incident (ITILv2):    An event which is not part of the standard operation of a service and which causes or may cause disruption to or a reduction in the quality of services and Customer productivity.
An Incident might give rise to the identification and investigation of a Problem, but never become a Problem. Even if handed over to the Problem Management process for 2nd Line Incident Control, it remains an IncidentProblem Management might, however, manage the resolution of the Incident and Problem in tandem, for instance if the Incident can only be closed by resolution of the Problem.
Incident (ITILv1):    An event which is not part of the normal operation of an IT Service. It will have an impact on the service, although this may be slight and may even be transparent to customers.

From the ITIL specification folks, I got this on Incident Management: Ref: http://www.itlibrary.org/index.php?page=Incident_Management
Quoting them "

'Real World' definition of Incident Management: IM is the way that the Service Desk puts out the 'daily fires'.


An 'Incident' is any event which is not part of the standard operation of the service and which causes, or may cause, an interruption or a reduction of the quality of the service.


The objective of Incident Management is to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price.


Inputs for Incident Management mostly come from users, but can have other sources as well like management Information or Detection Systems. The outputs of the process are RFC’s (Requests for Changes), resolved and closed Incidents, management information and communication to the customer.


Activities of the Incident Management process:


Incident detection and recording
Classification and initial support
Investigation and diagnosis
Resolution and recovery
Incident closure
Incident ownership, monitoring, tracking and communication


These elements provides a baseline for management review.
"

Also, I got this snippet from the same web site :

"Incidents and Service Requests are formally managed through a staged process to conclusion. This process is referred to as the "Incident Management Lifecycle". The objective of the Incident Management Lifecycle is to restore the service as quickly as possible to meet Service Level Agreements. The process is primarily aimed at the user level."

From an Event perspective, and event may or may not signify an Incident. An incident, by definition, has a lifecycle from start to conclusion which means it is a defined process.  This process can and should be mapped out, optimized, and documented.

Even the fact that you process an unknown event should, according to ITIL best practices, align your process steps toward an Incident Lifecycle on to an escalation that captures and uses information derived from the new incident to be mapped, process wise.

So, if one is presented with an event, is it an incident?  If it is, what is the process by which this Incident is handled? And if it is an Incident and it is being processed, what step  in the Incident process is the incident?  How long has it been in processing? What steps need to be taken right away, to process this incident effectively?

From a real world perspective, the events we work from are discreet events. They may be presented in a way that signifies a discreet "start of an Incident" process.  But inherently, an Incident may have several valid inputs from discreet events as part of the Incident Management process.

So, are we missing the boat here? Is every event presented an Incident? Not hardly. Now, intuitively, are your users managing events or incidents? Events - Hmmm Thought so.  How do you apply process and process optimization to something you don't inherently manage to in real time?  Incident management becomes an ABSTRACTION of event management. And you manage to Events in hopes that you'll make Incident Management better.

My take is that the abstraction is backwards because the software hasn't evolved to be incident / problem focused. So you see folks optimize to events as thats they way information is presented to them.  But it is not the same as managing incidents. 

For example, lets say I have a disk drive go south for the winter. And OK, its mirrored and is capable of being corrected without downtime. AWESOME. However, when you replace the drive, your mirror has to synch.  When it does, applications that use that drive - let's say a database - are held back from operating due to the synchronization.

From the aspect of incidents, you have a disk drive failure which is an incident to the System administrator for the system. This disk drive error may present thousands of events in that the dependencies of the CIs upon the failed or errored component span over into multiple areas.  For example, if you're scraping error logs and sending them in as traps, each unique event presents itself as something separate. Application performance thresholds present events depicting conditional changes in performance. 

This one incident could have a profound waterfall effect on events, their numbers and handling, given a single incident. Only the tools mandate that you manage to events which further exacerbates the workflow issue.

Organizations attempt to work around this by implementing ticketing systems. Only, once to move a user from the single pane of glass / near real time display to individual tickets, the end users become unaware of the real time aspects of the environment.  Once a ticket is opened and worked, all Hades could break loose and that user wouldn't be aware.

In Summary

The Event Management tools today present and process events. And the align the users toward Events. Somewhere along the way, we have missed the fact that an event does not equal an Incident.  But the tools don't align the information to incident so it has hampered effective ITIL implementation.

The Single Pane of Glass applications need to start migrating and evolving toward empowering Incident management in that near real time realm they do best. Create awareness of incidents as well as the incident process lifecycle. 







Sunday, April 4, 2010

Simplifying topology

I have been looking at monitoring and how its typically implemented. Part of my look is to drive visualization but also how can I leverage the data in a way that organizes people's thoughts on the desk.

Part of my thought process is around OpenNMS.  What can I contribute to make the project better.

What I came to realize is that Nodes are monitored on a Node / IP address basis by the majority of products available today.  All of the alarms and events are aligned by node - even the sub-object based events get aggregated back to the node level.  And for the most part, this is OK.  You dispatch a tech to the Node level, right?

When you look at topology at a general sense, you can see the relationship between the poller and the Node under test.  Between the poller and the end node, there is a list of elements that make up the lineage of network service components. So, from a service perspective, a simple traceroute between the poller and the end node produces a simple network "lineage".


Extending this a bit further, knowing that traceroute is typically done in ICMP, this gives you an IP level perspective of the network.  Note also that because traceroute exploits the time to Live parameter of IP, it can be accomplished in any transport layer protocol. For example, traceroute could work on  TCP port 80 or 8080.  The importance is that you place a protocol specific responder on the end of the code to see if the service is actually working beyond just responding to a connection request.

And while traceroute is a one way street, it still derives a lineage of path between the poller and the Node under test - and now the protocol or SERVICE under test. And it is still a simple lineage.

The significance of the path lineage is that in order to do some level of path correlation, you need to understand what is connected to what.  given that this can be very volatile and change very quickly, topology based correlation can be somewhat problematic - especially if your "facts" change on the fly.  and IP based networks do that.  They are supposed to do that.  They are a best efffort communications methodology that needs to adapt to various conditions.

Traceroute doesn't give you ALL of the topology.  By far. Consider the case of a simple frame relay circuit. A Frame Relay circuit is mapped end to end by a Circuit provider but uses T carrier access to the local exchange.  Traceroute only captures the IP level access and doesn't capture elements below that. In fact, if you have ISDN backup enabled for a Frame Relay circuit, your end points for the circuit will change in most cases, for the access.  And the hop count may change as well.

The good part about tracerouteing via a legitimate protocol is that you get to visualize any administrative access issues up front. For example, if port 8080 is blocked between the poller and the end node, the traceroute will fail. Additionally, you may see ICMP administratively prohibited messages as well. In effect, by positioning the poller according to end users populations, you get to see the service access pathing.

Now, think about this... From a basic service perspective, if you poll via the service, you get a basic understanding of the service you are providing via that connection.  When something breaks, you also have a BASELINE with which to diagnose the problem. So, if the poll fails, rerun the traceroute via the protocol and see where it stops.

Here are the interesting things to note about this approach:

  • You are simply replicating human expert knowledge in software.  Easy to explain.  Easy to transition to personnel.
  • You get to derive path breakage points pretty quickly.
  • You get to discern the perspective of the end user.
  • You are now managing your Enterprise via SERVICE!
Topology really doesn't mean ANYTHING until you evolve to manage by Service and not by individual nodes.  You can have all the pretty maps you want.  It doesn't mean crapola until you start managing by service.

This approach is an absolute NATURAL for OpenNMS.  Let me explain...

Look at the Path Outages tab. While it is currently manually configured, using the traceroute by service lineage here provides a way of visualizing the path lineage.

OpenNMS supports services pollers natively.  There are alot of different services out of the box and its easy to do more if you find something different from what they already do.

Look at the difference between Alarms versus Events. Service outages could directly be related to an Alarm while the things that are eventing underneath may affect the service, are presented as events.

What if you took the reports and charts and aligned the elements to the service lineage?  For example, if you had a difference in service response, you could align all of the IO graphs for everything in the service lineage.  You could also align all of the CPU utilizations as well.

In elements where there are subobjects abstracted in the lineage, if you discover them, you could add those in the lineage.  For example, if you discovered the Frame Relay PVCs and LEC access circuits, these could be included in with your visualization underneath the path where they are present.

The other part is that the way you work may need to evolve as well.  For example, if you've traditionally ticketed outages on Nodes, now you may need to transition to a Service based model. And while you may issue tickets on a node, your ticket on a Service becomes the overlying dominant ticket  in that multiple node problems may be present in a service problem.

And the important thing.  You become aware of the customer and Service first, then elements underneath that.  It becomes easier to manage to service along with impact assessments, when you manage to a service versus manage to a node.  And when you throw in the portability, agility, and abstractness of Cloud computing, this approach is a very logical fit.

Dangerous Development

Ever been in an environment where the developed solutions tend to work around problems rather than confronting issues directly? Do you see bandaids to issues as the normal, mode of operation? Do you have software that is developed without requirements or User Acceptance testing?

What you are seeing are solutions developed in a vacuum without regard to the domain knowledge necessary to understand that a problem needs to be corrected and not "worked around". Essentially, when developers lack the domain knowledge around identifying and correcting problems in areas outside of software, you end up with software developed that works around or bandaid across issues. Essentially, they don't know how to diagnose or correct the problem, diagnose the effects of the problem, or in many cases, even understand that it is a problem.

In some cases, you need strong managerial leadership to stand up and make things right. The problem may be exacerbated by weak management or politically charged environments where managers manage to the Green. And some problems do need escalation.

This gets very dangerous to an IT environment for a multitude of reasons including:

  • It masks a problem. Once a problem is masked, any fix to the real problem breaks the developed bandaid solution.
  • It sets an even more dangerous precedent in that now its OK to develop bandaid solutions.
  • Once developed and in place, it is difficult to replace the solution. (It is easier to do nothing.)
  • It creates a mandate that further development will always be required because of the work arounds in the environment. In essence, no standards based product can no longer fulfill the requirements because of the work arounds.

A lot of factors contribute to this condition commonly known as "Painted in a Corner" development. In essence, development efforts paint themselves into a corner where they cannot be truly finished or the return on investment can never be fully realized. A developer or IT organization cannot divorce itself or disengage from a product. In effect, you cannot finish it!

A common factor is a lack of life cycle methodology in the development organization. Without standards and methodologies, it is so easy for developers to skip over certain steps because of the pain and suffering. These elements include:

  
  • Requirements and Requirements traceability
  • Unit Testing
  • System Testing
  • Test Harnesses and structured testing
  • Quality Assurance
  • Coding standards
  • Documentation
  • Code / Project refactoring
  • Acceptance Testing.

 This is no different from doing other tasks such as Network Engineering, Systems Engineering, and Applications Engineering. The danger is that once the precedence is set that its OK to go around the Policies, Procedures, and Discipline associated with effective Software Development, it is very hard to reign it back in. In effect, the organization has been compromised. And they lack the awareness that they are compromised.

  
What do you do the right the ship?

  
Obviously, there is a lack of standards and governance up front. These need to be remedied. Coding standards, software lifecycle management techniques need to be chosen and implemented up front. Need to get away from Cowboy code and software development that is not customer driven. Additionally, it should be obvious that design and architecture choices need to be made external to this software development team in the foreseeable future.

  
Every piece of code written needs to be reviewed and corrected. You need to get rid of the bandaids and "solutions" that perpetuate problems. And you need to start addressing the real problems instead of working around them.

  
Software that perpetuates problems by masking them via workarounds and bandaids is a dangerous pattern to manifest in your organization. Its like finding roaches in your house. Not only will you see the bandaid redeveloped and reused over and over again, you have an empowered development staff that bandaids versus fixes. Until you can do a thorough cleaning and bombing of the house, it is next to impossible to get rid of the roaches.

Sometimes, development managers tend to be promoted a bit early in that while they have experience with code and techniques, exposure to a lot of different problem sets segregates the good development leaders from the politicians and wannabes.  Those that are pipelined do not always understand how to reason through problems, discern the good from bad techniques and approaches, and lead down the right paths.  Some turn very political because it is easier for them to respond politically than technically.

What are the Warnings Signs?

Typically, you see in house developed solutions that are developed around a problem in functionality that would not normally be seen in many different places.  This can be manifested in many ways.  Some examples or warning signs include:


  • Non-applicability of commercial products because of one off in house solutions or  applied workarounds.
  • Typical terms like "thats the way we've always done this" or "We developed this to work around..." arise in discussions.
  • You see alot of in house developed products in many different versions.
  • In house developed products tend to lack sufficient documentation.
  • You see major deviations away from standards.
  • No QA and code review is only internal to the group.
  • No Unit or System test functionality available to the support organization.
  • In house developed software that never transitions out of the development group.
  • Software developed in house that is never finished.
  • You get political answers to technical questions.
A lesson here is that it does matter that the person you setup to lead, have exposure to a lot of different problem sets, SDLC methodologies in practice and not just theory, and they have some definite problem reasoning skills. A Politician is not a good coach or development leader in these environments.

In Summary

One must be very careful in the design and implementation of software done in house.  If wrong, you can quickly paint development and the developed capabilities into a corner where you must forklift to get functionality back.   And if you're not careful, you will stop evolution in the environment because the technical solutions will continue to work around instead of directly addressing problems.