Showing posts with label Network Management. Show all posts
Showing posts with label Network Management. Show all posts

Monday, April 9, 2012

Product quality Dilemma

All too often, we have products that we have bought, put in production, and attempted to work through the shortcomings and obfuscated abnormalities prevalent in so many products.  (I call this product "ISMs" and I use the term to describe specific product behaviors or personalities.) As part of this long living life cycle, changes, additions, deprecations, and behaviors change over time.  Whether its fixing bugs or rewriting functions as upgrades or enhancements, things happen.

All too often, developers tend to think of their creation in a way that may be significantly different than the deployed environments they go into. Its easy to get stuck in microcosms and walled off development environments.  Sometimes you miss the urgency of need, the importance of the functionality, or the sense of mandate around the business.

With performance management products, its all too easy just to gather everything and produce reports ad nauseum.  With an overwhelming level of output, its easy to get caught up in the flash, glitz, and glamour of fancy graphs, pie charts, bar charts... Even  Ishigawa diagrams!

All this is a distraction of what the end user really really NEEDS. I'll give a shot at outlining some basic requirements pertinent to all performance management products.

1. Don't keep trying to collect on broken access mechanisms.

Many performance applications continue to collector attempt to collect, even when they haven't had any valid data in several hours or days.  Its crazy as all of the errors just get in the way of valid data.  And some applications will continue to generate reports even though no data has been collected! Why?

SNMP Authentication failures are a HUGE clue your app is wasting resources or something simple. Listening for ICMP Source Quenches will tell you if you're hammering end devices.

2. Migrate away from mass produced reports in favor of providing information.

If no one is looking at the reports, you are wasting cycles, hardware,and personnel time on results that are meaningless.

3. If you can't create new reports without code, its too complicated.

All too often, products want to put glue code or even programming environments / IDEs in front of your reporting.  Isn't it a stretch to assume that  a developer will be the reporting person?  Most of the time its someone more business process oriented.

4. Data and indexes should be documented and manageable.  If you have to BYODBA (Bring Your Own DBA), the wares vendor hasn't done their home work.

How many times have we loaded up a big performance management application only to find out you have to do a significant amount of work tuning the data and the database parameters just to get the app to generate reports on time?

And you end up having to dig through the logs to figure out what works and what doesn't.

If you know what goes into the database, why do you not put in indexes,checks and balances, and even recommended functions when expansion occurs.

In some instances, databases used by performance management applications are geared toward the polling and collection versus the reporting of information.  In many cases, one needs to build data derivatives of multiple elements in order to facilitate information presentation.  For example, a simple dynamic thresholding mechanism is to take a sample of a series of values and perform an average, root mean, and standard deviation derivative.

If a reporting person has to do more than one join to get to their data elements,  your data needs to be better organized, normalized, and accessible via either derivative tables or a view. Complex data access mechanisms tend to alienate BI and performance / Capacity Engineers. They would rather work the data than work your system.

5. If the algorithm is too complex to explain without a PhD, it is not usable nor trustable.

There are a couple of applications that use patented algorithms to extrapolate bandwidth, capacity, or effective usage. If you haven't simplified the explanation of how it works, you're going to alienate a large portion of your operations base.

6. If an algorithm or method is held as SECRET, it works just until something breaks or is suspect. Then your problem is a SECRET too!

Secrets are BAD. Cisco publishes all of its bugs online specifically because it eliminates the perception that they are keeping something from the customer.

If one remembers Concord's eHealth Health Index...  In the earlier days, it was SECRET SQUIRREL SAUCE. Many an Engineer got a bad review or lost their job because of the arrogance of not publishing the elements that made up the Health Index.

7. Be prepared to handle BI types of access.  Bulk transfers, ODBC and Excel/Access replication, ETL tools access, etc.

If Engineers are REALLY using your data, they want to use it in their own applications, their own analysis work, and their own business activities. The more useful your data is, the more embedded and valuable your application is.  Provide ways of providing shared tables,timed transfers, transformations, and data dumps.

8. Reports are not just a graph on a splash page or a table of data.  Reports to Operations personnel means they put text and formatting around the graphs, charts, tables, and data to relate the operational aspects of the environment in with the illustrations.

9. In many cases, you need to transfer data in a transformed state from one system that reports to another. Without ETL tools, your reporting solution kind of misses the mark.

Think about this...  You have configuration data and you need this data in a multitude of applications.  Netcool. Your CMDB.  Your Operational Data Store. Your discovery tools.  Your ticketing system.  Your performance management system.  And it seems that every one of this data elements may be text, XML, databases of various forms and flavors, even HTML. How do you get transformed from one place to another?

10. If you cannot store, archive, and activate polling, collection, threshold, and reporting configurations accurately, you will drive away customization.

As soon as a data source becomes difficult to work with, it gets in the way of progress.  In essence, what happens in that when a data source becomes difficult to access, it quits being used beyond its own internal function. When this occurs, you start seeing separation and duplication of data.

The definitions of the data can also morph over time.  When this occurs and the data is shared, you can correct it pretty quickly.  When data is isolated, many times the problem just continues until its a major ordeal to correct. Reconciliation when there are a significant number of discrepancies can be rough.

Last but not least - If you develop an application and you move the configuration from test/QA to production and it does not stay EXACTLY the same, YOUR APPLICATION is JUNK.  Its dangerous, haphazard, incredibly short sided, and should be avoided at all costs.  Recently, I had a dear friend develop, test, and validate a performance management application upgrade.  After a month in test and QA running many different use case validations, it was put into production. The application overloaded the paired down configurations to defaults upon placement into production, polled EVERYTHING and it caused major outages and major consternation for the business. In fact, heads rolled.  The business lost customers. There were people that were terminated. And a lot of man power was expended trying to fix the issues.

In no uncertain terms, I will never let my friends and customers be caught by this product.

Sunday, April 25, 2010

Performance Management Architecture

Performance Management systems in IT infrastructures do a few common things. These are:

Gather performance data
Enable processing of the data to produce:
Events and thresholds
New data and information
Baseline and average information
Present data through a UI or via scheduled reports.
Provide for ad hoc and data mining exercises

Common themes for broken systems include:

If you have to redevelop your application to add new metrics
If you have more than one or two data access points.
If data is not consistent
If reporting mechanisms have to be redeveloped for changes to occur
If a development staff owns access to the data
If a Development staff controls what data gets gathered and stored.
If multiple systems are in place and they overlap (Significantly) in coverage.
If you cannot graph any data newer than 5 minutes.
If theres no such thing as a live graph or the live graph is done via Metarefresh.

I dig SevOne. Easy to setup. Easy to use. Baselines. New graphs. New reports. And schedules. But they also do drill down from SNMP into IPFIX DIRECTLY. No popping out of one system and popping into another. SEAMLESSLY.

It took me 30 minutes or so to rack and stack the appliance. I went back to my desk, verified I could access the appliance, then called the SE. He setup a WebEx and it was 7 minutes and a few odd seconds later I got my first reports. Quite a significant difference from the previous Proviso install which took more than a single day to install.

The real deal is that with SevOne, your network engineers can get and setup the data collection they need. And the hosting engineers can follow suite. Need a new metric. Engineering sets it up. NO DEVELOPMENT EFFORT.

And it can be done today. Not 3 months from now. When something like a performance management system cannot be used as part of the diagnostics and triage of near real time, it significantly detracts from usability in both the near real time and the longer term trending functions as well.

Saturday, April 24, 2010

SNMP + Polling Techniques

Over the course of many years, it seems that I see the same lack of evolution regarding SNMP polling, how its accomplished, and the underlying ramifications. To give credit where credit is due, I learned alot from Ari Hirschman, Eric Wall, Will Pearce, and Alex Keifer. And of the things we learned - Bill Frank, Scott Rife, and Mike O'Brien.

Building an SNMP poller isn't bad. Provided you understand the data structures, understand what happens on the end node, and understand how it performs in its client server model.

First off, there are 5 basic operations one can perform. These are:

GET
GET-NEXT
SET
GET-RESPONSE
GET-BULK

Here is a reference link to RFC-1157 where SNMP v1 is defined.

The GET-BULK operator was introduced when SNMP V2 was proposed and it carried into SNMP V3. While SNMP V2 was never a standard, its defacto implementations followed the Community based model referenced in RFCs 1901-1908.

SNMP V3 is the current standard for SNMP (STD0062) and version 1 and 2 SNMP are considered obsolete or historical.

SNMP TRAPs and NOTIFICATIONs are event type messages sent from the Managed object back to the Manager. In the case of NOTIFICATIONs, the Manager returns the trap as an acknowledgement.

From a polling perspective, lets start with a basic SNMP Get Request. I will illustrate this via the Net::SNMP perl module directly. (URL is http://search.cpan.org/dist/Net-SNMP/lib/Net/SNMP.pm)

get_request() - send a SNMP get-request to the remote agent

$result = $session->get_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
-varbindlist => \@oids,
);
This method performs a SNMP get-request query to gather data from the remote agent on the host associated with the Net::SNMP object. The message is built using the list of OBJECT IDENTIFIERs in dotted notation passed to the method as an array reference using the -varbindlist argument. Each OBJECT IDENTIFIER is placed into a single SNMP GetRequest-PDU in the same order that it held in the original list.

A reference to a hash is returned in blocking mode which contains the contents of the VarBindList. In non-blocking mode, a true value is returned when no error has occurred. In either mode, the undefined value is returned when an error has occurred. The error() method may be used to determine the cause of the failure.

This can be either blocking - meaning the request will block until data is returned or non-blocking - the session will return right away but will initiate a callback subroutine upon finishing or timing out.

For the args:

-callback is used to attach a handler subroutine for non-blocking calls
-delay is used to delay the SNMP Porotocol exchange for the given number of seconds.
-contextengineid is used to pass the contextengineid needed for SNMP V3.
-contextname is used to pass the SNMP V3 contextname.
-varbindlist is an array of OIDs to get.

What this does is to setup a Session object for a given node and run through the gets in the varbindlist one PDU at a time. If you have set it up to be non-blocking, the PDUs are assembled and sent one right after another. If you are using blocking mode, the first PDU is sent and a response is received before the second one is sent.

GET requests require you to know the instance of the attribute ahead of time. Some tables are zero instanced while others may be instanced by one or even multiple indexes. For example, MIB-2.system is a zero instanced table in that there is only one row in the table. Other tables like MIB-2.interfaces.ifTable.ifEntry have multiple rows indexed by ifIndex. Here is a reference to the MIB-2 RFC-1213.

A GET-NEXT request is like a GET request except that it does not require the instance up front. For example, if you start with a table like ifEntry and you do not know what the first instance is, you would query the table without an instance.

Now here is the GET-NEXT:

$result = $session->get_next_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
-varbindlist => \@oids,
);

In the Net::SNMP module, each OID in th \@oids array reference is passed as a single PDU instance. And like the GET, it can also be performed in blocking mode or non-blocking mode.

An snmpwalk is simply a macro of multiple recursive GET-NEXTs for a given starting OID.

As polling started to evolve, folks started looking for ways to make things a bit more scalable and faster. One of the ways they proposed was the GET-BULK operator. This enabled an SNMP Manager to pull whole portions of an SNMP MIB Table with a single request.

A GETBULK request is like a getnext but tells the agent to return as much as it can from the table. And yes, it can return partial results.
$result = $session->get_bulk_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
[-nonrepeaters => $non_reps,]
[-maxrepetitions => $max_reps,]
-varbindlist => \@oids,
);

In SNMP V2, the GET BULK operator came into being. This was done to enable a large amount of table data to be retrieved from a single request. It does introduce two new parameters:

nonrepeaters partial information.
maxrepetitions

Nonrepeaters tells the get-bulk command that the first N objects can be retrieved with a simple get-next operation or single successor MIB objects.

Max-repetitions tells the get-bulk command to attempt up to M get-next operations to retrieve the remaining objects or how many times to repeat the get process.

The difficult part of GET BULK is you have to guess how many rows and there and you have to deal with partial returns.

As things evolved, folks started realizing that multiple OIDs were possible in SNMP GET NEXT operations through a concept of PDU Packing. However, not all agents are created equal. Some will support a few operations in a single PDU while some could support upwards of 512 in a single SNMP PDU.

In effect, by packing PDUs, you can overcome certain annoyances in data like time skew between two attributes given that they can be polled simultaneously.

When you look at the SNMP::Multi module, it not only allows multiple OIDs in a PDU by packing, it enables you to poll alot of hosts at one time. Follwing is a "synopsis" quote from the SNMP::Multi module:


use SNMP::Multi;

my $req = SNMP::Multi::VarReq->new (
nonrepeaters => 1,
hosts => [ qw/ router1.my.com router2.my.com / ],
vars => [ [ 'sysUpTime' ], [ 'ifInOctets' ], [ 'ifOutOctets' ] ],
);
die "VarReq: $SNMP::Multi::VarReq::error\n" unless $req;

my $sm = SNMP::Multi->new (
Method => 'bulkwalk',
MaxSessions => 32,
PduPacking => 16,
Community => 'public',
Version => '2c',
Timeout => 5,
Retries => 3,
UseNumeric => 1,
# Any additional options for SNMP::Session::new() ...
)
or die "$SNMP::Multi::error\n";

$sm->request($req) or die $sm->error;
my $resp = $sm->execute() or die "Execute: $SNMP::Multi::error\n";

print "Got response for ", (join ' ', $resp->hostnames()), "\n";
for my $host ($resp->hosts()) {

print "Results for $host: \n";
for my $result ($host->results()) {
if ($result->error()) {
print "Error with $host: ", $result->error(), "\n";
next;
}

print "Values for $host: ", (join ' ', $result->values());
for my $varlist ($result->varlists()) {
print map { "\t" . $_->fmt() . "\n" } @$varlist;
}
print "\n";
}
}

Using the Net::SNMP libraries underneath means that you're still constrained by port as it only uses one UDP port to poll and through requestIDs, handles the callbacks. In higher end pollers, the SNMP Collector can poll from multiple ports simultaneously.

Summary

Alot of evolution and technique has went into making SNMP data collection efficient over the years. It would be nice to see SNMP implementations that used these enhancements and evolve a bit as well. The evolution of these techniques came about for a reason. When I see places that haven't evolved in their SNMP Polling techniques, I tend to believe that they haven't evolved enough as an IT service to experience the pain that necessitated the lessons learned of the code evolution.

Sunday, April 4, 2010

Simplifying topology

I have been looking at monitoring and how its typically implemented. Part of my look is to drive visualization but also how can I leverage the data in a way that organizes people's thoughts on the desk.

Part of my thought process is around OpenNMS.  What can I contribute to make the project better.

What I came to realize is that Nodes are monitored on a Node / IP address basis by the majority of products available today.  All of the alarms and events are aligned by node - even the sub-object based events get aggregated back to the node level.  And for the most part, this is OK.  You dispatch a tech to the Node level, right?

When you look at topology at a general sense, you can see the relationship between the poller and the Node under test.  Between the poller and the end node, there is a list of elements that make up the lineage of network service components. So, from a service perspective, a simple traceroute between the poller and the end node produces a simple network "lineage".


Extending this a bit further, knowing that traceroute is typically done in ICMP, this gives you an IP level perspective of the network.  Note also that because traceroute exploits the time to Live parameter of IP, it can be accomplished in any transport layer protocol. For example, traceroute could work on  TCP port 80 or 8080.  The importance is that you place a protocol specific responder on the end of the code to see if the service is actually working beyond just responding to a connection request.

And while traceroute is a one way street, it still derives a lineage of path between the poller and the Node under test - and now the protocol or SERVICE under test. And it is still a simple lineage.

The significance of the path lineage is that in order to do some level of path correlation, you need to understand what is connected to what.  given that this can be very volatile and change very quickly, topology based correlation can be somewhat problematic - especially if your "facts" change on the fly.  and IP based networks do that.  They are supposed to do that.  They are a best efffort communications methodology that needs to adapt to various conditions.

Traceroute doesn't give you ALL of the topology.  By far. Consider the case of a simple frame relay circuit. A Frame Relay circuit is mapped end to end by a Circuit provider but uses T carrier access to the local exchange.  Traceroute only captures the IP level access and doesn't capture elements below that. In fact, if you have ISDN backup enabled for a Frame Relay circuit, your end points for the circuit will change in most cases, for the access.  And the hop count may change as well.

The good part about tracerouteing via a legitimate protocol is that you get to visualize any administrative access issues up front. For example, if port 8080 is blocked between the poller and the end node, the traceroute will fail. Additionally, you may see ICMP administratively prohibited messages as well. In effect, by positioning the poller according to end users populations, you get to see the service access pathing.

Now, think about this... From a basic service perspective, if you poll via the service, you get a basic understanding of the service you are providing via that connection.  When something breaks, you also have a BASELINE with which to diagnose the problem. So, if the poll fails, rerun the traceroute via the protocol and see where it stops.

Here are the interesting things to note about this approach:

  • You are simply replicating human expert knowledge in software.  Easy to explain.  Easy to transition to personnel.
  • You get to derive path breakage points pretty quickly.
  • You get to discern the perspective of the end user.
  • You are now managing your Enterprise via SERVICE!
Topology really doesn't mean ANYTHING until you evolve to manage by Service and not by individual nodes.  You can have all the pretty maps you want.  It doesn't mean crapola until you start managing by service.

This approach is an absolute NATURAL for OpenNMS.  Let me explain...

Look at the Path Outages tab. While it is currently manually configured, using the traceroute by service lineage here provides a way of visualizing the path lineage.

OpenNMS supports services pollers natively.  There are alot of different services out of the box and its easy to do more if you find something different from what they already do.

Look at the difference between Alarms versus Events. Service outages could directly be related to an Alarm while the things that are eventing underneath may affect the service, are presented as events.

What if you took the reports and charts and aligned the elements to the service lineage?  For example, if you had a difference in service response, you could align all of the IO graphs for everything in the service lineage.  You could also align all of the CPU utilizations as well.

In elements where there are subobjects abstracted in the lineage, if you discover them, you could add those in the lineage.  For example, if you discovered the Frame Relay PVCs and LEC access circuits, these could be included in with your visualization underneath the path where they are present.

The other part is that the way you work may need to evolve as well.  For example, if you've traditionally ticketed outages on Nodes, now you may need to transition to a Service based model. And while you may issue tickets on a node, your ticket on a Service becomes the overlying dominant ticket  in that multiple node problems may be present in a service problem.

And the important thing.  You become aware of the customer and Service first, then elements underneath that.  It becomes easier to manage to service along with impact assessments, when you manage to a service versus manage to a node.  And when you throw in the portability, agility, and abstractness of Cloud computing, this approach is a very logical fit.

Saturday, March 27, 2010

Implicit Status Determination example


It has come to my attention that alot of folks do not understand state based polling let alone the importance and scalability of doing things like active polling in a state machine representation.

For some, heartbeats and traps are way of life. While this is OK, if you're serious about detection of problems and pointing your people to fix these issues, its going to take an active stance. Waiting on heartbeats and traps is kind of lame in that you might as well be waiting on your end user to call you.

So, I'm going to start off with a very simple model that incorporates a technique called Implicit Status Determination that illuminates how effective state machines can be. For purposes of this example, I am working from an implementation done in LogMatrix NerveCenter.

In a NerveCenter Finite State Machine model, SNMP and ICMP polls are only applied when your model reaches a state where the poll takes it out of the current state. So, in this model, I may have several different poll conditions. These polls are only applied when they need to be. Additionally, traps are masked in as triggers as well.

In the NerveCenter model, you have States, triggers, and Actions. States are represented by the octagonal symbols in the GUI. The color of the symbols is also mapped (or not) to platforms like Network Node Manager. (As a side note, I map the purple used in this example for unreachable to the NNM status "Restricted" which is a brick brownish red color. When you see path outages in NNM, you'll see nodes depicted as Restricted where NerveCenter cannot reach those nodes.)

Triggers are the lines and boxes that connect state to state. They are named and can be created in alot of different ways including:

  • SNMP Polls
  • ICMP Polls
  • Masks
  • Actions
  • Timers
  • Counters
  • Perl Sub-Routines
  • External via nccmd

Actions are the work that occurs when a model transitions from state to state. There can be any number and function applied to an action. Like inform a platform by event, running a perl sub-routine, incrementing or decrementing counters, sending email and so on.

Using this simple Node Status model, node status poll rates can be cranked down to very tight intervals without killing performance and scalability. Additionally, traps that are not applicable to the current state, get filtered based on their usefulness in the current state. If the trap mask doesn't transition the model instance to another state, it is not used.

Lets go through the States...

Ground

All models are instanced at Ground. This model is scoped "Node" so only one instance per node is used. The first state reachable is the Needs Poll state. The transition is the result of a successful SNMP Get of system.sysUptime. (Pretty light weight. But if you want to store the variable, it gives you a very valid data point for Agent Availability.)

Out of the Needs Poll state are several poll and trap conditions that relate to various states of the Node. The first of which is Node OK.

NodeOK

NodeOK is achieved when a valid SNMP poll of system.sysUptime where the value of sysUptime is more than the interval times 100 (sysUptime is expressed in TimeTicks which is 1/100th of a second.)

In the transition to NodeOK, a poll timer is set to a value corresponding to the ISD interval window. In this example, it is 15 seconds. As other valid SNMP polls occur for the node, these send trigger called timer_reset. These triggers go from Node OK back to Node OK and they reset the poll timer to the next ISD Interval window. What this does is to "slide" the poll timer window on the assumption that any valid poll to the Node IMPLIES a valid Node Status poll. So, while I'm setting and using a 15 second window, I am only discreetly polling for Node Status every 10 to 15 minutes dependent upon the number of sub-objects I'm managing in other State models.

Agent Down

Agent Down occurs when the trigger PORT_UNREACHABLE is received. PORT_UNREACHABLE is triggered when an ICMP port Unreachable message is received from the Node. It signifies that the host is telling you that no process is listening on UDP 161 or the Agent is no longer listening. (Notice that this means that the Node is actually still up. It has the where with all to send you an ICMP Control message which means the INET daemon is still jammin.)

Rebooted

The Rebooted state is used to capture a node reboot and suppress all of the downs until the node comes back up. This state is also used to reset subobject models so that they do not become stale especially when instances change.

Agent Down Node Up

From the Agent down state, if the node responds to an ICMP ping, it is transitioned to Agent Down Node Up.

Node Down

Both SNMP and ICMP communications have failed. The node is deemed down.

Unreachable

Unreachable occurs when polling attempts yield either NET_UNREACHABLE or NODE_UNREACHABLE triggers. These messages are directly derived from IP flow control messages in ICMP - Namely Network Unreachable and Node Unreachable.

From RFC 792 - Internet Control Message Protocol

"Description

If, according to the information in the gateway's routing tables, the network specified in the internet destination field of a datagram is unreachable, e.g., the distance to the network is infinity, the gateway may send a destination unreachable message to the internet source host of the datagram. In addition, in some networks, the gateway may be able to determine if the internet destination host is unreachable. Gateways in these networks may send destination unreachable messages to the source host when the destination host is unreachable."

These two messages come from routers in the path between the poller and the end node. And, as such, when route convergence occurs, it is possible that a route to an end network may transition to infinity or become unreachable until routing metrics can be recalculated and the traffic rerouted. It is an indication that your topology has changed.

In cases where the path is persistently lost, it is an outage. But because the message is emitted
from the device that recognized the path loss, you have all of the good path TESTED ALREADY.

I typically see Net Unreachable when a router loses a route to a given network. I have seen a Node Unreachable when the ARP entry for the node is waxed out of the ARP cache on the router that services the Nodes local network.

Results

Using ISD and model sets, I can usually outperform polling and status mechanisms in other COTS offerings 4-5:1. I have benchmarked against NNM and performed on 20 second status intervals with 20% of the traffic netmon uses at 5 minute intervals.

And because I am catching active topology changes at the IP level in near real time, I'd say in many cases, topology changes that will be seen via NerveCenter will not be seen on other products. (I have captured Dykstra recalcs in midstream on slow and underpowered Cisco devices. Hey! It's a problem!)

In Summary

While this is a very basic Node Status model, you see the technique and prowess of the Finite State Machine in action. It is a very different world from the static poll lists so prevalent in basic pollers you see. And while a poll list is very lightweight code wise, you never make a decision in line with the poll. You end up polling continuously even when you don't want to.

While I do realize that some organizations want an "Out of the Box" solution, they tend to get a solution that is dependent upon the vendor to adapt to your infrastructure - your workflow - your organizational technical knowledge. Do you control the pace or do the Vendors? Do you evolve? Are you committed to continuous process improvement or is a lack of change your modus operandi?

I also realize that inevitably, some organizations must reinvent the wheel. Their Ego vs. Reality ratio has not matured or stabilized yet. Good luck. NerveCenter is still the standard in intelligent status polling. Has been for years. I got my first looksee in 1993 via Netlabs Dual Manager.