Dougie's Enterprise Management World: Implicit Status Determination example

It has come to my attention that alot of folks do not understand state based polling let alone the importance and scalability of doing things like active polling in a state machine representation.

For some, heartbeats and traps are way of life. While this is OK, if you're serious about detection of problems and pointing your people to fix these issues, its going to take an active stance. Waiting on heartbeats and traps is kind of lame in that you might as well be waiting on your end user to call you.

So, I'm going to start off with a very simple model that incorporates a technique called Implicit Status Determination that illuminates how effective state machines can be. For purposes of this example, I am working from an implementation done in LogMatrix NerveCenter.

In a NerveCenter Finite State Machine model, SNMP and ICMP polls are only applied when your model reaches a state where the poll takes it out of the current state. So, in this model, I may have several different poll conditions. These polls are only applied when they need to be. Additionally, traps are masked in as triggers as well.

In the NerveCenter model, you have States, triggers, and Actions. States are represented by the octagonal symbols in the GUI. The color of the symbols is also mapped (or not) to platforms like Network Node Manager. (As a side note, I map the purple used in this example for unreachable to the NNM status "Restricted" which is a brick brownish red color. When you see path outages in NNM, you'll see nodes depicted as Restricted where NerveCenter cannot reach those nodes.)

Triggers are the lines and boxes that connect state to state. They are named and can be created in alot of different ways including:

SNMP Polls
ICMP Polls
Masks
Actions
Timers
Counters
Perl Sub-Routines
External via nccmd

Actions are the work that occurs when a model transitions from state to state. There can be any number and function applied to an action. Like inform a platform by event, running a perl sub-routine, incrementing or decrementing counters, sending email and so on.

Using this simple Node Status model, node status poll rates can be cranked down to very tight intervals without killing performance and scalability. Additionally, traps that are not applicable to the current state, get filtered based on their usefulness in the current state. If the trap mask doesn't transition the model instance to another state, it is not used.

Lets go through the States...

Ground

All models are instanced at Ground. This model is scoped "Node" so only one instance per node is used. The first state reachable is the Needs Poll state. The transition is the result of a successful SNMP Get of system.sysUptime. (Pretty light weight. But if you want to store the variable, it gives you a very valid data point for Agent Availability.)

Out of the Needs Poll state are several poll and trap conditions that relate to various states of the Node. The first of which is Node OK.

NodeOK

NodeOK is achieved when a valid SNMP poll of system.sysUptime where the value of sysUptime is more than the interval times 100 (sysUptime is expressed in TimeTicks which is 1/100th of a second.)

In the transition to NodeOK, a poll timer is set to a value corresponding to the ISD interval window. In this example, it is 15 seconds. As other valid SNMP polls occur for the node, these send trigger called timer_reset. These triggers go from Node OK back to Node OK and they reset the poll timer to the next ISD Interval window. What this does is to "slide" the poll timer window on the assumption that any valid poll to the Node IMPLIES a valid Node Status poll. So, while I'm setting and using a 15 second window, I am only discreetly polling for Node Status every 10 to 15 minutes dependent upon the number of sub-objects I'm managing in other State models.

Agent Down

Agent Down occurs when the trigger PORT_UNREACHABLE is received. PORT_UNREACHABLE is triggered when an ICMP port Unreachable message is received from the Node. It signifies that the host is telling you that no process is listening on UDP 161 or the Agent is no longer listening. (Notice that this means that the Node is actually still up. It has the where with all to send you an ICMP Control message which means the INET daemon is still jammin.)

Rebooted

The Rebooted state is used to capture a node reboot and suppress all of the downs until the node comes back up. This state is also used to reset subobject models so that they do not become stale especially when instances change.

Agent Down Node Up

From the Agent down state, if the node responds to an ICMP ping, it is transitioned to Agent Down Node Up.

Node Down

Both SNMP and ICMP communications have failed. The node is deemed down.

Unreachable

Unreachable occurs when polling attempts yield either NET_UNREACHABLE or NODE_UNREACHABLE triggers. These messages are directly derived from IP flow control messages in ICMP - Namely Network Unreachable and Node Unreachable.

From RFC 792 - Internet Control Message Protocol

"Description

If, according to the information in the gateway's routing tables, the network specified in the internet destination field of a datagram is unreachable, e.g., the distance to the network is infinity, the gateway may send a destination unreachable message to the internet source host of the datagram. In addition, in some networks, the gateway may be able to determine if the internet destination host is unreachable. Gateways in these networks may send destination unreachable messages to the source host when the destination host is unreachable."

These two messages come from routers in the path between the poller and the end node. And, as such, when route convergence occurs, it is possible that a route to an end network may transition to infinity or become unreachable until routing metrics can be recalculated and the traffic rerouted. It is an indication that your topology has changed.

In cases where the path is persistently lost, it is an outage. But because the message is emitted

from the device that recognized the path loss, you have all of the good path TESTED ALREADY.

I typically see Net Unreachable when a router loses a route to a given network. I have seen a Node Unreachable when the ARP entry for the node is waxed out of the ARP cache on the router that services the Nodes local network.

Results

Using ISD and model sets, I can usually outperform polling and status mechanisms in other COTS offerings 4-5:1. I have benchmarked against NNM and performed on 20 second status intervals with 20% of the traffic netmon uses at 5 minute intervals.

And because I am catching active topology changes at the IP level in near real time, I'd say in many cases, topology changes that will be seen via NerveCenter will not be seen on other products. (I have captured Dykstra recalcs in midstream on slow and underpowered Cisco devices. Hey! It's a problem!)

In Summary

While this is a very basic Node Status model, you see the technique and prowess of the Finite State Machine in action. It is a very different world from the static poll lists so prevalent in basic pollers you see. And while a poll list is very lightweight code wise, you never make a decision in line with the poll. You end up polling continuously even when you don't want to.

While I do realize that some organizations want an "Out of the Box" solution, they tend to get a solution that is dependent upon the vendor to adapt to your infrastructure - your workflow - your organizational technical knowledge. Do you control the pace or do the Vendors? Do you evolve? Are you committed to continuous process improvement or is a lack of change your modus operandi?

I also realize that inevitably, some organizations must reinvent the wheel. Their Ego vs. Reality ratio has not matured or stabilized yet. Good luck. NerveCenter is still the standard in intelligent status polling. Has been for years. I got my first looksee in 1993 via Netlabs Dual Manager.

Dougie's Enterprise Management World

Saturday, March 27, 2010

Implicit Status Determination example

1 comment: