Saturday, March 27, 2010

NPS, Enterprise Management, and Situation Awareness

In the course of what I do, I have to sometimes take non-technical metrics and understand where implementation of technology - especially in the ENMS realm - applies toward achieving real business goals.

Recently, alot of Services based companies are working toward understanding and improving their Net Promoter Score or NPS. As part of this initiative, what can I do to realize the overall goal?

First, I went looking for a definition of NPS to understand the terms, conditions, and metrics related to this KPI. I found this definition:

"

What is Net Promoter?

Net Promoter® is both a loyalty metric and a discipline for using customer feedback to fuel profitable growth in your business. Developed by Satmetrix, Bain & Company, and Fred Reichheld, the concept was first popularized through Reichheld's book The Ultimate Question, and has since been embraced by leading companies worldwide as the standard for measuring and improving customer loyalty.

The Net Promoter Score, or NPS®, is a straightforward metric that holds companies and employees accountable for how they treat customers. It has gained popularity thanks to its simplicity and its linkage to profitable growth. Employees at all levels of the organization understand it, opening the door to customer- centric change and improved performance.

Net Promoter programs are not traditional customer satisfaction programs, and simply measuring your NPS does not lead to success. Companies need to follow an associated discipline to actually drive improvements in customer loyalty and enable profitable growth. They must have leadership commitment, and the right business processes and systems in place to deliver real-time information to employees, so they can act on customer feedback and achieve results.

"

I found this at :

http://www.netpromoter.com/np/index.jsp


In essence, the NPS KPI is a metric by which to measure customer loyalty. In its simplicity, come the subjectiveness of how you treat your customers. So this begs the question: What can I do from an Enterprise Management perspective to affect this?


From my perspective, the NPS is a measure of the effectiveness of your support CULTURE first and foremost. This is a personal - core belief - sort of thing at its foundation. Customer facing people in your support organization must project several key personality traits and behaviors. Some of these I envision to be:


Dedication. The customer is the only person in the room sort of thing.

Urgency of need. The support person must understand the importance of the situation.

Empathy. A willingness and understanding of the customer's pain.

Confidence. In the face of unknown issues and varying conditions, the customer facing person must exhibit technical strength.

Follow Through. If the customer trusts you enough to let you off the phone to handle things, you MUST FOLLOW THROUGH.

There is also the notion that in a Service oriented company, EVERYONE is a sales person in one way or another. Every interaction means an opportunity to understand the customer and help them be successful.

When you go to MacDonalds and you're trying to figure out what you'd like or what level of poly unsaturated fat and cholesterol you want to propagate to your family... Ever gotten the person that asks you what you want and you don't know and they stand there looking at you? NPS score --.

Now, if they engage you and suggest items like a 12 pack of Big Macs, they are DEDICATED, empathic to your hunger pains,understand your urgency of need, and have confidence your order is going to be up in a minute or so after inputting it in the computer. And in the end they ask about dessert - Great Sales person and GREAT customer service person. NPS score ++.

From a personal work habits perspective, one of the key behaviors to be considered is creating and maintaining Situation Awareness. I ran across the term SA while working on an Air force project and found it profoundly appropo for operations organizations doing customer service. Check this out on Wikipedia:

http://en.wikipedia.org/wiki/Situation_awareness

I also read through several sections of the google book review about Situation Awareness by Dr. Mica Endsley and Daniel Garland. This is at :

http://books.google.com/books?id=tUwqcqa_QaMC&printsec=frontcover&dq=Situation+Awareness&source=bl&ots=NccDiPzgMI&sig=NW0LAHrBsOTFXVmSyCSKz6154IU&hl=en&ei=pUWuS_HdE4X7lwf9tdCRAQ&sa=X&oi=book_result&ct=result&resnum=2&ved=0CBQQ6AEwAQ#v=onepage&q=&f=false

(I'm ordering the book!)

The model graphic they provide is useful as well:

http://en.wikipedia.org/wiki/File:SA_Wikipedia_Figure_1_Shared_SA_(20Nov2007).jpg

In effect, what enterprise management applications and technology MUST do to effectively achieve a higher NPS is to empower SA at all levels. In doing so, you create a culture where information is meant to be shared and used to make predictions and illicit responses and decisions based upon information being presented.

Now this is a bit taller of an order than once thought. For example, on the Event / Fault Management side of things, information is presented as events. People respond to events. They test or open tickets or whatever workflow they do when an event is received.

But an event is NOT a situation! A Situation is something a bit different and more abstract than a simple event. So, you have to transition your events to be situation focused! Interesting thought... Especially since event presentation is the real prevalent method! Maybe the Netcool approach needs to evolve a bit!

Interesting in that OpenNMS introduces the concept that events are different from Alarms in their own GUI. Check it out at:

http://www.opennms.org/wiki/Alarms

A brilliant piece of work (and a notion that simple is Good!) in that EVENTS != ALARMS! My hats off the the OpenNMS guys and the OGP for GETTING IT! In fact, its a start down the road of understanding the concept of Situations in SA.

Trouble Ticketing systems attempt to do this situation grouping via tickets but its almost too late once it leaves your near real time pane of glass. Once you transition away from a single pane of glass, you effectively lose your SA of the real time. And if you attempt to work out of tickets, you miss all of the elemental sorts of things that happen underneath. Even elements of information like event activity, performance thresholds, support activity, and the like have to be discerned and recognized in near real time to be effective information. If you miss it, you don't know. But your customer may not miss it!

If you ticket from event to ticket, you're asking for problems. Problems like tickets that are not problems but side effects. Or side effects that are problems, just rolled up under a ticket. Or awareness that conditions have cleared while the ticket is still being escalated and worked. Or missing all of the adjacent issues like a router taking out a subnet taking out and application and its three different desks.

The interesting part here is that two given situations may have events that effect each situation. This may throw a kink in normal, database table based event management systems. May be a bit difficult to implement and support.

I am beginning to think a bit different on Event processing especially with regards to SA and understanding, recognizing, and responding to SITUATIONS. For example, check out this presentation by Tim Bass of Cyberstrategics. He has a long history of thought leadership in Situational Awareness in Cyberspace.

http://www.slideshare.net/TimBassCEP/getting-started-in-cep-how-to-build-an-event-processing-application-presentation-717795

CEP techniques would enable an event to be consumed by multiple situations as situations develop and dissipate. Think about the weighting of events and conditions within a given situation. Some elements may be much more pertinent than others.

A significant part of Situation Awareness is the visualization and presentation of data regarding the ongoing situation. For example, check out this video: http://www.youtube.com/watch?v=FdKOxZIIKmQ

From the aspect of true, situational awareness, shouldn't we be looking at evolving Enterprise Management toward being able to deal with situations?

Another thought here. If I'm worried about an NPS, could I MANAGE to it live? Or at least closer to real time? What if I could meld in the capabilities of Evolve24's The Mirror product as a look at the REPUTATION SITUATION as it evolves? Check it out at: http://www.evolve24.com/mirror_for_social_media.php

This kind of changes the face of what we have been considering as BSM, doesn't it.

The common denominator in all this process and technology is Knowledge Management. How are you developing knowledge? How are you integrating it in with EVERY person. How are you using it to create SA and HUGE business discriminators? How are you using KM to empower your customers?



Implicit Status Determination example


It has come to my attention that alot of folks do not understand state based polling let alone the importance and scalability of doing things like active polling in a state machine representation.

For some, heartbeats and traps are way of life. While this is OK, if you're serious about detection of problems and pointing your people to fix these issues, its going to take an active stance. Waiting on heartbeats and traps is kind of lame in that you might as well be waiting on your end user to call you.

So, I'm going to start off with a very simple model that incorporates a technique called Implicit Status Determination that illuminates how effective state machines can be. For purposes of this example, I am working from an implementation done in LogMatrix NerveCenter.

In a NerveCenter Finite State Machine model, SNMP and ICMP polls are only applied when your model reaches a state where the poll takes it out of the current state. So, in this model, I may have several different poll conditions. These polls are only applied when they need to be. Additionally, traps are masked in as triggers as well.

In the NerveCenter model, you have States, triggers, and Actions. States are represented by the octagonal symbols in the GUI. The color of the symbols is also mapped (or not) to platforms like Network Node Manager. (As a side note, I map the purple used in this example for unreachable to the NNM status "Restricted" which is a brick brownish red color. When you see path outages in NNM, you'll see nodes depicted as Restricted where NerveCenter cannot reach those nodes.)

Triggers are the lines and boxes that connect state to state. They are named and can be created in alot of different ways including:

  • SNMP Polls
  • ICMP Polls
  • Masks
  • Actions
  • Timers
  • Counters
  • Perl Sub-Routines
  • External via nccmd

Actions are the work that occurs when a model transitions from state to state. There can be any number and function applied to an action. Like inform a platform by event, running a perl sub-routine, incrementing or decrementing counters, sending email and so on.

Using this simple Node Status model, node status poll rates can be cranked down to very tight intervals without killing performance and scalability. Additionally, traps that are not applicable to the current state, get filtered based on their usefulness in the current state. If the trap mask doesn't transition the model instance to another state, it is not used.

Lets go through the States...

Ground

All models are instanced at Ground. This model is scoped "Node" so only one instance per node is used. The first state reachable is the Needs Poll state. The transition is the result of a successful SNMP Get of system.sysUptime. (Pretty light weight. But if you want to store the variable, it gives you a very valid data point for Agent Availability.)

Out of the Needs Poll state are several poll and trap conditions that relate to various states of the Node. The first of which is Node OK.

NodeOK

NodeOK is achieved when a valid SNMP poll of system.sysUptime where the value of sysUptime is more than the interval times 100 (sysUptime is expressed in TimeTicks which is 1/100th of a second.)

In the transition to NodeOK, a poll timer is set to a value corresponding to the ISD interval window. In this example, it is 15 seconds. As other valid SNMP polls occur for the node, these send trigger called timer_reset. These triggers go from Node OK back to Node OK and they reset the poll timer to the next ISD Interval window. What this does is to "slide" the poll timer window on the assumption that any valid poll to the Node IMPLIES a valid Node Status poll. So, while I'm setting and using a 15 second window, I am only discreetly polling for Node Status every 10 to 15 minutes dependent upon the number of sub-objects I'm managing in other State models.

Agent Down

Agent Down occurs when the trigger PORT_UNREACHABLE is received. PORT_UNREACHABLE is triggered when an ICMP port Unreachable message is received from the Node. It signifies that the host is telling you that no process is listening on UDP 161 or the Agent is no longer listening. (Notice that this means that the Node is actually still up. It has the where with all to send you an ICMP Control message which means the INET daemon is still jammin.)

Rebooted

The Rebooted state is used to capture a node reboot and suppress all of the downs until the node comes back up. This state is also used to reset subobject models so that they do not become stale especially when instances change.

Agent Down Node Up

From the Agent down state, if the node responds to an ICMP ping, it is transitioned to Agent Down Node Up.

Node Down

Both SNMP and ICMP communications have failed. The node is deemed down.

Unreachable

Unreachable occurs when polling attempts yield either NET_UNREACHABLE or NODE_UNREACHABLE triggers. These messages are directly derived from IP flow control messages in ICMP - Namely Network Unreachable and Node Unreachable.

From RFC 792 - Internet Control Message Protocol

"Description

If, according to the information in the gateway's routing tables, the network specified in the internet destination field of a datagram is unreachable, e.g., the distance to the network is infinity, the gateway may send a destination unreachable message to the internet source host of the datagram. In addition, in some networks, the gateway may be able to determine if the internet destination host is unreachable. Gateways in these networks may send destination unreachable messages to the source host when the destination host is unreachable."

These two messages come from routers in the path between the poller and the end node. And, as such, when route convergence occurs, it is possible that a route to an end network may transition to infinity or become unreachable until routing metrics can be recalculated and the traffic rerouted. It is an indication that your topology has changed.

In cases where the path is persistently lost, it is an outage. But because the message is emitted
from the device that recognized the path loss, you have all of the good path TESTED ALREADY.

I typically see Net Unreachable when a router loses a route to a given network. I have seen a Node Unreachable when the ARP entry for the node is waxed out of the ARP cache on the router that services the Nodes local network.

Results

Using ISD and model sets, I can usually outperform polling and status mechanisms in other COTS offerings 4-5:1. I have benchmarked against NNM and performed on 20 second status intervals with 20% of the traffic netmon uses at 5 minute intervals.

And because I am catching active topology changes at the IP level in near real time, I'd say in many cases, topology changes that will be seen via NerveCenter will not be seen on other products. (I have captured Dykstra recalcs in midstream on slow and underpowered Cisco devices. Hey! It's a problem!)

In Summary

While this is a very basic Node Status model, you see the technique and prowess of the Finite State Machine in action. It is a very different world from the static poll lists so prevalent in basic pollers you see. And while a poll list is very lightweight code wise, you never make a decision in line with the poll. You end up polling continuously even when you don't want to.

While I do realize that some organizations want an "Out of the Box" solution, they tend to get a solution that is dependent upon the vendor to adapt to your infrastructure - your workflow - your organizational technical knowledge. Do you control the pace or do the Vendors? Do you evolve? Are you committed to continuous process improvement or is a lack of change your modus operandi?

I also realize that inevitably, some organizations must reinvent the wheel. Their Ego vs. Reality ratio has not matured or stabilized yet. Good luck. NerveCenter is still the standard in intelligent status polling. Has been for years. I got my first looksee in 1993 via Netlabs Dual Manager.








Saturday, March 13, 2010

Java and Finite State Machines

OK. I have this thing about Event Driven Architectures and Finite State Automata. Sounds big and bold but its really very simple once you get past the lingo and hoopla!

I like Finite State Machines because its how we, as people, step through logic and its how we enact and implement workflow. They are easy to illustrate and explain, even to the novice or PhD (Pointy Haired Dude!).

FSMs track the condition of "thingies" through various conditions and logic cases. Associated with each FSM are a Start state, transitions, and states. Simple enough, right?

When you instance an FSM, it becomes an Object. This means that you have started to track a "Thingie" in your FSM and it is in the Start state.

Generally, there are two types of Finite State Automata - Moore or Mealy model. In practice, a Moore Model of a state machine uses only Entry Actions, such that its output depends on the state. A Mealy model of a state machine uses only Input Actions, such that the output depends on the state and also on inputs.

Sounds complex but its not. It all breaks down to states, transitions, and actions.
For simplicity sake, a Finite State machine can be described in a couple of database tables:

CREATE TABLE States {
OLD_STATE varchar(32),
NEW_STATE varchar(32),
Transition_Name varchar(32),
Actions_Index integer,
}

CREATE TABLE Transitions {
Transition_Name varchar(32),
Transition_Method integer,
}

CREATE TABLE Actions {
Actions_Index integer,
Action_cmd varchar(255),
}

When a State is Achieved as in becomes a new state, any transition that progresses the State Machine needs to be enacted and scheduled. For example, if you have a start state and its one transition out of start requires an action, that action needs to be enacted.

When a state machine receives triggers, these are parsed and assigned to transitions which move a tracked object from State to state. If an object is in a state where a transition cannot be applied, it is dropped. For example, if you have an object in an Up state and the poll determination send a transition to Obj_up but that transition is not present in the Up state, the transition is dropped.

When an Object transitions from an Old State to a new State, and actions for that transition need to be executed. (This is the workflow). Once the New State is achieved, we restart the process.

The benefits behind a state machine is that it lets you model objects in an asynchronous way, as fast or as slow as need be. Methods are only executed upon reaching an achieved state. So, you don't have to execute ALL methods upon instantiation of and object... Only as you progress through the state machine.

From a purely "Persistent" point of view, an Object instance is a row in a DB table. This row depicts the current state and a date-time stamp. Everything else around the FSM logic is used to determine the next state and perform actions based upon transitioning from one state to another.

Now that we have the basics down, lets look at some code examples:

First of all, there was this fellow name Rocco Caputo that developed a set of Perl modules called POE or Perl Object Environment. As per Rocco : "POE originally was developed as the core of a persistent object server and runtime environment. It has evolved into a general purpose multitasking and networking framework, encompassing and providing a consistent interface to other event loops such as Event and the Tk and Gtk toolkits."

POE cansists of a kernel that can be thought of as a small, operating system running in a user process. Each kernel supports one or more Sessions and each Session has its own space called a Heap. Each Session, in turn, has a series of events and event handlers which run when called.

Events can be yielded (They go to the bottom of the events for processing) or they can be called (They go to the top of the stack for processing). Event Handlers are perl subroutines that are executed upon running of the event in stack processing the session.

Additionally, Sessions can be named and events can be sent from one session to another.

Sessions are initiated in a couple of ways. States or Objects.

This is the States way:

POE::Session->create(
inline_states => {
one => \&some_handler,
two => \&some_handler,
six => \&some_handler,
ten => \&some_handler,
_start => sub {
$_[KERNEL]->yield($_) for qw(one two six ten);
}
}
);

Heres a session initiation with Objcts and Inline States :

POE::Session->create(
object_states => [
$object_1 => { event_1a => "method_1a" },
$object_2 => { event_2a => "method_2a" },
],
inline_states => {
event_3 => \&piece_of_code,
},
);

Notice that the events in inline states call sub routine Code references. Each event handler must be organized as a subroutine.

Each subroutine is setup like:

sub Yada_yada {
my ($kernel, $heap, $parameter) = @_[KERNEL, HEAP, ARG0];
# Do stuff in the sub...
# ....
return;
}

While it may seem a bit unorthodox, Perl actually inits subs with an Arguments array @_ and POE uses this natively.

So, in POE, we init sessions which have states and actions (callbacks). And we have a Heap space to store our state data.

In a simplistic way of looking at it, transitions and ther application to state are accomplished in the events and callbacks. If the object is in the proper OLD_STATE to transition to a NEW_STATE, the transition occurs (writing the new state name to the Heap and executing the Actions.)

Now, here's something VERY INTERESTING about POE and Perl:

$_[KERNEL]->state sets or removes a Handler for an EVENT_NAME within the current Session. For example, the following line would remove the handler for the EVENT_NAME in the current session.

$_[KERNEL]->state( 'on_client_input' );

Subsequent calls that have a Sub routine Code reference get replaced as in:

$_[KERNEL]->state( 'on_client_input', \&new_subroutine );

Given Perl eval, one could read in new subroutines, check them in eval, AND put them into action within POE without having to stop, reread, and restart. Can you say
24 BY FOREVER!

Now, when you look at Java and State Machines, I am in a bit of a conundrum. Objects must run their methods right away. So trying to model a "Thingie" becomes an exercise where my "Thingie" object becomes a container of states objects and transition objects. All of a sudden, the app is not scalable.

And in keeping pace with POE, each "Thingie" object must b a separate thread as each Session is its own "thread of execution..."

In looking over the Finite State Machine Framework on SourceForge :

http://unimod.sourceforge.net/fsm-framework.html

I notice that this is a good FSM framework. However, State machines must be compiled and rerun under the JVM. No dynamic non-determinstic methods.

I could spoof Java into doing a persistent State machine by only handling transitions in objects. Everything else must be done via a pewrsistence storage such that only transitions are instanced and transition actions are executed upon transition execution. This adds a bit of overhead in the IO model as well as the states have to be hibernated or stored in data structure of some kind.

Changing transitions on the fly is an exercize in calling classes out of a database store. If a new class is applied, it gets exectued by name via the DB record. But, in order to change things, the process must stop and restart to reread all of the classes and class hierarchy.

Each transition must either have the same number of methjod arguments or iut must be uniquely named. Method overloading because of the variability of calling a transition with ever changing methods underneath means that method overloading would become rather prolific.

My Conclusion:

FSMs are hard to do in an OO type Object Model without instancing a whole lot of objects. But it could be accomplished if you make the object model look kinda like a FSM. Still no where near as dynamic as Perl and POE though. And because of the cooperative nature of the POE kernel, it is significantly tighter than attempting to spawn out hundreds of threads.

Topology Based Correlation

I know alot of folks would like to think that Topology based correlation as delivered today, is real 100% of the time. It can be. IF your network is REALLY simple. I have yet to see a simple network. (Guess I don't get out much!)

The problem is that IP networks are designed as a best effort communications medium that is designed to morph, adapt, and overcome problems in the network. As such, routes will change on the fly. Application connectivity can change very quickly. And this "adaptation" may hide outages from you.

Check out this wikipedia article on MPLS Fast Reroute and local protection :

http://en.wikipedia.org/wiki/MPLS_local_protection

There are a whole host of protocols and techniques around creating and maintaining a highly available network service as a utility. This includes the gamut from Spanning Tree Protocol to Extended BGP.

Does your topology map really map out your "true" network? Does it capture routing protocol decisions and seemingly mundane little network idiosyncracies like GRE Tunnels, Parreto Tunnels and VPNs? If you want to REALLY see topology as a true map thats live, you need to check out:

http://www.packetdesign.com/products/rex.htm

Does downstream suppression really buy you anything? Think about it for a moment. While you are responding to network outages that may appear, then disappear, do you wait until you have a "hard down" condition to ticket? How do you know you have a hard down condition anyway?

How do you know if you're giving away your SLA time? When a customer experiences an outage, do you really know and understand when that is? What impact does your interface down have on your customer? What impact does it have on your customer's customer?

I find ironic and somewhat disturbing that folks have not thought through SLAs on their product strategies in that if you do not check and verify in your correlation, you do not verify the problem and you will most likely either miss a problem due to false downstream suppression or you'll have a problem for several minutes before your support organization knows.

Sometimes probems are not Hard down as one might think. Issues like packet drops, big latency changes, and assymmetric routes can wreak havoc in communications without having hard down issues. It can make things extremely slow, cause alot of retransmissions, and cause services resets. If the customer is not having a good experience with their work over the network, its a very bad problem in that your reality doesn't match the customers perception of reality. Your situational awareness is disconnected from your customer in a time when it should be.

So, how do we create awareness and customer cognizance?

1. You need to understand the services that are important to them and measure them first.

If you measure latency and connectivity through Exchange and the times increase significantly or the webmail client response goes to heck in a bucket, HOUSTON - you have a problem!

2. You need to understand the topology at a basic level from the service poller to the service.

By understanding the topology at a basic level between a service poller and the service provide point, when a problem occurs, you can discern the before and after. Additionally, this gives you a baseline network lineage with which to apply other data.

For example, in a service lineage, you could show all of the CPUs in every element that had a CPU in a service. If you saw an abnormally high or low CPU condition along the path lineage, it could give you very valuable insight into problem areas or places where the service may be degraded from.

This service lineage gives the engineer, technician, and support personnel the ability to visualize the service but drill down into details from the context of a service. Even mundane things like sepict within the service lineage all change taskings and trouble ticket activity within the last week. Or show me the operating system versions and patch levels across the lineage.

From a CMDB perspective, the relationships needed can be easily modelled using the CIM model. You can find the CIM Schema here at:

http://www.dmtf.org/standards/cim/cim_schema_v28

The funny part is that with a wee bit of creativity, traceroute provides a great baseline for a lineage. When you look at how traceroute works, it uses the Time To Live parameter to step through a network. When a TTL / hop count is exceeded, the packet is discarded and an ICMP Time Exceeded message is sent back to the host (ICMP Type 11). Traceroute typically sends 3 packets at a time.

Box stock traceroute is usually done via UDP and in Windows, it is accomplished as ICMP. However, TCP is used in more advanced tools. The benefit of using TCP would be that you could perform a traceroute within a protocol port. This will tell you if the protocol such as HTTP (TCP port 80) can connect. If the specific port is blocked, the traceroute will not progress past the point of filtering. Additionally, an ICMP Administratively Prohibited message may be seen if available.

While traceroute may skip over elements in the network, when something breaks, at the basic IP/Service protocol layer, you have an impromptu schematic with which to baseline your service diagnosis. If the path is broken or changed, you'll be able to tell.

There are 3 elements of IP and how it works that give you topology based correlation in near real time.These elements are:

ICMP Net Unreachable - ICMP Type 3 Code 0
ICMP Host Unreachable - ICMP Type 3 Code 1
ICMP Port Unreachable - ICMP Type 3 Code 3

Check out http://tools.ietf.org/html/rfc792
for reference on the ICMP protocol.

You usually see Net Unreachable when a router in the given path for a destination IP has no route to that network. Likewise, you see Host Unreachable by the router for a given network when the destination IP is not in that routers ARP cache and the IP address isn't responding to an ARP request. You see a port unreachable via a destination host that does not have a listener of the port you're trying to communicate with.

These elements are a FUNDAMENTAL part of how IP works as a protocol. It is just a matter of using the information intelligently to tell you when your network breaks.

Consider this - If you monitored the ingress to your Enterprise passively using Snort on ICMP, you could potentially monitor all of the flow control occurring into and out of your enterprise. And by definition, the first 64 bytes of the native protocol are included in the payload of the ICMP message!

In Summary

I like the Service Lineage approach. It get folks aware of services and it keeps the service and the customer in the forefront. Even across service degradations and security impairments. How useful would this "view" be in mitigating a bot net that had infiltrated your enterprise? Or the recognition of the "Low and Slow" penetration probing that gets overlooked? And to be able to visualize cause and effect!

SNMP Agents...

In the past couple of weeks, I have been doing a bit of research concerning the Agent and sub-Agent capabilities around the Net-SNMP agent distribution. As part of my research, I came to the startling conclusion that most of the applications commonly found in Enterprises, have available sub-agents that are readily available either from the vendors or in Open source.

My first site I went to provided a series of Net-SNMP specific sub-agents and linkages to those. This list can be reviewed at:

http://www.net-snmp.org/wiki/index.php/Net-snmp_extensions

In the list I noticed sub-agents for elements listed in the requirements in the agent specific section. For example, Jasmin implements the IETF Script MIB using Java as the script language. Jasmin is available through :

http://www.ibr.cs.tu-bs.de/projects/jasmin/README.html

I find Jasmin intruiging in that Java programmers are prolific bunch! You can find them everywhere! It would be interesting to see what a good Java purist could come up with on an agent extension given that SNMP is supposed to be lightweight - very tightly done code. And Java, because of the JVM up front, tends to not be so "lightweight".

What if you could use Jasmin to leverage Java Web Start to dynamically add and update sub-agents?

Interestingly enough, I also ran across Ramon, an open source implementation of RMON2! Ramon data can be found here:

http://savannah.nongnu.org/projects/ramon/

I wonder what overhead it introduces on a hosts IP stack. Could this be used in conjuction with old hardware to provide some very interesting data sets for a given enterprise on a nice, operational price point?

And another, very pertinent sub-agent I found is this one for VMWare at:

http://www.vmware.com/support/esx21/doc/esx21admin_snmpagents.html

If you're working with VMWare and vSphere in Cloud environmentsa, this presents some very interesting possibilities. While I doubt the sub-agent is up to speed on the latest ESX version, more adoption would drive that priority.

In the following table, I list the applications verbatim from the requirements. I’ve also plugged in direct or indirect references to sub-agent capabilities.


Applications Sub-Agent Source
Web Servers
Apache http://mod-apache-snmp.sourceforge.net/english/docs.htm
WebLogic http://download.oracle.com/docs/cd/E13222_01/wls/docs81/snmpman/index.html
WebSphere http://www.webnms.com/snmpadaptor/datasheet.html
http://www.pcuniverse.com/IBM-WebSphere-Transformation-Extender-SNMP-Collection-v.-8.2-media-CD/BA0NSEN/pd/p4740071
http://publib.boulder.ibm.com/infocenter/tivihelp/v3r1/index.jsp?topic=/com.ibm.itcamwas.doc/cynmst464.htm

IIS http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/4a168955-4982-44d5-8a18-e252d37a3557.mspx?mfr=true


Web Application Servers
J2EE http://blogs.sun.com/orivat/entry/glassfish_snmp_j2ee_mib_presentation
Jboss http://community.jboss.org/wiki/JBossSNMPAdapter

.NET http://www.webnms.com/net-snmp/index.html

Tomcat http://www.opennms.org/index.php/Tomcat_5.5_JMX_How-To
http://forums.adventnet.com/viewtopic.php?t=959&start=0
http://jakarta.apache.org/jmeter/usermanual/build-monitor-test-plan.html


LDAP

MS Active Directory http://technet.microsoft.com/en-us/library/cc783142(WS.10).aspx

OpenLDAP http://ostatic.com/netsitter

Sun LDAP http://docs.sun.com/source/816-6698-10/snmp.html


Relational Database Servers
Oracle http://download.oracle.com/docs/cd/B10501_01/em.920/a96672/toc.htm
Sybase http://download.oracle.com/docs/cd/B10501_01/em.920/a96672/toc.htm

DB2 http://bytes.com/topic/db2/answers/181104-db2-snmp-support-v8

SQL Server http://technet.microsoft.com/en-us/library/dd316347.aspx

SQL Server Cluster http://technet.microsoft.com/en-us/library/dd316347.aspx

Informix http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.jsp?topic=/com.ibm.snmp.doc/snmp35.htm

MySQL http://mysqldump.azundris.com/archives/63-Sysadmins-Nightly-Mental-Pain-SNMP.html

PostgresQL http://pgfoundry.org/projects/pgsnmpd


Email
Exchange http://www.oidview.com/mibs/311/WINDOWS-NT-PERFORMANCE-EXCHANGE.html

POP/IMAP http://netilium.org/~mad/technotes/postfixstat/

SMTP http://netilium.org/~mad/technotes/postfixstat/


DNS Servers
Bind http://www.packetmischief.ca/network/monitoring/bind9/
http://www.l3jane.net/wiki/factory:projects:b9agent_en

Active Directory http://technet.microsoft.com/en-us/library/cc783142(WS.10).aspx

MS DHCP Server http://www.oidview.com/mibs/311/DHCP-MIB.html



MS SCOM http://www.microsoft.com/systemcenter/operationsmanager/en/us/default.aspx
MS SMS http://microsoft-sms-network-monitor.software.informer.com/
http://www.dlldll.com/snmpelea.dll_download.html

MS Radius http://support.microsoft.com/kb/237295

MS RAS http://software.informer.com/getfree-snmp-mibs-microsoft-ras-vpn/


My take:

Let's face it. Everyone is watching every penny in the IT budget. Why not leverage this technology?

In the Beginning...

This is my start of a personal project where I get to share my thoughts, ideas, philosophy, and experiences related to Enterprise Management with all of you. Well, those willing to read my ramblings.

About me a bit...

I have a bit over 30 years experience in the IT/Networks/Telecommunications world. I started off in the Air Force as a Cryptographic Equipment / Systems technician working my way up from Ground zero level training through many years of projects and different exposures to technology. The Air Force taught me a few things up front:

The ability to survive and operate.
The ability to think and reason on my feet, under stress.
A willingness to accept seemingly impossible tasks and show some level of success.
An understanding of mission and Urgency of Need.
Responsibility.
A realization that life is not about what you get, its about what you give!

I spent my entire Air Force duty beyond Tech School in Combat Communications. I was prepared to go antwhere in the world on a moments notice, do what I needed to do, and protect and defend the Constitution of the United States of America.