Dougie's Enterprise Management World: August 2017

In thinking about status polling in networks,  most of the software we use utilizes ping lists of IP addresses or host names to perform the polling.   We set standard timeouts across the board as well as poll rates, statically.   Not that ICMP ping is all that accurate in the determination of a specific Node REALLY being up or down. It is rife with misconceptions by those that lack an in depth understanding of the ICMP protocol and its purpose in life. Especially if there are devices in the network that are not pingable but can affect latency and connectivity.

What if we are thinking about this wrong? 

Artificial intelligence and machine learning techniques may enable us to do a much better job in status "determination" versus just status polling.  

First, I would think that we should reset the objectives for status polling.  Status polling used to be goals like ping every device every 2 minutes in the network.  While this seems OK, is it really what you want?  What if you set the objective to:

Establish a status interval of 2 minutes for every managed device. Or I need to know when a device is no longer responding within 2 minutes. And what about latency and thresholds?

Heres a thought

I have 10000 device IPs and 2 pollers. 

Give the list to both pollers. Have them go through the list to establish:

The latency distance from the poller.
Establish the baseline root mean and standard deviation for latency.

Now, organize your latency distances from the longest to the shortest into groups or bands. Work to distribute the list as primary and secondary based on best results but evenly.

Assumptions

The farthest latency distance may or may not be the farthest physical distance
Non-responsive IPs may or may not be down.
Polling the farthest latencies should be more synchronous than closer IPs.

What if...?

What if I organize the data for each band into a heat map?  Could I use this to visualize clusters of changes or anomalies in latency on the network?  The problem today is that we respond to ping failures individually and we work those tickets that way.   In some cases, by the time the engineer gets around to diagnosis, the problem has magically disappeared. So, we do what everyone else does - we give away the SLA to let only the persistent problems bubble up.

By organizing in bands, in the heat map, what you will see is:

Changes in closer latency bands may affect other more distant bands. So, diagnose the closer first, then work your way out. (Kind of helps to prioritize the list when things get rough)
The outages are as important as the changes. When one or more IPs change enough latency wise to transition from one band to another, iit illuminates something in the network that may be indicative of something much bigger. Or may be precursors to outages. Or may be indicative of static or dynamic changes in the network.

Machine Learning possibilities

Look for groupings in the managed elements. For example, each subnet has 1 or more IP addresses that are a part of that subnet. 

IPs in some subnets may display different latency distances. For example a 255.255.255.252 subnet has 2 host IPs and is commonly used for WAN links. If the distance to one is longer than the other, you can discern that its probably the distant end. (If you can poll the more distant one, would that not imply that the nearer IP is also up?) Interesting is that subnets like this may be able to be visualized as "bridges" between bands.

On networks that have a larger mask, one can assume that each shared network has one or more routers that provide service to that network.  And they have one or more switches used to interconnect devices at the logical link layer.   It may not be the case that you can imply that a node being up on a LAN also means the switch and router are up.  But, when things are broken, you would want to check the routers first, then the switches, then the nodes.

Use machine learning to capture and hypothesize failures as patterns.  When something changes, what also changed around it?  What broke further downstream?  After seeing a few failures, you should be able to pick up a probability on the causes and side effects in the network. (What a brilliant insight this would be for engineering teams to identify and solve for availability, performance, and redundancy issues based on priors?) 

Let machine learning determine the frequency based on the goals and objectives.

When you overlay other data elements in with this, the picture becomes a much more effective story.  For example, in the heat map identify any node / IP that has experienced configuration changes.  What about has an open ticket on it? What about event counts beyond the ping?

What if I use RUM and a bit of creative data from IPFIX to overlay onto the heat map application performance?  Interesting...

When you look at the Netcool probe systems, there are a couple of things you can learn

from the rules files as well as the events and raw data. Really, there is a tremendous amount of data at your fingertips that could be leveraged in new ways.

When you look at basic Naive Bayes as a function, it is an effective algorithm to use to classify

text elements. While I like the approach, I chose to simplify it a bit here to make it more understandable and usable without undue complication. Lets go on a data exploration journey and see what we can derive from the mental exercise...

Heres a thought

Given I have a set of rules, and event example, and I have raw event data. One of the first thing I'd like to train on is severity. This is important because it is an organizationally defined tag depicting a user perception.

On a side note, I seem to always revert back to the Golden Rule of correlation:

ENTERPRISE:NODE:SUBOBJECT.INSTANCE:PROPERTY

The Scope applies to how much of the record is applied to delineate the Managed Object. For example a Node Scope means that correlation elements need to match on Enterprise and Node to apply together.

Naive Bayes is about intelligently "guessing" elements that make up a fact as to whether they are goodor bad. For a given small sample of events, I'm going to use 4 events as my reference training to illustrate potential learning.

Summary = Interface GE1 is down , Severity = 4

Summary = Interface GE1 is up , Severity = 0

Summary = Node down ping fail, Severity = 5

Summary = Node up, Severity = 0

So,for the training process, I break down each word in the Summary as follows:

			Word hit count
	Clear	Indeterminate	Warning	Minor	Major	Critical
Interface	1				1
GE1	1				1
is	1				1
down					1	1
up	2
Node	1					1
ping						1
fail						1

It should be noted that some words may be a bit nebulous and have no differentiation in the determination of severity. For example, check out the following table.

	Interface	GE1	is	down	up	fail	ping	Node
Clear	1	1	1	0	2	0	0	1
Indeterminate	0	0	0	0	0	0	0	0
Warning	0	0	0	0	0	0	0	0
Minor	0	0	0	0	0	0	0	0
Major	1	1	1	1	0	0	0	0
Critical	0	0	0	1	0	1	1	1

			Ratio of occurrence by Severity
Clear	0.125	0.125	0.125	0	0.25	0	0	0.125
Indeterminate	0	0	0	0	0	0	0	0
Warning	0	0	0	0	0	0	0	0
Minor	0	0	0	0	0	0	0	0
Major	0.125	0.125	0.125	0.125	0	0	0	0
Critical	0	0	0	0.125	0	0.125	0.125	0.125

			Ratio of Non-occurrence
Clear	0.875	0.875	0.875	1	0.75	1	1	0.875
Indeterminate	1	1	1	1	1	1	1	1
Warning	1	1	1	1	1	1	1	1
Minor	1	1	1	1	1	1	1	1
Major	0.875	0.875	0.875	0.875	1	1	1	1
Critical	1	1	1	0.875	1	0.875	0.875	0.875

Normally, naive Bayes specifies a simple true of false kind of thing. With 6 different severities, one could classify the true as severities 2-5 and false 0 and 1. In this case, you look for words that “differentiate” True or false. In the example, the differentiating words are down, up, ping, and fail. In this first iteration, I would be tempted to either drop ping or add it to the Up event.

In analyzing the words and distribution of severities, we can discern that 4 words are differentiators. If the distribution of words like “is” are across multiple severities, they aren't so relevant in use as a predictor or a prior in bayesian terms.

The ratios are calculated per the total word count for the entire sample. Out of each word,what ratio would that word play on determination of severity.

The ratio of non-occurrence is interesting in that it shows you the ratio of how often it does not occur and clearly illustrates severities that can be ignored by a specific words presence.

This is a very basic machine learning event word classification mechanism spawned mainly out of the simplification of Naive Bayes theory. While Naive Bayes is very rudimentary, a lot of folks get hung up in the math that goes with it in its truest sense. Here are the formulas I use:

Ratio of occurrence is the number of times a word is seen via a given severity divided by the total number of unique words.

Ratio of non-Occurrence is Total number of words minus the occurrences in this severity for this word, divided by the total number of words.

The benefit is that Severity ratings are a perception of importance. The more accurate and consistent these simple classifications become, the better your more intensive Inferences and deep learning will be going forward.

Interestingly enough,these are just words that are independent of one another. Other that occurrence and perception, is is no different than down.

Now, what if I look for words in the event text that are also in the AlertKey? Normally, AlertKey is used to identify the Subobject and instance of the event. Would this not readily identify the noun/pronoun/object this event relates to? Could it be skipped in the calculations to make things more accurate?

What if you used AlertGroup to do associative vectors to only group together like events? Like a defined cluster of events by type.

What if you did the same thing for Type? Melding in the dimension of whether an event relates to a Problem or Resolution could help drive the accuracy of your event processing system.

Solving a Problem

The questions you may be looking to answer are:

Are my event wordings consistent to be able to guess a severity if we didn't have one already?
Is my severity selection consistent?
Event object definition accurate?

Lets say, I do this exercise on my top 20 events seen in the environment. Use these events to hone the “dictionary” of initial training. Then, lets apply it to the rest of the events.

There are some good news elements here. These include:

Identify and build in consistency into your event processing.
Use the results to identify the severity of new and existing events.
You can store the results by event (using Identifier as the key) so that you don't have to recalculate each time.
You can use history to build this.

Another very interesting “feature” here is that many of the events processed are SNMP based. Within the rules used to process the event, in many cases, you get the OID of the enterprise, OIDs or varbinds, and enumerations used to translated the numbers into text values. And what about the MIB object descriptions? And variable bindings that are instanced usually point ot a given subobject and instance. (ifEntry.4 as an example)

Some may even ask why do this. What if, programmatically, I can determine that performance and error conditions related to a given sub-object are in fact, supporting evidence to a most probable cause event that resulted in an outage. Now, we are getting some where. Think beyond the Root Cause and Side Effect paradigm into the realm of recognizing and understanding event clusters.

Summary

I should note that while I use Netcool as an example, one should not constrain these techniques to a single product. I could see this same sort of techniques used in HPE Node Manager, ScienceLogic EM7, Splunk, OpenNMS, Monolith, and others.

Machine learning is about exploring your data, classifying it, and producing better information.

Dougie's Enterprise Management World

Monday, August 14, 2017

Machine Learning and Status Determination

Heres a thought

Assumptions

What if...?

Machine Learning possibilities

Sunday, August 13, 2017

Machine Learning and Event Classification

Heres a thought