Monday, August 14, 2017

Machine Learning and Status Determination

In thinking about status polling in networks, most of the software we use utilizes ping lists of IP addresses or host names to perform the polling. We set standard timeouts across the board as well as poll rates, statically. Not that ICMP ping is all that accurate in the determination of a specific Node REALLY being up or down. It is rife with misconceptions by those that lack an in depth understanding of the ICMP protocol and its purpose in life. Especially if there are devices in the network that are not pingable but can affect latency and connectivity.
What if we are thinking about this wrong?
Artificial intelligence and machine learning techniques may enable us to do a much better job in status "determination" versus just status polling.
First, I would think that we should reset the objectives for status polling. Status polling used to be goals like ping every device every 2 minutes in the network. While this seems OK, is it really what you want? What if you set the objective to:
Establish a status interval of 2 minutes for every managed device. Or I need to know when a device is no longer responding within 2 minutes. And what about latency and thresholds?

Heres a thought

I have 10000 device IPs and 2 pollers.
Give the list to both pollers. Have them go through the list to establish:
  1. The latency distance from the poller.
  2. Establish the baseline root mean and standard deviation for latency.
Now, organize your latency distances from the longest to the shortest into groups or bands. Work to distribute the list as primary and secondary based on best results but evenly.

Assumptions

  1. The farthest latency distance may or may not be the farthest physical distance
  2. Non-responsive IPs may or may not be down.
  3. Polling the farthest latencies should be more synchronous than closer IPs.

What if...?

What if I organize the data for each band into a heat map? Could I use this to visualize clusters of changes or anomalies in latency on the network? The problem today is that we respond to ping failures individually and we work those tickets that way. In some cases, by the time the engineer gets around to diagnosis, the problem has magically disappeared. So, we do what everyone else does - we give away the SLA to let only the persistent problems bubble up.
By organizing in bands, in the heat map, what you will see is:
  • Changes in closer latency bands may affect other more distant bands. So, diagnose the closer first, then work your way out. (Kind of helps to prioritize the list when things get rough)
  • The outages are as important as the changes. When one or more IPs change enough latency wise to transition from one band to another, iit illuminates something in the network that may be indicative of something much bigger. Or may be precursors to outages. Or may be indicative of static or dynamic changes in the network.

Machine Learning possibilities

Look for groupings in the managed elements. For example, each subnet has 1 or more IP addresses that are a part of that subnet.
IPs in some subnets may display different latency distances. For example a 255.255.255.252 subnet has 2 host IPs and is commonly used for WAN links. If the distance to one is longer than the other, you can discern that its probably the distant end. (If you can poll the more distant one, would that not imply that the nearer IP is also up?) Interesting is that subnets like this may be able to be visualized as "bridges" between bands.
On networks that have a larger mask, one can assume that each shared network has one or more routers that provide service to that network. And they have one or more switches used to interconnect devices at the logical link layer. It may not be the case that you can imply that a node being up on a LAN also means the switch and router are up. But, when things are broken, you would want to check the routers first, then the switches, then the nodes.
Use machine learning to capture and hypothesize failures as patterns. When something changes, what also changed around it? What broke further downstream? After seeing a few failures, you should be able to pick up a probability on the causes and side effects in the network. (What a brilliant insight this would be for engineering teams to identify and solve for availability, performance, and redundancy issues based on priors?)
Let machine learning determine the frequency based on the goals and objectives.

When you overlay other data elements in with this, the picture becomes a much more effective story. For example, in the heat map identify any node / IP that has experienced configuration changes. What about has an open ticket on it? What about event counts beyond the ping?
What if I use RUM and a bit of creative data from IPFIX to overlay onto the heat map application performance? Interesting...

No comments:

Post a Comment