Saturday, March 13, 2010

Topology Based Correlation

I know alot of folks would like to think that Topology based correlation as delivered today, is real 100% of the time. It can be. IF your network is REALLY simple. I have yet to see a simple network. (Guess I don't get out much!)

The problem is that IP networks are designed as a best effort communications medium that is designed to morph, adapt, and overcome problems in the network. As such, routes will change on the fly. Application connectivity can change very quickly. And this "adaptation" may hide outages from you.

Check out this wikipedia article on MPLS Fast Reroute and local protection :

http://en.wikipedia.org/wiki/MPLS_local_protection

There are a whole host of protocols and techniques around creating and maintaining a highly available network service as a utility. This includes the gamut from Spanning Tree Protocol to Extended BGP.

Does your topology map really map out your "true" network? Does it capture routing protocol decisions and seemingly mundane little network idiosyncracies like GRE Tunnels, Parreto Tunnels and VPNs? If you want to REALLY see topology as a true map thats live, you need to check out:

http://www.packetdesign.com/products/rex.htm

Does downstream suppression really buy you anything? Think about it for a moment. While you are responding to network outages that may appear, then disappear, do you wait until you have a "hard down" condition to ticket? How do you know you have a hard down condition anyway?

How do you know if you're giving away your SLA time? When a customer experiences an outage, do you really know and understand when that is? What impact does your interface down have on your customer? What impact does it have on your customer's customer?

I find ironic and somewhat disturbing that folks have not thought through SLAs on their product strategies in that if you do not check and verify in your correlation, you do not verify the problem and you will most likely either miss a problem due to false downstream suppression or you'll have a problem for several minutes before your support organization knows.

Sometimes probems are not Hard down as one might think. Issues like packet drops, big latency changes, and assymmetric routes can wreak havoc in communications without having hard down issues. It can make things extremely slow, cause alot of retransmissions, and cause services resets. If the customer is not having a good experience with their work over the network, its a very bad problem in that your reality doesn't match the customers perception of reality. Your situational awareness is disconnected from your customer in a time when it should be.

So, how do we create awareness and customer cognizance?

1. You need to understand the services that are important to them and measure them first.

If you measure latency and connectivity through Exchange and the times increase significantly or the webmail client response goes to heck in a bucket, HOUSTON - you have a problem!

2. You need to understand the topology at a basic level from the service poller to the service.

By understanding the topology at a basic level between a service poller and the service provide point, when a problem occurs, you can discern the before and after. Additionally, this gives you a baseline network lineage with which to apply other data.

For example, in a service lineage, you could show all of the CPUs in every element that had a CPU in a service. If you saw an abnormally high or low CPU condition along the path lineage, it could give you very valuable insight into problem areas or places where the service may be degraded from.

This service lineage gives the engineer, technician, and support personnel the ability to visualize the service but drill down into details from the context of a service. Even mundane things like sepict within the service lineage all change taskings and trouble ticket activity within the last week. Or show me the operating system versions and patch levels across the lineage.

From a CMDB perspective, the relationships needed can be easily modelled using the CIM model. You can find the CIM Schema here at:

http://www.dmtf.org/standards/cim/cim_schema_v28

The funny part is that with a wee bit of creativity, traceroute provides a great baseline for a lineage. When you look at how traceroute works, it uses the Time To Live parameter to step through a network. When a TTL / hop count is exceeded, the packet is discarded and an ICMP Time Exceeded message is sent back to the host (ICMP Type 11). Traceroute typically sends 3 packets at a time.

Box stock traceroute is usually done via UDP and in Windows, it is accomplished as ICMP. However, TCP is used in more advanced tools. The benefit of using TCP would be that you could perform a traceroute within a protocol port. This will tell you if the protocol such as HTTP (TCP port 80) can connect. If the specific port is blocked, the traceroute will not progress past the point of filtering. Additionally, an ICMP Administratively Prohibited message may be seen if available.

While traceroute may skip over elements in the network, when something breaks, at the basic IP/Service protocol layer, you have an impromptu schematic with which to baseline your service diagnosis. If the path is broken or changed, you'll be able to tell.

There are 3 elements of IP and how it works that give you topology based correlation in near real time.These elements are:

ICMP Net Unreachable - ICMP Type 3 Code 0
ICMP Host Unreachable - ICMP Type 3 Code 1
ICMP Port Unreachable - ICMP Type 3 Code 3

Check out http://tools.ietf.org/html/rfc792
for reference on the ICMP protocol.

You usually see Net Unreachable when a router in the given path for a destination IP has no route to that network. Likewise, you see Host Unreachable by the router for a given network when the destination IP is not in that routers ARP cache and the IP address isn't responding to an ARP request. You see a port unreachable via a destination host that does not have a listener of the port you're trying to communicate with.

These elements are a FUNDAMENTAL part of how IP works as a protocol. It is just a matter of using the information intelligently to tell you when your network breaks.

Consider this - If you monitored the ingress to your Enterprise passively using Snort on ICMP, you could potentially monitor all of the flow control occurring into and out of your enterprise. And by definition, the first 64 bytes of the native protocol are included in the payload of the ICMP message!

In Summary

I like the Service Lineage approach. It get folks aware of services and it keeps the service and the customer in the forefront. Even across service degradations and security impairments. How useful would this "view" be in mitigating a bot net that had infiltrated your enterprise? Or the recognition of the "Low and Slow" penetration probing that gets overlooked? And to be able to visualize cause and effect!

No comments:

Post a Comment