Showing posts with label Enterprise Management. Show all posts

Sunday, July 15, 2012

Thoughts on Telepresense Management

Many organizations have turned to Telepresense capabilities as a way to enable collaboration and teaming across very separated geographical locations. When you think about it, the cost of getting managers, directors, and engineers together to address challenges and work through solutions, Telepresense systems have the potential to not only save money but also empower organizations to respond much faster as an organization.

Going further, the proliferation of video streams is becoming more common place every day. Not only are collaboration systems becoming more popular, many Service Providers are offering video based services to both corporate and consumer markets.

Video conferencing technology has been around for several years. It has evolved significantly as the Codecs and MCUs have evolved significantly along with several standards. Additionally, the transport of these streams is becoming common on internet and private TCP/IP networks where in the past, dedicated DSx or ISDN circuits were used..

Video traffic, by its very nature, places a series of constraints on the network due to the nature and behavior of the traffic.

Bandwidth

First there are bandwidth requirements. High motion digital MPEG video can present a significant amount of traffic to the network.

QoS

Additionally, video traffic is very jitter sensitive and dynamics like packet reordering can wreak havoc in the stream of what you see on screen. In some cases, you may not even own portions of the network you traverse and the symptoms may only be present for a very short period of time.

Video Traffic

Alot of multimedia and streaming type traffic is facilitated by the pairing of RTP and RTCP channels. RTP usually runs on an even numbered port between 1024 and 65535 with the RTCP channel running on the next higherport number. This pairing is called a tuple. Additionally, while RTP and RTCP are protocol independent, they typically run on a UDP transport as TCP tends to sacrifice timeliness in favor of reliability.

Multipoint communications is facilitated by IP multicast.

In digitized speech passing across the network, with G.711 and G.729, the two most common VoIP encoding methods, you get two samples of voice per packet. The size of voice packets tends to be slightly larger than 220 bytes per packet and therefore the overall behavior is the Voice is a constant bit stream constrained by the sample rates needed per interval. Because of sampling, the packets occur on a 20 msec interval.

Some video systems use ITU H.264 encoding which is equivalent to MPEG-4 Part 10, and uses 30 Frames per second. Each frame must be setup every 33 msec. But MPEG encoding per frame can have a variable number of bytes and packets depending upon the video content. MPEG video is accomplished using a reference frame and subsequent data sections provide updates to the original frame in what is called a GOP or Group of Pictures.

A GOP starts off with an I frame which is the reference frame. A P frame represents a significant change in the initial I frame but continues to reference the I frame. B Frames update both I frames and P frames.

The types of frames and their location within a GOP can be defined in a temporal sequence. This temporal distance is the time or number of images between specific types of images in a digital video. “m” is the distance between successive P frames and “n” is the distance between I frames.

In contrast to serial digital television, 30 frames per second is digitized where the entire screen is digitized and sent every frame. As compared to MPEG,the transmission requirements are significantly less than serial digital because only the changes to the I frame are updated using the P and B frames.

Following is a diagram of a typical Group of Pictures and how the frames are sequenced.

http://en.wikipedia.org/wiki/File:GOP.gif

Following is another illustration of a GOP along with data requirements per frame.

This figure shows how different types of frames that can compose a group of pictures (GOP). A GOP can be characterized as the depth of compressed predicted frames (m) as compared to the total number of frames (n). This example shows that a GOP starts within an intra-frame (I-frame) and that intra-frames typically requires the largest number of bytes to represent the image (200 kB in this example). The depth m represents the number of frames that exist between the I-frames and P-frames.

When you look at the data sizes, you realize that the transmission of data across a network as compared to the GOP diagram, you start to realize the breakdown of packets and packet streaming necessary to enable the GOP sequence to work in near real time. For example, the first I frame must pass 200 kB sequentially which could turn out to be several hundred packets across the network. These packets are time and sequence sensitive. When things go wrong, you get video presentations that are frozen or pixelated.

In Cisco TPS, latency is defined as the time it takes for audio and video input to go from input on one end to presentation on the other. It measures latency between two systems via time stamps in the RTP data as well as the RTCP return reports. It is only a measure of the network and does not take into consideration the codecs. The recommended latency target for Cisco TPs is 150 msec or less. However, this is not always possible. When latency is exceeds over 250 msec in a 10 second period, it generates an alarm, an on screen message, a syslog message and an SNMP Trap on the receiving system. The onscreen message is only displayed 15 seconds and isn’t displayed again unless the session is restarted or stopped and reinitiated.

Packet Loss

Packet Loss can occur across the network for a variety of reasons. It could be layer 1 errors, Ethernet duplex mismatches, overrunning the queue depth in routers, and even induced by jitter. In Cisco CTP systems, they recommend a loss less than .05 percent of traffic in each direction. When packet loss exceeds 1 percent averaged over a 10 second period, the system presents an on screen message, generates a syslog message and an SNMP Trap.

When packet loss exceeds 10 percent average over a 10 second period, an additional on screen message is generated, syslog messages are logged and an SNMP Trap is generated.

If the CTS system experiences packet loss greater than 10 percent averaged over a 60 second interval, it downgrades the quality of its outgoing video and puts up an onscreen message. Following is a table of key metrics.

Metric	Target	1^st Threshold	2^nd Threshold
Latency	150 ms	250 ms (2 seconds for Satellite mode)
Jitter	10 ms Packet Jitter 50 ms Frame jitter	125 ms of video frame jitter	165 ms of video frame jitter
Packet Loss	.05%	1%	10%

If you are building KPIs around something like a Cisco CTS implementation, you need to look at areas where latency, jitter, and packet loss can occur. Sometimes, it is in your control and sometimes it is not. And there are a lot of different possibilities.

Some errors and performance conditions are persistent and some are intermittent and situationally specific. For example, a duplex mismatch on either end will result in a persistent packet loss during TPS calls. An overloaded WAN router in the path may be dropping packets during the call.

While the CTS system gives your Operations personnel awareness that jitter, latency, or packet loss is occurring, it does not tell you where the problem is. And if it causes the call to reset, you may not be able to discern where the problem is either. Once the traffic is gone, the problem may disappear with it.

When you look at the diagram, you see a Headquarters CTS system capable of connecting to both a Remote Campus CTS system and a Branch office equipped with a CTS system. Multiple redundant metwork connections are provided via Headquarters and the Remote Campus but a single Metro Ethernet connection connects the Branch office to the Headquarters and Remote campus facilities.

When you look at the potential instrumentation that could be applied, there are a lot of different points to sort out device by device. But diagnosis of something like this goes from end to end. The underlying performance and status data is really most useful when an engineer is drilling into the problem to look for root causes, capacity planning, or performing a post mortem on a specific problem.

Most Cisco TPS systems are high visibility as the users tend to be presented with the issues either by cue on screen or by performance during the call. It doesn’t take too many jerky or broken calls to render it a non-working technology in the minds of some.And these users tend to be high profile like senior management and business unit decision makers.

So, What Do You Do?

First of all, Operations personnel need to know proactively, when a CTS system is presenting errors and experiencing problems. On the elements you can monitor, setup and measure key performance metrics and thresholds, setup status change mechanisms, and threshold on mis-configured elements. Thresholds, traps, and events help you create situation awareness in your environment. This is a good start toward being able to recognize problems and conditions. Even creating awareness that a configuration change is made to any component in the CTS system enables Operations to be aware of elements in the service.

In the monitoring and management of the infrastructure, you are probably receiving SNMP Traps and Syslog from the various components in the environment. This is a good start toward seeing failures and error conditions as they are sent as the events occur. However, you need performance data to be able to drill down into the data to discern and analyze conditions and situations that could affect streaming services.

But when you think about it, you actually need to start with a concept of performance and availability of end to end, as a service. Having IO performance data on a specific router interface makes no seense until its put into a service perspective.

End to End Strategy

First up, you need to look at an end to end monitoring strategy. And probably one of themost common steps to take is to employ something like Cisco IPSLA capabilities.

Cisco has a capability in many of its routers and switches that enables managing service levels using various measurements provided. There are a series of capabilities that can be utilized in your CTS Monitoring and management solution to make it more effective beyond the basic instrumentation. Cisco recommends that shadow devices be used as the IP SLA tests will circumvent routing and queuing mechanisms and can have an effect if done on production devices.

Not all devices support all tests. So, be sure to check with Cisco to ensure you can use a specific IOS version and platform for the given tests. Some service providers and enterprises use decommissioned components to reduce cost.

Basic Connectivity

One of the Cisco IP SLA most commonly used test is the ICMP Echo test. The ICMP Echo test sends a ping from the shadow router doing the test to an end IP. In looking over figure 2, we would need to setup an ICMP Echo test from the shadow router on one side to the CTS system on the other side. This is to establish a connectivity check in each direction.

This will also provide a latency metric but at the IP level. ICMP is the control mechanism for the IP protocol. There may be added latency at higher level protocols depending upon the elements interacting on those protocols. For example, firewalls that maintain state at the transport layer (TCP) may introduce additional latency. Traffic shapers may do so as well.

The collected data and status of the tests are in the CISCO-RTTMON-MIB.my MIB definition file. Pertinent data shows up in rttMonCtrlAdminEntry, rttMonCtrlOperEntry, rttMonStatsCaptureEntry, and rttMonStatsCollectEntry tables.

Jitter

We know that Jitter can have a profound effect on video streams as a GOP that arrives too late or out of sequence will be dropped by the receiving codec as it cannot go back in the video stream and redo the past GOP frame. We can also discern that at 30 frames a second, the GOP frame must be sent every 33 ms.

The jitter test is executed between a Source and Destination and subsequently, needs a Responder to operate. It is accomplished by sending packets separated by an interval. Both source to destination and destination to source is accommodated.

The statistics available provide the types of information you are looking to threshold and extract. The Cisco IP SLA ICMP based Jitter test supports the following statistics:

Jitter (Source to Destination and Destination to Source)
Latency (Source to Destination and Destination to Source
Round Trip Latency
Packet Loss
Successive Packet Loss
Out of Sequence Packets (Source to Destination and Destination to Source)
Late Packets

A couple of factors need to be considered in setting up jitter with regards to video streams. These are:

Jitter Interval
The number of packets
Test frequency
Dealing with Load Balancing and NAT

The jitter interval needs to be set to the GOP frame interval of 33 ms.

The number of packets needs to be set to something significant beyond single digits but below a value which would hammer the network. What you are looking for is just enough packets to understand you have an issue. I would probably recommend 20 at first even though the video packet numbers will be significantly higher. Not every network is created equal so you probably need to tune this. While the number of packets in a GOP frame can be highly variable, you just need to sample for the statistics that effect service.

Test frequency can have an effect on traffic. I would probably start with 5 minutes during active periods. But this too needs to be adjusted according to your environment.

In the Jitter test setup, you can use either IP addresses and hostnames when you specify source and destination.

The statistical data is collected in the rttMonJitterStatsEntry table.

ICMP Path Echo

When connectivity issues occur or changes in latency occur, one needs to be able to diagnose the path from both ends. The Cisco IP SLA ICMP Echo Path test is an ICMP based traceroute function. This test can help diagnose not only path connectivity issues but latency in the path, and asymmetric paths.

Traceroute uses the TTL field of IP to “walk” through a network. As a new hop is discovered, the TTL is incremented and attempted again. As each hop is discovered, an ICMP Echo test is performed to measure the response time.

Data will be presented in the rttMonStatsCaptureEntry table.

Advanced Functions

In looking over what we’ve discussed so far, we have reviewed in system diagnostic events, external events and conditions, and using IP SLA capabilities to test and evaluated in a shadow mode, between two or more telepresense systems. But when things happen, they do so in real time. Coupled with the facts of many areas that could affect degradation in services, in some instances, it may be necessary to provide enhanced levels of service.

Additionally, multiple problems can present confusing and anomalous event patterns and symptoms exacerbating the service restoral process. Of course, because you have high visibility, any confusion adds fuel to the fire.

Media Delivery Index

Ineoquest has derived a service metric related to video streams services and provide a KPI for service delivery. The Media Delivery Index (MDI) can be used to monitor both the quality of a delivered video stream as well as to show system margin by providing an accurate measurement of jitter and delay. The MDI addresses a need to assure the health and quality of systems that are delivering ever higher numbers of streams simultaneously by providing a predictable, repeatable measurement rather than relying on subjective human observations. Use of the MDI further provides a network margin indication that warns system operators of impending operational issues with enough advance notice to allow corrective action before observed video is impaired.

The MDI is also presented in RFC-4445 as jointly authored by Ineoquest and Cisco.

Ineoquest uses Probes and software to analyze streams in real time, to validate the actual video stream as it occurs. In fact, they can be used in a portspan, a network tap arrangement, or they can actually JOIN a conference! Why is this necessary?

The Singulus and Geminus probes look at each GOP and the packets that make up that frame. They capture and provide metrics on any data loss, jitter, and Program Clock Reference (PCR). They are also able to analyze the data stream and determine in real time, the quality, of the video. You can create awareness for increased jitter even before video is affected by analyzing the jitter margins. This takes the subjectiveness out of the measurement equation.

Additionally, you can record and play back video streams that are problematic. This is very important that in video encoding streams, the actual data patterns can be wildly different from one moment to the next depending upon the changes that occur in the GOP. For example, if the camera is displaying a white board, there a very few changes. However, if the camera is capturing 50 people in a room all moving, the GOP is going to be thick with updates to the I, P, and B frames.

For instance, a Provider Network setup could have a cbQoS setup that’s perfectly acceptable for a Whiteboard camera or even moderate traffic. It may start buffering packets when times get tough. The bandwidth may be oversubscribed and reprioritized in areas where you have no visibility. This can introduce jitter and even packet loss only its specific to the conditions at hand. And how do you PROVE that back to your Service Provider?

You can go back and replay the video streams during a maintenance window to validate the infrastructure. And test, diagnose, and validate specific network components with your network folks and service providers alike to deliver a validated service as a known functional capability.

And because you can replicate the captures, what a great way to test load and validate your end to end during maintenance windows!

The IQ Video Management System (iVMS) controls that probes, assembles the data into statistics and provides the technical diagnostic front end to access the detailed information elements available. Additionally, when problems and thresholds occur, it forwards SNMP Traps, syslogs, and potentially email regarding the issues at hand.

Analyzing the Data

First and foremost, you have traps and syslog that tell you of impending conditions that are occuring in near real time. For example, due to the transmission quality of the stream, the end presentation may downgrade from 1080i to 720i to save on bandwidth.

The IPSLA tests establish end to end connectivity and a portion of the path. In Traceroute, normally you only see one interface per device as traceroute walks the path. If only gives you one interface for the device and not the in and out interfaces. This is true because the Time to Live parameter wont be exceeded again until the next hop.

Keep in mind though that elements that can affect a video stream may not be in the direct path as a function. For example, if a router in the path has a low memory condition caused by the buffering from a fast network to a low speed WAN Link, this may affect the router's ability to correctly handle streaming data under load.

What if path changes affect the order of packets? What if timing slips causes jitter on the circuit? Even BGP micro flaps may cause jitter and out of sequence frames. all in all, you have your work cut out for you.

Summary

Prepare yourself to approach the monitoring and management of video streams from a holistic point of view. You have to look at it first from the perspective of end to end to end. Also be prepared to be able to fill in the blanks with other configuration and performance data.

Because of the visibility of Video teleconferencing systems, be prepared to validate your environment on an ongoing basis. Things change constantly. Changes even only remotely related or out of your span of control, can affect the systems ability to perform well. Validate and test your QoS and performance from a holistic view on a recurring basis. A little recurring discipline will save you countless hours of heartbreak later.

Because of the situational aspects of video telepresense, be prepared to enable diagnosis in near real time. Specialized tools like Ineoquest can really empower you to support and understand whats going on and work through analysis,design changes, maintenance actions, and even vendor support to ensure you are on the road to an effective collaboration service.

Sometimes, it pays to enlist the aid of an SME that is vendor independent. After all, in doing this Blog post, I credit a significant amount of education and technical validation on the GOP, RTP, RTVP, and other elements of streaming to Stephen Bavington of Bavington Consulting. http://bavingtonconsulting.com/

Sunday, May 2, 2010

Cloud Computing - Notes from a guy behind the curtain

The latest buzz is Cloud computing. When I attended a CloudCamp hosted here in St. Louis, it became rather obvious that the term Cloud computing has an almost unlimited supply of definitions depending upon which Marketing Dweeb you talk to. It can range from everything from hosted MS Exchange to hosted VMs to applications and even hosted services. I'm really not in a position to say what Cloud is or isn't and in fact, I don't believe theres any way to win that argument. Cloud computing is a marketing perception that is rapidly morphing into whatever marketing types deem necessary to sell something. right or wrong - the spin doctors own the term and it is whatever they think will sell.

In my own perception, Cloud Computing is a process by which applications, services, and infrastructure are delivered to a customer in a rapid manner and empowers the customer to pay for what they use in small, finite increments. Cloud Computing incorporates a lot of technologies and process to make this happen. Technology like Virtualization, configuration management databases, hardware, and software.

What used to take days or weeks to deliver now takes minutes. MINUTES. What does this mean? It takes longer to secure an S Corp and setup a corresponding Tax ID than it does to setup and deliver a new companies web access to customers. And not just local, down home Main street customers but you are in business, competing at a GLOBAL LEVEL in MINUTES. And you pay as you go!

Sounds tasty, huh. Now heres a kink. How the heck do you manage this? I know the big Four Management companies say a relationship is more important than Best of Breed. I've heard in in numerous presentations and conversations. If you are in a position to sit still in business, this may be OK for you. Are you so secure in your market space that you do not fear competition to the point where you would sit idly?

Their products reflect this same lack of concern in that it is the same old stuff. It hasn't evolved much - it takes forever to get up and running and it takes months to be productive. For example, IBM Tivoli ITNM/IP. Takes at least a week of planning just to get ready to install it - IF you have hardware. Next, you need another week and consulting to get things cranking on a discovery for the first time. Takes weeks to integrate in your environment dealing with community string issues, network access, and even dealing with discovery elements.

Dealing with LDAP and network views is another nightmare altogether.

The UI is clunky, slow, and non-interactive. Way too slow to be used interactively as a diagnostic tool. At least you could work with NNM in earlier versions to get some sort of speed. (Well, with the Web UI, you had HP - slower than molasses - and then when you needed something that worked you bought Edge Technologies or Onion Peel.) In the ITNM/IP design, somebody in their infinite wisdom decided to store the map objects and map instances in a binary glob field in MySQL. At least if you had the coordinates you could FIX the topoviz maps or even display them in something a bit faster and more Web 2.0 - Like Flash / Flex. (Hardware is CHEAP!)

And how do you apply this product to a cloud infrastructure? If you can only discover once every few days, I think you're gonna miss a few customers setting up new infrastructures and not to mention any corresponding VMotion events that occur when things fail or load balance. How do you even discover and display the virtual network infrastructure with the real network infrastructure?

Even if you wanted to use it like TADDM, Tideway, or DDMi, the underlying database is not architected right. It doesn't allow you to map out the relationships between entities enough to make it viable. Even if you did a custom Discovery agent and plugged in NMap - (Hey! Everybody uses it!) you cannot fit the data correctly. And it isn't even close to the CIM schema.

And every time you want some additional functionality like performance data integration, its a new check and a new ball game. They sort of attempt to address this by enabling short term polling via the UI. Huge Fail. How do you look at data from yesterday? DOH!

ITNM/IP + Cloud == Shelfware.

If we are expected to respond at the speed of Cloud, there is a HUGE Pile of Compost consisting of management technology of the past that just isn't going to make it. These products take too much support, take too much resources to maintain, and they hold back innovation. The cost just doesn't justify the integration. Even the products we considered as untouchables. Many have been architected in a way that paints them in a corner. Once you evolve a tool kit into a solution, you have to take care not to close up integration capabilities along the way.

They take too long to install, take a huge level of effort to keep running, and the yearly maintenance costs can be rather daunting. The Cloud methodology kind of changes the rules a bit. In the cloud, its SaaS. You sign up for management. You pay for what you get. If you don't like it or want something else, presto changeo - an hour later, you're on a new plan! AND you pay as you go. No more HUGE budget outlays, planning, and negotiation cycles. No more "True-Ups" at the end of the year that kill your upward mobility and career.

BMC - Bring More Cash
EMC - EXtreme Monetary Concerns
IBM - I've Been Mugged!
HP - Huge PriceTag!
CA - Cash Advanced!

Think about Remedy. Huge Cash outlay up front. Takes a long time to get up and running. Takes even longer to get into production. Hard to change over time. And everything custom becomes an ordeal at upgrade time.

They are counting on you to not be able to move. That you have become so political and process bound, you couldn't replace it if you wanted to. In fact, in the late 80s and early 90s, there was the notion that the applications that ran on mainframes could never be moved off of those huge platforms. I remember working on the transition from Space Station Freedom to International Space Station Information systems. The old MacDac folks kept telling up there was no way we could move to open systems. Especially on time and under budget. 9 months later, 2 ES9000 Duals and a bunch of Vaxes repurposed. 28 Applications migrated. Reduced support head count from over 300 to 70. And it was done with less than half of the cost of software maintenance for a year. New software costs ~15% of what they were before. And we had alot more users. And it was faster too! NASA. ESA. CSA. NASDA. RSA. All customers.

Bert Beals, Mark Spooner, and Tim Forrester are among my list of folks that had a profound effect on my career in that they taught me through example to keep it simple and that NOTHING is impossible. And to keep asking "And then what?"

And while not every app fits in a VM, there is a growing catalog of appliance based applications that make total sense. You get to optimize the hardware according to the application and its data. That first couple of months of planning, sizing, and procurement - DONE.

And some apps thrive on the Cloud virtualization. If you need a data warehouse or are looking to make sense of your data footprint, check out Greenplum. Distributed Database BASED on VMs! You plug in resources as VMs as you grow and change!

And the line between the network and the systems and the applications and the users - disappearing quickly. Presents an ever increasing data challenge to be able to discover and use all these relationships to deliver better services to customers.

Cloud Computing is bringing that revolution and reinvention cycle back into focus in the IT industry. It is a culling event as it will cull out the non-producers and change the customer engagement rules. Best of Breed is back! And with a Vengeance!

Sunday, April 4, 2010

Simplifying topology

I have been looking at monitoring and how its typically implemented. Part of my look is to drive visualization but also how can I leverage the data in a way that organizes people's thoughts on the desk.

Part of my thought process is around OpenNMS. What can I contribute to make the project better.

What I came to realize is that Nodes are monitored on a Node / IP address basis by the majority of products available today. All of the alarms and events are aligned by node - even the sub-object based events get aggregated back to the node level. And for the most part, this is OK. You dispatch a tech to the Node level, right?

When you look at topology at a general sense, you can see the relationship between the poller and the Node under test. Between the poller and the end node, there is a list of elements that make up the lineage of network service components. So, from a service perspective, a simple traceroute between the poller and the end node produces a simple network "lineage".

Extending this a bit further, knowing that traceroute is typically done in ICMP, this gives you an IP level perspective of the network. Note also that because traceroute exploits the time to Live parameter of IP, it can be accomplished in any transport layer protocol. For example, traceroute could work on TCP port 80 or 8080. The importance is that you place a protocol specific responder on the end of the code to see if the service is actually working beyond just responding to a connection request.

And while traceroute is a one way street, it still derives a lineage of path between the poller and the Node under test - and now the protocol or SERVICE under test. And it is still a simple lineage.

The significance of the path lineage is that in order to do some level of path correlation, you need to understand what is connected to what. given that this can be very volatile and change very quickly, topology based correlation can be somewhat problematic - especially if your "facts" change on the fly. and IP based networks do that. They are supposed to do that. They are a best efffort communications methodology that needs to adapt to various conditions.

Traceroute doesn't give you ALL of the topology. By far. Consider the case of a simple frame relay circuit. A Frame Relay circuit is mapped end to end by a Circuit provider but uses T carrier access to the local exchange. Traceroute only captures the IP level access and doesn't capture elements below that. In fact, if you have ISDN backup enabled for a Frame Relay circuit, your end points for the circuit will change in most cases, for the access. And the hop count may change as well.

The good part about tracerouteing via a legitimate protocol is that you get to visualize any administrative access issues up front. For example, if port 8080 is blocked between the poller and the end node, the traceroute will fail. Additionally, you may see ICMP administratively prohibited messages as well. In effect, by positioning the poller according to end users populations, you get to see the service access pathing.

Now, think about this... From a basic service perspective, if you poll via the service, you get a basic understanding of the service you are providing via that connection. When something breaks, you also have a BASELINE with which to diagnose the problem. So, if the poll fails, rerun the traceroute via the protocol and see where it stops.

Here are the interesting things to note about this approach:

You are simply replicating human expert knowledge in software. Easy to explain. Easy to transition to personnel.
You get to derive path breakage points pretty quickly.
You get to discern the perspective of the end user.
You are now managing your Enterprise via SERVICE!

Topology really doesn't mean ANYTHING until you evolve to manage by Service and not by individual nodes. You can have all the pretty maps you want. It doesn't mean crapola until you start managing by service.

This approach is an absolute NATURAL for OpenNMS. Let me explain...

Look at the Path Outages tab. While it is currently manually configured, using the traceroute by service lineage here provides a way of visualizing the path lineage.

OpenNMS supports services pollers natively. There are alot of different services out of the box and its easy to do more if you find something different from what they already do.

Look at the difference between Alarms versus Events. Service outages could directly be related to an Alarm while the things that are eventing underneath may affect the service, are presented as events.

What if you took the reports and charts and aligned the elements to the service lineage? For example, if you had a difference in service response, you could align all of the IO graphs for everything in the service lineage. You could also align all of the CPU utilizations as well.

In elements where there are subobjects abstracted in the lineage, if you discover them, you could add those in the lineage. For example, if you discovered the Frame Relay PVCs and LEC access circuits, these could be included in with your visualization underneath the path where they are present.

The other part is that the way you work may need to evolve as well. For example, if you've traditionally ticketed outages on Nodes, now you may need to transition to a Service based model. And while you may issue tickets on a node, your ticket on a Service becomes the overlying dominant ticket in that multiple node problems may be present in a service problem.

And the important thing. You become aware of the customer and Service first, then elements underneath that. It becomes easier to manage to service along with impact assessments, when you manage to a service versus manage to a node. And when you throw in the portability, agility, and abstractness of Cloud computing, this approach is a very logical fit.

Dangerous Development

Ever been in an environment where the developed solutions tend to work around problems rather than confronting issues directly? Do you see bandaids to issues as the normal, mode of operation? Do you have software that is developed without requirements or User Acceptance testing?

What you are seeing are solutions developed in a vacuum without regard to the domain knowledge necessary to understand that a problem needs to be corrected and not "worked around". Essentially, when developers lack the domain knowledge around identifying and correcting problems in areas outside of software, you end up with software developed that works around or bandaid across issues. Essentially, they don't know how to diagnose or correct the problem, diagnose the effects of the problem, or in many cases, even understand that it is a problem.

In some cases, you need strong managerial leadership to stand up and make things right. The problem may be exacerbated by weak management or politically charged environments where managers manage to the Green. And some problems do need escalation.

This gets very dangerous to an IT environment for a multitude of reasons including:

It masks a problem. Once a problem is masked, any fix to the real problem breaks the developed bandaid solution.
It sets an even more dangerous precedent in that now its OK to develop bandaid solutions.
Once developed and in place, it is difficult to replace the solution. (It is easier to do nothing.)
It creates a mandate that further development will always be required because of the work arounds in the environment. In essence, no standards based product can no longer fulfill the requirements because of the work arounds.

A lot of factors contribute to this condition commonly known as "Painted in a Corner" development. In essence, development efforts paint themselves into a corner where they cannot be truly finished or the return on investment can never be fully realized. A developer or IT organization cannot divorce itself or disengage from a product. In effect, you cannot finish it!

A common factor is a lack of life cycle methodology in the development organization. Without standards and methodologies, it is so easy for developers to skip over certain steps because of the pain and suffering. These elements include:

Requirements and Requirements traceability
Unit Testing
System Testing
Test Harnesses and structured testing
Quality Assurance
Coding standards
Documentation
Code / Project refactoring
Acceptance Testing.

This is no different from doing other tasks such as Network Engineering, Systems Engineering, and Applications Engineering. The danger is that once the precedence is set that its OK to go around the Policies, Procedures, and Discipline associated with effective Software Development, it is very hard to reign it back in. In effect, the organization has been compromised. And they lack the awareness that they are compromised.

What do you do the right the ship?

Obviously, there is a lack of standards and governance up front. These need to be remedied. Coding standards, software lifecycle management techniques need to be chosen and implemented up front. Need to get away from Cowboy code and software development that is not customer driven. Additionally, it should be obvious that design and architecture choices need to be made external to this software development team in the foreseeable future.

Every piece of code written needs to be reviewed and corrected. You need to get rid of the bandaids and "solutions" that perpetuate problems. And you need to start addressing the real problems instead of working around them.

Software that perpetuates problems by masking them via workarounds and bandaids is a dangerous pattern to manifest in your organization. Its like finding roaches in your house. Not only will you see the bandaid redeveloped and reused over and over again, you have an empowered development staff that bandaids versus fixes. Until you can do a thorough cleaning and bombing of the house, it is next to impossible to get rid of the roaches.

Sometimes, development managers tend to be promoted a bit early in that while they have experience with code and techniques, exposure to a lot of different problem sets segregates the good development leaders from the politicians and wannabes. Those that are pipelined do not always understand how to reason through problems, discern the good from bad techniques and approaches, and lead down the right paths. Some turn very political because it is easier for them to respond politically than technically.

What are the Warnings Signs?

Typically, you see in house developed solutions that are developed around a problem in functionality that would not normally be seen in many different places. This can be manifested in many ways. Some examples or warning signs include:

Non-applicability of commercial products because of one off in house solutions or applied workarounds.
Typical terms like "thats the way we've always done this" or "We developed this to work around..." arise in discussions.
You see alot of in house developed products in many different versions.
In house developed products tend to lack sufficient documentation.
You see major deviations away from standards.
No QA and code review is only internal to the group.
No Unit or System test functionality available to the support organization.
In house developed software that never transitions out of the development group.
Software developed in house that is never finished.
You get political answers to technical questions.

A lesson here is that it does matter that the person you setup to lead, have exposure to a lot of different problem sets, SDLC methodologies in practice and not just theory, and they have some definite problem reasoning skills. A Politician is not a good coach or development leader in these environments.

In Summary

One must be very careful in the design and implementation of software done in house. If wrong, you can quickly paint development and the developed capabilities into a corner where you must forklift to get functionality back. And if you're not careful, you will stop evolution in the environment because the technical solutions will continue to work around instead of directly addressing problems.

Saturday, March 27, 2010

NPS, Enterprise Management, and Situation Awareness

In the course of what I do, I have to sometimes take non-technical metrics and understand where implementation of technology - especially in the ENMS realm - applies toward achieving real business goals.

Recently, alot of Services based companies are working toward understanding and improving their Net Promoter Score or NPS. As part of this initiative, what can I do to realize the overall goal?

First, I went looking for a definition of NPS to understand the terms, conditions, and metrics related to this KPI. I found this definition:

What is Net Promoter?

Net Promoter® is both a loyalty metric and a discipline for using customer feedback to fuel profitable growth in your business. Developed by Satmetrix, Bain & Company, and Fred Reichheld, the concept was first popularized through Reichheld's book The Ultimate Question, and has since been embraced by leading companies worldwide as the standard for measuring and improving customer loyalty.

The Net Promoter Score, or NPS®, is a straightforward metric that holds companies and employees accountable for how they treat customers. It has gained popularity thanks to its simplicity and its linkage to profitable growth. Employees at all levels of the organization understand it, opening the door to customer- centric change and improved performance.

Net Promoter programs are not traditional customer satisfaction programs, and simply measuring your NPS does not lead to success. Companies need to follow an associated discipline to actually drive improvements in customer loyalty and enable profitable growth. They must have leadership commitment, and the right business processes and systems in place to deliver real-time information to employees, so they can act on customer feedback and achieve results.

I found this at :

http://www.netpromoter.com/np/index.jsp

In essence, the NPS KPI is a metric by which to measure customer loyalty. In its simplicity, come the subjectiveness of how you treat your customers. So this begs the question: What can I do from an Enterprise Management perspective to affect this?

From my perspective, the NPS is a measure of the effectiveness of your support CULTURE first and foremost. This is a personal - core belief - sort of thing at its foundation. Customer facing people in your support organization must project several key personality traits and behaviors. Some of these I envision to be:

Dedication. The customer is the only person in the room sort of thing.

Urgency of need. The support person must understand the importance of the situation.

Empathy. A willingness and understanding of the customer's pain.

Confidence. In the face of unknown issues and varying conditions, the customer facing person must exhibit technical strength.

Follow Through. If the customer trusts you enough to let you off the phone to handle things, you MUST FOLLOW THROUGH.

There is also the notion that in a Service oriented company, EVERYONE is a sales person in one way or another. Every interaction means an opportunity to understand the customer and help them be successful.

When you go to MacDonalds and you're trying to figure out what you'd like or what level of poly unsaturated fat and cholesterol you want to propagate to your family... Ever gotten the person that asks you what you want and you don't know and they stand there looking at you? NPS score --.

Now, if they engage you and suggest items like a 12 pack of Big Macs, they are DEDICATED, empathic to your hunger pains,understand your urgency of need, and have confidence your order is going to be up in a minute or so after inputting it in the computer. And in the end they ask about dessert - Great Sales person and GREAT customer service person. NPS score ++.

From a personal work habits perspective, one of the key behaviors to be considered is creating and maintaining Situation Awareness. I ran across the term SA while working on an Air force project and found it profoundly appropo for operations organizations doing customer service. Check this out on Wikipedia:

http://en.wikipedia.org/wiki/Situation_awareness

I also read through several sections of the google book review about Situation Awareness by Dr. Mica Endsley and Daniel Garland. This is at :

http://books.google.com/books?id=tUwqcqa_QaMC&printsec=frontcover&dq=Situation+Awareness&source=bl&ots=NccDiPzgMI&sig=NW0LAHrBsOTFXVmSyCSKz6154IU&hl=en&ei=pUWuS_HdE4X7lwf9tdCRAQ&sa=X&oi=book_result&ct=result&resnum=2&ved=0CBQQ6AEwAQ#v=onepage&q=&f=false

(I'm ordering the book!)

The model graphic they provide is useful as well:

http://en.wikipedia.org/wiki/File:SA_Wikipedia_Figure_1_Shared_SA_(20Nov2007).jpg

In effect, what enterprise management applications and technology MUST do to effectively achieve a higher NPS is to empower SA at all levels. In doing so, you create a culture where information is meant to be shared and used to make predictions and illicit responses and decisions based upon information being presented.

Now this is a bit taller of an order than once thought. For example, on the Event / Fault Management side of things, information is presented as events. People respond to events. They test or open tickets or whatever workflow they do when an event is received.

But an event is NOT a situation! A Situation is something a bit different and more abstract than a simple event. So, you have to transition your events to be situation focused! Interesting thought... Especially since event presentation is the real prevalent method! Maybe the Netcool approach needs to evolve a bit!

Interesting in that OpenNMS introduces the concept that events are different from Alarms in their own GUI. Check it out at:

http://www.opennms.org/wiki/Alarms

A brilliant piece of work (and a notion that simple is Good!) in that EVENTS != ALARMS! My hats off the the OpenNMS guys and the OGP for GETTING IT! In fact, its a start down the road of understanding the concept of Situations in SA.

Trouble Ticketing systems attempt to do this situation grouping via tickets but its almost too late once it leaves your near real time pane of glass. Once you transition away from a single pane of glass, you effectively lose your SA of the real time. And if you attempt to work out of tickets, you miss all of the elemental sorts of things that happen underneath. Even elements of information like event activity, performance thresholds, support activity, and the like have to be discerned and recognized in near real time to be effective information. If you miss it, you don't know. But your customer may not miss it!

If you ticket from event to ticket, you're asking for problems. Problems like tickets that are not problems but side effects. Or side effects that are problems, just rolled up under a ticket. Or awareness that conditions have cleared while the ticket is still being escalated and worked. Or missing all of the adjacent issues like a router taking out a subnet taking out and application and its three different desks.

The interesting part here is that two given situations may have events that effect each situation. This may throw a kink in normal, database table based event management systems. May be a bit difficult to implement and support.

I am beginning to think a bit different on Event processing especially with regards to SA and understanding, recognizing, and responding to SITUATIONS. For example, check out this presentation by Tim Bass of Cyberstrategics. He has a long history of thought leadership in Situational Awareness in Cyberspace.

http://www.slideshare.net/TimBassCEP/getting-started-in-cep-how-to-build-an-event-processing-application-presentation-717795

CEP techniques would enable an event to be consumed by multiple situations as situations develop and dissipate. Think about the weighting of events and conditions within a given situation. Some elements may be much more pertinent than others.

A significant part of Situation Awareness is the visualization and presentation of data regarding the ongoing situation. For example, check out this video: http://www.youtube.com/watch?v=FdKOxZIIKmQ

From the aspect of true, situational awareness, shouldn't we be looking at evolving Enterprise Management toward being able to deal with situations?

Another thought here. If I'm worried about an NPS, could I MANAGE to it live? Or at least closer to real time? What if I could meld in the capabilities of Evolve24's The Mirror product as a look at the REPUTATION SITUATION as it evolves? Check it out at: http://www.evolve24.com/mirror_for_social_media.php

This kind of changes the face of what we have been considering as BSM, doesn't it.

The common denominator in all this process and technology is Knowledge Management. How are you developing knowledge? How are you integrating it in with EVERY person. How are you using it to create SA and HUGE business discriminators? How are you using KM to empower your customers?