Dougie's Enterprise Management World: Thoughts on Telepresense Management

Many organizations have turned to Telepresense capabilities as a way to enable collaboration and teaming across very separated geographical locations. When you think about it, the cost of getting managers, directors, and engineers together to address challenges and work through solutions, Telepresense systems have the potential to not only save money but also empower organizations to respond much faster as an organization.

Going further, the proliferation of video streams is becoming more common place every day. Not only are collaboration systems becoming more popular, many Service Providers are offering video based services to both corporate and consumer markets.

Video conferencing technology has been around for several years. It has evolved significantly as the Codecs and MCUs have evolved significantly along with several standards. Additionally, the transport of these streams is becoming common on internet and private TCP/IP networks where in the past, dedicated DSx or ISDN circuits were used..

Video traffic, by its very nature, places a series of constraints on the network due to the nature and behavior of the traffic.

Bandwidth

First there are bandwidth requirements. High motion digital MPEG video can present a significant amount of traffic to the network.

QoS

Additionally, video traffic is very jitter sensitive and dynamics like packet reordering can wreak havoc in the stream of what you see on screen. In some cases, you may not even own portions of the network you traverse and the symptoms may only be present for a very short period of time.

Video Traffic

Alot of multimedia and streaming type traffic is facilitated by the pairing of RTP and RTCP channels. RTP usually runs on an even numbered port between 1024 and 65535 with the RTCP channel running on the next higherport number. This pairing is called a tuple. Additionally, while RTP and RTCP are protocol independent, they typically run on a UDP transport as TCP tends to sacrifice timeliness in favor of reliability.

Multipoint communications is facilitated by IP multicast.

In digitized speech passing across the network, with G.711 and G.729, the two most common VoIP encoding methods, you get two samples of voice per packet. The size of voice packets tends to be slightly larger than 220 bytes per packet and therefore the overall behavior is the Voice is a constant bit stream constrained by the sample rates needed per interval. Because of sampling, the packets occur on a 20 msec interval.

Some video systems use ITU H.264 encoding which is equivalent to MPEG-4 Part 10, and uses 30 Frames per second. Each frame must be setup every 33 msec. But MPEG encoding per frame can have a variable number of bytes and packets depending upon the video content. MPEG video is accomplished using a reference frame and subsequent data sections provide updates to the original frame in what is called a GOP or Group of Pictures.

A GOP starts off with an I frame which is the reference frame. A P frame represents a significant change in the initial I frame but continues to reference the I frame. B Frames update both I frames and P frames.

The types of frames and their location within a GOP can be defined in a temporal sequence. This temporal distance is the time or number of images between specific types of images in a digital video. “m” is the distance between successive P frames and “n” is the distance between I frames.

In contrast to serial digital television, 30 frames per second is digitized where the entire screen is digitized and sent every frame. As compared to MPEG,the transmission requirements are significantly less than serial digital because only the changes to the I frame are updated using the P and B frames.

Following is a diagram of a typical Group of Pictures and how the frames are sequenced.

http://en.wikipedia.org/wiki/File:GOP.gif

Following is another illustration of a GOP along with data requirements per frame.

This figure shows how different types of frames that can compose a group of pictures (GOP). A GOP can be characterized as the depth of compressed predicted frames (m) as compared to the total number of frames (n). This example shows that a GOP starts within an intra-frame (I-frame) and that intra-frames typically requires the largest number of bytes to represent the image (200 kB in this example). The depth m represents the number of frames that exist between the I-frames and P-frames.

When you look at the data sizes, you realize that the transmission of data across a network as compared to the GOP diagram, you start to realize the breakdown of packets and packet streaming necessary to enable the GOP sequence to work in near real time. For example, the first I frame must pass 200 kB sequentially which could turn out to be several hundred packets across the network. These packets are time and sequence sensitive. When things go wrong, you get video presentations that are frozen or pixelated.

In Cisco TPS, latency is defined as the time it takes for audio and video input to go from input on one end to presentation on the other. It measures latency between two systems via time stamps in the RTP data as well as the RTCP return reports. It is only a measure of the network and does not take into consideration the codecs. The recommended latency target for Cisco TPs is 150 msec or less. However, this is not always possible. When latency is exceeds over 250 msec in a 10 second period, it generates an alarm, an on screen message, a syslog message and an SNMP Trap on the receiving system. The onscreen message is only displayed 15 seconds and isn’t displayed again unless the session is restarted or stopped and reinitiated.

Packet Loss

Packet Loss can occur across the network for a variety of reasons. It could be layer 1 errors, Ethernet duplex mismatches, overrunning the queue depth in routers, and even induced by jitter. In Cisco CTP systems, they recommend a loss less than .05 percent of traffic in each direction. When packet loss exceeds 1 percent averaged over a 10 second period, the system presents an on screen message, generates a syslog message and an SNMP Trap.

When packet loss exceeds 10 percent average over a 10 second period, an additional on screen message is generated, syslog messages are logged and an SNMP Trap is generated.

If the CTS system experiences packet loss greater than 10 percent averaged over a 60 second interval, it downgrades the quality of its outgoing video and puts up an onscreen message. Following is a table of key metrics.

Metric	Target	1^st Threshold	2^nd Threshold
Latency	150 ms	250 ms (2 seconds for Satellite mode)
Jitter	10 ms Packet Jitter 50 ms Frame jitter	125 ms of video frame jitter	165 ms of video frame jitter
Packet Loss	.05%	1%	10%

If you are building KPIs around something like a Cisco CTS implementation, you need to look at areas where latency, jitter, and packet loss can occur. Sometimes, it is in your control and sometimes it is not. And there are a lot of different possibilities.

Some errors and performance conditions are persistent and some are intermittent and situationally specific. For example, a duplex mismatch on either end will result in a persistent packet loss during TPS calls. An overloaded WAN router in the path may be dropping packets during the call.

While the CTS system gives your Operations personnel awareness that jitter, latency, or packet loss is occurring, it does not tell you where the problem is. And if it causes the call to reset, you may not be able to discern where the problem is either. Once the traffic is gone, the problem may disappear with it.

When you look at the diagram, you see a Headquarters CTS system capable of connecting to both a Remote Campus CTS system and a Branch office equipped with a CTS system. Multiple redundant metwork connections are provided via Headquarters and the Remote Campus but a single Metro Ethernet connection connects the Branch office to the Headquarters and Remote campus facilities.

When you look at the potential instrumentation that could be applied, there are a lot of different points to sort out device by device. But diagnosis of something like this goes from end to end. The underlying performance and status data is really most useful when an engineer is drilling into the problem to look for root causes, capacity planning, or performing a post mortem on a specific problem.

Most Cisco TPS systems are high visibility as the users tend to be presented with the issues either by cue on screen or by performance during the call. It doesn’t take too many jerky or broken calls to render it a non-working technology in the minds of some.And these users tend to be high profile like senior management and business unit decision makers.

So, What Do You Do?

First of all, Operations personnel need to know proactively, when a CTS system is presenting errors and experiencing problems. On the elements you can monitor, setup and measure key performance metrics and thresholds, setup status change mechanisms, and threshold on mis-configured elements. Thresholds, traps, and events help you create situation awareness in your environment. This is a good start toward being able to recognize problems and conditions. Even creating awareness that a configuration change is made to any component in the CTS system enables Operations to be aware of elements in the service.

In the monitoring and management of the infrastructure, you are probably receiving SNMP Traps and Syslog from the various components in the environment. This is a good start toward seeing failures and error conditions as they are sent as the events occur. However, you need performance data to be able to drill down into the data to discern and analyze conditions and situations that could affect streaming services.

But when you think about it, you actually need to start with a concept of performance and availability of end to end, as a service. Having IO performance data on a specific router interface makes no seense until its put into a service perspective.

End to End Strategy

First up, you need to look at an end to end monitoring strategy. And probably one of themost common steps to take is to employ something like Cisco IPSLA capabilities.

Cisco has a capability in many of its routers and switches that enables managing service levels using various measurements provided. There are a series of capabilities that can be utilized in your CTS Monitoring and management solution to make it more effective beyond the basic instrumentation. Cisco recommends that shadow devices be used as the IP SLA tests will circumvent routing and queuing mechanisms and can have an effect if done on production devices.

Not all devices support all tests. So, be sure to check with Cisco to ensure you can use a specific IOS version and platform for the given tests. Some service providers and enterprises use decommissioned components to reduce cost.

Basic Connectivity

One of the Cisco IP SLA most commonly used test is the ICMP Echo test. The ICMP Echo test sends a ping from the shadow router doing the test to an end IP. In looking over figure 2, we would need to setup an ICMP Echo test from the shadow router on one side to the CTS system on the other side. This is to establish a connectivity check in each direction.

This will also provide a latency metric but at the IP level. ICMP is the control mechanism for the IP protocol. There may be added latency at higher level protocols depending upon the elements interacting on those protocols. For example, firewalls that maintain state at the transport layer (TCP) may introduce additional latency. Traffic shapers may do so as well.

The collected data and status of the tests are in the CISCO-RTTMON-MIB.my MIB definition file. Pertinent data shows up in rttMonCtrlAdminEntry, rttMonCtrlOperEntry, rttMonStatsCaptureEntry, and rttMonStatsCollectEntry tables.

Jitter

We know that Jitter can have a profound effect on video streams as a GOP that arrives too late or out of sequence will be dropped by the receiving codec as it cannot go back in the video stream and redo the past GOP frame. We can also discern that at 30 frames a second, the GOP frame must be sent every 33 ms.

The jitter test is executed between a Source and Destination and subsequently, needs a Responder to operate. It is accomplished by sending packets separated by an interval. Both source to destination and destination to source is accommodated.

The statistics available provide the types of information you are looking to threshold and extract. The Cisco IP SLA ICMP based Jitter test supports the following statistics:

Jitter (Source to Destination and Destination to Source)
Latency (Source to Destination and Destination to Source
Round Trip Latency
Packet Loss
Successive Packet Loss
Out of Sequence Packets (Source to Destination and Destination to Source)
Late Packets

A couple of factors need to be considered in setting up jitter with regards to video streams. These are:

Jitter Interval
The number of packets
Test frequency
Dealing with Load Balancing and NAT

The jitter interval needs to be set to the GOP frame interval of 33 ms.

The number of packets needs to be set to something significant beyond single digits but below a value which would hammer the network. What you are looking for is just enough packets to understand you have an issue. I would probably recommend 20 at first even though the video packet numbers will be significantly higher. Not every network is created equal so you probably need to tune this. While the number of packets in a GOP frame can be highly variable, you just need to sample for the statistics that effect service.

Test frequency can have an effect on traffic. I would probably start with 5 minutes during active periods. But this too needs to be adjusted according to your environment.

In the Jitter test setup, you can use either IP addresses and hostnames when you specify source and destination.

The statistical data is collected in the rttMonJitterStatsEntry table.

ICMP Path Echo

When connectivity issues occur or changes in latency occur, one needs to be able to diagnose the path from both ends. The Cisco IP SLA ICMP Echo Path test is an ICMP based traceroute function. This test can help diagnose not only path connectivity issues but latency in the path, and asymmetric paths.

Traceroute uses the TTL field of IP to “walk” through a network. As a new hop is discovered, the TTL is incremented and attempted again. As each hop is discovered, an ICMP Echo test is performed to measure the response time.

Data will be presented in the rttMonStatsCaptureEntry table.

Advanced Functions

In looking over what we’ve discussed so far, we have reviewed in system diagnostic events, external events and conditions, and using IP SLA capabilities to test and evaluated in a shadow mode, between two or more telepresense systems. But when things happen, they do so in real time. Coupled with the facts of many areas that could affect degradation in services, in some instances, it may be necessary to provide enhanced levels of service.

Additionally, multiple problems can present confusing and anomalous event patterns and symptoms exacerbating the service restoral process. Of course, because you have high visibility, any confusion adds fuel to the fire.

Media Delivery Index

Ineoquest has derived a service metric related to video streams services and provide a KPI for service delivery. The Media Delivery Index (MDI) can be used to monitor both the quality of a delivered video stream as well as to show system margin by providing an accurate measurement of jitter and delay. The MDI addresses a need to assure the health and quality of systems that are delivering ever higher numbers of streams simultaneously by providing a predictable, repeatable measurement rather than relying on subjective human observations. Use of the MDI further provides a network margin indication that warns system operators of impending operational issues with enough advance notice to allow corrective action before observed video is impaired.

The MDI is also presented in RFC-4445 as jointly authored by Ineoquest and Cisco.

Ineoquest uses Probes and software to analyze streams in real time, to validate the actual video stream as it occurs. In fact, they can be used in a portspan, a network tap arrangement, or they can actually JOIN a conference! Why is this necessary?

The Singulus and Geminus probes look at each GOP and the packets that make up that frame. They capture and provide metrics on any data loss, jitter, and Program Clock Reference (PCR). They are also able to analyze the data stream and determine in real time, the quality, of the video. You can create awareness for increased jitter even before video is affected by analyzing the jitter margins. This takes the subjectiveness out of the measurement equation.

Additionally, you can record and play back video streams that are problematic. This is very important that in video encoding streams, the actual data patterns can be wildly different from one moment to the next depending upon the changes that occur in the GOP. For example, if the camera is displaying a white board, there a very few changes. However, if the camera is capturing 50 people in a room all moving, the GOP is going to be thick with updates to the I, P, and B frames.

For instance, a Provider Network setup could have a cbQoS setup that’s perfectly acceptable for a Whiteboard camera or even moderate traffic. It may start buffering packets when times get tough. The bandwidth may be oversubscribed and reprioritized in areas where you have no visibility. This can introduce jitter and even packet loss only its specific to the conditions at hand. And how do you PROVE that back to your Service Provider?

You can go back and replay the video streams during a maintenance window to validate the infrastructure. And test, diagnose, and validate specific network components with your network folks and service providers alike to deliver a validated service as a known functional capability.

And because you can replicate the captures, what a great way to test load and validate your end to end during maintenance windows!

The IQ Video Management System (iVMS) controls that probes, assembles the data into statistics and provides the technical diagnostic front end to access the detailed information elements available. Additionally, when problems and thresholds occur, it forwards SNMP Traps, syslogs, and potentially email regarding the issues at hand.

Analyzing the Data

First and foremost, you have traps and syslog that tell you of impending conditions that are occuring in near real time. For example, due to the transmission quality of the stream, the end presentation may downgrade from 1080i to 720i to save on bandwidth.

The IPSLA tests establish end to end connectivity and a portion of the path. In Traceroute, normally you only see one interface per device as traceroute walks the path. If only gives you one interface for the device and not the in and out interfaces. This is true because the Time to Live parameter wont be exceeded again until the next hop.

Keep in mind though that elements that can affect a video stream may not be in the direct path as a function. For example, if a router in the path has a low memory condition caused by the buffering from a fast network to a low speed WAN Link, this may affect the router's ability to correctly handle streaming data under load.

What if path changes affect the order of packets? What if timing slips causes jitter on the circuit? Even BGP micro flaps may cause jitter and out of sequence frames. all in all, you have your work cut out for you.

Summary

Prepare yourself to approach the monitoring and management of video streams from a holistic point of view. You have to look at it first from the perspective of end to end to end. Also be prepared to be able to fill in the blanks with other configuration and performance data.

Because of the visibility of Video teleconferencing systems, be prepared to validate your environment on an ongoing basis. Things change constantly. Changes even only remotely related or out of your span of control, can affect the systems ability to perform well. Validate and test your QoS and performance from a holistic view on a recurring basis. A little recurring discipline will save you countless hours of heartbreak later.

Because of the situational aspects of video telepresense, be prepared to enable diagnosis in near real time. Specialized tools like Ineoquest can really empower you to support and understand whats going on and work through analysis,design changes, maintenance actions, and even vendor support to ensure you are on the road to an effective collaboration service.

Sometimes, it pays to enlist the aid of an SME that is vendor independent. After all, in doing this Blog post, I credit a significant amount of education and technical validation on the GOP, RTP, RTVP, and other elements of streaming to Stephen Bavington of Bavington Consulting. http://bavingtonconsulting.com/

Dougie's Enterprise Management World

Sunday, July 15, 2012

Thoughts on Telepresense Management

No comments:

Post a Comment