Showing posts with label Performance management. Show all posts
Showing posts with label Performance management. Show all posts

Sunday, July 15, 2012

Thoughts on Telepresense Management

Many organizations have turned to Telepresense capabilities as a way to enable collaboration and teaming across very separated geographical locations. When you think about it, the cost of getting managers, directors, and engineers together to address challenges and work through solutions, Telepresense systems have the potential to not only save money but also empower organizations to respond much faster as an organization.

Going further, the proliferation of video streams is becoming more common place every day. Not only are collaboration systems becoming more popular, many Service Providers are offering video based services to both corporate and consumer markets.

Video conferencing technology has been around for several years. It has evolved significantly as the Codecs and MCUs have evolved significantly along with several standards. Additionally, the transport of these streams is becoming common on internet and private TCP/IP networks where in the past, dedicated DSx or ISDN circuits were used..

Video traffic, by its very nature, places a series of constraints on the network due to the nature and behavior of the traffic. 

Bandwidth

 First there are bandwidth requirements. High motion digital MPEG video can present a significant amount of traffic to the network. 

QoS

Additionally, video traffic is very jitter sensitive and dynamics like packet reordering can wreak havoc in the stream of what you see on screen. In some cases, you may not even own portions of the network you traverse and the symptoms may only be present for a very short period of time.

Video Traffic

Alot of multimedia and streaming type traffic is facilitated by the pairing of RTP and RTCP channels. RTP usually runs on an even numbered port between 1024 and 65535 with the RTCP channel running on the next higherport number.   This pairing is called a tuple. Additionally, while RTP and RTCP are protocol independent, they typically run on a UDP transport as TCP tends to sacrifice timeliness in favor of reliability.

Multipoint communications is facilitated  by IP multicast.

In digitized speech passing across the network, with G.711 and G.729, the two most common VoIP encoding methods, you get two samples of voice per packet. The size of voice packets tends to be slightly larger than 220 bytes per packet and therefore the overall behavior is the Voice is a constant bit stream constrained by the sample rates needed per interval. Because of sampling, the packets occur on a 20 msec interval.

Some video systems use ITU H.264 encoding which is equivalent to MPEG-4 Part 10, and uses 30 Frames per second. Each frame must be setup every 33 msec. But MPEG encoding per frame can have a variable number of bytes and packets depending upon the video content. MPEG video is accomplished using a reference frame and subsequent data sections provide updates to the original frame in what is called a GOP or Group of Pictures.

A GOP starts off with an I frame which is the reference frame. A P frame represents a significant change in the initial I frame but continues to reference the I frame. B Frames update both I frames and P frames.
The types of frames and their location within a GOP can be defined in a temporal sequence. This temporal distance is the time or number of images between specific types of images in a digital video. “m” is the distance between successive P frames and “n” is the distance between I frames.

In contrast to serial digital television, 30 frames per second is digitized where the entire screen is digitized and sent every frame. As compared to MPEG,the transmission requirements are significantly less than serial digital because only the changes to the I frame are updated using the P and B frames. 

Following is a diagram of a typical Group of Pictures and how the frames are sequenced.



Following is another illustration of a GOP along with data requirements per frame.


This figure shows how different types of frames that can compose a group of pictures (GOP). A GOP can be characterized as the depth of compressed predicted frames (m) as compared to the total number of frames (n). This example shows that a GOP starts within an intra-frame (I-frame) and that intra-frames typically requires the largest number of bytes to represent the image (200 kB in this example). The depth m represents the number of frames that exist between the I-frames and P-frames.

 When you look at the data sizes, you realize that the transmission of data across a network as compared to the GOP diagram, you start to realize the breakdown of packets and packet streaming necessary to enable the GOP sequence to work in near real time. For example, the first I frame must pass 200 kB sequentially which could turn out to be several hundred packets across the network. These packets are time and sequence sensitive. When things go wrong, you get video presentations that are frozen or pixelated.


In Cisco TPS, latency is defined as the time it takes for audio and video input to go from input on one end to presentation on the other. It measures latency between two systems via time stamps in the RTP data as well as the RTCP return reports. It is only a measure of the network and does not take into consideration the codecs. The recommended latency target for Cisco TPs is 150 msec or less. However, this is not always possible. When latency is exceeds over 250 msec in a 10 second period, it generates an alarm, an on screen message, a syslog message and an SNMP Trap on the receiving system. The onscreen message is only displayed 15 seconds and isn’t displayed again unless the session is restarted or stopped and reinitiated.

Packet Loss

Packet Loss can occur across the network for a variety of reasons. It could be layer 1 errors, Ethernet duplex mismatches, overrunning the queue depth in routers, and even induced by jitter. In Cisco CTP systems, they recommend a loss less than .05 percent of traffic in each direction. When packet loss exceeds 1 percent averaged over a 10 second period, the system presents an on screen message, generates a syslog message and an SNMP Trap.

When packet loss exceeds 10 percent average over a 10 second period, an additional on screen message is generated, syslog messages are logged and an SNMP Trap is generated.

If the CTS system experiences packet loss greater than 10 percent averaged over a 60 second interval, it downgrades the quality of its outgoing video and puts up an onscreen message. Following is a table of key metrics.

MetricTarget1st Threshold2nd Threshold
Latency150 ms250 ms (2 seconds for Satellite mode)
Jitter10 ms Packet Jitter 50 ms Frame jitter125 ms of video frame jitter165 ms of video frame jitter
Packet Loss.05%1%10%

If you are building KPIs around something like a Cisco CTS implementation, you need to look at areas where latency, jitter, and packet loss can occur. Sometimes, it is in your control and sometimes it is not. And there are a lot of different possibilities.

Some errors and performance conditions are persistent and some are intermittent and situationally specific. For example, a duplex mismatch on either end will result in a persistent packet loss during TPS calls. An overloaded WAN router in the path may be dropping packets during the call.

While the CTS system gives your Operations personnel awareness that jitter, latency, or packet loss is occurring, it does not tell you where the problem is. And if it causes the call to reset, you may not be able to discern where the problem is either. Once the traffic is gone, the problem may disappear with it.





When you look at the diagram, you see a Headquarters CTS system capable of connecting to both a Remote Campus CTS system and a Branch office equipped with a CTS system.  Multiple redundant metwork connections are provided via Headquarters and the Remote Campus but a single Metro Ethernet connection connects the Branch office to the Headquarters and Remote campus facilities.

When you look at the potential instrumentation that could be applied, there are a lot of different points to sort out device by device. But diagnosis of something like this goes from end to end. The underlying performance and status data is really most useful when an engineer is drilling into the problem to look for root causes, capacity planning, or performing a post mortem on a specific problem.

Most Cisco TPS systems are high visibility as the users tend to be presented with the issues either by cue on screen or by performance during the call. It doesn’t take too many jerky or broken calls to render it a non-working technology in the minds of some.And these users tend to be high profile like senior management and business unit decision makers.

So, What Do You Do?

First of all, Operations personnel need to know proactively, when a CTS system is presenting errors and experiencing problems. On the elements you can monitor, setup and measure key performance metrics and thresholds, setup status change mechanisms, and threshold on mis-configured elements. Thresholds, traps, and events help you create situation awareness in your environment. This is a good start toward being able to recognize problems and conditions. Even creating awareness that a configuration change is made to any component in the CTS system enables Operations to be aware of elements in the service.

In the monitoring and management of the infrastructure, you are probably receiving SNMP Traps and Syslog from the various components in the environment.  This is a good start toward seeing failures and error conditions as they are sent as the events occur. However, you need performance data to be able to drill down into the data to discern and analyze conditions and situations that could affect streaming services.

But when you think about it, you actually need to start with a concept of performance and availability of end to end, as a service. Having IO performance data on a specific router interface makes no seense until its put into a service perspective.

End to End Strategy

First up, you need to look at an end to end monitoring strategy. And probably one of themost common steps to take is to employ something like Cisco IPSLA capabilities.

Cisco has a capability in many of its routers and switches that enables managing service levels using various measurements provided. There are a series of capabilities that can be utilized in your CTS Monitoring and management solution to make it more effective beyond the basic instrumentation.  Cisco recommends that shadow devices be used as the IP SLA tests will circumvent routing and queuing mechanisms and can have an effect if done on production devices. 

Not all devices support all tests. So, be sure to check with Cisco to ensure you can use a specific IOS version and platform for the given tests. Some service providers and enterprises use decommissioned components to reduce cost.

Basic Connectivity

One of the Cisco IP SLA most commonly used test is the ICMP Echo test. The ICMP Echo test sends a ping from the shadow router doing the test to an end IP. In looking over figure 2, we would need to setup an ICMP Echo test from the shadow router on one side to the CTS system on the other side. This is to establish a connectivity check in each direction. 

This will also provide a latency metric but at the IP level. ICMP is the control mechanism for the IP protocol. There may be added latency at higher level protocols depending upon the elements interacting on those protocols. For example, firewalls that maintain state at the transport layer (TCP) may introduce additional latency. Traffic shapers may do so as well.

The collected data and status of the tests are in the CISCO-RTTMON-MIB.my MIB definition file. Pertinent data shows up in rttMonCtrlAdminEntry, rttMonCtrlOperEntry, rttMonStatsCaptureEntry, and rttMonStatsCollectEntry tables.


Jitter

We know that Jitter can have a profound effect on video streams as a GOP that arrives too late or out of sequence will be dropped by the receiving codec as it cannot go back in the video stream and redo the past GOP frame. We can also discern that at 30 frames a second, the GOP frame must be sent every 33 ms.
The jitter test is executed between a Source and Destination and subsequently, needs a Responder to operate. It is accomplished by sending packets separated by an interval. Both source to destination and destination to source is accommodated. 

The statistics available provide the types of information you are looking to threshold and extract. The Cisco IP SLA ICMP based Jitter test supports the following statistics:
  • Jitter (Source to Destination and Destination to Source)
  • Latency (Source to Destination and Destination to Source
  • Round Trip Latency
  • Packet Loss
  • Successive Packet Loss
  • Out of Sequence Packets (Source to Destination and Destination to Source)
  • Late Packets
A couple of factors need to be considered in setting up jitter with regards to video streams. These are:
  • Jitter Interval
  • The number of packets
  • Test frequency
  • Dealing with Load Balancing and NAT
The jitter interval needs to be set to the GOP frame interval of 33 ms.

The number of packets needs to be set to something significant beyond single digits but below a value which would hammer the network. What you are looking for is just enough packets to understand you have an issue. I would probably recommend 20 at first even though the video packet numbers will be significantly higher. Not every network is created equal so you probably need to tune this. While the number of packets in a GOP frame can be highly variable, you just need to sample for the statistics that effect service.

Test frequency can have an effect on traffic. I would probably start with 5 minutes during active periods. But this too needs to be adjusted according to your environment.

In the Jitter test setup, you can use either IP addresses and hostnames when you specify source and destination.

The statistical data is collected in the rttMonJitterStatsEntry table.


ICMP Path Echo

When connectivity issues occur or changes in latency occur, one needs to be able to diagnose the path from both ends. The Cisco IP SLA ICMP Echo Path test is an ICMP based traceroute function. This test can help diagnose not only path connectivity issues but latency in the path, and asymmetric paths.

Traceroute uses the TTL field of IP to “walk” through a network. As a new hop is discovered, the TTL is incremented and attempted again. As each hop is discovered, an ICMP Echo test is performed to measure the response time.

Data will be presented in the rttMonStatsCaptureEntry table.



Advanced Functions


In looking over what we’ve discussed so far, we have reviewed in system diagnostic events, external events and conditions, and using IP SLA capabilities to test and evaluated in a shadow mode, between  two or more telepresense systems. But when things happen, they do so in real time. Coupled with the facts of many areas that could affect degradation in services, in some instances, it may be necessary to provide enhanced levels of service.

Additionally, multiple problems can present confusing and anomalous event patterns and symptoms exacerbating the service restoral process.  Of course, because you have high visibility, any confusion adds fuel to the fire.

Media Delivery Index

Ineoquest has derived a service metric related to video streams services and provide a KPI for service delivery. The Media Delivery Index (MDI) can be used to monitor both the quality of a delivered video stream as well as to show system margin by providing an accurate measurement of jitter and delay. The MDI addresses a need to assure the health and quality of systems that are delivering ever higher numbers of streams simultaneously by providing a predictable, repeatable measurement rather than relying on subjective human observations. Use of the MDI further provides a network margin indication that warns system operators of impending operational issues with enough advance notice to allow corrective action before observed video is impaired.

The MDI is also presented in RFC-4445 as jointly authored by Ineoquest and Cisco.

Ineoquest uses Probes and software to analyze streams in real time, to validate the actual video stream as it occurs. In fact, they can be used in a portspan, a network tap arrangement, or they can actually JOIN a conference! Why is this necessary?

The Singulus and Geminus probes look at each GOP and the packets that make up that frame. They capture and provide metrics on any data loss, jitter, and Program Clock Reference (PCR). They are also able to analyze the data stream and determine in real time, the quality, of the video. You can create awareness for increased jitter even before video is affected by analyzing the jitter margins. This takes the subjectiveness out of the measurement equation.

Additionally, you can record and play back video streams that are problematic. This is very important that in video encoding streams, the actual data patterns can be wildly different from one moment to the next depending upon the changes that occur in the GOP. For example, if the camera is displaying a white board, there a very few changes. However, if the camera is capturing 50 people in a room all moving, the GOP is going to be thick with updates to the I, P, and B frames.

For instance, a Provider Network setup could have a cbQoS setup that’s perfectly acceptable for a Whiteboard camera or even moderate traffic. It may start buffering packets when times get tough. The bandwidth may be oversubscribed and reprioritized in areas where you have no visibility. This can introduce jitter and even packet loss only its specific to the conditions at hand. And how do you PROVE that back to your Service Provider?

You can go back and replay the video streams during a maintenance window to validate the infrastructure. And test, diagnose, and validate specific network components with your network folks and service providers alike to deliver a validated service as a known functional capability.

And because you can replicate the captures, what a great way to test load and validate your end to end during maintenance windows!

The IQ Video Management System (iVMS) controls that probes, assembles the data into statistics and provides the technical diagnostic front end to access the detailed information elements available. Additionally, when problems and thresholds occur, it forwards SNMP Traps, syslogs, and potentially email regarding the issues at hand.


Analyzing the Data

First and foremost, you have traps and syslog that tell you of impending conditions that are occuring in near real time.  For example, due to the transmission quality of the stream, the end presentation may downgrade from 1080i to 720i to save on bandwidth.

The IPSLA tests establish end to end connectivity and a portion of the path. In Traceroute, normally you only see one interface per device as traceroute walks the path.   If only gives you one interface for the device and not the in and out interfaces.  This is true because the Time to  Live parameter wont be exceeded again until the next hop.

Keep in mind though that elements that can affect a video stream may not be in the direct path as a function.  For example, if a router in the path has a low memory condition caused by the buffering from a fast network to a low speed WAN Link, this may affect the router's ability to correctly handle streaming data under load.

What if path changes affect the order of packets? What if timing slips causes jitter on the circuit? Even BGP micro flaps may cause jitter and out of sequence frames. all in all, you have your work cut out for you.

Summary

Prepare yourself to approach the monitoring and management of video streams from a holistic point of view. You have to look at it first from the perspective of end to end to end.  Also be prepared to be able to fill in the blanks with other configuration and performance data.

Because of the visibility of Video teleconferencing systems, be prepared to validate your environment on an ongoing basis. Things change constantly.   Changes even only remotely related or out of your span of control, can affect the systems ability to perform well. Validate and test your QoS and performance from a holistic view on a recurring basis. A little recurring discipline will save you countless hours of heartbreak later.

Because of the situational aspects of video telepresense, be prepared to enable diagnosis in near real time. Specialized tools like Ineoquest can really empower you to support and understand whats going on and work through analysis,design changes, maintenance actions, and even vendor support to ensure you are on the road to an effective collaboration service.

Sometimes, it pays to enlist the aid of an SME that is vendor independent. After all, in doing this Blog post, I credit a significant amount of education and technical validation on the GOP, RTP, RTVP, and other elements of streaming to Stephen Bavington of Bavington Consulting. http://bavingtonconsulting.com/ 


Monday, June 25, 2012

Interview in Seattle - Perf Data in large environments

I recently interviewed at a rather large company in the NorthWest and during the course of the interview, I discovered a few things.

If you're limited to SQL Server for performance management data, there are a couple of things to consider:
  • You're stuck with about a 25000-30000 insert per second maximum input rate.
  • While you are at this maximum, you're not getting any data out.
  • When you get to 100 million or so records, indexing and inserting slows down. Reporting slows down even more.
They had a 3 tier data model that kept 7 days of 2 or 5 minute data on the lowest tier database.
This data was rolled up to a 1 hour average and stored at the second data tier for a period of 90 days.
This second tier data was again rolled up to a 12 hour average and sent to the highest tier for a data retention of 1 year.

Some of the head scratchers I came away with were:


  • How do you do a graph that spans time periods?
  • If you're doing post mortems, how effective is any of the data in the second or third tier?
  • How useful is CPU utilization that has been averaged to 1 hour or 12 hours?
  • How do you effectively trend multiple elements across tiers without a significant amount of development just so you can provide a report?


So, what are you gonna do?

What kind of data is it you're dealing with?

When you look at performance related data, a SIGNIFICANT part of it is simply time series data.  It has a unique identifier relating it to the managed object, attribute, and instance you are collecting data against and it has a timestamp of when it was collected. And then theres the value.

So, in a relational database,you would probably set this up as:

CREATE TABLE TSMETRICS {
Metrickey           varchar(64),
timestamp           datetime,
value                   integer
};

You would probably want to create an index on Metrickey so that you can more efficiently grab metric values from a given Metrickey.

When you consider you're collecting 10 metrics every 5 minutes for 10,000 nodes, you start to realize thatthe number of records starts to add up quickly. 288 specific records per metric for every day.
So 10 metrics every 5 minutes turns into 2880 records per node times 10000 nodes equals 28,800,000 records per day. At the end of 4 days, you're looking at crossing that 100 million record boundary.

Scaling

What if we changed TSMETRICS structure?  We could change it to:

CREATE TABLE TSMETRICS {
Metrickey             varchar(64),
starttime                datetime,
endtime                 datetime,
slot1                      integer,
...
slot288                  integer
}

This effectively flattens the table and reduces the duplicate string of Metrickey which would save a significant amount of repetitive record space. In effect,this is how Round Robin Data stores store metric data. But consider this, you either have to index each column or you have to process a row at a time to do so efficiently.

This gets us into the 1000 days realm!  10000 nodes * 10 metrics each * 1000 days = 100000000 records.

But the problem expands because the inserts and extracts become much more complex. And you have to go back to work on indexing.

Sharding

Some folks resort to sharding. What they will do is to take specific time ranges and move them onto their own table space.  Sharding ends up being a HUGE nightmare.  While it enables the DBA to control the Table spaces and number of records, getting data back out becomes another cryptic exercise is first finding the data, connecting to the appropriate database,and running your query.  So, the first query is to find the data you want.  Subsequent queries are used to go get that data. You're probably going to need to create a scratch database to use to aggregate the data from the multiple shards so that reporting tools can be more efficient.

Another technique folks employ is to use a data hierarchy.  Within the hierarchy, you keep high resolution data, say 7 days, in the lowest level. Roll the data up from a 5 minute interval to a 1 hour interval into a second data tier.  Then roll up the 1 hour data to 12 hour data in a third data tier.  I actually know of a large service provider that does exactly this sort of thing.

Imagine mining through the data and you need to look at the CPU of 5 systems over the course of a year.  How has the CPU load grown or declined over time? Now overlay the significant configuration changes and patch applications over that time period.  Now, overlay the availability of the 5 systems over that time.

All of a sudden, what looks like a simple reporting exercise becomes a major production issue. You have to get data from 3 separate databases, munge it together, handle the graphing of elements where the X axis is not linear, and it becomes mission impossible.

Suggestions

If you're looking at moderate to heavy data spaces, consider the options you have.  Do not automatically assume that everything fits in an RDBMS space effectively.

The advantages of a Round Robin Data store are:


  • Space is preallocated
  • It is already aligned to time series
  • Relatively compact
  • Handles missing data


Other considerations are that when you read an RRD type store,you store a copy in memory and read from that.  Your data inserts do not get blocked.

There are certain disadvantages to RRD stores as well to include:


  • Concentration of RRD stores on given controllers can drive disk IO rather high.
  • Once you start down the road of RRD store distribution, how do you keep up with the when and where of your data in a transparent manner?
  • RRD doesn't fit the SQL paradigm.
If you need SQL, why not take a look at columnar databases?

Take a look see at Vertica or Calpont InfiniDB.

When you think about it, most time series data is ultra-simple.  Yet when you do a graph or report, you are always comparing one time series element to another.  A Columnar data LIVES here because the DB engine aligns the data by column and not row.

Another thought here is massive parallelism. If you can increase your IO and processing power, you can overcome large data challenges.  Go check out Greenplum. While it is based on PostgresQL, it sets up as appliances based on VM instances. So, you start out with a few servers and as your data grows, you install another "appliance" and go.  As you install more components, you add to the overall performance potential of the overall data warehouse.

If you can run without SQL, take a look atthe big Data and noSQL options like Casandra or Hadoop / MapReduce.

Links for you:


http://hadoop.apache.org/
http://hadoop.apache.org/mapreduce/

An interesting experiment:

Run RRD or JRobin stores under Hadoop and HDFS.  Use MapReduce to index the RRD Stores.

I wonder how that would scale compared to straight RRD stores, against an RDBMS, or a Columnar Database.





Monday, April 9, 2012

Product quality Dilemma

All too often, we have products that we have bought, put in production, and attempted to work through the shortcomings and obfuscated abnormalities prevalent in so many products.  (I call this product "ISMs" and I use the term to describe specific product behaviors or personalities.) As part of this long living life cycle, changes, additions, deprecations, and behaviors change over time.  Whether its fixing bugs or rewriting functions as upgrades or enhancements, things happen.

All too often, developers tend to think of their creation in a way that may be significantly different than the deployed environments they go into. Its easy to get stuck in microcosms and walled off development environments.  Sometimes you miss the urgency of need, the importance of the functionality, or the sense of mandate around the business.

With performance management products, its all too easy just to gather everything and produce reports ad nauseum.  With an overwhelming level of output, its easy to get caught up in the flash, glitz, and glamour of fancy graphs, pie charts, bar charts... Even  Ishigawa diagrams!

All this is a distraction of what the end user really really NEEDS. I'll give a shot at outlining some basic requirements pertinent to all performance management products.

1. Don't keep trying to collect on broken access mechanisms.

Many performance applications continue to collector attempt to collect, even when they haven't had any valid data in several hours or days.  Its crazy as all of the errors just get in the way of valid data.  And some applications will continue to generate reports even though no data has been collected! Why?

SNMP Authentication failures are a HUGE clue your app is wasting resources or something simple. Listening for ICMP Source Quenches will tell you if you're hammering end devices.

2. Migrate away from mass produced reports in favor of providing information.

If no one is looking at the reports, you are wasting cycles, hardware,and personnel time on results that are meaningless.

3. If you can't create new reports without code, its too complicated.

All too often, products want to put glue code or even programming environments / IDEs in front of your reporting.  Isn't it a stretch to assume that  a developer will be the reporting person?  Most of the time its someone more business process oriented.

4. Data and indexes should be documented and manageable.  If you have to BYODBA (Bring Your Own DBA), the wares vendor hasn't done their home work.

How many times have we loaded up a big performance management application only to find out you have to do a significant amount of work tuning the data and the database parameters just to get the app to generate reports on time?

And you end up having to dig through the logs to figure out what works and what doesn't.

If you know what goes into the database, why do you not put in indexes,checks and balances, and even recommended functions when expansion occurs.

In some instances, databases used by performance management applications are geared toward the polling and collection versus the reporting of information.  In many cases, one needs to build data derivatives of multiple elements in order to facilitate information presentation.  For example, a simple dynamic thresholding mechanism is to take a sample of a series of values and perform an average, root mean, and standard deviation derivative.

If a reporting person has to do more than one join to get to their data elements,  your data needs to be better organized, normalized, and accessible via either derivative tables or a view. Complex data access mechanisms tend to alienate BI and performance / Capacity Engineers. They would rather work the data than work your system.

5. If the algorithm is too complex to explain without a PhD, it is not usable nor trustable.

There are a couple of applications that use patented algorithms to extrapolate bandwidth, capacity, or effective usage. If you haven't simplified the explanation of how it works, you're going to alienate a large portion of your operations base.

6. If an algorithm or method is held as SECRET, it works just until something breaks or is suspect. Then your problem is a SECRET too!

Secrets are BAD. Cisco publishes all of its bugs online specifically because it eliminates the perception that they are keeping something from the customer.

If one remembers Concord's eHealth Health Index...  In the earlier days, it was SECRET SQUIRREL SAUCE. Many an Engineer got a bad review or lost their job because of the arrogance of not publishing the elements that made up the Health Index.

7. Be prepared to handle BI types of access.  Bulk transfers, ODBC and Excel/Access replication, ETL tools access, etc.

If Engineers are REALLY using your data, they want to use it in their own applications, their own analysis work, and their own business activities. The more useful your data is, the more embedded and valuable your application is.  Provide ways of providing shared tables,timed transfers, transformations, and data dumps.

8. Reports are not just a graph on a splash page or a table of data.  Reports to Operations personnel means they put text and formatting around the graphs, charts, tables, and data to relate the operational aspects of the environment in with the illustrations.

9. In many cases, you need to transfer data in a transformed state from one system that reports to another. Without ETL tools, your reporting solution kind of misses the mark.

Think about this...  You have configuration data and you need this data in a multitude of applications.  Netcool. Your CMDB.  Your Operational Data Store. Your discovery tools.  Your ticketing system.  Your performance management system.  And it seems that every one of this data elements may be text, XML, databases of various forms and flavors, even HTML. How do you get transformed from one place to another?

10. If you cannot store, archive, and activate polling, collection, threshold, and reporting configurations accurately, you will drive away customization.

As soon as a data source becomes difficult to work with, it gets in the way of progress.  In essence, what happens in that when a data source becomes difficult to access, it quits being used beyond its own internal function. When this occurs, you start seeing separation and duplication of data.

The definitions of the data can also morph over time.  When this occurs and the data is shared, you can correct it pretty quickly.  When data is isolated, many times the problem just continues until its a major ordeal to correct. Reconciliation when there are a significant number of discrepancies can be rough.

Last but not least - If you develop an application and you move the configuration from test/QA to production and it does not stay EXACTLY the same, YOUR APPLICATION is JUNK.  Its dangerous, haphazard, incredibly short sided, and should be avoided at all costs.  Recently, I had a dear friend develop, test, and validate a performance management application upgrade.  After a month in test and QA running many different use case validations, it was put into production. The application overloaded the paired down configurations to defaults upon placement into production, polled EVERYTHING and it caused major outages and major consternation for the business. In fact, heads rolled.  The business lost customers. There were people that were terminated. And a lot of man power was expended trying to fix the issues.

In no uncertain terms, I will never let my friends and customers be caught by this product.

Sunday, April 17, 2011

Data Analytics and Decision Support - Continued

Data Interpolation

In some performance management systems, the fact that you may have nulls or failed polls in your data series draws consternation in some products. If you are missing data polls, the software doesn't seem to know what to do.

I think the important thing is to catch counter rollovers in instances where your data elements are counters.  You want to catch a rollover in such a way as when you calculate the delta, you add up until the maximum counter value, then start adding again up to the new value.  In effect you get a good delta in between two values.  What you do not want is tainted counters that span more than one counter rollover.

If you take the delta value between a missed poll as in:

P1                   500
P2                   Missed
P3                   1000

In smoothing, you can take the difference between P3 and P1 and divide that by half as a delta.  Simply add this to P1 to produce P2 which produces a derived time series of:

P1                    500
P2                    750
P3                    1000

In gauges, you may have to sample the elements around the missed data element and use an average to smooth over the missed data element.  Truth be known, I don't see why a graphing engine cannot connect from P1 to P4 without the interpolation!  It would make graphs much simpler and more rich - If you don't see a data point on the time slot, guess it didn't happen!  In the smoothing scenario, you cannot tell where the smoothing is.

Availability

In my own thoughts, I view availability as a discreet event in time or a state over a time period.  Both have valid uses but need to presented in the context that they are borne so that users do not get confused. For example, an ICMP ping that occurs every 5 minutes is a discreet event in time. Here is an example time series of pings:

P1             Good
P2             Good
P3             Fail
P4             Good
P5             Fail
P6             Good
P7             Good
P8             Good
P9             Good
P10           Good

This time series denotes ping failures at P3 and P5 intervals.  In theory, a line graph is not appropriate for this instance because it is boolean and it is discreet and not representative of the entire time period.  If P0 is equal to 1303088400 on the local poller and the sysUptime value at P0 is 2500000, then the following SNMP polls yield:

Period        Poller uTime       SysUptime
P0             1303088400      2500000
P1             1303088700      2503000
P2             1303089000      2506000
P3             1303089300      2509000
P4             1303089600      2512000
P5             1303089900      20000
P6             1303090200      50000
P7             1303090500      80000
P8             1303090800      110000
P9             1303091100      5000
P10           1303091400      35000

utime is the number of seconds since January 1, 1970 and as such, it increments every second.  SysUptime is the number of Clock ticks since the management system reinitialized.  When you look hard at the TimeTicks data type, it is a modulo-s counter of the number of 1/100ths of a second for a given time epoch period. This integer will roll over at the value of  4294967296 or approximately every 49.7 days.


In looking for availability in the time series, if you derive the delta between the utime timestamps and multiply that times 100 (given that timeTicks is 1/100th of a second), you can see that if the sysUpTime value is less than the previous value plus the serived delta timeticks, you can clearly see where you have an availability issues. As compared to the previous time series using ICMP ping, P3 was meaningless and only P5 managed to ascertain some level of availability discrepancy.


You can also derive from the irrant time series periods that P5 lost 10000 timeTicks from the minimum delta value of 30000 (300 seconds * 100). So, for period P5 we were not available for 10000 timeticks and not the full period. Also note that if you completely miss a poll or don't get a response, the delta of the last poll chould tell you where availability was affected or not even though the successful poll did not go as planned.


From a pure time series data kind of conversion, one could translate sysUpTime to an accumulated number of available seconds over the time period.  From a statistics perspective, this makes sense in that it becomes a count or integer.

Discreet to time Series Interpolation

Let's say you have an event management system.  And within this system, you receive a significant number of events   If can categorize and number your events into instances counted per interval, you can convert this into a counter for a time period.  for example, you received 4 CPU threshold events in 5 minutes for a given node.

In other instances, you may have to covert discreet events into more of a stateful.approach toward conversions. For example, you have a link down and 10 seconds later, you get a link up. You have to translate this to a non-availability of the link for 10 seconds of a 5 minute interval.  What is really interesting about this is that you have very finite time stamps.  When you do, you can use this to compare to other elements  that may have changed around the same period of time.  Kind of a cause and effect analysis.

This is especially pertinent when you analyze data from devices, event data, and workflow and business related elements. For example, what if you did a comparison between HTTP response times to Internet network IO rates, and Apache click rates on your web servers? What if you threw in trouble tickets and Change orders over this same period?   What about query rates on your Oracle databases?

Now, you can start translating real metrics with business metrics and even cause and effect elements because they are all done using a common denominator - Time. While many management platforms concentrate on the pure technical - like IO rates of Router interfaces, it may not mean as much to a business person.   What if your web servers seem to run best when IO rates are between 10-15 percent.  If that range is where the company performs best, I'd tune to that range.  If you change your apps to increase efficiency, you can go back and look at these same metrics after you make changes and validate handily.

This gets even more interesting when you get Netflow data for the intervals as well.  But I'll cover that in a later post!

Sunday, April 10, 2011

Data Analytics and Decision Support

.OK,  The problem is that given a significant amount of raw performance data, I want to be able to determine some indication and potentially predictions, related to whether the metrics are growing or declining and what will potentially happen beyond the data we currently have, given the existing data trending.

Sounds pretty easy, doesn't it.


When you look at the overall problem, what you're trying to do is to take raw data and turn it into information. In effect, it looks like a Business Intelligence system.  Forrester Research defines Business Intelligence as

"Forrester often defines BI in one of two ways. Typically, we use the following broad definition: Business intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making. But when using this definition, BI also has to include technologies such as data integration, data quality, data warehousing, master data management, text, and content analytics, and many others that the market sometimes lumps into the information management segment. Therefore, we also refer to data preparation and data usage as two separate, but closely linked, segments of the BI architectural stack. We define the narrower BI market as: A set of methodologies, processes, architectures, and technologies that leverage the output of information management processes for analysis, reporting, performance management, and information delivery."


Here is the link to their glossary for reference.


When you look at this definition versus what you get in todays performance management systems, you see somewhat of a dichotomy.  Most of the performance management systems are graph and report producing products.   You get what you get and so its up to you to evolve towards a Decision Support System versus a Graphical Report builder.


Many of the performance management applications lack the ability to produce and use derivative data sets.  They do produce graphs and reports on canned data sets. It is left up to the user to sort through the graphs to determine what is going on in the environment.


It should be noted that many products are hesitant to support extremely large data sets with OLTP oriented Relational Databases.  When the database table sizes go up, you start seeing degradation or slowness in the response of CRUD type transactions. Some vendors resort to batch loading.  Others use complex table space schemes. 


Part of the problem with Row oriented databases is that it must conform to the ACID model. Within the ACID model, a transaction must go or not go.  Also, you cannot read at the same time as a write is occurring or your consistency is suspect. In performance management systems, there is typically a lot more writes than there a reads. Also, as the data set gets bigger and bigger, the indexes have to grow as well.


Bottom line - If you want to build a Decision Support System around Enterprise Management and the corresponding Operational data, you have to have a database that can scale to the multiplication of data through new derivatives.


Consider this - a Terabyte of raw data may be able to produce more than 4 times that when performing analysis. You could end up with a significant number of derivative data sets and storing them in a database enables you to use them repeatedly. (Keep in mind, some OLAP databases are very good at data compression.)

Time - The Common Denominator

In most graphs and in the data, the one common denominator about performance data is that it is intrinsically time series oriented.  This provides a good foundation for comparisons.  So now, what we need to do is to select samples in a context or SICs.  For example, we may need to derive a sample of an hour over the context of a month.

In instances where a time series is not directly implied, it can be derived.  For example, the number of state changes for a given object can be counted for the time series interval.  Another instance may be a derivate of the state for a given time interval or the state when the interval occurred.  In this example, you use what ever state the object is in at the time stamp.

Given that raw data may be collected at intervals much smaller than the sample period, we must look for a way to summarize the individual raw elements sufficiently enough to characterize the whole sample period.  This gets one to contemplating about what would be interesting to look at.  For example, probably the simplest way to characterize a counter is to Sum all of the raw data values into a sample (Simple Addition). For example, if you retrieved counts on 5 minute intervals, you could simply ad up the deltas over the course of the 1 hour sample into a derivative data element producing a total count for the sample.   In some instances, you could look at the counter at the start and look at the end, if there are no resets or counter rollovers during the sample.

In other instances, you may see the need to take in the deltas of the raw data for the sample to produce and average.  This average is useful by itself in many cases.  But in others, you could go back through the raw data and compare the delta to the average to produce an offset or deviation from the average.  What this offset or deviation really does is to produce a metric to recognize activity within the sample.  The more up and down off of the average, the greater the deviations will be.

Another thing you need to be aware of is data set tainting.  For example, lets say you have a period within your context where data is not collected, not available, or you suspect bad data.  This sample could be considered tainted but in many cases, it can still be used.   It is important sometimes to compare data collection to different elements like availability, change management windows, counter resets, or collection "holes" in the data.

What these techniques provide is a tool bag of functions that you can use to characterize and compare time series data sets.  What if you wanted to understand if there were any cause and effect correlation of any metric on a given device.  There are elements you may find that are alike.  Yet there are others that may be opposite from each other.  For example, when one metric increases, the other decreases. The characterization enables you to programmatically determine relationships between metrics.

The result of normalizing samples is to produce derivative data sets so that those can be used in different contexts and correlations.  when you think about it, one should even be able to characterize and compare multiple SICs (Samples in Contexts).  For example, one could compare day in month to week in month SICs.

Back to our samples in context paradigm. What we need samples for is to look at the contexts over time and  look for growth or decline behaviors.  The interesting part is the samples in that samples are needed when performing Bayesian Belief Networks.  BBNs take a series of samples and produce a ratio of occurrences and anti-occurrences, translated as a simple + 1 to -1.

When you look at Operations as a whole, activity and work is accomplished around business hours.  Weekdays become important as does shifts. So, it makes sense to align your Samples to Operational time domains.

The first exercise we can do is to  take a complex SIC like weekday by week by Month. For each weekday, we need to total up our metric for the weekday. For each weekday in week in month sample, is the metric growing or declining?  If you create a derivative data set that captures the percent different of the previous value, you can start to visualize the change over the samples.

When you start down the road of building the data sets, you need to be able to visualize the data in different ways.  Graphs, charts, scatterplots, bubble charts, pie charts, and a plethora of other mechanisms provide ways you can see and use the data.  One of the most common tools that BI professionals use is a Spreadsheet. You can access the data sources via ODBC and use the tools and functions within the spreadsheet to work through the analysis.

Sunday, April 25, 2010

Performance Management Architecture

Performance Management systems in IT infrastructures do a few common things. These are:

Gather performance data
Enable processing of the data to produce:
Events and thresholds
New data and information
Baseline and average information
Present data through a UI or via scheduled reports.
Provide for ad hoc and data mining exercises

Common themes for broken systems include:

If you have to redevelop your application to add new metrics
If you have more than one or two data access points.
If data is not consistent
If reporting mechanisms have to be redeveloped for changes to occur
If a development staff owns access to the data
If a Development staff controls what data gets gathered and stored.
If multiple systems are in place and they overlap (Significantly) in coverage.
If you cannot graph any data newer than 5 minutes.
If theres no such thing as a live graph or the live graph is done via Metarefresh.

I dig SevOne. Easy to setup. Easy to use. Baselines. New graphs. New reports. And schedules. But they also do drill down from SNMP into IPFIX DIRECTLY. No popping out of one system and popping into another. SEAMLESSLY.

It took me 30 minutes or so to rack and stack the appliance. I went back to my desk, verified I could access the appliance, then called the SE. He setup a WebEx and it was 7 minutes and a few odd seconds later I got my first reports. Quite a significant difference from the previous Proviso install which took more than a single day to install.

The real deal is that with SevOne, your network engineers can get and setup the data collection they need. And the hosting engineers can follow suite. Need a new metric. Engineering sets it up. NO DEVELOPMENT EFFORT.

And it can be done today. Not 3 months from now. When something like a performance management system cannot be used as part of the diagnostics and triage of near real time, it significantly detracts from usability in both the near real time and the longer term trending functions as well.