Tuesday, October 13, 2015

Getting ahead of the Customer Experience Perception Case

I have read several articles on how users are the alarm of prevalence in many environments. How we should be looking at customer experience.

This is so true and appropo. If you are truly looking to provide customer service, the customer experience should be at the forefront of your service philosophy.

Why?

In the beginning,  help desk staff listened for a call.  They waited on the phone to ring.  In fact, in some (MANY!) environments, they still do.  They also use the Call routing information to determine if there is a problem in a specific area, neighborhood, or part of the infrastructure.  Wild huh?  If your NOC is using call statistics to do correlation, you are definitely managing alarms and alerts in the user perception space.

Even in many modern day Operations centers, Operations operates in a mode of being purely reactive to incoming events.  Furthermore, in cases where the inputs overwhelm the staff's ability to discern real problems and prioritize them as time evolves,you see people that will wait for the loudest problem to surface with incoming calls.

If your layer 1 support is doing dispatch only,in almost all cases, you are operating in a purely reactive environment.

If you only allow events to be presented that are predefined and actionable, you are simulated that phone call in software. Same thing,different media.

User Perception Window

It is the time interval from when the customer is affected to the time it gets too painful to go on without reporting the problem.

In some environments, this can be hours. Especially if the end user has not seen results from previous service outages.  Or they have had negative experiences in calling in problems.  Some will commence to doing their own troubleshooting like rebooting everything.  Some will merely wait, go take a break, or go to lunch in hopes that the problem heals or gets fixed.

When folks transition from in house support to an outsourced arrangement, one of the factors that is common is the need for better support.  More responsiveness.  Better up time. Better awareness.

In some instances, the time has become so critical, end users will introduce problems just to test and see how long it takes for the managed services provider to respond.   This results in a very short window and usually fares badly for the Service provider.

Negative perceptions by your customers have a negative effect on your Net Promoter Scores and can be the most prevalent cause of customer churn. They affect the effectiveness of the support organization.  And the ability to generate new revenue.

Architecture

Most management architectures are designed wrong for the ability to migrate towards a proactive management stance.   If you are waiting on Traps and syslog events, you are also waiting on the phone to ring.  While this is cheap and easy, it carries with it the consequences of always being after the fact, always post-cognitive.

And the problem is profoundly exacerbated by the introduction of agility in the enterprise.  The migration towards constant updates, infrastructure movement and redefinition, migration of applications across cloud platforms and containers, even off premise.

Consider this - changes in the environment can happen ANYWHERE in the Green, Red, or Yellow zones.  In effect, a change can lead up to an event horizon, cause other effects after the event horizon, or change the effects by changing in the middle of a problem.

If your architecture only looks at the red and yellow zones, you can never get AHEAD of the User Perception Window.  You can get a better handle on how you handle problems, identification and prioritization of problems, even building better workflow and run book processes.

How Do You Get There?

In many cases,architects and management has chosen the path of least resistance in hopes that enterprise management as a technology, is a commodity. (Funny - This was a marketing ploy by wares vendors to circumvent having to compete!)

Interesting thing about getting ahead of the customer is that this is the hard part.  It is the part where you have to go through the data, the workflow, the results, and come up with solutions to designing and implementing around architectural and product shortcomings, improving the processes and automations, and building and putting in place more effective instrumentation.

I'd like to warn you up front - if you're not willing to commit to the challenge, its better to admit that you will never get ahead of the customer experience perceptron.  Maybe you can set expectations with customers. Maybe you can put some spin on it.

There are several, very important Continuous Process Improvement sorts of tasks that need to be undertaken.  These include:

1. Post Mortems.


What was the root cause of the problem?  Was there more than a single cause?
Did the organization mishandle the problem?
Were there things that could have made the problem correction, better?
Are the runbooks and processes in order?
Has redundancy, DR, and HA been addressed properly?

A post mortem analysis is imperative to go through and analyze the what happened and how the support organization responded.  You need the data to be able to benchmark how information was derived and things were accomplished from the start to finish.

2. Failure Analysis


In the course of time, periodically, you need to go through your tickets and look for hardware and software that has failed over the reporting period.  Look for patterns and inconsistencies in the products, services, and systems.

An important gauge is to come up with a way of providing a cost of maintenance per device / Device type / Application.  Analyze both Scheduled and unscheduled maintenance actions.   This gives you an EXCELLENT way of illuminating problem areas in a way that non-technical people understand - dollars and cents.  Doesn't have to be real but relative and relevant.

Many Operations environments actually inflict a lot of pain on themselves by not doing failure analysis. You need to be ahead of the curve of equipments, systems, and software that fails more and more,takes more time to maintain, and causes more downtime.

3. Instrumentation


In the course of getting ahead of the customer perception window, you have to advance the instrument to seek out and illuminate issues before user perception is realized. If you are not increasing the instrumentation to be more predictive, you can not ever be able to visualize before the event horizon.

With containerization and microarchitectures, you need to build in advanced monitoring capabilities.  In fact, this advanced precognitive monitoring needs to be an integral part of the microarchitecture.

If you are not fielding advanced correlation where you are ACTIVELY looking for pre-cognitive conditions - conditions that lead up to a potential failure, you will NEVER EVER get in front of the customer.  If you are still waiting on a trap, a syslog event, or even a timed threshold, you are tragically on the wrong side of the timeline.

You need to look at user transactions from the user perspective. a 3 second deviation,while not discernable to many, could yield huge insight into an oncoming disaster.

What about IPFIX / Netflow data? What can you discern from this data to yield insights? Can you instrument the patterns into software to turn it into something to alert on?

4.  Adaptive Analytics


You need to be able to sample through the combinations of configurations and analyze event streams, instrumentation, and workflow data to look for predictive patterns that point to a customer experience potential problem BEFORE the event horizon occurs.

What things happened to illuminate an pending event horizon?

Can you discern loading conditions and thresholds from you analytics?

Linear regression? By time intervals?  Related or not related. Causal or not?

Summary


While out of the box Enterprise Management applications say they are proactive, take a good look at where they function in the Customer perspective perceptron space.  Could be, they are proactive after the customer perceptron.

Until you instrument and threshold on things that are before the customer perception window,

YOU CANNOT GET AHEAD OF THE CUSTOMER EXPERIENCE.  

In the comments, I'd be interested in hearing how your product / service fits in the Customer Perception window.  Leave a comment!


Saturday, September 12, 2015

To Build Versus Buy

Over the course of time that Management systems have evolved - at least in my years of exposure - there seems to always be the question - Build vs. Buy.

I have been in 3 different scenarios that show different perspectives as to where these views come from.  These perspectives include:

  • Product Company
  • Integrator
  • End User

 Product Perspective


From a product perspective, many tend to believe that the product is close enough to the 80% rule that the question of build versus buy should never come up.  In fact, some believe strongly enough to count the end user out as eccentric or lacking. Or they are intimidated by the Integrator.

I always liked the possibilities I could bring about with Log Matrix NerveCenter.  And the functions I put into the menus of HP OpenView.

Some products painted themselves in a corner. Like Netcool OMNIbus.  When they aligned to Telco oriented standards, they mandated to problems outside of the telco realm had to adapt to the Telco standards.   For example, color coding of Severity. But then again, not everything has a real severity or conforms so much to event management.

Some product companies are skeptical and somewhat intimidated by premier Integrators.  Instead of listening to the requirements and approaches, then embracing change, they will shun it.

As an Integrator, I was avoided, patronized, and shunned.  A few select product companies embraced my approaches.

As an end user, product companies seemed a bit taken aback when I told them their product doesn't scale.  Or doesn't fit.  Sometimes to the point of having their lawyers call me (Kinda weird).  And I've seen Sales folks do all sorts of things in lieu of fixing the technical issues.  Like a visit to your VP.  Or a talk with your CEO.  You know you've been sandbagged when your VP comes in with a bunch of glossies and tells you to evaluate this or that.

Integrator


As an Integrator, products are viewed as something they can use to deliver a value add. While there are a lot of Integrators that prefer to just do installation and setup, the premier Integrators are always looking for products that create a difference.  That segregates them from others.

From an Integrator perspective, I always strived to achieve success for both my team and my customer.  Sometimes, that takes a bit of work. And some thinking out of the box.

Sometimes, it was with a product and its capabilities.  Other times, I did my own code.

In the industry, there are a few folks out there that take products out of the box, put some code around them to integrate with other products or to add capabilities, and sell that and services around the extended capabilities and services.   Product companies don't always know how to leverage these folks or even consider them viable.

End User


From an end user perspective, some products just don't scale enough to make it.  Or they lack critical capabilities. And yes, price can be a factor.

I've been in places that could not use commercial products without significant work to make it work. For example, I know of a place that had 8 separate and distinct eHealth instances. And they ended up with 8 people supporting the product.

In instances where Build has evolved into a viable option, cost, capabilities, and scalability are the primary reasons.

In the immortal words of Larry Wall -- "There's more than one way to do it!".  Product companies don't have a stranglehold on innovation.  In many cases, quite the opposite is true. Its hard to do something different if the product does 50% or less of what the requirement is if it means refactoring and redeveloping code.  In fact, many developers consider that the product is done once coded.  Of the ones that have evolved, you see a lot of refactoring and reworking to achieve more capabilities and scale.

In Summary... EMBRACE the Innovator.


Product Companies - Use these Innovators to expand horizons, empower repeatable integrations, and drive solutions over tools.

Integrators - Want to get to the next level? Get you some Innovators and start productizing your Value Add.

End Users - Use Innovators to work through problems and get to solutions.  Tools aren't much it they don't fit your organization form and function and its workflow.  Innovators do that.  Embrace innovations that enhance your business in meaningful and distinctive ways. Keep driving efficiency and customer satisfaction up.




Wednesday, August 8, 2012

Enterprise Management - What's Missing?

Enterprise Management has been a somewhat strange market over the years. The problem definitions are really tough but are rewritten to lessen the amount of development needed to achieve something that marketing can spin.  Some elements like correlation and root cause analysis have been dumbed down and respun in an effort for many to tout a "me too" to the uninitiated.

Other areas like performance management platforms have become huge report generators. After all, pretty graphs and charts sell. In some systems, it is amazing the amounts of graphs produced, 90% of which may never be looked at.

The other amazing part is that even though a product is evolved and bought out, there comes a point where design decisions of the past dictate what can and cannot be done going forward. And this holds true for not just the older products but some new ones as well.  It is incredible to see a product that in its relatively young life cycle that has already mandated the inability to extend the architecture further.

The emerging technologies that have come to light most in the past 3 years or so are provisioning, configuration management, and workflow automation. This is especially true in the Cloud realm as Cloud needs standard configuration items, a way of automating provisioning, and ways to pre-provision systems overall. After all, virtualization is the technology and Cloud is the process.

OK - What's Missing?

I know we've seen all of the sales pitches on performance management with charts and graphs. And we've been barraged by the constant pitch for business services management with its dashboards and service trees. And we've seen the event and alarm lists, the maps, and even ticketing.

Someone once said that "a fool with a tool is still a fool".  Effective Enterprise Management Architectures and implementations have NEVER been about tools. Tools are the technology and Workflow is the process. Tools do nothing for the bottom line of an Enterprise or Service Provider without someone using it. In fact, the tools that are not used usually become shelfware or junkware. In many cases, you can't even sell the stuff to some other fool looking for a tool.

One of the most interesting assessments of emerging technologies is the Gartner Hype Cycle. Here is the Wikipedia article on the Hype Cycle.


While it is not so scientific and not really a cycle per se, it is a way to describe the subjective nature of a technology and the phases it goes through. Interestingly, there are five different stages in the Gartner Hype Cycle. Following is an excerpt from Wikipedia describing the Hype cycles.

Five phases

Hype cycle for emerging technologies as of July, 2009
A hype cycle in Gartner's interpretation comprises five phases:
  1. "Technology Trigger" — The first phase of a hype cycle is the "technology trigger" or breakthrough, product launch or other event that generates significant press and interest.
  2. "Peak of Inflated Expectations" — In the next phase, a frenzy of publicity typically generates over-enthusiasm and unrealistic expectations. There may be some successful applications of a technology, but there are typically more failures.
  3. "Trough of Disillusionment" — Technologies enter the "trough of disillusionment" because they fail to meet expectations and quickly become unfashionable. Consequently, the press usually abandons the topic and the technology.
  4. "Slope of Enlightenment" — Although the press may have stopped covering the technology, some businesses continue through the "slope of enlightenment" and experiment to understand the benefits and practical application of the technology.
  5. "Plateau of Productivity" — A technology reaches the "plateau of productivity" as the benefits of it become widely demonstrated and accepted. The technology becomes increasingly stable and evolves in second and third generations. The final height of the plateau varies according to whether the technology is broadly applicable or benefits only a niche market.
The term is now used more broadly in the marketing of new technologies.


Now keep in mind, the hype cycle is very subjective. Dependent upon perspective, one person could place a given technology in a different category than you would. (This is why you enlist the expert analysis of the Gartner Analysts.)

Workflow Automation

You see a significant number of Workflow automation products that are just now coming into Service Providers and Enterprises. Part of this is being driven by requirements from cloud services. Cloud Services NEED to be automated. The services need to be automated and user driven as much as possible.


Did we ever get to be ITIL compliant? Or is there such a thing? Maybe for ticketing systems.  But do your organizations reall truly follow ITIL Incident and problem management to the letter? Can your organization?

ITIL defines an Incident as : 
--------------------------------
Any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to or a reduction in, the quality of that service. The stated ITIL objective is to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price.
-------------------------------
Effective workflow automation and process optimization begins with a baseline of the process. If you cannot define the process or measure it, how will you improve it? In the spirit of getting ITIL Incident Management to be optimized and effective, one needs to implement effective business process management techniques.

But there are psychological and philosophical problems with how processes are approached and presented.   Some of these "issues" include:
  • The belief that documenting the process is to train someone else for your job.
  • The belief that documenting some technical processes is impossible as it takes into consideration judgement and subjective decisions.
  • The belief that documenting processes gets in the way of actual work.
A significant amount of this is Fear, Uncertainty, and Doubt or FUD.  But it can block or kill effective process optimization. You have to manage to and reward process optimization.

Knowledge Management

While you see segmented implementations of Knowledge Management by various products or as modules in ticketing systems, rarely do you see somewhat robust implementations beyond the very large Service Providers.

Knowledge Management takes knowledge, facts, and references, organizes that information and makes it available in the various processes as a way or method of putting the knowledge to work. For example, an event is presented to a level 1support person that denotes an actionable event. In the course of this work, knowledge can be presented at several different aspects to include:

  • Notification instructions to contact the customer
  • Escalation instructions to the appropriate Triage Team
  • Recent histories of the elements in question
  • Scheduled actions and maintenance actions
  • Recent Configuration changes.
  • Checks and validations process data.

Knowledge Management takes information and creates confidence and situation awareness in support organizations. The knowledge enables each and every user or support member to work with common technical terms regardless of level of expertise. Even simple knowledge management tasks such as attaching linkages to vendor technical documentation to configuration records,enables users to be able to reference and communicate better regarding diagnosis, configuration, or provisioning.

Collaboration


When take a step back and look at every application on the market today in the Enterprise Management realm, the one thing that sticks out profoundly is the lack of collaboration capabilities and multi-user awareness. Not one single application has user awareness. For example, you open a network map. Who else is in that map? When you open a ticket, who else has that ticket open?

Applications don't really enable folks to work together. They are more geared toward individual actions. Then when that person is done, it is escalated to the next person. And when things move across products, it is even worse.

Consider the customer who calls into a service Provider with a problem. The first level takes down the ticket information, runs through some run book elements, then escalates the ticket to the next level of support. At this point,the customer is told they will get a call from the next tier. So, the customer hangs up and waits on the next call. In this scenario, who owns the problem? What if Tier 2 doesn't call back for a long period of time?

While this scenario is all too common, who is collaborating? The customer is stuck dealing with a bunch of individual contributors to get their problem resolved. There is no team. The only way the customer gets a team is to initiate an escalation to the Service Representative for the Account and crank up a conference call with all of the Managers.

Effective customer service and customer support is a TEAM effort that requires collaboration and socialization. It is not about fielding tickets. It is about taking care of the customer.

There are products on the horizon that will empower collaboration as a whole but still, a significant amount of work needs to be done on each individual product.

The Hype Cycle


The Gartner folks are much more precise and much better at accurately listing the Hypecycle than I am capable of. However, I like the way it depicts a product or technology lifecycle. I am merely overlaying my opinion on the Gartner definitions. While I know its subjective and only my opinion, here's where I put these technologies in the Gartner Hype Cycle:

Workflow Automation – Between the “Peak of Inflated Expectations” and the “Trough of Disillusionment”.


There are a significant number of product offerings in the industry that do these functions. And getting more and more each day. While we do see new product offerings for the same sort of functionality, everyone seems to think they do it better. Yet many fall short, are too complex to maintain, or are too expensive.

The big players have purchased many of the initial workflow automation applications and next generation applications are in process. Additionally, BPM related systems are being enhanced to perform more of the workflow automation associated with Enterprise Management.

Interesting links:


Knowledge Management -- “Technology Trigger”



While there are a couple of KM systems offerings, most of the implementations are released as product features or are developed by large service providers for in house use only. For example, there are KM functions in SCOM/SCCM. Additionally, there are KM modules for ticketing systems like Remedy.

What you do see is that larger Service Providers are investing heavily in KM systems. It is done for many reasons, chief among them is driving down the cost of support and leveraging knowledge across multiple customers. Makes sense as these large outsourcing and service providers are usually a training ground for entry level personnel. Implementing KM systems is an accelerator for learning in these environments.

Interesting links:

WiiKno - The company helps organizations put in place Knowledge Management systems and integration.
Microsoft Knowledge Management Architecture paper
BMC Knowledge Management




Collaboration – “Technology Trigger”




There are a couple of emerging technologies here that could very well redefine how IT organizations perform operations and support. You may very well see some older technologies (Like Ticketing) take a back seat to newer collaboration and socialization technologies.

Part of the problem is that ticketing systems rarely capture enough information to do detailed operations analysis and optimization. They are typically too user overbearing to facilitate much of the real time notes that get missed or input after the fact.

When problems are ongoing, do you get the full monte of whats going on? Or do you have to aggregate your own picture? How effective is your post mortem analysis data?  Do you see and recognize involvement by different support functions, business units, and the customers?


Interesting links:

MOOGSoft - Check out the guys behind the curtains.


Sunday, July 15, 2012

Thoughts on Telepresense Management

Many organizations have turned to Telepresense capabilities as a way to enable collaboration and teaming across very separated geographical locations. When you think about it, the cost of getting managers, directors, and engineers together to address challenges and work through solutions, Telepresense systems have the potential to not only save money but also empower organizations to respond much faster as an organization.

Going further, the proliferation of video streams is becoming more common place every day. Not only are collaboration systems becoming more popular, many Service Providers are offering video based services to both corporate and consumer markets.

Video conferencing technology has been around for several years. It has evolved significantly as the Codecs and MCUs have evolved significantly along with several standards. Additionally, the transport of these streams is becoming common on internet and private TCP/IP networks where in the past, dedicated DSx or ISDN circuits were used..

Video traffic, by its very nature, places a series of constraints on the network due to the nature and behavior of the traffic. 

Bandwidth

 First there are bandwidth requirements. High motion digital MPEG video can present a significant amount of traffic to the network. 

QoS

Additionally, video traffic is very jitter sensitive and dynamics like packet reordering can wreak havoc in the stream of what you see on screen. In some cases, you may not even own portions of the network you traverse and the symptoms may only be present for a very short period of time.

Video Traffic

Alot of multimedia and streaming type traffic is facilitated by the pairing of RTP and RTCP channels. RTP usually runs on an even numbered port between 1024 and 65535 with the RTCP channel running on the next higherport number.   This pairing is called a tuple. Additionally, while RTP and RTCP are protocol independent, they typically run on a UDP transport as TCP tends to sacrifice timeliness in favor of reliability.

Multipoint communications is facilitated  by IP multicast.

In digitized speech passing across the network, with G.711 and G.729, the two most common VoIP encoding methods, you get two samples of voice per packet. The size of voice packets tends to be slightly larger than 220 bytes per packet and therefore the overall behavior is the Voice is a constant bit stream constrained by the sample rates needed per interval. Because of sampling, the packets occur on a 20 msec interval.

Some video systems use ITU H.264 encoding which is equivalent to MPEG-4 Part 10, and uses 30 Frames per second. Each frame must be setup every 33 msec. But MPEG encoding per frame can have a variable number of bytes and packets depending upon the video content. MPEG video is accomplished using a reference frame and subsequent data sections provide updates to the original frame in what is called a GOP or Group of Pictures.

A GOP starts off with an I frame which is the reference frame. A P frame represents a significant change in the initial I frame but continues to reference the I frame. B Frames update both I frames and P frames.
The types of frames and their location within a GOP can be defined in a temporal sequence. This temporal distance is the time or number of images between specific types of images in a digital video. “m” is the distance between successive P frames and “n” is the distance between I frames.

In contrast to serial digital television, 30 frames per second is digitized where the entire screen is digitized and sent every frame. As compared to MPEG,the transmission requirements are significantly less than serial digital because only the changes to the I frame are updated using the P and B frames. 

Following is a diagram of a typical Group of Pictures and how the frames are sequenced.



Following is another illustration of a GOP along with data requirements per frame.


This figure shows how different types of frames that can compose a group of pictures (GOP). A GOP can be characterized as the depth of compressed predicted frames (m) as compared to the total number of frames (n). This example shows that a GOP starts within an intra-frame (I-frame) and that intra-frames typically requires the largest number of bytes to represent the image (200 kB in this example). The depth m represents the number of frames that exist between the I-frames and P-frames.

 When you look at the data sizes, you realize that the transmission of data across a network as compared to the GOP diagram, you start to realize the breakdown of packets and packet streaming necessary to enable the GOP sequence to work in near real time. For example, the first I frame must pass 200 kB sequentially which could turn out to be several hundred packets across the network. These packets are time and sequence sensitive. When things go wrong, you get video presentations that are frozen or pixelated.


In Cisco TPS, latency is defined as the time it takes for audio and video input to go from input on one end to presentation on the other. It measures latency between two systems via time stamps in the RTP data as well as the RTCP return reports. It is only a measure of the network and does not take into consideration the codecs. The recommended latency target for Cisco TPs is 150 msec or less. However, this is not always possible. When latency is exceeds over 250 msec in a 10 second period, it generates an alarm, an on screen message, a syslog message and an SNMP Trap on the receiving system. The onscreen message is only displayed 15 seconds and isn’t displayed again unless the session is restarted or stopped and reinitiated.

Packet Loss

Packet Loss can occur across the network for a variety of reasons. It could be layer 1 errors, Ethernet duplex mismatches, overrunning the queue depth in routers, and even induced by jitter. In Cisco CTP systems, they recommend a loss less than .05 percent of traffic in each direction. When packet loss exceeds 1 percent averaged over a 10 second period, the system presents an on screen message, generates a syslog message and an SNMP Trap.

When packet loss exceeds 10 percent average over a 10 second period, an additional on screen message is generated, syslog messages are logged and an SNMP Trap is generated.

If the CTS system experiences packet loss greater than 10 percent averaged over a 60 second interval, it downgrades the quality of its outgoing video and puts up an onscreen message. Following is a table of key metrics.

MetricTarget1st Threshold2nd Threshold
Latency150 ms250 ms (2 seconds for Satellite mode)
Jitter10 ms Packet Jitter 50 ms Frame jitter125 ms of video frame jitter165 ms of video frame jitter
Packet Loss.05%1%10%

If you are building KPIs around something like a Cisco CTS implementation, you need to look at areas where latency, jitter, and packet loss can occur. Sometimes, it is in your control and sometimes it is not. And there are a lot of different possibilities.

Some errors and performance conditions are persistent and some are intermittent and situationally specific. For example, a duplex mismatch on either end will result in a persistent packet loss during TPS calls. An overloaded WAN router in the path may be dropping packets during the call.

While the CTS system gives your Operations personnel awareness that jitter, latency, or packet loss is occurring, it does not tell you where the problem is. And if it causes the call to reset, you may not be able to discern where the problem is either. Once the traffic is gone, the problem may disappear with it.





When you look at the diagram, you see a Headquarters CTS system capable of connecting to both a Remote Campus CTS system and a Branch office equipped with a CTS system.  Multiple redundant metwork connections are provided via Headquarters and the Remote Campus but a single Metro Ethernet connection connects the Branch office to the Headquarters and Remote campus facilities.

When you look at the potential instrumentation that could be applied, there are a lot of different points to sort out device by device. But diagnosis of something like this goes from end to end. The underlying performance and status data is really most useful when an engineer is drilling into the problem to look for root causes, capacity planning, or performing a post mortem on a specific problem.

Most Cisco TPS systems are high visibility as the users tend to be presented with the issues either by cue on screen or by performance during the call. It doesn’t take too many jerky or broken calls to render it a non-working technology in the minds of some.And these users tend to be high profile like senior management and business unit decision makers.

So, What Do You Do?

First of all, Operations personnel need to know proactively, when a CTS system is presenting errors and experiencing problems. On the elements you can monitor, setup and measure key performance metrics and thresholds, setup status change mechanisms, and threshold on mis-configured elements. Thresholds, traps, and events help you create situation awareness in your environment. This is a good start toward being able to recognize problems and conditions. Even creating awareness that a configuration change is made to any component in the CTS system enables Operations to be aware of elements in the service.

In the monitoring and management of the infrastructure, you are probably receiving SNMP Traps and Syslog from the various components in the environment.  This is a good start toward seeing failures and error conditions as they are sent as the events occur. However, you need performance data to be able to drill down into the data to discern and analyze conditions and situations that could affect streaming services.

But when you think about it, you actually need to start with a concept of performance and availability of end to end, as a service. Having IO performance data on a specific router interface makes no seense until its put into a service perspective.

End to End Strategy

First up, you need to look at an end to end monitoring strategy. And probably one of themost common steps to take is to employ something like Cisco IPSLA capabilities.

Cisco has a capability in many of its routers and switches that enables managing service levels using various measurements provided. There are a series of capabilities that can be utilized in your CTS Monitoring and management solution to make it more effective beyond the basic instrumentation.  Cisco recommends that shadow devices be used as the IP SLA tests will circumvent routing and queuing mechanisms and can have an effect if done on production devices. 

Not all devices support all tests. So, be sure to check with Cisco to ensure you can use a specific IOS version and platform for the given tests. Some service providers and enterprises use decommissioned components to reduce cost.

Basic Connectivity

One of the Cisco IP SLA most commonly used test is the ICMP Echo test. The ICMP Echo test sends a ping from the shadow router doing the test to an end IP. In looking over figure 2, we would need to setup an ICMP Echo test from the shadow router on one side to the CTS system on the other side. This is to establish a connectivity check in each direction. 

This will also provide a latency metric but at the IP level. ICMP is the control mechanism for the IP protocol. There may be added latency at higher level protocols depending upon the elements interacting on those protocols. For example, firewalls that maintain state at the transport layer (TCP) may introduce additional latency. Traffic shapers may do so as well.

The collected data and status of the tests are in the CISCO-RTTMON-MIB.my MIB definition file. Pertinent data shows up in rttMonCtrlAdminEntry, rttMonCtrlOperEntry, rttMonStatsCaptureEntry, and rttMonStatsCollectEntry tables.


Jitter

We know that Jitter can have a profound effect on video streams as a GOP that arrives too late or out of sequence will be dropped by the receiving codec as it cannot go back in the video stream and redo the past GOP frame. We can also discern that at 30 frames a second, the GOP frame must be sent every 33 ms.
The jitter test is executed between a Source and Destination and subsequently, needs a Responder to operate. It is accomplished by sending packets separated by an interval. Both source to destination and destination to source is accommodated. 

The statistics available provide the types of information you are looking to threshold and extract. The Cisco IP SLA ICMP based Jitter test supports the following statistics:
  • Jitter (Source to Destination and Destination to Source)
  • Latency (Source to Destination and Destination to Source
  • Round Trip Latency
  • Packet Loss
  • Successive Packet Loss
  • Out of Sequence Packets (Source to Destination and Destination to Source)
  • Late Packets
A couple of factors need to be considered in setting up jitter with regards to video streams. These are:
  • Jitter Interval
  • The number of packets
  • Test frequency
  • Dealing with Load Balancing and NAT
The jitter interval needs to be set to the GOP frame interval of 33 ms.

The number of packets needs to be set to something significant beyond single digits but below a value which would hammer the network. What you are looking for is just enough packets to understand you have an issue. I would probably recommend 20 at first even though the video packet numbers will be significantly higher. Not every network is created equal so you probably need to tune this. While the number of packets in a GOP frame can be highly variable, you just need to sample for the statistics that effect service.

Test frequency can have an effect on traffic. I would probably start with 5 minutes during active periods. But this too needs to be adjusted according to your environment.

In the Jitter test setup, you can use either IP addresses and hostnames when you specify source and destination.

The statistical data is collected in the rttMonJitterStatsEntry table.


ICMP Path Echo

When connectivity issues occur or changes in latency occur, one needs to be able to diagnose the path from both ends. The Cisco IP SLA ICMP Echo Path test is an ICMP based traceroute function. This test can help diagnose not only path connectivity issues but latency in the path, and asymmetric paths.

Traceroute uses the TTL field of IP to “walk” through a network. As a new hop is discovered, the TTL is incremented and attempted again. As each hop is discovered, an ICMP Echo test is performed to measure the response time.

Data will be presented in the rttMonStatsCaptureEntry table.



Advanced Functions


In looking over what we’ve discussed so far, we have reviewed in system diagnostic events, external events and conditions, and using IP SLA capabilities to test and evaluated in a shadow mode, between  two or more telepresense systems. But when things happen, they do so in real time. Coupled with the facts of many areas that could affect degradation in services, in some instances, it may be necessary to provide enhanced levels of service.

Additionally, multiple problems can present confusing and anomalous event patterns and symptoms exacerbating the service restoral process.  Of course, because you have high visibility, any confusion adds fuel to the fire.

Media Delivery Index

Ineoquest has derived a service metric related to video streams services and provide a KPI for service delivery. The Media Delivery Index (MDI) can be used to monitor both the quality of a delivered video stream as well as to show system margin by providing an accurate measurement of jitter and delay. The MDI addresses a need to assure the health and quality of systems that are delivering ever higher numbers of streams simultaneously by providing a predictable, repeatable measurement rather than relying on subjective human observations. Use of the MDI further provides a network margin indication that warns system operators of impending operational issues with enough advance notice to allow corrective action before observed video is impaired.

The MDI is also presented in RFC-4445 as jointly authored by Ineoquest and Cisco.

Ineoquest uses Probes and software to analyze streams in real time, to validate the actual video stream as it occurs. In fact, they can be used in a portspan, a network tap arrangement, or they can actually JOIN a conference! Why is this necessary?

The Singulus and Geminus probes look at each GOP and the packets that make up that frame. They capture and provide metrics on any data loss, jitter, and Program Clock Reference (PCR). They are also able to analyze the data stream and determine in real time, the quality, of the video. You can create awareness for increased jitter even before video is affected by analyzing the jitter margins. This takes the subjectiveness out of the measurement equation.

Additionally, you can record and play back video streams that are problematic. This is very important that in video encoding streams, the actual data patterns can be wildly different from one moment to the next depending upon the changes that occur in the GOP. For example, if the camera is displaying a white board, there a very few changes. However, if the camera is capturing 50 people in a room all moving, the GOP is going to be thick with updates to the I, P, and B frames.

For instance, a Provider Network setup could have a cbQoS setup that’s perfectly acceptable for a Whiteboard camera or even moderate traffic. It may start buffering packets when times get tough. The bandwidth may be oversubscribed and reprioritized in areas where you have no visibility. This can introduce jitter and even packet loss only its specific to the conditions at hand. And how do you PROVE that back to your Service Provider?

You can go back and replay the video streams during a maintenance window to validate the infrastructure. And test, diagnose, and validate specific network components with your network folks and service providers alike to deliver a validated service as a known functional capability.

And because you can replicate the captures, what a great way to test load and validate your end to end during maintenance windows!

The IQ Video Management System (iVMS) controls that probes, assembles the data into statistics and provides the technical diagnostic front end to access the detailed information elements available. Additionally, when problems and thresholds occur, it forwards SNMP Traps, syslogs, and potentially email regarding the issues at hand.


Analyzing the Data

First and foremost, you have traps and syslog that tell you of impending conditions that are occuring in near real time.  For example, due to the transmission quality of the stream, the end presentation may downgrade from 1080i to 720i to save on bandwidth.

The IPSLA tests establish end to end connectivity and a portion of the path. In Traceroute, normally you only see one interface per device as traceroute walks the path.   If only gives you one interface for the device and not the in and out interfaces.  This is true because the Time to  Live parameter wont be exceeded again until the next hop.

Keep in mind though that elements that can affect a video stream may not be in the direct path as a function.  For example, if a router in the path has a low memory condition caused by the buffering from a fast network to a low speed WAN Link, this may affect the router's ability to correctly handle streaming data under load.

What if path changes affect the order of packets? What if timing slips causes jitter on the circuit? Even BGP micro flaps may cause jitter and out of sequence frames. all in all, you have your work cut out for you.

Summary

Prepare yourself to approach the monitoring and management of video streams from a holistic point of view. You have to look at it first from the perspective of end to end to end.  Also be prepared to be able to fill in the blanks with other configuration and performance data.

Because of the visibility of Video teleconferencing systems, be prepared to validate your environment on an ongoing basis. Things change constantly.   Changes even only remotely related or out of your span of control, can affect the systems ability to perform well. Validate and test your QoS and performance from a holistic view on a recurring basis. A little recurring discipline will save you countless hours of heartbreak later.

Because of the situational aspects of video telepresense, be prepared to enable diagnosis in near real time. Specialized tools like Ineoquest can really empower you to support and understand whats going on and work through analysis,design changes, maintenance actions, and even vendor support to ensure you are on the road to an effective collaboration service.

Sometimes, it pays to enlist the aid of an SME that is vendor independent. After all, in doing this Blog post, I credit a significant amount of education and technical validation on the GOP, RTP, RTVP, and other elements of streaming to Stephen Bavington of Bavington Consulting. http://bavingtonconsulting.com/ 


Wednesday, July 11, 2012

Management and Cloud Migration

So, you've been tasked with analyzing the systems you have and migrating them to the Cloud. Sounds like a bit of a challenge. In part, you know there are some systems that will take longer to get to the Cloud than others. After all, you have systems like databases, ERP applications, and even Enterprise, Fault, and Performance systems that are critical to your operation. In many cases, email and DNS are very critical to your IT service. A lot of decisions, some harder than others.

Step Back

 All too often you find some Manager head over heels trying to push everything to the Cloud right up front. They've been to a couple of seminars and lurked on a few webcasts. It is so easy to just make a decision and hope for the best thinking it must be better in the cloud because everyone else is doing it.

Even if you deploy your own virtualized environment, call it your private cloud, and start migrating things over wantonly, you are headed for more pitfalls than you think. From a Management perspective, there are differences in legacy management applications and techniques and more "service oriented" approaches. Both have their form and function but migrating to cloud without considering the ramifications of management coverage could leave you exposed and vulnerable. (Translated as CLMs - Career Limiting Moves!)

What does Cloud Give you?

Mobility.

 First and foremost, it gives you the ability to move. Over the course of years, your process to put in place new computing assets has been increased and made more complex as it evolved. First, its the engineering process. Then the acquisition process. Then Development and prototyping. then migration into production. Each and every process along the way has put inplace more and more controls, check points, decision validations, and scheduling all to get something into production and working for a customer.

 If the IT department can't respond in short intervals, then why not hire a consultant to stand up a Web infrastructure in the cloud, upload our content, and take it down a month later when the need is diminished? In fact, you see alot of this in that the business needs Web assets for specific time periods and within a short time period and IT cannot even deliver a design in the time that the business unit needs the capability. A lot of business units will circumvent the IT department to make something happen. Only problem - When it breaks,who do they call?

 Being able to streamline your processes that enable new services to be delivered quickly, is a key benefit of cloud. You reserve your longer, legacy processes to developing the infrastructure and lining up the external cloud service contracts.

Standardization

Because the unit of deployment can be whatever you decide, you must work to develop standard packages that can be pre-packaged, tested, and validated ahead of time. This saves countless hours in engineering as the engineering is done once and replicated. Additionally, patch management becomes more effective as you test against standard packages.

Standardization also enables you as an IT Service provider, to provide a comprehensive catalog that business owners can use to satisfy their needs. Along with this, you can begin to standardize support costs more effectively. Once you standardize the platforms and applications, you can begin to really standardize your processes.

Think about your processes around monitoring and management and around support, care and feeding. Can you really start down the road of true ITIL Incident Management where corrective actions are defined?

Kill the Deadwood!

 Do you have elements / Silos in your environment that just get in the way of progress? Come on now, everyone does. Sometimes products / integrations just grow obsolete.

 Sometimes a product becomes in grained but the ability to support it or replace it is gone. At that point, people are afraid to touch it. It has become a boil on the rear of IT at that point. Nasty. Noone will touch it. And it festers on.

 Sometimes this happens because a person or group becomes the champion of the element and when you need something that the legacy product doesn't do, it means work. Theres an ongoing fallacy that in house developed products are free. Nothing is free. Someone has to continue development. Someone has to support it. Someone has to document it. Someone has to train others on it.

 Sometimes,an organization is just adverse to change. This is much more prevalent than one might think. There are folks that just don't do well with shifting sands. They prefer the status quo as it is more comfortable to avoid change. 

Changing to Cloud and Virtualization changes the way you monitor and manage things. Instead of managing by IP address or a node, you must start with the service first. In contrast, a vMotion event can move your node to a completely different location, hardware, and even disk. And it happens very quickly.

The legacy monitoring apps you've been used to may not be able to function effectively going forward. You must be willing to analyze and be real about the legacy products or you're going to get stuck with something that becomes a ball and chain. Even if the product you have is free, a part of an ELA that doesn't cost anything, or you are living on maintenance only, if the product does not fit, you are wasting TIME AND MONEY.

Analyze and Define Your Needs

Requirements is where you start and where you validate.  If the solutions you put in place do not address the requirements, you need to analyze the risk of non-coverage and work through solutions. Wares vendors love to see valid requirements.  They get an opportunity to address the things they do on a point by point basis.

Because you are dealing with moving technology, keep the requirements open such that if a Wares vendor has additional capability that your requirements do not address, do not throw away these capabilities in your evaluations.   Be extremely wary of vendors who want to minimize your requirements. It is often a sign of a lack of capability and a lack of the ability to effectively compete in their functional area.

Be open to new things.  If you are deploying VDI capabilities, look at VDI Management solutions.   Some of these technologies are new enough that deployment without specialized management capabilities can leave to naked using legacy management technology. Consider this - SNMP support is waning in VMWare ESXi and they prefer you access their capabilities via their Web Services API.

Organize your Data Model

Going forward, you need to look at your sources of truth - albeit configuration, performance, situational, and workflow data elements, need to be aligned and understood as a data model. What you do not want to end up with is multiple sources of the same data modeled in different ways, and not correlated. This will become painfully aware when you start organizing a service catalog. How confusing would it be in your catalog to have an entry as Server - Redhat + MySQL + PHP + Apache and an adjacent entry that is LAMP + MySQL. Think about the differences in CPU Cores.

Validate your Security Posture

Work diligently to identify and understand the risks of your systems, applications, and data as things migrate to a cloud or virtualized model. Consider where your data is, what its importance is, and what accesses you have to it. Moving to the Cloud or Virtualization may mean your data may move to a SAN.

Network access protection changes a bit in Cloud. One may be limited in what can be done to protect access. And with vMotion this becomes even more complex.

Integration

Use the move to Cloud and Virtualization to stress integration. You have requirements that may be addressed by multiple solutions. However, one must work to make things seamless. Because of the speed of deployment and deactivation, you need to work through how information and integration should work across products and data elements.

 When a service is provisioned, what happens next? Does it get put in your Service Desk solution? Is monitoring activated? Is billing activated? Customer notified? Work through defining and addressing the workflow and integration up front. Even if you setup a basic process, you can work to make it more efficient over time.

Cleanup your Applications

Many applications you have may run inconsistently in constrained environments. What happens when you introduce latency into your application? What about running on constrained hardware? (1 CPU core a 1GB memory when the Java JVM takes 1.1GB of memory itself). Some applications are terrible communicators across the network but network technology hides it. For example,open a Word document on a network share and watch it with a protocol analyzer. Now hit the page down key in your document and watch what happens. It downloads the entire document again and fseeks to the next page!

Cloud and virtualized environments introduce new challenges. Combinations of applications may affect each other when presented on the same blade. Over subscription can wreak havoc if you put two dominant applications together.

Invest in some APM applications, a protocol analyzer and some data gathering and reporting tools. I like to use both Class Loader based APM systems as well as network traffic capture based APMs systems as you get to analyze applications from multiple aspects.

Summary

Moving to the Cloud and toward virtualization does not have to be scary. But neither should you approach it blindly. Identify and understand your requirements and work through ensuring that whatever you implement, you understand what you are getting, what the risks are related to security, data, and availability, and what its going to take to implement and support.

Moving to the Cloud is also an opportunity to clean up your legacy systems, applications, processes, and organization. Get rid of the functions that don't work. Work to eliminate all of the red tape, OSI Layer 8 infrastructures, and inefficiencies that have been dragging down your IT Departments ability to provide good service.  If you migrate broken functions to the cloud, these will continue to break things, only much faster.

Sunday, July 1, 2012

ENSM Products are Commodities?

On your quest to put in network and systems management capabilities, you have to figure in several explicit and implicit factors related to your end goals.  What I mean is that while it's easy to go to your Framework Vendor of choice, break out the Bill of Materials spreadsheet, and sit down with the Sales person and go through the elements you would need for your environment, it may be filled with hidden challenges. And some challenges may be harder to overcome than others once you have signed the check.

Don't forget, these products don't magically install and run themselves. They take care and feeding. Some more than others. And the more complex it is, the more complex it is to figure out when something goes awry.

Sounds so easy! After all, all of these products are commodities. And buying from a single vendor gives you a single point of support... and blame. In effect, a single "throat to choke".  NOTHING could be further from the truth!

Most of the big vendor's product frameworks are aggregations and conglomerations of products that have been acquired, some overlapping, into what looks like a somewhat unified solution.  In many cases, it is only after you buy the product framework that you discover stuff like there are different portals with different products and these portals don't effectively integrate together. Or you may find the north bound interface of one product is a kludge to somewhat loosely fit the two products together. Or two products use competing Java versions.

Some vendors product suites have become more and more complex as new releases are GAed.   In many cases, these new levels of complexity have a profound impact on your ability to install, administer, or diagnose issues as they arise.

First up - Where are your requirements? Do you know the numbers and types of elements in your environment? What about the applications? How do these apply to Service Level Agreements? Do you have varying levels of maintenance and support for the components in your environment?

Do you know who the users will be?  Have you defined your support model? Which groups need access to what elements of information?  Do you have or have you prepared a proposed workflow of how users, managers, and even customers are going to interact with the new capabilities?

Who is going to take care of the management systems and applications? Have you aligned your organization to be successful in deployment? Do you have the skill sets? Do you have adequate skills coverage?

Have you defined the event flow? What about performance reports needs and distributions?  And ad hoc reporting needs? Have you defined any baseline thresholds?

Do you have SNMP access? What about ICMP? SSH? Have you considered the implications of management traffic across your security zones?

Product Choices


While there are a plethora of choices available to you, many do not want to go through the hassle of doing due diligence.  But be forewarned, failure to do due diligence can wreak mayhem in you environment.  I know, the big guns say that "our product works in your competitors" but does it really? You don't know?  As is your competition that undifferentiated from you? (May not be a good thing!)

When you go through product selection, you need to realize the support needed to administer the new management applications.  Do you need specialists just to install it? What about training? Are you going to need other resources like Business Intelligence Analysts, Web Developers, Database Administrators, Script Developers, or even additional Analysts or Engineers.

Here are some signs you may experience:

 If the product takes longer than a couple of days to install and integrate,  here's your sign.

If two or more products in your big vendor product suite need a significant amount of customization to work together, here's your sign.

 If the installation document for the product deviates from the actual installation, here's your sign.

  If you find out you actually have to install additional product as discovered during the installation, here's your sign.

  If you end up realizing that the recommended hardware specs are either overkill or under-speced, here's your sign.

 If you end up having to deal with libraries and utilities that are not included or resolved with the product installation, here's your sign.

Missed it by THAT much!
 If you find yourself opening up support tickets in the middle of the installation, here's your sign.

 If you find that the product breaks your security model AFTER you do the installation, here's your sign.

   If it takes Vendor specific Engineering to install the product, here's your sign.

   If you cannot see value in the first day after the installation of a product, here's your sign.

   If you find that you need to restructure and build out your support team AFTER the installation, here's your sign.

Systems Management


Systems Management brings whole new challenges to your environment.  Some of the things you need to evaluate up front are:


  • Agent deployment - Level of Difficulty - OS Coverage - consistent data across agents. Manual,  Automatic, or  distribute able
  • Agent-less - Browser specific? Adequate coverage?  Full transactions? Handles redirection?
  • Agent run time - Resource utilization - memory footprint - stability - Security.
  • Data collection - Pull or push model? Resiliency? Effect on run time resources?
  • External Restrictions - Java versions? Perl versions? Python versions?
  • Adequate application coverage?
  • Thresholds - Level of difficulty? Binary only or degrees of utilization/capacity/performance? Stateful? Dynamic thresholds? Northbound traps already defined or do you have to do your own?

Summary


Enterprise Management does not have to be that difficult.   There are products out there that work very well for what they do and are easy to deploy and maintain. For example, go do an OpenNMS installation.  Even though OpenNMS runs on just about any platform (a testament to their developer community and product maturity), you go to their wiki page http://www.opennms.org/documentation/installguide.html , pick out your platform of choice, and follow the procedure.  Most of the time, you are looking at maybe an hour. In an hour, you're starting discovery and picking up inventory to monitor and manage.

Solarwinds isn't too bad either. Nice, clean install on Windows.

Splunk is awesome and up in running in no time. http://www.splunk.com/

Hyperic HQ wasn't a bad installation either. Pretty simple. However, it is time sensitive on the agents.  Kind of thick (I think its the Struts), Java wise. http://www.hyperic.com/

eGInnovations is cake.  One agent everywhere for OS and applications. Handles VMWare, Xen and others. And the UI is straight forward. A Ton of value across both system and application monitoring and performance.   http://www.eginnovations.com/

Appliance based solutions take a bit more time in the planning phase up front but take the sting out of installation.  Some of these include:

http://www.sevone.com/ (SevOne does offer a software download for evaluation)
http://www.sciencelogic.com/
http://www.loglogic.com/ (They also offer a virtual appliance download)

One solution I dig is Tavve ZoneRanger for solving those access issues like UDP/SNMP across firewalls, SSH access across a firewall, etc.,  without having to run through proxies upon proxies and still maintain consistent auditing and logging.  It deploys as an appliance of virtual appliance.  http://www.tavve.com/

Another aspect you may consider include hosted applications. ServiceNow is easy to deploy because it is a hosted solution. http://www.servicenow.com/

Monday, June 25, 2012

Interview in Seattle - Perf Data in large environments

I recently interviewed at a rather large company in the NorthWest and during the course of the interview, I discovered a few things.

If you're limited to SQL Server for performance management data, there are a couple of things to consider:
  • You're stuck with about a 25000-30000 insert per second maximum input rate.
  • While you are at this maximum, you're not getting any data out.
  • When you get to 100 million or so records, indexing and inserting slows down. Reporting slows down even more.
They had a 3 tier data model that kept 7 days of 2 or 5 minute data on the lowest tier database.
This data was rolled up to a 1 hour average and stored at the second data tier for a period of 90 days.
This second tier data was again rolled up to a 12 hour average and sent to the highest tier for a data retention of 1 year.

Some of the head scratchers I came away with were:


  • How do you do a graph that spans time periods?
  • If you're doing post mortems, how effective is any of the data in the second or third tier?
  • How useful is CPU utilization that has been averaged to 1 hour or 12 hours?
  • How do you effectively trend multiple elements across tiers without a significant amount of development just so you can provide a report?


So, what are you gonna do?

What kind of data is it you're dealing with?

When you look at performance related data, a SIGNIFICANT part of it is simply time series data.  It has a unique identifier relating it to the managed object, attribute, and instance you are collecting data against and it has a timestamp of when it was collected. And then theres the value.

So, in a relational database,you would probably set this up as:

CREATE TABLE TSMETRICS {
Metrickey           varchar(64),
timestamp           datetime,
value                   integer
};

You would probably want to create an index on Metrickey so that you can more efficiently grab metric values from a given Metrickey.

When you consider you're collecting 10 metrics every 5 minutes for 10,000 nodes, you start to realize thatthe number of records starts to add up quickly. 288 specific records per metric for every day.
So 10 metrics every 5 minutes turns into 2880 records per node times 10000 nodes equals 28,800,000 records per day. At the end of 4 days, you're looking at crossing that 100 million record boundary.

Scaling

What if we changed TSMETRICS structure?  We could change it to:

CREATE TABLE TSMETRICS {
Metrickey             varchar(64),
starttime                datetime,
endtime                 datetime,
slot1                      integer,
...
slot288                  integer
}

This effectively flattens the table and reduces the duplicate string of Metrickey which would save a significant amount of repetitive record space. In effect,this is how Round Robin Data stores store metric data. But consider this, you either have to index each column or you have to process a row at a time to do so efficiently.

This gets us into the 1000 days realm!  10000 nodes * 10 metrics each * 1000 days = 100000000 records.

But the problem expands because the inserts and extracts become much more complex. And you have to go back to work on indexing.

Sharding

Some folks resort to sharding. What they will do is to take specific time ranges and move them onto their own table space.  Sharding ends up being a HUGE nightmare.  While it enables the DBA to control the Table spaces and number of records, getting data back out becomes another cryptic exercise is first finding the data, connecting to the appropriate database,and running your query.  So, the first query is to find the data you want.  Subsequent queries are used to go get that data. You're probably going to need to create a scratch database to use to aggregate the data from the multiple shards so that reporting tools can be more efficient.

Another technique folks employ is to use a data hierarchy.  Within the hierarchy, you keep high resolution data, say 7 days, in the lowest level. Roll the data up from a 5 minute interval to a 1 hour interval into a second data tier.  Then roll up the 1 hour data to 12 hour data in a third data tier.  I actually know of a large service provider that does exactly this sort of thing.

Imagine mining through the data and you need to look at the CPU of 5 systems over the course of a year.  How has the CPU load grown or declined over time? Now overlay the significant configuration changes and patch applications over that time period.  Now, overlay the availability of the 5 systems over that time.

All of a sudden, what looks like a simple reporting exercise becomes a major production issue. You have to get data from 3 separate databases, munge it together, handle the graphing of elements where the X axis is not linear, and it becomes mission impossible.

Suggestions

If you're looking at moderate to heavy data spaces, consider the options you have.  Do not automatically assume that everything fits in an RDBMS space effectively.

The advantages of a Round Robin Data store are:


  • Space is preallocated
  • It is already aligned to time series
  • Relatively compact
  • Handles missing data


Other considerations are that when you read an RRD type store,you store a copy in memory and read from that.  Your data inserts do not get blocked.

There are certain disadvantages to RRD stores as well to include:


  • Concentration of RRD stores on given controllers can drive disk IO rather high.
  • Once you start down the road of RRD store distribution, how do you keep up with the when and where of your data in a transparent manner?
  • RRD doesn't fit the SQL paradigm.
If you need SQL, why not take a look at columnar databases?

Take a look see at Vertica or Calpont InfiniDB.

When you think about it, most time series data is ultra-simple.  Yet when you do a graph or report, you are always comparing one time series element to another.  A Columnar data LIVES here because the DB engine aligns the data by column and not row.

Another thought here is massive parallelism. If you can increase your IO and processing power, you can overcome large data challenges.  Go check out Greenplum. While it is based on PostgresQL, it sets up as appliances based on VM instances. So, you start out with a few servers and as your data grows, you install another "appliance" and go.  As you install more components, you add to the overall performance potential of the overall data warehouse.

If you can run without SQL, take a look atthe big Data and noSQL options like Casandra or Hadoop / MapReduce.

Links for you:


http://hadoop.apache.org/
http://hadoop.apache.org/mapreduce/

An interesting experiment:

Run RRD or JRobin stores under Hadoop and HDFS.  Use MapReduce to index the RRD Stores.

I wonder how that would scale compared to straight RRD stores, against an RDBMS, or a Columnar Database.





Monday, April 9, 2012

Product quality Dilemma

All too often, we have products that we have bought, put in production, and attempted to work through the shortcomings and obfuscated abnormalities prevalent in so many products.  (I call this product "ISMs" and I use the term to describe specific product behaviors or personalities.) As part of this long living life cycle, changes, additions, deprecations, and behaviors change over time.  Whether its fixing bugs or rewriting functions as upgrades or enhancements, things happen.

All too often, developers tend to think of their creation in a way that may be significantly different than the deployed environments they go into. Its easy to get stuck in microcosms and walled off development environments.  Sometimes you miss the urgency of need, the importance of the functionality, or the sense of mandate around the business.

With performance management products, its all too easy just to gather everything and produce reports ad nauseum.  With an overwhelming level of output, its easy to get caught up in the flash, glitz, and glamour of fancy graphs, pie charts, bar charts... Even  Ishigawa diagrams!

All this is a distraction of what the end user really really NEEDS. I'll give a shot at outlining some basic requirements pertinent to all performance management products.

1. Don't keep trying to collect on broken access mechanisms.

Many performance applications continue to collector attempt to collect, even when they haven't had any valid data in several hours or days.  Its crazy as all of the errors just get in the way of valid data.  And some applications will continue to generate reports even though no data has been collected! Why?

SNMP Authentication failures are a HUGE clue your app is wasting resources or something simple. Listening for ICMP Source Quenches will tell you if you're hammering end devices.

2. Migrate away from mass produced reports in favor of providing information.

If no one is looking at the reports, you are wasting cycles, hardware,and personnel time on results that are meaningless.

3. If you can't create new reports without code, its too complicated.

All too often, products want to put glue code or even programming environments / IDEs in front of your reporting.  Isn't it a stretch to assume that  a developer will be the reporting person?  Most of the time its someone more business process oriented.

4. Data and indexes should be documented and manageable.  If you have to BYODBA (Bring Your Own DBA), the wares vendor hasn't done their home work.

How many times have we loaded up a big performance management application only to find out you have to do a significant amount of work tuning the data and the database parameters just to get the app to generate reports on time?

And you end up having to dig through the logs to figure out what works and what doesn't.

If you know what goes into the database, why do you not put in indexes,checks and balances, and even recommended functions when expansion occurs.

In some instances, databases used by performance management applications are geared toward the polling and collection versus the reporting of information.  In many cases, one needs to build data derivatives of multiple elements in order to facilitate information presentation.  For example, a simple dynamic thresholding mechanism is to take a sample of a series of values and perform an average, root mean, and standard deviation derivative.

If a reporting person has to do more than one join to get to their data elements,  your data needs to be better organized, normalized, and accessible via either derivative tables or a view. Complex data access mechanisms tend to alienate BI and performance / Capacity Engineers. They would rather work the data than work your system.

5. If the algorithm is too complex to explain without a PhD, it is not usable nor trustable.

There are a couple of applications that use patented algorithms to extrapolate bandwidth, capacity, or effective usage. If you haven't simplified the explanation of how it works, you're going to alienate a large portion of your operations base.

6. If an algorithm or method is held as SECRET, it works just until something breaks or is suspect. Then your problem is a SECRET too!

Secrets are BAD. Cisco publishes all of its bugs online specifically because it eliminates the perception that they are keeping something from the customer.

If one remembers Concord's eHealth Health Index...  In the earlier days, it was SECRET SQUIRREL SAUCE. Many an Engineer got a bad review or lost their job because of the arrogance of not publishing the elements that made up the Health Index.

7. Be prepared to handle BI types of access.  Bulk transfers, ODBC and Excel/Access replication, ETL tools access, etc.

If Engineers are REALLY using your data, they want to use it in their own applications, their own analysis work, and their own business activities. The more useful your data is, the more embedded and valuable your application is.  Provide ways of providing shared tables,timed transfers, transformations, and data dumps.

8. Reports are not just a graph on a splash page or a table of data.  Reports to Operations personnel means they put text and formatting around the graphs, charts, tables, and data to relate the operational aspects of the environment in with the illustrations.

9. In many cases, you need to transfer data in a transformed state from one system that reports to another. Without ETL tools, your reporting solution kind of misses the mark.

Think about this...  You have configuration data and you need this data in a multitude of applications.  Netcool. Your CMDB.  Your Operational Data Store. Your discovery tools.  Your ticketing system.  Your performance management system.  And it seems that every one of this data elements may be text, XML, databases of various forms and flavors, even HTML. How do you get transformed from one place to another?

10. If you cannot store, archive, and activate polling, collection, threshold, and reporting configurations accurately, you will drive away customization.

As soon as a data source becomes difficult to work with, it gets in the way of progress.  In essence, what happens in that when a data source becomes difficult to access, it quits being used beyond its own internal function. When this occurs, you start seeing separation and duplication of data.

The definitions of the data can also morph over time.  When this occurs and the data is shared, you can correct it pretty quickly.  When data is isolated, many times the problem just continues until its a major ordeal to correct. Reconciliation when there are a significant number of discrepancies can be rough.

Last but not least - If you develop an application and you move the configuration from test/QA to production and it does not stay EXACTLY the same, YOUR APPLICATION is JUNK.  Its dangerous, haphazard, incredibly short sided, and should be avoided at all costs.  Recently, I had a dear friend develop, test, and validate a performance management application upgrade.  After a month in test and QA running many different use case validations, it was put into production. The application overloaded the paired down configurations to defaults upon placement into production, polled EVERYTHING and it caused major outages and major consternation for the business. In fact, heads rolled.  The business lost customers. There were people that were terminated. And a lot of man power was expended trying to fix the issues.

In no uncertain terms, I will never let my friends and customers be caught by this product.