Dougie's Enterprise Management World: July 2012

Sunday, July 15, 2012

Thoughts on Telepresense Management

Many organizations have turned to Telepresense capabilities as a way to enable collaboration and teaming across very separated geographical locations. When you think about it, the cost of getting managers, directors, and engineers together to address challenges and work through solutions, Telepresense systems have the potential to not only save money but also empower organizations to respond much faster as an organization.

Going further, the proliferation of video streams is becoming more common place every day. Not only are collaboration systems becoming more popular, many Service Providers are offering video based services to both corporate and consumer markets.

Video conferencing technology has been around for several years. It has evolved significantly as the Codecs and MCUs have evolved significantly along with several standards. Additionally, the transport of these streams is becoming common on internet and private TCP/IP networks where in the past, dedicated DSx or ISDN circuits were used..

Video traffic, by its very nature, places a series of constraints on the network due to the nature and behavior of the traffic.

Bandwidth

First there are bandwidth requirements. High motion digital MPEG video can present a significant amount of traffic to the network.

QoS

Additionally, video traffic is very jitter sensitive and dynamics like packet reordering can wreak havoc in the stream of what you see on screen. In some cases, you may not even own portions of the network you traverse and the symptoms may only be present for a very short period of time.

Video Traffic

Alot of multimedia and streaming type traffic is facilitated by the pairing of RTP and RTCP channels. RTP usually runs on an even numbered port between 1024 and 65535 with the RTCP channel running on the next higherport number. This pairing is called a tuple. Additionally, while RTP and RTCP are protocol independent, they typically run on a UDP transport as TCP tends to sacrifice timeliness in favor of reliability.

Multipoint communications is facilitated by IP multicast.

In digitized speech passing across the network, with G.711 and G.729, the two most common VoIP encoding methods, you get two samples of voice per packet. The size of voice packets tends to be slightly larger than 220 bytes per packet and therefore the overall behavior is the Voice is a constant bit stream constrained by the sample rates needed per interval. Because of sampling, the packets occur on a 20 msec interval.

Some video systems use ITU H.264 encoding which is equivalent to MPEG-4 Part 10, and uses 30 Frames per second. Each frame must be setup every 33 msec. But MPEG encoding per frame can have a variable number of bytes and packets depending upon the video content. MPEG video is accomplished using a reference frame and subsequent data sections provide updates to the original frame in what is called a GOP or Group of Pictures.

A GOP starts off with an I frame which is the reference frame. A P frame represents a significant change in the initial I frame but continues to reference the I frame. B Frames update both I frames and P frames.

The types of frames and their location within a GOP can be defined in a temporal sequence. This temporal distance is the time or number of images between specific types of images in a digital video. “m” is the distance between successive P frames and “n” is the distance between I frames.

In contrast to serial digital television, 30 frames per second is digitized where the entire screen is digitized and sent every frame. As compared to MPEG,the transmission requirements are significantly less than serial digital because only the changes to the I frame are updated using the P and B frames.

Following is a diagram of a typical Group of Pictures and how the frames are sequenced.

http://en.wikipedia.org/wiki/File:GOP.gif

Following is another illustration of a GOP along with data requirements per frame.

This figure shows how different types of frames that can compose a group of pictures (GOP). A GOP can be characterized as the depth of compressed predicted frames (m) as compared to the total number of frames (n). This example shows that a GOP starts within an intra-frame (I-frame) and that intra-frames typically requires the largest number of bytes to represent the image (200 kB in this example). The depth m represents the number of frames that exist between the I-frames and P-frames.

When you look at the data sizes, you realize that the transmission of data across a network as compared to the GOP diagram, you start to realize the breakdown of packets and packet streaming necessary to enable the GOP sequence to work in near real time. For example, the first I frame must pass 200 kB sequentially which could turn out to be several hundred packets across the network. These packets are time and sequence sensitive. When things go wrong, you get video presentations that are frozen or pixelated.

In Cisco TPS, latency is defined as the time it takes for audio and video input to go from input on one end to presentation on the other. It measures latency between two systems via time stamps in the RTP data as well as the RTCP return reports. It is only a measure of the network and does not take into consideration the codecs. The recommended latency target for Cisco TPs is 150 msec or less. However, this is not always possible. When latency is exceeds over 250 msec in a 10 second period, it generates an alarm, an on screen message, a syslog message and an SNMP Trap on the receiving system. The onscreen message is only displayed 15 seconds and isn’t displayed again unless the session is restarted or stopped and reinitiated.

Packet Loss

Packet Loss can occur across the network for a variety of reasons. It could be layer 1 errors, Ethernet duplex mismatches, overrunning the queue depth in routers, and even induced by jitter. In Cisco CTP systems, they recommend a loss less than .05 percent of traffic in each direction. When packet loss exceeds 1 percent averaged over a 10 second period, the system presents an on screen message, generates a syslog message and an SNMP Trap.

When packet loss exceeds 10 percent average over a 10 second period, an additional on screen message is generated, syslog messages are logged and an SNMP Trap is generated.

If the CTS system experiences packet loss greater than 10 percent averaged over a 60 second interval, it downgrades the quality of its outgoing video and puts up an onscreen message. Following is a table of key metrics.

Metric	Target	1^st Threshold	2^nd Threshold
Latency	150 ms	250 ms (2 seconds for Satellite mode)
Jitter	10 ms Packet Jitter 50 ms Frame jitter	125 ms of video frame jitter	165 ms of video frame jitter
Packet Loss	.05%	1%	10%

If you are building KPIs around something like a Cisco CTS implementation, you need to look at areas where latency, jitter, and packet loss can occur. Sometimes, it is in your control and sometimes it is not. And there are a lot of different possibilities.

Some errors and performance conditions are persistent and some are intermittent and situationally specific. For example, a duplex mismatch on either end will result in a persistent packet loss during TPS calls. An overloaded WAN router in the path may be dropping packets during the call.

While the CTS system gives your Operations personnel awareness that jitter, latency, or packet loss is occurring, it does not tell you where the problem is. And if it causes the call to reset, you may not be able to discern where the problem is either. Once the traffic is gone, the problem may disappear with it.

When you look at the diagram, you see a Headquarters CTS system capable of connecting to both a Remote Campus CTS system and a Branch office equipped with a CTS system. Multiple redundant metwork connections are provided via Headquarters and the Remote Campus but a single Metro Ethernet connection connects the Branch office to the Headquarters and Remote campus facilities.

When you look at the potential instrumentation that could be applied, there are a lot of different points to sort out device by device. But diagnosis of something like this goes from end to end. The underlying performance and status data is really most useful when an engineer is drilling into the problem to look for root causes, capacity planning, or performing a post mortem on a specific problem.

Most Cisco TPS systems are high visibility as the users tend to be presented with the issues either by cue on screen or by performance during the call. It doesn’t take too many jerky or broken calls to render it a non-working technology in the minds of some.And these users tend to be high profile like senior management and business unit decision makers.

So, What Do You Do?

First of all, Operations personnel need to know proactively, when a CTS system is presenting errors and experiencing problems. On the elements you can monitor, setup and measure key performance metrics and thresholds, setup status change mechanisms, and threshold on mis-configured elements. Thresholds, traps, and events help you create situation awareness in your environment. This is a good start toward being able to recognize problems and conditions. Even creating awareness that a configuration change is made to any component in the CTS system enables Operations to be aware of elements in the service.

In the monitoring and management of the infrastructure, you are probably receiving SNMP Traps and Syslog from the various components in the environment. This is a good start toward seeing failures and error conditions as they are sent as the events occur. However, you need performance data to be able to drill down into the data to discern and analyze conditions and situations that could affect streaming services.

But when you think about it, you actually need to start with a concept of performance and availability of end to end, as a service. Having IO performance data on a specific router interface makes no seense until its put into a service perspective.

End to End Strategy

First up, you need to look at an end to end monitoring strategy. And probably one of themost common steps to take is to employ something like Cisco IPSLA capabilities.

Cisco has a capability in many of its routers and switches that enables managing service levels using various measurements provided. There are a series of capabilities that can be utilized in your CTS Monitoring and management solution to make it more effective beyond the basic instrumentation. Cisco recommends that shadow devices be used as the IP SLA tests will circumvent routing and queuing mechanisms and can have an effect if done on production devices.

Not all devices support all tests. So, be sure to check with Cisco to ensure you can use a specific IOS version and platform for the given tests. Some service providers and enterprises use decommissioned components to reduce cost.

Basic Connectivity

One of the Cisco IP SLA most commonly used test is the ICMP Echo test. The ICMP Echo test sends a ping from the shadow router doing the test to an end IP. In looking over figure 2, we would need to setup an ICMP Echo test from the shadow router on one side to the CTS system on the other side. This is to establish a connectivity check in each direction.

This will also provide a latency metric but at the IP level. ICMP is the control mechanism for the IP protocol. There may be added latency at higher level protocols depending upon the elements interacting on those protocols. For example, firewalls that maintain state at the transport layer (TCP) may introduce additional latency. Traffic shapers may do so as well.

The collected data and status of the tests are in the CISCO-RTTMON-MIB.my MIB definition file. Pertinent data shows up in rttMonCtrlAdminEntry, rttMonCtrlOperEntry, rttMonStatsCaptureEntry, and rttMonStatsCollectEntry tables.

Jitter

We know that Jitter can have a profound effect on video streams as a GOP that arrives too late or out of sequence will be dropped by the receiving codec as it cannot go back in the video stream and redo the past GOP frame. We can also discern that at 30 frames a second, the GOP frame must be sent every 33 ms.

The jitter test is executed between a Source and Destination and subsequently, needs a Responder to operate. It is accomplished by sending packets separated by an interval. Both source to destination and destination to source is accommodated.

The statistics available provide the types of information you are looking to threshold and extract. The Cisco IP SLA ICMP based Jitter test supports the following statistics:

Jitter (Source to Destination and Destination to Source)
Latency (Source to Destination and Destination to Source
Round Trip Latency
Packet Loss
Successive Packet Loss
Out of Sequence Packets (Source to Destination and Destination to Source)
Late Packets

A couple of factors need to be considered in setting up jitter with regards to video streams. These are:

Jitter Interval
The number of packets
Test frequency
Dealing with Load Balancing and NAT

The jitter interval needs to be set to the GOP frame interval of 33 ms.

The number of packets needs to be set to something significant beyond single digits but below a value which would hammer the network. What you are looking for is just enough packets to understand you have an issue. I would probably recommend 20 at first even though the video packet numbers will be significantly higher. Not every network is created equal so you probably need to tune this. While the number of packets in a GOP frame can be highly variable, you just need to sample for the statistics that effect service.

Test frequency can have an effect on traffic. I would probably start with 5 minutes during active periods. But this too needs to be adjusted according to your environment.

In the Jitter test setup, you can use either IP addresses and hostnames when you specify source and destination.

The statistical data is collected in the rttMonJitterStatsEntry table.

ICMP Path Echo

When connectivity issues occur or changes in latency occur, one needs to be able to diagnose the path from both ends. The Cisco IP SLA ICMP Echo Path test is an ICMP based traceroute function. This test can help diagnose not only path connectivity issues but latency in the path, and asymmetric paths.

Traceroute uses the TTL field of IP to “walk” through a network. As a new hop is discovered, the TTL is incremented and attempted again. As each hop is discovered, an ICMP Echo test is performed to measure the response time.

Data will be presented in the rttMonStatsCaptureEntry table.

Advanced Functions

In looking over what we’ve discussed so far, we have reviewed in system diagnostic events, external events and conditions, and using IP SLA capabilities to test and evaluated in a shadow mode, between two or more telepresense systems. But when things happen, they do so in real time. Coupled with the facts of many areas that could affect degradation in services, in some instances, it may be necessary to provide enhanced levels of service.

Additionally, multiple problems can present confusing and anomalous event patterns and symptoms exacerbating the service restoral process. Of course, because you have high visibility, any confusion adds fuel to the fire.

Media Delivery Index

Ineoquest has derived a service metric related to video streams services and provide a KPI for service delivery. The Media Delivery Index (MDI) can be used to monitor both the quality of a delivered video stream as well as to show system margin by providing an accurate measurement of jitter and delay. The MDI addresses a need to assure the health and quality of systems that are delivering ever higher numbers of streams simultaneously by providing a predictable, repeatable measurement rather than relying on subjective human observations. Use of the MDI further provides a network margin indication that warns system operators of impending operational issues with enough advance notice to allow corrective action before observed video is impaired.

The MDI is also presented in RFC-4445 as jointly authored by Ineoquest and Cisco.

Ineoquest uses Probes and software to analyze streams in real time, to validate the actual video stream as it occurs. In fact, they can be used in a portspan, a network tap arrangement, or they can actually JOIN a conference! Why is this necessary?

The Singulus and Geminus probes look at each GOP and the packets that make up that frame. They capture and provide metrics on any data loss, jitter, and Program Clock Reference (PCR). They are also able to analyze the data stream and determine in real time, the quality, of the video. You can create awareness for increased jitter even before video is affected by analyzing the jitter margins. This takes the subjectiveness out of the measurement equation.

Additionally, you can record and play back video streams that are problematic. This is very important that in video encoding streams, the actual data patterns can be wildly different from one moment to the next depending upon the changes that occur in the GOP. For example, if the camera is displaying a white board, there a very few changes. However, if the camera is capturing 50 people in a room all moving, the GOP is going to be thick with updates to the I, P, and B frames.

For instance, a Provider Network setup could have a cbQoS setup that’s perfectly acceptable for a Whiteboard camera or even moderate traffic. It may start buffering packets when times get tough. The bandwidth may be oversubscribed and reprioritized in areas where you have no visibility. This can introduce jitter and even packet loss only its specific to the conditions at hand. And how do you PROVE that back to your Service Provider?

You can go back and replay the video streams during a maintenance window to validate the infrastructure. And test, diagnose, and validate specific network components with your network folks and service providers alike to deliver a validated service as a known functional capability.

And because you can replicate the captures, what a great way to test load and validate your end to end during maintenance windows!

The IQ Video Management System (iVMS) controls that probes, assembles the data into statistics and provides the technical diagnostic front end to access the detailed information elements available. Additionally, when problems and thresholds occur, it forwards SNMP Traps, syslogs, and potentially email regarding the issues at hand.

Analyzing the Data

First and foremost, you have traps and syslog that tell you of impending conditions that are occuring in near real time. For example, due to the transmission quality of the stream, the end presentation may downgrade from 1080i to 720i to save on bandwidth.

The IPSLA tests establish end to end connectivity and a portion of the path. In Traceroute, normally you only see one interface per device as traceroute walks the path. If only gives you one interface for the device and not the in and out interfaces. This is true because the Time to Live parameter wont be exceeded again until the next hop.

Keep in mind though that elements that can affect a video stream may not be in the direct path as a function. For example, if a router in the path has a low memory condition caused by the buffering from a fast network to a low speed WAN Link, this may affect the router's ability to correctly handle streaming data under load.

What if path changes affect the order of packets? What if timing slips causes jitter on the circuit? Even BGP micro flaps may cause jitter and out of sequence frames. all in all, you have your work cut out for you.

Summary

Prepare yourself to approach the monitoring and management of video streams from a holistic point of view. You have to look at it first from the perspective of end to end to end. Also be prepared to be able to fill in the blanks with other configuration and performance data.

Because of the visibility of Video teleconferencing systems, be prepared to validate your environment on an ongoing basis. Things change constantly. Changes even only remotely related or out of your span of control, can affect the systems ability to perform well. Validate and test your QoS and performance from a holistic view on a recurring basis. A little recurring discipline will save you countless hours of heartbreak later.

Because of the situational aspects of video telepresense, be prepared to enable diagnosis in near real time. Specialized tools like Ineoquest can really empower you to support and understand whats going on and work through analysis,design changes, maintenance actions, and even vendor support to ensure you are on the road to an effective collaboration service.

Sometimes, it pays to enlist the aid of an SME that is vendor independent. After all, in doing this Blog post, I credit a significant amount of education and technical validation on the GOP, RTP, RTVP, and other elements of streaming to Stephen Bavington of Bavington Consulting. http://bavingtonconsulting.com/

Wednesday, July 11, 2012

Management and Cloud Migration

So, you've been tasked with analyzing the systems you have and migrating them to the Cloud. Sounds like a bit of a challenge. In part, you know there are some systems that will take longer to get to the Cloud than others. After all, you have systems like databases, ERP applications, and even Enterprise, Fault, and Performance systems that are critical to your operation. In many cases, email and DNS are very critical to your IT service. A lot of decisions, some harder than others.

Step Back

All too often you find some Manager head over heels trying to push everything to the Cloud right up front. They've been to a couple of seminars and lurked on a few webcasts. It is so easy to just make a decision and hope for the best thinking it must be better in the cloud because everyone else is doing it.

Even if you deploy your own virtualized environment, call it your private cloud, and start migrating things over wantonly, you are headed for more pitfalls than you think. From a Management perspective, there are differences in legacy management applications and techniques and more "service oriented" approaches. Both have their form and function but migrating to cloud without considering the ramifications of management coverage could leave you exposed and vulnerable. (Translated as CLMs - Career Limiting Moves!)

What does Cloud Give you?

Mobility.

First and foremost, it gives you the ability to move. Over the course of years, your process to put in place new computing assets has been increased and made more complex as it evolved. First, its the engineering process. Then the acquisition process. Then Development and prototyping. then migration into production. Each and every process along the way has put inplace more and more controls, check points, decision validations, and scheduling all to get something into production and working for a customer.

If the IT department can't respond in short intervals, then why not hire a consultant to stand up a Web infrastructure in the cloud, upload our content, and take it down a month later when the need is diminished? In fact, you see alot of this in that the business needs Web assets for specific time periods and within a short time period and IT cannot even deliver a design in the time that the business unit needs the capability. A lot of business units will circumvent the IT department to make something happen. Only problem - When it breaks,who do they call?

Being able to streamline your processes that enable new services to be delivered quickly, is a key benefit of cloud. You reserve your longer, legacy processes to developing the infrastructure and lining up the external cloud service contracts.

Standardization

Because the unit of deployment can be whatever you decide, you must work to develop standard packages that can be pre-packaged, tested, and validated ahead of time. This saves countless hours in engineering as the engineering is done once and replicated. Additionally, patch management becomes more effective as you test against standard packages.

Standardization also enables you as an IT Service provider, to provide a comprehensive catalog that business owners can use to satisfy their needs. Along with this, you can begin to standardize support costs more effectively. Once you standardize the platforms and applications, you can begin to really standardize your processes.

Think about your processes around monitoring and management and around support, care and feeding. Can you really start down the road of true ITIL Incident Management where corrective actions are defined?

Kill the Deadwood!

Do you have elements / Silos in your environment that just get in the way of progress? Come on now, everyone does. Sometimes products / integrations just grow obsolete.

Sometimes a product becomes in grained but the ability to support it or replace it is gone. At that point, people are afraid to touch it. It has become a boil on the rear of IT at that point. Nasty. Noone will touch it. And it festers on.

Sometimes this happens because a person or group becomes the champion of the element and when you need something that the legacy product doesn't do, it means work. Theres an ongoing fallacy that in house developed products are free. Nothing is free. Someone has to continue development. Someone has to support it. Someone has to document it. Someone has to train others on it.

Sometimes,an organization is just adverse to change. This is much more prevalent than one might think. There are folks that just don't do well with shifting sands. They prefer the status quo as it is more comfortable to avoid change.

Changing to Cloud and Virtualization changes the way you monitor and manage things. Instead of managing by IP address or a node, you must start with the service first. In contrast, a vMotion event can move your node to a completely different location, hardware, and even disk. And it happens very quickly.

The legacy monitoring apps you've been used to may not be able to function effectively going forward. You must be willing to analyze and be real about the legacy products or you're going to get stuck with something that becomes a ball and chain. Even if the product you have is free, a part of an ELA that doesn't cost anything, or you are living on maintenance only, if the product does not fit, you are wasting TIME AND MONEY.

Analyze and Define Your Needs

Requirements is where you start and where you validate. If the solutions you put in place do not address the requirements, you need to analyze the risk of non-coverage and work through solutions. Wares vendors love to see valid requirements. They get an opportunity to address the things they do on a point by point basis.

Because you are dealing with moving technology, keep the requirements open such that if a Wares vendor has additional capability that your requirements do not address, do not throw away these capabilities in your evaluations. Be extremely wary of vendors who want to minimize your requirements. It is often a sign of a lack of capability and a lack of the ability to effectively compete in their functional area.

Be open to new things. If you are deploying VDI capabilities, look at VDI Management solutions. Some of these technologies are new enough that deployment without specialized management capabilities can leave to naked using legacy management technology. Consider this - SNMP support is waning in VMWare ESXi and they prefer you access their capabilities via their Web Services API.

Organize your Data Model

Going forward, you need to look at your sources of truth - albeit configuration, performance, situational, and workflow data elements, need to be aligned and understood as a data model. What you do not want to end up with is multiple sources of the same data modeled in different ways, and not correlated. This will become painfully aware when you start organizing a service catalog. How confusing would it be in your catalog to have an entry as Server - Redhat + MySQL + PHP + Apache and an adjacent entry that is LAMP + MySQL. Think about the differences in CPU Cores.

Validate your Security Posture

Work diligently to identify and understand the risks of your systems, applications, and data as things migrate to a cloud or virtualized model. Consider where your data is, what its importance is, and what accesses you have to it. Moving to the Cloud or Virtualization may mean your data may move to a SAN.

Network access protection changes a bit in Cloud. One may be limited in what can be done to protect access. And with vMotion this becomes even more complex.

Integration

Use the move to Cloud and Virtualization to stress integration. You have requirements that may be addressed by multiple solutions. However, one must work to make things seamless. Because of the speed of deployment and deactivation, you need to work through how information and integration should work across products and data elements.

When a service is provisioned, what happens next? Does it get put in your Service Desk solution? Is monitoring activated? Is billing activated? Customer notified? Work through defining and addressing the workflow and integration up front. Even if you setup a basic process, you can work to make it more efficient over time.

Cleanup your Applications

Many applications you have may run inconsistently in constrained environments. What happens when you introduce latency into your application? What about running on constrained hardware? (1 CPU core a 1GB memory when the Java JVM takes 1.1GB of memory itself). Some applications are terrible communicators across the network but network technology hides it. For example,open a Word document on a network share and watch it with a protocol analyzer. Now hit the page down key in your document and watch what happens. It downloads the entire document again and fseeks to the next page!

Cloud and virtualized environments introduce new challenges. Combinations of applications may affect each other when presented on the same blade. Over subscription can wreak havoc if you put two dominant applications together.

Invest in some APM applications, a protocol analyzer and some data gathering and reporting tools. I like to use both Class Loader based APM systems as well as network traffic capture based APMs systems as you get to analyze applications from multiple aspects.

Summary

Moving to the Cloud and toward virtualization does not have to be scary. But neither should you approach it blindly. Identify and understand your requirements and work through ensuring that whatever you implement, you understand what you are getting, what the risks are related to security, data, and availability, and what its going to take to implement and support.

Moving to the Cloud is also an opportunity to clean up your legacy systems, applications, processes, and organization. Get rid of the functions that don't work. Work to eliminate all of the red tape, OSI Layer 8 infrastructures, and inefficiencies that have been dragging down your IT Departments ability to provide good service. If you migrate broken functions to the cloud, these will continue to break things, only much faster.

Sunday, July 1, 2012

ENSM Products are Commodities?

On your quest to put in network and systems management capabilities, you have to figure in several explicit and implicit factors related to your end goals. What I mean is that while it's easy to go to your Framework Vendor of choice, break out the Bill of Materials spreadsheet, and sit down with the Sales person and go through the elements you would need for your environment, it may be filled with hidden challenges. And some challenges may be harder to overcome than others once you have signed the check.

Don't forget, these products don't magically install and run themselves. They take care and feeding. Some more than others. And the more complex it is, the more complex it is to figure out when something goes awry.

Sounds so easy! After all, all of these products are commodities. And buying from a single vendor gives you a single point of support... and blame. In effect, a single "throat to choke". NOTHING could be further from the truth!

Most of the big vendor's product frameworks are aggregations and conglomerations of products that have been acquired, some overlapping, into what looks like a somewhat unified solution. In many cases, it is only after you buy the product framework that you discover stuff like there are different portals with different products and these portals don't effectively integrate together. Or you may find the north bound interface of one product is a kludge to somewhat loosely fit the two products together. Or two products use competing Java versions.

Some vendors product suites have become more and more complex as new releases are GAed. In many cases, these new levels of complexity have a profound impact on your ability to install, administer, or diagnose issues as they arise.

First up - Where are your requirements? Do you know the numbers and types of elements in your environment? What about the applications? How do these apply to Service Level Agreements? Do you have varying levels of maintenance and support for the components in your environment?

Do you know who the users will be? Have you defined your support model? Which groups need access to what elements of information? Do you have or have you prepared a proposed workflow of how users, managers, and even customers are going to interact with the new capabilities?

Who is going to take care of the management systems and applications? Have you aligned your organization to be successful in deployment? Do you have the skill sets? Do you have adequate skills coverage?

Have you defined the event flow? What about performance reports needs and distributions? And ad hoc reporting needs? Have you defined any baseline thresholds?

Do you have SNMP access? What about ICMP? SSH? Have you considered the implications of management traffic across your security zones?

Product Choices

While there are a plethora of choices available to you, many do not want to go through the hassle of doing due diligence. But be forewarned, failure to do due diligence can wreak mayhem in you environment. I know, the big guns say that "our product works in your competitors" but does it really? You don't know? As is your competition that undifferentiated from you? (May not be a good thing!)

When you go through product selection, you need to realize the support needed to administer the new management applications. Do you need specialists just to install it? What about training? Are you going to need other resources like Business Intelligence Analysts, Web Developers, Database Administrators, Script Developers, or even additional Analysts or Engineers.

Here are some signs you may experience:

If the product takes longer than a couple of days to install and integrate, here's your sign.

If two or more products in your big vendor product suite need a significant amount of customization to work together, here's your sign.

If the installation document for the product deviates from the actual installation, here's your sign.

If you find out you actually have to install additional product as discovered during the installation, here's your sign.

If you end up realizing that the recommended hardware specs are either overkill or under-speced, here's your sign.

If you end up having to deal with libraries and utilities that are not included or resolved with the product installation, here's your sign.

Missed it by THAT much!

If you find yourself opening up support tickets in the middle of the installation, here's your sign.

If you find that the product breaks your security model AFTER you do the installation, here's your sign.

If it takes Vendor specific Engineering to install the product, here's your sign.

If you cannot see value in the first day after the installation of a product, here's your sign.

If you find that you need to restructure and build out your support team AFTER the installation, here's your sign.

Systems Management

Systems Management brings whole new challenges to your environment. Some of the things you need to evaluate up front are:

Agent deployment - Level of Difficulty - OS Coverage - consistent data across agents. Manual, Automatic, or distribute able
Agent-less - Browser specific? Adequate coverage? Full transactions? Handles redirection?
Agent run time - Resource utilization - memory footprint - stability - Security.
Data collection - Pull or push model? Resiliency? Effect on run time resources?
External Restrictions - Java versions? Perl versions? Python versions?
Adequate application coverage?
Thresholds - Level of difficulty? Binary only or degrees of utilization/capacity/performance? Stateful? Dynamic thresholds? Northbound traps already defined or do you have to do your own?

Summary

Enterprise Management does not have to be that difficult. There are products out there that work very well for what they do and are easy to deploy and maintain. For example, go do an OpenNMS installation. Even though OpenNMS runs on just about any platform (a testament to their developer community and product maturity), you go to their wiki page http://www.opennms.org/documentation/installguide.html , pick out your platform of choice, and follow the procedure. Most of the time, you are looking at maybe an hour. In an hour, you're starting discovery and picking up inventory to monitor and manage.

Solarwinds isn't too bad either. Nice, clean install on Windows.

Splunk is awesome and up in running in no time. http://www.splunk.com/

Hyperic HQ wasn't a bad installation either. Pretty simple. However, it is time sensitive on the agents. Kind of thick (I think its the Struts), Java wise. http://www.hyperic.com/

eGInnovations is cake. One agent everywhere for OS and applications. Handles VMWare, Xen and others. And the UI is straight forward. A Ton of value across both system and application monitoring and performance. http://www.eginnovations.com/

Appliance based solutions take a bit more time in the planning phase up front but take the sting out of installation. Some of these include:

http://www.sevone.com/ (SevOne does offer a software download for evaluation)
http://www.sciencelogic.com/
http://www.loglogic.com/ (They also offer a virtual appliance download)

One solution I dig is Tavve ZoneRanger for solving those access issues like UDP/SNMP across firewalls, SSH access across a firewall, etc., without having to run through proxies upon proxies and still maintain consistent auditing and logging. It deploys as an appliance of virtual appliance. http://www.tavve.com/

Another aspect you may consider include hosted applications. ServiceNow is easy to deploy because it is a hosted solution. http://www.servicenow.com/