Monday, June 25, 2012

Interview in Seattle - Perf Data in large environments

I recently interviewed at a rather large company in the NorthWest and during the course of the interview, I discovered a few things.

If you're limited to SQL Server for performance management data, there are a couple of things to consider:
  • You're stuck with about a 25000-30000 insert per second maximum input rate.
  • While you are at this maximum, you're not getting any data out.
  • When you get to 100 million or so records, indexing and inserting slows down. Reporting slows down even more.
They had a 3 tier data model that kept 7 days of 2 or 5 minute data on the lowest tier database.
This data was rolled up to a 1 hour average and stored at the second data tier for a period of 90 days.
This second tier data was again rolled up to a 12 hour average and sent to the highest tier for a data retention of 1 year.

Some of the head scratchers I came away with were:


  • How do you do a graph that spans time periods?
  • If you're doing post mortems, how effective is any of the data in the second or third tier?
  • How useful is CPU utilization that has been averaged to 1 hour or 12 hours?
  • How do you effectively trend multiple elements across tiers without a significant amount of development just so you can provide a report?


So, what are you gonna do?

What kind of data is it you're dealing with?

When you look at performance related data, a SIGNIFICANT part of it is simply time series data.  It has a unique identifier relating it to the managed object, attribute, and instance you are collecting data against and it has a timestamp of when it was collected. And then theres the value.

So, in a relational database,you would probably set this up as:

CREATE TABLE TSMETRICS {
Metrickey           varchar(64),
timestamp           datetime,
value                   integer
};

You would probably want to create an index on Metrickey so that you can more efficiently grab metric values from a given Metrickey.

When you consider you're collecting 10 metrics every 5 minutes for 10,000 nodes, you start to realize thatthe number of records starts to add up quickly. 288 specific records per metric for every day.
So 10 metrics every 5 minutes turns into 2880 records per node times 10000 nodes equals 28,800,000 records per day. At the end of 4 days, you're looking at crossing that 100 million record boundary.

Scaling

What if we changed TSMETRICS structure?  We could change it to:

CREATE TABLE TSMETRICS {
Metrickey             varchar(64),
starttime                datetime,
endtime                 datetime,
slot1                      integer,
...
slot288                  integer
}

This effectively flattens the table and reduces the duplicate string of Metrickey which would save a significant amount of repetitive record space. In effect,this is how Round Robin Data stores store metric data. But consider this, you either have to index each column or you have to process a row at a time to do so efficiently.

This gets us into the 1000 days realm!  10000 nodes * 10 metrics each * 1000 days = 100000000 records.

But the problem expands because the inserts and extracts become much more complex. And you have to go back to work on indexing.

Sharding

Some folks resort to sharding. What they will do is to take specific time ranges and move them onto their own table space.  Sharding ends up being a HUGE nightmare.  While it enables the DBA to control the Table spaces and number of records, getting data back out becomes another cryptic exercise is first finding the data, connecting to the appropriate database,and running your query.  So, the first query is to find the data you want.  Subsequent queries are used to go get that data. You're probably going to need to create a scratch database to use to aggregate the data from the multiple shards so that reporting tools can be more efficient.

Another technique folks employ is to use a data hierarchy.  Within the hierarchy, you keep high resolution data, say 7 days, in the lowest level. Roll the data up from a 5 minute interval to a 1 hour interval into a second data tier.  Then roll up the 1 hour data to 12 hour data in a third data tier.  I actually know of a large service provider that does exactly this sort of thing.

Imagine mining through the data and you need to look at the CPU of 5 systems over the course of a year.  How has the CPU load grown or declined over time? Now overlay the significant configuration changes and patch applications over that time period.  Now, overlay the availability of the 5 systems over that time.

All of a sudden, what looks like a simple reporting exercise becomes a major production issue. You have to get data from 3 separate databases, munge it together, handle the graphing of elements where the X axis is not linear, and it becomes mission impossible.

Suggestions

If you're looking at moderate to heavy data spaces, consider the options you have.  Do not automatically assume that everything fits in an RDBMS space effectively.

The advantages of a Round Robin Data store are:


  • Space is preallocated
  • It is already aligned to time series
  • Relatively compact
  • Handles missing data


Other considerations are that when you read an RRD type store,you store a copy in memory and read from that.  Your data inserts do not get blocked.

There are certain disadvantages to RRD stores as well to include:


  • Concentration of RRD stores on given controllers can drive disk IO rather high.
  • Once you start down the road of RRD store distribution, how do you keep up with the when and where of your data in a transparent manner?
  • RRD doesn't fit the SQL paradigm.
If you need SQL, why not take a look at columnar databases?

Take a look see at Vertica or Calpont InfiniDB.

When you think about it, most time series data is ultra-simple.  Yet when you do a graph or report, you are always comparing one time series element to another.  A Columnar data LIVES here because the DB engine aligns the data by column and not row.

Another thought here is massive parallelism. If you can increase your IO and processing power, you can overcome large data challenges.  Go check out Greenplum. While it is based on PostgresQL, it sets up as appliances based on VM instances. So, you start out with a few servers and as your data grows, you install another "appliance" and go.  As you install more components, you add to the overall performance potential of the overall data warehouse.

If you can run without SQL, take a look atthe big Data and noSQL options like Casandra or Hadoop / MapReduce.

Links for you:


http://hadoop.apache.org/
http://hadoop.apache.org/mapreduce/

An interesting experiment:

Run RRD or JRobin stores under Hadoop and HDFS.  Use MapReduce to index the RRD Stores.

I wonder how that would scale compared to straight RRD stores, against an RDBMS, or a Columnar Database.