A Model for SQL Server Performance based on Waits

Patrick O'Keeffe

Data, data everywhere…

Project Lucy users that run Spotlight on SQL Server have been uploading SQL Server Waits data since May 2011. We now have hundreds of millions of samples of wait data from thousands of servers. We recently pushed an update to the Project Lucy home page that (amongst other things) displays a breakdown of the SQL Server population by hardware manufacturer type. This visualisation also splits the instances into two groups that display their performance ‘health’. To do this we came up with a model that calculated the performance health of a SQL Server instance based on wait data.

The Model

The Nature of the data

We are using wait statistics from SQL Server within this model.  A variant of this model is well known in Oracle circles and we wanted to see if it could be applied to SQL Server.  To do this we needed data, which is lucky because Spotlight on SQL Server Users who have opted in to uploading data to Project Lucy contribute 4 samples of wait stats data for each server that Spotlight monitors every day.  Since our users have been uploading this data since May 2011 we now have a large corpus to play with.

Modelled Variables

Currently the model evaluates three variables.

‘Good’ Waits

CPU wait and I/O Wait – the presumption being that if you are waiting on these at least you are doing ‘work’.  CPU wait is simply signal wait, which is a wait for CPU time after having waited on something else. I/O Wait consists of a wait on any of the following wait types – ASYNC_DISKPOOL_LOCK, ASYNC_IO_COMPLETION, CHKPT, DISKIO_SUSPEND, FCB_REPLICA_READ, FCB_REPLICA_WRITE, IO_COMPLETION, LOGBUFFER, LOGMGR, LOGMGR_FLUSH, LOGMGR_RESERVE_APPEND, PAGEIOLATCH_DT, PAGEIOLATCH_EX, PAGEIOLATCH_KP, PAGEIOLATCH_NL, PAGEIOLATCH_SH, PAGEIOLATCH_UP, SOAP_READ, SOAP_WRITE, WRITELOG

‘Bad’ Waits

‘Bad’ waits are the waits that tell you that you are waiting for something that is stopping you from actually doing any useful work. There are a number of categories of ‘bad’ wait:

  • Lock (LCK_M_*)
  • Latch (LATCH_*, PAGELATCH_*, TRAN_MARK_*, TRAN_MARKLATCH_*)
  • CLR (CLR_*, CLRHOST_*, SQLCLR_*)
  • Other wait types (too numerous to list)

We do exclude Network (ASYNC_NETWORK_IO, NETWORKIO) from our model, since these are more indicative of client side issues.

I/O Subsystem Response

Databases are large data pumps.  They exist to get data into and out of permanent storage as fast as possible.  Performance is definitely going to suffer if the database cannot get pages on or off the disk fast enough, so we need to include in our model a measure of the responsiveness of the I/O subsystem.  We are using Single Page Read latency which is simply the average time per wait for PAGEIOLATCH_* waits.

In English please…?

Basically in simple terms we are positing that as long as your single page read latency is low and the majority of time spent waiting is spent waiting on CPU or IO (i.e. you’re actually doing work) then you have no performance bottlenecks.

Doing the Math

Now we are going to use our model to calculate a score for a SQL Server. Currently we are using a score of above 80 for everything is OK.  Let’s use as an example, a SQL Server instance that has a single page read latency of 24 ms and spends 50% of total wait time waiting on CPU and 40% of total waiting on IO.

We start with a score of 100 for Single page read latency. This score is reduced by 20 for each increment of 5ms above 5ms.

SinglePageReadLatencyvsScore.png

In the case of our example, the score is 20 because our I/O Single Page Read latency is between 20 and 25.

We start with a score of 100 for CPU + IO.  For every increment of 5 that the sum of CPU and IO wait expressed as a percentage of total wait falls we subtract 20.

CPUPlusIOvsScore.png

Again in our case, we have figures of 50% and 40% respectively for a total of 90% for CPU + IO as a percentage of total wait.  90% falls into the 95-90 range so the second score becomes 80.

These two scores are then equally weighted to produce a single score. (20 + 80) / 2 = 50.

In this case we would argue that the performance of this SQL Server is not OK because even though it is doing work, it cannot get pages on or off the disk fast enough because the I/O subsystem is not fast enough.

 

Results

Running our algorithm over the whole corpus gave us the following distribution of performance scores.  As you can see the majority of SQL Servers in the population are OK in terms of performance.

It is interesting to note the ‘hump’ around the 50 mark.  This leads us to think that our algorithm could use some refinement.  Perhaps single page read latency is less important when there is minimal I/O occurring, so the weighting should be less for this part of score under these conditions.  Feedback and discussion on this is welcome.

PerformanceScoreDistribution.png

Real world examples

So what does a SQL Server with a ‘good’ performance score look like?  Here is an example.  This server got a score of 100. As you can see, the only wait is ‘good’ wait which indicates that the SQL Server is doing actual work and that there is no contention for other resources getting in the way.  In addition, the single page read latency is 1ms which means that there is no I/O subsystem bottleneck.

WaitScoreDist100Badge.png

What about a SQL Server with a ‘bad’ performance score?  Here is an example with a score of 30.  Note the large proportion of time spent waiting on latches.  Whilst this wait is occurring, the time available to do useful work has decreased.  Also note the large single page read latency – 22ms is an unacceptably large amount of time to wait for an I/O to occur.  This server clearly has contention issues and also I/O subsystem issues.

WaitScoreDist30Badge.png

Here is an example showing a performance score of 50.  This server has a low single page read latency which is good, but is spending a lot of time waiting on Locks and Latches.  It is possible that the I/O read latency is low because the contention on Locks and Latches means that not enough I/O is being done to saturate the I/O subsystem.

WaitScoreDist50Badge.png

In Conclusion

This model seems to be a reasonable way to quickly gauge whether a SQL Server instance has any performance bottlenecks and what those bottlenecks are.  Note that this model does not make any judgement at this stage whether the work being done is ‘useful’.  That is, even if there are no bottlenecks, the response time may not be acceptable because of application design issues.

We welcome any comment or feedback you might have.

 

Data: SQL Server Version Breakdown for Week Ending 9 September

Patrick O'Keeffe

This is the latest SQL Server Version breakdown across all Project Lucy users.  This data is derived from users who uploaded trace files and Spotlight on SQL Server uploaders and has a sample size of 960 servers – up from 890 this time last week.  No major changes, but out sample size is growing. We still see that SQL Server 2008 and SQL Server 2008 R2 represent over half the population followed by SQL Server 2005 representing nearly two fifths and with SQL Server 2000 bringing up the rear.

 

SQLVersion09-09-11.png

Here is this weeks plot of sample size over time:

SQLVersionSampleSize09-09-2011.png

Data: SQL Server Version Breakdown for Week Ending 2 September

Patrick O'Keeffe

This is the latest SQL Server Version breakdown across all Project Lucy users.  This data is derived from users who uploaded trace files and Spotlight on SQL Server uploaders and has a sample size of 890 servers – up from 355 this time last week.  The reason for this increse is that we made some changes that allowed us to consume version data from our Spotlight on SQL Server uploaders.  The picture of our population has changed somewhat with this new data.  We see that SQL Server 2008 and SQL Server 2008 R2 represent over half the population followed by SQL Server 2005 representing nearly two fifths and with SQL Server 2000 bringing up the rear.

I will be interesting to see what the trend looks like in a few weeks.

 

SQLVersion02-09-11.png

Here is this weeks plot of sample size over time:

SQLVersionSampleSize02-09-2011.png

Data: Distribution of IO Stall for All SQL Server Uploaders

Patrick O'Keeffe

This is some more data that we thought you might find interesting.

Have you ever wondered what I/O wait times SQL Server users were experiencing? Wonder no more – now we can begin to tell you.

This is the distribution of I/O Stall times across all Project Lucy users.  This data is derived from users who use Spotlight on SQL Server to upload data to Project Lucy and the sample size is 102822 distinct data and log files.  This data shows that as we would expect, the majority of IO”s performed across our user population are waiting less than 2.5 milliseconds – 2.5 ms per I/O means 400 I/Os per second.  It would be interesting to split these numbers into log and data files, but we are not in a position to do this just yet (we will be able to in a few weeks).

 

IOStallDist26-08-2011.png

By way of explanation, the source of the data for this is the sys.dm_io_virtual_file_stats DMV. Spotlight collects the io_stall_read_ms and io_stall_write_ms columns. These are then divided by the num_of_reads and num_of_writes values respectively to get the average stall time per IO.

Data: SQL Server Version Breakdown for Week Ending 26 August

Patrick O'Keeffe

This is the latest SQL Server Version breakdown across all Project Lucy users.  This data is derived from users who uploaded trace files and has a sample size of 355 servers – up from 352 this time last week.  We can see that the percentage of SQL Server 2008 is still rising and the percentage of SQL Server 2000 is falling – albeit slightly.

 

SQLVersion26-08-11.png

Here is this weeks plot of sample size over time:

SQLVersionSampleSize26-08-2011.png

Data: SQL Server Version Breakdown for Week Ending 19 August

Patrick O'Keeffe

This is the third in a series of posts that publish some data that we find interesting.

This is the SQL Server Version breakdown across all Project Lucy users.  This data is derived from users who uploaded trace files and has a sample size of 352 servers – up from 341 this time last week.  We can see that the percentage of SQL Server 2008 is still rising and the percentage of SQL Server 2000 is falling – albeit slightly.

 

SQLVersion19-08-11.png

Here is this week”s plot of sample size over time:

SQLVersionSampleSize19-08-2011.png

Tales from the Engine Room: Bulk Updates

Patrick O'Keeffe

We have been tweaking the Project Lucy datastore again this week, and have found out a couple of things we think worth sharing.  We found out the fastest way to update a lot of rows, was to do the complete opposite of what you’d think was the right thing to do.

Let me paint a scenario – we store wait stats data coming from Spotlight on SQL Server Enterprise customers in a fact table. One of the columns in the fact table is wait_type, which contains the name of the wait. This was recognized as a bad idea some time ago, but we had other fish to fry.

Recently our piscatorial problems have come back to bite us.

Since this column is an nvarchar(100) it takes up beaucoup space, and since it is indexed multiple times, multiply that by the number of indexes. This results in a performance problem, since SQL Azure has to read lots more pages because there are less rows per page. Also we have an operational problem – not only are we are using more space that we want to (and space costs money), but any DDL operations on the table are expensive because of its size.

So since we have already gone back to the nineties by eschewing the NoSQL-ness of Azure Table Services and adopting SQL Azure in all its relational glory, we thought; why stop there? Why not travel back into the eighties and start doing a bit of data modelling?

Hence, we find ourselves creating a dim_wait_types dimension table and instead of storing the wait type directly in the fact table, we store the id of the related row from the dimension table.

wbt-er-diagram.png

So far so good.

This database stuff is easy! We just create the dimension table and populate it, create a wait_type_id column in the fact table and write an update statement that updates the wait_type_id based on the value of the wait_type column. The update looks something like this:

update sqlserver_fact_waits_by_time set wait_type_id = 620, wait_category_id = 2 where wait_type = ''XE_DISPATCHER_JOIN''

The table has close to 40 million rows in it, so a single update is out of the question. SQL Azure will throttle us based on log space long before the update completes. We have to break it down, so we decide to do it per wait type, since there are 635 SQL Server wait types (that we know about).

We start running 635 update statements….

  • We find each statement does a full table scan, so we get hold of the plan and create the appropriate index
  • We find we cannot create the index from on-prem – the CREATE INDEX statement takes too long and the socket between us the the datacenter keeps getting closed. The solution is to remote desktop to one of the roles in the running application, install SQL Server 2008R2 Express Tools and run the CREATE INDEX from there. Don”t forget to use the WITH (ONLINE=ON) option, otherwise SQL Azure will throttle you because you use too much log space
  • Now that the index has been created, running the updates can commence. Chaos ensues. Because each update is doing a bunch of physical IO, all of our queued processing (end-users upload data around the clock) starts failing with timeouts.
  • We turn queue processing off, and allow the updates to continue. Each one averages 3-5 minutes and there are 635 of them which is roughly 60hrs in total.
  • Realization dawns that we cannot leave processing turned off for 3 days, so we stop the updates and turn queue processing back on. We need to find a way to trade time for concurrency. In other words, we don”t care if the updates take 3 days, but we need to be able to process incoming data whilst the updates are going on

A Solution

We went back to the original query plan for the update and found that 90% of the cost was in actually doing the update, very little cost was extant in actually finding the rows. Also and most significantly, we found that the amount of physical IO required scaled somewhat worse than linearly with the number of rows updated. The answer was to keep the actual number of rows updated small. So now our update looks something like this:

update
sqlserver_fact_waits_by_time
set
wait_type_id = 620, wait_category_id = 2
from
sqlserver_fact_waits_by_time wbt
inner join
(select top(2500) upload_id, id from sqlserver_fact_waits_by_time
where wait_type_id is null and wait_type = ''XE_DISPATCHER_JOIN'') wbt_top
on wbt.upload_id = wbt_top.upload_id and wbt.id = wbt_top.id

This new update statement takes under 20 seconds to run and we have a command line application run these for all wait types in a loop until there are no rows updated. Whilst these updates are running, user processing continues normally.

What we Learned

  • Execute long running DDL operations from within the data centre
  • Try to look for ways that allow you to trade time for maintaining concurrency
  • On SQL Azure, doing a larger number of small updates (where each update touches less rows) took less elapsed time than doing a smaller number of large updates (where each update touches more rows)

Data: SQL Server Version Breakdown for Week Ending 12 August

Patrick O'Keeffe

This is the second in a series of posts that publish some data that we find interesting.  Two data points – a few more and we have a trend!

This is the SQL Server Version breakdown across all Project Lucy users.  This data is derived from users who uploaded trace files and has a sample size of 341 servers – up from 329 this time last week.  We can see that the percentage of SQL Server 2008 is rising and the percentage of SQL Server 2000 is falling – albeit slightly.

 

SQLVersion12-08-11.png

Here is the plot of sample size over time:

SQLVersionSampleSize12-08-2011.png

SQL Server Management Studio Plugin for Project Lucy

Patrick O'Keeffe

We have just shipped a plugin for SQL Server Management Studio that allows you to quickly do a trace and upload it to Project Lucy. I have recorded a short video that covers how you might use the plugin to look at some SQL for an application.

[youtube Kha1OzPE2RA nolink]

We plan to improve the plugin in the future by adding the ability to capture workload without using trace and to capture other sorts of data to help you manage the performance of your SQL Servers.

Your feedback is encouraged.

Tales from the Engine Room: Migrating to SQL Azure

Patrick O'Keeffe

We’ve been very busy here at Project Lucy aver the last few weeks.  We are “cup-runneth-over” with ideas on how to provide really cool and valuable stuff to our users – only problem was, we were getting bogged down in performance and data storage issues.

The way we were storing our data was not conducive to providing the sorts of features we wanted to provide and something had to change.  After we re-architected our approach 4 times, with still no suitable solution, we decided our only two alternatives were between doing something stupid, or going broke.

We decided that we should store our analysed data in SQL Azure.  Now we have done this, we will be able to move much faster in providing new features that expose even more interesting things in the data that our users upload.

How we used to do it

It is useful to look at how we did things up until a week ago in order to explain the changes so let’s look at an example – SQL Workload Analysis.

When we process a trace file (this applies to Spotlight on SQL Server uploads that include SQL Analysis data as well), we have a piece of code that detects a SQL event; we will use SQL:BatchCompleted as an example.  Way back when we first went live we did some on the fly aggregation of these events but didn’t really store any event data because we had no way to display it.  Fast forward to the end of Feb 2011 and we went live with the ability to explore the workload data within your upload using a grid and some ajax magic that allowed you to do multi-level grouping on the data in real-time.

In order to allow you to do grouping in real-time, we couldn’t store aggregates any more – we had to store every event.  We could have stored each event as an entity in Table Services if we were prepared to live with the latency of fetching these entities.  But we weren’t.  You see, some of Project Lucy’s users upload trace files with hundreds of thousands of SQL events and pulling 200,000 entities out of table services is two orders of magnitude slower than a wet week.  Another solution was required.

We found that when we stripped all of the text data (SQL, application name, login name, database name) out of the event data, it really wasn’t that big – certainly small enough to fit in memory.  So that is what we did; as we were processing the data we put all of the text columns in look up tables and stored the even data in a big list.  What we ended up with was a binary structure that had one big list for the event data, and numerous smaller hash tables for the category (SQL, application, database etc) data.  This we then streamed out into a blob.

When a user came to the workload page and started interacting with it, we would check to see if we had the blob that contained the data we wanted to render for the user on the web server.  If the file did not exist we would go get it from blob storage and cache it locally.  Once the blob was local to the web server, we would load it into memory and perform the query user had requested using Linq. This worked surprisingly well and the response times even from blob files that contained many hundreds of thousands of events were acceptable.

The downside of this approach was that the data was stored  one blob per upload.  This meant that a few things were out of reach.  Because a user can upload files from multiple servers in an upload, we had no easy way of getting data from just one server.

Also, we had no easy way of querying across time.  We wanted to give users the ability to look at performance characteristics over time.  In order to do this with our blob approach we would have had to load each blob in turn, get the rows we needed from it (after reading all the rows in it) and then merge those rows at the end.  So we decided to invent our own database.

No seriously. We did.  We spent some hours basking in warm delusions of our own grandeur, scribbling madly on a whiteboard.

We then came to our senses and decided that inventing our own database was somewhat foolish, and might distract us from our core objective of providing value to our users.  So we decided to use SQL Azure to store our data.   Now instead of a binary file, we store the event data in a big fact table and we store all the “category” (application, database etc) data in dimension tables.  This now means that we can start to build features that allow querying across time – which is what we are working on right now.

What we learnt

As part of our migration to SQL Azure we learnt a few things:

  • The database size limit is there to encourage you to think scale out, instead of scale up.  This is a good thing – with scale out, you are not constrained by being on one machine. You need to design your scale out strategy from the start.  We plan to use SQL Azure Federations when it ships, so our schema is optimised for this
  • For tables that you are going to federate, you need to make sure there are no database scoped functions used.  This means no identity columns.  We used uniqueidentifier with a newid() default constraint
  • Make sure you are religious about putting clustered indexes on tables.  Operations against tables without clustered indexes will work in the development environment with a local database, but when you push into the cloud it will break
  • Using table services for bulk inserts or updates involves high latency. You can only insert/modify 100 entities (rows) at a time and this takes 500ms – designs which involve big fetches of many rows from Table Services will be unsatisfactory
  • If you are going to tempt fate and use Table Services, in order to fetch data from table services quickly enough (it takes 500ms per 1000 entities fetched) you have to be very careful about the partition and key values used when storing the data. (The partition+key must be unique for each entity; and you select multiple rows by partition.) The trick is to define your partition key so you can fetch multiple rows at a time, but not too many rows. Invariably this mean the same data will need to be stored multiple times, via different partition values. For example, for SQL Server wait statistics, you might need to store data partitioned by time (to service a trend over time chart) and store the same data partitioned by wait-name so you can compare different users with each other.  As a consequence of storing the data multiple times, you end up writing a lot more (serialization-deserialization) code than you really feel you should need to
  • Tables in Table Services store entities. An entity is an object with arbitrary properties.  Just because you can store different entities in the same table doesn’t mean you should. We got clever with C# generics and got this to work and then decided it was just a rod for our own backs
  • SQL Azure is expensive (we hope this changes) make sure you are indexing just enough.  You can double your storage requirements by just creating an index – make sure that the index is useful
  • High concurrency MERGE SQL statements that update multiple rows were problematic for us in SQL Azure. We had deadlocks, no matter what you try. The only way we were able to stop the deadlock was with a WITH TABLOCKX locking hint – not really a solution.  Until Microsoft provide a way to get the deadlock graph from SQL Azure so you can really troubleshoot the problem rather than guess, we will be left with a cloying sense of dissatisfaction
  • We’ve seen really impressive IO rates from our SQL Azure database (7,500 writes per second) – when you consider that this involves replication to two slaves, color us extra impressed
  • Keep the persistence layer of your software simple. i.e. Where feasible, perform simple SQL select statements from multiple tables and then use something akin to LINQ in C# to merge and project that data. This will make life easier in future – when you decide to locate those tables within different Azure databases, or use federations. We encountered this issue. Our database grew to over 43Gb in size (50Gb is the limit, as at August 2011). We had to move one of our tables into a different database. Consequently, any SQL statements joining this table to other tables had to be achieved via LINQ instead. (There are no cross-database joins in SQL Azure)
  • Long running SQL Statements like CREATE INDEX have to be run from within the data centre.  This will drive you nuts.  The internet will conspire against you to ensure that the socket will get closed while you wait for the query to finish

In general, our experience with SQL Azure has been a positive one, and now having taken the leap, we look forward to building out the new features we have planned in Project Lucy.