Dec 182008
 

I spent time today analysing a system monitoring tool that collects a lot of information about your servers and stores the data into a table.  A lot of data is collected, and the table was currently at 23 GB with 180,000,000 rows, with only a clustered index on a non-unique GUID.  There was no primary key set, although there was also an identity column.

A delete process runs nightly, and deletes anything older than 30 days.  This delete process was highly inefficient, and was timing out after 10 minutes, causing the transaction to roll back, and nothing deleted.  Additionally, during these 10 minutes, the table was locked and inserts couldn’t not be performed.

The code for the deletion looked like this (although this query is from AdventureWorks, and 2500 days is needed for demonstration purposes):

DELETE FROM Sales.SalesOrderHeader 
WHERE datediff(day, ModifiedDate, getdate()) >= 2500 

If you examine the query, you’ll notice two things.  Firstly, as there is no index on ModifiedDate, we’ll need to perform a table scan.  That’s fixable with a non-clustered index:

CREATE NONCLUSTERED INDEX ncix_moddate ON
    Sales.SalesOrderHeader(ModifiedDate)

We’ll ignore the fact that our new index on 180,000,000 rows will take up 4.8 GB – a fast database is a large database!  Changing the clustered index to ModifiedDate, GUID would remove the need for the uniquifier and the new index.

The other thing you can notice about this query is that each row must process the datediff function to determine the number of days difference between the ModifiedDate and the current date.  180,000,000 function calls, and this can’t be indexed.

A much better way of writing this query would be to determine the boundary date.  Any rows earlier than this date will be deleted.  By checking the ModifiedDate against this boundary date (which is calculated once at the start of the query), the new index on ModifiedDate can be used to find the rows to delete.

DELETE FROM Sales.SalesOrderHeader 
WHERE ModifiedDate <= dateadd(day, -2500, getdate())

If we compare the execution plans, we can see the following:

image

Note that I’m using a SELECT instead of a delete, as AdventureWorks has cascading deletes on the Sales.SalesOrderDetails table.  The first query takes 61 reads – an entire index scan of my new index, whereas the second query takes just 6 reads – seeking to those parts of the index where the date matches.

The only bad news now is that I can’t modify this code, as it appears to be run by a service and not in a stored procedure.  I’m hoping there will be an option to turn off purging the table, and I can run my own delete command.

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>