Mar 282011
 

A colleague mentioned he received a warning while building indexes online the other day, so I decided to check it out. A quick search in sys.messages for messages LIKE ‘%online%index%’ found the following message:

Warning: Online index operation on table ‘%.*ls’ will proceed but concurrent access to the table may be limited due to residual lock on the table from a previous operation in the same transaction.

The error message is fairly clear – if you have a transaction open with existing locks, you’ll get this warning, letting you know that your online index rebuild might not be as online as you think it will be.

As a test, let’s try the following code on AdventureWorks2008:

BEGIN TRAN
UPDATE Person.Person SET FirstName = 'Kenneth' WHERE BusinessEntityID = 1
-- Rebuild an index on this table
ALTER INDEX IX_Person_LastName_FirstName_MiddleName ON Person.Person 
   REBUILD WITH (ONLINE = ON)
ROLLBACK

In this case, we get the warning message due to the five extra locks taken as part of the UPDATE command. While this index is being rebuilt, other sessions trying to access that row will be blocked. This is a good warning.

However, what happens if we try to rebuild an index on a different table? Ideally, I’d want the same sort of warning – if I have any locks that could cause blocking, I’d like to know immediately. The following code will test that:

BEGIN TRAN
UPDATE Person.Person SET FirstName = 'Kenneth' WHERE BusinessEntityID = 1
-- Rebuild an index on a completely different table
ALTER INDEX IX_Customer_TerritoryID ON Sales.Customer 
   REBUILD WITH (ONLINE = ON)
ROLLBACK

The result is that we get no warning. Rebuilding an index on the Sales.Customer table has nothing to do with the Person.Person update we performed, so we miss out of the warning.

This goes against what I’d ideally like. In this case, access to the row modified in Person.Person will result in a block until the index rebuild has finished.

This is not much of an issue though. I can’t think of many situations where I’d want to rebuild an index inside a transaction – it’s the type of task where it’s more likely to be run in a very narrow, constrained transaction.

Mar 152011
 

Every couple of years, I pull out a particular book that’s been sitting on my bookshelf for the past decade, read it, and am suddenly refilled with the passion of computers and computing science. This book is Code: The Hidden Language of Computer Hardware and Software, authored by Charles Petzold.

This book is perfect for those who never did an EE or CS degree at university, and want to know exactly how and why computers work. I’m sure a lot of DBAs fall into this category! Petzold takes us on a wonderful journey, on how communication is encoded into our lives, and focussing particularly on digital codes – binary.  Starting with the most simple concepts (communicating simple messages with flashlights and Morse code over short range), new concepts are introduced in simple, logical steps. From the flashlights, electricity is introduced, and a telegraph system is designed.

From Morse code, we are introduced to Braille and to UPC barcodes, all tied together with the theme of binary logic – the information can be coded by a series of ones and zeroes.

The book slowly builds up these concepts in an easy to follow fashion, using telegraph relays to build logic gates, through to a machine that can add two binary numbers, and finally culminates in a design for a theoretical general purpose, programmable computer.

From there, the concept of a transistor is introduced, the Intel 8080 and the Motorola 68000 CPUs are examined in detail, and it’s then a whirlwind of fast logical steps from machine code, to assembly, to higher level languages, operating systems and file systems (with a reasonably detailed look at CP/M). Finally, modern (well, 1999) graphical operating systems are examined, along with how various types of information are stored – graphics formats, sound formats, video, and HTML.

The book doesn’t go into networking protocols (except for a brief couple of paragraphs on modems), but it is easy to extrapolate how computers can talk to one another from the previous concepts.

All along the way, Petzold sprinkles in the history of computing, and the entire story is woven in a very engaging tone. Every time I read this book, I’m reawakened to the beauty and underlying simplicity (underlying – there’s nothing simple about 10 million transistor processors!) of computing.

Highly recommended!

Mar 082011
 

The subject of T-SQL Tuesday #016, kindly hosted by Jes Borland (blog | twitter), is Aggregation. Although it’s a fine topic, I could not figure out what I could write about aggregation that would capture my interest, and more importantly, dear reader, yours.

I thought about the topic ceaselessly while doing dishes, feeding children, and cleaning up the house. I considered aggregate functions, and optional grouping parameters. I twisted the topic inside and out, trying to figure out an interesting angle. I considered telling a story about the Tuples vs the Aggregates on their way to Query Station (and spent an embarrassingly long time on that train of thought). Finally, I went out for a run, thought of a topic pretty quickly and started turning the idea over in my mind for the next 7 km.

Columnstore Indexes

A columnstore index is a new feature of Denali, whereby an index is stored vertically – one column per page.  Here is the whitepaper download – highly recommended, and only 5 pages of text. Imagine a row of a typical covering index (in this case, AdventureWork’s Person.Person table, and I’ve covered every column):

image

Every row stored in this particular index is stored on the same page. As each row is quite wide, only 5 rows will fit per 8 KB page.

A columnstore index, however, stores each column on a separate set of pages. I am unsure of the ordering of the rows* but each page is compressed. As the column is likely to contain similar data, this means that each page can have excellent compression ratios – a factor of 4-15 is quoted in the whitepaper – and thousands of rows can fit on a single page.

* Due to a lack of internals documentation so far. Are they ordered by the column, or by the clustering key? I can see arguments for either way, but haven’t yet dived deep enough to decide one or the other.

This allows for fantastic performance gains when performing aggregates, as fewer pages need to be read to service the query, and only the columns actually required by the query need to be accessed.

Howerver, due to the fact that columnstore indexes are read only, they really only have applications in data warehouses.

Can we reproduce columnstore indexes in SQL Server 2008?

The question of whether it is possible to reproduce columnstore indexes in SQL Server 2008 popped into my mind, and the short answer is “No, not a chance”. Sorry.

Let’s look at the properties of a columnstore index.

Read Only – SQL Server 2008 can certainly build index structures on read-only data. And, as it’s read only, separate aggregation steps can be created in 2008 to summarise data ahead of time. However, one of columnstore’s selling points is to remove the need to aggregate. Columnstore gives you a massive performance benefit with little effort.

Highly Compressed – SQL Server 2008 can compress indexes. However, to get highly compressible indexes, similar to columnstore’s you’ll only want a single column, which makes the index less useful, and forces you to join frequently.

Only select required columns – A query accessing a columnstore index will only access those columns that are needed by the query. To get the same sort of functionality in 2008, you would need to either create a covering index for each query you want (nothing like columnstore!), or join multiple indexes together (resulting in a large amount of logical reads).

Different table structure – Denali is able to use a completely different page structure, particularly as this data is read-only. This is not possible in SQL Server 2008, and still be usable with T-SQL.

No code changes – Denali’s implementation allows us to create a single columnstore index on all columns in the table, and we don’t need to modify any queries to use these columnstore indexes. It’s completely seamless.

I tried to create an example where we could create similar compressed indexes in 2008, but I was never able to merge multiple indexes together to query the data efficiently. Denali has specific optimisations available for columnstore indexes, and these cannot be reproduced. About the best we can do is to create a read-only compressed covering index, which is nowhere  near impressive these days.

After a few hours considering this, my conclusion is that you are better off optimising your data warehouse for those queries that you know are going to come, and ad-hoc queries will need to continue to run slowly, at least until you’ve upgraded to Denali. Frequently run queries will benefit from extra customisation (pre-aggregation, specialised indexes with compression), but there no chance of a similar payoff inside the database engine.

Not too long to wait now, and you’ll be able to take advantage of columnstore indexes!

Mar 012011
 

It’s finally happened. The table you created five years ago with an clustered integer IDENTITY column is about to hit 231 values (2,147,483,648). Five years ago, you were inexperienced with SQL Server, and had no idea that this table would grow this large – or that you’d be around to see it!

First, you reseed the IDENTITY value, back to -231.

DBCC CHECKIDENT (LargeTable, RESEED, -2147483647)

This should allow you to get another couple of years out of the table, and you will have enough time to implement a better table structure.

A flash of inspiration

A few hours later, you’re relaxing after work, quietly pleased about the bullet you’ve dodged today. A thought suddenly strikes you like a lightning bolt. You recall that inserts into a clustered index work best when inserting into an ever increasing value, at the end of the index. Can you  have made a massive error in judgement in forcing all inserts to always happen just before identity value 1?

While you toss and turn about this all night, feel free to consider the issue before reading on. What do you think will happen? Will you have an enormous number of page splits are you constantly run into identity 1? This is obviously a frequently inserted table, given that it has had 2 billion rows added in the past five years (1.1 million per day).

Testing the hypothesis

Here’s a nice block of code to test the problem. We create a table with 1000 rows approximately 1000 bytes long, to fit 8 rows per page.

CREATE DATABASE IdentityTest
GO
USE IdentityTest
GO

-- Create a test table, just under 1000 bytes per row. We can fit 8 rows per page.
CREATE TABLE T (i INT IDENTITY PRIMARY KEY, j char(960))
GO

SET NOCOUNT ON

-- Backup the database so we have a consistent log to examine
ALTER DATABASE IdentityTest SET RECOVERY FULL
BACKUP DATABASE IdentityTest TO DISK = 'C:\Work\IdentityTest_Full.bak' WITH INIT
GO

-- Seed the table with 1000 rows with identity ranges from 1-1000
INSERT INTO T (j) VALUES ('AAAAAAA')
GO 1000

-- Reseed back to a large negative number
DBCC CHECKIDENT (T, RESEED, -2147483647)
GO

-- Add another 100 rows with negative identities counting towards positive
INSERT INTO T (j) VALUES ('BBBBBBB')
GO 100

-- Examine the transaction log to see page split operations.
-- Note the "CheckIdent" Transaction Name around row 1136 where we reseeded.
SELECT [Transaction Name], * FROM fn_dblog(null, null)
WHERE [Transaction Name] IS NOT NULL
ORDER BY [Current LSN]
GO

-- Cleanup
USE master
GO
DROP DATABASE IdentityTest
GO

We can now clearly see the page split pattern in the transaction log. As expected, during inserts prior to the re-seeding, we have a pattern of 8 inserts before a Page Split occurs, signalling that no more rows will fit onto this page.

image

Analysis of the results

After we run the CheckIdent operation (row 1136), we can see that the next insert operation causes a page split, because we are trying to insert onto the first page of the clustered index. This already has 8 rows on it (IDs 1-8), so ID -2,147,483,646 won’t fit.  The new page ends up with the new row, plus IDs 1, 2, and 3. This leaves room for four more rows to be inserted.

As expected, the fifth insert after the CheckIdent (row 1142) causes another page split. This time, the four negative ID rows are moved to one page, and rows 1,2,3 are on the other page.

So far, the pattern seems bad. We’ve had two splits after inserting five rows, and we get another split on the ninth insert (row 1147). In this case, however, we have not bumped up against the positive IDs. We currently have a single page holding the previous 8 negative ID rows, and the ninth won’t fit at the end, so a new page is allocated with this single row.

Now we’re into a stable state. Every 8 inserts, a new page is allocated, and the inserts never again touch the pages containing the first positive rows.

Conclusion

The results of this test are pretty clear – the page splitting algorithms will quickly re-stabilise after re-seeding. You’ll no longer be inserting into the end of the clustered index B-tree, but you will be inserting into a consistent point, which is separated from the start of the values, and so you should not see any performance degradation.

You were stressing for nothing. Relax!