Category: Articles

How I Learn – T-SQL Tuesday #008

(It’s T-SQL Tuesday #008 – Gettin’ Schooled)

I learn by doing, and by teaching.

Studies have shown that the best way to learn a topic is to teach it to someone else.  I agree wholeheartedly with this – you don’t really know a topic until you’ve had to put it into your own words and get someone else to understand.  Helping out people on the MSDN SQL Forums and the SQLServerCentral forums is a great way of learning.  It’s a very ad-hoc method, as there is no guarantee what you’ll be looking at on any particular day.  Although I might not know the answer, a well written question will pique my interest, and, as one of my strengths is researching how to do things with SQL Server, I’ll attempt to ferret out the answer.  This results in a deeper understanding for me, and (hopefully) a thankful person on the other end. 

Although helping completely unknown people on the Internet can be fun, it’s a lot more satisfying when helping in person, either through teaching courses, giving presentations at user groups, or one-on-one mentoring.  These require you to know the topic thoroughly up front, as there is much less of an opportunity to dart off to Books Online.

I don’t read too many SQL Server books anymore, with a few notable exceptions, such as the Inside SQL Server series, and the SQL Server MVP Deep Dives. These are highly recommended due to their deep technical nature.  The MVP Deep Dives is especially interesting, as it contains a wide range of topics about what MVPs find interesting, as opposed Books Online worded differently.  (This is not to bag authors – there’s definitely an audience for well written books – I’m just happy with Books Online!)  This is a very similar type of format to podcast interviews.  I don’t recall how many different times I’ve heard Paul and Kim go over the same material once more, but it’s always an entertaining listen!  With 90 minutes on a train each day, podcasts are quite useful, as long as they’re not dry technical topics.  Videocasts are not my thing as I rarely have the opportunity.

I keep up with blogs (thank you, Google Reader!) to see what current ideas are floating around, but it’s necessary to filter them – I don’t have time to read every blog post in detail, although many are deserving of that attention!  Instead, I’ll flick over the content to get a feel for the topic, and keep it in mind for later reference.  Blogs can be quite handy when searching, but it’s always worth remembering not to just blindly follow advice given.  Think through the offered steps and consider whether it makes sense before trying it out on your production system.

I believe in the value of certifications, although only as a supplement to experience.  I would love the opportunity to do the SQL Server MCM course as it appears to be an excellent test of all areas of SQL Server, but the wife and kids will insist on spending three weeks in Seattle!

If I had to pick one method of learning that I believe is optimal, I would choose mentoring.  It’s always important to have a mentor, even if you’re considered an expert, if only to bounce ideas off.  And it’s fantastic to give back by mentoring others.

Ors belong in Boats, Not Where Clauses

OK, stop groaning over the title of this post.  It’ll probably be the best pun you read until you flick over to a slightly wittier SQL blog.

I’ve recently been upgrading an application from SQL Server 2000 to SQL Server 2005, and analysing performance between the two.  A common technique with this application is to create reporting stored procedures that have many different parameters, and allow the user to enter as few, or as many as they like.  (And then the text strings get a ‘%’ attached at each end, and thrown to a LIKE operation.)

For example, consider the following stored procedure:

USE AdventureWorks2008
GO

CREATE PROCEDURE dbo.TestOR
      @PersonType nchar(2)
    , @FirstName nvarchar(50)
    , @LastName nvarchar(50)
AS
    SELECT *
    FROM Person.Person
    WHERE (PersonType = @PersonType OR @PersonType IS NULL)
        AND (FirstName = @FirstName OR @FirstName IS NULL)
        AND (LastName = @LastName OR @LastName IS NULL)
GO

EXEC dbo.TestOR @PersonType = 'EM', @FirstName = null, @LastName = null
EXEC dbo.TestOR @PersonType = 'EM', @FirstName = 'Rob', @LastName = null
EXEC dbo.TestOR @PersonType = null, @FirstName = null, @LastName = 'Caron'
GO

You can see that the driving force here is the pattern (FirstName = @FirstName OR @FirstName IS NULL).  This means that if you do not supply a value for @FirstName (or set it to NULL), then the second part of the OR will always return TRUE, and so all rows will be selected, cancelling out the need for first part.

This appears to be a very good method of creating a single stored procedure that can flexibly take many different parameters.  It probably performed quite well in development too, until the amount of data increased.

Let’s have a look at the execution plan:

image

Oh dear.  Table scans.  This example is only using a single table, but you can imagine what would happen if search parameters could be in multiple tables.

An additional problem with this method is that of parameter sniffing – if the query is initially run with a LastName only, then the execution plan will be optimised for a LastName – and this may not be appropriate for the next execution.  This can be demonstrated by running “EXEC sp_recompile dbo.TestOR”, and then running query #3, then #2, then #1.  The execution plans are the same for all three, but the plan has changed, as a different index has been used:

image

There are a few ways to fix this.  One is to create a different path for each combination of variables that are optional.  However, this rapidly becomes unwieldy – for this example, we would need nine different paths, all with very similar code!

Another option is to move to dynamic SQL:

CREATE PROCEDURE dbo.TestDynamic
      @PersonType nchar(2)
    , @FirstName nvarchar(50)
    , @LastName nvarchar(50)
AS
    DECLARE @sql nvarchar(max)
    DECLARE @params nvarchar(max)
    SET @sql = 'SELECT *
                FROM Person.Person
                WHERE (1=1) '
    IF @PersonType IS NOT NULL
        SET @sql = @sql + ' AND PersonType = @PersonType '
    IF @FirstName IS NOT NULL
        SET @sql = @sql + ' AND FirstName = @FirstName '
    IF @LastName IS NOT NULL
        SET @sql = @sql + ' AND LastName = @LastName '

    SET @params = '@PersonType nchar(2), @FirstName nvarchar(50), @LastName nvarchar(50)'

    EXEC sp_executesql @sql, @params, @PersonType = @PersonType
              , @FirstName = @FirstName, @LastName = @LastName
GO

Let’s have a look at the new execution plans:

image

Much better!  We now have three different execution plans, and three different SQL statements being executed.

There are a number of benefits to switching to dynamic SQL in this case:

  1. The query is simpler.  This means that the query plans are likely to be more stable – there is less change of a bad plan being generated.
  2. Each combination of parameters will get its own execution plan, and this will be stored in the cache – in this case, we could have nine different plans.
  3. The code is easier to maintain.  It’s a little harder to read, but you only have a single copy of the query – it’s just built up along the way.
  4. The users still get to have their flexible interface – very important when the upgrade is supposed to change as few things as possible, functionality-wise.

Myth: LocalSystem has no access to Network

A common myth is that the LocalSystem account has no access to networked resources, and so you may have trouble getting SQL Server to backup to remote locations.

Excusing the fact that running SQL Server or IIS as LocalSystem is not a best practice, it is still possible to connect to networked resources. This is done by the domain account DOMAIN\ComputerName$. For example, if my server SQL01 was on the COMPANY domain, there will be an account in Active Directory named COMPANY\SQL01$. Whenever a service running as LocalSystem attempts to connect remotely, it will attempt to use this account. The restriction is that the server must be in a domain – a workgroup will not cut it.

A recent example of where this came in handy recently was an IIS installation that called Crystal Reports, which ran using System DSNs to connect to the database. These DSNs were configured to use a SQL login with no password. This worked quite well, until the security was tightened, and the SQL login was to be given a password. This then caused every report using the DSN to prompt the user for a password, as System DSNs cannot store passwords. Possible solutions were to change all 100+ reports to File DSNs (and protect them well, as the password would be stored in plain text inside the DSN). Alternatively, the System DSN could be modified to log in using Windows Authentication.

As IIS was running as LocalSystem, the initial attempt was to change this to be a domain account. Unfortunately, however, many security issues appeared, particularly with some versions of Internet Explorer. Allowing the DOMAIN\ServerName$ account limited access to the database removed the reliance on the insecure SQL login, and avoided a lot of rework.

As mentioned earlier, it is a better practice to use a dedicated domain account, but this technique may be useful in a pinch.

The Effects of Dropping a Database on the Proc Cache

While looking for a detailed explanation of cache-store flush messages in the SQL Server ERRORLOG, I came across this page: http://blogs.msdn.com/b/sqlprogrammability/archive/2007/01/17/10-0-plan-cache-flush.aspx, which refers to dropping a database causing the procedure cache to be flushed.

As someone that occasionally creates separate databases to hold temporary data, or a subset of an entire database, I did some investigations.

Script 1:

SELECT * FROM sys.dm_os_memory_cache_counters
WHERE name in ('Object Plans', 'SQL Plans', 'Bound Trees')
SELECT * FROM sys.dm_exec_cached_plans

This code simply reports on the state of the procedure cache at the server level.

Script 2:

CREATE DATABASE TestDB
GO
DROP DATABASE TestDB
GO
SELECT * FROM sys.dm_os_memory_cache_counters
WHERE name in ('Object Plans', 'SQL Plans', 'Bound Trees')
SELECT * FROM sys.dm_exec_cached_plans

We create a database, drop it, and then execute the queries from Script 1.

Conveniently, I have two development instances (2005 and 2008) with detailed plan caches that I have no problems with potentially flushing.  In SQL Server 2005, we get the following:

Before (Script 1):

image

After (Script 2):

image

As (unfortunately) expected, the procedure cache is gone.  Now let’s try SQL Server 2008:

Before:

image

After:

image

No change!  Fantastic! The obvious conclusion is that you can drop or detach databases as much as you like in SQL Server 2008, but you may want to be aware of the potential effect you have on the server when using 2005.

Disabling Indexes Before Rebuilding

It is a common practice to rebuild indexes frequently in order to improve SQL Server performance.  The problem with rebuilding indexes is that you need to have space inside the data file to hold the index currently being rebuilt.  This means that a 5 GB index will require an additional 5 GB of space.  The subsequent problem here is that when the operation is over, the database will appear to have 5 GB of free space, and the DBA might decide to shrink the database (a bad idea, as this will re-fragment the data file).

A potential solution, for those building non-clustered indexes offline, is to first disable the non-clustered index.  A disabled index is akin to dropping the index, but keeping the index definition. After an index is disabled, it must be rebuilt before it can be used, as SQL Server has no way of knowing how many, or which, rows were inserted, updated, or deleted during the period the index was disabled.

This means that a 5 GB index can be rebuilt in place, using the same 5 GB.  The operation may be a little bit slower, and temporary space (either in TempDB or the current database) will be needed to re-sort the index (rather than base it off the current “live” copy of the index, as there is none), but it removes the requirement for the data file to have enough empty space to hold a second copy of the index.

Note that this only applies to non-clustered indexes, as disabling a clustered index will result in the entire table being unavailable, and all non-clustered indexes disabled. Ideally, clustered indexes will be based on a small ever-increasing clustered key, which will greatly reduce the need to ever de-fragment the non-clustered indexes (although there are cases, for example when the row is initially inserted very small (due to variable length columns), and then later updated to populate the variable length columns, widening the row and causing it to not fit on the existing page, forcing a page split).

While this may be a useful technique to avoid data file growth, I would worry about an environment that does not have sufficient working room to rebuild the largest non-clustered index.

Example:

IF EXISTS (SELECT * FROM sys.databases WHERE name = 'IndexRebuildTest')
    USE MASTER
    DROP DATABASE IndexRebuildTest
GO
CREATE DATABASE IndexRebuildTest
GO
USE IndexRebuildTest
GO

-- Force the transaction log to grow
ALTER DATABASE [IndexRebuildTest] MODIFY FILE ( NAME = N'IndexRebuildTest_log'
     , SIZE = 51200KB )
GO

CREATE TABLE t1 (
      i int IDENTITY
    , UniqueID UNIQUEIDENTIFIER DEFAULT newid()
    , c CHAR(1000)
    , CONSTRAINT pk_t1 PRIMARY KEY (i)
)
GO
CREATE UNIQUE NONCLUSTERED INDEX ncix_UniqueID ON t1(UniqueID) INCLUDE (c)
GO

SET NOCOUNT ON
GO

INSERT INTO t1 DEFAULT VALUES
GO 10000

-- Note that sp_spaceused reports the database size including transaction log.
EXEC sp_spaceused @updateusage = N'TRUE'

image

We have an 80 MB database –  but 50 MB of that is the transaction log.  The data is about 12 MB, and the index is 18 MB – quite a large, fragmented index!

ALTER INDEX ncix_UniqueID ON t1 REBUILD WITH (SORT_IN_TEMPDB = ON)
GO
EXEC sp_spaceused @updateusage = N'TRUE'

image

Now, our database has grown to 91 MB – an increase of 11 MB.  We’ve also experienced a shrinking in the size of the index due to the rebuild, saving 6 MB.

Let’s run the first listing again (to reset the database size back to 80 MB), and then try disabling the index first.  Note that the 18 MB currently being taken by the index is immediately released when the index is disabled.

ALTER INDEX ncix_UniqueID ON t1 DISABLE
GO
ALTER INDEX ncix_UniqueID ON t1 REBUILD WITH (SORT_IN_TEMPDB = ON)
GO
EXEC sp_spaceused @updateusage = N'TRUE'

image

There you have it.  The database is still 80MB (it fluctuates slightly, depending on how fragmented in the index is each time we reset the database), and there is 6 MB of unallocated space – which is the savings from de-fragmenting the index.

Again, this primarily only useful for rebuilding offline – the index will be unavailable while it’s disabled.  If you do have 2005/2008 Enterprise Edition, you could rebuild online and at least have the table available during the rebuild, but the index will not be.  (And if you do have Enterprise Edition, forking out another $2000 for extra disk space won’t be an issue!)

SQL Server Agent Scheduler Has No AM/PM Selector

I recently came across a situation at a client’s site where I was unable to choose whether a SQL Agent schedule would be AM or PM.  Here’s a screenshot:

image

There is no AM/PM option, and clicking in the area of the control where it should be does nothing.  The first part – the hour – only goes from 1 to 12.  This means it’s only possible to schedule the job to run between midnight and 11:59:59am.

The cause of the problem was in the Standard Operating Environment settings for the organisation, where their default regional settings have the time set to h:mm:ss.  This displays times in 12-hour (instead of 24 hour), and does not display an AM/PM symbol.  This can be solved by changing the regional settings (via the Regional Settings Control Panel) time format on the computer to H:mm:ss (24 hour, no AM/PM) or h:mm:ss tt (12 hour, with AM/PM).  After this is done, reopen the schedule dialog box, and all is well. 

image

This is an issue with both Management Studio 2005 and 2008, running on Windows XP, Vista and Server 2003, but I think it’s rare that the regional settings will be configured in such a way.

The Difference Between Index Seeks and Scans

When is an index scan not a scan?  When it’s a seek!

Consider the following query:

USE AdventureWorks
GO
SELECT * FROM Production.TransactionHistory

As there is no filter on the table, you would expect to see an index scan.  Turn on the Actual Execution Plan, and SET STATISTICS IO ON, and have a look at the results:

image

(113443 row(s) affected)
Table 'TransactionHistory'. Scan count 1, logical reads 792,
physical reads 0, read-ahead reads 0, lob logical reads 0,
lob physical reads 0, lob read-ahead reads 0.

Just as expected.  Let’s now consider the following query:

SELECT * FROM Production.TransactionHistory WHERE TransactionID >= 1
(113443 row(s) affected)
Table 'TransactionHistory'. Scan count 1, logical reads 792,
physical reads 0, read-ahead reads 0, lob logical reads 0,
lob physical reads 0, lob read-ahead reads 0.

image 

The query returns the same set of rows, and has identical performance, but shows up in the execution plan as a seek!

The take-home message here is that just because you can see an index seek in your query does not mean that performance is great.  It means that the table was accessed using a filter you’ve placed on the query.  A lesser message, but sometimes overlooked is that a seek can “scan” a long range of values, if each of the values matches the filter.

The Potential Dangers of the Read Committed Snapshot Isolation Level

The Snapshot Isolation Level and Read Committed Snapshot features were a big improvement when SQL Server 2005 came out – finally, SQL Server had optimistic locking as well as pessimistic locking!  This is a feature that allows writers not to block readers, and readers will not block writers – instead, the readers will look at the most recent row, and ignore the fact that it’s currently being written to.  This sounds great on first inspection – long running transactions won’t block other transactions! If you accidentally have a BEGIN TRANSACTION but you haven’t followed up with a COMMIT or a ROLLBACK, other users won’t be blocked.

It recently came up in discussion that you can just turn on Read Committed Snapshot, and a lot of your blocking will be a thing of the past.  While this is true, I strongly discourage turning on Read Committed Snapshot for existing applications unless you know exactly how your application will respond to it.

Read Committed Snapshot is a modification to the Read Committed Isolation level that uses row versioning to read the previous value.  The trouble comes when you realise that Read Committed Snapshot is the default isolation level, and once you’ve turned it on, every single read committed transaction will run as a snapshot. 

If the developers of your application expect that Read Committed will be the default (back when they wrote the application when SQL Server 2000 was the latest version, this was a reasonable assumption), then changing Read Committed’s behaviour can have unintended consequences.

Let’s look at an example.  We’ll create a new database, containing a single table, Inventory.  This is a simple stock control system, and we’ll be looking at the procedure to check out an item. It’s worth noting that there are better and safer methods of writing the stock checkout function to not have this issue (such as using Repeatable Read), but it’s very possible that the developers of your application used something similar, and it worked “good enough”.

CREATE DATABASE ReadCommittedSnapshotTest
GO

USE ReadCommittedSnapshotTest
GO

CREATE TABLE Inventory (
      ItemID int NOT NULL
    , LocationID int NOT NULL
    , Qty int
    , PRIMARY KEY (ItemID, LocationID)
)
CREATE NONCLUSTERED INDEX ncixLocationItem ON Inventory(LocationID, ItemID)
GO

Next, we’ll insert some data.

SET NOCOUNT ON
INSERT INTO Inventory (ItemID, LocationID, Qty)
VALUES(rand() * 1000 + 1, 1, rand() * 40 + 1)
INSERT INTO Inventory (ItemID, LocationID, Qty)
VALUES(rand() * 1000 + 1, 2, rand() * 40 + 1)
INSERT INTO Inventory (ItemID, LocationID, Qty)
VALUES(rand() * 1000 + 1, 3, rand() * 40 + 1)
GO 2000
-- Ignore any key violation errors - we'll still get enough data
INSERT INTO Inventory (ItemID, LocationID, Qty) VALUES (796, 1, 5)

For the next part, we need to run the following query on two separate sessions at the same time.  I’ve added a 10-second delay in processing the transaction – you can imagine that there may be other processing required to complete the checkout, and this may take a second or two.  Ten seconds is a little unrealistic, but provides enough time to run the query in the other window.

The way this transaction works is that we begin a transaction, and read the number of items that are in stock at the moment.  If there are more than we want to take, we update the Inventory.

BEGIN TRANSACTION
DECLARE @QtyRequired int, @QtyRemain int
SELECT @QtyRequired = 4
SELECT @QtyRemain = SUM(QTY) FROM Inventory WHERE ItemID = 796 AND LocationID = 1
IF @QtyRemain - @QtyRequired >= 0
BEGIN
    UPDATE Inventory SET Qty = Qty - @QtyRequired
    WHERE ItemID = 796 AND LocationID = 1
    -- Do other stuff in other tables or databases to check out the item
    WAITFOR DELAY '00:00:10'
    SELECT 'Checkout complete'
END
ELSE
    SELECT 'Not enough qty!'
COMMIT TRANSACTION

SELECT * FROM Inventory WHERE ItemID = 796

If you run the two queries together, you’ll notice that both queries take about 10 seconds to run, and the one that ran first will report “Checkout complete”, and the other will report “Not enough qty!”.  This is good – the second query was blocked until the first was finished.

Let’s turn on READ COMMITTED SNAPSHOT and see what the effect will be.  First we’ll replace the items so we can run the same query again.

-- Replace the inventory
UPDATE Inventory SET Qty = 5 WHERE ItemID = 796 AND LocationID = 1

-- Turn on READ_COMMITTED_SNAPSHOT
ALTER DATABASE ReadCommittedSnapshotTest
SET READ_COMMITTED_SNAPSHOT ON
-- You may need to disconnect the other session for this to complete.

Run the two queries again, side by side.  This time, it still takes 10 seconds to run both queries, but the second query returns a –3 quantity.  Oh dear.

As mentioned, using Repeatable Read would help solve this issue, as you would be guaranteed the data wouldn’t change between the initial stock count check, and the update.  As the two queries are VERY close together, you may never see this issue with the code above, but it still can happen.  As I mentioned, this is an example of “good enough” code.  However, the difference is that the problem only has a narrow window of a few milliseconds to occur with READ COMMITTED, but has 10 seconds to occur with READ COMMITTED SNAPSHOT.

The conclusion of this example is that your application may have unexpected results if you blindly turn on READ COMMITTED SNAPSHOT.  It is a great feature of SQL Server 2005, but unless you know exactly what the effects will be, I don’t recommend turning it on for existing systems.

Oops, didn’t mean to run the entire script…

When developing T-SQL code, I frequently have a number of stored procedures created and run inside one .sql file.  While I’m generally good at highlighting the single statement I want to run, there are some situations where the highlight is misplaced – usually by me getting complacent with the lack of lag on Remote Desktop.  Suddenly I’ll get a dose of lag, my highlight command (shift-uparrow) won’t be processed, but my Execute (F5) command will.

To defend against this, I determined that the best option would be to create a severe error at the top of the .sql file to ensure that execution would cease.  Unfortunately, this only stops the execution of the current batch, and as CREATE PROCEDURE must be at the start of the batch, I have many batches in my code.

Five minutes after requesting suggestions from Twitter, Rob Farley suggested using SQLCMD mode – which is a feature that I haven’t used very often.  This is activated in Management Studio via the “Query” | “SQLCMD Mode” menu item, and can be specified as the default mode for all new query windows.

Placing the following code at the top of the file results in a fatal error that completely halts execution:

:on error exit
SELECT * FROM SomeTableThatDoesntExist
GO

!!dir
GO

The !!dir command will list the contents of the current directory.  You can run this line by itself, but this line won’t be executed if you run the entire .sql file.

The drawback to this method is that you lose Intellisense in SQL Server 2008 and the ability to debug your code, but I’m willing to live with this to prevent having to comment out my more dangerous statements!

Applications using Hashing with SQL Server 2005/2008

In my previous post, I discussed how you can use the HashBytes() function to generate hashes.  I gave a couple of examples where you could use this for your benefit, but a follow-up post is necessary.

Just to recap, a hash is a one-way encryption of a set of data of any size to come up with a smaller (eg, 20 bytes for SHA1) that can almost-uniquely identify the original source.  It is NOT a compressed version of the source – you cannot reconstruct the original source with the hash.

The main reason for using hashes is to see if things have changed.  The MDx hashing algorithms (Message Digest) was originally used to determine if email or Usenet messages had been read.  Instead of sending all messages, you could receive the hashes, and then compare the list of hashes with what you have stored locally, and only download those that you are missing.

A similar use of hashes is used to protect the integrity of executable files available for download.  You can create a hash of the executable, and when the user has downloaded your file, they can re-generate the hash to ensure that nothing has changed it (for example, a virus latching onto the executable).  If the hashes do not match, you would be wary of running the program.

Back in the database world, here are three scenarios where you might like to use hashes:

  1. Password and authentication
  2. Data Integrity
  3. Detecting changes to a table

Passwords and authentication

If you have a website and you’ve rolled your own authentication system, one thing you do not want to do is store passwords in your database.  By running the password through a hashing algorithm, you can store the hash in your database.  When the user tries to log in, you simply hash the password entered, and compare the two hashes to see if they match.  This means that you can never recover the password – you can only reset the password.

This looks great from an initial glance, but has some drawbacks.  If you’ve used a standard hashing algorithm, and an attack can get a copy of the hashed password, it is possible to compare this hash with a known list of hashed passwords, or generate the hashes by brute force until the generated hash matches the hashed password.  Alternatively, if the attacker has write access to your table, they could insert the hash for a known password onto any account.

A way around this is to “salt” the password.  Salting involves adding extra information to the password before hashing, and this works well as long as the attacker does not know what the extra information is.  You could add the username to the password, or add “abcd!%*(&” to the end of the password, and then hash the result.  As long as you do this every time you generate your hash, a standard set of pre-hashed values will not match your system.  Once the attacker has seen your source code, however, they will know how your hashes have been salted.

Data integrity

Data Integrity is another useful feature you can create with hashing.  Imagine that you don’t want your database administrators to be able to update tables via Management Studio – you only want them to update the tables via a specific interface, whether by an encrypted stored procedure, or through your application.  You can create a hash column on your table, and create a hash based on the other columns.  For example, if you had a Student table:

CREATE TABLE Student (
      StudentID int NOT NULL IDENTITY PRIMARY KEY
    , FirstName varchar(25) NOT NULL
    , Surname varchar(25) NOT NULL
    , DateOfBirth datetime NOT NULL
    , EmergencyContactPhone varchar(15) NOT NULL
    , RowHash varchar(50)
)

Now, when you insert your Students, you need to supply a value for the RowHash:

INSERT INTO Student (FirstName, Surname, DateOfBirth
    , EmergencyContactPhone, RowHash)
VALUES ('Robert', 'Jackson', '1995-10-21', '+61399991234'
    , sys.fn_varbintohexstr(HashBytes('SHA1',
        'SALTRobertJackson1995-10-21+61399991234'))
)

As you can see, the hash is the word SALT plus the FirstName, Surname, DateOfBirth and EmergencyContactPhone concatenated together.  Note that if you have CONCATE_NULL_YIELDS_NULL turned off, you’ll end up with an empty string if you try to add a NULL value to a string.

The next step is to look at the data.  As we’re not encrypting the data, your DBA can happily look at the rows.  However, in your application, you want to check the hash every time you load in the row to see if it has been modified and the hash not updated correctly.  For example, to retrieve our Robert’s row:

SELECT
      FirstName, Surname, DateOfBirth
    , EmergencyContactPhone, RowHash
    , sys.fn_varbintohexstr(HashBytes('SHA1', 'SALT'
        + FirstName + Surname
        + CONVERT(varchar(10), DateOfBirth, 121)
        + EmergencyContactPhone)) AS GeneratedHash
FROM Student
WHERE StudentID = 1

We get the existing RowHash, and a GeneratedHash, which should match.  Note that you would only generate this new hash in an encrypted stored procedure, or in your application, otherwise your DBA can see how the hash is constructed and generate a hash for his changed data.

Let’s update the row without updating the hash, and rerun the select query (click the image for a bigger version):

UPDATE Student SET FirstName = 'Bob' WHERE StudentID = 1

image

Oh dear, the hashes don’t match.  Sound an alarm!

Detecting changes to a table

Another use for hashing is to detect changes over a large number of rows.  For example, if you have a Students table that also includes their year level, and you want to check monthly to see if new students have been added to that year, you can either have a Modified Date column to store when rows are modified, keep track of any changes via a trigger, or use a hash function.

Create a separate table to hold the hashed values:

CREATE TABLE MonthlyStudentHash (
      SampleDate datetime
    , YearLevel int
    , GroupHash varchar(50)
    , PRIMARY KEY (SampleDate, YearLevel)
)

Each month, we can run the following query to get a hashed version of the students:

INSERT INTO MonthlyStudentHash
SELECT GETDATE(), YearLevel
    , sys.fn_varbintohexstr(HashBytes('SHA1'
        , CONVERT(varchar(20), NumStudents * SumDateOfBirth )))
FROM (
    SELECT YearLevel
        , COUNT(*) AS NumStudents
        , SUM(CONVERT(int, DateOfBirth)) AS SumDateOfBirth
    FROM Student
    GROUP BY YearLevel
) years

In the inner query (years) we get all the students, and group them by year level, counting the number of students in the YearLevel, and adding together their dates of birth, converted to an integer.  In the outer query, we then multiply the number of students by the SumDateOfBirth and hash this.  No salting is necessary in this case.

The NumStudents * SumDateOfBirth doesn’t make any sense, but it can be whatever you like – the aim is to get a unique value that can be hashed.  (Actually, NumStudents * SumDateOfBirth is likely to be small enough that hashing it will just take up more space, but if you want more accuracy (uniqueness), you can concatenate all the students’ names together and hash that!)

 

Now, each month, run this query, and compare the current hash value to last month’s hash value.  It will be easy to see if there’s been a change.  This is a great solution if you have no power to modify the source table to add a ModifiedDate column.

Conclusion

There are plenty of other uses for hashed values – this article has only mentioned a few uses.  I’d love to hear any other ideas – please leave a comment!

WordPress Themes