Apr 222009

A hashing function is a useful one-way encryption function that can take a large amount of data and create a compact string that can (mostly) uniquely identify the source data.  For example, this entire post could shrink from being 30 KB to just 30 bytes.  I must stress that the conversion is a one-way operation, and the 30 bytes will NOT contain the 30 KB of text. 

So what is this useful for?  A good example is a table/FILESTREAM containing many 5 MB images.  If you are inserting images into your table, but you only want to insert the image if it does not already exist, how can you tell if the image exists?  You can either check metadata on the image – the filename, the modification date, the file size, etc, or you can check the binary data, byte-by-byte.  Alternatively, you can run the 5 MB through a hashing algorithm, and get a string (about 20 characters long) returned that can (mostly) uniquely identify the data.  This hash can then be indexed to look up the photo quickly.

There are a few different hashing algorithms available in SQL Server 2005: MD2, MD4, MD5, SHA, SHA1, with each having pros and cons.

You’ve probably noticed that I’ve mentioned that hashing can “(mostly) uniquely identify” the original data.  This is because it is possible that collisions can occur, when two source strings produce the same hash value.  If this is the case, you can use the hash value to quickly locate rows in your database that match the hash, and then perform a byte-by-byte comparison to double-check that you haven’t got a collision.  The MDx algorithms may have this weakness, so SHA or SHA1 is recommended.  Performance-wise, I didn’t notice any difference between SHA1 and MD5 over a 1000 character source string.

The HashBytes function used to perform a hash:

DECLARE @hash varbinary(max)
SELECT @hash = HashBytes('SHA1', REPLICATE(N'abcdefABCDEF12345', 1000))
SELECT sys.fn_varbintohexstr(@hash)

Just a quick explanation: the HashBytes function takes in a varchar or nvarchar value.  I’ve used REPLICATE() to create a string of 17,000 characters.  The sys.fn_varbintohextstr() function then converts the varbinary data into a varchar, which can then be stored in the database (SHA and SHA1 should be 20 characters long).

  3 Responses to “Hashing Functions in SQL Server 2005/2008”

  1. Thanks for the excellent article!

    You provide a nice clear explanation of a topic which prior authors make sound complicated. You also provide a good example of a “real world” implementation.

    Good stuff. I look forward to reading more.

  2. Can you give more information about SHA 1 with sql 2008 , are there any improvements, where we can look for documentation for this algorithm implementation on sql 2008? what are the possible cons taking this approch.

  3. Nice article. Is there a way to sign data using sha1 in sql server 2008, when using replication? In this manner subscribers/publishers are receiving/sending hashed data. Or will this be automatically accomplished when i use SSL connection to the DB?

 Leave a Reply



You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>