A hashing function is a useful one-way encryption function that can take a large amount of data and create a compact string that can (mostly) uniquely identify the source data. For example, this entire post could shrink from being 30 KB to just 30 bytes. I must stress that the conversion is a one-way operation, and the 30 bytes will NOT contain the 30 KB of text.
So what is this useful for? A good example is a table/FILESTREAM containing many 5 MB images. If you are inserting images into your table, but you only want to insert the image if it does not already exist, how can you tell if the image exists? You can either check metadata on the image – the filename, the modification date, the file size, etc, or you can check the binary data, byte-by-byte. Alternatively, you can run the 5 MB through a hashing algorithm, and get a string (about 20 characters long) returned that can (mostly) uniquely identify the data. This hash can then be indexed to look up the photo quickly.
There are a few different hashing algorithms available in SQL Server 2005: MD2, MD4, MD5, SHA, SHA1, with each having pros and cons.
You’ve probably noticed that I’ve mentioned that hashing can “(mostly) uniquely identify” the original data. This is because it is possible that collisions can occur, when two source strings produce the same hash value. If this is the case, you can use the hash value to quickly locate rows in your database that match the hash, and then perform a byte-by-byte comparison to double-check that you haven’t got a collision. The MDx algorithms may have this weakness, so SHA or SHA1 is recommended. Performance-wise, I didn’t notice any difference between SHA1 and MD5 over a 1000 character source string.
The HashBytes function used to perform a hash:
DECLARE @hash varbinary(max) SELECT @hash = HashBytes('SHA1', REPLICATE(N'abcdefABCDEF12345', 1000)) SELECT sys.fn_varbintohexstr(@hash)
Just a quick explanation: the HashBytes function takes in a varchar or nvarchar value. I’ve used REPLICATE() to create a string of 17,000 characters. The sys.fn_varbintohextstr() function then converts the varbinary data into a varchar, which can then be stored in the database (SHA and SHA1 should be 20 characters long).