In my previous post, I discussed how you can use the HashBytes() function to generate hashes. I gave a couple of examples where you could use this for your benefit, but a follow-up post is necessary.
Just to recap, a hash is a one-way encryption of a set of data of any size to come up with a smaller (eg, 20 bytes for SHA1) that can almost-uniquely identify the original source. It is NOT a compressed version of the source – you cannot reconstruct the original source with the hash.
The main reason for using hashes is to see if things have changed. The MDx hashing algorithms (Message Digest) was originally used to determine if email or Usenet messages had been read. Instead of sending all messages, you could receive the hashes, and then compare the list of hashes with what you have stored locally, and only download those that you are missing.
A similar use of hashes is used to protect the integrity of executable files available for download. You can create a hash of the executable, and when the user has downloaded your file, they can re-generate the hash to ensure that nothing has changed it (for example, a virus latching onto the executable). If the hashes do not match, you would be wary of running the program.
Back in the database world, here are three scenarios where you might like to use hashes:
- Password and authentication
- Data Integrity
- Detecting changes to a table
Passwords and authentication
If you have a website and you’ve rolled your own authentication system, one thing you do not want to do is store passwords in your database. By running the password through a hashing algorithm, you can store the hash in your database. When the user tries to log in, you simply hash the password entered, and compare the two hashes to see if they match. This means that you can never recover the password – you can only reset the password.
This looks great from an initial glance, but has some drawbacks. If you’ve used a standard hashing algorithm, and an attack can get a copy of the hashed password, it is possible to compare this hash with a known list of hashed passwords, or generate the hashes by brute force until the generated hash matches the hashed password. Alternatively, if the attacker has write access to your table, they could insert the hash for a known password onto any account.
A way around this is to “salt†the password. Salting involves adding extra information to the password before hashing, and this works well as long as the attacker does not know what the extra information is. You could add the username to the password, or add “abcd!%*(&†to the end of the password, and then hash the result. As long as you do this every time you generate your hash, a standard set of pre-hashed values will not match your system. Once the attacker has seen your source code, however, they will know how your hashes have been salted.
Data integrity
Data Integrity is another useful feature you can create with hashing. Imagine that you don’t want your database administrators to be able to update tables via Management Studio – you only want them to update the tables via a specific interface, whether by an encrypted stored procedure, or through your application. You can create a hash column on your table, and create a hash based on the other columns. For example, if you had a Student table:
CREATE TABLE Student ( StudentID int NOT NULL IDENTITY PRIMARY KEY , FirstName varchar(25) NOT NULL , Surname varchar(25) NOT NULL , DateOfBirth datetime NOT NULL , EmergencyContactPhone varchar(15) NOT NULL , RowHash varchar(50) )
Now, when you insert your Students, you need to supply a value for the RowHash:
INSERT INTO Student (FirstName, Surname, DateOfBirth , EmergencyContactPhone, RowHash) VALUES ('Robert', 'Jackson', '1995-10-21', '+61399991234' , sys.fn_varbintohexstr(HashBytes('SHA1', 'SALTRobertJackson1995-10-21+61399991234')) )
As you can see, the hash is the word SALT plus the FirstName, Surname, DateOfBirth and EmergencyContactPhone concatenated together. Note that if you have CONCATE_NULL_YIELDS_NULL turned off, you’ll end up with an empty string if you try to add a NULL value to a string.
The next step is to look at the data. As we’re not encrypting the data, your DBA can happily look at the rows. However, in your application, you want to check the hash every time you load in the row to see if it has been modified and the hash not updated correctly. For example, to retrieve our Robert’s row:
SELECT FirstName, Surname, DateOfBirth , EmergencyContactPhone, RowHash , sys.fn_varbintohexstr(HashBytes('SHA1', 'SALT' + FirstName + Surname + CONVERT(varchar(10), DateOfBirth, 121) + EmergencyContactPhone)) AS GeneratedHash FROM Student WHERE StudentID = 1
We get the existing RowHash, and a GeneratedHash, which should match. Note that you would only generate this new hash in an encrypted stored procedure, or in your application, otherwise your DBA can see how the hash is constructed and generate a hash for his changed data.
Let’s update the row without updating the hash, and rerun the select query (click the image for a bigger version):
UPDATE Student SET FirstName = 'Bob' WHERE StudentID = 1
Oh dear, the hashes don’t match. Sound an alarm!
Detecting changes to a table
Another use for hashing is to detect changes over a large number of rows. For example, if you have a Students table that also includes their year level, and you want to check monthly to see if new students have been added to that year, you can either have a Modified Date column to store when rows are modified, keep track of any changes via a trigger, or use a hash function.
Create a separate table to hold the hashed values:
CREATE TABLE MonthlyStudentHash ( SampleDate datetime , YearLevel int , GroupHash varchar(50) , PRIMARY KEY (SampleDate, YearLevel) )
Each month, we can run the following query to get a hashed version of the students:
INSERT INTO MonthlyStudentHash SELECT GETDATE(), YearLevel , sys.fn_varbintohexstr(HashBytes('SHA1' , CONVERT(varchar(20), NumStudents * SumDateOfBirth ))) FROM ( SELECT YearLevel , COUNT(*) AS NumStudents , SUM(CONVERT(int, DateOfBirth)) AS SumDateOfBirth FROM Student GROUP BY YearLevel ) years
In the inner query (years) we get all the students, and group them by year level, counting the number of students in the YearLevel, and adding together their dates of birth, converted to an integer. In the outer query, we then multiply the number of students by the SumDateOfBirth and hash this. No salting is necessary in this case.
The NumStudents * SumDateOfBirth doesn’t make any sense, but it can be whatever you like – the aim is to get a unique value that can be hashed. (Actually, NumStudents * SumDateOfBirth is likely to be small enough that hashing it will just take up more space, but if you want more accuracy (uniqueness), you can concatenate all the students’ names together and hash that!)
Â
Now, each month, run this query, and compare the current hash value to last month’s hash value. It will be easy to see if there’s been a change. This is a great solution if you have no power to modify the source table to add a ModifiedDate column.
Conclusion
There are plenty of other uses for hashed values – this article has only mentioned a few uses. I’d love to hear any other ideas – please leave a comment!