Jul 232009
 

When developing T-SQL code, I frequently have a number of stored procedures created and run inside one .sql file.  While I’m generally good at highlighting the single statement I want to run, there are some situations where the highlight is misplaced – usually by me getting complacent with the lack of lag on Remote Desktop.  Suddenly I’ll get a dose of lag, my highlight command (shift-uparrow) won’t be processed, but my Execute (F5) command will.

To defend against this, I determined that the best option would be to create a severe error at the top of the .sql file to ensure that execution would cease.  Unfortunately, this only stops the execution of the current batch, and as CREATE PROCEDURE must be at the start of the batch, I have many batches in my code.

Five minutes after requesting suggestions from Twitter, Rob Farley suggested using SQLCMD mode – which is a feature that I haven’t used very often.  This is activated in Management Studio via the “Query” | “SQLCMD Mode” menu item, and can be specified as the default mode for all new query windows.

Placing the following code at the top of the file results in a fatal error that completely halts execution:

:on error exit
SELECT * FROM SomeTableThatDoesntExist
GO

!!dir
GO

The !!dir command will list the contents of the current directory.  You can run this line by itself, but this line won’t be executed if you run the entire .sql file.

The drawback to this method is that you lose Intellisense in SQL Server 2008 and the ability to debug your code, but I’m willing to live with this to prevent having to comment out my more dangerous statements!

Apr 232009
 

In my previous post, I discussed how you can use the HashBytes() function to generate hashes.  I gave a couple of examples where you could use this for your benefit, but a follow-up post is necessary.

Just to recap, a hash is a one-way encryption of a set of data of any size to come up with a smaller (eg, 20 bytes for SHA1) that can almost-uniquely identify the original source.  It is NOT a compressed version of the source – you cannot reconstruct the original source with the hash.

The main reason for using hashes is to see if things have changed.  The MDx hashing algorithms (Message Digest) was originally used to determine if email or Usenet messages had been read.  Instead of sending all messages, you could receive the hashes, and then compare the list of hashes with what you have stored locally, and only download those that you are missing.

A similar use of hashes is used to protect the integrity of executable files available for download.  You can create a hash of the executable, and when the user has downloaded your file, they can re-generate the hash to ensure that nothing has changed it (for example, a virus latching onto the executable).  If the hashes do not match, you would be wary of running the program.

Back in the database world, here are three scenarios where you might like to use hashes:

  1. Password and authentication
  2. Data Integrity
  3. Detecting changes to a table

Passwords and authentication

If you have a website and you’ve rolled your own authentication system, one thing you do not want to do is store passwords in your database.  By running the password through a hashing algorithm, you can store the hash in your database.  When the user tries to log in, you simply hash the password entered, and compare the two hashes to see if they match.  This means that you can never recover the password – you can only reset the password.

This looks great from an initial glance, but has some drawbacks.  If you’ve used a standard hashing algorithm, and an attack can get a copy of the hashed password, it is possible to compare this hash with a known list of hashed passwords, or generate the hashes by brute force until the generated hash matches the hashed password.  Alternatively, if the attacker has write access to your table, they could insert the hash for a known password onto any account.

A way around this is to “salt” the password.  Salting involves adding extra information to the password before hashing, and this works well as long as the attacker does not know what the extra information is.  You could add the username to the password, or add “abcd!%*(&” to the end of the password, and then hash the result.  As long as you do this every time you generate your hash, a standard set of pre-hashed values will not match your system.  Once the attacker has seen your source code, however, they will know how your hashes have been salted.

Data integrity

Data Integrity is another useful feature you can create with hashing.  Imagine that you don’t want your database administrators to be able to update tables via Management Studio – you only want them to update the tables via a specific interface, whether by an encrypted stored procedure, or through your application.  You can create a hash column on your table, and create a hash based on the other columns.  For example, if you had a Student table:

CREATE TABLE Student (
      StudentID int NOT NULL IDENTITY PRIMARY KEY
    , FirstName varchar(25) NOT NULL
    , Surname varchar(25) NOT NULL
    , DateOfBirth datetime NOT NULL
    , EmergencyContactPhone varchar(15) NOT NULL
    , RowHash varchar(50)
)

Now, when you insert your Students, you need to supply a value for the RowHash:

INSERT INTO Student (FirstName, Surname, DateOfBirth
    , EmergencyContactPhone, RowHash)
VALUES ('Robert', 'Jackson', '1995-10-21', '+61399991234'
    , sys.fn_varbintohexstr(HashBytes('SHA1',
        'SALTRobertJackson1995-10-21+61399991234'))
)

As you can see, the hash is the word SALT plus the FirstName, Surname, DateOfBirth and EmergencyContactPhone concatenated together.  Note that if you have CONCATE_NULL_YIELDS_NULL turned off, you’ll end up with an empty string if you try to add a NULL value to a string.

The next step is to look at the data.  As we’re not encrypting the data, your DBA can happily look at the rows.  However, in your application, you want to check the hash every time you load in the row to see if it has been modified and the hash not updated correctly.  For example, to retrieve our Robert’s row:

SELECT
      FirstName, Surname, DateOfBirth
    , EmergencyContactPhone, RowHash
    , sys.fn_varbintohexstr(HashBytes('SHA1', 'SALT'
        + FirstName + Surname
        + CONVERT(varchar(10), DateOfBirth, 121)
        + EmergencyContactPhone)) AS GeneratedHash
FROM Student
WHERE StudentID = 1

We get the existing RowHash, and a GeneratedHash, which should match.  Note that you would only generate this new hash in an encrypted stored procedure, or in your application, otherwise your DBA can see how the hash is constructed and generate a hash for his changed data.

Let’s update the row without updating the hash, and rerun the select query (click the image for a bigger version):

UPDATE Student SET FirstName = 'Bob' WHERE StudentID = 1

image

Oh dear, the hashes don’t match.  Sound an alarm!

Detecting changes to a table

Another use for hashing is to detect changes over a large number of rows.  For example, if you have a Students table that also includes their year level, and you want to check monthly to see if new students have been added to that year, you can either have a Modified Date column to store when rows are modified, keep track of any changes via a trigger, or use a hash function.

Create a separate table to hold the hashed values:

CREATE TABLE MonthlyStudentHash (
      SampleDate datetime
    , YearLevel int
    , GroupHash varchar(50)
    , PRIMARY KEY (SampleDate, YearLevel)
)

Each month, we can run the following query to get a hashed version of the students:

INSERT INTO MonthlyStudentHash
SELECT GETDATE(), YearLevel
    , sys.fn_varbintohexstr(HashBytes('SHA1'
        , CONVERT(varchar(20), NumStudents * SumDateOfBirth )))
FROM (
    SELECT YearLevel
        , COUNT(*) AS NumStudents
        , SUM(CONVERT(int, DateOfBirth)) AS SumDateOfBirth
    FROM Student
    GROUP BY YearLevel
) years

In the inner query (years) we get all the students, and group them by year level, counting the number of students in the YearLevel, and adding together their dates of birth, converted to an integer.  In the outer query, we then multiply the number of students by the SumDateOfBirth and hash this.  No salting is necessary in this case.

The NumStudents * SumDateOfBirth doesn’t make any sense, but it can be whatever you like – the aim is to get a unique value that can be hashed.  (Actually, NumStudents * SumDateOfBirth is likely to be small enough that hashing it will just take up more space, but if you want more accuracy (uniqueness), you can concatenate all the students’ names together and hash that!)

 

Now, each month, run this query, and compare the current hash value to last month’s hash value.  It will be easy to see if there’s been a change.  This is a great solution if you have no power to modify the source table to add a ModifiedDate column.

Conclusion

There are plenty of other uses for hashed values – this article has only mentioned a few uses.  I’d love to hear any other ideas – please leave a comment!

Apr 222009
 

A hashing function is a useful one-way encryption function that can take a large amount of data and create a compact string that can (mostly) uniquely identify the source data.  For example, this entire post could shrink from being 30 KB to just 30 bytes.  I must stress that the conversion is a one-way operation, and the 30 bytes will NOT contain the 30 KB of text. 

So what is this useful for?  A good example is a table/FILESTREAM containing many 5 MB images.  If you are inserting images into your table, but you only want to insert the image if it does not already exist, how can you tell if the image exists?  You can either check metadata on the image – the filename, the modification date, the file size, etc, or you can check the binary data, byte-by-byte.  Alternatively, you can run the 5 MB through a hashing algorithm, and get a string (about 20 characters long) returned that can (mostly) uniquely identify the data.  This hash can then be indexed to look up the photo quickly.

There are a few different hashing algorithms available in SQL Server 2005: MD2, MD4, MD5, SHA, SHA1, with each having pros and cons.

You’ve probably noticed that I’ve mentioned that hashing can “(mostly) uniquely identify” the original data.  This is because it is possible that collisions can occur, when two source strings produce the same hash value.  If this is the case, you can use the hash value to quickly locate rows in your database that match the hash, and then perform a byte-by-byte comparison to double-check that you haven’t got a collision.  The MDx algorithms may have this weakness, so SHA or SHA1 is recommended.  Performance-wise, I didn’t notice any difference between SHA1 and MD5 over a 1000 character source string.

The HashBytes function used to perform a hash:

DECLARE @hash varbinary(max)
SELECT @hash = HashBytes('SHA1', REPLICATE(N'abcdefABCDEF12345', 1000))
SELECT sys.fn_varbintohexstr(@hash)

Just a quick explanation: the HashBytes function takes in a varchar or nvarchar value.  I’ve used REPLICATE() to create a string of 17,000 characters.  The sys.fn_varbintohextstr() function then converts the varbinary data into a varchar, which can then be stored in the database (SHA and SHA1 should be 20 characters long).

Apr 022009
 

Following on from my previous post regarding using Profiler to find the use of deprecated features, I received an email asking if there was a way to determine what the offending statement was that caused the deprecation warning.

While you could trace every statement on a specific SPID to identify when the warnings are coming through, a much better method is to use the SQLHandle column in the Deprecation Announcement Profiler event.  This gives you a SQL Plan Handle that you can then use the SQL Server 2005 dynamic management  views/functions to look up the plan, including the original SQL text.

First, save the trace file to disk, and then run the following query:

SELECT Trace.TextData, SqlText.Text, *
FROM fn_trace_gettable('c:\work\deprecatedtrace.trc', 1) AS Trace
OUTER APPLY sys.dm_exec_sql_text(Trace.SqlHandle) AS SqlText
WHERE TextData IS NOT NULL

There is a drawback to this method, being that the plan must still be in the procedure cache.  If the server has restarted, or the plan has been removed or expired from the cache, you won’t be able to find the plan.

Feb 122009
 

SQL Server is a mature product, and as the product grows older, new features are added (we all know about those!) and some features are removed.  However, to avoid having customers upgrade to the latest and greatest version of SQL Server, Microsoft provide warnings that features will be deprecated (a fancy word for “removed”) in the next version, or in some future version.

If you’re considering upgrading your application, the first test you should do is detect if you’re using deprecated features.  You can either skim the list of deprecated features (look up “Deprecated features” in the Books Online index), or you can run a Profiler trace.

There are two Deprecation events in Profiler, residing in the not-so-hidden category of “Deprecation”.  These are:

  • Deprecation Announcement – captures those events that will be around for the next version, but will be removed in future.
  • Deprecation Final Support – captures those events that will NOT be in the next version.

Set up a Profiler trace on these two events.  Note that you should tick the “Show All Events” and “Show All Columns” options to be able to capture all relevant information.  Failure to “Show All Columns” can result in not capturing anything.

That done, sit back and wait for the messages to come rolling in.  On the other hand, if you’re running this on a test server with zero workload, you’ll need to produce some deprecated events to capture.  Again, look up “Deprecated Featres” in the Books Online index (on the web at: http://msdn.microsoft.com/en-us/library/ms143729.aspx, with a link to the 2005 list), and find a feature to use.

image

The deprecation warning message is kind enough to give you a hint on how to perform the function in future.

That’s all there is to it!  Now you can upgrade to SQL Server 2008 with a stronger feeling of confidence that Microsoft haven’t dropped support for a particular feature.  This doesn’t replace testing, of course, but will at least reduce the number of errors you get when you do upgrade.

Jan 272009
 

I bumped into a former colleague down the street today, and he asked how to go about converting existing Perfmon data from .blg (binary) to a different format, such as CSV or into a SQL database. (Yes, that’s right.  I run into people I haven’t seen in weeks, and we talk about SQL Server.  That’s normal, right?)

Going off on a slight tangent, which format of Perfmon log should you use?  The four choices are Binary (BLG), Tab Delimited, Comma Delimited, and SQL Database.  Binary is useful for most purposes, but becomes harder to extract the data out if you want to run any other processing on it.  Tab and Comma delimited have the benefit of being able to view the data in Excel (if you don’t have too many counters), and graphing can be easier.  SQL Database has the benefit of logging straight into a SQL database, and this allows you to write a query to process your Perfmon data.  For example, you could create an hourly average, or pull out the date/times of peak loads.  The drawback to using a SQL database is that it’s not as lightweight – you quickly store a lot of data in your database, and increase the load on your server. 

A good compromise may be to log into binary format, and then use the relog.exe command line tool to convert the data into SQL or comma/tab delimited format (which can then be opened in Excel or bulk-loaded into a database).

Relog.exe is a simple tool – it allows you to convert between Perfmon file formats, extract specific counters, or change the sample period.  There’s not really much to explain that isn’t mentioned in the Technet article at http://technet.microsoft.com/en-us/library/bb490958.aspx

Jan 192009
 

Worlde seems to be the in thing going around the SQL Server blogging community this week, so I figured I’d knock one up of this site.  Click for a larger image.

wordle

Apparently I enjoy talking about tables, indexes, queries and Perfmon.  Sounds about right.

Jan 052009
 

In my previous post, I discussed using Logman to automatically start Perfmon counters logging.  The solution involved running logman.exe to start the collection, and if the counters were already running you’d get an error message saying words to that effect.

The problem with this solution is SQL Agent will report that the job failed as the error code returned by Logman is non-zero.  You can tell SQL Agent a specific code to use as the “success” code, but the user interface will not allow big negative numbers, such as -2144337750 (which is what Perfmon returns if the counter is already running on my Vista machine.  Windows 2003 returns a different code).  While you may be able to enter this value via script, I’m not sure how Management Studio will handle it, and could cause problems down the line if you edit the job.

Instead, a solution is to use a batch file that can run logman, check the error code returned, and if the error code is “Already running”, it can then exit the batch file specifying an error code that SQL Agent can handle (such as 0):

@echo off
C:\Windows\System32\logman.exe start TestCollection -s SERVERNAME
IF %errorlevel%==-2144337750 (
 echo Collection already started!
 exit /B 0
) ELSE (
 echo Exiting with error level: %errorlevel%
)

That’s it!  Now the job will report as succeeded if the collection is already running.

Dec 192008
 

Monitoring your database server is crucial to being able to analyse its performance, and you want to be monitoring continually – it’s no good starting to monitor when you detect a problem, as you a) might not know that you have a problem, and b) you don’t know what normal operation looks like.

Perfmon is the current standard (although SQL Server 2008 has some nifty features that can complement Perfmon).  I won’t go into details about how to use Perfmon, as there’s plenty of information available on the web, but it can be tricky to configure to get Perfmon running automatically after reboot, or if for some reason Perfmon fails (such as a temporary out of disk space error).

The simplest way is to use the logman.exe program, supplied with Windows 2003 and later, and available for Windows 2000.  This command line program can start or stop Perfmon counter logs on either the local or a remote server, among many other tasks.

The way we use it is to schedule a task in SQL Agent or Windows Scheduler that runs every 5 or 10 minutes and attempts to start the Perfmon counter log specified.  If the counter log is already started, an error is returned, but the counter log is not affected.

The command to start a job is (note that logman should be in your C:\Windows\System32 folder):

logman.exe start CounterCollectionName -s RemoteServerName

More information about Logman and its parameters is available at http://technet.microsoft.com/en-us/library/cc755366.aspx.

Dec 182008
 

I spent time today analysing a system monitoring tool that collects a lot of information about your servers and stores the data into a table.  A lot of data is collected, and the table was currently at 23 GB with 180,000,000 rows, with only a clustered index on a non-unique GUID.  There was no primary key set, although there was also an identity column.

A delete process runs nightly, and deletes anything older than 30 days.  This delete process was highly inefficient, and was timing out after 10 minutes, causing the transaction to roll back, and nothing deleted.  Additionally, during these 10 minutes, the table was locked and inserts couldn’t not be performed.

The code for the deletion looked like this (although this query is from AdventureWorks, and 2500 days is needed for demonstration purposes):

DELETE FROM Sales.SalesOrderHeader 
WHERE datediff(day, ModifiedDate, getdate()) >= 2500 

If you examine the query, you’ll notice two things.  Firstly, as there is no index on ModifiedDate, we’ll need to perform a table scan.  That’s fixable with a non-clustered index:

CREATE NONCLUSTERED INDEX ncix_moddate ON
    Sales.SalesOrderHeader(ModifiedDate)

We’ll ignore the fact that our new index on 180,000,000 rows will take up 4.8 GB – a fast database is a large database!  Changing the clustered index to ModifiedDate, GUID would remove the need for the uniquifier and the new index.

The other thing you can notice about this query is that each row must process the datediff function to determine the number of days difference between the ModifiedDate and the current date.  180,000,000 function calls, and this can’t be indexed.

A much better way of writing this query would be to determine the boundary date.  Any rows earlier than this date will be deleted.  By checking the ModifiedDate against this boundary date (which is calculated once at the start of the query), the new index on ModifiedDate can be used to find the rows to delete.

DELETE FROM Sales.SalesOrderHeader 
WHERE ModifiedDate <= dateadd(day, -2500, getdate())

If we compare the execution plans, we can see the following:

image

Note that I’m using a SELECT instead of a delete, as AdventureWorks has cascading deletes on the Sales.SalesOrderDetails table.  The first query takes 61 reads – an entire index scan of my new index, whereas the second query takes just 6 reads – seeking to those parts of the index where the date matches.

The only bad news now is that I can’t modify this code, as it appears to be run by a service and not in a stored procedure.  I’m hoping there will be an option to turn off purging the table, and I can run my own delete command.