Dec 222010

In my previous post, I mentioned that the superior error checking and correcting abilities of a solid state disk (SSD) may allow you to trust your database files to a single drive and avoid RAID.  Let’s have a closer look at the issues.

SSDs will, in theory, be more reliable due to their lack of moving parts and lower power and cooling requirements.  Additionally, their ability to detect and correct errors in a more superior way (via Hamming codes to allow double error detection, single error correction) provides more protection than a RAID 5 array.  However, it is still possible for the drive to fail, and this issue must be considered prior to betting the farm on a single SSD.

Firstly, unless you’re happy to roll back to your previous full backup, transaction logs should be mirrored, whether they’re on magnetic hard disks or SSDs, and appropriate off-server log backups should be taken frequently.

In the case of using a single SSD for TempDB, what happens if the SSD fails?

  1. If you have another server to fail over to, great.  You’ll have a short outage while the other server picks up the workload, and hopefully you won’t lose any transactions.  If the standby server has no SSDs then performance may be slower, but still acceptable.  If there is no standby server, read on.
  2. You’ll have a SQL Server outage for the amount of time it takes to re-home TempDB to another disk.
  3. You’ll need to source a new home for TempDB.  You may already have space available on another attached drive, or you may need to provision more space from the SAN (if you have one).
  4. If you have no additional space, you’re in deeper trouble – you’ll need more disks, or move your databases to another server.
  5. Once you have the space, you’ll also need to consider performance.  Did you originally use a SSD for TempDB because the workload was so high that regular drives barely handle it?  You load might have grown since then, and you simply cannot handle the TempDB load without a SSD.
  6. How long does it take to get a new SSD shipped in?  Until then, your system may be down.
  7. If you have a spare SSD sitting on the shelf that you can quickly slot in, why didn’t you just RAID it in the first place, or put it in a secondary server?

In this case, we don’t even care about the contents of the SSD – TempDB will be recreated when we restart SQL Server, and things will be back to normal.  The main issues stem from how quickly you can re-home TempDB, and the restrictions on where you can place it.

If you have TempDB on dual SSDs in a RAID configuration, and you have one fail, then you’ll continue running with no outage (although you may need to schedule one to replace the faulty SSD).  Of course, this is a much more expensive option, and it’s possible that neither SSD will ever fail – but that’s the insurance game.  You pay for more piece of mind.

Data files are a similar story, but there are some differences.  While the loss of TempDB will guarantee a full SQL Server outage, loss of a single data file will only result in that database having reduced availability.  If the data is critical, and your system cannot run without it, then you will have a problem until you failover to a standby, or restore the data elsewhere.  If you can survive without this data, and you can regenerate or restore it later, there’s less of an issue.

The summary is that you need to consider answers to the following questions for each of your SSDs, and then make an appropriate decision.

  1. If this SSD fails, what are the effects in terms of server availability, data loss, and the amount of time to recover?  What is the action plan, with estimated times to get this back online?
  2. During this downtime, what effect will this have on the business?  Will the entire organisation grind to a halt, or will a non-critical data warehouse be unavailable for 24 hours?  What is the cost of this (lack of productivity, lost sales, your job)?
  3. What is the additional cost associated with providing redundancy to ensure that this won’t fail, whether as an additional SSD in a RAID configuration, or a standby server (which may or may not have SSDs).

Non-critical data on a single SSD can replace a RAID-5 array, but you need to thoroughly understand the risks first, and have a solid contingency plan in place.