Why You Absolutely Need Alternate Keys: A Unique Constraint Story

Although often forgotten, alternate keys are an essential part of  database design especially if you’re using surrogate keys. Alternate keys protect your database from duplicate data and can even speed up your application.

For the past few years I’ve been presenting on the top database design mistakes that I’ve seen “in the wild”. One of those “mistakes” was not placing an alternate key on a table when you have a surrogate key. But first let me back up a second and explain a few things. A surrogate key is a unique identifier of a row in a table. A surrogate key has no meaning in relation to the data in the table, it just uniquely identifies a row. In most cases, you would define the surrogate key as the primary key in the table. One way to define a surrogate key in SQL Server is to use the IDENTITY property when creating the table.

IDENTITY will tell SQL Server to auto-increment the column when a row is added. You don’t need to give it a value SQL Server will handle it. Which is nifty.
I fully endorse the practice of placing surrogate keys on tables. It leads to flexibility in changing the data model in the future. If you’ve ever had to change the primary key of a table then you can know how painful that can potentially be. When you change the primary key you need to change all of the relationships and foreign keys related to that key. Messy. Like chest-burster alien messy. Wait, maybe not that messy but you catch my drift.

The problem starts when the data modeler creates the table with surrogate keys but forgets to identify the natural keys of the table. The natural key is the column or group of columns that uniquely identify a row in a table. Say we had a table called Car. The natural key of the Car table most likely be VIN, the Vehicle Identification Number, that’s an auto industry standard. That would uniquely identify each automobile. Let’s just say I put a surrogate key and make that my primary key on the Car table.

Everything looks great right? RIGHT? Notsomuch. What will most likely happen in this situation, scratch that, what will happen is that you’ll get duplicate data. Did you get that? YOU WILL GET DUPLICATE DATA AND IT’S ALL YOUR FAULT. Ok, so maybe that was a bit much. Sorry about that. So how does the data duplication happen? Since we don’t have a unique constraint on VIN there’s nothing preventing the duplicate VIN from being inserted. So you say you have an API that prevents the duplicate data? That’s great! So you’re doing a lookup on VIN before you insert the new row into the Car table? Ok…that sounds like a lookup you shouldn’t have to make. Are you also saying that you’ll NEVER load data into the database without using the API or application? So you can never use tools such as SSIS, Informatica or even Bulk Load without the API? That sounds…slow.

But there is one thing that you forgot about and that thing has the most influence on your database…the DBA. Yes, the dreaded DBA. At some point someone is going to ask the DBA, the keeper of the kingdom, to load data in that table. And that DBA may be a *gasp* junior DBA. My point is that “bad” data is trying to find a way into your system, right now. Adding the alternate key is a great way to prevent the duplicate data from entering the database. So here’s how you would create it

So what if you have a table that already exists? How do you add the alternate key then? We’ll I’m sorry to say that you’re out of luck. You’ll have duplicate data forever. Just kidding. Here’s how you do it.

If you don’t get an error then great! You didn’t have duplicate data in the table! Yay! Hi-five! But what if you get something like this?

Msg 1505, Level 16, State 1, Line 1
The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name 'dbo.Car' and the index name 'AK_Car__VIN'. The duplicate key value is (12345678901234567).
Msg 1750, Level 16, State 0, Line 1
Could not create constraint or index. See previous errors.
The statement has been terminated.

Uh-oh. We’re too late. You have duplicate data. I’m sorry. But the good news is that it’s not terminal. The database will survive but we have to perform surgery. It may be painful but we can do it. First we have to identify the “bad” rows. This is how you would do is manually

You would have to do this manually for each table that you want to put the alternate key on. Yuck. There’s an easier way to find duplicate data for your potential alternate keys, sp_DataProfile. sp_DataProfile is a tool that I developed to gain insight on your data. Just give sp_DataProfile your table name and your column list and it will tell you the data that is duplicated for those columns.

So how do we fix it? You really have two options:

  1. Change the data so that that it’s unique
  2. Delete the data.

Here’s The SQL to look at the full data to see the full output so that you can make a decision on what to do. Of course you would need to do this for every duplicate value.

If you use sp_DataProfile you can just copy the SQL from the ViewDataSQL column and run it.

sp_DataProfile_AKCheck_Output_ViewSQL

Once you’ve cleaned up all of the duplicates you can safely apply the alternate key.

There is one other thing I didn’t mention about alternate keys and that’s how they’re implemented by SQL Server. SQL Server implements alternate keys as indexes. Did you just say “so”? I know you did. Soooo an index gives SQL Server another option to retrieve data faster than hitting the table itself. (Your results may vary. If you query takes more than four hours please consult a professional.) Take a look at this query:

Execution_Plan_AK_Seek

You see that? The query plan that SQL Server chose wasn’t against the clustered index (aka the table) it chose our alternate key (AK_Car__Vin). It’s amazing!

So if you want to eradicate duplicate data in your database create an alternate key. You’ll be glad you did when your reports return the correct numbers.