BackToBasics: Great Debates: Unicode

https://www.flickr.com/photos/vagabondblogger/3436681851/ In my Bad Habits session (and my Become a Better Developer pre-con), one of my slides addresses choosing the wrong data type. I have a demo that shows why date or smalldatetime can often be a better choice than datetime (which people tend to use universally, regardless of precision required), and I give several examples of scenarios where people pick the wrong data type - usually too big, too small, or simply inappropriate for the data they want to store there.

One of the more common dilemmas schema designers face, and I'm realizing it's something that I should probably spend more time on in my presentations, is whether to use varchar or nvarchar for columns that will store string data. As part of my #EntryLevel challenge, I thought I'd start by writing a bit about this here.

In general, I come across two schools of thought on this:

  1. Use varchar unless you know you need to support Unicode.
  2. Use nvarchar unless you know you don't.

Andy's Tweet

A couple of weeks ago, this came up on Twitter, when my friend Andy Yun (@SQLBek) gave a little jab toward the latter camp. We had a bit of a jovial back-and-forth, because in scenarios where you can't know for certain, I lean in the opposite direction (and don't worry, I'll explain why).

Before I get into that, let me try to lay out how these two data types differ, and why this choice is important.

What's the Difference?

In simplest terms, varchar stores ASCII characters, while nvarchar stores Unicode characters. ASCII is (again, simplistically, ignoring language-specific collations) used for characters in the standard Latin set, while Unicode supports a much larger set of characters from many languages. An important aspect of this difference in SQL Server is that each ASCII character requires one byte, while each Unicode character requires two bytes.*

* For brevity, I'm not going to get into things like collations, code pages, or UTF-8 vs. UTF-16. Also, I will address compression below.

As a quick example, if you wanted to store a Japanese string like 供が開発したリレーちシ in SQL Server (using the default collation), you would need to use nvarchar. And whenever you define a Unicode string literal, you must remember to always use the N prefix. Otherwise bad things happen:

SELECT no_prefix =  '供が開発したリレーちシ',
        [prefix] = N'供が開発したリレーちシ';

Result:

PrefixResults-1

The N prefix can't always help you, though. If you've defined your variable or column as varchar and attempt to store Unicode data in it, a funny thing happens:

DECLARE @x VARCHAR(32) = N'供が開発したリレーちシ';

CREATE TABLE #x (y VARCHAR(32));

INSERT #x (y)
SELECT N'供が開発したリレーちシ';

SELECT [variable] = @x
     , [column] = y
FROM #x;

Result:

OtherResults-1

Consequences of choosing varchar

As you can imagine, if you create varchar columns thinking you'll "never" have to support Unicode data, you could be looking at costly outcomes:

  • You can lose data, with no warning or error message. The data also isn't logged anywhere, so this is immediate, unrecoverable data loss, unless you can reconstruct the exact steps that led to the insert in the first place.
  • You may need to refactor, and the actual cost of this - depending on the scope of the change and the level of regression testing - can vary wildly.

There are some arguments out there that since the Windows operating system natively uses Unicode, using varchar can add overhead. While I haven't explicitly tested this, and off the cuff am not sure how you would reliably identify and isolate that specific overhead, I'm not inclined to believe this would be significant on today's hardware.

Consequences of choosing nvarchar

On the other hand, if you choose nvarchar (even if you currently don't need to support Unicode characters), you of course have to pay "the Unicode tax" - roughly double the storage size for the same string (not *exactly* double because the same space is needed for row metadata like nullability and length). And this has a trickle-down effect, of course, since more space for the pages on disk means more space for the pages in the buffer pool, lower page density, more costly I/O (particularly for things like scans), and so on. Andy has presented on this topic in the past ("Every Byte Counts: Why Your Data Type Choices Matter").

In Enterprise Edition, starting with SQL Server 2008 R2, you can use Unicode compression, which - in exchange for a little CPU - means you can store two bytes per character only when the character actually needs two bytes. So if you have a string like foo発, it will be 5 bytes under Unicode compression, but 8 bytes without.

My Bias

I appreciate and understand Andy's argument, which is that the additional storage and other performance benefits usually outweigh betting against having to refactor. In other words, he believes that it is rare someone will design an application with varchar and later have to support Unicode, and even if it does happen, the storage and performance savings in the meantime are worth the risk.

I am certainly in the second camp, for two reasons:

  1. I have witnessed the data loss and the refactor process, and in a complex application, this is not pretty.
  2. It is much easier to throw hardware at the problem, if it even becomes a problem. Often it doesn't, because we're talking about proper names that are going to take 20-40 bytes instead of 10-20 (though I will confess that if you're storing things like questions/answers, blog posts/comments, etc. then the impact is different).

And honestly, if the use of Unicode tips the performance of your application from acceptable to unacceptable, then it was only a matter of time anyway. If you can't withstand doubling the size of the relevant columns in 500,000 existing rows, you wouldn't be able to withstand scaling to 1,000,000 all-varchar rows, either - ultimately it will amount to roughly the same delta in storage, memory, I/O, and network requirements. So all you'd be doing by betting against a future Unicode requirement is shifting exactly when you'd start observing performance degradation and/or have to throw hardware at the problem.

You may be designing your schema using today's requirements, but if there's any chance you need to support internationalization to any extent in the future, just add that to today's requirements and start with Unicode (and again, if you are on Enterprise Edition, you can do so without paying any of the costs until later). And please note that I am not suggesting it makes sense to use nchar/nvarchar for *every* string column - some will never need to support Unicode, like telephone numbers, zip/postal codes, and VIN numbers. But be careful about proper names (e.g. people, products, addresses, cities, or companies), or anything where people can type arbitrary text (comments or descriptions). You might have already lost data and not even noticed, and I believe my elaboration about this risk has helped change Andy's opinion. Hopefully it will help you, too.