if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.
This infamous threat was first published a decade ago by Joel Spolsky. Unfortunately, a lot of people thought he was merely kidding and as a result, many of us still don’t fully understand Unicode and for that matter the difference between Unicode, UTF-8 and UTF-16. And that is the main motivation behind this article.
Without further ado, let us jump straight into action. Say, one fine afternoon, you receive an email from a long lost friend from High School with an attachment in .txt, or as it is often referred to as: the “plain text” format. The attachment consists of the following sequence of bits:
The email itself is empty, adding to the mystery. Before you kickstart your favorite text editor and open the attachment, have you ever wondered how does the text editor interpret the bits pattern to display characters? Specifically, how does your computer know the following two things:
- How the bytes are grouped (E.g. 1 or 2-byte characters?)
- How to map byte or bytes to characters?
The answer to these questions lie in the document’s Character Encoding. Loosely speaking, encoding define two things:
- How the bytes are grouped, for example 8-bits or 16 bits. Also known as Code Unit.
- Mapping of Code Units to Characters (E.g. In ASCII, decimal 65 maps to the letter A).
Character Encodings are tiny bit different from Character Sets but that really isn’t relevant to you unless you are designing a low level library.
One of the most popular encoding schemes, at least in the Western World, of the last century was known as ASCII. The table below shows how code units map of characters in ASCII.
There is a common misconception even amongst seasoned developers that “plain text” uses ASCII and that each character is 8-bits.
Truth be told, there is no such thing as “plain text”. If you have a string in memory or disk and you do not know its encoding, you cannot interpret it or display it. There is absolutely no other way around it.
How can your computer interpret the attachment you just received when it doesn’t specify encoding? Does this mean you can never read what your long lost friend really wanted to tell you? Before we get to the answer, we must travel back in time to the dark ages… where 29 MB hard disk was the best money (and a lot of it) could buy!
Long, long time ago, computer manufacturers had their own way of representing characters. They didn’t bother to talk to one another and came up with whatever algorithm they liked to render “glyphs” on screens. As computers became more and more popular and the competition intensified, people got sick and tired of this “custom” mess as data transfer between different computer systems became a pain in the butt.
Eventually, computer manufacturers got their heads together and came up with a standard way of describing characters. “Lo and behold”, they declared “the low 7-bits in a byte represent character“. And they created a table like the one shown in the first figure to map each of the 7-bit value to a character. For example, the letter A was 65, c was 99, ~ was 126 and so on. And ASCII was born. The original ASCII standard defined characters from 0 to 127, which is all you can fit in 7 bits. Life was good and everyone was happy. That is, for a while…
Why they picked 7 bits and not 8? I don’t exactly care. But a byte can fit in 8 bits. This means 1 whole bit was left completely unused and the range from 128 to 255 was left unregulated by the ASCII guys, who by the way, were Americans, who knew nothing, or even worst, didn’t care about the rest of the world.
People in other countries jumped at this opportunity and they started using the 128-255 range to represent characters in their languages. For example, 144 was گ in Arabic flavour of ASCII, but in Russian, it was ђ. Even in the United States of America, there were many different interpretations of the unused range. IBM PC came out with the “OEM font” or the “Extended ASCII” which provided fancy graphical characters for drawing boxes and supported some of the European characters like the pound (£) symbol.
A “cool” looking DOS splash screen made using IBM’s Extended ASCII Charset.
To recap: the problem with ASCII was that while everyone agreed what to do with codes up to 127, the range 128-255 had many, many different interpretations. You had to tell your computer the flavor of ASCII to display characters in the 128-255 range correctly.
This wasn’t a problem for North Americans and people of British Isles since no matter which ASCII flavor was being used, the latin alphabets stayed in the same – The British had to live with the fact that the original ASCII didn’t include their currency symbol. “Blasphemy! those arseholes.” But that’s water under the bridge.
Meanwhile, in Asia, there was even more madness going on. Asian languages have a lot of characters and shapes that need to be stored. 1 byte isn’t enough. So they started using 2 bytes for their documents.. This was known as DBCS (Double Byte Coding Scheme). In DBCS, String manipulation using pointers was a pain: how could you do str++ or str–?
All this craziness caused nightmares for system developers. For example, MS DOS had to support every single flavour of ASCII since they wanted their software to sell in other countries. They came out with a concept called “Code Pages”. For example, you had to tell DOS that you wish to use the Bulgarian Code Page to display Bulgarian letters, using “chcp” command in DOS. Code Page change was applied system wide. This posed a problem for people working in multiple languages (e.g. English and Turkish) as they had to constantly change back and forth between code pages.
While Code Pages was a good idea, it wasn’t a clean approach. It was rather a hack or “quick” fix to make things work.
Eventually, Americans realized that they need to come up with a standard scheme to represent all characters in all languages of the world to alleviate some of the pain software developers were feeling and to prevent Third World War over Character Encodings. And out of this need, Unicode was born.
The idea behind Unicode is very simple, yet widely misunderstood. Unicode is like a phone book: A mapping between characters and numbers. Joel called them magic number since they may be assigned at random and without explanation. The official term is code points and they always begin with U+. Every single alphabet of every single language (theoretically) is assigned a “magic number” by the Unicode Consortium. For example, The Aleph letter in Hebrew, א, is U+2135, while the letter A is U+0061.
Unicode doesn’t say how characters are represented in bytes. Not at all. It just assigns magic numbers to characters. Nothing else.
Other common myths include: Unicode can only support characters up to 65,536. Or that all Unicode characters must fit 2 bytes. Whoever told you get must immediately get a brain transplant!
Remember, Unicode is just a standard way to map characters to magic numbers. There is no limit on the number of characters Unicode can support. No, Unicode characters don’t have to fit in 2, 3, 4 or any number of bytes.
How Unicode characters are “encoded” as bytes in memory a is separate topic. One that is very well defined by ”Unicode Transformation Formats” or UTF’s.
Two of the most popular Unicode encodings remain the UTF-8 and UTF-16. Let’s look at them in detail.
UTF-8 was an amazing concept: it single handedly and brilliantly handled backward ASCII compatibility making sure that Unicode is adopted by masses. Whoever came up with it must at least receive the Nobel Peace Prize.
In UTF-8, every character from 0-127 is represented by 1 byte, using the same encoding as US-ASCII. This means that that a document written in 1980’s could be opened in UTF-8 without any problem. Only characters from 128 and above are represented using 2, 3,or 4 bytes. For this reason, UTF-8 is called variable width encoding.
Going back to our example at the beginning of this post, the attachment from your long lost high school friend had the following byte stream:
The byte stream in both ASCII and UTF-8 displays the same characters: HELLO.
Another popular variable width encoding for Unicode characters: It uses either 2 bytes or 4 bytes to store characters. However, people are now slowly realizing that UTF-16 may be wasteful and not such a good idea. But that’s another topic.
Little Endian or Big Endian
Endian is pronounced “End-ian” or “Indian”. The term traces its origin to Gulliver’s Travels.
Little or Big Endian is just a convention for storing and reading groups of bytes (called words) from memory. This means when you give your computer the letter A to store in memory in UTF-16 as two bytes, your computer decides using Endianness scheme it is using whether to place the first byte ahead of second byte or the other way around. Ah, this getting confusing. Let’s look at an example: Let’s say you want to save the attachment from your long lost friend you downloaded using UTF-16, you could end up with the following bytes in UTF-16 depending on the computer system you are on:
00 48 00 65 00 6C 00 6C 00 6F (big end, the high order byte is stored first, hence Big Endian)
48 00 65 00 6C 00 6C 00 6F 00 (little end, the low order byte is stored first, hence Little Endian)
Endianness is just a matter of preference by microprocessor architecture designers. For example, Intel uses Little Endian, while Motorola uses Big Endian.
Byte Order Mark
If you regularly transfer documents between Little and Big Endian systems and wish to specify endianness, there is a weird convention known as the Byte Order Mark or BOM for that. A BOM is a cleverly designed character which is placed at the beginning of the document to inform reader about the endianness of the encoded text. In UTF-16, this is acheived by placing FE FF, as the first byte. Depending on the Endianness of the system the document is accessed on, this will appear as either FF FE or FE FF, giving parser immediate hint of the endianness.
BOM, while useful, isn’t neat since people have been using a similar concept called “Magic Byte” to indicate the File Type for ages. The relation between BOM and Magic Byte isn’t well defined and may confuse some parsers.
Alright, that is all folks. Congratulations on making it this far: You must be an endurance reader.
Remember the bit about there being no such thing as “plain text” introduced at the beginning of this post that left you wondering how does your text editor or Internet Browser displays correct text every time? The answer is that the software deceives you and that is why a lot of people don’t know about encoding: when the software cannot detect the encoding, it guesses. Most of the time, it guesses the encoding to be UTF-8 which covers its proper subset ASCII, or for that matter ISO-8859-1, as well as partial coverage for almost every character set ever conceived. Since the latin alphabet used in English is guaranteed to be the same in almost all encodings, including UTF-8, you still see your english language characters displayed correctly even if the encoding guess is wrong.
But, every now and then, you may see � symbol while surfing the web… a clear sign that encoding is not what your browser thought it was. Time to click on View->Encoding menu option of your web browser and start experimenting with encodings.
If you didn’t have time to read the entire document or you skimmed through it, it is Okay. But make sure that you understand the following points at all cost otherwise, you will miss on some of the finest pleasures this life has to offer.
- There is no such thing as plain text. You must know the encoding of every String you want to read.
- Unicode is simply a standard way of mapping characters to numbers. The Brave Unicode people deal with all the politics behind including new characters and assigning numbers.
- Unicode does NOT say how characters are represented as bytes. This is dictated by Encodings and specified by Unicode Transformation Formats (UTF’s).
And, most importantly,
- Always, I mean always, indicate the encoding of your document either by using Content-Type or meta charset tag. By doing this, your are preventing web browsers from guessing the encoding and telling exactly which encoding they should use to render the page.