Unicode isn’t harmful for health – Unicode Myths debunked and encodings demystified

if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

This infamous threat was first published a decade ago by Joel Spolsky. Unfortunately, a lot of people thought he was merely kidding and as a result, many of us still don’t fully understand Unicode and for that matter the difference between Unicode, UTF-8 and UTF-16. And that is the main motivation behind this article.

Without further ado, let us jump straight into action. Say, one fine afternoon, you receive an email from a long lost friend from High School with an attachment in .txt, or as it is often referred to as: the “plain text” format. The attachment consists of the following sequence of bits:

0100100001000101010011000100110001001111

The email itself is empty, adding to the mystery. Before you kickstart your favorite text editor and open the attachment, have you ever wondered how does the text editor interpret the bits pattern to display characters? Specifically, how does your computer know the following two things:

  1. How the bytes are grouped (E.g. 1 or 2-byte characters?)
  2. How to map byte or bytes to characters?

The answer to these questions lie in the document’s Character Encoding. Loosely speaking, encoding define two things:

  1. How the bytes are grouped, for example 8-bits or 16 bits. Also known as Code Unit.
  2. Mapping of Code Units to Characters (E.g. In ASCII, decimal 65 maps to the letter A).

Character Encodings are tiny bit different from Character Sets but that really isn’t relevant to you unless you are designing a low level library.

One of the most popular encoding schemes, at least in the Western World, of the last century was known as ASCII. The table below shows how code units map of characters in ASCII.

US ASCII Chart

There is a common misconception even amongst seasoned developers that “plain text” uses ASCII and that each character is 8-bits.

 Truth be told, there is no such thing as “plain text”.  If you have a string in memory or disk and you do not know its encoding, you cannot interpret it or display it. There is absolutely no other way around it.

How can your computer interpret the attachment you just received when it doesn’t specify encoding? Does this mean you can never read what your long lost friend really wanted to tell you? Before we get to the answer, we must travel back in time to the dark ages… where 29 MB hard disk was the best money (and a lot of it) could buy!

Historical Perspective

Long, long time ago, computer manufacturers had their own way of representing characters. They didn’t bother to talk to one another and came up with whatever algorithm they liked to render “glyphs” on screens. As computers became more and more popular and the competition intensified, people got sick and tired of this “custom” mess as data transfer between different computer systems became a pain in the butt.

Eventually, computer manufacturers got their heads together and came up with a standard way of describing characters. “Lo and behold”, they declared “the low 7-bits in a byte represent character“. And they created a table like the one shown in the first figure to map each of the 7-bit value to a character. For example, the letter A was 65, c was 99, ~ was 126 and so on. And ASCII was born. The original ASCII standard defined characters from 0 to 127, which is all you can fit in 7 bits. Life was good and everyone was happy. That is, for a while…

Why they picked 7 bits and not 8? I don’t exactly care. But a byte can fit in 8 bits. This means 1 whole bit was left completely unused and the range from 128 to 255 was left unregulated by the ASCII guys, who by the way, were Americans, who knew nothing, or even worst, didn’t care about the rest of the world.

People in other countries jumped at this opportunity and they started using the 128-255 range to represent characters in their languages. For example, 144 was گ in Arabic flavour of ASCII, but in Russian, it was ђ. Even in the United States of America, there were many different interpretations of the unused range. IBM PC came out with the “OEM font” or the “Extended ASCII” which provided fancy graphical characters for drawing boxes and supported some of the European characters like the pound (£) symbol.

.

 A “cool” looking DOS splash screen made using IBM’s Extended ASCII Charset.

To recap: the problem with ASCII was that while everyone agreed what to do with codes up to 127, the range 128-255 had many, many different interpretations. You had to tell your computer the flavor of ASCII to display characters in the 128-255 range correctly.

This wasn’t a problem for North Americans and people of British Isles since no matter which ASCII flavor was being used, the latin alphabets stayed in the same – The British had to live with the fact that the original ASCII didn’t include their currency symbol. “Blasphemy! those arseholes.” But that’s water under the bridge.

Meanwhile, in Asia, there was even more madness going on. Asian languages have a lot of characters and shapes that need to be stored. 1 byte isn’t enough. So they started using 2 bytes for their documents.. This was known as DBCS (Double Byte Coding Scheme). In DBCS, String manipulation using pointers was a pain: how could you do str++ or str–?

All this craziness caused nightmares for system developers. For example, MS DOS had to support every single flavour of ASCII since they wanted their software to sell in other countries. They came out with a concept called “Code Pages”. For example, you had to tell DOS that you wish to use the Bulgarian Code Page to display Bulgarian letters, using “chcp” command in DOS. Code Page change was applied system wide. This posed a problem for people working in multiple languages (e.g. English and Turkish) as they had to constantly change back and forth between code pages.

While Code Pages was a good idea, it wasn’t a clean approach. It was rather a hack or “quick” fix to make things work.

Enter Unicode

Eventually, Americans realized that they need to come up with a standard scheme to represent all characters in all languages of the world to alleviate some of the pain software developers were feeling and to prevent Third World War over Character Encodings. And out of this need, Unicode was born.

The idea behind Unicode is very simple, yet widely misunderstood. Unicode is like a phone book: A mapping between characters and numbers. Joel called them magic number since they may be assigned at random and without explanation. The official term is code points and they always begin with U+. Every single alphabet of every single language (theoretically) is assigned a “magic number” by the Unicode Consortium. For example, The Aleph letter in Hebrew, א, is U+2135, while the letter A is U+0061.

Unicode doesn’t say how characters are represented in bytes. Not at all. It just assigns magic numbers to characters. Nothing else.

Other common myths include: Unicode can only support characters up to 65,536.  Or that all Unicode characters must fit 2 bytes. Whoever told you get must immediately get a brain transplant!

Remember, Unicode is just a standard way to map characters to magic numbers. There is no limit on the number of characters Unicode can support. No, Unicode characters don’t have to fit in 2, 3, 4 or any number of bytes.

How Unicode characters are “encoded” as bytes in memory a is separate topic. One that is very well defined by “Unicode Transformation Formats” or UTF’s.

Unicode Encodings

Two of the most popular Unicode encodings remain the UTF-8 and UTF-16. Let’s look at them in detail.

UTF-8

UTF-8 was an amazing concept: it single handedly and brilliantly handled backward ASCII compatibility making sure that Unicode is adopted by masses. Whoever came up with it must at least receive the Nobel Peace Prize.

In UTF-8, every character from 0-127 is represented by 1 byte, using the same encoding as US-ASCII. This means that that a document written in 1980’s could be opened in UTF-8 without any problem. Only characters from 128 and above are represented using 2, 3,or 4 bytes. For this reason, UTF-8 is called variable width encoding.

Going back to our example at the beginning of this post, the attachment from your long lost high school friend had the following byte stream:

0100100001000101010011000100110001001111

The byte stream in both ASCII and UTF-8 displays the same characters: HELLO.

UTF-16

Another popular variable width encoding for Unicode characters: It uses either 2 bytes or 4 bytes to store characters. However, people are now slowly realizing that UTF-16 may be wasteful and not such a good idea. But that’s another topic.

Little Endian or Big Endian

Endian is pronounced “End-ian” or “Indian”. The term traces its origin to Gulliver’s Travels.

Little or Big Endian is just a convention for storing and reading groups of bytes (called words) from memory. This means when you give your computer the letter A to store in memory in UTF-16 as two bytes, your computer decides using Endianness scheme it is using whether to place the first byte ahead of second byte or the other way around. Ah, this getting confusing. Let’s look at an example: Let’s say you want to save the attachment from your long lost friend you downloaded using UTF-16, you could end up with the following bytes in UTF-16 depending on the computer system you are on:

00 48  00 65  00 6C  00 6C  00 6F (big end, the high order byte is stored first, hence Big Endian)

OR,

48 00  65 00  6C 00  6C 00  6F 00 (little end, the low order byte is stored first, hence Little Endian)

Endianness is just a matter of preference by microprocessor architecture designers. For example, Intel uses Little Endian, while Motorola uses Big Endian.

Byte Order Mark

If you regularly transfer documents between Little and Big Endian systems and wish to specify endianness, there is a weird convention known as the Byte Order Mark or BOM for that. A BOM is a cleverly designed character which is placed at the beginning of the document to inform reader about the endianness of the encoded text. In UTF-16, this is acheived by placing FE FF, as the first byte. Depending on the Endianness of the system the document is accessed on, this will appear as either FF FE or FE FF, giving parser immediate hint of the endianness.

BOM, while useful, isn’t neat since people have been using a similar concept called “Magic Byte” to indicate the File Type for ages. The relation between BOM and Magic Byte isn’t well defined and may confuse some parsers.

Alright, that is all folks. Congratulations on making it this far: You must be an endurance reader.

Remember the bit about there being no such thing as “plain text” introduced at the beginning of this post that left you  wondering how does your text editor or Internet Browser displays correct text every time? The answer is that the software deceives you and that is why a lot of people don’t know about encoding: when the software cannot detect the encoding, it guesses. Most of the time, it guesses the encoding to be UTF-8 which covers its proper subset ASCII, or for that matter ISO-8859-1, as well as partial coverage for almost every character set ever conceived. Since the latin alphabet used in English is guaranteed to be the same in almost all encodings, including UTF-8, you still see your english language characters displayed correctly even if the encoding guess is wrong.

But, every now and then, you may see � symbol while surfing the web… a clear sign that encoding is not what your browser thought it was. Time to click on View->Encoding menu option of your web browser and start experimenting with encodings.

Summary

If you didn’t have time to read the entire document or you skimmed through it, it is Okay. But make sure that you understand the following points at all cost otherwise, you will miss on some of the finest pleasures this life has to offer.

  • There is no such thing as plain text. You must know the encoding of every String you want to read.
  • Unicode is simply a standard way of mapping characters to numbers. The Brave Unicode people deal with all the politics behind including new characters and assigning numbers.
  • Unicode does NOT say how characters are represented as bytes. This is dictated by Encodings and specified by Unicode Transformation Formats (UTF’s).

And, most importantly,

  • Always, I mean always, indicate the encoding of your document either by using Content-Type or meta charset tag. By doing this, your are preventing web browsers from guessing the encoding and telling exactly which encoding they should use to render the page.

The inspiration, ideas for this article came from the best article on Unicode by Joel.

49 thoughts on “Unicode isn’t harmful for health – Unicode Myths debunked and encodings demystified

  1. Does encoding apply to source code as well or documents only? How does python command line knows how to read the source code?

    Signed, Confused.

    • I’m a little late here, but the reason to use UTF16 over UTF8 is for those cases, where you frequently encounter characters that need 3 bytes in UTF8 and 2 in UTF16. You see, UTF8 packs only half as many characters into two bytes.

  2. When they say Java “has full Unicode support” does that mean that the Unicode characters could appear in java strings or that the source .java file has to be in Unicode. For instance; does javac expects source files to be in Unicode format always??

  3. > The original ASCII standard defined characters from 0 to 126.
    0 to 127. 127 is a power of 2 minus 1, which should be a hint; in specific, it’s two to the seventh minus one, since ASCII defines codepoints for all possible combinations of seven bits, which is 128 possible codepoints, so the enumeration ends at 127 if you count starting from zero, as computer programmers are wont to do.

    > all possible 127 ASCII characters
    128 characters, as mentioned above.

    > the ASCII guys, who by the way, were American
    ASCII stands for American Standard Code for Information Interchange. The ethnocentrism was unfortunate but it isn’t like you weren’t warned.

    > The numbers are called “magic numbers” and they begin with U+.
    He can call them “magic numbers” but everyone else calls them “codepoints”.

    > UTF-8 was an amazing concept: it single handedly and brilliantly handled backward ASCII compatibility making sure that Unicode is adopted by masses. Whoever came up with it must at least receive the Nobel Peace Prize.
    I’m sure Ken Thompson and Rob Pike will be happy to hear someone thinks that way.

    • >ASCII stands for American Standard Code for Information Interchange. The ethnocentrism was unfortunate but it isn’t like you weren’t warned.
      Yea, I think it reflects how much technology development comes from America even back then as well as now.

    • >> UTF-8 was an amazing concept: it single handedly and brilliantly handled backward ASCII compatibility making sure that Unicode is adopted by masses. Whoever came up with it must at least receive the Nobel Peace Prize.
      > I’m sure Ken Thompson and Rob Pike will be happy to hear someone thinks that way.
      If that’s the case then I think the designers of GB18030 deserve it more, because they achieved an encoding that is able to map all Unicode codepoints while being backwards compatible with GB2312, which is itself backwards compatible with ASCII.
      But seriously UTF-8 is like sliced bread after having dealt with we-thought-64K-is-enough-so-lets-all-use-16-bits UCS-2/UTF-16.

  4. I feel like we’re at a point now where articles that just try to ‘demystify’ unicode are almost teaching the controversy if they don’t come out and actually say how you should deal with encodings in new apps.
    It’s about time we actually start pressing for the idea that utf-16 was a terrible idea and that utf-8 should be the dominant wire format for unicode, with ucs4 if you really need to have a linear representation.
    Utf-16 is confusing, complicated, and implementations are routinely broken because the corner cases are rarer. I really hope we’re not still stuck with it in 50 years.

    • I agree about wire formats. However, programming languages that use UTF-8 as the primary string representation tend to have inferior Unicode support to those that use UTF-16. I think this is because with UTF-8, a lot of stuff just seems to work and so it’s easy to ignore the issues, while UTF-16 forces you to come to grips with the realities of Unicode.
      For example, consider a function to open a file by name. If your strings are UTF-8, you can just pass your null terminated buffer to fopen() or something and things will work fine for most files. But if your strings are internally UTF-16, you have to think about which encoding to use, and you research it, and you discover that holy crap, this stuff differs across OSes, and so we better take this problem seriously.

  5. I don’t think it’s fair to say that Americans “knew nothing, or even worst, didn’t care about the rest of the world” because ASCII didn’t include non-Roman characters.

    First of all, the choice of symbols in ASCII was not made by Americans. ASCII was designed to include the characters of the International Telegraph Alphabet, defined by the ITU (in Geneva, Switzerland). The ITA itself was an expansion and codification of the earlier Baudot code, developed by a Frenchman for use in France.

    Secondly, even if an American operator developed a code for use on American telegraphs that includes the characters used in American English, this does not show a disregard for the rest of the world – any more than when the French, for French domestic telegraphy, used a code that failed to include ß and was therefore unable to transmit the German language correctly.

    • “when the software cannot detect the encoding, it guesses. Most of the time, it guesses the encoding to be UTF-8 which covers its proper subset ASCII, or for that matter ISO-8859-1, as well as partial coverage for almost every character set ever conceived. Since the latin alphabet used in English is guaranteed to be the same in almost all encodings, including UTF-8, you still see your english language characters displayed correctly even if the encoding guess is wrong.”

  6. The byte order mark despite the name doesn’t denote endianness in UTF-8 as it is not an endian encoding. Rather the bom simply denotes the encoding of the text

  7. >Other common myths include: Unicode can only support characters up to 65,536

    Not really a myth. This was UCS2 and was the situation when a load of important early adopters started with Unicode. Winodws, Java, JavaScript all got burnt by this and ended up with UTF-16 as a result. Even Python 2.x on Linux is UTF-16 under the covers 😦

    >Unicode is just a standard way to map characters to magic numbers and there is no limit on the number of characters it can represent.

    Unicode now limits itself to 21 bits of data. This is what allows the surrogate pair coding of UTF-16

  8. Regarding the 7 bits: it’s not that “Americans didn’t care”. (Well, they cared very little, but still..)

    The point is that ASCII was created well before general computing, and was used for all sorts of communications (teletypes, …). The 8th bit was often used for signaling – special codes that told the equipment how to handle the text, or what followed it. It was deliberately left unused for actual data.

  9. I wrote this up before, but it is still applicable:
    The reason unicode is hard to understand is because ascii text, which is just a particular binary encoding, looks for all the world like it is understandable by examining the raw bytes in an editor. To understand encodings and unicode, however, you should try really hard to pretend you cannot understand the ascii text at all. It will make your life simpler.

    Instead, let’s take an analogy from music files like mp3. Say you wanted to edit a music file. To change the pitch or something. You’d have to convert the compressed music encoding which is mp3 into its raw form, which is a sequence of samples. You need to do this because the bytes of an mp3 are incomprehensible as music. (By the way this is exactly what a codec does, it decodes the mp3 to raw samples and plays them out your speakers.)
    You’d do your editing. Then, when it’s time to make it a music file again, you’d convert it back, encode it, if you will, back into an mp3.

    Treat text the same way. Treat ascii text as an unknowable blob. Pretend you can’t read it and understand it. Like the bytes of an mp3 file.

    To do something with it, you need to convert it to its raw form, which is unicode. To convert it, you need to know what it is: is it latin-1 encoded / ascii text? Is it utf-8? (similarly, is it an mp3 file? Is it an AAC file?). And, just like with music files, you can guess what the encoding is (mp3, aac, wav, etc.), but the only foolproof way is to know ahead of time. That’s why you need to provide the encoding.

    Only when it is unicode can you begin to understand it, to do stuff with it. Then, when its time to save it, or display it, or show it to a user, you encode it back to the localized encoding. You make it an mp3 again. You make it ascii text again. You make it korean text again. You make it utf-8 again.

    At this point, you cannot do anything with it besides copy it verbatim as a chunk of bytes.
    This is the reason behind the principle of decode (to unicode) early, stay in unicode for as long as possible, and only encode back at the last moment.

    • Sometimes rehashing a topic in new words is valuable, especially when common issues with the problem-area involve the mental models people use. (Look how many times people re-explain and re-illustrate version-control systems, for example.)

  10. > If you wish to transfer documents between Little and Big Endian systems in Unicode, UTF-8 and UTF-16 support a convention known as the Byte Order Mark.
    UTF-8 doesn’t require nor recommend using BOM, and using a BOM with it is strictly a Microsoft-ism.

    >Most of the time, it guesses the encoding to be UTF-8 or a subset of it (US-ASCII or ISO-8859-1)
    ISO-8859-1 isn’t a subset of UTF-8, they just are both supersets of ASCII.

    • In the sense that particular byte values match, you’re right. But in the sense that a particular decoded value matches, ’59-1 can be considered a subset of UTF-8 in that ’59-1 value 128 = UTF-8 value 128, etc.

  11. > There is no real limit on the number of letters that Unicode can define

    Now, that may be technically correct (the best kind of correct) in that the Unicode Consortium can vote to change their minds at some point in the future, but at the present time, there are exactly 1,114,112 possible Unicode code points, not all of which have been assigned.

  12. I don’t know wheter to be confused or annoyed.

    “Plain Text” (7 bit ASCII) certainly does exist and there are millions (billions?) of lines of code that expect it.
    There are certainly other implementations of multi-byte character sets and their representations, but the article title is just silly.
    I’m thinking that the author is just trying to build some traffic for some reason, since the other links point to articles that are even weaker and make less sense, and some that don’t even exist.

    • But that’s not “plain text”, it’s 7 bit ASCII. That’s the whole point. If you get a bunch of bytes and are not explicitly told the encoding, you do not know how to display it or modify it. Period. You can guess that it’s 7 bit ASCII, but you might be wrong and can’t complain if your code ends up displaying garbage.
      Please read this. We have enough broken code in the World because developers don’t understand character encoding and think there is such a thing as “plain text”.

  13. Here is my explanation:

    Text, like all data, is a stream of bits (like a river of 11101010101011010110 that flows into your program).
    You may have seen things like “inputstream” and “bytestream” and similar stuff in your programs and libraries, but have no clue what it’s talking about. A stream of bits is what it is talking about.
    Once you have a stream, then either the language, or the programmer, has to divide up the stream into chunks. These chunks become letters.
    But how do you know how big the chunks are, or what letter each chunk represents? That is determined by the code page you use to decode the stream. Decoding the stream is the act of breaking the bitstream up into chunks. You have to know the code page in order to decode it.
    Most programming languages will use the default system codepage, which we normally call “plain text”.
    This is called the ANSI codepage (NOT a “ASCII codepage”). In English Windows, this is windows-1252. In Russian Windows, the code page is windows-1251. These are the “plain text” for their respective systems.
    If you have a file that should be Cyrllic, and you want to view it on English Windows, there’s only two ways to view it. Either you change the Windows codepage so “plain text” displays as Cyrillic, or you re-create the stream in a different code page–a Unicode page like UTF-8. You would re-create the stream by reading it and then figuring out which Cyrllic characters in windows-1252 map to which Cyrllic characters in UTF-8. If you just open it in the English ANSI codepage, you will get mojibake, something like Îòìåíèòü.
    Codecs for video and audio files that decode video or audio streams are pretty much the same idea (though they are more multi-layered and complex of course).

  14. […] Unicode isn’t harmful for health – Unicode Myths debunked and encodings demystified: Un genial artículo sobre que es Unicode, su historia, como usarlo y algunos detalles sobre su implementación. Una versión quizá más legible pero menos profunda de al The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) […]

Leave a reply to stevees Cancel reply