Unicode isn’t harmful for health – Unicode Myths debunked and encodings demystified

if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

This infamous threat was first published a decade ago by Joel Spolsky. Unfortunately, a lot of people thought he was merely kidding and as a result, many of us still don’t fully understand Unicode and for that matter the difference between Unicode, UTF-8 and UTF-16. And that is the main motivation behind this article.

Without further ado, let us jump straight into action. Say, one fine afternoon, you receive an email from a long lost friend from High School with an attachment in .txt, or as it is often referred to as: the “plain text” format. The attachment consists of the following sequence of bits:

0100100001000101010011000100110001001111

The email itself is empty, adding to the mystery. Before you kickstart your favorite text editor and open the attachment, have you ever wondered how does the text editor interpret the bits pattern to display characters? Specifically, how does your computer know the following two things:

  1. How the bytes are grouped (E.g. 1 or 2-byte characters?)
  2. How to map byte or bytes to characters?

The answer to these questions lie in the document’s Character Encoding. Loosely speaking, encoding define two things:

  1. How the bytes are grouped, for example 8-bits or 16 bits. Also known as Code Unit.
  2. Mapping of Code Units to Characters (E.g. In ASCII, decimal 65 maps to the letter A).

Character Encodings are tiny bit different from Character Sets but that really isn’t relevant to you unless you are designing a low level library.

One of the most popular encoding schemes, at least in the Western World, of the last century was known as ASCII. The table below shows how code units map of characters in ASCII.

US ASCII Chart

There is a common misconception even amongst seasoned developers that “plain text” uses ASCII and that each character is 8-bits.

 Truth be told, there is no such thing as “plain text”.  If you have a string in memory or disk and you do not know its encoding, you cannot interpret it or display it. There is absolutely no other way around it.

How can your computer interpret the attachment you just received when it doesn’t specify encoding? Does this mean you can never read what your long lost friend really wanted to tell you? Before we get to the answer, we must travel back in time to the dark ages… where 29 MB hard disk was the best money (and a lot of it) could buy!

Historical Perspective

Long, long time ago, computer manufacturers had their own way of representing characters. They didn’t bother to talk to one another and came up with whatever algorithm they liked to render “glyphs” on screens. As computers became more and more popular and the competition intensified, people got sick and tired of this “custom” mess as data transfer between different computer systems became a pain in the butt.

Eventually, computer manufacturers got their heads together and came up with a standard way of describing characters. “Lo and behold”, they declared “the low 7-bits in a byte represent character“. And they created a table like the one shown in the first figure to map each of the 7-bit value to a character. For example, the letter A was 65, c was 99, ~ was 126 and so on. And ASCII was born. The original ASCII standard defined characters from 0 to 127, which is all you can fit in 7 bits. Life was good and everyone was happy. That is, for a while…

Why they picked 7 bits and not 8? I don’t exactly care. But a byte can fit in 8 bits. This means 1 whole bit was left completely unused and the range from 128 to 255 was left unregulated by the ASCII guys, who by the way, were Americans, who knew nothing, or even worst, didn’t care about the rest of the world.

People in other countries jumped at this opportunity and they started using the 128-255 range to represent characters in their languages. For example, 144 was گ in Arabic flavour of ASCII, but in Russian, it was ђ. Even in the United States of America, there were many different interpretations of the unused range. IBM PC came out with the “OEM font” or the “Extended ASCII” which provided fancy graphical characters for drawing boxes and supported some of the European characters like the pound (£) symbol.

.

 A “cool” looking DOS splash screen made using IBM’s Extended ASCII Charset.

To recap: the problem with ASCII was that while everyone agreed what to do with codes up to 127, the range 128-255 had many, many different interpretations. You had to tell your computer the flavor of ASCII to display characters in the 128-255 range correctly.

This wasn’t a problem for North Americans and people of British Isles since no matter which ASCII flavor was being used, the latin alphabets stayed in the same – The British had to live with the fact that the original ASCII didn’t include their currency symbol. “Blasphemy! those arseholes.” But that’s water under the bridge.

Meanwhile, in Asia, there was even more madness going on. Asian languages have a lot of characters and shapes that need to be stored. 1 byte isn’t enough. So they started using 2 bytes for their documents.. This was known as DBCS (Double Byte Coding Scheme). In DBCS, String manipulation using pointers was a pain: how could you do str++ or str–?

All this craziness caused nightmares for system developers. For example, MS DOS had to support every single flavour of ASCII since they wanted their software to sell in other countries. They came out with a concept called “Code Pages”. For example, you had to tell DOS that you wish to use the Bulgarian Code Page to display Bulgarian letters, using “chcp” command in DOS. Code Page change was applied system wide. This posed a problem for people working in multiple languages (e.g. English and Turkish) as they had to constantly change back and forth between code pages.

While Code Pages was a good idea, it wasn’t a clean approach. It was rather a hack or “quick” fix to make things work.

Enter Unicode

Eventually, Americans realized that they need to come up with a standard scheme to represent all characters in all languages of the world to alleviate some of the pain software developers were feeling and to prevent Third World War over Character Encodings. And out of this need, Unicode was born.

The idea behind Unicode is very simple, yet widely misunderstood. Unicode is like a phone book: A mapping between characters and numbers. Joel called them magic number since they may be assigned at random and without explanation. The official term is code points and they always begin with U+. Every single alphabet of every single language (theoretically) is assigned a “magic number” by the Unicode Consortium. For example, The Aleph letter in Hebrew, א, is U+2135, while the letter A is U+0061.

Unicode doesn’t say how characters are represented in bytes. Not at all. It just assigns magic numbers to characters. Nothing else.

Other common myths include: Unicode can only support characters up to 65,536.  Or that all Unicode characters must fit 2 bytes. Whoever told you get must immediately get a brain transplant!

Remember, Unicode is just a standard way to map characters to magic numbers. There is no limit on the number of characters Unicode can support. No, Unicode characters don’t have to fit in 2, 3, 4 or any number of bytes.

How Unicode characters are “encoded” as bytes in memory a is separate topic. One that is very well defined by “Unicode Transformation Formats” or UTF’s.

Unicode Encodings

Two of the most popular Unicode encodings remain the UTF-8 and UTF-16. Let’s look at them in detail.

UTF-8

UTF-8 was an amazing concept: it single handedly and brilliantly handled backward ASCII compatibility making sure that Unicode is adopted by masses. Whoever came up with it must at least receive the Nobel Peace Prize.

In UTF-8, every character from 0-127 is represented by 1 byte, using the same encoding as US-ASCII. This means that that a document written in 1980’s could be opened in UTF-8 without any problem. Only characters from 128 and above are represented using 2, 3,or 4 bytes. For this reason, UTF-8 is called variable width encoding.

Going back to our example at the beginning of this post, the attachment from your long lost high school friend had the following byte stream:

0100100001000101010011000100110001001111

The byte stream in both ASCII and UTF-8 displays the same characters: HELLO.

UTF-16

Another popular variable width encoding for Unicode characters: It uses either 2 bytes or 4 bytes to store characters. However, people are now slowly realizing that UTF-16 may be wasteful and not such a good idea. But that’s another topic.

Little Endian or Big Endian

Endian is pronounced “End-ian” or “Indian”. The term traces its origin to Gulliver’s Travels.

Little or Big Endian is just a convention for storing and reading groups of bytes (called words) from memory. This means when you give your computer the letter A to store in memory in UTF-16 as two bytes, your computer decides using Endianness scheme it is using whether to place the first byte ahead of second byte or the other way around. Ah, this getting confusing. Let’s look at an example: Let’s say you want to save the attachment from your long lost friend you downloaded using UTF-16, you could end up with the following bytes in UTF-16 depending on the computer system you are on:

00 48  00 65  00 6C  00 6C  00 6F (big end, the high order byte is stored first, hence Big Endian)

OR,

48 00  65 00  6C 00  6C 00  6F 00 (little end, the low order byte is stored first, hence Little Endian)

Endianness is just a matter of preference by microprocessor architecture designers. For example, Intel uses Little Endian, while Motorola uses Big Endian.

Byte Order Mark

If you regularly transfer documents between Little and Big Endian systems and wish to specify endianness, there is a weird convention known as the Byte Order Mark or BOM for that. A BOM is a cleverly designed character which is placed at the beginning of the document to inform reader about the endianness of the encoded text. In UTF-16, this is acheived by placing FE FF, as the first byte. Depending on the Endianness of the system the document is accessed on, this will appear as either FF FE or FE FF, giving parser immediate hint of the endianness.

BOM, while useful, isn’t neat since people have been using a similar concept called “Magic Byte” to indicate the File Type for ages. The relation between BOM and Magic Byte isn’t well defined and may confuse some parsers.

Alright, that is all folks. Congratulations on making it this far: You must be an endurance reader.

Remember the bit about there being no such thing as “plain text” introduced at the beginning of this post that left you  wondering how does your text editor or Internet Browser displays correct text every time? The answer is that the software deceives you and that is why a lot of people don’t know about encoding: when the software cannot detect the encoding, it guesses. Most of the time, it guesses the encoding to be UTF-8 which covers its proper subset ASCII, or for that matter ISO-8859-1, as well as partial coverage for almost every character set ever conceived. Since the latin alphabet used in English is guaranteed to be the same in almost all encodings, including UTF-8, you still see your english language characters displayed correctly even if the encoding guess is wrong.

But, every now and then, you may see � symbol while surfing the web… a clear sign that encoding is not what your browser thought it was. Time to click on View->Encoding menu option of your web browser and start experimenting with encodings.

Summary

If you didn’t have time to read the entire document or you skimmed through it, it is Okay. But make sure that you understand the following points at all cost otherwise, you will miss on some of the finest pleasures this life has to offer.

  • There is no such thing as plain text. You must know the encoding of every String you want to read.
  • Unicode is simply a standard way of mapping characters to numbers. The Brave Unicode people deal with all the politics behind including new characters and assigning numbers.
  • Unicode does NOT say how characters are represented as bytes. This is dictated by Encodings and specified by Unicode Transformation Formats (UTF’s).

And, most importantly,

  • Always, I mean always, indicate the encoding of your document either by using Content-Type or meta charset tag. By doing this, your are preventing web browsers from guessing the encoding and telling exactly which encoding they should use to render the page.

The inspiration, ideas for this article came from the best article on Unicode by Joel.

Command Line Flags for the JVM

This post attempts to demystify command-line options of the Java Virtual Machine (JVM). I’m talking about those strange characters you often have to type when starting Java to run a program. The options are often used to specify environment variables (class path), configure  performance characteristics of the JVM (garbage collection frequency), amongst many other things.

For example, a sample of valid command line options are shown below in bold.

java -Xmx6g AClass
java -version
java -Xms4g -Xmx6g SomeClass
java -DSomeVal="foo" MyProgram
java -DSomeVal="foo" -cp MyProgram.jar -Xmx6g -XX:+UseCompressedOops -XX:+UseG1GC com.foo.Bar 

JVM also support many command-line options which allow users to specify:

  • Minimum and Maximum Size of the Heap Memory
  • Type of Garbage Collector
  • Type of JIT Compiler (Client or Server), or,
  • To display JVM version,
  • etc.

Java Language Specification (JLS) breaks the command-line options into three categories based on their maturity level. The most mature options belong to a category called “Standard Options” and must be supported by all JVM’s. Less mature options are called “Non-Standard“. These are specific to JVM (e.g. HotSpot) and are subject to change between releases. There is also a third category of  options called “Developer Options“. Developer options are a shade of Non-Standard.

Quick Recap: JVM command-line options are used to specify configuration settings to control execution of the Virtual Machine. These options are broken into three categories as shown below:

  1. Standard Options
  2. Non-Standard Options
  3. Developer Options (Experimental)

Let’s look at the categories in detail.

1. Standard Options

The Standard Options are regulated by the Java Virtual Machine Specification and must be supported by all implementations of Java Virtual Machines. For example, OpenJDK, HotSpot, etc. Standard Options are stable and do not change between releases. Standard Options begin with a followed by the name of option, e.g. version

Some standard options are shown below:

-version:  java -version

-cp:  java -cp <PATH>

-jar:  java -jar MyProgram.jar

-Dproperty=value:  java -DSomeVal="foo"

2. Non-Standard Options

Non-Standard Options always begin with -X.  The fact that they are not guaranteed to be supported in all implementations of the JVM or even between versions, did little to hurt their popularity. Non-Standard Options remain popular and are widely used. These options often specify an integer value with a suffix of k,m, or g to specify kilo, mega or giga. To get a list of all Non-Standard Options supported by your JVM, you can invoke the launcher with -X, e.g. `java -X`. Examples:

-Xms:  java -Xms2g 

-X:  java -X

3. Developer Options

Developers Options always begin with -XX.  They are also Non-Standard. They follow the following format for setting boolean Options: -XX: followed by either + or to indicate true or false, followed by the name of option.

-XX:+UseCompressedOops (Indicates that the Option UseCompressedOops must be used)

From, Java Documentation:

  • Boolean options are turned on with -XX:+<option> and turned off with -XX:-<option>.

  • Numeric options are set with -XX:<option>=<number>. Numbers can include ‘m’ or ‘M’ for megabytes, ‘k’ or ‘K’ for kilobytes, and ‘g’ or ‘G’ for gigabytes (for example, 32k is the same as 32768).

  • String options are set with -XX:<option>=<string>, are usually used to specify a file, a path, or a list of commands

Examples:

java -XX:+PrintCompilation

java -XX:+ParallelGCThreads=10

Click here for a complete list of Options from Oracle.

Object Naming Conventions in JMX

Every MBean must have a name, more accurately, an ObjectName. Although, MBean could be named anything, E.g. “DogBean” or “SunnyDay”, it is important to choose consistent and well defined names to avoid inflicting mental torture on the poor soul who is interacting with your application via JMX.

Fortunately, MBean names follow some standard conventions and names can determine how Clients display MBeans. MBean names look like this:

domain:key=property

Remember this convention. Read the last line carefully. Notice that there are two parts with a separating them. The first part is called domain and the second part is called key-properties-list.

Here’s an example MBean name with both domain and properties: com.somecompany.app:type=ThreadPool

Domain Naming Conventions

The domain could be any arbitrary string, but it cannot contain a : since it is used as a separator. Slash (/) isn’t allowed as well. If the domain name is not provided, then the MBean shows up under “DefaultDomain“. As mentioned earlier, domain names should be predictable. According to Oracle Technet:

….if you know that there is going to be an MBean representing a certain object, then you should be able to know what its name will be.

And then add further:

The domain part of an Object Name should start with a Java package name. This prevents collisions between MBeans coming from different subsystems. There might be additional text after the package name. Examples:

com.sun.someapp:type=Whatsit,name=5
com.sun.appserv.Domain1:type=Whatever

Key=Property Naming Conventions

The property list is optional and is in the following format:

name=property

If you wish to specify multiple properties, separate each of them by comma(,). For example:

com.somecompany.app:type=ThreadPool,poolname=Parser,scope=internal

They Key=Property should be used to uniquely identify MBean Objects, such as, each Object in the same domain may have different properties. For example:

com.somecompany.app:type=ThreadPool,name=Parser

com.somecompany.app:type=ThreadPool,name=Generator

That’s MBean naming convention from  1000 feet. If you want to get more information, please read here.

Things every Java developer must know about Exception handling

Exceptions are one of the most misunderstood (and misused) features of the Java programming language. This article describes the absolute minimum every Java developer must know about exceptions. It assumes that the reader is somewhat familiar with Java.

Historical Perspective

Back in the heyday of the “C” programming language, it was customary to return values such as -1 or NULL from functions to indicate errors. You can easily see why this isn’t a great idea – developers had to check and track possible return values and their meanings: a return value of 2 might indicate “host is down” error in library A, whereas in library B, it could mean “illegal filename”.

Attempts were made to standardize error checking by expecting functions to set a global variable with a defined value.

deleteme

James Gosling and other designers of the language felt that this approach would go against the design goals of Java. They wanted:

  1. a cleaner, robust and portable approach
  2. built in language support for error checking and handling.

Luckily, they didn’t have to look too far. The inspiration for handling errors came from a very fine language of the 60’s: LISP.

Exception Handling

So what is exception handling? It is unconventional but simple concept: if an error is encountered in a program, halt the normal execution and transfer control to a section specified by the programmer. Let’s look at an example:

try {
   f = new File("list.txt"); //Will cause an error if the file is not found...
   f.readLine;
   f.write("another item for the list");
   f.close();
} catch (FileNotFoundException fnfe) { // ... and transfer control to this section on error.
   // Do something with the error: notify user or try reading another location, etc

}

Exceptions are exceptional conditions that violate some kind of a “contract” during program execution. They can be thrown by the language itself (e.g. use a null reference where an object is required) or by the developers of program or API (e.g. passing date in British format instead of American). Some examples of exceptions are:

  • Accessing index outside the bounds of an array
  • Divide by 0
  • Programmer defined contract: Invalid SQL or JSON format

Exceptions disrupt the normal program flow. Instead of executing the next instruction in the sequence, the control is transferred to the Java Virtual Machine (JVM) which tries to find an appropriate exception handler in the program and transfer control to it (hence disrupting the normal program flow).

Checked and Unchecked Exceptions

Before we look at the exception classes in Java, let’s understand the two categories of exceptions in Java:

Checked exceptions – You must check and handle these in your program. For example, if you are using an API that has a method which declares that it could throw a checked exception, you must catch the exception each time you call that method. If you don’t, the compiler will notice and your program will not compile. The designers of the Java wanted to encourage developers to use checked exceptions in situations from which programs may wish to recover: for example, if the host is down, the program may wish to try another address.

Unchecked exceptions on the other hand are not required to be handled or caught in the program. For example, if a method could throw unchecked exceptions, the caller of the method is not required to handle or catch the exceptions.

Remember: Checked exceptions are mild and normally programs wish to recover. They must be caught and this rule is enforced by the compiler. The compiler doesn’t care whether you do or do not catch unchecked exceptions.

Many people find dichotomy between checked and unchecked exceptions confusing and counter-intuitive. Discussing the arguments from both sides are beyond the scope of this post.

Parent of all exception classes: Throwable

All exceptions in Java descend (subclass) from Throwable . It has two direct children:

  1. Exception
  2. Error

Error and its sub-classes are used  for serious errors from which programs are not expected to recover,  i.e. unchecked exception.

Exception and its sub-classes are used for mild errors from which programs may wish to recover, i.e. checked exception. Right? Well, there is a twist. There is just one sub-class which is different, that is, unlike it’s parent the Exception class, it is unchecked. It’s called the RuntimeException.

deleteme

 

Checked exception classes (mostly): Exception

Exception and its sub-classes must be caught and as such they force the programmer to think (and hopefully) deal with the situation. It is a signal that something didn’t go as intended along with some information about what went wrong, and that “someone” should do something about it. (e.g. car’s dashboard indicating that the battery needs service).

According to official documentation:

These are exceptional conditions that a well-written application should anticipate and recover from. For example, suppose an application prompts a user for an input file name,  [..] But sometimes the user supplies the name of a nonexistent file, and the constructor throws java.io.FileNotFoundException. A well-written program will catch this exception and notify the user of the mistake, possibly prompting for a corrected file name.

Source: The Java Tutorials

RuntimeException

RuntimeExceptions are used to indicate programming errors, most commonly violation of some established contract. They make it impossible to continue further execution.

For example, the contract says that the array index mustn’t go past [array_length – 1]. If you do it, bam, you get a RuntimeException. A real world analogy would be pumping diesel into a gasoline car: the unwritten contract says that you must not do it. There are no  signals, just the white smoke before the car comes to a grinding halt after a while. The message: it was your fault and could’ve been prevented by being smarter in the first place.

These are exceptional conditions that are internal to the application, and that the application usually cannot anticipate or recover from. These usually indicate programming bugs, such as logic errors or improper use of an API.

Source: The Java Tutorials

Error

These exceptional circumstances are like “act-of-god” events. Going back to our previous analogy, if a large scale alien invasion were to happen, there is nothing you could do your protect your car, or yourself (unless your last name is Ripley). In Software world, this amounts to the disk dying while you are in the process of reading a file from it. The bottom line is that you should not design your program to handle Errors since something has gone wrong in the grand scheme of things that are beyond your control.

These are exceptional conditions that are external to the application, and that the application usually cannot anticipate or recover from. For example, suppose that an application successfully opens a file for input, but is unable to read the file because of a hardware or system malfunction.

Source: The Java Tutorials

It’s not so black and white

Checked exceptions are often abused in Java. While Java forces developers to catch unchecked exceptions, it cannot force them to handle these exceptions. It’s not hard to find statements like this even in well written programs:

try {
   Object obj = ...
   Set<String> set = ...
   // perform set operations
} catch (Exception e) {
   // do nothing
}

Should you ever catch Runtime Exceptions?

What’s the point of catching RuntimeExceptions if the condition is irrecoverable? After all, if you were catching every possible run-time exception, your program will be cluttered with exception handling code everywhere.

RuntimeExceptions are rare errors that could be prevented by fixing your code in the first place. For example, dividing a number by 0 will generate a run time exception, ArithmeticException. But rather than catching the error, you could modify your program to check the arguments for division function and make sure that the denominator > 0. If it is not, we can halt further execution or even dare to throw a exception of our own: IllegalArgumentException.

In this case, the program got away by verifying the input parameters instead of catching RuntimeExceptions.

So when is it OK for an application to catch RuntimeExceptions?

A while back, I architected a high-performance traffic director with the goal of operating in the proximity of 10,000 transactions per seconds (TPS). The project had a very high availability criteria and one of the requirement was that it “must-never-exit”.

The director performs minimum amount of processing on each transaction before passing it further. Transactions came in two flavours, call them: A and B. We were only interested in transactions of type A. We had a transactions handler to process type A. Naturally, it “choked” run time exceptions when we passed in transactions of type B. The solution? Create a function and pass it every single transaction. If it returned true, we continued to further processing. Otherwise, we simply ignored the transaction, and continued onto the next one.

boolean checkFormat(Transaction t) {
//return true if the t is of type A. false otherwise.
}

This worked well, except…..

… the analysis showed that this function returned false only once a year. The reason, 99.99999999999999% transactions were of type A. Yet, we were subjecting every single transaction to be checked. This does not sound so bad, but due to the nature of transactions, the only way to differentiate was by doing expensive String comparison on various fields.

When this finding was brought to my knowledge, I immediately had the `checkFormat(…)` function removed and instead let the handler do it’s course and throw RuntimeException upon encountering transaction of type, B. When the exception gets thrown once a year, we catch it, log it and move onto the next transaction. The result: improvement in performance, and room to squeeze in additional calculations.

Summary

Exceptions in java are either checked or unchecked. Checked exceptions must be caught in the program otherwise the compiler will complain. While Java encourages developers to follow certain guidelines when it comes to exception handling, there aren’t any hard and fast rules and the rules are often bent.

Java Multithreading Steeplechase: Executors

Historical Perspective on Tasks & Threads

Tasks are activities that perform some action or do calculations. For example a task could calculate prime numbers up to some upper limit. Good tasks do not depend on other tasks: they are independent. In this post, when I refer to tasks, I would mean tasks that are independent.

Tasks in Java can be represented by a very simple interface called Runnable that has only one method: run(). The singular function neither returns a value nor can throw checked exceptions.

public interface Runnable {
    void run();
}

Many new comers to Java presume Threads to be the primary abstraction for running tasks. This means that a task can be submitted to a thread which then runs the task. In fact, the Thread class has constructors which take a Runnable for execution:

Thread(Runnable target)
Thread(Runnable target, String name)
Thread(ThreadGroup group, Runnable target)
...

There are obvious benefits in segregating tasks and threads.

A Task, defined by implementing Runnables, is submitted to Thread for execution. The Thread doesn’t know anything about the task and the same thread could run several different tasks.

Enter Executor:

Executor was introduced in Java 1.5 as a clean abstraction for executing tasks. Mantle was passed to Executor from Thread. According to the Java API, an Executor:

“… executes submitted Runnable tasks.  This interface provides a way of decoupling task submission from the mechanics of how each task will be run, including details of thread use, scheduling, etc.” In essence, Executor is an interface, whose simplicity rivals that of Runnable:

public interface Executor {
    void execute(Runnable command);
}

The ‘very simple’ Executor interface forms basis for a very powerful asynchronous task execution framework. It is based on a Producer-Consumer pattern: Producers produce tasks and Consumer threads execute these tasks.

ExecutorService

There is little chance you will use ever use Executor directly. It is very powerful, yet feature starved interface with a lone method for executing tasks. Fortunately for us, it has a famous child called ExecutorService, which provides lifecycle support such as shutdown, task tracking and the ability to retrieve results.

Tracking Task Progress via Future

ExecutorService defines a method called `submit(Runnable task)` which returns a `Future` that can be used to track task’s progress and cancel it (if desired). Future is an interface. From its javadocs:

“A Future represents the result of an asynchronous computation. Methods are provided to check if the computation is complete, to wait for its completion, and to retrieve the result of the computation. The result can only be retrieved using method get when the computation has completed, blocking if necessary until it is ready. Cancellation is performed by the cancel method. Additional methods are provided to determine if the task completed normally or was cancelled. Once a computation has completed, the computation cannot be cancelled.”

RunnableFuture

Earlier on, I said that the interface Runnable doesn’t return a value. Runnable tasks can indicate completion by modifying a shared data structure. RunnableFuture implements both Future and Runnable interfaces. It can be submitted to any method which expects Runnable and the Future allows to access its result.

So far we have only discussed interfaces (Executor, ExecutorService and Future). Before we look into concrete classes, let us consider one very important concept.

Thread Pool

A  design pattern: http://en.wikipedia.org/wiki/Thread_pool_pattern. It has a task queue which holds incoming tasks and has a pool of thread which takes tasks from the queue and execute them.

thread pool

A sample thread pool (green boxes) with waiting tasks (blue) and completed tasks (yellow)

Benefits of Thread Pools are thread re-use (creating new threads is a significant CPU overhead) and improved responsiveness (there may already be a waiting thread when a task arrives).

Now let us discuss concrete classes.

AbstractExecutorService

This is a skeletal implementation for ExecutorService, providing default implementations for some of it’s methods.

public abstract class AbstractExecutorService
implements ExecutorService

ThreadPoolExecutor

This is an ExecutorService that applies the Thread Pool pattern to execute tasks. From its javadocs:

“An ExecutorService that executes each submitted task using one of possibly several pooled threads, normally configured using Executors factory methods.” It provides several methods for setting pool and task queue sizes. For more information:

public class ThreadPoolExecutor extends AbstractExecutorService
implements Executor, ExecutorService

FutureTask

Provides an implementation of Future and RunnableFuture. From its javadoc:

“…provides a base implementation of Future, with methods to start and cancel a computation, query to see if the computation is complete, and retrieve the result of the computation.”

Since a FutureTask implements RunnableFuture, you can submit it directly to an ExecutorService.

Callable:

Callable‘s were introduced in Java 5 as the next version of Runnable. Just like Thread passed mantle to Executor for task execution, Runnable passed mantle to Callable for representing tasks.

Callable for used to represent tasks. Unlike Runnable’s, they can return a value and even throw Exceptions. They even support generics.

Summary:

Executor and ExecutorService form a very powerful framework for asynchronous task execution. Future is a wrapper that provides a way to track a task’s progress and could be used to cancel it. Callable represents a task and allows the task to return a value and throw exceptions.

So you might ask why do we still have Threads and Runnables if we have better choices available, in the form of Executor and Callable. As far as Callable Vs Runnable is concerned, the reason is purely backwards compatibility. Threads are not languishing in Java. ExecutorService simply provides a cleaner abstraction for executing tasks. They still rely on Threads to execute these tasks.

Java Multithreading Steeplechase: Stopping Threads

Let us cut to the chase: In Java, there is no way to quickly and reliably stop a thread.

Java language designers got drunk once and attempted to support forced thread termination by releasing the following methods: `Thread.stop()`, `Thread.suspend()` and `Thread.resume()`. However, when they become sober, they quickly realized their mistake and deprecated them. Abrupt thread termination is not so straight forward. A running thread, often called by many writers as a light-weight process, has its own stack and is the master of its own destiny (well daemons are). It may own files and sockets. It may hold locks. Termination is not always easy: Unpredictable consequences may arise if the thread is in the middle of writing to a file and is killed before it can finish writing. Or what about the monitor locks held by the thread when it is shot in the head? For more information on why `Thread.stop()` was deprecated, follow this link: http://docs.oracle.com/javase/6/docs/technotes/guides/concurrency/threadPrimitiveDeprecation.html

Anyways, back to the point.

In Java, there is no way to quickly and reliably stop a thread.

To stop threads in Java, we rely on a co-operative mechanism called Interruption. The concept is very simple. To stop a thread, all we can do is deliver it a signal, aka interrupt it, requesting that the thread stops itself at the next available opportunity. That’s all. There is no telling what the receiver thread might do with the signal: it may not even bother to check the signal; or even worse ignore it.

Once you start a thread, nothing can (safely) stop it, except the thread itself. At most, the thread could be simply asked – or interrupted – to stop itself.

Hence in Java, stopping threads is a two step procedure:

  • Sending stop signal to thread – aka interrupting it
  • Designing threads to act on interruption

A thread in Java could be interrupted by calling `Thread.interrupt()` method. Threads can check for interruption by calling `Thread.isInterrupted()` method. A good thread must check for interruption at regular intervals. The following code fragment illustrates this:

public static void main(String[] args) throws Exception {

        /**
         * A Thread which is responsive to Interruption.
         */
        class ResponsiveToInterruption extends Thread {
            @Override public void run() {
                while (!Thread.currentThread().isInterrupted()) {
                    System.out.println("[Interruption Responsive Thread] Alive");
                }
                System.out.println("[Interruption Responsive Thread] bye**");

            }
        }

        /**
         * Thread that is oblivious to Interruption. It does not even check it's
         * interruption status and doesn't even know it was interrupted.
         */
        class ObliviousToInterruption extends Thread {
            @Override public void run() {
                while (true) {
                    System.out.println("[Interruption Oblivious Thread] Alive");
                }
                // The statement below will never be reached.
                //System.out.println("[Interruption Oblivious Thread] bye");
            }
        }

        Thread theGood = new ResponsiveToInterruption();
        Thread theUgly = new ObliviousToInterruption();

        theGood.start();
        theUgly.start();

        theGood.interrupt(); // The thread will stop itself
        theUgly.interrupt(); // Will do nothing
}

 

[Interruption Oblivious Thread] Alive
[Interruption Responsive Thread] Alive
[Interruption Responsive Thread] Alive
[Interruption Oblivious Thread] Alive
[Interruption Responsive Thread] bye**
[Interruption Oblivious Thread] Alive
[Interruption Oblivious Thread] Alive
[Interruption Oblivious Thread] Alive
[Interruption Oblivious Thread] Alive
....

A well designed thread checks its interrupt status at regular intervals and take action when interrupted, usually by cleaning and stopping itself.

Blocking Methods and Interruption:

A thread can check for interruption at regular intervals – e.g. as a loop condition – and take action when it is interrupted. Life would have been easy if it weren’t for those pesky blocking methods: these methods may “block” and take a long time to return, effectively delaying the calling thread’s ability to check for interruption in a timely manner. Methods like `Thread.sleep()`, `BlockingQueue.put()`, `ServerSocket.accept()` are some examples of blocking methods.

If the code is waiting on a blocked method, it may not check the interrupt status until the blocking method returns.

Blocking methods which support interruption usually throw an Exception when they detect interruption, transferring the control back to the caller. Blocking methods either throw InterruptedException or ClosedByInterruptionException to signal interruption to the caller. Let us consider an example. the code below calls `Thread.sleep()`. When it detects interruption, `Thread.sleep()` throws InterruptedException and the caller exits the loop. All blocking methods which throw InterruptedException also clear the interrupted status. You must either act on interruption when you catch this exception or at the very least, set the interrupted status again to allow the code higher up the stack to act on interruption.

   @Override
   public void run() {
        while(true) {
            try {
                Thread.sleep(Long.MAX_VALUE);
            } catch (InterruptedException exit) {
                break; //Break out of the loop; ending thread
            }
        }
    }

This may sound preposterous, but the code that does nothing on InterruptedException is “swallowing” the interruption, denying other code to take action on interruption.

Interruption Oblivious Blocking Methods:

In the first code example in this post, we have two threads, ResponsiveToInterruption and ObliviousToInterruption. The former checked for interruption regularly – as loop condition – whereas the later didn’t even bother to check. Blocking methods in Java library fall in the same two categories. The Good ones throw Exceptions when they detect interruption whereas the Ugly one’s don’t do anything. Blocking methods in the java.net.Socket don’t respond to Interruption. For example, the thread below cannot be stopped by interruption when it is waiting for clients. When a client is connected, accept() returns Socket allowing the caller to check for interruption:

        /**
         * Thread that checks for interruption, but calls a blocking method
         * that doesn't detect Interruptions.
         */
        class InterruptibleShesNot extends Thread {

            @Override
            public void run() {
                while(!Thread.currentThread().isInterrupted()) {
                    try {
                        ServerSocket server = new ServerSocket(8080);
                        Socket client = server.accept(); // This method will not
                                                         // return or 'unblock'
                                                         // until a client connects
                    } catch (IOException ignored) { }
                }

            }

        }

So how do you deal with blocking methods that do not respond to Interruption? You will have to think outside the box and find ways to cancel the operation by other means. For example, Socket operations throw SocketException when the underlying socket is closed (by `Socket.close()`). The code below takes advantage of this fact and closes the underlying socket, forcing all blocking methods such as ServerSocket.accept() to throw SocketException.

package kitchensink;

import java.net.*;
import java.io.*;

/**
 * Demonstrates non-standard thread cancellation.
 *
 * @author umermansoor
 */
public class SocketCancellation {

    /**
     * ServerSocket.accept() doesn't detect or respond to interruption. The
     * class below overrides the interrupt() method to support non-standard
     * cancellation by canceling the underlying ServerSocket forcing the
     * accept() method to throw Exception, on which we act by breaking the while
     * loop.
     *
     * @author umermansoor
     */
    static class CancelleableSocketThread extends Thread {

        private final ServerSocket server;

        public CancelleableSocketThread(int port) throws IOException {
            server = new ServerSocket(port);
        }

        @Override
        public void interrupt() {
            try {
                server.close();
            } catch (IOException ignored) {
            } finally {
                super.interrupt();
            }
        }

        @Override
        public void run() {
            while (true) {
                try {
                    Socket client = server.accept();
                } catch (Exception se) {
                    break;
                }
            }
        }
    }

    /**
     * Main entry point.
     * @param args
     * @throws Exception
     */
    public static void main(String[] args) throws Exception {
        CancelleableSocketThread cst = new CancelleableSocketThread(8080);
        cst.start();
        Thread.sleep(3000);
        cst.interrupt();
    }
}

Summary:

  • Threads cannot be stopped externally; they can only be delivered a signal to stop
  • It is up to the Thread to: i) check the interruption flag regularly, and ii) to act upon it
  • Sometimes checking interruption is not possible if the thread is blocked on a blocking method, such as `BlockingQueue.put()`. Luckily, most blocking methods detect interruption and throw InterruptedException or ClosedByInterruptionException
  • To support blocking methods that do not act on interruptions, non-standard cancellation mechanisms must be used, as illustrated in the last example

Extra:

The thread class also has a method called `interrupted()`. This is what is does: it clears the interrupted status and returns its previous value. Use this method only when you know what you are doing or when you want to clear the interrupt status.

Introduction To Apache Hadoop – HDFS & MapReduce

Let’s get something out of the way quickly: Hadoop is NOT a database. It is NOT a library. In reality, there is NO single product called Hadoop. Hadoop is made up of stand-alone modules such as a distributed file system called HDFS, a distributed database named HBASE, a library for parallel processing of large distributed datasets called MapReduce, and the list goes on. An analogy would be Microsoft Office. There is no application called Microsoft Office. It’s the name given to a suite of desktop applications like Word, Excel, etc.

In this post we will focus on the Hadoop Distributed File System (HDFS) and MapReduce. These two are core Hadoop modules and are widely used.

Together, HDFS and MapReduce form a framework for distributed batch processing. HDFS is used for storage, while MapReduce(MR) is used for analysis.

Almost anything you can accomplish with HDFS + MR could be done with built-in Linux utilities like grep, awk, bash, etc. Hadoop excels in large scale, distributed processing of data, where data to be processed is distributed on hundreds of nodes. With the advent of Cloud Computing, this is quickly becoming the norm. Distributed servers running on multiple nodes producing decentralized logs make it difficult to analyze data in one central place. Consider Google – it runs on thousands of web servers in multiple data centers around the world. Each web server generates a log file which is stored on its local disk, ending up with thousands of log files stored on as many servers. Analytics program should be able to view these dispersed logs as a single logical unit. For example, the following hypothetical queries require processing every single log file to find the total for that server and then add up results from all servers to get the final aggregate sum:

  • Number of unique users between 12:00-1:00am.
  • Number of users in a day from Chile.

Hadoop’s true power lies in its ability to scale to hundreds or thousands of nodes to distribute large amounts of work across a set of machines to be performed in parallel.

HDFS

Hadoop modules are built upon a Distributed File System appropriately named Hadoop Distributed File System (HDFS). The most famous `distributed` file system in existence today is the Network File System or NFS. HDFS is different from NFS on many levels, especially with regards to scalability.

Note: The design of HDFS is based on Google File System (GFS) described in this paper.

HDFS is based on a master/slave architecture and the design requires that a single master node keeps track of all the files in the file system. This is called Name Node. A Name Node stores only the meta-data about the files present on the file system: it doesn’t store the actual data. The data is stored on Data Nodes. Files are stored on the HDFS in blocks which are typically 64 MB in size.

Name Node versus Data Node: A Name Node manages the HDFS’s namespace and regulates access to files and directories. It distributes data blocks to Data Nodes and stores this mapping information. Data Nodes are responsible for storing data and serve read/write requests from clients directly.

Let us consider an example, suppose we are storing a 131 MB file on the HDFS. The file will be stored in three blocks on the HDFS  (64 + 64 + 3). Name Node will distribute the three blocks to Data Nodes and keep track of the mapping. To read a file stored on the HDFS, the client must have HDFS installed. The client HDFS will obtain the file information from Name Node such as the number of file blocks and their location and then download these blocks directly from Data Nodes.

Fore more information on the HDFS, I recommend the following link: http://hadoop.apache.org/docs/hdfs/current/hdfs_design.html

Map Reduce

Map Reduce (MR) is a framework or a library for writing applications to process large amounts of distributed data in parallel. Like HDFS, it’s architecture is also based on master/slave model. The master is a special node which coordinates activity between several worker nodes.

Here’s how it works: The master receives the input data which is to be processed. The input data is split into smaller chunks and all these chunks are distributed to and processed in parallel on multiple worker nodes. This is called the Map Phase. The workers send their results back to the master node which aggregates these results to produce the sum total. This is called the Reduce phase.

Note: I’ve oversimplified the inner workings of MapReduce. The Map phase output is written to the local disk of workers which is partitioned in as many regions as there are Reduce workers available. This locations is then passed to the master which passes it onto Reduce workers. I recommend this paper on MapReduce by Google, which actually introduced it.

MR applications at least must provide the following three input parameters:

  1. Location of the input data (e.g. a directory consisting of a single (rare) or multiple input files).
  2. Programming implementations of Map, Reduce functions and their configuration (e.g. a Java JAR file which is distributed to workers)
  3. Location of the output data (e.g. `/tmp/hadoop-output/`)

You must remember this always: All input and output in MR is based on <key, value> pairs. It is everywhere, the input to the Map function, its output, the input to the Reduce function and its output are all in <key, value> pairs.

Map Reduce Example in Java

To wrap our minds around MapReduce, let us consider an example. Suppose you have just become the Development Lead for a company which specializes in reading seismic data which measure earthquake magnitudes around the world. There are thousands of such sensors deployed around the world recording earthquake data in log files, the following format:

nc,71920701,1,”Saturday, January 12, 2013 19:43:18 UTC”,38.7865,-122.7630,1.5,1.10,27,“Northern California”

Each entry consists of lot of details. The items in red are the magnitude of the earthquake and the name of region where the reading was taken, respectively.

There are millions of such log files available. In addition, logs also contain erroneous entries such as when the sensor became faulty and went in an infinite loop dumping thousands of lines a second. The input data is stored on 50 machines and all the log files combined are about 10 Terabytes in size.  Your Director of Software asks you to perform a simple task: for every region where sensors were deployed, find out the highest magnitude of the earthquake recorded.

Now, let’s think about this for a moment. This tasks sounds rather simple. You could use your trusted linux tools like `grep`, `sort`, or even `awk` to accomplish this if the logfiles were available on a single computer. But they are not – they are scattered across 50 computers. Processing data on each computer manually and combing results will be too inefficient (for a Lead Developer, that is).

This is the kind of problem where you can use Hadoop. Let us see how you would do it:

  1. First you will deploy HDFS on the 50 machines where the input data is stored so that all data could be seen by all machines. Let us say you put all logfiles of data in a folder on HDFS called input/.
  2. Next, you will write a Java application providing implementations of Map & Reduce functions.
        /**
         * The `Mapper` function. It receives a line of input from the file,
         * extracts `region name` and `earthquake magnitude` from it, and outputs
         * the `region name` and `magnitude` in <key, value> manner.
         * @param key - The line offset in the file - ignored.
         * @param value - This is the line itself.
         * @param context - Provides access to the OutputCollector and Reporter.
         * @throws IOException
         * @throws InterruptedException
         */
        @Override
        public void map(LongWritable key, Text value, Context context) throws
                IOException, InterruptedException {
    
            String[] line = value.toString().split(",", 12);
    
            // Ignore invalid lines
            if (line.length != 12) {
                System.out.println("- " + line.length);
                return;
            }
    
            // The output `key` is the name of the region
            String outputKey = line[11];
    
            // The output `value` is the magnitude of the earthquake
            double outputValue = Double.parseDouble(line[8]);
    
            // Record the output in the Context object
            context.write(new Text(outputKey), new DoubleWritable(outputValue));
        }
    
        /**
         * The `Reducer` function. Iterates through all earthquake magnitudes for a
         * region to find the maximum value. The output is the `region name` and the
         * `maximum value of the magnitude`.
         * @param key - The name of the region
         * @param values - Iterator over earthquake magnitudes in the region
         * @param context - Used for collecting output
         * @throws IOException
         * @throws InterruptedException
         */
        @Override
        public void reduce(Text key, Iterable values,
                Context context) throws IOException, InterruptedException {
    
            // Standard algorithm for finding the max value
            double maxMagnitude = Double.MIN_VALUE;
            for (DoubleWritable value : values) {
                maxMagnitude = Math.max(maxMagnitude, value.get());
            }
    
            context.write(key, new DoubleWritable(maxMagnitude));
        }
    
  3. Next you will configure MapReduce to be run processing on all 50 computers. This achieves data locality – the logfiles are processed on the same computer where they are located and only the results are sent back to be reduced, saving bandwidth.
  4. Run `hadoop` passing it the location of the input folder on the HDFS (input/), MapReduce program, and the location where the output is to be produced (output/). And that’s all you need to do.

Example Project on GitHub

You can find the complete source code for the above example on GitHub repository I have created.

3 Effective Techniques For Software Versioning

Software version numbers are all over the map. At present, my system reports the following versions of various installed software:

17.0
1.2.167.19
11.0.0
1.5.11 (1635)
1.6.0_37 
23.0.1271.101
6.0.0.2968
Mayan Apocalypse (96)
0.6.9
3.7.1 (M20110909-1335)
7.0 (Build 201104080000)  

As you can see, there is no concrete pattern here. The most common theme seems to be:

<major>.<minor>.<patch>.<build>

To be fair, there have been some attempts to standardize version numbers such as Semantic VersioningApache APR, even the older Apple Standards. However,…..

Not all software is created equal:

We have a wide variety of software from libraries and frameworks, to complex server software to simple mobile applications. Some software is characterized by a high build rate (e.g. few times a day) whereas on the other end of the spectrum, it’s easy to find cases where a release every few months or even years is common practice. Some software is released to the public, other stays in-house. It is hard to define a single universal scheme that works for all kinds of software. So let us divide software into the following three broad categories:

  1. Libraries and Frameworks
  2. Public Software
  3. In-house or Hosted Software

Libraries and Frameworks

For library components and frameworks, which define their own API, Semantic Versioning (SemVer) looks promising. It requires that the software must define a public API to be able to use it. From their webpage:

“A normal version number MUST take the form X.Y.Z where X, Y, and Z are non-negative integers. X is the major version, Y is the minor version, and Z is the patch version. Each element MUST increase numerically by increments of one. For instance: 1.9.0 -> 1.10.0 -> 1.11.0.”

Web based frameworks like Express (for Node.js), Ruby On Rails, can benefit from it.

Public Software

When it comes to public software, things get complicated. Let’s take a moment to ask yourself why your users would ever use the version information? The only answer that comes to my mind is that’s how they decide if there’s a newer version available for purchase or not. Back in the day when I was using Windows 95 and heard about this fancy new thing called Windows 98, I knew it was time to upgrade.

Here’s the deal: your users don’t really care about cryptic version numbers. The only time detailed version information such as 4.1.2222A would come handy is when the user is experiencing issues and is on the phone, the support guy would want to know exactly what version of the software they are using right down to the build number.

Therefore, it is a good idea for public software to have a dual versioning scheme: The first version is a user-friendly name that is easy to remember and use in casual discussions, e.g. Lion.  The second is the detailed version to be used in times of crisis, e.g. 10.7.5 (11G63), such as when browsing the Internet for known vulnerabilities or calling Apple for support.

Microsoft seemed to have invented or at least popularized the dual versioning system with Windows 95. Windows 95, the immensely popular successor of Windows 3.1x, was actually called Windows 4.0. Windows 98 was 4.1. Apple picked this idea up starting from OS X 10.0 and started naming major releases after cats (Cheetah, Jaguar, Panther, to the latest Mountain Lion). You can apply your creative juices when coming up with a user-friendly versioning scheme and even involve the marketing group in the process.

For detailed versioning scheme, I personally prefer <major>.<minor>.<revision>.<build>. You could use <major>.<minor>.<revision>.<date>. For example:

2.15.45.30098312345
1.0.3 (December 21, 2012)

Embedding the date in the release is actually a very good idea. It conveys time the build was generated. Others prefer appending build Ids generated by build management systems such as SVN, Git or Mercurial. At my work, I use the following rules:

  • Major: This is usually done at the end of full release cycle; the resulting product is a major upgrade. All stakeholders (Managers, VPs, Marketing Directors) are informed of the major increment and is usually followed by press releases and marketing.
  • Minor: When additional functionality or new features are introduced. Software Development managers or Senior Developers typically approve such increments.
  • Revision: Bug fixes, ad-hoc patches, any minor change. Developers increment this number each time they make a minor change or fix a bug.
  • Build: This is the only component of the version that is automatically generated. There is a python script that generates this using the Git commit SHA every time the software is built.

This post from Alex Collins explain the rules are incrementing each component:

“You zero digits to the right of any you increment, so if you fix a bug and introduce a new feature after version 5.3.6 then the new version is 5.4.0. Unstated digits are assumed to be zero, so 5.4.0 is the same as 5.4.0.0 and 5.4.0.0.0.0.0.0.0…”

Bonus: I have written a Java class to hold Version information. It is available under the MIT license so you can freely use it. .NET users, have Version class which comes with the framework.

In-house or Hosted Software

For high rate builds using automated systems like Jenkins, it is a good idea to have versions automatically time stamped. Something like <year>.<month>.<day>.<time> is good for providing detailed information to support desk staff and for referencing build information to developers. E.g. 2012.12.29.0059 is the version of build generated on December 29, 2012 at 12:59 a.m.

Be Consistent

Whatever you do, please be consistent and stick to the system you pick. Sun Microsystems (acquired by Oracle), confused developers everywhere by changing their versioning scheme along the way.

An excellent post explaining JavaScript’s Object creation mechanism via prototype.

JavaScript, JavaScript...

(en Españolрусском中文)

JavaScript’s prototype object generates confusion wherever it goes. Seasoned JavaScript professionals, even authors frequently exhibit a limited understanding of the concept. I believe a lot of the trouble stems from our earliest encounters with prototypes, which almost always relate to new, constructor and the very misleading prototype property attached to functions. In fact prototype is a remarkably simple concept. To understand it better, we just need to forget what we ‘learned’ about constructor prototypes and start again from first principles.

View original post 894 more words