Cassandra Chapter 5: Data Replication Strategies

Where as Data Partitioning (discussed in last chapter) is concerned with picking a node to store the first copy of data (row) on, Replication is all about storing additional copies of data on multiple nodes so it can deal with node failures without data loss.

In Cassandra terminology, these copies of data or rows are called replicas.

Replication Factor:

This parameter determine how many nodes in your cluster store copies of data. For example, if Replication Factor (RF) is set to 2, there will be two copies of every data stored on different nodes. As common sense dictates, the RF cannot be greater than the number of nodes in the cluster. You cannot say store 10 replicas of data (RF=10) when you only have 8 nodes available. If you try to do this, your writes will fail.

Replication Strategies:

Cassandra has two replication protocols or strategies which I will discuss below. The main difference between the two is whether you have a single data center or your Cassandra cluster spans multiple data centers.

  • Simple Strategy: As the name indicates, this is the simplest strategy. Data Partitioner protocol (e.g. RandomPartitioner) picks out the first node for the first replica of data. SimpleStrategy places the second replica on the very next node in the ring.
  • Network Topology Strategy: As the name indicates, this strategy is aware of the network topology (location of nodes in racks, data centers etc.) and is much intelligent than Simple Strategy. This strategy is a must if your Cassandra cluster spans multiple data centers and lets you specify how many replicas you want per data center. It tries to distribute data among racks to minimize failures. That is, when choosing nodes to store replicas, it will try to find a node on a different rack.Let’s talk about a few real-world scenarios from my experience: My team and I were designing a Cassandra cluster spanning 4 data centers (we called them sites). Please note that I’m changing several details about the actual scenario, however, the concepts are the same. Each site also had own application servers. See figure below.
    Real World Cassandra Deployment

    We had two concerns:
    1. Allow local reads in a data center to avoid cross data center latency. For example, the Application Server in Site #1 should read from the Cassandra cluster in the same site.
    2. Application should continue to work albeit with limited functionality if connectivity between data centers is disrupted.
    3. Tolerate entire cluster failures in two data centers and some node failures throughout all clusters in all data centers.

    To meet all the requirements, we used NetworkTopologyStrategy with 2 replicas per data center. That is every single row of data will be stored in 8 nodes in total ( 2 nodes per site x 4 sites). Hence we can tolerate failure a single node per site while still allowing local reads. We can also tolerate failure of an entire cluster but will have to incur the extra cost of cross data center latency in the Application server of the site where the cluster failed. For example, if the cluster in Site #2 dies completely, the Application server in that site can have to retrieve data from any of Site # 1, 2 or 3 with extra cost of inter site communication. In this example, we can still have our application working (with decreased performance) even if Cassandra clusters in 3 sites fail completely.

SimpleStrategy Versus NetworkTopologyStrategy – Use Cases:

I have read in several posts people suggesting that if you have a single data center, you should use Simple Strategy. However, it’s not so simple. Simple Strategy is completely oblivious to network topology, including rack configuration. If the replicas happen to be on the same rack, and the rack fail, you will suffer from data loss. Racks are susceptible to failures due to power, heat or network issues. So, I would recommend using Simple Strategy only when your cluster is in a single data center and on a single rack (or you don’t care about rack configuration).

How Cassandra knows your Network Topology?

There is no magic here. Cassandra doesn’t learn your topology automatically. You must define the topology such as assigning nodes to racks and data centers and give this information to Cassandra which then uses it. This mapping of nodes to racks and data center is done using Snitch, which I will discuss later.

~~~ Ignore below this line ~~~
PlanetCassandra

Advertisements

Cassandra Chapter 4: Data Partitioning With Random and Byte-Ordered Partitioners

Cassandra is a distributed database that runs on multiple nodes. When you write data to the cluster, partitioning scheme determines which node in the cluster stores that data. For example, suppose you are inserting some data (Column-Value pair identified by a Row Key). Data partitioning protocol will dictate which node in the cluster is responsible for storing this data. Similarly, when you request data, the partitioning protocol will examine the Row Key and find the node in the cluster responsible for the row key and retrieve data from it.

Difference between Partitioning and Replication?

Data partitioning is concerned with picking a node in the cluster to store the first copy of data on. Replication determines number of additional nodes that will store the same data (for performance and fault tolerance reasons). Replication is discussed in the next section.

Partitioning   => Picking out one node to store first copy of data on
Replication    => Picking out additional nodes to store more copies of data

Types of Partitioners in Cassandra:

Two partitioners in Cassandra are used most commonly:

  • Random Partitioner (RP): It uses hash on the Row Key to determine which node in the cluster will be responsible for the data. Loosely speaking, similar to how a HashMap in Java uses the Hashcode to determine which bucket will keep the data. Here’s how this scheme works: The hash value is generated by doing MD5 on the Row Key. The resulting hash value is restricted in the 0 – 2^127 range. Each node in the cluster in a data center is assigned sections of this range and is responsible for storing the data whose Row Key’s hash value falls within this range. For example, suppose you have a 3-node Cassandra cluster with Random Partitioner (RP). RP assigns range, in Cassandra terminology called token0-2^42  to the first node, 2^42-2^84 to the second node and 2^84-2^127 to the third node. (Note, I made these numbers up). Now, when you write data to a node, it will calculate the hash of the Row Key to be written. Suppose the hash comes out to be 2^21. RP then determines that this falls in the range 0-2^42 which is assigned to the first node. Hence data is handed to the first node for storage.Remember that RP calculates these ranges or tokens by dividing the possible hash range (0-2^127) by the number of nodes in the cluster:
    Token Range = (2^127) ÷ (# of nodes in the custer)

    If the cluster is spanned across multiple data centers, the tokens are created for individual data centers. (And that’s a good thing as we’ll see in the next chapter).

  • Byte Ordered Partitioner (BOP): It allows you to calculate your own tokens and assign to nodes yourself as opposed to Random Partitioner automatically doing this for you. As an example, suppose you know that all your keys will be in the range 0 – 999. You have a 10 node cluster and you wish you assign the following ranges to the nodes:
    node_1    =>  0 - 100
    node_2    =>  100 - 200
    node_3    =>  200 - 300
    ......
    node_10  =>  900

    All keys which fall in the range 0 – 100 (e.g. 72) will be stored on node_1. In effect, ByteOrderedPartitioner allows you to create your own shards of data.

Which Partition Scheme to use?

Random Partitioner is the recommended partitioning scheme. It has the following advantages over Ordered Partitioning as in BOP:

  1. RP ensures that the data is evenly distributed across all nodes in the cluster. BOP can create hotspots of data where some nodes hold more data than the others. Consider our last example in BOP. Let us say we are inserting 10 M rows into the Cassandra cluster and 75% of the keys are in the range 200-300. In this case, node_3 will hold 7.5 M rows where as rest of 9 nodes will have 2.5 M keys. node_3 will be a hotspot.
  2. When a new node is added to the cluster, RP can quickly assign it a new token range and move minimum amount of data from other nodes to the new node which it is now responsible for. With BOP, this will have to be done manually. This may not be an issue if the # of nodes in your cluster is set in stone. However, for most applications, this is not the case . Your IT staff should be able to add new nodes effortlessly to a cluster under increased load without caring about internal details about partitioning.
  3. Multiple Column Families Issue: BOP can cause uneven distribution of data if you have multiple column families. See this very same problem on Stack Overflow: http://stackoverflow.com/questions/11109162/how-to-calculate-the-token-for-a-byteorderedpartitioner
  4. The only benefit that BOP has over RP is that it allows you to do row slices. You can obtain a cursor like in RDBMS and move over your rows. You can ask for all row keys between 250-295. In the above example, Cassandra knows that node_3 has that range and its work is easy. This is not easy in RP, but most applications can be redesigned and indexes in RP can give the same results. [I have some confusion about this part. This post over here say that this could be done using get_range_slices: http://wiki.apache.org/cassandra/FAQ#iter_world)

Summary:

In this post, we examined what Partitioning means in Cassandra and looked at the definition of two Partitioning schemes in Cassandra: RandomPartitioner and ByteOrderedPartitioner and compared their advantages and drawbacks. We reached the conclusion Random Partitioner should always be preferred over any type of Ordered Partitioning.

Cassandra Chapter 3 – Data Model

An Associative Array is one of the most basic and useful data structures where each value is identified by a key, usually a string. In contrast, values in a normal array are identified by indices. Associative Array maps keys to values. There is one-to-one relationship between keys and values, such that each key can only be mapped to a single value only. This concept is used by many languages: PHP calls it Associative Array, Dictionary in Python, HashMap in Java.

keyValuePairs
How HashMap or Dictionary stores data

In the figure above, if you access the key ‘firstName’, the return value will be ‘Bugs’. In Python 3, I can create this as follows:

mydictionary = { 'firstName' : 'Bugs', 'lastName': 'Bunny', 'location': 'Earth'} #create a dictionary
print(mydictionary['firstName']); #get value associated with key 'fistName'

The Output is:

$ python3 list.py
Bugs

Cassandra follows the same concept as Associative Maps or Dictionaries, but with a slight twist: The value in Cassandra has another embedded Associative Array with its own keys and values. Let me explain. Like an Associative Array, Cassandra has keys which point to values. These top-level keys are called ‘Row Keys‘. The value itself contains sub-keys, called ‘Column Names‘ associated to values. For example, we can store all movies by director in Cassandra sorted by year. To get movie directed by Quentin Tarantin and James Cameron in 1994 and 2009 respectively, we can:

[qtarantino][1994] == 'Pulp Fiction' //tarantino is the Row Key and 1994 is the sub-key, aka Column Name), or,
[jcameron][2009] == 'Avatar'

Another way of looking at it:
A key in Cassandra can have multiple values. Each of these values have to be assigned another identifier for Cassandra to retrieve that particular value. Cassandra lets you name individual values of a key so you can retrieve that value with pin-point accuracy without obtaining all values. This name or sub-key is called “Column Name” in Cassandra. A Column Name is nothing but a key identifying a unique value (one-to-one relation) inside a bigger key, called “Row Key“.

[figure]
column family in Cassandra

Cassandra Data Model

The above picture reminds me of the movie Inception, how it had dream within a dream. I see a Dictionary inside another Dictionary, if you think of ‘Column Name 1’ as a sub key with an associated value. I call this the “Inception Concept” and its present everywhere in the computing world, not just Cassandra. (think Recursion)

Column:

A column in Cassandra is very much like a Key-Value pair: It has a key, called Column Name which has an associated value. A column in Cassandra has an additional field called timestamp.
A Cassandra Column

To understand the timestamp field, let’s recall that Cassandra is a distributed databases running on multiple nodes. timestamp is provided by the client application and Cassandra uses this value to determine which node has the most up-to-date value. Let us say 2 nodes in Cassandra respond to our queue and return a column. Cassandra will examine the timestamp field of both columns and the one that is the most recent will be returned to the client. Cassandra will also update the node that returned the older value by doing what is called a ‘Read Repair’.

An important point to remember is that the timestamp value is provided by the application: Cassandra doesn’t automatically update this value on write or update. Most applications ignore timestamp values which is fine, however if you are using Cassandra as a real-time data store, the timestamp values become very important.

Cassandra allows null or empty values.

Column Family:

Very, very loosely speaking, a column family in Cassandra is like table in RDBMS database like MySQL: it is a container for row keys and their values (Column Names). But the comparison stops there: In RDBMS you define table schema and each row must adhere to that schema. In other words you specify the table columns, their names, data types and whether they can be null or not. In Cassandra, you have the freedom to choose whether you want to specify schema or not. Cassandra supports two types of Column Families:

1. Static Column Family: You can specify schema such as Column Names, their Data Types (more on Types later) and indexes (more on this later). At this point, you may be thinking this is like RDBMS. You are right, it is. However, one difference I can see is that in an RDBMS a table must strictly adhere to the schema and each row must reserve space for each column defined in the schema, even though the column may be empty or null for some rows. Cassandra rows are not required to reserve storage for every column defined in the schema and can be sparsed. Space is only used for the columns that are present in the row.

static column family

The client application is not required to provide all Columns

2. Dynamic Column Family: There is no schema. The application is free to store whatever columns it want and their data types at run-time.
dynamic_column_family

Note: An application can still insert arbitrary column in static column family. However, it must meet the contract for a column name that is defined in the schema, e.g. the Data Type of the value.

Keyspace:

A keyspace in Cassandra is a container for column families. Just like a database in RDBMS is a container of tables. Most application typically have one keyspace. For example, a voting application has a keyspace called “voting”. Inside that keyspace, there are several column families: users, stats, campaigns, etc.

So far:

The picture looks like the following:
Cassandra Data Model Tree

Super Columns: Another Inception level

Super Columns are yet another nesting inside row key. It groups Column Names together. Back to our inception analogy, starting from the inner most level: Column Families are dictionaries nested inside Super Columns which is another dictionary nested inside the top most dictionary called Row Key. Suppose your row key is the UserID. You can have a Super Column family called Name which contains the Columns FirstName and LastName. When you retrive the Name super column you get all the column names within it, in this case Fistname and LastName.


RowKey => Super Column => Column Names
UserID => Name => Firstname, LastName

Counter Columns:

If you have used Redis before, you must love the increment feature which lets you increment and retreive a value at the same time. E.g. incr key_name. Cassandra has something similar: Counter Column. A Counter Column stores a number which can be incremented or decremented as you would a variable in Java: i++ or i–. Possible use cases of Counter Columns are to store the number of times a web page has been viewed, limits, etc.
Counter columns do not require timestamp. I would imagine Cassandra tracks this internally. In fact, when you update a Counter Column (increment or decrement), Cassandra internally performs a read from other nodes to make sure it is updating the most recent value. A consistency level of ONE should be used with Counter Columns.

Summary:

Ok, we have covered a lot of ground here. Let’s summarize:

Keyspace: Top level container for Column Families.
Column Family: A container for Row Keys and Column Families
Row Key: The unique identifier for data stored within a Column Family
Column: Name-Value pair with an additional field: timestamp
Super Column: A Dictionary of Columns identified by Row Key.

Here’s how we will get a value: [Keyspace][ColumnFamily][RowKey][Column] == Column’s Value
Or for a Super Column: [Keyspace][ColumnFamily][RowKey][SuperColumn][Column] == Column’s Value

Forgiveness is a virtue (of Javascript)

Forgiveness and tolerance are Javascript’s greatest virtues: the language does everything it can to prevent errors and makes erroneous code work as much as it can. This is in huge contrast to ‘prima donna’ languages like C++ or Java(*1), which whine and cry at the slightest hint of error and do not hesitate at all to halt execution at the first opportunity.

Let us consider some examples. In the example below, we have a function that takes 1 argument. In the last line, I call the same function without any argument and the code works without any complains from the Javascript interpreter.

function foo(count) //a 1-arg function
{
    for (var i = 0; i < count; i++)
        console.log("count " + i);
}
foo(3); //call the function an argument
foo(); //***call the same function with no argument ***

Output

objects@london(umermansoor) $ node forgivenessVirtue.js
count 0
count 1
count 2

In fact the code above still works if we remove all semicolons (;)

function foo(count) //a 1-arg function
{
    for (var i = 0; i < count; i++)
        console.log("count " + i) //no semicolon
}

foo(3) //no semicolon here either
foo() //double whammy. Everything works

Note: Semi-colons are optional in Javascript and are required only if two statements occur on the same line.

Let’s look at a slightly advanced example. In Javascript, a constructor is a function that returns a reference to an object when called with new. E.g. var object = new SomeObject().

function Console(make) //a constructor
{
    console.log("The current make is: " + this.make + ". Setting make to: " + make);
    this.make = make; //setting make property of the object
}

var gamingConsole = new Console("Wii-U"); //normally that's how a constructor is called

Console("PS3"); //**calling a constructor directly, no complaints.

Note: If you are curious how the second example works when the constructor is called directly and there is no context for the this keyword. The answer is that Javascript uses the most global context in this case, normally the window object for web pages.

Why is Javascript so forgiving?

Javascript is THE language of the web. It is by far the most, popular language on the planet. I can even dare to bet that if you randomly pick two sites, there is a very high probability that both sites will be running Javascript in some form or fashion.

Javascript was designed to just work and this is the reason we don’t see sites running Javascript crash completely in the presence of errors. Your clients don’t come running to you complaining that the site is down entirely if there are Javascript errors. Please don’t get me wrong: Javascript programs can still contain irrecoverable errors, for example, missing closing “)” in a for statement. The point I’m trying to make is that Javascript will try very hard to prevent errors, but if errors are unavoidable, it will at least run the program as much as it can.

Javascript is very easy to learn and is fun to work with. I recall my first experience with Javascript when I was able to write a miniscule “Todo” program (or something like that) within an hour of first starting the language and the program just worked on the first try. This again is in sharp contrast with Java, which is very strict when it comes to discipline. When I started Java, for quite some time, I found it very difficult to write programs in plain text editor say vim without the assistance from an IDE say Netbeans, and run them without any syntactical errors on my first attempt.

Now back to the question: In my opinion, Javascript had to be forgiving in order for it be to widely adopted as the de-facto standard language of the web. This freedom however was mostly not intentional. Javascript is an outstanding Object Oriented (OO) language, however, it has more than its share of bad. A lot of different ways of doing the same thing, many different flavors, I can go on for hours talking about the bad parts of the language. In fact, there is an entire book on this subject: JavaScript: The Good Parts.

Summary:

Whether the freedom that Javascript lends to developers is a good thing or bad depends on the perspective. Beginners using Javascript as their first language love it. Veteran developers coming from background of disciplined language such as Java, may find themselves frustrated due to the fact that Javascript does not forces developers to behave themselves.

It seems to me that the freedom in Javascript is well balanced on the spectrum of most expressive languages like Perl and least expressive like Java. Javascript is an amazing OO language which is fun to work with, easy to use, and with the arrival of server side Javascript based platforms like Node.js, it can be used as the only language (not counting HTML + CSS) to furnish complete sites. (As opposed to PHP + JS, or Python+JS).

*1: Oracle/Sun Java is an extraordinary OO language. Java and Javascript were designed for two very different purposes, had different audiences and goals. Both are very good and popular in what they do. Context is the keyword here.
 

~~~ Ignore below this part ~~~~

Creating & Using Modules In Node.js & Understanding Paths

Javascript libraries are commonly referred to as “modules”. The modules are typically imported by other Javascript files to use their functionality. Modules could contain class definitions, or helper functions.

Node.js uses the global require() method to load modules. It is similar to import statement in Java or include in PHP. require is a Node.js function and is not part of the standard Javascript.

PHP’s include VS Node.js’ require

To include a file in PHP, we would do something like the following:

include 'filename.php';

To include a module in Node.js:

var module_name = require('filename.js');

The key difference is that PHP merges the content of the included file into current file’s scope. Whereas, in Node.js the contents of the imported module are accessible through the variable.

module.exports Object

A call to require() function returns module.exports object. The module.exports is automatically set to the same object as the module, so we can do something like this:

//filename: util.js
function add(num1, num2)
{
    return num1+num2;
};
exports.add = add;

This exposes the add() function to the world.

In the calling Javascript file, we can import and use the module as follows:

var utils = require('./utils'); // include the module we created

var result = utils.add(1, 2); // call the add function which was exported by the module
console.log(result); // prints 3

The name added to the exports object doesn’t have to be the same name as the internal function it is exposing. For example, consider the following code:

//filename: util.js
function addTwoNumbers(num1, num2)
{
    return num1+num2;
};
exports.add = addTwoNumbers; //Exposing addTwoNumbers() as add()

The calling script can do the following:

var util = require('./util.js');
util.add(1,2);

A common practice among Node.js developers is to use anonymous functions when exporting. Let’s take a look:

//filename: util.js
exports.add = function(num1, num2) { //exposes the anonymous 2-arg function as add
    return num1 + num2;
};

If you are wondering what is the difference between `export` and `module.exports`, it is nicely explained by Hack Sparrow:

Here is an eye-opener – module.exports is the real dealexports is just module.exports‘s little helper. Your module returns module.exports to the caller ultimately, not exports. All exportsdoes is collect properties and attach them to module.exports IF module.exports doesn’t have something on it already. If there’s something attached to module.exports already, everything onexports is ignored.

Node.js’ Import Path

Node.js uses a non-trivial algorithm when searching for a module specified in require() statement. Let’s look at several cases Node.js considers:

Case I: Absolute or Relative Path

This is the easiest case: If the path in require statement is absolute or relative, Node.js knows exactly where the module is located.

var moduleA = require('./moduleA'); // Module 'moduleA.js' is in the same folder as this script
var moduleB = require('../moduleB'); // moduleB.js is in the parent folder
var moduleC = require('/var/www/html/js/moduleC'); // absolute path

Case II: Core Modules

If the path (relative or absolute) is not specified, and the module that is loaded is one of Node.js’ core modules, it is automatically loaded. Node.js knows where the core module is located. The core modules are sys, http, fs, path, etc.

var moduleA = require('http');
var moduleB = require('fs');

Case III: No path and not core

This is where the algorithm uses non-trivial logic to locate the module, if you don’t specify path and the module you are loading is not a core module. For example:

var moduleA = require('myModule'); // no path given!

The module search algorithm in Node.js will attempt to find a directory called “node_modules”. It starts with the directory of the running script and will keep moving up the hierarchy (parent directories) until it locates the “node_modules” directory. When the node_modules directory is found, Node.js then looks whether the module is a .js file or is a subfolder. If the module is a .js (e.g. myModule.js), Node will simply load it. If there is a subdirectory called myModule/ (e.g. node_modules/myModule) then Node.js will attempt to load index.js or package.json inside the subdirectory. This looks confusing, let me use some illustration.

Going back to our example, suppose we require a certain module called “bigbangtheory”.

var moduleA = require('bigbangtheory'); // no path given!

Here’s how Node.js will look for it:

node_modules/bigbangtheory.js
node_modules/bigbangtheory/index.js
node_modules/bigbangtheory/package.json

The search will start in the same directory as the executing script and will work its way up the parent directories.

If the module is still not found, Node.js then uses “require.path” array, which lists include directory paths. The paths can be set using environment variable called NODE_PATH or programatically by scripts.

My Thoughts: Newtown Tragedy

I was deeply saddened and heartbroken, while browsing through the pictures of extremely beautiful children & courageous teachers killed in the Sandy Hook School massacre on CNN. My heart goes to the families of those who lost their lives and to the whole community affected by this despicable act of evil.

It is beyond my comprehension why people are allowed to buy automatic assault rifles with such ease.

The shooter was better armed than soldiers for most professional armies.

When the 2nd Amendment was passed, they had guns that only had one shot before they had to be reloaded. That was the time when they had no standing army and needed reserve forces just in case they were needed to defend their country.

Today, USA has the best military the history has ever witnessed.

I see absolutely no reason, whatsoever why a civilian should be in possession of fully automatic weapons.

It is time to rethink.

Java’s Iterator, ListIterator and Iterable Explained

Recall Collection: A container for Objects in Java. Example: ArrayList<E>, Vector<E>, Set<E>, etc.

 An iterator is an Object, which enables a Collection<E> to be traversed. It allows developers to retrieve data elements in a Collection without any knowledge of the underlying data structure, whether it is an ArrayList, LinkedList, Set or some other kind.

The concept behind iterator is very simple: An Iterator is always returned by a Collection and has methods such as next() which returns the next element in the Collection, hasNext() which returns a boolean value indicating if there are more elements in the Collection or not, etc.

Iterators promote “loose coupling” between Collection classes and the classes using these Collections. For example, a class containing some kind of an algorithm (e.g. Search) is only concerned with traversing the list without knowing the exact list structure or any low level details.

Interface Iterator<E>:

All Iterator objects must implement this interface and are bound by its protocols. The interface is very simple and only has three methods:

–       next() : Returns the next element in the Collection. To get all elements in a Collection, call this method repeatedly. When the end is reached and there are no more elements present, this method throws NoSuchElementException.

–       hasNext() : returns true if the Collection has more elements. false otherwise.

–       remove() : removes the last returned element from the Collection.

A java.util.Iterator can only move in one direction: forward. Once the iterator has reached the end of the list, it cannot be reset to the starting position again. In this case, a new Iterator should be obtained.

Interface ListIterator<E>:

This is a specialized Iterator for Collections implementing the List<E> interface. In other words, it is an iterator for Lists. It gives several advantages over Iterator namely:

  1. It allows traversing in both directions: forward & backward
  2. It allows for obtaining the Iterator position in the Collection, i.e. its index.
  3. It allows for adding or removing elements of the underlying List. [set(..) works as well].

You must be wondering at this point, why a new kind of Iterator for Lists? Why can’t we use the plain old Iterator. But if you really think about it, you’ll see why: Let us say you have two Collections: a Set<E> and a List<E>. You get Iterators from both Collections to traverse the list. However, you feel that the Iterator returned by List can do more: It can return the current index, allow you to add an element to the list, etc. That’s where ListIterator’s come in. The Iterator returned by Set<E> doesn’t have to do any of this: an element as no position(index) in the Set<E> Collection etc.

Interface Iterable<E>:

Before I wrap this us, I want to discuss the Iterable<E> interface. It is a very simple Interface which defines only one method called iterator() which returns an Iterator<E>.

The sole purpose of this interface is to allow Objects implementing it to be used in for-each loop. A for-each loop in Java looks like the following:

for (String element: (Iterable)collectionImplemetingIterable) { //do something with element}

For example: ArrayList() implements Iterable. This allows you to pass an ArrayList() object to a for-each loop and traverse through it.

e.g.

ArrayList as = new ArrayList();
as.add(“hippo”);
as.add(“chicken”);
as.add(“duck”);

// traverse the List using for-each
for (String element : as)
	System.out.println(element);

Summary:

An iterator allows traversing a Collection. An Iterator could be obtained for virtually any kind of Collection, for example:

ArrayList as = …;
HashSet hs = ….;

Now let us get Iterators for the two Collections defined above:

Iterator asIterator = as.iterator();
Iterator hsIterator = hs.iterator();

To iterate over the two collections:

</pre>
while(asIterator.hasNext) //Iterate over ArrayList()
{
    System.out.println(asIterator.next());
}

while(hsIterator.hasNext) //Iterate over HashSet()
{
    System.out.println(hsIterator.next());
}

Notes:

  • Java iterators are very much like Relational Database Cursors.
  • Prior to Iterators, which were introduced in jdk1.2, programmers used enumeration to traverse Collections.

Skeletal Implementations in Java Explained

I use interfaces generously in my programs. I prefer them over using abstract classes for several reasons, some of which I will mention below:

  1. Inheritance does not promote good encapsulation. All sub classes depend on the implementation details of the super class. This may result in broken sub classes when the super class is changed. (Imagine testing all sub classes every time you change the super class!)
  2. Unlike inheritance, where a sub class can only extend from one super class, classes are free to implement as many interfaces as they like to support.
  3. It is very easy to support a new interface in an existing class. Suppose you would like several of your classes to support a new type, say, Serializable. You can simply implement the interface in your classes and define the interface methods. For example, any class in Java can implement the Comparable interface and can be applied everywhere a Comparable type is expected.
    Note: This is called defining a mixin[1]. Comparable is a mixin type. These types are intended to be used by other classes to provide additional functionality.

The above three arguments go directly against the philosophy of abstract classes. A sub class can only have one parent class and abstract classes defeat the purpose of mixins (Imagine Comparable being an Abstract class).

Now that I have tried my best to convince you Inheritance is bad, let me say this:

“Inheritance has its own place in programming. It is helpful in many cases, and decreases programming effort”

This is best explained with an example. Suppose you are writing a program, which uses Redis to represent its data. You would like to create specialized classes that deal with certain types of data. For instance, a class could be created to open a connection to Redis Database #0 to store running counters and perform all related actions. Another class would connect to Redis Database #1 and store all users in a set who have requested to opt out from the service.

Let us define an Interface representing the main Redis Database:

interface RedisConnection {

    int connect();

    boolean isConnected();

    int disconnect();

    int getDatabaseNumber();
}

Lets write a Counters class which implement this interface:

class RedisCounters implements RedisConnection {

    @Override
    public int connect() {
        //... lots of code to connect to Redis
    }

    @Override
    public boolean isConnected() {
        //... code to check Redis connection
    }

    @Override
    public int disconnect() {
        //... lots of code to disconnect & perform cleanup
    }
 }

Finish by writing a class, which deals with users who have chosen to Opted Out in Redis.

class RedisOptOut implements RedisConnection {

    @Override
    public int connect() {

        //... lots of code to connect to Redis
    }

    @Override
    public boolean isConnected() {
        //... code to check Redis connection
    }

    @Override
    public int disconnect() {
       //... lots of code to disconnect & perform cleanup
    }

    	/**
      * Other code specific to handling users who have opted out
      */

    // method specific to this class
    public boolean isOptedOut(String userid) {….}
}

If you look closely at the two classes above, you’ll notice something is not right: both classes repeat the connect(), isConnected() and disconnect() functions verbatim. This type of code repetition is not good for several obvious reasons: imagine if you have 10 classes instead of just two, and you would like to change the way connect() function works. You’ll have to make edits in all 10 classes and test them.

Abstract Classes To the Rescue

The program in the last section, presents a classic case where abstract classes excel. We can have define an abstract super class which implement common functionality and make its methods final to restrict sub classes from overriding them. You’ll end up with some like the following:

abstract class RedisConnection {
	public final int connect() {
		// ... lots of code to connect to Redis
	}

	public final boolean isConnected() {
		//... code to check Redis connection
	}

	public final int disconnect() {
		// ... lots of code to disconnect from Redis and perform cleanup
	}
}

/**
 *  sub class which extends from RedisConnection
 *
 */
class RedisCounts extends RedisConnection {

	/**
	 * There is no need to define connect(), isConnected() and disconnect() as
	 * these functions are defined by the super class.
	 */

	/**
	 * Other code specific to storing and retreiving counters
	 */
}

/**
 * another sub class extending from RedisConnection
 *
 */
class RedisOptOut extends RedisConnection {
	/**
	 * There is no need to define connect(), isConnected() and disconnect() as
	 * these functions are defined by the super class.
	 */

	/**
	 * Other code specific to handling users who have opted out
	 */
}

No doubt, this is a better solution. But at the beginning of this post, I explained why interfaces are preferred over inheritance. Let us take this one step further and combine interfaces and abstract classes, to maximize the benefits.

Abstract Classes + Interfaces = Abstract Interfaces

We can combine Abstract Classes and Interfaces by providing an abstract class, which defines the basic functionality, with every interface where necessary. The interface defines the type, whereas the abstract class does all the work implementing it.

By convention, these classes are named: AbstractInterface [Interface is the name of the interface the abstract class is implementing]. This convention comes from Java. In the Collections API, the abstract class, which goes with the List interface, is called AbstractList, etc.

The key to designing these abstract classes or AbstractInterfaces is to design them properly and document it well for the programmers. For example, the class comment of the java.util.AbstractList class define the methods the programmers need to override in their implementations:

“To implement an unmodifiable list, the programmer needs only to extend this class and provide implementations for the get(int) and size() methods.
To implement a modifiable list, the programmer must additionally override the set(int, E) method (which otherwise throws an UnsupportedOperationException). If the list is variable-size the programmer must additionally override the add(int, E) and remove(int) methods.”[2]

Abstract Interfaces (Interfaces + Abstract Classes) give programmers the freedom to choose whether they would like to implement the interface directly or extend the abstract class. In our example, we will have:

/**
 * The Interface
 *
 */
interface RedisConnection
{
    int connect();
    boolean isConnected();
    int disconnect();
    int getDatabaseNumber();
}

/**
 * Abstract class which implements the interface.
 * This is called Abstract Interface
 *
 */
abstract class AbstractRedisConnection implements RedisConnection
{
    @Override
    public final int connect()
    {
        //... lots of code to connect to Redis
    }

    @Override
    public final boolean isConnected()
    {
        //... code to check Redis connection
    }

    @Override
    public final int disconnect()
    {
        //... lots of code to disconnect from Redis and perform cleanup
    }
 }

/**
 * A subclass which extends from the Abstract Interface
 *
 */
class RedisOptOut extends AbstractRedisConnection {…}

In cases where a class cannot extend from the AbstractInterface directly, it can still implement the Interface, and use an inner class which extends from the AbstractInterface and forward all interface method invocations to the inner class. For example:

/**
 * A class showing the forwarding technique. This class implements
 * an interface, but forwards all interface method invocations
 * to an abstract class, the Abstract Interface.
 */
class RedisCounters implements RedisConnection {

	// inner class extending Abstract Interface
	private class RedisConnectionForwarder extends AbstractRedisConnection {
		public void RedisConnectionForwarder() {
		}
	}
	RedisConnectionForwarder r = new RedisConnectionForwarder();

	@Override
	public int connect() {
		// Simply forward the request to the Forwarding class.
		return r.connect();

	}

	@Override
	public boolean isConnected() {
		// Simply forward the request to the Forwarding class.
		return r.isConnected();
	}

	@Override
	public int disconnect() {
		// Simply forward the request to the Forwarding class.
		return r.disconnect();
	}

	/**
	 * Other code specific to storing and retreiving **counters**
	 */
}

In cases where a class cannot extend from the AbstractInterface directly, it can still implement the Interface, and use an inner class which extends from the AbstractInterface and forward all interface method invocations to the inner class. For example:

/**
 * A class showing the forwarding technique. This class implements
 * an interface, but forwards all interface method invocations
 * to an abstract class, the Abstract Interface.
 */
class RedisCounters implements RedisConnection {

	// inner class extending Abstract Interface
	private class RedisConnectionForwarder extends AbstractRedisConnection {
		public void RedisConnectionForwarder() {
		}
	}
	RedisConnectionForwarder r = new RedisConnectionForwarder();

	@Override
	public int connect() {
		// Simply forward the request to the Forwarding class.
		return r.connect();

	}

	@Override
	public boolean isConnected() {
		// Simply forward the request to the Forwarding class.
		return r.isConnected();
	}

	@Override
	public int disconnect() {
		// Simply forward the request to the Forwarding class.
		return r.disconnect();
	}

	/**
	 * Other code specific to storing and retreiving **counters**
	 */
}

As a final technique, you can also use static factories returning concrete instances, which they implement, in form of anonymous inner classes.  For example:

/**
 * A static factory method
 */
public static RedisConnection getRedisCountersImpl(…)
{
	return new AbstractRedisConnection() {
		//...
        /**
	 * Other code specific to storing and retrieving counters
	 */

	}
}

Summary

Using Interfaces, as a general contract, has many benefits over Inheritance. Inheritance, however has its own place in programming, and often times is a necessary evil. In this post, we explored Abstract Interfaces which combine the power of Interfaces with Inheritance. Abstract Interface is a term for Abstract Class, which implements all the functionality of an Interface. Abstract Interfaces always go with the Interfaces they are supporting.

Abstract Interfaces gives programmers the freedom to use either the interface or the abstract class, instead of tying them with inheritance down like abstract classes. We explored two techniques of using abstract classes when extending from the Abstract Interface is not possible. The Java API uses abstract Interfaces graciously. The Collections API is filled with these: AbstractList, AbstractSet, AbstractMap, AbstractCollection.

References

[1] http://en.wikipedia.org/wiki/Mixin

[2] http://docs.oracle.com/javase/6/docs/api/java/util/AbstractList.html

Auto generating version & build information in Java

Until recently, I was relying on final Strings and longs to store version information in my programs. However, I soon faced the limitations of this approach such as forgetting to update version (or revision) information, conflicting maven and internal version information. So I switched to using java properties for handling version information and updating the properties at compile time using maven’s antrun plugins. This had its own short comings and resulted in complex pom.xml files.

I have to admit, I’m not a big fan of maven and its XML based structure: I don’t like it because its a gigantic beast. Every time I have to do something in maven, I find that I’m spending time researching online for the right plugin and looking up the documentation for the plugin of interest. As a developer, build management using maven should be the least of my concern {Not having to remember which maven plugin does what}. On the other hand, to be fair to maven, it has some cool features like dependency management, life cycle, convention based directory structure to name a few. But maven tried to do a lot of things, resulting in a complex product.

{At this point, I’m considering switching in Gradle. As I developer, I want to be spending time solving problems in the problem domain not trying to tame my build management system. But using Gradle requires grasp of Groovy (Yet Another Scripting Language – YASL) If only there was a build management system written in Python!!!!!! }

The solution which I’m going to discuss in this post uses Java annotations to generate versioning information at run-time using python scripts. Maven is used in a very limited way with this approach.

Steps

  1. Create Version annotation and a class which reads these annotations in your Java program
  2. Write a python script to write the Java annotation at runtime
  3. Structure your pom.xml file to include generated-sources folder and running our python script

1. Create Version annotation and a class which reads these annotations in your Java program

The first step is to create an annotation holder in your program which you’ll annotate at the runtime. Example here.

Then create a class which is going to read the annotation information. Example here.

2. Write a python script to write the Java annotation at runtime

The next step is to create a python script to generates package-info.java containing build time & date, version string, hostname, etc. The way this works is by creating a package information which is used to provide overall information about package contents. We will fill our annotations in this class. Example of a python script is here. Feel free to use it in your projects.

3. Structure your pom.xml file to include generated-sources folder and running our python script

You then need to tell your maven file to pick up the package-info.java which is auto-generated by the python script in the last step. The python script places the ‘package-info.java’ in “targer/generated-sources/java” folder. I used the build-helper-maven plugin to include a new source folder. Example here.

The last step is to tell maven to run the python script in generate-sources phases. I used the exec-maven plugin for this. Example here.

Checkout the complete project

I have uploaded a complete project on Github: https://github.com/umermansoor/Versionaire

To use the project, do the following:

$ git clone git@github.com:umermansoor/Versionaire.git

$ cd Versionaire

$ mvn package

$ java -cp ./target/versonaire-1.0-SNAPSHOT.jar com._10kloc.versionaire.App