Cassandra is a distributed database that runs on multiple nodes. When you write data to the cluster, partitioning scheme determines which node in the cluster stores that data. For example, suppose you are inserting some data (Column-Value pair identified by a Row Key). Data partitioning protocol will dictate which node in the cluster is responsible for storing this data. Similarly, when you request data, the partitioning protocol will examine the Row Key and find the node in the cluster responsible for the row key and retrieve data from it.
Difference between Partitioning and Replication?
Data partitioning is concerned with picking a node in the cluster to store the first copy of data on. Replication determines number of additional nodes that will store the same data (for performance and fault tolerance reasons). Replication is discussed in the next section.
Partitioning => Picking out one node to store first copy of data on
Replication => Picking out additional nodes to store more copies of data
Types of Partitioners in Cassandra:
Two partitioners in Cassandra are used most commonly:
- Random Partitioner (RP): It uses hash on the Row Key to determine which node in the cluster will be responsible for the data. Loosely speaking, similar to how a HashMap in Java uses the Hashcode to determine which bucket will keep the data. Here’s how this scheme works: The hash value is generated by doing MD5 on the Row Key. The resulting hash value is restricted in the 0 - 2^127 range. Each node in the cluster in a data center is assigned sections of this range and is responsible for storing the data whose Row Key’s hash value falls within this range. For example, suppose you have a 3-node Cassandra cluster with Random Partitioner (RP). RP assigns range, in Cassandra terminology called token, 0-2^42 to the first node, 2^42-2^84 to the second node and 2^84-2^127 to the third node. (Note, I made these numbers up). Now, when you write data to a node, it will calculate the hash of the Row Key to be written. Suppose the hash comes out to be 2^21. RP then determines that this falls in the range 0-2^42 which is assigned to the first node. Hence data is handed to the first node for storage.Remember that RP calculates these ranges or tokens by dividing the possible hash range (0-2^127) by the number of nodes in the cluster:
Token Range = (2^127) ÷ (# of nodes in the custer)
If the cluster is spanned across multiple data centers, the tokens are created for individual data centers. (And that’s a good thing as we’ll see in the next chapter).
- Byte Ordered Partitioner (BOP): It allows you to calculate your own tokens and assign to nodes yourself as opposed to Random Partitioner automatically doing this for you. As an example, suppose you know that all your keys will be in the range 0 – 999. You have a 10 node cluster and you wish you assign the following ranges to the nodes:
node_1 => 0 - 100 node_2 => 100 - 200 node_3 => 200 - 300 ...... node_10 => 900
All keys which fall in the range 0 – 100 (e.g. 72) will be stored on node_1. In effect, ByteOrderedPartitioner allows you to create your own shards of data.
Which Partition Scheme to use?
Random Partitioner is the recommended partitioning scheme. It has the following advantages over Ordered Partitioning as in BOP:
- RP ensures that the data is evenly distributed across all nodes in the cluster. BOP can create hotspots of data where some nodes hold more data than the others. Consider our last example in BOP. Let us say we are inserting 10 M rows into the Cassandra cluster and 75% of the keys are in the range 200-300. In this case, node_3 will hold 7.5 M rows where as rest of 9 nodes will have 2.5 M keys. node_3 will be a hotspot.
- When a new node is added to the cluster, RP can quickly assign it a new token range and move minimum amount of data from other nodes to the new node which it is now responsible for. With BOP, this will have to be done manually. This may not be an issue if the # of nodes in your cluster is set in stone. However, for most applications, this is not the case . Your IT staff should be able to add new nodes effortlessly to a cluster under increased load without caring about internal details about partitioning.
- Multiple Column Families Issue: BOP can cause uneven distribution of data if you have multiple column families. See this very same problem on Stack Overflow: http://stackoverflow.com/questions/11109162/how-to-calculate-the-token-for-a-byteorderedpartitioner
- The only benefit that BOP has over RP is that it allows you to do row slices. You can obtain a cursor like in RDBMS and move over your rows. You can ask for all row keys between 250-295. In the above example, Cassandra knows that node_3 has that range and its work is easy. This is not easy in RP, but most applications can be redesigned and indexes in RP can give the same results. [I have some confusion about this part. This post over here say that this could be done using get_range_slices: http://wiki.apache.org/cassandra/FAQ#iter_world)
In this post, we examined what Partitioning means in Cassandra and looked at the definition of two Partitioning schemes in Cassandra: RandomPartitioner and ByteOrderedPartitioner and compared their advantages and drawbacks. We reached the conclusion Random Partitioner should always be preferred over any type of Ordered Partitioning.