Cassandra Chapter 5: Data Replication Strategies

Where as Data Partitioning (discussed in last chapter) is concerned with picking a node to store the first copy of data (row) on, Replication is all about storing additional copies of data on multiple nodes so it can deal with node failures without data loss.

In Cassandra terminology, these copies of data or rows are called replicas.

Replication Factor:

This parameter determine how many nodes in your cluster store copies of data. For example, if Replication Factor (RF) is set to 2, there will be two copies of every data stored on different nodes. As common sense dictates, the RF cannot be greater than the number of nodes in the cluster. You cannot say store 10 replicas of data (RF=10) when you only have 8 nodes available. If you try to do this, your writes will fail.

Replication Strategies:

Cassandra has two replication protocols or strategies which I will discuss below. The main difference between the two is whether you have a single data center or your Cassandra cluster spans multiple data centers.

Simple Strategy: As the name indicates, this is the simplest strategy. Data Partitioner protocol (e.g. RandomPartitioner) picks out the first node for the first replica of data. SimpleStrategy places the second replica on the very next node in the ring.
Network Topology Strategy: As the name indicates, this strategy is aware of the network topology (location of nodes in racks, data centers etc.) and is much intelligent than Simple Strategy. This strategy is a must if your Cassandra cluster spans multiple data centers and lets you specify how many replicas you want per data center. It tries to distribute data among racks to minimize failures. That is, when choosing nodes to store replicas, it will try to find a node on a different rack.Let’s talk about a few real-world scenarios from my experience: My team and I were designing a Cassandra cluster spanning 4 data centers (we called them sites). Please note that I’m changing several details about the actual scenario, however, the concepts are the same. Each site also had own application servers. See figure below.

We had two concerns:
1. Allow local reads in a data center to avoid cross data center latency. For example, the Application Server in Site #1 should read from the Cassandra cluster in the same site.
2. Application should continue to work albeit with limited functionality if connectivity between data centers is disrupted.
3. Tolerate entire cluster failures in two data centers and some node failures throughout all clusters in all data centers.

To meet all the requirements, we used NetworkTopologyStrategy with 2 replicas per data center. That is every single row of data will be stored in 8 nodes in total ( 2 nodes per site x 4 sites). Hence we can tolerate failure a single node per site while still allowing local reads. We can also tolerate failure of an entire cluster but will have to incur the extra cost of cross data center latency in the Application server of the site where the cluster failed. For example, if the cluster in Site #2 dies completely, the Application server in that site can have to retrieve data from any of Site # 1, 2 or 3 with extra cost of inter site communication. In this example, we can still have our application working (with decreased performance) even if Cassandra clusters in 3 sites fail completely.

SimpleStrategy Versus NetworkTopologyStrategy – Use Cases:

I have read in several posts people suggesting that if you have a single data center, you should use Simple Strategy. However, it’s not so simple. Simple Strategy is completely oblivious to network topology, including rack configuration. If the replicas happen to be on the same rack, and the rack fail, you will suffer from data loss. Racks are susceptible to failures due to power, heat or network issues. So, I would recommend using Simple Strategy only when your cluster is in a single data center and on a single rack (or you don’t care about rack configuration).

How Cassandra knows your Network Topology?

There is no magic here. Cassandra doesn’t learn your topology automatically. You must define the topology such as assigning nodes to racks and data centers and give this information to Cassandra which then uses it. This mapping of nodes to racks and data center is done using Snitch, which I will discuss later.

~~~ Ignore below this line ~~~
PlanetCassandra

Cassandra Chapter 4: Data Partitioning With Random and Byte-Ordered Partitioners

Cassandra is a distributed database that runs on multiple nodes. When you write data to the cluster, partitioning scheme determines which node in the cluster stores that data. For example, suppose you are inserting some data (Column-Value pair identified by a Row Key). Data partitioning protocol will dictate which node in the cluster is responsible for storing this data. Similarly, when you request data, the partitioning protocol will examine the Row Key and find the node in the cluster responsible for the row key and retrieve data from it.

Difference between Partitioning and Replication?

Data partitioning is concerned with picking a node in the cluster to store the first copy of data on. Replication determines number of additional nodes that will store the same data (for performance and fault tolerance reasons). Replication is discussed in the next section.

Partitioning => Picking out one node to store first copy of data on
Replication => Picking out additional nodes to store more copies of data

Types of Partitioners in Cassandra:

Two partitioners in Cassandra are used most commonly:

Random Partitioner (RP): It uses hash on the Row Key to determine which node in the cluster will be responsible for the data. Loosely speaking, similar to how a HashMap in Java uses the Hashcode to determine which bucket will keep the data. Here’s how this scheme works: The hash value is generated by doing MD5 on the Row Key. The resulting hash value is restricted in the 0 – 2^127 range. Each node in the cluster in a data center is assigned sections of this range and is responsible for storing the data whose Row Key’s hash value falls within this range. For example, suppose you have a 3-node Cassandra cluster with Random Partitioner (RP). RP assigns range, in Cassandra terminology called token, 0-2^42 to the first node, 2^42-2^84 to the second node and 2^84-2^127 to the third node. (Note, I made these numbers up). Now, when you write data to a node, it will calculate the hash of the Row Key to be written. Suppose the hash comes out to be 2^21. RP then determines that this falls in the range 0-2^42 which is assigned to the first node. Hence data is handed to the first node for storage.Remember that RP calculates these ranges or tokens by dividing the possible hash range (0-2^127) by the number of nodes in the cluster:
```
Token Range = (2^127) ÷ (# of nodes in the custer)
```
If the cluster is spanned across multiple data centers, the tokens are created for individual data centers. (And that’s a good thing as we’ll see in the next chapter).

Byte Ordered Partitioner (BOP): It allows you to calculate your own tokens and assign to nodes yourself as opposed to Random Partitioner automatically doing this for you. As an example, suppose you know that all your keys will be in the range 0 – 999. You have a 10 node cluster and you wish you assign the following ranges to the nodes:
```
node_1    =>  0 - 100
node_2    =>  100 - 200
node_3    =>  200 - 300
......
node_10  =>  900
```
All keys which fall in the range 0 – 100 (e.g. 72) will be stored on node_1. In effect, ByteOrderedPartitioner allows you to create your own shards of data.

Which Partition Scheme to use?

Random Partitioner is the recommended partitioning scheme. It has the following advantages over Ordered Partitioning as in BOP:

RP ensures that the data is evenly distributed across all nodes in the cluster. BOP can create hotspots of data where some nodes hold more data than the others. Consider our last example in BOP. Let us say we are inserting 10 M rows into the Cassandra cluster and 75% of the keys are in the range 200-300. In this case, node_3 will hold 7.5 M rows where as rest of 9 nodes will have 2.5 M keys. node_3 will be a hotspot.
When a new node is added to the cluster, RP can quickly assign it a new token range and move minimum amount of data from other nodes to the new node which it is now responsible for. With BOP, this will have to be done manually. This may not be an issue if the # of nodes in your cluster is set in stone. However, for most applications, this is not the case . Your IT staff should be able to add new nodes effortlessly to a cluster under increased load without caring about internal details about partitioning.
Multiple Column Families Issue: BOP can cause uneven distribution of data if you have multiple column families. See this very same problem on Stack Overflow: http://stackoverflow.com/questions/11109162/how-to-calculate-the-token-for-a-byteorderedpartitioner
The only benefit that BOP has over RP is that it allows you to do row slices. You can obtain a cursor like in RDBMS and move over your rows. You can ask for all row keys between 250-295. In the above example, Cassandra knows that node_3 has that range and its work is easy. This is not easy in RP, but most applications can be redesigned and indexes in RP can give the same results. [I have some confusion about this part. This post over here say that this could be done using get_range_slices: http://wiki.apache.org/cassandra/FAQ#iter_world)

Summary:

In this post, we examined what Partitioning means in Cassandra and looked at the definition of two Partitioning schemes in Cassandra: RandomPartitioner and ByteOrderedPartitioner and compared their advantages and drawbacks. We reached the conclusion Random Partitioner should always be preferred over any type of Ordered Partitioning.

Cassandra Chapter 3 – Data Model

An Associative Array is one of the most basic and useful data structures where each value is identified by a key, usually a string. In contrast, values in a normal array are identified by indices. Associative Array maps keys to values. There is one-to-one relationship between keys and values, such that each key can only be mapped to a single value only. This concept is used by many languages: PHP calls it Associative Array, Dictionary in Python, HashMap in Java.

keyValuePairs — How HashMap or Dictionary stores data

In the figure above, if you access the key ‘firstName’, the return value will be ‘Bugs’. In Python 3, I can create this as follows:

mydictionary = { 'firstName' : 'Bugs', 'lastName': 'Bunny', 'location': 'Earth'} #create a dictionary
print(mydictionary['firstName']); #get value associated with key 'fistName'

The Output is:

$ python3 list.py
Bugs

Cassandra follows the same concept as Associative Maps or Dictionaries, but with a slight twist: The value in Cassandra has another embedded Associative Array with its own keys and values. Let me explain. Like an Associative Array, Cassandra has keys which point to values. These top-level keys are called ‘Row Keys‘. The value itself contains sub-keys, called ‘Column Names‘ associated to values. For example, we can store all movies by director in Cassandra sorted by year. To get movie directed by Quentin Tarantin and James Cameron in 1994 and 2009 respectively, we can:

[qtarantino][1994] == 'Pulp Fiction' //tarantino is the Row Key and 1994 is the sub-key, aka Column Name), or, [jcameron][2009] == 'Avatar'

Another way of looking at it:
A key in Cassandra can have multiple values. Each of these values have to be assigned another identifier for Cassandra to retrieve that particular value. Cassandra lets you name individual values of a key so you can retrieve that value with pin-point accuracy without obtaining all values. This name or sub-key is called “Column Name” in Cassandra. A Column Name is nothing but a key identifying a unique value (one-to-one relation) inside a bigger key, called “Row Key“.

[figure]
column family in Cassandra

Cassandra Data Model

The above picture reminds me of the movie Inception, how it had dream within a dream. I see a Dictionary inside another Dictionary, if you think of ‘Column Name 1’ as a sub key with an associated value. I call this the “Inception Concept” and its present everywhere in the computing world, not just Cassandra. (think Recursion)

Column:

A column in Cassandra is very much like a Key-Value pair: It has a key, called Column Name which has an associated value. A column in Cassandra has an additional field called timestamp.
A Cassandra Column

To understand the timestamp field, let’s recall that Cassandra is a distributed databases running on multiple nodes. timestamp is provided by the client application and Cassandra uses this value to determine which node has the most up-to-date value. Let us say 2 nodes in Cassandra respond to our queue and return a column. Cassandra will examine the timestamp field of both columns and the one that is the most recent will be returned to the client. Cassandra will also update the node that returned the older value by doing what is called a ‘Read Repair’.

An important point to remember is that the timestamp value is provided by the application: Cassandra doesn’t automatically update this value on write or update. Most applications ignore timestamp values which is fine, however if you are using Cassandra as a real-time data store, the timestamp values become very important.

Cassandra allows null or empty values.

Column Family:

Very, very loosely speaking, a column family in Cassandra is like table in RDBMS database like MySQL: it is a container for row keys and their values (Column Names). But the comparison stops there: In RDBMS you define table schema and each row must adhere to that schema. In other words you specify the table columns, their names, data types and whether they can be null or not. In Cassandra, you have the freedom to choose whether you want to specify schema or not. Cassandra supports two types of Column Families:

1. Static Column Family: You can specify schema such as Column Names, their Data Types (more on Types later) and indexes (more on this later). At this point, you may be thinking this is like RDBMS. You are right, it is. However, one difference I can see is that in an RDBMS a table must strictly adhere to the schema and each row must reserve space for each column defined in the schema, even though the column may be empty or null for some rows. Cassandra rows are not required to reserve storage for every column defined in the schema and can be sparsed. Space is only used for the columns that are present in the row.

The client application is not required to provide all Columns

2. Dynamic Column Family: There is no schema. The application is free to store whatever columns it want and their data types at run-time.
dynamic_column_family

Note: An application can still insert arbitrary column in static column family. However, it must meet the contract for a column name that is defined in the schema, e.g. the Data Type of the value.

Keyspace:

A keyspace in Cassandra is a container for column families. Just like a database in RDBMS is a container of tables. Most application typically have one keyspace. For example, a voting application has a keyspace called “voting”. Inside that keyspace, there are several column families: users, stats, campaigns, etc.

So far:

The picture looks like the following:
Cassandra Data Model Tree

Super Columns: Another Inception level

Super Columns are yet another nesting inside row key. It groups Column Names together. Back to our inception analogy, starting from the inner most level: Column Families are dictionaries nested inside Super Columns which is another dictionary nested inside the top most dictionary called Row Key. Suppose your row key is the UserID. You can have a Super Column family called Name which contains the Columns FirstName and LastName. When you retrive the Name super column you get all the column names within it, in this case Fistname and LastName.

RowKey => Super Column => Column Names UserID => Name => Firstname, LastName

Counter Columns:

If you have used Redis before, you must love the increment feature which lets you increment and retreive a value at the same time. E.g. incr key_name. Cassandra has something similar: Counter Column. A Counter Column stores a number which can be incremented or decremented as you would a variable in Java: i++ or i–. Possible use cases of Counter Columns are to store the number of times a web page has been viewed, limits, etc.
Counter columns do not require timestamp. I would imagine Cassandra tracks this internally. In fact, when you update a Counter Column (increment or decrement), Cassandra internally performs a read from other nodes to make sure it is updating the most recent value. A consistency level of ONE should be used with Counter Columns.

Summary:

Ok, we have covered a lot of ground here. Let’s summarize:

Keyspace: Top level container for Column Families.
Column Family: A container for Row Keys and Column Families
Row Key: The unique identifier for data stored within a Column Family
Column: Name-Value pair with an additional field: timestamp
Super Column: A Dictionary of Columns identified by Row Key.

Here’s how we will get a value: [Keyspace][ColumnFamily][RowKey][Column] == Column’s Value
Or for a Super Column: [Keyspace][ColumnFamily][RowKey][SuperColumn][Column] == Column’s Value

Cassandra Chapter 2 – Architecture and Concepts

In terms of it’s architecture, Cassandra is an open source, p2p, distributed data store written in the Java programming language.

Distributed, P2P and Fault Tolerant:

It is designed to run on multiple nodes and all nodes are equals: you do not have to specify any master nodes. One of the main ideas behind Cassandra was that machines fail and that the system should be able to sustain node failures. Cassandra fulfills this vision by partitioning its data among all nodes in the cluster and using replication to store multiple copies of the data.

One of the biggest features of its architecture is that Cassandra is write anywhere, read anywhere. This means that if you have a Cassandra cluster of 50 nodes, you can write some data to any node in the cluster and read it from any node in the cluster. You do not have to deal with partitioning and other messy details on the application level.

Scalability:

Cassandra is scalable. I have read articles which shows that Cassandra can easily scale to handle petabytes of data. Cassandra also claims to scale linearly with respect to performance, so that when nodes are doubled, so is the performance. In our experimentation with Cassandra on RackSpace, Cassandra does seem to hold this point valid as we saw almost double throughput every time we doubled the nodes.

Distribution Among Multiple Data Centers:

This is really an extension of “distributed” feature of Cassandra but this is so important that I’d like to discuss it. The TCP/IP or the OSI model is rack or data center topology unaware. When you provide an IP address to an application using TCP/IP, the networking stack does not know whether the TCP/IP end-point is in the same data center or a different one across the globe. This means that we must do something at the application layer to allow an application to make a determination about the location of a TCP/IP end-point. (define end-points, or use some intelligent algorithm to infer the location of the IP end-point).

Cassandra provides a tool on to define data center configurations. You edit configuration file in Cassandra to specify nodes and then group nodes by racks or data centers. Cassandra does no magic to determine whether the nodes are in the same data center or not, it simply reads it from the configuration. Cassandra uses the term Snitch for mapping nodes to physical locations in the network. A snitch maps a IPs to racks and data centers. It defines network topology in terms of nodes. Snitches are configured in main Cassandra config file: cassandra.yaml

Actually, to make my point, I lied: Cassandra can infact ‘infer’ the node’s location in the network, say a rack, if your network topology is indicated by the IP addresses. RackInferringSnitch does this.

Gossip Protocol:

The nodes in the Cassandra cluster uses Gossip protocols to exchange internal information about what’s going on with each other. For example, when a new node is added to the cluster, it uses the Gossip protocol to signal it’s arrival to the other nodes. When writing Cassandra applications, you do not use the Gossip protocol at all, but it is handy to know about this since it comes up in the official documentation.

Writing Data to Cassandra:

When data is written to Cassandra, the node handling that request writes it to a log file, called the “commit log”. Commit logs are used to ensure data durability and to make sure that writes are not lost if the machines crashes all of a sudden. Cassandra then writes the data to a table in memory and eventually writes the data to disk in SortedStrings (SS) table.

Tunable Consistency:

When you are writing and reading data from any node in a Cassandra cluster, there are issues regarding consistency. In other words, for a real-time system which is writing data thousands of times a second, how do we read the latest copy of data. For example, if you are writing some very critical data, you can make Cassandra writes that data to every node in the cluster. Similarly, if you would like to get the latest copy of data, you can read from all nodes and then take the latest copy (by milliseconds).

Cassandra offers tunable consistency, which means you can set it yourself. If you want high consistency, no problem. Cassandra takes this one step further: consistency can be set on a per operation basis.

Consistency Level ALL:

Consistency Level ONE:

Consistency Level QOURUM:

Schema-Less:

Cassandra uses no fixed schema like RDBMS to define data structure. You can have structure, unstructured data. Here’s the cool thing: Cassandra still allows you to define indexes .

Query Language:

Much like SQL language in the RDBMS universe, Cassandra has CQL to access and modify data. CQL’s syntax is close to SQL and supports core SQL DDL/DML commands like CREATE, SELECT, INSERT, UPDATE, etc. Here’s an example query in CQL:

SELECT * FROM usercounters WHERE NumberOfTimesPoked > 2

Data Compression:

To provide space saving benefits, as of version 1.0 Cassandra also supports data compression using the Google Snappy protocol. You can tell Cassandra to compress the data and when you try to read it, Cassandra will automatically decompress the relavant data and send it to you. This is something that I may not experiment. There is an obvious trade-off in performance as the CPU has to do extra work uncompressing data.

Cassandra Chapter 1 – A first look at Apache Cassandra

Apache Cassandra is Open Source, NoSQL, distributed,Peer-to-Peer, massively scalable data store. Mumbo-jumbo? Let’s take a closer look.

1. NoSQL – A fancy term for “post-relational” database management systems, NoSQL databases are very different from Relational databases like MySQL, or Oracle. By and large, NoSQL databases are magnificent Key/Value stores (think HashMaps in Java or Dictionaries in Python) designed to store very large quantities of data and scale horizontally. Unlike Relational Databases like MySQL or Oracle, they don’t have ACIDic support, schemas are flexible, rows are non-homogeneous and support millions of columns.

2. Distributed – Runs and stores data on multiple interconnected nodes.

3. Peer-To-Peer – Cassandra forgoes the traditional Master-Slave architecture. In Cassandra clusters, all nodes are absolutely equal. Client applications could write to any node in the cluster and vice versa, could read from any node.

4. Scalable – Designed for horizontal scaling; adding additional nodes is a breeze.

5. Fault-Tolerant & Highly available – What so great about Peer-to-Peer architecture? There is no single point of failure. In fact, Cassandra was designed based on the assumption that hardware failures are common and do occur.

Short History

Cassandra was designed and developed by Facebook from grounds up for their Inbox search feature. The architecture is based on Google’s BigTable and Amazon’s Dynamo.

Cassandra in simplest terms is a distributed, key-value data store. It stores data on multiple nodes so it can scale with the load increases. Data replication (storing multiple copies of the same data on different nodes) ensures that the data is not lost when a node fails.

Where to obtain it

To obtain a copy of Cassandra, please vist: http://cassandra.apache.org/. If you are looking to deploy Cassandra in an enterprise grade production environment and need paid support, I would recommend these guys: http://www.datastax.com/

Use Case #1: Using Cassandra for Real-time Counters

FastBook is a global social media company with 500 million active users on any given day. The application servers which serve user requests are spread across 4 geographically separated data centres. Requests are evenly load-balanced amongst the 4 data centres, so an incoming request from a user could get routed to any data centre at random. You are asked to create two features:

1. Each user can be “Poke’d” by other users a maximum of 1000 times a day.
2. Allow users to send a maximum of 3000 messages per day

Possible Solutions:
The solution to both requirement is simple: Store following counters associated with each user account:

DailyNumberOfPokes
DailyNumberOfMessagesSent

The application can update these counters as usage occurs, and once the limits are reached, deny any further usage for the day. Then at the end of the day (or at some point), you reset these counters to 0, since the counters are daily limits.

Sounds simple? It is, at least in theory. Here is the catch: You are doing this for a system that has 500+ million users sending thousands of events a second. One possible solution is to have a centralized MySQL server which stores the 2 counters in a table. But a single server solution is not feasible since the service operates in multiple data centers. Taking a trip across data centers for every request to read or update counters will be too costly. You can setup replication, sharding/partitioning in MySQL, but it is not good enough for real-time queries and will effect performance. Replication and sharding add extra layer of complexity on MySQL and makes thing much more complex. I have nothing against MySQL or other cool RDBMS solutions, but they are not suited for this problem.

Another solution is to use a real-time key value datastore such as Redis or Membase. Redis is a phenomenal system. It has the best speed out of all NoSQL solutions I have evaluated. However, Redis is not suited for multi data center replication, where an event can be randomly handled at any data center.

Enter Cassandra:

This is the type of problem which Cassandra claims to be perfectly suited for. Recall: Cassandra is a real-time read anywhere, write anywhere (p2p, distributed) data store. Cassandra can be setup on nodes in each data center and then as an event arrives, any Cassandra node can be updated and that update is seen by other nodes in other data centers automatically.
This sounds very simple. The programming paradigm is indeed very simple. However, the architecture of Cassandra deployment needs to be well thought. Distributed systems have to ensure that all nodes see the same data at the same time and that the system should tolerate partial failures. Imagine the following two scenarios that may hurt these goals:

Links connecting data centers are so slow that updates take a long to reach other data centers hence some data centers continue to see stale values for some time
Two critical racks become offline

This is a well-studied problem in Computer Science [1]. Cassandra is designed to solve these problems albeit with proper tuning. If you are interested, you can read about the CAP theorem by Eric Brewer which essentially talks about inherent difficulties in distributed computing systems like Cassandra.

Summary

This is it for this post. In this post, we looked at the definition of Cassandra and some of it’s key features. We also considered a use-case and a hypothetical situation where we can apply Cassandra.

In the next post, I will talk about the architecture of Cassandra and it’s inner workings.

10K-LOC

Articles on Software & Management

Cassandra

Cassandra Chapter 5: Data Replication Strategies

Replication Factor:

Replication Strategies:

SimpleStrategy Versus NetworkTopologyStrategy – Use Cases:

How Cassandra knows your Network Topology?

Cassandra Chapter 4: Data Partitioning With Random and Byte-Ordered Partitioners

Difference between Partitioning and Replication?

Partitioning => Picking out one node to store first copy of data on
Replication => Picking out additional nodes to store more copies of data

Types of Partitioners in Cassandra:

Which Partition Scheme to use?

Summary:

Cassandra Chapter 3 – Data Model

Column:

Column Family:

Keyspace:

So far:

Super Columns: Another Inception level

Counter Columns:

Summary:

Cassandra Chapter 1 – A first look at Apache Cassandra

Short History

Where to obtain it

Use Case #1: Using Cassandra for Real-time Counters

Summary

Replication Factor:

Replication Strategies:

SimpleStrategy Versus NetworkTopologyStrategy – Use Cases:

How Cassandra knows your Network Topology?

Difference between Partitioning and Replication?

Partitioning => Picking out one node to store first copy of data on Replication => Picking out additional nodes to store more copies of data

Types of Partitioners in Cassandra:

Which Partition Scheme to use?

Summary:

Column:

Column Family:

Keyspace:

So far:

Super Columns: Another Inception level

Counter Columns:

Summary:

Short History

Where to obtain it

Use Case #1: Using Cassandra for Real-time Counters

Summary

Partitioning => Picking out one node to store first copy of data on
Replication => Picking out additional nodes to store more copies of data