Apache Cassandra is Open Source, NoSQL, distributed,Peer-to-Peer, massively scalable data store. Mumbo-jumbo? Let’s take a closer look.
1. NoSQL – A fancy term for “post-relational” database management systems, NoSQL databases are very different from Relational databases like MySQL, or Oracle. By and large, NoSQL databases are magnificent Key/Value stores (think HashMaps in Java or Dictionaries in Python) designed to store very large quantities of data and scale horizontally. Unlike Relational Databases like MySQL or Oracle, they don’t have ACIDic support, schemas are flexible, rows are non-homogeneous and support millions of columns.
2. Distributed – Runs and stores data on multiple interconnected nodes.
3. Peer-To-Peer – Cassandra forgoes the traditional Master-Slave architecture. In Cassandra clusters, all nodes are absolutely equal. Client applications could write to any node in the cluster and vice versa, could read from any node.
4. Scalable – Designed for horizontal scaling; adding additional nodes is a breeze.
5. Fault-Tolerant & Highly available – What so great about Peer-to-Peer architecture? There is no single point of failure. In fact, Cassandra was designed based on the assumption that hardware failures are common and do occur.
Short History
Cassandra was designed and developed by Facebook from grounds up for their Inbox search feature. The architecture is based on Google’s BigTable and Amazon’s Dynamo.
Cassandra in simplest terms is a distributed, key-value data store. It stores data on multiple nodes so it can scale with the load increases. Data replication (storing multiple copies of the same data on different nodes) ensures that the data is not lost when a node fails.
Where to obtain it
To obtain a copy of Cassandra, please vist: http://cassandra.apache.org/. If you are looking to deploy Cassandra in an enterprise grade production environment and need paid support, I would recommend these guys: http://www.datastax.com/
Use Case #1: Using Cassandra for Real-time Counters
FastBook is a global social media company with 500 million active users on any given day. The application servers which serve user requests are spread across 4 geographically separated data centres. Requests are evenly load-balanced amongst the 4 data centres, so an incoming request from a user could get routed to any data centre at random. You are asked to create two features:
1. Each user can be “Poke’d” by other users a maximum of 1000 times a day.
2. Allow users to send a maximum of 3000 messages per day
Possible Solutions:
The solution to both requirement is simple: Store following counters associated with each user account:
DailyNumberOfPokes
DailyNumberOfMessagesSent
The application can update these counters as usage occurs, and once the limits are reached, deny any further usage for the day. Then at the end of the day (or at some point), you reset these counters to 0, since the counters are daily limits.
Sounds simple? It is, at least in theory. Here is the catch: You are doing this for a system that has 500+ million users sending thousands of events a second. One possible solution is to have a centralized MySQL server which stores the 2 counters in a table. But a single server solution is not feasible since the service operates in multiple data centers. Taking a trip across data centers for every request to read or update counters will be too costly. You can setup replication, sharding/partitioning in MySQL, but it is not good enough for real-time queries and will effect performance. Replication and sharding add extra layer of complexity on MySQL and makes thing much more complex. I have nothing against MySQL or other cool RDBMS solutions, but they are not suited for this problem.
Another solution is to use a real-time key value datastore such as Redis or Membase. Redis is a phenomenal system. It has the best speed out of all NoSQL solutions I have evaluated. However, Redis is not suited for multi data center replication, where an event can be randomly handled at any data center.
Enter Cassandra:
This is the type of problem which Cassandra claims to be perfectly suited for. Recall: Cassandra is a real-time read anywhere, write anywhere (p2p, distributed) data store. Cassandra can be setup on nodes in each data center and then as an event arrives, any Cassandra node can be updated and that update is seen by other nodes in other data centers automatically.
This sounds very simple. The programming paradigm is indeed very simple. However, the architecture of Cassandra deployment needs to be well thought. Distributed systems have to ensure that all nodes see the same data at the same time and that the system should tolerate partial failures. Imagine the following two scenarios that may hurt these goals:
- Links connecting data centers are so slow that updates take a long to reach other data centers hence some data centers continue to see stale values for some time
- Two critical racks become offline
This is a well-studied problem in Computer Science [1]. Cassandra is designed to solve these problems albeit with proper tuning. If you are interested, you can read about the CAP theorem by Eric Brewer which essentially talks about inherent difficulties in distributed computing systems like Cassandra.
Summary
This is it for this post. In this post, we looked at the definition of Cassandra and some of it’s key features. We also considered a use-case and a hypothetical situation where we can apply Cassandra.
In the next post, I will talk about the architecture of Cassandra and it’s inner workings.