Introduction to Apache Cassandra

24 februari 2021 om 10:00 by ParTech Media - Post a comment

We live in the information age. The age where we are surrounded by at least 2 devices at any given time. Most of these devices are also connected to the world wide web. All this translates to an invisible web that is continuously transmitting data at light speed around you.

So how do you store all this data? Moreover, how do you query and process all this incoming data at acceptable speeds? Especially when you are part of a million, billion-dollar organization whose reputation depends on data consistency.

This is where data stores like Apache Cassandra come into play. In this post, let’s take a look at this data store in detail including answering the question of What is Apache Cassandra.

Table of Contents

  1. What is Apache Cassandra?
  2. What are the top features of Cassandra?
  3. How does Apache Cassandra work?
  4. What are some common mistakes made while using Apache Cassandra?
  5. Example use case of Cassandra
  6. Verdict

What is Apache Cassandra?

Apache Cassandra is a column data store that allows you to sort and manage massive amounts of data. It is an open-source NoSQL data store that allows you to manage huge inflows of data without any bottlenecks.

If you take a closer look at the platform it’s a highly decentralized and distributed system that allows you to deploy Cassandra in a cloud or hybrid environment. The high availability and fault tolerance capabilities of Apache Cassandra will enable quick data processing across thousands of different data nodes.

Did you know that Apache Cassandra was originally designed for Facebook index searching? After it’s massive success with the billion-dollar organization, many more users started using it. Nowadays Cassandra is used by top organizations like Apple and Netflix to manage their large amount of data inputs. They use it for its ability to connect and verify data from thousands of sensors, appliances and digital applications spread across the organization. Apache Cassandra comes in handy when high write speeds are a priority for the normal functioning of your organization,

What are the top features of Cassandra?

On average, a data store can only fulfill two of the three fundamental functions in the CAP theorem. But what is the CAP theorem?

The CAP theorem, which is also known as Brewer’s theorem states that it is not feasible for a data store to offer more than two out of the three guarantees listed below -

  1. Consistency - The promise that every read request receives a recent write request or an error function
  2. Availability - The promise that every request receives a write response, with the guarantee that it’s not an error.
  3. Partition Tolerance - The promise that the data store continues to operate even when any number of messages are dropped by the network

CAP theorem states that not all three can be fulfilled by any distributed data store in the world. Most of the data store lets you use a predetermined set of CAP theorems, but Cassandra takes it a step further. It allows you to modify your preferences and approach towards the CAP theorem.

Apache Cassandra is fitted with a tunable consistency model, allowing you to decide how to handle read and write requests. It gives you the power to make this decision on a per-query level. The tunable consistency models allow you to select consistency levels. These consistency levels will then serve as the default configuration to how each read and write request is managed.

Here are the three levels of consistency offered by Apache Cassandra -

  • High stale data potential (Low consistency level) — This level allows you to configure Cassandra to wait for any available nodes for returning data. This method can sometimes result in the data being out-of-date. However, any conflicts can be resolved by choosing the replica that has the latest timestamp in the database.
  • Medium stale data potential (Medium consistency level) — This configuration mode prioritizes the accuracy of data while compromising on the speed of your queries. This gives a fine balance between data accuracy and speed, allowing Cassandra to scout a collection of nodes before coming back with the result.
  • Strong data consistency (High consistency level) — This configuration allows you to get the right query in the highest percentages. However, the return time for each query will be long, as Cassandra will need to scout every node before returning a result.

There is no one right mode for your needs, as you will always have to sacrifice something to get something. When you are opting for high data consistency, your time for each returning query will be long. When you are opting for high return speeds, your data consistency will take a huge hit. The choices are always yours to take with Apache Cassandra.

How does Apache Cassandra work?

Apache Cassandra is scalable and elastic making it one of the best data stores in the market. When all the other data stores work on a master-slave architecture, Apache Cassandra works on a totally different approach.

For example, let's consider a normal data store with a master-slave architecture. When a master node stops working here, the databases will shut down until a new master is appointed. But in Apache Cassandra, when a master node stops working, it will simply redirect to any available node in the whole architecture. This lowers the common occurrence of downtime in general data stores.

Cassandra also uses a table-like structure to organize data. It identifies data types and considers all the data that can be organized together. Here’s an added bonus of using Apache Cassandra - whenever a machine is added or removed from its architecture, Cassandra will assign and repartition it to the new setup.

What are some common mistakes made while using Apache Cassandra?

Here are the 7 common mistakes that people make while choosing to use Apache Cassandra -

  1. Just because you have a lot of data points, does not mean Cassandra will work for you.
  2. You should not implement a domain model in Apache Cassandra
  3. Not all data can be queried
  4. Trying to add a new column with values based on other columns
  5. Choosing the wrong hardware to deploy Cassandra
  6. Failing to monitor and maintain the data store consistently
  7. Trying to use Cassandra as a queue

Example use case of Cassandra

Let’s say that Mcdonalds wants to run a campaign where it is offering 100 free burgers for 100 people. All you need to do is retweet McDonald’s most recent tweet with the hashtag #free100.

Now this will bring in tens of thousands of retweets which translate to tens of thousands of queries to Twitter. Twitter must be equipped to handle so many queries and return the output efficiently.

This is a good situation where a data store like Cassandra could be essential. It has high availability and consistency, enabling Twitter to process all that requests within a short span of time.

Verdict

Apache Cassandra is a great choice if you’re looking for a scalable NoSQL data store. It will help you handle heaps of information without much hassle. Moreover, you can adjust the working of Apache Cassandra to the needs of your organization as a whole. If you have hundreds of data nodes with millions of queries, Apache Cassandra might just be the data store for you.