If you need to manage large amounts of data on the order of several terabytes or even petabytes, tra­di­tion­al database systems will not be up to the task. In this case, you need special big data ap­pli­ca­tions that are easily scalable, since it’s often difficult to predict the actual volume of data in advance. One of the most popular modern examples of such systems is Cassandra, an open-source solution orig­i­nal­ly developed for Facebook.

What is Apache Cassandra?

Apache Cassandra is an open-source database man­age­ment system (DBMS) for very large yet struc­tured databases. Thanks to easy scal­a­bil­i­ty, these databases can be dis­trib­uted across different clusters, which is why Cassandra is not bound to a single server.

Cassandra is a column-oriented NoSQL database. In this case, NoSQL means “Not only SQL” and not “no SQL”. When it comes to pro­cess­ing large amounts of data, NoSQL struc­tures offer sig­nif­i­cant ad­van­tages over typical SQL databases because they are not bound by the re­stric­tions of the query language SQL (Struc­tured Query Language). Apache Cassandra has its own query language called Cassandra Query Language (CQL), which is similar to SQL, but is much preferred by de­vel­op­ers because it is tailored to the special features of Cassandra.

As a NoSQL database, Cassandra relies on re­dun­dan­cy to ensure high re­silience. By contrast, re­la­tion­al databases fre­quent­ly encounter problems when repli­cat­ing data.

Fact

Cassandra was orig­i­nal­ly developed by Avinash Lakshman and Prashant Malik at Facebook and was first released in 2008. In 2009, the Apache Software Foun­da­tion, one of the most important open source developer com­mu­ni­ties, included the project as a sub-project in the Apache Incubator. In February 2011, Apache Cassandra graduated to a top-level project in the Apache Software Foun­da­tion, alongside other popular projects such as Apache HTTP Server, Solr search server, the Kafka messaging platform or OpenOf­fice, which is the most well-known Apache project.

Along with the original de­vel­op­ers, other big companies such as IBM, Twitter, and Rackspace, one of the largest IT service providers in the United States, con­tribute to Cassandra. One major con­trib­u­tor to the project is DataStax, a company spe­cial­iz­ing in sub­scrip­tion-based support, in­stal­la­tion as­sis­tance, and training courses in the Cassandra database. DataStax con­tributes 80% of Cassandra’s open-source releases and also offers DataStax En­ter­prise, a com­mer­cial database solution built on the freely available Cassandra system.

According to the DB-Engines Ranking, Apache Cassandra is currently the most popular column-oriented database and has out­per­formed big com­peti­tors like Microsoft Azure Cosmos DB or Google Cloud Bigtable.

Cassandra: core functions

As a truly dis­trib­uted system, Cassandra does not use a master. All clusters have equal per­mis­sions and can process every database request, which sig­nif­i­cant­ly increases per­for­mance. Data is dis­trib­uted across nodes. The system can also be easily scaled by simply adding more nodes. After in­stalling Cassandra, all you have to do is dis­trib­ute the con­fig­u­ra­tion files to the new nodes. Cassandra provides tools for this.

Apache Cassandra features a con­fig­urable repli­ca­tion system to ensure re­silience and recovery of data in the event of a failure. Fault tolerance is minimized because the data is au­to­mat­i­cal­ly repli­cat­ed between the nodes. Failed nodes can be easily replaced. The system remains available for requests at all times.

Cassandra also offers high avail­abil­i­ty and partition tolerance. According to the CAP theorem in computer science, it is im­pos­si­ble to guarantee con­sis­ten­cy, avail­abil­i­ty, and partition tolerance at the same time. Con­sis­ten­cy, meaning that all nodes see the same data at all times, has the lowest priority in many big data systems. After a failure, con­sis­ten­cy can be quickly restored through data recovery, whereas the other two prop­er­ties must be ensured at all times.

Cassandra databases support the MapReduce pro­gram­ming model developed by Google for cal­cu­la­tions involving large amounts of data in dis­trib­uted systems. The pro­pri­etary query language CQL (Cassandra Query Language) is designed es­pe­cial­ly for the data struc­tures of Cassandra.

What are the benefits of Apache Cassandra?

One of the main ad­van­tages of Cassandra is that it provides easy scal­a­bil­i­ty with very high re­silien­cy – two fun­da­men­tal re­quire­ments for big data ap­pli­ca­tions. Cassandra is hor­i­zon­tal­ly scalable, which means you can increase the capacity and per­for­mance of the system by adding more nodes. This is the opposite of vertical scaling, where you add more powerful CPUs and larger hard drives to a single database server when you need to increase per­for­mance or capacity. Hor­i­zon­tal scaling is the cheaper solution in most cases since you can use com­mer­cial­ly available server hardware.

Cassandra’s data model is based on mul­ti­di­men­sion­al hash tables where each row can have any number of columns. Unlike columns in a tra­di­tion­al database table, these columns do not have to be the same in every row. Apache Cassandra also has a clear speed advantage when compared to other NoSQL databases in benchmark analyses and real-life ap­pli­ca­tion scenarios.

Where is Apache Cassandra used?

One of the main goals in de­vel­op­ing Cassandra was to help Facebook users to search their inboxes more easily. The corporate giant used a cluster of over 150 in­di­vid­ual nodes to power this feature. It’s no co­in­ci­dence that Cassandra, which resembles Amazon Dynamo and Google Bigtable in its basic struc­tures, is now very popular with providers of large social networks in which vast amounts of data are shared between users. Along with Twitter, Instagram, and Spotify, other big-name customers include the social book­mark­ing website Digg and social news ag­gre­ga­tor Reddit.

Note

Facebook has now switched from Cassandra to a pro­pri­etary solution that combines the HBase and HDFS database systems, both com­po­nents of the Apache Hadoop framework.

Many other networks that handle large amounts of data use Cassandra both as a main database and as a secondary component for specific tasks. Examples include eBay, GitHub, Netflix, The Weather Channel, and the Large Hadron Collider at CERN, the European Or­ga­ni­za­tion for Nuclear Research (around 30,000 terabytes of data per year). Apple has one of the largest Cassandra in­stal­la­tions, with 75,000 nodes.

Getting started with Apache Cassandra

Apache Cassandra runs on UNIX-like systems, prefer­ably Linux servers. The Java Runtime En­vi­ron­ment is also required because Cassandra is pro­grammed in Java. In­stal­la­tion packages are stored on Apache servers as Debian or RPM packages. To install Cassandra, you add the cor­re­spond­ing repos­i­to­ry. After in­stal­la­tion, you create the usual data, cache and protocol di­rec­to­ries and configure the cassandra.yaml file.

Cassandra has its own command line tools for ad­min­is­tra­tor tasks. The most important utility is the Cassandra Query Language shell (cqlsh).

You can use the following command to view a list of all available commands:

cqlsh --help

The following YouTube video provides a clear in­tro­duc­tion to Apache Cassandra:

Tip

DataStax offers OpsCenter, a web-based tool for visual man­age­ment and mon­i­tor­ing of Cassandra systems.

Go to Main Menu