Traditionally the problem of having consistently fast drill-downs was solved using special multidimensional data structures like OLAP cubes and MDX queries. However, with the advent of NoSQL technologies like Cassandra, we now have faster and cheaper ways of supporting these drill-downs.
Cassandra is an Open Source distributed database management system. The main feature of the system are decentralization, replication support with multi data center replication, elastic scalability, fault-tolerance, tunable consistency, MapReduce support and an SQL like query language (CQL).
To get into the problem of drill-downs, let’s look at some sample data from a shopping cart:
timestamp, transaction_id, item_code, user_id, payment_method
20130304221518, abcd, 3, 5, creditcard
20130304221519, efgh, 4, 6, cashondelivery
20130305180402, ijkl, 4, 5, loyaltypoints
This data isn’t directly useful for analytics. Most analytics is about finding patterns and making marketing or other business decisions based on them. The useful information that can be derived from this raw data is as follows:
{month: march} {number of transactions: 3, unique_users: 2, unique_items_sold: 2}
{month: march, item: 4}{number of transactions: 2, unique_users: 2}
{month: march, day: 4} {number of transactions: 2, unique_users: 2, unique_items_sold: 2}
{item: 4, payment: cashondelivery} {number of transactions: 1, unique_users: 1, items_sold: 1}
Clearly, what we are trying to deduce here are aggregate numbers based on various combination of dimensions. When in RDBMS, the data we require can be found using queries with count, unique and where clauses. It is also evident that when this data grows to large sizes, the potential amount of joins in such queries will result in considerably long-running queries.
Thus, a better way to handle serving these kinds of aggregations in a reasonable amount of time is to pre-compute these aggregates and store them. This is where NoSQL technologies provide a distinct advantage, due to the lack of predefined schemas in them. This allows us to store similar combinations in a single column family (Cassandra’s tables), which in turn lets us serve extremely fast drill-downs. To illustrate this, let’s look at how the above data will be stored in Cassandra:
Column Family: Monthly{
Rowkey: March {
number_of_transactions: 3,
unique_users: 2,
unique_items_sold: 2
},
Rowkey: March, Item:4 {
number of transactions: 2,
unique_users:
},
Rowkey: March, Day:4 {
number of transactions: 2,
unique_users: 2.
unique_items_sold: 2
},
Rowkey: March, Item:4, Day:4 {
number of transactions: 1,
unique_users: 1,
items_sold: 1
}
}
With the above data model in place, fetching a drill-down on March for items is as simple as forming the key and doing one fetch, fast and join-less.
The logical question to ask here is when we start making these kind of combinations for every value of every dimension in our data, what about data size and data explosion. This is a real problem as the storage requirements for this kind of data is huge. Cassandra helps here by providing elastic scalability with linear performance and storage addition as number of nodes increase.
However, this problem also warrants having some data specific optimizations. One of these is to limit your combinations to only those values which make business sense. Generate data by judicious selection of what subsets you want to produce to serve your end users.
Finally, this is only one of the problem cases where Cassandra is useful and is one of the solutions for this problem. Let us know if you have ever needed to solve this kind of a problem, the way you did it, or if you are using Cassandra to solve any other interesting problems.