This article is part of a series.
- Part 1 - This Article
- Part 2 - What is a Data Warehouse
These data have
Given I had a dataset with 3-4 billion rows to analyze=, the two qualities I was looking for in an analytics warehouse are:
- Columnar based
- Massively parallel processing
Having used MariaDB quiet a bit in the past I was excited to discover the ColumnStore engine.
There are several parts to the MariaDB ColumnStore engine.
- User Module – allows users to write SQL to interact with the ColumnStore engine
- Performance Module – takes a user’s query and distributes across many cores on many nodes.
- Storage Module – manages storage for the data, compressing and un-compressing quickly to ensure reduced storage costs
These three modules make up ColumnStore. Essentially, the Performance and Storage module fill the need Snowflake and Redshift try to meet, quick processing of large amounts of data. And the User Modules allows the user to continue to write SQL against the data as normal, with some exceptions.
Let’s Talk Rows and Columns
As the name suggests, MariaDB ColumnStore stores the data in column format. This is referred to a columnar database management system (CDBMS). They differ from the row based database management system (RDBMS) in how they store data. There is lots of history on why, but historically, the world has used RDBMS to store data.
You can think of CDBMS and RDBMS as tools for specific jobs. They both work with data, but they have pros and cons which must be assessed when applying them.
Here is my summary comparison of the two systems.