What is a Data Warehouse

Reading time ~13 minutes

This article is part of a series.

Insights over Data

Data. They are the plastic of the tech world. We’re are making way too much of it, you can’t seem to get rid of it, and rarely is it used for its designed purpose. Unfortunately, like water in plastic bottles, invaluable insights are found in these data. And all good systems are built on insight.

I’ve worked as a data engineer for the last few years and realize this is a fairly universal problem. Converting data into insight is hard. There’s too little time. The cloud bill is too much. The data are never clean. And there seems to be the assumption from the C-suite data innately have value. It doesn’t. Data are a raw resource, which can be converted into insights with the skilled people and proper tools. And a data warehouse is one of the best tools for increasing the ease of generating insight from data.

Before moving on, let’s define data and insights.

There are many fancy explanations of what insights are. In my opinion, they are gain in knowledge about how something works. In the visualization above, we can see the females score higher than males, and males lower than non-binary genders. This is insight.

Looking at the “Data” pane, little insight can be gained. There is information, such as an email we could use to contact someone. Or their name, by which to address them. However, there isn’t anything which explains how something works. These are data.

Unfortunately, in information technology, converting data into insight is tough. As I hope to demonstrate.

Heart Rate Variability

Throughout this article I’m going to use examples regarding Heart Rate Variability. As the name suggests, it is a measure of variability between heart-beats. To be clear, not the time between heart-beats, but the variability.

I find heart rate variability, or HRV, fascinating because it is linked to human stress levels. Research seems to indicate if HRV is high, then you are relaxed. But if your HRV is low, you are stressed. Evolutionarily, this makes sense. During times of stress your heart is one of the most important organs in your body. It focuses up and ensures all muscles and organs are ready for fighting hard or running harder. The famous fight-or-flight. However, during times of relaxation, your heart chills out too. Its beats are less well timed. Hey, your heart works hard!

Not required to follow this article, but if you want to read more about the psychological and social aspects of HRV.

I got interested in HRV as I found Apple Watch now tracks it. And I’ve been curious if it is an actual valid indicator of my stress levels.

What’s the Problem?

So what does this have to do with a data warehouses? Let’s look at the HRV data the Apple Watch produces.

occurrence_id user_id start end hrv
1 1 2021-09-01 14:10:01 2021-09-01 17:00:00 40
2 1 2021-09-01 19:00:00 2021-09-01 23:00:00 55
3 1 2021-09-02 05:00:00 2021-09-03 13:00:01 120
4 1 2021-09-04 14:00:00   65

These data have a start and end time, and if the end time is blank it means it is still going. They also have a user_id and the heart_rate_variability score. Let’s try to ask the data a question.

  • What is my average HRV per day?

A naive method for querying these data might look like:

  SELECT AVG(heart_rate_variability)   AS avg_hrv,
         COALESCE(DATE(start), DATE('')) AS day
    FROM df
GROUP BY day

This produces the following:

This looks pretty good. We can get a bit of insight, it looks like my average HRV might be 50. And it looks like 09-04 was an extremely relaxed day. But, what happened to 2021-09-03?

This is where things start getting complex. There is a long history in databases of converting sustained values to a start and stop value. Databases are highly optimized for updating records quickly and minimizing the amount of disk space needed to store the data. These types of databases are known as online transactional processing (OLTP) databases. And they are the antithesis of a data warehouse. Data warehouses are known as online analytical processing (OLAP) databases. Unlike OLTP, they are optimized for speed of providing insights. I’ll go in more depth about OLTP vs. OLAP a bit later.

Let’s get back to solving the problem.

The Infamous Calendar Table

Let’s look at what the start-stop data actually represent.

avg_hrv day
47.5 2021-09-01
65 2021-09-02
65 2021-09-03
120 2021-09-04

Simple enough, right? But how do we transform the start and stop data into a list of all dates with sustained values. The most straightforward method is the calendar table.

In SQL, every query must begin with a base table. This is the table associated with the FROM keyword. This is the base for all other data are retrieved. In our problem, what we really need is a base table which fills in all the possible missing dates in the start-stop data.

In this situation, it is known as a “calendar table” and is simply a list of possible dates you wish to include in your query.

For example:

date
2021-09-01
2021-09-02
2021-09-03
2021-09-04

Nice! But how do we apply this to our question?

  • What is my average HRV per day?

We can apply it with the following query:

   SELECT
          AVG(hrv) avg_hrv,
          c.date
     FROM calendar AS c
LEFT JOIN hrvs AS h
       ON c.date BETWEEN DATE(start)
                     AND COALESCE(DATE(end), DATE('2021-09-05'))
    GROUP BY c.date

This joins all of the HRV start-stop values to the calendar table where the calendar.date falls between the hrvs.start and hrvs.stop dates. This is the result:

Perfect, this has the exact effect we wish. It expands the start-stop dates for all days and calculates the average heart rate variability per day. You may ask, “But this only generated one extra date right? Do we really need a data warehouse for it; can’t a classical database solve it?” Yes, yes it can. But! Is it really one extra day?

Let’s change our question slightly.

  • What is my average HRV per hour?

Some of you may see where I’m going here, but let’s just draw it out anyway. We can answer the above question with another calendar table, but this one with a finer grain (level of granularity). That is, our calendar table will now have an entry for every hour between the minimum start and maximum exit.

date
2021-09-01 00:00:00
2021-09-01 01:00:00
2021-09-01 02:00:00
2021-09-01 03:00:00
….
2021-09-04 04:00:00
2021-09-04 05:00:00
2021-09-04 06:00:00
2021-09-04 07:00:00
2021-09-04 08:00:00

Then, we can run the following query:

   SELECT
          AVG(hrv) avg_hrv,
          c.date
     FROM calendar AS c
LEFT JOIN hrvs AS h
       ON c.date BETWEEN start
                     AND COALESCE(end, '2021-09-05 23:59:59')
    GROUP BY c.date

Our result should look something like this:

And here in lies the problem. When we begin to try and expand the data, they have a tendency to grow in number. These sorts of relationships have all sorts of terms associated with it. In SQL, we refer to the relationship having a many-to-many relationship. That is, many dates in the calendar table refer to many start-stop entries in the hrv table. And when the result increases the result set greatly, we refer to this as data “fan out.”

Just One More

At this point you may be wondering if this calendar table was really the best way to answer the questions we have of these data. Well, I’m going to risk boring you by giving you just a few more examples.

Let’s say you want to ask the data:

  • What is average HRV of everyone per hour? 🤯

This would take data like this:

occurrence_id user_id start end heart_rate_variability
1 1 2021-09-01 14:10:01 2021-09-01 17:00:00 40
2 1 2021-09-01 19:00:00 2021-09-01 23:00:00 55
3 1 2021-09-02 5:00:00 2021-09-02 13:00:01 120
4 1 2021-09-02 14:00:00   65
5 2 2021-09-01 8:00:00 2021-09-01 17:00:00 80
6 2 2021-09-01 18:00:00 2021-09-01 22:00:00 35
7 2 2021-09-02 5:30:00 2021-09-02 17:00:00 25
8 2 2021-09-02 17:00:00   105

And join it to the calendar table. The math, roughly, works out to something like this.

[hours between min and max dates] x [users] = [rows of data]

And to make it even more apparent, what if we ask something like:

  • What is average HRV of everyone per hour who is over 90.7 kilograms (200lbs)?

Then the math becomes something like this

[hours between min and max dates] x [users] x [facts] = [rows of data]

Even with two years of data for 100 users the data start getting large.

  • What is average HRV of everyone per hour who is over 90.7 kilograms (200lbs) and under 50 years of age?
(2 x 365 x 24) * (100 x 2) = 3,504,000

And with every increase of granularity we get some order of magnitude of increase in the amount of data.

  • What is average HRV of everyone per hour who is over 90.7 kilograms (200lbs) and under 50 years of age?
(2 x 365 x 24 x 60) x (100 X 2) = 210,240,000

This sheer number of rows of data which must be processed is the reason for a data warehouse.

Online Transactional Processing (OLTP)

Let’s examine Online Transactional Processing databases to understand why they have difficulty with reading large number of rows. First, what exactly are we talking about when we say “an OLTP database”? In the wild, most database technologies are OLTP, as they fit the most common need of a database.

For example,

  • SQLite
  • PostgreSQL
  • MySQL
  • MariaDB
  • Microsoft SQL Server (MSSS)

There are many, many more.

A clever individual may argue, “Wait, these database technologies are capable of being setup as an OLAP database.” True, many of the above have “plugins” or settings to allow them to act as an analytics database, however, their primary purpose is transactional processing. This leads us to the question, “What is ‘transactional processing’ anyway?”

Don’t let the name try to trick you. Transactional processing databases are exactly how they sound. They are technologies optimized for large amounts of small transactions. These transactions may be retrieving a username, then updating a service entry. Or retrieving a user ID, then adding an address. But most involve at least one read and write operation, thus “transaction.” These transactions are usually extremely short and involve a small amount of data. Let’s go through the details of the characteristics a bit more.

Supports Large Amounts of Concurrency

OLTP DBs are designed for large amounts of users making short queries of the database. This makes them perfect for the backend of many software applications. Let’s take a small online bookstore called Amazingzone. This bookstore will often have approximately 300 customers interacting with their webpage at a time. Each time they click on a product the web application sends a query to the products table and retrieves the product information. Then, it updates a record in the visitor table indicating a user viewed a product. Given each customer is casually browsing, this may lead to a few thousand queries a minute. Each query would look something like this:

SELECT product_name, product_image_path
  FROM products
 WHERE id = 534;

 INSERT user_id, view_date INTO visitors VALUES (754, CURRENT_DATE());

Even though though there are thousands of queries a minute, each one is retrieving a single record, from a single table. This is what is meant when you see a database which is “optimized for concurrency.” It means the database is designed to allow thousands of small queries to be executed simultaneously.

Data are Normalized

Another characteristic of an OLTP database is the data are often normalized into at least the 2rd Normal Form. An oversimplified explanation of 2NF would be: “More tables, less data.”

One of the goals of normalization is to reduce the amount of disk space a database needs. This comes from an age where databases were limited to kilobytes of data rather than exabytes. With that in mind, for the following example let’s assume every cell beside the field headers require 1 byte regardless of the datatype or information contained.

id avg_hrv date user_id user_name user_age
0 47.5 2021-09-01 1 Thomas 41
1 120 2021-09-02 1 Thomas 41
2 120 2021-09-03 1 Thomas 41
3 65 2021-09-04 1 Thomas 41

Given the table above, it would take 24 bytes to store all the information on a database. If our goal is reducing disk space, 2nd Normal Form (2NF) can help. Let’s see what these same data look like after “normalization.”

We will split the data into two tables, hrv_recordings and users.

Here’s the hrv_recordings table:

id avg_hrv date user_id
0 47.5 2021-09-01 1
1 120 2021-09-02 1
2 120 2021-09-03 1
3 65 2021-09-04 1

And the users table:

id user_name user_age
1 Thomas 41

Now we have two tables related by the id of the user’s table, the Primary Key, and the user_id in the hrv_recordings table, the Foreign Key. When we query these data we can put the tables back together by using the JOIN keyword.

For example:

SELECT *
  FROM hrv_recordings AS hrvr
  JOIN users AS u
    ON hrvr.user_id = u.user_id

But why should we normalize these data in the first place. Well, if you go back and count the number of cells after normalization you will find their are 19. For our oversimplified scenario, that means we have saved 5 bytes through normalization! Pretty nifty.

And if 5 bytes doesn’t seem impressive, change it to a percentage. This represents a 20% savings. Let’s say your database is 100 GBs and our normalization savings scale linearly. The means you saved 20 GBs simply by changing how your data are stored! Nice job, you.

But the skeptics may be curious if there’s a catch. Well, yes, the catch is a reason normalized databases aren’t great for analytics. Normalization optimizes for reduced disk space needed and quick insets, not for how quickly the data can be read out.

Going back to the SQL query:

SELECT *
  FROM hrv_recordings AS hrvr
  JOIN users AS u
    ON hrvr.user_id = u.user_id

What’s really happening in this join can be simplified into a comparison loop.

For example:

for i in range(hrv_recordings.shape[0]):
  for j in range(users.shape[0]):
    if hrv_recordings[i:i+1,0] == users[j:j+1,0]:
      return (hrv_recordings[i:i+1,:], users[j:j+1,:])

If the above Python snippet doesn’t make sense, ignore it. Just know joins in SQL come with a computation cost every time a query is executed. And that’s the primary trade off with normalization, you are reducing the amount of disk space needed to store the data at the cost of retrieving the data quickly.

Made for Writing

About those quick inserts. As mentioned above, OLTP databases optimize for less disk space and speed of inserting new data. This makes sense, almost every interaction a user has with a an application most likely involves writing to the database, most often, more writing than reading.

Let’s take Amazingzone again. A customer clicks the “Buy it Now” button. This writes to the following tables:

  • orders table gets a record entry added for corresponding to the purchased item.
  • shipping table will get a record relating the order with a customer’s address.
  • users_orders tables get a record relating the orders entry with the customers
  • marketing table will contain an entry relating the orders entry with the customer prior viewing history.

And so forth. Needless to say, there are a lot of write operations, but fewer pieces read operations, as some of the data recorded will be returned to the user, such as the order amount, but a lot wont, such as the marketing data.

Data are Often Realtime

OLAP

Extract, Transform, Load (ETL) Prepare Data

Data are Denormalized (Star Schema)

Supports Parallelization of a Single Query

Made for Reading

OLTP and OLAP Compared

Let’s Talk Rows and Columns

As the name suggests, MariaDB ColumnStore stores the data in column format. This is referred to a columnar database management system (CDBMS). They differ from the row based database management system (RDBMS) in how they store data. There is lots of history on why, but historically, the world has used RDBMS to store data.

You can think of CDBMS and RDBMS as tools for specific jobs. They both work with data, but they have pros and cons which must be assessed when applying them.

Here is my summary comparison of the two systems.

Scraping Images from Google Search Results using Python

This articles relies on the code written by Fabian Bosler: Image Scraping with PythonI’ve only modified Bosler’s code to make it a bit e...… Continue reading