SQL CASE and R Paste

Reading time ~7 minutes

SQL Case

The SQL CASE function is one of my favorite. The command basically works like if-then command. If you are familiar with if-then commands, then feel free to skip this next bit.

If-Then

One of the reasons we have the amazing devices we do today is because a computer is capable of reasoning. A computer can compare two things and decide which one it likes.

Now, this may sound simple, but it’s actually a subtle miracle. Anyone who has been stuck on the toothpaste isle trying to decide between the 45 kinds of toothpaste probably understands making decisions is difficult. Of course, human decision making and computer decision making are not even on the same level. Humans can make comparisons of all 45 products at once(sort of). Computers, they have to make a decision between two objects, then, two objects, then two objects, so forth, until it has made it through all 45. Fortunately, computers can make these decisions blazing fast.

In computer programming we call this computer decision making process control flow. But let’s write some pseudocode for a little better understanding:

    If (Computer Likes Toothpaste 1) then buy Toothpaste 1

Pretty simple, right? The only thing a computer can’t do is decide if it likes Toothpaste 1 on its own. We have to program it to do that.

Well, this sentence makes sense if a computer is trying to decide to buy toothpaste or no toothpaste, but what if there are more than two toothpaste options? We just create another if-then statement.

    If (Computer Likes Toothpaste 1 Best) then buy Toothpaste 1
    If (Computer Likes Toothpaste 2 Best) then buy Toothpaste 2

Because the computer makes decisions in order it read them, then if it buys Toothpaste 1 then it will not buy Toothpaste 2. However, if he doesn’t like Toothpaste 1 the best, then if he thinks Toothpaste 2 is the best he’ll buy it. Otherwise, he will not buy any toothpaste–which makes sense, computers don’t have teeth.

This is almost everything we need to know about if-then, two more little catches.

First, what do we do if the computer doesn’t like any of the Toothpaste and don’t want him to just give up? We need a way to say, “Look computer, if you don’t like any toothpaste the best then go ask for help.”

In programming this is known as if-then-else statements. They are similar to if-then but with a contingency clause if something goes wrong.

Let’s take a look:

    if (Computer Likes Toothpaste 1 Best) then buy Toothpaste 1
    if (Computer Likes Toothpaste 2 Best) then buy Toothpaste 2
    else Go Ask a Computer Dentist what to buy

Ok, that’s it. Now let’s apply it to SQL.

SQL CASE WHEN

SQL applies if-then logic in several ways. We’ve already looked at the WHERE statement, which basicaly works like an if-then.

    SELECT * FROM data WHERE Name = 'Bob'

See how this could be written as

    SELECT * FROM data IF Name = 'Bob'

But the most likely SQL statement used for if-then-else logic is the CASE WHEN statement.

Here’s an example to be run in R.

library(sqldf)
################### Data DO NOT CHANGE ###########################
peopleDf <- data.frame(PersonalID=c("ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7", "IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV", "LASDU89NRABVJWW779W4JGGAN90IQ5B2"), 
                       FirstName=c("Timmy", "Fela", "Sarah"),
                       LastName=c("Tesa", "Falla", "Kerrigan"),
                       DOB=c("2010-01-01", "1999-1-1", "1992-04-01"))
##################################################################

peopleDf1 <- sqldf("SELECT *, 
                  CASE WHEN DOB > '2000-1-1' THEN 'Yes' ELSE 'No' END As 'Millennial' 
                  FROM peopleDf")

Here is the output:

PersonalID FirstName LastName DOB Gender Millennial
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Timmy Tesa 2010-01-01 Male Yes
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Fela Falla 1999-1-1 Female No
LASDU89NRABVJWW779W4JGGAN90IQ5B2 Sarah Kerrigan 1992-04-01 Female No

The SQL query, specifically the CASE WHEN statement created a column called Millennial, it then went through every person’s date of birth, comparing it. When the query found a person who was born after 2000-01-01 it inserted a ‘Yes’ in the Millennial column. If they were not born after 2000-01-01 then it set the Millennial column to ‘No.’ Nifty, right?

Notice, the ELSE is required to get the ‘No’. Otherwise, the query would leave everyone else blank.

Here’s a few more examples of using CASE WHEN for powerful results.

Using OR with CASE WHEN

peopleDf2 <- sqldf("SELECT *, 
                  CASE WHEN DOB > '2000-1-1' OR FirstName = 'Sarah' THEN 'PersonIsCool' ELSE 'NotHip' END As 'Cool?' 
                  FROM peopleDf")
PersonalID FirstName LastName DOB Gender Cool
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Timmy Tesa 2010-01-01 Male PersonIsCool
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Fela Falla 1999-1-1 Female NotHip
LASDU89NRABVJWW779W4JGGAN90IQ5B2 Sarah Kerrigan 1992-04-01 Female PersonIsCool

Using AND with CASE WHEN

peopleDf3 <- sqldf("SELECT *, 
                  CASE WHEN FirstName = 'Sarah' AND LastName = 'Kerrigan' THEN 'Yes' ELSE '' 
                  END As 'Queen of Blades' 
                  FROM peopleDf")
PersonalID FirstName LastName DOB Gender Queen of Blades
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Timmy Tesa 2010-01-01 Male  
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Fela Falla 1999-1-1 Female  
LASDU89NRABVJWW779W4JGGAN90IQ5B2 Sarah Kerrigan 1992-04-01 Female Yes

Using SUM with CASE WHEN

Using CASE WHEN in combination with SUM is a great way to get counts of different discrete data. Below is an example of getting total counts of males and females within the peopleDf

count1 <- sqldf("SELECT 
                  SUM(
                      CASE WHEN Gender = 'Female' THEN 1 ELSE 0 END
                    ) As 'NumberOfFemales',
                  SUM(
                      CASE WHEN Gender = 'Male' THEN 1 ELSE 0 END
                    ) As 'NumberOfMales'
                   FROM peopleDf")
NumberOfFemales NumberOfMales
2 1

Using Multiple CASES

So far, we’ve only covered one if-then statement, but in our example with the toothpaste we could string them together. The same can be done with CASE WHEN.

peopleDf4 <- sqldf("SELECT *, CASE WHEN DOB >= '1980-01-01' AND DOB < '1990-01-01' THEN 'X'
                           WHEN DOB >= '1990-01-01' AND DOB < '2000-01-01' THEN 'Y'
                           WHEN DOB >= '2000-01-01' AND DOB < '2010-01-01' THEN 'Millennial'
                           WHEN DOB >= '2010-01-01' AND DOB < '2020-01-01' THEN 'NotYetDefined'
                           END As 'Generation'
                   FROM peopleDf")
PersonalID FirstName LastName DOB Gender Generation
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Timmy Tesa 2010-01-01 Male NotYetDefined
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Fela Falla 1999-1-1 Female Y
LASDU89NRABVJWW779W4JGGAN90IQ5B2 Sarah Kerrigan 1992-04-01 Female Y

Paste

The paste() in R is meant for manipulating strings of text. You pass it strings as parameters and it returns one string containing all the strings passed into it. Let’s take a look.

greeting <- paste("Hello how are you,", "Bob?")

After running this line the greeting variable contains the following string Hello how are you, Bob?. This can be used by printing the contents of the variable using the print()

print(greeting)

Side note, print() will actually print out anything you pass it to the console. This can be useful when trying to debug code.

Back to our combined strings, notice whenever the greeting prints out there is a space inserted between ‘you,’ and ‘Bob?’, this is done automatically by paste. It will insert a space between every string you pass it, unless you pass the additional parameter sep. This parameter will take whatever you set it as and insert it between the two strings.

greeting <- paste("Hello how are you,", "Bob?", sep = "!!")
print(greeting)

This time print() will display “Hello how are you,!!Bob?” in the console. But, inserting exclamation marks is probably not what we want. Most of the time we will not want paste to insert anything and we can tell it to insert nothing.

greeting <- paste("Hello how are you,", "Bob?", sep = "")
print(greeting)

Print will spit out “Hello how are you,Bob?”. Notice, there is no longer any character between “you,” and “Bob?”.

Paste is a pretty straightforward function, the one last trick is knowing you can pass in multiple strings.

greeting <- paste("Hello", " how are you,", " Bob?", sep = "")
print(greeting)

This will produce the string “Hello how are you, Bob?”. Notice the spaces were inserted manually so the end string is readable to humans.

Dynamic SQL with Paste()

Prepare to have your mind blown. One of the powers of the paste() is building a sqldf string. Remember using SQLdf like this?

library(sqldf)
################### Data DO NOT CHANGE ###########################
peopleDf <- data.frame(PersonalID=c("ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7", "IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV", "LASDU89NRABVJWW779W4JGGAN90IQ5B2"), 
                       FirstName=c("Timmy", "Fela", "Sarah"),
                       LastName=c("Tesa", "Falla", "Kerrigan"),
                       DOB=c("2010-01-01", "1999-1-1", "1992-04-01"))
##################################################################

peopleDf1 <- sqldf("SELECT * FROM peopleDf WHERE DOB > '2001-01-01'")

This creates the table

PersonalID FirstName LastName DOB
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Timmy Tesa 2010-01-01

This is a dataframe of everyone who was born after January 1st, 2001. This method of filtering data works for a static date. But let’s say you wanted to easily change out the 2001-01-01 with other dates. You could replace the date with a different date, but when that date is in multiple SQL calls it can be easy to miss one. A better way to do it is using the paste(). And remember, everything inside the sqldf() parentheses is a string.

targetDate <- "2001-01-01"
sqlString <- paste("SELECT * FROM peopleDf WHERE DOB > '", targetDate, "'", sep = "")
peopleDf5 <- sqldf(sqlString)

Ok, let’s take this slow, there’s a lot going on. First, we create a variable called targetDate and assign it the string 2001-01-01. Next, we create a complex string using the paste() which looks a lot like a SQLdf string, but instead of hardcoding the date, we insert the targetDate variable. This creates the following string:

"SELECT * FROM peopleDf WHERE DOB > '2001-01-01'"

Which is then inserted into the variable sqlString, which is a string.

Lastly, we pass the sqlString variable into the sqldf() which executes the fancy SQL query. Awesome, right?

Now, if we want to look at those born after a different date, we simply change the targetDate variable and re-run the script.

targetDate <- "1980-01-01"
sqlString <- paste("SELECT * FROM peopleDf WHERE DOB > '", targetDate, "'", sep = "")
peopleDf5 <- sqldf(sqlString)

Sys.Date()

GSUB

HMIS, R, SQL -- Work Challenge Four

Creating Reusable CodeWriting report code which can be reused is critical to being an effective report specialist. By now, hopefully, yo...… Continue reading

C3 HMIS Graph Gallery

Published on August 07, 2017

C3 Graphing

Published on August 05, 2017