SQL Case
The SQL
CASE
function is one of my favorite. The command basically works like if-then command. If you are familiar with if-then commands, then feel free to skip this next bit.
If-Then
One of the reasons we have the amazing devices we do today is because a computer is capable of reasoning. A computer can compare two things and decide which one it likes.
Now, this may sound simple, but it's actually a subtle miracle. Anyone who has been stuck on the toothpaste isle trying to decide between the 45 kinds of toothpaste probably understands making decisions is difficult. Of course, human decision making and computer decision making are not even on the same level. Humans can make comparisons of all 45 products at once(sort of). Computers, they have to make a decision between two objects, then, two objects, then two objects, so forth, until it has made it through all 45. Fortunately, computers can make these decisions blazing fast.
In computer programming we call this computer decision making process
control flow
. But let's write some pseudocode for a little better understanding:
If (Computer Likes Toothpaste 1) then buy Toothpaste 1
Pretty simple, right? The only thing a computer can't do is decide if it likes Toothpaste 1 on its own. We have to program it to do that.
Well, this sentence makes sense if a computer is trying to decide to buy toothpaste or no toothpaste, but what if there are more than two toothpaste options? We just create another if-then statement.
If (Computer Likes Toothpaste 1 Best) then buy Toothpaste 1
If (Computer Likes Toothpaste 2 Best) then buy Toothpaste 2
Because the computer makes decisions in order it read them, then if it buys Toothpaste 1 then it will not buy Toothpaste 2. However, if he doesn't like Toothpaste 1 the best, then if he thinks Toothpaste 2 is the best he'll buy it. Otherwise, he will not buy any toothpaste--which makes sense, computers don't have teeth.
This is almost everything we need to know about
if-then
, two more little catches.
First, what do we do if the computer doesn't like any of the Toothpaste and don't want him to just give up? We need a way to say, "Look computer, if you don't like any toothpaste the best then go ask for help."
In programming this is known as
if-then-else
statements. They are similar to
if-then
but with a contingency clause if something goes wrong.
Let's take a look:
if (Computer Likes Toothpaste 1 Best) then buy Toothpaste 1
if (Computer Likes Toothpaste 2 Best) then buy Toothpaste 2
else Go Ask a Computer Dentist what to buy
Ok, that's it. Now let's apply it to SQL.
SQL CASE WHEN
SQL applies if-then logic in several ways. We've already looked at the
WHERE
statement, which basicaly works like an
if-then
.
SELECT * FROM data WHERE Name = 'Bob'
See how this could be written as
SELECT * FROM data IF Name = 'Bob'
But the most likely SQL statement used for
if-then-else
logic is the
CASE WHEN
statement.
Here's an example to be run in R.
library(sqldf)
################### Data DO NOT CHANGE ###########################
peopleDf <- data.frame(PersonalID=c("ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7", "IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV", "LASDU89NRABVJWW779W4JGGAN90IQ5B2"),
FirstName=c("Timmy", "Fela", "Sarah"),
LastName=c("Tesa", "Falla", "Kerrigan"),
DOB=c("2010-01-01", "1999-1-1", "1992-04-01"))
##################################################################
peopleDf1 <- sqldf("SELECT *,
CASE WHEN DOB > '2000-1-1' THEN 'Yes' ELSE 'No' END As 'Millennial'
FROM peopleDf")
Here is the output:
PersonalID
|
FirstName
|
LastName
|
DOB
|
Gender
|
Millennial
|
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7
|
Timmy
|
Tesa
|
2010-01-01
|
Male
|
Yes
|
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV
|
Fela
|
Falla
|
1999-1-1
|
Female
|
No
|
LASDU89NRABVJWW779W4JGGAN90IQ5B2
|
Sarah
|
Kerrigan
|
1992-04-01
|
Female
|
No
|
The SQL query, specifically the
CASE WHEN
statement created a column called
Millennial
, it then went through every person's date of birth, comparing it. When the query found a person who was born after 2000-01-01 it inserted a 'Yes' in the Millennial column. If they were not born after 2000-01-01 then it set the
Millennial
column to 'No.' Nifty, right?
Notice, the
ELSE
is required to get the 'No'. Otherwise, the query would leave everyone else blank.
Here's a few more examples of using CASE WHEN for powerful results.
Using OR with CASE WHEN
peopleDf2 <- sqldf("SELECT *,
CASE WHEN DOB > '2000-1-1' OR FirstName = 'Sarah' THEN 'PersonIsCool' ELSE 'NotHip' END As 'Cool?'
FROM peopleDf")
PersonalID
|
FirstName
|
LastName
|
DOB
|
Gender
|
Cool
|
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7
|
Timmy
|
Tesa
|
2010-01-01
|
Male
|
PersonIsCool
|
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV
|
Fela
|
Falla
|
1999-1-1
|
Female
|
NotHip
|
LASDU89NRABVJWW779W4JGGAN90IQ5B2
|
Sarah
|
Kerrigan
|
1992-04-01
|
Female
|
PersonIsCool
|
Using AND with CASE WHEN
peopleDf3 <- sqldf("SELECT *,
CASE WHEN FirstName = 'Sarah' AND LastName = 'Kerrigan' THEN 'Yes' ELSE ''
END As 'Queen of Blades'
FROM peopleDf")
PersonalID
|
FirstName
|
LastName
|
DOB
|
Gender
|
Queen of Blades
|
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7
|
Timmy
|
Tesa
|
2010-01-01
|
Male
|
|
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV
|
Fela
|
Falla
|
1999-1-1
|
Female
|
|
LASDU89NRABVJWW779W4JGGAN90IQ5B2
|
Sarah
|
Kerrigan
|
1992-04-01
|
Female
|
Yes
|
Using SUM with CASE WHEN
Using
CASE WHEN
in combination with
SUM
is a great way to get counts of different discrete data. Below is an example of getting total counts of males and females within the peopleDf
count1 <- sqldf("SELECT
SUM(
CASE WHEN Gender = 'Female' THEN 1 ELSE 0 END
) As 'NumberOfFemales',
SUM(
CASE WHEN Gender = 'Male' THEN 1 ELSE 0 END
) As 'NumberOfMales'
FROM peopleDf")
NumberOfFemales
|
NumberOfMales
|
2
|
1
|
Using Multiple CASES
So far, we've only covered one
if-then
statement, but in our example with the toothpaste we could string them together. The same can be done with
CASE WHEN
.
peopleDf4 <- sqldf("SELECT *, CASE WHEN DOB >= '1980-01-01' AND DOB < '1990-01-01' THEN 'X'
WHEN DOB >= '1990-01-01' AND DOB < '2000-01-01' THEN 'Y'
WHEN DOB >= '2000-01-01' AND DOB < '2010-01-01' THEN 'Millennial'
WHEN DOB >= '2010-01-01' AND DOB < '2020-01-01' THEN 'NotYetDefined'
END As 'Generation'
FROM peopleDf")
PersonalID
|
FirstName
|
LastName
|
DOB
|
Gender
|
Generation
|
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7
|
Timmy
|
Tesa
|
2010-01-01
|
Male
|
NotYetDefined
|
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV
|
Fela
|
Falla
|
1999-1-1
|
Female
|
Y
|
LASDU89NRABVJWW779W4JGGAN90IQ5B2
|
Sarah
|
Kerrigan
|
1992-04-01
|
Female
|
Y
|
Paste
The
paste()
in R is meant for manipulating strings of text. You pass it strings as parameters and it returns one string containing all the strings passed into it. Let's take a look.
greeting <- paste("Hello how are you,", "Bob?")
After running this line the
greeting
variable contains the following string
Hello how are you, Bob?
. This can be used by printing the contents of the variable using the
print()
Side note,
print()
will actually print out anything you pass it to the console. This can be useful when trying to debug code.
Back to our combined strings, notice whenever the greeting prints out there is a space inserted between 'you,' and 'Bob?', this is done automatically by paste. It will insert a space between every string you pass it, unless you pass the additional parameter
sep
. This parameter will take whatever you set it as and insert it between the two strings.
greeting <- paste("Hello how are you,", "Bob?", sep = "!!")
print(greeting)
This time
print()
will display "Hello how are you,!!Bob?" in the console. But, inserting exclamation marks is probably not what we want. Most of the time we will not want paste to insert anything and we can tell it to insert nothing.
greeting <- paste("Hello how are you,", "Bob?", sep = "")
print(greeting)
Print will spit out "Hello how are you,Bob?". Notice, there is no longer any character between "you," and "Bob?".
Paste is a pretty straightforward function, the one last trick is knowing you can pass in multiple strings.
greeting <- paste("Hello", " how are you,", " Bob?", sep = "")
print(greeting)
This will produce the string "Hello how are you, Bob?". Notice the spaces were inserted manually so the end string is readable to humans.
Dynamic SQL with Paste()
Prepare to have your mind blown. One of the powers of the
paste()
is building a
sqldf
string. Remember using SQLdf like this?
library(sqldf)
################### Data DO NOT CHANGE ###########################
peopleDf <- data.frame(PersonalID=c("ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7", "IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV", "LASDU89NRABVJWW779W4JGGAN90IQ5B2"),
FirstName=c("Timmy", "Fela", "Sarah"),
LastName=c("Tesa", "Falla", "Kerrigan"),
DOB=c("2010-01-01", "1999-1-1", "1992-04-01"))
##################################################################
peopleDf1 <- sqldf("SELECT * FROM peopleDf WHERE DOB > '2001-01-01'")
This creates the table
PersonalID
|
FirstName
|
LastName
|
DOB
|
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7
|
Timmy
|
Tesa
|
2010-01-01
|
This is a dataframe of everyone who was born after January 1st, 2001. This method of filtering data works for a static date. But let's say you wanted to easily change out the
2001-01-01
with other dates. You
could
replace the date with a different date, but when that date is in multiple SQL calls it can be easy to miss one. A better way to do it is using the
paste()
. And remember, everything inside the
sqldf()
parentheses is a string.
targetDate <- "2001-01-01"
sqlString <- paste("SELECT * FROM peopleDf WHERE DOB > '", targetDate, "'", sep = "")
peopleDf5 <- sqldf(sqlString)
Ok, let's take this slow, there's a lot going on. First, we create a variable called
targetDate
and assign it the string
2001-01-01
. Next, we create a complex string using the
paste()
which looks a lot like a SQLdf string, but instead of hardcoding the date, we insert the
targetDate
variable. This creates the following string:
"SELECT * FROM peopleDf WHERE DOB > '2001-01-01'"
Which is then inserted into the variable
sqlString
, which is a string.
Lastly, we pass the
sqlString
variable into the
sqldf()
which executes the fancy SQL query. Awesome, right?
Now, if we want to look at those born after a different date, we simply change the
targetDate
variable and re-run the script.
targetDate <- "1980-01-01"
sqlString <- paste("SELECT * FROM peopleDf WHERE DOB > '", targetDate, "'", sep = "")
peopleDf5 <- sqldf(sqlString)
Sys.Date()
GSUB