Comparing Values in R and SQL

Comparative Functions

Comparing two or more values is an extremely important concept when talking to computers. In writing a report script, it is is essential. Comparisons allow us to filter to values within a range, allowing us to provide a report of relevant information.

Take the following data:

######################### Data ###################################
###################### DO NOT CHANGE #############################
peopleDf <- data.frame(PersonalID=c("ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7", "IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV", "LASDU89NRABVJWW779W4JGGAN90IQ5B2"), 
           FirstName=c("Timmy", "Fela", "Sarah"),
           LastName=c("Tesa", "Falla", "Kerrigan"),
           DOB=c("2010-01-01", "1999-1-1", "1992-04-01"))
##################################################################
##################################################################

If you run the above in R you should get a dataframe called peopleDf which looks like this:

PersonalID FirstName LastName DOB
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Timmy Tesa 2010-01-01
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Fela Falla 1999-1-1
LASDU89NRABVJWW779W4JGGAN90IQ5B2 Sarah Kerrigan 1992-04-01

It’s a simple table. But let’s say we wanted to get a list of everyone born before 2000-01-01. Of course, we can easily see Timmy is the only person born after 2000. But if our table was thousands of records it wouldn’t be possible to quickly assess.

Luckily, this is pretty straight forward in SQL-R. We will use a less than operator (<). You probably remember this sign from high-school while solving inequalities. However, we will be using it as what’s known as a relational operator.

In short, it states,

Is x less than y

If x is less than y the computer is going to say the statement is true (or 1 in binary). If it is not, then the computer will say it’s false (or 0 in binary). Believe it or not, this simple operation is why you have a device in your pocket which could calculate the internal mass of the sun.

For us, things are a little simpler. We just want to know who was born before 2000. Let’s re-write the statement above with our problem:

Is Sarah’s DOB less than 2000-01-01

Well, what is Sarah’s DOB? 1992-04-01. Let’s rewrite and assess (gah, this feels like high-school algebra again).

Is 1992-04-01 less than 2000-01-01

Hmm. This can get confusing for humans, but more importantly, confusing to computers.

In English, we’d probably state this as,

Did 1992-04-01 come before 2001-01-01?

Essentially, that’s what we are doing. Just know, the computer will translate all dates into a number. This number is how many seconds transpired since 1970-01-01.

Why? On Thursday, January 1st 1970 the Universal Coordinated Time (UTC) was established. Think of it is when the world came together to standardize time. Computer people figured, “Well, if we have to convert dates into a raw number for computers to understand it, it might as well be the number of seconds since UTC was established.”

Ok, enough history lesson. How is this relevant?

  1. Computers convert dates into seconds since 1970-01-01.
  2. Comparing dates is actually comparing numbers.

Taking our statement again, let’s re-write it with the number of seconds since 1970-01-01

Is number of seconds between 1970-01-01 and 1992-04-01 less than number of seconds between 1970-01-01 and 2000-01-01

Which becomes:

Is 702,086,400 less than 46,684,800 seconds

Aha, now this makes sense. And the result is true. We can now say, in computer speak: Sarah was born before 2000-01-01.

Why? Really, dude.

In my world there is a saying: RFTM.

It’s hard to follow now days. Everything moves quick and we don’t have time to dig into the “Why.” But, like most things, if you want to be good, you must take the time to do so.

The reason we review how computers understand dates is it directly impacts how we write reports. Do you remember the date conversion trick to get dates to work in SQL from R? This is because R holds dates as the number of seconds since 1970 and passes it as a string to SQL. But, then SQL tries to convert the date from a date into seconds again, screwing everything up.

It pays to RFTM.

Filtering Dataframes by Date

Back to the problem. How do we write a script which provides a dataframe of people born before 2000-01-01?

The code is actually pretty simple,

library(sqldf)
nonMillennialsDf <- sqldf("SELECT * FROM peopleDf WHERE DOB < '2000-01-01'")

This should provide a nonMillennialsDf dataframe, which contains:

PersonalID FirstName LastName DOB
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Fela Falla 1999-1-1
LASDU89NRABVJWW779W4JGGAN90IQ5B2 Sarah Kerrigan 1992-04-01

And there we go, for all my nerdsplaining the code’s pretty simple, right?

Well, there are a few gotchas. Notice the date we’ve written. It has the following format YYYY-MM-DD and is surrounded by single quotes. Any time you use dates in SQL they must be written in this format.

Another tricky part is trying to find if a date falls between two dates. Let’s take the peopleDf and write a query which provides everyone who was born between 1998-01-01 and 2005-01-01

Here’s the query.

bornBetweenDf <- sqldf("SELECT * FROM peopleDf WHERE DOB > '1998-01-01' AND DOB < '2005-01-01'") 

This should result in a table with only Fela:

PersonalID FirstName LastName DOB
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Fela Falla 1999-1-1

It is important to understand, the first comparison removed Sarah, as 1992-04-01 is less than 1998-01-01. Then, the second comparison got rid of Timmy as 2010-01-01 is greater than 2005-01-01.

Now()

There is one more critical command in writing robust date comparisons. The NOW() function. This function is different in R and SQL, but pretty much every programming language has a version of the function.

Essentially, the NOW() asks the computer what today’s date is when the script runs.

In SQL-R it looks like this:

nowDf <- sqldf("SELECT *, DATE('NOW') As 'TodaysDate' FROM peopleDf")

This should provide:

PersonalID FirstName LastName DOB TodaysDate
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Timmy Tesa 2010-14-01 2017-07-18
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Fela Falla 1999-1-1 2017-07-18
LASDU89NRABVJWW779W4JGGAN90IQ5B2 Sarah Kerrigan 1992-04-01 2017-07-18

And it doesn’t matter when this script is run, it will always insert today’s date in the TodaysDate column. Nifty, right? Trust me, if you don’t see the possibilities yet, give it time. It’ll grow into one of your favorite functions.

Well, we can’t talk about the NOW() function without discussing the DATE() function I slipped in there. What does it do?

As we discussed earlier, the computer looks at dates as the number of seconds since 1970-01-01. When you use the NOW() function by itself then it will return the number of seconds–um, not something humans like to read. The DATE() function says, “Take whatever is inside the parentheses and try to convert it into a human readable date.” Voila! A human readable date.

Age

Let’s get fancy. We can use the NOW() function and our peopleDf to calculate everyone’s age.

peopleWithAgeDf <- sqldf("SELECT *, (DATE('NOW') - DOB) As 'Age' FROM peopleDf")

This should provide:

PersonalID FirstName LastName DOB Age
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Timmy Tesa 2010-14-01 7
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Fela Falla 1999-1-1 18
LASDU89NRABVJWW779W4JGGAN90IQ5B2 Sarah Kerrigan 1992-04-01 25

Cool, right? Now, it does not matter when this above code of line is run, it will calculate everyone’s age correctly.

One important note, if the date and time are wrong on your computer this calculation will be incorrect.

The nerd-judo which can be done with dates in SQL-R is endless. But this covers a lot of the basics.

If you’ve missed the code bits throughout this article, here it is all at once:

######################### Data ###################################
###################### DO NOT CHANGE #############################
peopleDf <- data.frame(PersonalID=c("ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7", "IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV", "LASDU89NRABVJWW779W4JGGAN90IQ5B2"), 
           FirstName=c("Timmy", "Fela", "Sarah"),
           LastName=c("Tesa", "Falla", "Kerrigan"),
           DOB=c("2010-14-01", "1999-1-1", "1992-04-01"))
##################################################################
##################################################################
library(sqldf)
nonMillennialsDf <- sqldf("SELECT * FROM peopleDf WHERE DOB < '2000-01-01'")
bornBetweenDf <- sqldf("SELECT * FROM peopleDf WHERE DOB > '1998-01-01' AND DOB < '2005-01-01'") 
nowDf <- sqldf("SELECT *, DATE('NOW') As 'TodaysDate' FROM peopleDf")
peopleWithAgeDf <- sqldf("SELECT *, (DATE('NOW') - DOB) As 'Age' FROM peopleDf")

Providing Chronically Homeless List

With this work challenge we are going to take the concepts we’ve learned from the first challenge and build on them. We will combine two dataframes derived from Client.csv and Enrollment.csv. Then, we will apply HUD’s formula to get a by-name-list of those who are chronically homeless.

Data Needed

The current definition of chronically homeless is found in HUD’s federal register:

A “chronically homeless” individual is defined to mean a homeless individual with a disability who lives either in a place not meant for human habitation, a safe haven, or in an emergency shelter, or in an institutional care facility if the individual has been living in the facility for fewer than 90 days and had been living in a place not meant for human habitation, a safe haven, or in an emergency shelter immediately before entering the institutional care facility. In order to meet the “chronically homeless” definition, the individual also must have been living as described above continuously for at least 12 months, or on at least four separate occasions in the last 3 years, where the combined occasions total a length of time of at least 12 months. Each period separating the occasions must include at least 7 nights of living in a situation other than a place not meant for human habitation, in an emergency shelter, or in a safe haven.

There are several data elements which will be needed for us to calculate whether someone is chronically homeless. These data elements are reported to case-managers and entered into a HUD Entry Assessment when a client enters a program.

Here’s a list of the data elements we will use:

  1. DisablingCondition
  2. TimesHomelessPastThreeYears
  3. MonthHomelessPastThreeYears
  4. DateToStreetESSH

All of the above data elements are found in the Enrollment.csv. Therefore, similar to the last Challenge, we will need to join the Client.csv and the Enrollment.csv.

We’ve covered how to get all data from CSVs into one dataframe using joins. This Challenge will build on that skill. The new concepts here will be combining logic to get to a specific answer.

In SQL we will use the following logic operators:

  • IS (==)
  • NOT (!=)
  • AND (&&)
  • OR (||)
  • > (greater than)
  • < (less than)

For example, let’s take the chronically homeless definition and turn it into something a computer can understand using these logic operators. We can do this by re-writing the definition several times, each time dropping what makes sense to humans and leaving what makes sense to computers.

For example, this should make sense to most humans.

A chronically homeless individual is disabled and been homeless greater than 364 days. Or, is disabled and been homeless greater than three times in three years and the time spent in homelessness adding up to greater than 364 days.

That paragraph seems a little hard to read, right? But still, humans should be able to understand it. Now, let’s look at the same paragraph emphasizing the logic operators.

A chronically homeless individual IS disabled AND been homeless GREATER THAN 364 days. OR, IS disabled AND been homeless GREATER THAN three times in three years AND the time spent in homelessness adding up to GREATER THAN 364 days.

This is skill of a Computational-Thinker, taking a definition like HUD provided and re-write it from something a human would understand into something a computer will understand.

The next step is re-writing the paragraph in something called pseudo-code.

Chronic Homeless Individual == 
                
                    A person IS Disabled AND
                    A person > Homeless 364 days

                    OR

                    A person IS Disabled AND
                    A person homeless > 4 times AND
                    A person > 12 months homeless within 3 years

This helps us make sure everything is in place to feed to the computer. The next step will be actually writing the SQL code.

Below is the following code to get chronically homeless:

#############################################
##### Get those with Disabling Condition ###
#############################################
disablingCondition <- sqldf("SELECT PersonalID 
                            FROM clientAndEnrollmentDf 
                            WHERE DisablingCondition = 1")

#############################################
##### Length-of-Stay ########################
#############################################
# Participants who meet the length-of-stay in homelessness requirement
# Either through four or more occurences with cumulative duration exceeding a year
# Or a consequtive year.
#                 113 = "12 Months"
#                 114 = "More than 12 Months"
chronicityDf <- sqldf("SELECT PersonalID, 'Yes' As 'Meets LOS'
                               FROM activeEnrollment
                               WHERE (TimesHomelessPastThreeYears = 4
                                    AND (
                                          MonthsHomelessPastThreeYears = 113
                                          OR MonthsHomelessPastThreeYears = 114)
                                        )
                               OR (CAST(JULIANDAY('now') - JULIANDAY(DateToStreetESSH) AS Integer) > 364
                                   AND (DateToStreetESSH != '') 
                                  )
                               ")

#############################################
##### Chronically Homeless ##################
#############################################
# Take the distinct PersonalIDs of individuals who meet both chronicity
# and disabling condition.
chronicallyHomeless <- sqldf("SELECT DISTINCT(a.PersonalID)
                              FROM chronicityDf a
                              INNER JOIN disablingCondition b
                              ON a.PersonalID=b.PersonalID
                             ")

This may look overwhelming, but that’ll be the purpose of this week’s Challenge, to demonstrate this is code is actually pretty simple when broke down into its basic parts.

That’s the real lesson here, every complex question may be made extremely simple when taken once piece at a time. The power of computational-thinking is extraordinary.

The Goal

We are going to merge the two data sets and to discover the following:

  1. A list of individuals who are chronically homeless.
  2. Export this list to an Excel document.

To get this information we will need to do the following:

  1. Load the Client.csv into the dataframe clientDf.
  2. Load the Enrollment.csv into the dataframe enrollmentDf.
  3. Inner join the clientDf to enrollmentDf.
  4. Calculate whether someone is chronically homeless.
  5. Filter to those who are chronically homeless.
  6. Write the by-name-list of individuals to an Excel document.

The Resources

Below are the resources which should help for each step:

Step 1 & 2

  • R Programming A-Z – Video 41 – Loading and Importing Data in R
  • R Programming A-Z – Video 21 – Functions in R
  • Read and Write CSVs in R

Step 3

  • The Complete SQL Bootcamp – Video #51 – Overview of Inner Joins
  • The Complete SQL Bootcamp – Video #52 – Example of Inner Joins * HMIS, R, SQL – Basics

Step 4 & 5

Step 6

  • Writing Excel Workbooks – Tutorial Coming
Give me MyFitnessPal Data!

I’m fat. Fatter than I want to be. I’ve not always been fat, I got down to 180 at back in 2008. It took counting calories and weight religiously. The key piece for me was having a graph which I looked at daily showing my outcomes. Over the course of a year I lost 40 pounds. Well, it’s time to do it again. I’ve gained that 40 back over 10 years–and now it needs to go.

Back in 2008 I was using Google to give me the calories of every item I ate and recording them in an Excel document. This food journal was great, but a little more work than it probably should have been.

Back then, I wasn’t aware of being a hacker. Now, I plan to throw all my hacker skills at this weight loss plan (hell, I might even go to the gym!)

I signed up for MyFitnessPal. Counting calories worked once, I figure if it aint broke. But then I got to looking at how much work it would take to look at my improvement. I mean, I’d have to actually open the app on my phone and click on the weight loss section. Shesh–who designed that app? Two actions to get where I needed–ain’t no one got time for that.

Enter hacker skills. I discovered there is a Python library which allows scraping of data.

This wonderful little library is written and provided by CoddingtonBear.

I figure, I’d write a Python script to scrap the data, save it to a CSV, create an SQL-R script to join the nutrition and weight information, use ggplot to plot the data, save the plot as a PNG, and then copy this plot to a dedicated spot on Ladvien.com. Lastly, I’d write a bash script to run every night and update the graph. Simples!

And c’mon, opening a webpage is a lot easier than tapping twice.

Well, after a few hours of coding, I’ve got the first step of the project complete.

import myfitnesspal
import csv, sys, os
from datetime import datetime

# Get account info
client = myfitnesspal.Client('cthomasbrittain')
# Set start year
startYear = "2008"
# Get limits
beginningDate = datetime.strptime(startYear, "%Y").date()
beginningYear = beginningDate.year
daysInMonth = {1:31, 2:28, 3:31, 4:30, 5:31, 6:30, 7:31, 8:31, 9:30, 10:31, 11:30, 12:31}
emptyNutrition = [None, None, None, None, None, None]

print("")
print("################################################")
print("# Scraping MyFitnessPal                        #")
print("# Make sure your account is set to public      #")
print("# and your username and pass are in keychain   #")
print("################################################")
print("")

today = datetime.now().date()
currentYear = today.year

print("")
print("################################################")
print("# Get nutrition and weight information         #")
print("################################################")
print("")

# Loop over years from beginingYear.  Make sure last year is inclusive.
for yearIndex in range(beginningYear, currentYear+1):
    
    # Create a file name based on this year's data
    thisFileName = "healthData_%s.csv" % yearIndex

    # Open CSV as read and write.
    # If file exists, open for read / write
    #   else, create file, write only.
    try:
        f = open(thisFileName, "r+")        # Check to see if file is complete,
        row_count = sum(1 for row in f)     # else, overwrite the file
        if(row_count != 366):               # A year of rows plus headers, and an empty line at end.
            f = open(thisFileName, "w+")
            row_count = 0
    except EnvironmentError:
        f = open(thisFileName, "w+")        # If file does not exist, create it.
        row_count = 0
    
    writer = csv.writer(f)
    
    # Check number of lines. If the year wasn't captured, start over.
    if(row_count < 365):
        # Write headers for totals
        writer.writerow(["Date", "Sodium", "Carbohydrates", "Calories", "Fat", "Sugar", "Protein", "Weight"])
        sys.stdout.write(str(yearIndex)+": ")   # Print has a linefeed.
        sys.stdout.flush()
        for monthIndex in range(1, 12+1):
                
            beginningOfMonthStr = "%s-%s-%s" % (yearIndex, monthIndex, 1)
            endOfMonthStr = "%s-%s-%s" % (yearIndex, monthIndex, daysInMonth[monthIndex])

            beginningOfMonth = datetime.strptime(beginningOfMonthStr, "%Y-%m-%d").date()
            endOfMonth = datetime.strptime(endOfMonthStr, "%Y-%m-%d").date()
            
            thisMonthsWeights = dict(client.get_measurements('Weight', beginningOfMonth, endOfMonth))

            for dayIndex in range(1, daysInMonth[monthIndex]+1):
                
                fullDateIndex = "%s-%s-%s" % (yearIndex, monthIndex, dayIndex)
                thisDate = datetime.strptime(fullDateIndex, "%Y-%m-%d").date()
                if(thisDate > today):
                    break;

                thisDaysNutritionData = client.get_date(yearIndex, monthIndex, dayIndex)
                thisDaysNutritionDataDict = thisDaysNutritionData.totals
                thisDaysNutritionValues = thisDaysNutritionDataDict.values()

                thisDaysWeight = [(thisMonthsWeights.get(thisDate))]
                
                if(len(thisDaysNutritionValues) < 6):
                    thisDaysNutritionValues = emptyNutrition

                dataRow = [fullDateIndex] + thisDaysNutritionValues  + thisDaysWeight
                if dataRow:
                    writer.writerow(dataRow)

            sys.stdout.write("#")
            sys.stdout.flush()
        print(" -- Done.")
        f.close()
    else:
        print((str(yearIndex)+": Exists and is complete."))

And then we add some R to join the data together and automate plotting, and saving the plots as images.

library(ggplot2)
library(scales)

cat("*******************************************************\n")
cat("* Starting R                                          *\n")
cat("*******************************************************\n")
cat("\n")
cat("*******************************************************\n")
cat("* Combining Health Data                               *\n")
cat("*******************************************************\n")
cat("\n")
# Thanks Rich Scriven
# https://stackoverflow.com/questions/25509879/how-can-i-make-a-list-of-all-dataframes-that-are-in-my-global-environment
healthDataRaw <- do.call(rbind, lapply(list.files(pattern = ".csv"), read.csv))
# Fill in missing values for calories
healthDataRaw$Calories[is.na(healthDataRaw$Calories)] <- mean(healthDataRaw$Calories, na.rm = TRUE)

date30DaysAgo <- Sys.Date() - 30
date90DaysAgo <- Sys.Date() - 90
date180DaysAgo <- Sys.Date() - 180

cat("*******************************************************\n")
cat("* Creating Weight Graph                               *\n")
cat("*******************************************************\n")
healthData <- healthDataRaw[!(is.na(healthDataRaw$Weight)),]
healthData$Date <- as.Date(healthData$Date)
healthData <- with(healthData, healthData[(Date >= date30DaysAgo), ])
p <- ggplot(healthData, aes(x = Date, y = Weight))+
  geom_line(color="firebrick", size = 1) +
  labs(title ="Ladvien's Weight", x = "Date", y = "Weight")
p
ggsave("ladviens_weight.png", width = 5, height = 5)

cat("\n")

cat("*******************************************************\n")
cat("* Creating Calories Graph                             *\n")
cat("*******************************************************\n")
cat("\n")
#healthData <- healthDataRaw[!(is.na(healthDataRaw$Calories)),]
healthData$Date <- as.Date(healthData$Date)
healthData <- with(healthData, healthData[(Date >= date30DaysAgo), ])
p2 <- ggplot(healthData, aes(x = Date, y = Calories))+
  geom_line(color="firebrick") 
p2

png(filename="ladviens_calories.png")
plot(p2)
dev.off()

cat("*******************************************************\n")
cat("* Finished R Script                                   *\n")
cat("*******************************************************\n")
cat("\n")

Lastly, let’s write a bash script to run the Python and R code, then copy the images to Ladvien.com

#!/bin/sh
PASSWORD=("$(keyring get system ladvien.com)")

Python myfitnesspall_scraper.py

Rscript myfitnesspal_data_sort.R

ECHO ""
ECHO "*******************************************************"
ECHO "* Syncing files to Ladvien.com                        *"
ECHO "*******************************************************"
ECHO ""

# Used SSHPass
# https://gist.github.com/arunoda/7790979

sshpass -p "$PASSWORD" scp ladviens_weight.png ladviens_calories.png root@ladvien.com:/usr/share/nginx/html/images

And here’s the result:

My weight:

And my calories:

Next, I’ll probably tweak ggplot2 to make the graphs a little prettier. Also, I’ll setup a Raspberry Pi or something to run the bash script once a night. Why? Lolz.

Filter to Most Recent HUD Assessment

Enrollment.csv

Many of the CSVs in the HMIS CSV may contain multiple rows per client. This can make it difficult when working with HMIS data, as it appears to the non-data person there are duplicates within your data.

Let’s look at some dataframes:

enrollmentDf

ProjectEntryID PersonalID FirstName EntryDate
L0TDCLTDEARVHNIQ4F9EDDKXJ764Z65Q ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 10/17/2016
169ACC89JY5SX0U87U7HQ28PMMHNJEXQ IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane 05/05/2015
XB52BYOGJ1YDFESNZVNGDGA58ITDML0A ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 01/01/2013

Notice how Bob has two records? One on 10/17/2016 and 01/01/2013. This represents two HUD Entry Assessments completed on Bob. These HUD Entry Assessments could represent two stays in the same program, or one stay in two programs.

Regardless, whenever you go to join this dataframe with a another dataframe, like the Client.csv, it will cause the resulting dataframe to have two rows representing both of Bob’s enrollments.

Let me walk us through joining the above dataframe with another dataframe.

We are going to join the enrollmentDf (above) with this clientDf

enrollmentDf

ProjectEntryID PersonalID FirstName EntryDate
L0TDCLTDEARVHNIQ4F9EDDKXJ764Z65Q ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 10/17/2016
169ACC89JY5SX0U87U7HQ28PMMHNJEXQ IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane 05/05/2015
XB52BYOGJ1YDFESNZVNGDGA58ITDML0A ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 01/01/2013

clientDf

PersonalID FirstName LastName
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob Beber
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane Goodall

In R, we can join these two dataframes with the following.

Please copy the code below to R and execute.

####### BEGIN LOADING DATA FRAMES ###############
enrollmentDf = data.frame(ProjectEntryID=c("L0TDCLTDEARVHNIQ4F9EDDKXJ764Z65Q", "169ACC89JY5SX0U87U7HQ28PMMHNJEXQ", "XB52BYOGJ1YDFESNZVNGDGA58ITDML0A"), 
               PersonalID=c("ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7", "IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV", "ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7"), 
               FirstName=c("Bob","Jane", "Bob"), 
               EntryDate=c("10/17/2016", "05/05/2015", "01/01/2013"))

clientDf = data.frame(PersonalID=c("ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7", "IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV"), 
                      FirstName=c("Bob","Jane"),
                      LastName=c("Beber", "Goodall"))
####### ENDS LOADING DATA FRAMES ###############

# Load the SQLdf package (note, it must be installed first. See install.packages())
library(sqldf)

# Join the two dataframes.
clientAndEnrollmentDf <- sqldf("SELECT * 
                               FROM clientDf 
                               LEFT JOIN enrollmentDf 
                               ON clientDf.PersonalID=enrollmentDf.PersonalID")

Important Sidenote

If you ever see the following error:

In field_types[] <- field_types[names(data)] : number of items to replace is not a multiple of replacement length

It’s a problem with going back-and-forth between R and SQL. To fix it, use the following code on the dataframe you are trying to work with before executing the line of code causing the error

dfCausingProblem <- subset(dfCausingProblem)

Ok, back to work.

After executing the code, you should end up with a table like this. Not too shabby.

PersonalID FirstName LastName ProjectEntryID PersonalID FirstName EntryDate
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob Beber L0TDCLTDEARVHNIQ4F9EDDKXJ764Z65Q ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 10/17/2016
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob Beber XB52BYOGJ1YDFESNZVNGDGA58ITDML0A ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 01/01/2013
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane Goodall 169ACC89JY5SX0U87U7HQ28PMMHNJEXQ IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane 05/05/2015

However, notice there are still rows for Bob? These aren’t technically duplicates. A duplicate is when there are two rows where items in every column are exactly the same. But in the case of the dataframe above, notice how the ProjectEntryID and EntryDate columns for Bob’s records are different?

As stated before, this is carried forth from Bob having two HUD Entry Assessments. But to the people whom we are going to present these data, it looks like duplicates. This is a problem because it will be seen as sloppy work (but remember, it’s not. It’s merely a technical artefact).

Who cares! How do we get rid of it?

First, we have to make a decision among three options. First, we can get only the most HUD Entry Assessment per client, only the the first HUD Entry Assessment ever taken per client, or leave it as it is.

The last option is out, so it’s a choice between most recent and the oldest one. In our world, homelessness usually gets worse and HUD wants us to be helping those who are the most vulnerable first, so the most recent is probably going to give us the best picture how vulnerable a client is right now.

Alright, how do we get the most recent HUD Assessment?

In SQL there is a function called MAX(). It will take the most recent of a record. Let’s look at how to use it, then we can discuss it.

For the following code to work, make sure all code above has been executed in R.

clientAndEnrollmentDf2 <- sqldf("SELECT *, MAX(EntryDate) FROM clientAndEnrollmentDf")

This should provide you with the following table:

PersonalID FirstName LastName ProjectEntryID PersonalID.1 FirstName.1 EntryDate MAX(EntryDate)
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob Beber L0TDCLTDEARVHNIQ4F9EDDKXJ764Z65Q ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 10/17/2016 10/17/2016

Wait! What happened to Jane!? Well, the MAX() took the row with the greatest EntryDate, which is 10/17/2016. Unfortunately, Jane’s record wasn’t the most recent, so her record was removed.

Well, that’s not going to work. But we’re close. If only we had a way to take the MAX(EntryDate) per client.

We do. SQL has another command called GROUP BY, which allows us to tell SQL to apply another command by a group of records.

Again, let’s use it and then dissect it.

Copy this to R and execute it.

clientAndEnrollmentDf3 <- sqldf("SELECT *, MAX(EntryDate) FROM clientAndEnrollmentDf GROUP BY PersonalID")

You should end up with a table like this:

PersonalID FirstName LastName ProjectEntryID PersonalID.1 FirstName.1 EntryDate MAX(EntryDate)
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane Goodall 169ACC89JY5SX0U87U7HQ28PMMHNJEXQ IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane 05/05/2015 05/05/2015
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob Beber L0TDCLTDEARVHNIQ4F9EDDKXJ764Z65Q ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 10/17/2016 10/17/2016

Aha! That’s it!

What the GROUP BY did was say, “Ok, SQL, create a group of data where the rows PersonalID are the same. Now, for each group, take the row with the greatest EntryDate.”

This gives exactly what we want. A single row per participant.

Working with Dates in SQLdf

Mixing R and SQL is powerful. One of the easiest ways to implement this combination is with the R library SQLdf.

If TL;DR, skip to Coerce Date Types into Strings before Passing to SQLdf at bottom.

SQLdf

The power of SQLdf comes from its ability to convert dataframes into SQLite databases on the fly. To the user, it doesn’t appear like anything special is going on, but under the hood R is working together with a SQLite client to create a table which can be queried and manipulated with ANSI SQL calls.

For example,

dataFrame1 <- read.csv(pathToData)
library("sqldf")
dataFrame2 <- sqldf("SELECT FirstName FROM dataFrame")

These three lines do a lot. It loads data from a CSV, loads a library of functions for convert R dataframes into SQLite databases, and then the sqldf() function call does two things at once. It converts the R dataframe into a SQLite database and then queries it for the FirstName column.

If we were to assume the dataFrame1 variable contained data like this:

PersonalID FirstName LastName
B7YIOJIGF9CDP6FV7TANQXLMQRMBTVTB Bob Person
ASGJ4F95HS85N39DJ12AJB094M59DJ45 Jane People

Then the dataFrame2 <- sqldf("SELECT FirstName FROM dataFrame) will create a variable called dataFrame2 which contains the FirstName column from dataFrame1

FirstName
Bob
Jane

And this is how we will shape our data in the R-SQL way.

Datatypes

One of the most important things a human can learn about computers is something called datatypes. When computers process information they need a little help from humans in attempt to understand what to do with the information. For example, what do these numbers mean to you?

76110, 444-325-7645, 10/24/1980

Most humans (at least in the United States) will know the first number is a ZIP code, the second a phone number, and last date. Humans know this because our brains have learned how to discern from context. In the case of the ZIP code, it’s exactly 5 numbers, the phone contains dashes at at exact places, and the date contains slashes in the exact spot we’d expect of a date.

Unfortunately, computers have a little more difficulty with this. Most computers are smart enough now days to know the phone number and date of birth, but the ZIP code will confuse the heck out of a computer.

A computer’s initial reaction in seeing the ZIP code is, “Oh, you mean 76,110. That’s a big number.” When really, this number represents a geographic location.

Ok, let’s make this more relevant to HMIS work. The way to help a computer understand what numbers are representing is by telling the computer what type of data a particular column is. This is known as a datatype. For us, we really only deal with a few datatypes, but their are hundreds of thousand of datatypes.

In R you can see what datatype a column of your dataframe is by clicking the blue button next to the dataframe name in the Global Environment variables.

We will be dealing with the following:

  • Dates (called “POSXct” in R)
  • Strings (called “chr” in R)
  • Numbers
  • Factors

Of course, every programming language can calls these datatypes by different names, thus, furthering confusion. (I mean, c’mon, programming is simple enough as it is–we’ve got to make it a little challenging.)

Dates

Date datatypes usually look like this:

10/24/1980

But it can come in many different formats. It’s probably best to differentiate between datatype and data format. A data type describes how the information should be used–it’s important for a computer. Data format describes how a computer should display information to the human–therefore, it’s useful for a human.

An example of different formats of the same data:

10/24/1980
1980-10-24
102480

Ok, back to the date datatype. It is used when dealing with dates. By declaring a variable as having a date datatype, it is telling the computer whatever we put into that variable to interpret as a date. Simple enough.

Strings

When we talk about string we aren’t talking about fuzzy things kittens chase. A string datatype is a series of characters (one part of a string) strung together. Anything can be a string. They are probably the most important datatype, since they can tell a computer to look at a number and see something else. Confused? Let’s look at some examples.

We tell a computer data is a string is by putting it in double quotes "this is a string" or single quotes 'this is also a string'.

Here’s an example of assigning a string in R:

myFirstString <- "this is a string"

Great! But what can we do with it? Well, a lot.

Let’s say we wanted to pass a path of a file to a read.csv() function. We could do so by providing the path as a string.

dataFrame <- read.csv("/Users/user/Downloads/Client.csv")

The above will load the Client.csv file located at the /Users/user/Downloads/ directory–the computer knows how to read the path because it’s a string.

But why are strings so important? Well, they allow us to tell a computer to override its basic instinct and view a piece of data as something other than what the computer would guess it is.

Returning to the ZIP code.

notAZipCode <- 76110
myZipCode <- "76110"

These variable assignments may seem to be exactly the same, however, the first one creates a variable as a number, but the second tells the computer, “This is a not a number, it is something else–please don’t pretend to understand it. I’ll tell you exactly what to do with it later.”

Numbers

Number datatypes are easy. It’s pretty much how a computer tries to look at all data you provide it. One important characteristic about numbers, you can have the computer perform math functions on numbers, which you couldn’t on strings.

dataframe1 <- 12 * 12 
datafram2 <- "12" * "12"

Above, dataframe1 will contain 144 after being executed. But if the second line is attempted to be executed the computer will freak out, “This is a string! I can’t add strings! You told me you’d tell me what to do with them…”

Factors

Factors are a special datatype in R. Most of all the variables we load in R will start out as factors. Essentially, factors are categories of data.

Red,
Orange,
Green,
Blue,
Indigo,
Violet

Is an example of factors. They are categories of data. The important of factors will become more evident as we work through these tutorials in R.

If you don’t understand factors, it’s cool. Just think of them as strings. However, if you don’t understand strings, please ask any questions in comments below. Understanding them is critical to working with SQLdf.

SQLdf and Datatypes

Anytime you mix two different languages it pays to be careful about meaning. As I learned once by talking about pie as something I liked–come to find out, it was funny to Hispanic friends who were learning English. (Apparently pie is Spanish for foot?)

When mixing R and SQL we must be careful about how the two languages look at the datatypes. In R it sees dates as a POSXct datatype (this is essentially fancy date datatype. Would you like to know more?)

Well, this is all fine and dandy, but when we pass commands from R to SQL it is all passed as a string.

dataFrame2 <- sqldf("SELECT * FROM dataFrame1")

Notice SELECT * FROM dataFrame1 is all in quotation marks? This turns it into a string then it passes it SQLite, which is hidden to us.

If all this is a bit overwhelming, no worries. Bookmark this page to refer back to later. Just remember the following:

Date columns must be converted into a chr datatype before passing it to SQL. How to we convert datatypes? It’s pretty darn simple. We use something called data coercion.

Coercing Data Types

Let’s go back to that ZIP code and number example. Let’s say the computer reads all your ZIP codes from a file as a number. This happens a lot, since to the computer that’s what it looks like–so it guesses that’s what you are going to want.

But no, we want those ZIP codes to be strings. To do this, we can get a particular column from a dataframe by writing the name of the dataframe then $ then the name of the column. For example, datafram$zipCodes will return only the column zipCodes from dataframe.

Alright, now we have a way to select one column from our dataframe we can attempt to convert that one column’s datatype. To do this use the as.character() command.

dataframe$zipCodes <- as.character(dataFrame$zipCodes)

This will convert the zipCode column from a number into a string, then, it assigns it back to the column zipCodes. Boom! We’ve told the computer to stop trying to make a ZIP code a number. Instead, treat it as a string. And with that, we will tell the computer later how to use ZIP codes.

Coerce Date Types into Strings before Passing to SQLdf

Ok, now for the reason for this entire article. Before passing any dates to SQLdf we need to first convert them to strings. Otherwise, SQLdf will try to treat them as numbers–which will cause a lot of heartache.

For example, a Client.csv file should have a DateCreated column. This represents the date a case-manager put the data into HMIS. The data should look something like this:

DateCreated DateUpdated
10/23/14 0:01 4/23/15 15:27
5/22/13 9:23 10/15/16 1:29
6/3/15 19:22 3/17/17 21:09

Let’s try to get all of the clients who’ve been entered after 10/01/2014.

dataFramContainingDates <- read.csv("/Users/user/Downloads/Client.csv")
datesEntered <- sqldf("SELECT * FROM dataFramContainingDates WHERE DateCreated > '2014-10-01'")

The above code should provide every column where DateCreated date is greater than 2014-10-01. But, instead, it will result in an empty dataframe. Waaah-waah.

Essentially, this is because SQL is comparing a number and a string. It freaks the computer out.

Instead, we should convert the DateCreated column to a string instead of a date. Then, SQL will actually convert it from a string to a date.

Confused? Imagine me when I was trying to figure this out on my own.

Ok, so, the take away? Before passing any dates to SQL convert them to strings.

dataFramContainingDates <- read.csv("/Users/user/Downloads/Client.csv")
dataFrameContaingDates$DateCreated <- as.character(dataFrameContaingDates$DateCreated)
datesEntered <- sqldf("SELECT * FROM dataFramContainingDates WHERE DateCreated > '2014-10-01'")

By using the as.character function to convert the DateCreated column to a string and then assigning it back to the dateframe, it sets SQL up to do the date comparisons correctly. Using the dateframe from above, this should result in the following table:

DateCreated DateUpdated
10/23/14 0:01 4/23/15 15:27
6/3/15 19:22 3/17/17 21:09

Confused as heck? Feel free to ask questions in the comments below!