This article is part of a series.
- Part 1 - HMIS, R, and SQL -- Introduction
- Part 2 - HMIS, R, and SQL -- Basics
- Part 3 - Preamble to Mixing R and SQL
- Part 4 - HMIS, R, SQL -- Work Challenge One
- Part 5 - Read and Write CSVs in R
- Part 6 - Working with Dates in SQLdf
- Part 7 - This Article
- Part 8 - HMIS, R, SQL -- Work Challenge Two
- Part 9 - Comparing Values in R and SQL
- Part 10 - Working with R Excel Libraries
Many of the CSVs in the HMIS CSV may contain multiple rows per client. This can make it difficult when working with HMIS data, as it appears to the non-data person there are duplicates within your data.
Let’s look at some dataframes:
Notice how Bob has two records? One on 10/17/2016 and 01/01/2013. This represents two HUD Entry Assessments completed on Bob. These HUD Entry Assessments could represent two stays in the same program, or one stay in two programs.
Regardless, whenever you go to join this dataframe with a another dataframe, like the Client.csv, it will cause the resulting dataframe to have two rows representing both of Bob’s enrollments.
Let me walk us through joining the above dataframe with another dataframe.
We are going to join the enrollmentDf (above) with this clientDf
In R, we can join these two dataframes with the following.
Please copy the code below to R and execute.
If you ever see the following error:
In field_types <- field_types[names(data)] :
number of items to replace is not a multiple of replacement length
It’s a problem with going back-and-forth between R and SQL. To fix it, use the following code on the dataframe you are trying to work with before executing the line of code causing the error
Ok, back to work.
After executing the code, you should end up with a table like this. Not too shabby.
However, notice there are still rows for Bob? These aren’t technically duplicates. A duplicate is when there are two rows where items in every column are exactly the same. But in the case of the dataframe above, notice how the ProjectEntryID and EntryDate columns for Bob’s records are different?
As stated before, this is carried forth from Bob having two HUD Entry Assessments. But to the people whom we are going to present these data, it looks like duplicates. This is a problem because it will be seen as sloppy work (but remember, it’s not. It’s merely a technical artefact).
Who cares! How do we get rid of it?
First, we have to make a decision among three options. First, we can get only the most HUD Entry Assessment per client, only the the first HUD Entry Assessment ever taken per client, or leave it as it is.
The last option is out, so it’s a choice between most recent and the oldest one. In our world, homelessness usually gets worse and HUD wants us to be helping those who are the most vulnerable first, so the most recent is probably going to give us the best picture how vulnerable a client is right now.
Alright, how do we get the most recent HUD Assessment?
In SQL there is a function called MAX(). It will take the most recent of a record. Let’s look at how to use it, then we can discuss it.
For the following code to work, make sure all code above has been executed in R.
This should provide you with the following table:
Wait! What happened to Jane!? Well, the MAX() took the row with the greatest EntryDate, which is 10/17/2016. Unfortunately, Jane’s record wasn’t the most recent, so her record was removed.
Well, that’s not going to work. But we’re close. If only we had a way to take the MAX(EntryDate) per client.
We do. SQL has another command called GROUP BY, which allows us to tell SQL to apply another command by a group of records.
Again, let’s use it and then dissect it.
Copy this to R and execute it.
You should end up with a table like this:
Aha! That’s it!
What the GROUP BY did was say, “Ok, SQL, create a group of data where the rows PersonalID are the same. Now, for each group, take the row with the greatest EntryDate.”
This gives exactly what we want. A single row per participant.