Continuing to explore R and SQL’s usefulness when it comes to HMIS data I decided to start posting HMIS problems and R and SQL solutions.
Problem: One of our Emergency Solutions Grant (ESG) funders requires the subrecipients to produce a report of all the participants which receive services from the shelter. This requirement was written into the contract the ESG funders have with the subrecipient. This worked somewhat in 2011 when it was implemented, however, it is 2016 and the data which is queried against to produce the report is well over 1.3 million entries. These data are generated every time a participant checki in a shelter for meal, bed, or to sit. It is unlikely this data set will ever get smaller. In short, the data have grown beyond the report server currently provided by our software vendor. Since the query is handled server-side it has resulted in the subrecipients being unable to reliably meet the requirement.
Solution:
In attempt to circumvent the server-side query, I’ve written a R and SQL script which takes two data files:
Point of Service (PoS)
Demographic and HUD Assessment data (enrolledInTCES)
These data were pulled using the software report services, but without any query. This seems to allow bypassing the server-side bottle-neck. The script then merges the data, formats it, and aggregates it for the final report.
The report should be something which could be run using a batch file, so I hope to deploy it to the subrecipients. Then, with a little training, it should allow them to continue to produce the report for the funders.
I’m a hacker. If you find errors, please leave comments below. If you have an opinion I’ll hear it, but I’m often not likely to agree without some argument.
Joins (Merging Data)
Probably the best part of R and SQL is their ability to quickly combine data around a key. For example, in HMIS CSVs the Client.csv contains a lot of demographic information and the Enrollment.csv contains a lot of assessment information. This makes it difficult when needing a count of the total participants who are veterans and disabled, since the veteran information is in Client.csv and disability information is in the Enrollment.csv. However, both R and SQL contain the join functions.
Joins are a hughely expansive topic; I’m not going to try to cover all their quirks, but here’s some videos I found helpful:
The two useful joins for HMIS data are LEFT JOIN and INNER JOIN. The left join keeps all the data in the left table and data matching from the right table and the inner join keeps only data which matches.
Here’s an example in the context of the Client.csv and Enrollment.csv:
Client.csv
PersonalID
FirstName
VeteranStatus
12345
Jane
Yes
54321
Joe
No
Enrollment.csv
PersonalID
FirstName
DisablingCondition
12345
Jane
Yes
54321
Joe
No
45321
Sven
Yes
Here are the two join statements and their results for the data above
This join should result in the following:
PersonalID
FirstName
VeteranStatus
DisablingCondition
12345
Jane
Yes
Yes
54321
Joe
No
No
45321
Sven
NULL
Yes
Notice Sven was kept, even though he had no entry the Client.csv. After the join, since he had no
And the inner join would look like this:
This join should result in the following:
PersonalID
FirstName
VeteranStatus
DisablingCondition
12345
Jane
Yes
Yes
54321
Joe
No
No
Counts
Method above creates a vector of all the PersonalIDs in the client data-frame, which came from the Client.csv. The DISTINCT command takes only one ID if there are more than two which are identical. In short, it create a de-duplicaed list of participants.
For example,
PersonalID
OtherData
12345
xxxxxxxxx
56839
xxxxxxxxx
12345
xxxxxxxxx
32453
xxxxxxxxx
Should result in the following,
PersonalID
12345
56839
32453
This is useful in creating a key vector, given other CSVs have a one-to-many relationship for the PersonalID. For example,
The Enrollment.csv looks something like this
PersonalID
ProjectEntryID
EntryDate
12345
34523
2016-12-01
56839
24523
2015-09-23
12345
23443
2014-01-10
32453
32454
2015-12-30
This reflects a client (i.e., 12345) entering a project twice, once on 2014-01-10 and the other 2016-12-01.
Count of Total Participants:
This query should give a on row output, counting the number of clients in the data-frame.
Total Participants
1
1609
However, if there are duplicate PersonalIDs it’ll count each entry as an ID. To get a count of unique clients in a data-frame add the DISTINCT command.
Conditional Data
Often in HMIS data it is necessary to find a collection of participants which meet a specific requirement. For example, “How many people in this data-set are disabled?” This is where the WHERE statement helps a lot.
This statement will return a vector of all the PersonalID’s of participants who stated they were disabled. The total participant query could be used, but there is an alternative method.
The above statement uses the CASE WHEN END statement, which I understand as SQL’s version of the IF statement. Here’s C equivalent:
BOOL!
Boolean operaters can be used to get more complex conditional data:
This statement will provide a vector of all the PersonalID’s for clients who are disabled and female.
I’m a HMIS Database Manager for a living. It’s a dream job–all the nerdy stuff, plus, there is a possibility I’m helping people. Currently, one area our software really lacks is quickly generating complex reports. It has the ability, but the servers are laggy, it crashes often, and a project which should take 20 minutes will take 50 minutes to 40 hours depending on the “report weather.” These issues are probably caused by the reporting platform being web-based and calculations done server-side. Regardless, given the amount of time the staff are eating on report projects I’ve decided to explore alternative systems for generating some of our needed reports.
Luckily, HUD has dictated a HMIS data format. This is often known as the “CSV version.” The specification of these data sets are outlined in HUD’s document:
These data standards are currently on version 5.1, however, HUD issues tweaks to these standards every October 1st. Point is, if the data is standardized it should make it easy to manipulate using local tools.
Here are a few pros to explore local reporting tools:
Software vendor ambivalent
No bottleneck due to routing issues
Greater flexibility of reporting
No outage concerns
More control on optimization of queries
And the cons:
Somewhat more difficult to deploy to end-users (integration would probably be through batch files or Excel-VB)
Before jumping in to the alternatives it is important to point out HUD requires all HMIS software vendors have the ability to export a set of CSV files which contain all the HUD manadated data elements (also known as universal data elements). This export process is reliable, fast, and predictable–at least, from my experience. As the alternative tools are explored the data sets being used will most often be these HMIS CSVs, however, there will probably be other data our COC reports locally which will be joined to these CSVs using each participant’s unique ID.
Ok! Let’s take a look.
R
R gets me excited. It is programming language for data miners. It is primarily C under the hood, which potentially makes it blazingly fast. R is meant to be a command-line interface, but I’m using RStudio as convenient overaly. R studio has a few limitations, for example only 15 columns may be view inside the IDE, but nothing show stopping.
This entry is not meant to be a course in R, however, I’ll add some of my favorite links:
First it is important to be able to read in CSV and Excel files. The ability to read in CSVs is built into R. To start loading Excel documents the read_excel package will need to be installed. R has a package manager, allowing method libraries to be easily added. Pretty much any package can be installed from the CLI using install.package(“name_of_package”). For example:
A package only needs to be installed once, however, every R session will need to refer to the library before making calls to its methods. For example,
After this package has been installed and added to the session, then we should be able to import all sorts of data into R using the following:
This creates two data-frames. One thing action I found to be necessary for later functions the ability to rename column headers. This can be done using the following:
This is important later when SQL functions are used inside of R, as speciali characters SQLite doesn’t like and workarounds make the SQL code verbose.
The most important thing which can be done by data people is merging datasets. I’ve only started on this journey, but it looks to be an art which requires mastery to be effective. But to get us going, here’s how to perform a left join in R.
If you’re pretty sharp–or a data scientist–you might notice the flaw in the in the merger above. The HMIS Client.csv should only have one record per participant, but the relationship from Client.csv to Enrollments.csv is one-to-many. Meaning, each client could have mutiple project enrollments. This makes the above code somewhat unpredictable–and I’ve no time to explore the results. Instead, I’ve focused on taking the most recent entry from Enrollments.csv. This can be done using some SQL code.
The SQL to R
Professional data folk may wonder why I’ve chosen to mix R and SQL. Well, it may not be the best reason or explanation, but here goes. R is a powerful tool, but often, the syntax is boggish. It is hard to read and figure out what’s going on. SQL on the other hand, it’s pretty intuitive. For me, I’m looking to solve problems as quickly as possible and I’ve found by mixing the two I get to solutions much more quickly. Often, it is a trade off, if a SQL query is running too slow, I look for an R solution. And if I’ve re-read an R statement twenty times without being able to spot a bug, then I find a SQL solution. For me, it’s about getting to the result as quickly as possible
A second reason to mix SQL is about respect and marketability. R seems to be gaining ground in a lot of data sciences, and seems to be the tool when it comes to economics and statistics, however, most data exchanges have SQL at their heart. Therefore, when I can use my work as an excuse to develop a marketable skill, I’m going to do it.
If someone still has problems with those assertions, feel free to hate away in the comments below.
Alright, how does one mix SQL into R? It centers around the package sqldf. This package can be installed and added to a session with the following:
Underneath the hood of sqldf is SQLite, this important to note when it comes to debugging SQL queries in R–as we will see in a moment.
But, to get us kicked off, let’s look at how sqldf works in R.
This is a sample of how sqldf works. Basically, the sqldf() makes a SQLite query call and returns the results. Here, all of the vector for PersonalIDs was taken from the Client.csv and put into a dataframe called personalIDs. And that’s pretty much it.
Here’s an example in the context of HMIS CSV data.
Alright, from here on in I’m going to outline SQL queries seperately, just know, the SQL query will need to be insert into the sqldf(“”) call.
Ok – I’m going to stop this article here, since it seems to have gotten us going. However, I’ll continue adding to this series as I write useful queries for HMIS data.