This article is part of a series.
- Part 1 - HMIS, R, and SQL -- Introduction
- Part 2 - HMIS, R, and SQL -- Basics
- Part 3 - Preamble to Mixing R and SQL
- Part 4 - HMIS, R, SQL -- Work Challenge One
- Part 5 - Read and Write CSVs in R
- Part 6 - Working with Dates in SQLdf
- Part 7 - Filter to Most Recent HUD Assessment
- Part 8 - HMIS, R, SQL -- Work Challenge Two
- Part 9 - Comparing Values in R and SQL
- Part 10 - Working with R Excel Libraries
- Part 11 - This Article
- Part 12 - HMIS, R, SQL -- Work Challenge Four
- Part 13 - SQL CASE and R Paste
Creating a Data Error Report by User
In this work challenge we will combine Client and Enrollment data, then we will add a non-HMIS dataframe which contains user account information. After these data are merged, we will then parse them for missing data elements and provide a by-user list of data errors.
As stated above, the data needed are:
- A dataframe containg HMIS user contact info.
The key to this challenge literally is at the end of every HMIS CSV. Each exported CSV contains some metadata which describes how the data were produced.
DateCreated should represent when the respective row was actually entered into the HMIS.
DateUpdated is the last time that row was modified and saved in the HMIS. The
UserID is the case-manager who last modified these data. Lastly, the ExportID is the number which identifies a collection of HMIS CSVs to be in the same batch.
We are going to focus in on the
UserID element. Notice, you will not find the usernames, real names, email address, or really any contact information for individual HMIS users. However, having a unique user ID in each CSV would still allow HUD to use internal validity tests to determine the reliability of the user.
For us, we are going to take another source of data containing all of the UserIDs and contact information for the users. Now, this will probably be different each HMIS software vendor. But each vendor should have a way to export a list of the users in the system with their
UserID, which will allow us to join these data to the HMIS CSVs.
For those participating in the work challenge from my CoC, I’ll provide a CSV with these user data.
After actual user names are joined to the CSVs, then we will begin to parse the CSVs for data errors. If you aren’t yet familiar with the term parse in computer science, think of it as diagraming a setence where we make the computer do all the work. Instead of a sentence, we will be diagraming a row of data to determine if there are any errors.
What’s an HMIS Data Error?
The HMIS Data Dictionary is specific about what a data error is.
- 8 – Client doesn’t know
- 9 – Client refused
- 99 – Data not collected
- Incomplete response
- Non-determinable response
Here’s an example of a Client.csv which contains one of each type of error.
Here are the data errors:
- Tesa is first name blank
- Sarah’s DOB is blank
- Fela’s SSN is an incomplete response (must be 9 digits)
- Sarah’s SSN is non-determinable
- Sarah’s DisablingCondition was not collected.
- Tesa refused to provide a VeteranStatus.
We are going to take a HMIS data and join it with a dataframe containing end-user information. Then, we will create a query to subset the dataframe so we get a dataframe which contains only rows with data errors. Lastly, we will get counts of the types of data errors and the names end-users who’ve caused the most data errors.
The data elements we will look into for errors:
To get this information we will need to do the following:
- Load Client.csv, Enrollment.csv, and Users.xlsx
- Left join the clientDf and enrollmentDf.
- Left join the usersDf to the result of step 2.
- Parse the data elements listed above for data errors
- Create a dataframe which contains only rows with data errors
- Use the SQL Count function to count the number of data errors by the element list above.
- Use the SQL Count function to count how many times a end-users name is associated with a row containing errors.
- Create a dataframe of these counts
- Save the dataframe containing the error counts into an Excel file (.xlsx)
Below are the resources which should help for each step: