Creating a Data Error Report by User
In this work challenge we will combine Client and Enrollment data, then we will add a non-HMIS dataframe which contains user account information. After these data are merged, we will then parse them for missing data elements and provide a by-user list of data errors.
Data Needed
As stated above, the data needed are:
-
Client.csv
-
Enrollment.csv
- A dataframe containg HMIS user contact info.
The key to this challenge literally is at the end of every HMIS CSV. Each exported CSV contains some metadata which describes how the data were produced.
The
DateCreated
should represent when the respective row was actually entered into the HMIS.
DateUpdated
is the last time that row was modified and saved in the HMIS. The
UserID
is the case-manager who last modified these data. Lastly, the ExportID is the number which identifies a collection of HMIS CSVs to be in the same batch.
We are going to focus in on the
UserID
element. Notice, you will not find the usernames, real names, email address, or really any contact information for individual HMIS users. However, having a unique user ID in each CSV would still allow HUD to use
internal validity tests
to determine the reliability of the user.
For us, we are going to take another source of data containing all of the UserIDs
and
contact information for the users. Now, this will probably be different each HMIS software vendor. But each vendor should have a way to export a list of the users in the system with their
UserID
, which will allow us to join these data to the HMIS CSVs.
For those participating in the work challenge from my CoC, I'll provide a CSV with these user data.
After actual user names are joined to the CSVs, then we will begin to parse the CSVs for data errors. If you aren't yet familiar with the term parse in computer science, think of it as diagraming a setence where we make the computer do all the work. Instead of a sentence, we will be diagraming a row of data to determine if there are any errors.
What's an HMIS Data Error?
The HMIS Data Dictionary is specific about what a data error is.
- 8 -- Client doesn’t know
- 9 -- Client refused
- 99 -- Data not collected
- Blank
- Incomplete response
- Non-determinable response
Here's an example of a Client.csv which contains one of each type of error.
PersonalID | FirstName | LastName | DOB | SSN | DisablingCondition | VeteranStatus |
---|---|---|---|---|---|---|
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 | Tesa | 2010-01-01 | 123-45-6789 | 1 | 8 | |
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV | Fela | Falla | 1999-1-1 | 4321 | 1 | 1 |
LASDU89NRABVJWW779W4JGGAN90IQ5B2 | Sarah | Kerrigan | Blahblah | 99 | 0 |
Here are the data errors:
- Tesa is first name blank
- Sarah's DOB is blank
- Fela's SSN is an incomplete response (must be 9 digits)
- Sarah's SSN is non-determinable
- Sarah's DisablingCondition was not collected.
- Tesa refused to provide a VeteranStatus.
The Goal
We are going to take a HMIS data and join it with a dataframe containing end-user information. Then, we will create a query to subset the dataframe so we get a dataframe which contains only rows with data errors. Lastly, we will get counts of the types of data errors and the names end-users who've caused the most data errors.
The data elements we will look into for errors:
- FirstName
- LastName
- DOB
- VeteranStatus
- DisablingCondition
- RelationshipToHoH
To get this information we will need to do the following:
- Load Client.csv, Enrollment.csv, and Users.xlsx
- Left join the clientDf and enrollmentDf.
- Left join the usersDf to the result of step 2.
- Parse the data elements listed above for data errors
- Create a dataframe which contains only rows with data errors
- Use the SQL Count function to count the number of data errors by the element list above.
- Use the SQL Count function to count how many times a end-users name is associated with a row containing errors.
- Create a dataframe of these counts
- Save the dataframe containing the error counts into an Excel file (.xlsx)
The Resources
Below are the resources which should help for each step:
TBD