Creating a List of Domestic Violence Victims
In this first work challenge you will use R and SQL to get a by-name-list of those who are domestic violence victims from an HMIS data pull (5.1).
Data Needed
The HMIS Data Pulls are simply a relational database which are broken into multiple CSVs. These CSVs will change in formatting, as stipulated by HUD. The current version of these CSVs is 5.1. For this work challenge the focus will be on two CSVs.
- Client.csv
- HealthAndDV.csv
The
Client
file will contain one row per client and only one row. Each row will contain most of all the Client's demographic information.
The
HealthAndDV
file will contain a row for every HUD Entry Assessment completed on the participant. Each entry will contain general health information as well the client's domestic violence information, which the client reported during the HUD Entry Assessment.
What ties the two files together is the
PersonalID
column. This ID is meant to be the grand-mamma of all client IDs. It is 32 characters long and contain both numbers and letter:
B7YIOJIGF9CDP6FV7TANQXLMQRMBTVTB
(note, this ID is called the "Enterprise ID" in our HMIS software)
Both the
Client
and
HealthAndDV
contain the
PersonalID
column. This often referred to as a
key
when dealing with relational databases. This unique ID in both files will allow us to merge the two data sets together using something called
joins
.
Now, in the Client.csv the information is readable to us humans. There will be the
FirstName
,
LastName
,
SSN
columns and many more. But in the HealthAndDV.csv the information is trickier. For this challenge we are going to focus on one column
DomesticViolenceVictim
. When you open the data set you may notice that instead of "Yes" or "No" answers in the columns you will see "1" or "0". That's because computers understand the 1 and 0 much quicker than Yes or No.
This is an important side note for managing data. Make the databases easy for computers to understand and report generation will be much faster. You have to think, if using a 1 instead of Yes could save 0.5 seconds on a calculation, then when you have a dataset which contains 1000 records you just saved 500 seconds 8.3 seconds. Now, multiply by 1,700,000 records. Well, you get the picture.
Ok. Back to the problem at hand. Just know "1" is equal to "Yes" and "0" is equal to "No". So, for this challenge, we will want to find all the clients who have a "1" in the
DomesticViolenceVictim
column
The Goal
We are going to merge the two data sets and to discover the following:
- A list of clients who are victims of domestic violence.
- A count of how many clients are fleeing domestic violence.
Really, the second piece of information is counting how many people are in the list of those who are victims.
To get this information we will need to do the following:
- Load the Client.csv and HealthAnd.csv
-
Filter the HealthAndDV dataset to the most recent according to the column
DateCreated
-
Join (merge) the dataframes where their
PersonalID
are the same -
Filter the merged dataframe to those who've reported
1
inDomesticViolenceVictim
- Write this data to a file.
- Use a function to count how many participants are in this victim list.
The Resources
Below are the resources which should help you understand each step of the process.
Step 1
- R Programming A-Z -- Video 41 -- Loading and Importing Data in R
- R Programming A-Z -- Video 21 -- Functions in R
Step 2
- The Complete SQL Bootcamp -- All Videos in Section 5
- The Complete SQL Bootcamp -- All Videos in Section 6
- Working with Dates in SQLdf
Step 3
- The Complete SQL Bootcamp -- All Videos in Section 8
Step 4 --
Step 5 --
Step 6 --
- The Complete SQL Bootcamp -- All Videos in Section 5
- The Complete SQL Bootcamp -- All Videos in Section 6