Hacker Introduction
I'm a hacker . If you find errors, please leave comments below. If you have an opinion I'll hear it, but I'm often not likely to agree without some argument.
Joins (Merging Data)
Probably the best part of R and SQL is their ability to quickly combine data around a key. For example, in HMIS CSVs the Client.csv contains a lot of demographic information and the Enrollment.csv contains a lot of assessment information. This makes it difficult when needing a count of the total participants who are veterans and disabled, since the veteran information is in Client.csv and disability information is in the Enrollment.csv. However, both R and SQL contain the join functions.
Joins are a hughely expansive topic; I'm not going to try to cover all their quirks, but here's some videos I found helpful:
The two useful joins for HMIS data are LEFT JOIN and INNER JOIN. The left join keeps all the data in the left table and data matching from the right table and the inner join keeps only data which matches.
Here's an example in the context of the Client.csv and Enrollment.csv:
Client.csv
PersonalID | FirstName | VeteranStatus |
12345 | Jane | Yes |
54321 | Joe | No |
Enrollment.csv
PersonalID | FirstName | DisablingCondition |
12345 | Jane | Yes |
54321 | Joe | No |
45321 | Sven | Yes |
Here are the two join statements and their results for the data above
SELECT *
FROM client a
LEFT JOIN enrollment b ON a.Personal=b.PersonalID
This join should result in the following:
PersonalID | FirstName | VeteranStatus | DisablingCondition |
12345 | Jane | Yes | Yes |
54321 | Joe | No | No |
45321 | Sven | NULL | Yes |
Notice Sven was kept, even though he had no entry the Client.csv. After the join, since he had no
And the inner join would look like this:
SELECT *
FROM client a
INNER JOIN enrollment b ON a.Personal=b.PersonalID
This join should result in the following:
PersonalID | FirstName | VeteranStatus | DisablingCondition |
12345 | Jane | Yes | Yes |
54321 | Joe | No | No |
Counts
PersonalID <- sqldf("SELECT DISTINCT PersonalID FROM client")
Method above creates a vector of all the PersonalIDs in the client data-frame, which came from the Client.csv. The DISTINCT command takes only one ID if there are more than two which are identical. In short, it create a de-duplicaed list of participants.
For example,
PersonalID | OtherData |
12345 | xxxxxxxxx |
56839 | xxxxxxxxx |
12345 | xxxxxxxxx |
32453 | xxxxxxxxx |
Should result in the following,
PersonalID |
12345 |
56839 |
32453 |
This is useful in creating a key vector, given other CSVs have a one-to-many relationship for the PersonalID. For example,
The Enrollment.csv looks something like this
PersonalID | ProjectEntryID | EntryDate |
12345 | 34523 | 2016-12-01 |
56839 | 24523 | 2015-09-23 |
12345 | 23443 | 2014-01-10 |
32453 | 32454 | 2015-12-30 |
This reflects a client (i.e., 12345) entering a project twice, once on 2014-01-10 and the other 2016-12-01.
Count of Total Participants:
SELECT COUNT(PersonalID) as 'Total Participants' FROM client
This query should give a on row output, counting the number of clients in the data-frame.
Total Participants | |
1 | 1609 |
However, if there are duplicate PersonalIDs it'll count each entry as an ID. To get a count of unique clients in a data-frame add the DISTINCT command.
SELECT COUNT(DISTINCT(PersonalID)) as 'Unique Total Participants' FROM client
Conditional Data
Often in HMIS data it is necessary to find a collection of participants which meet a specific requirement. For example, "How many people in this data-set are disabled?" This is where the WHERE statement helps a lot.
SELECT PersonlID FROM clientAndEnrollment WHERE disability = 'Yes'
This statement will return a vector of all the PersonalID's of participants who stated they were disabled. The total participant query could be used, but there is an alternative method.
SELECT SUM(CASE WHEN
disability = 'Yes' THEN 1 ELSE 0
END) as DisabledCount
The above statement uses the CASE WHEN END statement, which I understand as SQL's version of the IF statement. Here's C equivalent:
for(int i = 0; i < total_participants; i++)
if(disability == true){
disabilityCounter++;
}
}
BOOL!
Boolean operaters can be used to get more complex conditional data:
SELECT PersonalID FROM clientAndEnrollment
WHERE disability = 'Yes'
AND gender = 'Female'
This statement will provide a vector of all the PersonalID's for clients who are disabled and female.
Ok, good stopping point for now.