Shaping and Combining HMIS Data from ETO

Continuing to explore R and SQL’s usefulness when it comes to HMIS data I decided to start posting HMIS problems and R and SQL solutions.

Problem:  Our HMIS implementation has had three shelters entering data into one partition.  This has been a lot like mixing three different colors of sand into one bucket–much easier to put in then sort out.  It is also a problem since HUD requires Emergency Solution Grant (ESG) recipients to provide an annual standardized report, known as the CAPER, on data defined by the HMIS Data Dictionary. These data elements are referred to as Universal Data Elements.  With this mixed bucket data error responsibility becomes a big issue.  The CAPER only allows up to 25% missing data, which makes data quality important.  As for data repair, this should be completed by the agency which created the error.  This makes communicating data issues imperative.

Unfortunately, when data from several agencies is mixed ,creating an error report is problematic—at least, for the HMIS software our continuum of care utilizes.  The data quality reports our HMIS office produces lumps all errors together.  This leads to social loafing between the agencies, in turn, few data repairs.  

 Solution:  The solution seems to sort the data back out, re-assigning it to the respective agency’s data.  This would allow data quality reports to assign responsibility of repair.  Currently, our COC uses Social Solutions ETO software for manage our HMIS.  The process of the moving the data consists of the following steps:

  1. Determine all data which needs to be migrated.  For us, this is Demographic, HUD Assessment, and Bed Stay data.  
  2. Export these data sets.
  3. Sort the data sets to respective agencies.
  4. Import the data using a predefined template.

This article focuses on the third step.  The data has been exported, but how to sort it?

Below is a script written to take a flat file of HUD Assessments and 

  1. Filter to to the respective program
  2. Filter HUD Assessments to just Protect Entry
  3. Repair the COC code (e.g., “tx601” -> “TX-601”)
  4. Re-assign the data to the proper Site (agency’s data) and Program.
  5. Chop data into sets of no more than 500 rows, making the import process easier
  6. Write the data out to files.

It’s pretty hackish, but it worked.

library("sqldf")
library("readxl")

# Script Settings --------------------------------------------------------

# Name of output file, not including CSV extension.
outputFileName <- paste("HUD Assessments to Upload - v01 -")

# Paths to input files.
inputFileNameforHUDAssessments <- paste("TCES HUD Assesment Batch Report of Active Participants on 12-31-2016 - v09.xlsx")
inputFileNameforKeysToMatch <- paste("PEIDs of TCES Active Participants.xlsx")

# Names of target Site and Program
siteName <- "The Salvation Army Mabee Center"
programName <- "TSA Emergency Shelter Check In"

# ----------------------------------------

# Function to split files.
splitDataAndWriteFiles <- function(df, chunkSize, filename) {
  success <- FALSE
  count <- 0
  while (!success) {
    # If you want 20 samples, put any range of 20 values within the range of number of rows
    s <- paste(((count*chunkSize)+1), "  ", ((count+1)*chunkSize))
    print(s)
    chunk <- subset(df[((count*chunkSize)+1):((count+1)*chunkSize),])
    #chunk <- sample(df[5:20,])
    ## this would contain first 20 rows
    fileName <- paste(outputFileName, "_", as.character(count), ".csv")
    # Write out all the Active HUD Assessments.
    write.csv(chunk, file = fileName, na = "", row.names = FALSE)
    count <- count + 1
    success <- (count * chunkSize) > nrow(df)
  }
  return(success)
}

# Load all HUD Data
hudAssRaw <- read_excel(inputFileNameforHUDAssessments, na = "")

hudAssRaw <- subset(hudAssRaw, slect = -NA)

# Re-title columns for easier handling.
colnames(hudAssRaw)[1] <- "peid"
colnames(hudAssRaw)[11] <- "Relation_to_HH"
colnames(hudAssRaw)[12] <- "COC_Code"

# Replaces COC code for head's of household
hudAssRaw$COC_Code[with(hudAssRaw, Relation_to_HH == "Self (head of household)")] <- "TX-601"

hudAssRaw <- subset(hudAssRaw, slect = -NA)

# Subset Project Entry data.
hudAssRaw <- sqldf("SELECT * FROM hudAssRaw WHERE hudAssRaw.'At what point is this data being collected?_2270' = 'Project Entry'")

# Replaces COC code for head's of household
hudAssRaw$'Program Name' <- programName
hudAssRaw$'Site Name' <- siteName
colnames(hudAssRaw)[13] <- "A-3 What is the client's relationship t_2272"
colnames(hudAssRaw)[14] <- "A-5 HUD-assigned CoC code for the clien_2273"

# Set the dates back to YYYY-MM-DD
#hudAssRaw$`Response Date` <- as.Date(hudAssRaw$`Response Date`, "%Y%m%d")
#hudAssRaw$`DOB` <- as.Date(hudAssRaw$`DOB`, "%Y%m%d")
#hudAssRaw$`A-57 Approximate date homelessness star_6115` <- as.Date(hudAssRaw$`A-57 Approximate date homelessness star_6115`, "%Y%m%d")
#hudAssRaw$`A-58 Approximate date homelessness star_6116` <- as.Date(hudAssRaw$`A-58 Approximate date homelessness star_6116`, "%Y%m%d")

hudAssRaw <- subset(hudAssRaw, slect = -NA)

# Get target site Participant IDs
targetSiteParticipantIDs <- read_excel("TSA ESCI Target PSID.xlsx")

assessmentsWithTargetPID <- sqldf("SELECT * FROM targetSiteParticipantIDs
                                  INNER JOIN hudAssRaw   
                                  ON hudAssRaw.'Case Number'=targetSiteParticipantIDs.'Case Number'")

# Free up space.
rm(hudAssRaw)
rm(targetSiteParticipantIDs)

assessmentsWithTargetPID <- subset(assessmentsWithTargetPID, slect = -NA)

colnames(assessmentsWithTargetPID)[1] <- "pid"
colnames(assessmentsWithTargetPID)[12] <- "rid"

# INNER JOIN on self to get -only- the first HUD Assessment
# Thanks SO! http://stackoverflow.com/questions/7745609/sql-select-only-rows-with-max-value-on-a-column
assessmentsWithTargetPID <- sqldf("SELECT *
              FROM assessmentsWithTargetPID a
               INNER JOIN (
                  SELECT pid, MIN(rid) rid
                  FROM assessmentsWithTargetPID
                  GROUP BY pid
                ) b ON a.pid = b.pid AND a.rid = b.rid
              ")

# Remove PEID
assessmentsWithTargetPID <- subset(assessmentsWithTargetPID, select = -c(peid,peid.1))

write.csv(assessmentsWithTargetPID, file = "First HUD Entry Assessments for ESCI.csv", na = "", row.names = FALSE)

# Split the data into chunks and write to files.
splitDataAndWriteFiles(activeEntryAssessments, 500)
R Function to Split CSVs

This is an R function written to split a dataset into particular sized sets, then write them as a CSV.  Often, our office is need a quick way to split files for uploading purposes, since our HMIS software doesn’t handle large uploads well.

For example:

splitDataAndWriteFiles(df, 500, "My_Data")  

Will produce X number of files named “My_data_X.csv”

options(java.parameters = "-Xmx14336m")  ## memory set to 14 GB
library("XLConnect")

# Function to split files.
splitDataAndWriteFiles <- function(df, chunkSize, nameOfFiles) {
  success <- FALSE
  count <- 0
  while (!success) {
    # If you want 20 samples, put any range of 20 values within the range of number of rows
    s <- paste(((count*chunkSize)+1), "_", ((count+1)*chunkSize))
    print(s)
    chunk <- subset(df[((count*chunkSize)+1):((count+1)*chunkSize),])
    #chunk <- sample(df[5:20,])
    ## this would contain first 20 rows
    fileName <- paste(nameOfFiles, "_", as.character(count), ".csv")
    # Write out all the Active HUD Assessments.
    write.csv(chunk, file = fileName, na = "", row.names = FALSE, fileEncoding = "utf8")
    count <- count + 1
    success <- (count * chunkSize) > nrow(df)
  }
  return(success)
}

fileToSplit <- read.csv("UPLOAD -- Sal Men-- TCES Move -- TSA Bed Data Template.csv")

splitDataAndWriteFiles(fileToSplit, 5000, "Sal_Men_NBN")
Attachment III, aka, The Zombie

Continuing to explore R and SQL’s usefulness when it comes to HMIS data I decided to start posting HMIS problems and R and SQL solutions.

Problem: One of our Emergency Solutions Grant (ESG) funders requires the subrecipients to produce a report of all the participants which receive services from the shelter.  This requirement was written into the contract the ESG funders have with the subrecipient.  This worked somewhat in 2011 when it was implemented, however, it is 2016 and the data which is queried against to produce the report is well over 1.3 million entries.  These data are generated every time a participant checki in a shelter for meal, bed, or to sit.  It is unlikely this data set will ever get smaller.  In short, the data have grown beyond the report server currently provided by our software vendor.  Since the query is handled server-side it has resulted in the subrecipients being unable to reliably meet the requirement.

Solution:  
In attempt to circumvent the server-side query, I’ve written a R and SQL script which takes two data files:

  1. Point of Service (PoS)
  2. Demographic and HUD Assessment data (enrolledInTCES)

These data were pulled using the software report services, but without any query.  This seems to allow bypassing the server-side bottle-neck.  The script then merges the data, formats it, and aggregates it for the final report.

The report should be something which could be run using a batch file, so I hope to deploy it to the subrecipients.  Then, with a little training, it should allow them to continue to produce the report for the funders.

nameOfMonth <- readline("Provide the month for which the data was pulled: \n")
nameOfAgency <- readline("Provide the name of your agency: \n")

library("sqldf")
library("readxl")

# Load data
enrolledInTCES <- read_excel("Attachment III Date for Oct and Dec 2016 -- DRC Program.xlsx")
PoS <- read_excel("DRC PoS Check In for November 2016.xlsx")
colnames(enrolledInTCES)[1] <- "peid"
colnames(PoS)[1] <- "peid"

# Check for duplicates
PoS <- sqldf("SELECT DISTINCT peid FROM PoS")

# Join allPos and enrolledInTCES
attachmentIIIClientData <- merge(PoS, enrolledInTCES, by="peid")

# Make the disability column easier to work with.
colnames(attachmentIIIClientData)[8] <- "ethnicity"
colnames(attachmentIIIClientData)[9] <- "tmi"
colnames(attachmentIIIClientData)[10] <- "race"
colnames(attachmentIIIClientData)[11] <- "disability"
colnames(attachmentIIIClientData)[12] <- "FSI"
colnames(attachmentIIIClientData)[13] <- "gender"
colnames(attachmentIIIClientData)[14] <- "HHSize"

# Replaces all NAs with 0
attachmentIIIClientData[is.na(attachmentIIIClientData)] <- 0
# Create annual income column
attachmentIIIClientData$annualIncome <- attachmentIIIClientData$tmi * 12
# Remove decimals.
attachmentIIIClientData$annualIncome <- as.integer(attachmentIIIClientData$annualIncome)

# AMI Matrix
# N = Number of People in a Household
#  <30%AMI	  30%-50% AMI	 51-80% AMI	 >80% AMI
# 1<$14600	  1<$24300	   1<$38850	  1>$38850
# 2<$16650	  2<$27800	   2<$44400	  2>$44400
# 3<$18750	  3<$31250	   3<$49950	  3>$49950
# 4<$20800	  4<$34700	   4<$55500	  4>$55500
# 5<$22500	  5<$37500	   5<$59950	  5>$59950
# 6<$24150	  6<$40300	   6<$64400	  6>$64400
# 7<$25800	  7<$43050	   7<$68850	  7>$68850
# 8<$27500	  8<$45850     8<$73300 	8>$73300

ami <- sqldf("SELECT 

              (
              SUM (CASE WHEN annualIncome < 14600 AND HHSize == 1 THEN 1 ELSE 0 END) +
              SUM (CASE WHEN annualIncome < 16650 AND HHSize == 2 THEN 1 ELSE 0 END) +
              SUM (CASE WHEN annualIncome < 18750 AND HHSize == 3 THEN 1 ELSE 0 END) +
              SUM (CASE WHEN annualIncome < 20800 AND HHSize == 4 THEN 1 ELSE 0 END) +
              SUM (CASE WHEN annualIncome < 22500 AND HHSize == 5 THEN 1 ELSE 0 END) +
              SUM (CASE WHEN annualIncome < 24150 AND HHSize == 6 THEN 1 ELSE 0 END) +
              SUM (CASE WHEN annualIncome < 25800 AND HHSize == 7 THEN 1 ELSE 0 END) +
              SUM (CASE WHEN annualIncome < 27500 AND HHSize > 7 THEN 1 ELSE 0 END)) as '<30% AMI',

              (
              SUM(CASE WHEN annualIncome < 24300 AND HHSize == 1 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 27800 AND HHSize == 2 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 31250 AND HHSize == 3 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 34700 AND HHSize == 4 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 37500 AND HHSize == 5 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 40300 AND HHSize == 6 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 43050 AND HHSize == 7 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 45850 AND HHSize > 7 THEN 1 ELSE 0 END)) as '30-50% AMI',

              (
              SUM(CASE WHEN annualIncome < 38850 AND HHSize == 1 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 44400 AND HHSize == 2 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 49950 AND HHSize == 3 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 55500 AND HHSize == 4 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 59950 AND HHSize == 5 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 64400 AND HHSize == 6 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 68850 AND HHSize == 7 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome < 73300 AND HHSize > 7 THEN 1 ELSE 0 END)) as '51-80% AMI',

              (
              SUM(CASE WHEN annualIncome > 38850 AND HHSize == 1 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome > 44400 AND HHSize == 2 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome > 49950 AND HHSize == 3 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome > 55500 AND HHSize == 4 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome > 59950 AND HHSize == 5 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome > 64400 AND HHSize == 6 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome > 68850 AND HHSize == 7 THEN 1 ELSE 0 END) +
              SUM(CASE WHEN annualIncome > 73300 AND HHSize > 7 THEN 1 ELSE 0 END)) as '>80% AMI'

              FROM attachmentIIIClientData")

# Remove duplicate counts.
ami$'30-50% AMI' <- (ami$'30-50% AMI' - ami$'<30% AMI')
ami$'51-80% AMI' <- (ami$'51-80% AMI' - ami$'<30% AMI' - ami$'30-50% AMI')

# Aggregates data for Attachment III.
attachmentIIIAggregate <- sqldf("SELECT 

                 COUNT(peid) as 'Total Participants',
                 SUM(CASE WHEN disability = 'Yes' THEN 1 ELSE 0 END) as DisabledCount,

                 SUM(CASE WHEN age < 5 THEN 1 ELSE 0 END) as 'Under 5',
                 SUM(CASE WHEN age > 4 AND age < 13 THEN 1 ELSE 0 END) as '5 to 12',
                 SUM(CASE WHEN age > 12 AND age < 18 THEN 1 ELSE 0 END) as '13 to 17',
                 SUM(CASE WHEN age > 17 AND age < 25 THEN 1 ELSE 0 END) as '18 to 24',
                 SUM(CASE WHEN age > 24 AND age < 35 THEN 1 ELSE 0 END) as '25 to 34',
                 SUM(CASE WHEN age > 34 AND age < 45 THEN 1 ELSE 0 END) as '35 to 44',
                 SUM(CASE WHEN age > 44 AND age < 55 THEN 1 ELSE 0 END) as '45 to 54',
                 SUM(CASE WHEN age > 54 AND age < 62 THEN 1 ELSE 0 END) as '55 to 61',
                 SUM(CASE WHEN age > 61 THEN 1 ELSE 0 END) as '60+',

                 SUM(CASE WHEN race  = 'Black or African American' THEN 1 ELSE 0 END) as 'Black or African American',
                 SUM(CASE WHEN race  = 'White' THEN 1 ELSE 0 END) as 'White',
                 SUM(CASE WHEN race  = 'American Indian or Alaska Native' THEN 1 ELSE 0 END) as 'American Indian or Alaska Native',
                 SUM(CASE WHEN race  = 'Asian' THEN 1 ELSE 0 END) as 'Asian',
                 SUM(CASE WHEN race  = 'Native Hawaiian or Other Pacific Islander' THEN 1 ELSE 0 END) as 'Native Hawaiian or Other Pacific Islander',
                 SUM(CASE WHEN race = 'Black or African American' OR
                 race = 'White' OR
                 race = 'American Indian or Alaska Native' OR
                 race = 'Asian' OR
                 race = 'Native Hawaiian or Other Pacific Islander'
                 THEN 0 ELSE 1 END) as 'Other',

                 SUM(CASE WHEN ethnicity  = 'Non-Hispanic/Non-Latino' THEN 1 ELSE 0 END) as 'Non-Hispanic/Non-Latino',
                 SUM(CASE WHEN ethnicity  = 'Hispanic/Latino' THEN 1 ELSE 0 END) as 'Hispanic/Latino',

                 SUM(CASE WHEN gender = 'Male' THEN 1 ELSE 0 END) as 'Male',
                 SUM(CASE WHEN gender = 'Female' THEN 1 ELSE 0 END) as 'Female'
                 FROM attachmentIIIClientData")

attachmentIIIAggregate <- sqldf("SELECT * FROM attachmentIIIAggregate LEFT JOIN ami ")

aggregateFileNameString <- paste(nameOfMonth, "_", nameOfAgency, "_attachment_III_aggregate.csv")
write.csv(attachmentIIIAggregate, file = aggregateFileNameString)

clientDataFileNameString <- paste(nameOfMonth,"_", nameOfAgency, "_attachment_III_client_data.csv")
write.csv(attachmentIIIClientData, file = clientDataFileNameString)
HMIS, R, and SQL -- Basics

Hacker Introduction

I’m a hacker.  If you find errors, please leave comments below.  If you have an opinion I’ll hear it, but I’m often not likely to agree without some argument.

Joins (Merging Data)

Probably the best part of R and SQL is their ability to quickly combine data around a key.  For example, in HMIS CSVs the Client.csv contains a lot of demographic information and the Enrollment.csv contains a lot of assessment information.  This makes it difficult when needing a count of the total participants who are veterans and disabled, since the veteran information is in Client.csv and disability information is in the Enrollment.csv.  However, both R and SQL contain the join functions.  

Joins are a hughely expansive topic; I’m not going to try to cover all their quirks, but here’s some videos I found helpful:

The two useful joins for HMIS data are LEFT JOIN and INNER JOIN.  The left join keeps all the data in the left table and data matching from the right table and the inner join keeps only data which matches.

Here’s an example in the context of the Client.csv and Enrollment.csv:

Client.csv

PersonalID FirstName VeteranStatus
12345 Jane Yes
54321 Joe No

Enrollment.csv

PersonalID FirstName DisablingCondition
12345 Jane Yes
54321 Joe No
45321 Sven Yes

Here are the two join statements and their results for the data above

SELECT * 
   FROM client a 
   LEFT JOIN enrollment b ON a.Personal=b.PersonalID

This join should result in the following:

PersonalID FirstName VeteranStatus DisablingCondition
12345 Jane Yes Yes
54321 Joe No No
45321 Sven NULL Yes

Notice Sven was kept, even though he had no entry the Client.csv.  After the join, since he had no 
And the inner join would look like this:

SELECT * 
       FROM client a 
       INNER JOIN enrollment b ON a.Personal=b.PersonalID

This join should result in the following:

PersonalID FirstName VeteranStatus DisablingCondition
12345 Jane Yes Yes
54321 Joe No No

Counts

PersonalID <- sqldf("SELECT DISTINCT PersonalID FROM client")

Method above creates a vector of all the PersonalIDs in the client data-frame, which came from the Client.csv.  The DISTINCT command takes only one ID if there are more than two which are identical.  In short, it create a de-duplicaed list of participants.

For example,

PersonalID  OtherData
12345 xxxxxxxxx
56839 xxxxxxxxx
12345 xxxxxxxxx
32453 xxxxxxxxx

Should result in the following,

PersonalID
12345
56839
32453

This is useful in creating a key vector, given other CSVs have a one-to-many relationship for the PersonalID.  For example,

The Enrollment.csv looks something like this

PersonalID ProjectEntryID  EntryDate
12345 34523 2016-12-01
56839 24523 2015-09-23
12345 23443 2014-01-10
32453 32454 2015-12-30

This reflects a client (i.e., 12345) entering a project twice, once on 2014-01-10 and the other 2016-12-01.

Count of Total Participants:

SELECT COUNT(PersonalID) as 'Total Participants' FROM client

This query should give a on row output, counting the number of clients in the data-frame.

Total Participants
1 1609

However, if there are duplicate PersonalIDs it’ll count each entry as an ID.  To get a count of unique clients in a data-frame add the DISTINCT command.

SELECT COUNT(DISTINCT(PersonalID)) as 'Unique Total Participants' FROM client

Conditional Data

Often in HMIS data it is necessary to find a collection of participants which meet a specific requirement.  For example, “How many people in this data-set are disabled?”  This is where the WHERE statement helps a lot.

SELECT PersonlID FROM clientAndEnrollment WHERE disability = 'Yes'

This statement will return a vector of all the PersonalID’s of participants who stated they were disabled.  The total participant query could be used, but there is an alternative method.

SELECT SUM(CASE WHEN 
               disability = 'Yes' THEN 1 ELSE 0 
           END) as DisabledCount

The above statement uses the CASE WHEN END statement, which I understand as SQL’s version of the IF statement.  Here’s C equivalent:

for(int i = 0; i < total_participants; i++)
    if(disability == true){
       disabilityCounter++;
    }
}

BOOL!

Boolean operaters can be used to get more complex conditional data:

SELECT PersonalID FROM clientAndEnrollment 
       WHERE disability = 'Yes' 
       AND gender = 'Female'

This statement will provide a vector of all the PersonalID’s for clients who are disabled and female.

Ok, good stopping point for now.

HMIS, R, and SQL -- Introduction

“I love our software, I love our software.”

I’m a HMIS Database Manager for a living.  It’s a dream job–all the nerdy stuff, plus, there is a possibility I’m helping people.  Currently, one area our software really lacks is quickly generating complex reports. It has the ability, but the servers are laggy, it crashes often, and a project which should take 20 minutes will take 50 minutes to 40 hours depending on the “report weather.”  These issues are probably caused by the reporting platform being web-based and calculations done server-side.  Regardless, given the amount of time the staff are eating on report projects I’ve decided to explore alternative systems for generating some of our needed reports.

Luckily, HUD has dictated a HMIS data format.  This is often known as the “CSV version.”  The specification of these data sets are outlined in HUD’s document:

These data standards are currently on version 5.1, however, HUD issues tweaks to these standards every October 1st.  Point is, if the data is standardized it should make it easy to manipulate using local tools.

Here are a few pros to explore local reporting tools:

  • Software vendor ambivalent
  • No bottleneck due to routing issues
  • Greater flexibility of reporting
  • No outage concerns
  • More control on optimization of queries

And the cons:

  • Somewhat more difficult to deploy to end-users (integration would probably be through batch files or Excel-VB)

Before jumping in to the alternatives it is important to point out HUD requires all HMIS software vendors have the ability to export a set of CSV files which contain all the HUD manadated data elements (also known as universal data elements).  This export process is reliable, fast, and predictable–at least, from my experience.  As the alternative tools are explored the data sets being used will most often be these HMIS CSVs, however, there will probably be other data our COC reports locally which will be joined to these CSVs using each participant’s unique ID.

Ok! Let’s take a look.

R

R gets me excited.  It is programming language for data miners.  It is primarily C under the hood, which potentially makes it blazingly fast. R is meant to be a command-line interface, but I’m using RStudio as convenient overaly.  R studio has a few limitations, for example only 15 columns may be view inside the IDE, but nothing show stopping.

This entry is not meant to be a course in R, however, I’ll add some of my favorite links:

  1. Coursera’s R Courses
  2. John Hopkins – R Programming

Here’s the toolchains:

  1. Mirror List for R Download (the CLI)
  2. RStudio (the R IDE used)

Alright, now we’ve got R going!

Ok, let’s dig into some code.

First it is important to be able to read in CSV and Excel files.  The ability to read in CSVs is built into R.  To start loading Excel documents the read_excel package will need to be installed. R has a package manager, allowing method libraries to be easily added.  Pretty much any package can be installed from the CLI using install.package(“name_of_package”).  For example:

    # Installs the readxl package, which allows Excel files to be
    # read in as data-frames
    install.package("readxl")
    

A package only needs to be installed once, however, every R session will need to refer to the library before making calls to its methods.  For example,

    # Adds the readxl methods to this session.
    library("readxl")

 After this package has been installed and added to the session, then we should be able to import all sorts of data into R using the following:

    # Load data
    read.csv( <- read_excel("DRC PoS Check In for October 2016.xlsx")
    csvData <- read.csv("My_CSV_file.csv")

 This creates two data-frames.  One thing action I found to be necessary for later functions the ability to rename column headers.  This can be done using the following:

    # Make the disability column easier to work with.
    colnames(data_frame)[col_number_to_rename] <- "new_name"

    # For example, renames the header of column 8 to "ethnicity"
    colnames(client)[8] <- "ethnicity"

This is important later when SQL functions are used inside of R, as speciali characters SQLite doesn’t like and workarounds make the SQL code verbose.

The most important thing which can be done by data people is merging datasets.  I’ve only started on this journey, but it looks to be an art which requires mastery to be effective.  But to get us going, here’s how to perform a left join in R.

    # Join data_frame_2 to data_frame_1 where the "key" column matches.
    # Do not keep any data which doesn't match the keys in data_frame_1
    combined_data_frames <- merge(data_frame_1, data_frame_2, by="key")

    # Here's a real example, using HUD HMIS CSVs
    client <- read.csv("Client.csv")
    enrollments <- read.csv("Enrollments.csv")
    client_and_hud_assessments <- merge(client, enrollments, by="PersonalID")

If you’re pretty sharp–or a data scientist–you might notice the flaw in the in the merger above.  The HMIS Client.csv should only have one record per participant, but the relationship from Client.csv to Enrollments.csv is one-to-many.  Meaning, each client could have mutiple project enrollments.  This makes the above code somewhat unpredictable–and I’ve no time to explore the results.  Instead, I’ve focused on taking the most recent entry from Enrollments.csv.  This can be done using some SQL code.

The SQL to R

Professional data folk may wonder why I’ve chosen to mix R and SQL.  Well, it may not be the best reason or explanation, but here goes.  R is a powerful tool, but often, the syntax is boggish.  It is hard to read and figure out what’s going on.  SQL on the other hand, it’s pretty intuitive.  For me, I’m looking to solve problems as quickly as possible and I’ve found by mixing the two I get to solutions much more quickly.  Often, it is a trade off, if a SQL query is running too slow, I look for an R solution.  And if I’ve re-read an R statement twenty times without being able to spot a bug, then I find a SQL solution.  For me, it’s about getting to the result as quickly as possible

A second reason to mix SQL is about respect and marketability.  R seems to be gaining ground in a lot of data sciences, and seems to be the tool when it comes to economics and statistics, however, most data exchanges have SQL at their heart.  Therefore, when I can use my work as an excuse to develop a marketable skill, I’m going to do it.  

If someone still has problems with those assertions, feel free to hate away in the comments below.

Alright, how does one mix SQL into R?  It centers around the package sqldf.  This package can be installed and added to a session with the following:

    # Install SQLdf package (run once)
    install.package("sqldf")

    # Adds sqldf to the current R session
    library("sqldf")

Underneath the hood of sqldf is SQLite, this important to note when it comes to debugging SQL queries in R–as we will see in a moment.

But, to get us kicked off, let’s look at how sqldf works in R.

    result_df <- sqldf("YOUR SQL QUERY HERE")

This is a sample of how sqldf works.  Basically, the sqldf() makes a SQLite query call and returns the results.  Here, all of the vector for PersonalIDs was taken from the Client.csv and put into a dataframe called personalIDs.  And that’s pretty much it.

Here’s an example in the context of HMIS CSV data.

    # Add the SQLdf library
    library("sqldf)
    # Load Client CSV data
    client <- read.csv("Client.csv")
    # Get a vector of the client IDs from the Client.csv
    personalIDs <- sqldf("SELECT DISTINCT PersonalID FROM client")

Alright, from here on in I’m going to outline SQL queries seperately, just know, the SQL query will need to be insert into the sqldf(“”) call.

    SELECT DISTINCT PersonalID FROM client

Ok – I’m going to stop this article here, since it seems to have gotten us going.  However, I’ll continue adding to this series as I write useful queries for HMIS data.