Filter to Most Recent HUD Assessment

Enrollment.csv

Many of the CSVs in the HMIS CSV may contain multiple rows per client. This can make it difficult when working with HMIS data, as it appears to the non-data person there are duplicates within your data.

Let’s look at some dataframes:

enrollmentDf

ProjectEntryID PersonalID FirstName EntryDate
L0TDCLTDEARVHNIQ4F9EDDKXJ764Z65Q ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 10/17/2016
169ACC89JY5SX0U87U7HQ28PMMHNJEXQ IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane 05/05/2015
XB52BYOGJ1YDFESNZVNGDGA58ITDML0A ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 01/01/2013

Notice how Bob has two records? One on 10/17/2016 and 01/01/2013. This represents two HUD Entry Assessments completed on Bob. These HUD Entry Assessments could represent two stays in the same program, or one stay in two programs.

Regardless, whenever you go to join this dataframe with a another dataframe, like the Client.csv, it will cause the resulting dataframe to have two rows representing both of Bob’s enrollments.

Let me walk us through joining the above dataframe with another dataframe.

We are going to join the enrollmentDf (above) with this clientDf

enrollmentDf

ProjectEntryID PersonalID FirstName EntryDate
L0TDCLTDEARVHNIQ4F9EDDKXJ764Z65Q ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 10/17/2016
169ACC89JY5SX0U87U7HQ28PMMHNJEXQ IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane 05/05/2015
XB52BYOGJ1YDFESNZVNGDGA58ITDML0A ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 01/01/2013

clientDf

PersonalID FirstName LastName
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob Beber
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane Goodall

In R, we can join these two dataframes with the following.

Please copy the code below to R and execute.

####### BEGIN LOADING DATA FRAMES ###############
enrollmentDf = data.frame(ProjectEntryID=c("L0TDCLTDEARVHNIQ4F9EDDKXJ764Z65Q", "169ACC89JY5SX0U87U7HQ28PMMHNJEXQ", "XB52BYOGJ1YDFESNZVNGDGA58ITDML0A"), 
               PersonalID=c("ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7", "IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV", "ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7"), 
               FirstName=c("Bob","Jane", "Bob"), 
               EntryDate=c("10/17/2016", "05/05/2015", "01/01/2013"))

clientDf = data.frame(PersonalID=c("ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7", "IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV"), 
                      FirstName=c("Bob","Jane"),
                      LastName=c("Beber", "Goodall"))
####### ENDS LOADING DATA FRAMES ###############

# Load the SQLdf package (note, it must be installed first. See install.packages())
library(sqldf)

# Join the two dataframes.
clientAndEnrollmentDf <- sqldf("SELECT * 
                               FROM clientDf 
                               LEFT JOIN enrollmentDf 
                               ON clientDf.PersonalID=enrollmentDf.PersonalID")

Important Sidenote

If you ever see the following error:

In field_types[] <- field_types[names(data)] : number of items to replace is not a multiple of replacement length

It’s a problem with going back-and-forth between R and SQL. To fix it, use the following code on the dataframe you are trying to work with before executing the line of code causing the error

dfCausingProblem <- subset(dfCausingProblem)

Ok, back to work.

After executing the code, you should end up with a table like this. Not too shabby.

PersonalID FirstName LastName ProjectEntryID PersonalID FirstName EntryDate
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob Beber L0TDCLTDEARVHNIQ4F9EDDKXJ764Z65Q ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 10/17/2016
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob Beber XB52BYOGJ1YDFESNZVNGDGA58ITDML0A ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 01/01/2013
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane Goodall 169ACC89JY5SX0U87U7HQ28PMMHNJEXQ IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane 05/05/2015

However, notice there are still rows for Bob? These aren’t technically duplicates. A duplicate is when there are two rows where items in every column are exactly the same. But in the case of the dataframe above, notice how the ProjectEntryID and EntryDate columns for Bob’s records are different?

As stated before, this is carried forth from Bob having two HUD Entry Assessments. But to the people whom we are going to present these data, it looks like duplicates. This is a problem because it will be seen as sloppy work (but remember, it’s not. It’s merely a technical artefact).

Who cares! How do we get rid of it?

First, we have to make a decision among three options. First, we can get only the most HUD Entry Assessment per client, only the the first HUD Entry Assessment ever taken per client, or leave it as it is.

The last option is out, so it’s a choice between most recent and the oldest one. In our world, homelessness usually gets worse and HUD wants us to be helping those who are the most vulnerable first, so the most recent is probably going to give us the best picture how vulnerable a client is right now.

Alright, how do we get the most recent HUD Assessment?

In SQL there is a function called MAX(). It will take the most recent of a record. Let’s look at how to use it, then we can discuss it.

For the following code to work, make sure all code above has been executed in R.

clientAndEnrollmentDf2 <- sqldf("SELECT *, MAX(EntryDate) FROM clientAndEnrollmentDf")

This should provide you with the following table:

PersonalID FirstName LastName ProjectEntryID PersonalID.1 FirstName.1 EntryDate MAX(EntryDate)
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob Beber L0TDCLTDEARVHNIQ4F9EDDKXJ764Z65Q ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 10/17/2016 10/17/2016

Wait! What happened to Jane!? Well, the MAX() took the row with the greatest EntryDate, which is 10/17/2016. Unfortunately, Jane’s record wasn’t the most recent, so her record was removed.

Well, that’s not going to work. But we’re close. If only we had a way to take the MAX(EntryDate) per client.

We do. SQL has another command called GROUP BY, which allows us to tell SQL to apply another command by a group of records.

Again, let’s use it and then dissect it.

Copy this to R and execute it.

clientAndEnrollmentDf3 <- sqldf("SELECT *, MAX(EntryDate) FROM clientAndEnrollmentDf GROUP BY PersonalID")

You should end up with a table like this:

PersonalID FirstName LastName ProjectEntryID PersonalID.1 FirstName.1 EntryDate MAX(EntryDate)
IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane Goodall 169ACC89JY5SX0U87U7HQ28PMMHNJEXQ IA26X38HOTOIBHYIRV8CKR5RDS8KNGHV Jane 05/05/2015 05/05/2015
ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob Beber L0TDCLTDEARVHNIQ4F9EDDKXJ764Z65Q ZP1U3EPU2FKAWI6K5US5LDV50KRI1LN7 Bob 10/17/2016 10/17/2016

Aha! That’s it!

What the GROUP BY did was say, “Ok, SQL, create a group of data where the rows PersonalID are the same. Now, for each group, take the row with the greatest EntryDate.”

This gives exactly what we want. A single row per participant.

Working with Dates in SQLdf

Mixing R and SQL is powerful. One of the easiest ways to implement this combination is with the R library SQLdf.

If TL;DR, skip to Coerce Date Types into Strings before Passing to SQLdf at bottom.

SQLdf

The power of SQLdf comes from its ability to convert dataframes into SQLite databases on the fly. To the user, it doesn’t appear like anything special is going on, but under the hood R is working together with a SQLite client to create a table which can be queried and manipulated with ANSI SQL calls.

For example,

dataFrame1 <- read.csv(pathToData)
library("sqldf")
dataFrame2 <- sqldf("SELECT FirstName FROM dataFrame")

These three lines do a lot. It loads data from a CSV, loads a library of functions for convert R dataframes into SQLite databases, and then the sqldf() function call does two things at once. It converts the R dataframe into a SQLite database and then queries it for the FirstName column.

If we were to assume the dataFrame1 variable contained data like this:

PersonalID FirstName LastName
B7YIOJIGF9CDP6FV7TANQXLMQRMBTVTB Bob Person
ASGJ4F95HS85N39DJ12AJB094M59DJ45 Jane People

Then the dataFrame2 <- sqldf("SELECT FirstName FROM dataFrame) will create a variable called dataFrame2 which contains the FirstName column from dataFrame1

FirstName
Bob
Jane

And this is how we will shape our data in the R-SQL way.

Datatypes

One of the most important things a human can learn about computers is something called datatypes. When computers process information they need a little help from humans in attempt to understand what to do with the information. For example, what do these numbers mean to you?

76110, 444-325-7645, 10/24/1980

Most humans (at least in the United States) will know the first number is a ZIP code, the second a phone number, and last date. Humans know this because our brains have learned how to discern from context. In the case of the ZIP code, it’s exactly 5 numbers, the phone contains dashes at at exact places, and the date contains slashes in the exact spot we’d expect of a date.

Unfortunately, computers have a little more difficulty with this. Most computers are smart enough now days to know the phone number and date of birth, but the ZIP code will confuse the heck out of a computer.

A computer’s initial reaction in seeing the ZIP code is, “Oh, you mean 76,110. That’s a big number.” When really, this number represents a geographic location.

Ok, let’s make this more relevant to HMIS work. The way to help a computer understand what numbers are representing is by telling the computer what type of data a particular column is. This is known as a datatype. For us, we really only deal with a few datatypes, but their are hundreds of thousand of datatypes.

In R you can see what datatype a column of your dataframe is by clicking the blue button next to the dataframe name in the Global Environment variables.

We will be dealing with the following:

  • Dates (called “POSXct” in R)
  • Strings (called “chr” in R)
  • Numbers
  • Factors

Of course, every programming language can calls these datatypes by different names, thus, furthering confusion. (I mean, c’mon, programming is simple enough as it is–we’ve got to make it a little challenging.)

Dates

Date datatypes usually look like this:

10/24/1980

But it can come in many different formats. It’s probably best to differentiate between datatype and data format. A data type describes how the information should be used–it’s important for a computer. Data format describes how a computer should display information to the human–therefore, it’s useful for a human.

An example of different formats of the same data:

10/24/1980
1980-10-24
102480

Ok, back to the date datatype. It is used when dealing with dates. By declaring a variable as having a date datatype, it is telling the computer whatever we put into that variable to interpret as a date. Simple enough.

Strings

When we talk about string we aren’t talking about fuzzy things kittens chase. A string datatype is a series of characters (one part of a string) strung together. Anything can be a string. They are probably the most important datatype, since they can tell a computer to look at a number and see something else. Confused? Let’s look at some examples.

We tell a computer data is a string is by putting it in double quotes "this is a string" or single quotes 'this is also a string'.

Here’s an example of assigning a string in R:

myFirstString <- "this is a string"

Great! But what can we do with it? Well, a lot.

Let’s say we wanted to pass a path of a file to a read.csv() function. We could do so by providing the path as a string.

dataFrame <- read.csv("/Users/user/Downloads/Client.csv")

The above will load the Client.csv file located at the /Users/user/Downloads/ directory–the computer knows how to read the path because it’s a string.

But why are strings so important? Well, they allow us to tell a computer to override its basic instinct and view a piece of data as something other than what the computer would guess it is.

Returning to the ZIP code.

notAZipCode <- 76110
myZipCode <- "76110"

These variable assignments may seem to be exactly the same, however, the first one creates a variable as a number, but the second tells the computer, “This is a not a number, it is something else–please don’t pretend to understand it. I’ll tell you exactly what to do with it later.”

Numbers

Number datatypes are easy. It’s pretty much how a computer tries to look at all data you provide it. One important characteristic about numbers, you can have the computer perform math functions on numbers, which you couldn’t on strings.

dataframe1 <- 12 * 12 
datafram2 <- "12" * "12"

Above, dataframe1 will contain 144 after being executed. But if the second line is attempted to be executed the computer will freak out, “This is a string! I can’t add strings! You told me you’d tell me what to do with them…”

Factors

Factors are a special datatype in R. Most of all the variables we load in R will start out as factors. Essentially, factors are categories of data.

Red,
Orange,
Green,
Blue,
Indigo,
Violet

Is an example of factors. They are categories of data. The important of factors will become more evident as we work through these tutorials in R.

If you don’t understand factors, it’s cool. Just think of them as strings. However, if you don’t understand strings, please ask any questions in comments below. Understanding them is critical to working with SQLdf.

SQLdf and Datatypes

Anytime you mix two different languages it pays to be careful about meaning. As I learned once by talking about pie as something I liked–come to find out, it was funny to Hispanic friends who were learning English. (Apparently pie is Spanish for foot?)

When mixing R and SQL we must be careful about how the two languages look at the datatypes. In R it sees dates as a POSXct datatype (this is essentially fancy date datatype. Would you like to know more?)

Well, this is all fine and dandy, but when we pass commands from R to SQL it is all passed as a string.

dataFrame2 <- sqldf("SELECT * FROM dataFrame1")

Notice SELECT * FROM dataFrame1 is all in quotation marks? This turns it into a string then it passes it SQLite, which is hidden to us.

If all this is a bit overwhelming, no worries. Bookmark this page to refer back to later. Just remember the following:

Date columns must be converted into a chr datatype before passing it to SQL. How to we convert datatypes? It’s pretty darn simple. We use something called data coercion.

Coercing Data Types

Let’s go back to that ZIP code and number example. Let’s say the computer reads all your ZIP codes from a file as a number. This happens a lot, since to the computer that’s what it looks like–so it guesses that’s what you are going to want.

But no, we want those ZIP codes to be strings. To do this, we can get a particular column from a dataframe by writing the name of the dataframe then $ then the name of the column. For example, datafram$zipCodes will return only the column zipCodes from dataframe.

Alright, now we have a way to select one column from our dataframe we can attempt to convert that one column’s datatype. To do this use the as.character() command.

dataframe$zipCodes <- as.character(dataFrame$zipCodes)

This will convert the zipCode column from a number into a string, then, it assigns it back to the column zipCodes. Boom! We’ve told the computer to stop trying to make a ZIP code a number. Instead, treat it as a string. And with that, we will tell the computer later how to use ZIP codes.

Coerce Date Types into Strings before Passing to SQLdf

Ok, now for the reason for this entire article. Before passing any dates to SQLdf we need to first convert them to strings. Otherwise, SQLdf will try to treat them as numbers–which will cause a lot of heartache.

For example, a Client.csv file should have a DateCreated column. This represents the date a case-manager put the data into HMIS. The data should look something like this:

DateCreated DateUpdated
10/23/14 0:01 4/23/15 15:27
5/22/13 9:23 10/15/16 1:29
6/3/15 19:22 3/17/17 21:09

Let’s try to get all of the clients who’ve been entered after 10/01/2014.

dataFramContainingDates <- read.csv("/Users/user/Downloads/Client.csv")
datesEntered <- sqldf("SELECT * FROM dataFramContainingDates WHERE DateCreated > '2014-10-01'")

The above code should provide every column where DateCreated date is greater than 2014-10-01. But, instead, it will result in an empty dataframe. Waaah-waah.

Essentially, this is because SQL is comparing a number and a string. It freaks the computer out.

Instead, we should convert the DateCreated column to a string instead of a date. Then, SQL will actually convert it from a string to a date.

Confused? Imagine me when I was trying to figure this out on my own.

Ok, so, the take away? Before passing any dates to SQL convert them to strings.

dataFramContainingDates <- read.csv("/Users/user/Downloads/Client.csv")
dataFrameContaingDates$DateCreated <- as.character(dataFrameContaingDates$DateCreated)
datesEntered <- sqldf("SELECT * FROM dataFramContainingDates WHERE DateCreated > '2014-10-01'")

By using the as.character function to convert the DateCreated column to a string and then assigning it back to the dateframe, it sets SQL up to do the date comparisons correctly. Using the dateframe from above, this should result in the following table:

DateCreated DateUpdated
10/23/14 0:01 4/23/15 15:27
6/3/15 19:22 3/17/17 21:09

Confused as heck? Feel free to ask questions in the comments below!

Read and Write CSVs in R

R let’s us work with HMIS data, but there is still the problem of how to load the data into R from a source. R is actually pretty neat regarding data importing. We can load data from a website, SQL database, text file, Excel file, or CSV.

When working with HMIS data the two most common types of data source are Excel and CSV. Therefore, it’s going to pay to spend a little time on how to bring those files into R.

Importing CSVs into R

CSV stand for Comma Separated Value format. It is a near ancient file format which is used to store data in rows and columns. If you have a CSV file on the computer (in Windows) right-click on it and open it in Notepad. The contents should look something like this:

PersonalID,FirstName,LastName
B7YIOJIGF9CDP6FV7TANQXLMQRMBTVTB,Bob,Person
ASGJ4F95HS85N39DJ12AJB094M59DJ45,Jane,People

However, if you open the same CSV in Excel it will look something like this:

PersonalID FirstName LastName
B7YIOJIGF9CDP6FV7TANQXLMQRMBTVTB Bob Person
ASGJ4F95HS85N39DJ12AJB094M59DJ45 Jane People

Let’s be honest, the second example is easier for humans to read (well, unless you’re an odd human). And for the most part, we will be looking at CSVs in Excel or in RStudio’s dataview which looks similar to Excel. However, it is important to note the easier to read version can hide data errors that may only be visible by viewing the raw CSV. Nothing to worry about now, but keep it in mind.

Alright, let’s explore how to load a CSV file into R. There is many way’s to do this, but let’s start with a super-easy way:

pathToCsv <- file.choose()
myCsvAsADataFrame <- read.csv(pathToCsv)
 

These two commands, when executed, will force R to create a file selection box. This will allow us to easily select the CSV to load into R. Once selected and we pressOk then R will load the selected file’s path into the variable pathToCsv.

The next command read.csv() takes the path provided and attempts to load the file it points to and converts it into a dataframe. Once R creates a dataframe from the file selected it saves it in the variable myCsvAsADataFrame

And that’s it. The data is now loaded into R and it can be manipulated with other commands.

Writing CSVs

Saving data from R is the other end of importing data. This process is often referred to as exporting data. Really, it’s simply taking a R dataframe and converting it into a file–once in a file form it can be loaded into Excel or emailed to a peer (but, please don’t email personal information–remember, CSVs are simply text).

To write data a CSV use write.csv().

For example:

write.csv(theDataFrameToWrite, "NameOfFile.csv")

That’s it, pretty simple, eh? Well, there are a couple of catches. When R saves a CSV it does a few things which are annoying for using the data in other way. For example, let’s say we have data in R that looks like this:

PersonalID Name SSN
123JJKLDFWE234234JGD0238D2342346   123-45-6589
B7YIOJIGF9CDP6FV7TANQXLMQRMBTVTB Bob Purdy  
DSK329GJB9234J5JFSDF94056NDUNVDF Frank 123-99-9999

However, after it is written to a file it will look like this:

PersonalID Name SSN
123JJKLDFWE234234JGD0238D2342346 NA 123-45-6589
B7YIOJIGF9CDP6FV7TANQXLMQRMBTVTB Bob Purdy NA
DSK329GJB9234J5JFSDF94056NDUNVDF Frank 123-99-9999

Note the added NA. It is a good practice to put an NA in places where there are no data. Unfortunately, when dealing with HMIS data sets the standard is to leave a blank instead. To get R to conform to this standard we use an option in the write.csv() function.

For example:

write.csv(theDataFrameToWrite, "NameOfFile.csv", na="")

The na="" tells R to write the document without changing blanks into NA. The result of the code above should look like:

PersonalID Name SSN
123JJKLDFWE234234JGD0238D2342346   123-45-6589
B7YIOJIGF9CDP6FV7TANQXLMQRMBTVTB Bob Purdy  
DSK329GJB9234J5JFSDF94056NDUNVDF Frank 123-99-9999

Creating a List of Domestic Violence Victims

In this first work challenge you will use R and SQL to get a by-name-list of those who are domestic violence victims from an HMIS data pull (5.1).

Data Needed

The HMIS Data Pulls are simply a relational database which are broken into multiple CSVs. These CSVs will change in formatting, as stipulated by HUD. The current version of these CSVs is 5.1. For this work challenge the focus will be on two CSVs.

  1. Client.csv
  2. HealthAndDV.csv

The Client file will contain one row per client and only one row. Each row will contain most of all the Client’s demographic information.

The HealthAndDV file will contain a row for every HUD Entry Assessment completed on the participant. Each entry will contain general health information as well the client’s domestic violence information, which the client reported during the HUD Entry Assessment.

What ties the two files together is the PersonalID column. This ID is meant to be the grand-mamma of all client IDs. It is 32 characters long and contain both numbers and letter:

B7YIOJIGF9CDP6FV7TANQXLMQRMBTVTB

(note, this ID is called the “Enterprise ID” in our HMIS software)

Both the Client and HealthAndDV contain the PersonalID column. This often referred to as a key when dealing with relational databases. This unique ID in both files will allow us to merge the two data sets together using something called joins.

Now, in the Client.csv the information is readable to us humans. There will be the FirstName, LastName, SSN columns and many more. But in the HealthAndDV.csv the information is trickier. For this challenge we are going to focus on one column DomesticViolenceVictim. When you open the data set you may notice that instead of “Yes” or “No” answers in the columns you will see “1” or “0”. That’s because computers understand the 1 and 0 much quicker than Yes or No.

This is an important side note for managing data. Make the databases easy for computers to understand and report generation will be much faster. You have to think, if using a 1 instead of Yes could save 0.5 seconds on a calculation, then when you have a dataset which contains 1000 records you just saved 500 seconds 8.3 seconds. Now, multiply by 1,700,000 records. Well, you get the picture.

Ok. Back to the problem at hand. Just know “1” is equal to “Yes” and “0” is equal to “No”. So, for this challenge, we will want to find all the clients who have a “1” in the DomesticViolenceVictim column

The Goal

We are going to merge the two data sets and to discover the following:

  1. A list of clients who are victims of domestic violence.
  2. A count of how many clients are fleeing domestic violence.

Really, the second piece of information is counting how many people are in the list of those who are victims.

To get this information we will need to do the following:

  1. Load the Client.csv and HealthAnd.csv
  2. Filter the HealthAndDV dataset to the most recent according to the column DateCreated
  3. Join (merge) the dataframes where their PersonalID are the same
  4. Filter the merged dataframe to those who’ve reported 1 in DomesticViolenceVictim
  5. Write this data to a file.
  6. Use a function to count how many participants are in this victim list.

The Resources

Below are the resources which should help you understand each step of the process.

Step 1

  • R Programming A-Z – Video 41 – Loading and Importing Data in R
  • R Programming A-Z – Video 21 – Functions in R

Step 2

  • The Complete SQL Bootcamp – All Videos in Section 5
  • The Complete SQL Bootcamp – All Videos in Section 6
  • Working with Dates in SQLdf

Step 3

  • The Complete SQL Bootcamp – All Videos in Section 8

Step 4 –

Step 5 –

Step 6 –

  • The Complete SQL Bootcamp – All Videos in Section 5
  • The Complete SQL Bootcamp – All Videos in Section 6

The R and SQL Way

Below is my attempt to describe the method I use to get at HMIS data. In short, I’m mixing two powerful data languages to get answers from a data set quickly.

What is SQL?

SQL stands for Structured Query Language. Which can be translated as, “Asking a question a computer could understand.” This computer language was designed to get data off a remote relational database.

A sample of what SQL looks like:

SELECT Data FROM DataSet WHERE Data='BlueEyes'

SQL Pros:

  • Easy to understand
  • Can be learned quickly
  • Powerful in merging data sets

SQL Cons:

  • No advanced functions
  • Does not allow for additional features to be added by users

Would you like to know more?

What is R?

R is a programming language for super nerds. Often, it is used by those working in:

  • Bio-statistic (genetic sequencing)
  • High-level economics
  • Big-data (think all data collected by Facebook)
  • Machine learning

It is extremely fast computationally. The people who designed it wanted something which could crunch big data sets, on their computer, in a matter of minutes.

One of the neatest parts of R is it allows for user written plugins. Want to translate a data set of ZIP codes into their corresponding GPS coordinateds. Their’s a plugin for that! And did I mention it’s all free!

Would you like to know more?

Mix R and SQL? But why…

Think of our approach to data as Spanglish. We are going to take two major programming languages to get the best of both. When we don’t have the exact word to describe something, we will switch to a different language and use its vocabulary.

R is a powerful tool, but often, the syntax is boggish. It is hard to read and figure out what’s going on. SQL on the other hand, it’s pretty simple. And humans have an easier time reading it.

Personally, I’m looking to solve problems as quickly as possible and I’ve found by mixing the two I get to solutions quick. If I’m having a hard time getting Google to tell me how to do something in SQL, I’ll switch and see if Google know’s how to do it in R. Thus, getting to what I need do much quicker. And let’s not fool ourselves–learning a programming language is learning how to Google well.

A second reason to mix SQL is about respect and marketability. R seems to be gaining ground in a lot of data sciences, and seems to be the tool when it comes to economics and statistics, however, most data exchanges have SQL at their heart. Therefore, when I can use my work as an excuse to develop a marketable skill, I’m going to do it.

A third reason to mix the two, most often the servers where our data calculations are being done are extremely limited in their hardware. This results in calculations taking much longer on a server than they would on or personal PC. For example, a query which would take 45 minutes on our HMIS software vendor’s server takes around 30 seconds on my personal PC.

A fourth reason, by completing our calculations on our personal computers it reduces the number of times client level information would need to be transmitted back-and-forth from the server. Of course, this doesn’t mean the data is more secure, it simply means the opportunities for more secure data are in our hands. This method makes follow proper PC security practices much more important (encrypting hard-drives, not sharing passwords, etc) Localization of data.