Sampling Large Data

This article is part of a series.

View all 12 parts

Part 1 – Attachment III, aka, The Zombie
Part 2 – R Function to Split CSVs
Part 3 – Shaping and Combining HMIS Data from ETO
Part 4 – Splitting Program Data
Part 5 – This Article
Part 6 – Identifying Chronically Homeless and Veteran Participants throughout a COC
Part 7 – JPS DSRIP Report V2.0
Part 8 – Coordinated Entry By-Name-List using HMIS CSV 5.1, R, and SQL
Part 9 – Veteran's Report 2.0
Part 10 – Choropleth and Heatmaps for HMIS Data
Part 11 – Annualized Count
Part 12 – Stitching Together HMIS Exports

This R function allows sampling of a dataframe. This is helpful when writing a script which will be used against a large dataframe, however, writing the script is iterative. Sampling allows the overall reduction in time of testing iterations, without losing the validity of realistic results.

    options(java.parameters = "-Xmx14336m")  ## memory set to 14 GB
    library("sqldf")
    library("XLConnect")
    library("tcltk")

    df <- readWorksheetFromFile("Data_X.xlsx", sheet = 1, startRow = 1)

    sampleVector <- sample(1:nrow(df), 30000)
    df2 <- df[sampleVector,]

    write.csv(df2, file="Sample of Data_X (30000).csv", na="")