Monday, March 17, 2014

Getting a sample from a large data file with R

I'm working on a little project to attempt to cluster the voters in North Carolina into congressional districts.  My goal is to see if there's a way to have a computer draw the districts instead of relying on people with potential biases.

I quickly ran into quite a big wall when I reviewed the file listing all the voters in North Carolina.  I should have suspected that it would be huge!  A file consisting of around 7.5 million voters (both active and invalid) and around 60 columns is about 4.5 gigabytes.  Considering I have 4 gigabytes of RAM, I needed an alternative plan.

Well, I know that I just wanted a sample of this file.  I'm going to try and geocode these addresses and use the lat/long coordinates for the clustering.  As I've stated in previous posts, geocoding has daily limits. Geocoding tons of addresses can take serious time.

I didn't have too much success using the standard read.table function in R.  There are skip and nrow parameters, but they didn't seem to help too much when dealing with my RAM woes.  I also tried the Fread package, but my data had some flaws and Fread wasn't too flexible working around it.

I took a really simplistic approach to my problem by utilizing the lowly file command that comes standard with R.  First, a loop with the file command went through each line and copied only rows that didn't have problems.  In my situation, there were extra quote symbols in some of the lines. Those lines weren't worth it.  So, I skipped them.

 I also took out voters that weren't listed as ACTIVE or INACTIVE.

v <- file("c:\\users\\doug shartzer\\documents\\data\\ncvoter_Statewide.txt")

open(v)

while(length(line <- readLines(v,1)) > 0) {

if (sum(table(strsplit(line,'"'))) == 140) {

if (strsplit(line,'"')[[1]][[10]] == 'ACTIVE' | strsplit(line,'"')[[1]][[10]] == 'INACTIVE' ) {

write(line, 'c:\\users\\doug shartzer\\documents\\data\\voter_good_all.txt',append=T)

}

}

if (sum(table(strsplit(line,'"'))) != 140) {

print(line)

}

}

close(v)

q()


Although it did take a while to run (16 hours), I didn't run into any problems with memory.

After that, I collected a sample and wrote those voters to another file.

s <- sample(7500000, 75000)

v <- file("c:\\users\\doug shartzer\\documents\\data\\voter_good_all.txt")

open(v)

while(length(line <- readLines(v,1)) > 0) {

if (x %in% s){

write(line, 'c:\\users\\doug shartzer\\documents\\data\\voter_sample_03142014.txt',append=T)

}

}

shartzer\\documents\\data\\voter_run_status.txt',append=T)

close(v)


q()

After this process, I had a much more manageable file to play around with.