Statistics and Data Mining in R

Trying not to lie with data. Jenks classification based on a subset in R

Heat maps (i.e., choropleths) are popular to display geo-referenced data with a value of an attribute that is mapped to a pre-defined color scheme. However, care is necessary to represent the color scheme such that the map identifies heterogeneity yet, with minimal bias from the creator's perspective. There are a number of popular color palettes availble (e.g. http:// http://colorbrewer2.org) and methods to classify the values into categories. Popular methods include equal intervals, quantiles, standard deviations, and Jenks natural breaks. This note is focused on Jenks as previous work on an actual problem! identified that quantiles did not provide a good visual representation of these data. Jenks classification assigns a continuous variable to classes and attempts to minimize within class variance and maximize out of class variance. In practice, the method is an iterative process and is resource intensive. This causes computational problems in various software. For example, Arcmap (ESRI) has a maximum number of allowed records and will classify based on the subset. Unfortunately, if these data are time, or space ordered, this subset may not accurately represent the entire dataset. The classInt library in R supports Jenks classification using the classInterval function with style="jenks". Unfortunately, this method also fails on the dataset I was working with ~ 90,000 records. I wanted to examine if a subset of data could be used to develop categories that are representative of the dataset as a whole. This would improve computational time and allow Jenks classification for applications that I have been unsuccessful at thus far. Note, this is by no means an exhaustive review of applications, software, or methods to do this but hopefully, an informative, illustrative example.

Begin by loading the classInt library

library(classInt)

## Warning: package 'classInt' was built under R version 3.1.1

Next, create a sample data set using the rnorm function. I generate 10,000 samples with mean = 100, standard deviation = 25. Many other combinations, or distributions could/should be explored.

test.data <- rnorm(10000, 100, 25)

Create an empty matrix (6 x 6) to hold the results.

breaks.matrix <- matrix(ncol=6,nrow=6)

Create a vector with five subsets lengths and the entire 10,000 samples to be used as a reference.

n <- c(50,100,1000,2500,7500,10000)

Using a for loop, samples (without replacement) are taken successively of length n. Jenks classification is applied to each sample. Note the computing time can be long for the large sample sizes

for(i in 1:length(n)){
subsample <- sample(test.data, n[i], replace = FALSE)
test.breaks <- classIntervals(subsample,n=5,
    style="jenks", rtimes=3,
                intervalClosure=c("right"),
                data.precision=NULL, cutlabels=TRUE)
        breaks.matrix[,i] <- test.breaks$brks
        print(i) #To track progress
        }

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6

Print the results in a table.

round(cor(breaks.matrix),2)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.00 0.99 0.98 0.97 0.96 0.96
## [2,] 0.99 1.00 1.00 0.99 0.99 0.99
## [3,] 0.98 1.00 1.00 1.00 0.99 0.99
## [4,] 0.97 0.99 1.00 1.00 1.00 1.00
## [5,] 0.96 0.99 0.99 1.00 1.00 1.00
## [6,] 0.96 0.99 0.99 1.00 1.00 1.00

In this example, column 6 is of direct interest. Results suggest very high correlation with small sample sizes aproaching correlation = 1.0 to the full data sets when the sample was equal or greater than 25 percent of the total.

http:www.rpubs.com provides a free and easy method to upload R markdown and html output (via library knitr) to the web. You can register for free at the home page and be uploading R products in minutes. This tool makes it easy to share your code and output with colleagues for review collaboration or any other purpose. rpubs is an RStudio product (www.rstudio.com) and can be used to embed R code and output in clean and readable html files. Using RStudio, you need to 1) write some R code, 2) select the Compile Notebook option from the file menu (see below)

3) Next select will see a screen to name your output and (provide authorship if you wish) 4) Click the "Compile" button and your code will execute and be rendered to html in a nice looking format. Example below

5) Select the "Publish" button, and prompt to confirm with Rpubs will appear and you should be done with a single confirmation click on "Publish".

6) However, on my Windows 7 machine I received the following upload error:

7) I did some Googling and found that adding an option to my R profile solved the problem. I opened my Rprofile that was here on my machine using Notepad +:

8) Last, I pasted in the following snippet in my profile (Highlighted in gray).

9) I saved the changes, restarted R and everything worked as intended.

Statistics and Data Mining in R

Pages

Friday, January 20, 2017

Thursday, November 13, 2014

Friday, November 7, 2014

Wednesday, September 17, 2014

Test SVG plot from ggvis and R

Tuesday, September 16, 2014

Thursday, February 6, 2014

Uploading to Rpubs on Windows OS from R

Monday, February 3, 2014