Statistics and Data Mining in R

Trying not to lie with data. Jenks classification based on a subset in R

Heat maps (i.e., choropleths) are popular to display geo-referenced data with a value of an attribute that is mapped to a pre-defined color scheme. However, care is necessary to represent the color scheme such that the map identifies heterogeneity yet, with minimal bias from the creator's perspective. There are a number of popular color palettes availble (e.g. http:// http://colorbrewer2.org) and methods to classify the values into categories. Popular methods include equal intervals, quantiles, standard deviations, and Jenks natural breaks. This note is focused on Jenks as previous work on an actual problem! identified that quantiles did not provide a good visual representation of these data. Jenks classification assigns a continuous variable to classes and attempts to minimize within class variance and maximize out of class variance. In practice, the method is an iterative process and is resource intensive. This causes computational problems in various software. For example, Arcmap (ESRI) has a maximum number of allowed records and will classify based on the subset. Unfortunately, if these data are time, or space ordered, this subset may not accurately represent the entire dataset. The classInt library in R supports Jenks classification using the classInterval function with style="jenks". Unfortunately, this method also fails on the dataset I was working with ~ 90,000 records. I wanted to examine if a subset of data could be used to develop categories that are representative of the dataset as a whole. This would improve computational time and allow Jenks classification for applications that I have been unsuccessful at thus far. Note, this is by no means an exhaustive review of applications, software, or methods to do this but hopefully, an informative, illustrative example.

Begin by loading the classInt library

library(classInt)

## Warning: package 'classInt' was built under R version 3.1.1

Next, create a sample data set using the rnorm function. I generate 10,000 samples with mean = 100, standard deviation = 25. Many other combinations, or distributions could/should be explored.

test.data <- rnorm(10000, 100, 25)

Create an empty matrix (6 x 6) to hold the results.

breaks.matrix <- matrix(ncol=6,nrow=6)

Create a vector with five subsets lengths and the entire 10,000 samples to be used as a reference.

n <- c(50,100,1000,2500,7500,10000)

Using a for loop, samples (without replacement) are taken successively of length n. Jenks classification is applied to each sample. Note the computing time can be long for the large sample sizes

for(i in 1:length(n)){
subsample <- sample(test.data, n[i], replace = FALSE)
test.breaks <- classIntervals(subsample,n=5,
    style="jenks", rtimes=3,
                intervalClosure=c("right"),
                data.precision=NULL, cutlabels=TRUE)
        breaks.matrix[,i] <- test.breaks$brks
        print(i) #To track progress
        }

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6

Print the results in a table.

round(cor(breaks.matrix),2)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.00 0.99 0.98 0.97 0.96 0.96
## [2,] 0.99 1.00 1.00 0.99 0.99 0.99
## [3,] 0.98 1.00 1.00 1.00 0.99 0.99
## [4,] 0.97 0.99 1.00 1.00 1.00 1.00
## [5,] 0.96 0.99 0.99 1.00 1.00 1.00
## [6,] 0.96 0.99 0.99 1.00 1.00 1.00

In this example, column 6 is of direct interest. Results suggest very high correlation with small sample sizes aproaching correlation = 1.0 to the full data sets when the sample was equal or greater than 25 percent of the total.

Statistics and Data Mining in R

Pages

Tuesday, September 16, 2014

No comments:

Post a Comment