Refining a k-Nearest-Neighbor classification

Machine learning algorithms provide methods of classifying objects into one of several groups based on the values of several explanatory variables. Nearest neighbor methods are easily implmented and easy to understand. There is no model associated to them, so errors have to be estimated computationally, but it provides one simple solution to classifying a new object based on known results in a reference set. kNN is a generalization of “if it walks like a duck, looks like a duck, and talks like a duck, it is probably a duck.” That is, objects that a close together with respect to the explanatory variables are likely to have the same classification.

The kNN algorithm

In the simplest setting, like the example we will do here, objects can fall into one of two classes, \( A \) or \( B \). We have a set of \( n \) measurements, \( v_1, \dots, v_n \), of any object in question, and assume these are all numeric variables. We use a distance measure, namely Euclidean distance, to manifest when two objects are close with respect to these variables. So, given objects in the domain \( s \), \( r \), which have \( v_i \) measurements \( x_1, \dots, x_n \), \( y_1, \dots, y_n \), respectively, define \[ dist(s, r) = \sqrt(\sum_i (x_i-y_i)^2). \] Depending on characteristics of the variables, other distance measures may be more appropriate, but we'll stick with Euclidean distance.

The kNN algorithm begins with a training set of objects for which we know not only the values of the explanatory variables but also the classifications \( A \) and \( B \). To predict the classification of a new object \( q \), the \( k=1 \) version of kNN would proceed by finding the element of the training set with the minimum distance from \( q \), suppose it is \( p_1 \), and predict the classification of \( q \) to be the same as \( p_1 \). If \( p_1 \) is rather isolated and there are lots of points in the other class almost as close to \( q \) this could be a misclassification. So, we generally pick some larger odd number \( m \) for \( k \); find \( p_1, \dots,p_m \) closest to \( q \) and vote on what the classification of \( q \) should be. A rule of thumb in machine learning is to pick \( k \) near the square root of the size of the training set. In practice this does a good job of telling signal from noise.

Classifying samples as tumor or normal based on measurements from images

Typically, biopsy tissue samples from breast cancer are preserved on paraffin slide and examined visually by a pathologist. Cancerous tissues have more misshapen cells, larger nuclei, rough surfaces and other abnormal characteristics. Today, automated image analysis can collect measurements of the nuclei of the cells in a picture of a sample without human intervention. From the UC-Irvine machine learning archive we have the Wisconsin Breast Cancer Dataset, with nuclei measurements of 569 samples, some benign and some tumor. This is saved to disk and cleaned up here.

Reference: For definitions and a discussion of the meaning of the measurements used here, consult

W. Wolberg, W.N. Street, O.L. Mangasarian, Importance of nuclear morphology in breast cancer prognosis, Clinical Cancer Research, (1999) Vol. 5, 3542-3548.

setwd("/Users/steve/Documents/Computing with Data/17_Refining_kNN/")
wisc_bc_df <- read.csv("./wisc_bc_data.csv", stringsAsFactors = F)
str(wisc_bc_df)

## 'data.frame':    569 obs. of  32 variables:
##  $ id               : int  87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
##  $ diagnosis        : chr  "B" "B" "B" "B" ...
##  $ radius_mean      : num  12.3 10.6 11 11.3 15.2 ...
##  $ texture_mean     : num  12.4 18.9 16.8 13.4 13.2 ...
##  $ perimeter_mean   : num  78.8 69.3 70.9 73 97.7 ...
##  $ area_mean        : num  464 346 373 385 712 ...
##  $ smoothness_mean  : num  0.1028 0.0969 0.1077 0.1164 0.0796 ...
##  $ compactness_mean : num  0.0698 0.1147 0.078 0.1136 0.0693 ...
##  $ concavity_mean   : num  0.0399 0.0639 0.0305 0.0464 0.0339 ...
##  $ points_mean      : num  0.037 0.0264 0.0248 0.048 0.0266 ...
##  $ symmetry_mean    : num  0.196 0.192 0.171 0.177 0.172 ...
##  $ dimension_mean   : num  0.0595 0.0649 0.0634 0.0607 0.0554 ...
##  $ radius_se        : num  0.236 0.451 0.197 0.338 0.178 ...
##  $ texture_se       : num  0.666 1.197 1.387 1.343 0.412 ...
##  $ perimeter_se     : num  1.67 3.43 1.34 1.85 1.34 ...
##  $ area_se          : num  17.4 27.1 13.5 26.3 17.7 ...
##  $ smoothness_se    : num  0.00805 0.00747 0.00516 0.01127 0.00501 ...
##  $ compactness_se   : num  0.0118 0.03581 0.00936 0.03498 0.01485 ...
##  $ concavity_se     : num  0.0168 0.0335 0.0106 0.0219 0.0155 ...
##  $ points_se        : num  0.01241 0.01365 0.00748 0.01965 0.00915 ...
##  $ symmetry_se      : num  0.0192 0.035 0.0172 0.0158 0.0165 ...
##  $ dimension_se     : num  0.00225 0.00332 0.0022 0.00344 0.00177 ...
##  $ radius_worst     : num  13.5 11.9 12.4 11.9 16.2 ...
##  $ texture_worst    : num  15.6 22.9 26.4 15.8 15.7 ...
##  $ perimeter_worst  : num  87 78.3 79.9 76.5 104.5 ...
##  $ area_worst       : num  549 425 471 434 819 ...
##  $ smoothness_worst : num  0.139 0.121 0.137 0.137 0.113 ...
##  $ compactness_worst: num  0.127 0.252 0.148 0.182 0.174 ...
##  $ concavity_worst  : num  0.1242 0.1916 0.1067 0.0867 0.1362 ...
##  $ points_worst     : num  0.0939 0.0793 0.0743 0.0861 0.0818 ...
##  $ symmetry_worst   : num  0.283 0.294 0.3 0.21 0.249 ...
##  $ dimension_worst  : num  0.0677 0.0759 0.0788 0.0678 0.0677 ...

Clearly a lot of these variables are related but we can't really tell if any are redundant. We'll keep all of the variables for now. The second variable appears to be the classification.

Organizing the data

table(wisc_bc_df$diagnosis)

## 
##   B   M 
## 357 212

sum(is.na(wisc_bc_df$diagnosis))

## [1] 0

This really should be a factor, so

wisc_bc_df$diagnosis <- factor(wisc_bc_df$diagnosis, levels = c("B", "M"), labels = c("benign", 
    "malignant"))

We want to measure distances between samples using the numerical variables. However, some variables are on very different scales. Smoothness means are around 0.1 and areas are around 400. This over-emphasizes area in a distance calculation. So, we need to transform all variables to comparable scales. Typical in such a setting is to replace a measurement by its percentile. We define a function to do the calculation for any column.

normalize <- function(x) {
    y <- (x - min(x))/(max(x) - min(x))
    y
}

We can apply this function to each of the numeric columns using lapply.

wbcd_n_L <- lapply(wisc_bc_df[, 3:32], normalize)
wbcd_n <- data.frame(wbcd_n_L)
wbcd_n[1:3, 1:4]

##   radius_mean texture_mean perimeter_mean area_mean
## 1      0.2527      0.09063         0.2423   0.13599
## 2      0.1713      0.31248         0.1761   0.08607
## 3      0.1921      0.24078         0.1875   0.09743

# Add id labels as rownames to keep track of the patient's data
rownames(wbcd_n) <- wisc_bc_df$id

While we're at it, let's isolate the class labels and tag them with patient ids.

BM_class <- wisc_bc_df[, 2]
names(BM_class) <- wisc_bc_df$id
BM_class[1:3]

## 87139402  8910251   905520 
##   benign   benign   benign 
## Levels: benign malignant

Creating training and test (validation) datasets

A large training set helps to give us a good model, but a large validation set increases the significance of the result you report. You need to strike a balance. The choice of what fraction to use for training may depend on how messy the data is, or how complex the modeling method is. A reasonable balance is 2/3 training and 1/3 validation. Furthermore, in splitting the data we want the training set (and implicitly the validation set) to be representative of the general population as far as this particular problem goes. Here, that means they should contain roughly the same fraction of benign and malignant cases as the whole population.

We'll do the split by randomly permuting the rows of data, selecting 1/3 for validation and having the rest left for training. Our rows are determined by the id variable.

nrow(wisc_bc_df)
rand_permute <- sample(x = 1:569, size = 569)
rand_permute[1:5]
# save(rand_permute, file='rand_permute.RData')

load("rand_permute.RData")

all_id_random <- wisc_bc_df[rand_permute, "id"]
# Select the first third of these for validation
569/3

## [1] 189.7

validate_id <- as.character(all_id_random[1:189])
training_id <- as.character(all_id_random[190:569])

Subset the data we'll use in the knn analysis accordingly.

wbcd_train <- wbcd_n[training_id, ]
wbcd_val <- wbcd_n[validate_id, ]
BM_class_train <- BM_class[training_id]
BM_class_val <- BM_class[validate_id]
table(BM_class_train)

## BM_class_train
##    benign malignant 
##       247       133

table(BM_class_val)

## BM_class_val
##    benign malignant 
##       110        79

Check that training and validation sets have

Executing knn

The knn algorithm is implemented in the package class. Install this if necessary and then load it.

library(class)
`?`(knn)

What should k be? Let's use an odd number near the square root of size of the training set.

sqrt(nrow(wbcd_train))

## [1] 19.49

k <- 19

knn_predict <- knn(wbcd_train, wbcd_val, BM_class_train, k = 19)
knn_predict[1:3]

## [1] benign    malignant benign   
## Levels: benign malignant

Check the agreement with the predictor.

table(knn_predict, BM_class_val)

##            BM_class_val
## knn_predict benign malignant
##   benign       108        10
##   malignant      2        69

prop.table(table(knn_predict, BM_class_val))

##            BM_class_val
## knn_predict  benign malignant
##   benign    0.57143   0.05291
##   malignant 0.01058   0.36508

Not very good as classifications go.

Testing different values of k

Let's sample a few other values of k and see how the performance changes. Let's try k=3, 7, 11, 31.

knn_predict_3 <- knn(wbcd_train, wbcd_val, BM_class_train, k = 3)
knn_predict_7 <- knn(wbcd_train, wbcd_val, BM_class_train, k = 7)
knn_predict_11 <- knn(wbcd_train, wbcd_val, BM_class_train, k = 11)
knn_predict_31 <- knn(wbcd_train, wbcd_val, BM_class_train, k = 31)

table(knn_predict_3, BM_class_val)

##              BM_class_val
## knn_predict_3 benign malignant
##     benign       109         8
##     malignant      1        71

table(knn_predict_7, BM_class_val)

##              BM_class_val
## knn_predict_7 benign malignant
##     benign       109         9
##     malignant      1        70

table(knn_predict_11, BM_class_val)

##               BM_class_val
## knn_predict_11 benign malignant
##      benign       108        10
##      malignant      2        69

table(knn_predict_31, BM_class_val)

##               BM_class_val
## knn_predict_31 benign malignant
##      benign       108        11
##      malignant      2        68

The best was actually 3, by a little. Sometimes, too many points hides subclusters that are fairly small. This can also be very dependent on the choice of training set.

How else can we improve the analysis

The biggest gap I see is in the use of all variables in the distance calculation. Three traits each for 10 measurements almost certainly introduces some redundancy, hence added variance. More worrisome is that we may lose an opportunity to gain important insight into tumor biology. If a couple traits alone can separate benign from malignant then we should should look at the causes of those features as underlying factors in cancer development. In summary, the two factors we want to examine are:

The optimal set of variables to include in the distance calculation;
The optimal k.

An obstacle to testing which parameters give the best fit is that is that we just have one training set and one validation set. We need to reserve the validation set for the final choice of model. We need more training-validation partitions to select these parameters.

Study significance of the variables

We may want to cut down from the full 30 variables. Working in the training set, let's rank these by significance in their variation between benign and malignant. These variables are all continuous, so we'll use the F-statistic of a linear fit.

Ways of getting significance for all variables

Before showing the fast way to do this, let's show some ways that take less data manipulation but more type to compare them.

Copy, paste, rename one at a time.

names(wbcd_train)

##  [1] "radius_mean"       "texture_mean"      "perimeter_mean"   
##  [4] "area_mean"         "smoothness_mean"   "compactness_mean" 
##  [7] "concavity_mean"    "points_mean"       "symmetry_mean"    
## [10] "dimension_mean"    "radius_se"         "texture_se"       
## [13] "perimeter_se"      "area_se"           "smoothness_se"    
## [16] "compactness_se"    "concavity_se"      "points_se"        
## [19] "symmetry_se"       "dimension_se"      "radius_worst"     
## [22] "texture_worst"     "perimeter_worst"   "area_worst"       
## [25] "smoothness_worst"  "compactness_worst" "concavity_worst"  
## [28] "points_worst"      "symmetry_worst"    "dimension_worst"

lm_1 <- lm(radius_mean ~ BM_class_train, data = wbcd_train)
summary(lm_1)

## 
## Call:
## lm(formula = radius_mean ~ BM_class_train, data = wbcd_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3113 -0.0790  0.0034  0.0695  0.5008 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.24499    0.00712    34.4   <2e-16 ***
## BM_class_trainmalignant  0.25418    0.01204    21.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.112 on 378 degrees of freedom
## Multiple R-squared:  0.541,  Adjusted R-squared:  0.54 
## F-statistic:  446 on 1 and 378 DF,  p-value: <2e-16

names(summary(lm_1))

##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

summary(lm_1)$fstatistic

## value numdf dendf 
## 445.8   1.0 378.0

# The significance measure we want:
summary(lm_1)$fstatistic[1]

## value 
## 445.8

We need a vector to hold the outputs to avoid working with them one at a time.

exp_var_fstat <- as.numeric(rep(NA, times = 30))
names(exp_var_fstat) <- names(wbcd_train)

Now run through the variables, get the linear fit and store the F-statistic

exp_var_fstat["radius_mean"] <- summary(lm(radius_mean ~ BM_class_train, data = wbcd_train))$fstatistic[1]
exp_var_fstat["texture_mean"] <- summary(lm(texture_mean ~ BM_class_train, data = wbcd_train))$fstatistic[1]
exp_var_fstat["perimeter_mean"] <- summary(lm(perimeter_mean ~ BM_class_train, 
    data = wbcd_train))$fstatistic[1]
exp_var_fstat

##       radius_mean      texture_mean    perimeter_mean         area_mean 
##            445.83             80.29            483.35                NA 
##   smoothness_mean  compactness_mean    concavity_mean       points_mean 
##                NA                NA                NA                NA 
##     symmetry_mean    dimension_mean         radius_se        texture_se 
##                NA                NA                NA                NA 
##      perimeter_se           area_se     smoothness_se    compactness_se 
##                NA                NA                NA                NA 
##      concavity_se         points_se       symmetry_se      dimension_se 
##                NA                NA                NA                NA 
##      radius_worst     texture_worst   perimeter_worst        area_worst 
##                NA                NA                NA                NA 
##  smoothness_worst compactness_worst   concavity_worst      points_worst 
##                NA                NA                NA                NA 
##    symmetry_worst   dimension_worst 
##                NA                NA

Keep going. I'd be tempted to get lazy and not do all of them.

Looping over variable names

Rather than doing the one at a time version we can trying looping over each of the variables. But some careful coding needs to be done.

Again, we'll need a vector to hold the significance measures.

exp_vars <- names(wbcd_train)
exp_var_fstat <- as.numeric(rep(NA, times = 30))
names(exp_var_fstat) <- exp_vars

The following produces an error.

for (j in 1:length(exp_vars)) {
    exp_var_fstat[exp_vars[j]] <- summary(lm(exp_vars[j] ~ BM_class_train, data = wbcd_train))$fstatistic[1]
}

The problem is that there isn't a named variable of the form exp_vars[j]. We need the named variable stored in this “slot”. We can get at the variable and produce an executable formula by:

for (j in 1:length(exp_vars)) {
    exp_var_fstat[exp_vars[j]] <- summary(lm(as.formula(paste(exp_vars[j], " ~ BM_class_train")), 
        data = wbcd_train))$fstatistic[1]
}

This works.

exp_var_fstat

##       radius_mean      texture_mean    perimeter_mean         area_mean 
##         445.83008          80.28969         483.35212         406.02540 
##   smoothness_mean  compactness_mean    concavity_mean       points_mean 
##          53.17654         231.82299         424.86462         624.85237 
##     symmetry_mean    dimension_mean         radius_se        texture_se 
##          43.17113           0.04483         185.76126           0.05889 
##      perimeter_se           area_se     smoothness_se    compactness_se 
##         172.59053         189.61081           2.81093          38.93364 
##      concavity_se         points_se       symmetry_se      dimension_se 
##          31.62212          83.22744           0.13240           3.67540 
##      radius_worst     texture_worst   perimeter_worst        area_worst 
##         599.65263          98.65231         624.91718         491.38056 
##  smoothness_worst compactness_worst   concavity_worst      points_worst 
##          82.54685         224.67379         297.80279         650.50615 
##    symmetry_worst   dimension_worst 
##          71.84282          52.88346

We quickly got all the statistics.

Of course, we can avoid initializing the variable or worrying about an iterator by using an lapply or sapply.

exp_var_fstat2 <- sapply(exp_vars, function(x) {
    summary(lm(as.formula(paste(x, " ~ BM_class_train")), data = wbcd_train))$fstatistic[1]
})
exp_var_fstat2

##       radius_mean.value      texture_mean.value    perimeter_mean.value 
##               445.83008                80.28969               483.35212 
##         area_mean.value   smoothness_mean.value  compactness_mean.value 
##               406.02540                53.17654               231.82299 
##    concavity_mean.value       points_mean.value     symmetry_mean.value 
##               424.86462               624.85237                43.17113 
##    dimension_mean.value         radius_se.value        texture_se.value 
##                 0.04483               185.76126                 0.05889 
##      perimeter_se.value           area_se.value     smoothness_se.value 
##               172.59053               189.61081                 2.81093 
##    compactness_se.value      concavity_se.value         points_se.value 
##                38.93364                31.62212                83.22744 
##       symmetry_se.value      dimension_se.value      radius_worst.value 
##                 0.13240                 3.67540               599.65263 
##     texture_worst.value   perimeter_worst.value        area_worst.value 
##                98.65231               624.91718               491.38056 
##  smoothness_worst.value compactness_worst.value   concavity_worst.value 
##                82.54685               224.67379               297.80279 
##      points_worst.value    symmetry_worst.value   dimension_worst.value 
##               650.50615                71.84282                52.88346

names(exp_var_fstat2) <- exp_vars

If we had 10,000 variables instead of 30 and our function was more time-consumming than lm we might see faster execution because lapply and sapply use a for loop written in C++ and not R.

`plyr` version of the fit

It's awkward working with these 30 variables as separate columns in a data.frame. We'll homogenize the data by forming a list of data.frames with one component for each variable as follows. We'll also package the BM class variable into the data.frames to have all variables in one location.

wbcd_df_L <- lapply(exp_vars, function(x) {
    df <- data.frame(sample = rownames(wbcd_train), variable = x, value = wbcd_train[, 
        x], class = BM_class_train)
    df
})
head(wbcd_df_L[[1]])

##            sample    variable  value     class
## 873885     873885 radius_mean 0.3928 malignant
## 8610175   8610175 radius_mean 0.2522    benign
## 897880     897880 radius_mean 0.1453    benign
## 91544001 91544001 radius_mean 0.2480    benign
## 911391     911391 radius_mean 0.1845    benign
## 903516     903516 radius_mean 0.6924 malignant

head(wbcd_df_L[[5]])

##            sample        variable  value     class
## 873885     873885 smoothness_mean 0.3425 malignant
## 8610175   8610175 smoothness_mean 0.3529    benign
## 897880     897880 smoothness_mean 0.4340    benign
## 91544001 91544001 smoothness_mean 0.5143    benign
## 911391     911391 smoothness_mean 0.4340    benign
## 903516     903516 smoothness_mean 0.5784 malignant

names(wbcd_df_L) <- exp_vars

Now we don't have to be so sensitive to the variable names.

Now use laply in plyr to get the batch of them. Note that you could also use sapply for this – they are about the same thing.

library(plyr)
var_sig_fstats <- laply(wbcd_df_L, function(df) {
    fit <- lm(value ~ class, data = df)
    f <- summary(fit)$fstatistic[1]
    f
})
names(var_sig_fstats) <- names(wbcd_df_L)
var_sig_fstats[1:3]

##    radius_mean   texture_mean perimeter_mean 
##         445.83          80.29         483.35

This version of the calculations used a uniform formula and varied the data. This is a very flexible method. It isn't sensitive to the names of the variables, for example.

Conclusions about significance of the variables

most_sig_stats <- sort(var_sig_fstats, decreasing = T)
most_sig_stats[1:5]

##    points_worst perimeter_worst     points_mean    radius_worst 
##           650.5           624.9           624.9           599.7 
##      area_worst 
##           491.4

most_sig_stats[25:30]

##   concavity_se   dimension_se  smoothness_se    symmetry_se     texture_se 
##       31.62212        3.67540        2.81093        0.13240        0.05889 
## dimension_mean 
##        0.04483

This already gives us some insight. The last variables in the list aren't significant on their own. Adding them to the model may only increase the variance.

To prepare for doing kNN with a varying number of variables, reorder the variables in the training set data.frame by significance.

wbcd_train_ord <- wbcd_train[, names(most_sig_stats)]

The next step is to assess how many variables give the best fit. for that, we use a nice version of cross-validation.

Monte Carlo Cross-Validation

Starting with a training set \( S \) and validation set \( V \), select a large number (1000) of random subsets of \( S \), \( S_i \), \( i\leq 1000 \), of a fixed size. Let \( V_i = S\setminus S_i \), \( i \leq 1000 \). In comparing parameters for a kNN fit, test the options 1000 times with \( V_i \) as the validation set and \( S_i \) as the training (reference) set. Select the parameters with the smallest mean error across the \( V_i \).

With this choice of parameters, validate the final model with \( V \) as validation set and \( S \) as the reference set.

Reference: S. Dudoit and M.J. van der Laan, Asymptotics of cross-validated risk estimation in estimator selection and performance assessment, Statistical Methodology, vol 2, Issue 2, (2005) 131-154.

Selection of the family of training sets

Let's further divide the training set with each \( S_i \) 2/3 the size of training set picked above.

length(training_id)

## [1] 380

(2/3) * length(training_id)

## [1] 253.3

length(training_id) - 253

## [1] 127

# Use 253 as the training set size.

training_family_L <- lapply(1:1000, function(j) {
    perm <- sample(1:380, size = 380, replace = F)
    shuffle <- training_id[perm]
    trn <- shuffle[1:253]
    trn
})
# save(training_family_L, file='training_family_L.RData')

load("training_family_L.RData")

validation_family_L <- lapply(training_family_L, function(x) setdiff(training_id, 
    x))

Finding an optimal set of variables and optimal k

With the variables ordered by significance as above, we'll calculate the distributions of errors over the 1000 training-validation pairs, for different sets of variables. Our options will be the first \( n \) variables for \( n = 1, 3, 5, \ldots, 30 \):

N <- seq(from = 3, to = 29, by = 2)

At the same time we want to test options for k. The rule of thumb says to use the square root of the size of the reference set. This is

sqrt(length(training_family_L[[1]]))

## [1] 15.91

Let's vary from 3 to 19.

K <- seq(from = 3, to = 19, by = 2)

For each training-validation split, each choice of number of variables, and each k we want to calculate the error of the kNN prediction in the validation set. This is a total of

1000 * length(N) * length(K)

## [1] 126000

Execution of the test with loops

Let's nest a number of for loops to calculate the errors for all training sets and choices of model parameters. What are the important factors?

We have to store the errors in a way that associates them with the parameters and training set index; a. We could nest 3 lists and vectors for the 3 possibilities, but that will be hard to work with; b. Storing everything in a data.frame gives us the most options for examining significance and perhaps plotting; c. Opt for a data.frame with variables for training set index, number of variables, k and the calculated errors.
We must add entries to the data.frame within the nested loops that keeps track of where we are in the overall iterations. a. Using rbind to build the data.frame one row at a time will be very slow. b. Better to initialize a data.frame of the right dimensions and store results in the correct row; c. We'll need an iterator that tracks where we are in the triply nested loop to know what row to store the results in.

We begin by initializing the data.frame that will store the 126,000 entries.

paramter_errors_df <- data.frame(mc_index = as.integer(rep(NA, times = 126000)), 
    var_num = as.integer(rep(NA, times = 126000)), k = as.integer(rep(NA, times = 126000)), 
    error = as.numeric(rep(NA, times = 126000)))

We'll write a new function for the core kNN error calculation we need to do for each parameter choice. Here is an example of the core calculation for the first MC split, n=5 variables, k=7

knn_test <- knn(train = wbcd_train_ord[training_family_L[[1]], 1:5], test = wbcd_train_ord[validation_family_L[[1]], 
    1:5], cl = BM_class_train[training_family_L[[1]]], k = 7)
knn_test[1:3]

## [1] benign benign benign
## Levels: benign malignant

tbl_test <- table(knn_test, BM_class_train[validation_family_L[[1]]])
tbl_test

##            
## knn_test    benign malignant
##   benign        82         6
##   malignant      1        38

err_rate <- (tbl_test[1, 2] + tbl_test[2, 1])/length(validation_family_L[[1]])
err_rate

## [1] 0.05512

To make the notation easier to follow, let's write a function that executes this chunk and returns the error rate and the other parameters.

# j = index, n = length of range of variables, k=k
core_knn <- function(j, n, k) {
    knn_predict <- knn(train = wbcd_train_ord[training_family_L[[j]], 1:n], 
        test = wbcd_train_ord[validation_family_L[[j]], 1:n], cl = BM_class_train[training_family_L[[j]]], 
        k = k)
    tbl <- table(knn_predict, BM_class_train[validation_family_L[[j]]])
    err <- (tbl[1, 2] + tbl[2, 1])/length(validation_family_L[[j]])
    err
}
# sample
core_knn(1, 5, 7)

## [1] 0.05512

With this core function in place we'll run our nested loops. Besides the data.frame we need an iterator to keep track of the loop we're in, which will determine the row of the data.frame we store into.

iter <- 1

Now the for loops. We'll only execute this once since it takes awhile.

str_time <- Sys.time()
for (j in 1:1000) {
    for (n in 1:length(N)) {
        for (m in 1:length(K)) {
            err <- core_knn(j, N[n], K[m])
            paramter_errors_df[iter, ] <- c(j, N[n], K[m], err)
            iter <- iter + 1
        }
    }
}
time_lapsed_for <- Sys.time() - str_time
save(paramter_errors_df, time_lapsed_for, file = "for_loop_paramter_errors.RData")

load("for_loop_paramter_errors.RData")
time_lapsed_for

## Time difference of 39.37 mins

Execution with `plyr`

The following differences characterize how these calculations are done with plyr.

We create a data.frame containing all of the paramter and training set indices we'll want to test
ddply replaces the nested loops for(j in 1:1000), for(n in 1:length(N)), and for(m in 1:length(K)), with ddply(parameter_df, .(mc_index, var_num, k), ...)
This latter expression creates a collection of sub-data.frames, one for each combination of the 3 variables. The … above represents a function that takes the sub-data.frame as input. This effects how we express the arguments to whatever function is inside here.
creating the output data.frame including the input paramters and the calculated error is handled by ddply.

Create a data.frame of all our possible parameter combinations.

param_df1 <- merge(data.frame(mc_index = 1:1000), data.frame(var_num = N))
param_df <- merge(param_df1, data.frame(k = K))
str(param_df)

## 'data.frame':    126000 obs. of  3 variables:
##  $ mc_index: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ var_num : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ k       : num  3 3 3 3 3 3 3 3 3 3 ...

We'll nest these inside several **ply functions.

So, range over all these parameters, apply the core and stack all the results in a data.frame.

This is a test of the code using just 20 choices of parameters.

knn_err_est_df_test <- ddply(param_df[1:20, ], .(mc_index, var_num, k), function(df) {
    err <- core_knn(df$mc_index[1], df$var_num[1], df$k[1])
    err
})
head(knn_err_est_df_test)

##   mc_index var_num k      V1
## 1        1       3 3 0.07874
## 2        2       3 3 0.04724
## 3        3       3 3 0.10236
## 4        4       3 3 0.06299
## 5        5       3 3 0.07087
## 6        6       3 3 0.03937

Do the full run.

str_time <- Sys.time()
knn_err_est_df <- ddply(param_df, .(mc_index, var_num, k), function(df) {
    err <- core_knn(df$mc_index[1], df$var_num[1], df$k[1])
    err
})
time_lapsed <- Sys.time() - str_time
save(knn_err_est_df, time_lapsed, file = "knn_err_est_df.RData")

load("knn_err_est_df.RData")
time_lapsed

## Time difference of 6.811 mins

head(knn_err_est_df)

##   mc_index var_num  k      V1
## 1        1       3  3 0.03738
## 2        1       3  5 0.03738
## 3        1       3  7 0.05140
## 4        1       3  9 0.04673
## 5        1       3 11 0.04206
## 6        1       3 13 0.04673

names(knn_err_est_df)[4] <- "error"

This is almost a six-fold speed-up on the nested loops.

Getting summary performance statistics

For each range of variables and each k we have 1000 errors from the family of training-validation sets. The mean of these is a good estimate of the performance of these parameters. The choice with minimal mean errors likely generalizes very well.

How do we get the mean errors as easily as possible?

Example:

mean_ex_df <- subset(knn_err_est_df, var_num == 5 & k == 7)
head(mean_ex_df)

##     mc_index var_num k   error
## 12         1       5 7 0.03738
## 138        2       5 7 0.06436
## 264        3       5 7 0.04808
## 390        4       5 7 0.06763
## 516        5       5 7 0.06220
## 642        6       5 7 0.04651

mean(mean_ex_df$error)

## [1] 0.05729

We can generate a data.frame containing the summary mean errors over the 1000 training-validation sets for parameter choices as follows.

The following can be read as “For all combinations (n,m) of variable numbers and values of k, subset the data.frame to the records with var_num = n and k=m, then compute the mean of the error variable over these recoprds”.

mean_errs_df <- ddply(knn_err_est_df, .(var_num, k), function(df) mean(df$error))
head(mean_errs_df)

##   var_num  k      V1
## 1       3  3 0.05564
## 2       3  5 0.05746
## 3       3  7 0.05796
## 4       3  9 0.05612
## 5       3 11 0.05443
## 6       3 13 0.05472

names(mean_errs_df)[3] <- "mean_error"

Here, again, ddply took care of creating a data.frame of results, keeping (n,m) with the corresponding record.

Survey of parameter performance

Let's visualize the performance in a plot with color representing mean error.

library(ggplot2)

ggplot(data = mean_errs_df, aes(x = var_num, y = k, color = mean_error)) + geom_point(size = 10) + 
    theme_bw()

plot of chunk unnamed-chunk-50

Interesting that it performs pretty well with 3 variables and high k, gets worse out to 17 and then gets better. But it's worse 27 or 29 variables and high k.

Let's replot with the variable numbers between 15 and 30.

ggplot(data = subset(mean_errs_df, var_num >= 15), aes(x = var_num, y = k, color = mean_error)) + 
    geom_point(size = 10) + theme_bw()

plot of chunk unnamed-chunk-51

It looks like 19 variables and more with a low k works best. Let's look at the real numbers.

subset(mean_errs_df, var_num == 17)

##    var_num  k mean_error
## 64      17  3    0.02562
## 65      17  5    0.02572
## 66      17  7    0.02818
## 67      17  9    0.03141
## 68      17 11    0.03246
## 69      17 13    0.03217
## 70      17 15    0.03128
## 71      17 17    0.03089
## 72      17 19    0.03115

subset(mean_errs_df, var_num == 19)

##    var_num  k mean_error
## 73      19  3    0.01875
## 74      19  5    0.02421
## 75      19  7    0.02542
## 76      19  9    0.02585
## 77      19 11    0.02631
## 78      19 13    0.02612
## 79      19 15    0.02595
## 80      19 17    0.02625
## 81      19 19    0.02724

subset(mean_errs_df, var_num == 21)

##    var_num  k mean_error
## 82      21  3    0.01948
## 83      21  5    0.02267
## 84      21  7    0.02337
## 85      21  9    0.02400
## 86      21 11    0.02478
## 87      21 13    0.02621
## 88      21 15    0.02738
## 89      21 17    0.02797
## 90      21 19    0.02828

subset(mean_errs_df, var_num == 25)

##     var_num  k mean_error
## 100      25  3    0.01750
## 101      25  5    0.02274
## 102      25  7    0.02370
## 103      25  9    0.02352
## 104      25 11    0.02365
## 105      25 13    0.02446
## 106      25 15    0.02530
## 107      25 17    0.02685
## 108      25 19    0.02820

mean_errs_df[which.min(mean_errs_df$mean_error), ]

##     var_num k mean_error
## 109      27 3    0.01697

So, the best is 27 variables with k=3; with 25 variables and k=3 coming close. Here's a review of variables.

names(wbcd_train_ord)

##  [1] "points_worst"      "perimeter_worst"   "points_mean"      
##  [4] "radius_worst"      "area_worst"        "perimeter_mean"   
##  [7] "radius_mean"       "concavity_mean"    "area_mean"        
## [10] "concavity_worst"   "compactness_mean"  "compactness_worst"
## [13] "area_se"           "radius_se"         "perimeter_se"     
## [16] "texture_worst"     "points_se"         "smoothness_worst" 
## [19] "texture_mean"      "symmetry_worst"    "smoothness_mean"  
## [22] "dimension_worst"   "symmetry_mean"     "compactness_se"   
## [25] "concavity_se"      "dimension_se"      "smoothness_se"    
## [28] "symmetry_se"       "texture_se"        "dimension_mean"

Validation of the final test

We need to use the first 19 variables, ordered in a specific way, so, sort the data.frame of variables for the validation set in this order.

wbcd_val_ord <- wbcd_val[, names(wbcd_train_ord)]

Compute the predictions for the validation set using the whole training set for reference.

bm_val_pred <- knn(train = wbcd_train_ord[, 1:27], wbcd_val_ord[, 1:27], BM_class_train, 
    k = 3)

tbl_bm_val <- table(bm_val_pred, BM_class_val)
tbl_bm_val

##            BM_class_val
## bm_val_pred benign malignant
##   benign       108         6
##   malignant      2        73

(val_error <- tbl_bm_val[1, 2] + tbl_bm_val[2, 1])/length(BM_class_val)

## [1] 0.04233

This is much better than the rule of thumb.

Appendix on speed

It took about 6 minutes to run the core kNN parameter test. We can try speeding this up using a multicore parallelization.

library(doMC)

## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel

registerDoMC()
# How many cores are we using?
getDoParWorkers()

## [1] 4

str_time <- Sys.time()
knn_err_est_df_par <- ddply(param_df, .(mc_index, var_num, k), function(df) {
    err <- core_knn(df$mc_index[1], df$var_num[1], df$k[1])
    err
}, .parallel = T)
time_lapsed_par <- Sys.time() - str_time
save(knn_err_est_df_par, time_lapsed_par, file = "knn_err_est_df_par.RData")

load("knn_err_est_df_par.RData")
time_lapsed_par

## Time difference of 4.165 mins

This is about 60% of the earlier runtime. It isn't 25% of the time. It takes some time to move the data to the other cores. The kNN function is fast, so moving the data represents a significant percentage of the processing time. As the run-time of the core function increases, the speedup obtained by parallelization goes up.