Machine learning algorithms provide methods of classifying objects into one of several groups based on the values of several explanatory variables. Nearest neighbor methods are easily implmented and easy to understand. There is no model associated to them, so errors have to be estimated computationally, but it provides one simple solution to classifying a new object based on known results in a reference set. kNN is a generalization of “if it walks like a duck, looks like a duck, and talks like a duck, it is probably a duck.” That is, objects that a close together with respect to the explanatory variables are likely to have the same classification.
In the simplest setting, like the example we will do here, objects can fall into one of two classes, \( A \) or \( B \). We have a set of \( n \) measurements, \( v_1, \dots, v_n \), of any object in question, and assume these are all numeric variables. We use a distance measure, namely Euclidean distance, to manifest when two objects are close with respect to these variables. So, given objects in the domain \( s \), \( r \), which have \( v_i \) measurements \( x_1, \dots, x_n \), \( y_1, \dots, y_n \), respectively, define \[ dist(s, r) = \sqrt(\sum_i (x_i-y_i)^2). \] Depending on characteristics of the variables, other distance measures may be more appropriate, but we'll stick with Euclidean distance.
The kNN algorithm begins with a training set of objects for which we know not only the values of the explanatory variables but also the classifications \( A \) and \( B \). To predict the classification of a new object \( q \), the \( k=1 \) version of kNN would proceed by finding the element of the training set with the minimum distance from \( q \), suppose it is \( p_1 \), and predict the classification of \( q \) to be the same as \( p_1 \). If \( p_1 \) is rather isolated and there are lots of points in the other class almost as close to \( q \) this could be a misclassification. So, we generally pick some larger odd number \( m \) for \( k \); find \( p_1, \dots,p_m \) closest to \( q \) and vote on what the classification of \( q \) should be. A rule of thumb in machine learning is to pick \( k \) near the square root of the size of the training set. In practice this does a good job of telling signal from noise.
Typically, biopsy tissue samples from breast cancer are preserved on paraffin slide and examined visually by a pathologist. Cancerous tissues have more misshapen cells, larger nuclei, rough surfaces and other abnormal characteristics. Today, automated image analysis can collect measurements of the nuclei of the cells in a picture of a sample without human intervention. From the UC-Irvine machine learning archive we have the Wisconsin Breast Cancer Dataset, with nuclei measurements of 569 samples, some benign and some tumor. This is saved to disk and cleaned up here.
Reference: For definitions and a discussion of the meaning of the measurements used here, consult
W. Wolberg, W.N. Street, O.L. Mangasarian, Importance of nuclear morphology in breast cancer prognosis, Clinical Cancer Research, (1999) Vol. 5, 3542-3548.
setwd("/Users/steve/Documents/Computing with Data/17_Refining_kNN/")
wisc_bc_df <- read.csv("./wisc_bc_data.csv", stringsAsFactors = F)
str(wisc_bc_df)
## 'data.frame': 569 obs. of 32 variables:
## $ id : int 87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
## $ diagnosis : chr "B" "B" "B" "B" ...
## $ radius_mean : num 12.3 10.6 11 11.3 15.2 ...
## $ texture_mean : num 12.4 18.9 16.8 13.4 13.2 ...
## $ perimeter_mean : num 78.8 69.3 70.9 73 97.7 ...
## $ area_mean : num 464 346 373 385 712 ...
## $ smoothness_mean : num 0.1028 0.0969 0.1077 0.1164 0.0796 ...
## $ compactness_mean : num 0.0698 0.1147 0.078 0.1136 0.0693 ...
## $ concavity_mean : num 0.0399 0.0639 0.0305 0.0464 0.0339 ...
## $ points_mean : num 0.037 0.0264 0.0248 0.048 0.0266 ...
## $ symmetry_mean : num 0.196 0.192 0.171 0.177 0.172 ...
## $ dimension_mean : num 0.0595 0.0649 0.0634 0.0607 0.0554 ...
## $ radius_se : num 0.236 0.451 0.197 0.338 0.178 ...
## $ texture_se : num 0.666 1.197 1.387 1.343 0.412 ...
## $ perimeter_se : num 1.67 3.43 1.34 1.85 1.34 ...
## $ area_se : num 17.4 27.1 13.5 26.3 17.7 ...
## $ smoothness_se : num 0.00805 0.00747 0.00516 0.01127 0.00501 ...
## $ compactness_se : num 0.0118 0.03581 0.00936 0.03498 0.01485 ...
## $ concavity_se : num 0.0168 0.0335 0.0106 0.0219 0.0155 ...
## $ points_se : num 0.01241 0.01365 0.00748 0.01965 0.00915 ...
## $ symmetry_se : num 0.0192 0.035 0.0172 0.0158 0.0165 ...
## $ dimension_se : num 0.00225 0.00332 0.0022 0.00344 0.00177 ...
## $ radius_worst : num 13.5 11.9 12.4 11.9 16.2 ...
## $ texture_worst : num 15.6 22.9 26.4 15.8 15.7 ...
## $ perimeter_worst : num 87 78.3 79.9 76.5 104.5 ...
## $ area_worst : num 549 425 471 434 819 ...
## $ smoothness_worst : num 0.139 0.121 0.137 0.137 0.113 ...
## $ compactness_worst: num 0.127 0.252 0.148 0.182 0.174 ...
## $ concavity_worst : num 0.1242 0.1916 0.1067 0.0867 0.1362 ...
## $ points_worst : num 0.0939 0.0793 0.0743 0.0861 0.0818 ...
## $ symmetry_worst : num 0.283 0.294 0.3 0.21 0.249 ...
## $ dimension_worst : num 0.0677 0.0759 0.0788 0.0678 0.0677 ...
Clearly a lot of these variables are related but we can't really tell if any are redundant. We'll keep all of the variables for now. The second variable appears to be the classification.
table(wisc_bc_df$diagnosis)
##
## B M
## 357 212
sum(is.na(wisc_bc_df$diagnosis))
## [1] 0
This really should be a factor, so
wisc_bc_df$diagnosis <- factor(wisc_bc_df$diagnosis, levels = c("B", "M"), labels = c("benign",
"malignant"))
We want to measure distances between samples using the numerical variables. However, some variables are on very different scales. Smoothness means are around 0.1 and areas are around 400. This over-emphasizes area in a distance calculation. So, we need to transform all variables to comparable scales. Typical in such a setting is to replace a measurement by its percentile. We define a function to do the calculation for any column.
normalize <- function(x) {
y <- (x - min(x))/(max(x) - min(x))
y
}
We can apply this function to each of the numeric columns using lapply
.
wbcd_n_L <- lapply(wisc_bc_df[, 3:32], normalize)
wbcd_n <- data.frame(wbcd_n_L)
wbcd_n[1:3, 1:4]
## radius_mean texture_mean perimeter_mean area_mean
## 1 0.2527 0.09063 0.2423 0.13599
## 2 0.1713 0.31248 0.1761 0.08607
## 3 0.1921 0.24078 0.1875 0.09743
# Add id labels as rownames to keep track of the patient's data
rownames(wbcd_n) <- wisc_bc_df$id
While we're at it, let's isolate the class labels and tag them with patient ids.
BM_class <- wisc_bc_df[, 2]
names(BM_class) <- wisc_bc_df$id
BM_class[1:3]
## 87139402 8910251 905520
## benign benign benign
## Levels: benign malignant
A large training set helps to give us a good model, but a large validation set increases the significance of the result you report. You need to strike a balance. The choice of what fraction to use for training may depend on how messy the data is, or how complex the modeling method is. A reasonable balance is 2/3 training and 1/3 validation. Furthermore, in splitting the data we want the training set (and implicitly the validation set) to be representative of the general population as far as this particular problem goes. Here, that means they should contain roughly the same fraction of benign and malignant cases as the whole population.
We'll do the split by randomly permuting the rows of data, selecting 1/3 for validation and having the rest left for training. Our rows are determined by the id variable.
nrow(wisc_bc_df)
rand_permute <- sample(x = 1:569, size = 569)
rand_permute[1:5]
# save(rand_permute, file='rand_permute.RData')
load("rand_permute.RData")
all_id_random <- wisc_bc_df[rand_permute, "id"]
# Select the first third of these for validation
569/3
## [1] 189.7
validate_id <- as.character(all_id_random[1:189])
training_id <- as.character(all_id_random[190:569])
Subset the data we'll use in the knn analysis accordingly.
wbcd_train <- wbcd_n[training_id, ]
wbcd_val <- wbcd_n[validate_id, ]
BM_class_train <- BM_class[training_id]
BM_class_val <- BM_class[validate_id]
table(BM_class_train)
## BM_class_train
## benign malignant
## 247 133
table(BM_class_val)
## BM_class_val
## benign malignant
## 110 79
Check that training and validation sets have
The knn
algorithm is implemented in the package class. Install this if necessary and then load it.
library(class)
`?`(knn)
What should k be? Let's use an odd number near the square root of size of the training set.
sqrt(nrow(wbcd_train))
## [1] 19.49
k <- 19
knn_predict <- knn(wbcd_train, wbcd_val, BM_class_train, k = 19)
knn_predict[1:3]
## [1] benign malignant benign
## Levels: benign malignant
Check the agreement with the predictor.
table(knn_predict, BM_class_val)
## BM_class_val
## knn_predict benign malignant
## benign 108 10
## malignant 2 69
prop.table(table(knn_predict, BM_class_val))
## BM_class_val
## knn_predict benign malignant
## benign 0.57143 0.05291
## malignant 0.01058 0.36508
Not very good as classifications go.
Let's sample a few other values of k and see how the performance changes. Let's try k=3, 7, 11, 31.
knn_predict_3 <- knn(wbcd_train, wbcd_val, BM_class_train, k = 3)
knn_predict_7 <- knn(wbcd_train, wbcd_val, BM_class_train, k = 7)
knn_predict_11 <- knn(wbcd_train, wbcd_val, BM_class_train, k = 11)
knn_predict_31 <- knn(wbcd_train, wbcd_val, BM_class_train, k = 31)
table(knn_predict_3, BM_class_val)
## BM_class_val
## knn_predict_3 benign malignant
## benign 109 8
## malignant 1 71
table(knn_predict_7, BM_class_val)
## BM_class_val
## knn_predict_7 benign malignant
## benign 109 9
## malignant 1 70
table(knn_predict_11, BM_class_val)
## BM_class_val
## knn_predict_11 benign malignant
## benign 108 10
## malignant 2 69
table(knn_predict_31, BM_class_val)
## BM_class_val
## knn_predict_31 benign malignant
## benign 108 11
## malignant 2 68
The best was actually 3, by a little. Sometimes, too many points hides subclusters that are fairly small. This can also be very dependent on the choice of training set.
The biggest gap I see is in the use of all variables in the distance calculation. Three traits each for 10 measurements almost certainly introduces some redundancy, hence added variance. More worrisome is that we may lose an opportunity to gain important insight into tumor biology. If a couple traits alone can separate benign from malignant then we should should look at the causes of those features as underlying factors in cancer development. In summary, the two factors we want to examine are:
An obstacle to testing which parameters give the best fit is that is that we just have one training set and one validation set. We need to reserve the validation set for the final choice of model. We need more training-validation partitions to select these parameters.
We may want to cut down from the full 30 variables. Working in the training set, let's rank these by significance in their variation between benign and malignant. These variables are all continuous, so we'll use the F-statistic of a linear fit.
Before showing the fast way to do this, let's show some ways that take less data manipulation but more type to compare them.
names(wbcd_train)
## [1] "radius_mean" "texture_mean" "perimeter_mean"
## [4] "area_mean" "smoothness_mean" "compactness_mean"
## [7] "concavity_mean" "points_mean" "symmetry_mean"
## [10] "dimension_mean" "radius_se" "texture_se"
## [13] "perimeter_se" "area_se" "smoothness_se"
## [16] "compactness_se" "concavity_se" "points_se"
## [19] "symmetry_se" "dimension_se" "radius_worst"
## [22] "texture_worst" "perimeter_worst" "area_worst"
## [25] "smoothness_worst" "compactness_worst" "concavity_worst"
## [28] "points_worst" "symmetry_worst" "dimension_worst"
lm_1 <- lm(radius_mean ~ BM_class_train, data = wbcd_train)
summary(lm_1)
##
## Call:
## lm(formula = radius_mean ~ BM_class_train, data = wbcd_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3113 -0.0790 0.0034 0.0695 0.5008
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24499 0.00712 34.4 <2e-16 ***
## BM_class_trainmalignant 0.25418 0.01204 21.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.112 on 378 degrees of freedom
## Multiple R-squared: 0.541, Adjusted R-squared: 0.54
## F-statistic: 446 on 1 and 378 DF, p-value: <2e-16
names(summary(lm_1))
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
summary(lm_1)$fstatistic
## value numdf dendf
## 445.8 1.0 378.0
# The significance measure we want:
summary(lm_1)$fstatistic[1]
## value
## 445.8
We need a vector to hold the outputs to avoid working with them one at a time.
exp_var_fstat <- as.numeric(rep(NA, times = 30))
names(exp_var_fstat) <- names(wbcd_train)
Now run through the variables, get the linear fit and store the F-statistic
exp_var_fstat["radius_mean"] <- summary(lm(radius_mean ~ BM_class_train, data = wbcd_train))$fstatistic[1]
exp_var_fstat["texture_mean"] <- summary(lm(texture_mean ~ BM_class_train, data = wbcd_train))$fstatistic[1]
exp_var_fstat["perimeter_mean"] <- summary(lm(perimeter_mean ~ BM_class_train,
data = wbcd_train))$fstatistic[1]
exp_var_fstat
## radius_mean texture_mean perimeter_mean area_mean
## 445.83 80.29 483.35 NA
## smoothness_mean compactness_mean concavity_mean points_mean
## NA NA NA NA
## symmetry_mean dimension_mean radius_se texture_se
## NA NA NA NA
## perimeter_se area_se smoothness_se compactness_se
## NA NA NA NA
## concavity_se points_se symmetry_se dimension_se
## NA NA NA NA
## radius_worst texture_worst perimeter_worst area_worst
## NA NA NA NA
## smoothness_worst compactness_worst concavity_worst points_worst
## NA NA NA NA
## symmetry_worst dimension_worst
## NA NA
Keep going. I'd be tempted to get lazy and not do all of them.
Rather than doing the one at a time version we can trying looping over each of the variables. But some careful coding needs to be done.
Again, we'll need a vector to hold the significance measures.
exp_vars <- names(wbcd_train)
exp_var_fstat <- as.numeric(rep(NA, times = 30))
names(exp_var_fstat) <- exp_vars
The following produces an error.
for (j in 1:length(exp_vars)) {
exp_var_fstat[exp_vars[j]] <- summary(lm(exp_vars[j] ~ BM_class_train, data = wbcd_train))$fstatistic[1]
}
The problem is that there isn't a named variable of the form exp_vars[j]. We need the named variable stored in this “slot”. We can get at the variable and produce an executable formula by:
for (j in 1:length(exp_vars)) {
exp_var_fstat[exp_vars[j]] <- summary(lm(as.formula(paste(exp_vars[j], " ~ BM_class_train")),
data = wbcd_train))$fstatistic[1]
}
This works.
exp_var_fstat
## radius_mean texture_mean perimeter_mean area_mean
## 445.83008 80.28969 483.35212 406.02540
## smoothness_mean compactness_mean concavity_mean points_mean
## 53.17654 231.82299 424.86462 624.85237
## symmetry_mean dimension_mean radius_se texture_se
## 43.17113 0.04483 185.76126 0.05889
## perimeter_se area_se smoothness_se compactness_se
## 172.59053 189.61081 2.81093 38.93364
## concavity_se points_se symmetry_se dimension_se
## 31.62212 83.22744 0.13240 3.67540
## radius_worst texture_worst perimeter_worst area_worst
## 599.65263 98.65231 624.91718 491.38056
## smoothness_worst compactness_worst concavity_worst points_worst
## 82.54685 224.67379 297.80279 650.50615
## symmetry_worst dimension_worst
## 71.84282 52.88346
We quickly got all the statistics.
Of course, we can avoid initializing the variable or worrying about an iterator by using an lapply
or sapply
.
exp_var_fstat2 <- sapply(exp_vars, function(x) {
summary(lm(as.formula(paste(x, " ~ BM_class_train")), data = wbcd_train))$fstatistic[1]
})
exp_var_fstat2
## radius_mean.value texture_mean.value perimeter_mean.value
## 445.83008 80.28969 483.35212
## area_mean.value smoothness_mean.value compactness_mean.value
## 406.02540 53.17654 231.82299
## concavity_mean.value points_mean.value symmetry_mean.value
## 424.86462 624.85237 43.17113
## dimension_mean.value radius_se.value texture_se.value
## 0.04483 185.76126 0.05889
## perimeter_se.value area_se.value smoothness_se.value
## 172.59053 189.61081 2.81093
## compactness_se.value concavity_se.value points_se.value
## 38.93364 31.62212 83.22744
## symmetry_se.value dimension_se.value radius_worst.value
## 0.13240 3.67540 599.65263
## texture_worst.value perimeter_worst.value area_worst.value
## 98.65231 624.91718 491.38056
## smoothness_worst.value compactness_worst.value concavity_worst.value
## 82.54685 224.67379 297.80279
## points_worst.value symmetry_worst.value dimension_worst.value
## 650.50615 71.84282 52.88346
names(exp_var_fstat2) <- exp_vars
If we had 10,000 variables instead of 30 and our function was more time-consumming than lm
we might see faster execution because lapply
and sapply
use a for
loop written in C++ and not R.
plyr
version of the fitIt's awkward working with these 30 variables as separate columns in a data.frame. We'll homogenize the data by forming a list of data.frames with one component for each variable as follows. We'll also package the BM class variable into the data.frames to have all variables in one location.
wbcd_df_L <- lapply(exp_vars, function(x) {
df <- data.frame(sample = rownames(wbcd_train), variable = x, value = wbcd_train[,
x], class = BM_class_train)
df
})
head(wbcd_df_L[[1]])
## sample variable value class
## 873885 873885 radius_mean 0.3928 malignant
## 8610175 8610175 radius_mean 0.2522 benign
## 897880 897880 radius_mean 0.1453 benign
## 91544001 91544001 radius_mean 0.2480 benign
## 911391 911391 radius_mean 0.1845 benign
## 903516 903516 radius_mean 0.6924 malignant
head(wbcd_df_L[[5]])
## sample variable value class
## 873885 873885 smoothness_mean 0.3425 malignant
## 8610175 8610175 smoothness_mean 0.3529 benign
## 897880 897880 smoothness_mean 0.4340 benign
## 91544001 91544001 smoothness_mean 0.5143 benign
## 911391 911391 smoothness_mean 0.4340 benign
## 903516 903516 smoothness_mean 0.5784 malignant
names(wbcd_df_L) <- exp_vars
Now we don't have to be so sensitive to the variable names.
Now use laply
in plyr
to get the batch of them. Note that you could also use sapply
for this – they are about the same thing.
library(plyr)
var_sig_fstats <- laply(wbcd_df_L, function(df) {
fit <- lm(value ~ class, data = df)
f <- summary(fit)$fstatistic[1]
f
})
names(var_sig_fstats) <- names(wbcd_df_L)
var_sig_fstats[1:3]
## radius_mean texture_mean perimeter_mean
## 445.83 80.29 483.35
This version of the calculations used a uniform formula and varied the data. This is a very flexible method. It isn't sensitive to the names of the variables, for example.
most_sig_stats <- sort(var_sig_fstats, decreasing = T)
most_sig_stats[1:5]
## points_worst perimeter_worst points_mean radius_worst
## 650.5 624.9 624.9 599.7
## area_worst
## 491.4
most_sig_stats[25:30]
## concavity_se dimension_se smoothness_se symmetry_se texture_se
## 31.62212 3.67540 2.81093 0.13240 0.05889
## dimension_mean
## 0.04483
This already gives us some insight. The last variables in the list aren't significant on their own. Adding them to the model may only increase the variance.
To prepare for doing kNN with a varying number of variables, reorder the variables in the training set data.frame by significance.
wbcd_train_ord <- wbcd_train[, names(most_sig_stats)]
The next step is to assess how many variables give the best fit. for that, we use a nice version of cross-validation.
Starting with a training set \( S \) and validation set \( V \), select a large number (1000) of random subsets of \( S \), \( S_i \), \( i\leq 1000 \), of a fixed size. Let \( V_i = S\setminus S_i \), \( i \leq 1000 \). In comparing parameters for a kNN fit, test the options 1000 times with \( V_i \) as the validation set and \( S_i \) as the training (reference) set. Select the parameters with the smallest mean error across the \( V_i \).
With this choice of parameters, validate the final model with \( V \) as validation set and \( S \) as the reference set.
Reference: S. Dudoit and M.J. van der Laan, Asymptotics of cross-validated risk estimation in estimator selection and performance assessment, Statistical Methodology, vol 2, Issue 2, (2005) 131-154.
Let's further divide the training set with each \( S_i \) 2/3 the size of training set picked above.
length(training_id)
## [1] 380
(2/3) * length(training_id)
## [1] 253.3
length(training_id) - 253
## [1] 127
# Use 253 as the training set size.
training_family_L <- lapply(1:1000, function(j) {
perm <- sample(1:380, size = 380, replace = F)
shuffle <- training_id[perm]
trn <- shuffle[1:253]
trn
})
# save(training_family_L, file='training_family_L.RData')
load("training_family_L.RData")
validation_family_L <- lapply(training_family_L, function(x) setdiff(training_id,
x))
With the variables ordered by significance as above, we'll calculate the distributions of errors over the 1000 training-validation pairs, for different sets of variables. Our options will be the first \( n \) variables for \( n = 1, 3, 5, \ldots, 30 \):
N <- seq(from = 3, to = 29, by = 2)
At the same time we want to test options for k. The rule of thumb says to use the square root of the size of the reference set. This is
sqrt(length(training_family_L[[1]]))
## [1] 15.91
Let's vary from 3 to 19.
K <- seq(from = 3, to = 19, by = 2)
For each training-validation split, each choice of number of variables, and each k we want to calculate the error of the kNN prediction in the validation set. This is a total of
1000 * length(N) * length(K)
## [1] 126000
Let's nest a number of for
loops to calculate the errors for all training sets and choices of model parameters. What are the important factors?
rbind
to build the data.frame one row at a time will be very slow.
b. Better to initialize a data.frame of the right dimensions and store results in the correct row;
c. We'll need an iterator that tracks where we are in the triply nested loop to know what row to store the results in.We begin by initializing the data.frame that will store the 126,000 entries.
paramter_errors_df <- data.frame(mc_index = as.integer(rep(NA, times = 126000)),
var_num = as.integer(rep(NA, times = 126000)), k = as.integer(rep(NA, times = 126000)),
error = as.numeric(rep(NA, times = 126000)))
We'll write a new function for the core kNN error calculation we need to do for each parameter choice. Here is an example of the core calculation for the first MC split, n=5 variables, k=7
knn_test <- knn(train = wbcd_train_ord[training_family_L[[1]], 1:5], test = wbcd_train_ord[validation_family_L[[1]],
1:5], cl = BM_class_train[training_family_L[[1]]], k = 7)
knn_test[1:3]
## [1] benign benign benign
## Levels: benign malignant
tbl_test <- table(knn_test, BM_class_train[validation_family_L[[1]]])
tbl_test
##
## knn_test benign malignant
## benign 82 6
## malignant 1 38
err_rate <- (tbl_test[1, 2] + tbl_test[2, 1])/length(validation_family_L[[1]])
err_rate
## [1] 0.05512
To make the notation easier to follow, let's write a function that executes this chunk and returns the error rate and the other parameters.
# j = index, n = length of range of variables, k=k
core_knn <- function(j, n, k) {
knn_predict <- knn(train = wbcd_train_ord[training_family_L[[j]], 1:n],
test = wbcd_train_ord[validation_family_L[[j]], 1:n], cl = BM_class_train[training_family_L[[j]]],
k = k)
tbl <- table(knn_predict, BM_class_train[validation_family_L[[j]]])
err <- (tbl[1, 2] + tbl[2, 1])/length(validation_family_L[[j]])
err
}
# sample
core_knn(1, 5, 7)
## [1] 0.05512
With this core function in place we'll run our nested loops. Besides the data.frame we need an iterator to keep track of the loop we're in, which will determine the row of the data.frame we store into.
iter <- 1
Now the for
loops. We'll only execute this once since it takes awhile.
str_time <- Sys.time()
for (j in 1:1000) {
for (n in 1:length(N)) {
for (m in 1:length(K)) {
err <- core_knn(j, N[n], K[m])
paramter_errors_df[iter, ] <- c(j, N[n], K[m], err)
iter <- iter + 1
}
}
}
time_lapsed_for <- Sys.time() - str_time
save(paramter_errors_df, time_lapsed_for, file = "for_loop_paramter_errors.RData")
load("for_loop_paramter_errors.RData")
time_lapsed_for
## Time difference of 39.37 mins
plyr
The following differences characterize how these calculations are done with plyr
.
ddply
replaces the nested loops for(j in 1:1000)
, for(n in 1:length(N))
, and for(m in 1:length(K))
, with ddply(parameter_df, .(mc_index, var_num, k), ...)
ddply
. Create a data.frame of all our possible parameter combinations.
param_df1 <- merge(data.frame(mc_index = 1:1000), data.frame(var_num = N))
param_df <- merge(param_df1, data.frame(k = K))
str(param_df)
## 'data.frame': 126000 obs. of 3 variables:
## $ mc_index: int 1 2 3 4 5 6 7 8 9 10 ...
## $ var_num : num 3 3 3 3 3 3 3 3 3 3 ...
## $ k : num 3 3 3 3 3 3 3 3 3 3 ...
We'll nest these inside several **ply
functions.
So, range over all these parameters, apply the core and stack all the results in a data.frame.
This is a test of the code using just 20 choices of parameters.
knn_err_est_df_test <- ddply(param_df[1:20, ], .(mc_index, var_num, k), function(df) {
err <- core_knn(df$mc_index[1], df$var_num[1], df$k[1])
err
})
head(knn_err_est_df_test)
## mc_index var_num k V1
## 1 1 3 3 0.07874
## 2 2 3 3 0.04724
## 3 3 3 3 0.10236
## 4 4 3 3 0.06299
## 5 5 3 3 0.07087
## 6 6 3 3 0.03937
Do the full run.
str_time <- Sys.time()
knn_err_est_df <- ddply(param_df, .(mc_index, var_num, k), function(df) {
err <- core_knn(df$mc_index[1], df$var_num[1], df$k[1])
err
})
time_lapsed <- Sys.time() - str_time
save(knn_err_est_df, time_lapsed, file = "knn_err_est_df.RData")
load("knn_err_est_df.RData")
time_lapsed
## Time difference of 6.811 mins
head(knn_err_est_df)
## mc_index var_num k V1
## 1 1 3 3 0.03738
## 2 1 3 5 0.03738
## 3 1 3 7 0.05140
## 4 1 3 9 0.04673
## 5 1 3 11 0.04206
## 6 1 3 13 0.04673
names(knn_err_est_df)[4] <- "error"
This is almost a six-fold speed-up on the nested loops.
For each range of variables and each k we have 1000 errors from the family of training-validation sets. The mean of these is a good estimate of the performance of these parameters. The choice with minimal mean errors likely generalizes very well.
How do we get the mean errors as easily as possible?
Example:
mean_ex_df <- subset(knn_err_est_df, var_num == 5 & k == 7)
head(mean_ex_df)
## mc_index var_num k error
## 12 1 5 7 0.03738
## 138 2 5 7 0.06436
## 264 3 5 7 0.04808
## 390 4 5 7 0.06763
## 516 5 5 7 0.06220
## 642 6 5 7 0.04651
mean(mean_ex_df$error)
## [1] 0.05729
We can generate a data.frame containing the summary mean errors over the 1000 training-validation sets for parameter choices as follows.
The following can be read as “For all combinations (n,m) of variable numbers and values of k, subset the data.frame to the records with var_num = n and k=m, then compute the mean of the error variable over these recoprds”.
mean_errs_df <- ddply(knn_err_est_df, .(var_num, k), function(df) mean(df$error))
head(mean_errs_df)
## var_num k V1
## 1 3 3 0.05564
## 2 3 5 0.05746
## 3 3 7 0.05796
## 4 3 9 0.05612
## 5 3 11 0.05443
## 6 3 13 0.05472
names(mean_errs_df)[3] <- "mean_error"
Here, again, ddply
took care of creating a data.frame of results, keeping (n,m) with the corresponding record.
Let's visualize the performance in a plot with color representing mean error.
library(ggplot2)
ggplot(data = mean_errs_df, aes(x = var_num, y = k, color = mean_error)) + geom_point(size = 10) +
theme_bw()
Interesting that it performs pretty well with 3 variables and high k, gets worse out to 17 and then gets better. But it's worse 27 or 29 variables and high k.
Let's replot with the variable numbers between 15 and 30.
ggplot(data = subset(mean_errs_df, var_num >= 15), aes(x = var_num, y = k, color = mean_error)) +
geom_point(size = 10) + theme_bw()
It looks like 19 variables and more with a low k works best. Let's look at the real numbers.
subset(mean_errs_df, var_num == 17)
## var_num k mean_error
## 64 17 3 0.02562
## 65 17 5 0.02572
## 66 17 7 0.02818
## 67 17 9 0.03141
## 68 17 11 0.03246
## 69 17 13 0.03217
## 70 17 15 0.03128
## 71 17 17 0.03089
## 72 17 19 0.03115
subset(mean_errs_df, var_num == 19)
## var_num k mean_error
## 73 19 3 0.01875
## 74 19 5 0.02421
## 75 19 7 0.02542
## 76 19 9 0.02585
## 77 19 11 0.02631
## 78 19 13 0.02612
## 79 19 15 0.02595
## 80 19 17 0.02625
## 81 19 19 0.02724
subset(mean_errs_df, var_num == 21)
## var_num k mean_error
## 82 21 3 0.01948
## 83 21 5 0.02267
## 84 21 7 0.02337
## 85 21 9 0.02400
## 86 21 11 0.02478
## 87 21 13 0.02621
## 88 21 15 0.02738
## 89 21 17 0.02797
## 90 21 19 0.02828
subset(mean_errs_df, var_num == 25)
## var_num k mean_error
## 100 25 3 0.01750
## 101 25 5 0.02274
## 102 25 7 0.02370
## 103 25 9 0.02352
## 104 25 11 0.02365
## 105 25 13 0.02446
## 106 25 15 0.02530
## 107 25 17 0.02685
## 108 25 19 0.02820
mean_errs_df[which.min(mean_errs_df$mean_error), ]
## var_num k mean_error
## 109 27 3 0.01697
So, the best is 27 variables with k=3; with 25 variables and k=3 coming close. Here's a review of variables.
names(wbcd_train_ord)
## [1] "points_worst" "perimeter_worst" "points_mean"
## [4] "radius_worst" "area_worst" "perimeter_mean"
## [7] "radius_mean" "concavity_mean" "area_mean"
## [10] "concavity_worst" "compactness_mean" "compactness_worst"
## [13] "area_se" "radius_se" "perimeter_se"
## [16] "texture_worst" "points_se" "smoothness_worst"
## [19] "texture_mean" "symmetry_worst" "smoothness_mean"
## [22] "dimension_worst" "symmetry_mean" "compactness_se"
## [25] "concavity_se" "dimension_se" "smoothness_se"
## [28] "symmetry_se" "texture_se" "dimension_mean"
We need to use the first 19 variables, ordered in a specific way, so, sort the data.frame of variables for the validation set in this order.
wbcd_val_ord <- wbcd_val[, names(wbcd_train_ord)]
Compute the predictions for the validation set using the whole training set for reference.
bm_val_pred <- knn(train = wbcd_train_ord[, 1:27], wbcd_val_ord[, 1:27], BM_class_train,
k = 3)
tbl_bm_val <- table(bm_val_pred, BM_class_val)
tbl_bm_val
## BM_class_val
## bm_val_pred benign malignant
## benign 108 6
## malignant 2 73
(val_error <- tbl_bm_val[1, 2] + tbl_bm_val[2, 1])/length(BM_class_val)
## [1] 0.04233
This is much better than the rule of thumb.
It took about 6 minutes to run the core kNN parameter test. We can try speeding this up using a multicore parallelization.
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoMC()
# How many cores are we using?
getDoParWorkers()
## [1] 4
str_time <- Sys.time()
knn_err_est_df_par <- ddply(param_df, .(mc_index, var_num, k), function(df) {
err <- core_knn(df$mc_index[1], df$var_num[1], df$k[1])
err
}, .parallel = T)
time_lapsed_par <- Sys.time() - str_time
save(knn_err_est_df_par, time_lapsed_par, file = "knn_err_est_df_par.RData")
load("knn_err_est_df_par.RData")
time_lapsed_par
## Time difference of 4.165 mins
This is about 60% of the earlier runtime. It isn't 25% of the time. It takes some time to move the data to the other cores. The kNN function is fast, so moving the data represents a significant percentage of the processing time. As the run-time of the core function increases, the speedup obtained by parallelization goes up.