This notebook describes an example of using the caret1 package to conduct hyperparameter tuning for the k-Nearest Neighbour classifier.

library(mclust)
library(dplyr)
library(ggplot2)
library(caret)
library(pROC)

1 Example dataset

The example dataset is the banknote dataframe found in the mclust2 package. It contains six measurements made on 100 genuine and 100 counterfeit old-Swiss 1000-franc bank notes.

data(banknote)
head(banknote)

There are six predictor variables (Length, Left, Right, Bottom, Top, Diagonal) with Status being the categorical response or class variable having two levels, namely genuine and counterfeit.

2 Exploratory data analysis

Observe that the dataset is balanced with 100 observations against each level of Status.

banknote %>%
  group_by(Status) %>%
  summarise(N = n(), 
            Mean_Length = mean(Length),
            Mean_Left = mean(Left),
            Mean_Right = mean(Right),
            Mean_Bottom = mean(Bottom),
            Mean_Top = mean(Top),
            Mean_Diagonal = mean(Diagonal),
            .groups = "keep")

In most of the measurements of bank notes aside from Length, genuine and counterfeit notes have quite distinct distributions.

library(tidyr)
banknote %>% 
  mutate(ID = 1:n()) %>%
  pivot_longer(Length:Diagonal,
               names_to = "Dimension",
               values_to = "Size") %>%
  mutate(Dimension = factor(Dimension),
         ID = factor(ID)) %>%
  ggplot() +
  aes(y = Size, fill = Status) +
  facet_wrap(~ Dimension, scales = "free") +
  geom_boxplot() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  labs(y = "Size (mm)", title = "Comparison of bank note dimensions")