diff --git a/classification1.Rmd b/classification1.Rmd index 8b5ff43b0..ff7034d26 100644 --- a/classification1.Rmd +++ b/classification1.Rmd @@ -13,7 +13,7 @@ focus on *classification*, i.e., using one or more variables to predict the value of a categorical variable of interest. This chapter will cover the basics of classification, how to preprocess data to make it suitable for use in a classifier, and how to use our observed data to make -predictions. The next will focus on how to evaluate how accurate the +predictions. The next chapter will focus on how to evaluate how accurate the predictions from our classifier are, as well as how to improve our classifier (where possible) to maximize its accuracy. @@ -25,8 +25,8 @@ By the end of the chapter, readers will be able to: - Describe what a training data set is and how it is used in classification - Interpret the output of a classifier - Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables -- Explain the K-nearest neighbour classification algorithm -- Perform K-nearest neighbour classification in R using `tidymodels` +- Explain the $K$-nearest neighbor classification algorithm +- Perform $K$-nearest neighbor classification in R using `tidymodels` - Use a `recipe` to preprocess data to be centered, scaled, and balanced - Combine preprocessing and model training using a `workflow` @@ -52,8 +52,8 @@ we use these data to train, or teach, our classifier. Once taught, we can use the classifier to make predictions on new data for which we do not know the class. There are many possible methods that we could use to predict -a categorical class/label for an observation. In this book we will -focus on the simple, widely used **K-nearest neighbours** algorithm. +a categorical class/label for an observation. In this book, we will +focus on the widely used **$K$-nearest neighbors** algorithm. In your future studies, you might encounter decision trees, support vector machines (SVMs), logistic regression, neural networks, and more; see the additional resources section at the end of the next chapter for where to begin learning more about @@ -70,66 +70,60 @@ In this chapter and the next, we will study a data set of [digitized breast cancer image features](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29), created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison. Each row in the data set represents an -image of a tumour sample including the diagnosis (benign or malignant) and +image of a tumor sample, including the diagnosis (benign or malignant) and several other measurements (nucleus texture, perimeter, area, and more). Diagnosis for each image was conducted by physicians. As with all data analyses, we first need to formulate a precise question that -we want to answer. Here, the question is *predictive*: can we use the tumour -image measurements available to us to predict whether a future tumour image -(with unknown diagnosis) shows a benign or malignant tumour? Answering this -question is important because traditional, non-data-driven methods for tumour +we want to answer. Here, the question is *predictive*: can we use the tumor +image measurements available to us to predict whether a future tumor image +(with unknown diagnosis) shows a benign or malignant tumor? Answering this +question is important because traditional, non-data-driven methods for tumor diagnosis are quite subjective and dependent upon how skilled and experienced -the diagnosing physician is. Furthermore, benign tumours are not normally -dangerous; the cells stay in the same place and the tumour stops growing before -it gets very large. By contrast, in malignant tumours, the cells invade the -surrounding tissue and spread into nearby organs where they can cause serious -damage ([learn more about cancer here](https://www.worldwidecancerresearch.org/who-we-are/cancer-basics/)). -Thus, it is important to quickly and accurately diagnose the tumour type to +the diagnosing physician is. Furthermore, benign tumors are not normally +dangerous; the cells stay in the same place, and the tumor stops growing before +it gets very large. By contrast, in malignant tumors, the cells invade the +surrounding tissue and spread into nearby organs, where they can cause serious +damage (@stanfordhealthcare). +Thus, it is important to quickly and accurately diagnose the tumor type to guide patient treatment. ### Loading the cancer data Our first step is to load, wrangle, and explore the data using visualizations -in order to better understand the data we are working with. We start by -loading the necessary packages for our analysis. Below you'll see (in addition -to the usual `tidyverse`) a new package: `forcats`. -The `forcats` package enables us to easily -manipulate factors in R. As we learned in Chapter \@ref(viz), factors are a special categorical type of variable in -R that are often used for class label data. - +in order to better understand the data we are working with. We start by +loading the `tidyverse` package needed for our analysis. ```{r 05-load-libraries, warning = FALSE, message = FALSE} library(tidyverse) -library(forcats) ``` In this case, the file containing the breast cancer data set is a `.csv` file with headers. We'll use the `read_csv` function with no additional arguments, and then inspect its contents: -```{r 05-read-data} +```{r 05-read-data, message = FALSE} cancer <- read_csv("data/wdbc.csv") cancer ``` ### Describing the variables in the cancer data set -Breast tumours can be diagnosed by performing a *biopsy*, a process where +Breast tumors can be diagnosed by performing a *biopsy*, a process where tissue is removed from the body and examined for the presence of disease. Traditionally these procedures were quite invasive; modern methods such as fine -needle asipiration, used to collect the present data set, extract only a small +needle aspiration, used to collect the present data set, extract only a small amount of tissue and are less invasive. Based on a digital image of each breast -tissue sample collected for this data set, 10 different variables were measured +tissue sample collected for this data set, ten different variables were measured for each cell nucleus in the image (items 3-12 of the list of variables below), and then the mean for each variable across the nuclei was recorded. As part of the -data preparation, these values have been *scaled*; we will discuss what this +data preparation, these values have been *standardized (centered and scaled)*; we will discuss what this means and why we do it later in this chapter. Each image additionally was given -a unique ID and a diagnosis for malignance by a physician. Therefore, the +a unique ID and a diagnosis by a physician. Therefore, the total set of variables per image in this data set is: -1. ID number -2. Class: the diagnosis of **M**alignant or **B**enign +1. ID: identification number +2. Class: the diagnosis (M = malignant or B = benign) 3. Radius: the mean of distances from center to points on the perimeter 4. Texture: the standard deviation of gray-scale values 5. Perimeter: the length of the surrounding contour @@ -151,9 +145,9 @@ the page (instead of across). glimpse(cancer) ``` -We can see from the summary of the data above that `Class` is of type character -(denoted by ``). Since we are going to be working with `Class` as a -categorical statistical variable, we will convert it to factor using the +From the summary of the data above, we can see that `Class` is of type character +(denoted by ``). Since we will be working with `Class` as a +categorical statistical variable, we will convert it to a factor using the function `as_factor`. ```{r 05-class} @@ -163,17 +157,17 @@ glimpse(cancer) ``` Recall factors have what are called "levels", which you can think of as categories. We -can ask for the levels from the `Class` column by using the `levels` function. +can verify the levels of the `Class` column by using the `levels` function. This function should return the name of each category in that column. Given -that we only have 2 different values in our `Class` column (B and M), we -only expect to get two names back. Note that the `levels` function requires -a *vector* argument, while the `select` function outputs a *data frame*; -so we use the `pull` function, which converts a single -column of a data frame into a vector. +that we only have two different values in our `Class` column (B for benign and M +for malignant), we only expect to get two names back. Note that the `levels` function requires a *vector* argument; +so we use the `pull` function to extract a single column (`Class`) and +pass that into the `levels` function to see the categories +in the `Class` column. ```{r 05-levels} cancer |> - pull(Class) |> # turns a data frame into a vector + pull(Class) |> levels() ``` @@ -181,53 +175,49 @@ cancer |> Before we start doing any modelling, let's explore our data set. Below we use the `group_by`, `summarize` and `n` functions to find the number and percentage -of benign and maligant tumour observations in our data set. The `n` function within -the `summarize` function counts the number of observations in each `Class` group. -We have 357 (63\%) benign and 212 (37\%) malignant tumour observations. - +of benign and malignant tumor observations in our data set. The `n` function within +`summarize` when paired with `group_by` counts the number of observations in each `Class` group. +Then we calculate the percentage in each group by dividing by the total number of observations. We have 357 (63\%) benign and 212 (37\%) malignant tumor observations. ```{r 05-tally} num_obs <- nrow(cancer) cancer |> group_by(Class) |> summarize( - n = n(), + count = n(), percentage = n() / num_obs * 100 ) ``` Next, let's draw a scatter plot to visualize the relationship between the perimeter and concavity variables. Rather than use `ggplot's` default palette, -we define our own colourblind-friendly palette using two colours—`"orange2"` +we select our own colorblind-friendly colors—`"orange2"` for light orange and `"steelblue2"` for light blue—and pass them as the `values` argument to the `scale_color_manual` function. We also make the category labels ("B" and "M") more readable by changing them to "Benign" and "Malignant" using the `labels` argument. -```{r 05-scatter, fig.height = 4, fig.width = 5, fig.cap= "Scatter plot of concavity versus perimeter coloured by diagnosis label"} +```{r 05-scatter, fig.height = 4, fig.width = 5, fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label"} perim_concav <- cancer %>% ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + - geom_point(alpha = 0.5) + - labs(color = "Diagnosis") + + geom_point(alpha = 0.6) + + labs(color = "Diagnosis", x = "Perimeter (standardized)", y = "Concavity (standardized)") + scale_color_manual(labels = c("Malignant", "Benign"), values = c("orange2", "steelblue2")) perim_concav ``` In Figure \@ref(fig:05-scatter), we can see that malignant observations typically fall in -the the upper right-hand corner of the plot area. By contrast, benign -observations typically fall in lower left-hand corner of the plot. In other words, benign observations -tend to have lower concavity and perimeter values, and malignant ones tends to have larger -concavity and perimeter values. Suppose we +the upper right-hand corner of the plot area. By contrast, benign +observations typically fall in the lower left-hand corner of the plot. In other words, +benign observations tend to have lower concavity and perimeter values, and malignant +ones tend to have larger values. Suppose we obtain a new observation not in the current data set that has all the variables measured *except* the label (i.e., an image without the physician's diagnosis -for the tumour class). We could compute the perimeter and concavity values, +for the tumor class). We could compute the standardized perimeter and concavity values, resulting in values of, say, 1 and 1. Could we use this information to classify that observation as benign or malignant? Based on the scatter plot, how might -you classify that new observation? How would you classify a new observation with -perimeter value of -1 and concavity value of -0.5? What about 0 and 1? It seems -like the *prediction of an unobserved label* might be possible, based on our -visualization. +you classify that new observation? If the standardized concavity and perimeter values are 1 and 1 respectively, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like the *prediction of an unobserved label* might be possible. -## Classification with K-nearest neighbours +## Classification with $K$-nearest neighbors ```{r 05-knn-0, echo = FALSE} ## Find the distance between new point and all others in data set @@ -256,51 +246,59 @@ table_with_distances <- function(training, new_point) { new_point <- c(2, 4) attrs <- c("Perimeter", "Concavity") my_distances <- table_with_distances(cancer[, attrs], new_point) -neighbours <- cancer[order(my_distances$Distance), ] +neighbors <- cancer[order(my_distances$Distance), ] ``` In order to actually make predictions for new observations in practice, we will need a classification algorithm. -In this book we will use the K-nearest neighbours classification algorithm. +In this book, we will use the $K$-nearest neighbors classification algorithm. To predict the label of a new observation (here, classify it as either benign -or malignant), the K-nearest neighbours classifier generally finds the $K$ +or malignant), the $K$-nearest neighbors classifier generally finds the $K$ "nearest" or "most similar" observations in our training set, and then uses their diagnoses to make a prediction for the new observation's diagnosis. $K$ is a number that we must choose in advance; for now, we will assume that someone has chosen $K$ for us. We will cover how to choose $K$ ourselves in the next chapter. -To illustrate the concept of K-nearest neighbours classification, we +To illustrate the concept of $K$-nearest neighbors classification, we will walk through an example. Suppose we have a -new observation, with perimeter of `r new_point[1]` and concavity of `r new_point[2]`, whose -diagnosis "Class" is unknown. This new observation is depicted by the red scatter point in +new observation, with standardized perimeter of `r new_point[1]` and standardized concavity of `r new_point[2]`, whose +diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in Figure \@ref(fig:05-knn-1). -```{r 05-knn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation labelled in red"} -perim_concav + - geom_point(aes(x = new_point[1], y = new_point[2]), color = "red", size = 2.5, pch = 17) +```{r 05-knn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond"} +perim_concav_with_new_point <- bind_rows(cancer, tibble(Perimeter = new_point[1], Concavity = new_point[2], Class = "unknown")) %>% + ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class, size = Class)) + + geom_point(alpha = 0.6) + + labs(color = "Diagnosis", x = "Perimeter (standardized)", y = "Concavity (standardized)") + + scale_color_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values = c("steelblue2", "orange2", "red")) + + scale_shape_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values= c(16, 16, 18))+ + scale_size_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values= c(2, 2, 2.5)) +perim_concav_with_new_point ``` Figure \@ref(fig:05-knn-2) shows that the nearest point to this new observation is **malignant** and -located at the coordinates (`r round(neighbours[1, c(attrs[1], attrs[2])], +located at the coordinates (`r round(neighbors[1, c(attrs[1], attrs[2])], 1)`). The idea here is that if a point is close to another in the scatter plot, then the perimeter and concavity values are similar, and so we may expect that they would have the same diagnosis. -```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter, with malignant nearest neighbour to a new observation highlighted"} -perim_concav + geom_point(aes(x = new_point[1], y = new_point[2]), - color = "red", - size = 2.5, - pch = 17 -) + +```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."} +perim_concav_with_new_point + geom_segment(aes( x = new_point[1], y = new_point[2], - xend = pull(neighbours[1, attrs[1]]), - yend = pull(neighbours[1, attrs[2]]) - ), color = "black") + xend = pull(neighbors[1, attrs[1]]), + yend = pull(neighbors[1, attrs[2]]) + ), color = "black", size = 0.5) ``` @@ -308,63 +306,69 @@ perim_concav + geom_point(aes(x = new_point[1], y = new_point[2]), new_point <- c(0.2, 3.3) attrs <- c("Perimeter", "Concavity") my_distances <- table_with_distances(cancer[, attrs], new_point) -neighbours <- cancer[order(my_distances$Distance), ] +neighbors <- cancer[order(my_distances$Distance), ] ``` -Suppose we have another new observation with perimeter `r new_point[1]` and +Suppose we have another new observation with standardized perimeter `r new_point[1]` and concavity of `r new_point[2]`. Looking at the scatter plot in Figure \@ref(fig:05-knn-4), how would you -classify this red observation? The nearest neighbour to this new point is a -**benign** observation at (`r round(neighbours[1, c(attrs[1], attrs[2])], 1)`). -Does this seem like the right prediction to make for the red observation? Probably +classify this red, diamond observation? The nearest neighbor to this new point is a +**benign** observation at (`r round(neighbors[1, c(attrs[1], attrs[2])], 1)`). +Does this seem like the right prediction to make for this observation? Probably not, if you consider the other nearby points... -```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter, with benign nearest neighbour to a new observation highlighted"} -perim_concav + geom_point(aes(x = new_point[1], y = new_point[2]), - color = "red", - size = 2.5, - pch = 17 -) + +```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."} + +perim_concav_with_new_point2 <- bind_rows(cancer, tibble(Perimeter = new_point[1], Concavity = new_point[2], Class = "unknown")) %>% + ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class, size = Class)) + + geom_point(alpha = 0.6) + + labs(color = "Diagnosis", x = "Perimeter (standardized)", y = "Concavity (standardized)") + + scale_color_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values = c("steelblue2", "orange2", "red")) + + scale_shape_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values= c(16, 16, 18))+ + scale_size_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values= c(2, 2, 2.5)) +perim_concav_with_new_point2 + geom_segment(aes( x = new_point[1], y = new_point[2], - xend = pull(neighbours[1, attrs[1]]), - yend = pull(neighbours[1, attrs[2]]) - ), color = "black") + xend = pull(neighbors[1, attrs[1]]), + yend = pull(neighbors[1, attrs[2]]) + ), color = "black", size = 0.5) ``` To improve the prediction we can consider several -neighbouring points, say $K = 3$, that are closest to the new red observation +neighboring points, say $K = 3$, that are closest to the new observation to predict its diagnosis class. Among those 3 closest points, we use the *majority class* as our prediction for the new observation. As shown in Figure \@ref(fig:05-knn-5), we -see that the diagnoses of 2 of the 3 nearest neighbours to our new observation -are malignant. Therefore we take majority vote and classify our new red +see that the diagnoses of 2 of the 3 nearest neighbors to our new observation +are malignant. Therefore we take majority vote and classify our new red, diamond observation as malignant. - + -```{r 05-knn-5, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbours"} -perim_concav + geom_point(aes(x = new_point[1], y = new_point[2]), - color = "red", - size = 2.5, - pch = 17 -) + +```{r 05-knn-5, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbors"} +perim_concav_with_new_point2 + geom_segment(aes( x = new_point[1], y = new_point[2], - xend = pull(neighbours[1, attrs[1]]), - yend = pull(neighbours[1, attrs[2]]) - ), color = "black") + + xend = pull(neighbors[1, attrs[1]]), + yend = pull(neighbors[1, attrs[2]]) + ), color = "black", size = 0.5) + geom_segment(aes( x = new_point[1], y = new_point[2], - xend = pull(neighbours[2, attrs[1]]), - yend = pull(neighbours[2, attrs[2]]) - ), color = "black") + + xend = pull(neighbors[2, attrs[1]]), + yend = pull(neighbors[2, attrs[2]]) + ), color = "black", size = 0.5) + geom_segment(aes( x = new_point[1], y = new_point[2], - xend = pull(neighbours[3, attrs[1]]), - yend = pull(neighbours[3, attrs[2]]) - ), color = "black") + xend = pull(neighbors[3, attrs[1]]), + yend = pull(neighbors[3, attrs[2]]) + ), color = "black", size = 0.5) ``` @@ -379,7 +383,7 @@ We decide which points are the $K$ "nearest" to our new observation using the *straight-line distance* (we will often just refer to this as *distance*). Suppose we have two observations $a$ and $b$, each having two predictor variables, $x$ and $y$. Denote $a_x$ and $a_y$ to be the values of variables $x$ and $y$ for observation $a$; -$b_x$ and $b_y$ have similar definitions for observaiton $b$. +$b_x$ and $b_y$ have similar definitions for observation $b$. Then the straight-line distance between observation $a$ and $b$ on the x-y plane can be computed using the following formula: @@ -388,26 +392,35 @@ $$\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}$$ new_point <- c(0, 3.5) ``` -To find the $K$ nearest neighbours to our new observation, we compute the distance +To find the $K$ nearest neighbors to our new observation, we compute the distance from that new observation to each observation in our training data, and select the $K$ observations corresponding to the -$K$ *smallest* distance values. For example, suppose we want to use $K=5$ neighbours to classify a new +$K$ *smallest* distance values. For example, suppose we want to use $K=5$ neighbors to classify a new observation with perimeter of `r new_point[1]` and -concavity of `r new_point[2]`, shown in red in Figure \@ref(fig:05-multiknn-1). Let's calculate the distances +concavity of `r new_point[2]`, shown as a red diamond in Figure \@ref(fig:05-multiknn-1). Let's calculate the distances between our new point and each of the observations in the training set to find -the $K=5$ neighbours that are nearest to our new point. +the $K=5$ neighbors that are nearest to our new point. You will see in the `mutate` step below, we compute the straight-line distance using the formula above: we square the differences between the two observations' perimeter and concavity coordinates, add the squared differences, and then take the square root. -```{r 05-multiknn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation labelled in red"} -perim_concav <- cancer |> - ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + +```{r 05-multiknn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond"} +perim_concav <- bind_rows(cancer, tibble(Perimeter = new_point[1], Concavity = new_point[2], Class = "unknown")) |> + ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class, size = Class)) + + geom_point(aes(x = new_point[1], y = new_point[2]), color = "red", size = 2.5, pch = 18) + geom_point(alpha = 0.5) + - scale_x_continuous(name = "Perimeter", breaks = seq(-2, 4, 1)) + - scale_y_continuous(name = "Concavity", breaks = seq(-2, 4, 1)) + - labs(color = "Diagnosis") + - scale_color_manual(labels = c("Malignant", "Benign"), values = c("orange2", "steelblue2")) + - geom_point(aes(x = new_point[1], y = new_point[2]), color = "red", size = 2.5, pch = 17) + scale_x_continuous(name = "Perimeter (standardized)", breaks = seq(-2, 4, 1)) + + scale_y_continuous(name = "Concavity (standardized)", breaks = seq(-2, 4, 1)) + + labs(color = "Diagnosis") + + scale_color_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values = c("steelblue2", "orange2", "red")) + + scale_shape_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values= c(16, 16, 18))+ + scale_size_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values= c(2, 2, 2.5)) + perim_concav ``` @@ -421,24 +434,24 @@ cancer |> mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + (Concavity - new_obs_Concavity)^2)) |> arrange(dist_from_new) |> - slice(1:5) # subset the first 5 rows + slice(1:5) # take the first 5 rows ``` In Table \@ref(tab:05-multiknn-mathtable) we show in mathematical detail how the `mutate` step was used to compute the `dist_from_new` -variable (the distance to the new observation) for each of the 5 nearest neighbours in the training +variable (the distance to the new observation) for each of the 5 nearest neighbors in the training data. ```{r 05-multiknn-4, echo = FALSE} my_distances <- table_with_distances(cancer[, attrs], new_point) -neighbours <- my_distances[order(my_distances$Distance), ] +neighbors <- my_distances[order(my_distances$Distance), ] k <- 5 -tab <- data.frame(neighbours[1:k, ], cancer[order(my_distances$Distance), ][1:k, c("ID", "Class")]) +tab <- data.frame(neighbors[1:k, ], cancer[order(my_distances$Distance), ][1:k, c("ID", "Class")]) math_table <- tibble(Perimeter = round(tab[1:5,1],2), Concavity = round(tab[1:5,2],2), - dist = round(neighbours[1:5, "Distance"], 2) + dist = round(neighbors[1:5, "Distance"], 2) ) math_table <- math_table %>% mutate(Distance = paste0("$\\sqrt{(", new_obs_Perimeter, " - ", ifelse(Perimeter < 0, "(", ""), Perimeter, ifelse(Perimeter < 0,")",""), ")^2", @@ -449,14 +462,14 @@ math_table <- math_table %>% ``` ```{r 05-multiknn-mathtable, echo = FALSE} -knitr::kable(math_table, booktabs = TRUE, caption = "Evaluating the distances from the new observation to each of its 5 nearest neighbours", escape = FALSE) +knitr::kable(math_table, booktabs = TRUE, caption = "Evaluating the distances from the new observation to each of its 5 nearest neighbors", escape = FALSE) ``` -The result of this computation shows that 3 of the 5 nearest neighbours to our new observation are +The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are malignant (`M`); since this is the majority, we classify our new observation as malignant. -These 5 neighbours are circled in Figure \@ref(fig:05-multiknn-3). +These 5 neighbors are circled in Figure \@ref(fig:05-multiknn-3). -```{r 05-multiknn-3, echo = FALSE, fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbours circled"} +```{r 05-multiknn-3, echo = FALSE, fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled"} perim_concav + annotate("path", x = new_point[1] + 1.4 * cos(seq(0, 2 * pi, length.out = 100 @@ -472,87 +485,149 @@ perim_concav + annotate("path", ### More than two explanatory variables Although the above description is directed toward two predictor variables, -exactly the same K-nearest neighbours algorithm applies when you -have a higher number of predictor variables (i.e., a higher-dimensional -predictor space). Each predictor variable may give us new +exactly the same $K$-nearest neighbors algorithm applies when you +have a higher number of predictor variables. Each predictor variable may give us new information to help create our classifier. The only difference is the formula -for the distance between points. In particular, let's say we have $m$ predictor +for the distance between points. Suppose we have $m$ predictor variables for two observations $a$ and $b$, i.e., $a = (a_{1}, a_{2}, \dots, a_{m})$ and $b = (b_{1}, b_{2}, \dots, b_{m})$. -Previously, when we had two variables, we added up the squared difference between each of our (two) variables, -and then took the square root. Now we will do the same, except for *all* of our -$m$ variables. In other words, the distance formula becomes + +The distance formula becomes $$\mathrm{Distance} = \sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \dots + (a_{m} - b_{m})^2}.$$ This formula still corresponds to a straight-line distance, just in a space with -more dimensions. For example, suppose we decide to use 3 predictor variables (so, a 3-dimensional space): -perimeter, concavity, and symmetry. Figure \@ref(fig:05-more) shows what -the data look like when we visualize them as a 3-dimensional scatter. -In this case, the formula above is just the straight line distance in this 3-dimensional space. +more dimensions. Suppose we want to calculate the distance between a new observation with a perimeter of 0, concavity of 3.5 and symmetry of 1 and +another observation with a perimeter, concavity and symmetry of 0.417, 2.31 and 0.837 respectively. We have two observations with three predictor variables: perimeter, concavity, and symmetry. Previously, when we had two variables, we added up the squared difference between each of our (two) variables, +and then took the square root. Now we will do the same, except for our +three variables. We calculate the distance as follows + +$$\mathrm{Distance} =\sqrt{(0 - 0.417)^2 + (3.5 - 2.31)^2 + (1 - 0.837)^2} = 1.27.$$ + +Let's calculate the distances between our new observation and each of the observations in the training set to find the $K=5$ neighbors when we have these three predictors. +```{r} +new_obs_Perimeter <- 0 +new_obs_Concavity <- 3.5 +new_obs_Symmetry <- 1 +cancer |> + select(ID, Perimeter, Concavity, Symmetry, Class) |> + mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + + (Concavity - new_obs_Concavity)^2 + + (Symmetry - new_obs_Symmetry)^2)) |> + arrange(dist_from_new) |> + slice(1:5) # take the first 5 rows +``` +Based on $K=5$ nearest neighbors with these three predictors we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. +Figure \@ref(fig:05-more) shows what the data look like when we visualize them +as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors. ```{r 05-more, echo = FALSE, message = FALSE, fig.cap = "3D scatter plot of the symmetry, concavity, and perimeter variables."} library(plotly) -cancer |> - plot_ly( - x = ~Perimeter, +attrs <- c("Perimeter", "Concavity", "Symmetry") + +# create new scaled obs and get NNs +new_obs_3 <- tibble(Perimeter = 0, Concavity = 3.5, Symmetry = 1, Class = "Unknown") +my_distances_3 <- table_with_distances(cancer[, attrs], new_obs_3[, attrs]) +neighbors_3 <- cancer[order(my_distances_3$Distance), ] + +data <- neighbors_3 %>% select(Perimeter, Concavity, Symmetry) %>% slice(1:5) + +# add to the df +scaled_cancer_3 <- bind_rows(cancer, new_obs_3) %>% + mutate(Class = fct_recode(Class, "Benign" = "B", "Malignant"= "M")) + +plot_3d <- scaled_cancer_3 |> + plot_ly() |> + layout(scene = list( + xaxis = list(title = "Perimeter (standardized)"), + yaxis = list(title = "Concavity (standardized)"), + zaxis = list(title = "Symmetry (standardized)") + )) %>% + add_trace(x = ~Perimeter, y = ~Concavity, z = ~Symmetry, color = ~Class, opacity = 0.4, size = 150, - colors = c("orange2", "steelblue2") - ) |> - add_markers() |> - layout(scene = list( - xaxis = list(title = "Perimeter"), - yaxis = list(title = "Concavity"), - zaxis = list(title = "Symmetry") - )) + colors = c("orange2", "steelblue2", "red"), + symbol = ~Class, symbols = c('circle','circle','diamond')) + +x1 <- c(pull(new_obs_3[1]), data$Perimeter[1]) +y1 <- c(pull(new_obs_3[2]), data$Concavity[1]) +z1 <- c(pull(new_obs_3[3]), data$Symmetry[1]) + +x2 <- c(pull(new_obs_3[1]), data$Perimeter[2]) +y2 <- c(pull(new_obs_3[2]), data$Concavity[2]) +z2 <- c(pull(new_obs_3[3]), data$Symmetry[2]) + +x3 <- c(pull(new_obs_3[1]), data$Perimeter[3]) +y3 <- c(pull(new_obs_3[2]), data$Concavity[3]) +z3 <- c(pull(new_obs_3[3]), data$Symmetry[3]) + +x4 <- c(pull(new_obs_3[1]), data$Perimeter[4]) +y4 <- c(pull(new_obs_3[2]), data$Concavity[4]) +z4 <- c(pull(new_obs_3[3]), data$Symmetry[4]) + +x5 <- c(pull(new_obs_3[1]), data$Perimeter[5]) +y5 <- c(pull(new_obs_3[2]), data$Concavity[5]) +z5 <- c(pull(new_obs_3[3]), data$Symmetry[5]) + +plot_3d %>% + add_trace(x = x1, y = y1, z = z1, type = "scatter3d", mode = "lines", + name = "lines", showlegend = FALSE, color = I("steelblue2")) %>% + add_trace(x = x2, y = y2, z = z2, type = "scatter3d", mode = "lines", + name = "lines", showlegend = FALSE, color = I("steelblue2")) %>% + add_trace(x = x3, y = y3, z = z3, type = "scatter3d", mode = "lines", + name = "lines", showlegend = FALSE, color = I("steelblue2")) %>% + add_trace(x = x4, y = y4, z = z4, type = "scatter3d", mode = "lines", + name = "lines", showlegend = FALSE, color = I("orange2")) %>% + add_trace(x = x5, y = y5, z = z5, type = "scatter3d", mode = "lines", + name = "lines", showlegend = FALSE, color = I("steelblue2")) ``` *Click and drag the plot above to rotate it, and scroll to zoom. Note that in general we recommend against using 3D visualizations; here we show the data in -3D only to illustrate what "higher dimensions" and "nearest neighbours" look like, +3D only to illustrate what "higher dimensions" and "nearest neighbors" look like, for learning purposes.* -### Summary of K-nearest neighbours algorithm -In order to classify a new observation using a K-nearest neighbour classifier, we have to: +### Summary of $K$-nearest neighbors algorithm + +In order to classify a new observation using a $K$-nearest neighbor classifier, we have to: 1. Compute the distance between the new observation and each observation in the training set. 2. Sort the data table in ascending order according to the distances. 3. Choose the top $K$ rows of the sorted table. -4. Classify the new observation based on a majority vote of the neighbour classes. +4. Classify the new observation based on a majority vote of the neighbor classes. -## K-nearest neighbours with `tidymodels` +## $K$-nearest neighbors with `tidymodels` -Coding the K-nearest neighbours algorithm in R ourselves can get complicated, +Coding the $K$-nearest neighbors algorithm in R ourselves can get complicated, especially if we want to handle multiple classes, more than two variables, -or predicting the class for multiple new observations. Thankfully, in R, -the K-nearest neighbours algorithm is implemented in the `parsnip` package +or predict the class for multiple new observations. Thankfully, in R, +the $K$-nearest neighbors algorithm is implemented in the `parsnip` package included in the -[`tidymodels` meta package](https://www.tidymodels.org/), along with +[`tidymodels` package](https://www.tidymodels.org/), along with many [other models](https://www.tidymodels.org/find/parsnip/) that you will encounter in this and future chapters of the book. The `tidymodels` collection provides tools to help make and use models, such as classifiers. Using the packages in this collection will help keep our code simple, readable and accurate; the -less we have to code ourselves, the fewer mistakes we are likely to make. We -start off by loading `tidymodels`. +less we have to code ourselves, the fewer mistakes we will likely make. We +start by loading `tidymodels`. ```{r 05-tidymodels, warning = FALSE, message = FALSE} library(tidymodels) ``` -Let's walk through how to use `tidymodels` to perform K-nearest neighbours classification. -We will use the cancer data set from above, with -perimeter and concavity as predictors and $K = 5$ neighbours to build our classifier. Then +Let's walk through how to use `tidymodels` to perform $K$-nearest neighbors classification. +We will use the `cancer` data set from above, with +perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then we will use the classifier to predict the diagnosis label for a new observation with -perimeter 0, concavity 3.5, and an unknown diagnosis label. Let's pick out our 2 desired -predictor variables and class label and store it as a new dataset named `cancer_train`: +perimeter 0, concavity 3.5, and an unknown diagnosis label. Let's pick out our two desired +predictor variables and class label and store them as a new data set named `cancer_train`: ```{r 05-tidymodels-2} cancer_train <- cancer |> @@ -560,17 +635,17 @@ cancer_train <- cancer |> cancer_train ``` -Next, we create a *model specification* for K-nearest neighbours classification -by calling the `nearest_neighbor` function, specifying that we want to use $K = 5$ neighbours +Next, we create a *model specification* for $K$-nearest neighbors classification +by calling the `nearest_neighbor` function, specifying that we want to use $K = 5$ neighbors (we will discuss how to choose $K$ in the next chapter) and the straight-line distance (`weight_func = "rectangular"`). The `weight_func` argument controls -how neighbours vote when classifying a new observation; by setting it to `"rectangular"`, -each of the $K$ nearest neighbours gets exactly 1 vote as described above. Other choices, -which weight each neighbour's vote differently, can be found on -[the tidymodels website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html). -We specify the particular computational -engine (in this case, the `kknn` engine) for training the model with the `set_engine` function. -Finally we specify that this is a classification problem with the `set_mode` function. +how neighbors vote when classifying a new observation; by setting it to `"rectangular"`, +each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, +which weigh each neighbor's vote differently, can be found on +[the `tidymodels` website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html). +In the `set_engine` argument, we specify which package or system will be used for training +the model. Here `kknn` is the R package we will use for performing $K$-nearest neighbors classification. +Finally, we specify that this is a classification problem with the `set_mode` function. ```{r 05-tidymodels-3} knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |> @@ -580,7 +655,7 @@ knn_spec ``` In order to fit the model on the breast cancer data, we need to pass the model specification -and the dataset to the `fit` function. We also need to specify what variables to use as predictors +and the data set to the `fit` function. We also need to specify what variables to use as predictors and what variable to use as the target. Below, the `Class ~ Perimeter + Concavity` argument specifies that `Class` is the target variable (the one we want to predict), and both `Perimeter` and `Concavity` are to be used as the predictors. @@ -602,16 +677,16 @@ knn_fit ``` Here you can see the final trained model summary. It confirms that the computational engine used to train the model was `kknn::train.kknn`. It also shows the fraction of errors made by -the nearest neighbour model, but we will ignore this for now and discuss it in more detail +the nearest neighbor model, but we will ignore this for now and discuss it in more detail in the next chapter. Finally it shows (somewhat confusingly) that the "best" weight function was "rectangular" and "best" setting of $K$ was 5; but since we specified these earlier, R is just repeating those settings to us here. In the next chapter, we will actually -let R tune the model for us. +let R find the value of $K$ for us. Finally, we make the prediction on the new observation by calling the `predict` function, passing both the fit object we just created and the new observation itself. As above -when we ran the K-nearest neighbours +when we ran the $K$-nearest neighbors classification algorithm manually, the `knn_fit` object classifies the new observation as malignant ("M"). Note that the `predict` function outputs a data frame with a single variable named `.pred_class`. @@ -621,19 +696,25 @@ new_obs <- tibble(Perimeter = 0, Concavity = 3.5) predict(knn_fit, new_obs) ``` +Is this predicted malignant label the true class for this observation? +Well, we don't know because we do not have this +observation's diagnosis— that is what we were trying to predict! The +classifier's prediction is not necessarily correct, but in the next chapter, we will +learn ways to quantify how accurate we think our predictions are. + ## Data preprocessing with `tidymodels` ### Centering and scaling -When using K-nearest neighbour classification, the *scale* of each variable +When using $K$-nearest neighbor classification, the *scale* of each variable (i.e., its size and range of values) matters. Since the classifier predicts -classes by identifying observations that are nearest to it, any variables that -have a large scale will have a much larger effect than variables with a small +classes by identifying observations nearest to it, any variables with +a large scale will have a much larger effect than variables with a small scale. But just because a variable has a large scale *doesn't mean* that it is more important for making accurate predictions. For example, suppose you have a -data set with two attributes, salary (in dollars) and years of education, and +data set with two features, salary (in dollars) and years of education, and you want to predict the corresponding type of job. When we compute the -neighbour distances, a difference of \$1000 is huge compared to a difference of +neighbor distances, a difference of \$1000 is huge compared to a difference of 10 years of education. But for our conceptual understanding and answering of the problem, it's the opposite; 10 years of education is huge compared to a difference of \$1000 in yearly salary! @@ -642,17 +723,21 @@ In many other predictive models, the *center* of each variable (e.g., its mean) matters as well. For example, if we had a data set with a temperature variable measured in degrees Kelvin, and the same data set with temperature measured in degrees Celcius, the two variables would differ by a constant shift of 273 -(even though they contain exactly the same information). Likewise in our +(even though they contain exactly the same information). Likewise, in our hypothetical job classification example, we would likely see that the center of the salary variable is in the tens of thousands, while the center of the years of education variable is in the single digits. Although this doesn't affect the -K-nearest neighbour classification algorithm, this large shift can change the +$K$-nearest neighbor classification algorithm, this large shift can change the outcome of using many other predictive models. -When all variables in a data set have a mean (center) of 0 -and a standard deviation (scale) of 1, we say that the data have been -*standardized*. To illustrate the effect that standardization can have on the K-nearest -neighbour algorithm, we will read in the original, unscaled Wisconsin breast +To scale and center our data, we need to find +our variables' *mean* (the average, which quantifies the "central" value of a +set of numbers) and *standard deviation* (a number quantifying how spread out values are). +For each observed value of the variable, we subtract the mean (i.e., center the variable) +and divide by the standard deviation (i.e., scale the variable). When we do this, the data +is said to be *standardized*, and all variables in a data set will have a mean of 0 +and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$-nearest +neighbor algorithm, we will read in the original, unstandardized Wisconsin breast cancer data set; we have been using a standardized version of the data set up until now. To keep things simple, we will just use the `Area`, `Smoothness`, and `Class` variables: @@ -664,12 +749,13 @@ unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") |> unscaled_cancer ``` -Looking at the unscaled / uncentered data above, you can see that the difference +Looking at the unscaled and uncentered data above, you can see that the differences between the values for area measurements are much larger than those for -smoothness, and the mean appears to be much larger too. Will this affect +smoothness. Will this affect predictions? In order to find out, we will create a scatter plot of these two -predictors (coloured by diagnosis) for both the unstandardized data we just -loaded, and the standardized version of that same data. +predictors (colored by diagnosis) for both the unstandardized data we just +loaded, and the standardized version of that same data. But first, we need to +standardize the `unscaled_cancer` data set with `tidymodels`. In the `tidymodels` framework, all data preprocessing happens using a [`recipe`](https://tidymodels.github.io/recipes/reference/index.html). @@ -708,16 +794,24 @@ For example: - `-Class`: specify everything except the `Class` variable You can find [a full set of all the steps and variable selection functions](https://tidymodels.github.io/recipes/reference/index.html) -on the recipes home page. -We finally use the `bake` function to apply the recipe to the unscaled data. +on the `recipes` home page. + +At this point, we have calculated the required statistics based on the data input into the +recipe, but the data are not yet scaled and centred. To actually scale and center +the data, we need to apply the `bake` function to the unscaled data. + ```{r 05-scaling-4} scaled_cancer <- bake(uc_recipe, unscaled_cancer) scaled_cancer ``` -At this point, you may wonder why we are doing so much work just to center and -scale our variables. Can't we just manually scale and center the `Area` and -`Smoothness` variables ourselves before building our KNN model? Well, +It may seem redundant that we had to both `bake` *and* `prep` to scale and center the data. + However, we do this in two steps so we can specify a different data set in the `bake` step if we want. + For example, we may want to specify new data that were not part of the training set. + +You may wonder why we are doing so much work just to center and +scale our variables. Can't we just manually scale and center the `Area` and +`Smoothness` variables ourselves before building our $K$-nearest neighbor model? Well, technically *yes*; but doing so is error-prone. In particular, we might accidentally forget to apply the same centering / scaling when making predictions, or accidentally apply a *different* centering / scaling than what @@ -729,102 +823,146 @@ yourself. You will see further on in Section automatically apply `prep` and `bake` as necessary without additional coding effort. Figure \@ref(fig:05-scaling-plt) shows the two scatter plots side-by-side—one for `unscaled_cancer` and one for -`scaled_cancer`. Each has the same new observation annotated with its $K=3$ nearest neighbours. -In the plot for the nonstandardized original data, you can see some odd choices -for the three nearest neighbours. In particular, the "neighbours" are visually -well within the cloud of benign observations, and the neighbours are all nearly +`scaled_cancer`. Each has the same new observation annotated with its $K=3$ nearest neighbors. +In the original unstandardized data plot, you can see some odd choices +for the three nearest neighbors. In particular, the "neighbors" are visually +well within the cloud of benign observations, and the neighbors are all nearly vertically aligned with the new observation (which is why it looks like there -is only one black line on this plot). Here the computation of nearest -neighbours is dominated by the much larger-scale area variable. On the right, -the plot for standardized data shows a much more intuitively reasonable -selection of nearest neighbours. Thus, standardizing the data can change things -in an important way when we are using predictive algorithms. As a rule of -thumb, standardizing your data should be a part of the preprocessing you do -before any predictive modelling / analysis. - -```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.width = 10, fig.cap = "Comparison of K = 3 nearest neighbours with standardized and unstandardized data"} +is only one black line on this plot). Figure \@ref(fig:05-scaling-plt-zoomed) +shows a close-up of that region on the unstandardized plot. Here the computation of nearest +neighbors is dominated by the much larger-scale area variable. The plot for standardized data +on the right in Figure \@ref(fig:05-scaling-plt) shows a much more intuitively reasonable +selection of nearest neighbors. Thus, standardizing the data can change things +in an important way when we are using predictive algorithms. +Standardizing your data should be a part of the preprocessing you do +before predictive modelling and you should always think carefully about your problem domain and +whether you need to standardize your data. + +```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.width = 10, fig.cap = "Comparison of K = 3 nearest neighbors with standardized and unstandardized data"} attrs <- c("Area", "Smoothness") # create a new obs and get its NNs new_obs <- tibble(Area = 400, Smoothness = 0.135, Class = "unknown") my_distances <- table_with_distances(unscaled_cancer[, attrs], new_obs[, attrs]) -neighbours <- unscaled_cancer[order(my_distances$Distance), ] +neighbors <- unscaled_cancer[order(my_distances$Distance), ] # add the new obs to the df unscaled_cancer <- bind_rows(unscaled_cancer, new_obs) # plot the scatter -unscaled <- ggplot(unscaled_cancer, aes(x = Area, y = Smoothness, group = Class, color = Class, shape = Class)) + +unscaled <- ggplot(unscaled_cancer, aes(x = Area, y = Smoothness, group = Class, color = Class, shape = Class, size = Class)) + geom_point(alpha = 0.6) + scale_color_manual(name = "Diagnosis", labels = c("Benign", "Malignant", "Unknown"), values = c("steelblue2", "orange2", "red")) + scale_shape_manual(name = "Diagnosis", labels = c("Benign", "Malignant", "Unknown"), - values= c(16, 16, 17)) + - ggtitle("Nonstandardized Data") + + values= c(16, 16, 18)) + + scale_size_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values=c(2,2,2.5)) + + ggtitle("Unstandardized Data") + geom_segment(aes( x = unlist(new_obs[1]), y = unlist(new_obs[2]), - xend = unlist(neighbours[1, attrs[1]]), - yend = unlist(neighbours[1, attrs[2]]) - ), color = "black") + + xend = unlist(neighbors[1, attrs[1]]), + yend = unlist(neighbors[1, attrs[2]]) + ), color = "black", size = 0.5) + geom_segment(aes( x = unlist(new_obs[1]), y = unlist(new_obs[2]), - xend = unlist(neighbours[2, attrs[1]]), - yend = unlist(neighbours[2, attrs[2]]) - ), color = "black") + + xend = unlist(neighbors[2, attrs[1]]), + yend = unlist(neighbors[2, attrs[2]]) + ), color = "black", size = 0.5) + geom_segment(aes( x = unlist(new_obs[1]), y = unlist(new_obs[2]), - xend = unlist(neighbours[3, attrs[1]]), - yend = unlist(neighbours[3, attrs[2]]) - ), color = "black") + xend = unlist(neighbors[3, attrs[1]]), + yend = unlist(neighbors[3, attrs[2]]) + ), color = "black", size = 0.5) # create new scaled obs and get NNs new_obs_scaled <- tibble(Area = -0.72, Smoothness = 2.8, Class = "unknown") my_distances_scaled <- table_with_distances(scaled_cancer[, attrs], new_obs_scaled[, attrs]) -neighbours_scaled <- scaled_cancer[order(my_distances_scaled$Distance), ] +neighbors_scaled <- scaled_cancer[order(my_distances_scaled$Distance), ] # add to the df scaled_cancer <- bind_rows(scaled_cancer, new_obs_scaled) # plot the scatter -scaled <- ggplot(scaled_cancer, aes(x = Area, y = Smoothness, group = Class, color = Class, shape = Class)) + +scaled <- ggplot(scaled_cancer, aes(x = Area, y = Smoothness, + group = Class, color = Class, shape = Class, size = Class)) + geom_point(alpha = 0.6) + scale_color_manual(name = "Diagnosis", labels = c("Benign", "Malignant", "Unknown"), values = c("steelblue2", "orange2", "red")) + scale_shape_manual(name = "Diagnosis", labels = c("Benign", "Malignant", "Unknown"), - values= c(16, 16, 17)) + + values= c(16, 16, 18)) + + scale_size_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values=c(2,2,2.5)) + ggtitle("Standardized Data") + + labs(x = "Area (standardized)", y = "Smoothness (standardized)") + # coord_equal(ratio = 1) + geom_segment(aes( x = unlist(new_obs_scaled[1]), y = unlist(new_obs_scaled[2]), - xend = unlist(neighbours_scaled[1, attrs[1]]), - yend = unlist(neighbours_scaled[1, attrs[2]]) - ), color = "black") + + xend = unlist(neighbors_scaled[1, attrs[1]]), + yend = unlist(neighbors_scaled[1, attrs[2]]) + ), color = "black", size = 0.5) + geom_segment(aes( x = unlist(new_obs_scaled[1]), y = unlist(new_obs_scaled[2]), - xend = unlist(neighbours_scaled[2, attrs[1]]), - yend = unlist(neighbours_scaled[2, attrs[2]]) - ), color = "black") + + xend = unlist(neighbors_scaled[2, attrs[1]]), + yend = unlist(neighbors_scaled[2, attrs[2]]) + ), color = "black", size = 0.5) + geom_segment(aes( x = unlist(new_obs_scaled[1]), y = unlist(new_obs_scaled[2]), - xend = unlist(neighbours_scaled[3, attrs[1]]), - yend = unlist(neighbours_scaled[3, attrs[2]]) - ), color = "black") + xend = unlist(neighbors_scaled[3, attrs[1]]), + yend = unlist(neighbors_scaled[3, attrs[2]]) + ), color = "black", size = 0.5) gridExtra::grid.arrange(unscaled, scaled, ncol = 2) ``` +```{r 05-scaling-plt-zoomed, fig.height = 4, fig.width = 10, echo = FALSE, fig.cap = "Close up of three nearest neighbors for unstandardized data"} +library(ggforce) +ggplot(unscaled_cancer, aes(x = Area, y = Smoothness, group = Class, color = Class, shape = Class)) + + geom_point(size = 2.5, alpha = 0.6) + + scale_color_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values = c("steelblue2", "orange2", "red")) + + scale_shape_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values= c(16, 16, 18)) + + scale_size_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values = c(1, 1, 2.5)) + + ggtitle("Unstandardized Data") + + geom_segment(aes( + x = unlist(new_obs[1]), y = unlist(new_obs[2]), + xend = unlist(neighbors[1, attrs[1]]), + yend = unlist(neighbors[1, attrs[2]]) + ), color = "black") + + geom_segment(aes( + x = unlist(new_obs[1]), y = unlist(new_obs[2]), + xend = unlist(neighbors[2, attrs[1]]), + yend = unlist(neighbors[2, attrs[2]]) + ), color = "black") + + geom_segment(aes( + x = unlist(new_obs[1]), y = unlist(new_obs[2]), + xend = unlist(neighbors[3, attrs[1]]), + yend = unlist(neighbors[3, attrs[2]]) + ), color = "black") + theme_light() + +# facet_zoom( xlim = c(399.7, 401.6), ylim = c(0.08, 0.14), zoom.size = 2) + + facet_zoom(x = ( Area > 380 & Area < 420) , + y = (Smoothness > 0.08 & Smoothness < 0.14), zoom.size = 2) + + theme_bw() +``` ### Balancing Another potential issue in a data set for a classifier is *class imbalance*, i.e., when one label is much more common than another. Since classifiers like -the K-nearest neighbour algorithm use the labels of nearby points to predict +the $K$-nearest neighbor algorithm use the labels of nearby points to predict the label of a new point, if there are many more data points with one label overall, the algorithm is more likely to pick that label in general (even if the "pattern" of data suggests otherwise). Class imbalance is actually quite a @@ -833,8 +971,8 @@ detection, there are many cases in which the "important" class to identify (presence of disease, malicious email) is much rarer than the "unimportant" class (no disease, normal email). -To better illustrate the problem, let's revisit the breast cancer data; except -now we will remove many of the observations of malignant tumours, simulating +To better illustrate the problem, let's revisit the scaled breast cancer data, +`cancer`; except now we will remove many of the observations of malignant tumors, simulating what the data would look like if the cancer was rare. We will do this by picking only 3 observations randomly from the malignant group, and keeping all of the benign observations. We choose these 3 observations using the `slice_sample` @@ -846,71 +984,91 @@ The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced). set.seed(3) rare_cancer <- bind_rows( filter(cancer, Class == "B"), - cancer |> filter(Class == "M") |> slice_sample(n = 3) + cancer |> + filter(Class == "M") |> slice_sample(n = 3) ) |> select(Class, Perimeter, Concavity) rare_plot <- rare_cancer |> ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + geom_point(alpha = 0.5) + - labs(color = "Diagnosis") + + labs(color = "Diagnosis", x = "Perimeter (standardized)", y = "Concavity (standardized)") + scale_color_manual(labels = c("Malignant", "Benign"), values = c("orange2", "steelblue2")) rare_plot ``` > Note: You will see in the code above that we use the `set.seed` function. > This is because we are using `slice_sample` to artificially pick only 3 of -> the malignant tumour observations, which uses random sampling to choose which +> the malignant tumor observations, which uses random sampling to choose which > rows will be in the training set. In order to make the code reproducible, we > use `set.seed` to specify where the random number generator starts for this > process, which then guarantees the same result, i.e., the same choice of 3 > observations, each time the code is run. In general, when your code involves > random numbers, if you want *the same result* each time, you should use -> `set.seed`; if you want a *different result* each time, you should not. - -Suppose we now decided to use $K = 7$ in K-nearest neighbour classification. -With only 3 observations of malignant tumours, the classifier -will *always predict that the tumour is benign, no matter what its concavity and perimeter +> `set.seed`; if you want a *different result* each time, you should not. +> You only need to `set.seed` once at the beginning of your analysis, so the +rest of the analysis uses seemingly random numbers. + + +Suppose we now decided to use $K = 7$ in $K$-nearest neighbor classification. +With only 3 observations of malignant tumors, the classifier +will *always predict that the tumor is benign, no matter what its concavity and perimeter are!* This is because in a majority vote of 7 observations, at most 3 will be malignant (we only have 3 total malignant observations), so at least 4 must be benign, and the benign vote will always win. For example, Figure \@ref(fig:05-upsample) -shows what happens for a new tumour observation that is quite close to three observations +shows what happens for a new tumor observation that is quite close to three observations in the training data that were tagged as malignant. -```{r 05-upsample, echo=FALSE, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data with 7 nearest neighbours to a new observation highlighted"} +```{r 05-upsample, echo=FALSE, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data with 7 nearest neighbors to a new observation highlighted"} + new_point <- c(2, 2) attrs <- c("Perimeter", "Concavity") my_distances <- table_with_distances(rare_cancer[, attrs], new_point) my_distances <- bind_cols(my_distances, select(rare_cancer, Class)) -neighbours <- rare_cancer[order(my_distances$Distance), ] +neighbors <- rare_cancer[order(my_distances$Distance), ] + + +rare_plot <- bind_rows(rare_cancer, tibble(Perimeter = new_point[1], Concavity = new_point[2], Class = "unknown")) |> + ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class)) + + geom_point(alpha = 0.5) + + labs(color = "Diagnosis", x = "Perimeter (standardized)", y = "Concavity (standardized)") + + scale_color_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values = c("steelblue2", "orange2", "red")) + + scale_shape_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values= c(16, 16, 18))+ + scale_size_manual(name = "Diagnosis", + labels = c("Benign", "Malignant", "Unknown"), + values= c(2, 2, 2.5)) for (i in 1:7) { clr <- "steelblue2" - if (neighbours$Class[i] == "M") { + if (neighbors$Class[i] == "M") { clr <- "orange2" } rare_plot <- rare_plot + geom_segment( x = new_point[1], y = new_point[2], - xend = pull(neighbours[i, attrs[1]]), - yend = pull(neighbours[i, attrs[2]]), color = clr + xend = pull(neighbors[i, attrs[1]]), + yend = pull(neighbors[i, attrs[2]]), color = clr ) } rare_plot + geom_point(aes(x = new_point[1], y = new_point[2]), color = "red", size = 2.5, - pch = 17 + pch = 18 ) ``` -Figure \@ref(fig:05-upsample-2) shows what happens if we set the background colour of -each area of the plot to the decision the K-nearest neighbour +Figure \@ref(fig:05-upsample-2) shows what happens if we set the background color of +each area of the plot to the predictions the $K$-nearest neighbor classifier would make. We can see that the decision is -always "benign," corresponding to the blue colour. +always "benign," corresponding to the blue color. -```{r 05-upsample-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data with background colour indicating the decision of the classifier"} +```{r 05-upsample-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labelled data"} knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |> set_engine("kknn") |> @@ -931,7 +1089,7 @@ rare_plot <- ggplot() + geom_point(data = rare_cancer, mapping = aes(x = Perimeter, y = Concavity, color = Class), alpha = 0.75) + geom_point(data = prediction_table, mapping = aes(x = Perimeter, y = Concavity, color = Class), alpha = 0.02, size = 5.) + - labs(color = "Diagnosis") + + labs(color = "Diagnosis", x = "Perimeter (standardized)", y = "Concavity (standardized)") + scale_color_manual(labels = c("Malignant", "Benign"), values = c("orange2", "steelblue2")) rare_plot @@ -942,7 +1100,7 @@ Despite the simplicity of the problem, solving it in a statistically sound manne fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class. In other words, we will replicate rare observations multiple times in our data set to give them more -voting power in the K-nearest neighbour algorithm. In order to do this, we will add an oversampling +voting power in the $K$-nearest neighbor algorithm. In order to do this, we will add an oversampling step to the earlier `uc_recipe` recipe with the `step_upsample` function. We show below how to do this, and also use the `group_by` and `summarize` functions to see that our classes are now balanced: @@ -961,14 +1119,14 @@ upsampled_cancer |> group_by(Class) |> summarize(n = n()) ``` -Now suppose we train our K-nearest neighbour classifier with $K=7$ on this *balanced* data. -Figure \@ref(fig:05-upsample-plot) shows what happens now when we set the background colour -of each area of our scatter plot to the decision the K-nearest neighbour +Now suppose we train our $K$-nearest neighbor classifier with $K=7$ on this *balanced* data. +Figure \@ref(fig:05-upsample-plot) shows what happens now when we set the background color +of each area of our scatter plot to the decision the $K$-nearest neighbor classifier would make. We can see that the decision is more reasonable; when the points are close -to those labelled malignant, the classifier predicts a malignant tumour, and vice versa when they are -closer to the benign tumour observations. +to those labelled malignant, the classifier predicts a malignant tumor, and vice versa when they are +closer to the benign tumor observations. -```{r 05-upsample-plot, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Upsampled data with background colour indicating the decision of the classifier"} +```{r 05-upsample-plot, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Upsampled data with background color indicating the decision of the classifier"} knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |> set_engine("kknn") |> @@ -986,7 +1144,7 @@ upsampled_plot <- ggplot() + geom_point(data = prediction_table, mapping = aes(x = Perimeter, y = Concavity, color = Class), alpha = 0.02, size = 5.) + geom_point(data = rare_cancer, mapping = aes(x = Perimeter, y = Concavity, color = Class), alpha = 0.75) + - labs(color = "Diagnosis") + + labs(color = "Diagnosis", x = "Perimeter (standardized)", y = "Concavity (standardized)") + scale_color_manual(labels = c("Malignant", "Benign"), values = c("orange2", "steelblue2")) upsampled_plot @@ -1001,7 +1159,6 @@ together multiple data analysis steps without a lot of otherwise necessary code To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data. First we will load the data, create a model, and specify a recipe for how the data should be preprocessed: ```{r 05-workflow} - # load the unscaled cancer data and make sure the target Class variable is a factor unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") |> mutate(Class = as_factor(Class)) @@ -1036,7 +1193,7 @@ knn_fit <- workflow() |> knn_fit ``` As before, the fit object lists the function that trains the model as well as the "best" settings -for the number of neighbours and weight function (for now, these are just the values we chose +for the number of neighbors and weight function (for now, these are just the values we chose manually when we created `knn_spec` above). But now the fit object also includes information about the overall workflow, including the centering and scaling preprocessing steps. In other words, when we use the `predict` function with the `knn_fit` object to make a prediction for a new @@ -1050,12 +1207,12 @@ prediction ``` The classifier predicts that the first observation is benign ("B"), while the second is malignant ("M"). Figure \@ref(fig:05-workflow-plot-show) visualizes the predictions that this -trained K-nearest neighbour model will make on a large range of new observations. -Although you have seen coloured prediction map visualizations like this a few times now, +trained $K$-nearest neighbor model will make on a large range of new observations. +Although you have seen colored prediction map visualizations like this a few times now, we have not included the code to generate them, as it is a little bit complicated. For the interested reader who wants a learning challenge, we now include it below. The basic idea is to create a grid of synthetic new observations using the `expand.grid` function, -predict the label of each, and visualize the predictions with a coloured scatter having a very high transparency +predict the label of each, and visualize the predictions with a colored scatter having a very high transparency (low `alpha` value) and large point radius. See if you can figure out what each line is doing! > *Understanding this code is not required for the remainder of the textbook. It is included @@ -1063,8 +1220,12 @@ predict the label of each, and visualize the predictions with a coloured scatter ```{r 05-workflow-fit-plot} # create the grid of area/smoothness vals, and arrange in a data frame -are_grid <- seq(min(unscaled_cancer$Area), max(unscaled_cancer$Area), length.out = 100) -smo_grid <- seq(min(unscaled_cancer$Smoothness), max(unscaled_cancer$Smoothness), length.out = 100) +are_grid <- seq(min(unscaled_cancer$Area), + max(unscaled_cancer$Area), + length.out = 100) +smo_grid <- seq(min(unscaled_cancer$Smoothness), + max(unscaled_cancer$Smoothness), + length.out = 100) asgrid <- as_tibble(expand.grid(Area = are_grid, Smoothness = smo_grid)) # use the fit workflow to make predictions at the grid points @@ -1074,16 +1235,18 @@ knnPredGrid <- predict(knn_fit, asgrid) prediction_table <- bind_cols(knnPredGrid, asgrid) |> rename(Class = .pred_class) # plot: -# 1. the coloured scatter of the original data -# 2. the faded coloured scatter for the grid points +# 1. the colored scatter of the original data +# 2. the faded colored scatter for the grid points wkflw_plot <- ggplot() + - geom_point(data = unscaled_cancer, mapping = aes(x = Area, y = Smoothness, color = Class), alpha = 0.75) + - geom_point(data = prediction_table, mapping = aes(x = Area, y = Smoothness, color = Class), alpha = 0.02, size = 5.) + + geom_point(data = unscaled_cancer, + mapping = aes(x = Area, y = Smoothness, color = Class), alpha = 0.75) + + geom_point(data = prediction_table, + mapping = aes(x = Area, y = Smoothness, color = Class), alpha = 0.02, size = 5) + labs(color = "Diagnosis") + scale_color_manual(labels = c("Malignant", "Benign"), values = c("orange2", "steelblue2")) ``` -```{r 05-workflow-plot-show, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of smoothness versus area where background colour indicates the decision of the classifier"} +```{r 05-workflow-plot-show, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier"} wkflw_plot ``` diff --git a/classification2.Rmd b/classification2.Rmd index 5a9f594c7..0d7f6a186 100644 --- a/classification2.Rmd +++ b/classification2.Rmd @@ -13,8 +13,8 @@ By the end of the chapter, readers will be able to: - Describe what training, validation, and test data sets are and how they are used in classification - Split data into training, validation, and test data sets - Evaluate classification accuracy in R using a validation data set and appropriate metrics -- Execute cross-validation in R to choose the number of neighbours in a K-nearest neighbours classifier -- Describe advantages and disadvantages of the K-nearest neighbours classification algorithm +- Execute cross-validation in R to choose the number of neighbors in a $K$-nearest neighbors classifier +- Describe advantages and disadvantages of the $K$-nearest neighbors classification algorithm ## Evaluating accuracy @@ -24,12 +24,12 @@ classifier to make too many wrong predictions. How do we measure how "good" our classifier is? Let's revisit the [breast cancer images example](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) and think about how our classifier will be used in practice. A biopsy will be -performed on a *new* patient's tumour, the resulting image will be analyzed, -and the classifier will be asked to decide whether the tumour is benign or +performed on a *new* patient's tumor, the resulting image will be analyzed, +and the classifier will be asked to decide whether the tumor is benign or malignant. The key word here is *new*: our classifier is "good" if it provides -accurate predictions on data *not seen during training*. But then how can we -evaluate our classifier without having to visit the hospital to collect more -tumour images? +accurate predictions on data *not seen during training*. But then, how can we +evaluate our classifier without visiting the hospital to collect more +tumor images? The trick is to split the data into a **training set** and **test set** (Figure \@ref(fig:06-training-test)) and use only the **training set** when building the classifier. @@ -41,7 +41,7 @@ labels for new observations without known class labels. > Note: if there were a golden rule of machine learning, it might be this: *you cannot use the test data to build the model!* > If you do, the model gets to "see" the test data in advance, making it look more accurate than it really is. Imagine -> how bad it would be to overestimate your classifier's accuracy when predicting whether a patient's tumour is malignant or benign! +> how bad it would be to overestimate your classifier's accuracy when predicting whether a patient's tumor is malignant or benign! ```{r 06-training-test, echo = FALSE, warning = FALSE, fig.cap = "Splitting the data into training and testing sets", fig.retina = 2, out.width = "600"} knitr::include_graphics("img/training_test.jpeg") @@ -52,8 +52,13 @@ the observations in the test set? One way we can do this is to calculate the **prediction accuracy**. This is the fraction of examples for which the classifier made the correct prediction. To calculate this we divide the number of correct predictions by the number of predictions made. -This process is illustrated in Figure \@ref(fig:06-ML-paradigm-test). -Note that there are other measures for how well classifiers perform, such as *precision* and *recall*; + +$$\mathrm{prediction \; accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}}$$ + + +The process for assessing if our predictions match the true labels in the +test set is illustrated in Figure \@ref(fig:06-ML-paradigm-test). Note that there +are other measures for how well classifiers perform, such as *precision* and *recall*; these will not be discussed here, but you will likely encounter them in other more advanced books on this topic. @@ -61,14 +66,14 @@ books on this topic. knitr::include_graphics("img/ML-paradigm-test.png") ``` -In R, we can use the `tidymodels` library collection not only to perform K-nearest neighbours +In R, we can use the `tidymodels` package not only to perform $K$-nearest neighbors classification, but also to assess how well our classification worked. Let's -work through an example of this process using the breast cancer dataset. -We start by loading the necessary libraries, reading in the breast cancer data +work through an example of this process using the breast cancer data set. +We start by loading the necessary packages, reading in the breast cancer data from the previous chapter, and making a quick scatter plot visualization of -tumour cell concavity versus smoothness coloured by diagnosis in Figure \@ref(fig:06-precode). +tumor cell concavity versus smoothness colored by diagnosis in Figure \@ref(fig:06-precode). -```{r 06-precode, fig.height = 4, fig.width = 5, fig.cap="Scatterplot of tumour cell concavity versus smoothness coloured by diagnosis label", message = F, warning = F} +```{r 06-precode, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label", message = F, warning = F} # load packages library(tidyverse) library(tidymodels) @@ -77,8 +82,8 @@ library(tidymodels) cancer <- read_csv("data/unscaled_wdbc.csv") |> mutate(Class = as_factor(Class)) # convert the character Class variable to the factor datatype -# create scatter plot of tumour cell concavity versus smoothness, -# labelling the points be diagnosis class +# create scatter plot of tumor cell concavity versus smoothness, +# labeling the points be diagnosis class perim_concav <- cancer |> ggplot(aes(x = Smoothness, y = Concavity, color = Class)) + geom_point(alpha = 0.5) + @@ -100,29 +105,31 @@ using a larger test data set). Here, we will use 75% of the data for training, and 25% for testing. The `initial_split` function from `tidymodels` handles the procedure of splitting -the data for us. It also takes two very important steps when splitting to ensure +the data for us. It also applies two very important steps when splitting to ensure that the accuracy estimates from the test data are reasonable. First, it -**shuffles** the data before splitting. This ensures that any ordering present +**shuffles** the data before splitting, which ensures that any ordering present in the data does not influence the data that ends up in the training and testing sets. Second, it **stratifies** the data by the class label, to ensure that roughly -the same proportion of each class ends up in both the training and testing sets. For example, if roughly 65% of the -observations are from the benign class (`B`) and 35% are from the malignant class (`M`), -then `initial_split` ensures that roughly 65% of the training data are benign, -35% of the training data are malignant, +the same proportion of each class ends up in both the training and testing sets. For example, +in our data set, roughly 63% of the +observations are from the benign class (`B`), and 37% are from the malignant class (`M`), +so `initial_split` ensures that roughly 63% of the training data are benign, +37% of the training data are malignant, and the same proportions exist in the testing data. Let's use the `initial_split` function to create the training and testing sets. We will specify that `prop = 0.75` so that 75% of our original data set ends up in the training set. We will also set the `strata` argument to the categorical label variable -(here, `Class`) to ensure that the training and validation subsets contain the +(here, `Class`) to ensure that the training and testing subsets contain the right proportions of each category of observation. The `training` and `testing` functions then extract the training and testing data sets into two separate data frames. + ```{r 06-initial-split} set.seed(1) cancer_split <- initial_split(cancer, prop = 0.75, strata = Class) cancer_train <- training(cancer_split) -cancer_test <- testing(cancer_split) +cancer_test <- testing(cancer_split) ``` > Note: You will see in the code above that we use the `set.seed` function @@ -142,17 +149,38 @@ a train / test split of 75% / 25%, as desired. Recall from Chapter \@ref(classif that we use the `glimpse` function to view data with a large number of columns, as it prints the data such that the columns go down the page (instead of across). +```{r 06-train-prop, echo = FALSE} +train_prop <- cancer_train |> + group_by(Class) |> + summarize(proportion = n()/nrow(cancer_train)) +``` + +We can use `group_by` and `summarize` to find the percentage of malignant and benign classes +in `cancer_train` and we see about `r round(filter(train_prop, Class == "B")$proportion, 2)*100`% of the training +data are benign and `r round(filter(train_prop, Class == "M")$proportion, 2)*100`% +are malignant indicating that our class proportions were roughly preserved when we split the data. + +```{r 06-train-proportion} +cancer_proportions <- cancer_train %>% + group_by(Class) %>% + summarize(n = n()) %>% + mutate(percent = 100*n/nrow(cancer_train)) +cancer_proportions +``` + + + ### Preprocess the data -As we mentioned last chapter, KNN is sensitive to the scale of the predictors, -and so we should perform some preprocessing to standardize them. An +As we mentioned in the last chapter, $K$-nearest neighbors is sensitive to the scale of the predictors, +so we should perform some preprocessing to standardize them. An additional consideration we need to take when doing this is that we should create the standardization preprocessor using **only the training data**. This ensures that our test data does not influence any aspect of our model training. Once we have created the standardization preprocessor, we can then apply it separately to both the training and test data sets. -Fortunately, the `recipe` framework from `tidymodels` makes it simple to handle +Fortunately, the `recipe` framework from `tidymodels` helps us handle this properly. Below we construct and prepare the recipe using only the training data (due to `data = cancer_train` in the first line). @@ -165,14 +193,14 @@ cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) |> ### Train the classifier Now that we have split our original data set into training and test sets, we -can create our K-nearest neighbours classifier with only the training set using +can create our $K$-nearest neighbors classifier with only the training set using the technique we learned in the previous chapter. For now, we will just choose -the number $K$ of neighbours to be 3, and use concavity and smoothness as the +the number $K$ of neighbors to be 3, and use concavity and smoothness as the predictors. As before we need to create a model specification, combine the model specification and recipe into a workflow, and then finally use `fit` with the training data `cancer_train` to build the classifier. -```{r 06-create-K-nearest neighbour-classifier} +```{r 06-create-K-nearest neighbor-classifier} set.seed(1) knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |> set_engine("kknn") |> @@ -186,15 +214,15 @@ knn_fit <- workflow() |> knn_fit ``` -> Note: Here again you see the `set.seed` function. In the K-nearest neighbours algorithm, -> there is a tie for the majority neighbour class, the winner is randomly selected. Although there is no chance +> Note: Here again you see the `set.seed` function because in the $K$-nearest neighbors algorithm, +> if there is a tie for the majority neighbor class, the winner is randomly selected. Although there is no chance > of a tie when $K$ is odd (here $K=3$), it is possible that the code may be changed in the future to have an even value of $K$. > Thus, to prevent potential issues with reproducibility, we have set the seed. Note that in your own code, -> you only have to set the seed once at the beginning of your analysis. +> you should only set the seed once at the beginning of your analysis. ### Predict the labels in the test set -Now that we have a K-nearest neighbours classifier object, we can use it to +Now that we have a $K$-nearest neighbors classifier object, we can use it to predict the class labels for our test set. We use the `bind_cols` to add the column of predictions to the original test data, creating the `cancer_test_predictions` data frame. The `Class` variable contains the true @@ -215,7 +243,8 @@ the `truth` and `estimate` arguments: ```{r 06-accuracy} cancer_test_predictions |> - metrics(truth = Class, estimate = .pred_class) + metrics(truth = Class, estimate = .pred_class) |> + filter(.metric == "accuracy") ``` ```{r 06-accuracy-2, echo = FALSE, warning = FALSE} @@ -224,11 +253,12 @@ cancer_acc_1 <- cancer_test_predictions %>% filter(.metric == 'accuracy') ``` -In the metrics data frame we are interested in the `accuracy` row; -looking at the value of the `.estimate` variable +In the metrics data frame we filtered the `.metric` column since we are +interested in the `accuracy` row. Other entries involve more advanced metrics that +are beyond the scope of this book. Looking at the value of the `.estimate` variable shows that the estimated accuracy of the classifier on the test data was `r round(100*cancer_acc_1$.estimate, 0)`%. -The other entries involve more advanced metrics that are beyond the scope of this book. + We can also look at the *confusion matrix* for the classifier, which shows the table of predicted labels and correct labels, using the `conf_mat` function: @@ -246,8 +276,9 @@ confu21 <- (confusionmt %>% filter(name == "cell_2_1"))$value confu22 <- (confusionmt %>% filter(name == "cell_2_2"))$value ``` -This table shows that the classifier labelled -`r confu11`+`r confu22` = `r confu11+confu22` observations +The confusion matrix shows `r confu11` observations were correctly predicted +as malignant, and `r confu22` were correctly predicted as benign. Therefore the classifier labelled +`r confu11` + `r confu22` = `r confu11+confu22` observations correctly. It also shows that the classifier made some mistakes; in particular, it classified `r confu21` observations as benign when they were truly malignant, and `r confu12` observations as malignant when they were truly benign. @@ -255,14 +286,14 @@ and `r confu12` observations as malignant when they were truly benign. ### Critically analyze performance We now know that the classifier was `r round(100*cancer_acc_1$.estimate,0)`% accurate -on the test dataset. That sounds pretty good!...Wait, *is* it good? +on the test data set. That sounds pretty good!... Wait, *is* it good? Or do we need something higher? In general, what a *good* value for accuracy is depends on the application. -On a task of predicting whether a tumour is benign or malignant -for a type of tumour that is benign 99% of the time, it is very easy to -obtain a 99% accuracy just by guessing benign for every observation. In this -case, 99% accuracy is probably not good enough. And beyond just accuracy, + For instance, suppose you are predicting whether a tumor is benign or malignant + for a type of tumor that is benign 99% of the time. It is very easy to obtain + a 99% accuracy just by guessing benign for every observation. In this case, + 99% accuracy is probably not good enough. And beyond just accuracy, sometimes the *kind* of mistake the classifier makes is important as well. In the previous example, it might be very bad for the classifier to predict "benign" when the true class is "malignant", as this might result in a patient @@ -275,22 +306,18 @@ also the confusion matrix. However, there is always an easy baseline that you can compare to for any classification problem: the *majority classifier*. The majority classifier *always* guesses the majority class label from the training data, regardless of -what values the predictor variables take. It helps to give you a sense for +the predictor variables' values. It helps to give you a sense of scale when considering accuracies. If the majority classifier obtains a 90% -accuracy on a problem, then you might hope for your K-nearest neighbours +accuracy on a problem, then you might hope for your $K$-nearest neighbors classifier to do better than that. If your classifier provides a significant improvement upon the majority classifier, this means that at least your method is extracting some useful information from your predictor variables. Be careful though: improving on the majority classifier does not *necessarily* mean the classifier is working well enough for your application. -As an example, in the breast cancer data, the proportions of benign and malignant +As an example, in the breast cancer data, recall the proportions of benign and malignant observations in the training data are as follows: ```{r 06-proportions} -cancer_proportions <- cancer_train %>% - group_by(Class) %>% - summarize(n = n()) %>% - mutate(percent = 100*n/nrow(cancer_train)) cancer_proportions ``` ```{r 06-proportions-2, echo = FALSE, warning = FALSE} @@ -305,15 +332,15 @@ is benign. The estimated accuracy of the majority classifier is usually fairly close to the majority class proportion in the training data. In this case, we would suspect that the majority classifier will have an accuracy of around `r round(cancer_propn_1[1,1], 0)`%. -The K-nearest neighbours classifier we built does quite a bit better than this, +The $K$-nearest neighbors classifier we built does quite a bit better than this, with an accuracy of `r round(100*cancer_acc_1$.estimate, 0)`%. This means that from the perspective of accuracy, -the K-nearest neighbours classifier improved quite a bit on the basic +the $K$-nearest neighbors classifier improved quite a bit on the basic majority classifier. Hooray! But we still need to be cautious; in -this application, it is likely very important not to miss-diagnose any malignant tumours to avoid missing +this application, it is likely very important not to misdiagnose any malignant tumors to avoid missing patients who actually need medical care. The confusion matrix above shows -that the classifier does indeed miss-diagnose a significant number of malignant tumours as benign (`r confu21` -out of `r confu11+confu21` malignant tumours, or `r round(100*(confu21)/(confu11+confu21))`%!). +that the classifier does indeed misdiagnose a significant number of malignant tumors as benign (`r confu21` +out of `r confu11+confu21` malignant tumors, or `r round(100*(confu21)/(confu11+confu21))`%!). Therefore, even though the accuracy improved upon the majority classifier, our critical analysis suggests that this classifier may not have appropriate performance for the application. @@ -321,22 +348,22 @@ for the application. ## Tuning the classifier The vast majority of predictive models in statistics and machine learning have -*parameters*. A *parameter* is a number that you have to pick in advance that determines -some aspect of how the model behaves. For example, in the K-nearest neighbours +*parameters*. A *parameter* is a number you have to pick in advance that determines +some aspect of how the model behaves. For example, in the $K$-nearest neighbors classification algorithm, $K$ is a parameter that we have to pick -that determines how many neighbours participate in the class vote. +that determines how many neighbors participate in the class vote. By picking different values of $K$, we create different classifiers that make different predictions. -So then how do we pick the *best* value of $K$, i.e., *tune* the model? -And is it possible to make this selection in a principled way? Ideally what -we want is to somehow maximize the performance of our classifier on data *it +So then, how do we pick the *best* value of $K$, i.e., *tune* the model? +And is it possible to make this selection in a principled way? Ideally, +we want somehow to maximize the performance of our classifier on data *it hasn't seen yet*. But we cannot use our test data set in the process of building our model. So we will play the same trick we did before when evaluating our classifier: we'll split our *training data itself* into two subsets, use one to train the model, and then use the other to evaluate it. -In this section we will cover the details of this procedure, as well as -how to use it to help you pick good a parameter value for your classifier. +In this section, we will cover the details of this procedure, as well as +how to use it to help you pick a good parameter value for your classifier. > **Remember:** *don't touch the test set during the tuning process. Tuning is a part of model training!* @@ -360,12 +387,12 @@ value based on __*all*__ of the different results. If we just split our overall data *once*, our best parameter choice will depend strongly on whatever data was lucky enough to end up in the validation set. Perhaps using multiple different train/validation splits, we'll get a better estimate of accuracy, -which will lead to a better choice of the number of neighbours $K$ for the +which will lead to a better choice of the number of neighbors $K$ for the overall set of training data. Let's investigate this idea in R! In particular, we will use different seed values in the `set.seed` function to generate five different train/validation -splits of our overall training data, train five different K-nearest neighbours +splits of our overall training data, train five different $K$-nearest neighbors models, and evaluate their accuracy. We will start with just a single split generated by using `set.seed(1)`. ```{r 06-five-splits} @@ -436,7 +463,8 @@ for (i in 1:5) { The accuracy estimate using the split based on `set.seed(1)` is `r round(100*acc,1)`%. Now we repeat the above code 4 more times, using `set.seed(i)` for `i = 2, 3, 4, 5`. With five different seeds, we get five different shuffles of the data, and therefore five different values for -accuracy: `r sprintf("%.1f%%", round(100*accuracies,1))`. None of these is necessarily "more correct" than any other; they're +accuracy: `r sprintf("%.1f%%", round(100*accuracies,1))`. None of these values are +necessarily "more correct" than any other; they're just five estimates of the true, underlying accuracy of our classifier built using our overall training data. We can combine the estimates by taking their average (here `r round(100*mean(accuracies),0)`%) to try to get a single assessment of our @@ -447,7 +475,7 @@ In practice, we don't use random splits, but rather use a more structured splitting procedure so that each observation in the data set is used in a validation set only a single time. The name for this strategy is called **cross-validation**. In **cross-validation**, we split our **overall training -data** into $C$ evenly-sized chunks, and then iteratively use $1$ chunk as the +data** into $C$ evenly-sized chunks. Then, iteratively use $1$ chunk as the **validation set** and combine the remaining $C-1$ chunks as the **training set**. This procedure is shown in Figure \@ref(fig:06-cv-image). @@ -472,8 +500,8 @@ Then, when we create our data analysis workflow, we use the `fit_resamples` func instead of the `fit` function for training. This runs cross-validation on each train/validation split. -> **Note:** we set the seed when we call `train` not only because of the potential for ties, but also because we are doing -> cross-validation. Cross-validation uses a random process to select how to partition the training data. +> **Note:** we set the seed because we are doing +> cross-validation, which uses a random process to select how to partition the training data. ```{r 06-vfold-workflow} set.seed(1) @@ -506,7 +534,8 @@ You can also ignore the entire second row with `roc_auc` in the `.metric` column as it is beyond the scope of this book. ```{r 06-vfold-metrics} -knn_fit |> collect_metrics() +knn_fit |> + collect_metrics() ``` We can choose any number of folds, and typically the more we use the better our @@ -514,8 +543,8 @@ accuracy estimate will be (lower standard error). However, we are limited by computational power: the more folds we choose, the more computation it takes, and hence the more time it takes to run the analysis. So when you do cross-validation, you need to -consider the size of the data, and the speed of the algorithm (e.g., K-nearest -neighbour) and the speed of your computer. In practice, this is a trial and +consider the size of the data, and the speed of the algorithm (e.g., $K$-nearest +neighbor) and the speed of your computer. In practice, this is a trial and error process, but typically $C$ is chosen to be either 5 or 10. Here we show how the standard error decreases when we use 10-fold cross validation rather than 5-fold: @@ -527,23 +556,24 @@ vfold_metrics <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit_resamples(resamples = cancer_vfold) |> - collect_metrics() + collect_metrics() vfold_metrics ``` + ### Parameter value selection Using 5- and 10-fold cross-validation, we have estimated that the prediction accuracy of our classifier is somewhere around `r round(100*(vfold_metrics %>% filter(.metric == "accuracy"))$mean,0)`%. Whether that is good or not depends entirely on the downstream application of the data analysis. In the -present situation, we are trying to predict a tumour diagnosis, with expensive, +present situation, we are trying to predict a tumor diagnosis, with expensive, damaging chemo/radiation therapy or patient death as potential consequences of misprediction. Hence, we might like to do better than `r round(100*(vfold_metrics %>% filter(.metric == "accuracy"))$mean,0)`% for this application. In order to improve our classifier, we have one choice of parameter: the number of -neighbours, $K$. Since cross-validation helps us evaluate the accuracy of our +neighbors, $K$. Since cross-validation helps us evaluate the accuracy of our classifier, we can use cross-validation to calculate an accuracy for each value of $K$ in a reasonable range, and then pick the value of $K$ that gives us the best accuracy. The `tidymodels` package collection provides a very simple @@ -561,7 +591,6 @@ variable that contains the sequence of values of $K$ to try; below we create the data frame with the `neighbors` variable containing each value from $K=1$ to $K=15$ using the `seq` function. Then we pass that data frame to the `grid` argument of `tune_grid`. -We set the seed prior to tuning to ensure results are reproducible: ```{r 06-range-cross-val-2} set.seed(1) k_vals <- tibble(neighbors = seq(from = 1, to = 15, by = 1)) @@ -569,26 +598,28 @@ knn_results <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> tune_grid(resamples = cancer_vfold, grid = k_vals) |> - collect_metrics() -knn_results -``` -We can select the best value of the number of neighbours (i.e., the one that results -in the highest classifier accuracy estimate) by plotting the accuracy versus $K$. -```{r 06-find-k, fig.height = 4, fig.width = 5, fig.cap= "Plot of estimated accuracy versus the number of neighbours"} + collect_metrics() + accuracies <- knn_results |> filter(.metric == "accuracy") +accuracies +``` + +We can select the best value of the number of neighbors (i.e., the one that results +in the highest classifier accuracy estimate) by plotting the accuracy versus $K$ +in Figure \@ref(fig:06-find-k). +```{r 06-find-k, fig.height = 4, fig.width = 5, fig.cap= "Plot of estimated accuracy versus the number of neighbors"} accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) + geom_point() + geom_line() + labs(x = "Neighbors", y = "Accuracy Estimate") accuracy_vs_k ``` -Figure \@ref(fig:06-find-k) suggests that setting the number of -neighbours to $K =$ `r (accuracies %>% arrange(desc(mean)) %>% head(1))$neighbors` -provides the highest accuracy. -But as you can see, there is no exact or perfect answer here; - any selection from $K = 3$ and $15$ would be reasonably justified, as all +Setting the number of +neighbors to $K =$ `r (accuracies %>% arrange(desc(mean)) %>% head(1))$neighbors` +provides the highest accuracy (`r (accuracies %>% arrange(desc(mean)) %>% slice(1) %>% pull(mean) %>% round(4))*100`%). But there is no exact or perfect answer here; +any selection from $K = 3$ and $15$ would be reasonably justified, as all of these differ in classifier accuracy by a small amount. Remember: the values you see on this plot are *estimates* of the true accuracy of our classifier. Although the @@ -612,12 +643,12 @@ $K =$ `r (accuracies %>% arrange(desc(mean)) %>% head(1))$neighbors` for the cla ### Under/Overfitting To build a bit more intuition, what happens if we keep increasing the number of -neighbours $K$? In fact, the accuracy actually starts to decrease! +neighbors $K$? In fact, the accuracy actually starts to decrease! Let's specify a much larger range of values of $K$ to try in the `grid` argument of `tune_grid`. Figure \@ref(fig:06-lots-of-ks) shows a plot of estimated accuracy as we vary $K$ from 1 to almost the number of observations in the data set. -```{r 06-lots-of-ks, fig.height = 4, fig.width = 5, fig.cap="Plot of accuracy estimate versus number of neighbours for many K values"} +```{r 06-lots-of-ks, message = FALSE, fig.height = 4, fig.width = 5, fig.cap="Plot of accuracy estimate versus number of neighbors for many K values"} set.seed(1) k_lots <- tibble(neighbors = seq(from = 1, to = 385, by = 10)) knn_results <- workflow() |> @@ -637,33 +668,36 @@ accuracy_vs_k_lots ``` **Underfitting:** What is actually happening to our classifier that causes -this? As we increase the number of neighbours, more and more of the training +this? As we increase the number of neighbors, more and more of the training observations (and those that are farther and farther away from the point) get a "say" in what the class of a new observation is. This causes a sort of "averaging effect" to take place, making the boundary between where our -classifier would predict a tumour to be malignant versus benign to smooth out +classifier would predict a tumor to be malignant versus benign to smooth out and become *simpler.* If you take this to the extreme, setting $K$ to the total training data set size, then the classifier will always predict the same label regardless of what the new observation looks like. In general, if the model *isn't influenced enough* by the training data, it is said to **underfit** the data. -**Overfitting:** In contrast, when we decrease the number of neighbours, each +**Overfitting:** In contrast, when we decrease the number of neighbors, each individual data point has a stronger and stronger vote regarding nearby points. Since the data themselves are noisy, this causes a more "jagged" boundary corresponding to a *less simple* model. If you take this case to the extreme, setting $K = 1$, then the classifier is essentially just matching each new -observation to its closest neighbour in the training data set. This is just as +observation to its closest neighbor in the training data set. This is just as problematic as the large $K$ case, because the classifier becomes unreliable on new data: if we had a different training set, the predictions would be completely different. In general, if the model *is influenced too much* by the training data, it is said to **overfit** the data. -You can see these two effects in Figure \@ref(fig:06-decision-grid-K), -which shows how the classifier changes as we set the number of neighbours $K$ to 1, 7, 20, and 300. +Both overfitting and underfitting are problematic and will lead to a model +that does not generalize well to new data. When fitting a model, we need to strike +a balance between the two. You can see these two effects in Figure +\@ref(fig:06-decision-grid-K), which shows how the classifier changes as +we set the number of neighbors $K$ to 1, 7, 20, and 300.
-```{r 06-decision-grid-K, echo = FALSE, fig.height = 7, fig.width = 10, fig.cap = "Effect of K in overfitting and underfitting"} +```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 7, fig.width = 10, fig.cap = "Effect of K in overfitting and underfitting"} library(gridExtra) ks <- c(1, 7, 20, 300) plots <- list() @@ -701,21 +735,21 @@ grid.arrange(grobs = plots) ## Summary Classification algorithms use one or more quantitative variables to predict the -value of another, categorical variable. The K-nearest neighbours algorithm in -particular does this by first finding the $K$ points in the training data nearest +value of another categorical variable. In particular, the $K$-nearest neighbors algorithm +does this by first finding the $K$ points in the training data nearest to the new observation, and then returning the majority class vote from those training observations. We can evaluate a classifier by splitting the data randomly into a training and test data set, using the training set to build the classifier, and using the test set to estimate its accuracy. Finally, we -can tune the classifier (e.g., select the number of neighbours $K$ in KNN) -by maximizing estimated accuracy via cross-validation. This +can tune the classifier (e.g., select the number of neighbors $K$ in $K$-NN) +by maximizing estimated accuracy via cross-validation. The overall process is summarized in Figure \@ref(fig:06-overview). -```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of KNN classification", fig.retina = 2, out.width = "660"} +```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of K-nn classification", fig.retina = 2, out.width = "660"} knitr::include_graphics("img/train-test-overview.jpeg") ``` -The overall workflow for performing K-nearest neighbours classification using `tidymodels` is as follows: +The overall workflow for performing $K$-nearest neighbors classification using `tidymodels` is as follows: 1. Use the `initial_split` function to split the data into a training and test set. Set the `strata` argument to the class label variable. Put the test set aside for now. 2. Use the `vfold_cv` function to split up the training data for cross validation. @@ -726,15 +760,18 @@ The overall workflow for performing K-nearest neighbours classification using `t 7. Make a new model specification for the best parameter value (i.e., $K$), and retrain the classifier using the `fit` function. 8. Evaluate the estimated accuracy of the classifier on the test set using the `predict` function. -All algorithms have strengths and weaknesses. We summarize these for the K-nearest neighbours algorithm here. +In these last two chapters, we focused on the $K$-nearest neighbor algorithm, +but there are many other methods we could have used to predict a categorical label. +All algorithms have their strengths and weaknesses, and we summarize these for +the $K$-NN here. -**Strengths:** K-nearest neighbours classification +**Strengths:** $K$-nearest neighbors classification 1. is a simple, intuitive algorithm 2. requires few assumptions about what the data must look like 3. works for binary (two-class) and multi-class (more than 2 classes) classification problems -**Weaknesses:** K-nearest neighbours classification +**Weaknesses:** $K$-nearest neighbors classification 1. becomes very slow as the training data gets larger 2. may not perform well with a large number of predictors @@ -749,17 +786,16 @@ pick a subset of useful variables to include as predictors.* Another potentially important part of tuning your classifier is to choose which variables from your data will be treated as predictor variables. Technically, you can choose anything from using a single predictor variable to using every variable in your -data; the K-nearest neighbours algorithm accepts any number of +data; the $K$-nearest neighbors algorithm accepts any number of predictors. However, it is **not** the case that using more predictors always yields better predictions! In fact, sometimes including irrelevant predictors can actually negatively affect classifier performance. ### The effect of irrelevant predictors -Let's take a look at an example where K-nearest neighbours performs -worse when given more predictors to work with. In this example we have modified -the breast cancer data to have only the `Smoothness`, `Concavity`, and `Perimeter` variables from the original data, -and then added irrelevant variables that we created ourselves using a random number generator. +Let's take a look at an example where $K$-nearest neighbors performs +worse when given more predictors to work with. In this example, we modified +the breast cancer data to have only the `Smoothness`, `Concavity`, and `Perimeter` variables from the original data. Then, we added irrelevant variables that we created ourselves using a random number generator. The irrelevant variables each take a value of 0 or 1 with equal probability for each observation, regardless of what the value `Class` variable takes. In other words, the irrelevant variables have no meaningful relationship with the `Class` variable. @@ -781,9 +817,9 @@ cancer_irrelevant %>% select(Class, Smoothness, Concavity, Perimeter, Irrelevant1, Irrelevant2) ``` -Next, we build a sequence of KNN classifiers that include `Smoothness`, +Next, we build a sequence of $K$-NN classifiers that include `Smoothness`, `Concavity`, and `Perimeter` as predictor variables, but also increasingly many irrelevant -variables. In particular we create 6 datasets with 0, 5, 10, 15, 20, and 40 irrelevant predictors. +variables. In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors. Then we build a model, tuned via 5-fold cross-validation, for each data set. Figure \@ref(fig:06-performance-irrelevant-features) shows the estimated cross-validation accuracy versus the number of irrelevant predictors. As @@ -791,7 +827,7 @@ we add more irrelevant predictor variables, the estimated accuracy of our classifier decreases. This is because the irrelevant variables add a random amount to the distance between each pair of observations; the more irrelevant variables there are, the more (random) influence they have, and the more they -corrupt the set of nearest neighbours that vote on the class of the new +corrupt the set of nearest neighbors that vote on the class of the new observation to predict.
@@ -874,21 +910,21 @@ Although the accuracy decreases as expected, one surprising thing about Figure \@ref(fig:06-performance-irrelevant-features) is that it shows that the method still outperforms the baseline majority classifier (with about `r round(cancer_propn_1[1,1], 0)`% accuracy) even with 40 irrelevant variables. -How could that be? Figure \@ref(fig:06-neighbours-irrelevant-features) provides the answer: -the tuning procedure for the K-nearest neighbours classifier combats the extra randomness from the irrelevant variables -by increasing the number of neighbours. Of course, because of all the extra noise in the data from the irrelevant -variables, the number of neighbours does not increase smoothly; but the general trend is increasing. +How could that be? Figure \@ref(fig:06-neighbors-irrelevant-features) provides the answer: +the tuning procedure for the $K$-nearest neighbors classifier combats the extra randomness from the irrelevant variables +by increasing the number of neighbors. Of course, because of all the extra noise in the data from the irrelevant +variables, the number of neighbors does not increase smoothly; but the general trend is increasing. Figure \@ref(fig:06-fixed-irrelevant-features) corroborates -this evidence; if we fix the number of neighbours to $K=3$, the accuracy falls off more quickly. +this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls off more quickly. -```{r 06-neighbours-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "600", fig.cap = "Tuned number of neighbours for varying number of irrelevant predictors"} +```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "600", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors"} plt_irrelevant_nghbrs <- ggplot(res) + geom_line(mapping = aes(x=ks, y=nghbrs)) + - labs(x = "Number of Irrelevant Predictors", y = "Number of neighbours") + labs(x = "Number of Irrelevant Predictors", y = "Number of neighbors") plt_irrelevant_nghbrs ``` -```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "600", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbours"} +```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "600", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors"} res_tmp <- res %>% pivot_longer(cols=c("accs", "fixedaccs"), names_to="Type", values_to="accuracy") plt_irrelevant_nghbrs <- ggplot(res_tmp) + geom_line(mapping = aes(x=ks, y=accuracy, color=Type)) + @@ -899,7 +935,7 @@ plt_irrelevant_nghbrs ### Finding a good subset of predictors -So then if it is not ideal to use all of our variables as predictors without consideration, how +So then, if it is not ideal to use all of our variables as predictors without consideration, how do we choose which variables we *should* use? A simple method is to rely on your scientific understanding of the data to tell you which variables are not likely to be useful predictors. For example, in the cancer data that we have been studying, the `ID` variable is just a unique identifier for the observation. @@ -909,7 +945,7 @@ is less obvious, as all seem like reasonable candidates. It is not clear which subset of them will create the best classifier. One could use visualizations and other exploratory analyses to try to help understand which variables are potentially relevant, but this process is both time-consuming and error-prone when there are many variables to consider. -We therefore need a more systematic and programmatic way of choosing variables. +Therefore we need a more systematic and programmatic way of choosing variables. This is a very difficult problem to solve in general, and there are a number of methods that have been developed that apply in particular cases of interest. Here we will discuss two basic @@ -925,7 +961,7 @@ In particular, you 2. tune each one using cross validation 3. pick the subset of predictors that gives you the highest cross-validation accuracy -Best subset selection is applicable to any classification method (KNN or otherwise). +Best subset selection is applicable to any classification method ($K$-NN or otherwise). However, it becomes very slow when you have even a moderate number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets grows very quickly with the number of predictors, and you have to train the model (itself @@ -983,7 +1019,8 @@ as potential predictors, and the `Class` variable as the label. We will also extract the column names for the full set of predictor variables. ```{r 06-fwdsel, warning = FALSE} set.seed(1) -cancer_subset <- cancer_irrelevant %>% select(Class, Smoothness, Concavity, Perimeter, Irrelevant1, Irrelevant2, Irrelevant3) +cancer_subset <- cancer_irrelevant %>% + select(Class, Smoothness, Concavity, Perimeter, Irrelevant1, Irrelevant2, Irrelevant3) names <- colnames(cancer_subset %>% select(-Class)) cancer_subset @@ -1012,7 +1049,7 @@ one over increasing predictor set sizes and another to check which predictor to add in each round (where you see `for (j in 1:length(names))` below). For each set of predictors to try, we construct a model formula, pass it into a `recipe`, build a `workflow` that tunes -a KNN classifier using 5-fold cross-validation, +a $K$-NN classifier using 5-fold cross-validation, and finally records the estimated accuracy. ```{r 06-fwdsel-2, warning = FALSE} @@ -1065,7 +1102,8 @@ for (i in 1:n_total) { models[[j]] <- model_string } jstar <- which.max(unlist(accs)) - accuracies <- accuracies %>% add_row(size = i, model_string = models[[jstar]], accuracy = accs[[jstar]]) + accuracies <- accuracies %>% + add_row(size = i, model_string = models[[jstar]], accuracy = accs[[jstar]]) selected <- c(selected, names[[jstar]]) names <- names[-jstar] } @@ -1077,7 +1115,7 @@ Interesting! The forward selection procedure first added the three meaningful va visualizes the accuracy versus the number of predictors in the model. You can see that as meaningful predictors are added, the estimated accuracy increases substantially; and as you add irrelevant variables, the accuracy either exhibits small fluctuations or decreases as the model attempts to tune the number -of neighbours to account for the extra noise. In order to pick the right model from the sequence, you have +of neighbors to account for the extra noise. In order to pick the right model from the sequence, you have to balance high accuracy and model simplicity (i.e., having fewer predictors and a lower chance of overfitting). The way to find that balance is to look for the *elbow* in Figure \@ref(fig:06-fwdsel-3), i.e., the place on the plot where the accuracy stops increasing dramatically and diff --git a/references.bib b/references.bib index 9cb69a64d..151fdd606 100644 --- a/references.bib +++ b/references.bib @@ -106,6 +106,7 @@ @Manual{Rlanguage url = {https://www.R-project.org/}, } + @online{tidyversestyleguide, year = 2020, author = {Hadley Wickham}, @@ -218,3 +219,10 @@ @inproceedings{kluyver2016jupyter volume = {87}, address = {Amsterdam} } + +@online{stanfordhealthcare, + year = 2021, + author = {{Stanford Health Care}}, + title = {What is Cancer?}, + url = {https://stanfordhealthcare.org/medical-conditions/cancer/cancer.html} +} diff --git a/viz.Rmd b/viz.Rmd index 05a729b2f..646b5eda7 100644 --- a/viz.Rmd +++ b/viz.Rmd @@ -511,7 +511,7 @@ morley_hist ``` -Wait a second, we notice that the histogram is still all the same colour! What is going on here? If we look at the printed `morley` data, the column `Expt` is an integer (we see the label `` underneath the `Expt` column name). But, we want to treat it as a categorical variable. To fix this issue we can write `factor(Expt)` in the `fill` aesthetic mapping. By writing `factor(Expt)` we are ensuring that R will treat this variable as a factor and the colour will be mapped discretely. +Wait a second, we notice that the histogram is still all the same colour! What is going on here? If we look at the printed `morley` data, the column `Expt` is an integer (we see the label `` underneath the `Expt` column name). But, we want to treat it as a categorical variable. To fix this issue we can write `factor(Expt)` in the `fill` aesthetic mapping. Factors are a special categorical type of variable in R that are often used for class label data. By writing `factor(Expt)` we are ensuring that R will treat this variable as a factor and the colour will be mapped discretely. ```{r 03-data-morley-hist-with-factor, warning=FALSE, message=FALSE, fig.cap = "Histogram of Michelson's speed of light data coloured by experiment as factor"} morley_hist <- ggplot(morley, aes(x = Speed, fill = Expt)) +