Skip to content

Latest commit

 

History

History
823 lines (648 loc) · 24.4 KB

ch10.md

File metadata and controls

823 lines (648 loc) · 24.4 KB

10.7 Exercises

library(tidyverse)
library(here)

Conceptual

(1) This problem involves the K-means clustering algorithm.

(a) Prove 10.2

Solution found in: https://rpubs.com/evertonjlima/245990 Below is my transcription with some comments (I’m not fully convinced of the correctness of all the steps)

(b) On the basis of this identity, argue that the K-means clustering algorithm (Algorithm 10.1) decreases the objective (10.11) at each iteration.

A: In the first step of each iteration (moving the centroid of each cluster to the vector of the feature means) we’re actually minimizing the sum of deviations to the center in each cluster. Then, by relocating the observations to the nearest center, the sum of the deviations can only decrease.

(2) Suppose that we have four observations, for which we compute a dissimilarity matrix, given by:

For instance, the dissimilarity between the first and second observations is 0.3, and the dissimilarity between the second and fourth observations is 0.8.

(a) On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations using complete linkage. Be sure to indicate on the plot the height at which each fusion occurs, as well as the observations corresponding to each leaf in the dendrogram.

(b) Repeat (a), this time using single linkage clustering.

(c) Suppose that we cut the dendogram obtained in (a) such that two clusters result. Which observations are in each cluster?

1 and 2 in one cluster and 3 and 4 in the other.

(d) Suppose that we cut the dendogram obtained in (b) such that two clusters result. Which observations are in each cluster?

1, 2 and 3 in one cluster and 4 in the other.

(e) It is mentioned in the chapter that at each fusion in the dendrogram, the position of the two clusters being fused can be swapped without changing the meaning of the dendrogram. Draw a dendrogram that is equivalent to the dendrogram in (a), for which two or more of the leaves are repositioned, but for which the meaning of the dendrogram is the same.

(3) In this problem, you will perform K-means clustering manually, with K = 2, on a small example with n = 6 observations and p=2 features. The observations are as follows.

(a) Plot the observations.

df_kmeans <- 
  tibble(
    x1 = c(1, 1, 0, 5, 6, 4),
    x2 = c(4, 3, 4, 1, 2, 0)
  )

qplot(x1, x2, data = df_kmeans)

(b) Randomly assign a cluster label to each observation. You can use the sample() command in R to do this. Report the cluster labels for each observation.

set.seed(1989)

(
  df_kmeans <- df_kmeans %>%
   mutate(cluster = sample(c(1, 2), 6, replace = TRUE))
 )
## # A tibble: 6 x 3
##      x1    x2 cluster
##   <dbl> <dbl>   <dbl>
## 1     1     4       2
## 2     1     3       1
## 3     0     4       2
## 4     5     1       1
## 5     6     2       1
## 6     4     0       1

(c) Compute the centroid for each cluster

(df_centroids <-
   df_kmeans %>%
   group_by(cluster) %>%
   summarise_all(mean))
## # A tibble: 2 x 3
##   cluster    x1    x2
##     <dbl> <dbl> <dbl>
## 1       1   4     1.5
## 2       2   0.5   4

(d) Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation.

# Create a function to get the closest centroid
assign_cluster <- function(x1_input, x2_input) {
  cluster_assigned <-
    df_centroids %>%
    # Get distances to each cluster
    mutate(distance = sqrt((x1_input - x1) ^ 2 + (x2_input - x2) ^ 2)) %>%
    # Pick the cluster with lowest distance
    filter(distance == min(distance)) %>%
    pull(cluster)
}

(
df_kmeans <- df_kmeans %>% 
  mutate(new_cluster = map2_dbl(x1, x2, assign_cluster)) %>% 
  rename(old_cluster = cluster)
)
## # A tibble: 6 x 4
##      x1    x2 old_cluster new_cluster
##   <dbl> <dbl>       <dbl>       <dbl>
## 1     1     4           2           2
## 2     1     3           1           2
## 3     0     4           2           2
## 4     5     1           1           1
## 5     6     2           1           1
## 6     4     0           1           1

(e) Repeat (c) and (d) until the answers obtained stop changing.

assign_cluster_iter <- function(x1_input, x2_input) {
  cluster_assigned <-
    df_centroids %>%
    # Get distances to each cluster
    mutate(distance = sqrt((x1_input - x1) ^ 2 + (x2_input - x2) ^ 2)) %>%
    # Pick the cluster with lowest distance
    filter(distance == min(distance)) %>%
    pull(new_cluster)
}

while (TRUE)
{
  df_centroids <-
   df_kmeans %>%
  select(-old_cluster) %>% 
   group_by(new_cluster) %>%
   summarise_all(mean)
  
  df_kmeans <- 
    df_kmeans %>% 
    mutate(old_cluster = new_cluster,
           new_cluster = map2_dbl(x1, x2, assign_cluster_iter))
  
  if(all(df_kmeans$old_cluster == df_kmeans$new_cluster)) {
    break
  }
  
}

df_kmeans
## # A tibble: 6 x 4
##      x1    x2 old_cluster new_cluster
##   <dbl> <dbl>       <dbl>       <dbl>
## 1     1     4           2           2
## 2     1     3           2           2
## 3     0     4           2           2
## 4     5     1           1           1
## 5     6     2           1           1
## 6     4     0           1           1

(f) In your plot from (a), color the observations according to the cluster labels obtained.

qplot(x1, x2, color = factor(new_cluster), data = df_kmeans) +
  labs(color = "cluster")

(4) Suppose that for a particular data set, we perform hierarchical clustering using single linkage and using complete linkage. We obtain two dendrograms.

(a) At a certain point on the single linkage dendrogram, the clusters {1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram, the clusters {1, 2, 3} and {4, 5} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell?

A: Since complete linkage uses the highest intra-cluster distance/dissimilarity and single linkage uses the lowest, in most of the cases the fusion with complete linkage will occur higher on the tree. In some special cases (when the inter-cluster distances are all the same) the fusion will occur at the same height for both linkage methods.

(b) At a certain point on the single linkage dendrogram, the clusters {5} and {6} fuse. On the complete linkage dendrogram, the clusters {5} and {6} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell?

A: If both clusters have only one observation, then the linkage method is irrelevant, and both will fuse at the same point.

(5) In words, describe the results that you would expect if you performed K-means clustering of the eight shoppers in Figure 10.14, on the basis of their sock and computer purchases, with K=2. Give three answers, one for each of the variable scalings displayed. Explain.

A: In 1) a cluster of consumers who bought a lot of socks and another with the ones who bought fewer socks.

In 2) probably the clusters would split based on the number of computers purchased.

In 3) there would be a cluster of consumers who bought 1 computer and another for the customers who didn’t buy any computers.

(6) A researcher collects expression measurements for 1,000 genes in 100 tissue samples. The data can be written as a 1, 000 × 100 matrix, which we call X, in which each row represents a gene and each column a tissue sample. Each tissue sample was processed on a different day, and the columns of X are ordered so that the samples that were processed earliest are on the left, and the samples that were processed later are on the right. The tissue samples belong to two groups: control (C) and treatment (T). The C and T samples were processed in a random order across the days. The researcher wishes to determine whether each gene’s expression measurements differ between the treatment and control groups.

(a) Explain what it means that the first principal component “explains 10% of the variation”.

A: The summatory of the squares of the scorings of that component represents a 10% of the total variance of the dataset, which is defined by the summatory of the average squares of the observations in each predictor/column.

(b)

Skipped. Solution found here https://blog.princehonest.com/stat-learning/ch10/6.html says that the researcher should include a new feature in the original datasets that represents the machine in which the sample was processed, and then perform PCA using the aditional column.

(c) Design and run a small simulation experiment to demonstrate the superiority of your idea.

(Answer copied from the same link: https://blog.princehonest.com/stat-learning/ch10/6.html)

set.seed(1989)

Control = matrix(rnorm(50*1000), ncol=50)

Treatment = matrix(rnorm(50*1000), ncol=50)

X = cbind(Control, Treatment)

# linear trend in one dimension
X[1,] = seq(-18, 18 - .36, .36) 

pr_out <- prcomp(scale(X))
summary(pr_out)$importance[,1]
##     Standard deviation Proportion of Variance  Cumulative Proportion 
##               3.165053               0.100180               0.100180

The proportion of variance explained is aprox. 10%. Now let’s try appending a row which represents the machine used:

X = rbind(X, c(rep(10, 50), rep(0, 50)))
pr_out2 = prcomp(scale(X))
summary(pr_out2)$importance[,1]
##     Standard deviation Proportion of Variance  Cumulative Proportion 
##               3.409989               0.116280               0.116280

The proportion of variance explained increses to 13.6% percent.

Applied

In the chapter, we mentioned the use of correlation-based distance and Euclidean distance as dissimilarity measures for hierarchical clustering. It turns out that these two measures are almost equivalent: if each observation has been centered to have mean zero and standard deviation one, and if we let (r_{ij}) denote the correlation between the (i)th and (j)th observations, then the quantity (1−r_{ij}) is proportional to the squared Euclidean distance between the ith and jth observations.

On the USArrests data, show that this proportionality holds.

us_arrests <- USArrests %>% as_tibble() 
us_arrests
## # A tibble: 50 x 4
##    Murder Assault UrbanPop  Rape
##     <dbl>   <int>    <int> <dbl>
##  1   13.2     236       58  21.2
##  2   10       263       48  44.5
##  3    8.1     294       80  31  
##  4    8.8     190       50  19.5
##  5    9       276       91  40.6
##  6    7.9     204       78  38.7
##  7    3.3     110       77  11.1
##  8    5.9     238       72  15.8
##  9   15.4     335       80  31.9
## 10   17.4     211       60  25.8
## # ... with 40 more rows

First: center and scale the variables

(
  us_arrests_scaled <- 
    us_arrests %>% 
    mutate_all(~scale(.)[,1])
)
## # A tibble: 50 x 4
##     Murder Assault UrbanPop     Rape
##      <dbl>   <dbl>    <dbl>    <dbl>
##  1  1.24     0.783   -0.521 -0.00342
##  2  0.508    1.11    -1.21   2.48   
##  3  0.0716   1.48     0.999  1.04   
##  4  0.232    0.231   -1.07  -0.185  
##  5  0.278    1.26     1.76   2.07   
##  6  0.0257   0.399    0.861  1.86   
##  7 -1.03    -0.729    0.792 -1.08   
##  8 -0.433    0.807    0.446 -0.580  
##  9  1.75     1.97     0.999  1.14   
## 10  2.21     0.483   -0.383  0.488  
## # ... with 40 more rows
us_arrests_scaled %>% summary()
##      Murder           Assault           UrbanPop             Rape        
##  Min.   :-1.6044   Min.   :-1.5090   Min.   :-2.31714   Min.   :-1.4874  
##  1st Qu.:-0.8525   1st Qu.:-0.7411   1st Qu.:-0.76271   1st Qu.:-0.6574  
##  Median :-0.1235   Median :-0.1411   Median : 0.03178   Median :-0.1209  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.7949   3rd Qu.: 0.9388   3rd Qu.: 0.84354   3rd Qu.: 0.5277  
##  Max.   : 2.2069   Max.   : 1.9948   Max.   : 1.75892   Max.   : 2.6444

Then compute the pair-wise euclidean distance between all the points:

distance_matrix <- 
  dist(us_arrests_scaled) %>% 
  as.matrix()

Now compute the pair-wise correlation, and then get 1 - correlation.

cors_transformed <- 
  us_arrests_scaled %>% 
  as.matrix() %>% 
  t() %>% 
  cor() %>% 
  magrittr::multiply_by(-1) %>% 
  magrittr::add(1)

Finally we can plot the distances against 1-correlations. The relationship is evident:

tibble(
  distances = as.vector(distance_matrix),
  correlations = as.vector(distance_matrix)
) %>% 
  ggplot(aes(distances, correlations)) +
  geom_point()

(8) In Section 10.2.3, a formula for calculating PVE was given in Equation 10.8. We also saw that the PVE can be obtained using the sdev output of the prcomp() function.

On the USArrests data, calculate PVE in two ways:

(a) Using the sdev output of the prcomp() function, as was done in Section 10.2.3.

(b) By applying Equation 10.8 directly. That is, use the prcomp() function to compute the principal component loadings. Then, use those loadings in Equation 10.8 to obtain the PVE.

These two approaches should give the same results.

A: I’m going to use scaled and centered data in both cases.

Case (a), using sdev

pr_out_usarrests <- prcomp(us_arrests_scaled,
                           center = FALSE, scale = FALSE)

pr_out_usarrests$sdev %>% (function(x){ x^2/sum(x^2)} )
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

Case (b), using Equation 10.8

# Summation of squares of scorings (numerator in the formula)
variability_components <- 
  pr_out_usarrests$x %>% 
  magrittr::raise_to_power(2) %>% 
  colSums()

# Variability in the data (denominator in the formula)
variability_data <- 
  us_arrests_scaled %>% 
  as.matrix() %>% 
  magrittr::raise_to_power(2) %>% 
  colSums() %>% 
  sum()

variability_components/variability_data
##        PC1        PC2        PC3        PC4 
## 0.62006039 0.24744129 0.08914080 0.04335752

(9) Consider the USArrests data. We will now perform hierarchical clustering on the states.

(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

clust_us_arrest1 <- 
  hclust(dist(USArrests), method = "complete")

(b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

tibble(
  states = rownames(USArrests),
  clusters = cutree(clust_us_arrest1, 3)
)
## # A tibble: 50 x 2
##    states      clusters
##    <chr>          <int>
##  1 Alabama            1
##  2 Alaska             1
##  3 Arizona            1
##  4 Arkansas           2
##  5 California         1
##  6 Colorado           2
##  7 Connecticut        3
##  8 Delaware           1
##  9 Florida            1
## 10 Georgia            2
## # ... with 40 more rows
factoextra::fviz_dend(clust_us_arrest1, k = 3,
          cex = 0.5,                 # label size
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800"),
          color_labels_by_k = TRUE,  # color labels by groups
          ggtheme = theme_gray()     # Change theme
          )

(c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard de- viation one.

clust_us_arrest2 <- 
  hclust(dist(scale(USArrests)), method = "complete")
factoextra::fviz_dend(clust_us_arrest2, k = 3,                 #
          cex = 0.5,                 # label size
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800"),
          color_labels_by_k = TRUE,  # color labels by groups
          ggtheme = theme_gray()     # Change theme
          )

(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

A: Visually, the clusters are now less “balanced”, but that is not necesarily a bad thing. As a general rule, variables should be scaled whenever they are in different measurement units, as in this case.

(10) In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.

(a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.

(
  simulated_data <-
    matrix(c(rnorm(20 * 50, mean = 1),
             rnorm(20 * 50, mean = 5),
             rnorm(20 * 50, mean = 10)), ncol = 50, byrow = TRUE) %>%
    as_tibble(.name_repair = ~ as.character(1:50)) %>%
    mutate(class = rep(1:3, each = 20))
)
## # A tibble: 60 x 51
##       `1`    `2`     `3`   `4`    `5`     `6`    `7`    `8`     `9`
##     <dbl>  <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
##  1 -1.12   1.11  -0.0995 1.30  -0.429 -0.0795 -0.282  2.72   0.742 
##  2 -0.688 -1.18   1.08   1.25   0.746  0.484   2.35   1.18   1.92  
##  3  1.71   0.459  1.86   0.696  1.93   1.16    0.229  1.95   1.21  
##  4  0.723  0.697  0.792  0.531  2.68  -0.0447  1.50   1.31   1.30  
##  5 -0.165  0.620  1.58   2.65   1.52   1.55    2.40   1.64   1.11  
##  6  1.06   1.55   0.550  1.51  -0.670  1.23   -0.382  1.09   2.65  
##  7  1.71   0.589  1.09   1.22   1.00   2.01    1.68   1.24  -0.0661
##  8  1.14   0.949 -0.0373 1.08   2.11   1.93   -0.359 -0.284  1.89  
##  9  1.00   0.639  0.637  1.61   2.85   1.28    2.84   1.12   2.64  
## 10  1.37   0.277  0.799  1.07  -0.214  0.554  -0.899  2.20   1.10  
## # ... with 50 more rows, and 42 more variables: `10` <dbl>, `11` <dbl>,
## #   `12` <dbl>, `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>,
## #   `17` <dbl>, `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>,
## #   `22` <dbl>, `23` <dbl>, `24` <dbl>, `25` <dbl>, `26` <dbl>,
## #   `27` <dbl>, `28` <dbl>, `29` <dbl>, `30` <dbl>, `31` <dbl>,
## #   `32` <dbl>, `33` <dbl>, `34` <dbl>, `35` <dbl>, `36` <dbl>,
## #   `37` <dbl>, `38` <dbl>, `39` <dbl>, `40` <dbl>, `41` <dbl>,
## #   `42` <dbl>, `43` <dbl>, `44` <dbl>, `45` <dbl>, `46` <dbl>,
## #   `47` <dbl>, `48` <dbl>, `49` <dbl>, `50` <dbl>, class <int>

(b) Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.

pr_out_simulated <- 
  simulated_data %>% 
  select(-class) %>% 
  prcomp(scale. = TRUE)

pr_out_simulated$x %>% 
  as_tibble() %>% 
  dplyr::select(PC1, PC2) %>% 
  mutate(class = simulated_data$class) %>% 
  ggplot(aes(PC1, PC2, color = factor(class))) +
  geom_point()

(c) Perform K-means clustering of the observations with K=3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?

set.seed(1989)
k3_sim <- 
  simulated_data %>%
  select(-class) %>% 
  kmeans(centers = 3, nstart = 100)

simulated_data %>% 
  mutate(cluster = factor(k3_sim$cluster)) %>% 
  select(class, cluster) %>% 
  table()
##      cluster
## class  1  2  3
##     1  0  0 20
##     2  0 20  0
##     3 20  0  0

The clusters obtained map perfectly to the true class labels (it’s not surprising since the classes are well separated).

(d) Perform K-means clustering with K = 2. Describe your results.

set.seed(1989)
k2_sim <- 
  simulated_data %>%
  select(-class) %>% 
  kmeans(centers = 2, nstart = 100)

simulated_data %>% 
  mutate(cluster = factor(k2_sim$cluster)) %>% 
  select(class, cluster) %>% 
  table()
##      cluster
## class  1  2
##     1  0 20
##     2  0 20
##     3 20  0

Two classes are merged into one cluster.

(e) Now performK-means clustering with K = 4, and describe your results.

set.seed(1989)
k4_sim <- 
  simulated_data %>%
  select(-class) %>% 
  kmeans(centers = 4, nstart = 100)

simulated_data %>% 
  mutate(cluster = factor(k4_sim$cluster)) %>% 
  select(class, cluster) %>% 
  table()
##      cluster
## class  1  2  3  4
##     1  0  0 20  0
##     2  0 20  0  0
##     3  9  0  0 11

The class number 3 is splited into two clusters.

(f) Now perform K-means clustering with K =3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.

set.seed(1989)
sim_pc1_2 <- 
  pr_out_simulated$x %>% 
  as_tibble() %>% 
  dplyr::select(PC1:PC2)

k3_pca <- 
  kmeans(sim_pc1_2, centers = 3, nstart = 1000)

simulated_data %>% 
  mutate(cluster = factor(k3_pca$cluster)) %>% 
  select(class, cluster) %>% 
  table()
##      cluster
## class  1  2  3
##     1  0  0 20
##     2  0 20  0
##     3 20  0  0

As expected, the clusters obtained have a perfect mapping to the original classes.

(g) Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.

set.seed(1989)
k3_sim_sca <- 
  simulated_data %>%
  scale(center = FALSE, scale = TRUE) %>% 
  as_tibble() %>% 
  dplyr::select(-class) %>% 
  kmeans(centers = 3, nstart = 100)

simulated_data %>% 
  mutate(cluster = factor(k3_sim_sca$cluster)) %>% 
  select(class, cluster) %>% 
  table()
##      cluster
## class  1  2  3
##     1  0  0 20
##     2  0 20  0
##     3 20  0  0

The results are the same. Again, the clusters obtained can be mapped without error onto the original classes.

(11) On the book website, www.StatLearning.com, there is a gene expression data set (Ch10Ex11.csv) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.

(a) Load the data

data_ex11 <- read_csv(here("data", "Ch10Ex11.csv"),
                      col_names = FALSE)
## Parsed with column specification:
## cols(
##   .default = col_double()
## )

## See spec(...) for full column specifications.

(b) Apply hierarchical clustering to the samples using correlation based distance, and plot the dendrogram. Do the genes separate the samples into the two groups? Do your results depend on the type of linkage used?

dd = as.dist(1 - cor(data_ex11))

hclust_ex11 <- function(method_hclust) {
  corr_clust <- hclust(dd, method = method_hclust)
  
  factoextra::fviz_dend(
    corr_clust,
    k = 2,
    #
    cex = 0.5,
    # label size
    k_colors = c("#2E9FDF", "#E7B800"),
    color_labels_by_k = TRUE,
    # color labels by groups
    ggtheme = theme_gray()     # Change theme
  )
}
hclust_ex11("complete")

hclust_ex11("single")

hclust_ex11("average")

The results depend on the type of linkage chosen. Complete linkage is the one that gives results more close to a 50/50 split.

(c) Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.

A: The correct way to look for dissimilarity in the dendrogram is to check at which height the stems of two observations (genes, in this case) get merged. The higher they get together, the more different they are.