-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCase Study 2 Arturo Max Pagan.Rmd
440 lines (331 loc) · 24 KB
/
Case Study 2 Arturo Max Pagan.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
---
title: "Case Study 2"
author: "Max Pagan"
date: "2023-12-06"
output: html_document
---
## DDSAnalytics Case Study
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
[Link to YouTube video of presentation](https://youtu.be/znxO6Q0A8sI)
### Introduction
DDSAnalytics, a company that provides unique data-driven insights to Fortune 100 companies, is attempting to gain a unique advantage over the competition by using data science, statistics, and machine learning techniques to predict attrition, or if an employee will leave their position. This project will aim to predict attrition accurately by first conducting a thorough analysis of the available information to determine which factors contribute the most to attrition. Next, the project aims to create a supervised machine learning model to predict attrition. Finally, the project aims to provide additional insights into other aspects of the data, such as job role specific trends, and another supervised machine learning model to predict an employee’s monthly income with significant accuracy.
### Cleaning Data and EDA
```{r load}
#loading the data
data <- read.csv("~/Downloads/CaseStudy2-data.csv")
library(ggplot2)
library(dplyr)
# Function to calculate the number of unique instances in a column
count_unique <- function(column) {
n_distinct(column)
}
# Apply the function to each column in the dataset
unique_counts <- sapply(data, count_unique)
# Print the results
print(unique_counts)
data <- data[, !colnames(data) %in% c("Over18")]
data <- data[, !colnames(data) %in% c("EmployeeCount")]
data <- data[, !colnames(data) %in% c("StandardHours")]
```
After loading the data, necessary cleaning of the data and removal of redundancies is critical. Three variables were immediately found to be unhelpful: Over18, EmployeeCount, and StandardHours. For each entry in the dataset, these values were unchanged, so they will be able to provide no insight into the data. The columns will be therefore removed. Additionally, it appears the EmployeeNumber is just another form of ID, seeing as it has 870 unique entries. However without deeper background knowledge we cannot be certain that it is entirely random. It will remain in the dataset for now, but I will keep a close eye on it as I continue with the exploratory data analysis.
The first step in accomplishing these laid out goals is to conduct an Exploratory Data Analysis (EDA). The EDA aims to introduce us to the data through the use of univariate and bivariate analysis, statistical tests, and visualizations. Because there were many variables in the data, I created two separate RShiny apps: one to visualize relationships between variables in the form of scatterplots separated by JobRole, and another to conduct appropriate two-sample t-tests to determine if there were statistically significant differences in the mean values of certain variables, like job satisfaction or monthly income, between the two groups of different Attrition outcomes ('Yes' meaning the employees left their position, 'No' meaning they did not leave their position). These two RShiny apps are linked below:
### RShiny App Exploration
[RShiny for scatterplots based on JobRole](https://amaxpagan.shinyapps.io/rshiny_cs2_eda/)
[RShiny for T-Tests based on Attrition](https://amaxpagan.shinyapps.io/rshiny_case_study_2/)
Particular scatterplots of interest that were insightful in this EDA included a scatterplot of MonthlyIncome (Y-axis) vs. TotalWorkingYears (X-Axis). For all Job roles, this demonstrated a positive linear relationship. Another variable that demonstrated a largely positive linear relationship with monthly income was job level. By making the x-axis be ID, one can change the Y axis to visualize additional patterns between job roles. For example, changing the y axis to StockOptionLevel indicates that Sales Representatives, followed by human resources employees and managers, appear to have the lowest proportion of employees that have stock options higher than 0 or 1. Additionally, by separating the different Attrition levels by color, we can see that sales representatives appear to have a higher proportion of employees who laft their position. We will go further into this statistic later on.
The second RShiny app provides statistical tests called T-Tests and their corresponding p-values. Using a 0.05 level of significance, we can determine that there are many variables who's mean values are likely very different between employees who left and employees who didn't for example, choosing MonthlyIncome for the y-axis gives an incredibly low p-value of 4.422*10^-6, which yields tremendous statistical significance to our findings, and a confidence interval of (-2760.22, -1114.21). We can interpret this by saying with 95% confidence that the mean monthly income of an employee who left is between 1,114 and 2,760 dollars lower than that of an employee who didn't leave. This has strong implications of the ability for monthly income to predict employee attrition, seeing as there is overwhelming evidence for a concrete difference between the incomes of employees who left vs those who stayed. Other variables that provided statistically significant p-values when comparing between employees who left and employees who stayed were: Age, DistanceFromHome, EnvironmentSatisfaction, JobInvolvement, JobLevel, JobSatisfaction, StockOptionLevel, TotalWorkingYears, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, and YearsWithCurrManager. By conducting these t-tests, and ensuring that these variables have statistically significant differences in their means between the two groups of employees, we are able to determine which variables are actually able to help us predict attrition.
(Note: The RShiny app also provides a ratio of standard deviations of the variable in each question, and uses that ratio to determine if it will conduct a student's t test of the ratio is close to 1, or a welch's t test if the ratio is greater than 2 or less than 0.5. Also, the data for each variable contains normal distributions and/or groups that are large enough for the central limit theorem to apply. These statements mean that for all possible t-tests conducted by the RShiny app, the necessary assumptions of a t-test will be reasonably met.)
### EDA Continued
After looking at the RShiny apps, there are a few variables that remain that cannot be analyzed with a t-test because of their categorical nature, but might still impact our ability to predict attrition. These variables are JobRole, MaritalStatus, OverTime, Gender, EducationField, Department, and BusinessTravel. In order to determine these variables' efficacy in predicting attrition, we will look at the percentages of Attrition = Yes for each factor among these variables. The code to do so is below:
```{r percentages}
percentage_yes_by_department <- data %>%
group_by(Department, Attrition) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count) * 100) %>%
filter(Attrition == "Yes") %>%
select(Department, percentage)
# Print the result
print(percentage_yes_by_department)
percentage_yes_by_role <- data %>%
group_by(JobRole, Attrition) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count) * 100) %>%
filter(Attrition == "Yes") %>%
select(JobRole, percentage)
# Print the result
print(percentage_yes_by_role)
percentage_yes_by_gender <- data %>%
group_by(Gender, Attrition) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count) * 100) %>%
filter(Attrition == "Yes") %>%
select(Gender, percentage)
# Print the result
print(percentage_yes_by_gender)
percentage_yes_by_BusinessTravel <- data %>%
group_by(BusinessTravel, Attrition) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count) * 100) %>%
filter(Attrition == "Yes") %>%
select(BusinessTravel, percentage)
# Print the result
print(percentage_yes_by_BusinessTravel)
percentage_yes_by_EducationField <- data %>%
group_by(EducationField, Attrition) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count) * 100) %>%
filter(Attrition == "Yes") %>%
select(EducationField, percentage)
# Print the result
print(percentage_yes_by_EducationField)
percentage_yes_by_MaritalStatus <- data %>%
group_by(MaritalStatus, Attrition) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count) * 100) %>%
filter(Attrition == "Yes") %>%
select(MaritalStatus, percentage)
# Print the result
print(percentage_yes_by_MaritalStatus)
percentage_yes_by_OverTime <- data %>%
group_by(OverTime, Attrition) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count) * 100) %>%
filter(Attrition == "Yes") %>%
select(OverTime, percentage)
# Print the result
print(percentage_yes_by_OverTime)
```
This information is highly insightful, as it tells us which of these categorical variables will help our prediction of attrition. Looking at the above tables will show what the percentage of 'Attrition = Yes' was for each factor measured. One unique insight based on job role is that Sales Representatives are the most likely of all roles to leave their position, with a staggering 45% of employees in this role having an Attrition value of Yes. Another variable that seemed to have a stronger relationship with attrition was overtime, as employees who worked overtime were 31% likely to have left their position, compared to only 9% of employees who didn't work overtime. These percentages will prove useful in determining a model that will predict attrition.
### Model attempt 1: KNN
Below is the first attempt at creating a supervised learning model to predict attrition. It is a k-nearest neighbor classifier model that only uses some of the most impactful continuous variables, and none of the categorical variables we just discussed. As the results will show, it was not particularly successful. It had a relatively high accuracy and sensitivity (true positive rate), but an incredibly low specificity (true negative rate), which seemed to persist no matter how many continuous variables I included from the dataset. The second model will make use of continuous and categorical variables to improve specificity.
```{r knn}
selected_columns<- c("JobInvolvement","TotalWorkingYears","JobLevel", "Attrition")
# Ensure that "Attrition" is at the back of the list
selected_columns <- c(selected_columns[setdiff(seq_along(selected_columns), which(selected_columns == "Attrition"))], "Attrition")
length <- length(selected_columns)
val <- length - 1
# Load necessary packages
#if (!require(dplyr)) install.packages("dplyr")
library(dplyr)
library(caret)
library(e1071)
library(class)
# Number of iterations
num_iterations <- 1000
# Initialize vectors to store results
accuracies <- numeric(num_iterations)
sensitivities <- numeric(num_iterations)
specificities <- numeric(num_iterations)
# Run the KNN model with different seeds
for (i in 1:num_iterations) {
set.seed(i)
# Create a subset of data with selected columns
subset_data <- data[selected_columns]
# Convert 'Attrition' to a binary factor variable
subset_data$Attrition <- factor(subset_data$Attrition, levels = c("Yes", "No"))
# Create a training set (80%) and testing set (20%)
splitIndex <- sample(1:nrow(subset_data), 0.8 * nrow(subset_data))
train_data <- subset_data[splitIndex, ]
test_data <- subset_data[-splitIndex, ]
# Create the KNN model
k_value <- 5 # You can adjust the value of 'k'
knn_model <- knn(train = train_data[, -4], # Excluding the target variable 'Attrition'
test = test_data[, -4], # Excluding the target variable 'Attrition'
cl = train_data$Attrition,
k = k_value)
# Confusion matrix
conf_matrix <- table(Actual = test_data$Attrition, Predicted = knn_model)
# Calculate accuracy
accuracies[i] <- sum(diag(conf_matrix)) / sum(conf_matrix)
# Calculate sensitivity (True Positive Rate/Recall)
sensitivities[i] <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
# Calculate specificity (True Negative Rate)
specificities[i] <- conf_matrix[1, 1] / sum(conf_matrix[1, ])
}
# Create a histogram of accuracies
hist(accuracies, main = "Histogram of Accuracies", xlab = "Accuracy", col = "lightblue", border = "black")
# Calculate mean sensitivity, mean specificity, and mean accuracy
mean_sensitivity <- mean(sensitivities)
mean_specificity <- mean(specificities)
mean_accuracy <- mean(accuracies)
# Add a horizontal line at the mean of the histogram of accuracies
abline(v = mean_accuracy, col = "red", lty = 2)
# Add a label for the horizontal line
text(mean_accuracy, 50, labels = paste("Mean Accuracy:", round(mean_accuracy, 4)), col = "red")
# Print sensitivity and specificity
print(paste("Mean Sensitivity (True Positive Rate):", mean_sensitivity))
print(paste("Mean Specificity (True Negative Rate):", mean_specificity))
```
### Model attempt 2: Naive-Bayes
The second model is a Naive-Bayes classifier, which uses many of both the continuous and the categorical variables discussed above in the EDA. This model has a much higher specificity, and passes the required threshold of sensitivity and specificity both being above 0.60. We will use this model to predict the unknown values of attrition in the extra dataset below, and output those predictions to a csv file.
```{r naive bayes}
library(dplyr)
library(caret)
#selected_columns <- setdiff(all_columns, columns_to_drop)
selected_columns<- c("JobRole","OverTime", "MaritalStatus","BusinessTravel","Age","Gender","JobInvolvement","MonthlyIncome","TotalWorkingYears", "JobLevel", "YearsInCurrentRole","YearsWithCurrManager","YearsAtCompany", "StockOptionLevel","WorkLifeBalance","JobSatisfaction","EnvironmentSatisfaction","DistanceFromHome","Attrition")
# Ensure that "Attrition" is at the back of the list
selected_columns <- c(selected_columns[setdiff(seq_along(selected_columns), which(selected_columns == "Attrition"))], "Attrition")
library(e1071)
# Initialize vectors to store results
num_iterations <- 1000
accuracies <- numeric(num_iterations)
sensitivities <- numeric(num_iterations)
specificities <- numeric(num_iterations)
# Create a subset of data with selected columns
subset_data <- data[selected_columns]
# Convert 'Attrition' to a binary factor variable
subset_data$Attrition <- factor(subset_data$Attrition, levels = c("Yes", "No"))
# Run the Naive Bayes model with different seeds
for (i in 1:num_iterations) {
set.seed(i)
# Create a subset of data with selected columns
subset_data <- data[selected_columns]
# Convert 'Attrition' to a binary factor variable
subset_data$Attrition <- factor(subset_data$Attrition, levels = c("Yes", "No"))
# Split the data into training and testing sets
splitIndex <- sample(1:nrow(subset_data), 0.8 * nrow(subset_data))
train_data <- subset_data[splitIndex, ]
test_data <- subset_data[-splitIndex, ]
# Create the Naive Bayes model
nb_model <- naiveBayes(Attrition ~ ., data = train_data)
# Make predictions on the test set
predictions <- predict(nb_model, newdata = test_data)
# Confusion matrix
conf_matrix <- table(Actual = test_data$Attrition, Predicted = predictions)
# Calculate accuracy
accuracies[i] <- sum(diag(conf_matrix)) / sum(conf_matrix)
# Calculate sensitivity (True Positive Rate/Recall)
sensitivities[i] <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
# Calculate specificity (True Negative Rate)
specificities[i] <- conf_matrix[1, 1] / sum(conf_matrix[1, ])
}
# Calculate mean sensitivity, mean specificity, and mean accuracy
mean_sensitivity <- mean(sensitivities)
mean_specificity <- mean(specificities)
mean_accuracy <- mean(accuracies)
# Print mean sensitivity, mean specificity, and mean accuracy
print(paste("Mean Sensitivity (True Positive Rate):", round(mean_sensitivity, 4)))
print(paste("Mean Specificity (True Negative Rate):", round(mean_specificity, 4)))
print(paste("Mean Accuracy:", round(mean_accuracy, 4)))
```
The below chunk will use the model we just created to predict attrition for the 'No Attrition' dataset and output those predictions to a CSV file.
```{r predicting attrition}
set.seed(42)
# Split the data into training and testing sets
splitIndex <- sample(1:nrow(subset_data), 0.8 * nrow(subset_data))
train_data <- subset_data[splitIndex, ]
test_data <- subset_data[-splitIndex, ]
# Create the Naive Bayes model
nb_model <- naiveBayes(Attrition ~ ., data = train_data)
predictions <- predict(nb_model, newdata = test_data)
conf_matrix <- table(Actual = test_data$Attrition, Predicted = predictions)
accuracy_nb <- sum(diag(conf_matrix)) / sum(conf_matrix)
sensitivity_nb <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity_nb <- conf_matrix[1, 1] / sum(conf_matrix[1, ])
print(paste("Sensitivity of NB (seed = 42):", sensitivity_nb))
print(paste("Specificity of NB (seed = 42):", specificity_nb))
print(paste("Accuracy of NB (seed = 42):", accuracy_nb))
#loading in the dataframe with no attrition
no_attrition <- read.csv("~/Downloads/CaseStudy2CompSet No Attrition (1).csv")
#the same selected_columns as before, but this time without attrition
selected_columns<- c("JobRole","OverTime", "MaritalStatus","BusinessTravel","Age","Gender","JobInvolvement","MonthlyIncome","TotalWorkingYears", "JobLevel", "YearsInCurrentRole","YearsWithCurrManager","YearsAtCompany", "StockOptionLevel","WorkLifeBalance","JobSatisfaction","EnvironmentSatisfaction","DistanceFromHome")
# Create a subset of 'no_attrition' with selected columns
subset_no_attrition <- no_attrition[selected_columns]
# Make predictions using the Naive Bayes model
predictions_of_attrition <- predict(nb_model, newdata = subset_no_attrition)
# Creating a data frame with "ID" and predicted values
output_data_attrition <- data.frame(ID = no_attrition$ID, Attrition = predictions_of_attrition)
# Write the data frame to a CSV file
write.csv(output_data_attrition, file = "predicted_attrition.csv", row.names = FALSE)
# Print the confusion matrix
print("Confusion Matrix:")
print(conf_matrix)
```
### Linear Regression for Income
The below chunk will use data we synthesized from the EDA to create a linear regression model that will predict monthly income. Based on the information we gathered above, the variables I will use to create this multiple linear regression are TotalWorkingYears and JobLevel, as they appeared to have the strongest visual linear relationship with income. One way to potentially improve this model would be to also include JobRole as an interaction term. However, the current model has a RMSE well below the 3,000 dollar threshold, so we will accept it for now. The model will output its prediction results to a csv.
```{r predicting monthly income}
no_income <- read.csv("~/Downloads/CaseStudy2CompSet No Salary.csv")
#setting seed again
set.seed(42)
#use 70% of dataset as training set and 30% as test set
train_data <- data %>% sample_frac(0.70)
test_data <- anti_join(data, train_data, by = 'ID')
# Create a Linear Regression model
model <- lm(MonthlyIncome ~ TotalWorkingYears + JobLevel, data = train_data)
# Make predictions using the linear model for calculating RMSE
predictions <- predict(model)
# Calculate residuals
residuals <- residuals(model)
# Calculate RMSE
rmse <- sqrt(mean(residuals^2))
# Print RMSE
print(paste("Root Mean Square Error (RMSE):", round(rmse, 4)))
predictions_of_income <- predict(model, newdata = no_income)
# Creating a data frame with "ID" and predicted values
output_data_income <- data.frame(ID = no_income$ID, MonthlyIncome = predictions_of_income)
# Write the data frame to a CSV file
write.csv(output_data_income, file = "predicted_income.csv", row.names = FALSE)
```
### The Most Impactful Factors on Attrition
In order to determine the three most impactful variables that impact attrition, I will create naive-bayes models and KNN models with one predictor variable at a time from the list in the existing model, and output an accuracy F1 score to a list. The f1 score of each variable will be ranked, and the three best F1 scores will be selected as the three most imporant variables for predicting attrition.
```{r F1 calculation}
library(class)
# List of variables
candidate_variables <- c("JobRole","OverTime", "MaritalStatus","BusinessTravel","Age","Gender","JobInvolvement","MonthlyIncome","TotalWorkingYears", "JobLevel", "YearsInCurrentRole","YearsWithCurrManager","YearsAtCompany", "StockOptionLevel","WorkLifeBalance","JobSatisfaction","EnvironmentSatisfaction","DistanceFromHome")
# Initializing dataframe to store results
variable_performance_1 <- data.frame(Variable = character(), F1_Score = numeric(), stringsAsFactors = FALSE)
variable_performance_2 <- data.frame(Variable = character(), F1_Score = numeric(), stringsAsFactors = FALSE)
# Loop through each candidate variable
for (variable in candidate_variables) {
# Create a dataset with the current variable
current_data <- data[, c("Attrition", variable), drop = FALSE]
# Splitting the dataset into training and testing sets
set.seed(42) # Setting seed for reproducibility
split_index <- createDataPartition(current_data$Attrition, p = 0.7, list = FALSE)
train_data <- current_data[split_index, ]
test_data <- current_data[-split_index, ]
test_data$Attrition <- as.factor(test_data$Attrition)
# Train the Naive Bayes model
nb_model_2 <- naiveBayes(Attrition ~ ., data = train_data)
# Make predictions on the test set
predictions <- predict(nb_model_2, newdata = test_data)
# Evaluate model performance
confusion_matrix <- table(Actual = test_data$Attrition, Predicted = predictions)
f1_score <- confusion_matrix[2, 2]/(confusion_matrix[2, 2] + 0.5*(confusion_matrix[2, 1] + confusion_matrix[1, 2]))
# Append results to the data frame
variable_performance_1 <- rbind(variable_performance_1, c(variable, f1_score))
}
# Print the results
print(variable_performance_1)
#no categorical vars this time!
candidate_variables_2 <- c("Age","JobInvolvement","MonthlyIncome","TotalWorkingYears", "JobLevel", "YearsInCurrentRole","YearsWithCurrManager","YearsAtCompany", "StockOptionLevel","WorkLifeBalance","JobSatisfaction","EnvironmentSatisfaction","DistanceFromHome")
# Loop through each candidate variable
for (variable in candidate_variables_2) {
# Create a dataset with the current variable
current_data <- data[, c(variable, "Attrition"), drop = FALSE]
# Splitting the dataset into training and testing sets
set.seed(42) # Setting seed for reproducibility
split_index <- createDataPartition(current_data$Attrition, p = 0.7, list = FALSE)
train_data <- current_data[split_index, ]
test_data <- current_data[-split_index, ]
# Train the Naive Bayes model
knn_model_2 <- knn(train = train_data[,-2, drop = FALSE], # Excluding the target variable 'Attrition'
test = test_data[,-2, drop = FALSE], # Excluding the target variable 'Attrition'
cl = train_data$Attrition,
k = 5)
# Make predictions on the test set
predictions <- knn_model_2
# Evaluate model performance
confusion_matrix <- table(Actual = test_data$Attrition, Predicted = predictions)
f1_score <- confusion_matrix[2, 2]/(confusion_matrix[2, 2] + 0.5*(confusion_matrix[2, 1] + confusion_matrix[1, 2]))
# Append results to the data frame
variable_performance_2 <- rbind(variable_performance_2, c(variable, f1_score))
}
# Print the results
print(variable_performance_2)
```
Based on the data, it is clear that JobRole, JobInvolvement, and Age, in that order, are the only variables that can create an F1 score of higher than 0. This means that these are the only variables that have predictive power inherently, without the use of other variables. Other variables when used in combination become predictive, however given the inherent predictability of these three variables, DDSAnalytics would benefit greatly from paying careful attention to these specific variables with priority.
### Conclusion
After conducting a thorough analysis of the data, and creating two models, one to predict attrition and another to predict monthly income, we can say we have a strong understanding of the many different factors that go into talent management. Using the tools available to us with statistics and data science, we can make the necessary predictions and connections to further our future goals, like improving employee retention.