-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtransaction.Rmd
1274 lines (998 loc) · 52.8 KB
/
transaction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "MLR on Time-Series Financial Dataset"
author: "Leah Nguyen"
date: "`r Sys.Date()`"
output:
html_document:
fig_caption: yes
smart: no
number_sections: true
toc: true
toc_float: true
code_folding: hide
---
# Introduction
```{r setup, echo=FALSE, message = FALSE,warning = FALSE}
knitr::opts_chunk$set(center = TRUE)
```
This article aims to analyse and provide insights from the monthly transaction data set in order to better understand the customer transaction patterns. The article also offers a study on linear regression model, an important concept in the field of machine learning, and discusses how this model can assist in the decision-making process of identifying trends in bank transactions within the years of 2013 - 2016.
To well capture this information, the CRISP-DM management model is adopted to provide a structured planning approach to a data mining project with 6 high-level phases. In particular, these phases assist companies in comprehending the data mining process and serve as a road map for planning and executing a data mining project (Medeiros, 2021). This study explores each of the six phases, and the tasks associated with each in the following orders:
* [Business understanding](#bu)\
* [Data understanding](#du)\
* [Data preparation](#dp)\
* [Modeling](#m)\
* [Evaluation](#e)\
* [Deployment](#d)
```{r residuals, echo=FALSE, out.width="40%",fig.cap="Cross-Industry Standard Process for Data Mining (CRISP-DM project, 2000)",fig.align="center"}
knitr::include_graphics("img/CRISP-DM.png")
```
# Business Understanding {#bu}
**Business Understanding** is the first taken step in the CRISP-DM methodology. In this stage, the main task is to understand the purpose of the analysis and to provide a clear and crisp definition of the problem in respect of understanding the *Business objectives* and *Data mining objectives*.
In our case study, the posed question related Business object paraphrased from the sale manager's request is: “what is driving the trends and increase total monthly revenue?”. On the other hand, we wish to achieve the data mining object by applying data visualization tools to identify any underlying patterns from the dataset.
# Data Understanding {#du}
Following that, the **Data Understanding** phase is where we focus on understanding the data collected to support the Business Understanding and resolve the business challenge (Wijaya, 2021). Data preprocessing and data visualization techniques play an essential role in this. Thus, I'm going to divide the section into 2 main components:
* [Exploratory Data Analysis (Part 1) - The Dataset](#eddf)\
* [**Stage 1:** Basic Exploration](#be)\
* [**Stage 2:** Univariate, Bivariate & Multivariate Analysis](#ubma)\
* [Exploratory Data Analysis (Part 2) - The Business Insights](#eda2)\
The data was imported into the software package R to construct visualizations represented the findings found during the analysis.
## Exploratory Data Analysis (Part 1) - The Dataset {#eddf}
### Stage 1: Basic Exploration {#be}
First, I will run the libraries which will be necessary for reading & manipulating our data and then conducting the graphs.
```{r, message = FALSE, warning = FALSE}
##----------------------------------------------------------------
## Load the Libraries --
##----------------------------------------------------------------
library(here) # assess the file path
library(DataExplorer) # EDA visualizations
library(tidyverse) # data wrangling
library(kableExtra) # write table
library(bannerCommenter) # create comment banner
library(ggplot2) # data visualization
library(forecast) # times-series forecasting
library(ggradar) # plot seasonal trend
library(sqldf) # using SQL
library(dplyr) # data processing
library(ggpubr) # combine plots into single page
theme_set(theme_pubr())
library(reshape2) # transpose table
library(fmsb) # create radar chart
library(modelr) # computing regression model performance metrics
library(caret) # streamline the model training process
library(xts) # convert df to ts object
```
Once libraries are loaded, we explore the data with the goal of understanding its dimensions, data types, and distribution of values. In this assignment, a time series data set of financial transactions was used as the major source of data. The attributes information is specifically presented in [Appendix 1](#appendix1).
```{r message=FALSE, center=TRUE}
##----------------------------------------------------------------
## Dataset Overview --
##----------------------------------------------------------------
# Load Data File
df <- read_csv(here("dataset/transactions.csv"))
# Quick overview of the data set then transpose for better table display.
# The information including:
# * index and column data types
# * non-null values
# * memory usage
df_overview <- introduce(df) %>% t()
# turn the table into the Markdown format
df_overview %>% kbl() %>% kable_styling(bootstrap_options = "striped", full_width = F)
# rows 94247
# columns 5
# discrete_columns 4
# continuous_columns 1
# all_missing_columns 0
# total_missing_values 0
# complete_rows 94247
# total_observations 471235
# memory_usage 4201744
```
As apparent from the table, the data records 470,000+ observations across 5 columns, which are equivalent to 94,000+ bank transactions. The 5 features contained in this data set including `date`, `customer_id`, `industry`, `location`, `monthly_amount`, clearly indicates the total transactions amounts for customers each month spanning a 3-year period over a range of industries and locations. Therefore, no further justification needs to be made on column names.
```{r}
# inspect columns data type
sapply(df, class)
# date customer_id industry location monthly_amount
# "Date" "character" "character" "character" "numeric"
```
It is also worthwhile to note that features are made up in multiple formats that include both numerical and time-series data. However, the output shows that the `date` column has the wrong data type which will need to be converted to date format later.
Additionally, I investigate further by looking at the response field. Recall from the business question, we would expect to use the `monthly_amount` column as the target field since our goal is to get the predicted value of the monthly transaction value next month. Since the observation in this column are continuous, thus, I can conclude that our problem is defined as the supervised regression problem. Having known this information is extremely essential to select the right Machine Learning model in the later stage of this report.
```{r fig.align="center", fig.cap="Missing values plot"}
# Check for missing value in each column by plotting
plot_missing(df)
```
From the plot, it shows that there are no missing values on any fields of data. Nevertheless, some data sets define missing observations in categorical/character columns as a new category such as `"NA"`, `"NULL"`, etc. so there are chances that we possibly miss these observations, which can lay a tremendous negative impact on the real data distribution. Consequently, a further address on the missing values of our categorical columns need to be made in order to confirm this observation.
```{r}
# convert character values of columns to upper case for better missing value inspection
missing_df <- data.frame(
lapply(df, function(v) {
if (is.character(v)) return(toupper(v))
else return(v)
}))
# check if there is there is missing values assigned under new category
##::::::::::::::::::
## 1. date column
##::::::::::::::::::
sprintf(paste0("Is there any missing value observation categories in date column (T/F)?: ",
missing_df[1] %in% c("NA","N/A","NULL","")))
# "FALSE"
##::::::::::::::::::::::::::
## 2. customer_id column
##::::::::::::::::::::::::::
sprintf(paste0("Is there any missing value observation categories in customer_id column (T/F)?: ",
missing_df[1] %in% c("NA","N/A","NULL","")))
# "FALSE"
# 3. Check for any transaction with zero values
sprintf(paste0("How many rows contained 0 value in monthly transaction amount?: ",
sum(df$monthly_amount==0)))
# "FALSE"
```
The code output below interprets that there is no new missing value category exists in categorical columns. Thus, we can confirm our hypothesis that there is no missing values from both numerical and categorical columns in this data set. Furthermore, it also indicates that there are 1 row that contain odd value in `monthly_amount` column that will need to be resolved.
### Stage 2: Univariate, Bivariate & Multivariate Analysis {#ubma}
To evaluate the impact of each feature in the phenomenon, a univariate, bivariate, and multivariate analysis is performed with all features.
#### Univariate: Check the distribution of each field
The univariate analysis is the study of the data distribution. In research from Sharma (2020), the distributions of the independent variable and the target variable are assumed to be a crucial component in building linear models. Therefore, understanding the skewness of data helps us in creating better models.
Firstly, I will plot a histogram to check which group of industry and location statistically contribute the most to the significant difference.
```{r fig.align="center", fig.cap="Data distribution", message=FALSE, warning=FALSE}
##---------------------------------------------------------------
## Data Distribution --
##---------------------------------------------------------------
##:::::::::::::::::::::::::::::::
## 1. Data Skewness/Imbalance
##:::::::::::::::::::::::::::::::
# combine 2 plots into 1 plot
par(mfrow=c(1,2))
# plot data distribution
# 1. MONTHLY_AMOUNT group by INDUSTRY
hist(df$industry,
main = "Trans by Industry",
xlab="Industry",
xlim = c(0,10),
ylim=c(0,50000),
las=0)
## 2. MONTHLY_AMOUNT group by LOCATION
hist(df$location,
main = "Trans by Location",
xlab="Location",
xlim = c(0,10),
ylim=c(0,50000),
las=0)
```
As can be seen from the plot, the location 1 and 2 made the top contributions for the `industry` column while the industry 2 and industry 1 occupied for the highest frequency distribution for the `location`. These results imply that the model can perform better at predicting the total transaction amount for next month with location 1, 2 and/or industry 1, 2.
```{r fig.align="center", fig.cap="Boxplot to check for outliers when plotting Monthly Amount against Location & Industry", message=FALSE, warning=FALSE}
##---------------------------------------------------------------
## Data Distribution --
##---------------------------------------------------------------
##:::::::::::::::::::::::::::::::
## 2. Data Outliers
##:::::::::::::::::::::::::::::::
# combine 2 plots into 1 plot
par(mfrow=c(1,2))
# plot boxplot to check for outliers
# 1. industry
boxplot(monthly_amount~industry,
data=df,
main='Transaction vs. Industry',
xlab='Industry',
ylab='Transaction Amount',
horizontal=TRUE) +
scale_fill_grey() +
theme_classic()
# 2. location
boxplot(monthly_amount~location,
data=df,
main='Transaction vs. Location',
xlab='Location',
ylab='Transaction Amount',
horizontal=TRUE) +
scale_fill_grey() +
theme_classic()
```
Next, boxplot of sale transactions by the industry and location present their high variance with considerable amount of outliers. The median amount of spending per customer for industry 6 and 9 are highest, over 500,000 while the lowest ones belong to industry 1 and 10, less than 200,000. In terms of locations, most of locations had its median amount of spending less than 500,000.
#### Bivariate Analysis: Relationship between each column and target field & Collinearity
After having known the distribution of our `transaction` dataset, it is essential to also check for correlation and collinearity assumptions between fields in the Bivariate Analysis. Some basic transformations regarding data types are performed beforehand for the sake of plotting visualizations.
```{r fig.align="center", fig.cap="Correlation Plot", message=FALSE, warning=FALSE}
##---------------------------------------------------------------
## Data Transformation --
##---------------------------------------------------------------
# convert date column into the date format
df$date <- as.Date(df$date,"%d/%m/%Y")
# convert customer_id column into character format
df$customer_id = as.character(df$customer_id, format = "")
# convert location column into character format
df$location <- as.character(df$location)
# convert industry column into character format
df$industry <- as.character(df$industry)
# filter out value with 0 transaction amount
df<-filter(df, monthly_amount!=0)
########################################################
# DATA VISUAIZATION: CORRELATION PLOT
########################################################
plot_correlation(df)
```
Having known this information is essentially important to gain better understandings about the transaction data set and provide great insights for transforming data in the later stage.
## Exploratory Data Analysis (Part 2) - The Business Insights {#eda2}
```{r fig.align="center", fig.cap="Transaction amount vs. transaction number trend over time", message=FALSE, warning=FALSE}
##----------------------------------------------------------------
## Data Visualization --
##----------------------------------------------------------------
##:::::::::::::::::::::::::::::::::::::::::
## 1. Transaction Amount & # Transaction
##:::::::::::::::::::::::::::::::::::::::::
# create new df contain total transaction amount
transaction_amount <- sqldf(
"SELECT
date,
'Transaction Amount' as type, -- specify value type
SUM(monthly_amount) AS value -- sum total transaction amount
FROM df
GROUP BY date -- filter by date
ORDER BY date
"
)
# create new df contain number of transaction
transaction_count <- sqldf(
"SELECT
date,
'Transaction Count' as type, -- specify value type
COUNT(*) as value -- count total number of transactions
FROM df
GROUP BY date -- filter by date
ORDER BY date
"
)
# merge 2 df into 1 new TRANSACTION df vertically
transaction_df <- rbind(transaction_amount,
transaction_count)
# plot transaction amount over time
monthly_amount_plot <- transaction_df %>%
# filter by transaction amount only
filter(type=="Transaction Amount") %>%
# assign x and y-axis from the dataset
ggplot(aes(x = date, y = value/1e6)) +
# add the line graph, color, and the size
geom_line(color = "indianred", size=1.6) +
# the relationship graph between x and y
geom_smooth(formula = y~x,
method = 'loess') +
labs(x = "Year",
y = "Total transaction amount (in millions)",
title = "Monthly Transaction Amount",
subtitle = "2013 to 2016") +
theme_minimal()
# plot total transaction number over time
monthly_count_plot <- transaction_df %>%
# filter by total transaction number count only
filter(type=="Transaction Count") %>%
# assign x and y-axis from the dataset
ggplot(aes(x = date, y = value)) +
# add the line graph, color, and the size
geom_line(color = "indianred", size=1.6) +
# the relationship graph between x and y
geom_smooth(formula = y~x,
method = 'loess') +
labs(x = "Year",
y = "Total transaction number",
title = "Total Transaction Number",
subtitle = "2013 to 2016") +
theme_minimal()
## combine individual plots into a single page
ggarrange(monthly_amount_plot,
monthly_count_plot,
ncol = 2, nrow = 1)
```
Number of transactions and total amount of sales rose sharply throughout the years, from 2013 to 2017. The seasonal trend can be found on total amount of sales while the up trend for number of transactions is quite smooth.
We could see that there are seasonal pattern, although the trend is not clear yet. To investigate more, we would make yearly polar plot:
```{r fig.align="center", fig.cap="Seasonal Trend Over the Years", message=FALSE, warning=FALSE}
##----------------------------------------------------------------
## Data Visualization --
##----------------------------------------------------------------
##::::::::::::::::::::::::::::::::::::::
## 2. Time-series and Seasonal Trend
##::::::::::::::::::::::::::::::::::::::
## Create new MONTHLY_TREND df to plot seasonal transaction trend
new_ts_df <- sqldf(
"SELECT
strftime('%m', date) as month, --extract month from date column
strftime('%Y', --extract year from date column
date * 3600 * 24,
'unixepoch') as year,
SUM(monthly_amount) AS transaction_amount
FROM df
GROUP BY
month,
year
ORDER BY
month,
year
"
)
## transpose the dataset to prepare for the data visualization
monthly_trend <- recast(new_ts_df,
year + variable ~ month,
id.var = c("month", "year"))
monthly_trend <- data.frame(monthly_trend[,-1], # use the first column as the data index
row.names = monthly_trend[,1]) # use the first row as the header
monthly_trend <- subset(monthly_trend, select = -variable) # remove the unecessary column
## create new vector specify month column names
colnames(monthly_trend) <- c('January', 'February', 'March', 'April',
'May', 'June', 'July', 'August', 'September',
'October', 'November', 'December')
## To use the fmsb package, I have to add 2 lines to the dataframe:
## the max and min of each variable to show on the plot!
data <- rbind(rep(1400e6,12) , rep(0,12) , monthly_trend)
## create color vector
colr_1 <- rgb(0.2,0.5,0.5,0.9)
colr_2 <- rgb(0.8,0.2,0.5,0.9)
colr_3 <- rgb(0.7,0.5,0.1,0.9)
colr_4 <- "#FC4E07"
## set color theme for radar border
colors_border=c(colr_1,
colr_2,
colr_3,
colr_4)
## plot with default options:
seasonal_mul_plot <- radarchart(data, axistype=0,
#custom polygon
pcol=colors_border, plwd=2 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", cglwd=0.8,
#custom labels
vlcex=0.8, title=paste("Transaction Seasonal Trend"), cex.main = 1
)
## Add a legend
legend(seasonal_mul_plot,
x=1.5, y=0.5,
legend =
rownames(data[-c(1,2),]),
bty = "n", pch=15 ,
col=colors_border,
text.col = "black",
cex=1.2,
pt.cex=1,
title = "Year")
# Individual Years Bar Charts
ggplot(new_ts_df) +
aes(x = month, fill = year, weight = transaction_amount/1e6) +
geom_bar() +
scale_fill_brewer(palette = "Dark2",
direction = -1) +
labs(y = "Total transaction amount (in millions)",
title = "Monthly Seasonal Trend",
subtitle = "Individual Year (from 2013 to 2016)") +
coord_flip() +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size = 15L)) +
facet_wrap(vars(year), scales = "free", nrow = 1L)
```
A closer examination reveals that the total volume of transaction amount increases significantly from January to October and subsequently decreases from November towards the end of the year. This pattern can be ascribed to the fact that people trade less during the holidays, particularly during the month surrounding big holidays like Christmas and New Year.
This, however, might be based on a variety of different factors rather than on individual conclusions about each region or industry. As a result, additional information is required to substantiate these hypotheses.
```{r fig.align="center", fig.cap="Transaction amount by Location vs. Industry", message=FALSE, warning=FALSE}
##::::::::::::::::::::::::::::::::::::::::::::::::
## 3. Transaction Amount by Location & Industry
##::::::::::::::::::::::::::::::::::::::::::::::::
# create new INDUSTRY df contain total transaction by industry over time
industry <- sqldf(
"SELECT
date,
industry,
SUM(monthly_amount) AS transaction_amount, -- sum total transaction amount
COUNT(*) as transaction_count -- count total number of transactions
FROM df
GROUP BY
date, -- filter by date
industry -- filter by industry
ORDER BY date
"
)
# create new LOCATION df contain total transaction by location over time
location <- sqldf(
"SELECT
date,
location,
SUM(monthly_amount) AS transaction_amount, -- sum total transaction amount
COUNT(*) as transaction_count -- count total number of transactions
FROM df
GROUP BY
date, -- filter by date
location -- filter by location
ORDER BY date
"
)
# plot transaction info by industry
industry_amount_plot <- ggplot(industry) +
aes(x = date, y = transaction_amount, colour = industry) +
geom_line(size = 1) +
scale_color_hue(direction = 1) +
labs(x = "Year",
y = "Total transaction amount (in millions)",
title = "Transaction Amount",
subtitle = "By industry (from 2013 to 2016)") +
theme_minimal()
# plot transaction info by location
location_amount_plot <- ggplot(location) +
aes(x = date, y = transaction_amount, colour = location) +
geom_line(size = 1) +
scale_color_hue(direction = 1) +
labs(x = "Year",
y = "Total transaction amount (in millions)",
title = "Transaction Amount",
subtitle = "By location (from 2013 to 2016)") +
theme_minimal()
# combine plots into a single page
ggarrange(industry_amount_plot, location_amount_plot,
ncol = 2, nrow = 1)
```
When looking at the monthly amount by `location` and `industry`, it is not surprising that total sales of location 1 and 2 increased significantly compared to other locations. Meanwhile, in terms of industry, industry 2, 3 and 1 shows a rapid growing over the years while others’ progress are quite slow.
# Data Preparation {#dp}
## Feature Engineering {#fe}
When the data has been fully understood, data scientists generally need to go back to data collection and data cleaning phases of the data science pipeline so as to transform the data set as per the expected business outcomes.To expand the information that is already at hand and better represent the information we have, the best practice is to perform **Data Preparation** or *Feature Engineering*, meaning the creation of new features from the ones already existing.
In this case study, the data will need to be modified as we will be applying a linear regression model later on.
```{r message=FALSE, warning=FALSE}
# write a reusable function
aggregate_transactions <- function(df) {
# aggregate the data, grouping by date, industry and location,
# and calculating the mean monthly_amount
output = df %>%
group_by(date, industry, location) %>%
summarize(monthly_amount = mean(monthly_amount, na.rm = TRUE))
# create a column for the month number and another one for year number
output = output %>%
# create new column for month number
mutate(month_number = format(as.Date(date), "%m")) %>%
# create new column for month number
mutate(year_number = format(as.Date(date), "%Y"))
# Make sure the new columns are of the correct type
output$month_number = as.character(output$month_number)
output$year_number = as.character(output$year_number)
transform(output, month_number = as.integer(month_number), year_number = as.integer(year_number))
return(output)
}
# create a new variable that store new df with transformed features
aggregated_transactions <- aggregate_transactions(df)
# A tibble: 3,886 x 6
# Groups: date, industry [470]
# date industry location monthly_amount month_number year_number
# <date> <chr> <chr> <dbl> <chr> <chr>
# 1 2013-01-01 1 1 136081. 01 2013
# 2 2013-01-01 1 10 188735. 01 2013
# 3 2013-01-01 1 2 177840. 01 2013
# 4 2013-01-01 1 3 141632. 01 2013
# 5 2013-01-01 1 4 221058. 01 2013
# 6 2013-01-01 1 5 178138. 01 2013
# 7 2013-01-01 1 6 133400. 01 2013
# 8 2013-01-01 1 7 231599. 01 2013
# 9 2013-01-01 1 8 143778. 01 2013
# 10 2013-01-01 1 9 157416. 01 2013
# ... with 3,876 more rows
# turn the df into a Markdown table format
rmarkdown::paged_table(aggregated_transactions)
```
An aggregated data set using the fields `date`, `industry` and `location`, with a mean of monthly amount is created. There are total 3,886 rows with each row presents a mean of monthly amount ranging from 2013 to 2016.
# Train-Test split
Now that we have a new adjusted data set, I'm going to split the data into train and the test set for the aim of building a prediction model. The train set includes three years of data from 2013 to 2016 while test set includes one last year of data, 2016.
```{r}
#################################################################
## Task 2A.iii - Modelling ##
#################################################################
##---------------------------------------------------------------
## 1. Train-Test Split --
##---------------------------------------------------------------
# create helper function with the creation of train-test split
create_train_test <- function(df){
# Test set
# Use data for the year 2016 to test accuracy
test_df <- df[df$date >= "2016-01-01",]
# Training set
# Use data for the year 2016 for forecasting
train_df <- df[(df$date < "2016-01-01"),]
return (list(train_df, test_df))
}
# Create train and test set
# create new variable and apply train-test split function
split_list <- create_train_test(aggregated_transactions)
# 1. train set
train_set <- split_list[[1]]
# 2. test set
test_set <- split_list[[2]]
```
Additionally, we have 2 requirements in this assignment, which are:
1. **Basic Model Fitting**: Developing a linear regression model with monthly_amount as the target for `industry = 1` and `location = 1`.
2. **Advanced Model Fitting**: Developing a linear regression model with monthly_amount as the target for all industries and locations.
I will generate an additional data set that filtered only Industry 1 and Location 1 records. The train and split test for the Advanced Model Fitting section can be kept the same as there are no further adjustments needed.
```{r}
# create new df filter by only Location 1 & Industry 1
agg_1_1 <- aggregated_transactions %>%
filter(industry == 1, location == 1)
# # Create train and test
# use the adjusted data set with new features
split_list_1_1 <- create_train_test(agg_1_1)
# 1. train set: filter by only Location 1 & Industry 1
train_set_1_1 <- split_list_1_1[[1]]
# 2. test set: filter by only Location 1 & Industry 1
test_set_1_1 <- split_list_1_1[[2]]
# create new df to plot our test-train set
# add new column to specify the value type
train_set_1_1$type <- "Train"
test_set_1_1$type <- "Test"
# create data frame to display year range of test and train data
train_test <- sqldf(
"SELECT * FROM train_set_1_1
UNION ALL
SELECT * FROM test_set_1_1
ORDER BY type;
"
)
# plot year range of train set & test set (L1 & I1)
ggplot(train_test) +
aes(x = date, y = monthly_amount, colour = type) +
geom_line(size = 0.5) +
scale_color_hue(direction = 1) +
labs(x = "Year", y = "Total monthly amount",
title = "Train-Test Split",
subtitle = "Location 1 & Industry 1") +
theme_classic() +
theme(legend.position = "bottom", plot.title = element_text(size = 15L))
```
As new dataset is created, I will also use it to create a line plot of the variable `monthly_amount` for `industry = 1` and `location = 1` with the purpose of gaining more insights from targeted areas.
```{r message=FALSE, warning=FALSE}
#############################################################################
## Task 2A.ii - Visualization: Mean Monthly Transaction Amount (L1 & I1) ##
#############################################################################
agg_1_1 %>%
ggplot() +
aes(x = date, y = monthly_amount, colour = year_number) +
geom_line(size = 0.7) +
geom_point(size=1) +
scale_color_brewer(palette = "Dark2", direction = 1) +
labs(
x = "year",
y = "Transaction amount",
title = "Mean Monthly Transaction Amount",
subtitle = "Industry 1 & Location 1",
color = "Year"
) +
ggthemes::theme_par() +
theme(
plot.title = element_text(size = 12L),
plot.subtitle = element_text(hjust = 0.5)
)
```
It is clear from the graph that there is the seasonality trend observed from the mean transaction amount of Industry 1 & Location 1. More specifically, a down trend at the end of year followed by the up trend at the beginning of year is presented with the months of December and January are low months for this industry and location, and the sales more bounce back in March to June. This pattern of fluctuation is repeated during the year and in the time span of 3 years from 2013 to 2017. In average, the monthly mean amount of sales is increasing slowly over time.
However, it is worth mentioning that the year-end trend in 2016 was upward, which was the inverse of previous years. As a result, we will need to take a closer look at this occurrence by examining the amount of money moved by month for each year using the graphic below.
```{r}
# Plot Mean Transaction Amount Bar chart
agg_1_1 %>%
ggplot() +
aes(x = month_number, fill = year_number, weight = monthly_amount) +
geom_bar() +
scale_fill_manual(values = c(`2013` = "#EBEBAE", `2014` = "#BBE395", `2015` = "#379E54", `2016` = "#004529"
)) +
labs(x = "Month", y = "Total transaction amount", title = "Transaction Amount by Month", subtitle = "Location 1 & Industry 1",
fill = "year") +
theme_classic() +
theme(legend.position = "bottom",
plot.title = element_text(size = 15L),
axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
facet_wrap(vars(year_number), nrow = 1L) +
coord_flip()
```
As can be seen, the anomalous increase towards the end of 2016 was previously noticed as a result of a lack of transaction data in December 2016. As a result, we discovered another insight based on fact observing from the trend chart above.
# Modeling {#m}
## Basic Model Fitting
In this section, I will focus on developing a MLR filtered by Location 1 & Industry 1.
### Model Development
In this section, numerous Multiple Linear Regression (MLR) models will be developed and assessed with various combinations of predictor variables. As for the approach, I will adopt stepwise model selection method as backwards elimination. Here, I start with a full model that is a model with all possible co-variants or predictors included, and then I will drop variables one at a time until a parsimonious model is reached.
#### Model 1: Full Model
Noted that even though we start the model with all variables, I will exclude the `location`, `industry` and `year` as we only filter by Location 1, Industry 1 and the year of 2013-2015, which can overfit our MLR model.
```{r}
##::::::::::::::::::::::::::::
## M1: date + month_number
##::::::::::::::::::::::::::::
M1_1 <- lm(monthly_amount~date+month_number, data =train_set_1_1)
summary(M1_1)
# Call:
# lm(formula = monthly_amount ~ date + month_number, data = train_set_1_1)
#
# Residuals:
# Min 1Q Median 3Q Max
# -12943.4 -1553.1 494.6 2858.6 10733.4
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -1.758e+05 5.029e+04 -3.495 0.001954 **
# date 1.981e+01 3.123e+00 6.342 1.80e-06 ***
# month_number02 1.828e+04 4.561e+03 4.008 0.000551 ***
# month_number03 2.508e+04 4.563e+03 5.496 1.38e-05 ***
# month_number04 9.805e+03 4.568e+03 2.146 0.042623 *
# month_number05 2.188e+04 4.575e+03 4.783 8.00e-05 ***
# month_number06 1.448e+04 4.584e+03 3.159 0.004389 **
# month_number07 1.903e+04 4.595e+03 4.142 0.000395 ***
# month_number08 2.004e+04 4.608e+03 4.349 0.000236 ***
# month_number09 1.453e+04 4.622e+03 3.144 0.004546 **
# month_number10 2.475e+04 4.639e+03 5.335 2.04e-05 ***
# month_number11 1.983e+04 4.658e+03 4.258 0.000296 ***
# month_number12 4.360e+03 4.678e+03 0.932 0.360955
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 5584 on 23 degrees of freedom
# Multiple R-squared: 0.8329, Adjusted R-squared: 0.7457
# F-statistic: 9.555 on 12 and 23 DF, p-value: 2.627e-06
```
The month number variable is introduced to accommodate for the seasonality of the sales amount. As summarized in the linear model with the formula `formula = monthly_amount ~ date + month_number`, this model performs quite impressively with the Adjusted R-Square equivalent to 0.7457. In other words, this indicates that approximately 74,57% observations in the training set are explained by the model.
#### Model 2: Fit the model with `month_number` variable
```{r}
##::::::::::::::::::::::::::::
## M2: month_number
##::::::::::::::::::::::::::::
M1_2 <-lm(monthly_amount~month_number, data =train_set_1_1)
summary(M1_2)
# Call:
# lm(formula = monthly_amount ~ month_number, data = train_set_1_1)
#
# Residuals:
# Min 1Q Median 3Q Max
# -14715.9 -6444.4 625.6 6639.3 12664.7
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 142568 5233 27.244 < 2e-16 ***
# month_number02 18895 7401 2.553 0.017455 *
# month_number03 26248 7401 3.547 0.001641 **
# month_number04 11588 7401 1.566 0.130484
# month_number05 24258 7401 3.278 0.003179 **
# month_number06 17471 7401 2.361 0.026703 *
# month_number07 22616 7401 3.056 0.005431 **
# month_number08 24239 7401 3.275 0.003200 **
# month_number09 19347 7401 2.614 0.015209 *
# month_number10 30155 7401 4.075 0.000436 ***
# month_number11 25854 7401 3.493 0.001873 **
# month_number12 10976 7401 1.483 0.151073
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 9064 on 24 degrees of freedom
# Multiple R-squared: 0.5407, Adjusted R-squared: 0.3302
# F-statistic: 2.569 on 11 and 24 DF, p-value: 0.02583
```
Based on the Multiple R-squared value, our Model 2 can only account for approximately 54% of the variance. This indicates that fitting the `month_number` alone provide a moderate predictor of `monthly_amount` which specifically perform worse than the first model. We can also get confirmation by looking at the p-value of 0.02583 which tells us that the month predictors unlikely to be a good fit to the data.
#### Model 3: Fit the model with `date` variable
```{r}
##::::::::::::::::::::::::::::
## M3: 1 VARIABLE - date
##::::::::::::::::::::::::::::
M1_3 <-lm(monthly_amount~date, data =train_set_1_1)
summary(M1_3)
# Call:
# lm(formula = monthly_amount ~ date, data = train_set_1_1)
#
# Residuals:
# Min 1Q Median 3Q Max
# -17585 -7236 1484 6569 17030
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -1.764e+05 7.681e+04 -2.297 0.0279 *
# date 2.083e+01 4.729e+00 4.405 0.0001 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 8966 on 34 degrees of freedom
# Multiple R-squared: 0.3633, Adjusted R-squared: 0.3446
# F-statistic: 19.4 on 1 and 34 DF, p-value: 1e-04
```
With the third one where we fit only the `date` variable to the model, it even get a worse performance with only 36% of the variability in the average monthly sales amount is explained by it, leaving a whopping of unexplained 64% variance.
#### Model Selection
In conclusion, Model 1 provide the best fit so far comparing to the other 2 combinations. Thus, I will use this model for making a prediction for `monthly_amount` in December 2016.
### Model Prediction
After having chosen the Model 1 as the final model, my next step is to create a new data frame specifying only 2016 records. I then making the prediction for the transaction amount in December, 2016 with the function `predict()` as follows.
```{r message=FALSE, warning=FALSE}
##################################################################
## Task 2A.iv - Making the Prediction on Dec 2016 ##
##################################################################
##----------------------------------------------------------------
## Dec 2016 Prediction --
##----------------------------------------------------------------
# create new vector that specify column names
x <- c("date", "industry", "location", "monthly_amount",
"month_number","year_number","type")
# create new df for fitted df
fit_df <- data.frame(train_set_1_1[-7]) %>%
# assign value type for the the dataset
mutate(type="Fitted")
# Rename the fitted dataset column names
colnames(fit_df) <- x
# create new df for prediction df
testset_all <- data.frame(test_set_1_1[-7]) %>%
# assign value type for the the dataset
mutate(type="Predicted")
# Rename the predicted dataset column names
colnames(testset_all) <- x
# create new df for prediction row
dec2016<-data.frame(date = "2016/12/01",
industry="1",
location="1",
monthly_amount = 0,
month_number="12",
year_number="2016",
type="Predicted"
)
# convert the date column to the right date format to merge with predicted df later
dec2016$date <- as.Date(dec2016$date,format = "%Y/%m/%d")
## Use predict function + model we just built and apply to dec_2016 data frame
fit_df$monthly_amount <- predict(M1_1, train_set_1_1)
testset_all$monthly_amount <- predict(M1_1, test_set_1_1)
pred <- predict(M1_1, dec2016)
dec2016$monthly_amount <- pred
# combine the predicted df with the new predicted row
testset_all <- sqldf(
"SELECT * FROM testset_all
UNION ALL
SELECT * FROM dec2016"
)
```
Next, I will examine whether our December 2016 forecast reasonable by plotting a line plot with the predicted data.
```{r}
# create new column to specify value type for the dataset
agg_1_1$type <- "Real"
# make sure our variables have the right format for plotting
fit_df <- as.data.frame(fit_df)
testset_all <- as.data.frame(testset_all)
# merge all necessary variables into 1 single df variable with rbind
newData <- rbind(fit_df, testset_all,agg_1_1)
# Plot goodness of the fir from our MLR
newData %>%
filter(!(year_number %in% "2016")) %>%
filter(!(type %in% "Predicted")) %>%
ggplot() +
aes(x = date, y = monthly_amount, colour = type) +
geom_line(size = 0.5) +
scale_color_hue(direction = 1) +
labs(
x = "Year",
y = "Mean Transaction Amount",
title = "Fit from MLR"
) +
theme_light()
# Plot the comparison between our MLR prediction on test set with the actual values
ggplot(newData) +
aes(x = date, y = monthly_amount, colour = type) +
geom_line(size = 0.7) +
scale_color_hue(direction = 1) +
labs(x = "Year", y = "Mean transaction amount",
title = "December 2016 Prediction") +
theme_light() +
theme(plot.title = element_text(face = "bold"))
```
We can quantify the residuals by calculating a number of commonly used evaluation metrics. We’ll focus on the following three:
* **Mean Square Error (MSE):** The mean of the squared differences between predicted and actual values. This yields a relative metric in which the smaller the value, the better the fit of the model.
* **Root Mean Square Error (RMSE):** The square root of the MSE. This yields an absolute metric in the same unit as the label. The smaller the value, the better the model.
* **Coefficient of Determination (usually known as R-squared or R2):** A relative metric in which the higher the value, the better the fit of the model. In essence, this metric represents how much of the variance between predicted and actual label values the model is able to explain.
```{r}
#################################################################
## Task 2B - Describe the model ##
#################################################################
##----------------------------------------------------------------
## Evaluation Metrics --
##----------------------------------------------------------------
# Predictions vs Test data
evaluation_metrics_test_1_1 <- data.frame(
X = "Predictions vs Test data",
R2 = summary(M1_1)$r.squared,
RMSE = modelr::rmse(M1_1, data = test_set_1_1),
MAE = modelr::mae(M1_1, data = test_set_1_1)
)
# table formatting
rmarkdown::paged_table(evaluation_metrics_test_1_1)
```
```{r}
library(modelr)
# Fitted value vs Train data
evaluation_metrics_train_1_1 <- data.frame(
X = "Fitted value vs Train data",
R2 = rsquare(M1_1, train_set_1_1),
RMSE = modelr::rmse(M1_1, train_set_1_1),
MAE = modelr::mae(M1_1, train_set_1_1)
)