-
Fixed some bugs
-
Disabled Belgian special case
-
Disabled 5-year time-window
- Fixed a bug in parallel computation of shap values.
-
Compute data set sizes and data set sizes by age groups separately as the former cannot be derived from the latter.
-
In the preprocessing the last five previous Hb values are now returned: previous_Hb, previous_Hb2, ..., previous_Hb5. And also corresponding variables days_to_previous_Hb[1-5]. These can now be used as predictors.
-
RF and SVM models can now be exported and imported (through the Prefitted input format). But SVM model contains individual level data!
-
The computation of SHAP values is now parallelised.
-
Small improvements to the user interface.
-
Variable summary tables are now exported. The table includes counts, counts of NA values, min, max, mean, median, 1st quartile and 3rd quartile.
-
Better error message if an input file is missing.
-
Added imbalance option that allows selecting between no upsampling or SMOTE.
-
Print (and export) dataset count statistics by age groups
-
Handling of sex in the user interface is now either pooled, stratified, male or female. Basically, the last two are new options and allow running only one sex. As a temporary solution, since hyperparameters for pooled mode have not been optimized, the average of male and female parameters are used.
-
Visualize the dependence of deferral on the age group.
-
Added Dutch hyperparameter values.
-
Histograms of donation specific variables are stored into a dataframe on a file. Bins with less than five points are set to count zero.
-
"Final mode" is enabled.
- Reduced the memory usage dramatically in the LMM model.
-
The Finnish hyperparameters have now been reoptimized.
-
Now using package
ranger
instead ofrandomForest
for random forests. -
Tried to fix the output problem of LMMs in windows.
-
Use SMOTE sampling instead of upsampling in RF and SVM to handle class imbalance.
-
Export dataset sizes to sizes.cvs file.
-
Use SVM's raw decision values instead of probabilities, because Platt scaling randomly fails.
-
To ease debugging one can download (most of) the fitted models and their train/validate input datas with
docker cp nameoftherunningcontainer: /tmp/tmp_rds.zip .
Note that these container private data.
-
The shap values of linear models are now really exported to the shap-value.csv file.
-
SVM is now using radial kernel instead of polynomial. Also, the Finnish hyperparameters have been optimized again.
-
Reordered the operations in preprocessing. Preprocessing must be done again, because of this change.
-
Optimize the probability threshold that is used to compute the F1 score and the confusion matrix. The threshold is such that it maximizes the F1 score on the train data. Hopefully this will get rid of the NA F1 scores.
-
Shap value computation of the linear mixed models now working.
-
Reoptimized Finnish random forest hyperparameters (
mtry
,nodesize
,ntree
). Had to modify Caret to allow extra hyperparameters. -
The id column from file
shap-value.csv
was removed. In addition, the rows inshap-value.csv
and prediction.csv are permuted. The shap values are computed from a sample of 1000 donors. In fileshap-value.csv
only the normalized variables computed from the sample are shown. So, if there are more than 1000 donors in the test set, the individual data should be messed up enough to prevent finding out the original data or ids.
-
The number of cores used in parallel computation can now be specified in the user interface. By reducing the number of cores one may try to reduce the memory usage.
-
Fixed a bug in stratified sampling. Now if you specify sample size, e.g. 10 000 and stratify by sex is selected. then in the sample there will 10 000 male donors and 10 000 female donors.
-
Variable
nb_donat
is no longer included as a predictor by default. Fixed small upsampling problem with baseline model. Fixed logging of data exclusions. -
The downloadable preprocessed data is no longer filtered by time-series length.
-
The final number of donations, donors and deferrals are now reported in a table form in each Rmd.
-
Don't drop donor anymore even if date_first_donation field is NA.
-
Implemented the five-year time window. And the recent donations variable now refers to the last five years.
- Disabled parallelism in computing confidence intervals of AUPR and F1. This avoids crashing due to insufficient memory on systems for little memory.
-
Removed overlapping and overflowing content from pdf reports.
-
Elapsed time in the web UI now works when it is longer than one day.
-
Added a downloadable zip file that contains all the results except for the preprocessed data.
-
The minority class (deferrals) are now oversampled in the training data so that the 50% of the last donations are deferrals. This is done in baseline, random forest, and support vector machines.
-
Drop too short time series after preprocessing but before subsetting.
-
The shap value computation works baseline, rf, and svm.
-
Belgian data is now handled differently.
-
The container can now also run preprocessing only without fitting any models.