-
Notifications
You must be signed in to change notification settings - Fork 0
YangChemE/Final-Project
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# Final-Project Before training, since the problem is a binary classification, I calculated the data distribution and found the class of instances were almost half-half, which is good. Then features were ranked according to the evaluation of chi square value for each feature, chi square describes how important the feature is. features chi2 0 gill-color 5957.764469 1 ring-type 1950.610146 2 gill-size 1636.606833 3 bruises 1194.277352 4 gill-spacing 826.795274 5 habitat 751.309489 6 spore-print-color 379.132729 7 population 311.766736 8 stalk-surface-above-ring 222.982400 9 cap-surface 214.068544 10 stalk-surface-below-ring 206.648180 Firstly I tried training the data by using 5 algorithms (Logistic Regression, KNN, Neural Network, Decision Tree and Naive Bayes) with all 21 features (except for one feature with more than 1000 missing values, which is dropped). All 5 algorithms used were of their default setting, and gave very good accuracy of prediction (almost 100%). Results as following: Default Logistic Regression: training accuracy: 1.0 testing accuracy: 1.0 [[2297 0] [ 0 141]] tn: 2297 fp: 0 fn: 0 tp: 141 Precision: 1.0 Recall: 1.0 F1: 1.0 KNN: training accuracy: 0.999824129441 testing accuracy: 0.998769483183 [[2297 0] [ 3 138]] tn: 2297 fp: 0 fn: 3 tp: 138 Precision: 1.0 Recall: 0.978723404255 F1: 0.989247311828 Default NN: training accuracy: 1.0 testing accuracy: 1.0 [[2297 0] [ 0 141]] tn: 2297 fp: 0 fn: 0 tp: 141 Precision: 1.0 Recall : 1.0 F1: 1.0 Default Decision Tree: training accuracy: 1.0 testing accuracy: 1.0 [[2297 0] [ 0 141]] tn: 2297 fp: 0 fn: 0 tp: 141 Precision: 1.0 Recall : 1.0 F1: 1.0 Deafult Naive Bayes: training accuracy: 1.0 testing accuracy: 1.0 [[2297 0] [ 0 141]] tn: 2297 fp: 0 fn: 0 tp: 141 Precision: 1.0 Recall : 1.0 F1: 1.0 As there is almost no space to improve the performance of models in terms of classification accuracy, the next step I did was trying to reduce the dimensions of data. Data was trained for 10 times by using 1 to 10 most important features according to the ranking obtained above. Plot of accuracy for each algorithms against number of important features was made, then I decide to use 6 most important features, because, 6 most important feature give fairly high accuracy and beyond which there is no significant increase of accuracy. Results as following: top 6 most useful features are: gill-color ring-type gill-size bruises gill-spacing habitat Default Logistic Regression: training accuracy: 0.934605323896 testing accuracy: 0.936615384615 [[791 34] [ 69 731]] tn: 791 fp: 34 fn: 69 tp: 731 Precision: 0.955555555556 Recall: 0.91375 F1: 0.934185303514 KNN: training accuracy: 0.972611170949 testing accuracy: 0.964307692308 [[780 45] [ 13 787]] tn: 780 fp: 45 fn: 13 tp: 787 Precision: 0.945913461538 Recall: 0.98375 F1: 0.964460784314 Default NN: training accuracy: 0.983228188952 testing accuracy: 0.985846153846 [[821 4] [ 19 781]] tn: 821 fp: 4 fn: 19 tp: 781 Precision: 0.994904458599 Recall : 0.97625 F1: 0.985488958991 Default Decision Tree: training accuracy: 0.983228188952 testing accuracy: 0.985846153846 [[821 4] [ 19 781]] tn: 821 fp: 4 fn: 19 tp: 781 Precision: 0.994904458599 Recall : 0.97625 F1: 0.985488958991 Deafult Naive Bayes: training accuracy: 0.920295430066 testing accuracy: 0.92 [[785 40] [ 90 710]] tn: 785 fp: 40 fn: 90 tp: 710 Precision: 0.946666666667 Recall : 0.8875 F1: 0.916129032258 With lower dimension of data, it is not possible to achieve 100% accuracy anymore. Therefore, considering the practical significance of the classfication, we want to choose a model that gives high predicion accuracy and low amount of false positive. Because false negative does not really hurt people but false positive might kill people. The final train I did was to train the data by 5 algorithms with 6 most important features for 100 times and I calculated the average of Accuracy and average of false positive. I found that, among 5 algorithms, the Decision Tree gives the lowest false positive, and interestingly, highest accuracy at the same time (~98.4%). Avg Accuracy Train LogReg: 0.935082320357 Avg Accuracy Train KNN: 0.980847822742 Avg Accuracy Train NN: 0.983682104939 Avg Accuracy Train Tree: 0.983786736421 Avg Accuracy Train NB: 0.92216802585 Avg Accuracy Test LogReg: 0.935082320357 Avg Accuracy Test KNN: 0.980847822742 Avg Accuracy Test NN: 0.983682104939 Avg Accuracy Test Tree: 0.983786736421 Avg Accuracy Test NB: 0.92216802585 Avg False Positive LogReg: 31.37 Avg False Positive KNN: 11.4 Avg False Positive NN: 6.66 Avg False Positive Tree: 6.45 Avg False Positive NB: 40.0
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published