Skip to content

Exploring the factors driving people into software piracy by training two machine learning models to predict whether a person with certain characteristics and sentiments is likely to possess any pirated software or not using a dataset collected via a survey targeting users of music production software.

Notifications You must be signed in to change notification settings

MiroKeimioniemi/classifying-software-pirates

Repository files navigation

Classifying Software Pirates in the Music Production Software Industry

Below is a short excerpt of the Classifying Software Pirates in the Music Production Software Industry.pdf report, briefly summarizing the rationale, methodology and results of the project.

Introduction

This project attempts to dive deeper into the dataset used for the report “The Pricing of Digital Goods in the Music Production Software Industry” to try classify people into those who have pirated music production software and to those who have not based on a variety of features. This could then be used to explore the factors driving people into software piracy to gain more insight into this prominent modern phenomenon that extends to all online markets. This information can unlock economic insights into people’s online behavior and help software companies maximize their profits by conducting appropriate customer segmentation, which would likely benefit the customers as well in situations where they have not previously been able to afford the products.

Conclusion

Two machine learning models, DecisionTreeClassifier and LogisticRegression were developed to classify software pirates using demographic and similar, one-hot encoded, categorical data. Their performance characteristics were practically identical and thus LogisticRegression was selected due to its better interpretability, which poses that the factors most correlating with online piracy are its ease and the age and residence region of the person, both of which usually directly affect their disposable income. This implies that there might still be more room for further market segmentation in the form of, for example, country-specific pricing and student discounts.

The selected Logistic Regression model has an accuracy of 0.729729 and an F1 Score of 0.741379, which is quite good for a dataset this small, biased and noisy. This was enough to reveal and rank overall trends in terms of their approximate influence on the amount of piracy, but the accuracy would be quite poor for a classification system that would mislabel over one fourth of the people considered, which is something to be very careful about. Hence, this project’s focus on the predictive features over the predictions themselves.

About

Exploring the factors driving people into software piracy by training two machine learning models to predict whether a person with certain characteristics and sentiments is likely to possess any pirated software or not using a dataset collected via a survey targeting users of music production software.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published