Skip to content

Predicting whether a given user would read a particular book and predicting the genre of a book given book reviews.

Notifications You must be signed in to change notification settings

annadinov/BookReadAndGenrePrediction

Repository files navigation

Read Prediction: To predict whether or not a book was read, I iterated through the validation set to collect the following information. For the book in the validation set, I found the users that had read that book in the training set and for the user, I found the books that user read in the training set. Now, for each of these books from the training set that the user had read, I determined the users that had read them. Then I found the Jaccard similarities between the users in the training set that had read the book in the validation set and the users that read each book read by the user in the validation set. If the maximum similarity was greater than 5 or the number of times the book had been read was greater than 35 (I tried many thresholds, and this was the most optimal), I added the book to a set called return2. To make the predictions, I checked if the book was in the set return2. If the book was in the return2 set, I predicted 1 (meaning that the book is predicted to have been read) and if the book was not in the set, I predicted 0 (meaning that the book is predicted to be unread).

Category Prediction: To predict the category of books, I created a feature function that creates a feature vector for a data point in which every time of the 4000 most common words is seen in the review's text, 1 is added to the location in the feature vector that corresponds to that common word. Also, a 1 is added to the end of the feature vector for the constant term. I created a matrix of all the feature vectors for all data points in the dataset and did a linear regression model on the matrix and the genreIDs for each data point. I was able to then create a feature matrix for the test set and use the model to predict the genres of the test set. For the model, I tried many different values for C and found that 0.01 had the best accuracy.

About

Predicting whether a given user would read a particular book and predicting the genre of a book given book reviews.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published