Skip to content

Latest commit

 

History

History
3 lines (2 loc) · 1.82 KB

README.md

File metadata and controls

3 lines (2 loc) · 1.82 KB

Read Prediction: To predict whether or not a book was read, I iterated through the validation set to collect the following information. For the book in the validation set, I found the users that had read that book in the training set and for the user, I found the books that user read in the training set. Now, for each of these books from the training set that the user had read, I determined the users that had read them. Then I found the Jaccard similarities between the users in the training set that had read the book in the validation set and the users that read each book read by the user in the validation set. If the maximum similarity was greater than 5 or the number of times the book had been read was greater than 35 (I tried many thresholds, and this was the most optimal), I added the book to a set called return2. To make the predictions, I checked if the book was in the set return2. If the book was in the return2 set, I predicted 1 (meaning that the book is predicted to have been read) and if the book was not in the set, I predicted 0 (meaning that the book is predicted to be unread).

Category Prediction: To predict the category of books, I created a feature function that creates a feature vector for a data point in which every time of the 4000 most common words is seen in the review's text, 1 is added to the location in the feature vector that corresponds to that common word. Also, a 1 is added to the end of the feature vector for the constant term. I created a matrix of all the feature vectors for all data points in the dataset and did a linear regression model on the matrix and the genreIDs for each data point. I was able to then create a feature matrix for the test set and use the model to predict the genres of the test set. For the model, I tried many different values for C and found that 0.01 had the best accuracy.