-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy pathml.Rmd
208 lines (148 loc) · 6.36 KB
/
ml.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
title: "Text Classification with R"
author: "Wouter van Atteveldt"
date: "June 1, 2016"
output: pdf_document
---
```{r, echo=F}
head = function(...) knitr::kable(utils::head(...))
```
Machine Learning or automatic text classification is a set of techniques to train a statistical model on a set of annotated (coded) training texts, that can then be used to predict the category or class of new texts.
R has a number of different packages for various machine learning algorithm such as maximum entropy modeling, neural networks, and support vector machines.
`RTextTools` provides an easy way to access a large number of these algorithms.
In principle, like 'classical' statistical models, machine learning uses a number of (independent) variables called _features_ to predict a target (dependent) category or class.
In text mining, the independent variables are generally the term frequencies, and as such the input for the machine learning is the document-term matrix.
RTextTools can be installed directly from CRAN:
```{r, eval=F}
install.packages("RTextTools")
```
Obtaining data
----
For this example, we will use Amazon reviews from http://jmcauley.ucsd.edu/data/amazon/ and classify whether they are positive or negative.
See the hand-out 'Getting Sentiment Resources' hand-out for how to download and prepare these yourlelf,
or you can download the directly from [http://rawgit.com/vanatteveldt/learningr/master/data/reviews.rds](http://rawgit.com/vanatteveldt/learningr/master/data/reviews.rds).
```{r}
reviews = readRDS("data/reviews.rds")
```
Creating the Document Term Matrix
----
So, the first step is to create a document-term matrix.
To make it run faster for testing, we take a limited data set here.
Since reviews are mostly positive (taking positive to be 4 or 5 stars),
we sample 500 positive and 500 negative reviews to use:
```{r}
reviews$id = 1:nrow(reviews)
reviews$positive = as.numeric(reviews$overall >= 4)
pos = sample(reviews$id[reviews$positive == 1], 500)
neg = sample(reviews$id[reviews$positive == 0], 500)
reviews = reviews[reviews$id %in% c(pos, neg), ]
```
Now, we can create a dtm:
```{r, message=F}
library(RTextTools)
dtm = create_matrix(reviews[c("summary", "reviewText")], language="english", stemWords=T)
```
Of course, now that we have a DTM we can plot a word cloud to get some feeling of the most frequent words:
```{r, warning=F, message=F}
library(corpustools)
dtm.wordcloud(dtm)
```
(we could also e.g. compare the words in positive and negative reviews, or run a topic model on only the positive or negative terms; see the handouts `comparing.pdf` and `lda.pdf`, respectively)
Preparing the training and testing data
---
The next step is to create the RTextTools _container_.
This contains both the dt matrix and the manually coded classes,
and you specify which parts to use for training and which for testing.
To make sure that we get a random sample of documents for training and testing,
we sample 80% of the set for training and the remainder for testing.
(Note that it is important to sort the indices as otherwise GLMNET will fail)
```{r}
n = nrow(dtm)
train = sort(sample(1:n, n*.8))
test = sort(setdiff(1:n, train))
```
Now, we are ready to create the container:
```{r}
c = create_container(dtm, reviews$positive, trainSize=train, testSize=test, virgin=F)
```
Using this container, we can train different models:
```{r}
SVM <- train_model(c,"SVM")
MAXENT <- train_model(c,"MAXENT")
GLMNET <- train_model(c,"GLMNET")
```
Testing model performance
----
Using the same container, we can classify the 'test' dataset
```{r}
SVM_CLASSIFY <- classify_model(c, SVM)
MAXENT_CLASSIFY <- classify_model(c, MAXENT)
GLMNET_CLASSIFY <- classify_model(c, GLMNET)
```
Let's have a look at what these classifications yield:
```{r}
head(SVM_CLASSIFY)
```
For each document in the test set, the predicted label and probability are given.
We can compare these predictions to the correct classes manually:
```{r}
t = table(SVM_CLASSIFY$SVM_LABEL, as.character(reviews$positive[test]))
t
```
(Note that the as.character cast is necessary to align the rows and columns)
And compute the accuracy:
```{r}
sum(diag(t)) / sum(t)
```
Analytics
----
To make it easier to compute the relevant metrics, RTextTools has a built-in analytics function:
```{r}
analytics <- create_analytics(c, cbind(SVM_CLASSIFY, GLMNET_CLASSIFY, MAXENT_CLASSIFY))
names(attributes(analytics))
```
The `algorithm_summary` gives the performance of the various algorithms, with precision, recall, and f-score given per algorithm:
```{r}
head(analytics@algorithm_summary)
```
The `label_summary` gives the performance per label (class):
```{r}
head(analytics@label_summary)
```
Finally, the `ensemble_summary` gives an indication of how performance changes based on the amount of classifiers that agree on the classification:
```{r}
head(analytics@ensemble_summary)
```
The last attribute, `document_summary`, contains the classifications of the various algorithms per document, and also lists how many agree and whether the consensus and the highest probability classifier where correct:
```{r}
head(analytics@document_summary)
```
Classifying new material
----
New material (called 'virgin data' in RTextTools) can be coded by placing the old and new material in a single container.
Let's assume that we don't know the sentiment of 20% of our material:
```{r}
reviews$positive[1:200] = NA
```
We now set all documents with a sentiment score as training material, and specify `virgin=T` to indicate that we don't have coded classes on the test material:
```{r}
coded = which(!is.na(reviews$positive))
c = create_container(dtm, reviews$positive, trainSize=coded, virgin=T)
```
We can now build and test the model as before:
```{r}
SVM <- train_model(c,"SVM")
MAXENT <- train_model(c,"MAXENT")
GLMNET <- train_model(c,"GLMNET")
SVM_CLASSIFY <- classify_model(c, SVM)
MAXENT_CLASSIFY <- classify_model(c, MAXENT)
GLMNET_CLASSIFY <- classify_model(c, GLMNET)
analytics <- create_analytics(c, cbind(SVM_CLASSIFY, GLMNET_CLASSIFY, MAXENT_CLASSIFY))
names(attributes(analytics))
```
As you can see, the analytics now only has the `label_summary` and `document_summary`:
```{r}
analytics@label_summary
head(analytics@document_summary)
```
The label summary now only contains an overview of how many where coded using consensus and probability. The document_summary lists the output of all algorithms, and the consensus and probability code.