-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy pathamcat.Rmd
211 lines (154 loc) · 5.37 KB
/
amcat.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
title: "Text analysis with R and AmCAT"
author: "Wouter van Atteveldt"
date: "May 25, 2016"
output: pdf_document
---
You might need these packages
```{r, eval=F}
install.packages("devtools")
devtools::install_github("amcat/amcat-r")
devtools::install_github("kasperwelbers/corpus-tools")
devtools::install_github("vanatteveldt/rsyntax")
```
# Keyword queries with AmCAT
First, let's use the AmCAT API to do simple keyword queries:
On every computer you need to save your AmCAT password once (if you don't have an account you can create one for free at https://amcat.nl):
```{r, eval=F}
library(amcatr)
amcat.save.password("https://amcat.nl", "your_username", "your_password")
```
```{r, echo=F, message=F}
library(amcatr)
```
Next, you can connect using the `amcat.connect` function, storing the connection details in `conn`
```{r}
conn = amcat.connect("https://amcat.nl")
```
You can use this connection to retrieve e.g. the meta-information about articles in a specific set:
```{r, message=F}
meta = amcat.getarticlemeta(conn, 1006, 25173, dateparts = T)
head(meta)
```
You can also use a keyword query which returns the number of hits per document for a query:
```{r, message=F}
h = amcat.hits(conn, "referend*", sets=25173)
head(h)
```
Now, we can merge the information and plot the line over time:
```{r}
meta = meta[meta$medium %in% c("De Volkskrant", "Trouw", "NRC Handelsblad"),]
h = merge(meta, h)
perweek = aggregate(h["count"], h[c("week", "medium")], sum)
library(ggplot2)
ggplot(perweek, aes(x=week, y=count, color=medium)) + geom_line()
```
# Corpus analysis: document-term matrix
The main primitive in corpus analysis is the document-term matrix.
We can create one from text using the `create_matrix` function in `RTextTools`
```{r, message=F}
library(RTextTools)
input = data.frame(text=c("Chickens are birds", "The bird eats"))
m = create_matrix(input$text, removeStopwords=F)
as.matrix(m)
```
Note that a DTM is normally a sparse matrix, which means only the non-zero values are stored.
In a real world matrix, you easily have millions of cells, so converting it to a regular matrix with `as.matrix` can cause memory problems.
Let's try to clean some of the noise from the data set:
```{r}
m = create_matrix(input$text, removeStopwords=T, stemWords=T, language='english')
as.matrix(m)
```
So for English this works reasonably well.
Now let's try for Dutch:
```{r}
text = c("De kip eet", "De kippen hebben gegeten")
m = create_matrix(text, removeStopwords=T, stemWords=T, language="dutch")
colSums(as.matrix(m))
```
## Tokens and NLP Preprocessing
The `amcat.gettokens` command allows us to get document word lists from AmCAT:
```{r, message=F}
tokens = amcat.gettokens(conn, 1006, 25173, page_size=100, max_page=2)
head(tokens)
```
We can create a DTM (and a word cloud) from this:
```{r, message=F}
library(corpustools)
dtm = dtm.create(tokens$aid, tokens$term)
dtm.wordcloud(dtm)
```
But that's still no good: we need some preprocessing:
```{r, message=F}
tokens = amcat.gettokens(conn, 1006, 25173, module="morphosyntactic", page_size=100, max_page=1, only_cached=T)
head(tokens)
```
So you can see this lemmatizes (stems) words and gives their part of speech (noun, verb, etc.)
Let's plot only the names:
```{r, warning=F, message=F}
subset = tokens[tokens$pos == "name", ]
dtm = dtm.create(subset$aid, subset$lemma)
dtm.wordcloud(dtm, nterms = 200)
```
# Corpus analysis: The State-of-the-Unions
We've prepared a data set containing state of the union speeches by Obama and Bush:
```{r}
data(sotu)
head(sotu.tokens)
aggregate(cbind(Freq=sotu.meta$id), list(Speaker=sotu.meta$headline), length)
```
We can easily get the most frequent terms with the `term.statistics` function:
```{r, message=F}
dtm = dtm.create(sotu.tokens$aid, sotu.tokens$lemma)
stats = term.statistics(dtm)
stats = arrange(stats, -termfreq)
head(stats)
```
Let's limit that to adjectives:
```{r, message=F}
dtm = with(subset(sotu.tokens, pos1 == "A"), dtm.create(aid, lemma))
stats = term.statistics(dtm)
stats = arrange(stats, -termfreq)
head(stats)
```
## Comparing corpora
It is often more informative to compare two corpora, e.g. compare Bush' words to Obama's words:
```{r}
dtm = with(sotu.tokens[sotu.tokens$pos1 %in% c("N", "A", "M"), ],
dtm.create(aid, lemma))
obama = sotu.meta$id[sotu.meta$headline == "Barack Obama"]
cmp = corpora.compare(dtm, select.rows = obama)
cmp = arrange(cmp, over)
head(cmp)
```
So, words like terror and freedom are mostly used by Bush (their overrepresentation for Obama is below 1).
We can also plot these words with a 'directed' word cloud:
```{r}
h = rescale(log(cmp$over), c(1, .6666))
s = rescale(sqrt(cmp$chi), c(.25,1))
cmp$col = hsv(h, s, .33 + .67*s)
with(head(cmp, 130), plotWords(x=log(over), words=term,
wordfreq=termfreq, random.y = T, col=col, scale=1))
```
## Topic Modeling
A final example of corpus analysis is topic modeling.
In topic modeling, words are automatically assigned to clusters (similar to factor analysis):
```{r}
set.seed(123)
m = lda.fit(dtm, K=10, alpha=.1)
terms(m, 10)
```
We can see how often each topic occurs in each document:
```{r}
tpd = topics.per.document(m, as.wordassignments = T)
head(tpd)
```
And merge this back with the meta information
```{r}
tpd = merge(sotu.meta, tpd)
head(tpd)
```
And use this to e.g. figure out whether a topic like Iraq (topic 1) is mostly affiliated with which president:
```{r}
t.test(tpd$X1 ~ tpd$headline)
```