amcat.Rmd

---
title: "Text analysis with R and AmCAT"
author: "Wouter van Atteveldt"
date: "May 25, 2016"
output: pdf_document
---


You might need these packages
```{r, eval=F}
install.packages("devtools")
devtools::install_github("amcat/amcat-r")
devtools::install_github("kasperwelbers/corpus-tools")
devtools::install_github("vanatteveldt/rsyntax")
```

# Keyword queries with AmCAT

First, let's use the AmCAT API to do simple keyword queries:


On every computer you need to save your AmCAT password once (if you don't have an account you can create one for free at https://amcat.nl):

```{r, eval=F}
library(amcatr)
amcat.save.password("https://amcat.nl", "your_username", "your_password")
```
```{r, echo=F, message=F}
library(amcatr)
```

Next, you can connect using the `amcat.connect` function, storing the connection details in `conn`

```{r}
conn = amcat.connect("https://amcat.nl")
```

You can use this connection to retrieve e.g. the meta-information about articles in a specific set:

```{r, message=F}
meta = amcat.getarticlemeta(conn, 1006, 25173, dateparts = T)
head(meta)
```

You can also use a keyword query which returns the number of hits per document for a query:

```{r, message=F}
h = amcat.hits(conn, "referend*", sets=25173)
head(h)
```

Now, we can merge the information and plot the line over time:

```{r}
meta = meta[meta$medium %in%  c("De Volkskrant", "Trouw", "NRC Handelsblad"),]
h = merge(meta, h)
perweek = aggregate(h["count"], h[c("week", "medium")], sum)
library(ggplot2)
ggplot(perweek, aes(x=week, y=count, color=medium)) + geom_line()
```


# Corpus analysis: document-term matrix

The main primitive in corpus analysis is the document-term matrix. 
We can create one from text using the `create_matrix` function in `RTextTools`

```{r, message=F}
library(RTextTools)
input = data.frame(text=c("Chickens are birds", "The bird eats"))
m = create_matrix(input$text, removeStopwords=F)
as.matrix(m)
```

Note that a DTM is normally a sparse matrix, which means only the non-zero values are stored.
In a real world matrix, you easily have millions of cells, so converting it to a regular matrix with `as.matrix` can cause memory problems.

Let's try to clean some of the noise from the data set:

```{r}
m = create_matrix(input$text, removeStopwords=T, stemWords=T, language='english')
as.matrix(m)
```

So for English this works reasonably well. 
Now let's try for Dutch:

```{r}
text = c("De kip eet", "De kippen hebben gegeten")
m = create_matrix(text, removeStopwords=T, stemWords=T, language="dutch")
colSums(as.matrix(m))
```

## Tokens and NLP Preprocessing

The `amcat.gettokens` command allows us to get document word lists from AmCAT:

```{r, message=F}
tokens = amcat.gettokens(conn, 1006, 25173, page_size=100, max_page=2)
head(tokens)
```

We can create a DTM (and a word cloud) from this:

```{r, message=F}
library(corpustools)
dtm = dtm.create(tokens$aid, tokens$term)
dtm.wordcloud(dtm)
```

But that's still no good: we need some preprocessing:

```{r, message=F}
tokens = amcat.gettokens(conn, 1006, 25173, module="morphosyntactic", page_size=100, max_page=1, only_cached=T)
head(tokens)
```

So you can see this lemmatizes (stems) words and gives their part of speech (noun, verb, etc.)
Let's plot only the names:

```{r, warning=F, message=F}
subset = tokens[tokens$pos == "name", ]
dtm = dtm.create(subset$aid, subset$lemma)
dtm.wordcloud(dtm, nterms = 200)
```

# Corpus analysis: The State-of-the-Unions

We've prepared a data set containing state of the union speeches by Obama and Bush:

```{r}
data(sotu)
head(sotu.tokens)
aggregate(cbind(Freq=sotu.meta$id), list(Speaker=sotu.meta$headline), length)
```

We can easily get the most frequent terms with the `term.statistics` function:

```{r, message=F}
dtm = dtm.create(sotu.tokens$aid, sotu.tokens$lemma)
stats = term.statistics(dtm)
stats = arrange(stats, -termfreq)
head(stats)
```


Let's limit that to adjectives:

```{r, message=F}
dtm = with(subset(sotu.tokens, pos1 == "A"), dtm.create(aid, lemma))
stats = term.statistics(dtm)
stats = arrange(stats, -termfreq)
head(stats)
```

## Comparing corpora

It is often more informative to compare two corpora, e.g. compare Bush' words to Obama's words:

```{r}
dtm = with(sotu.tokens[sotu.tokens$pos1 %in% c("N", "A", "M"), ],
           dtm.create(aid, lemma))
obama = sotu.meta$id[sotu.meta$headline == "Barack Obama"]
cmp  = corpora.compare(dtm, select.rows = obama)
cmp = arrange(cmp, over)
head(cmp)
```

So, words like terror and freedom are mostly used by Bush (their overrepresentation for Obama is below 1).

We can also plot these words with a 'directed' word cloud:

```{r}
h = rescale(log(cmp$over), c(1, .6666))
s = rescale(sqrt(cmp$chi), c(.25,1))
cmp$col = hsv(h, s, .33 + .67*s)
with(head(cmp, 130), plotWords(x=log(over), words=term, 
    wordfreq=termfreq, random.y = T, col=col, scale=1))
```

## Topic Modeling

A final example of corpus analysis is topic modeling.
In topic modeling, words are automatically assigned to clusters (similar to factor analysis):

```{r}
set.seed(123)
m = lda.fit(dtm, K=10, alpha=.1)
terms(m, 10)
```

We can see how often each topic occurs in each document:

```{r}
tpd = topics.per.document(m, as.wordassignments = T)
head(tpd)
```

And merge this back with the meta information

```{r}
tpd = merge(sotu.meta, tpd)
head(tpd)
```

And use this to e.g. figure out whether a topic like Iraq (topic 1) is mostly affiliated with which president:

```{r}
t.test(tpd$X1 ~ tpd$headline)
```