3_ggplot.rmd

---
title: |-
  R Crash Course \
   Part 3 -- Customizing *ggplot()*
author: "Rob Colautti"
---

# Getting Started

Before following this tutorial, you should be familiar with the [qplot() tutorial]().

This [ggplot cheat sheet](https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) is a downloadable pdf that provides a good summary and quick-reference guide.

Load the ggplot2 library and a custom theme

```{r}
library(ggplot2)
source("http://bit.ly/theme_pub")
theme_set(theme_pub())
```

The `source` function loads an external file, in this case from the internet. The file is just a .R file with a custom function defining different aspects of the graph (e.g. text size, line width, etc.) You can open the link in a web browser or download and open in a text editor to see the file.

The `theme_set()` command sets our custom theme (`theme_pub`) to the default plotting theme. Since the theme is a function in R, we need the extra brackets: `theme_pub()`

<br>

# Graphical concepts

Begin with [this presentation](https://colauttilab.github.io/RCrashCourse/Graphics_small.pdf) 

# Rules of thumb

Published graphs in professional journals can vary depending on format and discipline, but there are a number of useful 'rules of thumb' to keep in mind. These are not hard and fast rules but helpful for new researchers who aren't sure how or where to start.

## 1. Minimize 'ink'

In the old days, when most papers were actually printed and mailed to journal subscribers, black ink was expensive and printing in colour was very expensive. Printing is still expensive but of course most research articles are available online where there is no additional cost to colour or extra ink. However, the concept of minimizing ink (or pixels) can go a long way toward keeping a graph free from clutter and unnecessary distraction.

## 2. Use space wisely

Empty space is not necessarily bad, but ask yourself if it is necessary and what you want the reader to take away. Consider the next two graphs:

```{r, echo=F}
Y<-rnorm(100)+60
X<-rbinom(100,1,0.5)
Y<-Y+(10*X)
X<-as.factor(X)
qplot(X,Y)
```

> Above: Y-axis scaled to the data

```{r, echo=F}
qplot(X,Y) + scale_y_continuous(limits=c(0,max(100)))
```

> Above: Y-axis scaled between 0 and 100

What are the benefits/drawbacks of scaling the axes? When might you choose to use one over ther other?

## 3. Choose a colour palette

Colour has three basic components

  a. Hue -- the amount of red vs green vs blue light
  b. Saturation -- how vivid the colour is 
  c. Brightness -- the amount of white (vs black) in the colour
  
In R these can be easily defined with the `rgb()` function. For example:

`rgb(1,0,0)` -- a saturated red
`rgb(0.1,0,0)` -- a dark red (low brightness, low saturation)
`rgb(1,0.9,0.9)` -- a light red (high brightness, low saturation)

Don't underestimate the impact of choosing a good colour palette, especially for presentations. Colour theory can get a bit overwhelming but here are a few good websites to help:

  * Quickly generate your own palette using [coolors](https://coolors.co)
  * Use a colour wheel to find complementary colours using [Adobe](https://color.adobe.com/create)
  * Browse some pre-made palettes or create one from a picture [colorfavs](http://www.colorfavs.com)

## 4. Colours have meaning

What's wrong with this graph? 

```{r, echo=F}
X<-rnorm(100)
Y<-X+seq_along(X)
D<-data.frame(Lat=X,Long=Y,Temp=Y/3)
qplot(Lat,Long,colour=Temp, data=D) + scale_color_gradient(high="blue", low="red")
```

Humans naturally associate colours with particular feelings. Be mindful of these associations when choosing a colour palette

Another important consideration is that not everyone sees colour the same way. About 5% to 10% of the population has colour blindness. In order to make colour graphs readable to everyone, you can use different 

## 5. Use high contrast

Colours that are too similar will be hard to distinguish

```{r, echo=F}
X<-rnorm(100)
Y<-X+seq_along(X)
D<-data.frame(Lat=X,Long=Y,Precip=rnorm(100))
qplot(Lat,Long,colour=Precip, data=D) + scale_color_gradient(high="#56B4E9", low="#56B499")
```

## 6. Keep relevant information

Make sure to include proper axis **labels** (i.e. names) and **tick marks** (i.e. numbers or categories showing the different values). These labels, along with the figure caption, should act as a stand-alone unit. The reader should be able to understand the figure without having to read through the rest of the paper.

## 7. Choose the right graph

Often the same data can be presented in different ways but some are easier to interpret than others. Think carefully about the story you want to present and the main ideas you want your reader to get from your figures. Look at these two graphs that show the same data and see which one is more intuitive

```{r}
X<-rnorm(100)
Y<-X+rnorm(100)
qplot(c(X,Y),fill=c(rep("X",100),rep("Y",100)),posit="dodge")
qplot(X,Y)
```


# Example

To get to know ggplot better, let's do a step-by-step example of a figure published in a [paper by Colautti & Lau](https://doi.org/10.1111/mec.13162) in the journal Molecular Ecology (2015)

## 1. Setup

### Import data

Download selection dataset from Colautti & Lau (2015) -- this data is archived on Dryad:

https://datadryad.org/stash/dataset/doi:10.5061/dryad.gt678

Extract the file called `Selection_Data.csv` and save it to your working directory. Then use `read.csv` to import the dataset:

```{r}
SelData<-read.csv("Selection_Data.csv",header=T)
```

#### Change column names

To make them more intuitive in R

```{r}
names(SelData)<-c("Collector","Author","Year","Journal","Vol","Species","Native","N","Fitness.measure","Trait","s","s.SE","s.P","B","B.SE","B.P")
```

#### Replace s with its absolute value 

We are interested in magnitude, not direction in the meta-analysis

```{r}
SelData$s<-abs(SelData$s)
```

#### Add random variables

We'll use these later to explore some additionall ggplot options

```{r}
SelData$Rpoint<-rnorm(nrow(SelData)) # Random, normally distributed
SelData$Rgroup<-sample(c(0,1),nrow(SelData),replace=T) # Random binary value
```

#### A quick look at the data

```{r}
head(SelData)
```

#### One more thing...

Note the missing data (denoted NA)

```{r, eval=FALSE}
print(SelData$s)
```

We can subset to remove mising data

```{r}
SelData<-SelData[!is.na(SelData$s),]
```

Recall from the intro tutorial that `!` means 'not' or 'invert'

similarly, we could use `filter` from dplyr

```{r, eval=F}
library(dplyr)
SelData<-SelData %>%
  filter(!is.na(s))
```


dplyr also has a convenient `drop_na` function in the `tidyr` package

```{r, eval=F}
library(tidyr)
SelData<-SelData %>%
  drop_na(s)
```


<br>

***

<br>

## 2. *ggplot()*  vs  *qplot()*

We can create the same graph using qplot and ggplot, just the syntax changes:

### Histogram

#### qplot

```{r, error=TRUE}
BarPlot<-qplot(s, data=SelData, fill=Native, geom="bar")
print(BarPlot)
```

#### ggplot

```{r, error=TRUE}
BarPlot <-ggplot(aes(s, fill=Native), data=SelData) 
print(BarPlot)
```

No layers! We only loaded in the data info for plotting

We have to specify which geom(s) we want

```{r}
BarPlot<- BarPlot + geom_bar() # info from ggplot() passed to geom_bar()
BarPlot
```

Explore the components of our BarPlot object:

```{r}
summary(BarPlot)
```

For more information on geom_bar()

```{r, eval=FALSE}
?geom_bar
```

### Bivariate geom

```{r}
BivPlot<-ggplot(data=SelData, aes(x=s, y=Rpoint)) + geom_point()
print(BivPlot)
```

Looks like a classic log-normal variable, so let's log-transform x

```{r}
BivPlot<-ggplot(data=SelData, aes(x=log(s+1), y=Rpoint)) + geom_point()
print(BivPlot)
```


Add linear regression

```{r}
BivPlot + geom_smooth(method="lm",colour="steelblue",size=2)
```

Add separate regression lines for each group

```{r}
BivPlot + geom_smooth(method="lm",size=2,aes(group=Native,colour=Native))
```

<br>

***

<br>

## 3. Full ggplot

Recreate the selection histograms from Colautti & Lau:

  1. Create separate data for native vs. introduced species

  2. Use a bootstrap to estimate non-parametric mean and 95% confidence intervals
  
  3. Plot all of the components on a single graph

### 3.1. Separate data

```{r}
NatSVals<-SelData$s[SelData$Native=="yes"] # s values for Native species
IntSVals<-SelData$s[SelData$Native=="no"] # s values for Introduced species
```

### 3.2. Bootstrap

#### 3.2a. Setup

```{r}
IterN<-100 # Number of iterations
NatSims<-{} # Dummy objects to hold output
IntSims<-{}
```

#### 3.2b. For loop

  * Sample, with replacement and calculate average
  
  * Store average in NatSims or IntSims

```{r}
for (i in 1:IterN){
  NatSims[i]<-mean(sample(NatSVals,length(NatSVals),replace=T))
  IntSims[i]<-mean(sample(IntSVals,length(IntSVals),replace=T))
}
```

#### 3.2c. Calculate 95% confidence intervals

Sort from low to high

```{r}
NatSims<-sort(NatSims)
IntSims<-sort(IntSims)
```

Calculate 95%iles from simulations

```{r}
CIs<-c(sort(NatSims)[round(IterN*0.025,0)], # Native, lower 2.5%
       sort(NatSims)[round(IterN*0.975,0)], # Native, upper 97.5%
       sort(IntSims)[round(IterN*0.025,0)], # Intro, lower 2.5%
       sort(IntSims)[round(IterN*0.975,0)]) # Intro, upper 97.5%
```

### 3.3. Plot components

#### Combine output for plotting

```{r}
HistData<-data.frame(s=SelData$s,Native=SelData$Native)
```

*NOTE:* This creates a 'stacked' dataset for plotting

```{r}
p <- ggplot() + theme_classic()
p <- p + geom_freqpoly(data=HistData[HistData$Native=="yes",], aes(s,y=(..count..)/sum(..count..)),alpha = 0.6,colour="#1fcebd",size=2)
print(p) # native species histogram
p <- p + geom_freqpoly(data=HistData[HistData$Native=="no",], aes(s,y=(..count..)/sum(..count..)),alpha = 0.5,colour="#f53751",size=2)
print(p) # introduced species histogram
p <- p + geom_rect(aes(xmin=CIs[1],xmax=CIs[2],ymin=0,ymax=0.01),colour="white",fill="#1fcebd88")
print(p) # native species 95% CI bar
p <- p + geom_line(aes(x=mean(NatSims),y=c(0,0.01)),colour="#1d76bf",size=1)
print(p) # native species bootstrap mean
p <- p + geom_rect(aes(xmin=CIs[3],xmax=CIs[4],ymin=0,ymax=0.01),colour="white",fill="#f5375188")
print(p) # introduced species 95% CI bar
p <- p + geom_line(aes(x=mean(IntSims),y=c(0,0.01)),colour="#f53751",size=1)
print(p) # introduced species bootstrap mean
p <- p + ylab("Frequency") + scale_x_continuous(limits = c(0, 1.5))
print(p) # labels added, truncated x-axis
```

<br>

***

<br>

## 4. Custom theme

You can customize various aspects such as font size, line widths, colours, etc.

This is already done in the custom theme at http://bit.ly/theme_pub. However, you could save this (or the text below) as a file called "MyTheme.R" in your project directory. Then you could edit the paremeters and use `source("MyTheme.R")` at the beginning of your code/markdown/notebook to load your own theme.

```{r, eval=F}
# Clean theme for presentations & publications used in the Colautti Lab
theme_pub <- function (base_size = 12, base_family = "") {
  theme_classic(base_size = base_size, base_family = base_family) %+replace% 
    theme(
      axis.text = element_text(colour = "black"),
      axis.title.x = element_text(size=16, margin=margin(t=5)),
      axis.text.x = element_text(size=10),
      axis.title.y = element_text(size=16,angle=90, margin=margin(r=5)),
      axis.text.y = element_text(size=10),
      axis.ticks = element_blank(), 
      panel.background = element_rect(fill="white"),
      panel.border = element_blank(),
      plot.title=element_text(face="bold", size=20),
      legend.position="none"
    ) 
}
```

<br>

***

<br>

## 5. Multi-graph

In the qplot tutorial, we looked at facets. Facets allow us to plot the same graph types but separated by category. Sometimes you might want to produce a multi-panel figure with different plots in each panel. There is a package for that...

### Setup

Install 'gridExtra' with `install.packages("gridExtra")`

```{r}
library(gridExtra)
```

### grid.arrange()
#### Combine multiple plots
Prints graphs in rows, then columns, from top left to bottom right

Use ***nrow =*** and ***ncol =*** to control layout 

```{r, warning=F, message=F}
grid.arrange(p,BivPlot,BarPlot,ncol=1)
grid.arrange(p,BivPlot,BarPlot,nrow=2)
```

> Note: You might get some warnings based on missing values or wrong binwidth. You will also see some weird things with different text sizes in the graphs. Normally, you would want to fix these for a final published figure but here we are just focused on showing what is possible with the layouts.

Apply consistent formatting

```{r, warning=F, message=F}
HistPlot<-p 
BarPlot<-BarPlot 
BivPlot<-BivPlot 
grid.arrange(HistPlot,BivPlot,BarPlot,HistPlot,nrow=2)
```

### viewport & newpage

What if we want to have graphs of different sizes? Or what if we want one figure to be inside another? We can make some even more advanced graphs using the `grid` package. F


Install 'gridExtra' if you haven't already
`install.packages("grid")`

```{r}
library(grid)
```

Control layout as a custom grid

```{r, warning=F, message=F}
grid.newpage() # Open a new page on grid device
pushViewport(viewport(layout = grid.layout(3, 2))) # Create 3x2 grid layout
print(HistPlot, vp = viewport(layout.pos.row = 3, layout.pos.col = 1:2)) # Add fig in row 3 and across columns 1:2
print(BivPlot, vp = viewport(layout.pos.row = 1:2, layout.pos.col = 1)) # add fig acros rows 1:3 in column 1
print(BarPlot, vp = viewport(layout.pos.row = 1:2, layout.pos.col = 2))
```

Use viewport to add insets

```{r, warning=F, message=F}
HistPlot
pushViewport(viewport(layout = grid.layout(4, 4))) # Create 4x4 grid layout (number of cells, will determine size/location of graph)
print(BivPlot, vp = viewport(layout.pos.row = 1:2, layout.pos.col = 3:4))
```

<br>

***

<br>

## 6. Reference

The comprehensive source for ggplot by Hadley Wickham:

http://link.springer.com/book/10.1007%2F978-0-387-98141-3

http://moderngraphics11.pbworks.com/f/ggplot2-Book09hWickham.pdf