Skip to content

Why ggtree is special?

Guangchuang Yu edited this page Aug 30, 2015 · 43 revisions

The innovation of ggtree including

  • parsing data from several evolution software
    • not only for visualization in ggtree, but also bring these data to R user for further analysis (e.g. summarization, visualization)
  • viewing and annotating phylogenetic tree, programmatically in R
    • other R packages that can view phylogenetic tree only contains plot functions for special cases, including those implemented with ggplot2
  • support grammar of graphics implemented in ggplot2
    • only ggtree supports grammar of graphics for phylogenetic tree annotation

see user comments.

It's differ from other tree viewers which contain pre-defined specific cases of tree view. ggtree doesn't define how annotation should be presented. Users have no restriction of presenting data in their favorite way and complex tree view can be achieved via multiple layers of annotation.

The grammar is extended from ggplot2 which has been widely used in biomedince and ecology. Many researchers in this field already familiar with the grammar.

There are several packages that implemented tree viewer using ggplot2, including ggphylo, OutbreakTools and phyloseq.

Using ggplot2 can't guarantee that the grammar of graphics is supported. Among these packages, only ggtree supports grammar of graphics, while others only implemented tree viewer for specific need.

ggphylo

This package is designed for viewing phylogenetic tree with alignment. It stopped updating since 2012 and the alignment part is not yet implemented.

PS. Viewing phylogenetic tree with alignment is supported in ggtree.

For the tree viewer, it mimic the function call of plot.phylo defined in ape. The ggphylo function is complex and how to view a tree is pre-defined with parameter to control it's behavior.

As showed in the screenshot, it created several data.frame and the tree was draw by q <- ggplot(lines.df). ggphylo parses a tree as a collection of lines which is meaningless (information only related to taxa).

OutbreakTools

OutbreakTools is designed for disease outbreak analysis and viewing phylogenetic tree is not their major focus.

The tree view function plotggphy is only applicable to obkData class defined within this package. It can't be used to view phylogenetic tree parsed by newick file directly.

As showed in the screenshot, it has similar design as in ggphylo that creates several data.frame and draws the tree via p <- ggplot(data=df.edge). It also parse a tree as a collection of lines.

phyloseq

phyloseq is designed for viewing microbiome census data.

The tree viewer defined in phyloseq only applied to phyloseq class. It either can't be used to view tree parsed by newick file directly.

Internally, it called ape to calculate edge positions.

It draw horizontal lines followed by vertical lines.

Common drawbacks

  1. designed for specific need
    • ggphylo for alignment (not implemented yet)
    • OutbreakTools for outbreak data
    • phyloseq for microbiome census data
  2. not applicable for widely use tree file format
    • plotggphy in OutbreakTools assumes input as an instance of obkData
    • plot_tree in phyloseq assumes input as an instance of phyloseq
  3. not extensible
    • tree is draw by lines, but information is related to taxa (nodes & tips)
    • data was separated in different data.frame/data.table, make it impossible for user to further modify the tree

Using ggplot2 can't guarantee that the grammar of graphics is supported. Among these packages, only ggtree supports grammar of graphics, while others only implemented tree viewer for specific need.

As I mentioned at the beginning, only ggtree supports grammar of graphics.

In ggphylo:

  lines.df <- subset(layout.df, type=='line')
  nodes.df <- subset(layout.df, type=='node')
  labels.df <- subset(layout.df, type=='label')
  internal.labels.df <- subset(layout.df, type=='internal.label')
  q <- ggplot(lines.df)

      geom.fn <- switch(aes.type,
        line='geom_joinedsegment',
        node='geom_point',
        label='geom_text',
        internal.label='geom_text'
      )
      q <- q + do.call(geom.fn, geom.list)

In OutbreakTools:

    ggphy <- phylo2ggphy(phylo, tip.dates = tip.dates, branch.unit = branch.unit)


    ##TODO: allow edge and node attributes and merge with df.edge and df.node
    df.tip <- ggphy[[1]]
    df.node <- ggphy[[2]]
    df.edge <- ggphy[[3]]

    p <- ggplot(data = df.edge)
    p <- p + geom_segment(data = df.edge, aes(x = x.beg, xend = x.end, y = y.beg, yend = y.end), lineend = "round")
    p <- p + scale_y_continuous("", breaks = NULL)

    if (show.tip.label) {
        p <- p + geom_text(data = df.tip, aes(x = x, y = y, label = label), hjust = 0, size = tip.label.size)
    }

In phyloseq:

  treeSegs <- tree_layout(phy_tree(physeq), ladderize=ladderize)
  edgeMap = aes(x=xleft, xend=xright, y=y, yend=y)
  vertMap = aes(x=x, xend=x, y=vmin, yend=vmax)
  # Initialize phylogenetic tree.
  # Naked, lines-only, unannotated tree as first layers. Edge (horiz) first, then vertical.
  p = ggplot(data=treeSegs$edgeDT) + geom_segment(edgeMap) + 
    geom_segment(vertMap, data=treeSegs$vertDT)

  if(!is.null(label.tips)){
    # `tiplabDT` has only one row per tip, the farthest horizontal
    # adjusted position (one for each taxa)
    tiplabDT = dodgeDT
    tiplabDT[, xfartiplab:=max(xdodge), by=OTU]
    tiplabDT <- tiplabDT[h.adj.index==1, .SD, by=OTU]
    if(!is.null(color)){
      if(color %in% sample_variables(physeq, errorIfNULL=FALSE)){
        color <- NULL
      }
    }
    labelMap <- NULL
    if(justify=="jagged"){
      labelMap <- aes_string(x="xfartiplab", y="y", label=label.tips, color=color)
    } else {
      labelMap <- aes_string(x="max(xfartiplab, na.rm=TRUE)", y="y", label=label.tips, color=color)
    }
    # Add labels layer to plotting object.
    p <- p + geom_text(labelMap, tiplabDT, size=I(text.size), hjust=-0.1, na.rm=TRUE)
  } 

These tree view functions are just other ordinary plot functions. Although they use ggplot2 and we can for example use theme to change background, scale_X function to change XY axis and we can add nonsense layer above the tree just as we can produce grammar correct sentence that is nonsense, this is not the philosophy of grammar of graphics. We want to add layer that related to taxa in the tree, which is mostly impossible with these implementations.

The tree view can only be controlled via pre-defined parameters. As the code showed above, if we create a tree without labels we can't even add a layer of tip labels since the information is created within the function and we don't have that information (we only have the positions of lines after the tree was draw).

For example, in OutbreakTools

    if (show.tip.label) {
        p <- p + geom_text(data = df.tip, aes(x = x, y = y, label = label), hjust = 0, size = tip.label.size)
    }

If show.tip.label = FALSE, the df.tip will be throw away when p was returned. Then it's impossible to add tip label. The only way is pass show.tip.label=TRUE at the very beginning when calling plotggphy. The implementations in ggphylo and phyloseq are similar. User have no idea to add related information if they are not pre-defined in those functions.

The design of ggtree

ggtree is different with the following features:

  • extending ggplot
  • parse tree as a collection of taxa

Firstly we separating parsing tree (including common software output) from visualization. Secondly We didn't create complex plot function, instead we extending ggplot to support tree objects.

Tree is viewing via geom_tree layer that created in ggtree and complex tree view can be achieved via adding annotation layers that freely controlled by users.

update example here

Plot functions defined in ggphylo, OutbreakTools and phyloseq are all special cases that can be easily implemented by a few layers using ggtree.

OutbreakTools example

library(OutbreakTools)
data(FluH1N1pdm2009)
attach(FluH1N1pdm2009)

x <- new("obkData", individuals = individuals, dna = FluH1N1pdm2009$dna,
      dna.individualID = samples$individualID, dna.date = samples$date,
      trees = FluH1N1pdm2009$trees)

p <- plotggphy(x, ladderize = TRUE, branch.unit = "year",
               tip.color = "location", tip.size = 3, tip.alpha = 0.75)

In this figure, it use Date as x-axis and annotate the tree with Location. In epidemic time and location that isolate the virus is important information and plotggphy is designed for viewing such information.

In ggtree, it's very easy to reproduce such figure.

g <- ggtree(x@trees[[1]], right=T) + theme_tree2()
## convert x axis to corresponding date
g$data$x <- g$data$x * 365 + as.Date("2009-02-01")
g <- g + theme(panel.grid.major = element_line(color="grey"), 
               panel.grid.major.y=element_blank())
## attach additional information to tree view via %<+% operator 
## that was created in ggtree
loc <- data.frame(tip=tree$tip.label, location=x@individuals)
g <- g %<+% loc
g + geom_tippoint(aes(color=location), size=3, alpha=.75) + 
    theme(legend.position="right")