treeman: an R package for efficient and intuitive manipulation of phylogenetic trees
© The Author(s) 2017
Received: 6 September 2016
Accepted: 13 December 2016
Published: 7 January 2017
Phylogenetic trees are hierarchical structures used for representing the inter-relationships between biological entities. They are the most common tool for representing evolution and are essential to a range of fields across the life sciences. The manipulation of phylogenetic trees—in terms of adding or removing tips—is often performed by researchers not just for reasons of management but also for performing simulations in order to understand the processes of evolution. Despite this, the most common programming language among biologists, R, has few class structures well suited to these tasks.
We present an R package that contains a new class, called TreeMan, for representing the phylogenetic tree. This class has a list structure allowing phylogenetic trees to be manipulated more efficiently. Computational running times are reduced because of the ready ability to vectorise and parallelise methods. Development is also improved due to fewer lines of code being required for performing manipulation processes.
We present three use cases—pinning missing taxa to a supertree, simulating evolution with a tree-growth model and detecting significant phylogenetic turnover—that demonstrate the new package’s speed and simplicity.
KeywordsPhylogenetic trees Evolution Tree simulation R Statistical computing
Phylogenetic trees have been a mainstay of the R statistical software environment since the release of Emmanuel Paradis’ APE package in 2002 [1, 2]. This package introduced the phylo object, an S3 class for the presentation and manipulation of phylogenetic tree data in the R environment. In its most basic implementation, the phylo object contains a list of three elements: an edge matrix, a vector of tip labels and an integer of the number of internal nodes. The use of an edge matrix facilitates phylogenetically structured statistical analyses because of its convenience for generating distance, cophenetic or covariance matrices. For this reason the APE package’s phylo is the dominant class for phylogenetic tree representation in R and is used by many well-known phylogenetic R packages (e.g. phangorn , phytools ). Since phylo’s first incarnation the number of available functions in the APE package has risen from 28 to 171 (versions 0.1–3.4), and to date there are 147 reverse dependencies, i.e. packages on CRAN  that depend on the phylo class. More recently, the phylo class has been updated to S4 as part of the phylobase package .
An edge matrix, however, leads to a dependence on index referencing, leading to certain computational scenarios in which the phylo object performs poorly: in particular, analyses that require the manipulation of the tree itself (i.e. tip and node addition/deletion). Such analyses include simulating, comparing, pruning, and merging trees, and calculating phylogenetic statistics such as measures of phylogenetic richness  and evolutionary distinctness . These have become the preserve of software solutions external to R, e.g. [8, 9], hindering their integration with the many packages in biomolecular, evolutionary and ecological studies already available for R. Although there are alternatives to the phylo class for phylogenetics or more generally ‘networks’ available in R , these packages and classes are rarely used for phylogenetics and may lack the intuitive functional framework for manipulating evolutionary trees.
The TreeMan object in R is an S4 formal class whose main data slot is a list—which in R is a vector whose elements can be named. All nodes in a TreeMan object are named elements in this list (ndlst). Each node usually contains the following data slots: the node ID (id), the length of the preceding edge (spn, for “span”), the IDs of all connecting ascending/ancestral nodes to root (pre-node IDs, prid), the IDs of the immediately descending nodes (post-node IDs, ptid), and the IDs of all descending tips (kids). Additionally, if all nodes in a tree contain the spn slot, then each node will also contain: the total edge length of all descending nodes (phylogenetic diversity, pd), total edge length of all connected pre-nodes (prdst; in a rooted tree this is the root-to-tip distance), and the relative distance of the node in the tree (age, for a time-calibrated rooted tree). All nodes must have either a prid and/or ptid data slots: tip nodes have only prid slots, root nodes have only ptid slots, and internal nodes have both. These slots must contain IDs that are found within the ndlst; if they do not, an error is raised. These core slots are supplemented by optional slots, a non-unique taxonomic name that can be used to generate lineages (txnym) and user-defined slots that can contain any kind of information. In addition to the ndlst, the TreeMan object contains informative slots that are generated upon reading or generating the tree, and are updated whenever modified. Basic tree information can be seen by printing the tree to console.
Taxonomy of treeman functions
Retrieve specific information about parts of a tree, often nodes
getNdAge, getPrnt, getNdKids, getNdLng, getNdPrid, getNdPtid, getPath, getSubtree
Calculate tree statistics and tree associated information
calcDstMtrx, calcTrDst, calcPhyDv, calcFrPrp
Set node or overall tree values
setNdSpn, setAge, setPD, setRoot, setTol
Change tree structure by adding or removing tips and nodes
addTip, rmTip, pinTip
Because the TreeMan class depends on the ndlst, all functions that run over this list are vectorised. All treeman functions that can be vectorised are done so using plyr vectorisation , providing substantial performance benefits, as computation is no longer taking place at the scripting level. Through the use of plyr these functions can also be parallelised using the “.parallel” argument that is passed onto plyr functions, which work in conjunction with parallel R packages such as DoMC  and doSNOW .
Results and discussion
To demonstrate the TreeMan class and how the treeman functions can be combined to complete complex tasks, we demonstrate three use-cases: pinning missing taxa using online taxonomic databases to a molecular phylogenetic tree; simulating phylogenetic trees through time using different models of evolution; and testing for significant phylogenetic turnover between ecological communities.
Tip pinning: adding missing taxa to a tree using online taxonomies
Tree simulation: generating trees using different models of evolution
Testing for significant phylogenetic turnover
TreeMan is an S4 class that encodes a phylogenetic tree using a node list. The advantage of a node list is the faster computational processing, and the ready capacity to track nodes between manipulations and vectorise or parallelise large-scale tree manipulations. The treeman package introduces new terminology to describe different elements of a tree and uses a naming convention to combine these new terms to make a more intuitive set of methods for tree manipulation.
DJB initiated the project, developed the code and wrote the paper. STT and MDS provided supervision, ideas and critically reviewed the final manuscript. All authors read and approved the final manuscript.
Thanks to Susy Echeverría-Londoño for testing the initial code, and to the volunteers at CRAN for making the code available.
The authors declare that they have no competing interests.
Availability of data and materials
The datasets supporting the conclusions of this article are available in the project GitHub repository, https://github.com/DomBennett/treeman.
This project was funded by a Natural and Environmental Research Council (NERC, UK) PhD. grant.
Project home page—https://github.com/DomBennett/treeman
Operating system(s)—platform independent
Other requirements—R v. 3+
Any restrictions to use by non-academics—none.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Paradis E, Blomberg S, Bolker B, Claude J, Cuong HS, Desper R, Didier G, Durand B, Dutheil J, Gascuel O. ape: Analyses of phylogenetics and evolution. 2016. https://cran.r-project.org/web/packages/ape/index.html. Accessed 24 Feb 2016.
- Paradis E, Claude J, Strimmer K. APE: analyses of phylogenetics and evolution in R language. Bioinformatics. 2004;20(2):289–90. doi:https://doi.org/10.1093/bioinformatics/btg412.View ArticlePubMedGoogle Scholar
- Schliep KP. phangorn: phylogenetic analysis in R. Bioinformatics. 2011;27(4):592–3. doi:https://doi.org/10.1093/bioinformatics/btq706.View ArticlePubMedGoogle Scholar
- Revell LJ. phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol Evol. 2012;3:217–23. doi:https://doi.org/10.1111/j.2041-210X.2011.00169.x.View ArticleGoogle Scholar
- Michonneau F. phylobase: Base package for phylogenetic structures and comparative data. 2016. https://cran.r-project.org/web/packages/phylobase/index.html. Accessed 3 Sep 2016.
- Faith D. Conservation evaluation and phylogenetic diversity. Biol Conserv. 1992;61(1):1–10. doi:https://doi.org/10.1016/0006-3207(92).View ArticleGoogle Scholar
- Isaac NJB, Turvey ST, Collen B, Waterman C, Baillie JEM. Mammals on the EDGE: conservation priorities based on threat and phylogeny. PLoS ONE. 2007;2(3):e296. doi:https://doi.org/10.1371/journal.pone.0000296.View ArticlePubMedPubMed CentralGoogle Scholar
- Smith SA, Dunn CW. Phyutility: a phyloinformatics tool for trees, alignments and molecular data. Bioinformatics. 2008;24(5):715–6. doi:https://doi.org/10.1093/bioinformatics/btm619.View ArticlePubMedGoogle Scholar
- Bogdanowicz D, Giaro K, Wróbel B. TreeCmp: comparison of trees in polynomial time. Evolut Bioinform Online. 2012;8:475–87. doi:https://doi.org/10.4137/EBO.S9657.Google Scholar
- Csardi G. igraph: network analysis and visualization. 2015. https://cran.r-project.org/web/packages/igraph/index.html. Accessed 3 Sep 2016.
- Wickham H. The split-apply-combine strategy for data. J Stat Softw. 2011;40:1–29. doi:https://doi.org/10.18637/jss.v040.i01 Google Scholar
- Calaway R, Weston S, Revolution analytics. doMC: foreach parallel adaptor for ‘parallel’. 2015. https://cran.r-project.org/web/packages/doMC/index.html. Accessed 17 April 2016.
- Calaway R, Weston S, Revolution analytics. doSNOW: foreach parallel adaptor for the ‘snow’ package. 2015. [https://cran.r-project.org/web/packages/doSNOW/index.html]. Accessed 17 April 2016.
- Thomas GH, Hartmann K, Jetz W, Joy JB, Mimoto A, Mooers AO. PASTIS: an R package to facilitate phylogenetic assembly with soft taxonomic inferences. Methods Ecol Evol. 2013;4(11):1011–7. doi:https://doi.org/10.1111/2041-210X.12117.View ArticleGoogle Scholar
- Bininda-Emonds ORP, Cardillo M, Jones KE, MacPhee RDE, Beck RMD, Grenyer R, Price SA, Vos R, Gittleman JL, Purvis A. The delayed rise of present-day mammals. Nature. 2007;446(7135):507–12. doi:https://doi.org/10.1038/nature05634.View ArticlePubMedGoogle Scholar
- Federhen S. The NCBI taxonomy database. Nucleic acids research, 40 (Database issue), 2012; D136–43. doi:https://doi.org/10.1093/nar/gkr1178.
- Mooers AO, Heard SB. Inferring evolutionary process from phylogenetic tree shape. Q Rev Biol. 1997;72:31–54.View ArticleGoogle Scholar
- Purvis A, Fritz SA, Rodríguez J, Harvey PH, Grenyer R. The shape of mammalian phylogeny: patterns, processes and scales. Philos Trans Royal Soc Lond B. 2011;366(1577):2462–77. doi:https://doi.org/10.1098/rstb.2011.0025.View ArticleGoogle Scholar
- Hagen O, Hartmann K, Steel M, Stadler T. Age-dependent speciation can explain the shape of empirical phylogenies. Systematic Biol. 2015;64(3):432–40.View ArticleGoogle Scholar
- Rabosky DL, Goldberg EE. Model inadequacy and mistaken inferences of trait-dependent speciation. Syst Biol. 2015;64(2):340–55.View ArticlePubMedGoogle Scholar
- Bennett DJ, Sutton MD, Turvey ST. Evolutionarily distinct “living fossils” require both lower speciation and lower extinction rates. Paleobiology. (In press).Google Scholar
- Frishkoff L, Karp D, M’Gonigle L. Loss of avian phylogenetic diversity in neotropical agricultural systems. Science. 2014. doi:https://doi.org/10.7910/DVN/26910.PubMedGoogle Scholar
- Lozupone C, Knight R. UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol. 2005;71(12):8228–35. doi:https://doi.org/10.1128/AEM.71.12.8228-8235.2005.View ArticlePubMedPubMed CentralGoogle Scholar
- Bennett DJ. MoreTreeTools: more phylogenetic tree tools in R (development copy). 2016. https://zenodo.org/badge/latestdoi/4641/DomBennett/MoreTreeTools. Accessed 24 July 2016.