GO Trimming: Systematically reducing redundancy in large Gene Ontology datasets

Background The increased accessibility of gene expression tools has enabled a wide variety of experiments utilizing transcriptomic analyses. As these tools increase in prevalence, the need for improved standardization in processing and presentation of data increases, as does the need to guard against interpretation bias. Gene Ontology (GO) analysis is a powerful method of interpreting and summarizing biological functions. However, while there are many tools available to investigate GO enrichment, there remains a need for methods that directly remove redundant terms from enriched GO lists that often provide little, if any, additional information. Findings Here we present a simple yet novel method called GO Trimming that utilizes an algorithm designed to reduce redundancy in lists of enriched GO categories. Depending on the needs of the user, this method can be performed with variable stringency. In the example presented here, an initial list of 90 terms was reduced to 54, eliminating 36 largely redundant terms. We also compare this method to existing methods and find that GO Trimming, while simple, performs well to eliminate redundant terms in a large dataset throughout the depth of the GO hierarchy. Conclusions The GO Trimming method provides an alternative to other procedures, some of which involve removing large numbers of terms prior to enrichment analysis. This method should free up the researcher from analyzing overly large, redundant lists, and instead enable the concise presentation of manageable, informative GO lists. The implementation of this tool is freely available at: http://lucy.ceh.uvic.ca/go_trimming/cbr_go_trimming.py


Background
Transcriptomic experiments conducted using high-density microarrays or RNA-seq often compare two or more states and can generate differentially expressed gene lists comprising hundreds or thousands of genes. These datasets generally require further analysis to identify reliable patterns in expression profiles, such as developmental changes at certain time points, and variable biological processes, such as metabolic pathways. Further analyses such as Gene Ontology (GO) enrichment, pathway enrichment, or clustering methods [1,2] can aid in both the discovery and summarization of important large-scale expression patterns.
GO vocabularies are structured as directed acyclic graphs with a clearly defined hierarchical structure. However, this hierarchy contains an added complexity by allowing terms to have multiple parents, or ascendants [3]. An ascendant and a descendant exist in a defined parent-child relationship and constitute a path through the GO hierarchy, connected by zero or more intermediate GO terms. A gene annotated with any term is also annotated with every term that is an ascendant, or parent term, of the more specific term; each GO category will contain all of the genes from each of its children categories. As a gene will be annotated by a term and every ancestor of this term, terms at a variety of depths in the hierarchy will appear in an enriched GO list, given that the GO tool being used recognizes all levels of annotation for an input gene. In most cases, multiple terms from the same hierarchical path will appear in a significant GO list. These multiple categories of differing specificities are not necessarily problematic. On the contrary, they allow for several levels of interpretation, ranging from specific terms that encompass few genes, to higher-level categories that may describe large-scale effects on the system being studied.
A current area of study to improve GO analysis focuses on the issue of interdependence between terms in the GO hierarchy, the problem being that many tools used to investigate GO enrichment search for enrichment on a term-for-term basis and do not account for correlations among terms along a path in the hierarchy [4]. Due to the detailed structure and incremental specificity of the GO database, as well as the correlation among enriched terms in a path, there will often be instances in which multiple categories from the same path appear in a list and differ only slightly, or not at all in gene content. When this occurs, often the parent term provides no additional information to the researcher or reader, especially when the terms themselves differ only in a small qualifier. An incorrect assumption can be made by the researcher that the multiplicity of similar categories increases the importance of the function or process to which they relate, whereas it is more likely that one group of genes is causing the inclusion of multiple terms. Accordingly, it makes analysis and presentation clearer when closely related terms containing the same genes are removed. However, this term removal can introduce another issue if the subset of terms to be presented is selected mainly due to the specific interest of the researcher. Reducing the size of GO lists in this manner typically uses arbitrary criteria and can be misleading, as the selected subset may not adequately reflect the entire dataset.
Several tools and databases have been developed for the purpose of reducing the inclusion of terms of varying specificity. One type of method proposed to address this issue involves reducing the size of the input database. The Gene Ontology Consortium has produced the GO Slim database, which is a subset comprising more general GO terms [5]. Alternatively, the GO Fat database, developed as part of the Annotation Tool of the DAVID suite of bioinformatics resources, is a subset comprising more specific terms [6]. Both of these methods function by limiting information prior to the enrichment test and therefore do not fully utilize the complete GO database.
An alternative method to reduce the resulting amount of enriched terms is through the use of multiple test corrections (MTC) during GO enrichment tests. These methods not only reduce the output dataset, but can also reduce false positives. Goeman and Mansmann [7] present a method that uses the structure of the GO hierarchy to perform MTC in a top-down, bottom-up, or bi-directional 'focus level' manner (working in both directions from a user-defined level of the hierarchy), depending on the desired objective of the researcher.
Other methods use the full GO database for enrichment analysis and remove terms or modify relationships between terms during the enrichment test to address the interdependency issue [4,8]. Alexa et al. [8] present iterative bottom-up algorithms that either remove genes from parent categories when a child in the same path is significantly enriched (elim), or reduce the weight of genes in categories that have more significant neighbours in its path (weight). Grossman et al. [4] present the parent-child intersection/union algorithms to reduce the inheritance problem by investigating enrichment in the context of parent-child relationships; a term is significant only if the enrichment is due to its own enriched gene set, rather than due to genes inherited from other categories.
Alterovitz et al. [9] have provided a method to investigate categories across a specified information level of the GO hierarchy by generating a numerical value of the "information content" of each GO term and thus could be used to reduce the size of the output dataset.
Clustering GO terms with similar content has also been implemented in both the GOstat and DAVID tools [10,11]. Clustering GO terms using these tools increases an understanding of commonalities between terms due to containment of similar sets of genes. However, these methods do not select or remove terms from the total list of enriched terms, and therefore the large dataset remains. Additionally, these clusters do not use the GO hierarchy or follow the parent-child path, but rather cluster based on gene content alone.
We have developed a simple, systematic method called GO Trimming for removal of redundancy from a GO category list after enrichment scores are given to terms, and is independent of any statistical package or analysis method. This method consists of an algorithm that is executed in two phases. We present an example of this process performed on a sample GO dataset, and highlight the categories that would be removed by the GO Trimming process according to different levels of stringency. Additionally, we compare GO Trimming to several of the aforementioned approaches performed on a second published dataset.

Algorithm
The GO Trimming algorithm is fully described by the flowchart in Figure 1 and is outlined here. GO Trimming consists of two phases, each requiring one pass through the list of significant GO terms. In the first phase, terms are connected to all other terms that share a common path by labelling with common identifiers ( Figure 1A). In the second phase, terms are removed from the list based on levels of redundancy between terms found in a given path ( Figure 1B).
The list of significantly enriched GO terms is retrieved by the researcher ( Figure 1A) from any tool that performs statistical testing on GO categories (e.g. Gene-Spring GX (Agilent, Santa Clara, CA); GOstat [10]). In Figure 1 GO Trimming algorithm flowchart. A) Phase 1 of GO Trimming: identification of parent-child relationships in GO hierarchy. B) Phase 2 of GO Trimming: strict and soft trimming using 0% and 40% uniqueness thresholds. Green boxes represent start and endpoints for the algorithm. Blue parallelograms represent input and output steps. Red rectangles represent an action required by the user and yellow diamonds represent questions that determine the flow of the algorithm. Input for the algorithm is the query list (list of enriched GO terms) and the GO tree (hierarchy of all GO terms). Output is a list of GO terms with soft and strict trimmed terms removed. addition, each term must have the following associated information: the number of differentially regulated genes annotated with the term (labelled "Diff. Genes" in Table  1) and the total number of genes annotated with the GO term (e.g. all genes on a microarray annotated with the term; labelled "Total Genes" in Table 1). Since these are the numbers used to test for enrichment, these values should be available from the same tool used to generate the list of significant terms. The hierarchical relationships of these categories are then obtained from the GO tree whether from within a software package like GeneSpring GX, or from the web platform AmiGO [12] or a downloadable database [13]. GO categories are considered query terms and are sorted in ascending order of "Total Genes". This order of query terms is maintained throughout both phases. Care should be taken after sorting to ensure that in cases of equal totals for a pair of terms in a parent-child relationship, parents are ranked after children in the list to avoid missed identification.
For any set of genes annotated with a common GO term, it can be said that this GO category contains this set of genes. For each term in the list of GO categories, all ancestors of the term (i.e. having parent-child relationships with the term) are examined, again starting with the term containing the fewest genes and moving towards broader terms. All parent-child paths are labelled with unique identifiers to mark the hierarchical connections between terms. Two terms may have a common parent, so each of these two children will be labelled with different identifiers, and the parent will be labelled with both identifiers. A term may have no parents present in the GO list, and therefore will have no identifier.
Once the entire list has been processed in the first phase, two types of trimming can be applied ( Figure  1B). The first more strict approach removes terms that are entirely redundant. The second approach, soft trimming, uses more relaxed stringency, and terms that are largely redundant can be removed. A uniqueness threshold was designed to filter terms based on the respective gene sets contained by the parent and child categories. A value of 0% is used for the strict approach and a value of 40% is used for the soft trimming approach. If the parent category contains the same set of genes as the child category (i.e. parent contains 0% unique genes), it is deemed fully redundant and removed from the strict list (and soft list). With respect to the soft trimming threshold, if the parent term contains 40% or fewer additional genes than the child term (e.g. the child term contains ten genes; the parent term contains these ten plus an additional four), the parent term is removed from the soft trimmed list. In both of these examples, the more specific child category is retained.
If a category is involved in multiple paths in the list and so has more than one identifier, it can only be removed if a descendant shares all identifiers (i.e. has the same ID set) and indicates that the parent category is redundant. When both soft and strict trimming are performed concurrently, soft trimmed terms should not be removed from the list until the end of the process as they may still be used in the strict trimming approach. Note that some IDs may be processed multiple times through the course of the list with different untrimmed terms as the child. This allows for broader terms to be trimmed based on the representation of intermediate terms. During the second phase, it is still important to check for ancestry before trimming, since the structure of the GO hierarchy (i.e. a term can have multiple parents) may allow two terms to have the same ID set, yet not be in a parent-child relationship. As a final note, the 40% value is somewhat arbitrary, and can be raised or lowered, but provides a cut-off that can be used to eliminate terms that seem to be generally unworthy of separate discussion from related terms. Once trimming is complete, all terms that were soft or strict trimmed can be fully removed from the GO list and the reduced list can be presented as fully processed.

Testing
We present here an example of the use of this method in removing redundant terms from an enriched Gene Ontology term list. This sample dataset was taken from a recent experiment exploring the transcriptional effects of sea lice (Lepeophtheirus salmonis) on pink salmon (Oncorhynchus gorbuscha) [14]. Although the parameters for the data presented here are slightly different from those for the GO lists shown in this earlier experiment, the biological question and the majority of the information remains the same. An initial list of 90 GO terms (Table 1) were enriched from an input list of 3388 differentially regulated entities. GeneSpring GX 11.0 was used to determine GO enrichment and the hierarchical relationships between enriched terms; the GO database used was from November 04, 2010.
Transcriptomic analysis can typically result in many enriched Gene Ontology terms, and in an attempt to reduce the number of terms enriched by chance, and to present a more manageable example dataset, only GO terms containing 5 or more differentially expressed genes were retained. This pre-filtering was done simply by imposing a threshold on the "Diff. Genes" value in a spreadsheet. This is an independent step from GO Trimming and should not have a substantial influence on the procedure. The DAVID tool [11] for finding enrichment offers a gene count threshold as well (default is 2) for the reason that terms with very few genes are less trustworthy as real trends. Through the GO Trimming process with 0%, 40%, and 50% uniqueness thresholds, the list was substantially reduced (Table 1), with 19 of 90 terms identified as completely redundant (0% threshold; bolded text). With the use of a 40% soft trimming threshold, another 15 terms were found to be largely redundant (bolded and italicized text). To show the relative flexibility of the threshold value, we also performed the procedure with a threshold of 50% (bolded, italicized and underlined text). With this reduced stringency, only two additional terms were removed from the list when compared with the list after the use of the 40% threshold. Note that several of the most specific and the most general terms are retained, and many of intermediate specificity are discarded.
Looking at the list trimmed using the conservative approach (0% threshold), 19 of 90 terms were shown to be redundant. For example, "oxidoreductase activity, oxidizing metal ions" and "oxidoreductase activity, oxidizing metal ions, oxygen as acceptor" are removed and the term "ferroxidase activity" is retained.
With slightly less stringency, we can remove a number of terms that offer little additional information to the analysis. For example, "polysaccharide catabolic process" adds only one gene to the set of those annotated with the term "chitin catabolic process".
Highly similar terms that remain in the list are often a result of being from different GO domains, such as Biological Process and Molecular Function, the top-level categories of "chitin catabolic process" and "chitinase activity", respectively. Also "sister" terms that appear quite similar but are not in a parent-child relationship (e.g. "cell wall chitin metabolic process" and "chitin catabolic process") cannot be eliminated because they are from different hierarchical paths, and therefore may refer to distinct processes or functions.
In addition to this sample dataset, we performed a comparison of GO Trimming with other methods that attempt to ease interpretation or take into account the interdependencies in the hierarchy during enrichment testing. For this comparison, zebrafish (Danio rerio) was deemed to be a suitable organism of study because each of the methods we wished to compare provided the ability to use a ZFIN identifier [15] in annotating genes with GO terms. Accordingly, we found a microarray experiment studying hypoxia in D. rerio [16] that resulted in a large number of differentially expressed genes using a well annotated array [17]. Of 1520 significantly differentially regulated entities, 1017 had official gene symbols which were used to link to ZFIN IDs in the GO Consortium's ZFIN annotation file (May 27, 2011) [18]. 617 genes had a corresponding ZFIN ID with GO annotation. This set of 617 ZFIN IDs was the sample dataset or the list of differentially regulated genes. In the same manner, of 42990 entities on the whole array, 24888 had a gene symbol, of which 12674 had a linked ZFIN ID associated with GO annotation. This was the population or total set of genes.
It is apparent that level of annotation and choice of statistical test have a large influence on the results of enrichment testing. For example, GOstat uses a χ 2 test and does not permit custom annotation files [10]. Therefore, the default ZFIN annotation database was used, which appeared to have annotation for all but 17 terms in the sample list. This resulted in a large difference in significant terms when compared to the traditional term-forterm method employed in the Ontologizer (175 vs. 236 terms; p-value ≤ 0.1) [19]. Additionally, GOstat did not include an option for a 0.05 p-value cut-off. The DAVID tool also used its own associations with ZFIN IDs, resulting in a different list of enriched terms.
Due to the differences in statistical tests and annotation, we restricted the formal comparison to those tests which could be performed using the Ontologizer [19], including elim, weight, and the parent-child methods [4,8]. The traditional term-for-term method was also included, and the output of this method was used as the input list for the GO Trimming process. A p-value cutoff of 0.05 was employed, and no MTC was used, as the impact of MTC may differ between methods. In this sample dataset, no gene count threshold was used.
The significantly enriched terms (p-value ≤ 0.05) resulting from each method are presented in Additional file 1. Any term enriched through one or more methods is listed in the table, and the enrichment is represented by a p-value. In summary, the term-for-term method resulted in the largest number of enriched terms (147), followed by elim (137), weight (86), GO Trimming (80), parent-child union (78), and parent-child intersection (46). The term-for-term output included all but 24 terms; the elim, parent-child intersection and parentchild union methods resulted in the inclusion of some additional terms.
We used the D. rerio dataset on two other methods: GOstat clustering [10] and DAVID clustering [11] (data not shown). Instead of reducing the number of terms produced by enrichment testing, the significant terms are clustered into groups that aim to improve interpretation of the results.

Discussion
GO Trimming was designed with the idea of reducing redundancy while fully utilizing the size and detail of the GO database. We believe this method is versatile and can be tailored to the needs of researchers while still being systematic by nature so that it can be easily integrated into an analysis workflow.
The first sample dataset (Table 1) provides a good example of what GO Trimming does and does not do. In this example, there is no real biological information lost to the researcher through the trimming process. Nor in fact, is there any biological information added that was not already present. The p-value of individual terms is not adjusted. Neither does GO Trimming serve to intentionally eliminate false positives. In fact this process is independent of MTC, as MTC can be applied during the enrichment testing, and GO Trimming may be performed on the output list.
After trimming, the information in Table 1 becomes more focused and balanced, making interpretation easier. Redundant terms no longer overwhelm the list as in the cases of IDs 3 and 12. Terms in a unique path, such as "immune response", a biologically important term, do not become lost in long lists [20]. Once redundant terms are removed, such as the terms related to polysaccharide, actin, and oxidoreductase functions, it becomes easier to consider and present the entire list, and terms such as "ruffle", which was important in the context of the experiment [14], can come to the forefront. Additionally, it becomes more feasible to present the list in the manuscript, instead of picking out only a select few to discuss. Ultimately it is up to the researcher how to interpret and discuss results, however GO Trimming provides a way to assist in this process.
Although the GO Trimming output list is easier to manage and interpret than an untrimmed list, it is important to use a non-destructive workflow where the full list of terms is retained for potential further exploration of specific results. We encourage researchers to append the full GO category list to published articles as supplemental documents, but for general table and text presentation, the trimmed list should be used.
Furthermore, working with GO lists in this way can familiarize the researcher to the general patterns and functions present in the data. Adding identifiers to parent-child relationships in the GO lists not only assists with the trimming process, but also connects terms with related functions and properties, allowing for ease in locating reoccurring themes. For example in the sample dataset above (Table 1), "actin cytoskeleton reorganization" and "myofibril assembly" share the parent "actin cytoskeleton organization and biogenesis".
Terms with multiple identifiers represent a synthesis of information, as they represent the union of multiple paths in the GO hierarchy. Alternatively, terms without identifiers are those that have no parents or children present in the list. This in itself provides some information about the term and the associated genes. Increased understanding of the connections between terms will allow for increased comprehension of the processes under investigation.
Understanding these connections between terms is a similar benefit to that offered by GOstat or DAVID clustering [10,11]. These clustering methods cannot be directly compared to GO Trimming, primarily due to the structure of the output. Instead of reducing terms in the output lists, terms are organized into categories that share information. One benefit of GO Trimming is the more manageable presentation. With clustering, either the researcher can present all terms, which can result in very sizeable lists, or the researcher can select a representative from each cluster to present. If a representative is selected, such as the most significant term, a main function or process may be preserved, but other valuable information could be lost.
With respect to GOstat specifically, the clustering method is highly inclusive while creating clusters. Any term containing a subset of genes annotated to another term will be clustered together. This does not take into account the GO hierarchy, which in some cases may be beneficial, in that closely related terms under difference roots (e.g. "biological process" and "molecular function") can be clustered together. However, this can also result in more disparate terms being grouped into a cluster, simply by containing common genes. Regarding the clustered output of DAVID, non-significant terms appear to be included, which could be removed after clustering. Also, clusters can consist entirely of terms that are essentially redundant. Overall, these clustering methods can aid in interpretation of results, however the problem of redundancy is not addressed, the GO hierarchy is not taken into account, and clusters can be too inclusive.
The comparison between GO Trimming and methods employed by the Ontologizer [19] provides insight into benefits and drawbacks of each method. There are a few trends identifiable based on specificity of enriched terms (Additional file 1). It is apparent that elim and weight methods produce more specific enriched terms and fewer general terms. This is not necessarily negative, since specific terms are arguably more interesting and informative to a researcher, although it can be informative to examine higher level terms. One major benefit of Gene Ontology is the ability to identify functions and processes at different depths [21]. Furthermore, there appear to be many redundant terms enriched using the elim and weight methods through the Ontologizer (e.g. many parents of "negative regulation of neutrophil chemotaxis").
The parent-child union method seems to result in a much lower level of redundancy and provides a lighter but still informative set of terms. The method behind it appears to be strong, in checking for enrichment of a term in the context of its parent(s), but it too results in some redundancy in the list (e.g. "branching morphogenesis of a tube", "morphogenesis of branching epithelium", "morphogenesis of a branching structure"; Additional file 1). The parent-child intersection method seems to result in much fewer terms being enriched, and may be too stringent, resulting in information being lost. These methods are more directed towards decorrelating terms from each other so as to minimize the effect of genes being inherited through the hierarchy and to reduce false positives [4].
Compared to these other methods, GO Trimming is highly effective at reducing redundancy at both specific and general levels. For example, regarding the parents of "negative regulation of neutrophil chemotaxis", GO Trimming removes 10-12 closely related terms, many of which are included in the results of the other methods (with the exception of parent-child intersection). At the general level, many parent terms of "ATP binding" are removed by GO Trimming. While it does not address the issue of false positives, GO Trimming specifically targets and reduces redundancy without losing information, which may occur through more stringent methods.

Conclusion
We have a developed a novel and important method for systematically reducing redundancy in Gene Ontology datasets. The simplicity of this method allows for ease of incorporation into a typical transcriptomic workflow, while still using the full structure of the GO hierarchy. It focuses on improving interpretation and presentation, and compares well against other GO enrichment methods that take into consideration interdependencies in the GO hierarchy. With the exception of the stringent parent-child intersection method, the resulting list of terms contains the least redundancy, offering a cleaner, more focused representation of the dataset. With this method, researchers are able to analyze and present terms in a way that will provide the most information about the genes and systems being studied.

Additional material
Additional file 1: Supplementary Table 1. Comparison of GO Trimming and enrichment methods on D. rerio dataset. Using the Ontologizer tool, a number of methods produce statistically enriched GO terms (p-value ≤ 0.05; no MTC) from a set of 617 differentially regulated genes. The union of GO terms enriched by one or more methods is presented, along with the p-values each method produced for enriched terms, sorted by "Total Genes" from specific terms to general terms. GO Trimming was performed on the output of the traditional term-for-term method using a 40% soft trimming threshold. P-values from the term-forterm method are presented for those terms retained by the GO Trimming method.