- Technical Note
- Open Access
DivA: detection of non-homologous and very divergent regions in protein sequence alignments
BMC Research Notesvolume 7, Article number: 806 (2014)
Sequence alignments are used to find evidence of homology but sometimes contain regions that are difficult to align which can interfere with the quality of the subsequent analyses. Although it is possible to remove problematic regions manually, this is non-practical in large genome scale studies, and the results suffer from irreproducibility arising from subjectivity. Some automated alignment trimming methods have been developed to remove problematic regions in alignments but these mostly act by removing complete columns or complete sequences from the MSA, discarding a lot of informative sites.
Here we present a tool that identifies Divergent windows in protein sequence Alignments (DivA). DivA makes no assumptions on evolutionary models, and it is ideal for detecting incorrectly annotated segments within individual gene sequences. DivA works with a sliding-window approach to estimate four divergence-based parameters and their outlier values. It then classifies a window of a sequence of an alignment as very divergent (potentially non-homologous) if it presents a combination of outlier values for the four parameters it calculates. The windows classified as very divergent can optionally be masked in the alignment.
DivA automatically identifies very divergent and incorrectly annotated genic regions in MSAs avoiding the subjective and time-consuming problem of manual annotation. The output is clear to interpret and allows the user to take more informed decisions for reducing the amount of sequence discarded but still finding the potentially erroneous and non-homologous regions.
Multiple sequence alignments (MSAs) are the basis of comparative analyses that rely on sequence homology [1–4]. Alignments of homologous sequences are used to characterize protein domains, predict protein function, detect motifs and describe gene families, as well as to infer evolutionary relationships between species. However, often there are sections in MSAs that can contain sequences that are erroneously aligned. These correspond to regions that are i) under a rapid evolutionary rate, ii) non-homologous because of the choice of different splicing variants in the comparison between species, iii) wrongly annotated intron-exon barriers, iv) local structural rearrangements in a single species, etc. It is difficult to classify portions of an alignment as either very divergent or non-homologous but it has been shown that phylogenetic results are improved after removing divergent and ambiguously aligned blocks from protein sequence alignments . Sequences in a MSA should be neither so similar that they are devoid of variation among the sites nor so divergent that positions are saturated by multiple substitutions, especially for phylogenetic analyses [6, 7]. Some methods have been developed to automatically clean alignments, but they mostly work by removing complete sequences if determined to be unrelated  or by deleting complete columns of the MSA . Other approaches should be taken into account, such as the one used by Guidance , which can detect problematic sections of individual sequences located within regions with high alignment uncertainty (e.g. Figure 1). Alternatively, manual adjustment can be performed to remove or mask potential non-homologous regions by removing a minimum amount of sequence information, but this leads to biases and irreproducibility of the results and is impractical for large-scale genomic analyses.
We developed DivA, a method that uses a sliding window approach to detect sections of individual sequences that are very divergent to the rest of the alignment. DivA first calculates the distribution of four parameters for each sequence in each window based on sequence similarity using both sequence weighting with position specific counts and distance-based methods. The windows for each sequence are then classified as homologous or very divergent depending on automatically calculated thresholds based on the outlier values of each parameter determined using all windows from all alignments (Figure 2). The user can change the stringency of the thresholds by choosing a desired standard deviation of the parameter values. The output is a table containing the coordinates of the very divergent windows and the values of the parameters (Table 1). Also, the user has the option to output an alignment where the outlier segments are masked.
To test the performance of DivA we used 200 MSAs with sets of orthologous proteins generated by a recent avian phylogenomics project  and corresponding manual annotations of highly divergent sequence segments [see Additional file 1]. The 200 MSAs were chosen randomly from the 8295 orthologous sets in Jarvis et al (this paper presents whole genome data and the corresponding annotations for 48 bird species representing 36 orders of birds). Each MSA contains protein sequences from up to 48 birds and the corresponding orthologues from the more distantly related species Homo sapiens, Alligator mississippiensis, Chelonia mydas, and Anolis carolinensis (human, alligator, turtle, and lizard, respectively). DivA was run with the default program parameters on all the alignments, both including and excluding non-birds species. We also created subsets containing 50, 100 and 200 MSAs from these 200 MSAs in which we only keep the bird sequences and exclude the sequences of the distantly related species.
The first parameter A is based on the probability of observing the amino acids from a sequence in a given window of the alignment. The second parameter B is based on the smallest pairwise distance to another amino acid in that position calculated using blosum62 . The other two parameters used are the Z-scores per sequence per window of the A and B parameters, Z A and Z B , respectively. A detailed description of parameters A and B is presented next.
1. The first parameter A is based on the probability of observing the amino acids (AA s) in a sequence in a given window of the alignment. The probability of observing amino acid a in position i in a sequence S in an alignment corresponds to the counts for that amino acid, c(a i S), divided by the number of sequences N:
The parameter A for a window of sequence S corresponds to the sum of these probabilities for each position of the window divided by the length of the window L:
2. The second parameter B is based on the smallest pairwise distance to another amino acid in that position calculated using blosum62 . The use of a column-by-column score lowers the probability of an orthologous sequence that shares high sequence similarity with different orthologs in the different sites of the window to be labeled as very divergent. For each position i of the sequence S, a i S is compared to each of the amino acids on that position in the other sequences. The distance to amino acid b i in another sequence X that has the smallest dissimilarity to a i S corresponds to the highest pairwise blosum62  distance:
The B parameter corresponds to the average of those distances:
For the calculation of the parameters, sequence segments were discarded when presenting more than 40% gaps or a P(a i S) <70%, indicative of weakly conserved sequences. In order to ensure that only the smallest required amount of sequence is discarded, if the probability of observing the first or last amino acid in a given window of a given sequence is higher than 0.9 that amino acid is removed and the parameters are re-calculated for the resized window.
Determination of parameter thresholds
We started by analyzing the distribution of the four parameters values calculated for the 200 bird-only alignments (Figure 3). The distribution of the values on the very divergent and homologous windows is not clearly differentiated, with some values of some parameters from very divergent windows overlapping values from homologous windows, thus posing difficulties for a straightforward thresholds definition. For the calculation of the parameter thresholds, sequence segments were discarded when presenting more than 40% gaps, indicative of very conserved sequences. The outliers were defined in terms of the Z scores for all four parameters: Z( A ) < 1, Z( B ) < 2, Z( Z A ) > 2, and Z( Z B ) > 2. We used the decision tree method from the R package ‘tree’  to find the thresholds that define an outlier sequence segment according to the manual annotation (Additional file 2: Figure S1A).
We performed a 10-fold cross-validation with half of the dataset using the prune.tree function and the misclass method in R (package ‘tree’). This showed a low misclassification error rate (1.3e-05) and residual mean deviance (3.2e-04) (Additional file 2: Figure S1B and S2C). Furthermore, we used the other half of the dataset as validation set with the predict function, and classified it with the previously defined thresholds. In agreement with the previous result, we obtained a low misclassification rate of 1.3e-05 (Additional file 2: Figure S1D and S2E). The misclassification rate was calculated as the number of misclassifications divided by the total number of data points in the validation dataset.
Guidance performance comparison to DivA
Guidance  is a currently available program for detecting problems in single sequences in a MSA; it works by assigning a confidence value for every position of the alignment. We compared the output from Guidance and that from DivA using the dataset of 200-only bird alignment and the dataset of 200 alignments including the very divergent species. In Guidance two thresholds of confidence values were used; one considering the window as very divergent and potentially non-homologous if having a Guidance score equal or less than 0.4; and a second very relaxed one that considers the window as such if the threshold is less than 0.8.
Results and discussion
To test the impact of the size of the dataset, DivA was applied to the datasets of 50, 100, and 200 only-birds alignments. Efficiency tests were applied to the classified windows and the results show that the model performs best with big datasets (Table 2), as expected in a phylogenomics analysis where up to thousands of alignments are concatenated (Additional file 2: Figure S2). We further explored the impact of the divergence to the classification and showed that DivA has a very high sensitivity (81% TPR) for the bird-only alignments, compared to the alignments including other vertebrate species (47% TPR) (Table 2). This can be explained by the increasing difficulty in distinguishing between true divergence and error. We also examined the divergence in the alignment with the highest number of TP very divergent windows and the one with highest number of FP very divergent windows from the 200 only-birds alignments (Additional file 2: Table S1).
To our knowledge, the only other method currently available for detecting problems in single sequences in a MSA is Guidance , but it does so only when the sequence segment is located in a region of high alignment uncertainty (Additional file 2: Figure S3). The comparison of the performance of Guidance and DivA showed that DivA produced better efficiency test results for the two datasets used with the two score thresholds in Guidance (Table 3).
The present method was developed to solve the subjective and time-consuming problem of manual annotation and identification of incorrect gene annotation in genomic projects with phylogenomic studies. It uses a statistical framework that takes into account the next information: i) probability of an amino acids appearing in a position in the window alignment, ii) the smallest pairwise distance to another amino acid in that position in the window, and iii) the Z-score of i) and ii). That information is then integrated into a binary decision making model for the window to be classified as very divergent or truly homologous. It is easy to use; it does not require a manual annotation or input training set, and its parameter values are obtained automatically. The output is clear to interpret and allows the user to take more informed decisions for reducing the amount of sequence discarded but still finding the potentially erroneous and non-homologous regions.
Availability and requirements
Project name: DivA
Project home page: https://github.com/lisandracady/DivA
Operating system(s): Platform independent
Programming language: Python 2.7
Other requirements: Python packages numpy, re, os, sys, argparse, Bio
License: Lesser GPL 3 (LGPL 3)
Any restrictions to use by non-academics: None.
Availability of supporting data
The data sets supporting the results of this article are available in the “The avian phylogenomic project” data repository, http://gigadb.org/dataset/101000. In particular, the ones used in this article are in https://github.com/lisandracady/DivA/tree/master/MUSCLEalns and https://github.com/lisandracady/DivA/tree/master/MUSCLE_birdsOnly.
Multiple sequence alignment
True positive rate
True negative rate
Positive predictive value
Notredame C, Higgins DG, Heringa J: T-coffee: a novel method for fast and accurate multiple sequence alignment Edited by J. Thornton. J Mol Biol. 2000, 302 (1): 205-217. 10.1006/jmbi.2000.4042.
Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999, 27 (13): 2682-2690. 10.1093/nar/27.13.2682.
Reinert K, Stoye J, Will T: An iterative method for faster sum-of-pairs multiple sequence alignment. Bioinformatics. 2000, 16 (9): 808-814. 10.1093/bioinformatics/16.9.808.
Löytynoja A: Phylogeny-aware alignment with PRANK. Methods Mol Biol. 2014, 1079: 155-170. 10.1007/978-1-62703-646-7_10.
Talavera G, Castresana J: Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol. 2007, 56 (4): 564-577. 10.1080/10635150701472164.
Goldman N: Phylogenetic information and experimental design in molecular systematics. Proc Biol Sci. 1998, 265 (1407): 1779-1786. 10.1098/rspb.1998.0502.
Yang Z: On the best evolutionary rate for phylogenetic analysis. Syst Biol. 1998, 47 (1): 125-133. 10.1080/106351598261067.
Thompson JD, Plewniak F, Ripp R, Thierry JC, Poch O: Towards a reliable objective function for multiple sequence alignments. J Mol Biol. 2001, 314 (4): 937-951. 10.1006/jmbi.2001.5187.
Penn O, Privman E, Landan G, Graur D, Pupko T: An alignment confidence score capturing robustness to guide tree uncertainty. Mol Biol Evol. 2010, 27 (8): 1759-1767. 10.1093/molbev/msq066.
Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 2011, 28 (10): 2731-2739. 10.1093/molbev/msr121.
Zhang G, Li B, Gilbert MTP, Jarvis E: The avian phylogenomic project data. GigaScience Datavase. Available at: http://gigadb.org/dataset/101000
Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho SYW, Faircloth BC, Nabholz B, Howard JT, Suh A, Weber CC, da Fonseca RR, Li J, Zhang F, Li H, Zhou L, Narula N, Liu L, Ganapathy G, Boussau B, Bayzid MS, Zavidovych V, Subramanian S, Gabaldón T, Capella-Gutiérrez S, Huerta-Cepas J, Rekepalli B, Munch K, Schierup M: Whole genome analyses resolve early branches in the tree of life of modern birds. Science. in press
Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992, 89 (22): 10915-10919. 10.1073/pnas.89.22.10915.
Ripley B: R Packag version 10-34. Tree: Classification and Regression Trees. 2013,http://CRAN.R-project.org/package=tree,
We are grateful to Tom Gilbert, Anders Albrechtsen, Tandy Warnow, Erich D. Jarvis and Siavash Mirarab for comments and suggestions. This work was supported by: Lundbeck Foundation (R52-A5062), Danish Council for Independent Research (DFF-10-081390), Marie Curie (FP7-PEOPLE-2010-IEF, Proposal-272927) and the Portuguese Science Foundation (PTDC/MAR/115347/2009; COMPETE-FCOMP-01-0124-FEDER-015453).
The authors declare that they have no competing interests.
RRdF designed and implemented the method first used in Jarvis, et al (submitted), participated in the new implementation, as well as in the drafting of the manuscript. MLZM refined the implementation and modified the python scripts accordingly, planned the tests, performed execution, and drafted the manuscript. SN revised the manuscript critically and gave insightful ideas. All authors read and approved the final manuscript.