PR2ALIGN: a stand-alone software program and a web-server for protein sequence alignment using weighted biochemical properties of amino acids
© Kuznetsov and McDuffie. 2015
Received: 6 May 2014
Accepted: 24 April 2015
Published: 7 May 2015
Alignment of amino acid sequences is the main sequence comparison method used in computational molecular biology. The selection of the amino acid substitution matrix best suitable for a given alignment problem is one of the most important decisions the user has to make. In a conventional amino acid substitution matrix all elements are fixed and their values cannot be easily adjusted. Moreover, most existing amino acid substitution matrices account for the average (dis)similarities between amino acid types and do not distinguish the contribution of a specific biochemical property to these (dis)similarities.
PR2ALIGN is a stand-alone software program and a web-server that provide the functionality for implementing flexible user-specified alignment scoring functions and aligning pairs of amino acid sequences based on the comparison of the profiles of biochemical properties of these sequences. Unlike the conventional sequence alignment methods that use 20x20 fixed amino acid substitution matrices, PR2ALIGN uses a set of weighted biochemical properties of amino acids to measure the distance between pairs of aligned residues and to find an optimal minimal distance global alignment. The user can provide any number of amino acid properties and specify a weight for each property. The higher the weight for a given property, the more this property affects the final alignment. We show that in many cases the approach implemented in PR2ALIGN produces better quality pair-wise alignments than the conventional matrix-based approach.
PR2ALIGN will be helpful for researchers who wish to align amino acid sequences by using flexible user-specified alignment scoring functions based on the biochemical properties of amino acids instead of the amino acid substitution matrix. To the best of the authors’ knowledge, there are no existing stand-alone software programs or web-servers analogous to PR2ALIGN. The software is freely available from http://pr2align.rit.albany.edu.
Alignment of amino acid sequences is the main sequence comparison method used in computational molecular biology. Dynamic programming provides a computationally efficient way of finding an optimal sequence alignment of two amino acid sequences, given an alignment scoring function [1,2]. This optimal alignment found by dynamic programming depends on the choice of the alignment scoring function, which typically consists of an amino acid substitution matrix used to account for matches/mismatches and gap penalties used to account for insertions/deletions [3,4]. After the advent of sequence alignment algorithms that use dynamic programming and substitution matrix-based scoring functions, several novel alignment algorithms that use more sophisticated scoring functions based on Hidden Markov Models (HMMs) have been developed [5-10]. However, global pair-wise alignment with dynamic programming and substitution matrices is still extensively used in sequence analysis, including such fundamental applications as homology modeling  and multiple sequence alignment algorithms [12,13].
Most amino acid substitution matrices are based on the same basic assumption: if two given amino acid types are frequently observed in the equivalent positions in related proteins, they have similar biochemical properties and vice versa [14-21]. The individual elements of a substitution matrix are obtained by averaging the amino acid frequencies over all sequence positions in a large collection of protein sequences. As a result of such averaging, most existing amino acid substitution matrices are general-purpose matrices that account for the average (dis)similarities between amino acid types and do not distinguish the contribution of a specific biochemical property to these (dis)similarities. The selection of the amino acid substitution matrix best suitable for a given alignment problem is one of the most important decisions the user has to make , because all matrix elements are fixed and their values cannot be easily adjusted. This lack of flexibility in the conventional substitution matrix-based sequence alignment may limit the options of a user who wishes to align and compare amino acid sequences by applying a user-specified scoring function based on the biochemical properties of amino acids, such as hydrophobicity, size, charge, etc. The biochemical properties of the amino acids have been extensively used to construct global and local descriptors of protein sequences in various applications for alignment-free comparison and classification of proteins, including prediction of DNA-binding proteins, sub-cellular localization, and protein disorder [23-26]. Hundreds of numerical indices that represent various properties of the amino acids are available from the AAindex database . However, applications that use biochemical properties of the amino acids for alignment-based comparison of protein sequences are lacking. ProtScale  is an on-line tool that allows the user to construct a graphical plot that displays the profile of some user-selected biochemical property of the amino acids, such as the hydrophobicity profile, for the input protein sequence. The user can construct plots that show profiles for two protein sequences and perform a qualitative visual comparison of these profiles. However, ProtScale does not provide capabilities for quantitative alignment-based comparison of two protein sequences based on the profiles of biochemical properties.
PR2ALIGN is a stand-alone software program and a web-server that provide the functionality for implementing flexible user-specified alignment scoring functions and aligning pairs of amino acid sequences based on the comparison of the profiles of biochemical properties of these sequences. Unlike the conventional sequence alignment methods that use 20×20 fixed amino acid substitution matrices, PR2ALIGN uses a flexible set of weighted biochemical properties of amino acids to measure the distance between pairs of aligned residues and to find an optimal minimal distance global alignment. The user can provide any number of amino acid properties and specify a weight for each property. The higher the weight for a given property, the more this property affects the final alignment. To the best of the authors’ knowledge, there is no existing stand-alone or on-line software analogous to PR2ALIGN.
Where α is the gap opening penalty, β is the gap extension penalty, and r L is the length of the L th gap (α ≥ 0, β ≥ 0, r L ≥ 1).
Where w(b) is the weight assigned to the amino acid property b.
Where D 0,0 = 0, D i,0 = g(i), D 0,j = g(j).
The alignment corresponding to the optimal score is found by tracing back from D n,m to D 0,0 .
Where p i (x j ) is the value of biochemical property i for amino acid x j ; min i and max i are the minimum and maximum values of the i-th biochemical property. The user may choose to disable normalization and use the raw biochemical properties (not recommended).
Default amino acid properties, property weights, and gap penalties
By default, PR2ALIGN uses the following four amino acid properties: hydrophobicity , size , coil propensity , and the presence of thiol group. These four properties were selected for the following two reasons. First, the total number of properties was limited to four due to the computational complexity of the process of optimization of property weights and gap penalties (described below). This process involves a grid search that has computational complexity n*N (k+2), where n is the number of sequence pairs in the benchmark dataset, N is the number of grid points, k is the number of amino acid properties, and the factor of 2 accounts for gap initiation and gap extension penalties. For instance, using 6,000 sequence pairs, a coarse 20-point grid, and 4 amino acid properties involves performing 6*103*204+2 ≈ 3.8*1011 pair-wise alignments. For more than 4 properties the procedure becomes computationally too expensive. Second, it was shown that the three main factors that explain a significant proportion of the total variability in amino acid properties are related to hydrophobicity, size, and structural propensity . Coil propensity was selected because it is correlated with other structural propensities and helps to distinguish such structurally important amino acids as glycine and proline . The presence of the thiol group was selected because it is a unique property of the amino acid cysteine, which tends to be highly conservative in homologous proteins and plays a special role in sequence alignment .
The optimized property weights and gap penalties for the four default amino acid properties
Pair-wise sequence identity
Weight for hydrophobicity
Weight for size
Weight for coil propensity
Weight for thiol group
Gap initiation penalty
Gap extension penalty
Where n I (i,j) is the number of residue pairs aligned identically in the test and the reference alignments; l R (i,j) is the length of the reference alignment; l T (i,j) is the length of the test alignment.
In order to minimize the effect of over-represented protein families and to reduce the computational time required for the grid search, if some SABmark super-family had more than 10 sequence pairs, only 10 pairs were randomly selected from this super-family. For property weights, a 20-point grid with 0.05 increments was used. For gap penalties, a 10-point grid with 0.1 increments was used.
Comparison of PR2ALIGN and matrix-based alignment
The optimized gap penalties for the VTML200 matrix
Pair-wise sequence identity
Gap initiation penalty
Gap extension penalty
Stand-alone software program
The stand-alone alignment program is written in C++. The source code and pre-compiled Windows and Linux executables are freely available under a GNU General Public License (http://www.gnu.org/licenses/) from http://pr2align.rit.albany.edu/download.html. The program reads amino acid sequences in the FASTA format (see “Example of FASTA file” below). The user has options to save the alignment as an HTML-formatted file or as a FASTA-formatted text file. The compilation instructions and command line options are described in the README file included in the distribution.
Example of FASTA file
FASTA file consists of a header line that begins with ">" character, followed by an optional sequence name and the sequence itself:
>Sequence name goes here…
“Sequences to align” – the user should provide two amino acid sequences in FASTA format. In this input field the user can also choose the option to automatically select property weights and gap penalties. If this option is checked, the web-server will attempt to estimate the expected percentage of sequence identity and will select the property weights and gap penalties based on this expected percentage identity. This option works only for the four default amino acid properties. The expected percentage of sequence identity is estimated by aligning the input sequences using the conventional global sequence alignment with the VTML200 amino acid similarity matrix  and gap initiation penalty of −15 and gap extension penalty of −1 (gap penalties for this matrix optimized on the benchmark dataset that consists of the entire SUP sub-set of SABmark ). The user should be aware that this option provides just a rough estimate which may differ from the final percentage sequence identity displayed in PR2ALIGN output.
“Amino acid properties to use for the alignment” – the user can either use the four default amino acid properties and weights listed in Table 1 or upload a text file that contains any set of user-defined properties and weights. The file format is described in the “Help” pages of the web-server and is shown below in “Example of file with amino acid properties and weights”.
Example of file with amino acid properties and weights
An example of the file with amino acid properties and weights that can be uploaded to PR2ALIGN web-server. The file must begin with a header line. The second line must contain 20 comma-delimited standard amino acid letters in the specified order. For each property, the file must contain a line that begins with "#PROPERTY" followed by the property name. The line after the property name must contain 20 comma-delimited numbers quantifying this property for each individual amino acid. These numbers must be in the same order as the 20 amino acid letters listed in line 2. For instance, in this example the first hydrophobicity number of 0.25 corresponds to A, the second hydrophobicity number of -1.76 corresponds to R, etc. The line that begins with "W:"' after each property gives the weight assigned to this property. For instance, in this example "Hydrophobicity" has the weight of 0.6 and "Size" has the weight of 0.4:
Header line goes here…
“Other alignment parameters” – the user can enter the gap initiation and gap extension penalties and choose whether or not to normalize the amino acid properties. By default, properties are normalized to be in range [0, 1] according to Eq.6.
“Retrieval of the results” – By default, the alignment is displayed in the web-browser window. Alternatively, the user can choose to receive the results by E-mail.
Availability and requirements
Project name: PR2ALIGN
Project home page: http://pr2align.rit.albany.edu
Operating system(s): Platform independent
Other requirements: None
License: GNU GPL
Any restrictions to use by non-academics: license needed
- Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–53.PubMedView ArticleGoogle Scholar
- Waterman MS. Global distance alignment. In: Introduction to computational biology. 1st ed. London: Chapman and Hall; 1995. p. 192–7.View ArticleGoogle Scholar
- Vogt G, Etzold T, Argos P. An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J Mol Biol. 1995;249:816–31.PubMedView ArticleGoogle Scholar
- Edgar RC. Optimizing substitution matrix choice and gap parameters for sequence alignment. BMC Bioinformatics. 2009;10:396.PubMed CentralPubMedView ArticleGoogle Scholar
- Edgar RC, Sjölander K. SATCHMO: sequence alignment and tree construction using hidden Markov models. Bioinformatics. 2003;19:1404–11.PubMedView ArticleGoogle Scholar
- Do CB, Mahabhashyam MS, Brudno M, Batzoglou S. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 2005;15:330–40.PubMed CentralPubMedView ArticleGoogle Scholar
- Pei J, Grishin NV. MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res. 2006;34:4364–74.PubMed CentralPubMedView ArticleGoogle Scholar
- Liu Y, Schmidt B, Maskell DL. MSAProbs: multiple sequence alignment based on - pair hidden Markov models and partition function posterior probabilities. Bioinformatics. 2010;26:1958–64.PubMedView ArticleGoogle Scholar
- Eddy SR. Multiple alignment using hidden Markov models. Proc Int Conf Intell Syst Mol Biol. 1995;3:114–20.PubMedGoogle Scholar
- Meier A, Söding J. Context similarity scoring improves protein sequence alignments in the midnight zone. Bioinformatics 2014. [Epub ahead of print].Google Scholar
- Cavasotto CN, Phatak SS. Homology modeling in drug discovery: current trends and applications. Drug Discov Today. 2009;14:676–83.PubMedView ArticleGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–80.PubMed CentralPubMedView ArticleGoogle Scholar
- Notredame C, Higgins DG, Heringa J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000;302:205–17.PubMedView ArticleGoogle Scholar
- Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992;89:10915–9.PubMed CentralPubMedView ArticleGoogle Scholar
- Dayhoff MO, Schwartz RM, Orcutt BC. Establishing homologies in protein sequences. Methods Enzymol. 1983;91:524–45.PubMedGoogle Scholar
- Johnson MS, Overington JP. A structural basis for sequence comparison: an evaluation of scoring methodologies. J Mol Biol. 1993;233:716–38.PubMedView ArticleGoogle Scholar
- Prlić A, Domingues FS, Sippl MJ. Structure-derived substitution matrices for alignment of distantly related sequences. Protein Eng. 2000;13:545–50.PubMedView ArticleGoogle Scholar
- Blake JD, Cohen FE. Pairwise sequence alignment below the twilight zone. J Mol Biol. 2001;307:721–35.PubMedView ArticleGoogle Scholar
- Müller T, Spang R, Vingron M. Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol. 2002;19:8–13.PubMedView ArticleGoogle Scholar
- Vilim RB, Cunningham RM, Lu B, Kheradpour P, Stevens FJ. Fold-specific substitution matrices for protein classification. Bioinformatics. 2004;20:847–53.PubMedView ArticleGoogle Scholar
- Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM Trans Comput Biol Bioinform. 2011;8:194–205.PubMedView ArticleGoogle Scholar
- Kuznetsov IB. Protein sequence alignment with family-specific amino acid similarity matrices. BMC Research Notes. 2011;4:296.PubMed CentralPubMedView ArticleGoogle Scholar
- Huang HL, Lin IC, Liou YF, Tsai CT, Hsu KT, Huang WL, et al. Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties. BMC Bioinformatics. 2011;12 Suppl 1:S47.PubMed CentralPubMedView ArticleGoogle Scholar
- Tantoso E, Li KB. AAIndexLoc: predicting subcellular localization of proteins based on a new representation of sequences using amino acid indices. Amino Acids. 2008;35(2):345–53.PubMedView ArticleGoogle Scholar
- Han P, Zhang X, Feng Z-P. Predicting disordered regions in proteins using the profiles of amino acid indices. BMC Bioinformatics. 2009;10 Suppl 1:S42.PubMed CentralPubMedView ArticleGoogle Scholar
- Zou C, Gong J, Li H. An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC Bioinformatics. 2013;14:90.PubMed CentralPubMedView ArticleGoogle Scholar
- Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008;36(Database issue):D202–5.PubMed CentralPubMedGoogle Scholar
- Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, et al. Protein Identification and analysis tools on the ExPASy server. In: The proteomics protocols handbook. Totowa, New Jersey: Humana Press; 2005. p. 571–607.View ArticleGoogle Scholar
- Eisenberg D, Weiss RM, Terwilliger TC. The hydrophobic moment detects periodicity in protein hydrophobicity. Proc Natl Acad Sci USA. 1984;81:140–4.PubMed CentralPubMedView ArticleGoogle Scholar
- Krigbaum WR, Komoriya A. Local interactions as a structure determinant for protein molecules. Biochim Biophys Acta. 1979;576:204–28.PubMedView ArticleGoogle Scholar
- Deléage G, Roux B. An algorithm for protein secondary structure prediction based on class prediction. Protein Eng. 1987;1:289–94.PubMedView ArticleGoogle Scholar
- Kidera A, Konishi Y, Oka M, Ooi T, Scheraga HA. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J Prot Chem. 1985;4:23–55.View ArticleGoogle Scholar
- Solis AD, Rackovsky S. Optimized representations and maximal information in proteins. Proteins. 2000;38:149–64.PubMedView ArticleGoogle Scholar
- Wong JWH, Ho SYW, Hogg PJ. Disulfide bond acquisition through eukaryotic protein evolution. Mol Biol Evol. 2011;28(1):327–34.PubMedView ArticleGoogle Scholar
- Van Walle I, Lasters I, Wyns L. SABmark - a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics. 2005;21:1267–8.PubMedView ArticleGoogle Scholar
- Sauder JM, Artur JW, Dunbrack RL. Large-scale comparison of protein sequence alignment algorithms with structural alignments. Proteins. 2000;40:6–22.PubMedView ArticleGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.