 Research article
 Open Access
 Published:
Effect of positional dependence and alignment strategy on modeling transcription factor binding sites
BMC Research Notes volume 5, Article number: 340 (2012)
Abstract
Background
Many consensusbased and Position Weight Matrixbased methods for recognizing transcription factor binding sites (TFBS) are not well suited to the variability in the lengths of binding sites. Besides, many methods discard known binding sites while building the model. Moreover, the impact of Information Content (IC) and the positional dependence of nucleotides within an aligned set of TFBS has not been well researched for modeling variablelength binding sites. In this paper, we propose MLConsensus (MixedLength Consensus): a consensus model for variablelength TFBS which does not exclude any reported binding sites.
Methods
We consider Pairwise Score (PS) as a measure of positional dependence of nucleotides within an alignment of TFBS. We investigate how the prediction accuracy of MLConsensus is affected by the incorporation of IC and PS with a particular binding site alignment strategy. We perform crossvalidations for datasets of six species from the TRANSFAC public database, and analyze the results using ROC curves and the Wilcoxon matchedpair signedranks test.
Results
We observe that the incorporation of IC and PS in MLConsensus results in statistically significant improvement in the prediction accuracy of the model. Moreover, the existence of a core region among the known binding sites (of any length) is witnessed by the pairwise coexistence of nucleotides within the core length.
Conclusions
These observations suggest the possibility of an efficient multiple sequence alignment algorithm for aligning TFBS, accommodating known binding sites of any length, for optimal (or nearoptimal) TFBS prediction. However, designing such an algorithm is a matter of further investigation.
Background
Transcription factors (TF) are proteins that bind to specific locations of DNA (referred to as binding sites, BS) and facilitate/repress the transcription process. In many cases binding sites of a transcription factor contain a common nucleotide pattern [1]. DNA motiffinding algorithms use various models to represent this pattern [1]. One of these models is the consensus, a sequence representation derived from a multiple sequence alignment of binding sites [2, 3]. The consensus sequence retains only the most conserved base at any position, resulting in loss of information about other bases at that position. Position weight matrix (PWM), also known as probabilistic sequence model or scoring matrix, is another representation model which records frequency (or probability) of every base at each position of the multiple sequence alignment [1, 4, 5]. The survey by Das and Dai provides a classification of DNA motiffinding methods based on different representation models [6].
Both basic consensusbased and PWMbased methods need equallength sequences. Although this is acceptable for cases where there is no variability in lengths of binding sites (e.g., the bacterial dataset described in [7]), there are other datasets where TFBS show remarkable variability in lengths (e.g., datasets described in Section Input, training, and testing). In order to circumvent this variability, variants of these methods apply constraints and assumptions on the nature of binding sites. For example, only fixedlength sites are considered, or only sites containing a fixedlength subsequence are considered [8]. It is not confirmed, however, whether the proteinDNA binding mechanism indeed follows such constraints. Therefore it is necessary to modify these models for allowing variability in TFBS lengths. Some studies described such a PWMbased model that allows gaps in the PWM and thus accommodates variablelength binding sites [9].
There are models which can accommodate binding sites of different lengths. A widely used TFBS prediction program is the PMATCH, which uses Gibbs sampling [10] to align binding sites of different lengths [11]. However, PMATCH excludes some documented binding sites based on constraints on the lengths of the sites, and imposes a constraint on the the core region; it defines the core region as the five most conserved positions within the alignment [11].
Models that involve a matrix representation (PWM/consensus) must make a multiple sequence alignment from the known binding sites. Therefore, the multiple sequence alignment algorithm associated with such a model will influence its performance because the alignment (and therefore, the scoring matrix or consensus) generated by different algorithms will be different. An excellent survey of multiple sequence alignment algorithms can be found in [12]. On the other hand, the TFBS prediction algorithm SiTaR does not align input sequences at all [13]. By not aligning, SiTaR avoids many uncertainties arising from the generalizations made by multiple sequence alignment.
Basic consensusbased and PWMbased models assume that positions in a binding site are independent. However, some biological studies suggest that positions in a binding site are correlated [14, 15]. Several computational models for this correlation have been proposed [16, 17]. Some studies described pairwise score (PS), a method that computes interdependence of any two positions that are located within a fixed distance from one another in a binding site [18]. This distance is called the scope of PS. It has been shown that the addition of PS to basic consensusbased and PWMbased models results in statistically significant improvement in performance [18]. However, pairwise correlation is not the same as the statistical measure “correlation”; rather, it is a measure of cooccurrence of bases within a given proximity (i.e., scope). The mathematical definition of pairwise score can be found in Section Scoring function with pairwise score (PS).
Information content (IC) of an alignment of binding sites is a measure of conservation of any base at any given position in that alignment. It has been shown that the addition of IC in basic consensusbased and PWMbased models results in statistically significant improvement in performance [18]. However, these results regarding PS and IC were demonstrated on a dataset that does not have any variability in the lengths of binding sites for a TF [7].
Our research
In this paper, we define a consensus model (Mixedlength Consensus or MLConsensus) for recognizing variablelength TFBS. Our model does not exclude any known/reported binding site while building the model for a set of TFBS. Moreover, MLConsensus does not make any assumption on the lengths of binding sites or on the length/composition of the core region. However, it assumes that there exists one core region for a set of TFBS, the core region is present, in part or whole, in every binding site. This assumption is used in constructing the naïve multiple alignment algorithm associated with this model (described later in this section).
Our input data covers six species from the TRANSFAC public database [19]. We study the effect of pairwise correlation of nucleotides, information content, and multiple sequence alignment strategy on the prediction accuracy of our model.
If each binding site of any given TF has the same length (e.g., the E. coli dataset in [7]), it is trivial to align them and get the consensus or scoring matrix. Otherwise, one needs to make a multiple sequence alignment from those sequences in order to derive a scoring matrix or a consensus. TFBS prediction tools employ various methods for aligning binding sites [6]. All other things remaining the same, effectiveness of two multiple sequence alignment algorithms for aligning TFBS can be evaluated by comparing the performance of a TFBS prediction model using those two alignment strategies.
In our study, our goal was to evaluate the effectiveness of commonly used multiple sequence alignment strategies in aligning TFBS. We present a naïve sortingbased multiple sequence alignment algorithm and compare it to ClustalW2, a widely used multiple sequence alignment algorithm [20]. We pick ClustalW2 as a representative of sophisticated alignment algorithms; our simplesorted alignment algorithm is so naïve that when comparing it to another algorithm the implementation specifics of the other algorithm does not matter provided the other algorithm is one of the good and sophisticated algorithms. Our algorithm (see Appendix A: The naïve sortingbased multiple sequence alignment algorithm) operates on a simple principle: it picks the shortest yettoalign sequence and adds it to the temporary alignment. This is done based on the assumption that all binding sites of a TF have some pattern in common (i.e., a core region), and therefore the probability that any given position of a binding site would be a part of the core region is higher in a short binding site than that in a long binding site. On the other hand, ClustalW2 creates the alignment from the phylogenetic tree built from pairwise alignments from the input sequences. In our experiments, we used ClustalW2 without iterative refinement. Table 1 shows different alignments produced by these two algorithms from the same input sequences.
Pairwise score (PS) is a measure of the dependence of nucleotides at two positions that are situated within a given distance in an alignment of TFBS. Whereas other studies (e.g., [18]) have discussed effect of PS on consensus/matrixbased TFBS models for fixedlength binding sites, we study the effect of PS on MLConsensus, a model for variablelength binding sites. Specifically, we perform experiments with different PS scopes to find whether there is any regularity with which a change in PS scope affects the performance of MLConsensus. We consider the following choices for PS: no PS, PS scopes 1–10, and a large scope value that covers the entire overlap between a test sequence and the consensus while scoring that sequence against the consensus.
As mentioned earlier, MLConsensus has three configuration variables: pairwise score, information content, and multiple sequence alignment strategy. We construct one experimentconfiguration for each combination of variables (e.g., ClustalW2 alignment using IC and PS scope 2, etc.). We conduct leaveoneout crossvalidation scheme for training/testing our model on TFBS data for six species extracted from TRANSFAC public database [19]. We used ROC curves and the Wilcoxon matchedpair signedranks test for statistical evaluation of the performance data.
Our results show that the adoption of IC or PS in the scoring function of MLConsensus results in significant improvement in performance. Moreover, a large PS scope (e.g., the full scope) does not produce the best performance for a given configuration; performance decreases after PS scope is larger than a certain value. Not only is this observation counterintuitive, but it also provides a way to estimate the core length. Our results also suggest that it is possible to design a TFBSspecific multiple sequence alignment algorithm that will perform better than generalpurpose algorithms by means of utilizing prior information and assumptions about TFBS. However, we do not present such an algorithm yet since it is subject to further investigation.
The main contributions of this paper are the following: (1) We describe a new model for TFBS prediction which accommodates all known binding sites of different lengths. (2) We show that incorporating information content and pairwise correlation into scoring function for this model improves the prediction accuracy. (3) We study the effect of different PS scopes on the prediction accuracy of this model. (4) We show that it is possible to estimate the length of the core region in a set of TFBS, and (5) We show that it is possible to design a multiple sequence alignment algorithm which will do better than generalpurpose algorithms while aligning TFBS.
Results and discussion
In the following discussion AUC refers to the area under ROC curve. A configuration is an experiment with any particular settings for IC, PS scope, and alignment strategy. AUC of a configuration is taken as a measure of its performance (i.e., prediction accuracy). However, when comparing performances of two configurations, the statistical significance of difference in performance is considered. If significant, the event is mentioned as configuration A performs better than configuration B. Otherwise, it is mentioned as the two configurations are equivalent. For a given configuration, peak in its AUC denotes the PS scope value which, among all scopes, produces the highest AUC for that configuration. The phrases naïve alignment, simple sortingbased alignment and simplesorted alignment all refer to our heuristic, sortingbased, multiple sequence alignment algorithm presented in Appendix A: The naïve sortingbased multiple sequence alignment algorithm.
Some adjacent PS scopes produce significantly better performance than other scopes
By definition (see Section Scoring function with pairwise score (PS)), all information gathered in a smaller PS scope are retained in a larger PS scope. However, Figure 1 and Figure 2 show that the performance of a configuration starts to decrease when PS scope grows larger than a certain value. The location of the peak (i.e., the PS scope which produces the highest area under ROC curve) for a configuration varies in different species. Figure 2 depicts whether the change in performance between successive PS scopes is statistically significant. We observe that there is always a range of PS scopes where performance, after initially increasing significantly, does not change significantly with a change in PS scope. After this range, however, performance decreases significantly. We call this range of PS scopes a significance plateau.
The above observation can be explained as follows. For any leaveoneout experiment over a given set of TFBS, the known positive example may have one or more mismatches with respect to the consensus. These positions may get involved in positionpair matches between the consensus and a known negative example. (We term this event as noise.) If such an event takes place, it increases the probability that the known negative example would score higher than the known positive example — producing a false positive. PS scopes larger than a certain value do not capture any new positionpair matches, yet continue picking up noise. This is why we observe a decrease in performance of a configuration with increase in PS scopes beyond a certain value. This scope indicates the maximum distance within which two positions in an aligned set of TFBS are correlated. In an alignment, only positions that form the core region will be correlated. Therefore the core region (for the sites in the alignment) should be at most as long as this scope value. However, this scope value is found by running a given experimentconfiguration over all sets of TFBS for a given species, and therefore it is associated with the overall TFBS dataset for the species and not with any particular set of TFBS. Additionally, different experimentconfigurations produce possibly different significance plateaus for any given species dataset. Therefore, the location of the plateau depends on which experimentconfiguration is in use. Our suggestion is that for a given species, we should choose the experimentconfiguration that produces highest area under its ROC curves across all PS scopes, thus having the highest discriminatory power.
ClustalW2 does not perform as expected
We perform the Wilcoxon matchedpair signedranks test in order to determine whether ClustalW2 performs significantly better than the naïve alignment algorithm. Since ClustalW2 is a sophisticated algorithm, the null hypothesis is that ClustalW2 should perform significantly better (with p<=0.05) than simple sorted alignment algorithm in all combinations of other variables. However, if the difference in performance is found insignificant it should be considered as an evidence against the null hypothesis. We divide all configurations into pairs (based on alignment strategy), and then compute statistical significance of difference in performance of the two configurations in each pair. Table 2 shows the statistical significance of the difference in performance of configurations using different alignment algorithms according to the null hypothesis mentioned above. It can be seen that the null hypothesis does not hold true in four out of six species with p<=0.01. This means ClustalW2 does not perform significantly better than naïve sortingbased alignment strategy in all experiments.
Since the naïve algorithm operates on simple assumptions, and it does not do anything as involved as common multiple sequence alignment algorithms do, the naïve algorithm has much room for improvement. Since the performance of this algorithm is already as good as (or better than) the performance of ClustalW2 in most situations, we can say that it is possible to design a TFBSspecific multiple sequence alignment algorithm that will perform better than generalpurpose algorithms (e.g., ClustalW2, etc.) by means of utilizing prior information and assumptions about TFBS. For example, an assumption made by the naïve alignment algorithm is that there is a core region contained by all binding sites. In addition, an example of prior information about TFBS is the core length suggested by the PS scopes in significance plateau. However, the relationship between the core length and the significance plateau is not known yet.
Performance of configurations using the same alignment algorithm varies across different species. Figure 3 shows that simple sorted alignment performs better than ClustalW2 in M. musculus. On the other hand, Figure 4 shows that ClustalW2 performs better than naïve sortingbased algorithm in R. norvegicus. Although we do not know why this happens, our hypothesis is that it may be due to the differences in the composition of binding sites (i.e., number and lengths of binding sites, the nature of the core region, etc.) for each species. This observation requires further investigation.
Both IC and PS lead to improved performance
The addition of IC to a configuration without IC always improves its performance. However, the improvement is more prominent for larger scopes which can be seen in Figure 1 and Figure 2. In these figures, AUC of configurations without IC drops quickly at large PS scopes. However, curves for configurations with IC tend to be more flat at large PS scopes. Similarly, addition of PS (with appropriate scope value) to configurations without PS results in improved performance. These observations are in accordance with the observations made by [18] regarding influence of IC and PS on models for fixedlength binding sites.
Conclusions and future works
In this paper we describe MLConsensus, a consensus model for recognizing variablelength transcription factor binding sites. We show that certain PS scope values indicate the range within which positions in a binding site are correlated. However, the statistical correlation of nucleotides in a set of binding sites is out of the scope of this research, and is a matter of future work. We also show that in most cases, configurations that use ClustalW2 as alignment algorithm do not perform significantly better than configurations that use a naïve sortingbased heuristic alignment algorithm. It suggests that it is possible to improve the naïve algorithm into a TFBSspecific multiple sequence alignment algorithm (using information/assumptions about TFBS) which would perform better than generalpurpose multiple sequence alignment algorithms. However, designing such an algorithm is another direction of future investigation. Lastly, although we use a consensus model, our approach and methods can be extended to a PWMbased model for variablelength binding sites.
Methods
In this section, we start with presenting the mathematical definition of the MLConsensus model and its various parts. Next we describe how we collected and processed the input data to build training and testing datasets. Then we describe how we made statistical evaluation of the experiments through ROC curves and Wilcoxon matchedpair signedranks test.
Model definition
The MLConsensus model has the following parts: (1) Building a multiple sequence alignment from a given set of binding sites, (2) Generating the consensus sequence from this alignment, (3) A basic scoring function which compares a given DNA sequence with this consensus and tells how close they are; this scoring function can be modified to incorporate information content (IC) and pairwise score (PS).
Building a consensus
Let S be the set of N binding sites for a particular transcription factor. Let A be a multiple sequence alignment of S with width of M. A gap in alignments in A is denoted by ‘’.
Let n_{ j }(b) be the number of times base b∈{A,C,G,T} appears at jth position of A. Let f_{ j }(b)=n_{ j }(b)/N be the corresponding frequency. Similarly, let n(b) be the number of times base b appears overall in A, and f(b) be the overall frequency for base b in A.
A letter representing more than one nucleotides is called the ambiguity code for those nucleotides. Let amb(b,d) be the ambiguity code for two bases b,d∈{A,C,G,T} as described in Table 3, and amb(b,∗) be any ambiguity code involving base b. Let C be the consensus sequence derived from A, and C_{ j } be the jth base in C.
C_{ j }is computed as follows. For each position j of A,

If fj(b)>0.5 for base b∈{A,C,G,T}, set Cj=b.

Otherwise, if fj(b) + fj(d)>0.75 for any two bases b,d∈{A,C,G,T}, set Cj=amb(b,d).

Otherwise, set Cj= ‘’, the gap character.
Table 1 shows how to derive a consensus from two different sequence alignments produced by two different alignment algorithms. Computing _{ f j }(b) for all j,b takes O(NM) time.
Scoring function
Let t be a putative binding site. Let _{ t j }be the jth base of t. To compute the score of t with respect to consensus C, we used a sliding window approach where t is shifted along C, from left to right. At each point of shifting there is an overlap between t and C. For each overlap w let _{C w,i}be the base in consensus corresponding to the ith position in w. Define _{t w,i} in similar way. For each overlap w we computed σ(t,C,w), the score of t at that particular overlap; this score is equal to the number of matches between t and C at w:
where
Computing Match(w,i) takes O(1) time, and computing σ(t,C,w) takes O(M) since size of w is O(M). Finally, the score of t with respect to C is the maximum score obtained in all overlaps, which takes O(M^{2}) since there can be at most O(M) overlaps.
Scoring function with information content (IC)
Information Content (also called entropy) at any position j of the alignment A is a measure of conservation of any base at that position [4, 21]. If a base is highly conserved at a position, chance of encountering a different base at that position is small; thus the information content at that position is low. The IC at position j of the alignment matrix A is defined as:
where the term {f}_{j}\left(b\right)log{f}_{j}\left(b\right) becomes zero whenever _{ f j }(b) becomes zero, thus avoiding evaluation of log0. IC(A j) for all j can be computed in O(M) time. Let A(w i) be the position in A that corresponds to the ith position in w. When IC is used in scoring, the scoring function for the overlap becomes:
This takes O(M) time when IC(A j) are precomputed.
Scoring function with pairwise score (PS)
Pairwise score is a measure of interdependence among positions in a binding site with respect to the consensus [18]. Two different positions in an overlap w are correlated if there are matches in both positions. In overlap w, let positions i and i + k be separated by k positions. The matchscore for this positionpair, MatchPair(w i k), is defined as follows:
This takes O(1) time.
Let K be the maximum distance considered between any two positions, and w be the length of the overlap. K is called the scope of PS. The pairwise score of t at overlap w, _{σ PS}(t,C,w), is defined as the total number of positionpair matches for all positions situated within the scope of PS.
This operation takes O(M K^{2}) time. Score at any PS scope contains all matches from all smaller scopes along with new matches at the said scope. Thus it does not lose any information about positionmatches gathered in previous scopes.
Scoring function with both information content and pairwise score
At any overlap w, let _{ n ij }(b,d) be the number of times two bases b and d appear together at positions i and j, respectively. Let _{ f ij }(b,d)=_{ n ij }(b,d)/N be the corresponding frequency. Then, IC of positionpair (i,j) in the alignment matrix A is defined as follows:
Computing _{ f ij }(b,d) for all i,j,b,d takes O(M^{2}) time. After that, computing I_{C pair}(A,i,j) for all i,j takes O(M^{2}) time.
Let A(w,i) be the position in A that corresponds to the ith position in w. Let I_{C pair}(w,i,k) be the information content of the positionpair (i,i + k) in w, which is defined as follows:
Finally, the score of t at overlap w is defined as follows:
This takes O(K^{2}M) time because all I_{C pair}(A,i,j) values are already computed for all i,j.
Experiment design
We studied the effect of three variables on the performance of MLConsensus: multiple sequence alignment strategy, IC, and PS. The alignment algorithm can be either ClustalW2 or simple sorted alignment algorithm. There are two choices for IC: either using IC, or not using IC. However, PS can have twelve possible values: not using PS; PS scopes 1–10; and lastly full PS scope, which means the scope spans the entire overlap between a putative site and the consensus. Therefore, there are 2×2×12=48 possible experimentconfigurations, one for each combination of the three variables. Each of these configurations was trained and tested using the same input, training, and testing data.
Input, training, and testing
We extracted TFBS data from TRANSFAC public database [19]. We considered TFs with at least three binding sites. Table 4 shows basic statistics for this data. Figure 5 shows the variability in TFBS lengths in the input data. The xaxis shows the ratio of population SD and mean in BS length computed for a set of TFBS. From the figure it can be observed that 9.5% TFs have small deviation in size (the first bin of histogram) but they cover only 7.5% of total BSs. From first three bins, it can be seen that 40% of TFs (covering 29% BSs) have low variability (\frac{\mathrm{SD}}{\mathrm{mean}}<0.3). From next three bins, it can be observed that another 49% TFs (covering 60% BSs) have much higher variability (0.3\le \frac{\mathrm{SD}}{\mathrm{mean}}<0.6). Remaining 11% TFs have extreme variability, and they cover the remaining 11% of BSs.
We conducted leaveoneout crossvalidation for all TFs over respective BS data. For each TF, the training data contained all its binding sites except the one left out. The test data contained all known negative examples (binding sites of other TFs of the same species) and one known positive example (the leftout site). In accordance with [18], we removed any site from the set of negative examples for this TF if the site is also a BS of this TF.
Statistical evaluation
If a known negative example scored higher than the only known positive example, it was treated as a false positive. We needed to know which false positive rate corresponds to which true positive rate in order to draw an ROC curve for a configuration. In our case there was only one known positive example. Because several negative sites may score higher than the positive one, our model must allow these false positives (compromising prediction accuracy) in order to correctly classify the known positive example. We considered allowable false positive rate from 0% to 20%. This range was discretized into several slots (each denoting a smaller range in false positive rate values). For each leaveoneout experiment over a given set of TFBS we computed the true positive rate corresponding to each of these slots. These values were used to generate an ROC curve for this configuration. Details of this construction are given in Appendix B: Construction of ROC curves. Area under the ROC curve of a configuration is a measure of its discriminatory power. However, it should be noted that the performance of two configurations can not be compared solely by the areas under respective ROC curves if the two curves intersect at one or more points [22].
We used Wilcoxon matchedpair signedranks test to compare performance of any two configurations at pvalues 0.05 and 0.01 [23]. This test is well suited to our experiments because the underlying distribution of the data is unknown, yet we know that individual data points are independent. We used the number of false positives for each individual leaveoneout experiment for a TF as the rank of the experiment. Therefore, a high rank indicated poor performance.
Availability of supporting data
The data set, supplementary data, source codes (C#), and figures supporting the results of this article are available in http://biogrid.engr.uconn.edu/mlconsensus/.
Appendix A: The naïve sortingbased multiple sequence alignment algorithm
The assumption behind this algorithm is that the core region is shared by all sites, and therefore on average, positions in short sites are more likely to constitute the core region (than positions in long sites). The steps of the algorithm are as follows:

1.
Sort all binding sites from shortest to longest.

2.
Take the shortest site that is yet unaligned. If more than one sites have the smallest length, pick one in random. This makes up the initial alignment, A.

3.
Compute C, the consensus, from A.

4.
Let s be the next shortest, unaligned site.

5.
If such an s does not exist, go to step 8. Otherwise,

6.
Shift s along C from left to right. Find the alignment which produces highest score of t with respect to C. Add t to A at this alignment.

7.
Go to step 3

8.
Output: A is the complete multiple sequence alignment of S.
It can be observed that the order of choosing sites affects the resultant alignment, and a heterogeneous short site is likely to negatively impact the rest of the alignment. However, according to the assumption of the algorithm, if the short site contains the core region then it will not be heterogeneous.
Appendix B: Construction of ROC curves
Let TP, TN, FP and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively.
False Positive Rate, or FPR, is defined as the fraction of incorrectly classified known negative examples. Similarly, True Positive Rate, or TPR, is defined as the fraction of correctly classified known positive examples.
Let _{N TF} be the number of TFs for the given species. Let _{TFi} be the ith TF. Let {N}_{\mathrm{BS}}^{i} be the number of known binding sites for _{TFi}. A leaveoneout crossvalidation is conducted for each of the {N}_{\mathrm{BS}}^{i} binding sites. If a known negative example scores more than the known positive example, it is considered as a false positive.
We computed an ROC (Receiver Operating Characteristic) curve for each configuration over each species. FPR and TPR were placed along xaxis and yaxis, respectively, and the curve indicates the TPR obtained at different values for FPR. The computation for each configuration was done in three steps. At first, we computed TPR and FPR for each leaveoneout experiment involving a known binding site. Next, these values were averaged over all BSs for each TF. Lastly, these values were further averaged over all TFs for a given species.
Step One: Individual binding sites
Let _{BSj,i} be the jth BS of _{TFi}. Let _{FPRmax}be the maximum false positive rate considered for drawing the ROC curve. We used _{FPRmax}=0.20, or 20%. Let the range 0≤FPR≤_{FPRmax} be divided into M equal slots. Let {\mathrm{FPR}}_{k}^{\mathrm{slot}} denote the false positive rate corresponding to the kth slot.
Let _{FPj,i}be the number of false positives in the leaveoneout run which involves _{BSj,i} as the known positive binding site. Let _{FPRj,i}be the observed false positive rate. For any given allowable false positive rate, if _{FPRj,i} is greater than the allowable FPR, the given configuration will not be able to identify the known positive example. _{ T j }(i,k) denotes whether the known positive example could be identified (i.e., occurrence of a true positive) by setting the allowable FPR equal to the false positive rate for the kth FPR slot.
for 1\le j\le {N}_{\mathrm{BS}}^{i},\phantom{\rule{1em}{0ex}}1\le i\le {N}_{\mathrm{TF}},\phantom{\rule{1em}{0ex}}1\le k\le M .
Step Two: Averaging over all BSs for a given TF
For _{TFi}, let _{T BS}(i,k) be the average number of true positives obtained by setting the allowable FPR equal to the false positive rate for the kth FPR slot.
for 1≤i≤N_{TF}, 1≤k≤M.
Step Three: Averaging over all TFs for a species
Let T_{TF}(k) be the average number of true positives obtained by setting the allowable FPR equal to the false positive rate for the kth FPR slot across all TFs.
for 1≤k≤M. The ROC curve is produced by plotting T_{TF}(k) at kth FPR slot.
We considered only 0%–20% false positive rate for computing the area under an ROC curve. Since the FPR slots are discrete, we used the sum of TPR values in the mentioned FPR range as the area under an ROC curve.
References
Stormo GD: DNA binding sites: representation and discovery. Bioinformatics. 2000, 16: 1623. 10.1093/bioinformatics/16.1.16.
Hertz GZ, Hartzell GW, Stormo GD: Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Computer App biosci : CABIOS. 1990, 6 (2): 8192. [http://www.ncbi.nlm.nih.gov/pubmed/2193692]
Day WHE, McMorris FR: Critical comparison of consensus methods for molecular sequences. Nucleic Acids Res. 1992, 20 (5): 10931099. 10.1093/nar/20.5.1093.
Stormo GD, Fields DS: Specificity, Free Energy and Information Content in ProteinDNA Interactions. Trends Biochem Sci. 1998, 23: 109113. 10.1016/S09680004(98)011876.
Schneider TD, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990, 18 (20): 60976100. 10.1093/nar/18.20.6097. [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=332411\&tool=pmcentrez\&rendertype=abstract]
Das MK, Dai HK: A survey of DNA motif finding algorithms. BMC Bioinformatics. 2007, 8 (Suppl 7): S2110.1186/147121058S7S21.
Robison K, McGuire AM, Church GM: E. coli DNABinding Site Matrices Applied to the Complete E. coli K12 Genome. J Mol Biol. 1998, 284: 241254. 10.1006/jmbi.1998.2160. [http://arep.med.harvard.edu/ecoli_matrices/]
Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, Klingenhoff A, Frisch M, Bayerlein M, Werner T: MatInspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics. 2005, 21 (13): 29332942. 10.1093/bioinformatics/bti473.
Reid JE, Evans KJ, Dyer N, Wernisch L, Ott S: Variable structure motifs for transcription factor binding sites. BMC genomics. 2010, 11 (30): 30[http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2824720\&tool=pmcentrez\&rendertype=abstract]
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting Subtle sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Sci, New Ser. 1993, 262 (5131): 208214.
Chekmenev DS, Haid C, Kel AE: PMatch: transcription factor binding site search by combining patterns and weight matrices. Nucleic Acids Research. 2005, 33 (Web Server issue): W432W437. [http://www.ncbi.nlm.nih.gov/pubmed/15980505]
Notredame C: Recent Evolutions of Multiple Sequence Alignment Algorithms. PLoS Comput Biol. 2007, 3 (8): 4[http://www.ncbi.nlm.nih.gov/pubmed/17784778]
Fazius E, Shelest V, Shelest E: SiTaR: a novel tool for transcription factor binding site prediction. Bioinformatics (Oxford, England). 2011, 27 (20): 28062811. 10.1093/bioinformatics/btr492. [http://www.ncbi.nlm.nih.gov/pubmed/21893518]
Badis G, Others: Diversity and Complexity in DNA Recognition by Transcription Factors. Science. 2009, 324: 17201723. 10.1126/science.1162327.
Bulyk ML, Johnson PLF, Church GM: Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res. 2002, 30 (5): 12551261. 10.1093/nar/30.5.1255.
Barash Y, Elidan G, Friedman N, Kaplan T: Modeling dependencies in proteinDNA binding sites. Proceedings Seventh Annu Int Conference Comput Mol Biol  RECOMB ’03. 2003, New York, New York USA: ACM Press, 2837. [http://dl.acm.org/citation.cfm?id=640075.640079]
Zhou Q, Liu JS: Modeling withinmotif dependence for transcription factor binding site predictions. Bioinformatics (Oxford, England). 2004, 20 (6): 909916. 10.1093/bioinformatics/bth006. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/6/909]
Osada R, Zaslavsky E, Singh M: Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinformatics. 2004, 20 (18): 35163525. 10.1093/bioinformatics/bth438. [http://www.ncbi.nlm.nih.gov/pubmed/15297295]
Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Pruss M, Reuter I, Schacherer F: TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 2000, 28: 316319. 10.1093/nar/28.1.316.
Larkin MA, Others: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23 (21): 29472948. 10.1093/bioinformatics/btm404.
Schneider TD, Stormo GD, Gold L: Information Content of Binding Sites on Nucleotide Sequences. J Mol Biol. 1986, 188: 415431. 10.1016/00222836(86)901658.
Sonego P, Kocsor A, Pongor S: ROC analysis: applications to the classification of biological sequences and 3D structures. Briefings Bioinf. 2008, 9 (3): 198209. 10.1093/bib/bbm064. [http://www.ncbi.nlm.nih.gov/pubmed/18192302]
Sheskin DJ: Handbook of Parametric and Nonparametric Statistical Procedures. 2000, Boca Raton, Florida: Chapman & Hall/CRC
Acknowledgements
This research is supported in part by the National Science Foundation (US) under the grant CCF0755373.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
Financial competing interests
• In the past five years have you received reimbursements, fees, funding, or salary from an organization that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Is such an organization financing this manuscript (including the articleprocessing charge)? If so, please specify. No
• In the past five years have you received reimbursements, fees, funding, or salary from an organization that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Is such an organization financing this manuscript (including the articleprocessing charge)? If so, please specify. No
• Do you hold any stocks or shares in an organization that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? If so, please specify. No
• Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? If so, please specify. No
• Do you have any other financial competing interests? If so, please specify. No
Nonfinancial competing interests
• Are there any nonfinancial competing interests (political, personal, religious, ideological, academic, intellectual, commercial or any other) to declare in relation to this manuscript? If so, please specify. No
Author’s contributions
CHH planned and directed the research, described the model, and proposed the naïve sortingbased multiple sequence alignment algorithm. SQ implemented the model and methodology, carried out experiments, and made statistical evaluation of the outcome. Both authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Quader, S., Huang, CH. Effect of positional dependence and alignment strategy on modeling transcription factor binding sites. BMC Res Notes 5, 340 (2012). https://doi.org/10.1186/175605005340
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/175605005340
Keywords
 Information Content
 False Positive Rate
 Multiple Sequence Alignment
 Core Region
 Transcription Factor Binding Site