Calculation of Splicing Potential from the Alternative Splicing Mutation Database

Background The Alternative Splicing Mutation Database (ASMD) presents a collection of all known mutations inside human exons which affect splicing enhancers and silencers and cause changes in the alternative splicing pattern of the corresponding genes. Findings An algorithm was developed to derive a Splicing Potential (SP) table from the ASMD information. This table characterizes the influence of each oligonucleotide on the splicing effectiveness of the exon containing it. If the SP value for an oligonucleotide is positive, it promotes exon retention, while negative SP values mean the sequence favors exon skipping. The merit of the SP approach is the ability to separate splicing signals from a wide range of sequence motifs enriched in exonic sequences that are attributed to protein-coding properties and/or translation efficiency. Due to its direct derivation from observed splice site selection, SP has an advantage over other computational approaches for predicting alternative splicing. Conclusion We show that a vast majority of known exonic splicing enhancers have highly positive cumulative SP values, while known splicing silencers have core motifs with strongly negative cumulative SP values. Our approach allows for computation of the cumulative SP value of any sequence segment and, thus, gives researchers the ability to measure the possible contribution of any sequence to the pattern of splicing.


Background
One of the key regulators of alternative splicing is a large variety of short sequence motifs inside exons known as exonic splicing enhancers (ESE) and exonic splicing silencers (ESS). These regulatory sequences have been characterized by several experimental techniques [1][2][3][4][5] and also by different computational approaches [6][7][8][9][10][11][12][13][14]. Despite this progress, one still can not predict predisposition to alternative splicing from genomic data. In this respect, a set of mutations known to be associated with alternative splicing effects (reviewed by [15,5]) is a valuable raw material for the investigation of the fine regulation of splicing. A novel database of these mutations, named the Alternative Splicing Mutation Database (ASMD), is described in the accompanying paper [16]. The ASMD represents a collection of human exon sequences with internal mutations that change the balance of alternatively spliced mRNA isoforms or cause the appearance of new mRNA isoforms. The ASMD includes only those mutations that change exonic enhancers and silencers and does not encompass those that change splice sites. Here we present a novel statistical approach for processing ASMD mutational datasets, converting them into a table of "Splicing Potential" (SP) values for every possible short oligonucleotide. If the SP value for an oligonucleotide is positive, it promotes exon retention, while negative SP values mean the sequence favors exon skipping. SP appears to be a valuable tool for evaluating the influence of a given sequence on splicing, for finding and testing putative ESE and ESS motifs, and for predicting the effect of a given mutation on splicing.

Algorithm for calculation of Splicing Potential
Our SP algorithm processes all oligonucleotides that appear and disappear in the mutations described in the ASMD. Due to the limited size of the current ASMD dataset, we only calculate SP values for triplets. For example, the mutation (G -> T) in the 14 th exon of the gene BRCA1 (entry '10asmd') occurs in the exonic region gctGagt -> gct-Tagt (mutation site is in the middle and is shown in capital letters). This mutation generates three new triplets (ctT, tTa, and Tag) and, at the same time, eliminates three triplets from the wild-type sequence (ctG, tGa, and Gag). The splicing effect of this mutation is SE = -1, meaning that this mutation causes the 14 th exon to be skipped in all gene transcripts. Because the wild-type triplets ctG, tGa, and Gag strengthen splicing of the exon, the algorithm increases their potential values by the value SP i = log 10 (w), where w = 1+abs(SE) and index i is the case identifier. In this example, SE is equal to -1, and thus w = 2. In our algorithm, w is simply the weight factor that awards more impact to those mutations that cause more dramatic changes in splicing patterns. In addition, because of the mutant triplets (ctT, tTa, and Tag) weaken the splicing of the exon, the algorithm decreases their SP values by the same value of SP i = -log 10 (w). The final potential value for a particular triplet xyz is the sum of SP i (xyz) for all cases in the ASMD where this triplet appears/disappears due to mutations. Finally, to make the SP values independent of the ASMD sample size, we normalize them by the stand-ard deviation of SP values (σ SP ). Thus, final SP values are calculated by the formula: SP(xyz) = sum(SP i (xyz))/σ SP .
( 1 ) We compared SP(xyz) with the coding potential, CP(xyz), for the triplet xyz, which was calculated by one of the simplest forms using equation: where F c (xyz) is the frequency of the triplet xyz inside coding exonic regions, and F i (xyz) is the frequency of xyz inside introns. Throughout this study we multiplied all SP and CP values by 0.243, the σ CP for the entire sample of non-redundant human genes. If CP(xyz) has a positive value, the xyz triplet is more abundant in exons versus introns. When CP(xyz) is negative, the opposite is true and xyz is more abundant in introns. There are several much more advanced formulas for computing CP, which take into account additional information such as reading frames, exon length, overall genome composition, etc. [17,18]. Usually these approaches use advanced statistics, such as Markov models. However, for proper and adequate comparison of our initial data of SP values versus CP values we deliberately used formula (2). Both formula (2) and (1) do not account for reading frames and other genomic peculiarities. Such restrictions are appropriate for the limited size of the current ASMD dataset.
The SP and CP values for all 64 triplets are shown in Table  1. The more positive the SP value of the triplet, the more frequently its appearance is associated with retention of the exon containing it. Conversely, the more negative the SP value, the more frequently its inclusion is associated with exon skipping.   Table 2 presents a list of experimentally-determined nonredundant ESE sequences that have also been evaluated by another computational approach in Down et al. [9]. We do not consider ESEs with ambiguous bases in their internal regions (for instance, tgcngyy sequence) because even a single nucleotide substitution in the analyzed motif could dramatically change its cumulative SP value. All ESEs in Table 2 have high, positive, cumulative SP values. Their average cumulative SP value per triplet is 0.17. Table 3 presents the consensus sequences of computerpredicted and verified ESE motifs obtained with the RES-CUE-ESE method [11]. Eight out of ten of these RESCUE-ESE sequences also have positive cumulative SP values. Yet, the average SP per triplet (0.07) of these ten RESCUE-ESEs is much less than that of the experimentally-determined ESEs (0.17). Two out of ten RESCUE-ESE sequences have negative SP values (motifs #2 and #7, Table 3). Through the use of a different computational approach utilizing a machine learning strategy, these two motifs have also been shown to insignificantly impact splicing, or have "negative status," according to Down, et al. [9]. We also processed the total list of 238 putative RES-CUE-ESE human splicing enhancers from the Hollywood exon annotation database [19]. The average SP value per triplet for this list is 0.08. 200 ESEs from this list have positive cumulative SP values and 38 are negative. Table 4 presents experimentally verified ESS sequences from the RegRNA database [20]. We do not include very long (>50 nt) and very short (<5 nt) ESSs. We also excluded a controversial GAAGAAGA silencer motif because it overlaps with the well-known ESE motif AAGAA, as well as ESE #5 from Table 3. Table 4 demonstrates that three out of five of these ESSs have negative cumulative SP values. However, all ESSs from this list have core sequences (shown in bold) with highly negative cumulative SP values (shown in the last column in Table  4). The triplets are sorted based on their CP value starting from the maximal one.

Testing the ability of SP to distinguish exons and introns
The capability of the SP and CP to distinguish between exons and introns has been examined. The complete sets of triplets composing each single exon and intron have been obtained (a sequence of L nucleotides is represented by (L-2) triplets). The average SP and CP values of exons and introns were calculated by summing the all triplet values and dividing by the number of triplets. The distributions of average SP and CP values per length for exons (red curve) and introns (blue curve) are shown in Fig. 1.   The overlapping area of the peaks represented by exon (blue) and intron (red) curves from Fig. 1  We also examined the distribution of average SP and CP values in the alternatively spliced exons of humans. Fig. 1 shows the distribution of average SP and CP values in a special case of alternative splicing -skipped exons with high skipping/retaining ratio (shown as green curves). Fig.  1A demonstrates that average SP values of skipped exons is very similar to constitutive exons (95% of curves overlapping), yet the curve for skipped exons has a slight, consistent shift toward the intron curve for every data point. The corresponding data for average CP-value curves (Fig.  1B) are not as smooth. There are several intersections between the average CP curves for skipped and constitutive exons. Thus the CP data is less amenable to interpretation.

Discussion
Splicing Potential is a statistical approach for evaluating the involvement of oligonucleotides in splicing that is based solely on the ASMD dataset. HIt For each mutation we study the entire group of triplets overlapping this mutation because we do not know their individual contributions to splicing. Plausibly, a number of the triplets in these groups have no significant effect on splicing. These sequences produce statistical "noise," appearing in our processing algorithm in one set of instances as splicing enhancers (having positive SP i values) and in other cases as splicing silencers (with negative SP i values). Collecting more data on splicing mutations should statistically resolve such irrelevant oligonucleotides, bringing their SP values closer to zero.
Enlarging the ASMD dataset will present the opportunity to compute the SP values for larger oligonucleotides. To generate a reliable SP table for 4-mer nucleotides we need to know at least 250 mutations that affect splicing; for 5mers, 800 mutations; and for 6-mers, at least 3000 mutations. It is well known that the predictive power of the coding potential (CP) increases dramatically with longer oligonucleotides: (n+1)-mers are always much better than n-mers, and 6-mers are the most commonly used oligonucleotides in real-world computations [21]. By analogy, we expect that the predictive power of SP will dramatically increase when SP values for longer oligonucleotides (up to 6-mers) have been computed.
We currently operate with the small set of 115 mutations in the ASMD. Even this limited dataset demonstrated an impressive trend in distinguishing between exonic and intronic sequences and also a very small, yet consistent, difference between constitutive and skipping exons. The SP values of triplets obtained on only 115 mutations is 1.5 times better at the separation of exons and introns compared to the analysis of triplet frequencies using Coding Potential. Further expansion of the ASMD dataset should dramatically increase the accuracy of the SP values and add power to this new tool for the prediction of exon/ intron gene structures and, hopefully, alternative splicing.
Supplementary Methods can be found in Additional file1.

Operating system(s): Platform-independent
Programming Language: Perl Other requirements: a Perl 5 interpreter

License: GNU GPL v3
Restrictions to use by non-academics: None (not applicable under GPL)