- Technical Note
Calculation of Splicing Potential from the Alternative Splicing Mutation Database
BMC Research Notesvolume 1, Article number: 4 (2008)
The Alternative Splicing Mutation Database (ASMD) presents a collection of all known mutations inside human exons which affect splicing enhancers and silencers and cause changes in the alternative splicing pattern of the corresponding genes.
An algorithm was developed to derive a Splicing Potential (SP) table from the ASMD information. This table characterizes the influence of each oligonucleotide on the splicing effectiveness of the exon containing it. If the SP value for an oligonucleotide is positive, it promotes exon retention, while negative SP values mean the sequence favors exon skipping. The merit of the SP approach is the ability to separate splicing signals from a wide range of sequence motifs enriched in exonic sequences that are attributed to protein-coding properties and/or translation efficiency. Due to its direct derivation from observed splice site selection, SP has an advantage over other computational approaches for predicting alternative splicing.
We show that a vast majority of known exonic splicing enhancers have highly positive cumulative SP values, while known splicing silencers have core motifs with strongly negative cumulative SP values. Our approach allows for computation of the cumulative SP value of any sequence segment and, thus, gives researchers the ability to measure the possible contribution of any sequence to the pattern of splicing.
One of the key regulators of alternative splicing is a large variety of short sequence motifs inside exons known as exonic splicing enhancers (ESE) and exonic splicing silencers (ESS). These regulatory sequences have been characterized by several experimental techniques [1–5] and also by different computational approaches [6–14]. Despite this progress, one still can not predict predisposition to alternative splicing from genomic data. In this respect, a set of mutations known to be associated with alternative splicing effects (reviewed by [15, 5]) is a valuable raw material for the investigation of the fine regulation of splicing. A novel database of these mutations, named the Alternative Splicing Mutation Database (ASMD), is described in the accompanying paper . The ASMD represents a collection of human exon sequences with internal mutations that change the balance of alternatively spliced mRNA isoforms or cause the appearance of new mRNA isoforms. The ASMD includes only those mutations that change exonic enhancers and silencers and does not encompass those that change splice sites. Here we present a novel statistical approach for processing ASMD mutational datasets, converting them into a table of "Splicing Potential" (SP) values for every possible short oligonucleotide. If the SP value for an oligonucleotide is positive, it promotes exon retention, while negative SP values mean the sequence favors exon skipping. SP appears to be a valuable tool for evaluating the influence of a given sequence on splicing, for finding and testing putative ESE and ESS motifs, and for predicting the effect of a given mutation on splicing.
Algorithm for calculation of Splicing Potential
Our SP algorithm processes all oligonucleotides that appear and disappear in the mutations described in the ASMD. Due to the limited size of the current ASMD dataset, we only calculate SP values for triplets. For example, the mutation (G -> T) in the 14th exon of the gene BRCA1 (entry '10asmd') occurs in the exonic region gctGagt -> gctTagt (mutation site is in the middle and is shown in capital letters). This mutation generates three new triplets (ctT, tTa, and Tag) and, at the same time, eliminates three triplets from the wild-type sequence (ctG, tGa, and Gag). The splicing effect of this mutation is SE = -1, meaning that this mutation causes the 14th exon to be skipped in all gene transcripts. Because the wild-type triplets ctG, tGa, and Gag strengthen splicing of the exon, the algorithm increases their potential values by the value SP i = log10(w), where w = 1+abs(SE) and index i is the case identifier. In this example, SE is equal to -1, and thus w = 2. In our algorithm, w is simply the weight factor that awards more impact to those mutations that cause more dramatic changes in splicing patterns. In addition, because of the mutant triplets (ctT, tTa, and Tag) weaken the splicing of the exon, the algorithm decreases their SP values by the same value of SP i = -log10(w). The final potential value for a particular triplet xyz is the sum of SP i (xyz) for all cases in the ASMD where this triplet appears/disappears due to mutations. Finally, to make the SP values independent of the ASMD sample size, we normalize them by the standard deviation of SP values (σSP). Thus, final SP values are calculated by the formula:
SP(xyz) = sum(SPi(xyz))/σSP. (1)
We compared SP(xyz) with the coding potential, CP(xyz), for the triplet xyz, which was calculated by one of the simplest forms using equation:
CP(xyz) = log10(Fc(xyz)/Fi(xyz))/σCP, (2)
where Fc(xyz) is the frequency of the triplet xyz inside coding exonic regions, and Fi(xyz) is the frequency of xyz inside introns. Throughout this study we multiplied all SP and CP values by 0.243, the σCP for the entire sample of non-redundant human genes. If CP(xyz) has a positive value, the xyz triplet is more abundant in exons versus introns. When CP(xyz) is negative, the opposite is true and xyz is more abundant in introns. There are several much more advanced formulas for computing CP, which take into account additional information such as reading frames, exon length, overall genome composition, etc. [17, 18]. Usually these approaches use advanced statistics, such as Markov models. However, for proper and adequate comparison of our initial data of SP values versus CP values we deliberately used formula (2). Both formula (2) and (1) do not account for reading frames and other genomic peculiarities. Such restrictions are appropriate for the limited size of the current ASMD dataset.
The SP and CP values for all 64 triplets are shown in Table 1. The more positive the SP value of the triplet, the more frequently its appearance is associated with retention of the exon containing it. Conversely, the more negative the SP value, the more frequently its inclusion is associated with exon skipping. Table 1 reveals a considerable Pearson correlation (r = 0.59) between coding potential (CP) and splicing potential (SP) values of triplets. A majority of triplets with positive CP values (meaning that their frequency in exons is greater than introns) also have positive SP values, while triplets with negative CP values frequently have negative SP values. However, the SP and CP values of a given triplet can differ significantly (for instance, see triplets ccg, ccc, ggg, taa, att).
Testing of SP values of splicing enhancers and silencers
Tables 2, 3, 4, 5 present cumulative SP values for a number of known ESE and ESS sequences, calculated by summing the SP values of all triplets composing them. Since the ESE and ESS could have different lengths, we also calculated the average SP value per triplet. For example, the first motif in Table 2, aggacagagc, is composed of eight triplets (agg, gga, gac, aca, cag, aga, gag, agc). The sum of the SP values of these triplets is 0.89 and the value per triplet is 0.89/8 = 0.11.
Table 2 presents a list of experimentally-determined non-redundant ESE sequences that have also been evaluated by another computational approach in Down et al. . We do not consider ESEs with ambiguous bases in their internal regions (for instance, tgcngyy sequence) because even a single nucleotide substitution in the analyzed motif could dramatically change its cumulative SP value. All ESEs in Table 2 have high, positive, cumulative SP values. Their average cumulative SP value per triplet is 0.17.
Table 3 presents the consensus sequences of computer-predicted and verified ESE motifs obtained with the RESCUE-ESE method . Eight out of ten of these RESCUE-ESE sequences also have positive cumulative SP values. Yet, the average SP per triplet (0.07) of these ten RESCUE-ESEs is much less than that of the experimentally-determined ESEs (0.17). Two out of ten RESCUE-ESE sequences have negative SP values (motifs #2 and #7, Table 3). Through the use of a different computational approach utilizing a machine learning strategy, these two motifs have also been shown to insignificantly impact splicing, or have "negative status," according to Down, et al. . We also processed the total list of 238 putative RESCUE-ESE human splicing enhancers from the Hollywood exon annotation database . The average SP value per triplet for this list is 0.08. 200 ESEs from this list have positive cumulative SP values and 38 are negative.
Table 4 presents experimentally verified ESS sequences from the RegRNA database . We do not include very long (>50 nt) and very short (<5 nt) ESSs. We also excluded a controversial GAAGAAGA silencer motif because it overlaps with the well-known ESE motif AAGAA, as well as ESE #5 from Table 3. Table 4 demonstrates that three out of five of these ESSs have negative cumulative SP values. However, all ESSs from this list have core sequences (shown in bold) with highly negative cumulative SP values (shown in the last column in Table 4).
Finally, Table 5 presents a list of 21 in vitro selected putative ESSs published by Wang et al. . Eighteen ESSs from this table have negative SP values, while two sequences have slightly positive SP values and only one sequence (ESS4) stands apart with a highly positive SP value (0.16 per triplet). The average cumulative SP for this group of 21 putative ESS is -0.17.
All in all we see a strong tendency for ESE to have positive cumulative SP-values and for core sequences of ESS to have negative cumulative SP-values. Therefore, this approach could be used for evaluation of a broad range of sequences for their contribution to the pre-mRNA splicing process.
Testing the ability of SP to distinguish exons and introns
The capability of the SP and CP to distinguish between exons and introns has been examined. The complete sets of triplets composing each single exon and intron have been obtained (a sequence of L nucleotides is represented by (L-2) triplets). The average SP and CP values of exons and introns were calculated by summing the all triplet values and dividing by the number of triplets. The distributions of average SP and CP values per length for exons (red curve) and introns (blue curve) are shown in Fig. 1.
The overlapping area of the peaks represented by exon (blue) and intron (red) curves from Fig. 1 is 1.5 times smaller for the average SP values (46% overlap) than for the average CP values (68% overlap). Moreover, the SP values are significantly less variable than the CP values, which enhances the discriminating ability of SP in statistical tests (such as the t-test).
We also examined the distribution of average SP and CP values in the alternatively spliced exons of humans. Fig. 1 shows the distribution of average SP and CP values in a special case of alternative splicing – skipped exons with high skipping/retaining ratio (shown as green curves). Fig. 1A demonstrates that average SP values of skipped exons is very similar to constitutive exons (95% of curves overlapping), yet the curve for skipped exons has a slight, consistent shift toward the intron curve for every data point. The corresponding data for average CP-value curves (Fig. 1B) are not as smooth. There are several intersections between the average CP curves for skipped and constitutive exons. Thus the CP data is less amenable to interpretation.
Splicing Potential is a statistical approach for evaluating the involvement of oligonucleotides in splicing that is based solely on the ASMD dataset. HIt For each mutation we study the entire group of triplets overlapping this mutation because we do not know their individual contributions to splicing. Plausibly, a number of the triplets in these groups have no significant effect on splicing. These sequences produce statistical "noise," appearing in our processing algorithm in one set of instances as splicing enhancers (having positive SPi values) and in other cases as splicing silencers (with negative SPi values). Collecting more data on splicing mutations should statistically resolve such irrelevant oligonucleotides, bringing their SP values closer to zero.
Enlarging the ASMD dataset will present the opportunity to compute the SP values for larger oligonucleotides. To generate a reliable SP table for 4-mer nucleotides we need to know at least 250 mutations that affect splicing; for 5-mers, 800 mutations; and for 6-mers, at least 3000 mutations. It is well known that the predictive power of the coding potential (CP) increases dramatically with longer oligonucleotides: (n+1)-mers are always much better than n-mers, and 6-mers are the most commonly used oligonucleotides in real-world computations . By analogy, we expect that the predictive power of SP will dramatically increase when SP values for longer oligonucleotides (up to 6-mers) have been computed.
We currently operate with the small set of 115 mutations in the ASMD. Even this limited dataset demonstrated an impressive trend in distinguishing between exonic and intronic sequences and also a very small, yet consistent, difference between constitutive and skipping exons. The SP values of triplets obtained on only 115 mutations is 1.5 times better at the separation of exons and introns compared to the analysis of triplet frequencies using Coding Potential. Further expansion of the ASMD dataset should dramatically increase the accuracy of the SP values and add power to this new tool for the prediction of exon/intron gene structures and, hopefully, alternative splicing.
Supplementary Methods can be found in Additional file1.
Availability and requirements
Project name: Splicing Potential
Project home page: http://mco321125.meduohio.edu/~jbechtel/asmd/
Operating system(s): Platform-independent
Programming Language: Perl
Other requirements: a Perl 5 interpreter
License: GNU GPL v3
Restrictions to use by non-academics: None (not applicable under GPL)
Wang Z, Xiao X, Van Nostrand E, Burge CB: General and specific functions of exonic splicing silencers in splicing control. Mol Cell. 2006, 23: 61-70. 10.1016/j.molcel.2006.05.018.
Tian H, Kole R: Selection of novel exon recognition elements from a pool of random sequences. Mol Cell Biol. 1995, 15: 6291-6298.
Coulter LR, Landree MA, Cooper TA: Identification of a new class of exonic splicing enhancers by in vivo selection. Mol Cell Biol. 1997, 17: 2143-2150.
Liu HX, Zhang M, Krainer AR: Identification of functional exonic splicing enhancer motifs recognized by individual SR proteins. Genes Dev. 1998, 12: 1998-2012. 10.1101/gad.12.13.1998.
Valentine CR: The association of nonsense codons with exon skipping. Mutat Res. 1998, 411: 87-117. 10.1016/S1383-5742(98)00010-6.
Zhang XH, Leslie CS, Chasin LA: Computational searches for splicing signals. Methods. 2005, 37: 292-305. 10.1016/j.ymeth.2005.07.011.
Zhang XH, Kangsamaksin T, Chao MS, Banerjee JK, Chasin LA: Exon inclusion is dependent on predictable exonic splicing enhancers. Mol Cell Biol. 2005, 25: 7323-7332. 10.1128/MCB.25.16.7323-7332.2005.
Stadler MB, Shomron N, Yeo GW, Schneider A, Xiao X, Burge CB: Inference of splicing regulatory activities by sequence neighborhood analysis. PLoS Genet. 2006, 2: e191-10.1371/journal.pgen.0020191.
Down TA, Leong B, Hubbard TJP: A machine learning strategy to identify candidate binding sites in human protein-coding sequence. BMC Bioinformatics. 2006, 7: 419-10.1186/1471-2105-7-419.
Wang Z, Bolish ME, Yeo G, Tung V, Mawson M, Burge CB: Systematic identification and analysis of exonic splicing silencers. Cell. 2004, 119: 831-845. 10.1016/j.cell.2004.11.010.
Fairbrother WG, Yeh RF, Sharp PA, Burge CB: Predictive identification of exonic splicing enhancers in human genes. Science. 2002, 297: 1007-1013. 10.1126/science.1073774.
Fedorov A, Saxonov S, Fedorova L, Daizadeh I: Comparison of intron-containing and intron-lacking genes elucidates putative exonic splicing enhancers. Nucleic Acids Res. 2001, 29: 1464-1469. 10.1093/nar/29.7.1464.
Cartegni L, Wang J, Zhu Z, Zhang MQ, Krainer AR: ESEfinder: A web resource to identify exonic splicing enhancers. Nucleic Acids Res. 2003, 31: 3568-3571. 10.1093/nar/gkg616.
Pertea M, Mount SM, Salzberg SL: A computational survey of candidate exonic splicing enhancer motifs in the model plant Arabidopsis thaliana. BMC Bioinformatics. 2007, 8: 159-10.1186/1471-2105-8-159.
Cartegni L, Chew SL, Krainer AR: Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nat Rev Genet. 2002, 3: 285-298. 10.1038/nrg775.
Bechtel JM, Rajesh P, Ilikchyan I, Deng Y, Mishra PK, Wang Q, Wu X, Afonin KA, Grose WE, Wang Y, Khuder S, Fedorov A: The Alternative Splicing Mutation Database: a hub for investigations of alternative splicing using mutational evidence. BMC Res Notes.
Frishman D, Mironov A, Mewes HW, Gelfand M: Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res. 1998, 26: 2941-2947. 10.1093/nar/26.12.2941.
Brocchieri L, Kledal TN, Karlin S, Mocarski ES: Predicting coding potential from genome sequence: application to betaherpesviruses infecting rats and mice. J Virol. 2005, 79: 7570-7596. 10.1128/JVI.79.12.7570-7596.2005.
Holste D, Huo G, Tung V, Burge CB: HOLLYWOOD: a comparative relational database of alternative splicing. Nucleic Acids Res. 2006, D56-62. 10.1093/nar/gkj048. 34 Database
RegRNA: A Regulatory RNA Motifs and Elements Database. [http://bidlab.life.nctu.edu.tw/RegRNA2/website/]
Azad RK, Borodovsky M: Probabilistic methods of identifying genes in prokaryotic genomes: connections to the HMM theory. Brief Bioinform. 2004, 5: 118-130. 10.1093/bib/5.2.118.
Sakabe NJ, Vibranovski MD, de Souza SJ: A bioinformatics analysis of alternative exon usage in human genes coding for extracellular matrix proteins. Genet Mol Res. 2004, 30: 532-544.
Shepelev V, Fedorov A: Advances in the Exon-Intron Database (EID). Briefings in Bioinformatics. 2006, 7: 178-185. 10.1093/bib/bbl003.
This project is supported by NSF Career grant MCB-0643542, "Investigation of intron cellular roles".
The authors declare that they have no competing interests.
The Splicing Potential algorithm was conceptualized and developed by JMB, PR, II, YD, PKM, QW, XW, KAA, WEG, YW, and AF. SK was responsible for all statistical analyses. AF supervised the project, provided guidance, and wrote the draft. All authors have read and approved the final manuscript.