Integrating heterogeneous sequence information for transcriptome-wide microarray design; a Zebrafish example
© Breit et al; licensee BioMed Central Ltd. 2010
Received: 16 March 2010
Accepted: 13 July 2010
Published: 13 July 2010
A complete gene-expression microarray should preferably detect all genomic sequences that can be expressed as RNA in an organism, i.e. the transcriptome. However, our knowledge of a transcriptome of any organism still is incomplete and transcriptome information is continuously being updated. Here, we present a strategy to integrate heterogeneous sequence information that can be used as input for an up-to-date microarray design.
Our algorithm consists of four steps. In the first step transcripts from different resources are grouped into Transcription Clusters (TCs) by looking at the similarity of all transcripts. TCs are groups of transcripts with a similar length. If a transcript is much smaller than a TC to which it is highly similar, it will be annotated as a subsequence of that TC and is used for probe design only if the probe designed for the TC does not query the subsequence. Secondly, all TCs are mapped to a genome assembly and gene information is added to the design. Thirdly TC members are ranked according to their trustworthiness and the most reliable sequence is used for the probe design. The last step is the actual array design. We have used this strategy to build an up-to-date zebrafish microarray.
With our strategy and the software developed, it is possible to use a set of heterogeneous transcript resources for microarray design, reduce the number of candidate target sequences on which the design is based and reduce redundancy. By changing the parameters in the procedure it is possible to control the similarity within the TCs and thus the amount of candidate sequences for the design. The annotation of the microarray is carried out simultaneously with the design.
The best scientific experiments are the ones based on the most recent scientific knowledge. Thus, in expression studies, our detector, i.e. the microarray, preferably would be based on the most recent and complete understanding of the genome and transcriptome. Although the annotation of commercially available microarrays is or can be [1, 2] updated on a regular basis, the microarray designs themselves tend to stay unchanged for long periods of time, also due to legacy issues. Furthermore, the microarray design strategy is in many cases proprietary to the microarray manufacturer. Therefore, and apart from the design of the individual probes, for which a variety of software tools exists [3–6], we need a way to translate our knowledge of the genome and transcriptome into a strategy for microarray design for an organism. This is a trivial task neither on the biological nor on a technical level.
The concept of a gene has evolved from a stretch on the genome that encodes one protein to an entity that represents many and complex relations that exist between sequence and biological function. The definition of a gene by Gerstein et al. [7, 8] as 'a union of genomic sequences encoding a coherent set of potentially overlapping functional products' allows genes to have an overlapping sequence, to be alternatively spliced and to exert functions other than protein coding. However, it makes a gene less tangible and thus less prone for microarray probe design because in many cases a gene is not just one distinct physical entity.
Number of Transcripts and Genes in Ensembl
# known protein-coding genes
Thus, microarray probes should be designed on transcripts or predicted transcripts, be annotated with gene information and use the most recent transcriptome resources. Because of the exploratory nature of transcriptomics experiments, most scientists wish to detect as many different transcripts as possible, rather than to limit themselves to established transcripts and genes only. The ongoing miniaturization in microarray manufacture also allows such an approach. A simple strategy would be to design probes for all resources separately and put these together on the microarray. However, this approach causes serious difficulties in the expression analysis, such as problems in gene set enrichment and overrepresentation analysis due to redundancy of probes representing the same transcript. Here we will show a strategy to integrate the heterogeneous sequence information for transcriptome-wide microarray design and show the result of our approach for the zebrafish transcriptome.
The transcript clustering is started by ordering all sequences by length and, starting with the longest sequence, mapping them onto one another using the BLAST algorithm . A similarity threshold is used as to consider only sequences with a matching part larger than a threshold T (Figure 2). Hence, if a sequence is only similar to itself, a new TC containing that transcript is made. If the sequence can be mapped by using the similarity threshold to more sequences and if those target BLAST sequences that are as long as or longer than the query sequence itself have a matching part larger than a threshold U, the query sequence is added to the TC to which the BLAST target sequence belongs. Query sequences that have a high similarity to the BLAST target sequence but are much smaller may be actual biologically distinct molecules as compared to the BLAST target sequences, e.g. a splice variant or a member of gene family. These sequences are set aside and are further processed in the Array Design step. Thus, if the sequence can be mapped to more sequences, but if those target BLAST sequences that are as long as or longer than the sequence itself have a matching part smaller than a threshold U, the query sequence is categorized as a subsequence unless it contains a non-similar end of at least H nucleotides in comparison with the BLAST target sequence. H is taken sufficiently large as to make the design of a probe possible. This step in the algorithm distinguishes a protruding query sequence from the very similar BLAST target sequence. The parameter H facilitates the distinction between subsequences for which probes might be designed (see the Array Design paragraph) and protruding sequences that are organized by introducing a new TC for which a probe must be designed. If the nucleotide order and composition of a sequence has low complexity, no BLAST hits are returned. These sequences are marked as low-complexity (LC) sequences and are discarded from the design.
In order to make a gene annotation for each TCs, the table of TCs is mapped to Ensembl using R-BioMart :. The TC is split if its sequences map to more than one Ensembl gene (Figure 2, Rule 1). In that case, sequences without a mapping to Ensembl are discarded. If a sequence is mapped to more than one Ensembl gene, the TC is only split, if this does not introduce redundancy, i.e. different TCs containing identical transcripts.
Next the most-trustworthy sequence in a TC is chosen for the actual probe design. Resources that apply a higher level of biological evidence are deemed to have a higher trustworthiness and sequences in a resource that are annotated based on biological evidence are chosen over transcripts that are in silico predictions (Figure 2, Rule 2). For instance: Ensembl transcripts are prioritized in the order 'known', 'novel' and 'pseudogene'. RefSeq sequences are chosen in the order of their prefix NM_ (mature messenger RNA transcripts), XM_ (model mRNA), NR_ (non-coding transcripts) and XR_ (model non-coding transcripts). If there are more UniGenes in the TC, the 'complete cdss' are favored. In all cases, if there is a draw, the longest sequence is taken and if then still no decision can be made the first sequence is chosen. If the 5' to 3' direction of EST-based UniGenes is not known, also the reverse complement of the candidate sequence is made. For a different organism or for a different choice of transcript resources this procedure can be easily adapted.
With the resulting list of probe target candidate sequences, microarray probes are designed. Next, the designed probes are mapped onto the subsequences using the BLAST algorithm. Subsequences that do not show any similarity with a probe are subjected to a second probe design run.
The final step of this strategy is only applicable to procedures in which the probe design software also can produce cross-hybridizing probes. Probes representing different TCs that cross-hybridize with a high identity can be grouped together, i.e. these probes can be attributed to a specific sequence that is common to a group of TCs. However, probes with a large number of low to medium cross-hybridization events are not interpretable and can be removed from the design (Figure 2, Rule 3).
All scripts used in this procedure are written in R or Perl and are available via our website http://staff.science.uva.nl/~rauwerda/resource_integration_array_design. The script, in which the TCs are constructed, uses a local installation of BLAST and is computationally the most expensive step in the procedure. The computational requirements for the actual probe-design depend on the software used and whether the design of cross-hybridizing probes is included.
A Zebrafish Example
For this array design we have chosen to include cross-hybridizing probes. Therefore we have chosen to be rather strict on the similarity threshold, but avoid a high increase rate of the number of clusters. Hence similarity parameters T and U and the overhang parameter H were set as 95, 95 and 100 respectively. The first BLAST procedure produced 96,189 TCs, 10,080 sub-sequences and 12 low-complexity sequences were produced. The splitting of TCs according to the Ensembl gene classification resulted in 97,316 TCs. 2,303 sequences for which the 5'-3'-direction could not be established were designed in both possible transcript directions. Afters the second BLAST procedure, in which the TC probes were aligned against the subsequences (Figure 2), 5,299 subsequences remained to be designed.
Oligopicker  has been used to design up to 5 probes per sequence with a 3' sequence preference. A probe is considered to cross-hybridize, if it contains a stretch of 16 nucleotides or has a bitscore higher than 32.5. If only cross-hybridizing probes could be designed, such a probe was added to the design, together with the information to which other sequences this probe can cross-hybridize.
The construction of the TCs took less than a day on a 2.8 GHz Dual-Core AMD Opteron 2220 machine. The microarray design has been carried out on a 20 node Pentium-D computer cluster and took 10 hours.
Characteristics of the Zebrafish Microarray Design
Probe target candidate sequences
TCs/sequences with 5 probes
TCs/sequences with 4 probes
TCs/sequences with 3 probes
TCs/sequences with 2 probes
TCs/sequences with 1 probe
TCs/sequences without cross-hybridizing probes
TCs/sequences with a cross-hybridizing probe
Total queried TCs/sequences
The workflow presented here facilitates the integration of heterogeneous sequence information for transcriptome-wide microarray design and minimizes by construction of Transcription Clusters, the redundancy of transcripts represented on the microarray by probes. Together with the microarray design, the annotation of the microarray is drawn up. Inherently to biology, some probes can never be mapped to individual genes. However, with this approach, all information which transcripts and genes a probe refers to is available. In this zebrafish example we have chosen to also design cross-hybridizing probes. If the research question that has to be answered by the microarray experiment does not need to investigate the biological mechanism at hand, such as in biomarker studies, these cross-hybridizing probes can prove to be quite useful.
With the presented workflow we have developed a tool for microarray design that allows the use of as many heterogeneous genome resources as desired. The easy to design up-to-date microarrays in the current era of high-density custom-designed microarrays makes this workflow a valuable tool for whole-transcriptome studies.
This work was part of the BioRange program of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI).
- de Leeuw WW, Rauwerda HH, Jonker MM, Breit TT: Salvaging Affymetrix probes after probe-level re-annotation. BMC Res Notes. 2008, 1: 66-10.1186/1756-0500-1-66.PubMed CentralPubMedView ArticleGoogle Scholar
- Neerincx P, Rauwerda H, Nie H, Groenen M, Breit T, Leunissen J: OligoRAP - an Oligo Re-Annotation Pipeline to improve annotation and estimate target specificity. BMC Proceedings. 2009, 3: S4-10.1186/1753-6561-3-s4-s4.PubMed CentralPubMedView ArticleGoogle Scholar
- Gordon PM, Sensen CW: Osprey: a comprehensive tool employing novel methods for the design of oligonucleotides for DNA sequencing and microarrays. Nucleic Acids Res. 2004, 32: e133-10.1093/nar/gnh127.PubMed CentralPubMedView ArticleGoogle Scholar
- Wang X, Seed B: Selection of oligonucleotide probes for protein coding sequences. Bioinformatics. 2003, 19: 796-802. 10.1093/bioinformatics/btg086.PubMedView ArticleGoogle Scholar
- Rouillard JM, Zuker M, Gulari E: OligoArray 2.0: design of oligonucleotide probes for DNA microarrays using a thermodynamic approach. Nucleic Acids Res. 2003, 31: 3057-3062. 10.1093/nar/gkg426.PubMed CentralPubMedView ArticleGoogle Scholar
- Lemoine S, Combes F, Le Crom S: An evaluation of custom microarray applications: the oligonucleotide design challenge. Nucleic Acids Res. 2009, 37: 1726-1739. 10.1093/nar/gkp053.PubMed CentralPubMedView ArticleGoogle Scholar
- Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M: What is a gene, post-ENCODE? History and updated definition. Genome Res. 2007, 17: 669-681. 10.1101/gr.6339607.PubMedView ArticleGoogle Scholar
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447: 799-816. 10.1038/nature05874.PubMedView ArticleGoogle Scholar
- Wilming LG, Gilbert JG, Howe K, Trevanion S, Hubbard T, Harrow JL: The vertebrate genome annotation (Vega) database. Nucleic Acids Res. 2008, 36: D753-760. 10.1093/nar/gkm987.PubMed CentralPubMedView ArticleGoogle Scholar
- Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM, Clamp M: The Ensembl automatic gene annotation system. Genome Res. 2004, 14: 942-950. 10.1101/gr.1858004.PubMed CentralPubMedView ArticleGoogle Scholar
- Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 2001, 29: 137-140. 10.1093/nar/29.1.137.PubMed CentralPubMedView ArticleGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM, DiCuccio M, Edgar R, Federhen S, Helmberg W, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2005, 33: D39-45. 10.1093/nar/gki062.PubMed CentralPubMedView ArticleGoogle Scholar
- Pontius JU, Wagner L, Schuler GD: UniGene: A Unified View of the Transcriptome. 2003Google Scholar
- Eppig JT, Bult CJ, Kadin JA, Richardson JE, Blake JA, Anagnostopoulos A, Baldarelli RM, Baya M, Beal JS, Bello SM, et al: The Mouse Genome Database (MGD): from genes to mice--a community resource for mouse biology. Nucleic Acids Res. 2005, 33: D471-475. 10.1093/nar/gki113.PubMed CentralPubMedView ArticleGoogle Scholar
- Sprague J, Bayraktaroglu L, Clements D, Conlin T, Fashena D, Frazer K, Haendel M, Howe DG, Mani P, Ramachandran S, et al: The Zebrafish Information Network: the zebrafish model organism database. Nucleic Acids Res. 2006, 34: D581-585. 10.1093/nar/gkj086.PubMed CentralPubMedView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.PubMedView ArticleGoogle Scholar
- Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W: BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005, 21: 3439-3440. 10.1093/bioinformatics/bti525.PubMedView ArticleGoogle Scholar