A general framework for optimization of probes for gene expression microarray and its application to the fungus Podospora anserina
© Debuchy et al; licensee BioMed Central Ltd. 2010
Received: 21 March 2010
Accepted: 18 June 2010
Published: 18 June 2010
The development of new microarray technologies makes custom long oligonucleotide arrays affordable for many experimental applications, notably gene expression analyses. Reliable results depend on probe design quality and selection. Probe design strategy should cope with the limited accuracy of de novo gene prediction programs, and annotation up-dating. We present a novel in silico procedure which addresses these issues and includes experimental screening, as an empirical approach is the best strategy to identify optimal probes in the in silico outcome.
We used four criteria for in silico probe selection: cross-hybridization, hairpin stability, probe location relative to coding sequence end and intron position. This latter criterion is critical when exon-intron gene structure predictions for intron-rich genes are inaccurate. For each coding sequence (CDS), we selected a sub-set of four probes. These probes were included in a test microarray, which was used to evaluate the hybridization behavior of each probe. The best probe for each CDS was selected according to three experimental criteria: signal-to-noise ratio, signal reproducibility, and representative signal intensities. This procedure was applied for the development of a gene expression Agilent platform for the filamentous fungus Podospora anserina and the selection of a single 60-mer probe for each of the 10,556 P. anserina CDS.
A reliable gene expression microarray version based on the Agilent 44K platform was developed with four spot replicates of each probe to increase statistical significance of analysis.
Development of a gene expression microarray comprises several time-consuming and complex steps. Probe libraries are generated by commercial services or specialized design programs , which analyze nucleic acid physical parameters to identify probes that offer the best theoretical characteristics, in terms of specificity and sensitivity. Optimal probe design is a compromise between these two latter features, which are predicted by computational methods that assume probes are in solution, while arrays, in fact, consist of surface-immobilized probes. Therefore an empirical approach appears as the optimal strategy to assess the quality of the probe design outcome [2–4]. This experimental step has been long overlooked, due to microarray cost and reluctance to modify a fixed design. In situ synthesized oligomer arrays now offer great flexibility for changing probes, thus promoting the addition of real hybridizations in the probe selection process. Probe design should also take into account uncertainties of gene structure predictions [5, 6] and genome databases re-annotations. Informatics tools allowing probe collection updating are available  but we are not aware of any established methods for dealing with potential annotation errors.
We chose medium length probes (60mers), which offer the best compromise between long oligonucleotide probes (50-80mers) prone to cross-hybridization [8, 9] and short oligonucleotide probes (25-30mers) producing low signal intensity . We used an ink-jet Agilent microarray platform and Agilent commercial service for designing probes. It delivers up to ten candidate probes per coding sequence (CDS). A single 60-mer probe can successfully detect gene expression at a low level . We present computational and experimental processes to identify the optimal probe for each CDS.
Computational selection of probes
Cross-hybridization capacity for non-target sequences. Each probe was aligned against the whole set of CDS using BLAST  with custom parameters (W = 7, z = 1 000 000, r = 2). These parameters were estimated from simulated data sets to detect a minimal identity of 70% on 20 contiguous bases . A cross-hybridization identity (CHI) score was attributed to each probe, based on its identity with any non-target CDS (Table 1).
Scores for in silico selection.
Score values per criterion
> -8 kcal/mol
≤ -8 kcal/mol
Probe position in CDS
Nucleotides numbered from CDS 3' end
500 - 1000
Probe position relative to intron
Classes defined in Figure 1
Used after genome re-annotation
Thermodynamic properties and secondary structure stability. Secondary structures can compromise hybridization between the probe and its target. Possible hairpin structures were analyzed and the corresponding free energy (ΔG) was computed . The parameters of the design program excluded probes with a low self-folding energy distribution, and therefore a high disqualifying score was not necessary (Table 1).
Probe location relative to CDS 3' end. Labeling methods start from the polyA tail and become attenuated as the enzymes progress toward the 5' end . Therefore, the selection procedure used gives the best scores to probes localized in the 3' end of the CDS (Table 1).
Relative positions of probe and intron. It has been reported that only 15% of gene structures is predicted correctly across the coding region of some organisms . Most probe design software does not select for probes according to their position relative to introns, whereas this criterion appears critical, notably for genomes with inaccurate intron prediction, often due to lack of ESTs. We therefore developed probe scores (Table 1) based on probe position relative to predicted introns (Figure 1). Probes that overlapped intron(s) were given a high score ensuring that they were rejected. The 3' boundaries of introns show little variations but the consensus is small  and prediction of intron 3' end is therefore uncertain. Consequently, probes located immediately adjacent to and downstream of the putative 3' end of introns were attributed a sub-optimal score.
A final score for the in silico quality of the probes was calculated from the sum of these four scores. A first round of selection identified probes with a final score below 4. If more than 4 probes were matched to a single CDS, we selected the four probes closest to the 3' end of the CDS. Probes that started within the last 100 nucleotides of a CDS were excluded to circumvent annotation uncertainties that are more frequent in the 3' region of CDS. For any CDS that have fewer than 4 probes, additional probes were selected by a second selection round that recovered probes overlapping intron(s) confirmed by EST(s), and allowing scores of up to 8. We excluded, however, probes that displayed a CHI of over 85% and probes that started upstream from the 3' terminal 1500 nucleotides of the CDS. A further probe-design stage was carried for CDS for which there was no, or only one, probe after the second selection round. For speed reasons, the probe design software ROSO [1, 12] was used for this and subsequent designs, instead of Agilent commercial services. ROSO parameters are indicated in Additional file 1. Probes issued from this new design were submitted to the above in silico selection.
When genome re-annotation was released, probes were aligned against the updated set of CDS using BLAST  to identify probe-deficient CDS. New probes were then designed using ROSO  and the in silico scoring procedure was applied once again. Re-annotations also led to CDS modifications that resulted in mismatches with previously designed probes. These probes were attributed a score of 20 to ensure that they would be discarded from further analyses (Table 1).
Experimental selection of probes
Signal-to-noise ratio. The determination of a signal-to-noise ratio (SNR) threshold is essential to distinguish a true signal from its background, and thus for the generation of high-quality microarray data. Subsequent data processing and biological interpretation of microarray results depend on the accuracy of this threshold. Two metrics were used to calculate the SNR values for each probe: (i) the signal-to-standard-deviation ratio (SSR)  and (ii) the signal-to-background ratio (SBR) . SSR ratios greater than 10 are considered indicative of high quality arrays . Probes with a SSR < 10 and a SBR < 2 for all samples or for all samples but one were discarded, as they possibly had a defective design.
Signal reproducibility. The reproducibility of each probe is usually assessed with the normalized measure of signal dispersion for each probe by calculating the signal coefficient of variation (CV). As our experimental design consists exclusively of biological replicates, the CV measures biological heterogeneity, as well as technical variation causes. We minimized biological heterogeneity by using biological replicates with minimal genetic polymorphism ( and references therein). Lack of signal reproducibility, and the major source of variation (high CV), therefore, was attributable to probe defect. The threshold for CV was set at 0.75, to reject no more than approximately 1% of the total number of CDS. Probes with a CV > 0.75 for any condition were submitted to expert supervision to determine possible biological causes of heterogeneity and rejected if none was found.
Signal intensity per CDS and per condition. We adapted the strategy of Paredes et al. , in which it was assumed that a probe targeted to a given CDS should have an optimal intensity signal that is similar to the average signal intensity of all probes targeted to this CDS. This rationale was applied to calculate two types of metrics: (i) Two median metrics were calculated from the normalized signal intensities obtained with the common reference RNA pool: the median of each probe targeted to a CDS (Mprobe) and the median for all probes targeted to this CDS (MCDS). Probes with Mprobe outside the interquartile range of MCDS were rejected. (ii) The average intensity (Marray) of all probes targeted to a given CDS in each array and its 95% confidence interval (CI) were calculated from the normalized signal intensities obtained from hybridization with sample RNA. Probes were discarded if the signal intensity was outside Marray ± 1.5 CI for all arrays.
Application to Podospora anserina
The draft genome assembly of Podospora anserina contained 10,824 CDS when this work started (P. Silar and O. Lespinet, unpublished results) and was updated to 10,545 CDS  as work progressed. A total of 5,032 CDS have at least one intron but no EST to confirm intron position, emphasizing the value of selecting probes that do not overlap introns. Elimination of short and long CDS resulted in 10,539 CDS. The in silico ranking was reapplied resulting in 41,843 unique probes (Microarray v.2).
Results of experimental scoring of probes.
CDS with probes
Mprobe and Marrayb
The probe set is available at http://podospora.igmors.u-psud.fr/download.php. The data discussed in this publication have been deposited in NCBI's Gene Expression Omnibus  and are accessible through GEO Series accession number GSE20734 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20734. The final microarray is available from Agilent under the reference AMADID 018343.
The authors are grateful to Lon Aggerbeck for his advice and support, and to Stéphane Le Crom for critical reading of the manuscript. RD is greatly indebted to B. Gillian Turgeon for English proofreading of the manuscript. The computer cluster used for probe design and selection, and the microarray scanner were provided by the Programme Pluri-Formation (PPF) of Univ Paris-Sud 11 "Plate-forme Puces à ADN Gif-Orsay 2005-2008". The http://podospora.igmors.u-psud.fr/download.php address was hosted on a server funded by IFR115. This study and the salary of FB were funded by the French National Research Agency (L'Agence Nationale de la Recherche, ANR) grant number ANR-05-BLAN-0385, project SexDevMycol, coordinator R. Debuchy.
- Lemoine S, Combes F, Le Crom S: An evaluation of custom microarray applications: the oligonucleotide design challenge. Nucleic acids research. 2009, 37: 1726-1739. 10.1093/nar/gkp053.PubMed CentralPubMedView ArticleGoogle Scholar
- Jourdren L, Duclos A, Brion C, Portnoy T, Mathis H, Margeot A, Le Crom S: Teolenn: an efficient and customizable workflow to design high-quality probes for microarray experiments. Nucleic acids research. 2010, 38: e117-10.1093/nar/gkq110.PubMed CentralPubMedView ArticleGoogle Scholar
- Kronick MN: Creation of the whole human genome microarray. Expert review of proteomics. 2004, 1: 19-28. 10.1586/147894126.96.36.199.PubMedView ArticleGoogle Scholar
- Paredes CJ, Senger RS, Spath IS, Borden JR, Sillers R, Papoutsakis ET: A general framework for designing and validating oligomer-based DNA microarrays and its application to Clostridium acetobutylicum. Applied and environmental microbiology. 2007, 73: 4631-4638. 10.1128/AEM.00144-07.PubMed CentralPubMedView ArticleGoogle Scholar
- Brent MR, Guigo R: Recent advances in gene structure prediction. Current opinion in structural biology. 2004, 14: 264-272. 10.1016/j.sbi.2004.05.007.PubMedView ArticleGoogle Scholar
- Salzberg SL: Genome re-annotation: a wiki solution?. Genome biology. 2007, 8: 102-10.1186/gb-2007-8-6-r102.PubMed CentralPubMedView ArticleGoogle Scholar
- Golfier G, Lemoine S, van Miltenberg A, Bendjoudi A, Rossier J, Le Crom S, Potier MC: Selection of oligonucleotides for whole-genome microarrays with semi-automatic update. Bioinformatics (Oxford, England). 2009, 25: 128-129. 10.1093/bioinformatics/btn573.View ArticleGoogle Scholar
- Hughes TR, Mao M, Jones AR, Burchard J, Marton MJ, Shannon KW, Lefkowitz SM, Ziman M, Schelter JM, Meyer MR: Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nature biotechnology. 2001, 19: 342-347. 10.1038/86730.PubMedView ArticleGoogle Scholar
- Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD, Madore SJ: Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic acids research. 2000, 28: 4552-4557. 10.1093/nar/28.22.4552.PubMed CentralPubMedView ArticleGoogle Scholar
- Chou CC, Chen CH, Lee TT, Peck K: Optimization of probe length and the number of probes per gene for optimal microarray analysis of gene expression. Nucleic acids research. 2004, 32: e99-10.1093/nar/gnh099.PubMed CentralPubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralPubMedView ArticleGoogle Scholar
- Reymond N, Charles H, Duret L, Calevro F, Beslon G, Fayard JM: ROSO: optimizing oligonucleotide probes for microarrays. Bioinformatics (Oxford, England). 2004, 20: 271-273. 10.1093/bioinformatics/btg401.View ArticleGoogle Scholar
- Rychlik W, Rhoads R: A computer program for choosing optimal oligonucleotides for filter hybridization, sequencing and in vitro amplification of DNA. Nucleic acids research. 1989, 17: 8543-8551. 10.1093/nar/17.21.8543.PubMed CentralPubMedView ArticleGoogle Scholar
- Do JH, Choi D-K: cDNA labeling strategies for microarrays using fluorescent dyes. Eng Life Sci. 2007, 7: 26-34. 10.1002/elsc.200620169.View ArticleGoogle Scholar
- Irimia M, Roy SW: Evolutionary convergence on highly-conserved 3' intron structures in intron-poor eukaryotes and insights into the ancestral eukaryotic genome. PLoS genetics. 2008, 4: e1000148-10.1371/journal.pgen.1000148.PubMed CentralPubMedView ArticleGoogle Scholar
- Leiske DL, Karimpour-Fard A, Hume PS, Fairbanks BD, Gill RT: A comparison of alternative 60-mer probe designs in an in-situ synthesized oligonucleotide microarray. BMC genomics. 2006, 7: 72-10.1186/1471-2164-7-72.PubMed CentralPubMedView ArticleGoogle Scholar
- He Z, Zhou J: Empirical evaluation of a new method for calculating signal-to-noise ratio for microarray data analysis. Applied and environmental microbiology. 2008, 74: 2957-2966. 10.1128/AEM.02536-07.PubMed CentralPubMedView ArticleGoogle Scholar
- Zakharkin SO, Kim K, Mehta T, Chen L, Barnes S, Scheirer KE, Parrish RS, Allison DB, Page GP: Sources of variation in Affymetrix microarray experiments. BMC bioinformatics. 2005, 6: 214-10.1186/1471-2105-6-214.PubMed CentralPubMedView ArticleGoogle Scholar
- Espagne E, Lespinet O, Malagnac F, Da Silva C, Jaillon O, Porcel BM, Couloux A, Aury JM, Segurens B, Poulain J: The genome sequence of the model ascomycete fungus Podospora anserina. Genome biology. 2008, 9: R77-10.1186/gb-2008-9-5-r77.PubMed CentralPubMedView ArticleGoogle Scholar
- Debuchy R, Berteaux-Lecellier V, Silar P: Mating systems and sexual morphogenesis in Ascomycetes. Cellular and Molecular Biology of Filamentous Fungi. Edited by: Borkovich KA, Ebbole DJ. 2010, Washington, DC: ASM Press, 501-535.View ArticleGoogle Scholar
- Coppin E, de Renty C, Debuchy R: The function of the coding sequences for the putative pheromone precursors in Podospora anserina is restricted to fertilization. Eukaryotic cell. 2005, 4: 407-420. 10.1128/EC.4.2.407-420.2005.PubMed CentralPubMedView ArticleGoogle Scholar
- Lorin S, Dufour E, Sainsard-Chanet A: Mitochondrial metabolism and aging in the filamentous fungus Podospora anserina. Biochimica et biophysica acta. 2006, 1757: 604-610. 10.1016/j.bbabio.2006.03.005.PubMedView ArticleGoogle Scholar
- Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature biotechnology. 2006, 24: 1151-1161. 10.1038/nbt1239.PubMedView ArticleGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research. 2002, 30: 207-210. 10.1093/nar/30.1.207.PubMed CentralPubMedView ArticleGoogle Scholar