- Research article
- Open Access
SNP markers retrieval for a non-model species: a practical approach
© Shahin et al; licensee BioMed Central Ltd. 2012
- Received: 29 September 2011
- Accepted: 29 January 2012
- Published: 29 January 2012
SNP (Single Nucleotide Polymorphism) markers are rapidly becoming the markers of choice for applications in breeding because of next generation sequencing technology developments. For SNP development by NGS technologies, correct assembly of the huge amounts of sequence data generated is essential. Little is known about assembler's performance, especially when dealing with highly heterogeneous species that show a high genome complexity and what the possible consequences are of differences in assemblies on SNP retrieval. This study tested two assemblers (CAP3 and CLC) on 454 data from four lily genotypes and compared results with respect to SNP retrieval.
CAP3 assembly resulted in higher numbers of contigs, lower numbers of reads per contig, and shorter average read lengths compared to CLC. Blast comparisons showed that CAP3 contigs were highly redundant. Contrastingly, CLC in rare cases combined paralogs in one contig. Redundant and chimeric contigs may lead to erroneous SNPs. Filtering for redundancy can be done by blasting selected SNP markers to the contigs and discarding all the SNP markers that show more than one blast hit. Results on chimeric contigs showed that only four out of 2,421 SNP markers were selected from chimeric contigs.
In practice, CLC performs better in assembling highly heterogeneous genome sequences compared to CAP3, and consequently SNP retrieval is more efficient. Additionally a simple flow scheme is suggested for SNP marker retrieval that can be valid for all non-model species.
- Assembly Quality
- CAP3 Assembly
- Transcriptome Size
- Redundant Contigs
- Illumina Golden Gate
In the last few years, the development of next-generation sequencing technologies that have the capacity to generate millions of short reads in a single run, has led to a revolution in sequencing applications. The NGS technologies not only boosted re-sequencing and allele mining studies in model species, but are also very useful for the development of SNP markers in species with no or hardly any genetic resources.
SNP development using NGS technologies essentially has become cheaper and faster but also generated requirements like the need for genome complexity reduction, assembly of sequences, and SNP identification in high throughput. The latter two steps are still considered challenging. Currently many different assemblers are available, but few studies discussed the performance of different assemblers in relation to assembly quality and the influence of genome complexity and heterogeneity on the quality of the assembly. Assembly quality is generally assessed by: the lengths of the contigs (mean, minimum and maximum lengths, or N50 according to the assembler), and the accuracy or correctness of the assembly (how well the contigs can be mapped to the reference genome) . Two different assemblers (Newbler and MIRA) were compared on an insect sequence dataset using public Sanger EST data and 454 transcriptome data . Another study compared six assemblers (CAP3, MIRA, Newbler 2.3, Newbler 2.5, SeqMan, and CLC) in reference to the number and length of contigs, speed of assembly and assembly redundancy in de novo assembly of a nematode . The quality of the contigs was checked by aligning the contigs to four reference sequence sets (ESTs, proteome, gene families, and protein data from databases). Similarly, the performance of six aligners (BLAT, SSAHA2, Bowtie, SeqMap, MAQ, and CLC) were compared using in silico generated transcripts from four model organisms (human, Arabidopsis, Drosophila, and yeast) that were mapped to the transcriptome or the complete genome from sequence databases . Results showed that with increasing sequence read length mapping was more accurate, while with increasing genome heterozygosity more reads were incorrectly mapped. Recently, a comparison in which eight short reads assemblers were evaluated against two types of simulated short reads datasets (allowing 0.1% error rate) derived from four different genomes (nematode, yeast, bacteria, and virus), was published . The assemblers' performance information about computational time, memory cost, assembly accuracy and completeness and size distribution of assembled contigs where studied (by mapping to reference genomes) . All these studies used relatively small sized genomes, and often inbred organisms and studied assembly accuracy in general parameters and by mapping to reference genomes or Sanger sequencing data [2–5]. Additionally these studies showed that there is currently no commonly accepted and standardized method for performance evaluation of assemblers, none of these studies checked the assembly quality concerning SNP markers retrieval, and no clear guidance for assembler selection was defined. Because we are involved in ornamental breeding where, in general, crops are outcrossing and highly heterogeneous without reference sequences, our goal was to study the effects of two different assemblers on assembly performance and SNP retrieval in heterogeneous outcrossing species by using our model crop lily as an example. Running such a study, conformation of assembly quality by mapping to a reference genome would be optimal. However, species with reference genomes do not represent the same level of heterogeneity and genome complexity as is found in most outbreeding non-model species. In our study we analyzed a highly divergent sequence dataset of the non-model species lily that allows us to investigate a real case study and develop a flow scheme that can be followed in SNP marker development studies for similar non model species.
When working with Lilium, which has an assumed high level of diversity, a large genome size of 36 Gb with an accompanying high genome complexity and a lack of genetic resources, assemble is an important step in SNP retrieval. Since clear criteria on choosing an assembler are lacking, in our study we focused on two widely used assemblers (CAP3 and CLC) which represent the two different approaches which are used in assemblers. CAP3 is selected since it uses the overlap algorithm for assembly and was successfully used to assemble EST genebank data in heterozygous species such as Zea mays and potato [7, 8]. Recently it was used to assembly apricot (Prunus armeniaca L.), castor bean, mulberry (Morus sp.), Pigeonpea (Cajanus cajan L.), rice, and grape [9–15]. Furthermore, CAP3 is implemented in the QualitySNP pipeline  which is a pipeline to identify SNPs and was used in SNP mining studies [8, 16]. CLC assembler is selected since it uses the de Bruijn algorithm, it was used in several comparison studies and showed to produce a good quality assembly [3, 4]. It is a user friendly assembler since it is not a command line programming software and it has a complete package (cleaning, trimming, clonality removal, SNP and InDels counting, and assembly, in addition to a very advanced visualization technique of the assemblies) which make it a very appealing software to be used. Moreover, CLC assembler supports both short read and long read assembly, and also supports de novo assembly of paired end data. Also, CLC was used because it was indicated to perform better in mapping of artificial datasets with increased heterogeneity . Additionally, recent papers on the performance of assemblers used both assemblers [3, 17], which indicates the importance and usability of both assemblers.
In this study CAP3 and CLC were used for de novo assembly of 454-transcriptome reads derived from Lilium. The goals of this study were: 1) comparing the performance of CAP3 and CLC by running de novo assembly, 2) show the influence of the assembler on the reliable detection of alleles and SNPs, and 3) suggesting a simple flow scheme to generate reliable SNP markers out of such heterozygous species.
In this study, we generated a large number of genes from the genus Lilium. In total, 1,282,735 reads with an average length of 340 bp were derived using 454 pyrosequencing. The lowest number of reads was obtained from 'Connecticut King' (139,480) reads and the highest from the Trumpet genotype (442,476 reads). From 'White Fox' and 'Star Gazer' 326,539 and 374,240 reads were obtained respectively. This difference in the number of reads might be related to the quality of RNA that was used for each genotype and variations in the initial amount of cDNA that was used of each sample for sequencing.
Cleaning the data showed that 85,719 reads (6.7%) were discarded either because of poor quality, being too short (less than 100 bp), being too long (over 800 bp), or missing the barcode sequence. Around 1,191,938 reads with an average length of 283 bp (after trimming) were kept for further analysis.
Next, all the duplicated reads were removed. The presence of duplicated reads affects the reliability of a SNP call. In sequence data analyses for SNP retrieval, reads are assumed to be from independently derived DNA fragments. Any polymorphism event present independently twice, will be considered as reliable whereas polymorphisms found independently only once could also be due to possible mistakes in cDNA synthesis and PCR steps. Duplicated reads with PCR mistakes still present in sequencing data could result in the selection of these mistakes as SNP and therefore should be avoided. The number of initial transcripts and the effects of differential amplification in the preparation of the sequencing libraries determine the final library output quality (goal is the presence of a variety of transcripts as wide as possible), and thereby affects the percentage of duplicated reads. The more diverse a library is the less duplicated reads. All the clonal reads were excluded (412,826 reads, 35%) and only the longest of the clonal reads were retained leaving a total of 779,112 reads (220,716,355 bp) for the assembly step. Similar results on clonal reads were detected in previous studies in which 11% to 35% of the sequences were reported as potential artificial replicates (e.g. ). Gomez-Alvarez  suggested that this phenomenon could be explained by the binding of amplified DNA fragments generated in the emulsion PCR step of the 454 pyrosequencing to empty beads. However, clonality of reads is not limited to a specific mechanism since it was recorded in GS20, GS-FLX, and GS-FLX Titanium systems  as well as in Illumina's Solexa  which indicates the possibility of another explanation. The relative high clonality found with different sequencing technologies could be related to the cDNA library preparation in which often PCR steps are used to generate sufficient quantities of cDNA for sequencing . In particular, the second PCR after the normalization step (using the primers adapters of the A and B adapters) may increase the number of duplicated reads. In our case, we could detect duplicated reads since no shearing of fragments was applied but instead fragments were generated by using randomized primers for cDNA synthesis, adapter primers were used in the first PCR step and size selection was obtained by gel electrophoresis. The same way of cDNA synthesis and normalization was also applied in other studies [20, 21]. However, none of these checked for duplicated reads. Library construction and normalization protocols minimizing PCR steps and preventing the occurrence of duplicate reads would be preferable . Nevertheless, data should always be checked for duplicated reads in order to remove them.
Assembly and SNPs detection
Comparison between CAP3 and CLC assembly results
Average contig length
Nr of SNP markers
CLC uses the de Bruijn algorithm which is used in several assembler software packages such as Velvet , Oases http://www.ebi.ac.uk/~zerbino/oases/, ABySS [http://www.bcgsc.ca/platform/bioinfo/software/abyss], and SOAPdenovo [http://soap.genomics.org.cn/soapdenovo.html]. Similar to CAP3, a cutoff threshold of 95% was used which resulted in the assembly of 646,424 reads in 55,433 contigs (30.8 Mb) with an average of 12 reads per contig (Table 1). Around 17% (132,688) of the reads were left out as singletons. The average length of the contigs was 555 bp, 177 contigs (0.32%) were less than 200 bp in length. Around 8.5% (4709) of the contigs were longer than 1 Kb, 485 contigs exceeded 2 Kb of which the longest contig was 9,420 bp (Figure 1). A total of 2,421 SNP markers were identified by QualitySNP as reliable markers.
A reasonable percentage of the Lilium transcriptome was covered as could be estimated from the transcriptome size of the monocot model species rice, which has 41,000 genes with average gene length of 2,000 bp . Assuming that lily has a comparable transcriptome size, the CAP3 contigs cover around 47% of the Lilium transcriptome while the CLC contigs cover 38%, regardless of the singletons that could be added to the total coverage.
Notable differences between the assembler's performance were recorded in this study. Similarly, differences in assembler's performance were also found in another study . The performance of different assemblers (Velvet, Oases and SeqMan NGen) were compared on a non-model species (snail) and showed that the assembly is strongly depend upon the assembler . In this study, CLC assembled more reads compared to CAP3 and also generated longer contigs with a higher average read coverage. However, CAP3 contigs generated more SNP markers and appeared to have a higher coverage in total sequence length. Both assemblers in addition to several other aligners were compared considering the number and mean length of the contigs, the assembled reads, and the assembly redundancy . In contrast to our results, CAP3 and CLC performed comparable in their study. To our knowledge, there are no studies published in which the assembler's performance has been evaluated with respect to SNP retrieval. SNP markers will segregate nicely in mapping studies if the SNP is true (reliable) and the marker is unique throughout the genome (high quality). The first step to generate reliable and high quality SNP markers is building contig in which alleles are joined and paralogs are preferably separated.
In order to choose the best assembler with respect to the identification of high quality reliable SNP markers for genetic mapping, we performed several tests to compare the performance of the assemblers.
Comparison between the CAP3 and CLC assemblies
Contigs blast to public sequence data base ESTs
Blasting generated SNPs vs. the contigs
Blast results from QualitySNP selected SNPs (with 50 bp flanking sequence on each side) vs. the contigs from the assembly they originated, provided an additional criterion for SNP markers selection. Many species have undergone genome duplications during their evolution. Assuming paralogs are assembled in different contigs it is still possible that SNP markers selected from one of these contigs will also be present in a paralogous gene assembled in another contig. Thus, is vital to check that SNP markers only map back to the contig from which they were selected. This paralog detection is important in any study aiming to generate SNP markers for which other genetic resources are missing. Selected SNP marker sequences (101 bp) of each assembler were blasted against all contigs using a threshold of E-20.
In CLC, 77% of SNP markers were unique. Only 13 SNP marker sequences had more than 5 blast hits (Figure 6). The 22% of redundancy among CLC-SNPs can be related to the presence of paralogs assembled in different contigs. In general, a number of genes in any genome are expected to be duplicated especially in case of a huge genome like that of Lilium. The percentage of paralogous genes differs between species. For example, in rice around 15 to 62% were expected to be duplicated genes . Using a strict method of defining paralogs, the 22% of redundancy among CLC-contigs is more in line with expectations than the 78% among CAP3 contigs, especially when taking in consideration that not all paralogous genes will be expressed at the time of sampling. To check whether CLC combined paralogs in contigs, haplotype numbers were assessed. Only, 0.7% (364) of the CLC-contigs combined paralogs and contained more than the maximum expected 8 alleles (expected of 4 heterozygote diploid cultivars). The actual number of CLC-contigs with paralogs may be slightly higher but is not likely to cause high numbers of erroneous SNP markers in mapping. Thus, CLC appeared to perform reasonably well for SNP markers retrieval even with the sequence data of this highly polymorphic species. This is in correspondence with  where CLC was among the two best programs for de novo sequence assembly. In contrast, CAP3 could not handle such high levels of heterogeneity .
Examples of assembly differences between the two assemblers
SNP markers are becoming the markers of choice in genetic studies and as such for many species researchers are likely to start up SNP retrieval from NGS data. Our results clearly showed that sequence assembly and consequently the SNP markers retrieval are affected significantly by the assembler. In our study, we tested two widely used assemblers that use different algorithms. Procedures followed can be used in any species that has little genetic resources to view assembly quality. Importantly, blasting the selected SNP markers vs. the contigs from where they generated from (in case of missing the support information from the databases) or against the whole genome, if available, is very essential to avoid false positive SNPs. Results obtained with Lilium cDNAs are likely also valid in other highly heterogeneous species. There seems to be a strong correlation between the level of heterozygosity in the studied species and the performance of the assemblers.
Overall, we believe that for inbreeding species both assemblers can be used, while in an outbreeding and highly heterozygote species CLC is preferred.
Four lily genotypes that represent the four main hybrid groups of the genus Lilium were used for sequencing: cv 'Star Gazer' (Oriental), breeding line 'Trumpet 061099' (Trumpet), cv 'White Fox' (Longiflorum), and cv 'Connecticut King' (Asiatic). Young leaves (500 mg) were collected and kept at -80°C upon RNA isolation.
RNA isolation and cDNA library preparation
Using the Trizol protocol (Invitrogen.Carlsbad, CA, USA), the RNA of the four genotypes was isolated and subsequently purified using the RNeasy MinElute kit (Qiagen, Hilden, Germany).
RNA library processing i.e. cDNA synthesis, normalization of the cDNA and adaptor ligation for GS FLX Titanium sequencing, was performed by Vertis Biotechnologie AG (Freising, Germany). In short, 45 ug of total RNA of each of the four samples was treated with DNase and then primed with 6 nucleotide random primers for first strand cDNA synthesis. Next, 454 adapters A and B with an unique 6 nucleotides barcode for each cultivar were ligated to the 5' and 3' ends of the cDNAs. These cDNAs were subjected to two steps of PCR: one before the normalization step (around 18 cycles) and one after it (around 8 cycles) using a proof reading enzyme. Normalization was carried out by one cycle of denaturation and re-association of the cDNAs and subsequent column purification. For Titanium sequencing the cDNAs in the size range of 500-600 bp were eluted from preparative agarose gels.
454 sequencing procedures
The four cDNA libraries were mixed in equal concentrations and sequenced on a Life Sciences GS-FLX Titanium according to standard procedures (454 Life Sciences) at Wageningen UR Greenomics (Wageningen, the Netherlands).
Raw sequence data are available at ENA-SRA (European Nucleotide Archive-Sequence Read Archive) with the accession number ERP001106.
Raw unprocessed sequences were cleaned before assembly using both the reads and the accompanying sequence quality information (SFF files). Trimming was done by removing: 5' and 3' adapters sequences, low quality bases (limit 0.05), ambiguous nucleotides (maximum 2 nucleotides allowed), terminal nucleotides (one nucleotide from the 5' end and 15 nucleotides from the 3' end), and removal of all reads that have less than 100 and more than 800 nucleotides.
Next, all the duplicated reads, i.e. reads that have the same first 6 nucleotides and exactly the same sequence (>98% similarity), were excluded (clonality) using CD-HIT . After trimming and removing clonality, all the reads were submitted to the standard CAP3  using the default parameters (threshold identity cutoff 95% over 100 bp) and CLC Genomics Workbench software (CLC bio, Denmark, http://www.clcbio.com/). The de novo assembly using CLC was done using the following parameters: conflict resolution (vote), similarity 95% 100 bp over read length and alignment mode (global, do not allow InDels). Through this study few terms will be used frequently such as:
Assembler's performance: refer to the number of contigs with average contig's length, the number of singletons, and assembly redundancy.
Assembly redundancy: when the assembler tend to separate sequence related to the same locus over different contigs.
All the contigs resulting from CAP3 and CLC were submitted to an updated version of QualitySNP  to detect reliable single nucleotide variants within each genotype (between the alleles in one genotype, intra SNPs) and between the four genotypes (between the alleles of the four genotypes, inter SNP). SNPs were chosen using the QualitySNP program based on the following criteria: high quality sequence, not within or adjacent to a homopolymeric tract, at least 2 reads of each allele, 50 bp of flanking sequence on each side free of other SNPs and InDels (criteria needed by Illumina Golden Gate platform for SNPs genotyping). Any SNP fitting these criteria is considered and referred to as 'reliable SNP marker', reliable SNP markers are referred as 'high quality' if they are uniquely present in the genome. For the latter, the SNP with 50 bp sequence on either side is compared against all contigs of the same assembler using BLASTN with Expectation value 1E -20. Only SNPs mapped uniquely to the contig from which they were selected (i.e. high quality SNPs) will be retained for marker analysis.
We are thankful for the financial support from: Foundation Technological Top Institute Green Genetics and the Dutch lily breeding companies: De Jong Lelies BV, Van Zanten Flowerbulbs BV, Vletter and Den Haan BV, Marklily CV, World Breeding BV, Van den Bos Breeding BV, Mak Breeding BV and C. Steenvoorden BV. Hans de Jong and Harm Nijveen are gratefully acknowledged for their constructive comments. We are thankful to Nasim Mansoori for her beneficial English editing.
A.SH is grateful to the Ministry of Higher Education in Syria (Damascus University) for providing her a fellowship to conduct this research.
- Paszkiewicz K, Studholme D: De novo assembly of short sequence reads. Brief Bioinform. 2010, 11 (5): 457-472. 10.1093/bib/bbq020.PubMedView ArticleGoogle Scholar
- Papanicolaou A, Stierli R, ffrench-Constant R, Heckel D: Next generation transcriptomes for next generation genomes using est2assembly. BMC Bioinformatics. 2009, 10 (1): 447-10.1186/1471-2105-10-447.PubMedPubMed CentralView ArticleGoogle Scholar
- Kumar S, Blaxter M: Comparing de novo assemblers for 454 transcriptome data. BMC Genomics. 2010, 11 (1): 571-10.1186/1471-2164-11-571.PubMedPubMed CentralView ArticleGoogle Scholar
- Palmieri N, Schlötterer C: Mapping accuracy of short reads from massively parallel sequencing and the implications for quantitative expression profiling. PLoS One. 2009, 4 (7): e6323-10.1371/journal.pone.0006323.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang W, Chen J, Yang Y, Tang Y, Shang J, Shen B: A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS One. 2011, 6 (3): e17915-10.1371/journal.pone.0017915.PubMedPubMed CentralView ArticleGoogle Scholar
- Emrich SJ, Aluru S, Fu Y, Wen T-J, Narayanan M, Guo L, Ashlock DA, Schnable PS: A strategy for assembling the maize (Zea mays L.) genome. Bioinformatics. 2004, 20 (2): 140-147. 10.1093/bioinformatics/bth017.PubMedView ArticleGoogle Scholar
- Tang J, Vosman B, Voorrips R, van der Linden CG, Leunissen J: QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species. BMC Bioinformatics. 2006, 7 (1): 438-10.1186/1471-2105-7-438.PubMedPubMed CentralView ArticleGoogle Scholar
- Anithakumari AM, Tang J, van Eck HJ, Visser RG, Leunissen JA, Vosman B, van der Linden CG: A pipeline for high throughput detection and mapping of SNPs from EST databases. Mol Breeding: New Strategies In Plant Improvement. 2010, 26 (1): 65-75.View ArticleGoogle Scholar
- Vera Ruiz EM, Soriano JM, Romero C, Zhebentyayeva T, Terol J, Zuriaga E, Llácer G, Abbott AG, Badenes ML: Narrowing down the apricot Plum pox virus resistance locus and comparative analysis with the peach genome syntenic region. Mol Plant Pathol. 2011, 12 (6): 535-547. 10.1111/j.1364-3703.2010.00691.x.PubMedView ArticleGoogle Scholar
- Rivarola M, Foster JT, Chan AP, Williams AL, Rice DW, Liu X, Melake-Berhan A, Creasy HH, Puiu D, Rosovitz MJ, et al: Castor Bean Organelle genome sequencing and worldwide genetic diversity analysis. PLoS ONE. 2011, 6 (7): e21743-10.1371/journal.pone.0021743. doi:21710.21371/journal.pone.0021743PubMedPubMed CentralView ArticleGoogle Scholar
- Gulyani V, Khurana P: Identification and expression profiling of drought-regulated genes in mulberry (Morus sp.) by suppression subtractive hybridization of susceptible and tolerant cultivars. Tree Genet Genomes. 2011, 7 (4): 725-738. 10.1007/s11295-011-0369-3.View ArticleGoogle Scholar
- Dubey A, Farmer A, Schlueter J, Cannon SB, Abernathy B, Tuteja R, Woodward J, Shah T, Mulasmanovic B, Kudapa H, et al: Defining the transcriptome assembly and its use for genome dynamics and transcriptome profiling studies in Pigeonpea (Cajanus cajan L.). DNA Research. 2011, 18 (3): 153-164. 10.1093/dnares/dsr007.PubMedPubMed CentralView ArticleGoogle Scholar
- Franssen SU, Shrestha RP, Bräutigam A, Bornberg-Bauer E, Weber APM: Comprehensive transcriptome analysis of the highly complex Pisum sativum genome using next generation sequencing. BMC Genomics. 2011, 12 (1): 227-10.1186/1471-2164-12-227.PubMedPubMed CentralView ArticleGoogle Scholar
- Sakai H, Ikawa H, Tanaka T, Numa H, Minami H, Fujisawa M, Shibata M, Kurita K, Kikuta A, Hamada M, et al: Distinct evolutionary patterns of Oryza glaberrima deciphered by genome sequencing and comparative analysis. Plant J. 2011, 66 (5): 796-805. 10.1111/j.1365-313X.2011.04539.x.PubMedPubMed CentralView ArticleGoogle Scholar
- Tillett RL, Ergül A, Albion RL, Schlauch KA, Cramer GR, Cushman JC: Identification of tissue-specific, abiotic stress-responsive gene expression patterns in wine grape (Vitis vinifera L.) based on curation and mining of large-scale EST data sets. BMC Plant Biol. 2011, 11 (1): 86-10.1186/1471-2229-11-86.PubMedPubMed CentralView ArticleGoogle Scholar
- Singhal D, Gupta P, Sharma P, Kashyap N, Anand S, Sharma H: In-silico single nucleotide polymorphisms (SNP) mining of Sorghum bicolor genome. Afr J Biotechnol. 2011, 10 (4): 580-583.Google Scholar
- Bräutigam A, Mullick T, Schliesky S, Weber APM: Critical assessment of assembly strategies for non-model species mRNA-Seq data and application of next-generation sequencing to the comparison of C3 and C4 species. J Exp Bot. 2011, 62 (9): 3093-3102. 10.1093/jxb/err029.PubMedView ArticleGoogle Scholar
- Gomez-Alvarez V, Teal TK, Schmidt TM: Systematic artifacts in metagenomes from complex microbial communities. ISME J. 2009, 3 (11): 1314-1317. 10.1038/ismej.2009.72.PubMedView ArticleGoogle Scholar
- Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat Meth. 2009, 6 (4): 291-295. 10.1038/nmeth.1311.View ArticleGoogle Scholar
- Parchman T, Geist K, Grahnen J, Benkman C, Buerkle CA: Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery. BMC Genomics. 2010, 11 (1): 180-10.1186/1471-2164-11-180.PubMedPubMed CentralView ArticleGoogle Scholar
- Wall PK, Leebens-Mack J, Chanderbali A, Barakat A, Wolcott E, Liang H, Landherr L, Tomsho L, Hu Y, Carlson J, et al: Comparison of next generation sequencing technologies for transcriptome characterization. BMC Genomics. 2009, 10 (1): 347-10.1186/1471-2164-10-347.PubMedPubMed CentralView ArticleGoogle Scholar
- Tang J, Leunissen J, Voorrips R, van der Linden CG, Vosman B: HaploSNPer: a web-based allele and SNP detection tool. BMC Genet. 2008, 9 (1): 23-PubMedPubMed CentralView ArticleGoogle Scholar
- Zerbino D, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.PubMedPubMed CentralView ArticleGoogle Scholar
- Oases. [http://www.ebi.ac.uk/~zerbino/oases/]
- ABySS. [http://www.bcgsc.ca/platform/bioinfo/software/abyss]
- SOAPdenovo. [http://soap.genomics.org.cn/soapdenovo.html]
- Sterck L, Rombauts S, Vandepoele K, Rouzé P, Van de Peer Y: How many genes are there in plants (... and why are they there)?. Curr Opin Plant Biol. 2007, 10 (2): 199-203. 10.1016/j.pbi.2007.01.004.PubMedView ArticleGoogle Scholar
- Feldmeyer B, Wheat C, Krezdorn N, Rotter B, Pfenninger M: Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance. BMC Genomics. 2011, 12 (1): 317-10.1186/1471-2164-12-317.PubMedPubMed CentralView ArticleGoogle Scholar
- Lin H, Ouyang S, Egan A, Nobuta K, Haas B, Zhu W, Gu X, Silva J, Meyers B, Buell CR: Characterization of paralogous protein families in rice. BMC Plant Biol. 2008, 8 (1): 18-10.1186/1471-2229-8-18.PubMedPubMed CentralView ArticleGoogle Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22 (13): 1658-1659. 10.1093/bioinformatics/btl158.PubMedView ArticleGoogle Scholar
- Huang X, Madan A: CAP3: a DNA sequence assembly program. Genome Res. 1999, 9 (9): 868-877. 10.1101/gr.9.9.868.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.