Sequencing and assembly of A- and D-genome ESTs
Sequence libraries were made from mRNA prepared from multiple tissues including young leaves, roots, stems, ovules, and fibers in G. arboreum and young leaves, stem and whole flowers in G. raimondii. To increase the abundance of poorly expressed genes, mRNA libraries were normalized using the procedure involving kamchatka crab duplex-specific nucleases (Evrogen and Axxora, LLC, Farmingdale, NY) [19]. To increase the sequencing coverage and depth, two independent mRNA libraries were constructed and sequenced. The first set of 454 Titanium reads was assembled at The University of Texas at Austin and included a total of 1,699,776 reads from the A-genome, and 1,464,815 reads from the D-genome. The assembly of these reads resulted in 62,609 contigs with an average contig size (N50) of 1,032-bp for A-genome and 34,908 contigs (N50 of 1,107-bp) for D genome. The second assembly included 89,185 contigs (N50 of 628.6-bp) from the A genome and 68,984 contigs (N50 of 675.6-bp) from the D genome at Brigham Young University [17].
To improve assembly quality, we filtered out contigs shorter than 200-bp because of the poor base quality and short size and merged two sets of EST assemblies. As a result, the contig number greater than 500-bp was increased (1.7-fold) from 27,574 to 47,670 for the A genome and from 21,763 to 37,869 for the D genome. Likewise, the average size greater than 500-bp increased from 940.7 bp to 1,184.5 bp (243.8-bp increase) for A-genome and from 994.5 bp to 1,198.4 bp (203.9-bp increase) for D-genome. The number of contigs remained similar for both A- (from 40,726 to 47,670) and D- (from 35,197 to 37,869) genomes. The merged ESTs were assembled into 89,588 contigs for the A-genome (GaA, N50 of 805.9-bp) and 65,542 contigs for the D-genome (GrD, N50 of 839.6-bp). Over 80% of A (88%) and D (84%) contigs ranged from 200 to 1,500-bp (Figure 1 and Additional file 1: Table S1). The average length of ESTs for GaA and GrD is longer than those of previously published ESTs in allotetraploid cotton (763 bp) [11].
To estimate the transcriptome coverage and novelty, we compared GaA and GrD ESTs with those in CGI11 that comprised 117,992 EST contigs in cotton varieties of diploid and allotetraploid species using reciprocal BLASTN with an e-value score below e-10 (Figure 2A). Using CGI11 as the query, the majority of contigs in CGI11 matched GaA (81.2%) and GrD (82.4%), respectively, suggesting a good representation of GaA and GrD ESTs in the current EST collections. Using the reverse query, GaA and GrD ESTs matched 61.7% and 70.1% of contigs, respectively in CGI11, indicating 38.3% of new transcripts in GaA and 29.9% in GrD, respectively. This is probably because of increased sequencing depth in two independent libraries, as well as the normalization of mRNA libraries to include the poorly expressed genes. These GaA and GrD EST assemblies provide unique and additional useful resources for genomic studies.
GaA and GrD EST assemblies were searched against annotated transcripts based on DOE Joint Genome Institute: Cotton D V2.0 (JGI-D) (http://www.phytozome.net/cotton.php#C3) [7]. With the e-value cut-off of 1e-10, we identified unique ESTs present in GaA and GrD EST assemblies. Compared to the JGI-D annotation, 27,537 contigs (30.7%) were unique to A-genome, and 10,452 contigs (15.9%) were unique to D-genome (Figure 2B). These unique ESTs are the genes that are expressed in other tissues such as roots and shoots, poorly expressed genes, non-coding RNAs and/or gene fragments in the D-genome. These new ESTs provide additional transcriptional information for cotton genome annotation.
Characterization of A and D transcriptomes
Most plants including cotton underwent one or more rounds of whole genome duplication or polyploidization [20–22]. To estimate duplicated transcripts in the A- and D-genome species, we performed self-BLASTN for GaA and GrD with e-value less than e-100 (Figure 3A). Compared to the total contigs, 48.3% and 42.7% of GaA and GrD ESTs and contigs were multiple-copy, respectively, and numbers of single-copy contigs were 43,302 for GaA and 28,019 for GrD (Figure 3A). The results remained similar using the annotated transcripts in the JGI-D in the place of GrD (Figure 3A). The JGI-D contained 37,505 annotated genes and 77,267 protein-coding transcripts [7].
Multiple genome duplication events occurred during cotton evolution [7, 23]. Evolutionary analysis indicated that in the D-genome cotton (G. ramondii) whole genome duplication (WGD) occurred after split from the closest genome Theobroma cacao [7]. This polyploidy event was estimated to be 13–20 Mya. As a result, 2,355 duplicate blocks were present across the D-genome, and ~40% of paralogous genes were present in more than one block [24]. We used the rate of nonsynonymous substitution per gene (Ks) to determine WGD events. The value of the highest peak (Ks = 0.4) of the Ks distributions of orthologs and paralogs between G. ramondii itself and G. arboreum and G. ramondii were very similar (Figure 3B). These data may imply that WGD in A- and D-genomes occurred at a similar timeframe. The WGD could increase the rate of evolution for these related species. Our analysis confirmed expression of these duplicate genes from WGD, which are preserved after polyploidization.
We next tested divergence between transcriptomes of A- and D-genomes that were predicted to diverge 7–10 Mya [25]. Using reciprocal BLASTN searches between GaA and GrD ESTs, we found that 73.2% (65,591) of GaA contigs and 80.8% (52,922) of GrD contigs were shared (Figure 3C). This suggests that the majority of transcriptomes was conserved after divergence between A- and D-genome species. The unique genes between these two species could be related to species-specific gene loss or gain and tissue-specific expression of genes and/or proportion of genes that are important to fiber development because these two species have very different fiber morphologies and properties [26].
Proteins and peptides encoded by GaA and GrD ESTs were assigned using BLAST against multiple protein databases. With the e-value below e-10 and the identity of >50% of aligned amino-acid sequences, 39.4% of GaA contigs and 50.9% of GrD contigs matched targets in the Arabidopsis protein database (Figure 4A). A total of 35, 335 GaA contigs and 33, 333 GrD contigs matched 13,851 and 14,222 Arabidopsis protein-coding sequences, respectively. Additionally, the 11.7% of GaA contigs (10, 493) and 9.6% of GrD contigs (6, 300) matched protein entries in the Pfam-A database [27], using the e-value less than e-05. Additional 0.9% and 2% of GaA (846) and GrD (1, 328) contigs, respectively matched entries in the Uniprot database [28]. Approximately 48% of GaA contigs (46, 674) and 37% of GrD (40, 961) contigs could not be annotated with protein functions, which could be due to the genes related to cotton-specific biological processes such as fiber and cell wall biosynthesis, unknown transposable elements, non-coding RNAs, and/or computational errors.
Gene ontology analysis of A and D ESTs
Gene ontology (GO) analysis using the molecular function classification showed significant increases of percentages of predicted genes in GaA and GrD ESTs relative to those in Arabidopsis genes in classes of glucosyltransferase, ligase, molecular transducer, and transporter activities (P < 1e-4, Fisher’s exact test) (Figure 4B). Cotton fiber cell differentiation and cell elongation require rapid carbohydrate biosynthesis and transport activities [29, 30]. The glucosyltransferase pathway is part of the primary metabolic process, which is required for biosynthesis of cellulose and primary and secondary cell walls. In addition, transport activities are required for rapid transport of sucrose and other sugars and nutrients that are important to fiber cell biosynthesis. Enrichment of the genes in these GO classes in cotton A and D-genome ESTs suggests evolutionary conservation of biological activities related to fiber production in cotton.
Identification and characterization of genome-specific SNPs (GNPs)
GNP is useful to discriminate allelic expression differences between species and in allotetraploid cotton. Here we analyzed and generated high-quality GNPs by the mapping A-genome 454 raw reads to D contigs using Newbler 2.3 (http://454.com/products/analysis-software/index.asp). The mapping efficiency is ~61%. Using high-stringent parameters (> = 90% identity, > = 8 × read coverage, and Q-value > =25), we identified 75,754 GNPs and 2,390 INDELs (insertion or deletion of bases) between A- and D-genome contigs (Additional file 2: DataSet1 and Additional file 3: DataSet2). More than 10,885 contigs, that represented 29% of the shared genes between A- and D-genomes, were aligned with at least one GNP. These GNP-containing contigs included genes encoding MYB transcription factors, epigenetic factors, and other protein factors important to fiber development (Additional file 4: Table S2). Cotton GNPs were not evenly distributed throughout expressed transcripts. The average length between GNPs is 300-bp, and most GNPs appeared on every 100–600 bp (Figure 5A).
GNPs may result from transition or transversion. Transition results from the interchanges within two-ring purines (A/G) or one-ring pyrimidines (C/T), while transversion is the interchange between two-ring purine and one-ring pyrimidine bases. Thus, transition is generally more frequent than transversion because the nature of these changes. The ratio between transition and transversion was 1.9:1 between A and D-genome ESTs) (Figure 5B), which is lower than the ratio (2.3:1) between the two allotetraploids, G. hirsutum and G. barbadense, as previously reported [31]. Both G. hirsutum and G. barbadense are evolved from the same ancestors of ancient A and D genome. Less transitions/transversions ratio between diploid cotton species than between allotetraploid species may reflect an evolutionary distance, which is larger in the former than in the latter.
Prediction of GNPs in GrD ESTs was computationally validated using the D-genome sequence [7]. Among 77,267 annotated transcripts in the D-genome, there were only 917 SNPs in 336 (0.4%) transcripts and 124 INDELs in 91 (0.1%) transcripts between annotated transcripts and GrD ESTs. This small amount of discrepancy suggests a low amount of sequence variation between these two sequencing libraries from the same G. raimondii accession likely due to a small population size.
To validate computationally predicted GNPs, we amplified 200–300-bp fragments that flank individual GNPs in G. arboreum (A-genome), G. raimondii (D-genome), and the allotetraploid G. hirsutum TM-1 (AD-genome) by PCR and sequenced individual fragments. Among 72 candidate GNPs for validation (Additional file 5: Figure S1), 52 (72.2%) perfectly matched the GNPs, which were found in the GaA and GrD ESTs were also present in A and D subgenomes in the allotetraploid. The remaining 20 (28.8%) GNPs showed biased amplifications towards one of the subgenomes in the allotetraploid. This could be due to non-specific amplification using the primer pairs in the PCR, sequence variation between the exact progenitors in the allotetraploid and extant diploid species used in the assay, and/or sequence changes between A and D-subgenomes after allotetraploid formation, as predicted in another study [17]. These data suggest that these EST assemblies in A- and D-genome species are useful resources for genome annotation and gene expression studies in allotetraploid cotton.