Evolutionary constraints and expression analysis of gene duplications in Rhodobacter sphaeroides 2.4.1
© Peters et al.; licensee BioMed Central Ltd. 2012
Received: 29 October 2011
Accepted: 5 April 2012
Published: 25 April 2012
Gene duplication is a major force that contributes to the evolution of new metabolic functions in all organisms. Rhodobacter sphaeroides 2.4.1 is a bacterium that displays a wide degree of metabolic versatility and genome complexity and therefore is a fitting model for the study of gene duplications in bacteria. A comprehensive analysis of 234 duplicate gene-pairs in R. sphaeroides was performed using structural constraint and expression analysis.
The results revealed that most gene-pairs in in-paralogs are maintained under negative selection (ω ≤ 0.3), but the strength of selection differed among in-paralog gene-pairs. Although in-paralogs located on different replicons are maintained under purifying selection, the duplicated genes distributed between the primary chromosome (CI) and the second chromosome (CII) are relatively less selectively constrained than the gene-pairs located within each chromosome. The mRNA expression patterns of duplicate gene-pairs were examined through microarray analysis of this organism grown under seven different growth conditions. Results revealed that ~62% of paralogs have similar expression patterns (cosine ≥ 0.90) over all of these growth conditions, while only ~7% of paralogs are very different in their expression patterns (cosine < 0.50).
The overall findings of the study suggest that only a small proportion of paralogs contribute to the metabolic diversity and the evolution of novel metabolic functions in R. sphaeroides. In addition, the lack of relationships between structural constraints and gene-pair expression suggests that patterns of gene-pair expression are likely associated with conservation or divergence of gene-pair promoter regions and other coregulation mechanisms.
KeywordsGene duplication In-paralog Out-paralog, Evolutionary constraint Microarray Gene expression Rhodobacter sphaeroides
Rhodobacter sphaeroides is a well-studied photosynthetic eubacterium that belongs to the α-3 subgroup of the Proteobacteria[1, 2]. R. sphaeroides 2.4.1 is a model strain for this organism and is noteworthy since its genome consists of two chromosomes, chromosome I (CI; ~3.2 Mb) and chromosome II (CII; ~0.9 Mb), and five endogenous plasmids[3–6]. It possesses significant metabolic diversity[7–14] and is capable of growing under aerobic, semiaerobic, and photosynthetic growth conditions, while utilizing a wide variety of carbon and nitrogen nutrient sources[15, 16]. Therefore, R. sphaeroides is an ideal model for the examination and study of gene duplications and their roles in both the evolution of genomic complexity and the metabolic plasticity.
Homologous genes can be classified into two different groups, orthologs or paralogs, depending on the relationship between the genes. Orthologs are homologous genes in different species that originated from a common ancestral gene and they normally retain the same function during the course of evolution. In contrast, paralogs are homologous genes that originated by gene duplication and often contribute to functional innovations that are maintained for adaptation to specific ecological niches. Some gene duplications occur prior to speciation and exist in an ancestral lineage as a pair. These gene duplications are also recognized as paralogs because they are present as a pair in an ancestral genome but were passed on as “co-orthologs” by a speciation event. These types of gene-pairs are referred to as “out-paralogs”, with traditional paralogs (duplication occurring within lineage) being referred to as “in-paralogs”[18, 19]. In a previous study, 234 duplicate gene-pairs (paralogs) were identified in R. sphaeroides 2.4.1. Of these, 180 pairs were out-paralogs and 54 were in-paralogs. Duplicated genes often do not evolve to have novel functions, usually becoming silenced, but they do have a short opportunity to develop new metabolic capabilities during a brief period of relaxed selection after a duplication event. Additionally, there are several predictions for how duplicated genes are preserved and how they maintain or diverge their functions, and some models include neofunctionalization, pseudogenization, subfunctionalization, specialization, and increased protein dosage[19, 23].
Microarray expression profiles of R. sphaeroides 2.4.1 have revealed that differential expression exists among genes in certain pathways, such as in the components of light harvesting complexes, secondary metabolites, and energy production. The genome of R. sphaeroides exhibits genome complexity, an abundance of duplicated genes between its two chromosomes, and varied gene organization (solitary genes and short or long operons) resulting in coordinated, varied metabolic capabilities. The transcription of these complex gene-operons is regulated by transcription regulator(s) under varying environmental conditions, such as oxygen tension and light intensity.
In the current study, four hypotheses were examined. As mentioned above, out-paralogs have vertically descended from an earlier common ancestor as “co-orthologs” while in-paralogs have originated within R. sphaeroides. Therefore, the first hypothesis was that in-paralogs have experienced varying levels of selective constraints. Second, it was expected that gene paralogs on different chromosomes (CI and CII) have experienced different selective pressures, since CII sequences evolve more rapidly than the CI sequences in R. sphaeroides[25, 26]. Third, since protein size is an indicator of gene complexity in eukaryotes, it is hypothesized that gene-pairs encoding larger proteins in complex prokaryotic genomes like R. sphaeroides are maintained by stronger selection than gene-pairs encoding smaller proteins. Fourth, since transcription of genes is controlled primarily by the interaction of transcription factors (inducers or repressors) and regulatory sequences within the promoter region rather than coding sequences that determine protein structure, it is expected that the differences in gene expression among gene-pairs will not be correlated with nucleotide substitution rate measures in R. sphaeroides.
In the current study, the nonsynonymous and synonymous substitution rates (i.e., K a and K s , respectively) were computed for all 234 duplicated gene-pairs in R. sphaeroides. Since out-paralogs and old in-paralogs show saturated level of synonymous substitution rates, selective constraint was measured only for those duplicated gene-pairs, which had K s values lower than 1.1, so as to provide a reliable estimate of selection. Modes of selection on in-paralogs were also examined and the results were discussed. In addition, microarray expression patterns of duplicate gene-pairs were examined and relationships with the structural constraints measures were investigated. The results of these analyses provide target genes for detailed molecular and biochemical characterization for duplicated genes in the R. sphaeroides genome.
Variability of K a and K s for homologous gene-pairs
Summary of structural constraint values for paralogs in R. Sphaeroides
Nonsynonymous substitution rate (K a )
Synonymous substitution rate (K s )
KS Test (P)
KS Test (P)
Structural constraints operating on duplicated genes
Duplicate gene-pairs in R. sphaeroides with selective constraint (ω) < 1 (ordered by ω)
The average ω values of protein-pairs grouped by length are roughly similar (data is not shown), therefore protein length was not correlated with structural constraint (ω). Accordingly, it may indicate that genes with longer length appear to be subject to similar levels of purifying selection as those with shorter length.
Different gene expression patterns among duplicated genes
Correlations between K a and expression divergence and between K s and expression divergence for in-paralogs are shown in Additional file2: Figure A1. Both correlations were not strong as follows: correlation between K a and expression divergence (R = 0.359, p = 2.70E-2) and correlation between K s and expression divergence (R = 0.441, p = 5.61E-3).
Duplicated genes are maintained by purifying selection
In the current study, four specific hypotheses were examined. The first hypothesis was that two types of paralogs (namely in-paralogs and out-paralogs) in R. sphaeroides genome will have varying levels of evolutionary constraints (K a and K s ). In particular, wider degree of variability in evolutionary constraints was observed in out-paralogs than in in-paralogs. This supports the previous finding that most ancient gene duplications in R. sphaeroides saturate synonymous substitutions (K s ), and the corresponding selective constraint (ω) would decrease with increasing time as found in several bacteria including Escherichia coli and Staphylococcus aureus[30, 31]. Furthermore, duplicated genes in Caenorhabditis elegans, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, bacteria, and mammals have all evolved under purifying selection. Since older duplications (out-paralogs and many old in-paralogs) are inherent to have K s > 1, the measure of selective constraint may not be reliable; however, majority of young in-paralogs showed that these gene-pairs are maintained under purifying selection.
Duplicated genes tend to experience a brief period of relaxed selection soon after the duplication event, thereby leaving opportunities for metabolic innovation[21, 33–35]. Some in-paralogs have not reached their maximum level of non-synonymous changes and so gene copies are allowed to harbor mutations, which may lead to expansion of gene functions. Relaxation of constraints can be the cause of an adaptive change that can alter protein function, so positive selection may occur at ω < 1.
Gene-pairs were determined to be relatively recent in R. sphaeroides genome, if they possessed K s < 1.1 and divergence < 50% as shown in Table2. If functional constraints are actually relaxed immediately after duplication, genes with these characteristics should also have elevated ω values. The average ω value for the most recent gene duplications within R. sphaeroides 2.4.1 was high and thus it indicates that relaxed selection may be acting on these gene-pairs, allowing them to gain metabolic novelties. Of the 24 in-paralogs listed in Table2, 15 in-paralogs code for hypothetical proteins and the others code for functions like transport, transposase, protease, conjugation, receptor, and phage-related protein function. As all of these in-paralogs originated in the R. sphaeroides lineage, varied gene function within a gene-pair is consistent with previous findings that in-paralogs experience accelerated evolution and can accumulate more amino acid substitutions. In a similar vein, as analyzed in Escherichia coli, Helicobacter pylori, and Neisseria meningitides, gene duplications evolve at faster rates compared to unique single-copy genes but this finding was not consistent in Chlamydophila pneumoniae. These findings also confirm that duplicated genes are often involved in less critical functions and that these may be responsible for strain-specific differences.
Similar level of evolutionary constraints on duplicate genes located on CI and CII
In R. sphaeroides 2.4.1, CII has been shown to be rapidly evolving[25, 26]. For this reason, the second hypothesis was that gene paralogs on different replicons might have experienced different selective pressures within R. sphaeroides and specifically that paralogs between CI and CII might have experienced different levels of selection from those contained strictly within CI, within CII, or within plasmids. Our result revealed that the mean ω value of in-paralogs distributed between CI and CII (ω = 0.284 ± 0.321) is higher than the mean ω value of in-paralogs located within CI (ω = 0.123 ± 0.085) and also within CII (ω = 0.145 ± 0.172), but the difference in ω is not statistically significant. Thus, the selective constraint is not the primary force for the rapid divergence of CII in R. sphaeroides. These findings are consistent with the previous observation that 28 gene orthologs within CI and CII among four strains of R. sphaeroides are maintained under purifying selection.
In an analysis of 28 gene-pairs common among four strains of R. sphaeroides (2.4.1, ATCC17025, ATCC17029, and KD131), purifying selection (ω ≤ 0.3) was detected under the MYN model, indicating that these gene-pairs have significantly conserved their functions, and have not possibly evolved via convergent evolution. Some duplicated genes in R. sphaeroides 2.4.1 might possess evidences of being subfunctionalized or specialized for similar functions, but other pairs may also be maintained through neofunctionalization. As a note, very few genes in genomes generally possess significantly high ω values. This is partly due to the fact that genomic streamlining can lead to limited size, and microbial genomes lack a permissive environment for expansion of nun-functional DNA due to different population genetic environments.
Different structural constraints exist for genes with different functions. Essential genes are highly conserved within bacterial genomes[40, 41] and duplicated genes in R. sphaeroides follow this trend. Herein then, one significant explanation for the differences in structural constraints across R. sphaeroides replicons could lie in the distribution of essential genes across its replicons. Generally, for bacteria with two chromosomes, there is one large primary chromosome (CI) that contains a great deal of essential genes, while the accessory chromosome (CII) contains a sizeable percentage of nonessential genes, coding for hypothetical proteins of unknown function. One issue that arises, however, is whether the presence of nonessential genes contributes to more variability of selection pressures on CII or weaker selection pressures on CII allow for the presence of nonessential genes to thrive and diverge. Although it is not immediately clear which predominates, it is probably a combination of both situations. As hypothesized before, it is likely then that relaxed structural constraints on the CII of bacteria significantly contributes to their increased evolution and divergence compared to CI, although the findings of the current study may stand in contrast to that idea.
Protein size does not influence selective constraints on coding sequence evolution
Although protein size is an indicator of gene complexity[27, 43, 44], K a , K s , and ω were not correlated with gene length in R. sphaeroides 2.4.1. In accord, the third hypothesis, which postulated that other gene-pairs coding for larger proteins would be maintained by stronger selection than other gene-pairs coding for smaller proteins, was not substantiated. However, owing to the limited number of duplicate genes that code proteins of varying lengths, the hypothesis remains to be tested on a larger data set. As a note, the stronger selection pressure exists on larger prokaryotic genomes compared to smaller ones. However, in eukaryotes, bigger proteins are likely more important compared to in prokaryotes, therefore longer and more complex genes are often selected to be duplicated in eukaryotes.
Gene expression patterns and structural constraints
Duplicated genes display a variety of expression patterns across seven growth conditions. Most in-paralogs show high cosine values, which suggests that these gene-pairs are similarly expressed, while the majority of out-paralogs possess a wide range of cosine values, which is indicative of their varied expression patterns. Varied expression patterns among out-paralogs suggest that out-paralog gene pairs evolved either to produce more proteins required by specific metabolic functions or to specialize new gene functions. The present results are in contrast to previous findings that gene duplications solely serve to increase protein dosage and our findings also suggest that gene duplication may play an important role in specializing new functions as a variety of expression patterns were observed for many of these duplicate gene pairs[47–50].
As illustrated in Figure9, there exist 16 dominant or discriminating expression patterns in both in-paralogs and out-paralogs. These paralogs have high cosine expression similarity at gene–to-gene and cluster-to-cluster levels, indicating that individual gene patterns in out-paralogs are similar to those in in-paralogs. Furthermore, all the latent major patterns observed in out-paralogs were preserved in in-paralogs with a very little variation. No distinct expression patterns in either out-paralogs or in-paralogs were observed.
Figure8 presents four examples of expression patterns for gene-pairs corresponding to each of the four previously defined groups (A, B, C, and D). It is immediately evident that gene-pairs with high cosine (>0.5) exhibit fairly similar expression patterns across all growth conditions while those with low cosine (≤0.5) exhibit dissimilar expression patterns. More specifically, as shown in Figure8C and8D, there is a divergent level of expression for the low cosine gene-pairs with the genes in each pair peaking at different conditions. In addition, protein divergence does not seem to be relevant to the expression similarity patterns as shown in Figure8A and8B. Additionally, although gene expression in out-paralogs does not appear to be related to structural constraints, K a and K s were weakly correlated (p < 0.05) with expression divergence in in-paralogs. As in-paralogs are relatively newer duplications compared to out-paralogs, these findings could suggest that in a gene duplication event, expression divergence is first partially due to structural constraints acting upon the gene-pair. However, as time increases, such structural constraints begin to matter less and other regulatory factors come into play. The fourth hypothesis submits that gene expression is not correlated with K a and K s because changes in regulatory elements will have a greater impact on the level of expression than changes in the coding sequence. This hypothesis is only partially supported, as there is a lack of significant correlations between those factors in out-paralogs but weak correlations exist in in-paralogs.
Therefore, in out-paralogs and possibly old in-paralogs, it seems then that other factors might be at work in contributing to expression divergence. One such factor is the promoter region of these genes. For instance, during a period of relaxed selection, some regions of a gene may be maintained or conserved, while others are altered and interact differently with transcriptional regulators. As such, it is possible that promoter sequences in some duplicated genes in R. sphaeroides have been selectively conserved after duplication but coding sequences have diverged so that gene-pair functions can differ but promoters can respond to similar environmental or physiological cues. On the other hand, if a similar gene structure or coding region is maintained but promoters have diverged, a gene-pair can be specialized for functions in a particular environmental condition. In prokaryotes, paralogs have been associated with advantageous changes or mutations such as in methyltransferase which are adaptive to the environment. Similarly, it has been proposed that gene duplications are useful for fluctuations in the environment so that each copy can perform the same function under different conditions. For instance, in R. sphaeroides, duplicated cbb genes participate in carbon fixation pathways, but different forms of the enzyme function under different levels of oxygen tension. Such an understanding is further substantiated by data shown in Figure8. For instance, as mentioned before, Figure8C and8D present gene-pairs with low cosine. The genes in Figure8C are RSP_0476 and RSP_2364 and they code for L-fuculose-phosphate aldolase (Class I) and L-fuculose phosphate aldolase (Class II ), respectively. The graph demonstrates that these two genes are expressed differently under the growth conditions, although their amino acid divergence is fairly low (29%). Likewise, in Figure8D, the genes RSP_1998 and RSP_3049 code for molybdopterin molybdochelatase and molybdenum cofactor biosynthesis protein, respectively. These genes follow a similar pattern as those in Figure8C, even though their amino acid divergence is high (62%). Therefore, it is more likely that diverged promoter sequences for these sets of genes have resulted in the differential gene-pair expression.
As such, in the cases in which differential expression exists between genes within a duplicate pair, it is possible that the promoter regions are not conserved within the pair. In a duplication event, promoter sequences may or may not have been duplicated with the coding region or the duplicated promoter regions may have diverged, if the duplicate copy is located close to another operon and is now controlled by a different, new promoter. Either case would therefore result in different patterns of expression between the gene copies. Additionally, paralogs have been shown to diverge with respect to regulatory control rather than biochemical function of the associated protein, possibly leading to differences in expression patterns. Analysis of promoter regions of duplicate genes is under investigation, and therefore the result is not currently available.
The current study demonstrated that duplicated genes have distinct patterns of gene expression that may be related to the function and maintenance of each gene copy in the genome of R. sphaeroides. Through analysis of structural and functional constraints on duplicated genes in R. sphaeroides, many paralogs were maintained under negative selection, indicating that their functions are constrained. Because most gene duplications are out-paralogs, the period in which they experienced relaxed selection may not be detectable. Moreover, duplicated genes are maintained with varying levels of amino acid divergence, and ~62% have nearly identical expression across seven different growth conditions. However, these results also mean that gene copies are allowed to diverge and if a new adaptive function is acquired by mutation, the new diverged gene may become subject to negative selection in order to maintain that function.
Furthermore, the results of this study have set the cornerstone for detailed molecular analysis of duplicated genes in prokaryotes and specifically in R. sphaeroides. Future studies may incorporate techniques such as promoter-swapping and gene knockout to examine important gene regulation mechanisms and regulatory circuits that facilitate functional innovations in R. sphaeroides and other prokaryotic organisms. In addition, future studies should examine gene-gene interaction networks with respect to gene duplication as well as in relation to the growth environments. For instance, one recent study suggests that the role of gene duplications are overstated in the evolution of metabolic functions and protein families in that duplicated genes are primarily important in gene dosage but horizontal gene transfer is the primary method of acquiring of new and evolved functions in organisms. As such, examination of not only major evolutionary forces, such as horizontal gene transfer and gene duplication, but also the interplay of these forces is essential to understand prokaryotic genome complexity and the evolution of new gene functions.
Aside the variability in structural-functional constraints (K s , K a , and ω) among duplicate gene pairs, a majority of out-paralog and in-paralog gene-pairs in R. sphaeroides are maintained under negative evolutionary pressure (purifying selection), and the finding is consistent with the results previously reported on other species. Only a small percentage of paralogs evolves into novel metabolic functions in R. sphaeroides. Two chromosomes (CI and CII) revealed a very similar level of evolutionary constraint and therefore the selective constraint on the duplicated gene-pairs is not the major force for the rapid genetic divergence of CII. In addition, expression patterns of duplicated gene-pairs suggest that a majority of duplicated genes in R. sphaeroides are similarly expressed over several growth conditions, however the level of similarity (cosine values) varies among duplicated gene-pairs. Only 14 gene-pairs have very divergent gene expression patterns. This study concludes that gene duplication not only means to innovate gene functions, but also contributes towards increasing protein dosages, which help this organism to adapt to its environment.
Selective constraint analysis
All 234 gene-pairs identified in a previous study were used for analysis. To analyze the selective constraints operating on the duplicate genes, automated software was employed in a PERL script and is available upon request. The program performs the following steps: first, nucleic acid sequences in FASTA format are translated to amino acids using the standard codon table; second, the translated protein sequences are stored in a separate file for each pair; third, the protein sequences for each duplicate protein-pair are then aligned using MUSCLE; fourth, the aligned protein sequences are back-translated into the original corresponding DNA sequences using PAL2NAL; fifth, each pair of DNA sequences is then subjected to KaKs_Calculator for analysis. KaKs_Calculator identifies the synonymous (K s ) and non-synonymous (K a ) nucleotide substitutions between two sequences and then computes the nonsynonymous/synonymous rate ratio (ω = K a /K s ). It is better to align protein sequences rather than nucleic acid sequences because protein alignments prevent the introduction of frame shifts that may occur due to the incorrect placement of gaps during alignments. KaKs_Calculator implements several estimation models, including the MYN (Modified Yang-Nielsen) method[58, 59]. In addition, it includes model selection, model averaging, and γ-distribution, which can be applied to these methods for the estimation of ω[60–62]. Plots and regression analysis were generated with MATLAB 7.11. Kolmogorov-Smirnov tests were performed using online software (http://www.physics.csbsju.edu/stats/KS-test.html). The level of significance for all statistical tests was designated at α = 0.05.
The MYN method adopts the Tamura-Nei (TN) model of substitution at each of the three steps in its calculation. A modified form of MYN, γ-MYN, is based on the assumption that the evolutionary rate at each site follows a mode of γ-distribution because unequal substitution rates affect K a and K s . This distribution can also be applied to other methods incorporating a γ-distribution shape parameter to suggest that variable substitution rates across sites under negative and positive selection have different effects on the estimation of ω. Different mutation models using different evolutionary parameters may produce biased results so it is important to choose the appropriate modeling for a particular sequence. Since the MYN method assumes that different nucleotide positions evolve at the same rate, the addition of the γ-distribution parameter attempts to increase biological realism. Added parameters can cause redundancy, but the γ-MYN method models γ-distribution at the amino acid level, which likely avoids this problem.
Gene expression analysis
The microarray expression data for all genes in R. sphaeroides 2.4.1 grown under seven different growth conditions have been published. The seven growth conditions include aerobic, anaerobic, semi-aerobic, dark/dimethyl-sulfoxide (DMSO), and three photosynthetic conditions under different light intensities (3 watts, 10 watts, and 100 watts), where three replicates for each growth condition were presented. It is noted that the probes for the microarray were designed so that under stringent hybridization condition, it produces minimal cross-hybridization. In addition, the results from the microarray experiment were independently verified with previous observations on gene expression performed by northern blot and qPCR analysis. Of the 234 duplicate gene-pairs, the gene expression levels of 213 duplicate gene-pairs were available in the microarray expression data. The three replicates were averaged to form the expression level for each growth condition, resulting in a rectangular real-value data matrix with 426 rows (213 gene-pairs) and seven columns (seven growth conditions). Then, rows (genes) of the gene expression data matrix were normalized by z-score transformation to reduce difference in scale among each gene expression, resulting in mean 0 and variance 1 of every gene expression over the seven growth conditions. The normalized expressions of homologs are clustered with the usual Hierarchical Agglomerative Clustering (HAC) and shown in Additional file3: Figure A2. From a biological point of view, the relative up- or down-regulation of gene expression is interesting, instead of the absolute amplitude changes. Therefore, z-score transformation has been used to emphasize the relative variation in intensity among genes or samples/conditions in gene expression[65, 66].
Pearson’s correlation coefficient (R) was used to quantify the correlation between two gene expression patterns of each duplicate pair and then linear regression analysis was employed to investigate the relationship between R and each of the evolution constraints (K a , K s , or ω), where the transformed R(= ln[(1 + R)/(1 - R)]) was used to follow the previously published approach that uses the transformed R to change the scale of R to be appropriate for a linear regression analysis. Linear regression analysis was also used to analyze relationships between expression divergence and each of the evolution constraints. In addition, the cosine similarity metric was used to measure the similarity between two gene expression patterns of each duplicate pair. Each gene is represented in the vector of the seven growth conditions (i.e., in the seven dimensional space) and cosine similarity measures the cosine of the angle formed by the two gene vectors, through which cosine quantifies how closely two gene vectors point in the high dimensional space. If the two genes are similarly expressed, cosine is close to one (i.e., 1), since the angle between the two gene vectors is close to 0 degrees, pointing in the same direction. If they are differently expressed, cosine is close to zero (i.e., 0), since the angle is close to 90 degrees.
In addition, to reveal the latent expression patterns in the microarray data, both HAC with average linkage and k-means clustering algorithms were applied to cluster genes into groups of similar or consistent patterns across the growth conditions. As a note, one minus the cosine was used as the distance metric for both the algorithms. HAC is utilized as an initialization for k-means, which resolves the initialization problem and thus results in a deterministic clustering with k-means. HAC has been successfully applied as an initialization for other clustering algorithms[66, 68–70].
To identify conserved patterns between in-paralogs and out-paralogs as well as paralog-specific co-expression patterns either in in-paralogs or in out-paralogs, a very simple and straightforward method was developed as follows. First, the clustering for in-paralogs and the clustering for out-paralogs were obtained separately using k-means with HAC initialization. Then, each cluster (as a query) in the clustering of in-paralogs was matched to the closest cluster (as a library) in the clustering of paralogs and vice versa, where cosine similarity was also used to measure the closeness between query and library clusters. This process resembles the BLAST search and results in two pairs of the best matches, one for in-paralogs to out-paralogs match and the other for out-paralogs to in-paralogs match. The similarity of all the matched cluster pairs was quantified using the average cosine similarity, the conserved patterns (i.e., the best matched clusters) were visualized, and the biological functions of the conserved patterns were discussed.
Availability of supporting data
The supplementary data describing the results in this article are included as additional files.
- K s :
ynonymous substitution rate
- K a :
Non-synonymous substitution rate
- ω :
Modified Yang-Nielsen method
Hierarchical agglomerative clustering
ATP binding cassette
Horizontal gene transfer DMSO: Dar/demethyl-sulfoxide.
We would like to thank Dr. Samuel Kaplan for the microarray data. This work was supported by the Enhancement Grant for Research (EGR) from Sam Houston State University to Madhusudan Choudhary.
- Woese CR: Bacterial evolution. Microbiol Rev. 1987, 51 (2): 221-271.PubMedPubMed CentralGoogle Scholar
- Woese CR, Stackebrandt E, Weisburg WG, Paster BJ, Madigan MT, Fowler VJ, Hahn CM, Blanz P, Gupta R, Nealson KH, et al: The phylogeny of purple bacteria: the alpha subdivision. Syst Appl Microbiol. 1984, 5: 315-326. 10.1016/S0723-2020(84)80034-X.PubMedView ArticleGoogle Scholar
- Mackenzie C, Choudhary M, Larimer FW, Predki PF, Stilwagen S, Armitage JP, Barber RD, Donohue TJ, Hosler JP, Newman JE, et al: The home stretch, a first analysis of the nearly completed genome of Rhodobacter sphaeroides 2.4.1. Photosynth Res. 2001, 70 (1): 19-41. 10.1023/A:1013831823701.PubMedView ArticleGoogle Scholar
- Suwanto A, Kaplan S: Physical and genetic mapping of the Rhodobacter sphaeroides 2.4.1 genome: genome size, fragment identification, and gene localization. J Bacteriol. 1989, 171 (11): 5840-5849.PubMedPubMed CentralGoogle Scholar
- Suwanto A, Kaplan S: Physical and genetic mapping of the Rhodobacter sphaeroides 2.4.1 genome: presence of two unique circular chromosomes. J Bacteriol. 1989, 171 (11): 5850-5859.PubMedPubMed CentralGoogle Scholar
- Suwanto A, Kaplan S: Chromosome transfer in Rhodobacter sphaeroides: Hfr formation and genetic evidence for two unique circular chromosomes. J Bacteriol. 1992, 174 (4): 1135-1145.PubMedPubMed CentralGoogle Scholar
- Zeilstra-Ryalls J, Gomelsky M, Eraso JM, Yeliseev A, O'Gara J, Kaplan S: Control of photosystem formation in Rhodobacter sphaeroides. J Bacteriol. 1998, 180 (11): 2801-2809.PubMedPubMed CentralGoogle Scholar
- Jenney FE, Daldal F: A novel membrane-associated c-type cytochrome, cyt cy, can mediate the photosynthetic growth of Rhodobacter capsulatus and Rhodobacter sphaeroides. EMBO J. 1993, 12 (4): 1283-1292.PubMedPubMed CentralGoogle Scholar
- Grishanin RN, Gauden DE, Armitage JP: Photoresponses in Rhodobacter sphaeroides: role of photosynthetic electron transport. J Bacteriol. 1997, 179 (1): 24-30.PubMedPubMed CentralGoogle Scholar
- Brandner JP, McEwan AG, Kaplan S, Donohue TJ: Expression of the Rhodobacter sphaeroides cytochrome c2 structural gene. J Bacteriol. 1989, 171 (1): 360-368.PubMedPubMed CentralGoogle Scholar
- Moore MD, Kaplan S: Identification of intrinsic high-level resistance to rare-earth oxides and oxyanions in members of the class Proteobacteria: characterization of tellurite, selenite, and rhodium sesquioxide reduction in Rhodobacter sphaeroides. J Bacteriol. 1992, 174 (5): 1505-1514.PubMedPubMed CentralGoogle Scholar
- Neidle EL, Kaplan S: Expression of the Rhodobacter sphaeroides hemA and hemT genes, encoding two 5-aminolevulinic acid synthase isozymes. J Bacteriol. 1993, 175 (8): 2292-2303.PubMedPubMed CentralGoogle Scholar
- Zeilstra-Ryalls JH, Kaplan S: Control of hemA expression in Rhodobacter sphaeroides 2.4.1: regulation through alterations in the cellular redox state. J Bacteriol. 1996, 178 (4): 985-993.PubMedPubMed CentralGoogle Scholar
- Barber RD, Donohue TJ: Function of a glutathione-dependent formaldehyde dehydrogenase in Rhodobacter sphaeroides formaldehyde oxidation and assimilation. Biochemistry. 1998, 37 (2): 530-537. 10.1021/bi971463t.PubMedView ArticleGoogle Scholar
- Mouncey N, Choudhary M, Kaplan S: Characterization of genes encoding dimethyl sulfoxide reductase of Rhodobacter sphaeroides 2. 4. 1: an essential metabolic gene function encoded on chromosome II. J Bacteriol. 1997, 179 (24): 7617-7624.PubMedPubMed CentralGoogle Scholar
- Kiley PJ, Kaplan S: Molecular Genetics of Photosynthetic Membrane Biosynthesis in Rhodobacter sphaeroides. MIicrobiol Rev. 1988, 52 (1): 50-69.Google Scholar
- Fitch W: Distinguishing homologous from analogous proteins. Systematic Biol. 1970, 19 (2): 99-113.Google Scholar
- Koonin E: Orthologs, paralogs and evolutionary genomics. Annu Rev Genet. 2005, 39: 309-338. 10.1146/annurev.genet.39.073003.114725.PubMedView ArticleGoogle Scholar
- Roth C, Rastogi S, Arvestad L, Dittmar K, Light S, Ekman D, Liberles DA: Evolution after gene duplication: models, mechanisms, sequences, systems, and organisms. J Exp Zoolog B Mol Dev Evol. 2007, 308: 58-73.View ArticleGoogle Scholar
- Bavishi A, Lin L, Schroeder K, Peters A, Cho H, Choudhary M: The prevalence of gene duplications and their ancient origin in Rhodobacter sphaeroides 2.4.1. BMC Microbiol. 2010, 10 (1): 331-10.1186/1471-2180-10-331.PubMedPubMed CentralView ArticleGoogle Scholar
- Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes. Science. 2000, 290: 1151-1155. 10.1126/science.290.5494.1151.PubMedView ArticleGoogle Scholar
- Lynch M, Force AG: Gene duplication and the origin of interspecific genomic incompatibility. Am Nat. 2000, 156: 590-605. 10.1086/316992.View ArticleGoogle Scholar
- Innan H, Kondrashov F: The evolution of gene duplications: classifying and distinguishing between models. Nat Rev Genet. 2010, 11 (2): 97-108.PubMedGoogle Scholar
- Pappas CT, Sram J, Moskvin OV, Ivanov PS, Mackenzie RC, Choudhary M, Land ML, Larimer FW, Kaplan S, Gomelsky M: Construction and validation of the Rhodobacter sphaeroides 2.4. 1 DNA microarray: transcriptome flexibility at diverse growth modes. J Bacteriol. 2004, 186 (14): 4748-4758. 10.1128/JB.186.14.4748-4758.2004.PubMedPubMed CentralView ArticleGoogle Scholar
- Choudhary M, Zanhua X, Fu YX, Kaplan S: Genome analyses of three strains of Rhodobacter sphaeroides: evidence of rapid evolution of chromosome II. J Bacteriol. 2007, 189 (5): 1914-1921. 10.1128/JB.01498-06.PubMedPubMed CentralView ArticleGoogle Scholar
- Bavishi A, Abhishek A, Lin L, Choudhary M: Complex prokaryotic genome structure: rapid evolution of chromosome II. Genome. 2010, 53 (9): 675-687. 10.1139/G10-046.PubMedView ArticleGoogle Scholar
- He X, Zhang J: Gene complexity and gene duplicability. Curr Biol. 2005, 15 (11): 1016-1021. 10.1016/j.cub.2005.04.035.PubMedView ArticleGoogle Scholar
- Wang DP, Wan HL, Zhang S, Yu J: γ-MYN: a new algorithm for estimating Ka and Ks with consideration of variable substitution rates. Biol Direct. 2009, 4 (1): 20-10.1186/1745-6150-4-20.PubMedPubMed CentralView ArticleGoogle Scholar
- Novichkov PS, Wolf YI, Dubchak I, Koonin EV: Trends in prokaryotic evolution revealed by comparison of closely related bacterial and archaeal genomes. J Bacteriol. 2009, 191 (1): 65-73. 10.1128/JB.01237-08.PubMedPubMed CentralView ArticleGoogle Scholar
- Ho SYW, Shapiro B, Phillips MJ, Cooper A, Drummond AJ: Evidence for time dependency of molecular rate estimates. Systematic Biol. 2007, 56 (3): 515-522. 10.1080/10635150701435401.View ArticleGoogle Scholar
- Rocha EPC, Smith JM, Hurst LD, Holden MTG, Cooper JE, Smith NH, Feil EJ: Comparisons of dN/dS are time dependent for closely related bacterial genomes. J Theor Biol. 2006, 239 (2): 226-235. 10.1016/j.jtbi.2005.08.037.PubMedView ArticleGoogle Scholar
- Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV: Selection in the evolution of gene duplications. Genome Biol. 2002, 3 (2): research0008.1-research0008.9. 10.1186/gb-2002-3-2-research0008.View ArticleGoogle Scholar
- Hughes AL: The evolution of functionally novel proteins after gene duplication. Proceedings: Biological Sciences. 1994, 256 (1346): 119-124. 10.1098/rspb.1994.0058.Google Scholar
- Zhang J, Rosenberg HF, Nei M: Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc Natl Acad Sci USA. 1998, 95: 3708-3713. 10.1073/pnas.95.7.3708.PubMedPubMed CentralView ArticleGoogle Scholar
- Conery JS, Lynch M: Nucleotide substitutions and the evolution of duplicate genes. Pacific Symposium on Biocomputing. 2001, 6: 167-178.Google Scholar
- Liberles DA: Evaluation of methods for determination of a reconstructed history of gene sequence evolution. Mol Biol Evol. 2001, 18 (11): 2040-2047. 10.1093/oxfordjournals.molbev.a003745.PubMedView ArticleGoogle Scholar
- Jordan IK, Rogozin IB, Wolf YI, Koonin EV: Microevolutionary genomics of bacteria. Theor Popul Biol. 2002, 61 (4): 435-447. 10.1006/tpbi.2002.1588.PubMedView ArticleGoogle Scholar
- Fay JC, Wu C: Sequence divergence, functional constraint, and selection in protein evolution. Annu Rev Genom Hum G. 2003, 4: 213-235. 10.1146/annurev.genom.4.020303.162528.View ArticleGoogle Scholar
- Lynch M: Streamlining and simplification of microbial genome architecture. Annu Rev Microbiol. 2006, 60: 327-349. 10.1146/annurev.micro.60.080805.142300.PubMedView ArticleGoogle Scholar
- Shakhnovich BE, Koonin EV: Origins and impact of constraints in evolution of gene families. Genome Res. 2006, 16: 1529-1536. 10.1101/gr.5346206.PubMedPubMed CentralView ArticleGoogle Scholar
- Jordan IK, Rogozin IB, Wolf YI, Koonin EV: Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res. 2002, 12 (6): 962-968.PubMedPubMed CentralView ArticleGoogle Scholar
- Egan ES, Fogel MA, Waldor MK: Divided genomes: negotiating the cell cycle in prokaryotes with multiple chromosomes. Mol Microbiol. 2005, 56 (5): 1129-1138. 10.1111/j.1365-2958.2005.04622.x.PubMedView ArticleGoogle Scholar
- Zhang J: Protein-length distributions for the three domains of life. Trends Genet. 2000, 16 (3): 107-109. 10.1016/S0168-9525(99)01922-8.PubMedView ArticleGoogle Scholar
- Rubin G, Yandell M, Wortman J, Gabor Miklos G, Nelson C, Hariharan I, Fortini M, Li P, Apweiler R, Fleischmann W: Comparative genomics of the eukaryotes. Science. 2000, 287 (5461): 2204-2215. 10.1126/science.287.5461.2204.PubMedPubMed CentralView ArticleGoogle Scholar
- Kondrashov FA, Koonin EV: A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet. 2004, 20 (7): 287-290. 10.1016/j.tig.2004.05.001.PubMedView ArticleGoogle Scholar
- Treangen TJ, Rocha EP: Horizontal transfer, not duplication, drives the expansion of protein families in prokaryotes. PLoS Genet. 2011, 7 (1): e1001284-10.1371/journal.pgen.1001284.PubMedPubMed CentralView ArticleGoogle Scholar
- Gu Z, Nicolae D, Lu HH, Li WH: Rapid divergence in expression between duplicate genes inferred from microarray data. Trends Genet. 2002, 18 (12): 609-613. 10.1016/S0168-9525(02)02837-8.PubMedView ArticleGoogle Scholar
- Castillo-Davis CI, Hartl DL, Achaz G: Cis-regulatory and protein evolution in orthologous and duplicate genes. Genome Res. 2004, 14 (8): 1530-1536. 10.1101/gr.2662504.PubMedPubMed CentralView ArticleGoogle Scholar
- Makova KD, Li WH: Divergence in the spatial pattern of gene expression between human duplicate genes. Genome Res. 2003, 13 (7): 1638-1645. 10.1101/gr.1133803.PubMedPubMed CentralView ArticleGoogle Scholar
- Li W, Yang J, Gu X: Expression divergence between duplicate genes. Trends Genet. 2005, 21 (11): 602-607. 10.1016/j.tig.2005.08.006.PubMedView ArticleGoogle Scholar
- Bratlie MS, Johansen J, Sherman BT, Huang DW, Lempicki RA, Drablos F: Gene duplications in prokaryotes can be associated with environmental adaptation. BMC Genomics. 2010, 11 (1): 588-10.1186/1471-2164-11-588.PubMedPubMed CentralView ArticleGoogle Scholar
- Sanchez-Perez G, Mira A, Nyiro G, Pasic L, Rodriguez-Valera F: Adapting to environmental changes using specialized paralogs. Trends Genet. 2008, 24 (4): 154-158. 10.1016/j.tig.2008.01.002.PubMedView ArticleGoogle Scholar
- Wapinski I, Pfeffer A, Friedman N, Regev A: Natural history and evolutionary principles of gene duplication in fungi. Nature. 2007, 449: 54-61. 10.1038/nature06107.PubMedView ArticleGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32 (5): 1792-1797. 10.1093/nar/gkh340.PubMedPubMed CentralView ArticleGoogle Scholar
- Suyama M, Torrents D, Bork P: PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006, 34: 609-612. 10.1093/nar/gkl315.View ArticleGoogle Scholar
- Zhang Z, Li J, Zhao X-Q, Wang J, Wong GK-S, Yu J: KaKs_Calculator: calculating Ka and Ks through model selection and model averaging. Geno Prot Bioinfo. 2006, 4 (4): 259-263.View ArticleGoogle Scholar
- Suyama M, Harrington E, Bork P, Torrents D: Identification and analysis of genes and pseudogenes within duplicated regions in the human and mouse genomes. PLoS Comput Biol. 2006, 2 (6): 0627-0636.View ArticleGoogle Scholar
- Yang Z, Nielsen R: Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol. 2000, 17 (1): 32-43. 10.1093/oxfordjournals.molbev.a026236.PubMedView ArticleGoogle Scholar
- Zhang Z, Li J, Yu J: Computing Ka and Ks with a consideration of unequal transitional substitutions. BMC Evol Biol. 2006, 6: 44-10.1186/1471-2148-6-44.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang Z, Yu J: Evaluation of six methods for estimating synonymous and nonsynonymous substitution rates. Geno Prot Bioinfo. 2006, 4: 173-181.View ArticleGoogle Scholar
- Wang D, Zhang S, He F, Zhu J, Hu S, Yu J: How Do Variable Substitution Rates Influence Ka and Ks Calculations?. Geno Prot Bioinfo. 2009, 7 (3): 116-127.View ArticleGoogle Scholar
- Yampolsky L, Bouzinier M: Evolutionary patterns of amino acid substitutions in 12 Drosophila genomes. BMC Genomics. 2010, 11 (Suppl 4): S10-10.1186/1471-2164-11-S4-S10.PubMedPubMed CentralView ArticleGoogle Scholar
- Lilliefors HW: On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J Am Stat Assoc. 1967, 62 (318): 399-402. 10.1080/01621459.1967.10482916.View ArticleGoogle Scholar
- Tamura K, Nei M: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993, 10: 512-526.PubMedGoogle Scholar
- Cheadle C, Vawter MP, Freed WJ, Becker KG: Analysis of microarray data using Z score transformation. J Mol Diagn. 2003, 5 (2): 73-81. 10.1016/S1525-1578(10)60455-2.PubMedPubMed CentralView ArticleGoogle Scholar
- Cho H, Dhillon IS: Co-clustering of human cancer microarrays using Minimum Sum-Squared Residue coclustering. IEEE/ACM Trans Compt Biol and Bioinform. 2008, 5 (3): 385-400.View ArticleGoogle Scholar
- McGill MJ, Salton G: Introduction to Modern Information Retrieval. 1987, McGraw-Hill, New York, NYGoogle Scholar
- Costa IG, Carvalho FdATd, Souto MCPd: Comparative analysis of clustering methods for gene expression time course data. Genet and Mol Biol. 2004, 27 (4): 623-631. 10.1590/S1415-47572004000400025.View ArticleGoogle Scholar
- Chen J, He X, Li L: Identifying the Patterns of Hematopoietic Stem Cells Gene Expressions Using Clustering Methods: Comparison and Summary. J Data Science. 2004, 2: 297-379.Google Scholar
- Yeung KY, Haynor DR, Ruzzo WL: Validating clustering for gene expression data. Bioinformatics. 2001, 17 (4): 309-318. 10.1093/bioinformatics/17.4.309.PubMedView ArticleGoogle Scholar