Prioritizing orphan proteins for further study using phylogenomics and gene expression profiles in Streptomyces coelicolor

Background Streptomyces coelicolor, a model organism of antibiotic producing bacteria, has one of the largest genomes of the bacterial kingdom, including 7825 predicted protein coding genes. A large number of these genes, nearly 34%, are functionally orphan (hypothetical proteins with unknown function). However, in gene expression time course data, many of these functionally orphan genes show interesting expression patterns. Results In this paper, we analyzed all functionally orphan genes of Streptomyces coelicolor and identified a list of "high priority" orphans by combining gene expression analysis and additional phylogenetic information (i.e. the level of evolutionary conservation of each protein). Conclusions The prioritized orphan genes are promising candidates to be examined experimentally in the lab for further characterization of their function.


Background
Here we present an analysis of orphan genes (hypothetical genes with unknown function) in the Streptomyces coelicolor genome, combining gene expression analysis and comparative genomics. The aim is to prioritize orphan genes for further study. In our gene expression studies [1,2], we frequently encountered genes that showed interesting expression patterns, but had no known function. To identify which of these genes merit in-depth experimental analysis, we developed a strategy for prioritizing protein encoding genes for additional characterization, combining phylogenomic information [3] (i.e. the level of evolutionary conservation of each protein), and gene expression data from a large gene expression time series [1]. We postulate that widely conserved proteins that show a physiologically relevant dynamic expression pattern are the most promising candidates for further experimental study, e.g. using gene overexpression and knock-out or knock-down approaches.
The functional annotation of orphan genes is not only relevant for its basic biological interest, but is also an important help for the improvement of genome-scale metabolic models based on genome annotation. These models in their initial form almost always contain gaps that need to be filled by manual curation or automated gap-filling algorithms that add missing essential metabolic activities to the models [2,[4][5][6][7].
During our previous studies of genome-scale metabolic models of Streptomyces coelicolor and its relatives, we regularly had to postulate enzymatic functions that had not been assigned to specific proteins in the organisms [2,7]. Assigning specific enzyme-coding genes to these orphan metabolic activities is very important for the subsequent analysis and interpretation of the models, and several approaches have been developed to assign sequences to the orphan metabolic activities: they employ, for example, mRNA co-expression analysis [8], phylogenetic profile information [9][10][11], pattern recognition techniques [12] or comparative genomics [13]. These approaches are organism specific and have mostly been employed for well-studied model organisms such as Escherichia coli and Saccharomyces cerevisiae.

Results and discussion
Of the 7825 predicted protein coding genes in the Streptomyces coelicolor genome [14], according to a re-annotation of the genome sequence in 2009, 2688 (34%) are coding for functionally orphan proteins, i.e. proteins that are annotated as "hypothetical protein", "conserved protein", "putative membrane protein" or "putative secreted protein". Of these orphan proteins, 26 are conserved in all and 381 are present in at least half (22/44) of the 44 analyzed complete actinomycete genomes (see Methods section for a complete species list). 683 orphan proteins are present in at least 11 (25%) and 177 are conserved in at least 33 (75%) actinomycete genomes.
Of the 381 generally conserved actinomycete orphan proteins (i.e., those that are present in at least half of the analyzed genomes), 25 are also encoded in all species in a selected set of diverse non-actinomycete bacterial genomes (Bacillus subtilis, Escherichia coli K12, Lactobacillus plantarum WCFS1, Staphylococcus aureus, and Streptococcus pneumonia AP200), and 73 are present in at least three of the representative bacterial genomes (see Additional File 1: Supplementary Table 1.xls).
Of these 73 ultra-highly conserved bacterial orphan genes, 22 also have putative homologues (reciprocal best BLAST hits) in at least half of the species in a representative set of eight non-bacterial genomes (the eukaryotes Caenorhabditis elegans, Arabidopsis thaliana, Plasmodium falciparum, Drosophila melanogaster, Saccharomyces cerevisiae and Homo sapiens, and the archaea Haloterrigena turkmenica and Methanosarcina acetivorans). These proteins are therefore almost universally conserved; however, although there seems to be significant conservation of some orphan proteins, none of them is truly universal, i.e. none has a putative homologue in all of the 58 studied genomes. This is most likely due to the fact that some of the included actinomycete genomes are highly reduced, as a result of the parasitic lifestyle of the organism, and the large phylogenetic distance covered (with the corresponding major differences in basic physiological processes).
To prioritize the orphan proteins for further characterization, we therefore summarized the phylogenomic information (i.e. the level of evolutionary conservation of each protein) in a single "conservation" score, which expresses the degree of conservation across the three domains examined (actinomycetes, non-actinomycete bacteria, non-bacteria). This score was combined with a second measure of expression dynamics across a large gene expression time series studying the metabolic switch caused by phosphate starvation. In this experiment, a fermenter culture of S. coelicolor was grown in phosphate-limited conditions, and gene expression data were obtained at 32 finely spaced time points throughout the duration of the experiment. Phosphate was depleted after about 35 hours, triggering a metabolic switch from primary to secondary metabolism, accompanied by a rapid global reorganization of the transcriptome, involving genes with a wide range of biological functions, from central metabolism and antibiotic biosynthesis, to cellular development and maintenance [1]. The "expression dynamics" score described in the Methods section identifies genes that show a smooth expression trend across (part of) the time series and favors those genes that show a particularly strong (steplike) expression change at one time point. This is intended to allow to focus on genes that are not only passively following the expression change during nutrient depletion but that show evidence for active regulation, which is indicative of a central function in cellular physiology. Based on the p-value of the "expression dynamics" score, we assigned a rank to each gene, and averaged this value with the rank of the "conservation" score. 734 orphan genes are significantly up-or downregulated with expression dynamic p-values less than 0.00001 (significant after multiple-testing correction).
Using the averaged conservation and expression dynamics rank, we arrived at a list of 30 top orphan proteins. These were examined in more detail to determine if their function was really unknown: we checked the most recent versions of the Uniprot [15] and StrepDB database for annotations, performed a PSI-BLAST against the Uniprot database, compared the annotation of the homologs in E. coli, yeast and human where these were available, and analyzed the domain architecture using SMART tool (Simple Modular Architecture Research Tool) [16]. Using this information, we asked three microbiologist and bioinformaticians to independently score the genes according to their "orphanicity", i. e. their confidence in the absence of a known potential function. The three raters used a large collection of automatically provided evidence for all candidate genes, including annotation from the most recent versions of the Uniprot and StrepDB database, output of a PSI-BLAST against the Uniprot database, and the output of a domain architecture analysis using the SMART tool (Simple Modular Architecture Research Tool). In addition, they were free to do their own literature research and sequence analysis, although this did not generally identify useful extra information. The average score of the three raters was combined with the average score of the conservation and expression dynamics to arrive at a final ranking for the most interesting orphan genes for further study: the top genes are those for which we have absolutely no information about their function, that are ultra-highly conserved across species, and show a highly significant dynamics in their gene expression (Table 1).
Based on the gene expression profiles (Figure 1), the candidate genes SCO5746 and SCO1222 are particularly interesting: they show a very strong switch upon phosphate starvation, and their expression increases in stationary phase similar to the expression pattern of the antibiotic biosynthesis gene clusters act and red. All other genes show a decrease of expression along the time course. SCO5746 has a putative uncharacterized homolog in E. coli and contains a domain of the DegT/DnrJ/EryC1/ StrS aminotransferase family. The aminotransferase activity was demonstrated for purified StsC protein, which acts as an L-glutamine:scyllo-inosose aminotransferase and catalyses the first amino acid transfer in the biosynthesis of the streptidine subunit of antibiotic streptomycin. It is therefore tempting to speculate that the SCO5746 gene has some role in the biosynthesis of a new antibiotic in S. coelicolor as well, and the same might be the case for the completely uncharacterized SCO1222. The closest putative antibiotic biosynthesis clusters are SCO5799-SCO5801 (siderophore synthetase type) and SCO1206-SCO1208 (chalcone synthetase type; [17]), both of which seem unlikely candidates for interacting with SCO5746 or SCO1222. However, it is possible that these genes contribute to a dispersed biosynthetic pathway, not involving a dense genomic clustering. Of course, they could also be contributing to any other stationary phase process.
Interestingly, we see a strong neighborhood conservation of most of the candidate orphan genes in other Streptomyces species (Figure 2). In some cases, the annotation of the neighbors does suggest at least a broad functional The proteins are prioritized according to their conservation across actinomycetes, bacteria and non-bacteria; their expression dynamics (summarized in the pvalue); and their orphanicity, i.e. the absence of any functional information, assessed as described in the text.
category: for example, SCO1521/1522 might be involved in DNA remodeling during recombination, as their conserved neighbors are a Holliday junction resolvase and DNA helicase (RuvABC complex); and SCO2081 might play a role in cell division, matching its conserved neighbor, the cell division protein ftsZ [18]. However, most of the conserved neighbors are hypothetical proteins themselves and do not seem to immediately identify a putative function for most of the orphan genes; nonetheless, the neighborhood information will be valuable for the design and interpretation of the most efficient experimental perturbations. The dynamic expression pattern of each of the neighborhoods depicted in Figure 2 is shown in Additional File 2: SupplementaryFile1.pdf. This illustrates, e.g., that the expression of SCO1522 shows a very similar expression trend compared with its left and right neighbors (SCO1521 and SCO1523), confirming the relevance of adjacency on the genome for predicting gene functionality.
The prioritization reported in this paper of course depends on implicit assumptions about what constitutes a protein of interest. Here we were a priori interested in any protein that is maintained by purifying selection in a large number of genomes, indicating that it is involved in a generally important physiological process. On the other hand, we assumed that genes that show strong gene expression responses to a major physiological perturbation are more likely to be functionally relevant under the conditions studied. Genes that are not expressed or not finely controlled are more likely to have more specialized functions. This approach does not exclude the identification of housekeeping genes, which may not be directly involved in the physiological process studied in the gene expression analysis, as these genes still tend to show dynamic expression patterns (as evidenced, e.g., by the ribosomal protein genes [1]). The results are, however, affected by the availability of gene expression data sets and will become more informative once other large-scale expression studies, comparable to the one used here, become available.

Conclusions
Our aim was to prioritize protein coding orphan genes (hypothetical proteins with unknown function) for further experimental characterization of their function. We developed an algorithm to detect dynamic switches in a large gene expression time course data set, and assigned an "expression dynamics" score to every orphan gene, arguing that genes that show substantial expression changes corresponding to biologically relevant events would be most interesting to follow up. We also summarized the available evolutionary information in a "conservation" score across a broad range of organisms (many actinomycetes, other bacteria and various nonbacterial species). We combined the "expression dynamics" rank and "conservation" rank to identify a robust list of 30 high priority orphan genes, which are promising candidates for detailed experimental study.

Genome sequence analysis
For the phylogenomic profiling, we studied the complete genome sequences of the 44 actinomycete species, which were also used in our earlier phylogenetic study [3] The conservation score was calculated in three steps: (1) the genes were independently ranked according to the number of species of actinomycetes, other bacteria, and non-bacteria in which they have a putative homolog; (2) their ranks in the bacteria and non-bacteria lists were averaged; and (3) the resulting rank and the rank in the actinomycete list were averaged again to produce the final rank.

Gene expression data
Details about the gene expression dataset and experimental conditions can be found in [1,2]. Briefly, mRNA abundance was assessed at 32 time points during a 68hour phosphate-limited fermentor culture of S. coelicolor, using custom-designed Affymetrix genechips; the data reveal a complex sequence of gene expression switches, affecting a large diversity of biological processes, from phosphate uptake to secondary metabolism and protein biosynthesis.

Dynamic expression detection
To identify genes that show a dynamic expression along the time course, and in particular genes that have a clear expression switch at one time point, we used the following iterative algorithm (in pseudo code): Input: a vector v of gene expression data Output: minPvalue, the p-value of the switch-like dynamic expression Initialize: minPvalue: = 1 For each value i in the set (2 to (length(v) -2)), do j : = i + 1 MaxWindowSize < -min(i, length (v)i) For each position p in the set ((i -MaxWindow-Size + 1) to i -1), do q : = j + (ip) Pvalue: = p-value of the t-test comparing v[p: i] and v[j:q] If (Pvalue < minPvalue), then minPvalue: = Pvalue end end end return minPvalue An R implementation of the algorithm is available from the authors upon request.

Additional material
Additional file 1: Table of generally conserved actinomycete orphan proteins. The table lists 381 Streptomyces coelicolor orphan genes that are generally conserved in other actinomycetes genomes (more than half of the 44 actinomycete species examined; blue gene numbers). Of these, the top 73 are present in at least three of five representative nonactinomycete bacterial genomes (red; 25 are present in all five of these species). The top 22 genes also have putative homologues (reciprocal best BLAST hits) in at least half of the species in a representative set of eight non-bacterial genomes (green).
Additional file 2: Gene expression profile of gene neighborhoods. Expression profile of the genes shown in Figure 2.