Streptomyces coelicolor, a model organism of antibiotic producing bacteria, has one of the largest genomes of the bacterial kingdom, including 7825 predicted protein coding genes. A large number of these genes, nearly 34%, are functionally orphan (hypothetical proteins with unknown function). However, in gene expression time course data, many of these functionally orphan genes show interesting expression patterns.
In this paper, we analyzed all functionally orphan genes of Streptomyces coelicolor and identified a list of "high priority" orphans by combining gene expression analysis and additional phylogenetic information (i.e. the level of evolutionary conservation of each protein).
The prioritized orphan genes are promising candidates to be examined experimentally in the lab for further characterization of their function.
Here we present an analysis of orphan genes (hypothetical genes with unknown function) in the Streptomyces coelicolor genome, combining gene expression analysis and comparative genomics. The aim is to prioritize orphan genes for further study. In our gene expression studies [1, 2], we frequently encountered genes that showed interesting expression patterns, but had no known function. To identify which of these genes merit in-depth experimental analysis, we developed a strategy for prioritizing protein encoding genes for additional characterization, combining phylogenomic information  (i.e. the level of evolutionary conservation of each protein), and gene expression data from a large gene expression time series . We postulate that widely conserved proteins that show a physiologically relevant dynamic expression pattern are the most promising candidates for further experimental study, e.g. using gene overexpression and knock-out or knock-down approaches.
The functional annotation of orphan genes is not only relevant for its basic biological interest, but is also an important help for the improvement of genome-scale metabolic models based on genome annotation. These models in their initial form almost always contain gaps that need to be filled by manual curation or automated gap-filling algorithms that add missing essential metabolic activities to the models [2, 4–7].
During our previous studies of genome-scale metabolic models of Streptomyces coelicolor and its relatives, we regularly had to postulate enzymatic functions that had not been assigned to specific proteins in the organisms [2, 7]. Assigning specific enzyme-coding genes to these orphan metabolic activities is very important for the subsequent analysis and interpretation of the models, and several approaches have been developed to assign sequences to the orphan metabolic activities: they employ, for example, mRNA co-expression analysis , phylogenetic profile information [9–11], pattern recognition techniques  or comparative genomics . These approaches are organism specific and have mostly been employed for well-studied model organisms such as Escherichia coli and Saccharomyces cerevisiae.
Results and discussion
Of the 7825 predicted protein coding genes in the Streptomyces coelicolor genome , according to a re-annotation of the genome sequence in 2009, 2688 (34%) are coding for functionally orphan proteins, i.e. proteins that are annotated as "hypothetical protein", "conserved protein", "putative membrane protein" or "putative secreted protein". Of these orphan proteins, 26 are conserved in all and 381 are present in at least half (22/44) of the 44 analyzed complete actinomycete genomes (see Methods section for a complete species list). 683 orphan proteins are present in at least 11 (25%) and 177 are conserved in at least 33 (75%) actinomycete genomes.
Of the 381 generally conserved actinomycete orphan proteins (i.e., those that are present in at least half of the analyzed genomes), 25 are also encoded in all species in a selected set of diverse non-actinomycete bacterial genomes (Bacillus subtilis, Escherichia coli K12, Lactobacillus plantarum WCFS1, Staphylococcus aureus, and Streptococcus pneumonia AP200), and 73 are present in at least three of the representative bacterial genomes (see Additional File 1: Supplementary Table 1.xls).
Of these 73 ultra-highly conserved bacterial orphan genes, 22 also have putative homologues (reciprocal best BLAST hits) in at least half of the species in a representative set of eight non-bacterial genomes (the eukaryotes Caenorhabditis elegans, Arabidopsis thaliana, Plasmodium falciparum, Drosophila melanogaster, Saccharomyces cerevisiae and Homo sapiens, and the archaea Haloterrigena turkmenica and Methanosarcina acetivorans). These proteins are therefore almost universally conserved; however, although there seems to be significant conservation of some orphan proteins, none of them is truly universal, i.e. none has a putative homologue in all of the 58 studied genomes. This is most likely due to the fact that some of the included actinomycete genomes are highly reduced, as a result of the parasitic lifestyle of the organism, and the large phylogenetic distance covered (with the corresponding major differences in basic physiological processes).
To prioritize the orphan proteins for further characterization, we therefore summarized the phylogenomic information (i.e. the level of evolutionary conservation of each protein) in a single "conservation" score, which expresses the degree of conservation across the three domains examined (actinomycetes, non-actinomycete bacteria, non-bacteria). This score was combined with a second measure of expression dynamics across a large gene expression time series studying the metabolic switch caused by phosphate starvation. In this experiment, a fermenter culture of S. coelicolor was grown in phosphate-limited conditions, and gene expression data were obtained at 32 finely spaced time points throughout the duration of the experiment. Phosphate was depleted after about 35 hours, triggering a metabolic switch from primary to secondary metabolism, accompanied by a rapid global reorganization of the transcriptome, involving genes with a wide range of biological functions, from central metabolism and antibiotic biosynthesis, to cellular development and maintenance . The "expression dynamics" score described in the Methods section identifies genes that show a smooth expression trend across (part of) the time series and favors those genes that show a particularly strong (step-like) expression change at one time point. This is intended to allow to focus on genes that are not only passively following the expression change during nutrient depletion but that show evidence for active regulation, which is indicative of a central function in cellular physiology. Based on the p-value of the "expression dynamics" score, we assigned a rank to each gene, and averaged this value with the rank of the "conservation" score. 734 orphan genes are significantly up- or down-regulated with expression dynamic p-values less than 0.00001 (significant after multiple-testing correction).
Using the averaged conservation and expression dynamics rank, we arrived at a list of 30 top orphan proteins. These were examined in more detail to determine if their function was really unknown: we checked the most recent versions of the Uniprot  and StrepDB database for annotations, performed a PSI-BLAST against the Uniprot database, compared the annotation of the homologs in E. coli, yeast and human where these were available, and analyzed the domain architecture using SMART tool (Simple Modular Architecture Research Tool) . Using this information, we asked three microbiologist and bioinformaticians to independently score the genes according to their "orphanicity", i.e. their confidence in the absence of a known potential function. The three raters used a large collection of automatically provided evidence for all candidate genes, including annotation from the most recent versions of the Uniprot and StrepDB database, output of a PSI-BLAST against the Uniprot database, and the output of a domain architecture analysis using the SMART tool (Simple Modular Architecture Research Tool). In addition, they were free to do their own literature research and sequence analysis, although this did not generally identify useful extra information. The average score of the three raters was combined with the average score of the conservation and expression dynamics to arrive at a final ranking for the most interesting orphan genes for further study: the top genes are those for which we have absolutely no information about their function, that are ultra-highly conserved across species, and show a highly significant dynamics in their gene expression (Table 1).
Based on the gene expression profiles (Figure 1), the candidate genes SCO5746 and SCO1222 are particularly interesting: they show a very strong switch upon phosphate starvation, and their expression increases in stationary phase similar to the expression pattern of the antibiotic biosynthesis gene clusters act and red. All other genes show a decrease of expression along the time course. SCO5746 has a putative uncharacterized homolog in E. coli and contains a domain of the DegT/DnrJ/EryC1/StrS aminotransferase family. The aminotransferase activity was demonstrated for purified StsC protein, which acts as an L-glutamine:scyllo-inosose aminotransferase and catalyses the first amino acid transfer in the biosynthesis of the streptidine subunit of antibiotic streptomycin. It is therefore tempting to speculate that the SCO5746 gene has some role in the biosynthesis of a new antibiotic in S. coelicolor as well, and the same might be the case for the completely uncharacterized SCO1222. The closest putative antibiotic biosynthesis clusters are SCO5799-SCO5801 (siderophore synthetase type) and SCO1206-SCO1208 (chalcone synthetase type; ), both of which seem unlikely candidates for interacting with SCO5746 or SCO1222. However, it is possible that these genes contribute to a dispersed biosynthetic pathway, not involving a dense genomic clustering. Of course, they could also be contributing to any other stationary phase process.
Interestingly, we see a strong neighborhood conservation of most of the candidate orphan genes in other Streptomyces species (Figure 2). In some cases, the annotation of the neighbors does suggest at least a broad functional category: for example, SCO1521/1522 might be involved in DNA remodeling during recombination, as their conserved neighbors are a Holliday junction resolvase and DNA helicase (RuvABC complex); and SCO2081 might play a role in cell division, matching its conserved neighbor, the cell division protein ftsZ . However, most of the conserved neighbors are hypothetical proteins themselves and do not seem to immediately identify a putative function for most of the orphan genes; nonetheless, the neighborhood information will be valuable for the design and interpretation of the most efficient experimental perturbations. The dynamic expression pattern of each of the neighborhoods depicted in Figure 2 is shown in Additional File 2: SupplementaryFile1.pdf. This illustrates, e.g., that the expression of SCO1522 shows a very similar expression trend compared with its left and right neighbors (SCO1521 and SCO1523), confirming the relevance of adjacency on the genome for predicting gene functionality.
The prioritization reported in this paper of course depends on implicit assumptions about what constitutes a protein of interest. Here we were a priori interested in any protein that is maintained by purifying selection in a large number of genomes, indicating that it is involved in a generally important physiological process. On the other hand, we assumed that genes that show strong gene expression responses to a major physiological perturbation are more likely to be functionally relevant under the conditions studied. Genes that are not expressed or not finely controlled are more likely to have more specialized functions. This approach does not exclude the identification of housekeeping genes, which may not be directly involved in the physiological process studied in the gene expression analysis, as these genes still tend to show dynamic expression patterns (as evidenced, e.g., by the ribosomal protein genes ). The results are, however, affected by the availability of gene expression data sets and will become more informative once other large-scale expression studies, comparable to the one used here, become available.
Our aim was to prioritize protein coding orphan genes (hypothetical proteins with unknown function) for further experimental characterization of their function. We developed an algorithm to detect dynamic switches in a large gene expression time course data set, and assigned an "expression dynamics" score to every orphan gene, arguing that genes that show substantial expression changes corresponding to biologically relevant events would be most interesting to follow up. We also summarized the available evolutionary information in a "conservation" score across a broad range of organisms (many actinomycetes, other bacteria and various non-bacterial species). We combined the "expression dynamics" rank and "conservation" rank to identify a robust list of 30 high priority orphan genes, which are promising candidates for detailed experimental study.
Genome sequence analysis
For the phylogenomic profiling, we studied the complete genome sequences of the 44 actinomycete species, which were also used in our earlier phylogenetic study : Arthrobacter aurescens TC1, Acidothermus cellulolyticus 11B, Bifidobacterium adolescentis ATCC 15703, Bifidobacterium longum NCC2705, Corynebacterium diphtheriae NCTC 13129, Corynebacterium efficiens YS-314, Corynebacterium glutamicum ATCC 13032, Corynebacterium jeikeium K411, Clavibacter michiganensis subsp michiganensis NCPPB 382, Frankia alni ACN14a, Frankia sp CcI3, Frankia sp EAN1pec, Kineococcus radiotolerans SRS30216, Leifsonia xyli subsp xyli str CTCB07, Mycobacterium avium subsp, paratuberculosis str k10, Mycobacterium avium 104, Mycobacterium bovis BCG Pasteur 1173P2, Mycobacterium bovis subsp bovis AF2122 97, Mycobacterium gilvum PYR-GCK, Mycobacterium sp JLS, Mycobacterium sp KMS, Mycobacterium leprae TN, Mycobacterium sp MCS, Mycobacterium tuberculosis H37Ra, Mycobacterium smegmatis str MC2155, Mycobacterium tuberculosis CDC1551, Mycobacterium tuberculosis F11, Mycobacterium tuberculosis H37Rv, Mycobacterium ulcerans Agy99, Mycobacterium vanbaalenii PYR-1, Nocardioides sp JS614, Nocardia farcinica IFM 10152, Propionibacterium acnes KPA171202, Rhodococcus sp RHA1, Renibacterium salmoninarum ATCC 33209, Salinispora arenicola CNS 205, Streptomyces avermitilis MA 4680, Saccharopolyspora erythraea NRRL 2338, Streptomyces griseus strain IFO13350, Streptomyces scabies strain 8722, Salinispora tropica CNB 440, Thermobifida fusca YX, Tropheryma whipplei str Twist, Tropheryma whipplei TW08 27. This was complemented by the genomes of 6 eukaryotes (Caenorhabditis elegans, Arabidopsis thaliana, Homo sapiens, Plasmodium falciparum 3D7, Drosophila melanogaster, Saccharomyces cerevisiae), 2 archaea (Haloterrigena turkmenica, Methanosarcina acetivorans), and 5 other model bacteria from different taxonomical classes (Bacillus subtilis, Escherichia coli K12, Lactobacillus plantarum WCFS1, Staphylococcus aureus, Streptococcus pneumonia AP200). Putative homologs were identified as reciprocal best BLAST hits. The conservation score was calculated in three steps: (1) the genes were independently ranked according to the number of species of actinomycetes, other bacteria, and non-bacteria in which they have a putative homolog; (2) their ranks in the bacteria and non-bacteria lists were averaged; and (3) the resulting rank and the rank in the actinomycete list were averaged again to produce the final rank.
Gene expression data
Details about the gene expression dataset and experimental conditions can be found in [1, 2]. Briefly, mRNA abundance was assessed at 32 time points during a 68-hour phosphate-limited fermentor culture of S. coelicolor, using custom-designed Affymetrix genechips; the data reveal a complex sequence of gene expression switches, affecting a large diversity of biological processes, from phosphate uptake to secondary metabolism and protein biosynthesis.
Dynamic expression detection
To identify genes that show a dynamic expression along the time course, and in particular genes that have a clear expression switch at one time point, we used the following iterative algorithm (in pseudo code):
Input: a vector v of gene expression data
Output: minPvalue, the p-value of the switch-like dynamic expression
Initialize: minPvalue: = 1
For each value i in the set (2 to (length(v) - 2)), do
j : = i + 1
MaxWindowSize < - min(i, length (v) - i)
For each position p in the set ((i - MaxWindowSize + 1) to i - 1), do
q : = j + (i - p)
Pvalue: = p-value of the t-test comparing v[p:i] and v[j:q]
If (Pvalue < minPvalue), then
minPvalue: = Pvalue
An R implementation of the algorithm is available from the authors upon request.
Nieselt K, Battke F, Herbig A, Bruheim P, Wentzel A, Jakobsen ØM, Sletta H, Alam MT, Merlo ME, Moore J, Omara WAM, Morrissey ER, Juarez-Hermosillo MA, Rodríguez-García A, Nentwich M, Thomas L, Iqbal M, Legaie R, Gaze WH, Challis GL, Jansen RC, Dijkhuizen L, Rand DA, Wild DL, Bonin M, Reuther J, Wohlleben W, Smith MCM, Burroughs NJ, Martín JF, Hodgson DA, Takano E, Breitling R, Ellingsen TE, Wellington EMH: The dynamic architecture of the metabolic switch in Streptomyces coelicolor. BMC Genomics. 2010, 11: 10-10.1186/1471-2164-11-10.
Alam MT, Merlo ME, Hodgson DA, Wellington EMH, Takano E, Breitling R: Metabolic modeling and analysis of the metabolic switch in Streptomyces coelicolor. BMC Genomics. 2010, 11: 202-10.1186/1471-2164-11-202.
Medema MH, Trefzer A, Kovalchuk A, van den Berg M, Müller U, Heijne W, Wu L, Alam MT, Ronning CM, Nierman WC, Bovenberg RAL, Breitling R, Takano E: The Sequence of a 1.8-Mb Bacterial Linear Plasmid Reveals a Rich Evolutionary Reservoir of Secondary Metabolic Pathways. Genome Biology and Evolution. 2010, 2: 212-224. 10.1093/gbe/evq013.
Cuff AL, Sillitoe I, Lewis T, Redfern OC, Garratt R, Thornton J, Orengo CA: The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res. 2009, 37: D310-314. 10.1093/nar/gkn877.
Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modular architecture research tool: Identification of signaling domains. Proceedings of the National Academy of Sciences of the United States of America. 1998, 95: 5857-5864. 10.1073/pnas.95.11.5857.
Additional file 1: Table of generally conserved actinomycete orphan proteins. The table lists 381 Streptomyces coelicolor orphan genes that are generally conserved in other actinomycetes genomes (more than half of the 44 actinomycete species examined; blue gene numbers). Of these, the top 73 are present in at least three of five representative non-actinomycete bacterial genomes (red; 25 are present in all five of these species). The top 22 genes also have putative homologues (reciprocal best BLAST hits) in at least half of the species in a representative set of eight non-bacterial genomes (green). (XLS 435 KB)
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Alam, M.T., Takano, E. & Breitling, R. Prioritizing orphan proteins for further study using phylogenomics and gene expression profiles in Streptomyces coelicolor.
BMC Res Notes4, 325 (2011). https://doi.org/10.1186/1756-0500-4-325