Domain analysis of symbionts and hosts (DASH) in a genome-wide survey of pathogenic human viruses

Background In the coevolution of viruses and their hosts, viruses often capture host genes, gaining advantageous functions (e.g. immune system control). Identifying functional similarities shared by viruses and their hosts can help decipher mechanisms of pathogenesis and accelerate virus-targeted drug and vaccine development. Cellular homologs in viruses are usually documented using pairwise-sequence comparison methods. Yet, pairwise-sequence searches have limited sensitivity resulting in poor identification of divergent homologies. Results Methods based on profiles from multiple sequences provide a more sensitive alternative to identify similarities in host-pathogen systems. The present work describes a profile-based bioinformatics pipeline that we call the Domain Analysis of Symbionts and Hosts (DASH). DASH provides a web platform for the functional analysis of viral and host genomes. This study uses Human Herpesvirus 8 (HHV-8) as a model to validate the methodology. Our results indicate that HHV-8 shares at least 29% of its genes with humans (fourteen immunomodulatory and ten metabolic genes). DASH also suggests functions for fifty-one additional HHV-8 structural and metabolic proteins. We also perform two other comparative genomics studies of human viruses: (1) a broad survey of eleven viruses of disparate sizes and transcription strategies; and (2) a closer examination of forty-one viruses of the order Mononegavirales. In the survey, DASH detects human homologs in 4/5 DNA viruses. None of the non-retro-transcribing RNA viruses in the survey showed evidence of homology to humans. The order Mononegavirales are also non-retro-transcribing RNA viruses, however, and DASH found homology in 39/41 of them. Mononegaviruses display larger fractions of human similarities (up to 75%) than any of the other RNA or DNA viruses (up to 55% and 29% respectively). Conclusions We conclude that gene sharing probably occurs between humans and both DNA and RNA viruses, in viral genomes of differing sizes, regardless of transcription strategies. Our method (DASH) simultaneously analyzes the genomes of two interacting species thereby mining functional information to identify shared as well as exclusive domains to each organism. Our results validate our approach, showing that DASH has potential as a pipeline for making therapeutic discoveries in other host-symbiont systems. DASH results are available at http://tinyurl.com/spouge-dash.


Background
Many species interact persistently in symbiosis through mutualistic, commensalistic, or parasitic relationships. Such symbiotic associations can lead to long histories of coevolution, promoting horizontal transfer of genes between the corresponding species. Acquired genetic material has afforded both prokaryotes and eukaryotes several advantageous new functions, including antibiotic resistance, nitrogen fixation, and even photosynthesis [1].
In the case of parasitic symbionts like viruses, most of the documented cases of gene transfer involve proteins with functions related to host immune system control or evasion. The large DNA viruses are particularly notorious for encoding homologs of cellular components of both the innate and adaptive arms of the immune response [2].
Profile search tools like PSI-BLAST and HMMER therefore provide more sensitive homology searches than BLAST or FASTA. Very few studies have investigated host-viral similarities (e.g. [15,16]) using profile search tools, however. Moreover, recent improvements to profilebased comparison algorithms have increased sensitivity further [17], thus improving their ability to identify even more distant homologies. In addition, previous studies of divergent host-viral similarities using profile search tools implemented their protocols as ad hoc solutions for specific viruses, thus, impeding automation and application in other viral systems.
The present article investigates functional similarity at the protein domain level by surveying similarities between the human genome and the genomes of an arbitrary but representative set of eleven viruses impacting human health. It also examines similarities between the human genome and the genomes of forty-one viruses of the order Mononegavirales. The functional comparisons are made with a bioinformatics pipeline that we call Domain Analysis of Symbionts and Hosts (DASH).
HHV-8 is the causative agent of Kaposi's sarcoma, the most common AIDS-associated cancer [14], and it has also been associated with primary effusion lymphoma [15] and multicentric Castleman's disease [16]. Because many studies have documented a cellular origin for many genes in HHV-8 [2,13], HHV-8 provides an ideal model virus for validating our approach. This article therefore scrutinizes HHV-8 more closely than the other fifty-one viruses. Our comparative genomics survey includes DNA and RNA viruses of various genome sizes and transcription strategies, thereby providing a snapshot of the prevalence of functional similarities across a representative set of viruses impacting human health. The results for the fifty-two viruses surveyed here validate the methodology and show that DASH has potential as a pipeline for making therapeutic discoveries in other host-symbiont systems.

Methods
DASH's computational pipeline DASH compares each of the genes in a pathogen genome against a local collection of protein domain families ( Figure 1). DASH performs the sequence comparisons using HMMER's hmmscan v. 3.0 [17] against all the HMM profiles in the Pfam-A subset of Pfam v. 26 [18]. PfamA features 13,672 protein domain models, making it a relatively exhaustive repository, one manually built by experts from representative sets of sequences. DASH records all significant similarities to the Pfam models for each of the pathogen's genes ( Figure 1). A parallel analysis also functionally annotates the host genome ( Figure 1). DASH distinguishes the protein domains exclusive to the host and pathogen from those shared between them (Figure 1, Figure 2).
Detecting functional similarities between the human and viral genomes DASH compared eighty-six proteins in the HHV-8 genome (reported by NCBI's Refseq viral genome collection as of May 11, 2011) against the 13,672 PfamA v. 26 models. DASH also analyzed 226,230 human proteins from NCBI's NR database (release date: February 2, 2012). The reported similarities for searches in both genomes were considered significant under a threshold E-value<1e-3 [17] after applying a Bonferroni multipletest correction. The E-values in this study were multiple-test corrected to account for multiple comparisons (number of Pfam domain models times the number of proteins in the relevant organism, host or pathogen). When DASH matches a sequence to a Pfam domain model, it reports domain coverage, the fraction of the length of the domain model matched. To avoid spurious short sequence matches, this study reports a match only if the corresponding domain coverage exceeds 0.50. The same protocol described for the functional analysis of the Human and HHV-8 genomes was applied to analyze the eleven viruses in the comparative genomics survey and the forty-one viruses in the examination of the order Mononegavirales. The fifty-two viruses were extracted from the Refseq viral genome collection of May 11, 2011. Results and discussion DASH: An automated system for the whole-genome detection of functional similarity in host-symbiont systems DASH automates the functional characterization of host and symbiont genomes through genome-wide profilebased similarity searches. By using HMMER's hmmscan, DASH compares all pathogen proteins against all Pfam domain models thereby generating a list of putative functional annotations. Likewise, the pipeline functionally annotates the host proteome in parallel ( Figure 1). All DASH output relevant to this paper is publicly available at http://tinyurl.com/spouge-dash.
DASH allows users to compare the functional annotations of two genomes simultaneously to distinguish shared protein domains from domains exclusive to each organism ( Figure 2). The user can input specific scoring parameters or allow the system to provide statistical guidance for choosing appropriate scoring thresholds ( Figure 2A). In its current prototype, DASH allows the analysis of fifty-two reference viral genomes. Plans for future releases include modifications to allow users to analyze custom sequences (e.g. different isolates of the same virus for intra-viral comparison).
The DASH output on the web classifies the functions into shared and pathogen-exclusive functions ( Figure 2B). If DASH marked functional annotations on the pathogen as shared domains, the DASH output can be expanded to display information about their functional counterparts in the host ( Figure 2B). In addition, the site has been linked to other in-house (e.g. NCBI taxonomy, NR) and external analysis resources (e.g. Pfam) to facilitate the exploration and functional characterization of pathogenic sequences.
Of the twenty-four HHV-8 genes DASH identifies as cellular homologs, fourteen genes feature immunomodulatory and ten metabolic functions. DASH misses the putative homology between ORF63 and human NLRP that Gregory et al. reported as having a blastp E-value=2e-4 [27]. Although we attempted to reproduce the results of Gregory et al. with DASH and blastp, we were unable to find significant similarities between ORF63 and NLRP. It appears that the marginal E-value observed by Gregory et al. was obtained using blast2seq, which compares a single query against only a single subject sequence. Blast2seq calculates E-values using a database length equal to the length of only one sequence (opposed to using the length of thousands of sequences). Thus, the difference in our E-values is likely the result of differing database sizes, and therefore dependent on the relevant multiple-test correction. Gregory et al. do adduce experimental evidence to validate the functional similarities between the two proteins, however.
When coupling computational functional analyses on HHV-8 with manual searches of the literature, it was reassuring to learn that our results agreed with previous experimental reports. Computational methods alone were able to repeat previous findings of functional similarity identified between HHV-8 and human, thereby validating our approach.
DASH also detected that 51/86 genes in HHV-8 have functions exclusive to the virus ( Table 2). DASH confirmed previously published annotations for 30/51 HHV-8 genes as viral structural/metabolic proteins. The remaining 21/51 genes in Table 2 are viral-exclusive genes displaying conserved domains with unknown functions.
Because the similarities listed in Table 1 and Table 2 have highly significant E-values, they are likely true homologs. Protein domain sequence similarities at E-value <1e-03 are widely-accepted as being significant [17]; indeed, scores approximating this threshold are expected in cases of distant similarities. Thus, the homologies reported here become more compelling still, given that even after the multipletest correction, for most of the proteins the E-values are much smaller than the accepted 1e-03 threshold. Table 1 lists 21/24 shared protein domains with multiple-testcorrected E-values ranging from 2.1e-141 to 1.1e-3 in HHV-8 (and 7.1e-211 to 1.4e-04 in human). Because of multiple-test correction, 3/24 HHV-8 homologs fall below accepted thresholds of similarity however. Table 2 shows fifty-one viral-exclusive genes with E-values ranging from 0.0 to 5.7e-15. As additional evidence of the reported functional similarities, ninety-five of the homologs shown in Table 1 and Table 2 have a domain coverage >0.75 (lowest overall domain coverage = 65). The high fraction of domain coverage suggests that the domains are largely complete and are thus functional instances (i.e. working opposed to degraded copies) of the domain.

DASH identifies homologies between humans and both the DNA and RNA viruses
To measure the prevalence of protein domain similarities between human and various types of viruses, we used DASH to survey the eleven viruses in Table 3, which have implications for human health. To make the survey as broad as possible, the viruses in Table 3 were selected to have disparate genome sizes and to represent varying classes of viruses (five DNA viruses and six RNA viruses; four retro-transcribing viruses and seven nonretro-transcribing viruses). The present study confirms previous reports of cellular homologies in the DNA viruses (detailed lists of the similarity hits in Table 3 can be found at http://tinyurl.com/ spouge-dash). DASH detects human homologs in all but one of the DNA viruses tested (4/5), both large and small. In the DNA viruses, the percent of viral genome shared with the host ranges from 0% in Human parvovirus B19 to 29% in HHV-8. Thus, HHV-8 appears to share a relatively large percentage of its genome with its host, compared to the rest of the DNA viruses examined.
The retro-transcribing RNA viruses in Table 3 displayed larger fractions of functional similarities to humans than any of the DNA viruses. The retroviruses analyzed share 30% (HTLV-2) to 55% (HTLV-1) of their genomes with their human host. The non-retro-transcribing RNA viruses in Table 3 showed no evidence of homology to humans, however.
Sequence similarity methods such as the one used by DASH cannot alone assign a definitive directionality to gene transfer between a pair of organisms with similar proteins. Nonetheless, DASH provides useful evidence suggesting similar directionality tendencies for the human DNA and RNA retro-transcribing viruses as those identified previously with other methods.
Most of the domain similarities DASH identifies in the DNA viruses feature domains involved in cellular (opposed to viral) processes (e.g. apoptosis regulation, cytokine signaling), in accord with the notion that cellular homologs in the DNA viruses are the result of molecular mimicry by the viruses to subvert the host immune system [3,[28][29][30]. In contrast, the annotations on the homologies between RNA retro-transcribing viruses and humans suggest retroviruses as donors of the shared genetic material, because most of the domains common to both allude to viral functions (e.g. integrase, retroviral aspartyl protease). The predominance of retrovirus-to-host transfer is consistent with the knowledge that endogenous retroviral genes constitute 7-8% of the human genome [31].

Homologies between humans and non-retro-transcribing RNA viruses
Non-retro-transcribing RNA viruses have no obvious means of capturing DNA from their host. To examine horizontal gene transfer in a non-retro-transcribing RNA viral order and the corresponding hosts, we used DASH to analyze all the human mononegaviruses in the NCBI repository.
DASH detects homologs between 39/41 mononegaviruses and human ( Figure 3 and Table 4). The percent of viral genome shared with the host in the thirty-nine mononegaviruses ranged from 8% (Pneumonia virus of mice) to 75% (Hendra virus), with a median of 44%. Based on the results of Table 4, the median non-retro-transcribing RNA virus shows higher fractions of homologous host genes than the median DNA virus (11%) or median retrovirus (41%) from Table 3. Figure 3 maps the homologies DASH identifies between humans and thirty-nine mononegaviruses onto the mononegaviral species tree. Several of the homologous domains in Figure 3 are present in complete taxa. For instance, PF12803 is present in all the Paramyxovirinae, while PF14314 is conserved in all the Rhabdoviridae. DASH also confirmed the human homologs in Bornaviridae and Filoviridae identified by [32,33] as showing nonretrotranscribing RNA viruses as contributors to the make-up of vertebrate genomes (Figure 3 and Table 5). Inferring homology from significant sequence similarity has been a routine bioinformatics practice since the 1990's. Homology can be reliably inferred for proteins sharing statistically significant sequence similarity [34], permitting inferences about the structure and function of unknown molecules with characterized homologs. Sequencing and annotation errors, however, can mislead homology inference.
We identified an example of DASH's susceptibility to annotation errors in the putative homologies it reported for some mononegaviruses. DASH identified two human proteins (AngRem104 and AngRem52) as significantly similar to the F, M, V, and P genes of the paramyxoviruses in the order Mononegavirales (Table 5). A 2003 study annotated AngRem104 and AngRem52 in the public databases as the products of two human genes upregulated by Angiotensin II in mesangial cells [35]. Yet, later studies demonstrated that AngRem104 and AngRem52 were actually proteins coded by two new paramyxoviruses [36,37]. Thus, it appears that AngRem104 and AngRem52 are not human, but viral homologs of paramyxoviruses. Table 5 flags domains unreliable because of possible sequence misannotations.
We also caution DASH users to consider E-values, domain coverage, and the number of similarity hits on the host and virus when evaluating putative homologies. For instance, the human homologs of mRNA methyltransferase (PF12803) in Table 5

DASH reports a subset of non-retrotranscribing viruses with no homology to humans
The apparent lack of homology between human genes and those of some non-retro-transcribing viruses (Table 3 and Table 4) has several possible causes. Perhaps, a molecular mechanism peculiar to their unique biology has prevented gene transfer between humans (or their ancestors) and non-retro-transcribing viruses. Although no clear evidence of recombination has been detected for non-retro-transcribing viruses [38], vertebrate genomes have endogenized some non-retro-transcribing viral elements [32,33]. Table 4 and Table 5 therefore suggest that humans (or their ancestors) might have endogenized elements from the majority of human mononegaviruses.
Although most (if not all) non-retro-transcribing viruses lack a mechanism to integrate host genes, our methods did detect homologies between humans and mononegaviruses. Our methods cannot speak directly to the mechanism by which a homology is present or absent, although undetected homologies might always be a consequence of excessive sequence divergence. The genes of the non-retro-transcribing viruses for which no human similarity was evident may simply have evolved too far for our methods to detect the corresponding homology. Significant sequence similarity indicates homology; but lack of sequence similarity does not rule it out.
Our approach to the detection of functional similarities in host-pathogen interactions DASH provides a framework to automate the sequence analysis of the complete genomes of two interacting species (e.g. a host and a pathogen), because it does not depend on curating or creating multiple alignments of orthologous genes to identify homology. Instead, the multiple alignments are implicit in the functional domains modeled by Pfam.
DASH identifies similarities between the pathogen and host sequences at the protein domain level. By targeting domain-based similarities, the sequence searches gain sensitivity, particularly in cases where only part of the protein is conserved. Consider, e.g., the cytokine receptors acquired by poxviruses. The cellular homologs of the tumor necrosis factor receptor and of the gammainterferon receptor in myxoma virus, as well as the IL-1 receptor homolog in vaccinia lack the membrane anchor and the cytoplasmic signaling domains. By coding only the ligand-binding domain, the viral homologs remain soluble, which, in turn, increases virulence since each can bind and neutralize the corresponding host cytokine, preventing the cellular receptors from delivering an antiviral signal [29]. Domain-restricted homology in viruses is therefore commonplace, particularly when the cellular homolog is a receptor or membrane-bound protein [29]. Domain-based sequence searches are particularly likely  to detect these domain-restricted homologies, especially in cases of divergent similarities. Previous experience informed our decision to use HMMER, instead of other profile-based sequence comparison methods such as PSI-BLAST. HMMER has been shown to be less susceptible to profile corruption, tends to have a higher sensitivity, and its programs are more amenable to searching against Pfam models [39]. In addition, by searching against curated functional models instead of building them iteratively DASH takes advantage of the transitivity of homology to identify more divergent similarities. DASH does not attempt to establish homology between the symbiont and the host sequence directly. Rather, by transitivity, it identifies whether two genes, one from a symbiont and one from a host, share homology to a common HMM profile before it reports them as functional similarity candidates.
In contrast, methods like PSI-BLAST first search a comprehensive protein sequence library. PSI-BLAST then builds a profile progressively over several search iterations, attempting to create an alignment phylogenetically diverse enough to detect host proteins homologous to a viral query protein, or vice versa. By taking advantage of the transitivity of homology, DASH exploits pre-computed, curated sequence alignments, identifying distant functional similarities more systematically. See Additional file 1: Table S1 for a comparison of the results obtained with PSI-BLAST when searching with the viral genes in Table 1.
Thus, the present study might have been successful, because it avoids ad hoc iterative homology searches. In iterative homology searches, if a viral sequence is too distant from the other members of its functional family, an iterative profile might lack sensitivity because it recruits too few close homologs of the viral sequence or does not weight them heavily enough. In contrast, experts have manually built Pfam profiles from representative sets of sequences, so its profiles are usually weighted evenly across the phylogenetic tree of the functional family. Moreover, Pfam features~14,000 different functions, making it a relatively exhaustive repository of functional domains.

Limitations of our approach
DASH requires sequenced genomes for both organisms in the analysis. Although DASH can analyze partial sequences, the resulting coverage will depend on the quantity and quality of sequences available. Fortunately, sequence data is widely available and, often, it is the only information available for newly identified organisms.
Likewise, the availability and phylogenetic breadth of the domain models in Pfam can limit DASH's approach: if the HMM profiles searched do not include a given function, the method cannot indicate the corresponding functional similarity. Similarly, the method will experience limitations anytime the given HMM does not represent the organisms adequately (e.g. when the model is not phylogenetically broad enough to include them). But, we expect these limitations to be neither frequent nor severe, especially since DASH uses Pfam, which is an extensive database. Missing functions should be the exception and not the norm.
As for any method relying on sequence data, DASH is susceptible without much warning to sequencing and annotation errors. Table 4 and Table 5, e.g., show an annotation error in the analysis of the mononegaviruses. Public sequence databases seldom correct such anno tation errors. Therefore, investigators should consider the similarities identified by DASH as helpful but still controvertible evidence of putative homologies, meriting further investigation as biological interest indicates.

Conclusions
The present article described a genome-wide survey of protein domain similarity between an arbitrary but representative set of viruses and their human host. As a proof of concept, we used DASH to analyze the homologies between HHV-8 and human, which have been extensively documented in the past two decades. Several of the HHV-8 proteins have been reported as being of cellular origin, which DASH confirmed. Our work also confirmed functional similarities between human and both the DNA and RNA viruses, with viral genomes of various sizes and regardless of transcription strategies. Our examination of the order Mononegavirales confirmed that retroviruses have not been the only RNA viruses donating genetic material to cellular genomes. DASH also provided supporting evidence that non-retro-transcribing RNA viruses have contributed endogenized elements to the human genome.
The fractions of homologs between humans and the fifty-two viruses reported here are likely underestimates of the actual fractions. This study analyzed only the protein-coding regions of the genomes. Quite possibly, transfer of non-coding genes (e.g. non-coding functional RNA (rRNA, tRNA), cis-regulatory elements, etc.) may also have occurred.
In all likelihood, the proteins shared by viruses and their hosts today have been acquired through horizontal gene transfer (HGT) at some point in the past. Our functional analyses of the fifty-two human viruses suggest that genetic transfers from host to virus seem to have been predominant in the DNA viruses. Our results also show that among the viruses, the RNA viruses have been predominant donors of genetic material to the host regardless of viral transcription strategy. Our remarks of directionality are based on the annotations on the homologies, which confirm reports published elsewhere (e.g. [5]).
Sequence similarity methods can suggest cases of HGT, but determination of HGT directionality is more difficult to automate, because it requires a remarkably detailed phylogenetic investigation in large part directed by human input. Yet, the usefulness of a sequence-based method like DASH is in its ability to scan large amounts of data to streamline the list of protein candidates for further phylogenetics or experimental characterization. Moreover, DASH provides a platform to analyze the complete genomes of two interacting species. The analysis identifies common domains as well as those exclusive to each organism.
The current version of DASH, a prototype, allows the analysis of fifty-two reference viral genomes. The results we have shown here validate the methodology and show the potential of the pipeline to analyze other hostsymbiont systems rapidly. In principle, the DASH pipeline can annotate new genomes or characterize different isolates of the same virus. DASH output can also augment the analysis of host-pathogen interaction or coevolution data. In addition to detecting functional similarities, DASH provides sets of possibly orthologous genes for phylogenetic analysis and evolutionary gene reconstruction. DASH can also generate lists of genes and proteins potentially unique to the virus to aid rational drug design. For instance, if a peptide-binding site were unique to a virus, the design of peptide drugs would then avoid an autoimmune response in the host. Knowledge of the genetic material captured or donated by pathogens should give insights into the etiology of the diseases they cause and help inform effective drug and vaccine design.