Viral pathogens have been a major cause of epidemics worldwide [1–3]. The regular surveillance system in many cases of viral epidemics is ineffective to directly identify the virus unless directed by telltale clinical features and clinical complications [4, 5]. In many cases the disease causing pathogen is not identified [6], and contributes to the inadequate and inappropriate management of the condition. Many of the disease outbreaks are also caused by changes in the environmental niche [7], and at least in some cases cause the emergence of new pathogens [6, 8–11]. It is estimated that at least 33 new pathogens have emerged during the last three decades [12]. Identification of causative organisms during viral outbreak is a problem, primarily due to the tedious methods involving isolation and culturing of the pathogen [13, 14]. The relatively low concentration of the genetic material, especially in the case of RNA viruses and the heavy interference due to the host cell nucleic acid and other metagenomes makes sequencing based approaches ineffective, unless done at higher depths [15, 16].
The availability of next-generation sequencing (NGS) technology has enabled the scale and ease of addressing biological questions on a genomics perspective [17, 18]. The throughput of sequencing enables deep sequencing of nucleic acids, adequate to provide for enough reads of the pathogen, even while the interference of the host genetic material is very high. Metagenomics has been one of the major applications of NGS technology for understanding the composition and dynamics of mixed population of organisms [19]. The field has now emerged to a vibrant area of genomics trying to understand a large spectrum of environmental niches right from human body in both disease as well as healthy states to natural geographical niches [20, 21]. Although NGS technology has been successfully used for addressing a large diversity of biological questions, its application to address questions pertaining to bio-surveillance and emerging infectious diseases on a large scale has been limited [22], in spite of the unprecedented opportunity provided by the scale and speed of operations [17].
Hologenome, a term borrowed from evolutionary biology is defined as the sum of the genetic information of the host and its microbiota [23]. Hologenomics is an emerging field in genomics which deals with mixed population of genomes, as in the case of interacting populations in host-pathogen and commensals. Hologenome differs from the widely popular term Metagenome, which involves the study of communities of microbes directly in their natural environments [24].
Cultured population of viruses co-exists and interacts intricately with their host genomes [23]. Although this presents a technical challenge in isolating individual genomes from mixed populations, it offers enormous possibility to understand the interactions and dynamics between the genomes in real-time.
Here we report the sequencing and analysis methodology, involving computational algorithms for reference mapping and de novo sequence assembly to accurately identify viral pathogens from mixed populations of genomes. The pipeline relies on the specificity of sequence mappings and the differential distribution of the mapped reads across genomes. As a proof of concept, we applied the methodology on a cell culture hologenome consisting of human, bacterial and viral genomes, and could specifically identify the viral pathogen. This methodology could potentially be applied for rapid and specific identification of viral pathogens during epidemic outbreaks.
Sample collection and RNA isolation
The sample was collected and isolated during an epidemic of acute viral encephalitis from an anonymous patient suffering from fever and acute encephalitis from Baba Raghav Das (BRD) Medical College and Nehru Hospital, Gorakhpur, India. Sample was procured and processed as per ethical procedures laid down by BRD Medical College, Gorakhpur, India and National Institute of Virology, Pune, India. Using standard virus isolation protocols samples were inoculated in human Rhabdosarcoma (RD) and Baby Hamster kidney (BHK) cell lines for virus isolation [25, 26]. Cells were observed for cytopathological effects (CPE) and passaged three times. The cell culture supernatant was filtered using 0.22 μm Millipore filters for every passage. RNA was isolated using Qiagen (QiaAMP viral RNA minikit) kit as per manufacturer's instructions. The RNA was eluted in 60 μlitre of AVE buffer.
Library preparation, sequencing and genome assembly
The RNA library was prepared according to the manufacturer's instructions using RNA Sample-prep kit (Illumina Inc, USA) for sequencing on Illumina sequencing platform. Two microgram of total RNA was fragmented using divalent cation. Cleaved RNA was converted to cDNA using reverse transcriptase (SSRT-II Invitrogen) and random primers. The fragments were further subjected to second strand cDNA synthesis using DNA polymerase as per manufacturer's instructions. End-repairing process followed by A-base addition and adapter ligation was further performed on the cDNA fragments. Approximately 350 base pair products were separated by gel excision and enriched with PCR to create the final library.
Clusters were generated on the flow cell using cBot Paired end cluster generation kit (Illumina Inc, USA) as per manufacturer's instructions. The sequencing runs were performed on Illumina Inc, USA) using 76 × 2 base reads. The sequence-quality files generated was transformed to Sanger quality scores using custom scripts.
The paired end reads were mapped to the reference datasets using Mapping and Assembly with Qualities (MAQ) software [27]. The datasets of 3735 viral genomes, 2352 Bacterial genomes and the human genome corresponding to GRch37/hg19 build was downloaded from NCBI [28] and used for mapping. The mapped reads were further analyzed and compared for reads that overlapped in each reference set. The genomes that mapped the maximum number of reads post-alignment were parsed using custom scripts and were further considered for analysis. Single nucleotide variations and Insertion Deletion (InDel) events were called using MAQ scripts. Mappings and functional analysis of the variations were performed using custom scripts. The entire pipeline for the data generation and analysis is summarized in Figure 1.
Velvet [29], a popularly used de novo assembly algorithm based on de Bruijn graphs was used for the de novo assembly. The entire read data was partitioned into smaller subsets for analysis. De-novo assembly was attempted on the subsets with different k-mers. The data was compiled and compared using in-house scripts.
RT-PCR validation
Experimental validation of the virus was performed using reverse transcription (RT)-polymerase chain reaction. Specific primer JEV PM1R (reverse primer): 5'-CGGARTCTCCTGCTTCGCTTGG-3' and JEV C1F (forward primer): 5'-GGCAGAAAGCAAAACAAAAGA-3' specific for Japanese encephalitis virus were used. RNA was initially reverse transcribed using JEV PM1R at 50°C using Superscript II reverse transcriptase (Invitrogen, Life Science Technologies). The cDNA was amplified using forward primer JEV C1F and reverse primer JEV PM1R. PCR amplification was carried out by denaturing 94°C for 5 min, followed by 35 cycles of 94°C for 30 sec 58°C for 30 sec, 72°C for 1 min and final extension of 3 min using Taq DNA polymerase (Fermentas).