Clonorchis sinensis is the human liver fluke of the class Trematoda (phylum Platyhelminthes: Digenea). The human host is infected by consuming raw and inadequately cooked freshwater fish with C. sinensis metacercariae. Clonorchiasis is a common infectious disease in many eastern Asian countries, including Korea, China, Japan and Vietnam. It is estimated that Clonorchiasis affects approximately 35 million people worldwide [1] and more than 600 million people are at risk of infection in East Asia and Eastern Europe [2].
Metacercariae of C. sinensis exist in the small intestine and the juvenile worms migrate up through the Ampulla of Vater and the common bile duct [3]. The livers of patients with clonorchiasis appear almost normal in cases of light infections, but slightly dilated and thickened peripheral bile ducts are present in cases of heavy infections. Patients with clonorchiasis persistently suffer from fatigue, jaundice, abdominal distress and indigestion [4]. Chronic infection can cause several hepatobiliary disease manifestations, such as cholangitis, cholecystitis, cholelithiasis, hepatomegaly and fibrosis of the periportal tract [5–7]. Although C. sinensis is officially recognized as a biological human group I carcinogen by the International Agency for Research on Cancer (IARC) and the World Health Organization (WHO) [8], the molecular and cellular biology of C. sinensis has been significantly underexplored due to the lack of genomic database resources from well-isolated full-length cDNAs.
The advent of next-generation sequencing machines that involve GS 454 [9], Solexa [10] and SOLiD [11] are revolutionizing molecular biology by generating hundreds of thousands of sequencing reads in parallel. The genome and transcriptome sequences of a growing number of model organisms have been published in recent years, which have drawn new insights into parasite research [12, 13]. However, the de novo large-scale sequencing of a non-model parasite is still a laborious task and an interesting challenge. Expressed sequence tags (ESTs) are a cost effective alternative and a powerful tool that provides sufficient information about functional proteins. In particular, the ESTs from a full-length cDNA library allow researchers to be cloned and provide material sources so that many intriguing biological issues can be isolated.
The aim of this study is to provide important database resources for the characterization and understanding of the functional genes of C. sinensis. Here, we have constructed and described ClonorESTdb, a web-based ESTs database resource that involves systematic functional annotation, which comprises more than 55,736 high-quality ESTs based on three full-length enriched cDNA libraries. The ESTs obtained were assembled into 13,305 C. sinensis Assembled EST sequences (CsAEs) comprising 6,497 clusters and 6,808 singletons by aligning CsAEs onto the non-redundant public NCBI NR database, UniProt, KEGG, InterProScan and Gene Ontology (GO). The ClonorESTdb database described here provides key insights into the differential gene expression of C. sinensis in a range of developmentally relevant conditions.
Database architecture
The ClonorESTdb database runs on a RedHat Enterprise Linux 5.5 platform with the Apache web server version 2.2. We also used the relational Oracle database 11 g standard version to develop and support an integrated database schema for storing sequence data, preprocessed data and final functional annotation. The web application was implemented with JSP (Java Server Pages), JavaServelet technology and the AJAX framework. The web interfaces were designed using HTML language with some scripts in JavaScript and the pages utilized cascading style sheet (CSS) properties. The database is currently optimized to work best with Microsoft Internet Explorer 8 (optimal resolution 1024 × 800).
Data source
We constructed full-length cDNA libraries (adult, metacercaria and egg) from C. sinensis and generated large-scale 60,768 ESTs data by 5’-end sequencing of individual clones [14]. All of the raw and cleaned data can be downloaded from the ClonorESTdb database.
The pipeline for constructing the database
In our study, a total of 55,736 C. sinensis EST sequences that were derived from three cDNA libraries (adult, metacercaria and egg) were used to construct the database. To analyze the data, we developed a pipeline for the ClonorEST Project divided into three steps: sequence cleaning, sequence clustering and assembly, and automatic annotation.
Sequence cleaning
Cleaning is an important part of processing and is used to obtain high-quality EST datasets from raw EST sequences. After base calling was performed using Phred [15], the cleaning process implemented Cross_match (version 0.990329) for masking any vector and contaminant sequences, SeqClean (http://seqclean.sourceforge.net/) for eliminating undetermined bases, poly(A) tails and low complexity elements, RepeatMasker (http://www.repeatmasker.org/) for removing interspersed repeats, such as SINEs (short interspersed nuclear element), LINEs (long interspersed nuclear elements), LTRs (long terminal repeat) and DNA elements included in the Repbase repetitive element library (http://www.girinst.org/) [16] (Figure 1A).
Sequence clustering and assembly
The clustering procedure was a basic step that was used to collect overlapping CsAEs sequences that originated from the same transcript of a single gene; this is performed to reduce redundancy. The assembly procedure is executed to align and merge many overlapping EST sequences of a much longer DNA sequence to reconstruct a putative full-length transcript sequence. For clustering and assembling, we used TGICL to create a grouping EST sequences and CAP3 for assembling the clustered EST sequences [17, 18] (Figure 1B).
Automatic functional annotation
For more accuracy and further variety of functional annotation, we used various annotation algorithms and public databases. First, we assigned putative functions to the CsAEs based on BLASTN (Query Coverage ≥ 80.0%, Identity ≥ 70.0%, E-value ≤ 1.0e-5), BLASTX and TBLASTX (Match No. ≥ 30 aa, Identity ≥ 25.0%, E-value ≤ 1.0e-5) searches against the GenBank NT and NR databases (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/). Second, metabolic pathways are extremely important for correctly inferring pathogen invasion, host defense, adaptation, pathogen life cycle and host-pathogen interactions. To get a rationale for the development of anti-parasitic drugs and vaccines, we must identify all parasite-specific metabolic pathways. To identify the pathways, additional annotations were created against the UniProtKB database (http://www.ebi.ac.uk/uniprot) and the KEGG database (http://www.genome.jp) using BLASTX (Match No. ≥ 30 aa, Identity ≥ 25.0%, E-value ≤ 1.0e-5). All BLAST algorithms were implemented using a TimeLogic DeCypher system (Active Motif, Inc., http://www.activemotif.com). We also used the InterProScan tool to extract additional functional domains (E-value ≤ 1.0e-4). To gain a better classification of the biological function of the CsAEs, an analysis of the functions was conducted using GO terms according to three categories: molecular function, biological process and cellular component. We used Tandem Repeats Finder (TRF) and AutoSNP to detect structural variations [19, 20] (Figure 1C).