Genetic region characterization (Gene RECQuest) - software to assist in identification and selection of candidate genes from genomic regions
- Rajani S Sadasivam†1Email author,
- Gayathri Sundar†2,
- Laura K Vaughan†3,
- Murat M Tanik†2 and
- Donna K Arnett†4
© Sadasivam et al; licensee BioMed Central Ltd. 2009
Received: 13 July 2009
Accepted: 30 September 2009
Published: 30 September 2009
The availability of research platforms like the web tools of the National Center for Biotechnology Information (NCBI) has transformed the time-consuming task of identifying candidate genes from genetic studies to an interactive process where data from a variety of sources are obtained to select likely genes for follow-up. This process presents its own set of challenges, as the genetic researcher has to interact with several tools in a time-intensive, manual, and cumbersome manner. We developed a method and implemented an effective software system to address these challenges by multidisciplinary efforts of professional software developers with domain experts. The method presented in this paper, Gene RECQuest, simplifies the interaction with existing research platforms through the use of advanced integration technologies.
Gene RECQuest is a web-based application that assists in the identification of candidate genes from linkage and association studies using information from Online Mendelian Inheritance in Man (OMIM) and PubMed. To illustrate the utility of Gene RECQuest we used it to identify genes physically located within a linkage region as potential candidate genes for a quantitative trait locus (QTL) for very low density lipoprotein (VLDL) response on chromosome 18.
Gene RECQuest provides a tool which enables researchers to easily identify and organize literature supporting their own expertise and make informed decisions. It is important to note that Gene RECQuest is a data acquisition and organization software, and not a data analysis method.
Complex genetic diseases are due to common variants acting alone or in combination with other genes or the environment to cause disease. For the past several years, genetic linkage has been a mainstay for the analysis of these complex diseases. Although there has been some discussion about the future of genetic linkage studies, there is little debate that genetic linkage studies have had tremendous success in identifying regions of the genome that contribute to a wide variety of complex phenotypes . However, identifying the gene (or genes) underlying a linkage peak or in a region of association which drive the association remains a significant challenge.
Linkage studies typically identify regions of association that cover 10-30 cM, which can contain up to 300 genes and be 10-30 Mb in length . In order to identify the causal variant, these regions must then be subjected to fine-mapping, where the area under each region is saturated with additional molecular markers, and these markers are then tested for association with the trait in question. This technique can reduce the area under each region and result in dozens of putative candidate genes for each region. These reduced regions are then subjected to sequence analysis to further refine the search area. Numerous sequence variants, both in coding and non-coding regions, can exist within each region. Because a complex trait region can result from several variants within the same region, each variant must be tested independently for functionality, as well as combinations of all the variants. With a linkage study of a complex trait, multiple regions are expected to be identified, resulting in several hundred, if not thousands, of putative candidate loci, and necessitating fine-mapping and subsequent sequencing for numerous genomic regions, which could be costly and time prohibitive [3–6].
In an effort to make this gene selection process more efficient, streamlined and organized, we have developed a software system called GENE tic RE gion C haracterization Quest (Gene RECQuest). For genes located within a region of interest, Gene RECQuest automatically retrieves and stores the PubMed and OMIM citations using XML and Web Services, and allows for a key phrase-based search using database technologies. To demonstrate the utility, we applied our method to a recently identified linkage peak for VLDL response on chromosome 18.
Our overall goal was to design and develop a web-based tool to integrate and customize the existing tools (NCBI web tools) to simplify and streamline genetic linkage studies. We approached the problem with a user-centered design  and modified service-oriented architecture approach that is described [8–11]. Full instructions for use are available on the software's website and key implemented features are as follows:
Integration with Map Viewer
Integration with NCBI Web Services
NCBI Web Services provide a Web Service Description Language (WSDL)-based Application Programmable Interface (API) to a collection of Entrez E-utilities. The Gene RECQuest NCBI tools-as-services wrapper uses the NCBI WSDL-API to access and retrieve the necessary information (Gene, OMIM, and PubMed information) from the NCBI database. Due to the amount of data and restrictions set by NCBI, the initial search must be conducted during "off" hours, thus registration with email address is used to notify a user when the search is complete. Integration with NCBI (WSDL)-based API in Gene RECQuest allows integrating and automating the access of data from the NBCI databases. For this purpose, we have developed a Windows service as part of Gene RECQuest that automatically updates the necessary NCBI data to the MySql database whenever new genes are uploaded to the system. Updating the data to the MySql databases allows us to implement the user-friendly search application using the most recent data available.
User-friendly search through PubMed and OMIM articles
The implementation of the user-friendly search application has two parts:
Setting up a database to collect and organize data
Key word or key phrases based search interface
One of the key features of this software is the ability to easily and quickly search for key words contained in the literature associated with genes in the chromosomal region of interest. This feature can be found in the "Search Genes" tab on the Gene RECQuest website. The user simply selects the region of interest from their list and uses the search form to find genes of interest. We leverage the Boolean full text search feature of MySql database to implement the Gene RECQuest full text search on PubMed and OMIM articles for each gene in a region. Instructions are provided to help the researcher construct the search query. Briefly, the researcher can either search using key words or phrases of interest, but must specify variants of interest (e.g. searching for singular and plural variations). The search results first provide a listing of genes that contained both OMIM annotations and PubMed articles matching the search phrases in the same format as the "Genes in Regions" list.
Application to a research question
We chose to use one of our own research questions to illustrate the utility of Gene RECQuest. As a part of the Genetics of Lipid reducing drugs and Diet Network (GOLDN) project, a QTL analysis was conducted for VLDL response to a high fat meal (see  for full details). The study was conducted in 1254 individuals. Briefly, the high fat challenge meal was determined by body surface area, and contained 700 kilocalories per m2 of body surface area. The meal composition was 83% of calories from fat, 14% from carbohydrates, and 3% from protein. VLDL was measured twice (the day before and the day of the meal), and at 3.5 and 6 hours after the meal using proton nuclear magnetic resonance (NMR) spectroscopy. The VLDL response was calculated using a mixed model where a growth curve was created across the four measurement points. Variance components linkage analysis was implemented using the program SOLAR (Sequential Oligogenic Linkage Analysis Routines) for the VLDL response trait.
Search for potential candidate genes
Gene RECQuest results for key word search.
Chromosome 18 region 60-120 cM total 237 genes in region & 480 on chromosome 18
It is important to note that Gene RECQuest is a data acquisition and organization software, and not a data analysis method. There is a multitude of data analysis methods designed to aid in the selection of candidate genes from lists generated by linkage or association studies [14–17]. Many of these methods rely on some sort of annotation analysis, either comparison to a list of known genes (e.g. ENDEAVOUR)  or over-representation enrichment (e.g. GSEA) , to identify genes of interest. Often methods that utilize the same input data and annotation sources produce very different output. As a result, it has been suggested that the best approach is to use multiple prioritization tools . Although these methods are potentially very powerful and have a place in one's toolbox, investigators often approach them with hesitation and want to read relevant literature to aid in their decisions. Gene RECQuest provides a tool which enables researchers to identify and organize literature supporting their own expertise easily, and make informed decisions.
Availability and requirements
Project name: Gene RECQuest
Project home page: http://www.ncbisearch.cme.uab.edu/Default.aspx
Anonymous accounts (no e-mail address for registration is needed): http://www.ncbisearch.cme.uab.edu/AnonymousLogin.aspx
Operating systems: any OS (that has an internet browser application)
Programming language: .Net, ASP.Net, C#, MySQL
Application Programmable Interface
Genetics of Lipid lowering drugs and Diet Network
National Center for Biotechnology Information
Nuclear magnetic resonance
Online Mendelian Inheritance in Man
Quantitative trait locus
Sequential Oligogenic Linkage Analysis Routines
Very low density lipoprotein cholesterol
Web Service Description Language
Extensible Markup language.
This work has been supported by the following NIH grants:
DA - 1R01 HL091357-01
- Clerget-Darpoux F, Elston RC: Are linkage analysis and the collection of family data dead? Prospects for family studies in the age of genome-wide association. Hum Hered. 2007, 64 (2): 91-96. 10.1159/000101960.View ArticlePubMedGoogle Scholar
- Perez-Iratxeta C, Bork P, Andrade MA: Association of genes to genetically inherited diseases using data mining. Nature Genetics. 2002, 31 (3): 316-319.PubMedGoogle Scholar
- Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, et al: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 2005, 33 (20): 175-10.1093/nar/gni179.View ArticleGoogle Scholar
- Farrall M, Morris AP: Gearing up for genome-wide gene-association studies. Hum Mol Genet. 2005, 14 (Spec No. 2): R157-162. 10.1093/hmg/ddi273.View ArticlePubMedGoogle Scholar
- Evans DM, Cardon LR: Genome-wide association: a promising start to a long race. Trends Genet. 2006, 22 (7): 350-354. 10.1016/j.tig.2006.05.001.View ArticlePubMedGoogle Scholar
- DiPetrillo K, Wang X, Stylianou IM, Paigen B: Bioinformatics toolbox for narrowing rodent quantitative trait loci. Trends Genet. 2005, 21 (12): 683-692. 10.1016/j.tig.2005.09.008.View ArticlePubMedGoogle Scholar
- Vredenburg K, Isensee S, Righi C: User-centered design: an integrated approach. 2002, Upper Saddle River, NJ: Prentice Hall PTRGoogle Scholar
- Sadasivam RS: An Architecture Framework for Process-Personalized Composite Services: Service-oriented Architecture, Web Services, Business-Process Engineering, and Human Interaction Management. 2008, Saarbrücken, Germany: VDM VerlagGoogle Scholar
- Sadasivam RS, Tanik MM: Composite process-personalization with service-oriented architecture. Handbook of Research In Mobile Business: Technical, Methodological And Social Perspectives. Edited by: Unhelkar B. 2008, IGI Global, 470-2
- Sadasivam RS, Sundar G, Tanik MM, Jololian L, Tanju MN: A process personalization model for enabling biological research. Int Design and Process Technology: June 3-8 2007. 2007, Antalya, Turkey: Society for Design and Process Science, 168-174.Google Scholar
- Sadasivam RS, Tanik MM, Jololian L: WO/2007/008687: Drag and drop communication of data via a computer network. World Intellectual Property Organization. vol. WO/2007/008687. 2007Google Scholar
- Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E: Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res. 2004, 35-40. 10.1093/nar/gkh073. 32 database
- Corella D, Arnett DK, Tsai MY, Kabagambe EK, Peacock JM, Hixson JE, Straka RJ, Province M, Lai CQ, Parnell LD, et al: The -256T>C polymorphism in the apolipoprotein A-II gene promoter is associated with body mass index and food intake in the genetics of lipid lowering drugs and diet network study. Clin Chem. 2007, 53 (6): 1144-1152. 10.1373/clinchem.2006.084863.View ArticlePubMedGoogle Scholar
- Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations and open problems. Bioinformatics. 2005, 21 (18): 3587-3595. 10.1093/bioinformatics/bti565.PubMed CentralView ArticlePubMedGoogle Scholar
- Tiffin N, Adie E, Turner F, Brunner H, van Driel MA, et al: Computational disease gene identification: A concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nuc Acids Res. 2006, 34 (10): 3067-3081. 10.1093/nar/gkl381.View ArticleGoogle Scholar
- Nam D, Kim S-Y: Gene-set approach for expression pattern analysis. Briefings in Bioinformatics. 2008, 9 (3): 189-197. 10.1093/bib/bbn001.View ArticlePubMedGoogle Scholar
- Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nuc Acids Res. 2009, 37 (1): 1-13. 10.1093/nar/gkn923.View ArticleGoogle Scholar
- Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent L-C, De Moor B, Marynen P, Hassan B, Carmeliet P, Moreau Y: Gene prioritization through genomic data fusion. Nature Biotechnology. 2006, 24 (5): 537-544. 10.1038/nbt1203.View ArticlePubMedGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietric differentiation. PNAS. 1999, 96: 2907-2912. 10.1073/pnas.96.6.2907.PubMed CentralView ArticlePubMedGoogle Scholar
- Thornblad TA, Elliott KS, Jowett J, Visscher PM: Prioritization of positional candidate genes using multiple web-based software tools. Twin Research and Human Genetics. 2007, 10 (6): 861-870. 10.1375/twin.10.6.861.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.