Skip to main content

India Allele Finder: a web-based annotation tool for identifying common alleles in next-generation sequencing data of Indian origin

Abstract

Objective

We built India Allele Finder, an online searchable database and command line tool, that gives researchers access to variant frequencies of Indian Telugu individuals, using publicly available fastq data from the 1000 Genomes Project. Access to appropriate population-based genomic variant annotation can accelerate the interpretation of genomic sequencing data. In particular, exome analysis of individuals of Indian descent will identify population variants not reflected in European exomes, complicating genomic analysis for such individuals.

Results

India Allele Finder offers improved ease-of-use to investigators seeking to identify and annotate sequencing data from Indian populations. We describe the use of India Allele Finder to identify common population variants in a disease quartet whole exome dataset, reducing the number of candidate single nucleotide variants from 84 to 7. India Allele Finder is freely available to investigators to annotate genomic sequencing data from Indian populations. Use of India Allele Finder allows efficient identification of population variants in genomic sequencing data, and is an example of a population-specific annotation tool that simplifies analysis and encourages international collaboration in genomics research.

Introduction

Whole exome sequencing (WES) has revolutionized genomic diagnostics and is a key tool in identifying the causal genes underlying rare Mendelian disorders [1,2,3]. A critical strategy in post-sequencing analysis involves screening a proband’s exome variants against exomes from reference individuals matching the ethnic makeup of the proband. While these data are widely available for individuals from European and African American descent [4, 5], such reference data is less accessible when analyzing exomes from individuals from India. We present India Allele Finder (IAF), an online database table of allele frequencies of individuals from the Indian subcontinent.

The 1000 Genomes web browser (http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/) effectively presents complete allele frequencies, but rapid queries are more difficult, and annotation of local variant call files (vcfs) is not possible. In contrast, the IAF website and its accompanying command line tool are focused only on the South Indian population, and allow researchers to easily annotate their own exome data sets. Clinicians who want a more ordered method of browsing 1000 Genome data will find the query-based website intuitive to use, while bioinformaticians who work with vcfs will easily adopt the IAF command line tool into their workflow.

Main text

Accessing 1000 Genomes data

Fastq data of individuals specific to Indian populations (flagged with “ITU” indicating Indian Telugu ancestry) available via the 1000 Genomes Project [6] were aggregated via ftp from the 1000 Genomes Project, and combined into two fastq files per individual, one per paired end read. We downloaded 100 fastqs out of 118 available ITU individuals from the 1000 Genomes data set. Automated shell scripts facilitated the downloading of fastq files, while an aggregator written in Python concatenated fastqs of the appropriate paired end such that each individual had two fastq files of equal size.

Data analysis

Fastqs were mapped with the Burrows–Wheeler alignment (BWA) tool 0.7.9a to hg19. The resulting bam files were then analyzed with SAMtools 0.1.19, Picard 1.114, and the Genome Analysis Toolkit (GATK) 3.1.1. Annotation of resulting vcfs was performed with Annovar. A command line Python script, indiaAlleleAnnotator.py, takes as its input a tab delineated vcf and outputs a modified vcf with an additional column representing the allele frequency among the Indian Telugu population.

Database schema

The vcf generated from the analysis was converted into structured query language (SQL) format, and imported into mysql v.14.14 database as one table. The database is accessed on-line via a Perl Catalyst front-end. The files for this implementation, including the raw SQL file, are available at https://github.com/Paciorkowski-Lab/IndiaAlleleFinder.

IAF allows query of variants through its web-based database, as well as providing a command line tool to annotate exome vcfs. Accepted formats for the web-based query include gene symbol, variant genomic location, or rsID number. The command line annotation tool identifies variants that are present in the IAF data set, and therefore likely to be population variants that may be excluded from further analysis in disease gene identification studies. The IAF workflow is represented in Fig. 1.

Fig. 1
figure 1

Workflow of analysis of publicly available ITU fastqs from 1000 Genomes used to construct the IAF dataset. Users wishing to annotate exome results with frequency data from IAF may do so using web-based or the command-line interface

IAF use case study

Subjects MP14-001a1, MP14-001a2, two siblings presenting with achalasia–addisonianism–alacrima syndrome (AAAS), as well as the father and mother, were selected for study. Saliva-derived DNA underwent WES using the Agilent Sure-Select 50 Mb whole exome capture kit, and 100 basepair paired-end reads were generated on an Illumina HiSeq 2500 machine at the University of Rochester Genomics Research Center. Sequence was aligned, analyzed as described previously. De novo, autosomal recessive, and X-linked variants were identified and common variants in the database of single nucleotide polymorphisms (dbSNP) version 137 excluded. We then used IAF to identify and exclude variants found in the 100 Telugu Indian individuals from 1000 Genomes. After filtering by pedigree hypothesis, candidate variants were reduced from 84 to seven when using IAF. We found that MP14-001a1 and MP15-001a2 were homozygous for c.43C>A/p.Q15K variant, a known AAAS sequence variation [7]. Their mother and father were both heterozygous for this variant.

The analysis of exome data from populations other than European and African American can be challenging due to difficulty accessing appropriate normal population data sets. This can result in an excess of candidate variants in disease gene identification studies. We have designed IAF to fit into existing workflows.

There are differences between results reported in 1000 Genomes vs IAF. Overall, the IAF data set reports fewer variants, likely due to our use of the newer version GATK v3.1.1 versus v2.4 [8]. Additionally, we sampled from a smaller group of 100 individuals. 1000 Genomes overall collected data from 2535 individuals from 26 different populations for their phase 3 study. As a result, 1000 Genomes aggregated over 5.2 million entries for chromosome 5 alone. Our data set for chromosome 5 contains 8520 entries aggregated from 100 individuals. We anticipate more variants will be represented in IAF as more exomes from the Indian continental population are added.

Limitations

IAF is a proof of concept implementation of a filtering mechanism based on population-derived variant frequencies. It is a unique tool to further annotate vcfs for the specific purpose of analyzing WES data from individuals of Indian subcontinent descent. We anticipate a proliferation of reference databases for populations that are not of European origin. Additional features are planned for the IAF website, including the ability to input multiple variants, and access a subset of the vcf output corresponding to the genes and/or variants queried. Further exome data sets from individuals of continental Indian ancestry will be added in the future as they become available.

Abbreviations

AAAS:

achalasia–addisonianism–alacrima syndrome

BWA:

Burrows–Wheeler alignment tool

dbSNP:

database of single nucleotide polymorphisms

GATK:

Genome Analysis Toolkit

ITU:

Telugu

SQL:

structured query language

vcf:

variant call file

WES:

whole exome sequencing

References

  1. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12:745–55.

    Article  CAS  PubMed  Google Scholar 

  2. Yang Y, Muzny DM, Reid JG, Bainbridge MN, Willis A, Ward PA, Braxton A, Beuten J, Xia F, Niu Z, et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N Engl J Med. 2013;369:1502–11.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Zhu X, Petrovski S, Xie P, Ruzzo EK, Lu Y-F, McSweeney KM, Ben-Zeev B, Nissenkorn A, Anikster Y, Oz-Levi D, et al. Whole-exome sequencing in undiagnosed genetic diseases: interpreting 119 trios. Genet Med. 2015;17:774–81.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Johnston JJ, Biesecker LG. Databases of genomic variation and phenotypes: existing resources and future needs. Hum Mol Genet. 2013;22:R27–31.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Song W, Gardner SA, Hovhannisyan H, Natalizio A, Weymouth KS, Chen W, Thibodeau I, Bogdanova E, Letovsky S, Willis A, et al. Exploring the landscape of pathogenic genetic variation in the ExAC population database: insights of relevance to variant classification. Genet Med. 2015;18:850–4.

    Article  PubMed  Google Scholar 

  6. 1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–73.

    Article  Google Scholar 

  7. Papageorgiou L, Mimidis K, Katsani KR, Fakis G. The genetic basis of triple A (Allgrove) syndrome in a Greek family. Gene. 2013;512:505–9.

    Article  CAS  PubMed  Google Scholar 

  8. Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, Donnelly P, Eichler EE, et al. A global reference for human genetic variation. Nature. 2015;526:68–74.

    Article  PubMed  Google Scholar 

Download references

Authors’ contributions

JFZ study design, acquisition, analysis, and interpretation of data, manuscript preparation. FJ analysis of data, manuscript preparation. AS acquisition of data, manuscript preparation. KMG acquisition of data, manuscript preparation. ARP study conception and design, acquisition, analysis, and interpretation of data, manuscript preparation. All authors read and approved the final manuscript.

Acknowledgements

We would like to acknowledge the University of Rochester Genomics Research Center for sequencing support, and the University of Rochester Center for Integrated Research Computing for providing high-performance computing resources.

Competing interests

The authors declare that they have no competing interests.

Availability of data and materials

IAF is hosted at https://iaf.urmc.rochester.edu. The source code and SQL file are available to download at https://github.com/Paciorkowski-Lab/IndiaAlleleFinder, where instructions are provided to set up a local instance of the database backend to IAF. Vcfs may be annotated with IAF data using a Python script available at https://www.iaf.urmc.rochester.edu/static/assets/commandline_vcf_annotator.tar.gz. This allows for integration of IAF into command-line workflows.

Consent for publication

Individuals in this study underwent informed consent through research protocols approved by the Research Subjects Review Board of the University of Rochester Medical Center and the research ethics board of Manipal University, which included consent to publish.

Ethics

Individuals in this study underwent informed consent through research protocols approved by the Research Subjects Review Board of the University of Rochester Medical Center and the research ethics board of Manipal University.

Funding

Research reported in this work was supported by the National Institutes of Health, National Institute of Neurologic Disorders and Stroke under Award Number K08NS078054 (to A.R.P.).

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alex R. Paciorkowski.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J.F., James, F., Shukla, A. et al. India Allele Finder: a web-based annotation tool for identifying common alleles in next-generation sequencing data of Indian origin. BMC Res Notes 10, 233 (2017). https://doi.org/10.1186/s13104-017-2556-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13104-017-2556-2

Keywords