Retrieval of 1000 genomes project (1KGP) variant calls
Variant calls (final phase3 release) in the form of variant call format (*.vcf) files (version 4.2) were downloaded from the 1KGP website (ftp/mirror site: EBI FTP: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/; NCBI FTP: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/. Perl API scripts of VCFtools (v0.1.11)  were used to subset the vcf files. Population-specific vcf files (PJL sub-population vcf files) were generated by extracting 96 PJL samples (or individual) IDs considering only those sites that have alternative alleles in the PJL samples and skip any other sites that are all REF allele in PJL samples. BCFtools stats (version 1.1 + htslib-1.1, https://samtools.github.io/bcftools/bcftools.html) was used to count SNPs, InDels and ratio of Ts/Tv. SNPs densities were calculated in defined bins of 1 Mbs by SNPdensity output filtering statistics option of VCFtools.
Annotation of genomic variants
Selection of transcript set
ENSEMBL (version 83, December 2015) transcript set provides genome resources for chordate genomes with a particular focus on human genome data. ENSEMBL makes available substantial and diverse transcript information, including the Consensus Coding Sequence (CCDS) [7, 8], Human and Vertebrate Analysis and Annotation (HAVANA) (https://www.sanger.ac.uk/research/projects/vertebrategenome/havana/), Vertebrate Genome Annotation (VEGA) (Wilming, et al. ), ENCODE data  and the GENCODE gene and transcript sets . 204,940 transcripts in ENSEMBL version 83 were used for annotations.
Variant annotations were obtained using the software tool ANNOVAR (version 2015 Dec 14) and SnpEff (version 4.2 build 2015-12-15).
Annotations by ANNOVAR: ANNOVAR was used to functionally annotate genomic variants by two methods, (1) Gene-based annotation by ENSEMBL genes (ensGene) annotation database, and (2) Filter-based annotations, snp138, clinvar_20150629, cosmic68, cosmic70, 1000g2015aug_all annotation database were used. A broad interpretation of splicing regions was used for ANNOVAR annotations, so that all variants within six bases of an intron/exon boundary would fall into ANNOVAR’s splicing annotation category. ANNOVAR returns a single annotation for each variant. If there are several relevant transcripts for a particular variant, then ANNOVAR will return the annotation with the most severe consequence according to its rules of precedence.
Annotations by SnpEff: Variant annotations were also obtained using SnpEff based on GRCH37.75. As SnpEff returns all possible annotations for each variant (given the transcripts present at each variant’s location in the genome), we prioritized annotations by the consequence impact of the variant to make SnpEff annotation results directly comparable with those from ANNOVAR.
Statistical analysis and plotting were performed by different libraries loaded into R statistical package (version 3.2.1, https://www.r-project.org/).
Although, 1KGP data is available at http://www.internationalgenome.org/ but we compiled everything at one place related to PJL so that researchers and non-scientific community do not need to search from the scratch. PJL sub-population data has a total of 158 individuals but not all of them have the same kind of analysis. Individuals can be grouped on the basis of analysis and data collection, even some individuals are not sequenced at all (Additional file 1). Genetic variants of sequenced individuals are analyzed (number of SNVs and InDels, SNPdensitites, the frequency with which they occur, substitution types and along with their counts, and Ts/Tv ratio) selected on the basis of low-coverage WGS released in phase3 (Additional file 2: Figures S2–S6). The SNP counts of PJL sub-population are further compared with the 1KGP SNP counts (for this analysis, 1KGP have all SNP counts except PJL sub-population (Additional file 2: Table S1 and Figures S7–S9).
We commenced our investigation with the use of multiple annotation software in order to evaluate the influence of each algorithm on the resulting annotations. Here, we compared the variant annotation results of PJL sub-population as observed by ANNOVAR and SnpEff using the ENSEMBL transcript set (Additional file 2: Table S2). Primarily, we compared annotation terms categorized by both software. All exactly replicating categories are treated as individual affects, while particular categories in SnpEff are combined to compare against the broader ANNOVAR categories. We referred to an exact match when the annotations from two software are exactly equivalent. For example, both software annotate a variant as intronic or intergenic (Additional file 2: Table S3).
In total, 62,411 variants are annotated as exonic variants either by ANNOVAR or SnpEff (Additional file 2: Table S4). Of these, 23,678 (37.94%) variants are present in both tools. Interestingly, both annotation tools have good share of individual match rate (the number of annotated variants by either ANNOVAR, or SnpEff; could be said as private annotations), 61.5% for ANNOVAR and 98.64% for SnpEff. Intronic variants have the highest collective share of annotations (1,521,361) as identified by both tools. Almost all annotations found either in ANNOVAR or SnpEff have a higher concordance rate, 99.90% for ANNOVAR and 95.19% for SnpEff. Intergenic annotations also have the similar match rate, indicating the fact that both tools use similar approach to identify non-exonic variants. For splicing variants, 100% ANNOVAR match rate is observed for common variants; however, only 10.84% of those splice variants are annotated by SnpEff. Since SnpEff can predict much broad sequence ontology effects of splice variants, the greater number of splice variants provide more information of these locations. Likewise, upstream and downstream variants show an identical trend to splice variants with an overall exact match of 6% for both tools. Considering all annotation categories, ANNOVAR and SnpEff show a substantial amount of disagreement in annotating genetic variants, even when using the same transcripts. A comprehensive analysis of the data suggests that splicing, upstream, downstream, and non-coding exonic variants are present at a negligible concurrence. Further in-depth analysis will focus on the exonic versus intronic and intergenic variants, since these occupy the largest quantities of identified variants within the dataset. As we are not discouraging the use of either ANNOVAR or SnpEff, but the representation of annotated variants highlighted the emphasis of awareness of researchers that needs to meet while analyzing the annotated data. Our comparisons may highlight these discrepancies to some extent. The Sequence Ontology Project  helps us to minimize the effect of apparent differences of variant definitions (splice variants), eventually could improve the annotations for clinical usage. As per our experience, annotated variant with at least two tools should be associated with genes expression databases, such as GTEx  when considering functional assay validation on potential candidate/variants of interest. Variants with opposing, or missed annotations by one tool demands special handling .