Whole genome mapping and identification of single nucleotide polymorphisms of four Bangladeshi individuals and their functional significance

The major objective of the study was to sequence the whole genome of four Bangladeshi individuals and identify variants that are known to be associated with functional changes or disease states. We also carried out an ontology analysis to identify the functions and pathways most likely to be affected by these variants. We identified around 900,000 common variants and close to 5 million unique ones in all four of the individuals. This included over 11,500 variants that caused nonsynonymous changes in proteins. Heart function associated pathways were heavily implicated by the ontology analysis; corroborating previous studies that claimed the Bangladeshi population as highly susceptible to heart disorders. Two variants were found that have been previously identified as pathogenic factors in familial hypercholesteremia and structural disorders of the heart. Other pathogenic variants we found were associated with pseudoxanthoma elasticum, cancer progression, polyagglutinable erythrocyte syndrome, preeclampsia, and others.


Introduction
The original Human Reference Genome had no representation from the subcontinent. Subsequently, the 1000 genome project [1,2] added genome data from the region. Overall, the current database of human variants is still lacking in heavy representation from this region. The addition of more genomic and variation data from this region will improve our understanding of how different genetic markers and predispositions are distributed globally; especially since many countries across the world have a large Bangladeshi populations.
Here we undertake a pilot study using the whole genome sequences from four Bangladeshi individuals, labelled samples S1, S6, S19 and S21, to gain an understanding of the functionally relevant single nucleotide variations (SNVs) that can occur in this population. The primary goal of this study was to identify variants that can be associated with functional traits and disease states. . The individuals chosen for this pilot study did not have any known underlying conditions or genetic disorders. A small aliquot (~ 5 ml) of blood sample was collected from each person and genomic DNA was extracted using Maxwell RSC whole blood DNA extraction kit (Promega) according to the manufacturer's instructions. 300 ng gDNA from each of the four samples was used to prepare paired-end libraries with the Nextera ™ DNA Flex Library Preparation kit with an average insert size of 600 bp. All manufacturer guidelines were followed (Illumina Inc., San Diego, CA). Sequencing was done using Novaseq 6000 sequencing platform.

Variant calling, annotation and analysis
Illumina Basespace Sequence hub, Dragon Germline 3.4.5 (DRAGEN Host Software Version 05.021.332.3.4.5 and Bio-IT Processor Version 0x04261818) was used for mapping the sequenced genomes with the human reference genome (GRCh38.p2) and the subsequent variant calling. The VCF files were annotated using Annovar [3]. These were subsequently compared with known human variation databases to identify the presence of functionally relevant mutations (for example pathogenic variants). Human variation datasets were obtained from the UCSC and NCBI repositories [4,5]. Known variants were identified in our samples using bedtools [6] and manual filtering using R. Strand bias was accounted for using the Fisher Exact Test. This was carried out for each variant to determine if the concerned allele was supported by one strand more so than the other. Variants with scores of 0 or close to 0 were included in the subsequent analysis. Finally, we looked at the genomic locations of the exonic variants to list all the genes that contained these changes. The genes were then used to carry out an ontology search in order to identify the biological pathways and functions that are most likely to be effected as a result of these SNPs. This was done using DAVID, with all parameters set to default. The importance of these genes was also visualized using ReactomePA. This was done to highlight the major pathways associated with these genes. Briefly this gave us the pathways that are most likely to be impacted as a result of functional alterations in the genes that contained the aforementioned variants. Once more all parameters were kept at default.

Results
The sequencing and mapping produced between 1.1 billion and 1.46 billion reads for each of the samples. Samples S1, S6 and S19 produced between 1.3 and 1.46 billion reads,; while for S21 the reads dropped down to around 1.1 billion. Under 25% of all reads were unmapped for all samples. The total number of reads aligned to the reference genome were 1387640908, 1498505945, 13876409 08,1023927992 for samples S1, S6, S19 and S21, respectively. The coverage was 63 ×, 69 ×, 63 × and 47 × for S1, S6, S19, and S21 respectively (Additional file 1: Table S1).
All four samples contained between 5 million and 5.5 million variants. S1 gave 5,279,748 variants and after removing variants with QUAL < 20 the total number of variants came down to 5,000,704. Sample S6 had 5,345,421 variants initially and 5,064,885 after removing low quality calls. Sample S19 produced 5,269,076 variants with low quality calls and 4,970,655 calls without them. Lastly Sample S21 produced 5,260,335 calls with low quality variants and 4,966,352 variants after they had been filtered. Approximately 900,000 previously identified variants were found in each of the datasets.
Additional file 2: Figure S1 summarizes the variant calls. The QUAL scores were generally above the threshold of 20. The mapping qualities were mostly very high, with the highest peaks observed near 250. Read depth was concentrated mostly around 50 as an approximate median. The variant count per window summarizes the number of variants observed per each genomic interval of 1 kilobase pair (kbp) along the genome. The peak close to zero indicates most windows did not contain any variants (Additional file 2: Figure S1). This was visualized using the vcfR package on R [7].
The number of variants per chromosome correlated with chromosome size. Chromosome 1 had the most number of variants, averaging around 4.13 million for four samples, while Chromosome 22 had the least, averaging around 80,500. As for the base changes associated with the single nucleotide variants (SNVs), the four most common types of changes were to A to G, G to A, T to C, and C to T. Additional file 3: Figure S2 shows heatmaps of these four types of changes for each chromosome. The patterns are as expected with the larger chromosomes harboring more variants. The identical color band gradient of each chromosome for all four variant types indicates each chromosome contained near equal proportions of each type of base substitution.
The protein coding genes with the highest numbers of exonic variants were identical for all four samples. A total of 9702 genes harbored variants in all four samples. Among the exonic variants, 11,582 were nonsynonymous mutations, 12,296 were synonymous, 192 were nonframeshift insertions, 218 were nonframeshift deletions, 110 were stop-gain variants, 116 were frameshift deletions, 98 were frameshift insertions, 10 were stoploss and 378 were unknown. Out of all the exonic variants, 9524 were homozygous variants in all four genomes.
Afterwards we investigated to the functional significance of these variants. The potential effects of these variants on different metabolic pathways was visualized using the Reactome online sever (Fig. 1). As it can be seen, the disease associated pathways are shaded in a darker yellow, suggesting that the genes containing the most variants are strongly connected to these pathways.
Afterwards we focused on the known clinical variants present in the genomes under investigation. A total of 3628 clinical variants were found in the Bangladeshi individuals. Seventeen of these were known pathogenic variants. 5 of these were associated with Pseudoxan-thoma_elasticum. The rest of the implicated diseases were all represented by 1 variant each (Table 1). Fig. 1 Visualization of metabolic pathways potentially affected by the highly variable genes in Bangladeshi individuals. Most notably, disease and immune system related pathways have been shaded in a darker yellow, indicating that a number of cascades and pathways whose dysregulation is connected to disease manifestations are likely to be impacted as a result of the genomic variants Table 1 The diseases that are associated with the pathogenic variants found within our samples The NCBI Clinvar database was used to identify these previously known disease causing mutations. Two heart function associated disorders are implicated, which would seem to coincide with the relatively high incidence of heart diseases in Bangladesh. The most heavily implicated disease is Pseudonanthoma elasticum, for which the Bangladeshi individuals harbored 5 known pathogenic variants

Disease
Variants Pathogenic Pseudoxanthoma elasticum 5 5 Bardet-biedl syndrome 2/6, digenic 1 1 Cancer progression and tumor cell motility 1 1 Deafness, autosomal recessive 9 1 1 Diamond-Blackfan anemia_4 1 1 Encephalopathy,_progressive,_early-onset,_with brain edema_and/or_leukoencephalopathy In terms of functional importance, we found that a considerable number of our variants occurred in genes involved in cholesterol metabolism and cardiac functions, as per DAVID. Although adjusted p-values for significance were above 0.05 for all of these. However two of the aforementioned disease causing variants were directly associated with structural heart defects and hypercholesteremia respectively; suggesting the possibility that the variants occurring in heart function genes may be significant still. These two variants in question are the 11100236 G to A change in the LDLR gene in chromosome 19 and the 4 nucleotide deletion at position 56633141 of chromosome 14 in the TMEM260 gene. The functions that did have significant adjusted p-values or Benjamini scores of significance include glycosylation functions, signal peptidase functions, neuroactive ligand-receptor interactions, and glycoprotein associated functions; these were the five with the lowest Benjamini values (Additional file 5: Table S2A-D). Figure 2 displays this.

Discussion
This study was the first preliminary step in a large scale population wide cataloguing of genomic features in the Bangladeshi population. Our primary focus was on the characterization of specific genetic signatures that impact various phenotypic features of Bangladeshi individuals.
As part of this initial study involving four individuals, we identified a number of clinically relevant variants.
A number of genes containing significant numbers of variants in the Bangladeshi samples implicated various heart associated disorders when analyzed. The genes CSMD1, EPHA3, and PTPRD were all linked with heart or heart associated disorders by DAVID's algorithm [8,9]. Previous studies have often listed cardiovascular disease as one of the biggest causes of mortality in Bangladesh [10] and possible genetic links that may predispose the population to these conditions should be investigated further. A number of non-pathogenic clinically relevant variants were also identified in these genes; for example the chr6:121447564 G > A change which is classified as benign [11]. There were also a number of genes which only contained non-pathogenic variants in our samples but have been known to harbor pathogenic ones in other populations, such as GABRD and SKI (both known to contain pathogenic variants associated with heart diseases) [12,13]. SKI contained two nonsynonymous substitutions in our samples.
The two heart disease associated pathogenic variants occurred in the low density lipoprotein receptor and the transmembrane protein 260. The former holds obvious importance with regards to cardiac health. In fact a number of mutations in this gene have been associated with familial hypercholesterolemia in other populations [14]. Furthermore, this protein is also believed to interact with C. difficile toxins and facilitate toxin entry into cells [15]. This is a pathogenic bacteria that causes infections of the gut and diarrhea. Such diseases have high incidence in Bangladesh, albeit caused by other agents such as Vibrio cholerae. It is interesting to speculate whether the presence of these variants can also make individuals more vulnerable to other gut infection causing microbes such as Vibrio cholerae. The variants associated with Pseudox-anthoma_elasticum also hold potential significance in this regard. One of the major clinical manifestations of this genetic disorder is atherosclerosis; providing yet more evidence of genetic predisposition to heart conditions [16].
Finally, we identified 28 genes with previously unidentified unique mutations. Most of these genes were associated with alternative splicing and other functions connected with polymorphisms in the DNA/RNA (Additional file 6: Table S3A, and 3B).
Although we have used four individuals in this study, the genome analysis of these individuals have provided many known and unknown variants which may have an impact on the health of these individuals as well as the broader Bangladeshi population. Additionally, as we add more subjects to this initiative, we will then be able identify and validate variants that are unique to the Bangladeshi population, as well the resultant genetic predispositions.

Limitations
The major limitation of this study was the small sample size and the consequent limited analysis, especially in finding the disease-variant correlation. While we are confident in the fact that our major findings are the clinical variants that have been known to cause diseases in other populations, thus rendering the possibility of these variants arising as a result of sequencing error very small, a larger sample size would nonetheless add more validation. The findings listed and discussed here, along with the data available in the additional files can be used to focus genome research endeavors in the future.