Improved library preparation protocols for amplicon sequencing-based noninvasive fetal genotyping for RHD-positive D antigen-negative alleles

We aimed to simplify our fetal RHD genotyping protocol by changing the method to attach Illumina’s sequencing adaptors to PCR products from the ligation-based method to a PCR-based method, and to improve its reliability and robustness by introducing unique molecular indexes, which allow us to count the numbers of DNA fragments used as PCR templates and to minimize the effects of PCR and sequencing errors. Both of the newly established protocols reduced time and cost compared with our conventional protocol. Removal of PCR duplicates using UMIs reduced the frequencies of erroneously mapped sequences reads likely generated by PCR and sequencing errors. The modified protocols will help us facilitate implementing fetal RHD genotyping for East Asian populations into clinical practice.


Introduction
Alloantibodies against Rh antigens represent the main cause of hemolytic disease of the fetus and newborn (HDFN). The D antigen is the most highly immunogenic among Rh antigens. RhD-negative women become sensitized to the D antigen and subsequently produce anti-D antibodies when they carry an RhD-positive fetus. Although anti-D prophylaxis by postnatal and antenatal anti-D Ig administration has been highly successful in reducing the incidence of HDFN worldwide [1], it is unnecessary for RhD-negative women who carry an RhD-negative fetus. Fetal RHD genotyping makes it possible to prevent unnecessary anti-D administration in such pregnancy cases. The fetal RHD genotyping method widely implemented in western countries is designed to detect the presence or absence of the RHD wild-type allele of fetal origin in the plasma of RhD-negative pregnant women, over 99.9% of whom are homozygous for RHD deletion alleles in Caucasian populations. Because of relatively high frequencies of RHD-positive RhD-negative alleles, RHD*01EL.01 and RHD*01N.04,

BMC Research Notes
*Correspondence: nakabaya-k@ncchd.go.jp; hata-k@ncchd.go.jp † Kazuhiko Nakabayashi and Kenichiro Hata contributed equally to this work 1 Department of Maternal-Fetal Biology, National Research Institute for Child Health and Development, 2-10-1 Okura, Setagaya, Tokyo 157-8535, Japan 7 Laboratory of Developmental Genomics, National Research Institute for Child Health and Development, 2-10-1 Okura, Setagaya, Tokyo 157-8535, Japan Full list of author information is available at the end of the article among RhD-negative individuals (9.0% and 2.9%, respectively, in the Japanese population), the same genotyping method was not applicable in East Asian countries. The RHD*01EL.01 allele contains a single nucleotide variant at the last nucleotide of exon 9 (c.1227G/A), which likely disrupts normal splicing [2]. The RHD*01N.04 allele is a hybrid allele, in which exons 3-9 of the RHD gene are replaced with those of RHCE [3].
We have recently established an amplicon-based noninvasive fetal genotyping method that distinguishes the wild-type RHD allele not only from the RHD-negative D antigen-negative allele (the RHD deletion allele), but also from RHD-positive D antigen-negative alleles [4]. This method requires PCR amplification from four genomic intervals, upstream and downstream Rhesus boxes, RHD exon 9, and RHCE exon 9. Because of extremely high sequence similarities between two Rhesus boxes and between RHD exon 9 and RHCE exon 9, we designed two primer pairs to amplify these four regions (Fig. 1A, Additional file 1: Fig. S1). One primer pair perfectly matches with two genomic intervals, upstream and downstream Rhesus boxes, and amplifies 105-bp PCR products. The other primer pair also perfectly matches with two genomic intervals, RHD exon 9 and RHCE exon 9 regions, and amplifies 148-bp PCR products. The 105bp PCR products contain one base difference that distinguishes two Rhesus box sequences. The 148-bp PCR products contain two base differences that distinguish two genes, and also cover the point mutation site in exon 9 of the RHD*01EL.01 allele (c.1227G/A) that distinguishes it from the wild-type allele (RHD*01). Although two regions are co-amplified with one primer pair, attachment of adaptor sequences to the PCR amplicons followed by NGS allowed us to accurately map each of the co-amplified sequences to its origin because of the one or two base differences between two regions, in the data analysis procedure [4].
In this study, we simplified our fetal RHD genotyping protocol by changing the adaptor attachment method from ligation (Fig. 1B, C) to a one-step PCR (Fig. 1E, F). We also evaluated whether introduction of unique molecular indexes (UMIs) [5,6] (Fig. 1H, I) improves the quantitative accuracy in measuring the ratios of RHD alleles in cfDNA.

Blood collection and DNA extraction
Cell-free DNA in maternal plasma was extracted using the Mag MAX Cell-Free DNA Isolation Kit (Thermo-Fisher Scientific, A29319) as described previously [4]. Individuals with three major RhD-negative genotypes, RHD*01N.01/RHD*01N.01, RHD*01N.01/RHD*01.04, and RHD*01N.01/RHD*01EL.01 in the Japanese population, and those with two RhD-positive genotypes, RHD*01/RHD*01 and RHD*01/RHD*01N.01, were identified as described previously [4]. (Fig. 1E, F) A tailed-forward primer ("Tailed_F") contains the Illumina forward (P5) adaptor sequence (70 bases) including an 8-bp index followed by a target-specific forward primer sequence (20 or 22 bases) at the 3′ end (90 or 92 bases in total). A tailed-reverse primer ("Tailed_R") contains the Illumina reverse (P7) adaptor sequence (66 In the panels C, F, and I, genomic DNA is shown by a black line; PCR primers that target genomic DNA sequences are shown by blue arrows; adaptor sequences are shown in dark/light green/orange; 12-base UMIs are shown in red. Index sequences to de-multiplex sequence reads after sequencing of pooled libraries are omitted for simplicity. Panels D, G, and J show representative electropherograms of Rhesus box amplicons ("box"), for RHD/RHCE exon 9 amplicons ("ex9"), and for multiplex amplification by two pairs of primers ("multiplex"). Horizontal and vertical axes of each electropherogram represent fluorescent intensity and DNA size (bp), respectively. Lower and upper marker peaks are present at the positions of 35 bp and 10,380 bp, respectively. In our conventional [4] and one-step PCR protocols (C, F), multiplex PCR using two pairs of primers gave rise to specific amplification of target genomic intervals (D, G). K A data analysis workflow showing the informatics tools used and their step-by-step functions. L Bam coverage tracks of the RHD genotyping results by the one-step PCR protocol (F) for three combinations of the mixtures of genomic DNA (combinations #1, #2 and #3 in Additional file 2: Table S2) and for two cfDNA samples (#46 and #59) from RhD-negative pregnant women (Table 1) visualized using IGV. The mapped read patterns indicated that the RHD genotypes of the mother (Mo) and the fetus (Fe) were RHD*01N.01/RHD*01N.01 and RHD*01/RHD*01N.01, respectively, in both cases. The vertical ranges of mapped read numbers were 0-25,000 for the top three panels, 0-40,000 for cfDNA#46, and 0-150,000 for cfDNA#59 Step 3

Preparation of amplicon sequencing libraries by one-step PCR
Step 3 Step 9 multiplex box ex9 Step

Preparation of amplicon sequencing libraries with unique molecular identifier (UMI) sequences (Fig. 1H, I)
Cell-free DNA was concentrated using a centrifugal evaporator (  Twenty out of the 50 µL reactions were purified using 0.9 times the volume (18 µL) of Agencourt AMPure XP repeated twice, and eluted with 10 µL distilled water. The final libraries were electrophoresed and sequenced as described above.
Data analysis (Fig. 1K) For amplicon libraries without UMIs, fastq files were generated using bcl2fastq V2.20.0.422 (Illumina), and trimmed for adaptor sequences using fastp ver.0.21.0 [7]. Read 1 and read 2 sequences were merged using FLASH ver.1.2.11 [8] with a parameter of "-max-mismatch-density = 0", and the merged sequences were filtered by their expected size (105 bases for Rhesus boxes and 148 bases for RHD/RHCE exon 9) to remove reads with unexpected sizes (such as primer dimers and PCR artefacts). The merged sequences were mapped to the hg19 reference genome using "bwa aln" and "bwa samse" commands of bwa-0.7.17 [9]. By using samtools -1.4.1 [10], the resultant bam file was sorted in a positional order and filtered by base quality scores (cutoff 25) to remove low quality reads, by mapping scores (cutoff 23 for Rhesus boxes and 37 for RHD/RHCE exon 9) to select uniquely mapped reads. The mapped read numbers were counted for each of the four bases at each nucleotide position of the amplicons using IGVTools_2.3.94 (https:// softw are. broad insti tute. org/ softw are/ igv/ igvto ols) (igvtools count-w 1-bases), and output as a.wig file. Subsequently, the numbers of G at chr1:25,592,628 and of A at chr1:25,662,955 were extracted as read counts of the upstream and the downstream Rhesus boxes, respectively. The numbers of G and A at chr1:25,648,453 as the read counts of the wildtype allele and the c1227G > A allele of RHD exon 9, and the number of C at chr1:25,696,958 as the read count of RHCE exon 9 were extracted. The mapping results were further examined for the existence of unexpected variants or potential sequence errors by visualizing the bam file data for four regions corresponding to the amplicons using IGV (https:// softw are. broad insti tute. org/ softw are/ igv/), and by inspecting the minor allele frequency of each nucleotide position of the amplicons using the text data (.wig file) generated by IGVTools. For amplicon libraries with UMIs, fastq generation, adaptor trimming, and merge of paired reads were performed as described above. The resultant merged sequences were further processed with AmpUMI.py 1.2 [6] to remove PCR duplicate reads using 12-base UMI sequences located at the beginning of the original read 1. The sequence reads after removing PCR duplicates were further processed as described above.
The.wig file data were used to calculate the ratio of the read number containing the bases other than the reference base to the total read number at a single nucleotide (i.e., error ratio). These read numbers were initially counted for each nucleotide of the four amplicon regions (upstream and downstream Rhesus boxes, RHD exon 9, and RHCE exon 9). Error ratios were subsequently calculated using the total numbers for the positionally identical bases between the upstream and downstream at Rhesus box amplicons (for 105 positions) and between RHD exon9 and RHCE exon 9 amplicons (for 147 positions excluding the position of the c.1227A > G variation at chr1: 25,648,453).

Results and discussion
We established one-step PCR conditions (Fig. 1E-G) and tested on twelve combinations of genomic DNA mixtures of two individuals (A and B) at a 10:1 ratio, which served as approximation models of cfDNA from RhD-negative pregnant women. "A" corresponds to the mother and is any of three RhD-negative genotypes (RHD*01N.01/ RHD*01N.01, RHD*01N.01/RHD*01EL.01, or RHD*01N.01/RHD*01N.04), and "B" corresponds to the fetus and is any of four genotypes (one RhD-positive genotype [RHD*01/RHD*01N.01] or three RhD-negative genotypes). These twelve combinations cover 93.6% of possible genotype combinations of a fetus and an RhDnegative pregnant woman in the Japanese population. The observed ratios of amplicons from Rhesus boxes and from RHD/RHCE exon 9 were mostly consistent with the expected ratios (Additional file 2: Table S2). The reliability for estimating the fetal RhD type of the newly established one-step PCR protocol was confirmed to be comparable with that of the conventional protocol ( Fig. 1B-D) [4]. Examples of mapped read data visualized by IGV are provided (Fig. 1L).
UMIs have been used to confidently detect PCR duplicates in NGS applications [6], and have been shown to reduce sequencing error rates and to increase analytical specificity in various studies, including those for NIPT [11,12]. We established linear and subsequent exponential amplification conditions to introduce 12-base UMIs to the amplicon libraries ( Fig. 1H-J), and tested the protocol on the twelve combinations of genomic DNA mixtures described above (Additional file 2: Table S3). The results we obtained by the UMI protocol with our standard filtering conditions were almost completely free of erroneous reads, and were not suitable to assess the efficacy of UMI to reduce errors. When we analyzed the same dataset with less stringent filtering conditions, we confirmed that the UMI-based removal of PCR duplicates lowered the ratio of erroneous reads for the majority of cases (data not shown). We observed a stronger tendency of the amplification bias towards RHCE exon 9 over RHD exon 9 in the dataset by the linear & PCR amplification protocol (Additional file 2: Table S3) than that by one-step PCR (Additional file 2: Table S2). Further optimization of the linear and the PCR amplification conditions is required to minimize the observed amplification biases.
PCR and sequencing errors are inherent in the current NGS technologies [13]. We calculated the ratios of such errors presumably generated during library preparation and sequencing procedures and retained after applying read filtering conditions in the data analysis procedure. When the mapped read data of twelve samples in Additional file 2: Table S2 were analyzed, the highest error ratios were 0.06% and 0.11% for Rhesus box amplicons and RHD/RHCE exon 9 amplicons, respectively (Additional file 1: Fig. S2). Although potential PCR or sequencing errors at the nucleotide positions to distinguish the origin of each sequence read were observed at low frequencies, as shown in red in Table 1, Additional file 2: Tables S2 and S3, their ratios were below the calculated background levels except for one case (0.19%) detected for the fourth combination of genomic DNA mixture (Additional file 2: Table S2). Such a high error ratio indicates the possibility of carry-over contamination from previous PCR assays. UMI is expected to be useful as a means to remove such contaminated reads.
In NIPT methods, fetal fraction (FF), the ratio of fetal DNA in cfDNA in maternal plasma, has been recognized as the most critical factor for their diagnostic accuracy [14]. The sensitivity of trisomy 21 detection dropped from 99 to 75% when FF was below 8% [15]. In our method, when the RHD*01 allele is detected in cfDNA from maternal plasma, it determinately demonstrates that the fetus is RhD-positive. On the other hand, when the RHD*01 allele is undetected, it leaves two possibilities, namely, that the fetus is RhD-negative and homozygous for the RHD deletion (RHD*01N.01) allele, or that the assay failed to detect an RhDpositive allele of the fetus due to a low FF. Accurate determination of the target molecule number in a genotyping PCR reaction through UMI-based removal of PCR duplicates helps us better presume which one of the possibilities is more likely.