Dynamic bimodal changes in CpG and non-CpG methylation genome-wide upon CGGBP1 loss-of-function

Objectives Although CpG methylation is well studied, mechanisms of non-CpG methylation in mammals remains elusive. Studying proteins with non-CpG cytosine methylation-sensitive DNA-binding, such as human CGGBP1, can unveil cytosine methylation regulatory mechanisms. Here we have resequenced a published genome-wide bisulfite sequencing library and analyzed it at base level resolution. CpG, CHG and CHH (where H is any nucleotide other than G) methylation states in non-targeting or CGGBP1-targeting shmiR lentivirus-transduced cells have been analyzed to identify how CGGBP1 regulates CpG and non-CpG methylation. Results We report that CGGBP1 acts as a dynamic bimodal balancer of methylation. Both gain and loss of methylation observed upon CGGBP1 depletion were spatially overlapping at annotated functional regions and not identifiable with any sequence motifs but clearly associated with GC-skew. CGGBP1 depletion caused clustered methylation changes in cis, upstream of R-loop forming promoters. This was complemented by clustered occurrences of methylation changes in proximity of transcription start sites of known cytosine methylation regulatory genes, altered expression of which can regulate cytosine methylation in trans. Despite low coverage, our data provide reliable estimates of the spectrum of methylation changes regulated by CGGBP1 in all cytosine contexts genome-wide through a combination of cis and trans-acting mechanisms. Electronic supplementary material The online version of this article (10.1186/s13104-018-3516-1) contains supplementary material, which is available to authorized users.


Legends for Supplementary
Percentage of A+T and G+C in total sequenced data of S1 (in presence of CGGBP1) and S2 (after depletion of CGGBP1). Table S2: Percentage distribution of A+T and G+C in mapped and unmapped sequences of S1 and S2. For a 2 x 2 contingency analysis (Chi-square test) of A+T and G+C in S1 and S2, the p value < 0.0001 for mapped, unmapped and all combined reads. Table S3: Distribution of cytosines undergoing change in methylation (GoM or LoM) upon CGGBP1 loss-of-function in repeats (N) and non-repeats (C or G). Table S4: Context-wise abundance of GoM and LoM (as ratios) of all cytosine sequenced in both the samples. Table S5: Context-wise abundance of total methylated and total unmethylated cytosines (as ratios) of all cytosine sequenced in both the samples. Table S6: Chi-square test (using GraphPad Prism 7) for significance for the data presented in figure  1 (D to I) show that the difference between S1 and S2 are highly significant. Table S7: Repeat content analysis (using RepeatMasker) in consistently differentially methylated sequences between S1 and S2. Table S8. Pattern of methylation change occurring in different genomic landmarks upon CGGBP1 depletion. Observation summaries presented here are derived out of only those cytosines that were sequenced in both samples S1 and S2 and methylated in at least on of them.    Table S11: Inter-and intra-strand correlation of GoM and LoM at replication origins. Replication origins with no change in methylation in both S1 and S2 have been excluded. Remarkably and unexpectedly, the GoM and LoM events are highly correlated on the same strand as compared to either GoM or LoM on separate strand. This shows that at the replication origins the methylation change in opposite directions tend to be concurrent on the same strand. The values presented are R correlation coefficients. Number of pairs = 5788 minimum and 7741 maximum for ORC1, 5666 minimum and 6704 maximum for ORCA Early, 5211 minimum and 6159 maximum for ORCA Mid, 715 minimum and 806 maximum for ORCA Late, and, 14989 minimum and 12409 maximum for PHIP.

Legends for Supplementary figures
Fig S1: A: Distribution of cytosines either exhibiting change (GoM or LoM) or no change (RoM or RuN) of methylation upon CGGBP1 loss-of-function in Giemsa bands. Higher GoM/LoM ratios were associated with G-rich Giemsa-negative R-bands with progressive decline in GoM/LoM ratio from G-negative towards fully Giemsa-positive G-100 bands. Welch's unpaired T test was performed between the GoM/LoM values for all combinations of G-neg (n=780), G-pos-25 (n=87), G-pos-50 (n=121), G-pos-75 (n=89) and G-pos-100 (n=81). The p values ranged between, maximum 0.0093 and minimum <0.0001. B to D: A base level measurement of cytosine methylation states in the three contexts CpG, CHG and CHH showed a highly context-dependent effect of CGGBP1-depletion on cytosine methylation. In CpG context (B) the normally higher levels of methylated over unmethylated cytosines in S1 was reversed by CGGBP1 depletion. Hence in S2 the net methylation state of CpG was more unmethylated than methylated (S1 total methylation compared to S2 total methylation, Obs/Exp Chi-square p value ). C and D: The unmethylated cytosine count also increased in the CHG (C) and CHH (D) contexts highly significantly (Obs/Exp Chi-square p value < 0.0001 for all contexts). However the balance between unmethylated and methylated cytosines was not reversed by CGGBP1 depletion. This differential effect of CGGBP1 depletion on CpG unlike CHG and CHH contexts could be partially due to the prevalence of the CHG and CHH contexts (the Y axes in C and D are higher by an order of magnitude than in B).

Fig S2:
CGGBP1 depletion causes methylation change that is associated with functional genomic regions. A: Cytosine methylation levels in 2.5 Kb flank from midpoint of CTCF-binding sites was plotted. Increased cytosine methylation level was observed at the midpoint of CTCF-binding sites. B: Similarly, cytosine methylation level in 10 Kb flank from midpoint of permissive enhancers was plotted. Decreased cytosine methylation level was observed at the midpoint of the permissive enhancer elements. C and D: Distribution of methylated cytosines in 10 Kb flank from TAD boundaries (TAD start sites (C) and TAD end sites (D)) was plotted. Cytosine methylation level was not altered at TAD boundary in presence and absence of CGGBP1. E and F: Distribution of methylated cytosines in 1 Mb flanks of LAD boundaries (LAD start sites (E) and LAD end sites (F)) was plotted. Mild long-range increase in cytosine methylation was observed in S2 compared to S1 at LAD start and end sites. Red line = S2, Blue line = S1. X axis represents genomic location from the centre of the genomic coordinates of the indicated genomic landmarks. Y axis represents methylated cytosine counts in bins with sizes as indicated. Plots were generated using deepTools and coordinates plotted correspond to Hg38 assembly. and LoM (B) was plotted for 10kb flanks in S1 (blue) and S2 (red). Differential abrupt increases in cytosine methylation was observed as expected at the centre of LINE-1 sequences as these constitute approximately 20% of GoM and LoM regions. Despite difference in levels of cytosine methylation in the centre of LINE-1 sequences, it was consistently lower in S2 throughout the 10kb flanking regions. Thus, the disturbance in methylation caused by CGGBP1 depletion seemed to be occurring differently at the LINE-1 elements than at the flanking regions. Red line = S2, Blue line = S1. X axis represents genomic location from the centre of LINE-1 regions in GoM or LoM regions. Y axis represents methylated cytosine counts in 100 bp bins. LINE-1 coordinates from RepeatMasker output on GoM and LoM sequences were used to extract sequences from Hg38 assembly. Plots were generated using deepTools.

Fig S4:
Differential methylation regulation at G4-forming replication origins by CGGBP1. A and B: Distribution of methylated cytosines for S1 and S2 from the centre of origin (10Kb flanks) was plotted. Replication origins occupied by ORC1 maintained low levels of cytosine methylation at the origin centres as compared to the flanking regions (A). However, a subset of ORC1 peaks that forms G4 structure showed relative higher cytosine methylation at the centre upon CGGBP1 depletion (B). C and D: Similarly, replication origins occupied by PHIP showed relative decrease in the cytosine methylation at the centre (C) as compared to that of flanking regions. In contrast, there was significant increase in methylated cytosine at the centre of G4-forming PHIP origins upon CGGBP1 depletion (D). Red line = S2, Blue line = S1. X axis represents genomic location from the centre of ORC1 and PHIP ChIP seq peaks. Y axis represents methylated cytosine counts in 100 bp bins.

Fig S5:
Distribution of strand specific cytosine methylation at promoters with GC-skew shows a consistent increase in methylation only upstream of TSS. In presence of CGGBP1, a positive gradient of cytosine methylation from 1kb upstream to 1kb downstream is observed. Upon CGGBP1 depletion this gradient is disturbed giving rise to a mild but very consistent region of methylation gain immediately upstream of the TSS with respect to direction of transcription. Red line = S2, Blue line = S1. X axis represents genomic location from the centre of GC-skew promoter coordinates. Y axis represents methylated cytosine counts in 50 bp bins.

Cell culture, shmiR transduction, bisulfite conversion of DNA, library preparation and Illumina sequencing and raw data analysis
The sequences analyzed in this study were obtained by sequencing of libraries described earlier [1]. The source of DNA, methods of DNA extraction, bisulfite conversion (including lambda phage DNA as spike in for calculating conversion efficiency) and library preparation apply as described before. The sequences obtained for this analysis are different from the ones already published [1]. Briefly, normal human foreskin fibroblasts (1064Sk, passage 12), cultured in MEM (SIGMA) supplemented with 10% FCS (SIGMA) and 0.05% Glutamine (SIGMA) were transduced with CGGBP1-targetting or non-targeting lentiviral shmiRs (Dharmacon Thermoscientific; V3LHS_351828, V3LHS_351826 and V3LHS_351824 for specific and RHS4348 for non-targeting constructs). 96 h post-tranduction, CGGBP1 knock-down was confirmed by western blot and genomic DNA was extracted. After bisulfite conversion (EZ methylation Kit, Zymo Research), library was prepared by using NEXTflex Bisulfite (Boo Scientific (5119-01)) as per manufacturer's protocol. Per 1 microgram DNA, 500 pg Lambda DNA spike (500 pg/ul) to be used as a spike-in control. Paired sequence reads (Illumina) were unpaired before mapping. After trimming of adaptor sequences and filtering out of sequences with ambiguous nucleotides (N), the mapping was performed against unmasked Hg38 using Bismark (version 0.16.2) with inbuilt Bowtie2. The mapping parameters were: "-f --score-min L,0,-0.2 --ignore-quals". Subsequently, Bismark methylextractor was used to extract context and strand specific methylation data for cytosines with specified base locations. In both samples the Lambda DNA spike in controls indicated 95% conversion efficiency.

Nucleotide Composition analyses
For nucleotide composition analyses of the entire sequence data, sequences of all reads (premapping) were used. For nucleotide composition analyses of the mapped sequence data, sequences of all the uniquely mappable reads were used. The data for unmapped reads were derived by subtracting the values for mapped reads from the total reads. Nucleotide compositions were obtained using the EMBOSS tools Compseq and NucBed.

Analyses of genomic coordinates and sequences of regions
GoM, LoM, RoM and RuN locations were obtained by using Bedtools intersect tool on the Bismark methylextractor output. Bedtools merge tool was used to generate regions out of single base coordinates for analyses at the level of regions and their sequences. The parameters of merging are detailed in figure S2. Sequences of thus obtained bed coordinates were extracted from Hg38 (repeat-masked or unmasked as described) using Bedtools getfasta tool. To identify the number of cytosines (methylated or unmethylated as described) in any bed coordinate, Bedtools coverage tool was used with coverage count option. The distance between two different sets of bed coordinates were obtained using Bedtools Closest tool. For methylation analyses at region level, we retained only the cytosines exhibiting change of methylation only between S1 and S2 (consistently methylated or unmethylated) and filtered out those exhibiting opposite methylation states within a sample (inconsistently methylated).

Repeat content analyses
The repeat-masked and unmasked genome Hg38 were used as available from UCSC genome browser. The subsets of sequence data were repeat-masked using RepeatMasker [2]. The repeat search engine used was RMBLAST (NCBI) and the repeat database used was obtained from RepBase[2,3].

Motif finding
Motif search was done using locally installed versions of MEME (version 4.11.2) [4] suite tools dreme and fimo. Unless described, the motif searched were performed using default options with changes in -k value ranging from 5 to 14.

G4 quadruplex prediction
G4 quadruplex prediction in replication origin peaks was performed using QGRS tool [5]. For identification of G4 forming sequences in larger datasets, the sequence signature {G(3)N(3-

GC skew
GC skew was calculated using the formula {(G-C)/(G+C)} [4,6]. The calculations were performed using either inbuilt functions of OpenOffice Spreadsheet or using shell commands of UNIX.

Plotting of signals in genomic coordinates
Methylation signals from S1 and S2 were plotted along genomic coordinates using deepTools [7]. Bed to BigWig conversion was done using UCSC scripts bedItemOverlapCount, bedGraphToBigWig. Plots were generated using plotProfile function on the matrix generated using computeMatrix function.

Statistical analyses
All statistical tests were performed using Prism 7 (GraphPad) on numerical data generated from the above mentioned tools including OpenOffice Spreadsheet.