ssvQC: an integrated CUT&RUN quality control workflow for histone modifications and transcription factors

Objective Among the different methods to profile the genome-wide patterns of transcription factor binding and histone modifications in cells and tissues, CUT&RUN has emerged as a more efficient approach that allows for a higher signal-to-noise ratio using fewer number of cells compared to ChIP-seq. The results from CUT&RUN and other related sequence enrichment assays requires comprehensive quality control (QC) and comparative analysis of data quality across replicates. While several computational tools currently exist for read mapping and analysis, a systematic reporting of data quality is lacking. Our aims were to (1) compare methods for using frozen versus fresh cells for CUT&RUN and (2) to develop an easy-to-use pipeline for assessing data quality. Results We compared a workflow for CUT&RUN with fresh and frozen samples, and present an R package called ssvQC for quality control and comparison of data quality derived from CUT&RUN and other enrichment-based sequence data. Using ssvQC, we evaluate results from different CUT&RUN protocols for transcription factors and histone modifications from fresh and frozen tissue samples. Overall, this process facilitates evaluation of data quality across datasets and permits inspection of peak calling analysis, replicate analysis of different data types. The package ssvQC is readily available at https://github.com/FrietzeLabUVM/ssvQC. Supplementary Information The online version contains supplementary material available at 10.1186/s13104-021-05781-8.


Introduction
The genome-wide profiling of chromatin-associated proteins and posttranslational modifications (PTMs) to histones has revolutionized research in epigenetic gene regulation [1][2][3]. One of the mainstay techniques used to map the enrichment patterns of PTMs and proteins is ChIP-seq, in which proteins are physically crosslinked to target DNA, immunoprecipitated with specific antibodies, and their crosslinks are reversed for downstream deep sequencing and analysis of enriched DNA [4,5].
ChIP-seq assays typically require a large amount of starting material in the range of millions of cells per reaction, and depending on the antibody, can result in a high amount of background. A recently developed method termed Cleavage Under Targets and Release Using Nuclease (CUT&RUN) is becoming widely employed in laboratories for mapping the genomic interactions of proteins [6,7]. CUT&RUN is an in situ genome-wide profiling method that employs an antibody-targeted micrococcal nuclease (MNase) fusion protein to selectively digest and release DNA fragments at protein binding sites. Overall, this method results in considerably lower background compared to ChIP-seq [7], and can be employed using much less starting material [8].
While multiple tools exist for CUT&RUN data processing and peak calling [9,10], there is a need for software to conduct CUT&RUN data quality control (QC), which should be performed prior to detailed data analysis and interpretation. To increase the reproducibility and reliability of the data, a detailed evaluation of the data is essential when performing the assays in different sources of biological material and when testing new antibodies. In the event of a low-quality assay (high background or low signal), a comparative QC process is vital to understanding possible sources of error to adjust experimental conditions and improve data quality.
Here we introduce ssvQC as a framework for CUT&RUN data quality assessment. We use this tool to evaluate cell storage protocols for performing CUT&RUN from a limited number of cells and demonstrate that frozen cells from spleen can be used to obtain consistent genome-wide enrichment profiles for transcription factor and histone modifications generated by CUT&RUN. Overall, ssvQC integrates a set of useful QC metrics to reliably summarize data quality and can be implemented as a user-friendly R package.

Ethics statement
Animal experiments were approved by University of California San Francisco animal research (# AN177290) and University of Vermont animal research ethics committees (IUACUC# PROTO201900021) and were carried out in accordance with approved guidelines.

Mice
C57BL/6 mice were obtained from The Jackson Laboratory (Bar Harbor, USA). Mice (6-8 weeks old) of both genders were used for experiments. The animals were bred and maintained either at the animal facilities at University of California San Francisco, or at the University of Vermont. Animals were sacrificed by CO 2 asphyxiation followed by cervical dislocation.

Magnetic B cell enrichment
Spleens were harvested and single-cell suspensions were prepared. Cells were washed once with PBS and red blood cells were lysed with ACK lysis buffer (150 mM NH 4 Cl, 10 mM KHCO 3 , 0.1 mM Na 2 EDTA). B cells were isolated using MACS (Miltenyi Biotech, cat # 130-090-862). Purities of enriched B cells of > 90% were determined by flow cytometry. Freezing was performed by resuspending cell pellets in 10% DMSO/90% FBS and then slowly frozen at -1 °C/minute.

Data processing, peak calling and analysis
For data preprocessing, we used the CUT&RUN tools pipeline as described [10].

Implementation of ssvQC
ssvQC is implemented in R and can run on both MacOS, Windows, and Linux (bigWig operations will not work on Windows). The package can be found at https:// github. com/ Friet zeLab UVM/ ssvQC. The tool is flexible to use and can be run with the minimal inputs of several bam files and corresponding peak files or can be fully controlled via a plain text configuration file. Data processing operations are implemented in parallel where possible for speed and results are cached to prevent redundant work. All plot outputs are ggplot objects to allow easy customization and intermediate tidy formatted data objects are easily accessible to allow complete control by users. A complete QC report is outputted as a CSV file.

Results
To study the gene regulatory mechanisms of B cells, we aimed to generate genome-wide datasets for regulatory histone modifications and transcription factors (TFs) from a mature B cell population isolated from fresh mouse spleen tissue. Because cell isolation from tissue can be time consuming and the downstream processing of fresh material is not always possible, we wanted to evaluate if isolated cells could be frozen using a slow freezing protocol (see methods) for CUT&RUN assays and how the results would compare to that of fresh material (Fig. 1A). We therefore performed a pilot CUT&RUN experiment using antibodies specific for the histone modification H3K4me3, the TF Ikaros and control IgG, using fresh or frozen cells.
To assess the overall data quality, we implemented a uniform data processing pipeline and developed a quality control package called ssvQC to systematically collect and represent CUT&RUN data quality metrics. The overall data processing workflow is illustrated in Fig. 1B, showing the preprocessing steps leading up to the input data files for ssvQC. The output of preprocessing steps includes mapped BAM files, a signal BigWig file, and the significantly enriched regions in narrowPeak, broadPeak and SEACR bed file format. We also implemented a routine to upload signal files to UCSC genome browser for generating shared sessions, facilitating sharing of data. ssvQC can similarly be applied to other sequence enrichment assays, including ChIP-seq and ATAC-seq.
We used ssvQC to compare the results of Ikaros and H3K4me3 CUT&RUN profiles generated from fresh and frozen B cell samples. The total number of mapped reads ranged from 6-14 million reads per dataset (Additional file 3: Table S1). The number of peaks for H3K4me3 was similar between samples, whereas the number of peaks called for Ikaros differed between datasets (10,809 and 22,022 for fresh and frozen samples, respectively). The FRiP score is defined as the fraction of reads that fall into a peak and is often used as a measure of data quality. Each dataset showed a relatively high FRiP scores, > 0.25 for H3K4me3 and > 0.13 for Ikaros (Additional file 3: Table S1). These scores are in accord with the ENCODE guidelines for ChIP-Seq (FRiP score > 0.1) [4] and are similar to CUT&RUN/CUT&Tag reported scores [11]. Thus, slow freezing of B cells provides equivalent FRiP scores to that of fresh B cells. Next, we performed CUT&RUN for Ikaros and the histone modification H3K4me3 with frozen biological replicates and compared data quality with ssvQC. Overall, the data was consistent between replicates, exhibiting a comparable number of mapped reads, number of peaks, FRiP scores, and Pearson correlation of read counts ( Fig. 2A-D). ssvQC was further used to directly compare the peak overlap for each factor, and depicted a moderate degree of overlap across datasets (Fig. 2E, F and Additional file 1: Fig. S1). The data stored by ssvQC is compatible with methods such as ChIP-peakAnno that perform permutation tests to assess statistical significance of peak overlap [12]. Additionally, the strand cross-correlation (SCC) analysis of ssvQC detects the clustering of reads in CUT&RUN dataset independently of peak calling. ssvQC determines the strand-shift profile based on the position of reads on either strand and shifts them towards one another, and then calculates the correlation between the position of reads on the two strands at each progressive shift. Overall, each dataset produced provided a similar SCC score, providing an assessment of data quality independently of peak call (Additional file 2: Fig. S2). Finally, ssvQC can be used to cluster datasets to provide a valuable visual inspection of data quality across genomic regions of interest (Fig. 3).

Discussion
Despite the increasing application of CUT&RUN for epigenomic profiling, there is not currently any agreement on experimental protocols starting with different types of biological material. Quality control is a critical process when validating these methods. While there are a variety of methods for gathering quality control metrics (library size, alignment rate, read duplication rate, fragment size distribution, E. coli aligned read fraction, etc.), including as part of the CUT&RUNTools pipeline, it is still difficult and time consuming to sythesize metrics into an overall view of data quality. Along with numerical metrics, ssvQC provides visualizations of signal enrichment which allow users to make qualitative judgments about their data.
In our experience, fresh biological material from cultured cells or tissues generally provides superior results over frozen material for CUT&RUN. However, when performing lengthy cell isolation procedures such as tissue dissection and cell sorting, and when working with collaborating laboratories at different institutions where shipping of biological material is required, it is not always feasible to proceed directly with sample processing. We therefore wanted to evaluate a slow freezing protocol To compare the quality of datasets from different sample sources, we developed the ssvQC package to incorporate a set of valuable QC metrics for the evaluation of data quality of CUT&RUN experiments. The ssvQC package reports these metrics in a clear report with understandable graphics. Because ssvQC utilizes the data.table package, it can efficiently process large collections of CUT&RUN, ChIP-seq, ATAC-seq and other sequence enrichment data. Users can reproduce data QC and analysis results in a uniform processing pipeline and Fig. 2 Evaluation of CUT&RUN peaks for transcription factors and histone modifications with replicates using the ssvQC package. The ssvQC output plots for H3K4me3 and Ikaros CUT&RUN replicates from frozen B cells, showing A total mapped reads, B total number of called peaks, C fraction of reads in peaks (FRiP) scores D Genome-wide correlations of reads in 10 kb bins. The overlaps of peak regions for E H3K4me3 and F Ikaros were compared via binary heatmap, upset, and venn diagrams integrated software package in R. We performed replicate CUT&RUN from frozen B cells and show features of ssvQC to assess data quality of multiple sequence enrichment datasets.

Limitations
This study employed an analysis of single fresh and frozen samples, and more replicates and conditions could be incorporated to more rigorously evaluate the optimal experimental conditions.
Additional file 1: Figure S1. Analysis of peakset with ssvQC. A) a barplot showing peak counts or each dataset, as well as venn diagrams for H3K4me3 and Ikaros datasets. B) As in A, but peak counts were filtered by qValue scores. Fig. 3 Evaluation of CUT&RUN enrichment profiles with clustered signal heatmap. A Heatmap showing the normalized (reads per million (RPM) signal of each CUT&RUN dataset over consensus peak regions for A Ikaros replicate datasets (frozen) with IgG controls, or B H3K4me3 replicate datasets (frozen) with IgG controls. Clustering was performed with kmeans = 6 for both datasets, and shows an overall similar distribution of signal for called peaks across replicates, exhibiting variable signal across clusters. Also, in either dataset, cluster 1 regions show strong signal in IgG controls and are likely artifacts. These peaks do not map to known blacklisted regions but are associated with heterochromatin. These peaks should be filtered to obtain a final dataset