Quantification of histone modification ChIP-seq enrichment for data mining and machine learning applications
© Bekiranov et al; licensee BioMed Central Ltd. 2011
Received: 24 May 2011
Accepted: 11 August 2011
Published: 11 August 2011
The advent of ChIP-seq technology has made the investigation of epigenetic regulatory networks a computationally tractable problem. Several groups have applied statistical computing methods to ChIP-seq datasets to gain insight into the epigenetic regulation of transcription. However, methods for estimating enrichment levels in ChIP-seq data for these computational studies are understudied and variable. Since the conclusions drawn from these data mining and machine learning applications strongly depend on the enrichment level inputs, a comparison of estimation methods with respect to the performance of statistical models should be made.
Various methods were used to estimate the gene-wise ChIP-seq enrichment levels for 20 histone methylations and the histone variant H2A.Z. The Multivariate Adaptive Regression Splines (MARS) algorithm was applied for each estimation method using the estimation of enrichment levels as predictors and gene expression levels as responses. The methods used to estimate enrichment levels included tag counting and model-based methods that were applied to whole genes and specific gene regions. These methods were also applied to various sizes of estimation windows. The MARS model performance was assessed with the Generalized Cross-Validation Score (GCV). We determined that model-based methods of enrichment estimation that spatially weight enrichment based on average patterns provided an improvement over tag counting methods. Also, methods that included information across the entire gene body provided improvement over methods that focus on a specific sub-region of the gene (e.g., the 5' or 3' region).
The performance of data mining and machine learning methods when applied to histone modification ChIP-seq data can be improved by using data across the entire gene body, and incorporating the spatial distribution of enrichment. Refinement of enrichment estimation ultimately improved accuracy of model predictions.
Recent advances in high-throughput DNA sequencing technology have facilitated the generation of vast amounts of epigenomic ChIP-seq data. The availability of these datasets has provided the opportunity to utilize the power of statistical computing to model epigenomic networks. Unlike conventional biochemical approaches, the application of machine learning and data mining techniques to ChIP-seq data is capable of providing a broad, network-level view of the epigenetic regulatory system. These strategies can provide insights into mechanisms of genomic control, such as the so-called "histone code [1–3]," by facilitating an integrated analysis of the many histone posttranslational modifications (PTMs) that have been described, as well as other epigenetic chromatin modifications. The histone code is a particularly attractive problem for computational applications, since it has become apparent that histone PTMs are regulated in a network fashion and are deposited combinatorially [4–6]. However, a thorough study of the gene-biased quantification of ChIP-seq enrichment for the application of data mining and machine learning techniques has not yet been done.
Several groups have applied various data mining and machine learning techniques to epigenomic ChIP-seq data in a gene-biased fashion, including Bayesian networks [7, 8], support vector machines and regression , linear [6, 8, 9] and nonlinear regressions . These models have focused on histone modifications due in part to their known role in transcriptional regulation [10–12] and the availability of rich datasets exhibiting a wide variety of types of histone modifications . These models have been built using individual genes as observations, ChIP-seq enrichments as predictors, and gene expression as responses in the case of supervised learning methods. In these models the predictors are histone modification/variant ChIP-seq enrichment levels for individual genes, so model quality is highly dependent on the accuracy of enrichment estimation. Since ChIP-seq enrichment levels are strongly dependent on genomic coordinate, providing a gene-wise estimate of ChIP-seq enrichment that accurately captures the relevant enrichment information across all genes--which vary in length over four orders of magnitude--is a challenging task.
The most straightforward way to estimate per-gene ChIP-seq enrichment is to simply count the number of sequence reads associated with a given gene. Indeed, counting the number of reads in the promoter region of each gene was an approach taken in some previous studies [7, 9]; however, this method has several limitations. First, tag-counting methods equally weight every position within the counting window, and thus ignore the spatial component of the enrichment data. Second, not every histone modification is 5' biased. Several modifications have greater enrichment into the body of genes, such as H3K36me3 [10, 12, 13]. This modification highlights the pitfalls associated with 5' biased enrichment estimation. It has greater enrichment in the bodies of genes, is indeed 3' biased, and has a proclivity for enrichment in exons [13, 14]. H3K36me3 also has a strong correlation with transcriptional elongation as determined by various biochemical studies [10, 14–16]. It is worth noting that a machine learning study by Yu et al. used a 5' tag counting method and found little correlation between H3K36me3 enrichment and gene expression . The consequence for models that use 5' proximal enrichment estimation methods is that the effects of histone modifications with gene body or 3' biased enrichment are underestimated or greatly obscured.
Histone methylations tend to have unique average spatial deposition patterns . For example, in contrast to H3K36me3, H3K4me3 has high enrichment around the transcription start site, with depletion in the nucleosome-free region. Some modifications seem to be deposited in a specific genic region with respect to the absolute positions of nucleosomes relative to gene boundaries, while the patterns of others seem to scale with gene length. These effects can be attributed, at least in part, to the recruitment of histone methyltransferases that are dependent on the phosphorylation state of the C-terminal domain of Pol II before and during the elongation process [13, 16–18]. Estimating ChIP-seq enrichment is complicated by the different modes of histone PTM deposition coupled with the wide variability in gene lengths. Part of the information content associated with a given histone modification is encoded within the spatial distribution of the enrichment data, and so it should also be considered when estimating enrichment levels.
The selection of a genomic window used for the calculation of enrichment levels is important in capturing relevant enrichment data, since enrichment information may be 5' or 3' proximal, intergenic, or intragenic . Choosing a window size that extends too far outside of gene boundaries may incorporate data from neighboring regions, and selecting a window size that is too small may exclude useful data. The goal in window selection is to maximize useful information content and minimize the incorporation of noise, while being generalizable across a variety of marks.
Since the quality of a statistical model is largely dependent on the quality of the observation data used to build it, refining enrichment estimation methods is important for future statistical analyses of ChIP-seq data. To resolve some of the issues involved with enrichment estimation, we compared the performance of models built using a ChIP-seq dataset of histone methylations/variant in CD4+ T-cells generated by Barski et al. . This dataset has been used in several other machine learning and data mining studies [6, 7, 9]. We applied the Multivariate Adaptive Regression Splines (MARS) algorithm [19, 20] to build nonlinear regression models using enrichment levels of 20 histone lysine and arginine methylations plus histone variant H2A.Z. Given that gene expression levels have been shown to be highly dependent on histone modification levels, we used the Generalized Cross-Validation (GCV)  and R2 metrics to assess the quality of MARS model fits and rank enrichment estimation methods.
Several different strategies were employed for estimating gene-wise enrichment levels, including tag counting and model-based approaches, which use average enrichment patterns to spatially weight enrichment of individual genes in a set of genomic windows. We also investigated the selection of window sizes for our gene-wise enrichment estimation methods. By comparing models using GCV and R2 values, we demonstrate that the performance of regression models using histone modification enrichment levels as predictors can be greatly affected by the chosen enrichment estimation method. We conclude that methods that incorporate the spatial distribution of ChIP-seq enrichment offer an improvement in a regression fit over tag counting methods. We also observe that whole-gene estimation windows produce superior results relative to estimations restricted to specific genic regions. Indeed, incorporating data across the entire body of the gene was the most important factor in improving the fit of our models. These improvements of gene-wise ChIP-seq enrichment estimation can improve the sensitivity and specificity of the predictions derived from data mining and machine learning models.
A list of gene transcript annotations was downloaded from the NCBI36 Homo sapiens database, Ensembl 54 (June 24, 2009), which was then filtered to only include transcripts that had Ensembl, UCSC, and RefSeq IDs. Genes that did not have corresponding expression data associated with them were removed from the list. Many of the transcripts within this list contain multiple annotated start and stop sites. Using the same procedure described by Xu et al. , we select a single Transcription Start Site (TSS) and Transcription End Site (TES) for each gene.
Some of the enrichment estimation methods described in this study calculate enrichment within a window around the TSSs and TESs, the largest of which we employed was ±3000 bp around each site. To avoid overlap between windows where the enrichment estimate was a combination of estimates from both ends of the genes, the gene list was further filtered to only include transcripts of length 6002 bp or greater. Although not all enrichment estimation strategies have this limitation, the filtered list was used for each enrichment estimation method to allow a fair comparison of the final models. After implementing each of the aforementioned filters, the final gene list totaled 9882 genes.
Tag Repeat Filter
PCR sequence artifacts or phenomena inherent to the sequencing technology may cause repeat sequences to be produced. These artifacts manifest as large numbers of tags that map to precisely the same genomic coordinates. With the exception of H3K79 methylations, maximum repeats ranged from 231 for H3K4me2 to 4231 for H3K9me3. H3K79me1/2/3 had far fewer repeats with maximums of 23, 26, and 42, respectively. We identified these tag "pile-ups" by searching for multiple tags that mapped to the genome with precisely the same start and stop coordinates (or for differing tag lengths, within the margin of the difference in length). A cutoff of 75 repeats was chosen empirically for the modification H3K4me3 (max repeats = 1166) to filter repeat artifacts from H3K4me3 data. We assumed that the typical number of tags in these piles for a given mark crudely scaled with the total number of tags. Thus, the cutoff was scaled for other modifications by the total tag count relative to H3K4me3, and ranged from 21 (H3K79me2) to 75 (H3K4me3). H3K4me3 was chosen to determine the cutoff because it had the largest total tag count, and its tendency to form large localized peaks relative to the other modifications. This helped ensure that our cutoff was not overly stringent and was only sensitive to extreme outliers.
Using this filtering scheme H3K79me1/2/3 had data removal percentages of 0%, 0.001%, and 0.004%, respectively. H4K20me3, which had relatively large numbers of reads that mapped to repeat sequences, had 5% of all data removed. All other marks ranged from 0.01% (H3K4me1) to 0.5% (H3K9me3) of data removed. Thus, we removed extreme outliers while minimally affecting the overall data set.
Tag counting is the summation of ChIP-seq reads within a genomic window. Any part of a read falling within a window was included in our tag counts. Following previous studies using tag counting methods, tag count enrichments were calculated in ±500 bp, ±1000 bp, ±2000 bp, and ±3000 bp windows relative to both the annotated Transcription Start Sites (TSS) and the Transcription End Sites (TES). While evaluating which genic sub-regions to include in tag counting methods, we assessed how the inclusion of tag counts within exons as a genic sub-region category improved model performance and found their contribution was negligible (Additional file 1). We therefore did not consider tag counts solely within exons further. For clarity, exons were not excluded from other counting methods. Another set (one for each window size) of gene-wise count-based enrichment estimates were produced by summing the TSS counts with the TES counts multiplied by a scaling factor for each of the 21 histone marks:
where Y is gene expression level. The correlation was optimized numerically with respect to α, for each modification type, k.
A set of whole-gene tag count enrichments was calculated within a window defined by the gene boundaries plus flanking intergenic regions immediately adjacent to the annotated gene boundaries. Counts were normalized by dividing by the length of the counting window. Sets of normalized counts were calculated for the gene bodies plus 0 bp, 500 bp, 1000 bp, 2000 bp, and 3000 bp overhangs up and downstream of the gene boundaries.
Iterative model-based enrichment estimation
Equations 6 and 7 can be solved iteratively, subject to the template normalization constraint given by equation 4. An iterative solution of these equations minimizes the least squares difference between the modeled enrichment data and the actual data c ijk . In the case of the iterative solution, the template is the enrichment estimate weighted average tag count across genomic coordinate, i. The value of X jk is ultimately a weighted average of enrichment across a genomic window, providing a single-value estimate of enrichment that incorporates information from the spatial distribution of the enrichment data. For our calculations the iterative process continued until the average difference between the nth and the (n+1)th set of enrichment estimations converged to less than 5% of the nth set values.
Using this iterative model-based strategy, enrichment levels were estimated around both the TSS and TES in ±500 bp, ±1000 bp, ±2000 bp, and ±3000 bp windows, with single base pair resolution (i.e., i corresponds to a single base pair in the window). Enrichment estimates were also made with templates consisting of the TSS and TES windows combined (calculated as a single template) using the same four window sizes. In summary, a set of 5', 3', and 5'+3' enrichment estimations were made for each of the window sizes.
In another set of enrichment estimates, genes were scaled to correspond to a fixed number of bins. The scaling procedure described by Xu et al.  was used with bin number equal to 33,346--the median gene length in the filtered gene list. The template procedure was applied to the scaled genes plus an intergenic overhang of 0 bp, 500 bp, 1000 bp, 2000 bp, and 3000 bp beyond the TSS and TES. The resolution of the genes is equal to gene length divided by bin number, while the overhang regions have base-pair resolution.
Non-iterative model-based enrichment estimate
The process of iteratively solving equations 4, 6, and 7 is computationally expensive. A non-iterative enrichment estimation can be made with equation 6 using the non-weighted average template shown in equations 3 and 4. To examine the trade off between computational efficiency and template optimization, we produced one set of enrichment estimates calculated non-iteratively for every set calculated using the iterative method.
Evaluation of template models
where n is the number of indices in the template. This metric was used to compare the fit of iterative and non-iterative template models.
MARS model construction and evaluation
MARS models were built with each set of enrichment estimations (51 in total) using the earth package in R http://cran.r-project.org/web/packages/earth/index.html. Following Xu et al., each model was allowed terms with up to 3 degrees of interaction. The quality of each model was evaluated using R2 values and GCV scores. The GCV score evaluates the fit of the model while penalizing model complexity, whereas the R2 only considers the fit of the model to the data. A description of the MARS algorithm and GCV scores can be found in the 'Methods' section of Xu et al. .
Results and discussion
Overview of model construction
Template model error analysis
CV(RMSD) for whole gene templates plus a 2000 bp intergenic overhang
The iterative and non-iterative H4K20me3 template models had the worst CV(RMSD)s (8.11 and 5.87, respectively). Moreover, the iterative template performed much more poorly than the non-iterative. In this case, H4K20me3 is highly enriched in members of the zinc-finger (ZNF) gene family, and at low levels with a different enrichment profile across the genes in the rest of the genome [10, 22]. Thus, for H4K20me3, there are at least two classes of enrichment profiles across genes. The iterative template is weighted by enrichment, and hence biased toward the ZNF genes. Thus it yields a poor CV(RMSD) for the majority of genes in the genome that have a different profile and have relatively low levels of H4K20me3 across their bodies. One way of resolving this problem is to apply clustering analysis to the H4K20me3 enrichment profiles across genes and identify the two or three dominant deposition profiles and apply the appropriate template to each subset of genes. Nevertheless, the iterative template method required significantly more computational resources than the non-iterative method and only marginal improvements in the CV(RMSD) of 13 of the 21 marks, and in MARS model performance (as discussed in the "Enrichment estimation and model performance" section). This suggests that the non-iterative template approach may be preferable to the iterative enrichment estimation method for many applications.
Enrichment estimation and model performance
The iterative enrichment estimation method was intended to improve the fit of the template to the data; however, this does not mean that the estimated enrichment level produces a final MARS model with a better fit. Indeed, we found this to be true in our models. Of the 17 pairs of iterative and non-iterative template-based enrichment estimations, 10 produced models in which the iterative method was superior, and 7 in which the non-iterative method was superior. However, both methods produced models with similar statistics (Figure 1). A possible explanation for this result is that the iterative method yields a template that is an estimation-weighted average of enrichment across genomic coordinates. Thus, genes with large outlier enrichment values for a given mark could be driving the shape of the iterative template. For H4K20me3, which produced the poorest CV(RMSD)s and a largest increase in CV(RMSD) in the iterative estimate relative to the non-iterative estimate, outliers did drive poor performance. As previously discussed ZNF repeats are highly enriched for this mark while most non-ZNF genes have an extremely modest enrichment. The genes that had the largest absolute deviation between iterative and non-iterative enrichment estimates were indeed ZNF genes. This suggests that datasets with extreme outliers may be poorly represented by the iterative enrichment estimate. Incorporating robust estimation procedures (e.g., trim mean) into template and enrichment estimation calculations may improve the results of the iterative enrichment estimation method.
Selection of window size was also a factor in model performance. For the window sizes considered, larger window sizes always yielded improved model fits for any 5' and/or 3'-focused method. Not surprisingly, an increase in the amount of data used to calculate the predictors generally improved the performance of the models. However, this does not hold true for the whole-gene tag counting and scaled-gene methods. For these methods, the size of the overhang region is a relatively small fraction of the total genomic coverage used in the per-gene enrichment estimation.
The estimate method with the best performance based on GCV score was the whole-gene, non-iterative, template-based method with a 2000 bp intergenic overhang, which achieved a GCV score of 2.656. The whole-gene method that received the poorest GCV score was the tag count with 3000 bp intergenic overhangs, which had a GCV score of 2.696. The difference in model GCV score between the best and worst whole-gene enrichment estimation methods was only 0.04, corresponding to 1.5% difference in GCV score; the associated p-value < 0.001. Significance was assessed by randomly permuting mark amplitudes with respect to genes. MARS was then applied to the randomly permuted data and GVC scores were calculated for the best and worst whole-gene enrichment estimation method as well as the percent difference in GCV score. A null distribution and corresponding p-value were calculated by repeating this procedure 1000 times. The worst whole-gene estimation method had a GCV score, which was 0.129 (4.6%) below that of the best method based on specific genic regions (5'+3' iterative template with 3000 bp intergenic overhang). The associated p-value < 0.001 based on the same random permutation procedure described above. This suggests that the most important factor when estimating gene-wise ChIP-seq enrichment is the inclusion of data across the entire length of gene bodies. Additionally, unlike the methods based on localized regions, the whole-gene methods do not show a strong correlation between model performance and window size; further suggesting that the enrichment data in the body of the gene contains the majority of the information content for a given gene.
The Spearman correlation between the iterative and non-iterative template-based (2000 bp intergenic overhang) enrichment estimates was 0.994 or better across all marks. As expected, the largest deviations between the methods were in estimates of H4K20me3 in the ZNF genes. Correlations of enrichment estimations between whole-gene tag counting and template-based methods (2000 bp intergenic overhang) had a median value of 0.983, and exceeded 0.925 for all marks except for H2A.Z and H3K4me3. The correlations between the tag counting method and the iterative and non-iterative methods for H2A.Z were 0.659 and 0.675 respectively, and 0.775 and 0.771 for H3K4me3. These relatively low correlations can be attributed to the fact that on average these two marks have extremely high enrichment within a few hundred base pairs of the TSS, which rapidly falls to nearly zero beyond 2000 bp into the gene body. No other marks show such a dramatic difference between the gene body and TSS region. For extremely large genes, this means an underestimation of the enrichment using the length-normalized tag count. Indeed, many of the largest deviations between the estimation methods for these marks were for genes that were on the mega-base scale in length (Additional file 2). Large deviations also occurred when few tags were observed within the estimation window. In these cases, differences between enrichment estimation methods can be attributed to coordinate-dependent differences in weighting. In some cases of 5' proximal marks, genes that were not enriched for the mark were flanked by genes that were (Additional file 3). The 5' enrichment of the neighbor would sometimes bleed into the 3' region of the non-enriched gene, causing a large enrichment estimate using the tag counting method relative to the template-based methods. Since the template-based methods are a weighting scheme based on the average enrichment pattern, the intruding enrichment is down-weighted. The template-based methods are subsequently able to deconvolve enrichment signals of genes that are close neighbors, and therefore represent an advantage of these methods over tag counting.
The accuracy and precision of amplitude estimation for all of the methods considered could be improved by subtracting background read levels and applying appropriate noise filtering. High throughput sequence analysis of input DNA samples revealed that chromatin structure affects shearing and other aspects of ChIP sample preparation, and hence introduces biases in ChIP-seq data . This together with sequence dependent biases coming from PCR amplification of ChIP samples argues for methods that assume an inhomogeneous background. One approach would be to use input DNA or other control samples to estimate inhomogeneous background levels; however, an accurate method, which performs this analysis remains to be developed. Indeed a recent comparison of ChIP-chip and ChIP-seq data  showed that using Input-seq data as background from an unmatched sample can remove GC-content biases better than use of a matched Input-seq sample. Thus, accurate background estimation and subtraction is still an area of active research. One ChIP-seq peak finding method, SICER , which is designed to identify significantly enriched domains in histone modification data can also be applied as a background noise filter. SICER performs the filtering based on significance. The genome is segmented into windows and those that are not members of significantly enriched islands are filtered out (i.e., set to zero). However, a significance-based filtering approach is not ideal for amplitude estimation and statistical learning applications because accurate estimates of even low, albeit insignificant, enrichments are important. High frequency noise could be removed by applying low pass filters using wavelets. Indeed, wavelet analysis has been applied to genomic tiling array ChIP-chip data to denoise the data  and could be generalized for ChIP-seq noise filtering.
Enrichment profile and gene length
To determine if H3K36me3 and H3K79me2,3--all strongly associated with gene activation--were driving the superior performance of the scaled-gene models, we rebuilt all 51 MARS models without these predictors (data not shown). Surprisingly, the scaled gene method with no intergenic overhang yielded the best model, though the 2nd and 3rd best models were based on whole-gene tag counts. This suggests that although for many marks the scaled template is less representative of the deposition pattern of very large and very small genes, the scaled template strategy offers good performance even on marks whose enrichment profiles do not appear to scale significantly with gene length.
Regulatory information embedded in spatial deposition patterns
Interestingly, the slopes of the enrichment profiles of the three marks that scale appear to be approximately similar across gene lengths from the TSS to approximately 1 kb into the gene body. Beyond approximately 1 kb into the gene body the slopes of the enrichment profiles begin to differ dramatically. For example, for the shortest 20%-ile of genes, average H3K79me3 enrichment rapidly decreases beyond 1 kb into the gene body. For the longest 20%-ile of genes the enrichment profile has a steady, positive slope for the same genomic window, which is about 1 kb to 6 kb into the gene body. However, from the TSS to approximately 1 kb into the gene body the enrichment profiles of these extreme length groups are nearly identical. This suggest that for these scaling marks, there is a region near the TSS, which is approximately 1 kb in size, where these modifications are deposited in a length-independent manner, but beyond which the modifications are deposited in a length-dependent fashion.
Of the non-scaling marks, all but four of the modifications in this study show differences in absolute enrichment levels across gene lengths (Additional file 4). Those that do not are H3K27me2, H3R2me2, H4R3me2, and H4K20me3. The greatest differences appear in the longest 20%-ile, which has relatively low enrichment for marks that are explicitly known to be associated with gene activation, and relatively high enrichment for genes associated with gene repression, suggesting a global reduction of transcription in large genes relative to small genes. Indeed, we find that the genes in the largest 20%-ile of gene length show significantly lower gene expression than other genes (Additional file 5). For methods that did not use the whole gene to arrive at enrichment estimates, we rebuilt models with gene length included as a predictor to determine if the superior performance of the whole-gene estimation methods were driven by the gene length bias. The best performing model of this set was the 5'+3' non-iterative template with a 3000 bp window, which had a GCV score of 2.831. The best model based on estimates in a specific genic region and without gene length as a predictor had a GCV score of 2.824. The lack of improvement after including gene length as a predictor suggests that the performance of the whole-gene enrichment estimation methods was not driven by the gene length bias.
In addition to revealing information about transcriptional regulation, templates and enrichment estimates may also provide information on co-regulation of PTMs. For example, H4K20me1 and H3K9me1 are known to be preferentially deposited on the same nucleosomes in vivo . The correlation (Spearman) between the templates of these two marks was 0.716, and the correlation between their enrichment estimates across genes was 0.876. These strongly positive correlations of templates and enrichment estimates of marks known to co-occur suggest that co-regulatory information can be gleaned from spatial distribution and magnitude of the enrichment data (see Additional files 6 and 7 for all enrichment and template correlations). For example, our data show enrichment correlations of 0.889, 0.762, and 0.761 between H2BK5me1 and H3K79me1/2/3, respectively. Template correlations between H2BK5me1 and H3K79me1/2/3 were 0.956, 0.910, and 0.924, respectively. The high correlation between H2BK5me1 and H3K79 methylation deposition patterns and levels across the genome suggest that there may be a mechanistic link (e.g., the enzymes that deposit these marks could be on the same complex) between these histone PTMs that has not yet been reported in biochemical studies. This is one of many cases where both the correlation between two marks' spatial profiles and enrichment levels across genes is high (see Additional files 6 and 7 for all enrichment and template correlations). Using both enrichment level and spatial deposition patterns across genomes could prove to be powerful at identifying biologically relevant synergies between histone modifications, which make up the "histone code" [1–4].
Generalizing a method for estimating ChIP-seq enrichment for multiple histone modifications is complicated by the variability in the way different modifications are deposited. This variability ultimately creates different gene-wise ChIP-seq enrichment patterns, some of which scale with gene length and some which do not. Tag counting methods can yield high quality predictors for regression modeling, but ultimately some of the information content coded in the spatial distribution of the data is lost. Although many modifications are highly enriched at the 5' end of genes, much of the useful data associated with a given gene is encoded in the body of the gene. Many previous studies have attempted to estimate enrichment by only focusing on the promoter region, and in doing so, have forgone much of the relevant data.
Using the MARS regression algorithm to build regression models with enrichment levels as predictors and gene expression as responses, we compared various strategies for estimating gene-wise ChIP-seq enrichment for 20 histone methylations and histone variant H2A.Z in human CD4+ T cells . Enrichment estimation methods were assessed and ranked by the quality of the models produced, which was measured by GCV scores. We have demonstrated that, with respect to the cis-regulatory role that the histone modifications/variant surveyed in this study play in controlling gene expression, the majority of the significant enrichment data lies within gene boundaries. Also, the incorporation of data across whole genes, as well as spatially weighting enrichment for single-value estimations of gene-wise ChIP-seq enrichment can provide significant improvement over strategies that focus on specific genic regions. Improving methods for the quantification of ChIP-seq data for statistical modeling serves to sharpen the resolution of the models and ultimately improves the conclusions that can be drawn from them.
ChIP-seq technology facilitates the computational interrogation of genomic control networks, and the conclusions drawn by this study can serve to increase depth at which we can probe these networks using this technology. The methods outlined in this work can be applied to almost any machine learning or data mining application that uses gene-wise ChIP-seq enrichment as predictors or responses.
List of abbreviations
Multivariate Adaptive Regression Splines
Coefficient of Variation of the Root Mean Square Deviation.
This work was supported in part by the National Institutes of Health grant CA104397 from the National Cancer Institute (NCI). SAH was supported by the University of Virginia Cancer Training Grant from the NCI (5T32CA009109).
- Strahl BD, Allis CD: The language of covalent histone modifications. Nature. 2000, 403 (6765): 41-45. 10.1038/47412.PubMedView ArticleGoogle Scholar
- Jenuwein T, Allis CD: Translating the histone code. Science. 2001, 293 (5532): 1074-1080. 10.1126/science.1063127.PubMedView ArticleGoogle Scholar
- Goldberg AD, Allis CD, Bernstein E: Epigenetics: a landscape takes shape. Cell. 2007, 128 (4): 635-638. 10.1016/j.cell.2007.02.006.PubMedView ArticleGoogle Scholar
- Latham JA, Dent SY: Cross-regulation of histone modifications. Nat Struct Mol Biol. 2007, 14 (11): 1017-1024. 10.1038/nsmb1307.PubMedView ArticleGoogle Scholar
- Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh TY, Peng W, Zhang MQ, et al: Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet. 2008, 40 (7): 897-903. 10.1038/ng.154.PubMedPubMed CentralView ArticleGoogle Scholar
- Xu X, Hoang S, Mayo MW, Bekiranov S: Application of machine learning methods to histone methylation ChIP-Seq data reveals H4R3me2 globally represses gene expression. BMC Bioinformatics. 11: 396-
- Yu H, Zhu S, Zhou B, Xue H, Han JD: Inferring causal relationships among different histone modifications and gene expression. Genome Res. 2008, 18 (8): 1314-1324. 10.1101/gr.073080.107.PubMedPubMed CentralView ArticleGoogle Scholar
- Cheng C, Yan KK, Yip KY, Rozowsky J, Alexander R, Shou C, Gerstein M: A statistical framework for modeling gene expression using chromatin features and application to modENCODE datasets. Genome Biol. 12 (2): R15-
- Karlic R, Chung HR, Lasserre J, Vlahovicek K, Vingron M: Histone modification levels are predictive for gene expression. Proc Natl Acad Sci USA. 107 (7): 2926-2931.
- Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell. 2007, 129 (4): 823-837. 10.1016/j.cell.2007.05.009.PubMedView ArticleGoogle Scholar
- Kouzarides T: Histone methylation in transcriptional control. Curr Opin Genet Dev. 2002, 12 (2): 198-209. 10.1016/S0959-437X(02)00287-3.PubMedView ArticleGoogle Scholar
- Campos EI, Reinberg D: Histones: annotating chromatin. Annu Rev Genet. 2009, 43: 559-599. 10.1146/annurev.genet.032608.103928.PubMedView ArticleGoogle Scholar
- Hon GC, Hawkins RD, Ren B: Predictive chromatin signatures in the mammalian genome. Hum Mol Genet. 2009, 18 (R2): R195-201. 10.1093/hmg/ddp409.PubMedPubMed CentralView ArticleGoogle Scholar
- Kolasinska-Zwierz P, Down T, Latorre I, Liu T, Liu XS, Ahringer J: Differential chromatin marking of introns and expressed exons by H3K36me3. Nat Genet. 2009, 41 (3): 376-381. 10.1038/ng.322.PubMedPubMed CentralView ArticleGoogle Scholar
- Bannister AJ, Schneider R, Myers FA, Thorne AW, Crane-Robinson C, Kouzarides T: Spatial distribution of di- and tri-methyl lysine 36 of histone H3 at active genes. J Biol Chem. 2005, 280 (18): 17732-17736.PubMedView ArticleGoogle Scholar
- Krogan NJ, Kim M, Tong A, Golshani A, Cagney G, Canadien V, Richards DP, Beattie BK, Emili A, Boone C, et al: Methylation of histone H3 by Set2 in Saccharomyces cerevisiae is linked to transcriptional elongation by RNA polymerase II. Mol Cell Biol. 2003, 23 (12): 4207-4218. 10.1128/MCB.23.12.4207-4218.2003.PubMedPubMed CentralView ArticleGoogle Scholar
- Komarnitsky P, Cho EJ, Buratowski S: Different phosphorylated forms of RNA polymerase II and associated mRNA processing factors during transcription. Genes Dev. 2000, 14 (19): 2452-2460. 10.1101/gad.824700.PubMedPubMed CentralView ArticleGoogle Scholar
- Ng HH, Robert F, Young RA, Struhl K: Targeted recruitment of Set1 histone methylase by elongating Pol II provides a localized mark and memory of recent transcriptional activity. Mol Cell. 2003, 11 (3): 709-719. 10.1016/S1097-2765(03)00092-3.PubMedView ArticleGoogle Scholar
- Friedman JH: Multivariate Adaptive Regression Splines. Annals of Statistics. 1991, 19 (1): 1-67. 10.1214/aos/1176347963.View ArticleGoogle Scholar
- Friedman JH: Multivariate Adaptive Regression Splines - Rejoinder. Annals of Statistics. 1991, 19 (1): 123-141. 10.1214/aos/1176347973.View ArticleGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA. 2004, 101 (16): 6062-6067. 10.1073/pnas.0400782101.PubMedPubMed CentralView ArticleGoogle Scholar
- Ernst J, Kellis M: Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol. 28 (8): 817-825.
- Teytelman L, Ozaydin B, Zill O, Lefrancois P, Snyder M, Rine J, Eisen MB: Impact of chromatin structures on DNA processing for genomic analyses. PLoS One. 2009, 4 (8): e6700-10.1371/journal.pone.0006700.PubMedPubMed CentralView ArticleGoogle Scholar
- Ho JW, Bishop E, Karchenko PV, Negre N, White KP, Park PJ: ChIP-chip versus ChIP-seq: lessons for experimental design and data analysis. BMC Genomics. 12: 134-
- Zang C, Schones DE, Zeng C, Cui K, Zhao K, Peng W: A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 2009, 25 (15): 1952-1958. 10.1093/bioinformatics/btp340.PubMedPubMed CentralView ArticleGoogle Scholar
- Karpikov A, Rozowsky J, Gerstein M: Tiling array data analysis: a multiscale approach using wavelets. BMC Bioinformatics. 12: 57-
- Sims JK, Houston SI, Magazinnik T, Rice JC: A trans-tail histone code defined by monomethylated H4 Lys-20 and H3 Lys-9 demarcates distinct regions of silent chromatin. J Biol Chem. 2006, 281 (18): 12760-12766. 10.1074/jbc.M513462200.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.