Research Article
Free access
Published Online: 11 March 2020

ChExMix: A Method for Identifying and Classifying Protein–DNA Interaction Subtypes

Publication: Journal of Computational Biology
Volume 27, Issue Number 3

Abstract

Regulatory proteins can employ multiple direct and indirect modes of interaction with the genome. The ChIP-exo mixture model (ChExMix) provides a principled approach to detecting multiple protein–DNA interaction modes in a single ChIP-exo experiment. ChExMix discovers and characterizes binding event subtypes in ChIP-exo data by leveraging both protein–DNA cross-linking signatures and DNA motifs. In this study, we present a summary of the major features and applications of ChExMix. We demonstrate that ChExMix does not require high-resolution protein–DNA binding assay data to detect binding event subtypes. Specifically, we apply ChExMix to analyze 393 ChIP-seq data profiles in K562 cells. Similar binding event subtypes are discovered across multiple proteins, suggesting the existence of colocalized regulatory protein modules that are recruited to DNA through a particular sequence-specific transcription factor. Our results thus suggest that ChExMix can characterize protein–DNA binding interaction modes using data from multiple types of protein–DNA interaction assays.

1. Introduction

Regulatory proteins associate with the genome either by directly binding cognate DNA motifs or through protein–protein interactions with other regulators. ChExMix is designed to discover distinct protein–DNA binding modes in ChIP-exo experiments (Rhee and Pugh, 2011; Yamada et al., 2019). ChExMix assumes that different regulatory complexes will result in distinct protein–DNA cross-linking signatures within ChIP-exo data, and thus analysis of ChIP-exo sequencing tag patterns should enable detection of multiple protein–DNA binding modes for a given regulatory protein. ChExMix uses a mixture model framework to probabilistically model the genomic locations and subtype membership of protein–DNA binding events, leveraging both ChIP-exo tag enrichment patterns and DNA sequence information.

2. Results

Figure 1 shows a screen shot of ChExMix's HTML output using TAL1 ChIP-exo data in estradiol-treated G1E-ER4 cells (Han et al., 2015). The HTML output summarizes command-line options and parameters, input data information, discovered protein–DNA binding event subtypes, and replicate consistency information (Fig. 1A–D). The full list of command-line arguments for the ChExMix run is shown in a drop-down window by clicking on “command” (Fig. 1A). Input data information shows replicate name, total number of tag counts, automatically estimated scaling ratio between signal and control experiments, and the estimated fraction of the sample's data that represents foreground signal as opposed to background (Fig. 1B). ChExMix analysis results show condition name, total number of detected binding events, and a link to the binding event file (Fig. 1C). The heatmap shows per-base tag counts for forward and reverse strand in a 500 bp window around discovered binding events. The sequence plot shows DNA sequences, represented using four colors, in a 30 bp window around binding events. Tag distribution, motif (when available), and number of subtype events are shown for each discovered subtype.
FIG. 1. A screenshot of ChExMix HTML output using Tal1 ChIP-exo data in mouse estradiol-treated G1E-ER4 cells. (A) Command-line options and parameters for the ChExMix run. Users can click on “Command” to see the full list of command-line arguments. (B) Input data information including name of replicate, total tag count, automatically estimated scaling ratio used between signal and control experiments, and the estimated fraction of the sample's data that represents foreground signal as opposed to background. (C) ChExMix protein–DNA binding event subtypes analysis results. Name of condition, total number of detected binding events, a link to binding event file. Heatmap showing per-base tag counts for forward (blue) and reverse (red) strand and corresponding sequence plots. For each discovered subtype, tag distribution and motif (when available), and number of subtype events are shown. (D) Replicate information flat files that provide information on ChIP enrichment consistency between replicates.
After the publication of the initial release, we added replicate consistency analysis to ChExMix, which examines agreement on tag counts associated with binding events across replicates (Fig. 1D). ChExMix internally combines replicates to gain statistical power in detecting binding events, but models each replicate's subtype-specific tag distributions and per-binding event tag counts separately. Replicate consistency analysis can be run in three modes: (1) the “standard” mode will report events that pass the tag count significant threshold calculated for the condition as a whole, (2) the “lenient” mode reports events that pass significance thresholds in more than one replicate or in the condition as a whole, and (3) the “lenientplus” mode reports events that pass significance thresholds in more than one replicate or in the condition as a whole and requires no significant difference in tag counts between replicates. Thereby, ChExMix enables us to examine the consistency of per-binding event tag counts across replicates.
ChExMix can be run with different modes to discover binding event subtypes: (i) 5′ tag positions centered on discovered motifs, (ii) clustering tag distributions using the affinity propagation algorithm, and (iii) a combination of (i) and (ii). Subtype characterization using only motifs may become useful when analyzing lower-resolution data such as ChIP-seq; in contrast to ChIP-exo, 5′ tag positions in ChIP-seq do not have precise relationships to protein–DNA cross-linking positions. In addition, if users have prior knowledge of motifs associated with target or cooperating factors, binding event subtypes can be initialized using user-defined position frequency matrices, for instance matrices sourced from databases of known motifs.
We encourage the use of control data such as immunoprecipitation with a non-specific antibody or sequenced whole cell extract controls. Control data provide information on local background signal levels, which ChExMix can use in finding potentially enriched genomic regions in the initial discovery step, as well as when calculating the statistical significance of tag enrichment at candidate binding events. ChExMix is flexible in input data types, allowing input in BAM/SAM, BED, and IDX formats. Troubleshooting advice can be found in Table 1.
Table 1. Troubleshooting Advice
ProblemPossible reasonSolution
Error message indicating insufficient memoryJava does not have access to enough memory or there is insufficient memory available on the systemIncrease the memory Java has access to by increasing the value associated with the “-Xmx” flag. Alternatively, if you are using the “thread” flag, set it to a small nonzero integer value to reduce the number of processors, and thus memory used simultaneously, at a cost of additional run time
Discovered subtype tag distributions contain “tower blocks” or “spikes”High duplication rate in dataUse “--readfilter” flag to enforce per-base limit on tag count using a Poisson distribution
No motifs are discovered(a) Motif discovery is not performed(a) Install MEME (Bailey and Elkan, 1994) and provide a path to MEME bin
(b) Binding events do not contain enriched sequences(b) Nature of data
Each of the subtypes output by ChExMix is associated with a tag distribution, but may or may not be associated with a motif. We advise caution when interpreting subtypes that are presented by ChExMix as being defined by only a tag distribution, and having no centrally enriched motif. ChExMix simultaneously models both tag distributions and motif features at each subtype; if no motif feature is enriched at a fixed offset from a defined tag distribution, ChExMix may determine that the subtype is more favorably modeled using only the tag distribution, and no motif will be presented. Therefore, ChExMix binding event subtypes that are presented without motifs may still have particular DNA motifs enriched in the vicinity of the binding events, even motifs that are similar to those found centrally enriched at other subtypes.
We originally developed ChExMix for application to genome-wide ChIP-exo data, demonstrating that ChExMix can discover distinct protein–DNA interaction subtypes using motifs and tag distribution patterns. In this study, we demonstrate that ChExMix can also be applied to find protein–DNA binding event modes in ChIP-seq data. ChIP-seq profiles protein–DNA interactions at lower resolution compared with ChIP-exo. The distribution of sequenced tags around protein–DNA binding events are determined by the fragmentation and size-selection strategies in ChIP-seq, and so different protein–DNA binding modes should not be expected to display different tag distributions in ChIP-seq, as they do in ChIP-exo. However, ChExMix can still determine protein–DNA binding modes in ChIP-seq data by relying solely on enriched motif patterns (or their absence).
To demonstrate the utility of ChExMix analysis in ChIP-seq data, we applied it to analyze 393 ChIP-seq experiments performed in K562 cells (Dunham et al., 2012). The numbers of binding events detected by ChExMix range from 5 to 367,946 (median = 8444; Fig. 2A). Forty-three percent of K562 ChIP-seq experiments contain ChExMix-discovered binding subtypes that are associated with enriched motifs (Fig. 2B). For example, the results from MEIS2 ChIP-seq analysis show five subtypes, each containing distinct motifs (Fig. 2C, D). We selected the 161 K562 ENCODE ChIP-seq data sets in which ChExMix produced motif-containing subtypes. We then consolidated subtypes across factors according to the similarity of their encapsulated motifs (Fig. 2E). Clusters in Figure 2E represent sets of factors that share one or more motif-defined binding event subtype. These clusters are explained either because multiple factors can recognize and bind to the same motif, or because some factors are recruited to DNA through protein–protein interactions with a sequence-specific transcription factor (and thus will be associated with a binding event subtype that contains the sequence-specific factor's cognate binding motif).
FIG. 2. ChExMix discovers distinct subtypes in K562 ENCODE ChIP-seq data. (A, B) Number of binding events (A) and number of subtypes with motifs (B) discovered by ChExMix in K562 ENCODE ChIP-seq data. (C) Motif, sequence color plot, and heatmap of five subtypes identified in MEIS2 ChIP-seq data. Sites within each subtype are aligned by the ChExMix-defined binding event position and orientation. (D) MEIS2 tag patterns associated with subclasses 1–5. (E) Each cell in the heatmap represents the fraction of a factor's total binding events that are associated with a particular subtype. The column contains subtype motifs that are clustered by STAMP. The left dendrogram was computed by applying hierarchical clustering on the regulatory module matrix with Euclidean distance and single linkage.
The K562 factor clusters shown in Figure 2E, and as defined by ChExMix discovered subtype motifs, capture several sets of regulatory proteins that are known to interact with each other. For example, the master regulators GATA1, GATA2, and TAL1 are in the same cluster, and these factors are known to have protein–protein interactions with each other (Cantor and Orkin, 2002). We also find CBFA2T3, STAT5, and TCF12 in the same cluster. The identified regulator elements, GATA1, GATA2, TAL1, STAT5A, CBFA2T3, and TFC12, show ChIP-seq tag enrichment at the union of subtype specific binding events, indicating that these regulatory proteins share binding regions (Fig. 3A). Similarly, we identified homeodomain protein families PKNOX1, MEIS2, and PBX2 within the same regulatory group. MEIS, PBX, and PKNOX homeodomain protein families bind DNA cooperatively (Knoepfler et al., 1997). PKNOX1, MEIS2, and PBX2 show ChIP-seq tag enrichment at the union of subtype binding events (Fig. 3B).
FIG. 3. Clustered regulatory proteins share binding regions. (A) ChIP-seq tag enrichment of CBFA2T3, GATA1, GATA2, TAL1, STAT5A, and TCF12 in a 1 kbp window around the union of subtype binding events. The heatmaps are sorted by decreasing CBFA2T3 occupancies. (B) ChIP-seq tag enrichment of PKNOX1, MEIS2, and PBX2 in a 1 kbp window around the union of subtype binding events. The heatmaps are sorted by decreasing PKNOX1 occupancies.
Our results suggest that ChExMix can successfully identify consistent subtypes across independent analyses of ChIP-seq experiments. Consistent ChIP-enrichment patterns across proteins that share particular subtypes (Fig. 3) provide evidence of protein colocalization at the underlying binding events. Therefore, each set of regulatory proteins that share a common protein–DNA binding subtype potentially points to a regulatory complex of interacting proteins that are recruited to DNA by a particular sequence-specific transcription factor. However, the lack of resolution in ChIP-seq prevents us from confirming that the proteins that share a particular subtype are all specifically organized around a particular motif. Future reanalysis with ChIP-exo or other higher-resolution assays may provide further characterization of these candidate regulatory complexes.
In summary, ChExMix is a flexible method for analyzing protein–DNA binding modes in ChIP-exo and ChIP-seq data. Our applications to different types of DNA-protein binding sequencing data demonstrate that ChExMix analysis can shed lights upon possible recruitment mechanisms of DNA-binding proteins through sequence-specific transcription factors.

3. Methods

ChExMix can be installed using a conda package (https://anaconda.org/bioconda/chexmix). The accompanying online material is available at https://github.com/seqcode/chexmix, which consists of software documentation, a link to executables, dependencies, and instructions on building from source code. Links for synthetic test data and test codes are also provided. The example test code runs ChExMix in two modes that (1) defines subtypes using DNA motif and ChIP-exo tag distributions and (2) defines subtypes only using ChIP-exo tag distributions. The documentation lists available command-line options, file links such as genome length and Markov background models for the genomic sequences of major organisms, and expected output file formats. The supplementary data provided on the ChExMix website also provide tag distribution files used for making synthetic data sets described in the original article. Simulation data can be created using the ChIPOverlapReadSimulator module in ChExMix code.
TAL1 ChIP-exo data in G1E-ER4 cell line treated with 10 nM beta-estradiol for 3 hours are obtained from the GEO accession no. GSE68964 (Han et al., 2015). TAL1 ChIP-exo data are aligned against mm9 using BWA (Li and Durbin, 2009) version 0.6.2 with default options.
To run ChExMix on ChIP-seq data, BAM files for ChIP-seq data in K562 cells targeting transcription factors, histone modifications, and corresponding controls were downloaded from the ENCODE project website (https://www.encodeproject.org/) (Dunham et al., 2012). ChExMix was used to call binding events using the aligned tags of ChIP-seq experiments and the corresponding control experiments. We run ChExMix with the options “--noclustering --readfilter”. The first option turns off affinity propagation clustering of tag distributions so that subtypes are defined only using motifs. The second option enforces a global per-base limit from a Poisson distribution to remove PCR duplicates. After obtaining collections of ChExMix-defined subtypes from the K562 ChIP-seq data, we trimmed low information content flanking columns (<0.2) from the motifs. We then used STAMP (Mahony and Benos, 2007) to assess pairwise similarities of the subtype-specific motifs. We ran STAMP with the option “-align SWU” to use ungapped Smith–Waterman algorithm as the alignment method. We then applied hierarchical clustering on a matrix that summarized the fractions of each factor's total binding events that belong to each subtype.

References

Bailey, T.L., and Elkan, C. 1994. Fitting a mixture model by expectation maximization to discover motifs in bipolymers. Proc Int Conf Intell Syst Mol Biol 2, 28–36.
Cantor, A.B., and Orkin, S.H. 2002. Transcriptional regulation of erythropoiesis: An affair involving multiple partners. Oncogene 21, 3368–3376.
Dunham, I., Kundaje, A., Aldred, S.F., et al. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74.
Han, G.C., Vinayachandran V, Bataille, A.R., et al. 2015. Genome-wide organization of GATA1 and TAL1 determined at high resolution. Mol Cell Biol 36, 157–172.
Knoepfler, P.S., Calvo KR, Chen, H., et al. 1997. Meis1 and PKnox1 bind DNA cooperatively with Pbx1 utilizing an interaction surface disrupted in oncoprotein E2a-Pbx1. Proc Natl Acad Sci U S A 94, 14553–14558.
Li, H., and Durbin, R. 2009. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 1754–1760.
Mahony, S., and Benos, P.V. 2007. STAMP: A web tool for exploring DNA-binding motif similarities. Nucleic Acids Res 35(Web Server issue), W253-8.
Rhee, H.S., and Pugh, B.F. 2011. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147, 1408–1419.
Yamada, N., Lai, W.K.M., Farrell, N., et al. 2019. Characterizing protein-DNA binding event subtypes in ChIP-Exo Data. Bioinformatics 35, 903–913.

Information & Authors

Information

Published In

cover image Journal of Computational Biology
Journal of Computational Biology
Volume 27Issue Number 3March 2020
Pages: 429 - 435
PubMed: 32023130

History

Published online: 11 March 2020
Published in print: March 2020
Published ahead of print: 5 February 2020

Permissions

Request permissions for this article.

Authors

Affiliations

Naomi Yamada
Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania.
Prashant Kumar Kuntala
Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania.
B. Franklin Pugh
Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania.
Shaun Mahony [email protected]
Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania.

Notes

Address correspondence to: Dr. Shaun Mahony, Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, 404 South Frear Laboratory, University Park, PA 16802 [email protected]

Author Disclosure Statement

B.F.P. has a financial interest in Peconic, LLC, which utilizes the ChIP-exo technology implemented in this study and could potentially benefit from the outcomes of this research.

Funding Information

This study was supported by NIH R01 GM125722 (S.M. and B.F.P.).

Metrics & Citations

Metrics

Citations

Export citation

Select the format you want to export the citations of this publication.

View Options

View options

PDF/EPUB

View PDF/ePub

Get Access

Access content

To read the fulltext, please use one of the options below to sign in or purchase access.

Society Access

If you are a member of a society that has access to this content please log in via your society website and then return to this publication.

Restore your content access

Enter your email address to restore your content access:

Note: This functionality works only for purchases done as a guest. If you already have an account, log in to access the content to which you are entitled.

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share on social media

Back to Top