18 research outputs found

    An Integrated Model of Multiple-Condition ChIP-Seq Data Reveals Predeterminants of Cdx2 Binding

    Get PDF
    Regulatory proteins can bind to different sets of genomic targets in various cell types or conditions. To reliably characterize such condition-specific regulatory binding we introduce MultiGPS, an integrated machine learning approach for the analysis of multiple related ChIP-seq experiments. MultiGPS is based on a generalized Expectation Maximization framework that shares information across multiple experiments for binding event discovery. We demonstrate that our framework enables the simultaneous modeling of sparse condition-specific binding changes, sequence dependence, and replicate-specific noise sources. MultiGPS encourages consistency in reported binding event locations across multiple-condition ChIP-seq datasets and provides accurate estimation of ChIP enrichment levels at each event. MultiGPS's multi-experiment modeling approach thus provides a reliable platform for detecting differential binding enrichment across experimental conditions. We demonstrate the advantages of MultiGPS with an analysis of Cdx2 binding in three distinct developmental contexts. By accurately characterizing condition-specific Cdx2 binding, MultiGPS enables novel insight into the mechanistic basis of Cdx2 site selectivity. Specifically, the condition-specific Cdx2 sites characterized by MultiGPS are highly associated with pre-existing genomic context, suggesting that such sites are pre-determined by cell-specific regulatory architecture. However, MultiGPS-defined condition-independent sites are not predicted by pre-existing regulatory signals, suggesting that Cdx2 can bind to a subset of locations regardless of genomic environment. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.National Science Foundation (U.S.) (Graduate Research Fellowship under Grant 0645960)National Institutes of Health (U.S.) (grant P01 NS055923)Pennsylvania State University. Center for Eukaryotic Gene Regulatio

    Deconvolving sequence features that discriminate between overlapping regulatory annotations.

    No full text
    Genomic loci with regulatory potential can be annotated with various properties. For example, genomic sites bound by a given transcription factor (TF) can be divided according to whether they are proximal or distal to known promoters. Sites can be further labeled according to the cell types and conditions in which they are active. Given such a collection of labeled sites, it is natural to ask what sequence features are associated with each annotation label. However, discovering such label-specific sequence features is often confounded by overlaps between the labels; e.g. if regulatory sites specific to a given cell type are also more likely to be promoter-proximal, it is difficult to assess whether motifs identified in that set of sites are associated with the cell type or associated with promoters. In order to meet this challenge, we developed SeqUnwinder, a principled approach to deconvolving interpretable discriminative sequence features associated with overlapping annotation labels. We demonstrate the novel analysis abilities of SeqUnwinder using three examples. Firstly, SeqUnwinder is able to unravel sequence features associated with the dynamic binding behavior of TFs during motor neuron programming from features associated with chromatin state in the initial embryonic stem cells. Secondly, we characterize distinct sequence properties of multi-condition and cell-specific TF binding sites after controlling for uneven associations with promoter proximity. Finally, we demonstrate the scalability of SeqUnwinder to discover cell-specific sequence features from over one hundred thousand genomic loci that display DNase I hypersensitivity in one or more ENCODE cell lines

    Overview of SeqUnwinder, which takes an input list of annotated genomic sites and identifies label-specific discriminative motifs.

    No full text
    <p><b>(A)</b> Schematic showing a typical input instance for SeqUnwinder: a list of genomic coordinates and corresponding annotation labels. <b>(B)</b> The underlying classification framework implemented in SeqUnwinder. Subclasses (combination of annotation labels) are treated as different classes in a multi-class classification framework. The label-specific properties are implicitly modeled using L1-regularization. <b>(C)</b> Weighted <i>k</i>-mer models are used to identify 10-15bp focus regions called hills. MEME is used to identify motifs at hills. <b>(D)</b> <i>De novo</i> identified motifs in C) are scored using the weighted <i>k</i>-mer model to obtain label-specific scores.</p

    SeqUnwinder analysis of Lhx3 binding classes during iMN programming.

    No full text
    <p><b>(A)</b> Lhx3 binding sites labeled using their dynamic binding behavior and ES chromatin activity statuses. <b>(B)</b> Label-specific scores of <i>de novo</i> motifs identified at Lhx3 binding sites defined in “A” using MCC (left) and SeqUnwinder (right) models. For consistency across figures, we fix the color saturation values to -0.4 and 0.4. <b>(C)</b> Log-odds score distribution of <i>de novo</i> discovered Onecut-like motif at “ES-active”, “ES-inactive“, “Early”, “Shared”, and “Late” sites (left panel). Distribution of Onecut2 (48hr) ChIP-seq tag counts in log-scale at “ES-active”, “ES-inactive“, “Early”, “Shared”, and “Late” sites (right panel). <b>(D)</b> Log-odds score distribution of <i>de novo</i> discovered Oct4-like motif at “ES-active”, “ES-inactive“, “Early”, “Shared”, and “Late” sites (left panel). Distribution of Oct4 (0hr) ChIP-seq tag counts in log-scale at “ES-active”, “ES-inactive“, “Early”, “Shared”, and “Late” sites (right panel). Statistical significance calculated using Mann-Whitney-Wilcoxon test (*: P-value < 0.001).</p

    Discriminative sequence feature analysis at DHS sites in 6 different ENCODE cell-lines using SeqUnwinder.

    No full text
    <p><b>(A)</b> ~140K DHS sites annotated with 6 different cell-line labels used to identify cell-line specific and shared sequence features. <b>(B)</b> Label-specific scores of all the <i>de novo</i> motifs identified at DHSs sites in “A”. For consistency across figures, we fix the color saturation values to -0.4 and 0.4.</p

    Performance of SeqUnwinder on simulated datasets.

    No full text
    <p><b>(A)</b> 9000 simulated genomic sites with corresponding motif associations. <b>(B)</b> Label-specific scores for all <i>de novo</i> motifs identified using MCC (left) and SeqUnwinder (right) models on simulated genomic sites in “A”. For consistency across figures, we fix the color saturation values to -0.4 and 0.4 <b>(C)</b> Schematic showing 100 genomic datasets with 6000 genomic sites and varying degrees of label overlap ranging from 0.5 to 0.99. <b>(D)</b> Performance of MCC (multi-class logistic classifier), DREME, and SeqUnwinder on simulated datasets in “C”, measured using the F1-score, <b>(E)</b> true positive rates, and <b>(F)</b> false positive rates.</p

    SeqUnwinder analysis of sequence features at multi-condition TF binding sites for ENCODE YY1 datasets.

    No full text
    <p><b>(A)</b> Heatmaps showing the YY1 ChIP-seq reads at curated YY1 binding sites, stratified based on binding across cell-lines and distance from annotated mRNA TSS. The order of subclasses is: Shared and Proximal, Shared and Distal, K562 and Proximal, K562 and Distal, GM12878 and Proximal, GM12878 and Distal, H1-hESC and Proximal, and H1-hESC and Distal. <b>(B)</b> <i>De novo</i> motifs and corresponding label-specific scores identified using SeqUnwinder at events defined in A). For consistency across figures, we fix the color saturation values to -0.4 and 0.4.</p

    Proneural factors Ascl1 and Neurog2 contribute to neuronal subtype identities by establishing distinct chromatin landscapes

    No full text
    12 páginas, 7 figurasDevelopmental programs that generate the astonishing neuronal diversity of the nervous system are not completely understood and thus present a major challenge for clinical applications of guided cell differentiation strategies. Using direct neuronal programming of embryonic stem cells, we found that two main vertebrate proneural factors, Ascl1 and neurogenin 2 (Neurog2), induce different neuronal fates by binding to largely different sets of genomic sites. Their divergent binding patterns are not determined by the previous chromatin state, but are distinguished by enrichment of specific E-box sequences that reflect the binding preferences of the DNA-binding domains. The divergent Ascl1 and Neurog2 binding patterns result in distinct chromatin accessibility and enhancer activity profiles that differentially shape the binding of downstream transcription factors during neuronal differentiation. This study provides a mechanistic understanding of how transcription factors constrain terminal cell fates, and it delineates the importance of choosing the right proneural factor in neuronal reprogramming strategies.This work is supported by NICHD (R01HD079682) and Project ALS (A13-0416) to E.O.M. and by NYSTEM pre-doctoral training grant (C026880) to B.A. S.M. is supported by NIGMS (R01GM125722) and the National Science Foundation ABI Innovation Grant No. DBI1564466. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. M.R. is supported by NYU MSTP (T32GM007308) and Developmental Genetics T32 (T32HD007520) grants. N.F. and M.M.E. are supported by ERC Starting Grant (2011-281920). The authors would like to thank Link Tejavibulya and Apeksha Ashokkumar for their help with molecular biology; Mohammed Khalfan for his help with scRNA-seq analysis. Michael Cammer from the NYU Medical Center Microscopy Core for the ImageJ script used in calcium imaging analysis; and NYU Genomics Core facility. Finally, we would like to thank Steve Small, Nikos Konstantinidis, Pinar Onal, Orly Wapinski, Sevinç Ercan, Chris Rushlow, Claude Desplan and Mazzoni lab members for their helpful suggestions on the manuscript.Peer reviewe

    An integrated model of multiple-condition ChIP-Seq data reveals predeterminants of Cdx2 binding

    No full text
    Regulatory proteins can bind to different sets of genomic targets in various cell types or conditions. To reliably characterize such condition-specific regulatory binding we introduce MultiGPS, an integrated machine learning approach for the analysis of multiple related ChIP-seq experiments. MultiGPS is based on a generalized Expectation Maximization framework that shares information across multiple experiments for binding event discovery. We demonstrate that our framework enables the simultaneous modeling of sparse condition-specific binding changes, sequence dependence, and replicate-specific noise sources. MultiGPS encourages consistency in reported binding event locations across multiple-condition ChIP-seq datasets and provides accurate estimation of ChIP enrichment levels at each event. MultiGPS’s multi-experiment modeling approach thus provides a reliable platform for detecting differential binding enrichment across experimental conditions. We demonstrate the advantages of MultiGPS with an analysis of Cdx2 binding in three distinct developmental contexts. By accurately characterizing condition-specific Cdx2 binding, MultiGPS enables novel insight into the mechanistic basis of Cdx2 site selectivity. Specifically, the condition-specific Cdx2 sites characterized by MultiGPS are highly associated with pre
    corecore