13 research outputs found

    Sequence preferences of GM12878 DNase-seq peaks.

    No full text
    <p>(A) Heatmap showing the group scores for top groups over 16,891 GM12878 DNase-seq peaks. Note that some peaks have a sequence signal for a single factor while others have signals for multiple factors (FDR-corrected <i>p</i> < 0.01). All the group predictions identified by “*” have been validated by ChIP-seq data, while “#” indicates no ChIP-seq data available in ENCODE. The dashed boxes highlight the specific examples illustrated in Fig 5B. (B) The left panel shows a DNase peak with a strong score for a single transcription factor (NRF1). The middle panel shows DNase peaks with moderate scores for both BATF and RUNX. The left panel shows different binding preferences in adjacent split peaks derived from a single broad peak. All the predictions are validated by ChIP-seq.</p

    SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps

    No full text
    <div><p>Genome-wide maps of transcription factor (TF) occupancy and regions of open chromatin implicitly contain DNA sequence signals for multiple factors. We present SeqGL, a novel <i>de novo</i> motif discovery algorithm to identify multiple TF sequence signals from ChIP-, DNase-, and ATAC-seq profiles. SeqGL trains a discriminative model using a <i>k</i>-mer feature representation together with group lasso regularization to extract a collection of sequence signals that distinguish peak sequences from flanking regions. Benchmarked on over 100 ChIP-seq experiments, SeqGL outperformed traditional motif discovery tools in discriminative accuracy. Furthermore, SeqGL can be naturally used with multitask learning to identify genomic and cell-type context determinants of TF binding. SeqGL successfully scales to the large multiplicity of sequence signals in DNase- or ATAC-seq maps. In particular, SeqGL was able to identify a number of ChIP-seq validated sequence signals that were not found by traditional motif discovery algorithms. Thus compared to widely used motif discovery algorithms, SeqGL demonstrates both greater discriminative accuracy and higher sensitivity for detecting the DNA sequence signals underlying regulatory element maps. SeqGL is available at <a href="http://cbio.mskcc.org/public/Leslie/SeqGL/" target="_blank">http://cbio.mskcc.org/public/Leslie/SeqGL/</a>.</p></div

    SeqGL performs significantly better than traditional motif discovery methods across different settings.

    No full text
    <p>(A) Plot showing the <i>k</i>-mer weight inferred group lasso regularized logistic regression for PAX5 ChIP-seq in GM12878 cell line. A number of groups are uniformly set to 0 (Group 5), while other groups are either significantly predictive of peaks or flanks (Group 3 and Group 2 respectively). Motifs identified for groups that are strongly predictive of peaks and the corresponding TFs are also shown. (B) PAX5 ChIP-seq auROC on the test set comparing the discriminative performance of SeqGL with motif finding tools and <i>k</i>-mer methods. The different colors correspond to the colors in Fig 2C. (C) Plots showing auROCs on test sets for 105 ChIP-seq experiments using different tools and settings. Three different settings were used for the motif finding tools HOMER, DREME and HOMER (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004271#pcbi.1004271.s001" target="_blank">S1 Fig</a>). ‘Best motif’ uses the highest-ranking motif from each method, as defined by the <i>p</i>-value; ‘Max motif’ uses the motif with maximum log odds score in each example; and ‘Motif elastic’ uses elastic net logistic regression across all motifs determined by the respective method. Only the ‘Motif elastic’ methods are shown in the performance plots, since they outperform the ‘Best motif” and ‘Max motif” methods. ‘SeqGL and other <i>k</i>-mer methods significantly outperform the different motif finding tools across all settings (Wilcoxon rank sum <i>p</i>-values < 7e-3). gkm-SVM performs marginally (but not significantly) better compared to SeqGL with 5K top discriminative features (Wilcoxon rank sum <i>p</i>-value = 0.06); SeqGL using a larger feature set (30K) gives identical performance to gkm-SVM (no difference in the distribution of auROC scores based on a Wilcoxon rank sum test, using <i>p</i>-value < 0.05 for our threshold of significance). Furthermore, the elastic-net regressor on the full SeqGL feature space using 10-mers with 3 wildcards (similar to settings used by gkm-SVM) also yields identical performance. While the discriminative accuracy is comparable, unlike other <i>k</i>-mer methods, SeqGL identifies multiple distinct DNA binding signals from the same ChIP-seq experiment (<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004271#pcbi.1004271.s015" target="_blank">S3 Table</a>).</p

    SeqGL identifies binding profiles in genome-wide regulatory maps.

    No full text
    <p>SeqGL uses sparse group lasso to identify the most important <i>k</i>-mer groups that discriminate between ChIP-seq/DNase-seq peaks and flanks. Hierarchical clustering of <i>k</i>-mer counts across peak and flank sequences reveals a block structure that defines <i>k</i>-mer groups. A representative heatmap of <i>k</i>-mer frequencies for a subset of peaks and flanks is shown. Sparse group lasso regression sets some groups uniformly to zero; groups with non-zero weights define group signals that may represent binding sequence signals for individual TFs. Significant hits for each group signal are identified, and sequence windows around these hits are extracted. HOMER is then applied to the windows to associate these group signals with motifs for visualization and identification.</p

    SeqGL identifies sequence signals underlying ATAC-seq peaks.

    No full text
    <p>Distribution of top-scoring TFs across (A) DNase-seq and (B) ATAC-seq peaks. TFs identified by SeqGL using ATAC-seq peaks are IRF, BATF and other cell-type factors that are strongly represented across peaks in both data types whereas NRF and CTCF show enrichment in DNase-seq and ATAC-seq respectively. The fraction of intergenic enhancer peaks is significantly higher in ATAC-seq, potentially explaining the higher occurrence of CTCF.</p

    Context determinants of DNase binding signals.

    No full text
    <p>(A) DNase peaks across cell types show significant differences in the underlying sequence preferences depending on the context. DNase peaks that are common to both GM12878 and H1-hESC cell lines show strong preferences for either insulator proteins (CTCF) or promoter associated regulators (NFY) whereas again cell-type specific peaks show preferences for cell-type specific regulators (BATF, IRF, RUNX in GM12878 and OCT, SMAD, NANOG in H1-hESC). (B) The chromatin context of a DNase peak also defines a context for specific binding preferences. DNase peaks in active promoter regions are associated with NRF, SP1 and NFY motifs whereas peaks in the enhancer regions are associated with CTCF, SOX and TEAD motifs.</p

    SeqGL identifies binding signals in ChIP-seq occupancy profiles.

    No full text
    <p>(A) Heatmaps show predicted binding signals and ChIP occupancy of PAX5 and co-factors. SeqGL analysis of PAX5 ChIP-seq predicts BATF and PU.1 as the most significant binding partners of PAX5. The top panel shows the group scores associated to three TFs from the PAX5 model, and the bottom panel shows the corresponding ChIP-seq read counts. This shows that a number of PAX5 ChIP-seq peaks are indirect and obtained through DNA binding of partners rather than PAX5 itself. The dashed boxes highlight the specific examples illustrated in Fig 3B. (B) Specific examples of PAX5 profiles show various modes of binding detected by SeqGL. The left panel shows direct binding of a TF (PAX5 recognizing its motif). The middle panel shows that the sequence signal is associated to BATF, and hence the PAX5 peak at this location is either due to interaction of the two factors and/or long distance looping. The right panel shows an example of co-binding of PAX5 and PU.1, each recognizing its respective binding motif.</p

    Candidate binding partners for a transcription factor are dependent on genomic context, cell type, and expression levels of partners.

    No full text
    <p>(A) Multitask learning identifies different candidate binding partners of POU2F2 at proximal and distal locations to genes. The proximal partners are associated with factors that preferentially bind to CG-rich motifs whereas the distal partners are associated with cell-type specific regulators. As expected, the POU/octamer motif is present in both contexts. (B) Similarly, TCF12 binding profiles in GM12878 and H1-hESC shows preference for cell-type specific regulators at the cell-type specific sites, and the TCF motif is associated with peaks of both cell types. (C-D) The enhancer bound factor p300 shows highly cell-type specific binding and predicted TF signals. The cell-type specific expression of the p300 candidate binding partners partially explains the differing binding profiles of p300 across cell types.</p

    <i>In vivo</i> quantification of ribosome-bound and total RNA levels revealed a broad range of ribosome recruitment efficiencies amongst mRNA transcripts.

    No full text
    <p>(A) Distribution of mRNA expression in ribosome-bound and total RNA pools from PDGF-driven glioma identified differential TE (N = 4). (B) TE values for each biological replicate (black points) plotted with the average of the other three replicates (red line) demonstrated reproducibility of measurements. (C) Signal-to-noise ratios of TE measurements (blue bars) identified range of high confidence measurements relative to a normal distribution (red line). (D) GSEA identified statistical overrepresentation of defined gene ontologies amongst efficiently and inefficiently translated genes. Black bars represent distribution of mRNAs from indicated geneset amongst all genes ranked by signal to noise ratio (top panel). Red line represents GSEA output enrichment score. R = Pearson correlation coefficient.</p
    corecore