31 research outputs found

    Computational modeling of gene regulatory programs in differentiation and disease

    No full text
    Cell state transitions are tightly controlled by numerous regulatory mechanisms to achieve cellular differentiation. Dysregulation of these regulatory mechanisms through the acquisition of somatic mutations and/or copy number changes can lead to oncogenic transformation. Binding of transcription factors (TFs) to regulatory elements is a primary mechanism controlling gene expression. TFs work in conjunction with chromatin to either activate or repress specific genes. miRNA-mediated degradation is another key regulatory mechanism involved in post transcriptional repression of genes. Genomics projects like ENCODE, Roadmap Epigenomics, TCGA and others are generating rich datasets across cell lines, primary tissues and cancers. These datasets enable computational modeling of transcriptional and miRNA mediated regulation. In this thesis, I will present our work on integrating multimodal datasets along with DNA sequence information to decipher novel regulatory programs in human disease and differentiation. First, we use the TCGA generated GBM dataset as a case study to infer gene regulatory programs in disease. We model the gene expression change in GBM relative to normal brain as a function of copy number of the gene, and TF and miRNA binding sites in the promoter and 3'UTR respectively. We use regularized least squares regression to fit the expression change of all genes for each sample. This framework achieves significant accuracy compared to randomized gene expression values and clustering of regression models recapitulates expression subtypes. We then employ a multi-task learning framework to learn regression models of all samples simultaneously and define a feature-scoring scheme to identify subtype-specific and common regulators. Using these experiments and literature search, we were able to identify a core regulatory network centered at the REST repression complex in the proneural subtype of GBM. I will then present our work on characterizing regulatory changes in hematopoietic differentiation primarily using DNase-seq enhancer maps from the Roadmap Epigenomics project. We first developed a tool, SeqGL, which demonstrates significantly greater sensitivity to binding signals underlying enhancer maps compared to traditional motif discovery algorithms. We then characterize the locus complexity, defined as number of DNase peaks assigned to a gene, in the hematopoietic system and observe that high complexity genes tend to be cell-type specific in expression and are enriched for functionally relevant ontologies. Furthermore, we observe extensive poising of enhancers in progenitor cells for function in differentiated cell types. We then use SeqGL scores to predict gene expression change in a transition from stem and progenitor cells to differentiated cell types with high accuracy and identify a potentially novel mechanistic role for PU.1 in B cell and monocyte specification

    SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps

    No full text
    <div><p>Genome-wide maps of transcription factor (TF) occupancy and regions of open chromatin implicitly contain DNA sequence signals for multiple factors. We present SeqGL, a novel <i>de novo</i> motif discovery algorithm to identify multiple TF sequence signals from ChIP-, DNase-, and ATAC-seq profiles. SeqGL trains a discriminative model using a <i>k</i>-mer feature representation together with group lasso regularization to extract a collection of sequence signals that distinguish peak sequences from flanking regions. Benchmarked on over 100 ChIP-seq experiments, SeqGL outperformed traditional motif discovery tools in discriminative accuracy. Furthermore, SeqGL can be naturally used with multitask learning to identify genomic and cell-type context determinants of TF binding. SeqGL successfully scales to the large multiplicity of sequence signals in DNase- or ATAC-seq maps. In particular, SeqGL was able to identify a number of ChIP-seq validated sequence signals that were not found by traditional motif discovery algorithms. Thus compared to widely used motif discovery algorithms, SeqGL demonstrates both greater discriminative accuracy and higher sensitivity for detecting the DNA sequence signals underlying regulatory element maps. SeqGL is available at <a href="http://cbio.mskcc.org/public/Leslie/SeqGL/" target="_blank">http://cbio.mskcc.org/public/Leslie/SeqGL/</a>.</p></div

    Sequence preferences of GM12878 DNase-seq peaks.

    No full text
    <p>(A) Heatmap showing the group scores for top groups over 16,891 GM12878 DNase-seq peaks. Note that some peaks have a sequence signal for a single factor while others have signals for multiple factors (FDR-corrected <i>p</i> < 0.01). All the group predictions identified by “*” have been validated by ChIP-seq data, while “#” indicates no ChIP-seq data available in ENCODE. The dashed boxes highlight the specific examples illustrated in Fig 5B. (B) The left panel shows a DNase peak with a strong score for a single transcription factor (NRF1). The middle panel shows DNase peaks with moderate scores for both BATF and RUNX. The left panel shows different binding preferences in adjacent split peaks derived from a single broad peak. All the predictions are validated by ChIP-seq.</p

    SeqGL performs significantly better than traditional motif discovery methods across different settings.

    No full text
    <p>(A) Plot showing the <i>k</i>-mer weight inferred group lasso regularized logistic regression for PAX5 ChIP-seq in GM12878 cell line. A number of groups are uniformly set to 0 (Group 5), while other groups are either significantly predictive of peaks or flanks (Group 3 and Group 2 respectively). Motifs identified for groups that are strongly predictive of peaks and the corresponding TFs are also shown. (B) PAX5 ChIP-seq auROC on the test set comparing the discriminative performance of SeqGL with motif finding tools and <i>k</i>-mer methods. The different colors correspond to the colors in Fig 2C. (C) Plots showing auROCs on test sets for 105 ChIP-seq experiments using different tools and settings. Three different settings were used for the motif finding tools HOMER, DREME and HOMER (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004271#pcbi.1004271.s001" target="_blank">S1 Fig</a>). ‘Best motif’ uses the highest-ranking motif from each method, as defined by the <i>p</i>-value; ‘Max motif’ uses the motif with maximum log odds score in each example; and ‘Motif elastic’ uses elastic net logistic regression across all motifs determined by the respective method. Only the ‘Motif elastic’ methods are shown in the performance plots, since they outperform the ‘Best motif” and ‘Max motif” methods. ‘SeqGL and other <i>k</i>-mer methods significantly outperform the different motif finding tools across all settings (Wilcoxon rank sum <i>p</i>-values < 7e-3). gkm-SVM performs marginally (but not significantly) better compared to SeqGL with 5K top discriminative features (Wilcoxon rank sum <i>p</i>-value = 0.06); SeqGL using a larger feature set (30K) gives identical performance to gkm-SVM (no difference in the distribution of auROC scores based on a Wilcoxon rank sum test, using <i>p</i>-value < 0.05 for our threshold of significance). Furthermore, the elastic-net regressor on the full SeqGL feature space using 10-mers with 3 wildcards (similar to settings used by gkm-SVM) also yields identical performance. While the discriminative accuracy is comparable, unlike other <i>k</i>-mer methods, SeqGL identifies multiple distinct DNA binding signals from the same ChIP-seq experiment (<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004271#pcbi.1004271.s015" target="_blank">S3 Table</a>).</p

    Candidate binding partners for a transcription factor are dependent on genomic context, cell type, and expression levels of partners.

    No full text
    <p>(A) Multitask learning identifies different candidate binding partners of POU2F2 at proximal and distal locations to genes. The proximal partners are associated with factors that preferentially bind to CG-rich motifs whereas the distal partners are associated with cell-type specific regulators. As expected, the POU/octamer motif is present in both contexts. (B) Similarly, TCF12 binding profiles in GM12878 and H1-hESC shows preference for cell-type specific regulators at the cell-type specific sites, and the TCF motif is associated with peaks of both cell types. (C-D) The enhancer bound factor p300 shows highly cell-type specific binding and predicted TF signals. The cell-type specific expression of the p300 candidate binding partners partially explains the differing binding profiles of p300 across cell types.</p

    SeqGL identifies binding profiles in genome-wide regulatory maps.

    No full text
    <p>SeqGL uses sparse group lasso to identify the most important <i>k</i>-mer groups that discriminate between ChIP-seq/DNase-seq peaks and flanks. Hierarchical clustering of <i>k</i>-mer counts across peak and flank sequences reveals a block structure that defines <i>k</i>-mer groups. A representative heatmap of <i>k</i>-mer frequencies for a subset of peaks and flanks is shown. Sparse group lasso regression sets some groups uniformly to zero; groups with non-zero weights define group signals that may represent binding sequence signals for individual TFs. Significant hits for each group signal are identified, and sequence windows around these hits are extracted. HOMER is then applied to the windows to associate these group signals with motifs for visualization and identification.</p

    Context determinants of DNase binding signals.

    No full text
    <p>(A) DNase peaks across cell types show significant differences in the underlying sequence preferences depending on the context. DNase peaks that are common to both GM12878 and H1-hESC cell lines show strong preferences for either insulator proteins (CTCF) or promoter associated regulators (NFY) whereas again cell-type specific peaks show preferences for cell-type specific regulators (BATF, IRF, RUNX in GM12878 and OCT, SMAD, NANOG in H1-hESC). (B) The chromatin context of a DNase peak also defines a context for specific binding preferences. DNase peaks in active promoter regions are associated with NRF, SP1 and NFY motifs whereas peaks in the enhancer regions are associated with CTCF, SOX and TEAD motifs.</p

    SeqGL identifies sequence signals underlying ATAC-seq peaks.

    No full text
    <p>Distribution of top-scoring TFs across (A) DNase-seq and (B) ATAC-seq peaks. TFs identified by SeqGL using ATAC-seq peaks are IRF, BATF and other cell-type factors that are strongly represented across peaks in both data types whereas NRF and CTCF show enrichment in DNase-seq and ATAC-seq respectively. The fraction of intergenic enhancer peaks is significantly higher in ATAC-seq, potentially explaining the higher occurrence of CTCF.</p

    SeqGL identifies binding signals in ChIP-seq occupancy profiles.

    No full text
    <p>(A) Heatmaps show predicted binding signals and ChIP occupancy of PAX5 and co-factors. SeqGL analysis of PAX5 ChIP-seq predicts BATF and PU.1 as the most significant binding partners of PAX5. The top panel shows the group scores associated to three TFs from the PAX5 model, and the bottom panel shows the corresponding ChIP-seq read counts. This shows that a number of PAX5 ChIP-seq peaks are indirect and obtained through DNA binding of partners rather than PAX5 itself. The dashed boxes highlight the specific examples illustrated in Fig 3B. (B) Specific examples of PAX5 profiles show various modes of binding detected by SeqGL. The left panel shows direct binding of a TF (PAX5 recognizing its motif). The middle panel shows that the sequence signal is associated to BATF, and hence the PAX5 peak at this location is either due to interaction of the two factors and/or long distance looping. The right panel shows an example of co-binding of PAX5 and PU.1, each recognizing its respective binding motif.</p
    corecore