28 research outputs found

    Analysis, Visualization, and Machine Learning of Epigenomic Data

    Get PDF
    The goal of the Encyclopedia of DNA Elements (ENCODE) project has been to characterize all the functional elements of the human genome. These elements include expressed transcripts and genomic regions bound by transcription factors (TFs), occupied by nucleosomes, occupied by nucleosomes with modified histones, or hypersensitive to DNase I cleavage, etc. Chromatin Immunoprecipitation (ChIP-seq) is an experimental technique for detecting TF binding in living cells, and the genomic regions bound by TFs are called ChIP-seq peaks. ENCODE has performed and compiled results from tens of thousands of experiments, including ChIP-seq, DNase, RNA-seq and Hi-C. These efforts have culminated in two web-based resources from our lab—Factorbook and SCREEN—for the exploration of epigenomic data for both human and mouse. Factorbook is a peak-centric resource presenting data such as motif enrichment and histone modification profiles for transcription factor binding sites computed from ENCODE ChIP-seq data. SCREEN provides an encyclopedia of ~2 million regulatory elements, including promoters and enhancers, identified using ENCODE ChIP-seq and DNase data, with an extensive UI for searching and visualization. While we have successfully utilized the thousands of available ENCODE ChIP-seq experiments to build the Encyclopedia and visualizers, we have also struggled with the practical and theoretical inability to assay every possible experiment on every possible biosample under every conceivable biological scenario. We have used machine learning techniques to predict TF binding sites and enhancers location, and demonstrate machine learning is critical to help decipher functional regions of the genome

    Factorbook: an Updated Catalog of Transcription Factor Motifs and Candidate Regulatory Motif Sites [preprint]

    Get PDF
    The human genome contains roughly 1,600 transcription factors (TFs) (1), DNA-binding proteins recognizing characteristic sequence motifs to exert regulatory effects on gene expression. The binding specificities of these factors have been profiled both in vitro, using techniques such as HT-SELEX (2), and in vivo, using techniques including ChIP-seq (3, 4). We previously developed Factorbook, a TF-centric database of annotations, motifs, and integrative analyses based on ChIP-seq data from Phase II of the ENCODE Project. Here we present an update to Factorbook which significantly expands the breadth of cell type and TF coverage. The update includes an expanded motif catalog derived from thousands of ENCODE Phase II and III ChIP-seq experiments and HT-SELEX experiments; this motif catalog is integrated with the ENCODE registry of candidate cis-regulatory elements to annotate a comprehensive collection of genome-wide candidate TF binding sites. The database also offers novel tools for applying the motif models within machine learning frameworks and using these models for integrative analysis, including annotation of variants and disease and trait heritability. We will continue to expand the resource as ENCODE Phase IV data are released

    Differential analysis of chromatin accessibility and histone modifications for predicting mouse developmental enhancers

    Get PDF
    Enhancers are distal cis-regulatory elements that modulate gene expression. They are depleted of nucleosomes and enriched in specific histone modifications; thus, calling DNase-seq and histone mark ChIP-seq peaks can predict enhancers. We evaluated nine peak-calling algorithms for predicting enhancers validated by transgenic mouse assays. DNase and H3K27ac peaks were consistently more predictive than H3K4me1/2/3 and H3K9ac peaks. DFilter and Hotspot2 were the best DNase peak callers, while HOMER, MUSIC, MACS2, DFilter and F-seq were the best H3K27ac peak callers. We observed that the differential DNase or H3K27ac signals between two distant tissues increased the area under the precision-recall curve (PR-AUC) of DNase peaks by 17.5-166.7% and that of H3K27ac peaks by 7.1-22.2%. We further improved this differential signal method using multiple contrast tissues. Evaluated using a blind test, the differential H3K27ac signal method substantially improved PR-AUC from 0.48 to 0.75 for predicting heart enhancers. We further validated our approach using postnatal retina and cerebral cortex enhancers identified by massively parallel reporter assays, and observed improvements for both tissues. In summary, we compared nine peak callers and devised a superior method for predicting tissue-specific mouse developmental enhancers by reranking the called peaks

    Expanded encyclopaedias of DNA elements in the human and mouse genomes

    Get PDF
    All data are available on the ENCODE data portal: www.encodeproject. org. All code is available on GitHub from the links provided in the methods section. Code related to the Registry of cCREs can be found at https:// github.com/weng-lab/ENCODE-cCREs. Code related to SCREEN can be found at https://github.com/weng-lab/SCREEN.© The Author(s) 2020. The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.This work was supported by grants from the NIH under U01HG007019, U01HG007033, U01HG007036, U01HG007037, U41HG006992, U41HG006993, U41HG006994, U41HG006995, U41HG006996, U41HG006997, U41HG006998, U41HG006999, U41HG007000, U41HG007001, U41HG007002, U41HG007003, U54HG006991, U54HG006997, U54HG006998, U54HG007004, U54HG007005, U54HG007010 and UM1HG009442

    A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods

    Get PDF
    BACKGROUND: Many genome-wide collections of candidate cis-regulatory elements (cCREs) have been defined using genomic and epigenomic data, but it remains a major challenge to connect these elements to their target genes. RESULTS: To facilitate the development of computational methods for predicting target genes, we develop a Benchmark of candidate Enhancer-Gene Interactions (BENGI) by integrating the recently developed Registry of cCREs with experimentally derived genomic interactions. We use BENGI to test several published computational methods for linking enhancers with genes, including signal correlation and the TargetFinder and PEP supervised learning methods. We find that while TargetFinder is the best-performing method, it is only modestly better than a baseline distance method for most benchmark datasets when trained and tested with the same cell type and that TargetFinder often does not outperform the distance method when applied across cell types. CONCLUSIONS: Our results suggest that current computational methods need to be improved and that BENGI presents a useful framework for method development and testing

    ATLAS: A database linking binding affinities with structures for wild-type and mutant TCR-pMHC complexes

    No full text
    The ATLAS (Altered TCR Ligand Affinities and Structures) database (https://zlab.umassmed.edu/atlas/web/) is a manually curated repository containing the binding affinities for wild-type and mutant T cell receptors (TCRs) and their antigens, peptides presented by the major histocompatibility complex (pMHC). The database links experimentally measured binding affinities with the corresponding three dimensional (3D) structures for TCR-pMHC complexes. The user can browse and search affinities, structures, and experimental details for TCRs, peptides, and MHCs of interest. We expect this database to facilitate the development of next-generation protein design algorithms targeting TCR-pMHC interactions. ATLAS can be easily parsed using modeling software that builds protein structures for training and testing. As an example, we provide structural models for all mutant TCRs in ATLAS, built using the Rosetta program. Utilizing these structures, we report a correlation of 0.63 between experimentally measured changes in binding energies and our predicted changes

    The PsychENCODE project

    No full text
    Recent research on disparate psychiatric disorders has implicated rare variants in genes involved in global gene regulation and chromatin modification, as well as many common variants located primarily in regulatory regions of the genome. Understanding precisely how these variants contribute to disease will require a deeper appreciation for the mechanisms of gene regulation in the developing and adult human brain. The PsychENCODE project aims to produce a public resource of multidimensional genomic data using tissue- and cell type–specific samples from approximately 1,000 phenotypically well-characterized, high-quality healthy and disease-affected human post-mortem brains, as well as functionally characterize disease-associated regulatory elements and variants in model systems. We are beginning with a focus on autism spectrum disorder, bipolar disorder and schizophrenia, and expect that this knowledge will apply to a wide variety of psychiatric disorders. This paper outlines the motivation and design of PsychENCODE

    Expanded encyclopaedias of DNA elements in the human and mouse genomes

    Get PDF
    The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE(1) and Roadmap Epigenomics(2) data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes

    Color Fundus Photography Versus Fluorescein Angiography in Identification of the Macular Center and Zone in Retinopathy of Prematurity

    No full text
    PURPOSE: To examine the utility of fluorescein angiography (FA) in identification of the macular center and the diagnosis of zone in patients with retinopathy of prematurity (ROP). DESIGN: Validity and reliability analysis of diagnostic tools METHODS: 32 sets (16 color fundus photographs; 16 color fundus photographs paired with the corresponding FA) of wide-angle retinal images obtained from 16 eyes of eight infants with ROP were compiled on a secure web site. 9 ROP experts (3 pediatric ophthalmologists; 6 vitreoretinal surgeons) participated in the study. For each image set, experts identified the macular center and provided a diagnosis of zone. MAIN OUTCOME MEASURES: (1) Sensitivity and specificity of zone diagnosis (2) “Computer facilitated diagnosis of zone,” based on precise measurement of the macular center, optic disc center, and peripheral ROP. RESULTS: Computer facilitated diagnosis of zone agreed with the expert’s diagnosis of zone in 28/45 (62%) cases using color fundus photographs and in 31/45 (69%) cases using FA. Mean (95% CI) sensitivity for detection of zone I by experts as compared to a consensus reference standard diagnosis when interpreting the color fundus images alone versus interpreting the color fundus photographs and FA was 47% (35.3% – 59.3%) and 61.1% (48.9% – 72.4%), respectively, (t(9) ≥ (2.063), p = 0.073). CONCLUSIONS: There is a marginally significant difference in zone diagnosis when using color fundus photographs compared to using color fundus photographs and the corresponding fluorescein angiograms. There is inconsistency between traditional zone diagnosis (based on ophthalmoscopic exam and image review) compared to a computer-facilitated diagnosis of zone

    Neuronal and glial 3D chromatin architecture informs the cellular etiology of brain disorders

    Get PDF
    Cellular heterogeneity in the human brain obscures the identification of robust cellular regulatory networks, which is necessary to understand the function of non-coding elements and the impact of non-coding genetic variation. Here we integrate genome-wide chromosome conformation data from purified neurons and glia with transcriptomic and enhancer profiles, to characterize the gene regulatory landscape of two major cell classes in the human brain. We then leverage cell-type-specific regulatory landscapes to gain insight into the cellular etiology of several brain disorders. We find that Alzheimer\u27s disease (AD)-associated epigenetic dysregulation is linked to neurons and oligodendrocytes, whereas genetic risk factors for AD highlighted microglia, suggesting that different cell types may contribute to disease risk, via different mechanisms. Moreover, integration of glutamatergic and GABAergic regulatory maps with genetic risk factors for schizophrenia (SCZ) and bipolar disorder (BD) identifies shared (parvalbumin-expressing interneurons) and distinct cellular etiologies (upper layer neurons for BD, and deeper layer projection neurons for SCZ). Collectively, these findings shed new light on cell-type-specific gene regulatory networks in brain disorders
    corecore