4,411 research outputs found

    ModHMM: A Modular Supra-Bayesian Genome Segmentation Method

    Get PDF
    Genome segmentation methods are powerful tools to obtain cell type or tissue-specific genome-wide annotations and are frequently used to discover regulatory elements. However, traditional segmentation methods show low predictive accuracy and their data-driven annotations have some undesirable properties. As an alternative, we developed ModHMM, a highly modular genome segmentation method. Inspired by the supra-Bayesian approach, it incorporates predictions from a set of classifiers. This allows to compute genome segmentations by utilizing state-of-the-art methodology. We demonstrate the method on ENCODE data and show that it outperforms traditional segmentation methods not only in terms of predictive performance, but also in qualitative aspects. Therefore, ModHMM is a valuable alternative to study the epigenetic and regulatory landscape across and within cell types or tissues

    Methods for developing a machine learning framework for precise 3D domain boundary prediction at base-level resolution

    Get PDF
    High-throughput chromosome conformation capture technology (Hi-C) has revealed extensive DNA looping and folding into discrete 3D domains. These include Topologically Associating Domains (TADs) and chromatin loops, the 3D domains critical for cellular processes like gene regulation and cell differentiation. The relatively low resolution of Hi-C data (regions of several kilobases in size) prevents precise mapping of domain boundaries by conventional TAD/loop-callers. However, high resolution genomic annotations associated with boundaries, such as CTCF and members of cohesin complex, suggest a computational approach for precise location of domain boundaries. We developed preciseTAD, an optimized machine learning framework that leverages a random forest model to improve the location of domain boundaries. Our method introduces three concepts - shifted binning, distance-type predictors, and random under-sampling - which we use to build classification models for predicting boundary regions. The algorithm then uses density-based clustering (DBSCAN) and partitioning around medoids (PAM) to extract the most biologically meaningful domain boundary from models trained on high-resolution genome annotation data and boundaries from low-resolution Hi-C data. We benchmarked our method against a popular TAD-caller and a novel chromatin loop prediction algorithm. Boundaries predicted by preciseTAD were more enriched for known molecular drivers of 3D chromatin including CTCF, RAD21, SMC3, and ZNF143. preciseTAD-predicted boundaries were more conserved across cell lines, highlighting their higher biological significance. Additionally, models pre-trained in one cell line accurately predict boundaries in another cell line. Using cell line-specific genomic annotations, the pre-trained models enable detecting domain boundaries in cells without Hi-C data. The research presented provides a unified approach for precisely predicting domain boundaries. This improved precision will provide insight into the association between genomic regulators and the 3D genome organization. Furthermore, our methods will provide researchers with flexible and easy-to-use tools to continue to annotate the 3D structure of the human genome without relying on costly high resolution Hi-C data. The preciseTAD R package and supplementary ExperimentHub package, preciseTADhub, are available on Bioconductor (version 3.13; https://bioconductor.org/packages/preciseTAD/; https://bioconductor.org/packages/preciseTADhub/)

    An image representation based convolutional network for DNA classification

    Get PDF
    The folding structure of the DNA molecule combined with helper molecules, also referred to as the chromatin, is highly relevant for the functional properties of DNA. The chromatin structure is largely determined by the underlying primary DNA sequence, though the interaction is not yet fully understood. In this paper we develop a convolutional neural network that takes an image-representation of primary DNA sequence as its input, and predicts key determinants of chromatin structure. The method is developed such that it is capable of detecting interactions between distal elements in the DNA sequence, which are known to be highly relevant. Our experiments show that the method outperforms several existing methods both in terms of prediction accuracy and training time.Comment: Published at ICLR 2018, https://openreview.net/pdf?id=HJvvRoe0

    An image representation based convolutional network for DNA classification

    Get PDF
    The folding structure of the DNA molecule combined with helper molecules, also referred to as the chromatin, is highly relevant for the functional properties of DNA. The chromatin structure is largely determined by the underlying primary DNA sequence, though the interaction is not yet fully understood. In this paper we develop a convolutional neural network that takes an image-representation of primary DNA sequence as its input, and predicts key determinants of chromatin structure. The method is developed such that it is capable of detecting interactions between distal elements in the DNA sequence, which are known to be highly relevant. Our experiments show that the method outperforms several existing methods both in terms of prediction accuracy and training time

    Exploring Patterns of Epigenetic Information With Data Mining Techniques

    Get PDF
    [Abstract] Data mining, a part of the Knowledge Discovery in Databases process (KDD), is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Analyses of epigenetic data have evolved towards genome-wide and high-throughput approaches, thus generating great amounts of data for which data mining is essential. Part of these data may contain patterns of epigenetic information which are mitotically and/or meiotically heritable determining gene expression and cellular differentiation, as well as cellular fate. Epigenetic lesions and genetic mutations are acquired by individuals during their life and accumulate with ageing. Both defects, either together or individually, can result in losing control over cell growth and, thus, causing cancer development. Data mining techniques could be then used to extract the previous patterns. This work reviews some of the most important applications of data mining to epigenetics.Programa Iberoamericano de Ciencia y Tecnología para el Desarrollo; 209RT-0366Galicia. Consellería de Economía e Industria; 10SIN105004PRInstituto de Salud Carlos III; RD07/0067/000

    Predicting gene expression using morphological cell responses to nanotopography

    Get PDF
    Cells respond in complex ways to their environment, making it challenging to predict a direct relationship between the two. A key problem is the lack of informative representations of parameters that translate directly into biological function. Here we present a platform to relate the effects of cell morphology to gene expression induced by nanotopography. This platform utilizes the ‘morphome’, a multivariate dataset of cell morphology parameters. We create a Bayesian linear regression model that uses the morphome to robustly predict changes in bone, cartilage, muscle and fibrous gene expression induced by nanotopography. Furthermore, through this model we effectively predict nanotopography-induced gene expression from a complex co-culture microenvironment. The information from the morphome uncovers previously unknown effects of nanotopography on altering cell–cell interaction and osteogenic gene expression at the single cell level. The predictive relationship between morphology and gene expression arising from cell-material interaction shows promise for exploration of new topographies
    • …
    corecore