Search CORE

1,509 research outputs found

Occupancy Classification of Position Weight Matrix-Inferred Transcription Factor Binding Sites

Author: A Barski
A Valouev
Aaron Cohen
B Lenhard
CC Chang
D Karolchik
DH Wolpert
DL Daniels
E Roulet
FN Jensen
G Cooper
G Pavesi
G Robertson
GC Prendergast
GD Stormo
Gregory Yochum
Hollis Wright
IH Witten
Indra Neil Sarkar
J Cohen
JE Darnell
Kemal Sönmez
KI Zeller
KJ Won
M Tompa
MA Hall
N Friedman
ND Heintzman
OJ Sansom
P Hatzis
PJ Collins
Q Sun
R Staden
S Cawley
S Sinha
Shannon McWeeney
SL Schreiber
TL Bailey
TY Roh
UM Fayyad
V Matys
VN Vapnik
Y Chen
YJ Shann
Publication venue: Public Library of Science
Publication date: 04/11/2011
Field of study

BACKGROUND: Computational prediction of Transcription Factor Binding Sites (TFBS) from sequence data alone is difficult and error-prone. Machine learning techniques utilizing additional environmental information about a predicted binding site (such as distances from the site to particular chromatin features) to determine its occupancy/functionality class show promise as methods to achieve more accurate prediction of true TFBS in silico. We evaluate the Bayesian Network (BN) and Support Vector Machine (SVM) machine learning techniques on four distinct TFBS data sets and analyze their performance. We describe the features that are most useful for classification and contrast and compare these feature sets between the factors. RESULTS: Our results demonstrate good performance of classifiers both on TFBS for transcription factors used for initial training and for TFBS for other factors in cross-classification experiments. We find that distances to chromatin modifications (specifically, histone modification islands) as well as distances between such modifications to be effective predictors of TFBS occupancy, though the impact of individual predictors is largely TF specific. In our experiments, Bayesian network classifiers outperform SVM classifiers. CONCLUSIONS: Our results demonstrate good performance of machine learning techniques on the problem of occupancy classification, and demonstrate that effective classification can be achieved using distances to chromatin features. We additionally demonstrate that cross-classification of TFBS is possible, suggesting the possibility of constructing a generalizable occupancy classifier capable of handling TFBS for many different transcription factors

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Exploring Patterns of Epigenetic Information With Data Mining Techniques

Author: Aguiar-Pulido Vanessa
Dorado Julián
Gestal M.
Seoane José A.
Publication venue: 'Bentham Science Publishers Ltd.'
Publication date: 01/01/2013
Field of study

[Abstract] Data mining, a part of the Knowledge Discovery in Databases process (KDD), is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Analyses of epigenetic data have evolved towards genome-wide and high-throughput approaches, thus generating great amounts of data for which data mining is essential. Part of these data may contain patterns of epigenetic information which are mitotically and/or meiotically heritable determining gene expression and cellular differentiation, as well as cellular fate. Epigenetic lesions and genetic mutations are acquired by individuals during their life and accumulate with ageing. Both defects, either together or individually, can result in losing control over cell growth and, thus, causing cancer development. Data mining techniques could be then used to extract the previous patterns. This work reviews some of the most important applications of data mining to epigenetics.Programa Iberoamericano de Ciencia y Tecnología para el Desarrollo; 209RT-0366Galicia. Consellería de Economía e Industria; 10SIN105004PRInstituto de Salud Carlos III; RD07/0067/000

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Explore Bristol Research

Peak-valley-peak pattern of histone modifications delineates active regulatory elements and their directionality

Author: Bagger Frederik Otzen
Lauridsen Felicia Kathrine Bratt
Porse Bo Torben
Pundhir Sachin
Rapin Nicolas Philippe Jean-Pierre
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

Formation of nucleosome free region (NFR) accompanied by specific histone modifications at flanking nucleosomes is an important prerequisite for enhancer and promoter activity. Due to this process, active regulatory elements often exhibit a distinct shape of histone signal in the form of a peak-valley-peak (PVP) pattern. However, different features of PVP patterns and their robustness in predicting active regulatory elements have never been systematically analyzed. Here, we present PARE, a novel computational method that systematically analyzes the H3K4me1 or H3K4me3 PVP patterns to predict NFRs. We show that NFRs predicted by H3K4me1 and me3 patterns are associated with active enhancers and promoters, respectively. Furthermore, asymmetry in the height of peaks flanking the central valley can predict the directionality of stable transcription at promoters. Using PARE on ChIP-seq histone modifications from four ENCODE cell lines and four hematopoietic differentiation stages, we identified several enhancers whose regulatory activity is stage specific and correlates positively with the expression of proximal genes in a particular stage. In conclusion, our results demonstrate that PVP patterns delineate both the histone modification landscape and the transcriptional activities governed by active enhancers and promoters, and therefore can be used for their prediction. PARE is freely available at http://servers.binf.ku.dk/pare

Copenhagen University Research Information System

PubMed Central

Iterative Random Forests to detect predictive and stable high-order interactions

Author: Basu Sumanta
Brown James B.
Kumbier Karl
Yu Bin
Publication venue
Publication date: 23/12/2017
Field of study

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology

arXiv.org e-Print Archive

University of Birmingham Research Portal

eScholarship - University of California

Linking the Epigenome to the Genome: Correlation of Different Features to DNA Methylation of CpG Islands

Author: A Barski
A Bird
A Henckel
A Jeltsch
A Meissner
A Siepel
AH Ting
Andreas Zell
AP Bird
B Rhead
BE Bernstein
BE Bernstein
Brock C. Christensen
C Bock
C Bock
C Bock
C Previti
C Wrzodek
CC Chang
CD Bustos
Clemens Wrzodek
D Jia
D Takai
D Zilberman
DE Schones
E Schilling
EJ Gardiner
ES Lander
F Antequera
F Antequera
F Eckhardt
F Fang
F Fuks
F Mohn
FA Feltus
Finja Büchel
Florian Mittag
GD Stormo
Georg Hinselmann
H Cedar
H Vikas
JF Costello
JG Cleary
Johannes Eichner
JT Bell
KL Thu
M Burset
M Esteller
M Esteller
M Gardiner-Garden
M Hall
M Oka
P Baldi
P Dehan
P Hajkova
PA Jones
R Das
R Fan
R Lister
RA Rollins
RM Brena
RM Brena
S Aerts
S Fan
S Kim
S Kochanek
SE Celniker
SKT Ooi
W Reik
WJ Kent
Y Wang
Y Zhang
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

DNA methylation of CpG islands plays a crucial role in the regulation of gene expression. More than half of all human promoters contain CpG islands with a tissue-specific methylation pattern in differentiated cells. Still today, the whole process of how DNA methyltransferases determine which region should be methylated is not completely revealed. There are many hypotheses of which genomic features are correlated to the epigenome that have not yet been evaluated. Furthermore, many explorative approaches of measuring DNA methylation are limited to a subset of the genome and thus, cannot be employed, e.g., for genome-wide biomarker prediction methods. In this study, we evaluated the correlation of genetic, epigenetic and hypothesis-driven features to DNA methylation of CpG islands. To this end, various binary classifiers were trained and evaluated by cross-validation on a dataset comprising DNA methylation data for 190 CpG islands in HEPG2, HEK293, fibroblasts and leukocytes. We achieved an accuracy of up to 91% with an MCC of 0.8 using ten-fold cross-validation and ten repetitions. With these models, we extended the existing dataset to the whole genome and thus, predicted the methylation landscape for the given cell types. The method used for these predictions is also validated on another external whole-genome dataset. Our results reveal features correlated to DNA methylation and confirm or disprove various hypotheses of DNA methylation related features. This study confirms correlations between DNA methylation and histone modifications, DNA structure, DNA sequence, genomic attributes and CpG island properties. Furthermore, the method has been validated on a genome-wide dataset from the ENCODE consortium. The developed software, as well as the predicted datasets and a web-service to compare methylation states of CpG islands are available at http://www.cogsys.cs.uni-tuebingen.de/software/dna-methylation/

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Publikationsserver der Universität Tübingen

Analysis, Visualization, and Machine Learning of Epigenomic Data

Author: Purcaro Michael J.
Publication venue: eScholarship@UMassChan
Publication date: 12/12/2017
Field of study

The goal of the Encyclopedia of DNA Elements (ENCODE) project has been to characterize all the functional elements of the human genome. These elements include expressed transcripts and genomic regions bound by transcription factors (TFs), occupied by nucleosomes, occupied by nucleosomes with modified histones, or hypersensitive to DNase I cleavage, etc. Chromatin Immunoprecipitation (ChIP-seq) is an experimental technique for detecting TF binding in living cells, and the genomic regions bound by TFs are called ChIP-seq peaks. ENCODE has performed and compiled results from tens of thousands of experiments, including ChIP-seq, DNase, RNA-seq and Hi-C. These efforts have culminated in two web-based resources from our lab—Factorbook and SCREEN—for the exploration of epigenomic data for both human and mouse. Factorbook is a peak-centric resource presenting data such as motif enrichment and histone modification profiles for transcription factor binding sites computed from ENCODE ChIP-seq data. SCREEN provides an encyclopedia of ~2 million regulatory elements, including promoters and enhancers, identified using ENCODE ChIP-seq and DNase data, with an extensive UI for searching and visualization. While we have successfully utilized the thousands of available ENCODE ChIP-seq experiments to build the Encyclopedia and visualizers, we have also struggled with the practical and theoretical inability to assay every possible experiment on every possible biosample under every conceivable biological scenario. We have used machine learning techniques to predict TF binding sites and enhancers location, and demonstrate machine learning is critical to help decipher functional regions of the genome

eScholarship@UMMS

Exploratory analysis of genomic segmentations with Segtools

Author: Buske Orion J
Hoffman Michael M
Le Roch Karine G
Noble William Stafford
Ponts Nadia
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background As genome-wide experiments and annotations become more prevalent, researchers increasingly require tools to help interpret data at this scale. Many functional genomics experiments involve partitioning the genome into labeled segments, such that segments sharing the same label exhibit one or more biochemical or functional traits. For example, a collection of ChlP-seq experiments yields a compendium of peaks, each labeled with one or more associated DNA-binding proteins. Similarly, manually or automatically generated annotations of functional genomic elements, including <it>cis</it>-regulatory modules and protein-coding or RNA genes, can also be summarized as genomic segmentations. Results We present a software toolkit called <it>Segtools </it>that simplifies and automates the exploration of genomic segmentations. The software operates as a series of interacting tools, each of which provides one mode of summarization. These various tools can be pipelined and summarized in a single HTML page. We describe the Segtools toolkit and demonstrate its use in interpreting a collection of human histone modification data sets and <it>Plasmodium falciparum </it>local chromatin structure data sets. Conclusions Segtools provides a convenient, powerful means of interpreting a genomic segmentation.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

HAL Descartes

eScholarship - University of California

ProdInra