Search CORE

6,667 research outputs found

Transcription Factor-DNA Binding Via Machine Learning Ensembles

Author: DeLisi Charles
Fan Yue
Kon Mark
Publication venue
Publication date: 09/05/2018
Field of study

We present ensemble methods in a machine learning (ML) framework combining predictions from five known motif/binding site exploration algorithms. For a given TF the ensemble starts with position weight matrices (PWM's) for the motif, collected from the component algorithms. Using dimension reduction, we identify significant PWM-based subspaces for analysis. Within each subspace a machine classifier is built for identifying the TF's gene (promoter) targets (Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool. Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string) feature PWM-based subspaces that stand out in identifying gene targets. We approach Problem 3 (binding sites) with a novel machine learning approach that uses promoter string features and ML importance scores in a classification algorithm locating binding sites across the genome. For target gene identification this method improves performance (measured by the F1 score) by about 10 percentage points over the (a) motif scanning method and (b) the coexpression-based association method. Top motif outperformed 5 component algorithms as well as two other common algorithms (BEST and DEME). For identifying individual binding sites on a benchmark cross species database (Tompa et al., 2005) we match the best performer without much human intervention. It also improved the performance on mammalian TFs. The ensemble can integrate orthogonal information from different weak learners (potentially using entirely different types of features) into a machine learner that can perform consistently better for more TFs. The TF gene target identification component (problem 1 above) is useful in constructing a transcriptional regulatory network from known TF-target associations. The ensemble is easily extendable to include more tools as well as future PWM-based information.Comment: 33 page

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

A flexible integrative approach based on random forest improves prediction of transcription factor binding sites

Author: Abeel
Afflerbach
Angarica
Bailey
Bart Hooghe
Bauer
Benos
Breiman
Bulyk
Burden
Calladine
Camenisch
Chen
Cho
Cordell
Davis
Dickerson
Ehret
Ernst
Frans van Roy
Friedel
Fujii
Fulton
Gama-Castro
Gardiner
Gartenberg
Gershenzon
Goodsell
Gorin
Gowrisankar
Greenbaum
Gunewardena
Hall
Hendrickson
Hu
Juo
Kajimura
Kaplan
Karas
Kel
Kim
Lavery
Lewis
Liu
Liu
Liu
Long
Lu
Lu
Lu
Lunetta
Man
Marco
Marinescu
Martinez-Hackert
Matys
Medina-Rivera
Meysman
Michel
Mokry
Morozov
Narang
Naughton
O'Flanagan
Olson
Paillard
Pan
Parker
Parvin
Pieter De Bleser
Ponomarenko
Portales-Casamar
Powell
Pudimat
Ramsey
Rohs
Rohs
Rohs
Ruiz
Satchwell
Schneider
Shakked
Sharon
Shi
Spolar
Stefan Broos
Stormo
Svozil
Thayer
Tomovic
Toro-Roman
Travers
Tullius
Wunderlich
Zhang
Zhang
Zhu
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Transcription factor binding sites (TFBSs) are DNA sequences of 6-15 base pairs. Interaction of these TFBSs with transcription factors (TFs) is largely responsible for most spatiotemporal gene expression patterns. Here, we evaluate to what extent sequence-based prediction of TFBSs can be improved by taking into account the positional dependencies of nucleotides (NPDs) and the nucleotide sequence-dependent structure of DNA. We make use of the random forest algorithm to flexibly exploit both types of information. Results in this study show that both the structural method and the NPD method can be valuable for the prediction of TFBSs. Moreover, their predictive values seem to be complementary, even to the widely used position weight matrix (PWM) method. This led us to combine all three methods. Results obtained for five eukaryotic TFs with different DNA-binding domains show that our method improves classification accuracy for all five eukaryotic TFs compared with other approaches. Additionally, we contrast the results of seven smaller prokaryotic sets with high-quality data and show that with the use of high-quality data we can significantly improve prediction performance. Models developed in this study can be of great use for gaining insight into the mechanisms of TF binding

Communication as the Main Characteristic of Life

Author: Witzany Guenther
Publication venue
Publication date: 01/01/2018
Field of study

PhilPapers