Search CORE

547 research outputs found

Transcription Factor-DNA Binding Via Machine Learning Ensembles

Author: DeLisi Charles
Fan Yue
Kon Mark
Publication venue
Publication date: 09/05/2018
Field of study

We present ensemble methods in a machine learning (ML) framework combining predictions from five known motif/binding site exploration algorithms. For a given TF the ensemble starts with position weight matrices (PWM's) for the motif, collected from the component algorithms. Using dimension reduction, we identify significant PWM-based subspaces for analysis. Within each subspace a machine classifier is built for identifying the TF's gene (promoter) targets (Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool. Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string) feature PWM-based subspaces that stand out in identifying gene targets. We approach Problem 3 (binding sites) with a novel machine learning approach that uses promoter string features and ML importance scores in a classification algorithm locating binding sites across the genome. For target gene identification this method improves performance (measured by the F1 score) by about 10 percentage points over the (a) motif scanning method and (b) the coexpression-based association method. Top motif outperformed 5 component algorithms as well as two other common algorithms (BEST and DEME). For identifying individual binding sites on a benchmark cross species database (Tompa et al., 2005) we match the best performer without much human intervention. It also improved the performance on mammalian TFs. The ensemble can integrate orthogonal information from different weak learners (potentially using entirely different types of features) into a machine learner that can perform consistently better for more TFs. The TF gene target identification component (problem 1 above) is useful in constructing a transcriptional regulatory network from known TF-target associations. The ensemble is easily extendable to include more tools as well as future PWM-based information.Comment: 33 page

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

Transcription factor-DNA binding via machine learning ensembles

Author: Delisi Charles
Fan Yue
Kon Mark A.
Publication venue
Publication date: 27/05/2018
Field of study

The network of interactions between transcription factors (TFs) and their regulatory gene targets governs many of the behaviors and responses of cells. Construction of a transcriptional regulatory network involves three interrelated problems, defined for any regulator: finding (1) its target genes, (2) its binding motif and (3) its DNA binding sites. Many tools have been developed in the last decade to solve these problems. However, performance of algorithms for these has not been consistent for all transcription factors. Because machine learning algorithms have shown advantages in integrating information of different types, we investigate a machine-based approach to integrating predictions from an ensemble of commonly used motif exploration algorithms.Published versio

Boston University Institutional Repository (OpenBU)

POIMs: positional oligomer importance matrices—understanding support vector machine-based signal detectors

Author: Alexander Zien
Arkhipova
Barash
Ben-Gal
Burke
Chen
Down
Graber
Graf
Gunnar Rätsch
Harris
Lanckriet
Leslie
Leslie
Meinicke
Ohler
Ohler
Petra Philips
Rätsch
Rätsch
Rätsch
Saeys
Schölkopf
Sonnenburg
Sonnenburg
Sonnenburg
Sonnenburg
Sonnenburg
Sören Sonnenburg
Vapnik
Zien
Üstün
Publication venue: Oxford University Press
Publication date: 01/01/2008
Field of study

Motivation: At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts

Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information

Author: Engel James Douglas
Hero Alfred O
Rao Arvind
States David J
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Motif discovery for the identification of functional regulatory elements underlying gene expression is a challenging problem. Sequence inspection often leads to discovery of novel motifs (including transcription factor sites) with previously uncharacterized function in gene expression. Coupled with the complexity underlying tissue-specific gene expression, there are several motifs that are putatively responsible for expression in a certain cell type. This has important implications in understanding fundamental biological processes such as development and disease progression. In this work, we present an approach to the identification of motifs (not necessarily transcription factor sites) and examine its application to some questions in current bioinformatics research. These motifs are seen to discriminate tissue-specific gene promoter or regulatory regions from those that are not tissue-specific. There are two main contributions of this work. Firstly, we propose the use of directed information for such classification constrained motif discovery, and then use the selected features with a support vector machine (SVM) classifier to find the tissue specificity of any sequence of interest. Such analysis yields several novel interesting motifs that merit further experimental characterization. Furthermore, this approach leads to a principled framework for the prospective examination of any chosen motif to be discriminatory motif for a group of coexpressed/coregulated genes, thereby integrating sequence and expression perspectives. We hypothesize that the discovery of these motifs would enable the large-scale investigation for the tissue-specific regulatory role of any conserved sequence element identified from genome-wide studies

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Recommended from our members

The folded k-spectrum kernel: A machine learning approach to detecting transcription factor binding sites with gapped nucleotide dependencies

Author: Dresch Jacqueline M.
Elmas Abdulkadir
Wang Xiaodong
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2017
Field of study

Understanding the molecular machinery involved in transcriptional regulation is central to improving our knowledge of an organism’s development, disease, and evolution. The building blocks of this complex molecular machinery are an organism’s genomic DNA sequence and transcription factor proteins. Despite the vast amount of sequence data now available for many model organisms, predicting where transcription factors bind, often referred to as ‘motif detection’ is still incredibly challenging. In this study, we develop a novel bioinformatic approach to binding site prediction. We do this by extending pre-existing SVM approaches in an unbiased way to include all possible gapped k-mers, representing different combinations of complex nucleotide dependencies within binding sites. We show the advantages of this new approach when compared to existing SVM approaches, through a rigorous set of cross-validation experiments. We also demonstrate the effectiveness of our new approach by reporting on its improved performance on a set of 127 genomic regions known to regulate gene expression along the anterio-posterior axis in early Drosophila embryos

Columbia University Academic Commons

Directory of Open Access Journals

FigShare

Kernel methods in genomics and computational biology

Author: Vert Jean-Philippe
Publication venue
Publication date: 17/10/2005
Field of study

Support vector machines and kernel methods are increasingly popular in genomics and computational biology, due to their good performance in real-world applications and strong modularity that makes them suitable to a wide range of problems, from the classification of tumors to the automatic annotation of proteins. Their ability to work in high dimension, to process non-vectorial data, and the natural framework they provide to integrate heterogeneous data are particularly relevant to various problems arising in computational biology. In this chapter we survey some of the most prominent applications published so far, highlighting the particular developments in kernel methods triggered by problems in biology, and mention a few promising research directions likely to expand in the future

arXiv.org e-Print Archive

HAL-MINES ParisTech

Prediction of Alternative Splice Sites in Human Genes

Author: Simmons Douglas
Publication venue: SJSU ScholarWorks
Publication date: 01/01/2007
Field of study

This thesis addresses the problem of predicting alternative splice sites in human genes. The most common way to identify alternative splice sites are the use of expressed sequence tags and microarray data. Since genes only produce alternative proteins under certain conditions, these methods are limited to detecting only alternative splice sites in genes whose alternative protein forms are expressed under the tested conditions. I have introduced three multiclass support vector machines that predict upstream and downstream alternative 3’ splice sites, upstream and downstream alternative 5’ splice sites, and the 3’ splice site of skipped and cryptic exons. On a test set extracted from the Alternative Splice Annotation Project database, I was able to correctly classify about 68% of the splice sites in the alternative 3’ set, about 62% of the splice sites in the alternative 5’ set, and about 66% in the exon skipping set

SJSU ScholarWorks

KIRMES: kernel-based identification of regulatory modules in euchromatic sequences

Author: Bailey
Ben-Hur
Boser
Busch
Frith
Giardine
Gordân
Gunnar Rätsch
Gupta
Harbison
Jan U. Lohmann
Joachims
Lawrence
Leibfried
Leslie
Leslie
Matys
Meinicke
Mikolajczyk
Müller
Noble
Nowak
Oliver Kohlbacher
Redman
Rätsch
Rätsch
Sandelin
Schneider
Schneider
Schölkopf
Schölkopf
Schölkopf
Sebastian J. Schultheiss
Segal
Sinha
Smith
Sonnenburg
Sonnenburg
Sonnenburg
Sonnenburg
Stormo
Swarbreck
Thijs
Wolfgang Busch
Yada
Zien
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

Motivation: Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

MPG.PuRe

Integrating Diverse Datasets Improves Developmental Enhancer Prediction

Author: A Arvey
A Barski
A Ben-Hur
A He
A Miquelajauregui
A Rada-Iglesias
A Siepel
A Visel
A Visel
A Visel
A Visel
A Visel
A Visel
A Woolfe
A Woznica
AI Su
AP Boyle
AR Quinlan
AS Nord
BW Busser
C Cheng
C Jin
C Leslie
CE Grant
CM Koch
CT Ong
CY McLean
D Lee
D May
D Wang
Dennis Kostka
DM McGaughey
DS Johnson
DU Gorkin
E Birney
E Seuntjens
G Cuellar-Partida
GE Zentner
Genevieve D. Erwin
GM Burzynski
H Lahdesmaki
HH He
I Dunham
J Banerji
J Cotney
J Ernst
JA Capra
JA Capra
JA Wamstad
John A. Capra
JP Noonan
K Koshiba-Takeuchi
K Lindblad-Toh
KA Aldinger
Karl K. Murphy
Katherine S. Pollard
KJ Won
KS Pollard
KY Yip
L Narlikar
L Taher
LA Hindorff
LA Pennacchio
M Bulger
M Kloft
M Levine
M Wilson
MA Nobrega
MA White
MJ Blow
MM El-Kasti
MM Hoffman
MP Creyghton
MR Kantorovitz
N Oksenberg
N Rajagopal
Nadav Ahituv
ND Heintzman
ND Heintzman
NE Renthal
Nir Oksenberg
NJ Sakabe
PG Giresi
Q Li
Q Weng
R Andersson
R O'Rahilly
R Pique-Regi
RE Thurman
Rebecca M. Truty
RP Zinzen
RS Smith
S Bonn
S Ghisletti
S Lomvardas
S Prabhakar
S Salzberg
S Sonnenburg
S Sonnenburg
SD Gillies
SJ Sholtis
SL Paige
T Casci
T Kume
T Kume
TG Dietterich
TS Mikkelsen
UA Orom
Uwe Ohler
VW Zhou
Z Wang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 27/09/2013
Field of study

Gene-regulatory enhancers have been identified using various approaches, including evolutionary conservation, regulatory protein binding, chromatin modifications, and DNA sequence motifs. To integrate these different approaches, we developed EnhancerFinder, a two-step method for distinguishing developmental enhancers from the genomic background and then predicting their tissue specificity. EnhancerFinder uses a multiple kernel learning approach to integrate DNA sequence motifs, evolutionary patterns, and diverse functional genomics datasets from a variety of cell types. In contrast with prediction approaches that define enhancers based on histone marks or p300 sites from a single cell line, we trained EnhancerFinder on hundreds of experimentally verified human developmental enhancers from the VISTA Enhancer Browser. We comprehensively evaluated EnhancerFinder using cross validation and found that our integrative method improves the identification of enhancers over approaches that consider a single type of data, such as sequence motifs, evolutionary conservation, or the binding of enhancer-associated proteins. We find that VISTA enhancers active in embryonic heart are easier to identify than enhancers active in several other embryonic tissues, likely due to their uniquely high GC content. We applied EnhancerFinder to the entire human genome and predicted 84,301 developmental enhancers and their tissue specificity. These predictions provide specific functional annotations for large amounts of human non-coding DNA, and are significantly enriched near genes with annotated roles in their predicted tissues and lead SNPs from genome-wide association studies. We demonstrate the utility of EnhancerFinder predictions through in vivo validation of novel embryonic gene regulatory enhancers from three developmental transcription factor loci. Our genome-wide developmental enhancer predictions are freely available as a UCSC Genome Browser track, which we hope will enable researchers to further investigate questions in developmental biology. © 2014 Erwin et al

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

D-Scholarship@Pitt

FigShare