Search CORE

38 research outputs found

Discriminative motif discovery in DNA and protein sequences using the DEME algorithm

Author: A Price
AD Smith
BJ Davids
CT Harbison
CT Workman
D La
E Segal
E Segal
Emma Redhead
GD Stormo
GD Stormo
GD Stormo
GE Crooks
GZ Hertz
H Marks
HCM Leung
J Buhler
J Fang
J Zhu
JD Hughes
JJ Hu
KD Macisaac
M Akerman
M Brown
M Giufrè
M Tompa
MC Frith
MO Dayhoff
OG Berg
PA Pevzner
R Durbin
R Sharan
S Gupta
S Sinha
S Sinha
SR Krig
TD Schneider
Timothy L Bailey
TL Bailey
TL Bailey
TL Bailey
WH Press
WP Lehrach
X Liu
XS Liu
Y Barash
ZN Wang
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms. Results We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins. Conclusion Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at <url>http://bioinformatics.org.au/deme/</url></p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Transcription Factor-DNA Binding Via Machine Learning Ensembles

Author: DeLisi Charles
Fan Yue
Kon Mark
Publication venue
Publication date: 09/05/2018
Field of study

We present ensemble methods in a machine learning (ML) framework combining predictions from five known motif/binding site exploration algorithms. For a given TF the ensemble starts with position weight matrices (PWM's) for the motif, collected from the component algorithms. Using dimension reduction, we identify significant PWM-based subspaces for analysis. Within each subspace a machine classifier is built for identifying the TF's gene (promoter) targets (Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool. Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string) feature PWM-based subspaces that stand out in identifying gene targets. We approach Problem 3 (binding sites) with a novel machine learning approach that uses promoter string features and ML importance scores in a classification algorithm locating binding sites across the genome. For target gene identification this method improves performance (measured by the F1 score) by about 10 percentage points over the (a) motif scanning method and (b) the coexpression-based association method. Top motif outperformed 5 component algorithms as well as two other common algorithms (BEST and DEME). For identifying individual binding sites on a benchmark cross species database (Tompa et al., 2005) we match the best performer without much human intervention. It also improved the performance on mammalian TFs. The ensemble can integrate orthogonal information from different weak learners (potentially using entirely different types of features) into a machine learner that can perform consistently better for more TFs. The TF gene target identification component (problem 1 above) is useful in constructing a transcriptional regulatory network from known TF-target associations. The ensemble is easily extendable to include more tools as well as future PWM-based information.Comment: 33 page

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

DLocalMotif: a discriminative approach for discovering local motifs in protein sequences

Author: Ahmed M. Mehdi
Austin
Bailey
Bostjan Kobe
Chatfield
Crooks
Dingwall
Dogruel
Elrod-Erickson
Engelmann
Erb
Ettwiller
Fink
Finn
Giri
Harbison
Hawkins
Huang
Keilwagen
Kosugi
Lee
Lee
Linhart
Mikael Bodén
Muhammad Shoaib B. Sehgal
Mullen
Munro
Narang
Neuberger
Ohler
Pavesi
Qiu
Redhead
Roepcke
Rose-John
Saijou
Sigrist
Stark
Thijs
Timothy L. Bailey
Vardhanabhuti
Wilks
Xie
Yamasaki
Yan
Yun
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2013
Field of study

Motivation: Local motifs are patterns of DNA or protein sequences that occur within a sequence interval relative to a biologically defined anchor or landmark. Current protein motif discovery methods do not adequately consider such constraints to identify biologically significant motifs that are only weakly over-represented but spatially confined. Using negatives, i.e. sequences known to not contain a local motif, can further increase the specificity of their discovery

Crossref

University of Queensland eSpace

De-Novo Discovery of Differentially Abundant Transcription Factor Binding Sites Including Their Positional Preference

Author: AD Smith
AM Benotmane
C Linhart
CE Lawrence
CT Harbison
DJ Galas
DJ Lockhart
DJC MacKay
DS Johnson
E Redhead
E Wingender
G Mönke
G Pavesi
GA Wray
GK Sandve
H Wettig
Harmen J. Bussemaker
HM Wallach
IA Paponov
Ivan A. Paponov
Ivo Grosse
J Cerquides
J Davis
J Wu
Jan Grau
JC Bryne
JD Hughes
Jens Keilwagen
LM Hellman
LV Sun
M Tompa
Marc Strickert
NK Kim
O Elemento
S Sonnenburg
S Sonnenburg
Stefan Posch
T Ulmasov
T Ulmasov
TD Schneider
TJ Guilfoyle
TL Bailey
V Matys
VV Raghavan
W Ao
W Thompson
WA Thompson
WD Teale
Publication venue: Public Library of Science
Publication date: 10/02/2011
Field of study

Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet. Here, we present a de-novo motif discovery tool called Dispom for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Evaluating Dispom, we find that its prediction performance is superior to existing tools for de-novo motif discovery for 18 benchmark data sets with planted binding sites, and for a metazoan compendium based on experimental data from micro-array, ChIP-chip, ChIP-DSL, and DamID as well as Gene Ontology data. Finally, we apply Dispom to find binding sites differentially abundant in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as a refined auxin responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find in genome-wide predictions that the refined motif is more specific for auxin-responsive genes than the canonical auxin-responsive element. In general, Dispom can be used to find differentially abundant motifs in sequences of any origin. However, the positional distribution learned by Dispom is especially beneficial if all sequences are aligned to some anchor point like the transcription start site in case of promoter sequences. We demonstrate that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open-source Java framework Jstacs and as a stand-alone application at http://www.jstacs.de/index.php/Dispom

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

PhyloGibbs-MP: Module Prediction and Discriminative Motif-Finding by Gibbs Sampling

Author: Siddharthan Rahul
Publication venue: Public Library of Science
Publication date: 01/08/2008
Field of study

PhyloGibbs, our recent Gibbs-sampling motif-finder, takes phylogeny into account in detecting binding sites for transcription factors in DNA and assigns posterior probabilities to its predictions obtained by sampling the entire configuration space. Here, in an extension called PhyloGibbs-MP, we widen the scope of the program, addressing two major problems in computational regulatory genomics. First, PhyloGibbs-MP can localise predictions to small, undetermined regions of a large input sequence, thus effectively predicting cis-regulatory modules (CRMs) ab initio while simultaneously predicting binding sites in those modules—tasks that are usually done by two separate programs. PhyloGibbs-MP's performance at such ab initio CRM prediction is comparable with or superior to dedicated module-prediction software that use prior knowledge of previously characterised transcription factors. Second, PhyloGibbs-MP can predict motifs that differentiate between two (or more) different groups of regulatory regions, that is, motifs that occur preferentially in one group over the others. While other “discriminative motif-finders” have been published in the literature, PhyloGibbs-MP's implementation has some unique features and flexibility. Benchmarks on synthetic and actual genomic data show that this algorithm is successful at enhancing predictions of differentiating sites and suppressing predictions of common sites and compares with or outperforms other discriminative motif-finders on actual genomic data. Additional enhancements include significant performance and speed improvements, the ability to use “informative priors” on known transcription factors, and the ability to output annotations in a format that can be visualised with the Generic Genome Browser. In stand-alone motif-finding, PhyloGibbs-MP remains competitive, outperforming PhyloGibbs-1.0 and other programs on benchmark data

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

Genome Analysis Reveals Interplay between 5′UTR Introns and Nuclear mRNA Export for Secretory and Mitochondrial Genes

Author: A Nott
Abdalla Akef
Adnan Derti
AF Palazzo
Alexander F. Palazzo
C Cenik
Can Cenik
CC Hentschel
D Barrell
D Roise
D Wu
E Redhead
F Rodríguez-Trelles
FP Roth
Frederick P. Roth
G Blobel
G Pesole
GE Crooks
GF Berriz
GF Berriz
GF Berriz
GV Heijne
H Cheng
H Le Hir
Hon Nian Chua
Hui Zhang
J Sylvestre
JD Keene
K Sträβer
KD Pruitt
L Timchenko
M Garcia
M Hesse
M Luo
Melissa J. Moore
Michael Snyder
MJ Moore
Murat Tasan
N Visa
N Wiwatwattana
P Grüter
S Gueroussov
S Haider
S William Roy
SF Altschul
Stefan P. Tarnawsky
VN Kim
X Hong
XM Ma
Y Kino
Y Niimura
Publication venue: Public Library of Science
Publication date: 01/04/2011
Field of study

In higher eukaryotes, messenger RNAs (mRNAs) are exported from the nucleus to the cytoplasm via factors deposited near the 5′ end of the transcript during splicing. The signal sequence coding region (SSCR) can support an alternative mRNA export (ALREX) pathway that does not require splicing. However, most SSCR–containing genes also have introns, so the interplay between these export mechanisms remains unclear. Here we support a model in which the furthest upstream element in a given transcript, be it an intron or an ALREX–promoting SSCR, dictates the mRNA export pathway used. We also experimentally demonstrate that nuclear-encoded mitochondrial genes can use the ALREX pathway. Thus, ALREX can also be supported by nucleotide signals within mitochondrial-targeting sequence coding regions (MSCRs). Finally, we identified and experimentally verified novel motifs associated with the ALREX pathway that are shared by both SSCRs and MSCRs. Our results show strong correlation between 5′ untranslated region (5′UTR) intron presence/absence and sequence features at the beginning of the coding region. They also suggest that genes encoding secretory and mitochondrial proteins share a common regulatory mechanism at the level of mRNA export

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Transcription factor-DNA binding via machine learning ensembles

Author: Delisi Charles
Fan Yue
Kon Mark A.
Publication venue
Publication date: 27/05/2018
Field of study

The network of interactions between transcription factors (TFs) and their regulatory gene targets governs many of the behaviors and responses of cells. Construction of a transcriptional regulatory network involves three interrelated problems, defined for any regulator: finding (1) its target genes, (2) its binding motif and (3) its DNA binding sites. Many tools have been developed in the last decade to solve these problems. However, performance of algorithms for these has not been consistent for all transcription factors. Because machine learning algorithms have shown advantages in integrating information of different types, we investigate a machine-based approach to integrating predictions from an ensemble of commonly used motif exploration algorithms.Published versio

Boston University Institutional Repository (OpenBU)

Sequence-Based Classification Using Discriminatory Motif Feature Selection

Author: Daniel Capurso
Hao Xiong
Mark R. Segal
Mikael Boden
Śaunak Sen
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all -mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length , such that potentially important, longer () predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

AMD, an Automated Motif Discovery Tool Using Stepwise Refinement of Gapped Consensuses

Author: A Barski
A Chakravarty
A Marson
A Stark
AD Smith
AD Smith
Ashok Aiyar
C Bi
C Linhart
CT Harbison
D GuhaThakurta
DS Johnson
E Redhead
E Valen
E Wijaya
FP Roth
G Laux
G Pavesi
H Ji
HJ Bussemaker
JD Hughes
Ji Zhang
Jiantao Shi
JM Vaquerizas
JS Carroll
K Weigelt
Kankan Wang
KD MacIsaac
L Ettwiller
M Kellis
M Lupien
M Tompa
MC Frith
Mingjie Chen
O Elemento
P Tamayo
PA Pevzner
S Prabhakar
S Sinha
SA Vokes
TL Bailey
V Matys
Wentao Yang
X Xie
X Xie
XS Liu
Y Zhang
Yanzhi Du
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Motif discovery is essential for deciphering regulatory codes from high throughput genomic data, such as those from ChIP-chip/seq experiments. However, there remains a lack of effective and efficient methods for the identification of long and gapped motifs in many relevant tools reported to date. We describe here an automated tool that allows for de novo discovery of transcription factor binding sites, regardless of whether the motifs are long or short, gapped or contiguous

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Discriminative Learning for Probabilistic Sequence Analysis

Author: Maaskola J.
Publication venue
Publication date: 16/04/2015
Field of study

MPG.PuRe