Search CORE

26 research outputs found

Efficient motif finding algorithms for large-alphabet inputs

Author: Kuksa Pavel P
Pavlovic Vladimir
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings. Results The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families. Conclusions Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences.</p

CiteSeerX

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective

Author: Jensen Shane T
Liu Jun S
Liu X. Shirley
Zhou Qing
Publication venue: ScholarlyCommons
Publication date: 01/01/2004
Field of study

The Bayesian approach together with Markov chain Monte Carlo techniques has provided an attractive solution to many important bioinformatics problems such as multiple sequence alignment, microarray analysis and the discovery of gene regulatory binding motifs. The employment of such methods and, more broadly, explicit statistical modeling, has revolutionized the field of computational biology. After reviewing several heuristics-based computational methods, this article presents a systematic account of Bayesian formulations and solutions to the motif discovery problem. Generalizations are made to further enhance the Bayesian approach. Motivated by the need of a speedy algorithm, we also provide a perspective of the problem from the viewpoint of optimizing a scoring function. We observe that scoring functions resulting from proper posterior distributions, or approximations to such distributions, showed the best performance and can be used to improve upon existing motif-finding programs. Simulation analyses and a real-data example are used to support our observation

ScholarlyCommons@Penn

A Novel Bayesian DNA Motif Comparison Method for Clustering and Retrieval

Author: A Gelli
A Morozov
A Sandelin
A Siepel
AK Jain
AP Gasch
C Csank
C Harbison
D Che
D Gordon
D Karolchik
D Martin
EP Xing
Ernest Fraenkel
G Stormo
G Thijs
G Yona
H Madhani
Hanah Margalit
IG Choi
J Hughes
J Lin
J Rutherford
J Schaber
J Zeitlinger
J Zhu
JL DeRisi
K MacIsaac
K MacIsaac
K Sjolander
M Bulyk
M Courel
M DeGroot
M Harris
M Kellis
MB Eisen
N Friedman
Naomi Habib
Nir Friedman
P Benos
PT Spellman
R Osada
S Aerts
S Chou
S Chou
S Gupta
S Mahony
S Mahony
S Pietrokovski
S Roepcke
T Bailey
T Kaplan
T Wang
TL Bailey
Tommy Kaplan
V Matys
W Day
X Liu
X Xie
Y Barash
Y Barash
Y Barash
Y Wang
Publication venue: Public Library of Science
Publication date: 01/02/2008
Field of study

Characterizing the DNA-binding specificities of transcription factors is a key problem in computational biology that has been addressed by multiple algorithms. These usually take as input sequences that are putatively bound by the same factor and output one or more DNA motifs. A common practice is to apply several such algorithms simultaneously to improve coverage at the price of redundancy. In interpreting such results, two tasks are crucial: clustering of redundant motifs, and attributing the motifs to transcription factors by retrieval of similar motifs from previously characterized motif libraries. Both tasks inherently involve motif comparison. Here we present a novel method for comparing and merging motifs, based on Bayesian probabilistic principles. This method takes into account both the similarity in positional nucleotide distributions of the two motifs and their dissimilarity to the background distribution. We demonstrate the use of the new comparison method as a basis for motif clustering and retrieval procedures, and compare it to several commonly used alternatives. Our results show that the new method outperforms other available methods in accuracy and sensitivity. We incorporated the resulting motif clustering and retrieval procedures in a large-scale automated pipeline for analyzing DNA motifs. This pipeline integrates the results of various DNA motif discovery algorithms and automatically merges redundant motifs from multiple training sets into a coherent annotated library of motifs. Application of this pipeline to recent genome-wide transcription factor location data in S. cerevisiae successfully identified DNA motifs in a manner that is as good as semi-automated analysis reported in the literature. Moreover, we show how this analysis elucidates the mechanisms of condition-specific preferences of transcription factors

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data

Author: A Jolma
A Mathelier
AE Kel
B Wilczynski
Bartek Wilczynski
BE Bernstein
Bozena Kaminska
E Portales-Casamar
EP Xing
G Badis
GZ Hertz
I Krystkowiak
IV Kulakovskiy
Izabella Krystkowiak
JV Turatsinze
K Cartharius
K Cartharius
K Quandt
L Yang
M Pachkov
MF Berger
Michal Dabrowski
Norbert Dojer
P Flicek
PJA Cock
R Pique-Regi
R Worsley Hunt
S Rahmann
T Kaplan
TD Schneider
U Mudunuri
V Matys
X Xie
Y Zhao
Y Zhao
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Equi-energy sampler with applications in statistical inference and statistical mechanics

Author: Kou S. C.
Wong Wing Hung
Zhou Qing
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2006
Field of study

We introduce a new sampling algorithm, the equi-energy sampler, for efficient statistical sampling and estimation. Complementary to the widely used temperature-domain methods, the equi-energy sampler, utilizing the temperature--energy duality, targets the energy directly. The focus on the energy function not only facilitates efficient sampling, but also provides a powerful means for statistical estimation, for example, the calculation of the density of states and microcanonical averages in statistical mechanics. The equi-energy sampler is applied to a variety of problems, including exponential regression in statistics, motif sampling in computational biology and protein folding in biophysics.Comment: This paper discussed in: [math.ST/0611217], [math.ST/0611219], [math.ST/0611221], [math.ST/0611222]. Rejoinder in [math.ST/0611224]. Published at http://dx.doi.org/10.1214/009053606000000515 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Harvard University - DASH

Improved benchmarks for computational motif discovery

Author: Abul Osman
Drabløs Finn
Sandve Geir Kjetil
Walseng Vegard
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Background An important step in annotation of sequenced genomes is the identification of transcription factor binding sites. More than a hundred different computational methods have been proposed, and it is difficult to make an informed choice. Therefore, robust assessment of motif discovery methods becomes important, both for validation of existing tools and for identification of promising directions for future research. Results We use a machine learning perspective to analyze collections of transcription factors with known binding sites. Algorithms are presented for finding position weight matrices (PWMs), IUPAC-type motifs and mismatch motifs with optimal discrimination of binding sites from remaining sequence. We show that for many data sets in a recently proposed benchmark suite for motif discovery, none of the common motif models can accurately discriminate the binding sites from remaining sequence. This may obscure the distinction between the potential performance of the motif discovery tool itself versus the intrinsic complexity of the problem we are trying to solve. Synthetic data sets may avoid this problem, but we show on some previously proposed benchmarks that there may be a strong bias towards a presupposed motif model. We also propose a new approach to benchmark data set construction. This approach is based on collections of binding site fragments that are ranked according to the optimal level of discrimination achieved with our algorithms. This allows us to select subsets with specific properties. We present one benchmark suite with data sets that allow good discrimination between positive and negative instances with the common motif models. These data sets are suitable for evaluating algorithms for motif discovery that rely on these models. We present another benchmark suite where PWM, IUPAC and mismatch motif models are not able to discriminate reliably between positive and negative instances. This suite could be used for evaluating more powerful motif models. Conclusion Our improved benchmark suites have been designed to differentiate between the performance of motif discovery algorithms and the power of motif models. We provide a web server where users can download our benchmark suites, submit predictions and visualize scores on the benchmarks

Springer - Publisher Connector

PubMed Central

NORA - Norwegian Open Research Archives

Discovery and prediction of protein binding sites in DNA and RNA sequences using Bayesian Markov models

Author: Ge Wanwan
Publication venue
Publication date: 10/07/2020
Field of study

Georg-August-University Göttingen

Evolutionary Computation and QSAR Research

Author: Aguiar-Pulido Vanessa
Cruz-Monteagudo Maykel
Dorado Julián
Gestal M.
Munteanu Cristian-Robert
Rabuñal Juan R.
Publication venue: 'Bentham Science Publishers Ltd.'
Publication date: 01/01/2013
Field of study

[Abstract] The successful high throughput screening of molecule libraries for a specific biological property is one of the main improvements in drug discovery. The virtual molecular filtering and screening relies greatly on quantitative structure-activity relationship (QSAR) analysis, a mathematical model that correlates the activity of a molecule with molecular descriptors. QSAR models have the potential to reduce the costly failure of drug candidates in advanced (clinical) stages by filtering combinatorial libraries, eliminating candidates with a predicted toxic effect and poor pharmacokinetic profiles, and reducing the number of experiments. To obtain a predictive and reliable QSAR model, scientists use methods from various fields such as molecular modeling, pattern recognition, machine learning or artificial intelligence. QSAR modeling relies on three main steps: molecular structure codification into molecular descriptors, selection of relevant variables in the context of the analyzed activity, and search of the optimal mathematical model that correlates the molecular descriptors with a specific activity. Since a variety of techniques from statistics and artificial intelligence can aid variable selection and model building steps, this review focuses on the evolutionary computation methods supporting these tasks. Thus, this review explains the basic of the genetic algorithms and genetic programming as evolutionary computation approaches, the selection methods for high-dimensional data in QSAR, the methods to build QSAR models, the current evolutionary feature selection methods and applications in QSAR and the future trend on the joint or multi-task feature selection methods.Instituto de Salud Carlos III, PIO52048Instituto de Salud Carlos III, RD07/0067/0005Ministerio de Industria, Comercio y Turismo; TSI-020110-2009-53)Galicia. Consellería de Economía e Industria; 10SIN105004P

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas