Search CORE

3,092 research outputs found

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis

Author: A Bernal
A Culotta
A Feelders
AE Kel
AL Berger
AY Ng
C Burge
CM Bishop
D Cai
D Grossman
D Heckerman
D Klein
E Redhead
E Segal
F Pernkopf
G Yeo
GD Stormo
H Wallach
H Wettig
HE Peckham
I Ben-Gal
Ivo Grosse
J Cerquides
J Davis
J Goodman
J Grau
J Keilwagen
Jan Grau
Jens Keilwagen
L Narlikar
M Arita
M Meila-Predoviciu
M Tompa
M Zhang
MI Jordan
NK Kim
O Schulte
O Yakhnenko
P Grünwald
R Castelo
R Castelo
R Greiner
R Staden
S Chen
S Sonnenburg
SL Salzberg
Stefan Posch
T Fawcett
TH Kim
TM Chen
WL Buntine
Y Barash
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions. Results With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the <it>same a-priori information</it>, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites. Conclusions We find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior is implemented in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Unifying generative and discriminative learning principles

Author: A Bernal
A Culotta
A Feelders
A Mccallum
AE Kel
AY Ng
BP Lewis
C Burge
CM Bishop
D Cai
D Grossman
E Redhead
E Segal
E Wingender
F Pernkopf
G Bouchard
G Bouchard
G Stormo
G Yeo
H Wallach
H Wettig
HE Peckham
I Ben-Gal
Ivo Grosse
J Aldrich
J Cerquides
J Grau
J Keilwagen
J Keilwagen
JA Lasserre
Jan Grau
Jens Keilwagen
JH Xue
M Maragkakis
M Tompa
M Zhang
Marc Strickert
O Yakhnenko
P Grünwald
R Greiner
R Raina
R Staden
RA Fisher
S Sonnenburg
SL Salzberg
Stefan Posch
T Abeel
T Hastie
TH Kim
Y Barash
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. During the last decades, a plethora of different and well-adapted models has been developed, but only little attention has been payed to the development of different and similarly well-adapted learning principles. Only recently it was noticed that discriminative learning principles can be superior over generative ones in diverse bioinformatics applications, too. Results Here, we propose a generalization of generative and discriminative learning principles containing the maximum likelihood, maximum a posteriori, maximum conditional likelihood, maximum supervised posterior, generative-discriminative trade-off, and penalized generative-discriminative trade-off learning principles as special cases, and we illustrate its efficacy for the recognition of vertebrate transcription factor binding sites. Conclusions We find that the proposed learning principle helps to improve the recognition of transcription factor binding sites, enabling better computational approaches for extracting as much information as possible from valuable wet-lab data. We make all implementations available in the open-source library Jstacs so that this learning principle can be easily applied to other classification problems in the field of genome and epigenome analysis.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Hidden Markov Models for Gene Sequence Classification: Classifying the VSG genes in the Trypanosoma brucei Genome

Author: Alvarez-Valin Fernando
Basterrech Sebastián
Guerberoff Gustavo
Mesa Andrea
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/10/2015
Field of study

The article presents an application of Hidden Markov Models (HMMs) for pattern recognition on genome sequences. We apply HMM for identifying genes encoding the Variant Surface Glycoprotein (VSG) in the genomes of Trypanosoma brucei (T. brucei) and other African trypanosomes. These are parasitic protozoa causative agents of sleeping sickness and several diseases in domestic and wild animals. These parasites have a peculiar strategy to evade the host's immune system that consists in periodically changing their predominant cellular surface protein (VSG). The motivation for using patterns recognition methods to identify these genes, instead of traditional homology based ones, is that the levels of sequence identity (amino acid and DNA sequence) amongst these genes is often below of what is considered reliable in these methods. Among pattern recognition approaches, HMM are particularly suitable to tackle this problem because they can handle more naturally the determination of gene edges. We evaluate the performance of the model using different number of states in the Markov model, as well as several performance metrics. The model is applied using public genomic data. Our empirical results show that the VSG genes on T. brucei can be safely identified (high sensitivity and low rate of false positives) using HMM.Comment: Accepted article in July, 2015 in Pattern Analysis and Applications, Springer. The article contains 23 pages, 4 figures, 8 tables and 51 reference

arXiv.org e-Print Archive

DSpace at VSB Technical University of Ostrava

Hybrid MM/SVM structural sensors for stochastic sequential data

Author: AA Markov
Brian Roux
CE Shannon
CJC Burges
Cortes Corinna
R Durbin
S Winters-Hilt
Shannon
Stephen Winters-Hilt
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

In this paper we present preliminary results stemming from a novel application of Markov Models and Support Vector Machines to splice site classification of Intron-Exon and Exon-Intron (5' and 3') splice sites. We present the use of Markov based statistical methods, in a log likelihood discriminator framework, to create a non-summed, fixed-length, feature vector for SVM-based classification. We also explore the use of Shannon-entropy based analysis for automated identification of minimal-size models (where smaller models have known information loss according to the specified Shannon entropy representation). We evaluate a variety of kernels and kernel parameters in the classification effort. We present results of the algorithms for splice-site datasets consisting of sequences from a variety of species for comparison

Crossref

Springer - Publisher Connector

PubMed Central

CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction

Author: Chuong B Do
Marina Sirota
Samuel S Gross
Serafim Batzoglou
Serafim Batzoglou
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

CONTRAST is a gene predictor that directly incorporates information from multiple alignments and uses discriminative machine learning techniques to give large improvements in prediction over previous methods

CiteSeerX

Crossref

Springer - Publisher Connector

PubMed Central

VOMBAT: prediction of transcription factor binding sites using variable order Bayesian trees

Author: Ben-Gal Irad
Grau Jan
Grosse Ivo
Posch Stefan
Publication venue: Oxford University Press
Publication date: 01/01/2006
Field of study

Variable order Markov models and variable order Bayesian trees have been proposed for the recognition of transcription factor binding sites, and it could be demonstrated that they outperform traditional models, such as position weight matrices, Markov models and Bayesian trees. We develop a web server for the recognition of DNA binding sites based on variable order Markov models and variable order Bayesian trees offering the following functionality: (i) given datasets with annotated binding sites and genomic background sequences, variable order Markov models and variable order Bayesian trees can be trained; (ii) given a set of trained models, putative DNA binding sites can be predicted in a given set of genomic sequences and (iii) given a dataset with annotated binding sites and a dataset with genomic background sequences, cross-validation experiments for different model combinations with different parameter settings can be performed. Several of the offered services are computationally demanding, such as genome-wide predictions of DNA binding sites in mammalian genomes or sets of 10(4)-fold cross-validation experiments for different model combinations based on problem-specific data sets. In order to execute these jobs, and in order to serve multiple users at the same time, the web server is attached to a Linux cluster with 150 processors. VOMBAT is available at

CiteSeerX

Crossref

PubMed Central

Optimized mixed Markov models for motif identification

Author: AE Kel
B Matthews
B Negre
C Burge
D Cai
David M Umbach
E Roulet
E Wingender
G Schwarz
G Yeo
GA Wray
GD Stormo
GE Crooks
H Akaike
I Carmel
J Rissanen
JP Staley
K Ellrott
K Nandabalan
K Nelson
K Quandt
Leping Li
M Kellis
MG Reese
ML Bulyk
MP Ponomarenko
MQ Zhang
N Saitou
P Agarwal
P Bühlmann
PV Benos
Q Zhou
R Staden
RP Ketterling
S Salzberg
T Thanaraj
TD Schneider
TK Man
U Ohler
Uwe Ohler
W Krivan
Weichun Huang
X Xie
X Zhao
Y Barash
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Identifying functional elements, such as transcriptional factor binding sites, is a fundamental step in reconstructing gene regulatory networks and remains a challenging issue, largely due to limited availability of training samples. RESULTS: We introduce a novel and flexible model, the Optimized Mixture Markov model (OMiMa), and related methods to allow adjustment of model complexity for different motifs. In comparison with other leading methods, OMiMa can incorporate more than the NNSplice's pairwise dependencies; OMiMa avoids model over-fitting better than the Permuted Variable Length Markov Model (PVLMM); and OMiMa requires smaller training samples than the Maximum Entropy Model (MEM). Testing on both simulated and actual data (regulatory cis-elements and splice sites), we found OMiMa's performance superior to the other leading methods in terms of prediction accuracy, required size of training data or computational time. Our OMiMa system, to our knowledge, is the only motif finding tool that incorporates automatic selection of the best model. OMiMa is freely available at [1]. CONCLUSION: Our optimized mixture of Markov models represents an alternative to the existing methods for modeling dependent structures within a biological motif. Our model is conceptually simple and effective, and can improve prediction accuracy and/or computational speed over other leading methods

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

MDC Repository