Search CORE

13 research outputs found

Kernel Methods for Predictive Sequence Analysis

Author: Ong C.
Rätsch G.
Publication venue
Publication date: 01/09/2006
Field of study

This tutorial is meant for a broad audience: Students, researchers, biologists and computer scientist interested in (a) an overview of general and efficient algorithms for statistical learning used in computational biology, (b) sequence kernels for the problems such as promoter or splice site detection. No specific knowledge will be required since the tutorial is self-contained and most fundamental concepts are introduced during the course

MPG.PuRe

Identification and analysis of methylation call differences between bisulfite microarray and bisulfite sequencing data with statistical learning techniques

Author: AE Teschendorff
G Rätsch
Gilles Gasparoni
Jasmin Gries
Jörn Walter
Karl Nordström
Matthias Döring
Nico Pfeifer
P Meinicke
Pavlo Lutsik
S Dedeurwaerder
S Sonnenburg
T Gärtner
Y Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning

Author: Bernhard Schölkopf
Gunnar Rätsch
Hanh Witte
Jagan Srinivasan
Klaus-R Müller
Ralf-J Sommer
Sören Sonnenburg
The Caenorhabditis elegans sequencing consortium
Uwe Ohler
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2007
Field of study

For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology

CiteSeerX

Public Library of Science (PLOS)

Crossref

Fraunhofer-ePrints

Directory of Open Access Journals

PubMed Central

Caltech Authors

MPG.PuRe

The Feature Importance Ranking Measure

Author: A. Graf
B. Schölkopf
B. Üstün
C. Strobl
C. Strobl
G. Rätsch
G.R.G. Lanckriet
J. Friedman
J. Schäfer
K. Bennett
M. Laan van der
R. Tibshirani
S. Sonnenburg
S. Sonnenburg
Publication venue
Publication date: 01/01/2009
Field of study

Most accurate predictions are typically obtained by learning machines with complex feature spaces (as e.g. induced by kernels). Unfortunately, such decision rules are hardly accessible to humans and cannot easily be used to gain insights about the application domain. Therefore, one often resorts to linear models in combination with variable selection, thereby sacrificing some predictive power for presumptive interpretability. Here, we introduce the Feature Importance Ranking Measure (FIRM), which by retrospective analysis of arbitrary learning machines allows to achieve both excellent predictive performance and superior interpretation. In contrast to standard raw feature weighting, FIRM takes the underlying correlation structure of the features into account. Thereby, it is able to discover the most relevant features, even if their appearance in the training data is entirely prevented by noise. The desirable properties of FIRM are investigated analytically and illustrated in simulations.Comment: 15 pages, 3 figures. to appear in the Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 200

arXiv.org e-Print Archive

Pseudo-Marginal Bayesian Inference for Gaussian Processes

Author: Filippone Maurizio
Girolami Mark
Publication venue
Publication date: 07/04/2014
Field of study

The main challenges that arise when adopting Gaussian Process priors in probabilistic modeling are how to carry out exact Bayesian inference and how to account for uncertainty on model parameters when making model-based predictions on out-of-sample data. Using probit regression as an illustrative working example, this paper presents a general and effective methodology based on the pseudo-marginal approach to Markov chain Monte Carlo that efficiently addresses both of these issues. The results presented in this paper show improvements over existing sampling methods to simulate from the posterior distribution over the parameters defining the covariance function of the Gaussian Process prior. This is particularly important as it offers a powerful tool to carry out full Bayesian inference of Gaussian Process based hierarchic statistical models in general. The results also demonstrate that Monte Carlo based integration of all model parameters is actually feasible in this class of models providing a superior quantification of uncertainty in predictions. Extensive comparisons with respect to state-of-the-art probabilistic classifiers confirm this assertion.Comment: 14 pages double colum

arXiv.org e-Print Archive

Crossref

Warwick Research Archives Portal Repository

Enlighten

Large Scale Genomic Sequence SVM Classifiers

Author: Rätsch G.
Schölkopf B.
Sonnenburg S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/08/2005
Field of study

In genomic sequence analysis tasks like splice site recognition or promoter identification, large amounts of training sequences are available, and indeed needed to achieve sufficiently high classification performances. In this work we study two recently proposed and successfully used kernels, namely the Spectrum kernel and the Weighted Degree kernel (WD). In particular, we suggest several extensions using Suffix Trees and modi cations of an SMO-like SVM training algorithm in order to accelerate the training of the SVMs and their evaluation on test sequences. Our simulations show that for the spectrum kernel and WD kernel, large scale SVM training can be accelerated by factors of 20 and 4 times, respectively, while using much less memory (e.g. no kernel caching). The evaluation on new sequences is often several thousand times faster using the new techniques (depending on the number of Support Vectors). Our method allows us to train on sets as large as one million sequences

MPG.PuRe

Fast splice site detection using information content and feature reduction

Author: AKMA Baten
AKMA Baten
BCH Chang
C Burge
C Burge
C Cortes
CE Shannon
D Cai
G Dror
G Ratsch
G Yeo
H Drucker
H Itoh
H Liu
JCaHLS Rajapakse
JSaRD Chuang
L Zhang
M Burset
M Pertea
M Zhang
MB Shapiro
MG Reese
MG Reese
N Cristianini
P Waddell
R Castelo
S Brunak
S Buckingham
S Degroeve
S Salzberg
S Sonnenburg
S Sonnenburg
S Washietl
SA Marashi
SK Halgamuge
SM Hebsgaard
T Golub
T-M Chen
TD Schneider
v Vapnik
XH-F Zhang
Y Saeys
YF Sun
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background: Accurate identification of splice sites in DNA sequences plays a key role in the prediction of gene structure in eukaryotes. Already many computational methods have been proposed for the detection of splice sites and some of them showed high prediction accuracy. However, most of these methods are limited in terms of their long computation time when applied to whole genome sequence data. Results: In this paper we propose a hybrid algorithm which combines several effective and informative input features with the state of the art support vector machine (SVM). To obtain the input features we employ information content method based on Shannon\u27s information theory, Shapiro\u27s score scheme, and Markovian probabilities. We also use a feature elimination scheme to reduce the less informative features from the input data. Conclusion: In this study we propose a new feature based splice site detection method that shows improved acceptor and donor splice site detection in DNA sequences when the performance is compared with various state of the art and well known method

ePublications@SCU

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of Melbourne Institutional Repository

Sequence-Based Classification Using Discriminatory Motif Feature Selection

Author: Daniel Capurso
Hao Xiong
Mark R. Segal
Mikael Boden
Śaunak Sen
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all -mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length , such that potentially important, longer () predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California