Search CORE

62 research outputs found

Fast protein superfamily classification using principal component null space analysis.

Author: French Leon
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2005
Field of study

The protein family classification problem, which consists of determining the family memberships of given unknown protein sequences, is very important for a biologist for many practical reasons, such as drug discovery, prediction of molecular functions and medical diagnosis. Neural networks and Bayesian methods have performed well on the protein classification problem, achieving accuracy ranging from 90% to 98% while running relatively slowly in the learning stage. In this thesis, we present a principal component null space analysis (PCNSA) linear classifier to the problem and report excellent results compared to those of neural networks and support vector machines. The two main parameters of PCNSA are linked to the high dimensionality of the dataset used, and were optimized in an exhaustive manner to maximize accuracy. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .F74. Source: Masters Abstracts International, Volume: 44-03, page: 1400. Thesis (M.Sc.)--University of Windsor (Canada), 2005

Scholarship at UWindsor

Mining protein database using machine learning techniques

Author: Camargo Renata
Niranjan Mahesan
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/06/2008
Field of study

With a large amount of information relating to proteins accumulating in databases widely available online, it is of interest to apply machine learning techniques that, by extracting underlying statistical regularities in the data, make predictions about the functional and evolutionary characteristics of unseen proteins. Such predictions can help in achieving a reduction in the space over which experiment designers need to search in order to improve our understanding of the biochemical properties. Previously it has been suggested that an integration of features computable by comparing a pair of proteins can be achieved by an artificial neural network, hence predicting the degree to which they may be evolutionary related and homologous. We compiled two datasets of pairs of proteins, each pair being characterised by seven distinct features. We performed an exhaustive search through all possible combinations of features, for the problem of separating remote homologous from analogous pairs, we note that significant performance gain was obtained by the inclusion of sequence and structure information. We find that the use of a linear classifier was enough to discriminate a protein pair at the family level. However, at the superfamily level, to detect remote homologous pairs was a relatively harder problem. We find that the use of nonlinear classifiers achieve significantly higher accuracies. In this paper, we compare three different pattern classification methods on two problems formulated as detecting evolutionary and functional relationships between pairs of proteins, and from extensive cross validation and feature selection based studies quantify the average limits and uncertainties with which such predictions may be made. Feature selection points to a "knowledge gap" in currently available functional annotations. We demonstrate how the scheme may be employed in a framework to associate an individual protein with an existing family of evolutionarily related proteins

Southampton (e-Prints Soton)

Crossref

A Protein Classification Benchmark collection for machine learning

Author: Dhir Somdutta
Gáspári Zoltán
Kertész-Farkas Attila
Kocsor András
Leunissen Jack A.M.
Pacurar Mircea
Pongor Sándor
Sonego Paolo
Publication venue: Oxford University Press
Publication date: 01/01/2006
Field of study

Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection () was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences and structures, and each set is subdivided into positive/negative, training/test sets in several ways. There is a total of 6405 classification tasks, 3297 on protein sequences, 3095 on protein structures and 10 on protein coding regions in DNA. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences or structures, as well as various functional and taxonomic classification problems. In the case of hierarchical classification schemes, the classification tasks can be defined at various levels of the hierarchy (such as classes, folds, superfamilies, etc.). For each dataset there are distance matrices available that contain all vs. all comparison of the data, based on various sequence or structure comparison methods, as well as a set of classification performance measures computed with various classifier algorithms

ELTE Digital Institutional Repository (EDIT)

Building multiclass classifiers for remote homology detection and fold recognition

Author: A Heger
A Krogh
A Sun
AG Murzin
B Taskar
C Leslie
C Leslie
CA Orengo
CD Huang
CH Ding
D Mittelman
E le
E Lindahl
EL Allwein
F Aiolli
F Rosenblatt
George Karypis
H Rangwala
H Saigo
Huzefa Rangwala
I Tsochantaridis
J Rousu
J Shi
J Weston
K Crammer
K Crammer
L Holm
L Liao
M Collins
M Collins
M Marti-Renom
P Baldi
R Kuang
R Rifkin
S Altschul
SB Needleman
SE Brenner
T Jaakkola
T Jaakkola
T Joachims
TF Smith
TG Dietterich
V Vapnik
W Pearson
Y Guermeur
Y Guermeur
Y Hou
Y Hou
Z Barutcuoglu
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Protein remote homology detection and fold recognition are central problems in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for solving these problems. These methods are primarily used to solve binary classification problems and they have not been extensively used to solve the more general multiclass remote homology prediction and fold recognition problems. RESULTS: We present a comprehensive evaluation of a number of methods for building SVM-based multiclass classification schemes in the context of the SCOP protein classification. These methods include schemes that directly build an SVM-based multiclass model, schemes that employ a second-level learning approach to combine the predictions generated by a set of binary SVM-based classifiers, and schemes that build and combine binary classifiers for various levels of the SCOP hierarchy beyond those defining the target classes. CONCLUSION: Analyzing the performance achieved by the different approaches on four different datasets we show that most of the proposed multiclass SVM-based classification approaches are quite effective in solving the remote homology prediction and fold recognition problems and that the schemes that use predictions from binary models constructed for ancestral categories within the SCOP hierarchy tend to not only lead to lower error rates but also reduce the number of errors in which a superfamily is assigned to an entirely different fold and a fold is predicted as being from a different SCOP class. Our results also show that the limited size of the training data makes it hard to learn complex second-level models, and that models of moderate complexity lead to consistently better results

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of Minnesota Digital Conservancy

Comparing Kernels For Predicting Protein Binding Sites From Amino Acid Sequence

Author: Dobbs Drena
Dobbs Drena
Honavar Vasant
Olson Byron
Wu Feihong
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2006
Field of study

The ability to identify protein binding sites and to detect specific amino acid residues that contribute to the specificity and affinity of protein interactions has important implications for problems ranging from rational drug design to analysis of metabolic and signal transduction networks. Support vector machines (SVM) and related kernel methods offer an attractive approach to predicting protein binding sites. An appropriate choice of the kernel function is critical to the performance of SVM. Kernel functions offer a way to incorporate domain-specific knowledge into the classifier. We compare the performance of 3 types of kernels functions: identity kernel, sequence-alignment kernel, and amino acid substitution matrix kernel for predicting protein-protein, protein-DNA and protein-RNA binding sites. The results show that the identity kernel is quite effective in on all three tasks, with the substitution kernel based on amino acid substitution matrices that take into account structural or evolutionary conservation or physicochemical properties of amino acids yields modest improvement in the performance of the resulting SVM classifiers for predicting protein-protein, protein-DNA and protein-RNA binding sites

Digital Repository @ Iowa State University (ISU)

Crossref

Building Combined Classifiers

Author: Eastwood Mark
Gabrys Bogdan
Publication venue: EXIT Publishing House
Publication date: 01/01/2008
Field of study

This chapter covers different approaches that may be taken when building an ensemble method, through studying specific examples of each approach from research conducted by the authors. A method called Negative Correlation Learning illustrates a decision level combination approach with individual classifiers trained co-operatively. The Model level combination paradigm is illustrated via a tree combination method. Finally, another variant of the decision level paradigm, with individuals trained independently instead of co-operatively, is discussed as applied to churn prediction in the telecommunications industry

Bournemouth University Research Online

Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning

Author: Bernhard Schölkopf
Gunnar Rätsch
Hanh Witte
Jagan Srinivasan
Klaus-R Müller
Ralf-J Sommer
Sören Sonnenburg
The Caenorhabditis elegans sequencing consortium
Uwe Ohler
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2007
Field of study

For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology

CiteSeerX

Public Library of Science (PLOS)

Crossref

Fraunhofer-ePrints

Directory of Open Access Journals

PubMed Central

Caltech Authors

MPG.PuRe

Support vector machines with profile-based kernels for remote protein homology detection

Author: Abela John
Busuttil Steven
Pace Gordon J.
The 15th International Conference on Genome Informatics (GIW'04)
Publication venue: GIW
Publication date: 01/01/2004
Field of study

Two new techniques for remote protein homology detection particulary suited for sparse data are introduced. These methods are based on position specific scoring matrices or profiles and use a support vector machine (SVM) for discrimination. The performance on standard benchmarks outperforms previous non-discriminative techniques and is comparable to that of other SVM-based methods while giving distinct advantages.peer-reviewe

OAR@UM

A hybrid generative/discriminative method for EEG evoked potential detection

Author: Deniz Erdogmus
Kenneth E Hild
Misha Pavel
Santosh Mathan
Yonghong Huang
Publication venue
Publication date: 06/03/2020
Field of study

I. INTRODUCTION Generative and discriminative learning approaches are two prevailing and powerful, yet different, paradigms in machine leaning. Generative learning models, such as Bayesian inference [1] attempt to model the underlying distributions of the variables in order to compute classification and regression functions. These methods provide a rich framework for learning from prior knowledge. Discriminative learning models, such as support vector machines (SVM) [2] avoid generative modeling by directly optimizing a mapping from the inputs to the desired outputs by adjusting the resulting classification boundary. These latter methods commonly demonstrate superior performance in classification. Recently, researchers have investigated the relationship between these two learning paradigms and have attempted to combine their complementary strength

CiteSeerX