17,973 research outputs found
Comparing Kernels For Predicting Protein Binding Sites From Amino Acid Sequence
The ability to identify protein binding sites and to detect specific amino acid residues that contribute to the specificity and affinity of protein interactions has important implications for problems ranging from rational drug design to analysis of metabolic and signal transduction networks. Support vector machines (SVM) and related kernel methods offer an attractive approach to predicting protein binding sites. An appropriate choice of the kernel function is critical to the performance of SVM. Kernel functions offer a way to incorporate domain-specific knowledge into the classifier. We compare the performance of 3 types of kernels functions: identity kernel, sequence-alignment kernel, and amino acid substitution matrix kernel for predicting protein-protein, protein-DNA and protein-RNA binding sites. The results show that the identity kernel is quite effective in on all three tasks, with the substitution kernel based on amino acid substitution matrices that take into account structural or evolutionary conservation or physicochemical properties of amino acids yields modest improvement in the performance of the resulting SVM classifiers for predicting protein-protein, protein-DNA and protein-RNA binding sites
Algebraic shortcuts for leave-one-out cross-validation in supervised network inference
Supervised machine learning techniques have traditionally been very successful at reconstructing biological networks, such as protein-ligand interaction, protein-protein interaction and gene regulatory networks. Many supervised techniques for network prediction use linear models on a possibly nonlinear pairwise feature representation of edges. Recently, much emphasis has been placed on the correct evaluation of such supervised models. It is vital to distinguish between using a model to either predict new interactions in a given network or to predict interactions for a new vertex not present in the original network. This distinction matters because (i) the performance might dramatically differ between the prediction settings and (ii) tuning the model hyperparameters to obtain the best possible model depends on the setting of interest. Specific cross-validation schemes need to be used to assess the performance in such different prediction settings. In this work we discuss a state-of-the-art kernel-based network inference technique called two-step kernel ridge regression. We show that this regression model can be trained efficiently, with a time complexity scaling with the number of vertices rather than the number of edges. Furthermore, this framework leads to a series of cross-validation shortcuts that allow one to rapidly estimate the model performance for any relevant network prediction setting. This allows computational biologists to fully assess the capabilities of their models
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach
Virtual screening (VS) is widely used during computational drug discovery to
reduce costs. Chemogenomics-based virtual screening (CGBVS) can be used to
predict new compound-protein interactions (CPIs) from known CPI network data
using several methods, including machine learning and data mining. Although
CGBVS facilitates highly efficient and accurate CPI prediction, it has poor
performance for prediction of new compounds for which CPIs are unknown. The
pairwise kernel method (PKM) is a state-of-the-art CGBVS method and shows high
accuracy for prediction of new compounds. In this study, on the basis of link
mining, we improved the PKM by combining link indicator kernel (LIK) and
chemical similarity and evaluated the accuracy of these methods. The proposed
method obtained an average area under the precision-recall curve (AUPR) value
of 0.562, which was higher than that achieved by the conventional Gaussian
interaction profile (GIP) method (0.425), and the calculation time was only
increased by a few percent
Predicting protein-protein interactions as a one-class classification problem
Protein-protein interactions represent a key step in understanding proteins functions. This is due to the fact that proteins usually work in context of other proteins and rarely function alone. Machine learning techniques have been used to predict protein-protein interactions. However, most of these techniques address this problem as a binary classification problem. While it is easy to get a dataset of interacting protein as positive example, there is no experimentally confirmed non-interacting protein to be considered as a negative set. Therefore, in this paper we solve this problem as a one-class classification problem using One-Class SVM (OCSVM). Using only positive examples (interacting protein pairs) for training, the OCSVM achieves accuracy of 80%. These results imply that protein-protein interaction can be predicted using one-class classifier with reliable accuracy
Identification of functionally related enzymes by learning-to-rank methods
Enzyme sequences and structures are routinely used in the biological sciences
as queries to search for functionally related enzymes in online databases. To
this end, one usually departs from some notion of similarity, comparing two
enzymes by looking for correspondences in their sequences, structures or
surfaces. For a given query, the search operation results in a ranking of the
enzymes in the database, from very similar to dissimilar enzymes, while
information about the biological function of annotated database enzymes is
ignored.
In this work we show that rankings of that kind can be substantially improved
by applying kernel-based learning algorithms. This approach enables the
detection of statistical dependencies between similarities of the active cleft
and the biological function of annotated enzymes. This is in contrast to
search-based approaches, which do not take annotated training data into
account. Similarity measures based on the active cleft are known to outperform
sequence-based or structure-based measures under certain conditions. We
consider the Enzyme Commission (EC) classification hierarchy for obtaining
annotated enzymes during the training phase. The results of a set of sizeable
experiments indicate a consistent and significant improvement for a set of
similarity measures that exploit information about small cavities in the
surface of enzymes
Metric learning pairwise kernel for graph inference
Much recent work in bioinformatics has focused on the inference of various
types of biological networks, representing gene regulation, metabolic
processes, protein-protein interactions, etc. A common setting involves
inferring network edges in a supervised fashion from a set of high-confidence
edges, possibly characterized by multiple, heterogeneous data sets (protein
sequence, gene expression, etc.). Here, we distinguish between two modes of
inference in this setting: direct inference based upon similarities between
nodes joined by an edge, and indirect inference based upon similarities between
one pair of nodes and another pair of nodes. We propose a supervised approach
for the direct case by translating it into a distance metric learning problem.
A relaxation of the resulting convex optimization problem leads to the support
vector machine (SVM) algorithm with a particular kernel for pairs, which we
call the metric learning pairwise kernel (MLPK). We demonstrate, using several
real biological networks, that this direct approach often improves upon the
state-of-the-art SVM for indirect inference with the tensor product pairwise
kernel
A Comparative Study of Pairwise Learning Methods based on Kernel Ridge Regression
Many machine learning problems can be formulated as predicting labels for a
pair of objects. Problems of that kind are often referred to as pairwise
learning, dyadic prediction or network inference problems. During the last
decade kernel methods have played a dominant role in pairwise learning. They
still obtain a state-of-the-art predictive performance, but a theoretical
analysis of their behavior has been underexplored in the machine learning
literature.
In this work we review and unify existing kernel-based algorithms that are
commonly used in different pairwise learning settings, ranging from matrix
filtering to zero-shot learning. To this end, we focus on closed-form efficient
instantiations of Kronecker kernel ridge regression. We show that independent
task kernel ridge regression, two-step kernel ridge regression and a linear
matrix filter arise naturally as a special case of Kronecker kernel ridge
regression, implying that all these methods implicitly minimize a squared loss.
In addition, we analyze universality, consistency and spectral filtering
properties. Our theoretical results provide valuable insights in assessing the
advantages and limitations of existing pairwise learning methods.Comment: arXiv admin note: text overlap with arXiv:1606.0427
- …