6,683 research outputs found
Algebraic shortcuts for leave-one-out cross-validation in supervised network inference
Supervised machine learning techniques have traditionally been very successful at reconstructing biological networks, such as protein-ligand interaction, protein-protein interaction and gene regulatory networks. Many supervised techniques for network prediction use linear models on a possibly nonlinear pairwise feature representation of edges. Recently, much emphasis has been placed on the correct evaluation of such supervised models. It is vital to distinguish between using a model to either predict new interactions in a given network or to predict interactions for a new vertex not present in the original network. This distinction matters because (i) the performance might dramatically differ between the prediction settings and (ii) tuning the model hyperparameters to obtain the best possible model depends on the setting of interest. Specific cross-validation schemes need to be used to assess the performance in such different prediction settings. In this work we discuss a state-of-the-art kernel-based network inference technique called two-step kernel ridge regression. We show that this regression model can be trained efficiently, with a time complexity scaling with the number of vertices rather than the number of edges. Furthermore, this framework leads to a series of cross-validation shortcuts that allow one to rapidly estimate the model performance for any relevant network prediction setting. This allows computational biologists to fully assess the capabilities of their models
Classifying pairs with trees for supervised biological network inference
Networks are ubiquitous in biology and computational approaches have been
largely investigated for their inference. In particular, supervised machine
learning methods can be used to complete a partially known network by
integrating various measurements. Two main supervised frameworks have been
proposed: the local approach, which trains a separate model for each network
node, and the global approach, which trains a single model over pairs of nodes.
Here, we systematically investigate, theoretically and empirically, the
exploitation of tree-based ensemble methods in the context of these two
approaches for biological network inference. We first formalize the problem of
network inference as classification of pairs, unifying in the process
homogeneous and bipartite graphs and discussing two main sampling schemes. We
then present the global and the local approaches, extending the later for the
prediction of interactions between two unseen network nodes, and discuss their
specializations to tree-based ensemble methods, highlighting their
interpretability and drawing links with clustering techniques. Extensive
computational experiments are carried out with these methods on various
biological networks that clearly highlight that these methods are competitive
with existing methods.Comment: 22 page
Identifying networks with common organizational principles
Many complex systems can be represented as networks, and the problem of
network comparison is becoming increasingly relevant. There are many techniques
for network comparison, from simply comparing network summary statistics to
sophisticated but computationally costly alignment-based approaches. Yet it
remains challenging to accurately cluster networks that are of a different size
and density, but hypothesized to be structurally similar. In this paper, we
address this problem by introducing a new network comparison methodology that
is aimed at identifying common organizational principles in networks. The
methodology is simple, intuitive and applicable in a wide variety of settings
ranging from the functional classification of proteins to tracking the
evolution of a world trade network.Comment: 26 pages, 7 figure
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach
Virtual screening (VS) is widely used during computational drug discovery to
reduce costs. Chemogenomics-based virtual screening (CGBVS) can be used to
predict new compound-protein interactions (CPIs) from known CPI network data
using several methods, including machine learning and data mining. Although
CGBVS facilitates highly efficient and accurate CPI prediction, it has poor
performance for prediction of new compounds for which CPIs are unknown. The
pairwise kernel method (PKM) is a state-of-the-art CGBVS method and shows high
accuracy for prediction of new compounds. In this study, on the basis of link
mining, we improved the PKM by combining link indicator kernel (LIK) and
chemical similarity and evaluated the accuracy of these methods. The proposed
method obtained an average area under the precision-recall curve (AUPR) value
of 0.562, which was higher than that achieved by the conventional Gaussian
interaction profile (GIP) method (0.425), and the calculation time was only
increased by a few percent
- …