44,828 research outputs found
Bayesian leave-one-out cross-validation for large data
Model inference, such as model comparison, model checking, and model
selection, is an important part of model development. Leave-one-out
cross-validation (LOO) is a general approach for assessing the generalizability
of a model, but unfortunately, LOO does not scale well to large datasets. We
propose a combination of using approximate inference techniques and
probability-proportional-to-size-sampling (PPS) for fast LOO model evaluation
for large datasets. We provide both theoretical and empirical results showing
good properties for large data.Comment: Accepted to ICML 2019. This version is the submitted pape
Leave-one-out Distinguishability in Machine Learning
We introduce a new analytical framework to quantify the changes in a machine
learning algorithm's output distribution following the inclusion of a few data
points in its training set, a notion we define as leave-one-out
distinguishability (LOOD). This problem is key to measuring data
**memorization** and **information leakage** in machine learning, and the
**influence** of training data points on model predictions. We illustrate how
our method broadens and refines existing empirical measures of memorization and
privacy risks associated with training data. We use Gaussian processes to model
the randomness of machine learning algorithms, and validate LOOD with extensive
empirical analysis of information leakage using membership inference attacks.
Our theoretical framework enables us to investigate the causes of information
leakage and where the leakage is high. For example, we analyze the influence of
activation functions, on data memorization. Additionally, our method allows us
to optimize queries that disclose the most significant information about the
training data in the leave-one-out setting. We illustrate how optimal queries
can be used for accurate **reconstruction** of training data.Comment: Fixed typo
Algebraic shortcuts for leave-one-out cross-validation in supervised network inference
Supervised machine learning techniques have traditionally been very successful at reconstructing biological networks, such as protein-ligand interaction, protein-protein interaction and gene regulatory networks. Many supervised techniques for network prediction use linear models on a possibly nonlinear pairwise feature representation of edges. Recently, much emphasis has been placed on the correct evaluation of such supervised models. It is vital to distinguish between using a model to either predict new interactions in a given network or to predict interactions for a new vertex not present in the original network. This distinction matters because (i) the performance might dramatically differ between the prediction settings and (ii) tuning the model hyperparameters to obtain the best possible model depends on the setting of interest. Specific cross-validation schemes need to be used to assess the performance in such different prediction settings. In this work we discuss a state-of-the-art kernel-based network inference technique called two-step kernel ridge regression. We show that this regression model can be trained efficiently, with a time complexity scaling with the number of vertices rather than the number of edges. Furthermore, this framework leads to a series of cross-validation shortcuts that allow one to rapidly estimate the model performance for any relevant network prediction setting. This allows computational biologists to fully assess the capabilities of their models
- …