95 research outputs found
Bayesian experimental design using regularized determinantal point processes
In experimental design, we are given vectors in dimensions, and our
goal is to select of them to perform expensive measurements, e.g., to
obtain labels/responses, for a linear regression task. Many statistical
criteria have been proposed for choosing the optimal design, with popular
choices including A- and D-optimality. If prior knowledge is given, typically
in the form of a precision matrix , then all of the
criteria can be extended to incorporate that information via a Bayesian
framework. In this paper, we demonstrate a new fundamental connection between
Bayesian experimental design and determinantal point processes, the latter
being widely used for sampling diverse subsets of data. We use this connection
to develop new efficient algorithms for finding -approximations
of optimal designs under four optimality criteria: A, C, D and V. Our
algorithms can achieve this when the desired subset size is
,
where is the -effective dimension, which can
often be much smaller than . Our results offer direct improvements over a
number of prior works, for both Bayesian and classical experimental design, in
terms of algorithm efficiency, approximation quality, and range of applicable
criteria
Diversity in Machine Learning
Machine learning methods have achieved good performance and been widely
applied in various real-world applications. They can learn the model adaptively
and be better fit for special requirements of different tasks. Generally, a
good machine learning system is composed of plentiful training data, a good
model training process, and an accurate inference. Many factors can affect the
performance of the machine learning process, among which the diversity of the
machine learning process is an important one. The diversity can help each
procedure to guarantee a total good machine learning: diversity of the training
data ensures that the training data can provide more discriminative information
for the model, diversity of the learned model (diversity in parameters of each
model or diversity among different base models) makes each parameter/model
capture unique or complement information and the diversity in inference can
provide multiple choices each of which corresponds to a specific plausible
local optimal result. Even though the diversity plays an important role in
machine learning process, there is no systematical analysis of the
diversification in machine learning system. In this paper, we systematically
summarize the methods to make data diversification, model diversification, and
inference diversification in the machine learning process, respectively. In
addition, the typical applications where the diversity technology improved the
machine learning performance have been surveyed, including the remote sensing
imaging tasks, machine translation, camera relocalization, image segmentation,
object detection, topic modeling, and others. Finally, we discuss some
challenges of the diversity technology in machine learning and point out some
directions in future work.Comment: Accepted by IEEE Acces
Subsampling for Ridge Regression via Regularized Volume Sampling
Given vectors , we want to fit a linear
regression model for noisy labels . The ridge estimator is a
classical solution to this problem. However, when labels are expensive, we are
forced to select only a small subset of vectors for which we
obtain the labels . We propose a new procedure for selecting the subset of
vectors, such that the ridge estimator obtained from that subset offers strong
statistical guarantees in terms of the mean squared prediction error over the
entire dataset of labeled vectors. The number of labels needed is
proportional to the statistical dimension of the problem which is often much
smaller than . Our method is an extension of a joint subsampling procedure
called volume sampling. A second major contribution is that we speed up volume
sampling so that it is essentially as efficient as leverage scores, which is
the main i.i.d. subsampling procedure for this task. Finally, we show
theoretically and experimentally that volume sampling has a clear advantage
over any i.i.d. sampling when labels are expensive
On proportional volume sampling for experimental design in general spaces
Optimal design for linear regression is a fundamental task in statistics. For
finite design spaces, recent progress has shown that random designs drawn using
proportional volume sampling (PVS) lead to approximation guarantees for
A-optimal design. PVS strikes the balance between design nodes that jointly
fill the design space, while marginally staying in regions of high mass under
the solution of a relaxed convex version of the original problem. In this
paper, we examine some of the statistical implications of a new variant of PVS
for (possibly Bayesian) optimal design. Using point process machinery, we treat
the case of a generic Polish design space. We show that not only are the
A-optimality approximation guarantees preserved, but we obtain similar
guarantees for D-optimal design that tighten recent results. Moreover, we show
that PVS can be sampled in polynomial time. Unfortunately, in spite of its
elegance and tractability, we demonstrate on a simple example that the
practical implications of general PVS are likely limited. In the second part of
the paper, we focus on applications and investigate the use of PVS as a
subroutine for stochastic search heuristics. We demonstrate that PVS is a
robust addition to the practitioner's toolbox, especially when the regression
functions are nonstandard and the design space, while low-dimensional, has a
complicated shape (e.g., nonlinear boundaries, several connected components)
Towards Bursting Filter Bubble via Contextual Risks and Uncertainties
A rising topic in computational journalism is how to enhance the diversity in
news served to subscribers to foster exploration behavior in news reading.
Despite the success of preference learning in personalized news recommendation,
their over-exploitation causes filter bubble that isolates readers from
opposing viewpoints and hurts long-term user experiences with lack of
serendipity. Since news providers can recommend neither opposite nor
diversified opinions if unpopularity of these articles is surely predicted,
they can only bet on the articles whose forecasts of click-through rate involve
high variability (risks) or high estimation errors (uncertainties). We propose
a novel Bayesian model of uncertainty-aware scoring and ranking for news
articles. The Bayesian binary classifier models probability of success (defined
as a news click) as a Beta-distributed random variable conditional on a vector
of the context (user features, article features, and other contextual
features). The posterior of the contextual coefficients can be computed
efficiently using a low-rank version of Laplace's method via thin Singular
Value Decomposition. Efficiencies in personalized targeting of exceptional
articles, which are chosen by each subscriber in test period, are evaluated on
real-world news datasets. The proposed estimator slightly outperformed existing
training and scoring algorithms, in terms of efficiency in identifying
successful outliers.Comment: The filter bubble problem; Uncertainty-aware scoring; Empirical-Bayes
method; Low-rank Laplace's metho
Determinantal Point Processes Implicitly Regularize Semi-parametric Regression Problems
Semi-parametric regression models are used in several applications which
require comprehensibility without sacrificing accuracy. Typical examples are
spline interpolation in geophysics, or non-linear time series problems, where
the system includes a linear and non-linear component. We discuss here the use
of a finite Determinantal Point Process (DPP) for approximating semi-parametric
models. Recently, Barthelm\'e, Tremblay, Usevich, and Amblard introduced a
novel representation of some finite DPPs. These authors formulated extended
L-ensembles that can conveniently represent partial-projection DPPs and suggest
their use for optimal interpolation. With the help of this formalism, we derive
a key identity illustrating the implicit regularization effect of determinantal
sampling for semi-parametric regression and interpolation. Also, a novel
projected Nystr\"om approximation is defined and used to derive a bound on the
expected risk for the corresponding approximation of semi-parametric
regression. This work naturally extends similar results obtained for kernel
ridge regression.Comment: 26 pages. Extended results. Typos correcte
Correcting the bias in least squares regression with volume-rescaled sampling
Consider linear regression where the examples are generated by an unknown
distribution on . Without any assumptions on the noise, the linear
least squares solution for any i.i.d. sample will typically be biased w.r.t.
the least squares optimum over the entire distribution. However, we show that
if an i.i.d. sample of any size k is augmented by a certain small additional
sample, then the solution of the combined sample becomes unbiased. We show this
when the additional sample consists of d points drawn jointly according to the
input distribution that is rescaled by the squared volume spanned by the
points. Furthermore, we propose algorithms to sample from this volume-rescaled
distribution when the data distribution is only known through an i.i.d sample
Determinantal Point Processes in Randomized Numerical Linear Algebra
Randomized Numerical Linear Algebra (RandNLA) uses randomness to develop
improved algorithms for matrix problems that arise in scientific computing,
data science, machine learning, etc. Determinantal Point Processes (DPPs), a
seemingly unrelated topic in pure and applied mathematics, is a class of
stochastic point processes with probability distribution characterized by
sub-determinants of a kernel matrix. Recent work has uncovered deep and
fruitful connections between DPPs and RandNLA which lead to new guarantees and
improved algorithms that are of interest to both areas. We provide an overview
of this exciting new line of research, including brief introductions to RandNLA
and DPPs, as well as applications of DPPs to classical linear algebra tasks
such as least squares regression, low-rank approximation and the Nystr\"om
method. For example, random sampling with a DPP leads to new kinds of unbiased
estimators for least squares, enabling more refined statistical and inferential
understanding of these algorithms; a DPP is, in some sense, an optimal
randomized algorithm for the Nystr\"om method; and a RandNLA technique called
leverage score sampling can be derived as the marginal distribution of a DPP.
We also discuss recent algorithmic developments, illustrating that, while not
quite as efficient as standard RandNLA techniques, DPP-based algorithms are
only moderately more expensive
Exact expressions for double descent and implicit regularization via surrogate random design
Double descent refers to the phase transition that is exhibited by the
generalization error of unregularized learning models when varying the ratio
between the number of parameters and the number of training samples. The recent
success of highly over-parameterized machine learning models such as deep
neural networks has motivated a theoretical analysis of the double descent
phenomenon in classical models such as linear regression which can also
generalize well in the over-parameterized regime. We provide the first exact
non-asymptotic expressions for double descent of the minimum norm linear
estimator. Our approach involves constructing a special determinantal point
process which we call surrogate random design, to replace the standard i.i.d.
design of the training sample. This surrogate design admits exact expressions
for the mean squared error of the estimator while preserving the key properties
of the standard design. We also establish an exact implicit regularization
result for over-parameterized training samples. In particular, we show that,
for the surrogate design, the implicit bias of the unregularized minimum norm
estimator precisely corresponds to solving a ridge-regularized least squares
problem on the population distribution. In our analysis we introduce a new
mathematical tool of independent interest: the class of random matrices for
which determinant commutes with expectation.Comment: Minor typo corrections and clarifications; moved the proofs into the
appendi
Leveraged volume sampling for linear regression
Suppose an design matrix in a linear regression problem is
given, but the response for each point is hidden unless explicitly requested.
The goal is to sample only a small number of the responses, and then
produce a weight vector whose sum of squares loss over all points is at most
times the minimum. When is very small (e.g., ), jointly
sampling diverse subsets of points is crucial. One such method called volume
sampling has a unique and desirable property that the weight vector it produces
is an unbiased estimate of the optimum. It is therefore natural to ask if this
method offers the optimal unbiased estimate in terms of the number of responses
needed to achieve a loss approximation.
Surprisingly we show that volume sampling can have poor behavior when we
require a very accurate approximation -- indeed worse than some i.i.d. sampling
techniques whose estimates are biased, such as leverage score sampling. We then
develop a new rescaled variant of volume sampling that produces an unbiased
estimate which avoids this bad behavior and has at least as good a tail bound
as leverage score sampling: sample size suffices to
guarantee total loss at most times the minimum with high
probability. Thus, we improve on the best previously known sample size for an
unbiased estimator, .
Our rescaling procedure leads to a new efficient algorithm for volume
sampling which is based on a determinantal rejection sampling technique with
potentially broader applications to determinantal point processes. Other
contributions include introducing the combinatorics needed for rescaled volume
sampling and developing tail bounds for sums of dependent random matrices which
arise in the process
- …