95 research outputs found

    Bayesian experimental design using regularized determinantal point processes

    Full text link
    In experimental design, we are given nn vectors in dd dimensions, and our goal is to select knk\ll n of them to perform expensive measurements, e.g., to obtain labels/responses, for a linear regression task. Many statistical criteria have been proposed for choosing the optimal design, with popular choices including A- and D-optimality. If prior knowledge is given, typically in the form of a d×dd\times d precision matrix A\mathbf A, then all of the criteria can be extended to incorporate that information via a Bayesian framework. In this paper, we demonstrate a new fundamental connection between Bayesian experimental design and determinantal point processes, the latter being widely used for sampling diverse subsets of data. We use this connection to develop new efficient algorithms for finding (1+ϵ)(1+\epsilon)-approximations of optimal designs under four optimality criteria: A, C, D and V. Our algorithms can achieve this when the desired subset size kk is Ω(dAϵ+log1/ϵϵ2)\Omega(\frac{d_{\mathbf A}}{\epsilon} + \frac{\log 1/\epsilon}{\epsilon^2}), where dAdd_{\mathbf A}\leq d is the A\mathbf A-effective dimension, which can often be much smaller than dd. Our results offer direct improvements over a number of prior works, for both Bayesian and classical experimental design, in terms of algorithm efficiency, approximation quality, and range of applicable criteria

    Diversity in Machine Learning

    Full text link
    Machine learning methods have achieved good performance and been widely applied in various real-world applications. They can learn the model adaptively and be better fit for special requirements of different tasks. Generally, a good machine learning system is composed of plentiful training data, a good model training process, and an accurate inference. Many factors can affect the performance of the machine learning process, among which the diversity of the machine learning process is an important one. The diversity can help each procedure to guarantee a total good machine learning: diversity of the training data ensures that the training data can provide more discriminative information for the model, diversity of the learned model (diversity in parameters of each model or diversity among different base models) makes each parameter/model capture unique or complement information and the diversity in inference can provide multiple choices each of which corresponds to a specific plausible local optimal result. Even though the diversity plays an important role in machine learning process, there is no systematical analysis of the diversification in machine learning system. In this paper, we systematically summarize the methods to make data diversification, model diversification, and inference diversification in the machine learning process, respectively. In addition, the typical applications where the diversity technology improved the machine learning performance have been surveyed, including the remote sensing imaging tasks, machine translation, camera relocalization, image segmentation, object detection, topic modeling, and others. Finally, we discuss some challenges of the diversity technology in machine learning and point out some directions in future work.Comment: Accepted by IEEE Acces

    Subsampling for Ridge Regression via Regularized Volume Sampling

    Full text link
    Given nn vectors xiRd\mathbf{x}_i\in \mathbb{R}^d, we want to fit a linear regression model for noisy labels yiRy_i\in\mathbb{R}. The ridge estimator is a classical solution to this problem. However, when labels are expensive, we are forced to select only a small subset of vectors xi\mathbf{x}_i for which we obtain the labels yiy_i. We propose a new procedure for selecting the subset of vectors, such that the ridge estimator obtained from that subset offers strong statistical guarantees in terms of the mean squared prediction error over the entire dataset of nn labeled vectors. The number of labels needed is proportional to the statistical dimension of the problem which is often much smaller than dd. Our method is an extension of a joint subsampling procedure called volume sampling. A second major contribution is that we speed up volume sampling so that it is essentially as efficient as leverage scores, which is the main i.i.d. subsampling procedure for this task. Finally, we show theoretically and experimentally that volume sampling has a clear advantage over any i.i.d. sampling when labels are expensive

    On proportional volume sampling for experimental design in general spaces

    Full text link
    Optimal design for linear regression is a fundamental task in statistics. For finite design spaces, recent progress has shown that random designs drawn using proportional volume sampling (PVS) lead to approximation guarantees for A-optimal design. PVS strikes the balance between design nodes that jointly fill the design space, while marginally staying in regions of high mass under the solution of a relaxed convex version of the original problem. In this paper, we examine some of the statistical implications of a new variant of PVS for (possibly Bayesian) optimal design. Using point process machinery, we treat the case of a generic Polish design space. We show that not only are the A-optimality approximation guarantees preserved, but we obtain similar guarantees for D-optimal design that tighten recent results. Moreover, we show that PVS can be sampled in polynomial time. Unfortunately, in spite of its elegance and tractability, we demonstrate on a simple example that the practical implications of general PVS are likely limited. In the second part of the paper, we focus on applications and investigate the use of PVS as a subroutine for stochastic search heuristics. We demonstrate that PVS is a robust addition to the practitioner's toolbox, especially when the regression functions are nonstandard and the design space, while low-dimensional, has a complicated shape (e.g., nonlinear boundaries, several connected components)

    Towards Bursting Filter Bubble via Contextual Risks and Uncertainties

    Full text link
    A rising topic in computational journalism is how to enhance the diversity in news served to subscribers to foster exploration behavior in news reading. Despite the success of preference learning in personalized news recommendation, their over-exploitation causes filter bubble that isolates readers from opposing viewpoints and hurts long-term user experiences with lack of serendipity. Since news providers can recommend neither opposite nor diversified opinions if unpopularity of these articles is surely predicted, they can only bet on the articles whose forecasts of click-through rate involve high variability (risks) or high estimation errors (uncertainties). We propose a novel Bayesian model of uncertainty-aware scoring and ranking for news articles. The Bayesian binary classifier models probability of success (defined as a news click) as a Beta-distributed random variable conditional on a vector of the context (user features, article features, and other contextual features). The posterior of the contextual coefficients can be computed efficiently using a low-rank version of Laplace's method via thin Singular Value Decomposition. Efficiencies in personalized targeting of exceptional articles, which are chosen by each subscriber in test period, are evaluated on real-world news datasets. The proposed estimator slightly outperformed existing training and scoring algorithms, in terms of efficiency in identifying successful outliers.Comment: The filter bubble problem; Uncertainty-aware scoring; Empirical-Bayes method; Low-rank Laplace's metho

    Determinantal Point Processes Implicitly Regularize Semi-parametric Regression Problems

    Full text link
    Semi-parametric regression models are used in several applications which require comprehensibility without sacrificing accuracy. Typical examples are spline interpolation in geophysics, or non-linear time series problems, where the system includes a linear and non-linear component. We discuss here the use of a finite Determinantal Point Process (DPP) for approximating semi-parametric models. Recently, Barthelm\'e, Tremblay, Usevich, and Amblard introduced a novel representation of some finite DPPs. These authors formulated extended L-ensembles that can conveniently represent partial-projection DPPs and suggest their use for optimal interpolation. With the help of this formalism, we derive a key identity illustrating the implicit regularization effect of determinantal sampling for semi-parametric regression and interpolation. Also, a novel projected Nystr\"om approximation is defined and used to derive a bound on the expected risk for the corresponding approximation of semi-parametric regression. This work naturally extends similar results obtained for kernel ridge regression.Comment: 26 pages. Extended results. Typos correcte

    Correcting the bias in least squares regression with volume-rescaled sampling

    Full text link
    Consider linear regression where the examples are generated by an unknown distribution on Rd×RR^d\times R. Without any assumptions on the noise, the linear least squares solution for any i.i.d. sample will typically be biased w.r.t. the least squares optimum over the entire distribution. However, we show that if an i.i.d. sample of any size k is augmented by a certain small additional sample, then the solution of the combined sample becomes unbiased. We show this when the additional sample consists of d points drawn jointly according to the input distribution that is rescaled by the squared volume spanned by the points. Furthermore, we propose algorithms to sample from this volume-rescaled distribution when the data distribution is only known through an i.i.d sample

    Determinantal Point Processes in Randomized Numerical Linear Algebra

    Full text link
    Randomized Numerical Linear Algebra (RandNLA) uses randomness to develop improved algorithms for matrix problems that arise in scientific computing, data science, machine learning, etc. Determinantal Point Processes (DPPs), a seemingly unrelated topic in pure and applied mathematics, is a class of stochastic point processes with probability distribution characterized by sub-determinants of a kernel matrix. Recent work has uncovered deep and fruitful connections between DPPs and RandNLA which lead to new guarantees and improved algorithms that are of interest to both areas. We provide an overview of this exciting new line of research, including brief introductions to RandNLA and DPPs, as well as applications of DPPs to classical linear algebra tasks such as least squares regression, low-rank approximation and the Nystr\"om method. For example, random sampling with a DPP leads to new kinds of unbiased estimators for least squares, enabling more refined statistical and inferential understanding of these algorithms; a DPP is, in some sense, an optimal randomized algorithm for the Nystr\"om method; and a RandNLA technique called leverage score sampling can be derived as the marginal distribution of a DPP. We also discuss recent algorithmic developments, illustrating that, while not quite as efficient as standard RandNLA techniques, DPP-based algorithms are only moderately more expensive

    Exact expressions for double descent and implicit regularization via surrogate random design

    Full text link
    Double descent refers to the phase transition that is exhibited by the generalization error of unregularized learning models when varying the ratio between the number of parameters and the number of training samples. The recent success of highly over-parameterized machine learning models such as deep neural networks has motivated a theoretical analysis of the double descent phenomenon in classical models such as linear regression which can also generalize well in the over-parameterized regime. We provide the first exact non-asymptotic expressions for double descent of the minimum norm linear estimator. Our approach involves constructing a special determinantal point process which we call surrogate random design, to replace the standard i.i.d. design of the training sample. This surrogate design admits exact expressions for the mean squared error of the estimator while preserving the key properties of the standard design. We also establish an exact implicit regularization result for over-parameterized training samples. In particular, we show that, for the surrogate design, the implicit bias of the unregularized minimum norm estimator precisely corresponds to solving a ridge-regularized least squares problem on the population distribution. In our analysis we introduce a new mathematical tool of independent interest: the class of random matrices for which determinant commutes with expectation.Comment: Minor typo corrections and clarifications; moved the proofs into the appendi

    Leveraged volume sampling for linear regression

    Full text link
    Suppose an n×dn \times d design matrix in a linear regression problem is given, but the response for each point is hidden unless explicitly requested. The goal is to sample only a small number knk \ll n of the responses, and then produce a weight vector whose sum of squares loss over all points is at most 1+ϵ1+\epsilon times the minimum. When kk is very small (e.g., k=dk=d), jointly sampling diverse subsets of points is crucial. One such method called volume sampling has a unique and desirable property that the weight vector it produces is an unbiased estimate of the optimum. It is therefore natural to ask if this method offers the optimal unbiased estimate in terms of the number of responses kk needed to achieve a 1+ϵ1+\epsilon loss approximation. Surprisingly we show that volume sampling can have poor behavior when we require a very accurate approximation -- indeed worse than some i.i.d. sampling techniques whose estimates are biased, such as leverage score sampling. We then develop a new rescaled variant of volume sampling that produces an unbiased estimate which avoids this bad behavior and has at least as good a tail bound as leverage score sampling: sample size k=O(dlogd+d/ϵ)k=O(d\log d + d/\epsilon) suffices to guarantee total loss at most 1+ϵ1+\epsilon times the minimum with high probability. Thus, we improve on the best previously known sample size for an unbiased estimator, k=O(d2/ϵ)k=O(d^2/\epsilon). Our rescaling procedure leads to a new efficient algorithm for volume sampling which is based on a determinantal rejection sampling technique with potentially broader applications to determinantal point processes. Other contributions include introducing the combinatorics needed for rescaled volume sampling and developing tail bounds for sums of dependent random matrices which arise in the process
    corecore