3 research outputs found
Distributed Feature Screening via Componentwise Debiasing
Feature screening is a powerful tool in the analysis of high dimensional
data. When the sample size and the number of features are both large,
the implementation of classic screening methods can be numerically challenging.
In this paper, we propose a distributed screening framework for big data setup.
In the spirit of "divide-and-conquer", the proposed framework expresses a
correlation measure as a function of several component parameters, each of
which can be distributively estimated using a natural U-statistic from data
segments. With the component estimates aggregated, we obtain a final
correlation estimate that can be readily used for screening features. This
framework enables distributed storage and parallel computing and thus is
computationally attractive. Due to the unbiased distributive estimation of the
component parameters, the final aggregated estimate achieves a high accuracy
that is insensitive to the number of data segments specified by the problem
itself or to be chosen by users. Under mild conditions, we show that the
aggregated correlation estimator is as efficient as the classic centralized
estimator in terms of the probability convergence bound; the corresponding
screening procedure enjoys sure screening property for a wide range of
correlation measures. The promising performances of the new method are
supported by extensive numerical examples.Comment: 28 pages, 2 figures, 4 table
Randomized maximum-contrast selection: subagging for large-scale regression
We introduce a very general method for sparse and large-scale variable
selection. The large-scale regression settings is such that both the number of
parameters and the number of samples are extremely large. The proposed method
is based on careful combination of penalized estimators, each applied to a
random projection of the sample space into a low-dimensional space. In one
special case that we study in detail, the random projections are divided into
non-overlapping blocks; each consisting of only a small portion of the original
data. Within each block we select the projection yielding the smallest
out-of-sample error. Our random ensemble estimator then aggregates the results
according to new maximal-contrast voting scheme to determine the final selected
set. Our theoretical results illuminate the effect on performance of increasing
the number of non-overlapping blocks. Moreover, we demonstrate that statistical
optimality is retained along with the computational speedup. The proposed
method achieves minimax rates for approximate recovery over all estimators
using the full set of samples. Furthermore, our theoretical results allow the
number of subsamples to grow with the subsample size and do not require
irrepresentable condition. The estimator is also compared empirically with
several other popular high-dimensional estimators via an extensive simulation
study, which reveals its excellent finite-sample performance
Descent-to-Delete: Gradient-Based Methods for Machine Unlearning
We study the data deletion problem for convex models. By leveraging
techniques from convex optimization and reservoir sampling, we give the first
data deletion algorithms that are able to handle an arbitrarily long sequence
of adversarial updates while promising both per-deletion run-time and
steady-state error that do not grow with the length of the update sequence. We
also introduce several new conceptual distinctions: for example, we can ask
that after a deletion, the entire state maintained by the optimization
algorithm is statistically indistinguishable from the state that would have
resulted had we retrained, or we can ask for the weaker condition that only the
observable output is statistically indistinguishable from the observable output
that would have resulted from retraining. We are able to give more efficient
deletion algorithms under this weaker deletion criterion