12 research outputs found
Efficient Data Representation by Selecting Prototypes with Importance Weights
Prototypical examples that best summarizes and compactly represents an
underlying complex data distribution communicate meaningful insights to humans
in domains where simple explanations are hard to extract. In this paper we
present algorithms with strong theoretical guarantees to mine these data sets
and select prototypes a.k.a. representatives that optimally describes them. Our
work notably generalizes the recent work by Kim et al. (2016) where in addition
to selecting prototypes, we also associate non-negative weights which are
indicative of their importance. This extension provides a single coherent
framework under which both prototypes and criticisms (i.e. outliers) can be
found. Furthermore, our framework works for any symmetric positive definite
kernel thus addressing one of the key open questions laid out in Kim et al.
(2016). By establishing that our objective function enjoys a key property of
that of weak submodularity, we present a fast ProtoDash algorithm and also
derive approximation guarantees for the same. We demonstrate the efficacy of
our method on diverse domains such as retail, digit recognition (MNIST) and on
publicly available 40 health questionnaires obtained from the Center for
Disease Control (CDC) website maintained by the US Dept. of Health. We validate
the results quantitatively as well as qualitatively based on expert feedback
and recently published scientific studies on public health, thus showcasing the
power of our technique in providing actionability (for retail), utility (for
MNIST) and insight (on CDC datasets) which arguably are the hallmarks of an
effective data mining method.Comment: Accepted for publication in International Conference on Data Mining
(ICDM) 201
Data drift correction via time-varying importance weight estimator
Real-world deployment of machine learning models is challenging when data
evolves over time. And data does evolve over time. While no model can work when
data evolves in an arbitrary fashion, if there is some pattern to these
changes, we might be able to design methods to address it. This paper addresses
situations when data evolves gradually. We introduce a novel time-varying
importance weight estimator that can detect gradual shifts in the distribution
of data. Such an importance weight estimator allows the training method to
selectively sample past data -- not just similar data from the past like a
standard importance weight estimator would but also data that evolved in a
similar fashion in the past. Our time-varying importance weight is quite
general. We demonstrate different ways of implementing it that exploit some
known structure in the evolution of data. We demonstrate and evaluate this
approach on a variety of problems ranging from supervised learning tasks
(multiple image classification datasets) where the data undergoes a sequence of
gradual shifts of our design to reinforcement learning tasks (robotic
manipulation and continuous control) where data undergoes a shift organically
as the policy or the task changes
Stratified Learning: a general-purpose statistical method for improved learning under Covariate Shift
Covariate shift arises when the labelled training (source) data is not
representative of the unlabelled (target) data due to systematic differences in
the covariate distributions. A supervised model trained on the source data
subject to covariate shift may suffer from poor generalization on the target
data. We propose a novel, statistically principled and theoretically justified
method to improve learning under covariate shift conditions, based on
propensity score stratification, a well-established methodology in causal
inference. We show that the effects of covariate shift can be reduced or
altogether eliminated by conditioning on propensity scores. In practice, this
is achieved by fitting learners on subgroups ("strata") constructed by
partitioning the data based on the estimated propensity scores, leading to
balanced covariates and much-improved target prediction. We demonstrate the
effectiveness of our general-purpose method on contemporary research questions
in observational cosmology, and on additional benchmark examples, matching or
outperforming state-of-the-art importance weighting methods, widely studied in
the covariate shift literature. We obtain the best reported AUC (0.958) on the
updated "Supernovae photometric classification challenge" and improve upon
existing conditional density estimation of galaxy redshift from Sloan Data Sky
Survey (SDSS) data
The out-of-sample prediction error of the square-root-LASSO and related estimators
We study the classical problem of predicting an outcome variable, , using
a linear combination of a -dimensional covariate vector, . We
are interested in linear predictors whose coefficients solve: % \begin{align*}
\inf_{\boldsymbol{\beta} \in \mathbb{R}^d} \left( \mathbb{E}_{\mathbb{P}_n}
\left[ \left(Y-\mathbf{X}^{\top}\beta \right)^r \right] \right)^{1/r} +\delta
\, \rho\left(\boldsymbol{\beta}\right), \end{align*} where is a
regularization parameter, is a convex
penalty function, is the empirical distribution of the data, and
. We present three sets of new results. First, we provide conditions
under which linear predictors based on these estimators % solve a
\emph{distributionally robust optimization} problem: they minimize the
worst-case prediction error over distributions that are close to each other in
a type of \emph{max-sliced Wasserstein metric}. Second, we provide a detailed
finite-sample and asymptotic analysis of the statistical properties of the
balls of distributions over which the worst-case prediction error is analyzed.
Third, we use the distributionally robust optimality and our statistical
analysis to present i) an oracle recommendation for the choice of
regularization parameter, , that guarantees good out-of-sample
prediction error; and ii) a test-statistic to rank the out-of-sample
performance of two different linear estimators. None of our results rely on
sparsity assumptions about the true data generating process; thus, they broaden
the scope of use of the square-root lasso and related estimators in prediction
problems