42 research outputs found
Multiple Correspondence Analysis & the Multilogit Bilinear Model
Multiple Correspondence Analysis (MCA) is a dimension reduction method which
plays a large role in the analysis of tables with categorical nominal variables
such as survey data. Though it is usually motivated and derived using geometric
considerations, in fact we prove that it amounts to a single proximal Newtown
step of a natural bilinear exponential family model for categorical data the
multinomial logit bilinear model. We compare and contrast the behavior of MCA
with that of the model on simulations and discuss new insights on the
properties of both exploratory multivariate methods and their cognate models.
One main conclusion is that we could recommend to approximate the multilogit
model parameters using MCA. Indeed, estimating the parameters of the model is
not a trivial task whereas MCA has the great advantage of being easily solved
by singular value decomposition and scalable to large data
Local case-control sampling: Efficient subsampling in imbalanced data sets
For classification problems with significant class imbalance, subsampling can
reduce computational costs at the price of inflated variance in estimating
model parameters. We propose a method for subsampling efficiently for logistic
regression by adjusting the class balance locally in feature space via an
accept-reject scheme. Our method generalizes standard case-control sampling,
using a pilot estimate to preferentially select examples whose responses are
conditionally rare given their features. The biased subsampling is corrected by
a post-hoc analytic adjustment to the parameters. The method is simple and
requires one parallelizable scan over the full data set. Standard case-control
sampling is inconsistent under model misspecification for the population
risk-minimizing coefficients . By contrast, our estimator is
consistent for provided that the pilot estimate is. Moreover, under
correct specification and with a consistent, independent pilot estimate, our
estimator has exactly twice the asymptotic variance of the full-sample MLE -
even if the selected subsample comprises a miniscule fraction of the full data
set, as happens when the original data are severely imbalanced. The factor of
two improves to if we multiply the baseline acceptance
probabilities by (and weight points with acceptance probability greater
than 1), taking roughly times as many data points into the
subsample. Experiments on simulated and real data show that our method can
substantially outperform standard case-control subsampling.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1220 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org