42 research outputs found

    Multiple Correspondence Analysis & the Multilogit Bilinear Model

    Full text link
    Multiple Correspondence Analysis (MCA) is a dimension reduction method which plays a large role in the analysis of tables with categorical nominal variables such as survey data. Though it is usually motivated and derived using geometric considerations, in fact we prove that it amounts to a single proximal Newtown step of a natural bilinear exponential family model for categorical data the multinomial logit bilinear model. We compare and contrast the behavior of MCA with that of the model on simulations and discuss new insights on the properties of both exploratory multivariate methods and their cognate models. One main conclusion is that we could recommend to approximate the multilogit model parameters using MCA. Indeed, estimating the parameters of the model is not a trivial task whereas MCA has the great advantage of being easily solved by singular value decomposition and scalable to large data

    Local case-control sampling: Efficient subsampling in imbalanced data sets

    Full text link
    For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ∗\theta^*. By contrast, our estimator is consistent for θ∗\theta^* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE - even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to 1+1c1+\frac{1}{c} if we multiply the baseline acceptance probabilities by c>1c>1 (and weight points with acceptance probability greater than 1), taking roughly 1+c2\frac{1+c}{2} times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1220 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore