574 research outputs found
Addressing class imbalance for logistic regression
The challenge of class imbalance arises in classification problem when the minority class is observed much less than the majority class. This characteristic is endemic in many domains. Work by [Owen, 2007] has shown that, in a theoretical context related to infinite imbalance, logistic regression behaves such that all data in the rare class can be replaced by their mean vector to achieve the same coefficient estimates. Such results suggest that cluster structure among the minority class may be a specific problem in highly imbalanced logistic regression. In this thesis, we focus on highly imbalanced logistic regression and develop mitigation methods and diagnostic tools.
Theoretically, we extend the [Owen, 2007] results to show the phenomenon remains true for both weighted and penalized likelihood methods in the infinitely imbalanced regime, which suggests these alternative choices to logistic regression are not enough for highly imbalanced logistic regression.
For mitigation methods, we propose a novel relabeling solution based on relabeling the minority class to handle imbalance problem when using logistic regression, which essentially assigns new labels to the minority class observations. Two algorithms (the Genetic algorithm and the Expectation Maximization algorithm) are formalized to serve as tools for computing this relabeling. In simulation and real data experiments, we show that logistic regression is not able to provide the best out-of-sample predictive performance, and our relabeling approach that can capture underlying structure in the minority class is often superior.
For diagnostic tools to detect highly imbalanced logistic regression, different hypothesis testing methods, along with a graphical tool are proposed, based on the mathematical insights about highly imbalanced logistic regression. Simulation studies provide evidence that combining our diagnostic tools with mitigation methods as a systematic strategy has the potential to alleviate the class imbalance problem among logistic regression.Open Acces
Local case-control sampling: Efficient subsampling in imbalanced data sets
For classification problems with significant class imbalance, subsampling can
reduce computational costs at the price of inflated variance in estimating
model parameters. We propose a method for subsampling efficiently for logistic
regression by adjusting the class balance locally in feature space via an
accept-reject scheme. Our method generalizes standard case-control sampling,
using a pilot estimate to preferentially select examples whose responses are
conditionally rare given their features. The biased subsampling is corrected by
a post-hoc analytic adjustment to the parameters. The method is simple and
requires one parallelizable scan over the full data set. Standard case-control
sampling is inconsistent under model misspecification for the population
risk-minimizing coefficients . By contrast, our estimator is
consistent for provided that the pilot estimate is. Moreover, under
correct specification and with a consistent, independent pilot estimate, our
estimator has exactly twice the asymptotic variance of the full-sample MLE -
even if the selected subsample comprises a miniscule fraction of the full data
set, as happens when the original data are severely imbalanced. The factor of
two improves to if we multiply the baseline acceptance
probabilities by (and weight points with acceptance probability greater
than 1), taking roughly times as many data points into the
subsample. Experiments on simulated and real data show that our method can
substantially outperform standard case-control subsampling.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1220 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …