3,255 research outputs found
Deep Over-sampling Framework for Classifying Imbalanced Data
Class imbalance is a challenging issue in practical classification problems
for deep learning models as well as traditional models. Traditionally
successful countermeasures such as synthetic over-sampling have had limited
success with complex, structured data handled by deep learning models. In this
paper, we propose Deep Over-sampling (DOS), a framework for extending the
synthetic over-sampling method to exploit the deep feature space acquired by a
convolutional neural network (CNN). Its key feature is an explicit, supervised
representation learning, for which the training data presents each raw input
sample with a synthetic embedding target in the deep feature space, which is
sampled from the linear subspace of in-class neighbors. We implement an
iterative process of training the CNN and updating the targets, which induces
smaller in-class variance among the embeddings, to increase the discriminative
power of the deep representation. We present an empirical study using public
benchmarks, which shows that the DOS framework not only counteracts class
imbalance better than the existing method, but also improves the performance of
the CNN in the standard, balanced settings
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
This work is motivated by the needs of predictive analytics on healthcare
data as represented by Electronic Medical Records. Such data is invariably
problematic: noisy, with missing entries, with imbalance in classes of
interests, leading to serious bias in predictive modeling. Since standard data
mining methods often produce poor performance measures, we argue for
development of specialized techniques of data-preprocessing and classification.
In this paper, we propose a new method to simultaneously classify large
datasets and reduce the effects of missing values. It is based on a multilevel
framework of the cost-sensitive SVM and the expected maximization imputation
method for missing values, which relies on iterated regression analyses. We
compare classification results of multilevel SVM-based algorithms on public
benchmark datasets with imbalanced classes and missing values as well as real
data in health applications, and show that our multilevel SVM-based method
produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
Box Drawings for Learning with Imbalanced Data
The vast majority of real world classification problems are imbalanced,
meaning there are far fewer data from the class of interest (the positive
class) than from other classes. We propose two machine learning algorithms to
handle highly imbalanced classification problems. The classifiers constructed
by both methods are created as unions of parallel axis rectangles around the
positive examples, and thus have the benefit of being interpretable. The first
algorithm uses mixed integer programming to optimize a weighted balance between
positive and negative class accuracies. Regularization is introduced to improve
generalization performance. The second method uses an approximation in order to
assist with scalability. Specifically, it follows a \textit{characterize then
discriminate} approach, where the positive class is characterized first by
boxes, and then each box boundary becomes a separate discriminative classifier.
This method has the computational advantages that it can be easily
parallelized, and considers only the relevant regions of feature space
- …