87,523 research outputs found
Supervised Classification Using Balanced Training
We examine supervised learning for multi-class, multi-label text classification. We are interested in exploring classification in a real-world setting, where the distribution of labels may change dynamically over time. First, we compare the performance of an array of binary classifiers trained on the label distribution found in the original corpus against classifiers trained on balanced data, where we try to make the label distribution as nearly uniform as possible. We discuss the performance trade-offs between balanced vs. unbalanced training, and highlight the advantages of balancing the training set. Second, we compare the performance of two classifiers, Naive Bayes and SVM, with several feature-selection methods, using balanced training. We combine a Named-Entity-based rote classifier with the statistical classifiers to obtain better performance than either method alone.Peer reviewe
Deep Over-sampling Framework for Classifying Imbalanced Data
Class imbalance is a challenging issue in practical classification problems
for deep learning models as well as traditional models. Traditionally
successful countermeasures such as synthetic over-sampling have had limited
success with complex, structured data handled by deep learning models. In this
paper, we propose Deep Over-sampling (DOS), a framework for extending the
synthetic over-sampling method to exploit the deep feature space acquired by a
convolutional neural network (CNN). Its key feature is an explicit, supervised
representation learning, for which the training data presents each raw input
sample with a synthetic embedding target in the deep feature space, which is
sampled from the linear subspace of in-class neighbors. We implement an
iterative process of training the CNN and updating the targets, which induces
smaller in-class variance among the embeddings, to increase the discriminative
power of the deep representation. We present an empirical study using public
benchmarks, which shows that the DOS framework not only counteracts class
imbalance better than the existing method, but also improves the performance of
the CNN in the standard, balanced settings
Co-supervised learning paradigm with conditional generative adversarial networks for sample-efficient classification
Classification using supervised learning requires annotating a large amount
of classes-balanced data for model training and testing. This has practically
limited the scope of applications with supervised learning, in particular deep
learning. To address the issues associated with limited and imbalanced data,
this paper introduces a sample-efficient co-supervised learning paradigm
(SEC-CGAN), in which a conditional generative adversarial network (CGAN) is
trained alongside the classifier and supplements semantics-conditioned,
confidence-aware synthesized examples to the annotated data during the training
process. In this setting, the CGAN not only serves as a co-supervisor but also
provides complementary quality examples to aid the classifier training in an
end-to-end fashion. Experiments demonstrate that the proposed SEC-CGAN
outperforms the external classifier GAN (EC-GAN) and a baseline ResNet-18
classifier. For the comparison, all classifiers in above methods adopt the
ResNet-18 architecture as the backbone. Particularly, for the Street View House
Numbers dataset, using the 5% of training data, a test accuracy of 90.26% is
achieved by SEC-CGAN as opposed to 88.59% by EC-GAN and 87.17% by the baseline
classifier; for the highway image dataset, using the 10% of training data, a
test accuracy of 98.27% is achieved by SEC-CGAN, compared to 97.84% by EC-GAN
and 95.52% by the baseline classifier.Comment: 14 pages, 5 figure
FAULT DETECTION FRAMEWORK FOR IMBALANCED AND SPARSELY-LABELED DATA SETS USING SELF-ORGANIZING MAPS
While machine learning techniques developed for fault detection usually assume that the classes in the training data are balanced, in real-world applications, this is seldom the case. These techniques also usually require labeled training data, obtaining which is a costly and time-consuming task. In this context, a data-driven framework is developed to detect faults in systems where the condition monitoring data is either imbalanced or consists of mostly unlabeled observations. To mitigate the problem of class imbalance, self-organizing maps (SOMs) are trained in a supervised manner, using the same map size for both classes of data, prior to performing classification. The optimal SOM size for balancing the classes in the data, the size of the neighborhood function, and the learning rate, are determined by performing multiobjective optimization on SOM quality measures such as quantization error and information entropy; and performance measures such as training time and classification error. For training data sets which contain a majority of unlabeled observations, the transductive semi-supervised approach is used to label the neurons of an unsupervised SOM, before performing supervised SOM classification on the test data set. The developed framework is validated using artificial and real-world fault detection data sets
A COMPARISON OF ARTIFICIAL NEURAL NETWORK AND NAIVE BAYES CLASSIFICATION USING UNBALANCED DATA HANDLING
Classification is a supervised learning method that predicts the class of objects whose labels are unknown. Classification in machine learning will produce good performance if it has a balanced data class on the response variable. Therefore, unbalanced classification is a problem that must be taken seriously. This study will handle unbalanced data using the Synthetic Minority Over-Sampling Technique (SMOTE). The classification methods that are quite popular are the Naïve Bayes Classifier (NB) and the Resilient Backpropagation Artificial Neural Network (Rprop-ANN). The data used comes from the Health Nutrition Research and Development Agency (Balitbangkes) which consists of 2499 observations. This study examines the use of NB and ANN using the SMOTE method to classify the incidence of anemia in young women in Indonesia. Modeling is done on 80% of training data and predictions on 20% of test data. The analysis shows that SMOTE can perform better than not handling unbalanced data. Based on the results of the study, the best method for predicting the incidence of anemia is the Naïve Bayes method, with the sensitivity value of 82%
- …