20,630 research outputs found
Class prediction for high-dimensional class-imbalanced data
<p>Abstract</p> <p>Background</p> <p>The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance.</p> <p>Results</p> <p>Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers.</p> <p>Conclusions</p> <p>Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when dealing with high-dimensional data. Researchers using class-imbalanced data should be careful in assessing the predictive accuracy of the classifiers and, unless the class imbalance is mild, they should always use an appropriate method for dealing with the class imbalance problem.</p
Oversampling for Imbalanced Learning Based on K-Means and SMOTE
Learning from class-imbalanced data continues to be a common and challenging
problem in supervised learning as standard classification algorithms are
designed to handle balanced class distributions. While different strategies
exist to tackle this problem, methods which generate artificial data to achieve
a balanced class distribution are more versatile than modifications to the
classification algorithm. Such techniques, called oversamplers, modify the
training data, allowing any classifier to be used with class-imbalanced
datasets. Many algorithms have been proposed for this task, but most are
complex and tend to generate unnecessary noise. This work presents a simple and
effective oversampling method based on k-means clustering and SMOTE
oversampling, which avoids the generation of noise and effectively overcomes
imbalances between and within classes. Empirical results of extensive
experiments with 71 datasets show that training data oversampled with the
proposed method improves classification results. Moreover, k-means SMOTE
consistently outperforms other popular oversampling methods. An implementation
is made available in the python programming language.Comment: 19 pages, 8 figure
Impact of Biases in Big Data
The underlying paradigm of big data-driven machine learning reflects the
desire of deriving better conclusions from simply analyzing more data, without
the necessity of looking at theory and models. Is having simply more data
always helpful? In 1936, The Literary Digest collected 2.3M filled in
questionnaires to predict the outcome of that year's US presidential election.
The outcome of this big data prediction proved to be entirely wrong, whereas
George Gallup only needed 3K handpicked people to make an accurate prediction.
Generally, biases occur in machine learning whenever the distributions of
training set and test set are different. In this work, we provide a review of
different sorts of biases in (big) data sets in machine learning. We provide
definitions and discussions of the most commonly appearing biases in machine
learning: class imbalance and covariate shift. We also show how these biases
can be quantified and corrected. This work is an introductory text for both
researchers and practitioners to become more aware of this topic and thus to
derive more reliable models for their learning problems
GraphSR: A Data Augmentation Algorithm for Imbalanced Node Classification
Graph neural networks (GNNs) have achieved great success in node
classification tasks. However, existing GNNs naturally bias towards the
majority classes with more labelled data and ignore those minority classes with
relatively few labelled ones. The traditional techniques often resort
over-sampling methods, but they may cause overfitting problem. More recently,
some works propose to synthesize additional nodes for minority classes from the
labelled nodes, however, there is no any guarantee if those generated nodes
really stand for the corresponding minority classes. In fact, improperly
synthesized nodes may result in insufficient generalization of the algorithm.
To resolve the problem, in this paper we seek to automatically augment the
minority classes from the massive unlabelled nodes of the graph. Specifically,
we propose \textit{GraphSR}, a novel self-training strategy to augment the
minority classes with significant diversity of unlabelled nodes, which is based
on a Similarity-based selection module and a Reinforcement Learning(RL)
selection module. The first module finds a subset of unlabelled nodes which are
most similar to those labelled minority nodes, and the second one further
determines the representative and reliable nodes from the subset via RL
technique. Furthermore, the RL-based module can adaptively determine the
sampling scale according to current training data. This strategy is general and
can be easily combined with different GNNs models. Our experiments demonstrate
the proposed approach outperforms the state-of-the-art baselines on various
class-imbalanced datasets.Comment: Accepted by AAAI202
- …