7,419 research outputs found
Informative sample generation using class aware generative adversarial networks for classification of chest Xrays
Training robust deep learning (DL) systems for disease detection from medical
images is challenging due to limited images covering different disease types
and severity. The problem is especially acute, where there is a severe class
imbalance. We propose an active learning (AL) framework to select most
informative samples for training our model using a Bayesian neural network.
Informative samples are then used within a novel class aware generative
adversarial network (CAGAN) to generate realistic chest xray images for data
augmentation by transferring characteristics from one class label to another.
Experiments show our proposed AL framework is able to achieve state-of-the-art
performance by using about of the full dataset, thus saving significant
time and effort over conventional methods
Incremental learning of concept drift from imbalanced data
Learning data sampled from a nonstationary distribution has been shown to be a very challenging problem in machine learning, because the joint probability distribution between the data and classes evolve over time. Thus learners must adapt their knowledge base, including their structure or parameters, to remain as strong predictors. This phenomenon of learning from an evolving data source is akin to learning how to play a game while the rules of the game are changed, and it is traditionally referred to as learning concept drift. Climate data, financial data, epidemiological data, spam detection are examples of applications that give rise to concept drift problems. An additional challenge arises when the classes to be learned are not represented (approximately) equally in the training data, as most machine learning algorithms work well only when the class distributions are balanced. However, rare categories are commonly faced in real-world applications, which leads to skewed or imbalanced datasets. Fraud detection, rare disease diagnosis, anomaly detection are examples of applications that feature imbalanced datasets, where data from category are severely underrepresented. Concept drift and class imbalance are traditionally addressed separately in machine learning, yet data streams can experience both phenomena. This work introduces Learn++.NIE (nonstationary & imbalanced environments) and Learn++.CDS (concept drift with SMOTE) as two new members of the Learn++ family of incremental learning algorithms that explicitly and simultaneously address the aforementioned phenomena. The former addresses concept drift and class imbalance through modified bagging-based sampling and replacing a class independent error weighting mechanism - which normally favors majority class - with a set of measures that emphasize good predictive accuracy on all classes. The latter integrates Learn++.NSE, an algorithm for concept drift, with the synthetic sampling method known as SMOTE, to cope with class imbalance. This research also includes a thorough evaluation of Learn++.CDS and Learn++.NIE on several real and synthetic datasets and on several figures of merit, showing that both algorithms are able to learn in some of the most difficult learning environments
Recommended from our members
IMPROVING CREDIT CARD FRAUD DETECTION USING TRANSFER LEARNING AND DATA RESAMPLING TECHNIQUES
This Culminating Experience Project explores the use of machine learning algorithms to detect credit card fraud. The research questions are: Q1. What cross-domain techniques developed in other domains can be effectively adapted and applied to mitigate or eliminate credit card fraud, and how do these techniques compare in terms of fraud detection accuracy and efficiency? Q2. To what extent do synthetic data generation methods effectively mitigate the challenges posed by imbalanced datasets in credit card fraud detection, and how do these methods impact classification performance? Q3. To what extent can the combination of transfer learning and innovative data resampling techniques improve the accuracy and efficiency of credit card fraud detection systems when dealing with imbalanced datasets, and what novel strategies can be developed to address this common challenge?
The main findings are: Q1. Unconventional cross-domain methods improved fraud detection, holding promise for enhanced security. Q2. The problems caused by unbalanced datasets in credit card fraud detection were effectively addressed by the synthetic data generation techniques SMOTE and ADASYN, resulting in a more balanced dataset suitable for fraud classification. Q3. The combination of neural networks and data resampling techniques, such as SMOTE and ADASYN, significantly improved credit card fraud detection accuracy.
The main conclusions are: Q1. Cross-domain methods are useful for credit card fraud detection, especially when it comes to online transactions. Q2. When used with various classifiers, neural networks show remarkable accuracy rates: 97% for unbalanced data, 99.47% for SMOTE, and 99.11% for ADASYN Q3. A fraud recall of 0.99 is obtained by the model evaluation on imbalanced data, with 12,155 right predictions out of 12,336 and 181 incorrect ones. The identified areas for further study encompass the testing of our model on larger datasets and the optimization of hyperparameters for further enhancement
A Survey of Methods for Handling Disk Data Imbalance
Class imbalance exists in many classification problems, and since the data is
designed for accuracy, imbalance in data classes can lead to classification
challenges with a few classes having higher misclassification costs. The
Backblaze dataset, a widely used dataset related to hard discs, has a small
amount of failure data and a large amount of health data, which exhibits a
serious class imbalance. This paper provides a comprehensive overview of
research in the field of imbalanced data classification. The discussion is
organized into three main aspects: data-level methods, algorithmic-level
methods, and hybrid methods. For each type of method, we summarize and analyze
the existing problems, algorithmic ideas, strengths, and weaknesses.
Additionally, the challenges of unbalanced data classification are discussed,
along with strategies to address them. It is convenient for researchers to
choose the appropriate method according to their needs
- …