23 research outputs found
Iterative Nearest Neighborhood Oversampling in Semisupervised Learning from Imbalanced Data
Transductive graph-based semi-supervised learning methods usually build an
undirected graph utilizing both labeled and unlabeled samples as vertices.
Those methods propagate label information of labeled samples to neighbors
through their edges in order to get the predicted labels of unlabeled samples.
Most popular semi-supervised learning approaches are sensitive to initial label
distribution happened in imbalanced labeled datasets. The class boundary will
be severely skewed by the majority classes in an imbalanced classification. In
this paper, we proposed a simple and effective approach to alleviate the
unfavorable influence of imbalance problem by iteratively selecting a few
unlabeled samples and adding them into the minority classes to form a balanced
labeled dataset for the learning methods afterwards. The experiments on UCI
datasets and MNIST handwritten digits dataset showed that the proposed approach
outperforms other existing state-of-art methods
Tackling Diverse Minorities in Imbalanced Classification
Imbalanced datasets are commonly observed in various real-world applications,
presenting significant challenges in training classifiers. When working with
large datasets, the imbalanced issue can be further exacerbated, making it
exceptionally difficult to train classifiers effectively. To address the
problem, over-sampling techniques have been developed to linearly interpolating
data instances between minorities and their neighbors. However, in many
real-world scenarios such as anomaly detection, minority instances are often
dispersed diversely in the feature space rather than clustered together.
Inspired by domain-agnostic data mix-up, we propose generating synthetic
samples iteratively by mixing data samples from both minority and majority
classes. It is non-trivial to develop such a framework, the challenges include
source sample selection, mix-up strategy selection, and the coordination
between the underlying model and mix-up strategies. To tackle these challenges,
we formulate the problem of iterative data mix-up as a Markov decision process
(MDP) that maps data attributes onto an augmentation strategy. To solve the
MDP, we employ an actor-critic framework to adapt the discrete-continuous
decision space. This framework is utilized to train a data augmentation policy
and design a reward signal that explores classifier uncertainty and encourages
performance improvement, irrespective of the classifier's convergence. We
demonstrate the effectiveness of our proposed framework through extensive
experiments conducted on seven publicly available benchmark datasets using
three different types of classifiers. The results of these experiments showcase
the potential and promise of our framework in addressing imbalanced datasets
with diverse minorities
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is
considered \de facto" standard in the framework of learning from imbalanced data. This
is due to its simplicity in the design of the procedure, as well as its robustness when applied
to di erent type of problems. Since its publication in 2002, SMOTE has proven
successful in a variety of applications from several di erent domains. SMOTE has also inspired
several approaches to counter the issue of class imbalance, and has also signi cantly
contributed to new supervised learning paradigms, including multilabel classi cation, incremental
learning, semi-supervised learning, multi-instance learning, among others. It is
standard benchmark for learning from imbalanced data. It is also featured in a number of
di erent software packages | from open source to commercial. In this paper, marking the
fteen year anniversary of SMOTE, we re
ect on the SMOTE journey, discuss the current
state of a airs with SMOTE, its applications, and also identify the next set of challenges
to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology
under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project
887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016;
and the National Science Foundation (NSF) Grant IIS-1447795
Searching for Needles in the Cosmic Haystack
Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection
Towards Data-centric Graph Machine Learning: Review and Outlook
Data-centric AI, with its primary focus on the collection, management, and
utilization of data to drive AI models and applications, has attracted
increasing attention in recent years. In this article, we conduct an in-depth
and comprehensive review, offering a forward-looking outlook on the current
efforts in data-centric AI pertaining to graph data-the fundamental data
structure for representing and capturing intricate dependencies among massive
and diverse real-life entities. We introduce a systematic framework,
Data-centric Graph Machine Learning (DC-GML), that encompasses all stages of
the graph data lifecycle, including graph data collection, exploration,
improvement, exploitation, and maintenance. A thorough taxonomy of each stage
is presented to answer three critical graph-centric questions: (1) how to
enhance graph data availability and quality; (2) how to learn from graph data
with limited-availability and low-quality; (3) how to build graph MLOps systems
from the graph data-centric view. Lastly, we pinpoint the future prospects of
the DC-GML domain, providing insights to navigate its advancements and
applications.Comment: 42 pages, 9 figure
Hyperspectral Image Classification -- Traditional to Deep Models: A Survey for Future Prospects
Hyperspectral Imaging (HSI) has been extensively utilized in many real-life
applications because it benefits from the detailed spectral information
contained in each pixel. Notably, the complex characteristics i.e., the
nonlinear relation among the captured spectral information and the
corresponding object of HSI data make accurate classification challenging for
traditional methods. In the last few years, Deep Learning (DL) has been
substantiated as a powerful feature extractor that effectively addresses the
nonlinear problems that appeared in a number of computer vision tasks. This
prompts the deployment of DL for HSI classification (HSIC) which revealed good
performance. This survey enlists a systematic overview of DL for HSIC and
compared state-of-the-art strategies of the said topic. Primarily, we will
encapsulate the main challenges of traditional machine learning for HSIC and
then we will acquaint the superiority of DL to address these problems. This
survey breakdown the state-of-the-art DL frameworks into spectral-features,
spatial-features, and together spatial-spectral features to systematically
analyze the achievements (future research directions as well) of these
frameworks for HSIC. Moreover, we will consider the fact that DL requires a
large number of labeled training examples whereas acquiring such a number for
HSIC is challenging in terms of time and cost. Therefore, this survey discusses
some strategies to improve the generalization performance of DL strategies
which can provide some future guidelines
Deficient data classification with fuzzy learning
This thesis first proposes a novel algorithm for handling both missing values and imbalanced data classification problems. Then, algorithms for addressing the class imbalance problem in Twitter spam detection (Network Security Problem) have been proposed. Finally, the security profile of SVM against deliberate attacks has been simulated and analysed.<br /