1,380 research outputs found
Coupling different methods for overcoming the class imbalance problem
Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical.
Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches.
To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature.
Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357
A systematic review of data quality issues in knowledge discovery tasks
Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust
Recommended from our members
GENETIC PROGRAMMING TO OPTIMIZE PERFORMANCE OF MACHINE LEARNING ALGORITHMS ON UNBALANCED DATA SET
Data collected from the real world is often imbalanced, meaning that the distribution of data across known classes is biased or skewed. When using machine learning classification models on such imbalanced data, predictive performance tends to be lower because these models are designed with the assumption of balanced classes or a relatively equal number of instances for each class. To address this issue, we employ data preprocessing techniques such as SMOTE (Synthetic Minority Oversampling Technique) for oversampling data and random undersampling for undersampling data on unbalanced datasets. Once the dataset is balanced, genetic programming is utilized for feature selection to enhance performance and efficiency. For this experiment, we consider an imbalanced bank marketing dataset from the UCI Machine Learning Repository. To assess the effectiveness of the technique, it is implemented on four different classification algorithms: Decision Tree, Logistic Regression, KNN (K-Nearest Neighbors), and SVM (Support Vector Machines). Various metrics including accuracy, balanced accuracy, recall, F-score, ROC (Receiver Operating Characteristics) curve, and PR (Precision-Recall) curve are compared for unbalanced data, oversampled data, undersampled data, and cleaned data with Tomek-Links for each algorithm. The results indicate that all four algorithms perform better when oversampling the minority class to half of the majority class and undersampling the majority class examples to match the minority class, followed by performing Tomek-Links on the balanced dataset
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is
considered \de facto" standard in the framework of learning from imbalanced data. This
is due to its simplicity in the design of the procedure, as well as its robustness when applied
to di erent type of problems. Since its publication in 2002, SMOTE has proven
successful in a variety of applications from several di erent domains. SMOTE has also inspired
several approaches to counter the issue of class imbalance, and has also signi cantly
contributed to new supervised learning paradigms, including multilabel classi cation, incremental
learning, semi-supervised learning, multi-instance learning, among others. It is
standard benchmark for learning from imbalanced data. It is also featured in a number of
di erent software packages | from open source to commercial. In this paper, marking the
fteen year anniversary of SMOTE, we re
ect on the SMOTE journey, discuss the current
state of a airs with SMOTE, its applications, and also identify the next set of challenges
to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology
under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project
887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016;
and the National Science Foundation (NSF) Grant IIS-1447795
An under-Sampled Approach for Handling Skewed Data Distribution using Cluster Disjuncts
In Data mining and Knowledge Discovery hidden and valuable knowledge from the data sources is discovered. The traditional algorithms used for knowledge discovery are bottle necked due to wide range of data sources availability. Class imbalance is a one of the problem arises due to data source which provide unequal class i.e. examples of one class in a training data set vastly outnumber examples of the other class(es). Researchers have rigorously studied several techniques to alleviate the problem of class imbalance, including resampling algorithms, and feature selection approaches to this problem. In this paper, we present a new hybrid frame work dubbed as Majority Under-sampling based on Cluster Disjunct (MAJOR_CD) for learning from skewed training data. This algorithm provides a simpler and faster alternative by using cluster disjunct concept. We conduct experiments using twelve UCI data sets from various application domains using five algorithms for comparison on six evaluation metrics. The empirical study suggests that MAJOR_CD have been believed to be effective in addressing the class imbalance problem
Generative Adversarial Networks Selection Approach for Extremely Imbalanced Fault Diagnosis of Reciprocating Machinery
At present, countless approaches to fault diagnosis in reciprocating machines have been proposed, all considering that the available machinery dataset is in equal proportions for all conditions. However, when the application is closer to reality, the problem of data imbalance is increasingly evident. In this paper, we propose a method for the creation of diagnoses that consider an extreme imbalance in the available data. Our approach first processes the vibration signals of the machine using a wavelet packet transform-based feature-extraction stage. Then, improved generative models are obtained with a dissimilarity-based model selection to artificially balance the dataset. Finally, a Random Forest classifier is created to address the diagnostic task. This methodology provides a considerable improvement with 99% of data imbalance over other approaches reported in the literature, showing performance similar to that obtained with a balanced set of data.National Natural Science Foundation of China, under Grant 51605406National Natural Science Foundation of China under Grant 7180104
- …