6,280 research outputs found

    Coupling different methods for overcoming the class imbalance problem

    Get PDF
    Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature. Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357

    HAR-MI method for multi-class imbalanced datasets

    Get PDF
    Research on multi-class imbalance from a number of researchers faces obstacles in the form of poor data diversity and a large number of classifiers. The Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) method is a Hybrid Ensembles method which is the development of the Hybrid Approach Redefinion (HAR) method. This study has compared the results obtained with the Dynamic Ensemble Selection-Multiclass Imbalance (DES-MI) method in handling multiclass imbalance. In the HAR-MI Method, the preprocessing stage was carried out using the random balance ensembles method and dynamic ensemble selection to produce a candidate ensemble and the processing stages was carried out using different contribution sampling and dynamic ensemble selection to produce a candidate ensemble. This research has been conducted by using multi-class imbalance datasets sourced from the KEEL Repository. The results show that the HAR-MI method can overcome multi-class imbalance with better data diversity, smaller number of classifiers, and better classifier performance compared to a DES-MI method. These results were tested with a Wilcoxon signed-rank statistical test which showed that the superiority of the HAR-MI method with respect to DES-MI method

    On the relevance of preprocessing in predictive maintenance for dynamic systems

    Get PDF
    The complexity involved in the process of real-time data-driven monitoring dynamic systems for predicted maintenance is usually huge. With more or less in-depth any data-driven approach is sensitive to data preprocessing, understood as any data treatment prior to the application of the monitoring model, being sometimes crucial for the final development of the employed monitoring technique. The aim of this work is to quantify the sensitiveness of data-driven predictive maintenance models in dynamic systems in an exhaustive way. We consider a couple of predictive maintenance scenarios, each of them defined by some public available data. For each scenario, we consider its properties and apply several techniques for each of the successive preprocessing steps, e.g. data cleaning, missing values treatment, outlier detection, feature selection, or imbalance compensation. The pretreatment configurations, i.e. sequential combinations of techniques from different preprocessing steps, are considered together with different monitoring approaches, in order to determine the relevance of data preprocessing for predictive maintenance in dynamical systems

    Predicting progression of mild cognitive impairment to dementia using neuropsychological data: a supervised learning approach using time windows

    Get PDF
    Background: Predicting progression from a stage of Mild Cognitive Impairment to dementia is a major pursuit in current research. It is broadly accepted that cognition declines with a continuum between MCI and dementia. As such, cohorts of MCI patients are usually heterogeneous, containing patients at different stages of the neurodegenerative process. This hampers the prognostic task. Nevertheless, when learning prognostic models, most studies use the entire cohort of MCI patients regardless of their disease stages. In this paper, we propose a Time Windows approach to predict conversion to dementia, learning with patients stratified using time windows, thus fine-tuning the prognosis regarding the time to conversion. Methods: In the proposed Time Windows approach, we grouped patients based on the clinical information of whether they converted (converter MCI) or remained MCI (stable MCI) within a specific time window. We tested time windows of 2, 3, 4 and 5 years. We developed a prognostic model for each time window using clinical and neuropsychological data and compared this approach with the commonly used in the literature, where all patients are used to learn the models, named as First Last approach. This enables to move from the traditional question "Will a MCI patient convert to dementia somewhere in the future" to the question "Will a MCI patient convert to dementia in a specific time window". Results: The proposed Time Windows approach outperformed the First Last approach. The results showed that we can predict conversion to dementia as early as 5 years before the event with an AUC of 0.88 in the cross-validation set and 0.76 in an independent validation set. Conclusions: Prognostic models using time windows have higher performance when predicting progression from MCI to dementia, when compared to the prognostic approach commonly used in the literature. Furthermore, the proposed Time Windows approach is more relevant from a clinical point of view, predicting conversion within a temporal interval rather than sometime in the future and allowing clinicians to timely adjust treatments and clinical appointments.FCT under the Neuroclinomics2 project [PTDC/EEI-SII/1937/2014, SFRH/BD/95846/2013]; INESC-ID plurianual [UID/CEC/50021/2013]; LASIGE Research Unit [UID/CEC/00408/2013

    On the role of pre and post-processing in environmental data mining

    Get PDF
    The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795
    • 

    corecore