2,356 research outputs found

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    a literature review

    Get PDF
    Fonseca, J., & Bacao, F. (2023). Tabular and latent space synthetic data generation: a literature review. Journal of Big Data, 10, 1-37. [115]. https://doi.org/10.1186/s40537-023-00792-7 --- This research was supported by two research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”), references SFRH/BD/151473/2021 and DSAIPA/DS/0116/2019, and by project UIDB/04152/2020 - Centro de Investigação em Gestão de Informação (MagIC).The generation of synthetic data can be used for anonymization, regularization, oversampling, semi-supervised learning, self-supervised learning, and several other tasks. Such broad potential motivated the development of new algorithms, specialized in data generation for specific data formats and Machine Learning (ML) tasks. However, one of the most common data formats used in industrial applications, tabular data, is generally overlooked; Literature analyses are scarce, state-of-the-art methods are spread across domains or ML tasks and there is little to no distinction among the main types of mechanism underlying synthetic data generation algorithms. In this paper, we analyze tabular and latent space synthetic data generation algorithms. Specifically, we propose a unified taxonomy as an extension and generalization of previous taxonomies, review 70 generation algorithms across six ML problems, distinguish the main generation mechanisms identified into six categories, describe each type of generation mechanism, discuss metrics to evaluate the quality of synthetic data and provide recommendations for future research. We expect this study to assist researchers and practitioners identify relevant gaps in the literature and design better and more informed practices with synthetic data.publishersversionpublishe

    Incremental learning of concept drift from imbalanced data

    Get PDF
    Learning data sampled from a nonstationary distribution has been shown to be a very challenging problem in machine learning, because the joint probability distribution between the data and classes evolve over time. Thus learners must adapt their knowledge base, including their structure or parameters, to remain as strong predictors. This phenomenon of learning from an evolving data source is akin to learning how to play a game while the rules of the game are changed, and it is traditionally referred to as learning concept drift. Climate data, financial data, epidemiological data, spam detection are examples of applications that give rise to concept drift problems. An additional challenge arises when the classes to be learned are not represented (approximately) equally in the training data, as most machine learning algorithms work well only when the class distributions are balanced. However, rare categories are commonly faced in real-world applications, which leads to skewed or imbalanced datasets. Fraud detection, rare disease diagnosis, anomaly detection are examples of applications that feature imbalanced datasets, where data from category are severely underrepresented. Concept drift and class imbalance are traditionally addressed separately in machine learning, yet data streams can experience both phenomena. This work introduces Learn++.NIE (nonstationary & imbalanced environments) and Learn++.CDS (concept drift with SMOTE) as two new members of the Learn++ family of incremental learning algorithms that explicitly and simultaneously address the aforementioned phenomena. The former addresses concept drift and class imbalance through modified bagging-based sampling and replacing a class independent error weighting mechanism - which normally favors majority class - with a set of measures that emphasize good predictive accuracy on all classes. The latter integrates Learn++.NSE, an algorithm for concept drift, with the synthetic sampling method known as SMOTE, to cope with class imbalance. This research also includes a thorough evaluation of Learn++.CDS and Learn++.NIE on several real and synthetic datasets and on several figures of merit, showing that both algorithms are able to learn in some of the most difficult learning environments

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Early hospital mortality prediction using vital signals

    Full text link
    Early hospital mortality prediction is critical as intensivists strive to make efficient medical decisions about the severely ill patients staying in intensive care units. As a result, various methods have been developed to address this problem based on clinical records. However, some of the laboratory test results are time-consuming and need to be processed. In this paper, we propose a novel method to predict mortality using features extracted from the heart signals of patients within the first hour of ICU admission. In order to predict the risk, quantitative features have been computed based on the heart rate signals of ICU patients. Each signal is described in terms of 12 statistical and signal-based features. The extracted features are fed into eight classifiers: decision tree, linear discriminant, logistic regression, support vector machine (SVM), random forest, boosted trees, Gaussian SVM, and K-nearest neighborhood (K-NN). To derive insight into the performance of the proposed method, several experiments have been conducted using the well-known clinical dataset named Medical Information Mart for Intensive Care III (MIMIC-III). The experimental results demonstrate the capability of the proposed method in terms of precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The decision tree classifier satisfies both accuracy and interpretability better than the other classifiers, producing an F1-score and AUC equal to 0.91 and 0.93, respectively. It indicates that heart rate signals can be used for predicting mortality in patients in the ICU, achieving a comparable performance with existing predictions that rely on high dimensional features from clinical records which need to be processed and may contain missing information.Comment: 11 pages, 5 figures, preprint of accepted paper in IEEE&ACM CHASE 2018 and published in Smart Health journa

    Predicting Cholera Positive Cases in Haiti

    Get PDF
    While Western countries typically run census surveys frequently, poorer countries such as Haiti do not have the money to do so; thus research into how Haitians live is severely lacking. Furthermore, studies that do exist tend to be not only old and outdated, but also lacking in depth. Using new census data recently collected from Haiti, I attempt to predict if certain behaviors and living situations can be used as indicators for determining if someone has cholera. Challenges for exploring this data center on getting the surveys into a format suitable for analysis and the severe class imbalance between the number of cholera positive people and cholera negative people. Numerous solutions to this problem are attempted including using different sampling techniques, using ensembles with models like CART and SVM, and Bayesian model averaging. Better survey designs and questions to add to future surveys are also discussed
    corecore