27 research outputs found

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    Improved imbalanced classification through convex space learning

    Get PDF
    Imbalanced datasets for classification problems, characterised by unequal distribution of samples, are abundant in practical scenarios. Oversampling algorithms generate synthetic data to enrich classification performance for such datasets. In this thesis, I discuss two algorithms LoRAS & ProWRAS, improving on the state-of-the-art as shown through rigorous benchmarking on publicly available datasets. A biological application for detection of rare cell-types from single-cell transcriptomics data is also discussed. The thesis also provides a better theoretical understanding behind oversampling

    AI for in-line vehicle sequence controlling: development and evaluation of an adaptive machine learning artifact to predict sequence deviations in a mixed-model production line

    Get PDF
    Customers in the manufacturing sector, especially in the automotive industry, have a high demand for individualized products at price levels comparable to traditional mass production. The contrary objectives of providing a variety of products and operating at minimum costs have introduced a high degree of production planning and control mechanisms based on a stable order sequence for mixed-model assembly lines. A major threat to this development is sequence scrambling, triggered by both operational and product-related root causes. Despite the introduction of just-in-time and fixed production times, the problem of sequence scrambling remains partially unresolved in the automotive industry. Negative downstream effects range from disruptions in the just-in-sequence supply chain to a stop of the production process. A precise prediction of sequence deviations at an early stage allows the introduction of counteractions to stabilize the sequence before disorder emerges. While procedural causes are widely addressed in research, the work at hand requires a different perspective involving a product-related view. Built on unique data from a real-world global automotive manufacturer, a supervised classification model is trained and evaluated. This includes all the necessary steps to design, implement, and assess an AI artifact, as well as data gathering, preprocessing, algorithm selection, and evaluation. To ensure long-term prediction stability, we include a continuous learning module to counter data drifts. We show that up to 50% of the major deviations can be predicted in advance. However, we do not consider any process-related information, such as machine conditions and shift plans, but solely focus on the exploitation of product features like body type, powertrain, color, and special equipment

    a priori synthetic sampling for increasing classification sensitivity in imbalanced data sets

    Get PDF
    Building accurate classifiers for predicting group membership is made difficult when data is skewed or imbalanced which is typical of real world data sets. The classifier has the tendency to be biased towards the over represented group as a result. This imbalance is considered a class imbalance problem which will induce bias into the classifier particularly when the imbalance is high. Class imbalance data usually suffers from data intrinsic properties beyond that of imbalance alone. The problem is intensified with larger levels of imbalance most commonly found in observational studies. Extreme cases of class imbalance are commonly found in many domains including fraud detection, mammography of cancer and post term births. These rare events are usually the most costly or have the highest level of risk associated with them and are therefore of most interest. To combat class imbalance the machine learning community has relied upon embedded, data preprocessing and ensemble learning approaches. Exploratory research has linked several factors that perpetuate the issue of misclassification in class imbalanced data. However, there remains a lack of understanding between the relationship of the learner and imbalanced data among the competing approaches. The current landscape of data preprocessing approaches have appeal due to the ability to divide the problem space in two which allows for simpler models. However, most of these approaches have little theoretical bases although in some cases there is empirical evidence supporting the improvement. The main goals of this research is to introduce newly proposed a priori based re-sampling methods that improve concept learning within class imbalanced data. The results in this work highlight the robustness of these techniques performance within publicly available data sets from different domains containing various levels of imbalance. In this research the theoretical and empirical reasons are explored and discussed

    Image Diversification via Deep Learning based Generative Models

    Get PDF
    Machine learning driven pattern recognition from imagery such as object detection has been prevalenting among society due to the high demand for autonomy and the recent remarkable advances in such technology. The machine learning technologies acquire the abstraction of the existing data and enable inference of the pattern of the future inputs. However, such technologies require a sheer amount of images as a training dataset which well covers the distribution of the future inputs in order to predict the proper patterns whereas it is impracticable to prepare enough variety of images in many cases. To address this problem, this thesis pursues to discover the method to diversify image datasets for fully enabling the capability of machine learning driven applications. Focusing on the plausible image synthesis ability of generative models, we investigate a number of approaches to expand the variety of the output images using image-to-image translation, mixup and diffusion models along with the technique to enable a computation and training dataset efficient diffusion approach. First, we propose the combined use of unpaired image-to-image translation and mixup for data augmentation on limited non-visible imagery. Second, we propose diffusion image-to-image translation that generates greater quality images than other previous adversarial training based translation methods. Third, we propose a patch-wise and discrete conditional training of diffusion method enabling the reduction of the computation and the robustness on small training datasets. Subsequently, we discuss a remaining open challenge about evaluation and the direction of future work. Lastly, we make an overall conclusion after stating social impact of this research field

    Semi-supervised learning and fairness-aware learning under class imbalance

    Get PDF
    With the advent of Web 2.0 and the rapid technological advances, there is a plethora of data in every field; however, more data does not necessarily imply more information, rather the quality of data (veracity aspect) plays a key role. Data quality is a major issue, since machine learning algorithms are solely based on historical data to derive novel hypotheses. Data may contain noise, outliers, missing values and/or class labels, and skewed data distributions. The latter case, the so-called class-imbalance problem, is quite old and still affects dramatically machine learning algorithms. Class-imbalance causes classification models to learn effectively one particular class (majority) while ignoring other classes (minority). In extend to this issue, machine learning models that are applied in domains of high societal impact have become biased towards groups of people or individuals who are not well represented within the data. Direct and indirect discriminatory behavior is prohibited by international laws; thus, there is an urgency of mitigating discriminatory outcomes from machine learning algorithms. In this thesis, we address the aforementioned issues and propose methods that tackle class imbalance, and mitigate discriminatory outcomes in machine learning algorithms. As part of this thesis, we make the following contributions: • Tackling class-imbalance in semi-supervised learning – The class-imbalance problem is very often encountered in classification. There is a variety of methods that tackle this problem; however, there is a lack of methods that deal with class-imbalance in the semi-supervised learning. We address this problem by employing data augmentation in semi-supervised learning process in order to equalize class distributions. We show that semi-supervised learning coupled with data augmentation methods can overcome class-imbalance propagation and significantly outperform the standard semi-supervised annotation process. • Mitigating unfairness in supervised models – Fairness in supervised learning has received a lot of attention over the last years. A growing body of pre-, in- and postprocessing approaches has been proposed to mitigate algorithmic bias; however, these methods consider error rate as the performance measure of the machine learning algorithm, which causes high error rates on the under-represented class. To deal with this problem, we propose approaches that operate in pre-, in- and post-processing layers while accounting for all classes. Our proposed methods outperform state-of-the-art methods in terms of performance while being able to mitigate unfair outcomes

    Generative Methods, Meta-learning, and Meta-heuristics for Robust Cyber Defense

    Get PDF
    Cyberspace is the digital communications network that supports the internet of battlefield things (IoBT), the model by which defense-centric sensors, computers, actuators and humans are digitally connected. A secure IoBT infrastructure facilitates real time implementation of the observe, orient, decide, act (OODA) loop across distributed subsystems. Successful hacking efforts by cyber criminals and strategic adversaries suggest that cyber systems such as the IoBT are not secure. Three lines of effort demonstrate a path towards a more robust IoBT. First, a baseline data set of enterprise cyber network traffic was collected and modelled with generative methods allowing the generation of realistic, synthetic cyber data. Next, adversarial examples of cyber packets were algorithmically crafted to fool network intrusion detection systems while maintaining packet functionality. Finally, a framework is presented that uses meta-learning to combine the predictive power of various weak models. This resulted in a meta-model that outperforms all baseline classifiers with respect to overall accuracy of packets, and adversarial example detection rate. The National Defense Strategy underscores cybersecurity as an imperative to defend the homeland and maintain a military advantage in the information age. This research provides both academic perspective and applied techniques to to further the cybersecurity posture of the Department of Defense into the information age

    Mapping (Dis-)Information Flow about the MH17 Plane Crash

    Get PDF
    Digital media enables not only fast sharing of information, but also disinformation. One prominent case of an event leading to circulation of disinformation on social media is the MH17 plane crash. Studies analysing the spread of information about this event on Twitter have focused on small, manually annotated datasets, or used proxys for data annotation. In this work, we examine to what extent text classifiers can be used to label data for subsequent content analysis, in particular we focus on predicting pro-Russian and pro-Ukrainian Twitter content related to the MH17 plane crash. Even though we find that a neural classifier improves over a hashtag based baseline, labeling pro-Russian and pro-Ukrainian content with high precision remains a challenging problem. We provide an error analysis underlining the difficulty of the task and identify factors that might help improve classification in future work. Finally, we show how the classifier can facilitate the annotation task for human annotators

    Advanced machine learning methods for oncological image analysis

    Get PDF
    Cancer is a major public health problem, accounting for an estimated 10 million deaths worldwide in 2020 alone. Rapid advances in the field of image acquisition and hardware development over the past three decades have resulted in the development of modern medical imaging modalities that can capture high-resolution anatomical, physiological, functional, and metabolic quantitative information from cancerous organs. Therefore, the applications of medical imaging have become increasingly crucial in the clinical routines of oncology, providing screening, diagnosis, treatment monitoring, and non/minimally- invasive evaluation of disease prognosis. The essential need for medical images, however, has resulted in the acquisition of a tremendous number of imaging scans. Considering the growing role of medical imaging data on one side and the challenges of manually examining such an abundance of data on the other side, the development of computerized tools to automatically or semi-automatically examine the image data has attracted considerable interest. Hence, a variety of machine learning tools have been developed for oncological image analysis, aiming to assist clinicians with repetitive tasks in their workflow. This thesis aims to contribute to the field of oncological image analysis by proposing new ways of quantifying tumor characteristics from medical image data. Specifically, this thesis consists of six studies, the first two of which focus on introducing novel methods for tumor segmentation. The last four studies aim to develop quantitative imaging biomarkers for cancer diagnosis and prognosis. The main objective of Study I is to develop a deep learning pipeline capable of capturing the appearance of lung pathologies, including lung tumors, and integrating this pipeline into the segmentation networks to leverage the segmentation accuracy. The proposed pipeline was tested on several comprehensive datasets, and the numerical quantifications show the superiority of the proposed prior-aware DL framework compared to the state of the art. Study II aims to address a crucial challenge faced by supervised segmentation models: dependency on the large-scale labeled dataset. In this study, an unsupervised segmentation approach is proposed based on the concept of image inpainting to segment lung and head- neck tumors in images from single and multiple modalities. The proposed autoinpainting pipeline shows great potential in synthesizing high-quality tumor-free images and outperforms a family of well-established unsupervised models in terms of segmentation accuracy. Studies III and IV aim to automatically discriminate the benign from the malignant pulmonary nodules by analyzing the low-dose computed tomography (LDCT) scans. In Study III, a dual-pathway deep classification framework is proposed to simultaneously take into account the local intra-nodule heterogeneities and the global contextual information. Study IV seeks to compare the discriminative power of a series of carefully selected conventional radiomics methods, end-to-end Deep Learning (DL) models, and deep features-based radiomics analysis on the same dataset. The numerical analyses show the potential of fusing the learned deep features into radiomic features for boosting the classification power. Study V focuses on the early assessment of lung tumor response to the applied treatments by proposing a novel feature set that can be interpreted physiologically. This feature set was employed to quantify the changes in the tumor characteristics from longitudinal PET-CT scans in order to predict the overall survival status of the patients two years after the last session of treatments. The discriminative power of the introduced imaging biomarkers was compared against the conventional radiomics, and the quantitative evaluations verified the superiority of the proposed feature set. Whereas Study V focuses on a binary survival prediction task, Study VI addresses the prediction of survival rate in patients diagnosed with lung and head-neck cancer by investigating the potential of spherical convolutional neural networks and comparing their performance against other types of features, including radiomics. While comparable results were achieved in intra- dataset analyses, the proposed spherical-based features show more predictive power in inter-dataset analyses. In summary, the six studies incorporate different imaging modalities and a wide range of image processing and machine-learning techniques in the methods developed for the quantitative assessment of tumor characteristics and contribute to the essential procedures of cancer diagnosis and prognosis
    corecore