Search CORE

19,042 research outputs found

New Data Mining Techniques for Social and Healthcare Sciences

Author: Sung Kisuk
Publication venue
Publication date: 16/09/2016
Field of study

Data mining is an analytic process for discovering systematic relationships between variables and for finding patterns in data. Using those findings, data mining can create predictive models (e.g., target variable forecasting, label classification) or identify different groups within data (e.g., clustering). The principal objective of this dissertation is to develop data mining algorithms that outperform conventional data mining techniques on social and healthcare sciences. Toward this objective, this dissertation develops two data mining techniques, each of which addresses the limitations of a conventional data mining technique when applied in these contexts. The first part (Part I) of this dissertation addresses the problem of identifying important factors that promote or hinder population growth. When addressing this problem, previous studies included variables (input factors) without considering the statistical dependence among the included input factors; therefore, most previous studies exhibit multicollinearity between the input variables. We propose a novel methodology that, even in the presence of multicollinearity among input factors, is able to (1) identify significant factors affecting population growth and (2) rank these factors according to their level of influence on population growth. In order to measure the level of influence of each input factor on population growth, the proposed method combines decision tree clustering and Cohen’s d index. We applied the proposed method to a real county-level United States dataset and determined the level of influence of an extensive list of input factors on population growth. Among other findings, we show that poverty ratio is a highly important factor for population growth while no previous study found poverty ratio to be a significant factor due to its high linear relationship with other input factors. The second part (Part II) of this dissertation proposes a classification method for imbalanced data—data where the majority class has significantly more instances than the minority class. The specific problem addressed is that conventional classification methods have poor minority-class detection performance in imbalanced dataset since they tend to classify the vast majority of the test instances as majority instances. To address this problem, we developed a guided undersampling method that combines two instance-selecting techniques — ensemble outlier filtering and normalized-cut sampling — in order to obtain a clean and well-represented subset of the original training instances. Our proposed imbalanced-data classification method uses the guided undersampling method to select the training data and then applies support vector machines on the sampled data in order to construct the classification model (i.e., decide the final class boundary). Our computational results show that the proposed imbalanced-data classification method outperforms several state-of-the-art imbalanced-data classification methods, including cost-sensitive, sampling, and synthetic data generation approaches on eleven open datasets, most of them related to healthcare sciences

Texas A&M Repository

An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification

Author: García López Salvador
Zhang Chongsheng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/09/2021
Field of study

In predictive tasks, real-world datasets often present di erent degrees of imbalanced (i.e., long-tailed or skewed) distributions. While the majority (the head or the most frequent) classes have su cient samples, the minority (the tail or the less frequent or rare) classes can be under-represented by a rather limited number of samples. Data pre-processing has been shown to be very e ective in dealing with such problems. On one hand, data re-sampling is a common approach to tackling class imbalance. On the other hand, dimension reduction, which reduces the feature space, is a conventional technique for reducing noise and inconsistencies in a dataset. However, the possible synergy between feature selection and data re-sampling for high-performance imbalance classification has rarely been investigated before. To address this issue, we carry out a comprehensive empirical study on the joint influence of feature selection and re-sampling on two-class imbalance classification. Specifically, we study the performance of two opposite pipelines for imbalance classification by applying feature selection before or after data re-sampling. We conduct a large number of experiments, with a total of 9225 tests, on 52 publicly available datasets, using 9 feature selection methods, 6 resampling approaches for class imbalance learning, and 3 well-known classification algorithms. Experimental results show that there is no constant winner between the two pipelines; thus both of them should be considered to derive the best performing model for imbalance classification. We find that the performance of an imbalance classification model not only depends on the classifier adopted and the ratio between the number of majority and minority samples, but also depends on the ratio between the number of samples and features. Overall, this study should provide new reference value for researchers and practitioners in imbalance learning.TIN2017-89517-

Repositorio Institucional Universidad de Granada

Deep Over-sampling Framework for Classifying Imbalanced Data

Author: B Krawczyk
C Dong
G Hinton
GE Hinton
H He
KQ Weinberger
MD Zeiler
NV Chawla
NV Chawla
P Jeatrakul
RA Dunne
S Ando
S Köknar-Tezel
Y Bengio
Y Lecun
ZH Zhou
Publication venue
Publication date: 12/07/2017
Field of study

Class imbalance is a challenging issue in practical classification problems for deep learning models as well as traditional models. Traditionally successful countermeasures such as synthetic over-sampling have had limited success with complex, structured data handled by deep learning models. In this paper, we propose Deep Over-sampling (DOS), a framework for extending the synthetic over-sampling method to exploit the deep feature space acquired by a convolutional neural network (CNN). Its key feature is an explicit, supervised representation learning, for which the training data presents each raw input sample with a synthetic embedding target in the deep feature space, which is sampled from the linear subspace of in-class neighbors. We implement an iterative process of training the CNN and updating the targets, which induces smaller in-class variance among the embeddings, to increase the discriminative power of the deep representation. We present an empirical study using public benchmarks, which shows that the DOS framework not only counteracts class imbalance better than the existing method, but also improves the performance of the CNN in the standard, balanced settings

arXiv.org e-Print Archive

Crossref