6,589 research outputs found

    Deep Over-sampling Framework for Classifying Imbalanced Data

    Full text link
    Class imbalance is a challenging issue in practical classification problems for deep learning models as well as traditional models. Traditionally successful countermeasures such as synthetic over-sampling have had limited success with complex, structured data handled by deep learning models. In this paper, we propose Deep Over-sampling (DOS), a framework for extending the synthetic over-sampling method to exploit the deep feature space acquired by a convolutional neural network (CNN). Its key feature is an explicit, supervised representation learning, for which the training data presents each raw input sample with a synthetic embedding target in the deep feature space, which is sampled from the linear subspace of in-class neighbors. We implement an iterative process of training the CNN and updating the targets, which induces smaller in-class variance among the embeddings, to increase the discriminative power of the deep representation. We present an empirical study using public benchmarks, which shows that the DOS framework not only counteracts class imbalance better than the existing method, but also improves the performance of the CNN in the standard, balanced settings

    A methodology for the generation of efficient error detection mechanisms

    Get PDF
    A dependable software system must contain error detection mechanisms and error recovery mechanisms. Software components for the detection of errors are typically designed based on a system specification or the experience of software engineers, with their efficiency typically being measured using fault injection and metrics such as coverage and latency. In this paper, we introduce a methodology for the design of highly efficient error detection mechanisms. The proposed methodology combines fault injection analysis and data mining techniques in order to generate predicates for efficient error detection mechanisms. The results presented demonstrate the viability of the methodology as an approach for the development of efficient error detection mechanisms, as the predicates generated yield a true positive rate of almost 100% and a false positive rate very close to 0% for the detection of failure-inducing states. The main advantage of the proposed methodology over current state-of-the-art approaches is that efficient detectors are obtained by design, rather than by using specification-based detector design or the experience of software engineers

    Class imbalance impact on the prediction of complications during home hospitalization: a comparative study.

    Get PDF
    © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting /republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksHome hospitalization (HH) is presented as a healthcare alternative capable of providing high standards of care when patients no longer need hospital facilities. Although HH seems to lower healthcare costs by shortening hospital stays and improving patient's quality of life, the lack of continuous observation at home may lead to complications in some patients. Since blood tests have been proven to provide relevant prognosis information in many diseases, this paper analyzes the impact of different sampling methods on the prediction of HH outcomes. After a first exploratory analysis, some variables extracted from routine blood tests performed at the moment of HH admission, such as hemoglobin, lymphocytes or creatinine, were found to unmask statistically significant differences between patients undergoing successful and unsucessful HH stays. Then, predictive models were built with these data, in order to identify unsuccessful cases eventually needing hospital facilities. However, since these hospital admissions during HH programs are rare, their identification through conventional machine-learning approaches is challenging. Thus, several sampling strategies designed to face class imbalance were herein overviewed and compared. Among the analyzed approaches, over-sampling strategies, such as ROSE (Random Over-Sampling Examples) and conventional random over-sampling, showed the best performances. Nevertheless, further improvements should be proposed in the future so as to better identify those patients not benefiting from HHPeer ReviewedPostprint (author's final draft

    Instance-dependent cost-sensitive learning: do we really need it?

    Get PDF
    Traditionally, classification algorithms aim to minimize the number of errors. However, this approach can lead to sub-optimal results for the common case where the actual goal is to minimize the total cost of errors and not their number. To address this issue, a variety of cost-sensitive machine learning techniques has been suggested. Methods have been developed for dealing with both class- and instance-dependent costs. In this article, we ask whether we really need instance-dependent rather than class-dependent cost-sensitive learning? To this end, we compare the effects of training cost-sensitive classifiers with instance- and class-dependent costs in an extensive empirical evaluation using real-world data from a range of application areas. We find that using instance-dependent costs instead of class-dependent costs leads to improved performance for cost-sensitive performance measures, but worse performance for cost-insensitive metrics. These results confirm that instance-dependent methods are useful for many applications where the goal is to minimize costs

    An Examination of the Smote and Other Smote-based Techniques That Use Synthetic Data to Oversample the Minority Class in the Context of Credit-Card Fraud Classification

    Get PDF
    This research project seeks to investigate some of the different sampling techniques that generate and use synthetic data to oversample the minority class as a means of handling the imbalanced distribution between non-fraudulent (majority class) and fraudulent (minority class) classes in a credit-card fraud dataset. The purpose of the research project is to assess the effectiveness of these techniques in the context of fraud detection which is a highly imbalanced and cost-sensitive dataset. Machine learning tasks that require learning from datasets that are highly unbalanced have difficulty learning since many of the traditional learning algorithms are not designed to cope with large differentials between classes. For that reason, various different methods have been developed to help tackle this problem. Oversampling and undersampling are examples of techniques that help deal with the class imbalance problem through sampling. This paper will evaluate oversampling techniques that use synthetic data to balance the minority class. The idea of using synthetic data to compensate for the minority class was first proposed by (Chawla et al., 2002). The technique is known as Synthetic Minority Over-Sampling Technique (SMOTE). Following the development of the technique, other techniques were developed from it. This paper will evaluate the SMOTE technique along with other also popular SMOTE-based extensions of the original technique

    Penerapan Adaboost untuk Penyelesaian Ketidakseimbangan Kelas pada Penentuan Kelulusan Mahasiswa dengan Metode Decision Tree

    Full text link
    Universitas Pamulang salah satu perguruan tinggi yang memiliki jumlah mahasiswa yang besar, namun dalam data histori terdapat masalah dengan jumlah kelulusan yang tepat waktu dan terlambat (tidak tepat waktu ) yang tidak seimbang. Metode decision tree memiliki kinerja yang baik dalam menangani klasifikasi tepat waktu atau terlambat tetapi decision tree memiliki kelemahan dalam derajat yang tinggi dari ketidakseimbangan kelas (class imbalance). Untuk mengatasi masalah tersebut dapat dilakukan dengan sebuah metode yang dapat menyeimbangkan kelas dan meningkatkan akurasi. Adaboost salah satu metode boosting yang mampu menyeimbangkan kelas dengan memberikan bobot pada tingkat error klasifikasi yang dapat merubah distribusi data. Eksperimen dilakukan dengan menerapkan metode adaboost pada decision tree (DT) untuk mendapatkan hasil yang optimal dan tingkat akurasi yang baik. Hasil ekperimen yang diperoleh dari metode decision tree untuk akurasi sebesar 87,18%, AUC sebesar 0,864, dan RMSE sebesar 0,320, sedangkan hasil dari decision tree dengan adaboost (DTBoost) untuk akurasi sebesar 90,45%, AUC sebesar 0,951, dan RMSE sebesar 0,273, maka dapat disimpulkan dalam penentuan kelulusan mahasiswa dengan metode decision tree dan adaboost terbukti mampu menyelesaikan masalah ketidakseimbangan kelas dan meningkatkan akurasi yang tinggi dan dapat menurunkan tingkat error klasifikasi

    Economic actors and the problem of externalities : could financial markets play a role in democratic backsliding?

    Get PDF
    Purpose: Economic actors tend to exert powerful impact on socio-economic and political developments around the globe, including yielding financial and political crises in developed democracies. Approach/Methodology/Design: While a number of studies discuss the impact of finance on political and societal reality, research on the interlink between finance and democratic processes is very limited. Drawing on secondary literature and a case study of two young Central-European democracies, this paper contends a relationship between financial economy and democratic backsliding. Findings: The findings challenge the existing conventional accounts of the reversal to authoritarian politics in Poland and Hungary. Practical Implications: They also identify a mismatch between the constitutional foundations for embedding the market within the society and its institutions on the one hand, and the political-institutional reality in contemporary democracies. Originality/Value: The research provides theoretical assumptions encouraging further study on unwelcome externalities produced by financial markets.peer-reviewe

    Two Stage Comparison of Classifier Performances for Highly Imbalanced Datasets

    Get PDF
    During the process of knowledge discovery in data, imbalanced learning data often emerges and presents a significant challenge for data mining methods. In this paper, we investigate the influence of class imbalanced data on the classification results of artificial intelligence methods, i.e. neural networks and support vector machine, and on the classification results of classical classification methods represented by RIPPER and the Naïve Bayes classifier. All experiments are conducted on 30 different imbalanced datasets obtained from KEEL (Knowledge Extraction based on Evolutionary Learning) repository. With the purpose of measuring the quality of classification, the accuracy and the area under ROC curve (AUC) measures are used. The results of the research indicate that the neural network and support vector machine show improvement of the AUC measure when applied to balanced data, but at the same time, they show the deterioration of results from the aspect of classification accuracy. RIPPER results are also similar, but the changes are of a smaller magnitude, while the results of the Naïve Bayes classifier show overall deterioration of results on balanced distributions. The number of instances in the presented highly imbalanced datasets has significant additional impact on the classification performances of the SVM classifier. The results have shown the potential of the SVM classifier for the ensemble creation on imbalanced datasets

    Out of the crisis. A radical change of strategy for the eurozone

    Get PDF
    The paper argues that the crisis, mistakenly interpreted as a standard fiscal/balance of payments problem, was generated by the incomplete nature of the European institutions and a disregard for the consequences of differences in the stages of development of the member countries. The ideological pre-conception that markets are self-equilibrating through price competition has been used to justify disastrous internal devaluation policies in the belief that an austerity regime associated with institutions close to those assumed to prevail in ‘core’ countries would create the ‘right’ environment for resuming growth in the periphery. An analysis of the main phases of the development of European countries since the second post-war period provides evidence of wide differences in the productive structures of the countries of the centre and the southern periphery of Europe at the start of the Europeanization process. These differences entailed an asymmetric capacity of countries at differing levels of development to adjust to external shocks. This longer-term perspective helps us better to assess the limitations of the two alternatives that have been suggested to steer the EZ economy out of its present quagmire: internal devaluation (wage flexibility) in the deficit (Southern European) countries, or expansion of internal demand in ‘core’ countries (Germany). Both measures, it is argued, do not go to the root of the development and debt sustainability problems of Southern European countries, which continue to lack a sufficiently broad and differentiated productive structure. Given the differences in the levels of development of the various EU countries and their varying capacities to cope with change, fiscal policy should be assigned two complementary targets: the role of actively promoting — through investment —the removal of development bottlenecks and the renewal of the productive base, and a redistributive and compensative function. This new strategy entails the assignment of a strategic importance to investment guidance by the State through industrial policies geared to diversifying, innovating and strengthening the economic structures of peripheral countries. The paper concludes that this change of strategy is even more important today, since the crisis marks another important structural break in world trade, similar to those of the 1970s and the first decade of the new millennium

    On the suitability of resampling techniques for the class imbalance problem in credit scoring

    Get PDF
    In real-life credit scoring applications, the case in which the class of defaulters is under-represented in comparison with the class of non-defaulters is a very common situation, but it has still received little attention. The present paper investigates the suitability and performance of several resampling techniques when applied in conjunction with statistical and artificial intelligence prediction models over five real-world credit data sets, which have artificially been modified to derive different imbalance ratios (proportion of defaulters and non-defaulters examples). Experimental results demonstrate that the use of resampling methods consistently improves the performance given by the original imbalanced data. Besides, it is also important to note that in general, over-sampling techniques perform better than any under-sampling approach.This work has partially been supported by the Spanish Ministry of Education and Science under grant TIN2009– 14205 and the Generalitat Valenciana under grant PROMETEO/2010/ 028
    corecore