24 research outputs found

    Penerapan Adaboost untuk Penyelesaian Ketidakseimbangan Kelas pada Penentuan Kelulusan Mahasiswa dengan Metode Decision Tree

    Full text link
    Universitas Pamulang salah satu perguruan tinggi yang memiliki jumlah mahasiswa yang besar, namun dalam data histori terdapat masalah dengan jumlah kelulusan yang tepat waktu dan terlambat (tidak tepat waktu ) yang tidak seimbang. Metode decision tree memiliki kinerja yang baik dalam menangani klasifikasi tepat waktu atau terlambat tetapi decision tree memiliki kelemahan dalam derajat yang tinggi dari ketidakseimbangan kelas (class imbalance). Untuk mengatasi masalah tersebut dapat dilakukan dengan sebuah metode yang dapat menyeimbangkan kelas dan meningkatkan akurasi. Adaboost salah satu metode boosting yang mampu menyeimbangkan kelas dengan memberikan bobot pada tingkat error klasifikasi yang dapat merubah distribusi data. Eksperimen dilakukan dengan menerapkan metode adaboost pada decision tree (DT) untuk mendapatkan hasil yang optimal dan tingkat akurasi yang baik. Hasil ekperimen yang diperoleh dari metode decision tree untuk akurasi sebesar 87,18%, AUC sebesar 0,864, dan RMSE sebesar 0,320, sedangkan hasil dari decision tree dengan adaboost (DTBoost) untuk akurasi sebesar 90,45%, AUC sebesar 0,951, dan RMSE sebesar 0,273, maka dapat disimpulkan dalam penentuan kelulusan mahasiswa dengan metode decision tree dan adaboost terbukti mampu menyelesaikan masalah ketidakseimbangan kelas dan meningkatkan akurasi yang tinggi dan dapat menurunkan tingkat error klasifikasi

    An AUC-based Permutation Variable Importance Measure for Random Forests

    Get PDF
    The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html

    An Approach to Tacit Knowledge Classification in a Manufacturing Company

    Get PDF
    This article attempts to classify the tacit knowledge acquired using the example of the research and development (R&D) department of a manufacturing company. Based on studies in the literature and direct interviews in the company analysed, the authors\u27 own model for classifying the tacit knowledge in an R&D department was proposed. The description of this model has been divided into two parts. In the first part, viz., the classification of knowledge, through three planes: (1) selection of algorithm inputs and the grouping of knowledge accumulated in an enterprise; (2) algorithm activity, that is, the use of algorithms based on clustering them, for calculations; (3) interpretation of results. The Bayesian Network was used for this purpose, which was modelled on the defined relationships between representing of tacit knowledge. Then, on the basis of the case study, the classification of knowledge was prepared according to: (1) definition of knowledge in the R&D department and its modelling, (2) implementation of a suitable number of training sets, (3) verification of the knowledge base, that is, declaration of the value of the knowledge observed, followed by (4) assignment of the probability of returning to the node of network clusters containing interpretations of business benefits

    Two Stage Comparison of Classifier Performances for Highly Imbalanced Datasets

    Get PDF
    During the process of knowledge discovery in data, imbalanced learning data often emerges and presents a significant challenge for data mining methods. In this paper, we investigate the influence of class imbalanced data on the classification results of artificial intelligence methods, i.e. neural networks and support vector machine, and on the classification results of classical classification methods represented by RIPPER and the Naïve Bayes classifier. All experiments are conducted on 30 different imbalanced datasets obtained from KEEL (Knowledge Extraction based on Evolutionary Learning) repository. With the purpose of measuring the quality of classification, the accuracy and the area under ROC curve (AUC) measures are used. The results of the research indicate that the neural network and support vector machine show improvement of the AUC measure when applied to balanced data, but at the same time, they show the deterioration of results from the aspect of classification accuracy. RIPPER results are also similar, but the changes are of a smaller magnitude, while the results of the Naïve Bayes classifier show overall deterioration of results on balanced distributions. The number of instances in the presented highly imbalanced datasets has significant additional impact on the classification performances of the SVM classifier. The results have shown the potential of the SVM classifier for the ensemble creation on imbalanced datasets

    Why Does Rebalancing Class-unbalanced Data Improve AUC for Linear Discriminant Analysis?

    Get PDF
    Many established classifiers fail to identify the minority class when it is much smaller than the majority class. To tackle this problem, researchers often first rebalance the class sizes in the training dataset, through oversampling the minority class or undersampling the majority class, and then use the rebalanced data to train the classifiers. This leads to interesting empirical patterns. In particular, using the rebalanced training data can often improve the area under the receiver operating characteristic curve (AUC) for the original, unbalanced test data. The AUC is a widely-used quantitative measure of classification performance, but the property that it increases with rebalancing has, as yet, no theoretical explanation. In this note, using Gaussian-based linear discriminant analysis (LDA) as the classifier, we demonstrate that, at least for LDA, there is an intrinsic, positive relationship between the rebalancing of class sizes and the improvement of AUC. We show that the largest improvement of AUC is achieved, asymptotically, when the two classes are fully rebalanced to be of equal sizes

    An Evaluation of Feature Selection Robustness on Class Noisy Data

    Get PDF
    With the increasing growth of data dimensionality, feature selection has become a crucial step in a variety of machine learning and data mining applications. In fact, it allows identifying the most important attributes of the task at hand, improving the efficiency, interpretability, and final performance of the induced models. In recent literature, several studies have examined the strengths and weaknesses of the available feature selection methods from different points of view. Still, little work has been performed to investigate how sensitive they are to the presence of noisy instances in the input data. This is the specific field in which our work wants to make a contribution. Indeed, since noise is arguably inevitable in several application scenarios, it would be important to understand the extent to which the different selection heuristics can be affected by noise, in particular class noise (which is more harmful in supervised learning tasks). Such an evaluation may be especially important in the context of class-imbalanced problems, where any perturbation in the set of training records can strongly affect the final selection outcome. In this regard, we provide here a two-fold contribution by presenting (i) a general methodology to evaluate feature selection robustness on class noisy data and (ii) an experimental study that involves different selection methods, both univariate and multivariate. The experiments have been conducted on eight high-dimensional datasets chosen to be representative of different real-world domains, with interesting insights into the intrinsic degree of robustness of the considered selection approaches

    A New Under-Sampling Method to Face Class Overlap and Imbalance

    Get PDF
    Class overlap and class imbalance are two data complexities that challenge the design of effective classifiers in Pattern Recognition and Data Mining as they may cause a significant loss in performance. Several solutions have been proposed to face both data difficulties, but most of these approaches tackle each problem separately. In this paper, we propose a two-stage under-sampling technique that combines the DBSCAN clustering algorithm to remove noisy samples and clean the decision boundary with a minimum spanning tree algorithm to face the class imbalance, thus handling class overlap and imbalance simultaneously with the aim of improving the performance of classifiers. An extensive experimental study shows a significantly better behavior of the new algorithm as compared to 12 state-of-the-art under-sampling methods using three standard classification models (nearest neighbor rule, J48 decision tree, and support vector machine with a linear kernel) on both real-life and synthetic databases
    corecore