2,235 research outputs found

    The use of decision trees for costā€sensitive classification: an empirical study in software quality prediction

    Full text link
    This empirical study investigates two commonly used decision tree classification algorithms in the context of costā€sensitive learning. A review of the literature shows that the costā€based performance of a software quality prediction model is usually determined after the modelā€training process has been completed. In contrast, we incorporate costā€sensitive learning during the modelā€training process. The C4.5 and Random Forest decision tree algorithms are used to build defect predictors either with, or without, any costā€sensitive learning technique. The paper investigates six different costā€sensitive learning techniques: AdaCost, Adc2, Csb2, MetaCost, Weighting, and Random Undersampling (RUS). The data come from case study include 15 software measurement datasets obtained from several highā€assurance systems. In addition, to a unique insight into the costā€based performance of defection prediction models, this study is one of the first to use misclassification cost as a parameter during the modelā€training process. The practical appeal of this research is that it provides a software quality practitioner with a clear process for how to consider (during model training) and analyze (during model evaluation) the costā€based performance of a defect prediction model. RUS is ranked as the best costā€sensitive technique among those considered in this study. Ā© 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 448ā€“459 DOI: 10.1002/widm.38Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/87156/1/38_ftp.pd

    Further thoughts on precision

    Get PDF
    Background: There has been much discussion amongst automated software defect prediction researchers regarding use of the precision and false positive rate classifier performance metrics. Aim: To demonstrate and explain why failing to report precision when using data with highly imbalanced class distributions may provide an overly optimistic view of classifier performance. Method: Well documented examples of how dependent class distribution affects the suitability of performance measures. Conclusions: When using data where the minority class represents less than around 5 to 10 percent of data points in total, failing to report precision may be a critical mistake. Furthermore, deriving the precision values omitted from studies can reveal valuable insight into true classifier performancePeer reviewedFinal Accepted Versio

    A Cost-sensitive Intelligent Prediction Model for Outsourced Software Project Risk

    Get PDF
    Outsourced software project is one of the main ways of software development, which is of high failure rate. Intelligent risk prediction model can help identify high risk project in time. However, the existing models are mostly based on such a hypothesis that all the cost of misclassification is equal, which is not consistent with the reality that in the domain of software project risk prediction, the cost of predicting a fail-prone project as a success-prone project is different from predicting a success-prone project as a fail-prone project. To the best of our knowledge, the cost-sensitive learning method has not yet been applied in the domain of outsourced software project risk management though it has been widely used in a variety of fields. Based on this situation, we selected five classifiers, and introduced cost-sensitive learning method to build intelligent prediction models respectively. This paper totally collected 292 real data of outsourced software project for modeling. Experiment results showed that, under cost-sensitive scenario, the polynomial kernel support vector machine is the best classifier for outsourced software project risk prediction among the five classifiers due to its high prediction accuracy, stability and low cost

    Ensemble multiboost based on ripper classifier for prediction of imbalanced software defect data

    Get PDF
    Identifying defective software entities is essential to ensure software quality during software development. However, the high dimensionality and class distribution imbalance of software defect data seriously affect software defect prediction performance. In order to solve this problem, this paper proposes an Ensemble MultiBoost based on RIPPER classifier for prediction of imbalanced Software Defect data, called EMR_SD. Firstly, the algorithm uses principal component analysis (PCA) method to find out the most effective features from the original features of the data set, so as to achieve the purpose of dimensionality reduction and redundancy removal. Furthermore, the combined sampling method of adaptive synthetic sampling (ADASYN) and random sampling without replacement is performed to solve the problem of data class imbalance. This classifier establishes association rules based on attributes and classes, using MultiBoost to reduce deviation and variance, so as to achieve the purpose of reducing classification error. The proposed prediction model is evaluated experimentally on the NASA MDP public datasets and compared with existing similar algorithms. The results show that EMR-SD algorithm is superior to DNC, CEL and other defect prediction techniques in most evaluation indicators, which proves the effectiveness of the algorithm

    The Integrity of Machine Learning Algorithms against Software Defect Prediction

    Full text link
    The increased computerization in recent years has resulted in the production of a variety of different software, however measures need to be taken to ensure that the produced software isn't defective. Many researchers have worked in this area and have developed different Machine Learning-based approaches that predict whether the software is defective or not. This issue can't be resolved simply by using different conventional classifiers because the dataset is highly imbalanced i.e the number of defective samples detected is extremely less as compared to the number of non-defective samples. Therefore, to address this issue, certain sophisticated methods are required. The different methods developed by the researchers can be broadly classified into Resampling based methods, Cost-sensitive learning-based methods, and Ensemble Learning. Among these methods. This report analyses the performance of the Online Sequential Extreme Learning Machine (OS-ELM) proposed by Liang et.al. against several classifiers such as Logistic Regression, Support Vector Machine, Random Forest, and Na\"ive Bayes after oversampling the data. OS-ELM trains faster than conventional deep neural networks and it always converges to the globally optimal solution. A comparison is performed on the original dataset as well as the over-sampled data set. The oversampling technique used is Cluster-based Over-Sampling with Noise Filtering. This technique is better than several state-of-the-art techniques for oversampling. The analysis is carried out on 3 projects KC1, PC4 and PC3 carried out by the NASA group. The metrics used for measurement are recall and balanced accuracy. The results are higher for OS-ELM as compared to other classifiers in both scenarios.Comment: 7 pages, 4 figure
    • ā€¦
    corecore