2,235 research outputs found
The use of decision trees for costāsensitive classification: an empirical study in software quality prediction
This empirical study investigates two commonly used decision tree classification algorithms in the context of costāsensitive learning. A review of the literature shows that the costābased performance of a software quality prediction model is usually determined after the modelātraining process has been completed. In contrast, we incorporate costāsensitive learning during the modelātraining process. The C4.5 and Random Forest decision tree algorithms are used to build defect predictors either with, or without, any costāsensitive learning technique. The paper investigates six different costāsensitive learning techniques: AdaCost, Adc2, Csb2, MetaCost, Weighting, and Random Undersampling (RUS). The data come from case study include 15 software measurement datasets obtained from several highāassurance systems. In addition, to a unique insight into the costābased performance of defection prediction models, this study is one of the first to use misclassification cost as a parameter during the modelātraining process. The practical appeal of this research is that it provides a software quality practitioner with a clear process for how to consider (during model training) and analyze (during model evaluation) the costābased performance of a defect prediction model. RUS is ranked as the best costāsensitive technique among those considered in this study. Ā© 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 448ā459 DOI: 10.1002/widm.38Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/87156/1/38_ftp.pd
Further thoughts on precision
Background: There has been much discussion amongst automated software defect prediction researchers regarding use of the precision and false positive rate classifier performance metrics. Aim: To demonstrate and explain why failing to report precision when using data with highly imbalanced class distributions may provide an overly optimistic view of classifier performance. Method: Well documented examples of how dependent class distribution affects the suitability of performance measures. Conclusions: When using data where the minority class represents less than around 5 to 10 percent of data points in total, failing to report precision may be a critical mistake. Furthermore, deriving the precision values omitted from studies can reveal valuable insight into true classifier performancePeer reviewedFinal Accepted Versio
A Cost-sensitive Intelligent Prediction Model for Outsourced Software Project Risk
Outsourced software project is one of the main ways of software development, which is of high failure rate. Intelligent risk prediction model can help identify high risk project in time. However, the existing models are mostly based on such a hypothesis that all the cost of misclassification is equal, which is not consistent with the reality that in the domain of software project risk prediction, the cost of predicting a fail-prone project as a success-prone project is different from predicting a success-prone project as a fail-prone project. To the best of our knowledge, the cost-sensitive learning method has not yet been applied in the domain of outsourced software project risk management though it has been widely used in a variety of fields. Based on this situation, we selected five classifiers, and introduced cost-sensitive learning method to build intelligent prediction models respectively. This paper totally collected 292 real data of outsourced software project for modeling. Experiment results showed that, under cost-sensitive scenario, the polynomial kernel support vector machine is the best classifier for outsourced software project risk prediction among the five classifiers due to its high prediction accuracy, stability and low cost
Ensemble multiboost based on ripper classifier for prediction of imbalanced software defect data
Identifying defective software entities is essential to ensure software quality during software development. However, the high dimensionality and class distribution imbalance of software defect data seriously affect software defect prediction performance. In order to solve this problem, this paper proposes an Ensemble MultiBoost based on RIPPER classifier for prediction of imbalanced Software Defect data, called EMR_SD. Firstly, the algorithm uses principal component analysis (PCA) method to find out the most effective features from the original features of the data set, so as to achieve the purpose of dimensionality reduction and redundancy removal. Furthermore, the combined sampling method of adaptive synthetic sampling (ADASYN) and random sampling without replacement is performed to solve the problem of data class imbalance. This classifier establishes association rules based on attributes and classes, using MultiBoost to reduce deviation and variance, so as to achieve the purpose of reducing classification error. The proposed prediction model is evaluated experimentally on the NASA MDP public datasets and compared with existing similar algorithms. The results show that EMR-SD algorithm is superior to DNC, CEL and other defect prediction techniques in most evaluation indicators, which proves the effectiveness of the algorithm
The Integrity of Machine Learning Algorithms against Software Defect Prediction
The increased computerization in recent years has resulted in the production
of a variety of different software, however measures need to be taken to ensure
that the produced software isn't defective. Many researchers have worked in
this area and have developed different Machine Learning-based approaches that
predict whether the software is defective or not. This issue can't be resolved
simply by using different conventional classifiers because the dataset is
highly imbalanced i.e the number of defective samples detected is extremely
less as compared to the number of non-defective samples. Therefore, to address
this issue, certain sophisticated methods are required. The different methods
developed by the researchers can be broadly classified into Resampling based
methods, Cost-sensitive learning-based methods, and Ensemble Learning. Among
these methods. This report analyses the performance of the Online Sequential
Extreme Learning Machine (OS-ELM) proposed by Liang et.al. against several
classifiers such as Logistic Regression, Support Vector Machine, Random Forest,
and Na\"ive Bayes after oversampling the data. OS-ELM trains faster than
conventional deep neural networks and it always converges to the globally
optimal solution. A comparison is performed on the original dataset as well as
the over-sampled data set. The oversampling technique used is Cluster-based
Over-Sampling with Noise Filtering. This technique is better than several
state-of-the-art techniques for oversampling. The analysis is carried out on 3
projects KC1, PC4 and PC3 carried out by the NASA group. The metrics used for
measurement are recall and balanced accuracy. The results are higher for OS-ELM
as compared to other classifiers in both scenarios.Comment: 7 pages, 4 figure
- ā¦