Search CORE

2,235 research outputs found

The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction

Author: Berenson
Breiman
Breiman
Domingos
Drummond
Elkan
Emam
Fan
Freund
Hulse
Jiang
Khoshgoftaar
Khoshgoftaar
Khoshgoftaar
Khoshgoftaar
Khoshgoftaar
Khoshgoftaar
Lessmann
Liu
Sayyad Shirabad
Seliya
Seliya
Sun
Ting
Witten
Publication venue: 'Wiley'
Publication date: 01/09/2011
Field of study

This empirical study investigates two commonly used decision tree classification algorithms in the context of cost‐sensitive learning. A review of the literature shows that the cost‐based performance of a software quality prediction model is usually determined after the model‐training process has been completed. In contrast, we incorporate cost‐sensitive learning during the model‐training process. The C4.5 and Random Forest decision tree algorithms are used to build defect predictors either with, or without, any cost‐sensitive learning technique. The paper investigates six different cost‐sensitive learning techniques: AdaCost, Adc2, Csb2, MetaCost, Weighting, and Random Undersampling (RUS). The data come from case study include 15 software measurement datasets obtained from several high‐assurance systems. In addition, to a unique insight into the cost‐based performance of defection prediction models, this study is one of the first to use misclassification cost as a parameter during the model‐training process. The practical appeal of this research is that it provides a software quality practitioner with a clear process for how to consider (during model training) and analyze (during model evaluation) the cost‐based performance of a defect prediction model. RUS is ranked as the best cost‐sensitive technique among those considered in this study. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 448–459 DOI: 10.1002/widm.38Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/87156/1/38_ftp.pd

Crossref

Deep Blue Documents at the University of Michigan

Further thoughts on precision

Author: Bowes David
Christianson B.
Davey N.
Gray D.
Sun Yi
Publication venue: 'Institution of Engineering and Technology (IET)'
Publication date: 01/01/2011
Field of study

Background: There has been much discussion amongst automated software defect prediction researchers regarding use of the precision and false positive rate classifier performance metrics. Aim: To demonstrate and explain why failing to report precision when using data with highly imbalanced class distributions may provide an overly optimistic view of classifier performance. Method: Well documented examples of how dependent class distribution affects the suitability of performance measures. Conclusions: When using data where the minority class represents less than around 5 to 10 percent of data points in total, failing to report precision may be a critical mistake. Furthermore, deriving the precision values omitted from studies can reveal valuable insight into true classifier performancePeer reviewedFinal Accepted Versio

CiteSeerX

Crossref

Lancaster E-Prints

University of Hertfordshire Research Archive

A Cost-sensitive Intelligent Prediction Model for Outsourced Software Project Risk

Author: Bin Feng
Hongming Zhang
Lijun Su
Xiangzhou Xiangzhou Zhang
Xizhu Mo
Yong Hu
Publication venue: AIS Electronic Library (AISeL)
Publication date: 25/05/2013
Field of study

Outsourced software project is one of the main ways of software development, which is of high failure rate. Intelligent risk prediction model can help identify high risk project in time. However, the existing models are mostly based on such a hypothesis that all the cost of misclassification is equal, which is not consistent with the reality that in the domain of software project risk prediction, the cost of predicting a fail-prone project as a success-prone project is different from predicting a success-prone project as a fail-prone project. To the best of our knowledge, the cost-sensitive learning method has not yet been applied in the domain of outsourced software project risk management though it has been widely used in a variety of fields. Based on this situation, we selected five classifiers, and introduced cost-sensitive learning method to build intelligent prediction models respectively. This paper totally collected 292 real data of outsourced software project for modeling. Experiment results showed that, under cost-sensitive scenario, the polynomial kernel support vector machine is the best classifier for outsourced software project risk prediction among the five classifiers due to its high prediction accuracy, stability and low cost

AIS Electronic Library (AISeL)

Discriminating features-based cost-sensitive approach for software defect prediction

Author: Abu-Tair Mamun
Ali Aftab
Khan Naveed
McChesney Ian
McClean Sally I
Noppen Joost
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/07/2021
Field of study

Ulster University's Research Portal

Ensemble multiboost based on ripper classifier for prediction of imbalanced software defect data

Author: Cheng Yongqiang
He Haitao
Liu Jiaxin
Ren Jiadong
Wang Qian
Zhang Xu
Zhao Xiaolin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/08/2019
Field of study

Identifying defective software entities is essential to ensure software quality during software development. However, the high dimensionality and class distribution imbalance of software defect data seriously affect software defect prediction performance. In order to solve this problem, this paper proposes an Ensemble MultiBoost based on RIPPER classifier for prediction of imbalanced Software Defect data, called EMR_SD. Firstly, the algorithm uses principal component analysis (PCA) method to find out the most effective features from the original features of the data set, so as to achieve the purpose of dimensionality reduction and redundancy removal. Furthermore, the combined sampling method of adaptive synthetic sampling (ADASYN) and random sampling without replacement is performed to solve the problem of data class imbalance. This classifier establishes association rules based on attributes and classes, using MultiBoost to reduce deviation and variance, so as to achieve the purpose of reducing classification error. The proposed prediction model is evaluated experimentally on the NASA MDP public datasets and compared with existing similar algorithms. The results show that EMR-SD algorithm is superior to DNC, CEL and other defect prediction techniques in most evaluation indicators, which proves the effectiveness of the algorithm

Repository@Hull - Worktribe

The Integrity of Machine Learning Algorithms against Software Defect Prediction

Author: and Param Khakhar
Dubey Rahul Kumar
IEEE Senior Member
Publication venue
Publication date: 05/09/2020
Field of study

The increased computerization in recent years has resulted in the production of a variety of different software, however measures need to be taken to ensure that the produced software isn't defective. Many researchers have worked in this area and have developed different Machine Learning-based approaches that predict whether the software is defective or not. This issue can't be resolved simply by using different conventional classifiers because the dataset is highly imbalanced i.e the number of defective samples detected is extremely less as compared to the number of non-defective samples. Therefore, to address this issue, certain sophisticated methods are required. The different methods developed by the researchers can be broadly classified into Resampling based methods, Cost-sensitive learning-based methods, and Ensemble Learning. Among these methods. This report analyses the performance of the Online Sequential Extreme Learning Machine (OS-ELM) proposed by Liang et.al. against several classifiers such as Logistic Regression, Support Vector Machine, Random Forest, and Na\"ive Bayes after oversampling the data. OS-ELM trains faster than conventional deep neural networks and it always converges to the globally optimal solution. A comparison is performed on the original dataset as well as the over-sampled data set. The oversampling technique used is Cluster-based Over-Sampling with Noise Filtering. This technique is better than several state-of-the-art techniques for oversampling. The analysis is carried out on 3 projects KC1, PC4 and PC3 carried out by the NASA group. The metrics used for measurement are recall and balanced accuracy. The results are higher for OS-ELM as compared to other classifiers in both scenarios.Comment: 7 pages, 4 figure

arXiv.org e-Print Archive