Search CORE

303 research outputs found

The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction

Author: Berenson
Breiman
Breiman
Domingos
Drummond
Elkan
Emam
Fan
Freund
Hulse
Jiang
Khoshgoftaar
Khoshgoftaar
Khoshgoftaar
Khoshgoftaar
Khoshgoftaar
Khoshgoftaar
Lessmann
Liu
Sayyad Shirabad
Seliya
Seliya
Sun
Ting
Witten
Publication venue: 'Wiley'
Publication date: 01/09/2011
Field of study

This empirical study investigates two commonly used decision tree classification algorithms in the context of cost‐sensitive learning. A review of the literature shows that the cost‐based performance of a software quality prediction model is usually determined after the model‐training process has been completed. In contrast, we incorporate cost‐sensitive learning during the model‐training process. The C4.5 and Random Forest decision tree algorithms are used to build defect predictors either with, or without, any cost‐sensitive learning technique. The paper investigates six different cost‐sensitive learning techniques: AdaCost, Adc2, Csb2, MetaCost, Weighting, and Random Undersampling (RUS). The data come from case study include 15 software measurement datasets obtained from several high‐assurance systems. In addition, to a unique insight into the cost‐based performance of defection prediction models, this study is one of the first to use misclassification cost as a parameter during the model‐training process. The practical appeal of this research is that it provides a software quality practitioner with a clear process for how to consider (during model training) and analyze (during model evaluation) the cost‐based performance of a defect prediction model. RUS is ranked as the best cost‐sensitive technique among those considered in this study. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 448–459 DOI: 10.1002/widm.38Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/87156/1/38_ftp.pd

Crossref

Deep Blue Documents at the University of Michigan

An Empirical Investigation of Filter Attribute Selection Techniques for Software Quality Classification

Author: Gao Kehan
Khoshgoftaar Taghi M.
Wang Huanjing
Publication venue: TopSCHOLAR®
Publication date: 01/08/2009
Field of study

Attribute selection is an important activity in data preprocessing for software quality modeling and other data mining problems. The software quality models have been used to improve the fault detection process. Finding faulty components in a software system during early stages of software development process can lead to a more reliable final product and can reduce development and maintenance costs. It has been shown in some studies that prediction accuracy of the models improves when irrelevant and redundant features are removed from the original data set. In this study, we investigated four filter attribute selection techniques, Automatic Hybrid Search (AHS), Rough Sets (RS), Kolmogorov-Smirnov (KS) and Probabilistic Search (PS) and conducted the experiments by using them on a very large telecommunications software system. In order to evaluate their classification performance on the smaller subsets of attributes selected using different approaches, we built several classification models using five different classifiers. The empirical results demonstrated that by applying an attribution selection approach we can build classification models with an accuracy comparable to that built with a complete set of attributes. Moreover, the smaller subset of attributes has less than 15 percent of the complete set of attributes. Therefore, the metrics collection, model calibration, model validation, and model evaluation times of future software development efforts of similar systems can be significantly reduced. In addition, we demonstrated that our recently proposed attribute selection technique, KS, outperformed the other three attribute selection techniques

TopSCHOLAR

Crossref

A review of data mining using big data in health informatics

Author: Matthew Herland
Randall Wald
Taghi M Khoshgoftaar
Publication venue: Springer Nature
Publication date: 01/01/2014
Field of study

Springer - Publisher Connector

High-Dimensional Software Engineering Data and Feature Selection

Author: Gao kehan
Khoshgoftaar Taghi M.
Wang Huanjing
Publication venue: TopSCHOLAR®
Publication date: 01/11/2009
Field of study

Software metrics collected during project development play a critical role in software quality assurance. A software practitioner is very keen on learning which software metrics to focus on for software quality prediction. While a concise set of software metrics is often desired, a typical project collects a very large number of metrics. Minimal attention has been devoted to finding the minimum set of software metrics that have the same predictive capability as a larger set of metrics – we strive to answer that question in this paper. We present a comprehensive comparison between seven commonly-used filter-based feature ranking techniques (FRT) and our proposed hybrid feature selection (HFS) technique. Our case study consists of a very highdimensional (42 software attributes) software measurement data set obtained from a large telecommunications system. The empirical analysis indicates that HFS performs better than FRT; however, the Kolmogorov-Smirnov feature ranking technique demonstrates competitive performance. For the telecommunications system, it is found that only 10% of the software attributes are sufficient for effective software quality prediction

TopSCHOLAR

A Comparative Study of Threshold-based Feature Selection Techniques

Author: Hulse Jason Van
Khoshgoftaar Taghi M.
Wang Huanjing
Publication venue: TopSCHOLAR®
Publication date: 01/08/2010
Field of study

Abstract Given high-dimensional software measurement data, researchers and practitioners often use feature (metric) selection techniques to improve the performance of software quality classification models. This paper presents our newly proposed threshold-based feature selection techniques, comparing the performance of these techniques by building classification models using five commonly used classifiers. In order to evaluate the effectiveness of different feature selection techniques, the models are evaluated using eight different performance metrics separately since a given performance metric usually captures only one aspect of the classification performance. All experiments are conducted on three Eclipse data sets with different levels of class imbalance. The experiments demonstrate that the choice of a performance metric may significantly influence the results. In this study, we have found four distinct patterns when utilizing eight performance metrics to order 11 threshold-based feature selection techniques. Moreover, performances of the software quality models either improve or remain unchanged despite the removal of over 96% of the software metrics (attributes)

TopSCHOLAR

Crossref

Mining Data from Multiple Software Development Projects

Author: Gao Kehan
Khoshgoftaar Taghi M.
Seliya Naeem
Wang Huanjing
Publication venue: TopSCHOLAR®
Publication date: 01/12/2009
Field of study

A large system often goes through multiple software project development cycles, in part due to changes in operation and development environments. For example, rapid turnover of the development team between releases can influence software quality, making it important to mine software project data over multiple system releases when building defect predictors. Data collection of software attributes are often conducted independent of the quality improvement goals, leading to the availability of a large number of attributes for analysis. Given the problems associated with variations in development process, data collection, and quality goals from one release to another emphasizes the importance of selecting a best-set of software attributes for software quality prediction. Moreover, it is intuitive to remove attributes that do not add to, or have an adverse effect on, the knowledge of the consequent model. Based on real-world software projects’ data, we present a large case study that compares wrapper-based feature ranking techniques (WRT) and our proposed hybrid feature selection (HFS) technique. The comparison is done using both threefold cross-validation (CV) and three-fold cross-validation with risk impact (CVR). It is shown that HFS is better than WRT, while CV is superior to CVR

TopSCHOLAR

Crossref

Choosing software metrics for defect prediction: an investigation on feature selection techniques

Author: Aha
Aha
Berenson
Cameron
Chen
Coppin
Dash
Domingos
Fawcett
Fogarty
Forman
Furlanello
Guyon
Hall
Harman
Haykin
Hudepohl
Khoshgoftaar
Khoshgoftaar
Khoshgoftaar
Le Cessie
Lessmann
Liu
Menzies
Pfleeger
Platt
Shawe-Taylor
Tan
Votta
Witten
Wohlin
Publication venue: 'Wiley'
Publication date: 25/04/2011
Field of study

The selection of software metrics for building software quality prediction models is a search-based software engineering problem. An exhaustive search for such metrics is usually not feasible due to limited project resources, especially if the number of available metrics is large. Defect prediction models are necessary in aiding project managers for better utilizing valuable project resources for software quality improvement. The efficacy and usefulness of a fault-proneness prediction model is only as good as the quality of the software measurement data. This study focuses on the problem of attribute selection in the context of software quality estimation. A comparative investigation is presented for evaluating our proposed hybrid attribute selection approach, in which feature ranking is first used to reduce the search space, followed by a feature subset selection. A total of seven different feature ranking techniques are evaluated, while four different feature subset selection approaches are considered. The models are trained using five commonly used classification algorithms. The case study is based on software metrics and defect data collected from multiple releases of a large real-world software system. The results demonstrate that while some feature ranking techniques performed similarly, the automatic hybrid search algorithm performed the best among the feature subset selection methods. Moreover, performances of the defect prediction models either improved or remained unchanged when over 85were eliminated. Copyright © 2011 John Wiley & Sons, Ltd.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/83475/1/1043_ftp.pd

Crossref

Deep Blue Documents at the University of Michigan

Relative error prediction via kernel regression smoothers

Author: Chatfield
Eagleson
Fan
Fan
Farnum
Heungsun Park
Härdle
Key-Il Shin
Khoshgoftaar
Khoshgoftaar
Kitchenham
M.C. Jones
Narula
Park
Park
Renshaw
Ruppert
S.K. Vines
Seok-Oh Jeong
Shen
Simonoff
Wand
Publication venue: 'Elsevier BV'
Publication date: 01/10/2008
Field of study

In this article, we introduce and study local constant and local linear nonparametric regression estimators when it is appropriate to assess performance in terms of mean squared relative error of prediction. We give asymptotic results for both boundary and non-boundary cases. These are special cases of more general asymptotic results that we provide concerning the estimation of the ratio of conditional expectations of two functions of the response variable. We also provide a good bandwidth selection method for the estimators. Examples of application, limited simulation results and discussion of related problems and approaches are also given

Crossref

Open Research Online (The Open University)

An empirical study on developer-related factors characterizing fix-inducing commits

Author: Baeza-Yates
Bavota
Cohen
Conover
D'Ambros
Grissom
Jolliffe
Khoshgoftaar
Kim
Lehman
Williams
Zar
Zeller
Publication venue: W&M ScholarWorks
Publication date: 01/01/2017
Field of study

This paper analyzes developer-related factors that could influence the likelihood for a commit to induce a fix. Specifically, we focus on factors that could potentially hinder developers\u27 ability to correctly understand the code components involved in the change to be committed as follows: (i) the coherence of the commit (i.e., how much it is focused on a specific topic); (ii) the experience level of the developer on the files involved in the commit; and (iii) the interfering changes performed by other developers on the files involved in past commits. The results of our study indicate that fix-inducing\u27 commits (i.e., commits that induced a fix) are significantly less coherent than clean\u27 commits (i.e., commits that did not induce a fix). Surprisingly, fix-inducing\u27 commits are performed by more experienced developers; yet, those are the developers performing more complex changes in the system. Finally, fix-inducing\u27 commits have a higher number of past interfering changes as compared with clean\u27 commits. Our empirical study sheds light on previously unexplored factors and presents significant results that can be used to improve approaches for defect prediction. Copyright (c) 2016 John Wiley & Sons, Ltd

Università degli Studi del Molise: IRIS

Crossref

Archivio della Ricerca - Università di Salerno

College of William & Mary: W&M Publish

Experimental Exploration of Compact Convolutional Neural Network Architectures for Non-temporal Real-time Fire Detection

Author: Bhowmik N.
Breckon T.P.
Khoshgoftaar Taghi M.
Samarth G.
Seliya Naeem (Jim)
Wang Dingding
Wang Huanjing
Wani M. Arif
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 16/12/2019
Field of study

In this work we explore different Convolutional Neural Network (CNN) architectures and their variants for non-temporal binary fire detection and localization in video or still imagery. We consider the performance of experimentally defined, reduced complexity deep CNN architectures for this task and evaluate the effects of different optimization and normalization techniques applied to different CNN architectures (spanning the Inception, ResNet and EfficientNet architectural concepts). Contrary to contemporary trends in the field, our work illustrates a maximum overall accuracy of 0.96 for full frame binary fire detection and 0.94 for superpixel localization using an experimentally defined reduced CNN architecture based on the concept of InceptionV4. We notably achieve a lower false positive rate of 0.06 compared to prior work in the field presenting an efficient, robust and real-time solution for fire region detection

Durham Research Online