18 research outputs found

    Searching for rules to detect defective modules: A subgroup discovery approach

    Get PDF
    Data mining methods in software engineering are becoming increasingly important as they can support several aspects of the software development life-cycle such as quality. In this work, we present a data mining approach to induce rules extracted from static software metrics characterising fault-prone modules. Due to the special characteristics of the defect prediction data (imbalanced, inconsistency, redundancy) not all classification algorithms are capable of dealing with this task conveniently. To deal with these problems, Subgroup Discovery (SD) algorithms can be used to find groups of statistically different data given a property of interest. We propose EDER-SD (Evolutionary Decision Rules for Subgroup Discovery), a SD algorithm based on evolutionary computation that induces rules describing only fault-prone modules. The rules are a well-known model representation that can be easily understood and applied by project managers and quality engineers. Thus, rules can help them to develop software systems that can be justifiably trusted. Contrary to other approaches in SD, our algorithm has the advantage of working with continuous variables as the conditions of the rules are defined using intervals. We describe the rules obtained by applying our algorithm to seven publicly available datasets from the PROMISE repository showing that they are capable of characterising subgroups of fault-prone modules. We also compare our results with three other well known SD algorithms and the EDER-SD algorithm performs well in most cases.Ministerio de Educación y Ciencia TIN2007-68084-C02-00Ministerio de Educación y Ciencia TIN2010-21715-C02-0

    Choosing software metrics for defect prediction: an investigation on feature selection techniques

    Full text link
    The selection of software metrics for building software quality prediction models is a search-based software engineering problem. An exhaustive search for such metrics is usually not feasible due to limited project resources, especially if the number of available metrics is large. Defect prediction models are necessary in aiding project managers for better utilizing valuable project resources for software quality improvement. The efficacy and usefulness of a fault-proneness prediction model is only as good as the quality of the software measurement data. This study focuses on the problem of attribute selection in the context of software quality estimation. A comparative investigation is presented for evaluating our proposed hybrid attribute selection approach, in which feature ranking is first used to reduce the search space, followed by a feature subset selection. A total of seven different feature ranking techniques are evaluated, while four different feature subset selection approaches are considered. The models are trained using five commonly used classification algorithms. The case study is based on software metrics and defect data collected from multiple releases of a large real-world software system. The results demonstrate that while some feature ranking techniques performed similarly, the automatic hybrid search algorithm performed the best among the feature subset selection methods. Moreover, performances of the defect prediction models either improved or remained unchanged when over 85were eliminated. Copyright © 2011 John Wiley & Sons, Ltd.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/83475/1/1043_ftp.pd

    Experience: Quality benchmarking of datasets used in software effort estimation

    Get PDF
    Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location and severity of defects in code. Serious questions have been raised, however, over the quality of the data used in ESE. Data quality problems caused by noise, outliers, and incompleteness have been noted as being especially prevalent. Other quality issues, although also potentially important, have received less attention. In this study, we assess the quality of 13 datasets that have been used extensively in research on software effort estimation. The quality issues considered in this article draw on a taxonomy that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions are as follows: (1) an evaluation of the “fitness for purpose” of these commonly used datasets and (2) an assessment of the utility of the taxonomy in terms of dataset benchmarking. We also propose a template that could be used to both improve the ESE data collection/submission process and to evaluate other such datasets, contributing to enhanced awareness of data quality issues in the ESE community and, in time, the availability and use of higher-quality datasets

    Exploring Essential Content of Defect Prediction and Effort Estimation through Data Reduction

    Get PDF
    Mining Software Repositories provides the opportunity to exploit/explore some of the behaviors, distinct patterns and features of software development processes, using which the stakeholders can generate models to perform estimations, predictions and make decisions on these projects.;When using data mining on project data in software engineering, it is important to generate models that are easy for business users to understand. The business users should be able to gain insight on how to improve the project using these models. Software engineering data are often too large to discern. To understand the intricacies of software analytics, one approach is to reduce software engineering data to its essential content, then reasoning about that reduced set.;This thesis explores methods (a) removing spurious and redundant columns then (b) clustering the data set and replacing each cluster by one exemplar per cluster then (c) making conclusions by extrapolating between the exemplars (via k=2 nearest neighbor between cluster centroids).;Numerous defect data sets were reduced to around 25 exemplars containing around 6 attributes. These tables of 25*6 values were then used for (a) effective and simple defect prediction as well as (b) simple presentation of that data. Also, in an investigation of numerous common clustering methods, we find that the details of the clustering method are less important than ensuring that those methods produce enough clusters (which, for defect data sets, seems to be around 25 clusters). For effort estimation data sets, conclusive results for ideal number of clusters could not be determined due to smaller size of the data sets

    Bridging the semantic gap for software effort estimation by hierarchical feature selection techniques

    Get PDF
    Software project management is one of the significant activates in the software development process. Software Development Effort Estimation (SDEE) is a challenging task in the software project management. SDEE is an old activity in computer industry from 1940s and has been reviewed several times. A SDEE model is appropriate if it provides the accuracy and confidence simultaneously before software project contract. Due to the uncertain nature of development estimates and in order to increase the accuracy, researchers recently have focused on machine learning techniques. Choosing the most effective features to achieve higher accuracy in machine learning is crucial. In this paper, for narrowing the semantic gap in SDEE, a hierarchical method of filter and wrapper Feature Selection (FS) techniques and a fused measurement criteria are developed in a two-phase approach. In the first phase, two stage filter FS methods provide start sets for wrapper FS techniques. In the second phase, a fused criterion is proposed for measuring accuracy in wrapper FS techniques. Experimental results show the validity and efficiency of the proposed approach for SDEE over a variety of standard datasets

    Selecting Best Practices for Effort Estimation

    Full text link

    Evaluation of Classifiers in Software Fault-Proneness Prediction

    Get PDF
    Reliability of software counts on its fault-prone modules. This means that the less software consists of fault-prone units the more we may trust it. Therefore, if we are able to predict the number of fault-prone modules of software, it will be possible to judge the software reliability. In predicting software fault-prone modules, one of the contributing features is software metric by which one can classify software modules into fault-prone and non-fault-prone ones. To make such a classification, we investigated into 17 classifier methods whose features (attributes) are software metrics (39 metrics) and instances (software modules) of mining are instances of 13 datasets reported by NASA. However, there are two important issues influencing our prediction accuracy when we use data mining methods: (1) selecting the best/most influent features (i.e. software metrics) when there is a wide diversity of them and (2) instance sampling in order to balance the imbalanced instances of mining; we have two imbalanced classes when the classifier biases towards the majority class. Based on the feature selection and instance sampling, we considered 4 scenarios in appraisal of 17 classifier methods to predict software fault-prone modules. To select features, we used Correlation-based Feature Selection (CFS) and to sample instances we did Synthetic Minority Oversampling Technique (SMOTE). Empirical results showed that suitable sampling software modules significantly influences on accuracy of predicting software reliability but metric selection has not considerable effect on the prediction
    corecore