Exploring Essential Content of Defect Prediction and Effort Estimation through Data Reduction

Abstract

Mining Software Repositories provides the opportunity to exploit/explore some of the behaviors, distinct patterns and features of software development processes, using which the stakeholders can generate models to perform estimations, predictions and make decisions on these projects.;When using data mining on project data in software engineering, it is important to generate models that are easy for business users to understand. The business users should be able to gain insight on how to improve the project using these models. Software engineering data are often too large to discern. To understand the intricacies of software analytics, one approach is to reduce software engineering data to its essential content, then reasoning about that reduced set.;This thesis explores methods (a) removing spurious and redundant columns then (b) clustering the data set and replacing each cluster by one exemplar per cluster then (c) making conclusions by extrapolating between the exemplars (via k=2 nearest neighbor between cluster centroids).;Numerous defect data sets were reduced to around 25 exemplars containing around 6 attributes. These tables of 25*6 values were then used for (a) effective and simple defect prediction as well as (b) simple presentation of that data. Also, in an investigation of numerous common clustering methods, we find that the details of the clustering method are less important than ensuring that those methods produce enough clusters (which, for defect data sets, seems to be around 25 clusters). For effort estimation data sets, conclusive results for ideal number of clusters could not be determined due to smaller size of the data sets

    Similar works