3 research outputs found

    Exploring Essential Content of Defect Prediction and Effort Estimation through Data Reduction

    Get PDF
    Mining Software Repositories provides the opportunity to exploit/explore some of the behaviors, distinct patterns and features of software development processes, using which the stakeholders can generate models to perform estimations, predictions and make decisions on these projects.;When using data mining on project data in software engineering, it is important to generate models that are easy for business users to understand. The business users should be able to gain insight on how to improve the project using these models. Software engineering data are often too large to discern. To understand the intricacies of software analytics, one approach is to reduce software engineering data to its essential content, then reasoning about that reduced set.;This thesis explores methods (a) removing spurious and redundant columns then (b) clustering the data set and replacing each cluster by one exemplar per cluster then (c) making conclusions by extrapolating between the exemplars (via k=2 nearest neighbor between cluster centroids).;Numerous defect data sets were reduced to around 25 exemplars containing around 6 attributes. These tables of 25*6 values were then used for (a) effective and simple defect prediction as well as (b) simple presentation of that data. Also, in an investigation of numerous common clustering methods, we find that the details of the clustering method are less important than ensuring that those methods produce enough clusters (which, for defect data sets, seems to be around 25 clusters). For effort estimation data sets, conclusive results for ideal number of clusters could not be determined due to smaller size of the data sets

    Evaluating defect prediction approaches: a benchmark and an extensive comparison

    Get PDF
    Reliably predicting software defects is one of the holy grails of software engineering. Researchers have devised and implemented a plethora of defect/bug prediction approaches varying in terms of accuracy, complexity and the input data they require. However, the absence of an established benchmark makes it hard, if not impossible, to compare approaches. We present a benchmark for defect prediction, in the form of a publicly available dataset consisting of several software systems, and provide an extensive comparison of well-known bug prediction approaches, together with novel approaches we devised. We evaluate the performance of the approaches using different performance indicators: classification of entities as defect-prone or not, ranking of the entities, with and without taking into account the effort to review an entity. We performed three sets of experiments aimed at (1) comparing the approaches across different systems, (2) testing whether the differences in performance are statistically significant, and (3) investigating the stability of approaches across different learners. Our results indicate that, while some approaches perform better than others in a statistically significant manner, external validity in defect prediction is still an open problem, as generalizing results to different contexts/learners proved to be a partially unsuccessful endeavo
    corecore