21 research outputs found

    Evaluating prediction systems in software project estimation

    Get PDF
    This is the Pre-print version of the Article - Copyright @ 2012 ElsevierContext: Software engineering has a problem in that when we empirically evaluate competing prediction systems we obtain conflicting results. Objective: To reduce the inconsistency amongst validation study results and provide a more formal foundation to interpret results with a particular focus on continuous prediction systems. Method: A new framework is proposed for evaluating competing prediction systems based upon (1) an unbiased statistic, Standardised Accuracy, (2) testing the result likelihood relative to the baseline technique of random ‘predictions’, that is guessing, and (3) calculation of effect sizes. Results: Previously published empirical evaluations of prediction systems are re-examined and the original conclusions shown to be unsafe. Additionally, even the strongest results are shown to have no more than a medium effect size relative to random guessing. Conclusions: Biased accuracy statistics such as MMRE are deprecated. By contrast this new empirical validation framework leads to meaningful results. Such steps will assist in performing future meta-analyses and in providing more robust and usable recommendations to practitioners.Martin Shepperd was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/H050329

    Adopting the appropriate performance measures for soft computing based estimation by analogy

    Get PDF
    Soft Computing based estimation by analogy is a lucrative research domain for the software engineering research community. There are a considerable number of models proposed in this research area. Therefore, researchers are of interest to compare the models to identify the best one for software development effort estimation. This research showed that most of the studies used mean magnitude of relative error (MMRE) and percentage of prediction (PRED) for the comparison of their estimation models. Still, it was also found in this study that there are quite a number of criticisms done on accuracy statistics like MMRE and PRED by renowned authors. It was found that MMRE is an unbalanced, biased, and inappropriate performance measure for identifying the best among competing estimation models. The accuracy statistics, e.g., MMRE and PRED, are still adopted in the evaluation criteria by the domain researchers, stating the reason for "widely used, " which is not a valid reason. This research study identified that, since there is no practical solution provided so far, which could replace MMRE and PRED, the researchers are adopting these measures. The approach of partitioning the large dataset into subsamples was tried in this paper using estimation by analogy (EBA) model. One small and one large dataset were considered for it, such as Desharnais and ISBSG release 11. The ISBSG dataset is a large dataset concerning Desharnais. The ISBSG dataset was partitioned into subsamples. The results suggested that when the large datasets are partitioned, the MMRE produces the same or nearly the same results, which it produces for the small dataset. It is observed that the MMRE can be trusted as a performance metric if the large datasets are partitioned into subsamples

    Evaluating an automated procedure of machine learning parameter tuning for software effort estimation

    Get PDF
    Software effort estimation requires accurate prediction models. Machine learning algorithms have been used to create more accurate estimation models. However, these algorithms are sensitive to factors such as the choice of hyper-parameters. To reduce this sensitivity, automated approaches for hyper-parameter tuning have been recently investigated. There is a need for further research on the effectiveness of such approaches in the context of software effort estimation. These evaluations could help understand which hyper-parameter settings can be adjusted to improve model accuracy, and in which specific contexts tuning can benefit model performance. The goal of this work is to develop an automated procedure for machine learning hyper-parameter tuning in the context of software effort estimation. The automated procedure builds and evaluates software effort estimation models to determine the most accurate evaluation schemes. The methodology followed in this work consists of first performing a systematic mapping study to characterize existing hyper-parameter tuning approaches in software effort estimation, developing the procedure to automate the evaluation of hyper-parameter tuning, and conducting controlled quasi experiments to evaluate the automated procedure. From the systematic literature mapping we discovered that effort estimation literature has favored the use of grid search. The results we obtained in our quasi experiments demonstrated that fast, less exhaustive tuners were viable in place of grid search. These results indicate that randomly evaluating 60 hyper-parameters can be as good as grid search, and that multiple state-of-the-art tuners were only more effective than this random search in 6% of the evaluated dataset-model combinations. We endorse random search, genetic algorithms, flash, differential evolution, and tabu and harmony search as effective tuners.Los algoritmos de aprendizaje automático han sido utilizados para crear modelos con mayor precisión para la estimación del esfuerzo del desarrollo de software. Sin embargo, estos algoritmos son sensibles a factores, incluyendo la selección de hiper parámetros. Para reducir esto, se han investigado recientemente algoritmos de ajuste automático de hiper parámetros. Es necesario evaluar la efectividad de estos algoritmos en el contexto de estimación de esfuerzo. Estas evaluaciones podrían ayudar a entender qué hiper parámetros se pueden ajustar para mejorar los modelos, y en qué contextos esto ayuda el rendimiento de los modelos. El objetivo de este trabajo es desarrollar un procedimiento automatizado para el ajuste de hiper parámetros para algoritmos de aprendizaje automático aplicados a la estimación de esfuerzo del desarrollo de software. La metodología seguida en este trabajo consta de realizar un estudio de mapeo sistemático para caracterizar los algoritmos de ajuste existentes, desarrollar el procedimiento automatizado, y conducir cuasi experimentos controlados para evaluar este procedimiento. Mediante el mapeo sistemático descubrimos que la literatura en estimación de esfuerzo ha favorecido el uso de la búsqueda en cuadrícula. Los resultados obtenidos en nuestros cuasi experimentos demostraron que algoritmos de estimación no-exhaustivos son viables para la estimación de esfuerzo. Estos resultados indican que evaluar aleatoriamente 60 hiper parámetros puede ser tan efectivo como la búsqueda en cuadrícula, y que muchos de los métodos usados en el estado del arte son solo más efectivos que esta búsqueda aleatoria en 6% de los escenarios. Recomendamos el uso de la búsqueda aleatoria, algoritmos genéticos y similares, y la búsqueda tabú y harmónica.Escuela de Ciencias de la Computación e InformáticaCentro de Investigaciones en Tecnologías de la Información y ComunicaciónUCR::Vicerrectoría de Investigación::Sistema de Estudios de Posgrado::Ingeniería::Maestría Académica en Computación e Informátic

    Multi-objective software effort estimation

    Get PDF
    We introduce a bi-objective effort estimation algorithm that combines Confidence Interval Analysis and assessment of Mean Absolute Error. We evaluate our proposed algorithm on three different alternative formulations, baseline comparators and current state-of-the-art effort estimators applied to five real-world datasets from the PROMISE repository, involving 724 different software projects in total. The results reveal that our algorithm outperforms the baseline, state-of-the-art and all three alternative formulations, statistically significantly (p < 0:001) and with large effect size (A12≥ 0:9) over all five datasets. We also provide evidence that our algorithm creates a new state-of-the-art, which lies within currently claimed industrial human-expert-based thresholds, thereby demonstrating that our findings have actionable conclusions for practicing software engineers

    Investigating the Effectiveness of Clustering for Story Point Estimation

    Get PDF
    Automated techniques to estimate Story Points (SP) for user stories in agile software development came to the fore a decade ago. Yet, the state-of-the-art estimation techniques’ accuracy has room for improvement. In this paper, we present a new approach for SP estimation, based on analysing textual features of software issues by employing latent Dirichlet allocation (LDA) and clustering. We first use LDA to represent issue reports in a new space of generated topics. We then use hierarchical clustering to agglomerate issues into clusters based on their topic similarities. Next, we build estimation models using the issues in each cluster. Then, we find the closest cluster to the new coming issue and use the model from that cluster to estimate the SP. Our approach is evaluated on a dataset of 26 open source projects with a total of 31,960 issues and compared against both baselines and state-of-the-art SP estimation techniques. The results show that the estimation performance of our proposed approach is as good as the state-of-the-art. However, none of these approaches is statistically significantly better than more naive estimators in all cases, which does not justify their additional complexity. We therefore encourage future work to develop alternative strategies for story points estimation. The experimental data and scripts we used in this work are publicly available to allow for replication and extension

    Multi-Objective Software Effort Estimation: A Replication Study

    Get PDF
    Replication studies increase our confidence in previous results when the findings are similar each time, and help mature our knowledge by addressing both internal and external validity aspects. However, these studies are still rare in certain software engineering fields. In this paper, we replicate and extend a previous study, which denotes the current state-of-the-art for multi-objective software effort estimation, namely CoGEE. We investigate the original research questions with an independent implementation and the inclusion of a more robust baseline (LP4EE), carried out by the first author, who was not involved in the original study. Through this replication, we strengthen both the internal and external validity of the original study. We also answer two new research questions investigating the effectiveness of CoGEE by using four additional evolutionary algorithms (i.e., IBEA, MOCell, NSGA-III, SPEA2) and a well-known Java framework for evolutionary computation, namely JMetal (rather than the previously used R software), which allows us to strengthen the external validity of the original study. The results of our replication confirm that: (1) CoGEE outperforms both baseline and state-of-the-art benchmarks statistically significantly (p < 0:001); (2) CoGEE’s multi-objective nature makes it able to reach such a good performance; (3) CoGEE’s estimation errors lie within claimed industrial human-expert-based thresholds. Moreover, our new results show that the effectiveness of CoGEE is generally not limited to nor dependent on the choice of the multi-objective algorithm. Using CoGEE with either NSGA-II, NSGA-III, or MOCell produces human competitive results in less than a minute. The Java version of CoGEE has decreased the running time by over 99.8% with respect to its R counterpart. We have made publicly available the Java code of CoGEE to ease its adoption, as well as, the data used in this study in order to allow for future replication and extension of our work

    Linear Programming as a Baseline for Software Effort Estimation

    Get PDF
    Software effort estimation studies still suffer from discordant empirical results (i.e., conclusion instability) mainly due to the lack of rigorous benchmarking methods. So far only one baseline model, namely, Automatically Transformed Linear Model (ATLM), has been proposed yet it has not been extensively assessed. In this article, we propose a novel method based on Linear Programming (dubbed as Linear Programming for Effort Estimation, LP4EE) and carry out a thorough empirical study to evaluate the effectiveness of both LP4EE and ATLM for benchmarking widely used effort estimation techniques. The results of our study confirm the need to benchmark every other proposal against accurate and robust baselines. They also reveal that LP4EE is more accurate than ATLM for 17% of the experiments and more robust than ATLM against different data splits and cross-validation methods for 44% of the cases. These results suggest that using LP4EE as a baseline can help reduce conclusion instability. We make publicly available an open-source implementation of LP4EE in order to facilitate its adoption in future studies
    corecore