Search CORE

148,744 research outputs found

Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Author: Albrecht
Austin
Baird
Batista
Boehm
Boehm
Breiman
Briand
Briand
Briand
Brockmeier
Cartwright
Cheung
Clark
Feelders
Finnie
Gama
Gray
Holte
Jain
Jeffery
Jun Liu
Jönsson
Kemerer
Khotanzad
Kibler
Kim
Kitchenham
Kohavi
Little
Little
Little
Little
Little
Martin Shepperd
Miranda
Myrtveit
Pickard
Putnam
Qinbao Song
Quinlan
Robins
Rubin
Rubin
Rubin
Rubin
Samson
Selby
Shao
Shepperd
Shepperd
Siedelecki
Song
Song
Srinivasan
Strike
Tabachnick
Tay
Walkerden
Walston
Xiangru Chen
Publication venue: 'Elsevier BV'
Publication date: 01/12/2008
Field of study

Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%

Crossref

Brunel University Research Archive

Towards an automation of the traceability of bugs from development logs: A study based on open source software

Author: Auwal Romo B
Capiluppi A
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/04/2015
Field of study

Context: Information and tracking of defects can be severely incomplete in almost every Open Source project, resulting in a reduced traceability of defects into the development logs (i.e., version control commit logs). In particular, defect data often appears not in sync when considering what developers logged as their actions. Synchronizing or completing the missing data of the bug repositories, with the logs detailing the actions of developers, would benefit various branches of empirical software engineering research: prediction of software faults, software reliability, traceability, software quality, effort and cost estimation, bug prediction and bug fixing. Objective: To design a framework that automates the process of synchronizing and filling the gaps of the development logs and bug issue data for open source software projects. Method: We instantiate the framework with a sample of OSS projects from GitHub, and by parsing, linking and filling the gaps found in their bug issue data, and development logs. UML diagrams show the relevant modules that will be used to merge, link and connect the bug issue data with the development data. Results: Analysing a sample of over 300 OSS projects we observed that around 1/2 of bug-related data is present in either development logs or issue tracker logs: the rest of the data is missing from one or the other source. We designed an automated approach that fills the gaps of either source by making use of the available data, and we successfully mapped all the missing data of the analysed projects, when using one heuristics of annotating bugs. Other heuristics need to be investigated and implemented. Conclusion: In this paper a framework to synchronise the development logs and bug data used in empirical software engineering was designed to automatically fill the missing parts of development logs and bugs of issue data

Crossref

Brunel University Research Archive

Missing.... presumed at random: cost-analysis of incomplete data

Author: Bang
Box
Brick
Buck
Carides
Coast
Diggle
Duan
Etzioni
Fenn
Gelfand
Greene
Hanson
Horton
Johnston
Keoghane
Leigh
Lin
Little
Little
Polsky
Roberts
Rubin
Rubin
Rubin
Schafer
Schafer
Tanner
van Buuren
Willan
WinBUGS
Wu
Publication venue: 'Wiley'
Publication date: 01/12/2002
Field of study

When collecting patient-level resource use data for statistical analysis, for some patients and in some categories of resource use, the required count will not be observed. Although this problem must arise in most reported economic evaluations containing patient-level data, it is rare for authors to detail how the problem was overcome. Statistical packages may default to handling missing data through a so-called complete case analysis, while some recent cost-analyses have appeared to favour an available case approach. Both of these methods are problematic: complete case analysis is inefficient and is likely to be biased; available case analysis, by employing different numbers of observations for each resource use item, generates severe problems for standard statistical inference. Instead we explore imputation methods for generating replacement values for missing data that will permit complete case analysis using the whole data set and we illustrate these methods using two data sets that had incomplete resource use information

Crossref

LSHTM Research Online

Research Papers in Economics

Enlighten

Warranty Data Analysis: A Review

Author: Ahn
Alam
Attardi
Baik
Blischke
Blischke
Brennan
Buddhakulsomsiri
Buddhakulsomsiri
Chen
Chukova
Davis
Djamaludin
Duchesne
Elkins
Escobar
Fredette
Gertsbakh
Grabert
Honari
Hrycej
Hu
Hu
Hu
Hu
Ion
Iskandar
Jung
Kalbfleisch
Kalbfleisch
Kalbfleisch
Kalbfleisch
Kaminskiy
Karim
Karim
Karim
Karim
Kijima
Kleyner
Kleyner
Kleyner
Krivtsov
Lawless
Lawless
Lawless
Lawless
Lawless
Lawless
Majeske
Majeske
Majeske
Marcorin
Marshall
Meeker
Moskowitz
Murthy
Murthy
Murthy
Murthy
Oh
Pal
Phillips
Phillips
Phillips
Rahman
Rai
Rai
Rai
Robinson
Sahin
Singpurwalla
Singpurwalla
Sureka
Suzuki
Suzuki
Suzuki
Suzuki
Suzuki
Suzuki
Thomas
Thomas
Vinta
Vintr
Vittal
Wang
Wasserman
Wasserman
Wasserman
Wilson
Wu
Wu
Wu
Wu
Wu
Wu
Yang
Yang
Zuo
Publication venue: 'Wiley'
Publication date: 10/01/2012
Field of study

Warranty claims and supplementary data contain useful information about product quality and reliability. Analysing such data can therefore be of benefit to manufacturers in identifying early warnings of abnormalities in their products, providing useful information about failure modes to aid design modification, estimating product reliability for deciding on warranty policy and forecasting future warranty claims needed for preparing fiscal plans. In the last two decades, considerable research has been conducted in warranty data analysis (WDA) from several different perspectives. This article attempts to summarise and review the research and developments in WDA with emphasis on models, methods and applications. It concludes with a brief discussion on current practices and possible future trends in WDA

Crossref

Kent Academic Repository

Standard error estimation for EM applications related to Latent class models

Author: Camilleri Liberato
First International North American Simulation Technology Conference (NASTEC 2008)
Publication venue: The European Multidisciplinary Society for Modelling and Simulation Technology
Publication date: 01/01/2008
Field of study

The EM algorithm is a popular method for computing maximum likelihood estimates. It tends to be numerically stable, reduces execution time compared to other estimation procedures and is easy to implement in latent class models. However, the EM algorithm fails to provide a consistent estimator of the standard errors of maximum likelihood estimates in incomplete data applications. Correct standard errors can be obtained by numerical differentiation. The technique requires computation of a complete-data gradient vector and Hessian matrix, but not those associated with the incomplete data likelihood. Obtaining first and second derivatives numerically is computationally very intensive and execution time may become very expensive when fitting Latent class models using a Newton-type algorithm. When the execution time is too high one is motivated to use the EM algorithm solution to initialize the Newton Raphson algorithm. We also investigate the effect on the execution time when a final Newton-Raphson step follows the EM algorithm after convergence. In this paper we compare the standard errors provided by the EM and Newton-Raphson algorithms for two models and analyze how this bias is affected by the number of parameters in the model fit.peer-reviewe

OAR@UM

An Analysis of Data Sets Used to Train and Validate Cost Prediction Systems

Author: Jorgensen Magne
Mair Carolyn
Shepperd Martin
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2006
Field of study

OBJECTIVE - the aim of this investigation is to build up a picture of the nature and type of data sets being used to develop and evaluate different software project effort prediction systems. We believe this to be important since there is a growing body of published work that seeks to assess different prediction approaches. Unfortunately, results – to date – are rather inconsistent so we are interested in the extent to which this might be explained by different data sets. METHOD - we performed an exhaustive search from 1980 onwards from three software engineering journals for research papers that used project data sets to compare cost prediction systems. RESULTS - this identified a total of 50 papers that used, one or more times, a total of 74 unique project data sets. We observed that some of the better known and publicly accessible data sets were used repeatedly making them potentially disproportionately influential. Such data sets also tend to be amongst the oldest with potential problems of obsolescence. We also note that only about 70% of all data sets are in the public domain and this can be particularly problematic when the data set description is incomplete or limited. Finally, extracting relevant information from research papers has been time consuming due to different styles of presentation and levels of contextural information. CONCLUSIONS - we believe there are two lessons to learn. First, the community needs to consider the quality and appropriateness of the data set being utilised; not all data sets are equal. Second, we need to assess the way results are presented in order to facilitate meta-analysis and whether a standard protocol would be appropriate