4 research outputs found
The Prevalence of Errors in Machine Learning Experiments
Context: Conducting experiments is central to research machine learning research to benchmark, evaluate and compare learning algorithms. Consequently it is important we conduct reliable, trustworthy experiments. Objective: We investigate the incidence of errors in a sample of machine learning experiments in the domain of software defect prediction. Our focus is simple arithmetical and statistical errors. Method: We analyse 49 papers describing 2456 individual experimental results from a previously undertaken systematic review comparing supervised and unsupervised defect prediction classifiers. We extract the confusion matrices and test for relevant constraints, e.g., the marginal probabilities must sum to one. We also check for multiple statistical significance testing errors. Results: We find that a total of 22 out of 49 papers contain demonstrable errors. Of these 7 were statistical and 16 related to confusion matrix inconsistency (one paper contained both classes of error). Conclusions: Whilst some errors may be of a relatively trivial nature, e.g., transcription errors their presence does not engender confidence. We strongly urge researchers to follow open science principles so errors can be more easily be detected and corrected, thus as a community reduce this worryingly high error rate with our computational experiments
The impact of using biased performance metrics on software defect prediction research
Context: Software engineering researchers have undertaken many experiments
investigating the potential of software defect prediction algorithms.
Unfortunately, some widely used performance metrics are known to be
problematic, most notably F1, but nevertheless F1 is widely used.
Objective: To investigate the potential impact of using F1 on the validity of
this large body of research.
Method: We undertook a systematic review to locate relevant experiments and
then extract all pairwise comparisons of defect prediction performance using F1
and the un-biased Matthews correlation coefficient (MCC).
Results: We found a total of 38 primary studies. These contain 12,471 pairs
of results. Of these, 21.95% changed direction when the MCC metric is used
instead of the biased F1 metric. Unfortunately, we also found evidence
suggesting that F1 remains widely used in software defect prediction research.
Conclusions: We reiterate the concerns of statisticians that the F1 is a
problematic metric outside of an information retrieval context, since we are
concerned about both classes (defect-prone and not defect-prone units). This
inappropriate usage has led to a substantial number (more than one fifth) of
erroneous (in terms of direction) results. Therefore we urge researchers to (i)
use an unbiased metric and (ii) publish detailed results including confusion
matrices such that alternative analyses become possible.Comment: Submitted to the journal Information & Software Technology. It is a
greatly extended version of "Assessing Software Defection Prediction
Performance: Why Using the Matthews Correlation Coefficient Matters"
presented at EASE 202
Use and misuse of the term "Experiment" in mining software repositories research
The significant momentum and importance of Mining Software Repositories (MSR) in Software Engineering (SE) has fostered new opportunities and challenges for extensive empirical research. However, MSR researchers seem to struggle to characterize the empirical methods they use into the existing empirical SE body of knowledge. This is especially the case of MSR experiments. To provide evidence on the special characteristics of MSR experiments and their differences with experiments traditionally acknowledged in SE so far, we elicited the hallmarks that differentiate an experiment from other types of empirical studies and characterized the hallmarks and types of experiments in MSR. We analyzed MSR literature obtained from a small-scale systematic mapping study to assess the use of the term experiment in MSR. We found that 19% of the papers claiming to be an experiment are indeed not an experiment at all but also observational studies, so they use the term in a misleading way. From the remaining 81% of the papers, only one of them refers to a genuine controlled experiment while the others stand for experiments with limited control. MSR researchers tend to overlook such limitations, compromising the interpretation of the results of their studies. We provide recommendations and insights to support the improvement of MSR experiments.This work has been partially supported by the Spanish project: MCI PID2020-117191RB-I00.Peer ReviewedPostprint (author's final draft
Remaining Useful Life Estimation of Bearings Meta-Analysis of Experimental Procedure
In the domain of predictive maintenance, when trying to repli- cate and compare research in remaining useful life estimation (RUL), several inconsistencies and errors were identified in the experimental methodology used by various researchers. This makes the replication and the comparison of results diffi- cult, thus severely hindering both progress in this research do- main and its practical application to industry. We survey the literature to evaluate the experimental procedures that were used, and identify the most common errors and omission in both experimental procedures and reporting.
A total of 70 papers on RUL were audited. From this meta- analysis we estimate that approximately 11% of the papers present work that will allow for replication and comparison. Surprisingly, only about 24.3% (17 of the 70 articles) com- pared their results with previous work. Of the remaining work, 41.4% generated and compared several models of their own and, somewhat unsettling, 31.4% of the researchers made no comparison whatsoever. The remaining 2.9% did not use the same data set for comparisons. The results of this study were also aggregated into 3 categories: problem class selec- tion, model fitting best practices and evaluation best practices. We conclude that model evaluation is the most problematic one.
The main contribution of the article is a proposal of an ex- perimental protocol and several recommendations that specif- ically target model evaluation. Adherence to this protocol should substantially facilitate the research and application of RUL prediction models. The goals are to promote the collab- oration between scholars and practitioners alike and advance the research in this domain