Search CORE

270,352 research outputs found

Reliability and validity in comparative studies of software prediction models

Author: Myrtveit I
Shepperd MJ
Stensrud E
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2005
Field of study

Empirical studies on software prediction models do not converge with respect to the question "which prediction model is best?" The reason for this lack of convergence is poorly understood. In this simulation study, we have examined a frequently used research procedure comprising three main ingredients: a single data sample, an accuracy indicator, and cross validation. Typically, these empirical studies compare a machine learning model with a regression model. In our study, we use simulation and compare a machine learning and a regression model. The results suggest that it is the research procedure itself that is unreliable. This lack of reliability may strongly contribute to the lack of convergence. Our findings thus cast some doubt on the conclusions of any study of competing software prediction models that used this research procedure as a basis of model comparison. Thus, we need to develop more reliable research procedures before we can have confidence in the conclusions of comparative studies of software prediction models

Brunel University Research Archive

Comparing software prediction techniques using simulation

Author: Kadoda G
Shepperd M J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

The need for accurate software prediction systems increases as software becomes much larger and more complex. We believe that the underlying characteristics: size, number of features, type of distribution, etc., of the data set influence the choice of the prediction system to be used. For this reason, we would like to control the characteristics of such data sets in order to systematically explore the relationship between accuracy, choice of prediction system, and data set characteristic. It would also be useful to have a large validation data set. Our solution is to simulate data allowing both control and the possibility of large (1000) validation cases. The authors compare four prediction techniques: regression, rule induction, nearest neighbor (a form of case-based reasoning), and neural nets. The results suggest that there are significant differences depending upon the characteristics of the data set. Consequently, researchers should consider prediction context when evaluating competing prediction systems. We observed that the more "messy" the data and the more complex the relationship with the dependent variable, the more variability in the results. In the more complex cases, we observed significantly different results depending upon the particular training set that has been sampled from the underlying data set. However, our most important result is that it is more fruitful to ask which is the best prediction system in a particular context rather than which is the "best" prediction system

Brunel University Research Archive

Recommended from our members

Meta-analysis of massively parallel reporter assays enables prediction of regulatory function across cell types.

Author: Ahituv Nadav
Kreimer Anat
Yan Zhongxia
Yosef Nir
Publication venue: eScholarship, University of California
Publication date: 01/09/2019
Field of study

Deciphering the potential of noncoding loci to influence gene regulation has been the subject of intense research, with important implications in understanding genetic underpinnings of human diseases. Massively parallel reporter assays (MPRAs) can measure regulatory activity of thousands of DNA sequences and their variants in a single experiment. With increasing number of publically available MPRA data sets, one can now develop data-driven models which, given a DNA sequence, predict its regulatory activity. Here, we performed a comprehensive meta-analysis of several MPRA data sets in a variety of cellular contexts. We first applied an ensemble of methods to predict MPRA output in each context and observed that the most predictive features are consistent across data sets. We then demonstrate that predictive models trained in one cellular context can be used to predict MPRA output in another, with loss of accuracy attributed to cell-type-specific features. Finally, we show that our approach achieves top performance in the Fifth Critical Assessment of Genome Interpretation "Regulation Saturation" Challenge for predicting effects of single-nucleotide variants. Overall, our analysis provides insights into how MPRA data can be leveraged to highlight functional regulatory regions throughout the genome and can guide effective design of future experiments by better prioritizing regions of interest

eScholarship - University of California

Prediction of the functional properties of ceramic materials from composition using artificial neural networks

Author: Alford N. Mc N.
Coveney P. V.
Kilner J. A.
Rossiny J. C. H.
Scott D. J.
Publication venue
Publication date: 23/02/2007
Field of study

We describe the development of artificial neural networks (ANN) for the prediction of the properties of ceramic materials. The ceramics studied here include polycrystalline, inorganic, non-metallic materials and are investigated on the basis of their dielectric and ionic properties. Dielectric materials are of interest in telecommunication applications where they are used in tuning and filtering equipment. Ionic and mixed conductors are the subjects of a concerted effort in the search for new materials that can be incorporated into efficient, clean electrochemical devices of interest in energy production and greenhouse gas reduction applications. Multi-layer perceptron ANNs are trained using the back-propagation algorithm and utilise data obtained from the literature to learn composition-property relationships between the inputs and outputs of the system. The trained networks use compositional information to predict the relative permittivity and oxygen diffusion properties of ceramic materials. The results show that ANNs are able to produce accurate predictions of the properties of these ceramic materials which can be used to develop materials suitable for use in telecommunication and energy production applications

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository

Making inferences with small numbers of training sets

Author: Albrecht
Boehm
C. Kirsopp
Ebert
Efron
Kadoda
Kemerer
Kitchenham
Kitchenham
Kok
M. Shepperd
MacDonell
Mair
Miyazaki
Shepperd
Shepperd
Srinivasan
Walston
Wittig
Publication venue: 'Institution of Engineering and Technology (IET)'
Publication date: 01/01/2002
Field of study

A potential methodological problem with empirical studies that assess project effort prediction system is discussed. Frequently, a hold-out strategy is deployed so that the data set is split into a training and a validation set. Inferences are then made concerning the relative accuracy of the different prediction techniques under examination. This is typically done on very small numbers of sampled training sets. It is shown that such studies can lead to almost random results (particularly where relatively small effects are being studied). To illustrate this problem, two data sets are analysed using a configuration problem for case-based prediction and results generated from 100 training sets. This enables results to be produced with quantified confidence limits. From this it is concluded that in both cases using less than five training sets leads to untrustworthy results, and ideally more than 20 sets should be deployed. Unfortunately, this raises a question over a number of empirical validations of prediction techniques, and so it is suggested that further research is needed as a matter of urgency

CiteSeerX

Brunel University Research Archive