1,892 research outputs found
Recommended from our members
Data sets and data quality in software engineering
OBJECTIVE - to assess the extent and types of techniques used to manage quality within software engineering data sets. We consider this a particularly interesting question in the context of initiatives to promote sharing and secondary analysis of data sets.
METHOD - we perform a systematic review of available empirical software engineering studies.
RESULTS - only 23 out of the many hundreds of studies assessed, explicitly considered data quality.
CONCLUSIONS - first, the community needs to consider the quality and appropriateness of the data set being utilised; not all data sets are equal. Second, we need more research into means of identifying, and ideally repairing, noisy cases. Third, it should become routine to use sensitivity analysis to assess conclusion stability with respect to the assumptions that must be made concerning noise levels
Finding Temporal Patterns in Noisy Longitudinal Data: A Study in Diabetic Retinopathy
This paper describes an approach to temporal pattern mining using the concept of user defined temporal prototypes to define the nature of the trends of interests. The temporal patterns are defined in terms of sequences of support values associated with identified frequent patterns. The prototypes are defined mathematically so that they can be mapped onto the temporal patterns. The focus for the advocated temporal pattern mining process is a large longitudinal patient database collected as part of a diabetic retinopathy screening programme, The data set is, in itself, also of interest as it is very noisy (in common with other similar medical datasets) and does not feature a clear association between specific time stamps and subsets of the data. The diabetic retinopathy application, the data warehousing and cleaning process, and the frequent pattern mining procedure (together with the application of the prototype concept) are all described in the paper. An evaluation of the frequent pattern mining process is also presented
On the role of pre and post-processing in environmental data mining
The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed
Experience: Quality benchmarking of datasets used in software effort estimation
Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous
process and project management activities, including the estimation of development effort and the prediction
of the likely location and severity of defects in code. Serious questions have been raised, however, over the
quality of the data used in ESE. Data quality problems caused by noise, outliers, and incompleteness have
been noted as being especially prevalent. Other quality issues, although also potentially important, have
received less attention. In this study, we assess the quality of 13 datasets that have been used extensively
in research on software effort estimation. The quality issues considered in this article draw on a taxonomy
that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions
are as follows: (1) an evaluation of the âfitness for purposeâ of these commonly used datasets and (2) an
assessment of the utility of the taxonomy in terms of dataset benchmarking. We also propose a template
that could be used to both improve the ESE data collection/submission process and to evaluate other such
datasets, contributing to enhanced awareness of data quality issues in the ESE community and, in time, the
availability and use of higher-quality datasets
Evaluation of imputation techniques with varying percentage of missing data
Missing data is a common problem which has consistently plagued statisticians
and applied analytical researchers. While replacement methods like mean-based
or hot deck imputation have been well researched, emerging imputation
techniques enabled through improved computational resources have had limited
formal assessment. This study formally considers five more recently developed
imputation methods: Amelia, Mice, mi, Hmisc and missForest - compares their
performances using RMSE against actual values and against the well-established
mean-based replacement approach. The RMSE measure was consolidated by method
using a ranking approach. Our results indicate that the missForest algorithm
performed best and the mi algorithm performed worst.Comment: 16 pages, 21 figures, 3 table
Recommended from our members
Data cleaning techniques for software engineering data sets
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Data quality is an important issue which has been addressed and recognised in research communities such as data warehousing, data mining and information systems. It has been agreed that poor data quality will impact the quality of results of analyses and that it will therefore impact on decisions made on the basis of these results. Empirical software engineering has neglected the issue of data quality to some extent. This fact poses the question of how researchers in empirical software engineering can trust their results without addressing the quality of the analysed data. One widely accepted definition for data quality describes it as `fitness for purpose', and the issue of poor data quality can be addressed by either introducing preventative measures or by applying means to cope with data quality issues. The research presented in this thesis addresses the latter with the special focus on noise handling.
Three noise handling techniques, which utilise decision trees, are proposed for application to software engineering data sets. Each technique represents a noise handling approach: robust filtering, where training and test sets are the same; predictive filtering, where training and test sets are different; and filtering and polish, where noisy instances are corrected. The techniques were first evaluated in two different investigations by applying them to a large real world software engineering data set. In the first investigation the techniques' ability to improve predictive accuracy in differing noise levels was tested. All three techniques improved predictive accuracy in comparison to the do-nothing approach. The filtering and polish was the most successful technique in improving predictive accuracy. The second investigation utilising the large real world software engineering data set tested the techniques' ability to identify instances with implausible values. These instances were flagged for the purpose of evaluation before applying the three techniques. Robust filtering and predictive filtering decreased the number of instances with implausible values, but substantially decreased the size of the data set too. The filtering and polish technique actually increased the number of implausible values, but it did not reduce the size of the data set.
Since the data set contained historical software project data, it was not possible to know the real extent of noise detected. This led to the production of simulated software engineering data sets, which were modelled on the real data set used in the previous evaluations to ensure domain specific characteristics. These simulated versions of the data set were then injected with noise, such that the real extent of the noise was known. After the noise injection the three noise handling techniques were applied to allow evaluation. This procedure of simulating software engineering data sets combined the incorporation of domain specific characteristics of the real world with the control over the simulated data. This is seen as a special strength of this evaluation approach.
The results of the evaluation of the simulation showed that none of the techniques performed well. Robust filtering and filtering and polish performed very poorly, and based on the results of this evaluation they would not be recommended for the task of noise reduction. The predictive filtering technique was the best performing technique in this evaluation, but it did not perform significantly well either.
An exhaustive systematic literature review has been carried out investigating to what extent the empirical software engineering community has considered data quality. The findings showed that the issue of data quality has been largely neglected by the empirical software engineering community.
The work in this thesis highlights an important gap in empirical software engineering. It provided clarification and distinctions of the terms noise and outliers. Noise and outliers are overlapping, but they are fundamentally different. Since noise and outliers are often treated the same in noise handling techniques, a clarification of the two terms was necessary.
To investigate the capabilities of noise handling techniques a single investigation was deemed as insufficient. The reasons for this are that the distinction between noise and outliers is not trivial, and that the investigated noise cleaning techniques are derived from traditional noise handling techniques where noise and outliers are combined. Therefore three investigations were undertaken to assess the effectiveness of the three presented noise handling techniques. Each investigation should be seen as a part of a multi-pronged approach.
This thesis also highlights possible shortcomings of current automated noise handling techniques. The poor performance of the three techniques led to the conclusion that noise handling should be integrated into a data cleaning process where the input of domain knowledge and the replicability of the data cleaning process are ensured
Deep generative modeling for single-cell transcriptomics.
Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells ( https://github.com/YosefLab/scVI ). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task
A New Method for Estimation of Missing Data Based on Sampling Methods for Data Mining
International audienceToday we collect large amounts of data and we receive more than we can handle, the accumulated data are often raw and far from being of good quality they contain Missing Values and noise. The presence of Missing Values in data are major disadvantages for most Datamining algorithms. Intuitively, the pertinent information is embedded in many attributes and its extraction is only possible if the original data are cleaned and pre-treated. In this paper we propose a new technique for preprocessing data that aims to estimate the Missing Values, in order to obtain representative Samples of good quality, and also to assure that the information extracted is more safe and reliable
Parameter estimation in Cox models with missing failure indicators and the OPPERA study
In a prospective cohort study, examining all participants for incidence of
the condition of interest may be prohibitively expensive. For example, the
"gold standard" for diagnosing temporomandibular disorder (TMD) is a physical
examination by a trained clinician. In large studies, examining all
participants in this manner is infeasible. Instead, it is common to use
questionnaires to screen for incidence of TMD and perform the "gold standard"
examination only on participants who screen positively. Unfortunately, some
participants may leave the study before receiving the "gold standard"
examination. Within the framework of survival analysis, this results in missing
failure indicators. Motivated by the Orofacial Pain: Prospective Evaluation and
Risk Assessment (OPPERA) study, a large cohort study of TMD, we propose a
method for parameter estimation in survival models with missing failure
indicators. We estimate the probability of being an incident case for those
lacking a "gold standard" examination using logistic regression. These
estimated probabilities are used to generate multiple imputations of case
status for each missing examination that are combined with observed data in
appropriate regression models. The variance introduced by the procedure is
estimated using multiple imputation. The method can be used to estimate both
regression coefficients in Cox proportional hazard models as well as incidence
rates using Poisson regression. We simulate data with missing failure
indicators and show that our method performs as well as or better than
competing methods. Finally, we apply the proposed method to data from the
OPPERA study.Comment: Version 4: 23 pages, 0 figure
- âŠ