345 research outputs found

    Ensemble missing data techniques for software effort prediction

    Get PDF
    Constructing an accurate effort prediction model is a challenge in software engineering. The development and validation of models that are used for prediction tasks require good quality data. Unfortunately, software engineering datasets tend to suffer from the incompleteness which could result to inaccurate decision making and project management and implementation. Recently, the use of machine learning algorithms has proven to be of great practical value in solving a variety of software engineering problems including software prediction, including the use of ensemble (combining) classifiers. Research indicates that ensemble individual classifiers lead to a significant improvement in classification performance by having them vote for the most popular class. This paper proposes a method for improving software effort prediction accuracy produced by a decision tree learning algorithm and by generating the ensemble using two imputation methods as elements. Benchmarking results on ten industrial datasets show that the proposed ensemble strategy has the potential to improve prediction accuracy compared to an individual imputation method, especially if multiple imputation is a component of the ensemble

    Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

    Get PDF
    Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%

    Imputation techniques for improving survey outcomes in Nigeria: the case of the business expectation survey (BES) of the central bank of Nigeria

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Information Analysis and ManagementOver the years, the issue of respondents’ apathy, missing data and item non-response in particular, has remained a major concern with regards to analyses of survey-based studies undertaken by the Central Bank of Nigeria (CBN). Researchers and policy analysis within the CBN has been plagued by the growing quantum of item non-response. This dissertation will attempt to empirically analyze and recommend the best imputation technique for item nonresponse in surveys undertaken by the Bank. The case in point will be the Business Expectations Survey (BES) conducted quarterly by the CBN. It will take a specific items/questions in the BES for which there are complete responses and undertake a multiple correspondence analysis (MCA) of the responses. Using a complete randomize scheme (table of random numbers) it will exclude 15 – 35 percent of responses as if they were item nonresponse and proceed to replace them through various imputation technique. After which the MCA will be repeated for each of the derived data sets and the result compared with that of the original data sets. The matrices of principal coordinates are compared using the RV coefficient (Escoufier, 1973), a measure of similarity between two datasets such that a value of 1 indicates complete similarity and 0 indicates complete dissimilarity. This coefficient is a generalization of the square of Spearman’s correlation coefficient. The result of the RV coefficient analysis and well as the analysis of some selected summary statistics will be used to recommend the best imputation technique for such item non-responses in future surveys

    Childbearing intentions in a low fertility context: the case of Romania

    Get PDF
    This paper applies the Theory of Planned Behaviour (TPB) to find out the predictors of fertility intentions in Romania, a low-fertility country. We analyse how attitudes, subjective norms and perceived behavioural control relate to the intention to have a child among childless individuals and one-child parents. Principal axis factor analysis confirms which items proposed by the Generation and Gender Survey (GGS 2005) act as valid and reliable measures of the suggested theoretical socio-psychological factors. Four parity-specific logistic regression models are applied to evaluate the relationship between the socio-psychological factors and childbearing intentions. Social pressure emerges as the most important aspect in fertility decision-making among childless individuals and one-child parents, and positive attitudes towards childbearing are a strong component in planning for a child. This paper also underlines the importance of the region-specific factors when studying childbearing intentions: planning for the second child significantly differs among the development regions, representing the cultural and socio-economic divisions of the Romanian territory

    Missing Features Reconstruction Using a Wasserstein Generative Adversarial Imputation Network

    Full text link
    Missing data is one of the most common preprocessing problems. In this paper, we experimentally research the use of generative and non-generative models for feature reconstruction. Variational Autoencoder with Arbitrary Conditioning (VAEAC) and Generative Adversarial Imputation Network (GAIN) were researched as representatives of generative models, while the denoising autoencoder (DAE) represented non-generative models. Performance of the models is compared to traditional methods k-nearest neighbors (k-NN) and Multiple Imputation by Chained Equations (MICE). Moreover, we introduce WGAIN as the Wasserstein modification of GAIN, which turns out to be the best imputation model when the degree of missingness is less than or equal to 30%. Experiments were performed on real-world and artificial datasets with continuous features where different percentages of features, varying from 10% to 50%, were missing. Evaluation of algorithms was done by measuring the accuracy of the classification model previously trained on the uncorrupted dataset. The results show that GAIN and especially WGAIN are the best imputers regardless of the conditions. In general, they outperform or are comparative to MICE, k-NN, DAE, and VAEAC.Comment: Preprint of the conference paper (ICCS 2020), part of the Lecture Notes in Computer Scienc

    Illuminate the unknown: Evaluation of imputation procedures based on the SAVE Survey

    Get PDF
    Questions about monetary variables (such as income, wealth or savings) are key components of questionnaires on household finances. However, missing information on such sensitive topics is a well-known phenomenon which can seriously bias any inference based only on complete cases analysis. Many imputation techniques have been developed and implemented in several surveys. Using the German SAVE data, this paper evaluates different techniques for the imputation of monetary variables implementing a simulation study, where a random pattern of missingness is imposed on the observed values of the variables of interest. New estimation techniques are necessary to overcome the upward bias of monetary variables caused by the initially implemented imputation procedure. A Monte-Carlo simulation based on the observed data shows the superiority of the newly implemented smearing estimate to construct the missing data structure. All waves are consistently imputed using the new method.

    Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare

    Get PDF
    Missing data is one of the most common issues encountered in data cleaning process especially when dealing with medical dataset. A real collected dataset is prone to be incomplete, inconsistent, noisy and redundant due to potential reasons such as human errors, instrumental failures, and adverse death. Therefore, to accurately deal with incomplete data, a sophisticated algorithm is proposed to impute those missing values. Many machine learning algorithms have been applied to impute missing data with plausible values. However, among all machine learning imputation algorithms, KNN algorithm has been widely adopted as an imputation for missing data due to its robustness and simplicity and it is also a promising method to outperform other machine learning methods. This paper provides a comprehensive review of different imputation techniques used to replace the missing data. The goal of the review paper is to bring specific attention to potential improvements to existing methods and provide readers with a better grasps of imputation technique trends

    Autoencoder for clinical data analysis and classification : data imputation, dimensional reduction, and pattern recognition

    Get PDF
    Over the last decade, research has focused on machine learning and data mining to develop frameworks that can improve data analysis and output performance; to build accurate decision support systems that benefit from real-life datasets. This leads to the field of clinical data analysis, which has attracted a significant amount of interest in the computing, information systems, and medical fields. To create and develop models by machine learning algorithms, there is a need for a particular type of data for the existing algorithms to build an efficient model. Clinical datasets pose several issues that can affect the classification of the dataset: missing values, high dimensionality, and class imbalance. In order to build a framework for mining the data, it is necessary first to preprocess data, by eliminating patients’ records that have too many missing values, imputing missing values, addressing high dimensionality, and classifying the data for decision support.This thesis investigates a real clinical dataset to solve their challenges. Autoencoder is employed as a tool that can compress data mining methodology, by extracting features and classifying data in one model. The first step in data mining methodology is to impute missing values, so several imputation methods are analysed and employed. Then high dimensionality is demonstrated and used to discard irrelevant and redundant features, in order to improve prediction accuracy and reduce computational complexity. Class imbalance is manipulated to investigate the effect on feature selection algorithms and classification algorithms.The first stage of analysis is to investigate the role of the missing values. Results found that techniques based on class separation will outperform other techniques in predictive ability. The next stage is to investigate the high dimensionality and a class imbalance. However it was found a small set of features that can improve the classification performance, the balancing class does not affect the performance as much as imbalance class

    Data mining for heart failure : an investigation into the challenges in real life clinical datasets

    Get PDF
    Clinical data presents a number of challenges including missing data, class imbalance, high dimensionality and non-normal distribution. A motivation for this research is to investigate and analyse the manner in which the challenges affect the performance of algorithms. The challenges were explored with the help of a real life heart failure clinical dataset known as Hull LifeLab, obtained from a live cardiology clinic at the Hull Royal Infirmary Hospital. A Clinical Data Mining Workflow (CDMW) was designed with three intuitive stages, namely, descriptive, predictive and prescriptive. The naming of these stages reflects the nature of the analysis that is possible within each stage; therefore a number of different algorithms are employed. Most algorithms require the data to be distributed in a normal manner. However, the distribution is not explicitly used within the algorithms. Approaches based on Bayes use the properties of the distributions very explicitly, and thus provides valuable insight into the nature of the data.The first stage of the analysis is to investigate if the assumptions made for Bayes hold, e.g. the strong independence assumption and the assumption of a Gaussian distribution. The next stage is to investigate the role of missing values. Results found that imputation does not affect the performance as much as those records which are initially complete. These records are often not outliers, but contain problem variables. A method was developed to identify these. The effect of skews in the data was also investigated within the CDMW. However, it was found that methods based on Bayes were able to handle these, albeit with a small variability in performance. The thesis provides an insight into the reasons why clinical data often causes problems. Even the issue of imbalanced classes is not an issue, for Bayes is independent of this

    Determinants of the acceptance of domestic use of recycled water by use type

    Get PDF
    In the circular economy model, the recycling of water is an alternative option that can reduce the pressure on water resources and guarantee water supply. This water policy measure is currently widespread in agriculture, but thus far few countries have opted for the domestic use of recycled water. In part, this is because it is the source of water with the lowest levels of public acceptance, which poses a threat to the success of the necessary investment. We analyse the degree of acceptance of recycled water for different domestic uses. The main contribution of this study is the analysis of the determinants of acceptance of recycled water by use type. The research was based on data from a questionnaire given to 844 university students in Andalusia, southern Spain. Results are obtained from ordinary least squares regressions that relate the determinants of recycled water acceptance to each of the water use classes. The 'yuck factor'—variously defined as ‘disgust’ or ‘psychological repugnance’—and the perceived risk are found to be the main determinants of the low degree of acceptance of recycled water for ingestion by people and pets. For other uses, such as body washing, laundry and cleaning, environ- mental awareness stands out as a determining factor. The main conclusion is that if au- thorities were to opt for measures to promote the use of recycled water, they should take into account the fact that the reluctance to use recycled water and the determinants of acceptance differ according to the intended useEuropean Regional Development FundSpanish Agencia Estatal de InvestigaciónRegional Government of Andalusi
    • …
    corecore