Search CORE

345 research outputs found

Ensemble missing data techniques for software effort prediction

Author: Cartwright Michelle
Twala Bhekisipho
Publication venue: Intelligent Data Analysis, IOS Press
Publication date: 01/01/2010
Field of study

Constructing an accurate effort prediction model is a challenge in software engineering. The development and validation of models that are used for prediction tasks require good quality data. Unfortunately, software engineering datasets tend to suffer from the incompleteness which could result to inaccurate decision making and project management and implementation. Recently, the use of machine learning algorithms has proven to be of great practical value in solving a variety of software engineering problems including software prediction, including the use of ensemble (combining) classifiers. Research indicates that ensemble individual classifiers lead to a significant improvement in classification performance by having them vote for the most popular class. This paper proposes a method for improving software effort prediction accuracy produced by a decision tree learning algorithm and by generating the ensemble using two imputation methods as elements. Benchmarking results on ten industrial datasets show that the proposed ensemble strategy has the potential to improve prediction accuracy compared to an individual imputation method, especially if multiple imputation is a component of the ensemble

University of Johannesburg Institutional Repository

Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Author: Albrecht
Austin
Baird
Batista
Boehm
Boehm
Breiman
Briand
Briand
Briand
Brockmeier
Cartwright
Cheung
Clark
Feelders
Finnie
Gama
Gray
Holte
Jain
Jeffery
Jun Liu
Jönsson
Kemerer
Khotanzad
Kibler
Kim
Kitchenham
Kohavi
Little
Little
Little
Little
Little
Martin Shepperd
Miranda
Myrtveit
Pickard
Putnam
Qinbao Song
Quinlan
Robins
Rubin
Rubin
Rubin
Rubin
Samson
Selby
Shao
Shepperd
Shepperd
Siedelecki
Song
Song
Srinivasan
Strike
Tabachnick
Tay
Walkerden
Walston
Xiangru Chen
Publication venue: 'Elsevier BV'
Publication date: 01/12/2008
Field of study

Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%

Crossref

Brunel University Research Archive

Imputation techniques for improving survey outcomes in Nigeria: the case of the business expectation survey (BES) of the central bank of Nigeria

Author: Sylvanus Udoette Ubong
Publication venue
Publication date: 23/05/2017
Field of study

Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Information Analysis and ManagementOver the years, the issue of respondents’ apathy, missing data and item non-response in particular, has remained a major concern with regards to analyses of survey-based studies undertaken by the Central Bank of Nigeria (CBN). Researchers and policy analysis within the CBN has been plagued by the growing quantum of item non-response. This dissertation will attempt to empirically analyze and recommend the best imputation technique for item nonresponse in surveys undertaken by the Bank. The case in point will be the Business Expectations Survey (BES) conducted quarterly by the CBN. It will take a specific items/questions in the BES for which there are complete responses and undertake a multiple correspondence analysis (MCA) of the responses. Using a complete randomize scheme (table of random numbers) it will exclude 15 – 35 percent of responses as if they were item nonresponse and proceed to replace them through various imputation technique. After which the MCA will be repeated for each of the derived data sets and the result compared with that of the original data sets. The matrices of principal coordinates are compared using the RV coefficient (Escoufier, 1973), a measure of similarity between two datasets such that a value of 1 indicates complete similarity and 0 indicates complete dissimilarity. This coefficient is a generalization of the square of Spearman’s correlation coefficient. The result of the RV coefficient analysis and well as the analysis of some selected summary statistics will be used to recommend the best imputation technique for such item non-responses in future surveys

Repositório da Universidade Nova de Lisboa

Childbearing intentions in a low fertility context: the case of Romania

Author: A Bandura
A Dușa
A Steinbach
A Vikat
B Berelson
B Perelli-Harris
B Perelli-Harris
B Voicu
C Bîrciu
C Lefter
C Mureșan
C Mureșan
C Mureșan
C Mureșan
C Yoon
CK Enders
D Albarracin
D Philipov
D Sandu
D Sandu
D Sandu
DB Rubin
E Thomson
F Holgado-Tell
FC Billari
FC Billari
G Norman
H Lefebvre
I Ajzen
I Ajzen
I Ajzen
I Ajzen
I Ajzen
I Ajzen
International Monetary Fund
J Ekström
J Gaskin
J Jaccard
J Karlsson
J Klobas
J Klobas
JM Hoem
JP Stevens
JR Goldstein
L Dommermuth
L Mencarini
L Vlăsceanu
LL Thurstone
M Bankier
M Castiglioni
M Conner
M Fishbein
M Greaves
M Hărăguș
M Hărăguș
M Kearney
M Macura
M Marin
M Soley-Bori
Ministry of Labour Social Solidarity and family
R Development Core Team
R Inglehart
R Inglehart
R Lesthaeghe
R Popescu
R Schoen
RB D’Agostino
RR Andridge
S Bodogai
T Frejka
T Rotariu
T Sobotka
T Sobotka
T Sobotka
T Sobotka
U Olsson
V Ghețău
WB Miller
World Bank
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

This paper applies the Theory of Planned Behaviour (TPB) to find out the predictors of fertility intentions in Romania, a low-fertility country. We analyse how attitudes, subjective norms and perceived behavioural control relate to the intention to have a child among childless individuals and one-child parents. Principal axis factor analysis confirms which items proposed by the Generation and Gender Survey (GGS 2005) act as valid and reliable measures of the suggested theoretical socio-psychological factors. Four parity-specific logistic regression models are applied to evaluate the relationship between the socio-psychological factors and childbearing intentions. Social pressure emerges as the most important aspect in fertility decision-making among childless individuals and one-child parents, and positive attitudes towards childbearing are a strong component in planning for a child. This paper also underlines the importance of the region-specific factors when studying childbearing intentions: planning for the second child significantly differs among the development regions, representing the cultural and socio-economic divisions of the Romanian territory

Crossref

Directory of Open Access Journals

Archivio della ricerca- Università di Roma La Sapienza

Missing Features Reconstruction Using a Wasserstein Generative Adversarial Imputation Network

Author: A Farhangfar
AF Costa
E-L Silva-Ramírez
J Demšar
JL Schafer
JT McCoy
L Gondara
L Li
M Friedjungová
M Friedman
M Friedman
MJ Azur
RJA Little
S Van Buuren
Y Duan
Á Arroyo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/06/2020
Field of study

Missing data is one of the most common preprocessing problems. In this paper, we experimentally research the use of generative and non-generative models for feature reconstruction. Variational Autoencoder with Arbitrary Conditioning (VAEAC) and Generative Adversarial Imputation Network (GAIN) were researched as representatives of generative models, while the denoising autoencoder (DAE) represented non-generative models. Performance of the models is compared to traditional methods k-nearest neighbors (k-NN) and Multiple Imputation by Chained Equations (MICE). Moreover, we introduce WGAIN as the Wasserstein modification of GAIN, which turns out to be the best imputation model when the degree of missingness is less than or equal to 30%. Experiments were performed on real-world and artificial datasets with continuous features where different percentages of features, varying from 10% to 50%, were missing. Evaluation of algorithms was done by measuring the accuracy of the classification model previously trained on the uncorrupted dataset. The results show that GAIN and especially WGAIN are the best imputers regardless of the conditions. In general, they outperform or are comparative to MICE, k-NN, DAE, and VAEAC.Comment: Preprint of the conference paper (ICCS 2020), part of the Lecture Notes in Computer Scienc

arXiv.org e-Print Archive

Crossref

Illuminate the unknown: Evaluation of imputation procedures based on the SAVE Survey

Author: Ziegelmeyer Michael
Publication venue
Publication date
Field of study

Questions about monetary variables (such as income, wealth or savings) are key components of questionnaires on household finances. However, missing information on such sensitive topics is a well-known phenomenon which can seriously bias any inference based only on complete cases analysis. Many imputation techniques have been developed and implemented in several surveys. Using the German SAVE data, this paper evaluates different techniques for the imputation of monetary variables implementing a simulation study, where a random pattern of missingness is imposed on the observed values of the variables of interest. New estimation techniques are necessary to overcome the upward bias of monetary variables caused by the initially implemented imputation procedure. A Monte-Carlo simulation based on the observed data shows the superiority of the newly implemented smearing estimate to construct the missing data structure. All waves are consistently imputed using the new method.

Research Papers in Economics

Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare

Author: Abidin Nadzurah Zainal
Ismail Amelia Ritahani
Maen Mhd Khaled
Publication venue: 'Universitas Muhammadiyah Yogyakarta'
Publication date: 05/02/2022
Field of study

Missing data is one of the most common issues encountered in data cleaning process especially when dealing with medical dataset. A real collected dataset is prone to be incomplete, inconsistent, noisy and redundant due to potential reasons such as human errors, instrumental failures, and adverse death. Therefore, to accurately deal with incomplete data, a sophisticated algorithm is proposed to impute those missing values. Many machine learning algorithms have been applied to impute missing data with plausible values. However, among all machine learning imputation algorithms, KNN algorithm has been widely adopted as an imputation for missing data due to its robustness and simplicity and it is also a promising method to outperform other machine learning methods. This paper provides a comprehensive review of different imputation techniques used to replace the missing data. The goal of the review paper is to bring specific attention to potential improvements to existing methods and provide readers with a better grasps of imputation technique trends

The International Islamic University Malaysia Repository

Leading & Enlightening Journal UMY

Autoencoder for clinical data analysis and classification : data imputation, dimensional reduction, and pattern recognition

Author: Al Khaldy Mohammad
Publication venue
Publication date: 01/07/2017
Field of study

Over the last decade, research has focused on machine learning and data mining to develop frameworks that can improve data analysis and output performance; to build accurate decision support systems that benefit from real-life datasets. This leads to the field of clinical data analysis, which has attracted a significant amount of interest in the computing, information systems, and medical fields. To create and develop models by machine learning algorithms, there is a need for a particular type of data for the existing algorithms to build an efficient model. Clinical datasets pose several issues that can affect the classification of the dataset: missing values, high dimensionality, and class imbalance. In order to build a framework for mining the data, it is necessary first to preprocess data, by eliminating patients’ records that have too many missing values, imputing missing values, addressing high dimensionality, and classifying the data for decision support.This thesis investigates a real clinical dataset to solve their challenges. Autoencoder is employed as a tool that can compress data mining methodology, by extracting features and classifying data in one model. The first step in data mining methodology is to impute missing values, so several imputation methods are analysed and employed. Then high dimensionality is demonstrated and used to discard irrelevant and redundant features, in order to improve prediction accuracy and reduce computational complexity. Class imbalance is manipulated to investigate the effect on feature selection algorithms and classification algorithms.The first stage of analysis is to investigate the role of the missing values. Results found that techniques based on class separation will outperform other techniques in predictive ability. The next stage is to investigate the high dimensionality and a class imbalance. However it was found a small set of features that can improve the classification performance, the balancing class does not affect the performance as much as imbalance class

Repository@Hull - Worktribe

Data mining for heart failure : an investigation into the challenges in real life clinical datasets

Author: Kirke Lisa
Publication venue
Publication date: 01/06/2015
Field of study

Clinical data presents a number of challenges including missing data, class imbalance, high dimensionality and non-normal distribution. A motivation for this research is to investigate and analyse the manner in which the challenges affect the performance of algorithms. The challenges were explored with the help of a real life heart failure clinical dataset known as Hull LifeLab, obtained from a live cardiology clinic at the Hull Royal Infirmary Hospital. A Clinical Data Mining Workflow (CDMW) was designed with three intuitive stages, namely, descriptive, predictive and prescriptive. The naming of these stages reflects the nature of the analysis that is possible within each stage; therefore a number of different algorithms are employed. Most algorithms require the data to be distributed in a normal manner. However, the distribution is not explicitly used within the algorithms. Approaches based on Bayes use the properties of the distributions very explicitly, and thus provides valuable insight into the nature of the data.The first stage of the analysis is to investigate if the assumptions made for Bayes hold, e.g. the strong independence assumption and the assumption of a Gaussian distribution. The next stage is to investigate the role of missing values. Results found that imputation does not affect the performance as much as those records which are initially complete. These records are often not outliers, but contain problem variables. A method was developed to identify these. The effect of skews in the data was also investigated within the CDMW. However, it was found that methods based on Bayes were able to handle these, albeit with a small variability in performance. The thesis provides an insight into the reasons why clinical data often causes problems. Even the issue of imbalanced classes is not an issue, for Bayes is independent of this

Repository@Hull - Worktribe

Determinants of the acceptance of domestic use of recycled water by use type

Author: González Gómez Francisco José
Guardiola Wanden-Berghe Jorge
López Ruiz Samara
Moya Fernández Pablo José
Publication venue: Elsevier
Publication date: 01/07/2021
Field of study

In the circular economy model, the recycling of water is an alternative option that can reduce the pressure on water resources and guarantee water supply. This water policy measure is currently widespread in agriculture, but thus far few countries have opted for the domestic use of recycled water. In part, this is because it is the source of water with the lowest levels of public acceptance, which poses a threat to the success of the necessary investment. We analyse the degree of acceptance of recycled water for different domestic uses. The main contribution of this study is the analysis of the determinants of acceptance of recycled water by use type. The research was based on data from a questionnaire given to 844 university students in Andalusia, southern Spain. Results are obtained from ordinary least squares regressions that relate the determinants of recycled water acceptance to each of the water use classes. The 'yuck factor'—variously defined as ‘disgust’ or ‘psychological repugnance’—and the perceived risk are found to be the main determinants of the low degree of acceptance of recycled water for ingestion by people and pets. For other uses, such as body washing, laundry and cleaning, environ- mental awareness stands out as a determining factor. The main conclusion is that if au- thorities were to opt for measures to promote the use of recycled water, they should take into account the fact that the reluctance to use recycled water and the determinants of acceptance differ according to the intended useEuropean Regional Development FundSpanish Agencia Estatal de InvestigaciónRegional Government of Andalusi

Repositorio Institucional Universidad de Granada