6 research outputs found
The impact of missing data imputation on HIV classification
Missing data are a part of research and data analysis that often cannot be ignored. Although a
number of methods have been developed in handling and imputing missing data, this problem
is, for the most part, still unsolved with many researchers still struggling with its existence.
Due to the availability of software and the advancement of computational power, maximum
likelihood and multiple imputations as well as neural networks and genetic algorithms
(AANN-GA) have been introduced as solutions to the missing data problem. Although these
methods have given considerable results in this domain, the impact that missing data and
missing data imputation has on decision making has, until recently, not been assessed. This
dissertation contributes to this knowledge by first introducing a new computational intelligent
model that integrates Neuro-Fuzzy (N-F) modeling, Principal Component Analysis and the
genetic algorithms to impute missing data. The performance of this model is then compared
to that of the AANN-GA as well as the independent use of the N-F architecture. In order to
determine if the data are predictable and also to assist in processing the data for training, an
analysis on the HIV sero-prevalence data is performed.
Two classification decision making frameworks are then presented in order to assess the
effect of missing data. These decision frameworks are trained to classify between two
conditions when presented with a set of data variables. The first is the use of a Bayesian
neural network which is statistical in nature and the second is based on the fuzzy ARTMAP
(FAM) classifier which has incremental abilities. The two methods are used and compared in
order to assess the degree in which missing data, and the imputation thereof, has on decision
making. The effect of missing data differs for the two frameworks; while the Bayesian neural
network fails in the presence of missing data, the FAM classifier attempts to classify with a
decreased accuracy. This work has shown that although missing data and the imputation
thereof has an effect on decision making, the degree of that effect is dependent on the
decision making framework and on the model used for data imputation
Robustness of Multiple Imputation under Missing at Random (MAR) Mechanism: A Simulation Study
Missing data is an unavoidable issue in controlled clinical trials and public health research and practice. Presence of missing data and applying inappropriate methods of analysis generates biased estimates and reduces power of study. It is very important for investigators to use appropriate methods of analysis to deal with missing data in order to maintain internal (power of study) and external (generalization of sample results to larger population) validity of study. The focus of this dissertation is to compare different methods to deal with missing data in controlled clinical trials and public health research and practice. In addition, this dissertation also discusses that current approaches to deal with missing data might not produce valid inferences and may affect internal and external validity of results. Furthermore, emphasis is put on demonstrating how well multiple imputation works to deal with missing data under Missing at Random (MAR) mechanism with monotonic and non-monotonic missing data patterns for a range of percent missing under both normal and non-normal distributions. The results of this dissertation showed that multiple imputation is an efficient technique to obtain valid inferences compared to single imputation methods. In addition estimates obtained from multiple imputation also preserve the internal validity of study
Recommended from our members
Partial least squares structural equation modelling with incomplete data. An investigation of the impact of imputation methods.
Despite considerable advances in missing data imputation methods over the last three decades, the problem of missing data remains largely unsolved. Many techniques have emerged in the literature as candidate solutions. These techniques can be categorised into two classes: statistical methods of data imputation and computational intelligence methods of data imputation. Due to the longstanding use of statistical methods in handling missing data problems, it takes quite some time for computational intelligence methods to gain profound attention even though these methods have analogous accuracy, in comparison to other approaches. The merits of both these classes have been discussed at length in the literature, but only limited studies make significant comparison to these classes.
This thesis contributes to knowledge by firstly, conducting a comprehensive comparison of standard statistical methods of data imputation, namely, mean substitution (MS), regression imputation (RI), expectation maximization (EM), tree imputation (TI) and multiple imputation (MI) on missing completely at random (MCAR) data sets. Secondly, this study also compares the efficacy of these methods with a computational intelligence method of data imputation,
ii
namely, a neural network (NN) on missing not at random (MNAR) data sets. The significance difference in performance of the methods is presented. Thirdly, a novel procedure for handling missing data is presented. A hybrid combination of each of these statistical methods with a NN, known here as the post-processing procedure, was adopted to approximate MNAR data sets. Simulation studies for each of these imputation approaches have been conducted to assess the impact of missing values on partial least squares structural equation modelling (PLS-SEM) based on the estimated accuracy of both structural and measurement parameters.
The best method to deal with particular missing data mechanisms is highly recognized. Several significant insights were deduced from the simulation results. It was figured that for the problem of MCAR by using statistical methods of data imputation, MI performs better than the other methods for all percentages of missing data. Another unique contribution is found when comparing the results before and after the NN post-processing procedure. This improvement in accuracy may be resulted from the neural network¿s ability to derive meaning from the imputed data set found by the statistical methods. Based on these results, the NN post-processing procedure is capable to assist MS in producing significant improvement in accuracy of the approximated values. This is a promising result, as MS is the weakest method in this study. This evidence is also informative as MS is often used as the default method available to users of PLS-SEM software.Minister of Higher Education Malaysia and University Utara Malaysi