Search CORE

829 research outputs found

Nearest neighbours in least-squares data imputation algorithms with different missing patterns

Author: Atkeson
Berry
Boris Mirkin
Davies
Dempster
Everrit
Gabriel
Golub
Grung
Heiser
Holter
Holzinger
Hunt
Ito Wasito
Jollife
Kamakashi
Kenney
Kiers
Krzanowski
Laaksonen
Little
Mirkin
Myrtveit
Roweis
Rubin
Rubin
Schafer
Shum
Tipping
Troyanskaya
Wold
Publication venue: 'Elsevier BV'
Publication date: 01/01/2006
Field of study

Methods for imputation of missing data in the so-called least-squares approximation approach, a non-parametric computationally efficient multidimensional technique, are experimentally compared. Contributions are made to each of the three components of the experiment setting: (a) algorithms to be compared, (b) data generation, and (c) patterns of missing data. Specifically, "global" methods for least-squares data imputation are reviewed and extensions to them are proposed based on the nearest neighbours (NN) approach. A conventional generator of mixtures of Gaussian distributions is theoretically analysed and, then, modified to scale clusters differently. Patterns of missing data are defined in terms of rows and columns according to three different mechanisms that are referred to as Random missings, Restricted random missings, and Merged database. It appears that NN-based versions almost always outperform their global counterparts. With the Random missings pattern, the winner is always the authors' two-stage method M, which combines global and local imputation algorithms

Crossref

Birkbeck Institutional Research Online

Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare

Author: Abidin Nadzurah Zainal
Ismail Amelia Ritahani
Maen Mhd Khaled
Publication venue: 'Universitas Muhammadiyah Yogyakarta'
Publication date: 05/02/2022
Field of study

Missing data is one of the most common issues encountered in data cleaning process especially when dealing with medical dataset. A real collected dataset is prone to be incomplete, inconsistent, noisy and redundant due to potential reasons such as human errors, instrumental failures, and adverse death. Therefore, to accurately deal with incomplete data, a sophisticated algorithm is proposed to impute those missing values. Many machine learning algorithms have been applied to impute missing data with plausible values. However, among all machine learning imputation algorithms, KNN algorithm has been widely adopted as an imputation for missing data due to its robustness and simplicity and it is also a promising method to outperform other machine learning methods. This paper provides a comprehensive review of different imputation techniques used to replace the missing data. The goal of the review paper is to bring specific attention to potential improvements to existing methods and provide readers with a better grasps of imputation technique trends

The International Islamic University Malaysia Repository

Leading & Enlightening Journal UMY

Multivariate Data Imputation using Trees

Author: Bárcena Ruiz María Jesús
Tusell Palmer Fernando Jorge
Publication venue
Publication date
Field of study

We address the problem of completing two files with records containing a fully observed common subset of variables. The tecnique investigated involves the use of regression and/or classification trees. An extension of current methodology (the intersection-seeking or "forest-climbing" algorithm) is proposed to deal with multivariate response variables. The method is demonstrated and shown to be feasible and have some desirable properties.file completion, data imputation, regression trees

Research Papers in Economics

Novel Computationally Intelligent Machine Learning Algorithms for Data Mining and Knowledge Discovery

Author: Gheyas Iffat A.
Publication venue: University of Stirling
Publication date: 24/11/2009
Field of study

This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model

Stirling Online Research Repository

Influence of missing values substitutes on multivariate analysis of metabolomics data

Author: Brereton
Dixon
Duda
Duran
Efron
Everitt
Haenlein
Hair
Hastie
Hrydziuszko
Jolliffe
Kopka
Kotze
Little
Macfie
Manly
Stacklies
Szekely
Team
Varmuza
Venables
Walczak
Xia
Publication venue: 'MDPI AG'
Publication date: 01/01/2014
Field of study

Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods

Multidisciplinary Digital Publishing Institute

University of Salford Institutional Repository

Crossref

Directory of Open Access Journals

PubMed Central

Enlighten

The University of Manchester - Institutional Repository

Non-parametric regression for space-time forecasting under missing data

Author: Cheng T
Haworth J
Publication venue
Publication date: 30/09/2012
Field of study

As more and more real time spatio-temporal datasets become available at increasing spatial and temporal resolutions, the provision of high quality, predictive information about spatio-temporal processes becomes an increasingly feasible goal. However, many sensor networks that collect spatio-temporal information are prone to failure, resulting in missing data. To complicate matters, the missing data is often not missing at random, and is characterised by long periods where no data is observed. The performance of traditional univariate forecasting methods such as ARIMA models decreases with the length of the missing data period because they do not have access to local temporal information. However, if spatio-temporal autocorrelation is present in a space–time series then spatio-temporal approaches have the potential to offer better forecasts. In this paper, a non-parametric spatio-temporal kernel regression model is developed to forecast the future unit journey time values of road links in central London, UK, under the assumption of sensor malfunction. Only the current traffic patterns of the upstream and downstream neighbouring links are used to inform the forecasts. The model performance is compared with another form of non-parametric regression, K-nearest neighbours, which is also effective in forecasting under missing data. The methods show promising forecasting performance, particularly in periods of high congestion

Elsevier - Publisher Connector

UCL Discovery

Classifiers accuracy improvement based on missing data imputation

Author: Jordanov Ivan
Petrov Nedyalko
Petrozziello Alessio
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2018
Field of study

In this paper we investigate further and extend our previous work on radar signal identification and classification based on a data set which comprises continuous, discrete and categorical data that represent radar pulse train characteristics such as signal frequencies, pulse repetition, type of modulation, intervals, scan period, scanning type, etc. As the most of the real world datasets, it also contains high percentage of missing values and to deal with this problem we investigate three imputation techniques: Multiple Imputation (MI); K-Nearest Neighbour Imputation (KNNI); and Bagged Tree Imputation (BTI). We apply these methods to data samples with up to 60% missingness, this way doubling the number of instances with complete values in the resulting dataset. The imputation models performance is assessed with Wilcoxon’s test for statistical significance and Cohen’s effect size metrics. To solve the classification task, we employ three intelligent approaches: Neural Networks (NN); Support Vector Machines (SVM); and Random Forests (RF). Subsequently, we critically analyse which imputation method influences most the classifiers’ performance, using a multiclass classification accuracy metric, based on the area under the ROC curves. We consider two superclasses (‘military’ and ‘civil’), each containing several ‘subclasses’, and introduce and propose two new metrics: inner class accuracy (IA); and outer class accuracy (OA), in addition to the overall classification accuracy (OCA) metric. We conclude that they can be used as complementary to the OCA when choosing the best classifier for the problem at hand

Biblioteka Nauki - repozytorium artykuÅÃ³w

Portsmouth University Research Portal (Pure)

Techniques for clustering gene expression data

Author: Crane Martin
Doolan Padraig
Kerr Gráinne
Ruskin Heather J.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered

CiteSeerX

Irish Universities

DCU Online Research Access Service

Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments

Author: Celton Magalie
de Brevern Alexandre G
Lelandais Gaëlle
Malpertuy Alain
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Microarray technologies produced large amount of data. In a previous study, we have shown the interest of <it>k-Nearest Neighbour </it>approach for restoring the missing gene expression values, and its positive impact of the gene clustering by hierarchical algorithm. Since, numerous replacement methods have been proposed to impute missing values (MVs) for microarray data. In this study, we have evaluated twelve different usable methods, and their influence on the quality of gene clustering. Interestingly we have used several datasets, both kinetic and non kinetic experiments from yeast and human. Results We underline the excellent efficiency of approaches proposed and implemented by Bo and co-workers and especially one based on expected maximization (<it>EM_array</it>). These improvements have been observed also on the imputation of extreme values, the most difficult predictable values. We showed that the imputed MVs have still important effects on the stability of the gene clusters. The improvement on the clustering obtained by hierarchical clustering remains limited and, not sufficient to restore completely the correct gene associations. However, a common tendency can be found between the quality of the imputation method and the gene cluster stability. Even if the comparison between clustering algorithms is a complex task, we observed that <it>k-means </it>approach is more efficient to conserve gene associations. Conclusions More than 6.000.000 independent simulations have assessed the quality of 12 imputation methods on five very different biological datasets. Important improvements have so been done since our last study. The <it>EM_array </it>approach constitutes one efficient method for restoring the missing expression gene values, with a lower estimation error level. Nonetheless, the presence of MVs even at a low rate is a major factor of gene cluster instability. Our study highlights the need for a systematic assessment of imputation methods and so of dedicated benchmarks. A noticeable point is the specific influence of some biological dataset.</p

Springer - Publisher Connector

Directory of Open Access Journals