Search CORE

1,054 research outputs found

A systematic review of data quality issues in knowledge discovery tasks

Author: Corrales David Camilo
Corrales Juan Carlos
Ledezma Agapito Ismael
Publication venue: 'Universidad de Medellin'
Publication date: 07/11/2015
Field of study

Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Universidad de Medellín: Revistas Científicas

Repositorio Institucional Universidad de Medellín

DIALNET

A split questionnaire survey design applied to German media and consumer surveys

Author: Koller Florian
Mäenpää Christine
Rässler Susanne
Publication venue
Publication date
Field of study

On the basis of real data sets it is shown that splitting a questionnaire survey according to technical rather than qualitative criteria can reduce costs and respondent burden remarkably. Household interview surveys about media and consuming behavior are analyzed and splitted into components. Following the matrix sampling approach, respondents are asked only the varying subsets of the components inducing missing data by design. These missing data are imputed afterwards to create a complete data set. In an iterative algorithm every variable with missing values is regressed on all other variables which either are originally complete or contain actual imputations. The imputation procedure itself is based on the socalled predictive mean matching. In this contribution the validity of split and imputation is discussed based on the preservation of empirical distributions, bivariate associations, conditional associations and on regression inference. Finally, we find that many empirical distributions of the complete data are reproduced well in the imputed data sets. Concerning these long media and consumer questionnaires we like to conclude that nearly the same inference can be achieved by means of such a split design with reduced costs and minor respondent burden --

Research Papers in Economics

Multivariate Data Imputation using Trees

Author: Bárcena Ruiz María Jesús
Tusell Palmer Fernando Jorge
Publication venue
Publication date
Field of study

We address the problem of completing two files with records containing a fully observed common subset of variables. The tecnique investigated involves the use of regression and/or classification trees. An extension of current methodology (the intersection-seeking or "forest-climbing" algorithm) is proposed to deal with multivariate response variables. The method is demonstrated and shown to be feasible and have some desirable properties.file completion, data imputation, regression trees

Research Papers in Economics

Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

Author: Marko Nicholas
Razzaghi Talayeh
Roderick Oleg
Safro Ilya
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 07/04/2016
Field of study

This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

Author: Blakesley Richard E
Brock Guy N
Lotz Meredith J
Shaffer John R
Tseng George C
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures × time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. Results We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Conclusion Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Imputation of missing sub-hourly precipitation data in a large sensor network : a machine learning approach

Author: Chivers Benedict
Cole Steven
Fry Matthew
Leontidis Georgios
Sebek Ondrej
Stanley Simon
Wallbank John
Publication venue: 'Elsevier BV'
Publication date: 02/05/2020
Field of study

This research was supported by a UKRI-NERC Constructing a Digital Environment Strategic Priority grant “Engineering Transformation for the Integration of Sensor Networks: A Feasibility Study” [NE/S016236/1 & NE/S016244/1].Peer reviewedPostprin

arXiv.org e-Print Archive

Aberdeen University Research

NERC Open Research Archive

A binary neural k-nearest neighbour technique

Author: Austin J.
Hodge V.J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/02/2005
Field of study

K-Nearest Neighbour (k-NN) is a widely used technique for classifying and clustering data. K-NN is effective but is often criticised for its polynomial run-time growth as k-NN calculates the distance to every other record in the data set for each record in turn. This paper evaluates a novel k-NN classifier with linear growth and faster run-time built from binary neural networks. The binary neural approach uses robust encoding to map standard ordinal, categorical and real-valued data sets onto a binary neural network. The binary neural network uses high speed pattern matching to recall the k-best matches. We compare various configurations of the binary approach to a conventional approach for memory overheads, training speed, retrieval speed and retrieval accuracy. We demonstrate the superior performance with respect to speed and memory requirements of the binary approach compared to the standard approach and we pinpoint the optimal configurations

White Rose Research Online

Classifiers accuracy improvement based on missing data imputation

Author: Jordanov Ivan
Petrov Nedyalko
Petrozziello Alessio
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2018
Field of study

In this paper we investigate further and extend our previous work on radar signal identification and classification based on a data set which comprises continuous, discrete and categorical data that represent radar pulse train characteristics such as signal frequencies, pulse repetition, type of modulation, intervals, scan period, scanning type, etc. As the most of the real world datasets, it also contains high percentage of missing values and to deal with this problem we investigate three imputation techniques: Multiple Imputation (MI); K-Nearest Neighbour Imputation (KNNI); and Bagged Tree Imputation (BTI). We apply these methods to data samples with up to 60% missingness, this way doubling the number of instances with complete values in the resulting dataset. The imputation models performance is assessed with Wilcoxon’s test for statistical significance and Cohen’s effect size metrics. To solve the classification task, we employ three intelligent approaches: Neural Networks (NN); Support Vector Machines (SVM); and Random Forests (RF). Subsequently, we critically analyse which imputation method influences most the classifiers’ performance, using a multiclass classification accuracy metric, based on the area under the ROC curves. We consider two superclasses (‘military’ and ‘civil’), each containing several ‘subclasses’, and introduce and propose two new metrics: inner class accuracy (IA); and outer class accuracy (OA), in addition to the overall classification accuracy (OCA) metric. We conclude that they can be used as complementary to the OCA when choosing the best classifier for the problem at hand

Biblioteka Nauki - repozytorium artykuÅÃ³w

Portsmouth University Research Portal (Pure)

Recommended from our members

A collateral missing value estimation algorithm for DNA microarrays

Author: Dooley L.
Gondal I.
Sehgal M. S. B.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2005
Field of study

Genetic microarray expression data often contains multiple missing values that can significantly affect the performance of statistical and machine learning algorithms. This paper presents an innovative missing value estimation technique, called collateral missing value estimation (CMVE) which has demonstrated superior estimation performance compared with the K-nearest neighbour (KNN) imputation algorithm, the least square impute (LSImpute) and Bayesian principal component analysis (BPCA) techniques. Experimental results confirm that CMVE provides an improvement of 89%, 12% and 10% for the BRCA1, BRCA2 and sporadic ovarian cancer mutations, respectively, compared to the average error rate of KNN, LSImpute and BPCA imputation methods, over a range of randomly selected missing values. The underlying theory behind CMVE also means that it is not restricted to bioinformatics data, but can be successfully applied to any correlated data set

Open Research Online (The Open University)

Missing value imputation improves clustering and interpretation of gene expression microarray data

Author: AG de Brevern
D Wang
G Feten
H Kim
H Kuhn
H Yoshimoto
I Scheel
J Handl
J He
J Hu
J Tuikkala
JJ Wyrick
JL DeRisi
Johannes Tuikkala
Laura L Elo
M Al-Daoud
M Hirao
M Kankainen
M Ronen
M Shapira
MJ Brauer
O Troyanskaya
Olli S Nevalainen
P D'haeseleer
PT Spellman
R Jörnsten
S Oba
S Tavazoie
T Lange
Tero Aittokallio
TR Golub
X Gan
X Wang
Y Shi
Z Cai
Publication venue: BioMed Central
Publication date: 01/04/2008
Field of study

Abstract Background Missing values frequently pose problems in gene expression microarray experiments as they can hinder downstream analysis of the datasets. While several missing value imputation approaches are available to the microarray users and new ones are constantly being developed, there is no general consensus on how to choose between the different methods since their performance seems to vary drastically depending on the dataset being used. Results We show that this discrepancy can mostly be attributed to the way in which imputation methods have traditionally been developed and evaluated. By comparing a number of advanced imputation methods on recent microarray datasets, we show that even when there are marked differences in the measurement-level imputation accuracies across the datasets, these differences become negligible when the methods are evaluated in terms of how well they can reproduce the original gene clusters or their biological interpretations. Regardless of the evaluation approach, however, imputation always gave better results than ignoring missing data points or replacing them with zeros or average values, emphasizing the continued importance of using more advanced imputation methods. Conclusion The results demonstrate that, while missing values are still severely complicating microarray data analysis, their impact on the discovery of biologically meaningful gene groups can – up to a certain degree – be reduced by using readily available and relatively fast imputation methods, such as the Bayesian Principal Components Algorithm (BPCA).</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central