2,856 research outputs found
A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets
Missing data handling is an important preparation step for most data discrimination or mining tasks. Inappropriate treatment of missing data may cause large errors or false results. In this paper, we study the effect of a missing data recovery method, namely the pseudo- nearest neighbor substitution approach, on Gaussian distributed data sets that represent typical cases in data discrimination and data mining applications. The error rate of the proposed recovery method is evaluated by comparing the clustering results of the recovered data sets to the clustering results obtained on the originally complete data sets. The results are also compared with that obtained by applying two other missing data handling methods, the constant default value substitution and the missing data ignorance (non-substitution) methods. The experiment results provided a valuable insight to the improvement of the accuracy for data discrimination and knowledge discovery on large data sets containing missing values
Graph Estimation From Multi-attribute Data
Many real world network problems often concern multivariate nodal attributes
such as image, textual, and multi-view feature vectors on nodes, rather than
simple univariate nodal attributes. The existing graph estimation methods built
on Gaussian graphical models and covariance selection algorithms can not handle
such data, neither can the theories developed around such methods be directly
applied. In this paper, we propose a new principled framework for estimating
graphs from multi-attribute data. Instead of estimating the partial correlation
as in current literature, our method estimates the partial canonical
correlations that naturally accommodate complex nodal features.
Computationally, we provide an efficient algorithm which utilizes the
multi-attribute structure. Theoretically, we provide sufficient conditions
which guarantee consistent graph recovery. Extensive simulation studies
demonstrate performance of our method under various conditions. Furthermore, we
provide illustrative applications to uncovering gene regulatory networks from
gene and protein profiles, and uncovering brain connectivity graph from
functional magnetic resonance imaging data.Comment: Extended simulation study. Added an application to a new data se
Numerical Data Imputation for Multimodal Data Sets: A Probabilistic Nearest-Neighbor Kernel Density Approach
Numerical data imputation algorithms replace missing values by estimates to
leverage incomplete data sets. Current imputation methods seek to minimize the
error between the unobserved ground truth and the imputed values. But this
strategy can create artifacts leading to poor imputation in the presence of
multimodal or complex distributions. To tackle this problem, we introduce the
NNKDE algorithm: a data imputation method combining nearest neighbor
estimation (NN) and density estimation with Gaussian kernels (KDE). We
compare our method with previous data imputation methods using artificial and
real-world data with different data missing scenarios and various data missing
rates, and show that our method can cope with complex original data structure,
yields lower data imputation errors, and provides probabilistic estimates with
higher likelihood than current methods. We release the code in open-source for
the community: https://github.com/DeltaFloflo/knnxkdeComment: 30 pages, 8 figures, accepted in TMLR (Reproducibility certification
- …