11,115 research outputs found
Multivariate Data Imputation using Trees
We address the problem of completing two files with records containing a fully observed common subset of variables. The tecnique investigated involves the use of regression and/or classification trees. An extension of current methodology (the intersection-seeking or "forest-climbing" algorithm) is proposed to deal with multivariate response variables. The method is demonstrated and shown to be feasible and have some desirable properties.file completion, data imputation, regression trees
Data Imputation through the Identification of Local Anomalies
We introduce a comprehensive and statistical framework in a model free
setting for a complete treatment of localized data corruptions due to severe
noise sources, e.g., an occluder in the case of a visual recording. Within this
framework, we propose i) a novel algorithm to efficiently separate, i.e.,
detect and localize, possible corruptions from a given suspicious data instance
and ii) a Maximum A Posteriori (MAP) estimator to impute the corrupted data. As
a generalization to Euclidean distance, we also propose a novel distance
measure, which is based on the ranked deviations among the data attributes and
empirically shown to be superior in separating the corruptions. Our algorithm
first splits the suspicious instance into parts through a binary partitioning
tree in the space of data attributes and iteratively tests those parts to
detect local anomalies using the nominal statistics extracted from an
uncorrupted (clean) reference data set. Once each part is labeled as anomalous
vs normal, the corresponding binary patterns over this tree that characterize
corruptions are identified and the affected attributes are imputed. Under a
certain conditional independency structure assumed for the binary patterns, we
analytically show that the false alarm rate of the introduced algorithm in
detecting the corruptions is independent of the data and can be directly set
without any parameter tuning. The proposed framework is tested over several
well-known machine learning data sets with synthetically generated corruptions;
and experimentally shown to produce remarkable improvements in terms of
classification purposes with strong corruption separation capabilities. Our
experiments also indicate that the proposed algorithms outperform the typical
approaches and are robust to varying training phase conditions
Improving Missing Data Imputation with Deep Generative Models
Datasets with missing values are very common on industry applications, and
they can have a negative impact on machine learning models. Recent studies
introduced solutions to the problem of imputing missing values based on deep
generative models. Previous experiments with Generative Adversarial Networks
and Variational Autoencoders showed interesting results in this domain, but it
is not clear which method is preferable for different use cases. The goal of
this work is twofold: we present a comparison between missing data imputation
solutions based on deep generative models, and we propose improvements over
those methodologies. We run our experiments using known real life datasets with
different characteristics, removing values at random and reconstructing them
with several imputation techniques. Our results show that the presence or
absence of categorical variables can alter the selection of the best model, and
that some models are more stable than others after similar runs with different
random number generator seeds
Multi-Output Gaussian Processes for Crowdsourced Traffic Data Imputation
Traffic speed data imputation is a fundamental challenge for data-driven
transport analysis. In recent years, with the ubiquity of GPS-enabled devices
and the widespread use of crowdsourcing alternatives for the collection of
traffic data, transportation professionals increasingly look to such
user-generated data for many analysis, planning, and decision support
applications. However, due to the mechanics of the data collection process,
crowdsourced traffic data such as probe-vehicle data is highly prone to missing
observations, making accurate imputation crucial for the success of any
application that makes use of that type of data. In this article, we propose
the use of multi-output Gaussian processes (GPs) to model the complex spatial
and temporal patterns in crowdsourced traffic data. While the Bayesian
nonparametric formalism of GPs allows us to model observation uncertainty, the
multi-output extension based on convolution processes effectively enables us to
capture complex spatial dependencies between nearby road segments. Using 6
months of crowdsourced traffic speed data or "probe vehicle data" for several
locations in Copenhagen, the proposed approach is empirically shown to
significantly outperform popular state-of-the-art imputation methods.Comment: 10 pages, IEEE Transactions on Intelligent Transportation Systems,
201
Lung Segmentation from Chest X-rays using Variational Data Imputation
Pulmonary opacification is the inflammation in the lungs caused by many
respiratory ailments, including the novel corona virus disease 2019 (COVID-19).
Chest X-rays (CXRs) with such opacifications render regions of lungs
imperceptible, making it difficult to perform automated image analysis on them.
In this work, we focus on segmenting lungs from such abnormal CXRs as part of a
pipeline aimed at automated risk scoring of COVID-19 from CXRs. We treat the
high opacity regions as missing data and present a modified CNN-based image
segmentation network that utilizes a deep generative model for data imputation.
We train this model on normal CXRs with extensive data augmentation and
demonstrate the usefulness of this model to extend to cases with extreme
abnormalities.Comment: Accepted to be presented at the first Workshop on the Art of Learning
with Missing Values (Artemiss) hosted by the 37th International Conference on
Machine Learning (ICML). Source code, training data and the trained models
are available here: https://github.com/raghavian/lungVAE
- …