1,542,136 research outputs found
A Common-Factor Approach for Multivariate Data Cleaning with an Application to Mars Phoenix Mission Data
Data quality is fundamentally important to ensure the reliability of data for
stakeholders to make decisions. In real world applications, such as scientific
exploration of extreme environments, it is unrealistic to require raw data
collected to be perfect. As data miners, when it is infeasible to physically
know the why and the how in order to clean up the data, we propose to seek the
intrinsic structure of the signal to identify the common factors of
multivariate data. Using our new data driven learning method, the common-factor
data cleaning approach, we address an interdisciplinary challenge on
multivariate data cleaning when complex external impacts appear to interfere
with multiple data measurements. Existing data analyses typically process one
signal measurement at a time without considering the associations among all
signals. We analyze all signal measurements simultaneously to find the hidden
common factors that drive all measurements to vary together, but not as a
result of the true data measurements. We use common factors to reduce the
variations in the data without changing the base mean level of the data to
avoid altering the physical meaning.Comment: 12 pages, 10 figures, 1 tabl
Metrics for Graph Comparison: A Practitioner's Guide
Comparison of graph structure is a ubiquitous task in data analysis and
machine learning, with diverse applications in fields such as neuroscience,
cyber security, social network analysis, and bioinformatics, among others.
Discovery and comparison of structures such as modular communities, rich clubs,
hubs, and trees in data in these fields yields insight into the generative
mechanisms and functional properties of the graph.
Often, two graphs are compared via a pairwise distance measure, with a small
distance indicating structural similarity and vice versa. Common choices
include spectral distances (also known as distances) and distances
based on node affinities. However, there has of yet been no comparative study
of the efficacy of these distance measures in discerning between common graph
topologies and different structural scales.
In this work, we compare commonly used graph metrics and distance measures,
and demonstrate their ability to discern between common topological features
found in both random graph models and empirical datasets. We put forward a
multi-scale picture of graph structure, in which the effect of global and local
structure upon the distance measures is considered. We make recommendations on
the applicability of different distance measures to empirical graph data
problem based on this multi-scale view. Finally, we introduce the Python
library NetComp which implements the graph distances used in this work
- …