258 research outputs found
Sequence-to-Sequence Imputation of Missing Sensor Data
Although the sequence-to-sequence (encoder-decoder) model is considered the
state-of-the-art in deep learning sequence models, there is little research
into using this model for recovering missing sensor data. The key challenge is
that the missing sensor data problem typically comprises three sequences (a
sequence of observed samples, followed by a sequence of missing samples,
followed by another sequence of observed samples) whereas, the
sequence-to-sequence model only considers two sequences (an input sequence and
an output sequence). We address this problem by formulating a
sequence-to-sequence in a novel way. A forward RNN encodes the data observed
before the missing sequence and a backward RNN encodes the data observed after
the missing sequence. A decoder decodes the two encoders in a novel way to
predict the missing data. We demonstrate that this model produces the lowest
errors in 12% more cases than the current state-of-the-art
Missing Value Imputation for Multi-attribute Sensor Data Streams via Message Propagation (Extended Version)
Sensor data streams occur widely in various real-time applications in the
context of the Internet of Things (IoT). However, sensor data streams feature
missing values due to factors such as sensor failures, communication errors, or
depleted batteries. Missing values can compromise the quality of real-time
analytics tasks and downstream applications. Existing imputation methods either
make strong assumptions about streams or have low efficiency. In this study, we
aim to accurately and efficiently impute missing values in data streams that
satisfy only general characteristics in order to benefit real-time applications
more widely. First, we propose a message propagation imputation network (MPIN)
that is able to recover the missing values of data instances in a time window.
We give a theoretical analysis of why MPIN is effective. Second, we present a
continuous imputation framework that consists of data update and model update
mechanisms to enable MPIN to perform continuous imputation both effectively and
efficiently. Extensive experiments on multiple real datasets show that MPIN can
outperform the existing data imputers by wide margins and that the continuous
imputation framework is efficient and accurate.Comment: Accepted at VLDB 202
Simultaneous Measurement Imputation and Outcome Prediction for Achilles Tendon Rupture Rehabilitation
Achilles Tendon Rupture (ATR) is one of the typical soft tissue injuries.
Rehabilitation after such a musculoskeletal injury remains a prolonged process
with a very variable outcome. Accurately predicting rehabilitation outcome is
crucial for treatment decision support. However, it is challenging to train an
automatic method for predicting the ATR rehabilitation outcome from treatment
data, due to a massive amount of missing entries in the data recorded from ATR
patients, as well as complex nonlinear relations between measurements and
outcomes. In this work, we design an end-to-end probabilistic framework to
impute missing data entries and predict rehabilitation outcomes simultaneously.
We evaluate our model on a real-life ATR clinical cohort, comparing with
various baselines. The proposed method demonstrates its clear superiority over
traditional methods which typically perform imputation and prediction in two
separate stages
In-network Sparsity-regularized Rank Minimization: Algorithms and Applications
Given a limited number of entries from the superposition of a low-rank matrix
plus the product of a known fat compression matrix times a sparse matrix,
recovery of the low-rank and sparse components is a fundamental task subsuming
compressed sensing, matrix completion, and principal components pursuit. This
paper develops algorithms for distributed sparsity-regularized rank
minimization over networks, when the nuclear- and -norm are used as
surrogates to the rank and nonzero entry counts of the sought matrices,
respectively. While nuclear-norm minimization has well-documented merits when
centralized processing is viable, non-separability of the singular-value sum
challenges its distributed minimization. To overcome this limitation, an
alternative characterization of the nuclear norm is adopted which leads to a
separable, yet non-convex cost minimized via the alternating-direction method
of multipliers. The novel distributed iterations entail reduced-complexity
per-node tasks, and affordable message passing among single-hop neighbors.
Interestingly, upon convergence the distributed (non-convex) estimator provably
attains the global optimum of its centralized counterpart, regardless of
initialization. Several application domains are outlined to highlight the
generality and impact of the proposed framework. These include unveiling
traffic anomalies in backbone networks, predicting networkwide path latencies,
and mapping the RF ambiance using wireless cognitive radios. Simulations with
synthetic and real network data corroborate the convergence of the novel
distributed algorithm, and its centralized performance guarantees.Comment: 30 pages, submitted for publication on the IEEE Trans. Signal Proces
Mind the large gap : novel algorithm using seasonal decomposition and elastic net regression to impute large intervals of missing data in air quality data
Air quality data sets are widely used in numerous analyses. Missing values are ubiquitous in air quality data sets as the data are collected through sensors. Recovery of missing data is a challenging task in the data preprocessing stage. This task becomes more challenging in time series data as time is an implicit variable that cannot be ignored. Even though existing methods to deal with missing data in time series perform well in situations where the percentage of missing values is relatively low and the gap size is small, their performances are reasonably lower when it comes to large gaps. This paper presents a novel algorithm based on seasonal decomposition and elastic net regression to impute large gaps of time series data when there exist correlated variables. This method outperforms several other existing univariate approaches namely Kalman smoothing on ARIMA models, Kalman smoothing on structural time series models, linear interpolation, and mean imputation in imputing large gaps. However, this is applicable only when there exists one or more correlated variables with the time series with large gaps
Load curve data cleansing and imputation via sparsity and low rank
The smart grid vision is to build an intelligent power network with an
unprecedented level of situational awareness and controllability over its
services and infrastructure. This paper advocates statistical inference methods
to robustify power monitoring tasks against the outlier effects owing to faulty
readings and malicious attacks, as well as against missing data due to privacy
concerns and communication errors. In this context, a novel load cleansing and
imputation scheme is developed leveraging the low intrinsic-dimensionality of
spatiotemporal load profiles and the sparse nature of "bad data.'' A robust
estimator based on principal components pursuit (PCP) is adopted, which effects
a twofold sparsity-promoting regularization through an -norm of the
outliers, and the nuclear norm of the nominal load profiles. Upon recasting the
non-separable nuclear norm into a form amenable to decentralized optimization,
a distributed (D-) PCP algorithm is developed to carry out the imputation and
cleansing tasks using networked devices comprising the so-termed advanced
metering infrastructure. If D-PCP converges and a qualification inequality is
satisfied, the novel distributed estimator provably attains the performance of
its centralized PCP counterpart, which has access to all networkwide data.
Computer simulations and tests with real load curve data corroborate the
convergence and effectiveness of the novel D-PCP algorithm.Comment: 8 figures, submitted to IEEE Transactions on Smart Grid - Special
issue on "Optimization methods and algorithms applied to smart grid
Embedded Data Imputation for Environmental Intelligent Sensing: A Case Study
Recent developments in cloud computing and the Internet of Things have enabled smart environments, in terms of both monitoring and actuation. Unfortunately, this often results in unsustainable cloud-based solutions, whereby, in the interest of simplicity, a wealth of raw (unprocessed) data are pushed from sensor nodes to the cloud. Herein, we advocate the use of machine learning at sensor nodes to perform essential data-cleaning operations, to avoid the transmission of corrupted (often unusable) data to the cloud. Starting from a public pollution dataset, we investigate how two machine learning techniques (kNN and missForest) may be embedded on Raspberry Pi to perform data imputation, without impacting the data collection process. Our experimental results demonstrate the accuracy and computational efficiency of edge-learning methods for filling in missing data values in corrupted data series. We find that kNN and missForest correctly impute up to 40% of randomly distributed missing values, with a density distribution of values that is indistinguishable from the benchmark. We also show a trade-off analysis for the case of bursty missing values, with recoverable blocks of up to 100 samples. Computation times are shorter than sampling periods, allowing for data imputation at the edge in a timely manner.Our work is supported by the Open Access Publishing Fund of the Free University of Bozen-Bolzano
- …