359 research outputs found

    Handling missing data in multivariate time series using a vector autoregressive model-imputation (VAR-IM) algorithm

    Get PDF
    Imputing missing data from a multivariate time series dataset remains a challenging problem. There is an abundance of research on using various techniques to impute missing, biased, or corrupted values to a dataset. While a great amount of work has been done in this field, most imputing methodologies are centered about a specific application, typically involving static data analysis and simple time series modelling. However, these approaches fall short of desired goals when the data originates from a multivariate time series. The objective of this paper is to introduce a new algorithm for handling missing data from multivariate time series datasets. This new approach is based on a vector autoregressive (VAR) model by combining an expectation and minimization (EM) algorithm with the prediction error minimization (PEM) method. The new algorithm is called a vector autoregressive imputation method (VAR-IM). A description of the algorithm is presented and a case study was accomplished using the VAR-IM. The case study was applied to a real-world data set involving electrocardiogram (ECG) data. The VAR-IM method was compared with both traditional methods list wise deletion and linear regression substitution; and modern methods Multivariate Auto-Regressive State-Space (MARSS) and expectation maximization algorithm (EM). Generally, the VAR-IM method achieved significant improvement of the imputation tasks as compared with the other two methods. Although an improvement, a summary of the limitations and restrictions when using VAR-IM is presented

    Handling of Missing Values in Static and Dynamic Data Sets

    Get PDF
    This thesis contributes by first, conducting a comparative study of traditional and modern classifications by highlighting the differences in their performance. Second, an algorithm to enhance the prediction of values to be used for data imputation with nonlinear models is presented. Third, a novel algorithm model selection to enhance prediction performance in the presence of missing data is presented. It includes an overview of nonlinear model selection with complete data, and provides summary descriptions of Box-Tidwell and fractional polynomial methods for model selection. In particular, it focuses on the fractional polynomial method for nonlinear modelling in cases of missing data. An analysis ex- ample is presented to illustrate the performance of this method

    Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

    Get PDF
    Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series contain missing data. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust \emph{time series cluster kernel} (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to form the final kernel. We evaluate the TCK on synthetic and real data and compare to other state-of-the-art techniques. The experimental results demonstrate that the TCK is robust to parameter choices, provides competitive results for MTS without missing data and outstanding results for missing data.Comment: 23 pages, 6 figure

    TIME SERIES IMPUTATION USING VAR-IM (CASE STUDY: WEATHER DATA IN METEOROLOGICAL STATION OF CITEKO)

    Get PDF
    Univariate imputation methods are defined as imputation methods that only use the information of the target variable to estimate missing values. While univariate imputation methods are convenient and flexible since no other variable is required, multivariate imputation methods can potentially improve imputation accuracy given that the other variables are relevant to the target variable. Many multivariate imputation methods have been proposed, one of which is Vector Autoregression Imputation Method (VAR-IM). This study aims to compare imputation results of VAR-IM-based methods and univariate imputation methods on time series data, specifically on long lag seasonal data such as daily weather data. Three modified VAR-IM methods were studied using simulations with three steps: deletion, imputation, and evaluation. The deletion step was conducted using six different schemes with six missing proportions. The simulations were conducted on secondary daily weather data collected from meteorological station of Citeko from January 1, 1991, to June 22, 2013. Nine weather variables were examined, that is the minimum, maximum, and average temperatures, average humidity, rainfall rate, duration of solar radiation, maximum and average wind speed, as well as wind direction at maximum speed. The simulation results show that the three modified VAR-IM methods can improve the accuracies in around 75% of cases. The simulation results also show that imputation results of VAR-IM-based methods tend to be more stable in accuracy as the missing proportion increase compared to the imputation results of univariate imputation methods

    A wavelet-based approach for imputation in nonstationary multivariate time series

    Get PDF
    Many multivariate time series observed in practice are second order nonstationary, i.e. their covariance properties vary over time. In addition, missing observations in such data are encountered in many applications of interest, due to recording failures or sensor dropout, hindering successful analysis. This article introduces a novel method for data imputation in multivariate nonstationary time series, based on the so-called locally stationary wavelet modelling paradigm. Our methodology is shown to perform well across a range of simulation scenarios, with a variety of missingness structures, as well as being competitive in the stationary time series setting. We also demonstrate our technique on data arising in a health monitoring application

    Real-time data-driven missing data imputation for short-term sensor data of marine systems. A comparative study

    Get PDF
    In the maritime industry, sensors are utilised to implement condition-based maintenance (CBM) to assist decision-making processes for energy efficient operations of marine machinery. However, the employment of sensors presents several challenges including the imputation of missing values. Data imputation is a crucial pre-processing step, the aim of which is the estimation of identified missing values to avoid under-utilisation of data that can lead to biased results. Although various studies have been developed on this topic, none of the studies so far have considered the option of imputing incomplete values in real-time to assist instant data-driven decision-making strategies. Hence, a methodological comparative study has been developed that examines a total of 20 widely implemented machine learning and time series forecasting algorithms. Moreover, a case study on a total of 7 machinery system parameters obtained from sensors installed on a cargo vessel is utilised to highlight the implementation of the proposed methodology. To assess the models’ performance seven metrics are estimated (Execution time, MSE, MSLE, RMSE, MAPE, MedAE, Max Error). In all cases, ARIMA outperforms the remaining models, yielding a MedAE of 0.08 r/min and a Max Error of 2.4 r/min regarding the main engine rotational speed paramete

    Scalable Low-Rank Tensor Learning for Spatiotemporal Traffic Data Imputation

    Full text link
    Missing value problem in spatiotemporal traffic data has long been a challenging topic, in particular for large-scale and high-dimensional data with complex missing mechanisms and diverse degrees of missingness. Recent studies based on tensor nuclear norm have demonstrated the superiority of tensor learning in imputation tasks by effectively characterizing the complex correlations/dependencies in spatiotemporal data. However, despite the promising results, these approaches do not scale well to large data tensors. In this paper, we focus on addressing the missing data imputation problem for large-scale spatiotemporal traffic data. To achieve both high accuracy and efficiency, we develop a scalable tensor learning model -- Low-Tubal-Rank Smoothing Tensor Completion (LSTC-Tubal) -- based on the existing framework of Low-Rank Tensor Completion, which is well-suited for spatiotemporal traffic data that is characterized by multidimensional structure of location×\times time of day ×\times day. In particular, the proposed LSTC-Tubal model involves a scalable tensor nuclear norm minimization scheme by integrating linear unitary transformation. Therefore, tensor nuclear norm minimization can be solved by singular value thresholding on the transformed matrix of each day while the day-to-day correlation can be effectively preserved by the unitary transform matrix. We compare LSTC-Tubal with state-of-the-art baseline models, and find that LSTC-Tubal can achieve competitive accuracy with a significantly lower computational cost. In addition, the LSTC-Tubal will also benefit other tasks in modeling large-scale spatiotemporal traffic data, such as network-level traffic forecasting

    Statistical approaches for handling longitudinal and cross sectional discrete data with missing values focusing on multiple imputation and probability weighting.

    Get PDF
    Doctor of Philosophy in Science. University of KwaZulu-Natal, Pietermaritzburg, 2018.Abstract available in PDF file

    Vector autoregression with varied frequency data

    Get PDF
    The Vector Autoregression (VAR) model has been extensively applied in macroeconomics. A typical VAR requires its component variables being sampled at a uniformed frequency, regardless of the fact that some macro data are available monthly and some are only quarterly. Practitioners invariably align variables to the same frequency either by aggregation or imputation, regardless of information loss or noises gain. We study a VAR model with varied frequency data in a Bayesian context. Lower frequency (aggregated) data are essentially a linear combination of higher frequency (disaggregated) data. The observed aggregated data impose linear constraints on the autocorrelation structure of the latent disaggregated data. The perception of a constrained multivariate normal distribution is crucial to our Gibbs sampler. Furthermore, the Markov property of the VAR series enables a block Gibbs sampler, which performs faster for evenly aggregated data. Lastly, our approach is applied to two classic structural VAR analyses, one with long-run and the other with short-run identification constraints. These applications demonstrate that it is both feasible and sensible to use data of different frequencies in a new VAR model, the one that keeps the branding of the economic ideas underlying the structural VAR model but only makes minimum modification from a technical perspective.Vector Autoregression; Bayesian; Temporal aggregation

    Novel methods for distributed acoustic sensing data

    Get PDF
    In this thesis, we propose novel methods for analysing nonstationary, multivariate time series, focusing in particular on the problems of classification and imputation within this context. Many existing methods for time series classification are static, in that they assign the entire series to one class and do not allow for temporal dependence with the signal. In the first part of this thesis, we propose a computationally efficient extension of an existing dynamic classification method to the online setting. Dependence within the series is captured by adopting the multivariate locally stationary wavelet (mvLSW) framework and the signal is classified at each time point into one of a number of known classes. We apply the method to multivariate acoustic sensing data in order to detect anomalous regions and evaluate the results against alternative methods in the literature. The second part of this thesis considers imputation in multivariate locally stationary time series containing missing values. We first introduce a method for estimating the local wavelet spectral matrix that can be used in the presence of missingness. We then propose a novel method for imputing missing values that uses the local auto and cross-covariance functions of a mvLSW process to perform one step-ahead forecasting and backcasting. The performance of this nonstationary imputation approach is then assessed against competitor methods for simulated examples and a case study involving a dataset from a Carbon Capture and Storage facility. The software that implements this imputation scheme is also described, together with examples of the R package functionality
    corecore