2,452 research outputs found
Extending time series forecasting methods using functional principal components analysis
Traffic volume forecasts are used by many transportation analysis and management systems to better characterize and react to fluctuating traffic patterns. Most current forecasting methods do not take advantage of the underlying functional characteristics of the time series to make predictions. This paper presents a methodology that uses Functional Principal Components Analysis (FPCA) to create smooth and differentiable daily traffic forecasts. The methodology is validated with a data set of 1,813 days of 15 minute aggregated traffic volume time series. Both the FPCA based forecasts and the associated prediction intervals outperform traditional Seasonal Autoregressive Integrated Moving Average (SARIMA) based methods --Abstract, page iii
A Comprehensive Survey on Generative Diffusion Models for Structured Data
In recent years, generative diffusion models have achieved a rapid paradigm
shift in deep generative models by showing groundbreaking performance across
various applications. Meanwhile, structured data, encompassing tabular and time
series data, has been received comparatively limited attention from the deep
learning research community, despite its omnipresence and extensive
applications. Thus, there is still a lack of literature and its reviews on
structured data modelling via diffusion models, compared to other data
modalities such as visual and textual data. To address this gap, we present a
comprehensive review of recently proposed diffusion models in the field of
structured data. First, this survey provides a concise overview of the
score-based diffusion model theory, subsequently proceeding to the technical
descriptions of the majority of pioneering works that used structured data in
both data-driven general tasks and domain-specific applications. Thereafter, we
analyse and discuss the limitations and challenges shown in existing works and
suggest potential research directions. We hope this review serves as a catalyst
for the research community, promoting developments in generative diffusion
models for structured data.Comment: 20 pages, 1 figure, 2 table
Imputation of Rainfall Data Using the Sine Cosine Function Fitting Neural Network
Missing rainfall data have reduced the quality of hydrological data analysis because they are the essential input for hydrological modeling. Much research has focused on rainfall data imputation. However, the compatibility of precipitation (rainfall) and non-precipitation (meteorology) as input data has received less attention. First, we propose a novel pre-processing mechanism for non-precipitation data by using principal component analysis (PCA). Before the imputation, PCA is used to extract the most relevant features from the meteorological data. The final output of the PCA is combined with the rainfall data from the nearest neighbor gauging stations and then used as the input to the neural network for missing data imputation. Second, a sine cosine algorithm is presented to optimize neural network for infilling the missing rainfall data. The proposed sine cosine function fitting neural network (SC-FITNET) was compared with the sine cosine feedforward neural network (SCFFNN), feedforward neural network (FFNN) and long short-term memory (LSTM) approaches. The results showed that the proposed SC-FITNET outperformed LSTM, SC-FFNN and FFNN imputation in terms of mean absolute error (MAE), root mean square error (RMSE) and correlation coefficient (R), with an average accuracy of 90.9%. This study revealed that as the percentage of missingness increased, the precision of the four imputation methods reduced. In addition, this study also revealed that PCA has potential in pre-processing meteorological data into an understandable format for the missing data imputation
Deep learning for the early detection of harmful algal blooms and improving water quality monitoring
Climate change will affect how water sources are managed and monitored. The frequency of algal blooms will increase with climate change as it presents favourable conditions for the reproduction of phytoplankton. During monitoring, possible sensory failures in monitoring systems result in partially filled data which may affect critical systems. Therefore, imputation becomes necessary to decrease error and increase data quality. This work investigates two issues in water quality data analysis: improving data quality and anomaly detection. It consists of three main topics: data imputation, early algal bloom detection using in-situ data and early algal bloom detection using multiple modalities.The data imputation problem is addressed by experimenting with various methods with a water quality dataset that includes four locations around the North Sea and the Irish Sea with different characteristics and high miss rates, testing model generalisability. A novel neural network architecture with self-attention is proposed in which imputation is done in a single pass, reducing execution time. The self-attention components increase the interpretability of the imputation process at each stage of the network, providing knowledge to domain experts.After data curation, algal activity is predicted using transformer networks, between 1 to 7 days ahead, and the importance of the input with regard to the output of the prediction model is explained using SHAP, aiming to explain model behaviour to domain experts which is overlooked in previous approaches. The prediction model improves bloom detection performance by 5% on average and the explanation summarizes the complex structure of the model to input-output relationships. Performance improvements on the initial unimodal bloom detection model are made by incorporating multiple modalities into the detection process which were only used for validation purposes previously. The problem of missing data is also tackled by using coordinated representations, replacing low quality in-situ data with satellite data and vice versa, instead of imputation which may result in biased results
Enhancing Missing Data Imputation of Non-stationary Signals with Harmonic Decomposition
Dealing with time series with missing values, including those afflicted by
low quality or over-saturation, presents a significant signal processing
challenge. The task of recovering these missing values, known as imputation,
has led to the development of several algorithms. However, we have observed
that the efficacy of these algorithms tends to diminish when the time series
exhibit non-stationary oscillatory behavior. In this paper, we introduce a
novel algorithm, coined Harmonic Level Interpolation (HaLI), which enhances the
performance of existing imputation algorithms for oscillatory time series.
After running any chosen imputation algorithm, HaLI leverages the harmonic
decomposition based on the adaptive nonharmonic model of the initial imputation
to improve the imputation accuracy for oscillatory time series. Experimental
assessments conducted on synthetic and real signals consistently highlight that
HaLI enhances the performance of existing imputation algorithms. The algorithm
is made publicly available as a readily employable Matlab code for other
researchers to use
Novel Computationally Intelligent Machine Learning Algorithms for Data Mining and Knowledge Discovery
This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model
Data-Driven Copy-Paste Imputation for Energy Time Series
A cornerstone of the worldwide transition to smart grids are smart meters.
Smart meters typically collect and provide energy time series that are vital
for various applications, such as grid simulations, fault-detection, load
forecasting, load analysis, and load management. Unfortunately, these time
series are often characterized by missing values that must be handled before
the data can be used. A common approach to handle missing values in time series
is imputation. However, existing imputation methods are designed for power time
series and do not take into account the total energy of gaps, resulting in
jumps or constant shifts when imputing energy time series. In order to overcome
these issues, the present paper introduces the new Copy-Paste Imputation (CPI)
method for energy time series. The CPI method copies data blocks with similar
properties and pastes them into gaps of the time series while preserving the
total energy of each gap. The new method is evaluated on a real-world dataset
that contains six shares of artificially inserted missing values between 1 and
30%. It outperforms by far the three benchmark imputation methods selected for
comparison. The comparison furthermore shows that the CPI method uses matching
patterns and preserves the total energy of each gap while requiring only a
moderate run-time.Comment: 8 pages, 7 figures, submitted to IEEE Transactions on Smart Grid, the
first two authors equally contributed to this wor
- …