87 research outputs found
Time Series classification through transformation and ensembles
The problem of time series classification (TSC), where we consider any real-valued ordered data a time series, offers a specific challenge. Unlike traditional classification
problems, the ordering of attributes is often crucial for identifying discriminatory features between classes. TSC problems arise across a diverse range of domains, and this
variety has meant that no single approach outperforms all others.
The general consensus is that the benchmark for TSC is nearest neighbour (NN) classifiers using Euclidean distance or Dynamic Time Warping (DTW). Though conceptually simple, many have reported that NN classifiers are very diffi�cult to beat and new work is often compared to NN classifiers. The majority of approaches have focused on classification in the time domain, typically proposing alternative elastic similarity measures
for NN classification. Other work has investigated more specialised approaches, such as building support vector machines on variable intervals and creating tree-based
ensembles with summary measures.
We wish to answer a specific research question: given a new TSC problem without any prior, specialised knowledge, what is the best way to approach the problem? Our thesis is that the best methodology is to first transform data into alternative representations where discriminatory features are more easily detected, and then build ensemble
classifiers on each representation. In support of our thesis, we propose an elastic ensemble classifier that we believe is the first ever to significantly outperform DTW on the widely used UCR datasets. Next, we propose the shapelet-transform, a new data transformation that allows complex classifiers to be coupled with shapelets, which outperforms the original algorithm and is competitive with DTW. Finally, we combine these two works with with heterogeneous ensembles built on autocorrelation and spectral-transformed data to propose a collective of transformation-based ensembles (COTE). The results of COTE are, we believe, the best ever published on the UCR datasets
Benchmarking Multivariate Time Series Classification Algorithms
Time Series Classification (TSC) involved building predictive models for a
discrete target variable from ordered, real valued, attributes. Over recent
years, a new set of TSC algorithms have been developed which have made
significant improvement over the previous state of the art. The main focus has
been on univariate TSC, i.e. the problem where each case has a single series
and a class label. In reality, it is more common to encounter multivariate TSC
(MTSC) problems where multiple series are associated with a single label.
Despite this, much less consideration has been given to MTSC than the
univariate case. The UEA archive of 30 MTSC problems released in 2018 has made
comparison of algorithms easier. We review recently proposed bespoke MTSC
algorithms based on deep learning, shapelets and bag of words approaches. The
simplest approach to MTSC is to ensemble univariate classifiers over the
multivariate dimensions. We compare the bespoke algorithms to these dimension
independent approaches on the 26 of the 30 MTSC archive problems where the data
are all of equal length. We demonstrate that the independent ensemble of
HIVE-COTE classifiers is the most accurate, but that, unlike with univariate
classification, dynamic time warping is still competitive at MTSC.Comment: Data Min Knowl Disc (2020
Mining time-series data using discriminative subsequences
Time-series data is abundant, and must be analysed to extract usable knowledge. Local-shape-based methods offer improved performance for many problems, and a
comprehensible method of understanding both data and models.
For time-series classification, we transform the data into a local-shape space using a shapelet transform. A shapelet is a time-series subsequence that is discriminative
of the class of the original series. We use a heterogeneous ensemble classifier on the transformed data. The accuracy of our method is significantly better than the time-series classification benchmark (1-nearest-neighbour with dynamic time-warping distance), and significantly better than the previous best shapelet-based classifiers.
We use two methods to increase interpretability: First, we cluster the shapelets using a novel, parameterless clustering method based on Minimum Description Length,
reducing dimensionality and removing duplicate shapelets. Second, we transform the shapelet data into binary data reflecting the presence or absence of particular
shapelets, a representation that is straightforward to interpret and understand.
We supplement the ensemble classifier with partial classifocation. We generate rule sets on the binary-shapelet data, improving performance on certain classes, and revealing the relationship between the shapelets and the class label. To aid interpretability, we use a novel algorithm, BruteSuppression, that can substantially reduce
the size of a rule set without negatively affecting performance, leading to a more compact, comprehensible model.
Finally, we propose three novel algorithms for unsupervised mining of approximately repeated patterns in time-series data, testing their performance in terms of
speed and accuracy on synthetic data, and on a real-world electricity-consumption device-disambiguation problem. We show that individual devices can be found automatically
and in an unsupervised manner using a local-shape-based approach
QUANT: A Minimalist Interval Method for Time Series Classification
We show that it is possible to achieve the same accuracy, on average, as the
most accurate existing interval methods for time series classification on a
standard set of benchmark datasets using a single type of feature (quantiles),
fixed intervals, and an 'off the shelf' classifier. This distillation of
interval-based approaches represents a fast and accurate method for time series
classification, achieving state-of-the-art accuracy on the expanded set of 142
datasets in the UCR archive with a total compute time (training and inference)
of less than 15 minutes using a single CPU core.Comment: 26 pages, 20 figure
Plotting Time: Exploring Visual Representations for Time Series Classification
Tese de mestrado, Engenharia Informática, 2022, Universidade de Lisboa, Faculdade de CiênciasTime series data is a collection of data points acquired in successive order over a period of
time, allowing us to obtain temporal information and make time-based predictions through the
combination of Machine Learning (ML) algorithms. Time series are prevalent in crucial sectors
for society’s development, such as Economy, Health, Weather, and Astronomy, with the objective
of improving the quality of life through the prediction of climate changes, economic variations,
earthquakes, and other types of events. These sectors require models with good predictive abilities
and capable of scaling as the volume of data gradually increases. We can address this issue by
using Deep Learning (DL) models that can keep a good performance while increasing the amount
of data. One example is the Convolutional Neural Network (CNN), which uses images as input in
several activity sectors. There is not much time series-related work with deep learning models and
image generation. As a result, our objective is to develop new methods for image generation and
then train them with a simple CNN. We focus on time series data to create a new algorithm for
converting non-image time series data into graphical images that contain either box plots or violin
plots with statistical information. We hypothesize that CNNs can interpret and learn different
elements of the plots, and by comparing two different approaches, we can verify this statement.
Our results indicate that CNNs may not understand some elements of the box and violin plots,
for example, the outliers and quartiles, and focus more on the density and distribution of the data.
In the future, it would be interesting to study alternative image generation algorithms and explore
graphical representations in multivariate datasets
A Bag of Receptive Fields for Time Series Extrinsic Predictions
High-dimensional time series data poses challenges due to its dynamic nature,
varying lengths, and presence of missing values. This kind of data requires
extensive preprocessing, limiting the applicability of existing Time Series
Classification and Time Series Extrinsic Regression techniques. For this
reason, we propose BORF, a Bag-Of-Receptive-Fields model, which incorporates
notions from time series convolution and 1D-SAX to handle univariate and
multivariate time series with varying lengths and missing values. We evaluate
BORF on Time Series Classification and Time Series Extrinsic Regression tasks
using the full UEA and UCR repositories, demonstrating its competitive
performance against state-of-the-art methods. Finally, we outline how this
representation can naturally provide saliency and feature-based explanations
Simulation Analytics for Deeper Comparisons
Output analysis for stochastic simulation has traditionally focused on obtaining statistical summaries of time-averaged and replication-averaged performance measures. Although providing a useful overview of expected long-run results, this focus ignores the finer behaviour and dynamic interactions that characterise a stochastic system, motivating an opening for simulation analytics. Data analysis efforts directed towards the detailed event logs of simulation sample paths can extend the analytical toolkit of simulation beyond static summaries of long-run behaviour. This thesis contributes novel methodologies to the field of simulation analytics. Through a careful mining of sample path data and application of appropriate machine learning techniques, we unlock new opportunities for understanding and improving the performance of stochastic systems. Our first area of focus is on the real-time prediction of dynamic performance measures, and we demonstrate a k-nearest neighbours model on the multivariate state of a simulation. In conjunction with this, metric learning is employed to refine a system-specific distance measure that operates between simulation states. The involvement of metric learning is found not only to enhance prediction accuracy, but also to offer insight into the driving factors behind a system’s stochastic performance. Our main contribution within this approach is the adaptation of a metric learning formulation to accommodate the type of data that is typical of simulation sample paths. Secondly, we explore the continuous-time trajectories of simulation variables. Shapelets are found to identify the patterns that characterise and distinguish the trajectories of competing systems. Tailoring to the structure of discrete-event sample paths, we probe a deeper understanding and comparison of the dynamic behaviours of stochastic simulation
- …