3,880 research outputs found
Highly comparative feature-based time-series classification
A highly comparative, feature-based approach to time series classification is
introduced that uses an extensive database of algorithms to extract thousands
of interpretable features from time series. These features are derived from
across the scientific time-series analysis literature, and include summaries of
time series in terms of their correlation structure, distribution, entropy,
stationarity, scaling properties, and fits to a range of time-series models.
After computing thousands of features for each time series in a training set,
those that are most informative of the class structure are selected using
greedy forward feature selection with a linear classifier. The resulting
feature-based classifiers automatically learn the differences between classes
using a reduced number of time-series properties, and circumvent the need to
calculate distances between time series. Representing time series in this way
results in orders of magnitude of dimensionality reduction, allowing the method
to perform well on very large datasets containing long time series or time
series of different lengths. For many of the datasets studied, classification
performance exceeded that of conventional instance-based classifiers, including
one nearest neighbor classifiers using Euclidean distances and dynamic time
warping and, most importantly, the features selected provide an understanding
of the properties of the dataset, insight that can guide further scientific
investigation
DancingLines: An Analytical Scheme to Depict Cross-Platform Event Popularity
Nowadays, events usually burst and are propagated online through multiple
modern media like social networks and search engines. There exists various
research discussing the event dissemination trends on individual medium, while
few studies focus on event popularity analysis from a cross-platform
perspective. Challenges come from the vast diversity of events and media,
limited access to aligned datasets across different media and a great deal of
noise in the datasets. In this paper, we design DancingLines, an innovative
scheme that captures and quantitatively analyzes event popularity between
pairwise text media. It contains two models: TF-SW, a semantic-aware popularity
quantification model, based on an integrated weight coefficient leveraging
Word2Vec and TextRank; and wDTW-CD, a pairwise event popularity time series
alignment model matching different event phases adapted from Dynamic Time
Warping. We also propose three metrics to interpret event popularity trends
between pairwise social platforms. Experimental results on eighteen real-world
event datasets from an influential social network and a popular search engine
validate the effectiveness and applicability of our scheme. DancingLines is
demonstrated to possess broad application potentials for discovering the
knowledge of various aspects related to events and different media
On-line Elastic Similarity Measures for time series
The way similarity is measured among time series is of paramount importance in many data mining and machine learning tasks. For instance, Elastic Similarity Measures are widely used to determine whether two time series are similar to each other. Indeed, in off-line time series mining, these measures have been shown to be very effective due to their ability to handle time distortions and mitigate their effect on the resulting distance. In the on-line setting, where available data increase continuously over time and not necessary in a stationary manner, stream mining approaches are required to be fast with limited memory consumption and capable of adapting to different stationary intervals. In this sense, the computational complexity of Elastic Similarity Measures and their lack of flexibility to accommodate different stationary intervals, make these similarity measures incompatible with the requirements mentioned. To overcome these issues, this paper adapts the family of Elastic Similarity Measures â which includes Dynamic Time Warping, Edit Distance, Edit Distance for Real Sequences and Edit Distance with Real Penalty â to the on-line setting. The proposed adaptation is based on two main ideas: a forgetting mechanism and the incremental computation. The former makes the similarity consistent with streaming time series characteristics by giving more importance to recent observations, whereas the latter reduces the computational complexity by avoiding unnecessary computations. In order to assess the behavior of the proposed similarity measure in on-line settings, two different experiments have been carried out. The first aims at showing the efficiency of the proposed adaptation, to do so we calculate and compare the computation time for the elastic measures and their on-line adaptation. By analyzing the results drawn from a distance-based streaming machine learning model, the second experiment intends to show the effect of the forgetting mechanism on the resulting similarity value. The experimentation shows, for the aforementioned Elastic Similarity Measures, that the proposed adaptation meets the memory, computational complexity and flexibility constraints imposed by streaming data
Generic Subsequence Matching Framework: Modularity, Flexibility, Efficiency
Subsequence matching has appeared to be an ideal approach for solving many
problems related to the fields of data mining and similarity retrieval. It has
been shown that almost any data class (audio, image, biometrics, signals) is or
can be represented by some kind of time series or string of symbols, which can
be seen as an input for various subsequence matching approaches. The variety of
data types, specific tasks and their partial or full solutions is so wide that
the choice, implementation and parametrization of a suitable solution for a
given task might be complicated and time-consuming; a possibly fruitful
combination of fragments from different research areas may not be obvious nor
easy to realize. The leading authors of this field also mention the
implementation bias that makes difficult a proper comparison of competing
approaches. Therefore we present a new generic Subsequence Matching Framework
(SMF) that tries to overcome the aforementioned problems by a uniform frame
that simplifies and speeds up the design, development and evaluation of
subsequence matching related systems. We identify several relatively separate
subtasks solved differently over the literature and SMF enables to combine them
in straightforward manner achieving new quality and efficiency. This framework
can be used in many application domains and its components can be reused
effectively. Its strictly modular architecture and openness enables also
involvement of efficient solutions from different fields, for instance
efficient metric-based indexes. This is an extended version of a paper
published on DEXA 2012.Comment: This is an extended version of a paper published on DEXA 201
Calibration by correlation using metric embedding from non-metric similarities
This paper presents a new intrinsic calibration method that allows us to calibrate a generic single-view point camera just
by waving it around. From the video sequence obtained while the camera undergoes random motion, we compute the pairwise time
correlation of the luminance signal for a subset of the pixels. We show that, if the camera undergoes a random uniform motion, then
the pairwise correlation of any pixels pair is a function of the distance between the pixel directions on the visual sphere. This leads to
formalizing calibration as a problem of metric embedding from non-metric measurements: we want to find the disposition of pixels on
the visual sphere from similarities that are an unknown function of the distances. This problem is a generalization of multidimensional
scaling (MDS) that has so far resisted a comprehensive observability analysis (can we reconstruct a metrically accurate embedding?)
and a solid generic solution (how to do so?). We show that the observability depends both on the local geometric properties (curvature)
as well as on the global topological properties (connectedness) of the target manifold. We show that, in contrast to the Euclidean case,
on the sphere we can recover the scale of the points distribution, therefore obtaining a metrically accurate solution from non-metric
measurements. We describe an algorithm that is robust across manifolds and can recover a metrically accurate solution when the metric
information is observable. We demonstrate the performance of the algorithm for several cameras (pin-hole, fish-eye, omnidirectional),
and we obtain results comparable to calibration using classical methods. Additional synthetic benchmarks show that the algorithm
performs as theoretically predicted for all corner cases of the observability analysis
Time series classification with ensembles of elastic distance measures
Several alternative distance measures for comparing time series have recently been proposed and evaluated on time series classification (TSC) problems. These include variants of dynamic time warping (DTW), such as weighted and derivative DTW, and edit distance-based measures, including longest common subsequence, edit distance with real penalty, time warp with edit, and moveâsplitâmerge. These measures have the common characteristic that they operate in the time domain and compensate for potential localised misalignment through some elastic adjustment. Our aim is to experimentally test two hypotheses related to these distance measures. Firstly, we test whether there is any significant difference in accuracy for TSC problems between nearest neighbour classifiers using these distance measures. Secondly, we test whether combining these elastic distance measures through simple ensemble schemes gives significantly better accuracy. We test these hypotheses by carrying out one of the largest experimental studies ever conducted into time series classification. Our first key finding is that there is no significant difference between the elastic distance measures in terms of classification accuracy on our data sets. Our second finding, and the major contribution of this work, is to define an ensemble classifier that significantly outperforms the individual classifiers. We also demonstrate that the ensemble is more accurate than approaches not based in the time domain. Nearly all TSC papers in the data mining literature cite DTW (with warping window set through cross validation) as the benchmark for comparison. We believe that our ensemble is the first ever classifier to significantly outperform DTW and as such raises the bar for future work in this area
- âŠ