43,870 research outputs found
The Extended Edit Distance Metric
Similarity search is an important problem in information retrieval. This
similarity is based on a distance. Symbolic representation of time series has
attracted many researchers recently, since it reduces the dimensionality of
these high dimensional data objects. We propose a new distance metric that is
applied to symbolic data objects and we test it on time series data bases in a
classification task. We compare it to other distances that are well known in
the literature for symbolic data objects. We also prove, mathematically, that
our distance is metric.Comment: Technical repor
Feature-based time-series analysis
This work presents an introduction to feature-based time-series analysis. The
time series as a data type is first described, along with an overview of the
interdisciplinary time-series analysis literature. I then summarize the range
of feature-based representations for time series that have been developed to
aid interpretable insights into time-series structure. Particular emphasis is
given to emerging research that facilitates wide comparison of feature-based
representations that allow us to understand the properties of a time-series
dataset that make it suited to a particular feature-based representation or
analysis algorithm. The future of time-series analysis is likely to embrace
approaches that exploit machine learning methods to partially automate human
learning to aid understanding of the complex dynamical patterns in the time
series we measure from the world.Comment: 28 pages, 9 figure
Advanced Analysis on Temporal Data
Due to the increase in CPU power and the ever increasing data storage
capabilities, more and more data of all kind is recorded, including temporal
data. Time series, the most prevalent type of temporal data are derived in a
broad number of application domains. Prominent examples include stock price data
in economy, gene expression data in biology, the course of environmental
parameters in meteorology, or data of moving objects recorded by traffic
sensors.
This large amount of raw data can only be analyzed by automated data mining
algorithms in order to generate new knowledge. One of the most basic data
mining operations is the similarity query, which computes a similarity or
distance value for two objects. Two aspects of such an similarity function are
of special interest. First, the semantics of a similarity function and second,
the computational cost for the calculation of a similarity value. The semantics
is the actual similarity notion and is highly dependant on the analysis task at
hand.
This thesis addresses both aspects. We introduce a number of new
similarity measures for time series data and show how they can
efficiently be calculated by means of index structures and query
algorithms.
The first of the new similarity measures is threshold-based. Two
time series are considered as similar, if they exceed a user-given
threshold during similar time intervals. Aside from formally
defining this similarity measure, we show how to represent time
series in such a way that threshold-based queries can be efficiently
calculated. Our representation allows for the specification of the
threshold value at query time. This is for example useful for data
mining task that try to determine crucial thresholds.
The next similarity measure considers a relevant amplitude range.
This range is scanned with a certain resolution and for each
considered amplitude value features are extracted. We consider the
change in the feature values over the amplitude values and thus,
generate so-called feature sequences. Different features can finally
be combined to answer amplitude-level-based similarity queries. In
contrast to traditional approaches which aggregate global feature
values along the time dimension, we capture local characteristics
and monitor their change for different amplitude values.
Furthermore, our method enables the user to specify a relevant range
of amplitude values to be considered and so the similarity notion
can be adapted to the current requirements.
Next, we introduce so-called interval-focused similarity queries. A
user can specify one or several time intervals that should be
considered for the calculation of the similarity value. Our main
focus for this similarity measure was the efficient support of the
corresponding query. In particular we try to avoid loading the
complete time series objects into main memory, if only a relatively
small portion of a time series is of interest. We propose a time
series representation which can be used to calculate upper and
lower distance bounds, so that only a few time series objects have
to be completely loaded and refined. Again, the relevant time
intervals do not have to be known in advance.
Finally, we define a similarity measure for so-called uncertain time series,
where several amplitude values are given for each point in time. This can be
due to multiple recordings or to errors in measurements, so that no exact value
can be specified. We show how to efficiently support queries on uncertain time
series.
The last part of this thesis shows how data mining methods can be used to
discover crucial threshold parameters for the threshold-based similarity
measure. Furthermore we present a data mining tool for time series
A quick search method for audio signals based on a piecewise linear representation of feature trajectories
This paper presents a new method for a quick similarity-based search through
long unlabeled audio streams to detect and locate audio clips provided by
users. The method involves feature-dimension reduction based on a piecewise
linear representation of a sequential feature trajectory extracted from a long
audio stream. Two techniques enable us to obtain a piecewise linear
representation: the dynamic segmentation of feature trajectories and the
segment-based Karhunen-L\'{o}eve (KL) transform. The proposed search method
guarantees the same search results as the search method without the proposed
feature-dimension reduction method in principle. Experiment results indicate
significant improvements in search speed. For example the proposed method
reduced the total search time to approximately 1/12 that of previous methods
and detected queries in approximately 0.3 seconds from a 200-hour audio
database.Comment: 20 pages, to appear in IEEE Transactions on Audio, Speech and
Language Processin
- …