Due to the increase in CPU power and the ever increasing data storage
capabilities, more and more data of all kind is recorded, including temporal
data. Time series, the most prevalent type of temporal data are derived in a
broad number of application domains. Prominent examples include stock price data
in economy, gene expression data in biology, the course of environmental
parameters in meteorology, or data of moving objects recorded by traffic
sensors.
This large amount of raw data can only be analyzed by automated data mining
algorithms in order to generate new knowledge. One of the most basic data
mining operations is the similarity query, which computes a similarity or
distance value for two objects. Two aspects of such an similarity function are
of special interest. First, the semantics of a similarity function and second,
the computational cost for the calculation of a similarity value. The semantics
is the actual similarity notion and is highly dependant on the analysis task at
hand.
This thesis addresses both aspects. We introduce a number of new
similarity measures for time series data and show how they can
efficiently be calculated by means of index structures and query
algorithms.
The first of the new similarity measures is threshold-based. Two
time series are considered as similar, if they exceed a user-given
threshold during similar time intervals. Aside from formally
defining this similarity measure, we show how to represent time
series in such a way that threshold-based queries can be efficiently
calculated. Our representation allows for the specification of the
threshold value at query time. This is for example useful for data
mining task that try to determine crucial thresholds.
The next similarity measure considers a relevant amplitude range.
This range is scanned with a certain resolution and for each
considered amplitude value features are extracted. We consider the
change in the feature values over the amplitude values and thus,
generate so-called feature sequences. Different features can finally
be combined to answer amplitude-level-based similarity queries. In
contrast to traditional approaches which aggregate global feature
values along the time dimension, we capture local characteristics
and monitor their change for different amplitude values.
Furthermore, our method enables the user to specify a relevant range
of amplitude values to be considered and so the similarity notion
can be adapted to the current requirements.
Next, we introduce so-called interval-focused similarity queries. A
user can specify one or several time intervals that should be
considered for the calculation of the similarity value. Our main
focus for this similarity measure was the efficient support of the
corresponding query. In particular we try to avoid loading the
complete time series objects into main memory, if only a relatively
small portion of a time series is of interest. We propose a time
series representation which can be used to calculate upper and
lower distance bounds, so that only a few time series objects have
to be completely loaded and refined. Again, the relevant time
intervals do not have to be known in advance.
Finally, we define a similarity measure for so-called uncertain time series,
where several amplitude values are given for each point in time. This can be
due to multiple recordings or to errors in measurements, so that no exact value
can be specified. We show how to efficiently support queries on uncertain time
series.
The last part of this thesis shows how data mining methods can be used to
discover crucial threshold parameters for the threshold-based similarity
measure. Furthermore we present a data mining tool for time series