30 research outputs found
Generic Subsequence Matching Framework: Modularity, Flexibility, Efficiency
Subsequence matching has appeared to be an ideal approach for solving many
problems related to the fields of data mining and similarity retrieval. It has
been shown that almost any data class (audio, image, biometrics, signals) is or
can be represented by some kind of time series or string of symbols, which can
be seen as an input for various subsequence matching approaches. The variety of
data types, specific tasks and their partial or full solutions is so wide that
the choice, implementation and parametrization of a suitable solution for a
given task might be complicated and time-consuming; a possibly fruitful
combination of fragments from different research areas may not be obvious nor
easy to realize. The leading authors of this field also mention the
implementation bias that makes difficult a proper comparison of competing
approaches. Therefore we present a new generic Subsequence Matching Framework
(SMF) that tries to overcome the aforementioned problems by a uniform frame
that simplifies and speeds up the design, development and evaluation of
subsequence matching related systems. We identify several relatively separate
subtasks solved differently over the literature and SMF enables to combine them
in straightforward manner achieving new quality and efficiency. This framework
can be used in many application domains and its components can be reused
effectively. Its strictly modular architecture and openness enables also
involvement of efficient solutions from different fields, for instance
efficient metric-based indexes. This is an extended version of a paper
published on DEXA 2012.Comment: This is an extended version of a paper published on DEXA 201
Efficient Algorithms for Similarity and Skyline Summary on Multidimensional Datasets.
Efficient management of large multidimensional datasets has attracted much attention
in the database research community. Such large multidimensional datasets are common
and efficient algorithms are needed for analyzing these data sets for a variety of applications.
In this thesis, we focus our study on two very common classes of analysis: similarity
and skyline summarization. We first focus on similarity when one of the dimensions in the
multidimensional dataset is temporal. We then develop algorithms for evaluating skyline
summaries effectively for both temporal and low-cardinality attribute domain datasets and
propose different methods for improving the effectiveness of the skyline summary operation.
This thesis begins by studying similarity measures for time-series datasets and efficient
algorithms for time-series similarity evaluation. The first contribution of this thesis is
a new algorithm which can be
used to evaluate similarity methods whose matching criteria is bounded by a specified
threshold value.
The second contribution of this thesis is the development of a new time-interval skyline
operator, which continuously computes the current skyline over a data stream. We present
a new algorithm called LookOut for evaluating such queries efficiently, and empirically
demonstrate the scalability of this algorithm.
Current skyline evaluation techniques follow a common paradigm that eliminates data
elements from skyline consideration by finding other elements in the dataset that dominate
them. The performance of such techniques is heavily influenced by the underlying data
distribution. The third contribution of this thesis is a novel technique called the Lattice
Skyline Algorithm (LS) that is built around a new paradigm for skyline evaluation on
datasets with attributes that are drawn from low-cardinality domains.
The utility of the skyline as a data summarization technique is often diminished by the
volume of points in the skyline The final contribution of this thesis is a novel scheme
which remedies the skyline volume problem by
ranking the elements of the skyline based on their importance to the skyline summary.
Collectively, the techniques described in this thesis present efficient methods for two
common and computationally intensive analysis operations on large multidimensional
datasets.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57643/2/mmorse_1.pd
Semi-Lazy Learning Approach to Dynamic Spatio-Temporal Data Analysis
Ph.DDOCTOR OF PHILOSOPH
Fast trajectory search for real-world applications
With the popularity of smartphones equipped with GPS, a vast amount of trajectory data are being produced from location-based services, such as Uber, Google Maps, and Foursquare. We broadly divide trajectory data into three types: 1) commuter trajectories from taxicabs and ride-sharing apps; 2) vehicle trajectories from GPS navigation apps; 3) activity trajectories from social network check-ins and travel blogs. We investigate efficient and effective search on each of the three types of trajectory data, each of which has a real-world application. In particular: 1) commuter trajectory search can serve for the transport capacity estimation and route planning; 2) vehicle trajectory search can help real-time traffic monitoring and trend analysis; 3) activity trajectory search can be used in interactive and personalized trip planning. As the most straightforward trajectory data, a commuter trajectory only contains two points: origin and destination indicating a passenger’s movement, which is valuable for transportation decision making. In this thesis, we propose a novel query RkNNT to estimate the capacity of a bus route in the transport network. Answering RkNNT is challenging due to the high amount of data from commuters. We propose efficient solutions to prune most trajectories which cannot choose a query route as their nearest one. Further, we apply RkNNT to the optimal route planning problem-MaxRkNNT. A vehicle trajectory has more points than a commuter trajectory, as it tracks the whole trace of a vehicle and can further advocate the application of traffic monitoring. We conclude the common queries over trajectory data for monitoring purposes and proposes a search engine Torch to manage and search trajectories with map matching over a road network, instead of storing raw data sampled from GPS with a high cost. Besides improving the efficiency of search, Torch also supports compression, effectiveness evaluation of various existing similarity measures, and large-scale clustering k-paths with a novel similarity measure LORS. Exploring the activity trajectory data which contains textual information can help plan personalized trips for tourists. Based on spatial indexes which we propose for commuter and vehicle trajectory data, we further develop a unified search paradigm to process various top-k queries over activity trajectory and POIs data (hotels, restaurants, and attractions, etc.) at the same time. In particular, a new point-wise similarity measure PATS and an indexing framework with a unified search paradigm are proposed
Eddy current defect response analysis using sum of Gaussian methods
This dissertation is a study of methods to automatedly detect and produce approximations of eddy current differential coil defect signatures in terms of a summed collection of Gaussian functions (SoG). Datasets consisting of varying material, defect size, inspection frequency, and coil diameter were investigated. Dimensionally reduced representations of the defect responses were obtained utilizing common existing reduction methods and novel enhancements to them utilizing SoG Representations. Efficacy of the SoG enhanced representations were studied utilizing common Machine Learning (ML) interpretable classifier designs with the SoG representations indicating significant improvement of common analysis metrics
Diffeomorphic Transformations for Time Series Analysis: An Efficient Approach to Nonlinear Warping
The proliferation and ubiquity of temporal data across many disciplines has
sparked interest for similarity, classification and clustering methods
specifically designed to handle time series data. A core issue when dealing
with time series is determining their pairwise similarity, i.e., the degree to
which a given time series resembles another. Traditional distance measures such
as the Euclidean are not well-suited due to the time-dependent nature of the
data. Elastic metrics such as dynamic time warping (DTW) offer a promising
approach, but are limited by their computational complexity,
non-differentiability and sensitivity to noise and outliers. This thesis
proposes novel elastic alignment methods that use parametric \& diffeomorphic
warping transformations as a means of overcoming the shortcomings of DTW-based
metrics. The proposed method is differentiable \& invertible, well-suited for
deep learning architectures, robust to noise and outliers, computationally
efficient, and is expressive and flexible enough to capture complex patterns.
Furthermore, a closed-form solution was developed for the gradient of these
diffeomorphic transformations, which allows an efficient search in the
parameter space, leading to better solutions at convergence. Leveraging the
benefits of these closed-form diffeomorphic transformations, this thesis
proposes a suite of advancements that include: (a) an enhanced temporal
transformer network for time series alignment and averaging, (b) a
deep-learning based time series classification model to simultaneously align
and classify signals with high accuracy, (c) an incremental time series
clustering algorithm that is warping-invariant, scalable and can operate under
limited computational and time resources, and finally, (d) a normalizing flow
model that enhances the flexibility of affine transformations in coupling and
autoregressive layers.Comment: PhD Thesis, defended at the University of Navarra on July 17, 2023.
277 pages, 8 chapters, 1 appendi