30 research outputs found

    Generic Subsequence Matching Framework: Modularity, Flexibility, Efficiency

    Get PDF
    Subsequence matching has appeared to be an ideal approach for solving many problems related to the fields of data mining and similarity retrieval. It has been shown that almost any data class (audio, image, biometrics, signals) is or can be represented by some kind of time series or string of symbols, which can be seen as an input for various subsequence matching approaches. The variety of data types, specific tasks and their partial or full solutions is so wide that the choice, implementation and parametrization of a suitable solution for a given task might be complicated and time-consuming; a possibly fruitful combination of fragments from different research areas may not be obvious nor easy to realize. The leading authors of this field also mention the implementation bias that makes difficult a proper comparison of competing approaches. Therefore we present a new generic Subsequence Matching Framework (SMF) that tries to overcome the aforementioned problems by a uniform frame that simplifies and speeds up the design, development and evaluation of subsequence matching related systems. We identify several relatively separate subtasks solved differently over the literature and SMF enables to combine them in straightforward manner achieving new quality and efficiency. This framework can be used in many application domains and its components can be reused effectively. Its strictly modular architecture and openness enables also involvement of efficient solutions from different fields, for instance efficient metric-based indexes. This is an extended version of a paper published on DEXA 2012.Comment: This is an extended version of a paper published on DEXA 201

    Efficient Algorithms for Similarity and Skyline Summary on Multidimensional Datasets.

    Full text link
    Efficient management of large multidimensional datasets has attracted much attention in the database research community. Such large multidimensional datasets are common and efficient algorithms are needed for analyzing these data sets for a variety of applications. In this thesis, we focus our study on two very common classes of analysis: similarity and skyline summarization. We first focus on similarity when one of the dimensions in the multidimensional dataset is temporal. We then develop algorithms for evaluating skyline summaries effectively for both temporal and low-cardinality attribute domain datasets and propose different methods for improving the effectiveness of the skyline summary operation. This thesis begins by studying similarity measures for time-series datasets and efficient algorithms for time-series similarity evaluation. The first contribution of this thesis is a new algorithm which can be used to evaluate similarity methods whose matching criteria is bounded by a specified threshold value. The second contribution of this thesis is the development of a new time-interval skyline operator, which continuously computes the current skyline over a data stream. We present a new algorithm called LookOut for evaluating such queries efficiently, and empirically demonstrate the scalability of this algorithm. Current skyline evaluation techniques follow a common paradigm that eliminates data elements from skyline consideration by finding other elements in the dataset that dominate them. The performance of such techniques is heavily influenced by the underlying data distribution. The third contribution of this thesis is a novel technique called the Lattice Skyline Algorithm (LS) that is built around a new paradigm for skyline evaluation on datasets with attributes that are drawn from low-cardinality domains. The utility of the skyline as a data summarization technique is often diminished by the volume of points in the skyline The final contribution of this thesis is a novel scheme which remedies the skyline volume problem by ranking the elements of the skyline based on their importance to the skyline summary. Collectively, the techniques described in this thesis present efficient methods for two common and computationally intensive analysis operations on large multidimensional datasets.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57643/2/mmorse_1.pd

    Semi-Lazy Learning Approach to Dynamic Spatio-Temporal Data Analysis

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Fast trajectory search for real-world applications

    Get PDF
    With the popularity of smartphones equipped with GPS, a vast amount of trajectory data are being produced from location-based services, such as Uber, Google Maps, and Foursquare. We broadly divide trajectory data into three types: 1) commuter trajectories from taxicabs and ride-sharing apps; 2) vehicle trajectories from GPS navigation apps; 3) activity trajectories from social network check-ins and travel blogs. We investigate efficient and effective search on each of the three types of trajectory data, each of which has a real-world application. In particular: 1) commuter trajectory search can serve for the transport capacity estimation and route planning; 2) vehicle trajectory search can help real-time traffic monitoring and trend analysis; 3) activity trajectory search can be used in interactive and personalized trip planning. As the most straightforward trajectory data, a commuter trajectory only contains two points: origin and destination indicating a passenger’s movement, which is valuable for transportation decision making. In this thesis, we propose a novel query RkNNT to estimate the capacity of a bus route in the transport network. Answering RkNNT is challenging due to the high amount of data from commuters. We propose efficient solutions to prune most trajectories which cannot choose a query route as their nearest one. Further, we apply RkNNT to the optimal route planning problem-MaxRkNNT. A vehicle trajectory has more points than a commuter trajectory, as it tracks the whole trace of a vehicle and can further advocate the application of traffic monitoring. We conclude the common queries over trajectory data for monitoring purposes and proposes a search engine Torch to manage and search trajectories with map matching over a road network, instead of storing raw data sampled from GPS with a high cost. Besides improving the efficiency of search, Torch also supports compression, effectiveness evaluation of various existing similarity measures, and large-scale clustering k-paths with a novel similarity measure LORS. Exploring the activity trajectory data which contains textual information can help plan personalized trips for tourists. Based on spatial indexes which we propose for commuter and vehicle trajectory data, we further develop a unified search paradigm to process various top-k queries over activity trajectory and POIs data (hotels, restaurants, and attractions, etc.) at the same time. In particular, a new point-wise similarity measure PATS and an indexing framework with a unified search paradigm are proposed

    Eddy current defect response analysis using sum of Gaussian methods

    Get PDF
    This dissertation is a study of methods to automatedly detect and produce approximations of eddy current differential coil defect signatures in terms of a summed collection of Gaussian functions (SoG). Datasets consisting of varying material, defect size, inspection frequency, and coil diameter were investigated. Dimensionally reduced representations of the defect responses were obtained utilizing common existing reduction methods and novel enhancements to them utilizing SoG Representations. Efficacy of the SoG enhanced representations were studied utilizing common Machine Learning (ML) interpretable classifier designs with the SoG representations indicating significant improvement of common analysis metrics

    SharkDB: an in-memory storage system for large scale trajectory data management

    Get PDF

    Diffeomorphic Transformations for Time Series Analysis: An Efficient Approach to Nonlinear Warping

    Full text link
    The proliferation and ubiquity of temporal data across many disciplines has sparked interest for similarity, classification and clustering methods specifically designed to handle time series data. A core issue when dealing with time series is determining their pairwise similarity, i.e., the degree to which a given time series resembles another. Traditional distance measures such as the Euclidean are not well-suited due to the time-dependent nature of the data. Elastic metrics such as dynamic time warping (DTW) offer a promising approach, but are limited by their computational complexity, non-differentiability and sensitivity to noise and outliers. This thesis proposes novel elastic alignment methods that use parametric \& diffeomorphic warping transformations as a means of overcoming the shortcomings of DTW-based metrics. The proposed method is differentiable \& invertible, well-suited for deep learning architectures, robust to noise and outliers, computationally efficient, and is expressive and flexible enough to capture complex patterns. Furthermore, a closed-form solution was developed for the gradient of these diffeomorphic transformations, which allows an efficient search in the parameter space, leading to better solutions at convergence. Leveraging the benefits of these closed-form diffeomorphic transformations, this thesis proposes a suite of advancements that include: (a) an enhanced temporal transformer network for time series alignment and averaging, (b) a deep-learning based time series classification model to simultaneously align and classify signals with high accuracy, (c) an incremental time series clustering algorithm that is warping-invariant, scalable and can operate under limited computational and time resources, and finally, (d) a normalizing flow model that enhances the flexibility of affine transformations in coupling and autoregressive layers.Comment: PhD Thesis, defended at the University of Navarra on July 17, 2023. 277 pages, 8 chapters, 1 appendi

    Coping with distance and location dependencies in spatial, temporal and uncertain data

    Get PDF
    corecore