675 research outputs found

    Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining

    Get PDF
    The chapter is organized as follows. Section 2 will introduce the similarity matching problem on time series. We will note the importance of the use of efficient data structures to perform search, and the choice of an adequate distance measure. Section 3 will show some of the most used distance measure for time series data mining. Section 4 will review the above mentioned dimensionality reduction techniques

    Fast and Flexible Multivariate Time Series Subsequence Search

    Get PDF
    Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which often contain several gigabytes of data. Surprisingly, research on MTS search is very limited. Most of the existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two algorithms to solve this problem (1) a List Based Search (LBS) algorithm which uses sorted lists for indexing, and (2) a R*-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences. Both algorithms guarantee that all matching patterns within the specified thresholds will be returned (no false dismissals). The very few false alarms can be removed by a post-processing step. Since our framework is also capable of Univariate Time-Series (UTS) subsequence search, we first demonstrate the efficiency of our algorithms on several UTS datasets previously used in the literature. We follow this up with experiments using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>99%) thus needing actual disk access for only less than 1% of the observations. To the best of our knowledge, MTS subsequence search has never been attempted on datasets of the size we have used in this paper

    Time Series Data Mining Algorithms for Identifying Short RNA in Arabidopsis thaliana

    Get PDF
    The class of molecules called short RNAs (sRNAs) are known to play a key role in gene regulation. Th are typically sequences of nucleotides between 21-25 nucleotides in length. They are known to play a key role in gene regulation. The identification, clustering and classification of sRNA has recently become the focus of much research activity. The basic problem involves detecting regions of interest on the chromosome where the pattern of candidate matches is somehow unusual. Currently, there are no published algorithms for detecting regions of interest, and the unpublished methods that we are aware of involve bespoke rule based systems designed for a specific organism. Work in this very new field has understandably focused on the outcomes rather than the methods used to obtain the results. In this paper we propose two generic approaches that place the specific biological problem in the wider context of time series data mining problems. Both methods are based on treating the occurrences on a chromosome, or “hit count” data, as a time series, then running a sliding window along a chromosome and measuring unusualness. This formulation means we can treat finding unusual areas of candidate RNA activity as a variety of time series anomaly detection problem. The first set of approaches is model based. We specify a null hypothesis distribution for not being a sRNA, then estimate the p-values along the chromosome. The second approach is instance based. We identify some typical shapes from known sRNA, then use dynamic time warping and fourier trans-form based distance to measure how closely the candidate series matches. We demonstrate that these methods can find known sRNA on Arabidopsis thaliana chromosomes and illustrate the benefits of the added information provided by these algorithms

    The Influence of Global Constraints on Similarity Measures for Time-Series Databases

    Full text link
    A time series consists of a series of values or events obtained over repeated measurements in time. Analysis of time series represents and important tool in many application areas, such as stock market analysis, process and quality control, observation of natural phenomena, medical treatments, etc. A vital component in many types of time-series analysis is the choice of an appropriate distance/similarity measure. Numerous measures have been proposed to date, with the most successful ones based on dynamic programming. Being of quadratic time complexity, however, global constraints are often employed to limit the search space in the matrix during the dynamic programming procedure, in order to speed up computation. Furthermore, it has been reported that such constrained measures can also achieve better accuracy. In this paper, we investigate two representative time-series distance/similarity measures based on dynamic programming, Dynamic Time Warping (DTW) and Longest Common Subsequence (LCS), and the effects of global constraints on them. Through extensive experiments on a large number of time-series data sets, we demonstrate how global constrains can significantly reduce the computation time of DTW and LCS. We also show that, if the constraint parameter is tight enough (less than 10-15% of time-series length), the constrained measure becomes significantly different from its unconstrained counterpart, in the sense of producing qualitatively different 1-nearest neighbor graphs. This observation explains the potential for accuracy gains when using constrained measures, highlighting the need for careful tuning of constraint parameters in order to achieve a good trade-off between speed and accuracy

    NEW METHODS FOR MINING SEQUENTIAL AND TIME SERIES DATA

    Get PDF
    Data mining is the process of extracting knowledge from large amounts of data. It covers a variety of techniques aimed at discovering diverse types of patterns on the basis of the requirements of the domain. These techniques include association rules mining, classification, cluster analysis and outlier detection. The availability of applications that produce massive amounts of spatial, spatio-temporal (ST) and time series data (TSD) is the rationale for developing specialized techniques to excavate such data. In spatial data mining, the spatial co-location rule problem is different from the association rule problem, since there is no natural notion of transactions in spatial datasets that are embedded in continuous geographic space. Therefore, we have proposed an efficient algorithm (GridClique) to mine interesting spatial co-location patterns (maximal cliques). These patterns are used as the raw transactions for an association rule mining technique to discover complex co-location rules. Our proposal includes certain types of complex relationships – especially negative relationships – in the patterns. The relationships can be obtained from only the maximal clique patterns, which have never been used until now. Our approach is applied on a well-known astronomy dataset obtained from the Sloan Digital Sky Survey (SDSS). ST data is continuously collected and made accessible in the public domain. We present an approach to mine and query large ST data with the aim of finding interesting patterns and understanding the underlying process of data generation. An important class of queries is based on the flock pattern. A flock is a large subset of objects moving along paths close to each other for a predefined time. One approach to processing a “flock query” is to map ST data into high-dimensional space and to reduce the query to a sequence of standard range queries that can be answered using a spatial indexing structure; however, the performance of spatial indexing structures rapidly deteriorates in high-dimensional space. This thesis sets out a preprocessing strategy that uses a random projection to reduce the dimensionality of the transformed space. We use probabilistic arguments to prove the accuracy of the projection and to present experimental results that show the possibility of managing the curse of dimensionality in a ST setting by combining random projections with traditional data structures. In time series data mining, we devised a new space-efficient algorithm (SparseDTW) to compute the dynamic time warping (DTW) distance between two time series, which always yields the optimal result. This is in contrast to other approaches which typically sacrifice optimality to attain space efficiency. The main idea behind our approach is to dynamically exploit the existence of similarity and/or correlation between the time series: the more the similarity between the time series, the less space required to compute the DTW between them. Other techniques for speeding up DTW, impose a priori constraints and do not exploit similarity characteristics that may be present in the data. Our experiments demonstrate that SparseDTW outperforms these approaches. We discover an interesting pattern by applying SparseDTW algorithm: “pairs trading” in a large stock-market dataset, of the index daily prices from the Australian stock exchange (ASX) from 1980 to 2002

    Searching and mining trillions of time series subsequences under dynamic time warping

    Full text link
    • …
    corecore