21 research outputs found

    Efficient Retrieval of Similar Time Sequences Using DFT

    Full text link
    We propose an improvement of the known DFT-based indexing technique for fast retrieval of similar time sequences. We use the last few Fourier coefficients in the distance computation without storing them in the index since every coefficient at the end is the complex conjugate of a coefficient at the beginning and as strong as its counterpart. We show analytically that this observation can accelerate the search time of the index by more than a factor of two. This result was confirmed by our experiments, which were carried out on real stock prices and synthetic data

    Implications of Z-normalization in the matrix profile

    Get PDF
    Companies are increasingly measuring their products and services, resulting in a rising amount of available time series data, making techniques to extract usable information needed. One state-of-the-art technique for time series is the Matrix Profile, which has been used for various applications including motif/discord discovery, visualizations and semantic segmentation. Internally, the Matrix Profile utilizes the z-normalized Euclidean distance to compare the shape of subsequences between two series. However, when comparing subsequences that are relatively flat and contain noise, the resulting distance is high despite the visual similarity of these subsequences. This property violates some of the assumptions made by Matrix Profile based techniques, resulting in worse performance when series contain flat and noisy subsequences. By studying the properties of the z-normalized Euclidean distance, we derived a method to eliminate this effect requiring only an estimate of the standard deviation of the noise. In this paper we describe various practical properties of the z-normalized Euclidean distance and show how these can be used to correct the performance of Matrix Profile related techniques. We demonstrate our techniques using anomaly detection using a Yahoo! Webscope anomaly dataset, semantic segmentation on the PAMAP2 activity dataset and for data visualization on a UCI activity dataset, all containing real-world data, and obtain overall better results after applying our technique. Our technique is a straightforward extension of the distance calculation in the Matrix Profile and will benefit any derived technique dealing with time series containing flat and noisy subsequences

    Generalized Distance Metric as a Robust Similarity Measure for Mobile Object Trajectories

    Get PDF
    In this paper, we propose a novel generalized distance metric based on a model that incorporates the time axis explicitly. The proposed metric is based fundamentally on the Mahalanobis distance metric, which eliminates the correlation and scaling errors in similarity searches on trajectory databases. We propose the incorporation of a weight matrix in the proposed distance metric, which allows for easy manipulation of the degree of significance of the different spatial and or temporal dimensions

    PolyFit: Polynomial-based Indexing Approach for Fast Approximate Range Aggregate Queries

    Get PDF
    Range aggregate queries find frequent application in data analytics. In some use cases, approximate results are preferred over accurate results if they can be computed rapidly and satisfy approximation guarantees. Inspired by a recent indexing approach, we provide means of representing a discrete point data set by continuous functions that can then serve as compact index structures. More specifically, we develop a polynomial-based indexing approach, called PolyFit, for processing approximate range aggregate queries. PolyFit is capable of supporting multiple types of range aggregate queries, including COUNT, SUM, MIN and MAX aggregates, with guaranteed absolute and relative error bounds. Experiment results show that PolyFit is faster and more accurate and compact than existing learned index structures.Comment: 13 page

    The EDAM Project: Mining Atmospheric Aerosol Datasets

    Get PDF
    Data mining has been a very active area of research in the database, machine learning, and mathematical programming communities in recent years. EDAM (Exploratory Data Analysis and Management) is a joint project between researchers in Atmospheric Chemistry and Computer Science at Carleton College and the University of Wisconsin-Madison that aims to develop data mining techniques for advancing the state of the art in analyzing atmospheric aerosol datasets. There is a great need to better understand the sources, dynamics, and compositions of atmospheric aerosols. The traditional approach for particle measurement, which is the collection of bulk samples of particulates on filters, is not adequate for studying particle dynamics and real-time correlations. This has led to the development of a new generation of real-time instruments that provide continuous or semi-continuous streams of data about certain aerosol properties. However, these instruments have added a significant level of complexity to atmospheric aerosol data, and dramatically increased the amounts of data to be collected, managed, and analyzed. Our abilit y to integrate the data from all of these new and complex instruments now lags far behind our data-collection capabilities, and severely limits our ability to understand the data and act upon it in a timely manner. In this paper, we present an overview of the EDAM project. The goal of the project, which is in its early stages, is to develop novel data mining algorithms and approaches to managing and monitoring multiple complex data streams. An important objective is data quality assurance, and real-time data mining offers great potential. The approach that we take should also provide good techniques to deal with gas-phase and semi-volatile data. While atmospheric aerosol analysis is an important and challenging domain that motivates us with real problems and serves as a concrete test of our results, our objective is to develop techniques that have broader applicability, and to explore some fundamental challenges in data mining that are not specific to any given application domain

    A data-driven framework for remaining useful life estimation

    Get PDF
    Remaining useful life (RUL) estimation is one of the most common tasks in the field of prognostics and structural health management. The aim of this research is to estimate the remaining useful life of an unspecified complex system using some data-driven approaches. The approaches are suitable for problems in which a data library of complete runs of a system is available. Given a non-complete  run of the system, the RUL can be predicted  using these approaches. Three main RUL prediction algorithms, which cover centralized data processing, decentralize data processing, and  in-between, are introduced and evaluated using the data of PHM’08 Challenge Problem. The methods involve the use of some other data processing techniques including wavelets denoise and similarity search. Experiment results show that all of the approaches  are effective in performing RUL prediction
    corecore