13 research outputs found

    Development and Applications of Similarity Measures for Spatial-Temporal Event and Setting Sequences

    Get PDF
    Similarity or distance measures between data objects are applied frequently in many fields or domains such as geography, environmental science, biology, economics, computer science, linguistics, logic, business analytics, and statistics, among others. One area where similarity measures are particularly important is in the analysis of spatiotemporal event sequences and associated environs or settings. This dissertation focuses on developing a framework of modeling, representation, and new similarity measure construction for sequences of spatiotemporal events and corresponding settings, which can be applied to different event data types and used in different areas of data science. The first core part of this dissertation presents a matrix-based spatiotemporal event sequence representation that unifies punctual and interval-based representation of events. This framework supports different event data types and provides support for data mining and sequence classification and clustering. The similarity measure is based on the modified Jaccard index with temporal order constraints and accommodates different event data types. This approach is demonstrated through simulated data examples and the performance of the similarity measures is evaluated with a k-nearest neighbor algorithm (k-NN) classification test on synthetic datasets. These similarity measures are incorporated into a clustering method and successfully demonstrate the usefulness in a case study analysis of event sequences extracted from space time series of a water quality monitoring system. This dissertation further proposes a new similarity measure for event setting sequences, which involve the space and time in which events occur. While similarity measures for spatiotemporal event sequences have been studied, the settings and setting sequences have not yet been considered. While modeling event setting sequences, spatial and temporal scales are considered to define the bounds of the setting and incorporate dynamic variables along with static variables. Using a matrix-based representation and an extended Jaccard index, new similarity measures are developed to allow for the use of all variable data types. With these similarity measures coupled with other multivariate statistical analysis approaches, results from a case study involving setting sequences and pollution event sequences associated with the same monitoring stations, support the hypothesis that more similar spatial-temporal settings or setting sequences may generate more similar events or event sequences. To test the scalability of STES similarity measure in a larger dataset and an extended application in different fields, this dissertation compares and contrasts the prospective space-time scan statistic with the STES similarity approach for identifying COVID-19 hotspots. The COVID-19 pandemic has highlighted the importance of detecting hotspots or clusters of COVID-19 to provide decision makers at various levels with better information for managing distribution of human and technical resources as the outbreak in the USA continues to grow. The prospective space-time scan statistic has been used to help identify emerging disease clusters yet results from this approach can encounter strategic limitations imposed by the spatial constraints of the scanning window. The STES-based approach adapted for this pandemic context computes the similarity of evolving normalized COVID-19 daily cases by county and clusters these to identify counties with similarly evolving COVID-19 case histories. This dissertation analyzes the spread of COVID-19 within the continental US through four periods beginning from late January 2020 using the COVID-19 datasets maintained by John Hopkins University, Center for Systems Science and Engineering (CSSE). Results of the two approaches can complement with each other and taken together can aid in tracking the progression of the pandemic. Overall, the dissertation highlights the importance of developing similarity measures for analyzing spatiotemporal event sequences and associated settings, which can be applied to different event data types and used for data mining, sequence classification, and clustering

    Multivariate Time Series Similarity Searching

    Get PDF
    Multivariate time series (MTS) datasets are very common in various financial, multimedia, and hydrological fields. In this paper, a dimension-combination method is proposed to search similar sequences for MTS. Firstly, the similarity of single-dimension series is calculated; then the overall similarity of the MTS is obtained by synthesizing each of the single-dimension similarity based on weighted BORDA voting method. The dimension-combination method could use the existing similarity searching method. Several experiments, which used the classification accuracy as a measure, were performed on six datasets from the UCI KDD Archive to validate the method. The results show the advantage of the approach compared to the traditional similarity measures, such as Euclidean distance (ED), cynamic time warping (DTW), point distribution (PD), PCA similarity factor SPCA, and extended Frobenius norm (Eros), for MTS datasets in some ways. Our experiments also demonstrate that no measure can fit all datasets, and the proposed measure is a choice for similarity searches

    Data Mining and Analysis on Multiple Time Series Object Data

    Get PDF
    Huge amount of data is available in our society and the need for turning such data into useful information and knowledge is urgent. Data mining is an important field addressing that need and significant progress has been achieved in the last decade. In several important application areas, data arises in the format of Multiple Time Series Object (MTSO) data, where each data object is an array of time series over a large set of features and each has an associated class or state. Very little research has been conducted towards this kind of data. Examples include computational toxicology, where each data object consists of a set of time series over thousands of genes, and operational stress management, where each data object consists of a set of time series over different measuring points on the human body. The purpose of this dissertation is to conduct a systematic data mining study over microarray time series data, with applications on computational toxicology. More specifically, we aim to consider several issues: feature selection algorithms for different classification cases, gene markers or feature set selection for toxic chemical exposure detection, toxic chemical exposure time prediction, wildness concept development and applications, and organizing diversified and parsimonious committee. We will formalize and analyze these research problems, design algorithms to address these problems, and perform experimental evaluations of the proposed algorithms. All these studies are based on microarray time series data set provided by Dr. McDougal

    Answering topband queries in time series data

    Get PDF
    Master'sMASTER OF SCIENC

    Feature-based Time Series Analytics

    Get PDF
    Time series analytics is a fundamental prerequisite for decision-making as well as automation and occurs in several applications such as energy load control, weather research, and consumer behavior analysis. It encompasses time series engineering, i.e., the representation of time series exhibiting important characteristics, and data mining, i.e., the application of the representation to a specific task. Due to the exhaustive data gathering, which results from the ``Industry 4.0'' vision and its shift towards automation and digitalization, time series analytics is undergoing a revolution. Big datasets with very long time series are gathered, which is challenging for engineering techniques. Traditionally, one focus has been on raw-data-based or shape-based engineering. They assess the time series' similarity in shape, which is only suitable for short time series. Another focus has been on model-based engineering. It assesses the time series' similarity in structure, which is suitable for long time series but requires larger models or a time-consuming modeling. Feature-based engineering tackles these challenges by efficiently representing time series and comparing their similarity in structure. However, current feature-based techniques are unsatisfactory as they are designed for specific data-mining tasks. In this work, we introduce a novel feature-based engineering technique. It efficiently provides a short representation of time series, focusing on their structural similarity. Based on a design rationale, we derive important time series characteristics such as the long-term and cyclically repeated characteristics as well as distribution and correlation characteristics. Moreover, we define a feature-based distance measure for their comparison. Both the representation technique and the distance measure provide desirable properties regarding storage and runtime. Subsequently, we introduce techniques based on our feature-based engineering and apply them to important data-mining tasks such as time series generation, time series matching, time series classification, and time series clustering. First, our feature-based generation technique outperforms state-of-the-art techniques regarding the accuracy of evolved datasets. Second, with our features, a matching method retrieves a match for a time series query much faster than with current representations. Third, our features provide discriminative characteristics to classify datasets as accurately as state-of-the-art techniques, but orders of magnitude faster. Finally, our features recommend an appropriate clustering of time series which is crucial for subsequent data-mining tasks. All these techniques are assessed on datasets from the energy, weather, and economic domains, and thus, demonstrate the applicability to real-world use cases. The findings demonstrate the versatility of our feature-based engineering and suggest several courses of action in order to design and improve analytical systems for the paradigm shift of Industry 4.0

    Interconnected Services for Time-Series Data Management in Smart Manufacturing Scenarios

    Get PDF
    xvii, 218 p.The rise of Smart Manufacturing, together with the strategic initiatives carried out worldwide, have promoted its adoption among manufacturers who are increasingly interested in boosting data-driven applications for different purposes, such as product quality control, predictive maintenance of equipment, etc. However, the adoption of these approaches faces diverse technological challenges with regard to the data-related technologies supporting the manufacturing data life-cycle. The main contributions of this dissertation focus on two specific challenges related to the early stages of the manufacturing data life-cycle: an optimized storage of the massive amounts of data captured during the production processes and an efficient pre-processing of them. The first contribution consists in the design and development of a system that facilitates the pre-processing task of the captured time-series data through an automatized approach that helps in the selection of the most adequate pre-processing techniques to apply to each data type. The second contribution is the design and development of a three-level hierarchical architecture for time-series data storage on cloud environments that helps to manage and reduce the required data storage resources (and consequently its associated costs). Moreover, with regard to the later stages, a thirdcontribution is proposed, that leverages advanced data analytics to build an alarm prediction system that allows to conduct a predictive maintenance of equipment by anticipating the activation of different types of alarms that can be produced on a real Smart Manufacturing scenario

    Interconnected Services for Time-Series Data Management in Smart Manufacturing Scenarios

    Get PDF
    xvii, 218 p.The rise of Smart Manufacturing, together with the strategic initiatives carried out worldwide, have promoted its adoption among manufacturers who are increasingly interested in boosting data-driven applications for different purposes, such as product quality control, predictive maintenance of equipment, etc. However, the adoption of these approaches faces diverse technological challenges with regard to the data-related technologies supporting the manufacturing data life-cycle. The main contributions of this dissertation focus on two specific challenges related to the early stages of the manufacturing data life-cycle: an optimized storage of the massive amounts of data captured during the production processes and an efficient pre-processing of them. The first contribution consists in the design and development of a system that facilitates the pre-processing task of the captured time-series data through an automatized approach that helps in the selection of the most adequate pre-processing techniques to apply to each data type. The second contribution is the design and development of a three-level hierarchical architecture for time-series data storage on cloud environments that helps to manage and reduce the required data storage resources (and consequently its associated costs). Moreover, with regard to the later stages, a thirdcontribution is proposed, that leverages advanced data analytics to build an alarm prediction system that allows to conduct a predictive maintenance of equipment by anticipating the activation of different types of alarms that can be produced on a real Smart Manufacturing scenario

    Subseries Join and Compression of Time Series Data Based on Non-uniform Segmentation

    Get PDF
    A time series is composed of a sequence of data items that are measured at uniform intervals. Many application areas generate or manipulate time series, including finance, medicine, digital audio, and motion capture. Efficiently searching a large time series database is still a challenging problem, especially when partial or subseries matches are needed. This thesis proposes a new denition of subseries join, a symmetric generalization of subseries matching, which finds similar subseries in two or more time series datasets. A solution is proposed to compute the subseries join based on a hierarchical feature representation. This hierarchical feature representation is generated by an anisotropic diffusion scale-space analysis and a non-uniform segmentation method. Each segment is represented by a minimal polynomial envelope in a reduced-dimensionality space. Based on the hierarchical feature representation, all features in a dataset are indexed in an R-tree, and candidate matching features of two datasets are found by an R-tree join operation. Given candidate matching features, a dynamic programming algorithm is developed to compute the final subseries join. To improve storage efficiency, a hierarchical compression scheme is proposed to compress features. The minimal polynomial envelope representation is transformed to a Bezier spline envelope representation. The control points of each Bezier spline are then hierarchically differenced and an arithmetic coding is used to compress these differences. To empirically evaluate their effectiveness, the proposed subseries join and compression techniques are tested on various publicly available datasets. A large motion capture database is also used to verify the techniques in a real-world application. The experiments show that the proposed subseries join technique can better tolerate noise and local scaling than previous work, and the proposed compression technique can also achieve about 85% higher compression rates than previous work with the same distortion error
    corecore