102,969 research outputs found

    NEW METHODS FOR MINING SEQUENTIAL AND TIME SERIES DATA

    Get PDF
    Data mining is the process of extracting knowledge from large amounts of data. It covers a variety of techniques aimed at discovering diverse types of patterns on the basis of the requirements of the domain. These techniques include association rules mining, classification, cluster analysis and outlier detection. The availability of applications that produce massive amounts of spatial, spatio-temporal (ST) and time series data (TSD) is the rationale for developing specialized techniques to excavate such data. In spatial data mining, the spatial co-location rule problem is different from the association rule problem, since there is no natural notion of transactions in spatial datasets that are embedded in continuous geographic space. Therefore, we have proposed an efficient algorithm (GridClique) to mine interesting spatial co-location patterns (maximal cliques). These patterns are used as the raw transactions for an association rule mining technique to discover complex co-location rules. Our proposal includes certain types of complex relationships – especially negative relationships – in the patterns. The relationships can be obtained from only the maximal clique patterns, which have never been used until now. Our approach is applied on a well-known astronomy dataset obtained from the Sloan Digital Sky Survey (SDSS). ST data is continuously collected and made accessible in the public domain. We present an approach to mine and query large ST data with the aim of finding interesting patterns and understanding the underlying process of data generation. An important class of queries is based on the flock pattern. A flock is a large subset of objects moving along paths close to each other for a predefined time. One approach to processing a “flock query” is to map ST data into high-dimensional space and to reduce the query to a sequence of standard range queries that can be answered using a spatial indexing structure; however, the performance of spatial indexing structures rapidly deteriorates in high-dimensional space. This thesis sets out a preprocessing strategy that uses a random projection to reduce the dimensionality of the transformed space. We use probabilistic arguments to prove the accuracy of the projection and to present experimental results that show the possibility of managing the curse of dimensionality in a ST setting by combining random projections with traditional data structures. In time series data mining, we devised a new space-efficient algorithm (SparseDTW) to compute the dynamic time warping (DTW) distance between two time series, which always yields the optimal result. This is in contrast to other approaches which typically sacrifice optimality to attain space efficiency. The main idea behind our approach is to dynamically exploit the existence of similarity and/or correlation between the time series: the more the similarity between the time series, the less space required to compute the DTW between them. Other techniques for speeding up DTW, impose a priori constraints and do not exploit similarity characteristics that may be present in the data. Our experiments demonstrate that SparseDTW outperforms these approaches. We discover an interesting pattern by applying SparseDTW algorithm: “pairs trading” in a large stock-market dataset, of the index daily prices from the Australian stock exchange (ASX) from 1980 to 2002

    Advances in Processing, Mining, and Learning Complex Data: From Foundations to Real-World Applications

    Get PDF
    Processing, mining, and learning complex data refer to an advanced study area of data mining and knowledge discovery concerning the development and analysis of approaches for discovering patterns and learning models from data with a complex structure (e.g., multirelational data, XML data, text data, image data, time series, sequences, graphs, streaming data, and trees) [1–5]. These kinds of data are commonly encountered in many social, economic, scientific, and engineering applications. Complex data pose new challenges for current research in data mining and knowledge discovery as they require new methods for processing, mining, and learning them. Traditional data analysis methods often require the data to be represented as vectors [6]. However, many data objects in real-world applications, such as chemical compounds in biopharmacy, brain regions in brain health data, users in business networks, and time-series information in medical data, contain rich structure information (e.g., relationships between data and temporal structures). Such a simple feature-vector representation inherently loses the structure information of the objects. In reality, objects may have complicated characteristics, depending on how the objects are assessed and characterized. Meanwhile, the data may come from heterogeneous domains [7], such as traditional tabular-based data, sequential patterns, graphs, time-series information, and semistructured data. Novel data analytics methods are desired to discover meaningful knowledge in advanced applications from data objects with complex characteristics. This special issue contributes to the fundamental research in processing, mining, and learning complex data, focusing on the analysis of complex data sources

    Advances in Processing, Mining, and Learning Complex Data: From Foundations to Real-World Applications

    Get PDF
    Processing, mining, and learning complex data refer to an advanced study area of data mining and knowledge discovery concerning the development and analysis of approaches for discovering patterns and learning models from data with a complex structure (e.g., multirelational data, XML data, text data, image data, time series, sequences, graphs, streaming data, and trees) [1–5]. These kinds of data are commonly encountered in many social, economic, scientific, and engineering applications. Complex data pose new challenges for current research in data mining and knowledge discovery as they require new methods for processing, mining, and learning them. Traditional data analysis methods often require the data to be represented as vectors [6]. However, many data objects in real-world applications, such as chemical compounds in biopharmacy, brain regions in brain health data, users in business networks, and time-series information in medical data, contain rich structure information (e.g., relationships between data and temporal structures). Such a simple feature-vector representation inherently loses the structure information of the objects. In reality, objects may have complicated characteristics, depending on how the objects are assessed and characterized. Meanwhile, the data may come from heterogeneous domains [7], such as traditional tabular-based data, sequential patterns, graphs, time-series information, and semistructured data. Novel data analytics methods are desired to discover meaningful knowledge in advanced applications from data objects with complex characteristics. This special issue contributes to the fundamental research in processing, mining, and learning complex data, focusing on the analysis of complex data sources

    Time series segmentation based on stationarity analysis to improve new samples prediction

    Get PDF
    A wide range of applications based on sequential data, named time series, have become increasingly popular in recent years, mainly those based on the Internet of Things (IoT). Several different machine learning algorithms exploit the patterns extracted from sequential data to support multiple tasks. However, this data can suffer from unreliable readings that can lead to low accuracy models due to the low-quality training sets available. Detecting the change point between high representative segments is an important ally to find and thread biased subsequences. By constructing a framework based on the Augmented Dickey-Fuller (ADF) test for data stationarity, two proposals to automatically segment subsequences in a time series were developed. The former proposal, called Change Detector segmentation, relies on change detection methods of data stream mining. The latter, called ADF-based segmentation, is constructed on a new change detector derived from the ADF test only. Experiments over real-file IoT databases and benchmarks showed the improvement provided by our proposals for prediction tasks with traditional Autoregressive integrated moving average (ARIMA) and Deep Learning (Long short-term memory and Temporal Convolutional Networks) methods. Results obtained by the Long short-term memory predictive model reduced the relative prediction error from 1 to 0.67, compared to time series without segmentation

    Pattern mining under different conditions

    Get PDF
    New requirements and demands on pattern mining arise in modern applications, which cannot be fulfilled using conventional methods. For example, in scientific research, scientists are more interested in unknown knowledge, which usually hides in significant but not frequent patterns. However, existing itemset mining algorithms are designed for very frequent patterns. Furthermore, scientists need to repeat an experiment many times to ensure reproducibility. A series of datasets are generated at once, waiting for clustering, which can contain an unknown number of clusters with various densities and shapes. Using existing clustering algorithms is time-consuming because parameter tuning is necessary for each dataset. Many scientific datasets are extremely noisy. They contain considerably more noises than in-cluster data points. Most existing clustering algorithms can only handle noises up to a moderate level. Temporal pattern mining is also important in scientific research. Existing temporal pattern mining algorithms only consider pointbased events. However, most activities in the real-world are interval-based with a starting and an ending timestamp. This thesis developed novel pattern mining algorithms for various data mining tasks under different conditions. The first part of this thesis investigates the problem of mining less frequent itemsets in transactional datasets. In contrast to existing frequent itemset mining algorithms, this part focus on itemsets that occurred not that frequent. Algorithms NIIMiner, RaCloMiner, and LSCMiner are proposed to identify such kind of itemsets efficiently. NIIMiner utilizes the negative itemset tree to extract all patterns that occurred less than a given support threshold in a top-down depth-first manner. RaCloMiner combines existing bottom-up frequent itemset mining algorithms with a top-down itemset mining algorithm to achieve a better performance in mining less frequent patterns. LSCMiner investigates the problem of mining less frequent closed patterns. The second part of this thesis studied the problem of interval-based temporal pattern mining in the stream environment. Interval-based temporal patterns are sequential patterns in which each event is aligned with a starting and ending temporal information. The ability to handle interval-based events and stream data is lacking in existing approaches. A novel intervalbased temporal pattern mining algorithm for stream data is described in this part. The last part of this thesis studies new problems in clustering on numeric datasets. The first problem tackled in this part is shape alternation adaptivity in clustering. In applications such as scientific data analysis, scientists need to deal with a series of datasets generated from one experiment. Cluster sizes and shapes are different in those datasets. A kNN density-based clustering algorithm, kadaClus, is proposed to provide the shape alternation adaptability so that users do not need to tune parameters for each dataset. The second problem studied in this part is clustering in an extremely noisy dataset. Many real-world datasets contain considerably more noises than in-cluster data points. A novel clustering algorithm, kenClus, is proposed to identify clusters in arbitrary shapes from extremely noisy datasets. Both clustering algorithms are kNN-based, which only require one parameter k. In each part, the efficiency and effectiveness of the presented techniques are thoroughly analyzed. Intensive experiments on synthetic and real-world datasets are conducted to show the benefits of the proposed algorithms over conventional approaches

    Feature-based time-series analysis

    Full text link
    This work presents an introduction to feature-based time-series analysis. The time series as a data type is first described, along with an overview of the interdisciplinary time-series analysis literature. I then summarize the range of feature-based representations for time series that have been developed to aid interpretable insights into time-series structure. Particular emphasis is given to emerging research that facilitates wide comparison of feature-based representations that allow us to understand the properties of a time-series dataset that make it suited to a particular feature-based representation or analysis algorithm. The future of time-series analysis is likely to embrace approaches that exploit machine learning methods to partially automate human learning to aid understanding of the complex dynamical patterns in the time series we measure from the world.Comment: 28 pages, 9 figure
    • …
    corecore