113,889 research outputs found

    Consistent algorithms for clustering time series

    Get PDF
    The problem of clustering is considered for the case where every point is a time series. The time series are either given in one batch (offline setting), or they are allowed to grow with time and new time series can be added along the way (online setting). We propose a natural notion of consistency for this problem, and show that there are simple, computationally efficient algorithms that are asymptotically consistent under extremely weak assumptions on the distributions that generate the data. The notion of consistency is as follows. A clustering algorithm is called consistent if it places two time series into the same cluster if and only if the distribution that generates them is the same. In the considered framework the time series are allowed to be highly dependent, and the dependence can have arbitrary form. If the number of clusters is known, the only assumption we make is that the (marginal) distribution of each time series is stationary ergodic. No parametric, memory or mixing assumptions are made. When the number of clusters is unknown, stronger assumptions are provably necessary, but it is still possible to devise nonparametric algorithms that are consistent under very general conditions. The theoretical findings of this work are illustrated with experiments on both synthetic and real data

    Independence clustering (without a matrix)

    Full text link
    The independence clustering problem is considered in the following formulation: given a set SS of random variables, it is required to find the finest partitioning {U1,…,Uk}\{U_1,\dots,U_k\} of SS into clusters such that the clusters U1,…,UkU_1,\dots,U_k are mutually independent. Since mutual independence is the target, pairwise similarity measurements are of no use, and thus traditional clustering algorithms are inapplicable. The distribution of the random variables in SS is, in general, unknown, but a sample is available. Thus, the problem is cast in terms of time series. Two forms of sampling are considered: i.i.d.\ and stationary time series, with the main emphasis being on the latter, more general, case. A consistent, computationally tractable algorithm for each of the settings is proposed, and a number of open directions for further research are outlined

    ASYMPTOTIC STATISTICAL ANALYSIS OF STATIONARY ERGODIC TIME SERIES

    Get PDF
    International audienceIt is shown how to construct asymptotically consistent efficient algorithms for various statistical problems concerning stationary ergodic time series. The considered problems include clustering, hypothesis testing, change-point estimation and others. The presented approach is based on empirical estimates of the distributional distance. Some open problems are also discussed

    Clustering piecewise stationary processes

    Get PDF
    The problem of time-series clustering is considered in the case where each data-point is a sample generated by a piecewise stationary process. While stationary processes comprise one of the most general classes of processes in nonparametric statistics, and in particular, allow for arbitrary long-range dependencies, their key assumption of stationarity remains restrictive for some applications. We address this shortcoming by considering piecewise stationary processes, studied here for the first time in the context of clustering. It turns out that this problem allows for a rather natural definition of consistency of clustering algorithms. Efficient algorithms are proposed which are shown to be asymptotically consistent without any additional assumptions beyond piecewise stationarity. The theoretical results are complemented with experimental evaluations

    An assessment of the application of cluster analysis techniques to the Johannesburg Stock Exchange

    Get PDF
    Includes bibliographical references.Cluster analysis is becoming an increasingly popular method in modern finance because of its ability to summarise large amounts of data and so help individual and institutional investors to make timeous and informed investment decisions. This is no less true for investors in smaller, emerging markets - such as the Johannesburg Stock Exchange - than it is for those in the larger global markets. This study examines the application of two clustering techniques to the Johannesburg Stock Exchange. First, the application of Salvador and Chan's (2003) L method stopping rule to a hierarchical clustering of time series return data was analysed as a method for determining the number of latent groups in the data set. Using Ward's method and the Euclidean distance function, this method appears to be able detect the correct number of clusters on the JSE. Second, the ability of three different clustering algorithms to generate consistent clusters and cluster members over time on the Johannesburg Stock Exchange was analysed. The variation of information was used to measure the consistency of cluster members through time. Hierarchical clustering using Ward's method and the Euclidean distance measure proved to produce the most consistent results, while the K-means algorithms generated the least consistent cluster members

    Spatial Clustering Algorithm for Time Series Rainfall Data Using X-Means Data Splitting

    Get PDF
    The aim of this study is to present a new spatial clustering process for time series data. It has become an important and demanding application when the data involves chronological long time series and huge datasets. A great challenge in clustering is to achieve an optimal solution in searching similarity along the series.Furthermore, it also involves a very large-scale data analysis. Unfortunately, the existing clustering time series algorithms have become impractical since data do not scale properly for longer time series. The performance of the clustering algorithm gets even worse if it relies on actual data and many clustering algorithms are often faced with conflict in handling high dimensional data. In the case of spatial time series, the problem can be solved by unsupervised approaches rather than supervised classification, with appropriate preprocessing techniques to transform the actual data. The unsupervised solution using time series clustering algorithms is capable to extract valuable information and identify structure in complex and massive datasets as spatial time series. Therefore, a clustering algorithm by introducing data transformation using X-means data splitting is proposed to investigate the spatial homogeneity of time series rainfall data. The hierarchical clustering was used to demonstrate the similarity once the data was divided into training and testing sets. The proposed algorithm is compared with five types of data transformation techniques, namely mean and median in monthly data and the rest is in daily data such as binary, cumulative and actual values.Results indicate that data transformation using X-means data splitting in hierarchical clustering outperformed other transformation techniques and more consistent between training and testing datasets based on similarity measures

    Predicting wine quality and/or taste through the use of a latent ODE-RNN Neural Net

    Get PDF
    It is common for recommendation systems to use clustering techniques for finding similar products for the downstream user. These models do not always incorporate time as a variable when recommending an item. If our recommendation models do not include time, it may be difficult to surface the correct product to downstream users, given that seasonality tends to affect user behaviors. Time is not frequently used in recommendation algorithms due to the difficulty of obtaining continuous or consistent time series data of user interactions. Recently, Ordinary Differential Equation Recurrent Neural Networks (ODE-RNNs) has been flagged as a possible solution for predicting inconsistent time series data. This algorithm can bypass the need for consistent time data via its Recurrent Neural Network (RNN) encoder, which transforms the data with inconsistent time steps into hidden latent states that capture its temporal element. These encoded states are inputted into the Ordinary Differential Equation (ODE) block of the computational graph to solve the initial value problem of the hidden latent states. This solution results in a function that describes how the states change in continuous time. This new development is a possible solution for creating specific recommendations accounting for how tastes change over time. To determine the feasibility of the above method for recommendations, a high-dimensional time series dataset is reduced into a two-dimensional dataset with time as a feature. This dataset is used to train an ODE-RNN model to predict how it changes over time. Reviews from the Wine Enthusiast are used to create the original high-dimensional time series dataset. The wine reviewers will represent the users to predict, and the high scoring wines will be used to predict the taste trends of the reviewer

    Information criterion-based clustering with order-restricted candidate profiles in short time-course microarray experiments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Time-course microarray experiments produce vector gene expression profiles across a series of time points. Clustering genes based on these profiles is important in discovering functional related and co-regulated genes. Early developed clustering algorithms do not take advantage of the ordering in a time-course study, explicit use of which should allow more sensitive detection of genes that display a consistent pattern over time. Peddada <it>et al</it>. <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> proposed a clustering algorithm that can incorporate the temporal ordering using order-restricted statistical inference. This algorithm is, however, very time-consuming and hence inapplicable to most microarray experiments that contain a large number of genes. Its computational burden also imposes difficulty to assess the clustering reliability, which is a very important measure when clustering noisy microarray data.</p> <p>Results</p> <p>We propose a computationally efficient information criterion-based clustering algorithm, called ORICC, that also takes account of the ordering in time-course microarray experiments by embedding the order-restricted inference into a model selection framework. Genes are assigned to the profile which they best match determined by a newly proposed information criterion for order-restricted inference. In addition, we also developed a bootstrap procedure to assess ORICC's clustering reliability for every gene. Simulation studies show that the ORICC method is robust, always gives better clustering accuracy than Peddada's method and saves hundreds of times computational time. Under some scenarios, its accuracy is also better than some other existing clustering methods for short time-course microarray data, such as STEM <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> and Wang <it>et al</it>. <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. It is also computationally much faster than Wang <it>et al</it>. <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p> <p>Conclusion</p> <p>Our ORICC algorithm, which takes advantage of the temporal ordering in time-course microarray experiments, provides good clustering accuracy and is meanwhile much faster than Peddada's method. Moreover, the clustering reliability for each gene can also be assessed, which is unavailable in Peddada's method. In a real data example, the ORICC algorithm identifies new and interesting genes that previous analyses failed to reveal.</p

    Reducing statistical time-series problems to binary classification

    Get PDF
    We show how binary classification methods developed to work on i.i.d. data can be used for solving statistical problems that are seemingly unrelated to classification and concern highly-dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classification methods. Universal consistency of the proposed algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data.Comment: In proceedings of NIPS 2012, pp. 2069-207
    • …
    corecore