Search CORE

113,889 research outputs found

Consistent algorithms for clustering time series

Author: Khaleghi Azedeh
Mari Jeremie
Preux Philippe
Ryabko Daniil
Publication venue
Publication date: 01/01/2016
Field of study

The problem of clustering is considered for the case where every point is a time series. The time series are either given in one batch (offline setting), or they are allowed to grow with time and new time series can be added along the way (online setting). We propose a natural notion of consistency for this problem, and show that there are simple, computationally efficient algorithms that are asymptotically consistent under extremely weak assumptions on the distributions that generate the data. The notion of consistency is as follows. A clustering algorithm is called consistent if it places two time series into the same cluster if and only if the distribution that generates them is the same. In the considered framework the time series are allowed to be highly dependent, and the dependence can have arbitrary form. If the number of clusters is known, the only assumption we make is that the (marginal) distribution of each time series is stationary ergodic. No parametric, memory or mixing assumptions are made. When the number of clusters is unknown, stronger assumptions are provably necessary, but it is still possible to devise nonparametric algorithms that are consistent under very general conditions. The theoretical findings of this work are illustrated with experiments on both synthetic and real data

INRIA a CCSD electronic archive server

HAL Descartes

Lancaster E-Prints

Hal-Diderot

Independence clustering (without a matrix)

Author: Ryabko Daniil
Publication venue
Publication date: 20/03/2017
Field of study

The independence clustering problem is considered in the following formulation: given a set

S

of random variables, it is required to find the finest partitioning

\{U_1,\dots,U_k\}

S

into clusters such that the clusters

U_1,\dots,U_k

are mutually independent. Since mutual independence is the target, pairwise similarity measurements are of no use, and thus traditional clustering algorithms are inapplicable. The distribution of the random variables in

S

is, in general, unknown, but a sample is available. Thus, the problem is cast in terms of time series. Two forms of sampling are considered: i.i.d.\ and stationary time series, with the main emphasis being on the latter, more general, case. A consistent, computationally tractable algorithm for each of the settings is proposed, and a number of open directions for further research are outlined

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

ASYMPTOTIC STATISTICAL ANALYSIS OF STATIONARY ERGODIC TIME SERIES

Author: Ryabko Daniil
Publication venue: HAL CCSD
Publication date: 01/08/2012
Field of study

International audienceIt is shown how to construct asymptotically consistent efficient algorithms for various statistical problems concerning stationary ergodic time series. The considered problems include clustering, hypothesis testing, change-point estimation and others. The presented approach is based on empirical estimates of the distributional distance. Some open problems are also discussed

HAL - Lille 3

INRIA a CCSD electronic archive server

Clustering piecewise stationary processes

Author: Khaleghi Azadeh
Ryabko Daniil
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/06/2019
Field of study

The problem of time-series clustering is considered in the case where each data-point is a sample generated by a piecewise stationary process. While stationary processes comprise one of the most general classes of processes in nonparametric statistics, and in particular, allow for arbitrary long-range dependencies, their key assumption of stationarity remains restrictive for some applications. We address this shortcoming by considering piecewise stationary processes, studied here for the first time in the context of clustering. It turns out that this problem allows for a rather natural definition of consistency of clustering algorithms. Efficient algorithms are proposed which are shown to be asymptotically consistent without any additional assumptions beyond piecewise stationarity. The theoretical results are complemented with experimental evaluations

arXiv.org e-Print Archive

Crossref

Lancaster E-Prints

An assessment of the application of cluster analysis techniques to the Johannesburg Stock Exchange

Author: Tully Robyn
Publication venue: Department of Finance and Tax
Publication date: 01/01/2014
Field of study

Includes bibliographical references.Cluster analysis is becoming an increasingly popular method in modern finance because of its ability to summarise large amounts of data and so help individual and institutional investors to make timeous and informed investment decisions. This is no less true for investors in smaller, emerging markets - such as the Johannesburg Stock Exchange - than it is for those in the larger global markets. This study examines the application of two clustering techniques to the Johannesburg Stock Exchange. First, the application of Salvador and Chan's (2003) L method stopping rule to a hierarchical clustering of time series return data was analysed as a method for determining the number of latent groups in the data set. Using Ward's method and the Euclidean distance function, this method appears to be able detect the correct number of clusters on the JSE. Second, the ability of three different clustering algorithms to generate consistent clusters and cluster members over time on the Johannesburg Stock Exchange was analysed. The variation of information was used to measure the consistency of cluster members through time. Hierarchical clustering using Ward's method and the Euclidean distance measure proved to produce the most consistent results, while the K-means algorithms generated the least consistent cluster members

Cape Town University OpenUCT

Spatial Clustering Algorithm for Time Series Rainfall Data Using X-Means Data Splitting

Author: Ali Noor Rasidah
Ku Mahamud Ku Ruhana
Publication venue: 'Maxwell Scientific Publication Corp.'
Publication date: 01/01/2017
Field of study

The aim of this study is to present a new spatial clustering process for time series data. It has become an important and demanding application when the data involves chronological long time series and huge datasets. A great challenge in clustering is to achieve an optimal solution in searching similarity along the series.Furthermore, it also involves a very large-scale data analysis. Unfortunately, the existing clustering time series algorithms have become impractical since data do not scale properly for longer time series. The performance of the clustering algorithm gets even worse if it relies on actual data and many clustering algorithms are often faced with conflict in handling high dimensional data. In the case of spatial time series, the problem can be solved by unsupervised approaches rather than supervised classification, with appropriate preprocessing techniques to transform the actual data. The unsupervised solution using time series clustering algorithms is capable to extract valuable information and identify structure in complex and massive datasets as spatial time series. Therefore, a clustering algorithm by introducing data transformation using X-means data splitting is proposed to investigate the spatial homogeneity of time series rainfall data. The hierarchical clustering was used to demonstrate the similarity once the data was divided into training and testing sets. The proposed algorithm is compared with five types of data transformation techniques, namely mean and median in monthly data and the rest is in daily data such as binary, cumulative and actual values.Results indicate that data transformation using X-means data splitting in hierarchical clustering outperformed other transformation techniques and more consistent between training and testing datasets based on similarity measures

UUM Repository

Predicting wine quality and/or taste through the use of a latent ODE-RNN Neural Net

Author: Beattie Alexandra
Publication venue
Publication date: 07/12/2019
Field of study

It is common for recommendation systems to use clustering techniques for finding similar products for the downstream user. These models do not always incorporate time as a variable when recommending an item. If our recommendation models do not include time, it may be difficult to surface the correct product to downstream users, given that seasonality tends to affect user behaviors. Time is not frequently used in recommendation algorithms due to the difficulty of obtaining continuous or consistent time series data of user interactions. Recently, Ordinary Differential Equation Recurrent Neural Networks (ODE-RNNs) has been flagged as a possible solution for predicting inconsistent time series data. This algorithm can bypass the need for consistent time data via its Recurrent Neural Network (RNN) encoder, which transforms the data with inconsistent time steps into hidden latent states that capture its temporal element. These encoded states are inputted into the Ordinary Differential Equation (ODE) block of the computational graph to solve the initial value problem of the hidden latent states. This solution results in a function that describes how the states change in continuous time. This new development is a possible solution for creating specific recommendations accounting for how tastes change over time. To determine the feasibility of the above method for recommendations, a high-dimensional time series dataset is reduced into a two-dimensional dataset with time as a feature. This dataset is used to train an ODE-RNN model to predict how it changes over time. Reviews from the Wine Enthusiast are used to create the original high-dimensional time series dataset. The wine reviewers will represent the users to predict, and the high scoring wines will be used to predict the taste trends of the reviewer

SHAREOK repository

Information criterion-based clustering with order-restricted candidate profiles in short time-course microarray experiments

Author: Lin Nan
Liu Tianqing
Shi Ningzhong
Zhang Baoxue
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Time-course microarray experiments produce vector gene expression profiles across a series of time points. Clustering genes based on these profiles is important in discovering functional related and co-regulated genes. Early developed clustering algorithms do not take advantage of the ordering in a time-course study, explicit use of which should allow more sensitive detection of genes that display a consistent pattern over time. Peddada <it>et al</it>. <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> proposed a clustering algorithm that can incorporate the temporal ordering using order-restricted statistical inference. This algorithm is, however, very time-consuming and hence inapplicable to most microarray experiments that contain a large number of genes. Its computational burden also imposes difficulty to assess the clustering reliability, which is a very important measure when clustering noisy microarray data. Results We propose a computationally efficient information criterion-based clustering algorithm, called ORICC, that also takes account of the ordering in time-course microarray experiments by embedding the order-restricted inference into a model selection framework. Genes are assigned to the profile which they best match determined by a newly proposed information criterion for order-restricted inference. In addition, we also developed a bootstrap procedure to assess ORICC's clustering reliability for every gene. Simulation studies show that the ORICC method is robust, always gives better clustering accuracy than Peddada's method and saves hundreds of times computational time. Under some scenarios, its accuracy is also better than some other existing clustering methods for short time-course microarray data, such as STEM <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> and Wang <it>et al</it>. <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. It is also computationally much faster than Wang <it>et al</it>. <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Conclusion Our ORICC algorithm, which takes advantage of the temporal ordering in time-course microarray experiments, provides good clustering accuracy and is meanwhile much faster than Peddada's method. Moreover, the clustering reliability for each gene can also be assessed, which is unavailable in Peddada's method. In a real data example, the ORICC algorithm identifies new and interesting genes that previous analyses failed to reveal.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digital Commons@Becker

Reducing statistical time-series problems to binary classification

Author: Mary Jérémie
Ryabko Daniil
Publication venue
Publication date: 01/12/2012
Field of study

We show how binary classification methods developed to work on i.i.d. data can be used for solving statistical problems that are seemingly unrelated to classification and concern highly-dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classification methods. Universal consistency of the proposed algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data.Comment: In proceedings of NIPS 2012, pp. 2069-207

arXiv.org e-Print Archive

HAL - Lille 3

INRIA a CCSD electronic archive server