3,089 research outputs found

    Greedy Gaussian Segmentation of Multivariate Time Series

    Get PDF
    We consider the problem of breaking a multivariate (vector) time series into segments over which the data is well explained as independent samples from a Gaussian distribution. We formulate this as a covariance-regularized maximum likelihood problem, which can be reduced to a combinatorial optimization problem of searching over the possible breakpoints, or segment boundaries. This problem can be solved using dynamic programming, with complexity that grows with the square of the time series length. We propose a heuristic method that approximately solves the problem in linear time with respect to this length, and always yields a locally optimal choice, in the sense that no change of any one breakpoint improves the objective. Our method, which we call greedy Gaussian segmentation (GGS), easily scales to problems with vectors of dimension over 1000 and time series of arbitrary length. We discuss methods that can be used to validate such a model using data, and also to automatically choose appropriate values of the two hyperparameters in the method. Finally, we illustrate our GGS approach on financial time series and Wikipedia text data

    ecp: An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data

    Full text link
    There are many different ways in which change point analysis can be performed, from purely parametric methods to those that are distribution free. The ecp package is designed to perform multiple change point analysis while making as few assumptions as possible. While many other change point methods are applicable only for univariate data, this R package is suitable for both univariate and multivariate observations. Estimation can be based upon either a hierarchical divisive or agglomerative algorithm. Divisive estimation sequentially identifies change points via a bisection algorithm. The agglomerative algorithm estimates change point locations by determining an optimal segmentation. Both approaches are able to detect any type of distributional change within the data. This provides an advantage over many existing change point algorithms which are only able to detect changes within the marginal distributions

    Seeded Binary Segmentation: A general methodology for fast and optimal change point detection

    Full text link
    In recent years, there has been an increasing demand on efficient algorithms for large scale change point detection problems. To this end, we propose seeded binary segmentation, an approach relying on a deterministic construction of background intervals, called seeded intervals, in which single change points are searched. The final selection of change points based on the candidates from seeded intervals can be done in various ways, adapted to the problem at hand. Thus, seeded binary segmentation is easy to adapt to a wide range of change point detection problems, let that be univariate, multivariate or even high-dimensional. We consider the univariate Gaussian change in mean setup in detail. For this specific case we show that seeded binary segmentation leads to a near-linear time approach (i.e. linear up to a logarithmic factor) independent of the underlying number of change points. Furthermore, using appropriate selection methods, the methodology is shown to be asymptotically minimax optimal. While computationally more efficient, the finite sample estimation performance remains competitive compared to state of the art procedures. Moreover, we illustrate the methodology for high-dimensional settings with an inverse covariance change point detection problem where our proposal leads to massive computational gains while still exhibiting good statistical performance

    Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data

    Full text link
    Subsequence clustering of multivariate time series is a useful tool for discovering repeated patterns in temporal data. Once these patterns have been discovered, seemingly complicated datasets can be interpreted as a temporal sequence of only a small number of states, or clusters. For example, raw sensor data from a fitness-tracking application can be expressed as a timeline of a select few actions (i.e., walking, sitting, running). However, discovering these patterns is challenging because it requires simultaneous segmentation and clustering of the time series. Furthermore, interpreting the resulting clusters is difficult, especially when the data is high-dimensional. Here we propose a new method of model-based clustering, which we call Toeplitz Inverse Covariance-based Clustering (TICC). Each cluster in the TICC method is defined by a correlation network, or Markov random field (MRF), characterizing the interdependencies between different observations in a typical subsequence of that cluster. Based on this graphical representation, TICC simultaneously segments and clusters the time series data. We solve the TICC problem through alternating minimization, using a variation of the expectation maximization (EM) algorithm. We derive closed-form solutions to efficiently solve the two resulting subproblems in a scalable way, through dynamic programming and the alternating direction method of multipliers (ADMM), respectively. We validate our approach by comparing TICC to several state-of-the-art baselines in a series of synthetic experiments, and we then demonstrate on an automobile sensor dataset how TICC can be used to learn interpretable clusters in real-world scenarios.Comment: This revised version fixes two small typos in the published versio

    A posteriori Trading-inspired Model-free Time Series Segmentation

    Full text link
    Within the context of multivariate time series segmentation this paper proposes a method inspired by a posteriori optimal trading. After a normalization step time series are treated channel-wise as surrogate stock prices that can be traded optimally a posteriori in a virtual portfolio holding either stock or cash. Linear transaction costs are interpreted as hyperparameters for noise filtering. Resulting trading signals as well as resulting trading signals obtained on the reversed time series are used for unsupervised labeling, before a consensus over channels is reached that determines segmentation time instants. The method is model-free such that no model prescriptions for segments are made. Benefits of proposed approach include simplicity, computational efficiency and adaptability to a wide range of different shapes of time series. Performance is demonstrated on synthetic and real-world data, including a large-scale dataset comprising a multivariate time series of dimension 1000 and length 2709. Proposed method is compared to a popular model-based bottom-up approach fitting piecewise affine models and to a recent model-based top-down approach fitting Gaussian models, and found to be consistently faster while producing more intuitive results.Comment: 9 pages, double column, 13 figures, 2 table

    Tail-greedy bottom-up data decompositions and fast mulitple change-point detection

    Get PDF
    This article proposes a ‘tail-greedy’, bottom-up transform for one-dimensional data, which results in a nonlinear but conditionally orthonormal, multiscale decomposition of the data with respect to an adaptively chosen Unbalanced Haar wavelet basis. The ‘tail-greediness’of the decomposition algorithm, whereby multiple greedy steps are taken in a single pass through the data, both enables fast computation and makes the algorithm applicable in the problem of consistent estimation of the number and locations of multiple changepoints in data. The resulting agglomerative change-point detection method avoids the disadvantages of the classical divisive binary segmentation, and offers very good practical performance. It is implemented in the R package breakfast, available from CRAN
    • …
    corecore