3,089 research outputs found
Greedy Gaussian Segmentation of Multivariate Time Series
We consider the problem of breaking a multivariate (vector) time series into
segments over which the data is well explained as independent samples from a
Gaussian distribution. We formulate this as a covariance-regularized maximum
likelihood problem, which can be reduced to a combinatorial optimization
problem of searching over the possible breakpoints, or segment boundaries. This
problem can be solved using dynamic programming, with complexity that grows
with the square of the time series length. We propose a heuristic method that
approximately solves the problem in linear time with respect to this length,
and always yields a locally optimal choice, in the sense that no change of any
one breakpoint improves the objective. Our method, which we call greedy
Gaussian segmentation (GGS), easily scales to problems with vectors of
dimension over 1000 and time series of arbitrary length. We discuss methods
that can be used to validate such a model using data, and also to automatically
choose appropriate values of the two hyperparameters in the method. Finally, we
illustrate our GGS approach on financial time series and Wikipedia text data
ecp: An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data
There are many different ways in which change point analysis can be
performed, from purely parametric methods to those that are distribution free.
The ecp package is designed to perform multiple change point analysis while
making as few assumptions as possible. While many other change point methods
are applicable only for univariate data, this R package is suitable for both
univariate and multivariate observations. Estimation can be based upon either a
hierarchical divisive or agglomerative algorithm. Divisive estimation
sequentially identifies change points via a bisection algorithm. The
agglomerative algorithm estimates change point locations by determining an
optimal segmentation. Both approaches are able to detect any type of
distributional change within the data. This provides an advantage over many
existing change point algorithms which are only able to detect changes within
the marginal distributions
Seeded Binary Segmentation: A general methodology for fast and optimal change point detection
In recent years, there has been an increasing demand on efficient algorithms
for large scale change point detection problems. To this end, we propose seeded
binary segmentation, an approach relying on a deterministic construction of
background intervals, called seeded intervals, in which single change points
are searched. The final selection of change points based on the candidates from
seeded intervals can be done in various ways, adapted to the problem at hand.
Thus, seeded binary segmentation is easy to adapt to a wide range of change
point detection problems, let that be univariate, multivariate or even
high-dimensional.
We consider the univariate Gaussian change in mean setup in detail. For this
specific case we show that seeded binary segmentation leads to a near-linear
time approach (i.e. linear up to a logarithmic factor) independent of the
underlying number of change points. Furthermore, using appropriate selection
methods, the methodology is shown to be asymptotically minimax optimal. While
computationally more efficient, the finite sample estimation performance
remains competitive compared to state of the art procedures. Moreover, we
illustrate the methodology for high-dimensional settings with an inverse
covariance change point detection problem where our proposal leads to massive
computational gains while still exhibiting good statistical performance
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data
Subsequence clustering of multivariate time series is a useful tool for
discovering repeated patterns in temporal data. Once these patterns have been
discovered, seemingly complicated datasets can be interpreted as a temporal
sequence of only a small number of states, or clusters. For example, raw sensor
data from a fitness-tracking application can be expressed as a timeline of a
select few actions (i.e., walking, sitting, running). However, discovering
these patterns is challenging because it requires simultaneous segmentation and
clustering of the time series. Furthermore, interpreting the resulting clusters
is difficult, especially when the data is high-dimensional. Here we propose a
new method of model-based clustering, which we call Toeplitz Inverse
Covariance-based Clustering (TICC). Each cluster in the TICC method is defined
by a correlation network, or Markov random field (MRF), characterizing the
interdependencies between different observations in a typical subsequence of
that cluster. Based on this graphical representation, TICC simultaneously
segments and clusters the time series data. We solve the TICC problem through
alternating minimization, using a variation of the expectation maximization
(EM) algorithm. We derive closed-form solutions to efficiently solve the two
resulting subproblems in a scalable way, through dynamic programming and the
alternating direction method of multipliers (ADMM), respectively. We validate
our approach by comparing TICC to several state-of-the-art baselines in a
series of synthetic experiments, and we then demonstrate on an automobile
sensor dataset how TICC can be used to learn interpretable clusters in
real-world scenarios.Comment: This revised version fixes two small typos in the published versio
A posteriori Trading-inspired Model-free Time Series Segmentation
Within the context of multivariate time series segmentation this paper
proposes a method inspired by a posteriori optimal trading. After a
normalization step time series are treated channel-wise as surrogate stock
prices that can be traded optimally a posteriori in a virtual portfolio holding
either stock or cash. Linear transaction costs are interpreted as
hyperparameters for noise filtering. Resulting trading signals as well as
resulting trading signals obtained on the reversed time series are used for
unsupervised labeling, before a consensus over channels is reached that
determines segmentation time instants. The method is model-free such that no
model prescriptions for segments are made. Benefits of proposed approach
include simplicity, computational efficiency and adaptability to a wide range
of different shapes of time series. Performance is demonstrated on synthetic
and real-world data, including a large-scale dataset comprising a multivariate
time series of dimension 1000 and length 2709. Proposed method is compared to a
popular model-based bottom-up approach fitting piecewise affine models and to a
recent model-based top-down approach fitting Gaussian models, and found to be
consistently faster while producing more intuitive results.Comment: 9 pages, double column, 13 figures, 2 table
Tail-greedy bottom-up data decompositions and fast mulitple change-point detection
This article proposes a ‘tail-greedy’, bottom-up transform for one-dimensional data, which results in a nonlinear but conditionally orthonormal, multiscale decomposition of the data with respect to an adaptively chosen Unbalanced Haar wavelet basis. The ‘tail-greediness’of the decomposition algorithm, whereby multiple greedy steps are taken in a single pass through the data, both enables fast computation and makes the algorithm applicable in the problem of consistent estimation of the number and locations of multiple changepoints in data. The resulting agglomerative change-point detection method avoids the disadvantages of the classical divisive binary segmentation, and offers very good practical performance. It is implemented in the R package breakfast, available from CRAN
- …