104,668 research outputs found
One-Step or Two-Step Optimization and the Overfitting Phenomenon: A Case Study on Time Series Classification
For the last few decades, optimization has been developing at a fast rate.
Bio-inspired optimization algorithms are metaheuristics inspired by nature.
These algorithms have been applied to solve different problems in engineering,
economics, and other domains. Bio-inspired algorithms have also been applied in
different branches of information technology such as networking and software
engineering. Time series data mining is a field of information technology that
has its share of these applications too. In previous works we showed how
bio-inspired algorithms such as the genetic algorithms and differential
evolution can be used to find the locations of the breakpoints used in the
symbolic aggregate approximation of time series representation, and in another
work we showed how we can utilize the particle swarm optimization, one of the
famous bio-inspired algorithms, to set weights to the different segments in the
symbolic aggregate approximation representation. In this paper we present, in
two different approaches, a new meta optimization process that produces optimal
locations of the breakpoints in addition to optimal weights of the segments.
The experiments of time series classification task that we conducted show an
interesting example of how the overfitting phenomenon, a frequently encountered
problem in data mining which happens when the model overfits the training set,
can interfere in the optimization process and hide the superior performance of
an optimization algorithm
A Better Alternative to Piecewise Linear Time Series Segmentation
Time series are difficult to monitor, summarize and predict. Segmentation
organizes time series into few intervals having uniform characteristics
(flatness, linearity, modality, monotonicity and so on). For scalability, we
require fast linear time algorithms. The popular piecewise linear model can
determine where the data goes up or down and at what rate. Unfortunately, when
the data does not follow a linear model, the computation of the local slope
creates overfitting. We propose an adaptive time series model where the
polynomial degree of each interval vary (constant, linear and so on). Given a
number of regressors, the cost of each interval is its polynomial degree:
constant intervals cost 1 regressor, linear intervals cost 2 regressors, and so
on. Our goal is to minimize the Euclidean (l_2) error for a given model
complexity. Experimentally, we investigate the model where intervals can be
either constant or linear. Over synthetic random walks, historical stock market
prices, and electrocardiograms, the adaptive model provides a more accurate
segmentation than the piecewise linear model without increasing the
cross-validation error or the running time, while providing a richer vocabulary
to applications. Implementation issues, such as numerical stability and
real-world performance, are discussed.Comment: to appear in SIAM Data Mining 200
Matrix Profile XII: MPDist: A Novel Time Series Distance Measure to allow Data Mining in more Challenging Scenarios
At their core, many time series data mining algorithms can be reduced to reasoning about the shapes of time series subsequences. This requires a distance measure, and most algorithms use Euclidean Distance or Dynamic Time Warping (DTW) as their core subroutine. We argue that these distance measures are not as robust as the community believes. The undue faith in these measures derives from an overreliance on benchmark datasets and self-selection bias. The community is reluctant to address more difficult domains, for which current distance measures are ill-suited. In this work, we introduce a novel distance measure MPdist. We show that our proposed distance measure is much more robust than current distance measures. Furthermore, it allows us to successfully mine datasets that would defeat any Euclidean or DTW distance-based algorithm. Additionally, we show that our distance measure can be computed so efficiently, it allows analytics on fast streams
Network Lasso: Clustering and Optimization in Large Graphs
Convex optimization is an essential tool for modern data analysis, as it
provides a framework to formulate and solve many problems in machine learning
and data mining. However, general convex optimization solvers do not scale
well, and scalable solvers are often specialized to only work on a narrow class
of problems. Therefore, there is a need for simple, scalable algorithms that
can solve many common optimization problems. In this paper, we introduce the
\emph{network lasso}, a generalization of the group lasso to a network setting
that allows for simultaneous clustering and optimization on graphs. We develop
an algorithm based on the Alternating Direction Method of Multipliers (ADMM) to
solve this problem in a distributed and scalable manner, which allows for
guaranteed global convergence even on large graphs. We also examine a
non-convex extension of this approach. We then demonstrate that many types of
problems can be expressed in our framework. We focus on three in particular -
binary classification, predicting housing prices, and event detection in time
series data - comparing the network lasso to baseline approaches and showing
that it is both a fast and accurate method of solving large optimization
problems
- …