17,325 research outputs found
Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data
Similarity-based approaches represent a promising direction for time series
analysis. However, many such methods rely on parameter tuning, and some have
shortcomings if the time series are multivariate (MTS), due to dependencies
between attributes, or the time series contain missing data. In this paper, we
address these challenges within the powerful context of kernel methods by
proposing the robust \emph{time series cluster kernel} (TCK). The approach
taken leverages the missing data handling properties of Gaussian mixture models
(GMM) augmented with informative prior distributions. An ensemble learning
approach is exploited to ensure robustness to parameters by combining the
clustering results of many GMM to form the final kernel.
We evaluate the TCK on synthetic and real data and compare to other
state-of-the-art techniques. The experimental results demonstrate that the TCK
is robust to parameter choices, provides competitive results for MTS without
missing data and outstanding results for missing data.Comment: 23 pages, 6 figure
Penalized Clustering of Large Scale Functional Data with Multiple Covariates
In this article, we propose a penalized clustering method for large scale
data with multiple covariates through a functional data approach. In the
proposed method, responses and covariates are linked together through
nonparametric multivariate functions (fixed effects), which have great
flexibility in modeling a variety of function features, such as jump points,
branching, and periodicity. Functional ANOVA is employed to further decompose
multivariate functions in a reproducing kernel Hilbert space and provide
associated notions of main effect and interaction. Parsimonious random effects
are used to capture various correlation structures. The mixed-effect models are
nested under a general mixture model, in which the heterogeneity of functional
data is characterized. We propose a penalized Henderson's likelihood approach
for model-fitting and design a rejection-controlled EM algorithm for the
estimation. Our method selects smoothing parameters through generalized
cross-validation. Furthermore, the Bayesian confidence intervals are used to
measure the clustering uncertainty. Simulation studies and real-data examples
are presented to investigate the empirical performance of the proposed method.
Open-source code is available in the R package MFDA
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Segmentation of Fault Networks Determined from Spatial Clustering of Earthquakes
We present a new method of data clustering applied to earthquake catalogs,
with the goal of reconstructing the seismically active part of fault networks.
We first use an original method to separate clustered events from uncorrelated
seismicity using the distribution of volumes of tetrahedra defined by closest
neighbor events in the original and randomized seismic catalogs. The spatial
disorder of the complex geometry of fault networks is then taken into account
by defining faults as probabilistic anisotropic kernels, whose structures are
motivated by properties of discontinuous tectonic deformation and previous
empirical observations of the geometry of faults and of earthquake clusters at
many spatial and temporal scales. Combining this a priori knowledge with
information theoretical arguments, we propose the Gaussian mixture approach
implemented in an Expectation-Maximization (EM) procedure. A cross-validation
scheme is then used and allows the determination of the number of kernels that
should be used to provide an optimal data clustering of the catalog. This
three-steps approach is applied to a high quality relocated catalog of the
seismicity following the 1986 Mount Lewis () event in California and
reveals that events cluster along planar patches of about 2 km, i.e.
comparable to the size of the main event. The finite thickness of those
clusters (about 290 m) suggests that events do not occur on well-defined
euclidean fault core surfaces, but rather that the damage zone surrounding
faults may be seismically active at depth. Finally, we propose a connection
between our methodology and multi-scale spatial analysis, based on the
derivation of spatial fractal dimension of about 1.8 for the set of hypocenters
in the Mnt Lewis area, consistent with recent observations on relocated
catalogs
- …