17,325 research outputs found

    Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

    Get PDF
    Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series contain missing data. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust \emph{time series cluster kernel} (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to form the final kernel. We evaluate the TCK on synthetic and real data and compare to other state-of-the-art techniques. The experimental results demonstrate that the TCK is robust to parameter choices, provides competitive results for MTS without missing data and outstanding results for missing data.Comment: 23 pages, 6 figure

    Penalized Clustering of Large Scale Functional Data with Multiple Covariates

    Full text link
    In this article, we propose a penalized clustering method for large scale data with multiple covariates through a functional data approach. In the proposed method, responses and covariates are linked together through nonparametric multivariate functions (fixed effects), which have great flexibility in modeling a variety of function features, such as jump points, branching, and periodicity. Functional ANOVA is employed to further decompose multivariate functions in a reproducing kernel Hilbert space and provide associated notions of main effect and interaction. Parsimonious random effects are used to capture various correlation structures. The mixed-effect models are nested under a general mixture model, in which the heterogeneity of functional data is characterized. We propose a penalized Henderson's likelihood approach for model-fitting and design a rejection-controlled EM algorithm for the estimation. Our method selects smoothing parameters through generalized cross-validation. Furthermore, the Bayesian confidence intervals are used to measure the clustering uncertainty. Simulation studies and real-data examples are presented to investigate the empirical performance of the proposed method. Open-source code is available in the R package MFDA

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    Segmentation of Fault Networks Determined from Spatial Clustering of Earthquakes

    Full text link
    We present a new method of data clustering applied to earthquake catalogs, with the goal of reconstructing the seismically active part of fault networks. We first use an original method to separate clustered events from uncorrelated seismicity using the distribution of volumes of tetrahedra defined by closest neighbor events in the original and randomized seismic catalogs. The spatial disorder of the complex geometry of fault networks is then taken into account by defining faults as probabilistic anisotropic kernels, whose structures are motivated by properties of discontinuous tectonic deformation and previous empirical observations of the geometry of faults and of earthquake clusters at many spatial and temporal scales. Combining this a priori knowledge with information theoretical arguments, we propose the Gaussian mixture approach implemented in an Expectation-Maximization (EM) procedure. A cross-validation scheme is then used and allows the determination of the number of kernels that should be used to provide an optimal data clustering of the catalog. This three-steps approach is applied to a high quality relocated catalog of the seismicity following the 1986 Mount Lewis (Ml=5.7M_l=5.7) event in California and reveals that events cluster along planar patches of about 2 km2^2, i.e. comparable to the size of the main event. The finite thickness of those clusters (about 290 m) suggests that events do not occur on well-defined euclidean fault core surfaces, but rather that the damage zone surrounding faults may be seismically active at depth. Finally, we propose a connection between our methodology and multi-scale spatial analysis, based on the derivation of spatial fractal dimension of about 1.8 for the set of hypocenters in the Mnt Lewis area, consistent with recent observations on relocated catalogs
    corecore