40 research outputs found
Quantile-Based Fuzzy Clustering of Multivariate Time Series in the Frequency Domain
Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] A novel procedure to perform fuzzy clustering of multivariate time series generated from different dependence models is proposed. Different amounts of dissimilarity between the generating models or changes on the dynamic behaviours over time are some arguments justifying a fuzzy approach, where each series is associated to all the clusters with specific membership levels. Our procedure considers quantile-based cross-spectral features and consists of three stages: (i) each element is characterized by a vector of proper estimates of the quantile cross-spectral densities, (ii) principal component analysis is carried out to capture the main differences reducing the effects of the noise, and (iii) the squared Euclidean distance between the first retained principal components is used to perform clustering through the standard fuzzy C-means and fuzzy C-medoids algorithms. The performance of the proposed approach is evaluated in a broad simulation study where several types of generating processes are considered, including linear, nonlinear and dynamic conditional correlation models. Assessment is done in two different ways: by directly measuring the quality of the resulting fuzzy partition and by taking into account the ability of the technique to determine the overlapping nature of series located equidistant from well-defined clusters. The procedure is compared with the few alternatives suggested in the literature, substantially outperforming all of them whatever the underlying process and the evaluation scheme. Two specific applications involving air quality and financial databases illustrate the usefulness of our approach.The authors are grateful to the anonymous referees for their comments and suggestions. The research of Ángel López-Oriona and José A. Vilar has been supported by the Ministerio de Economía y Competitividad (MINECO) grants MTM2017-82724-R and PID2020-113578RB-100, the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14), and the Centro de Investigación del Sistema Universitario de Galicia “CITIC” grant ED431G 2019/01; all of them through the European Regional Development Fund (ERDF). This work has received funding for open access charge by Universidade da Coruña/CISUGXunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0
New methodological contributions in time series clustering
Programa Oficial de Doutoramento en Estatística e Investigación Operativa. 555V01[Abstract]
This thesis presents new procedures to address the analysis cluster of time
series. First of all a two-stage procedure based on comparing frequencies and
magnitudes of the absolute maxima of the spectral densities is proposed. Assuming
that the clustering purpose is to group series according to the underlying
dependence structures, a detailed study of the behavior in clustering of a dissimilarity
based on comparing estimated quantile autocovariance functions (QAF)
is also carried out. A prediction-based resampling algorithm proposed by Dudoit
and Fridlyand is adjusted to select the optimal number of clusters. The
asymptotic behavior of the sample quantile autocovariances is studied and an
algorithm to determine optimal combinations of lags and pairs of quantile levels
to perform clustering is introduced. The proposed metric is used to perform
hard and soft partitioning-based clustering. First, a broad simulation study
examines the behavior of the proposed metric in crisp clustering using hierarchkal
and PAM procedure. Then, a novel fuzzy C-mcdoids algorithm based on
the QAF-dissimilarity is proposed. Three different robust versions of this fuzzy
algorithm are also presented to deal with data containing outlier time series.
Finally, other ways of soft clustering analysis are explored, namely probabilistic
0-clustering and clustering based on mixture models.[Resumo]
Esta tese presenta novos procedementos para abordar a análise cluster de
series temporais. En primeiro lugar proponse un procedemento en dúas etapas
baseádo na comparación de frecuencias e magnitudes dos máximos absolutos das
densidades espectrais. Supoñendo que o propósito é agrupar series dacordo coas
estruturas de dependencia subxaccntes, tamén se leva a cabo un estudo detallado
do comportamento en clustering dunha disimilaridade basea.da na comparación
das funcións estimadas das autocovarianzas cuantil (QAF). Un algoritmo de remostraxe
baseado na predición proposto por Dudoit e Fridlyand adáptase para
selecionar o número óptimo de clusters. Tamén se estuda o comportamento
asintótico das autocovarianzas cuantís e se introduce un algoritmo para determinar
as combinacións óptimas de lags e pares de niveles de cuantís para levar
a cabo a clasificación. A métrica proposta utilízase para realizar análise cluster
baseado en particións "hard" e "soft". En primeiro lugar, un amplo estudo de
simulación examina o comportamento da métrica proposta en clústering "hard"
utilizando os procedementos xerárquico e PAM. A continuación, proponse un
novo algoritmo "fuzzy" C-medoides baseado na disimilaridade QAF. Tamén se
presentan tres versións robustas deste algoritmo "fuzzy" para tratar con datos
que conteñan valores atípicos. Finalmente, explóranse outras vías de análise
cluster "soft", concretamente, D-clustering probabilístico e clustering baseado
en modelos mixtos.[Resumen]
Esta tesis presenta nuevos procedimientos para abordar el análisis cluster de
series temporales. En primer lugar se propone un procedimiento en dos etapas
basado en la comparación de frecuencias y magnitudes de los máximos absolutos
de las densidades espectrales. Suponiendo que el propósito es agrupar series
de acuerdo con las estructuras de dependencia subyacentes, también se lleva. a
cabo un estudio detallado del comportamiento en clustering de una disimilaridad
basada en la comparación de las funciones estimadas de las autoco,'afiancias
cuantil (QAF). Un algoritmo de remuestreo basado en predicción propuesto por
Dudoit y Fridlyand se adapta para seleccionar el número óptimo de clusters.
También se estudia el comportamiento asintótico de las autocovariancias cuantites
y se introduce un algoritmo para determinar las combinaciones óptimas de
lags y pares de niveles de cuantiles para llevar a cabo la clasificación. La. métrica
propuesta se utiliza para realizar análisis cluster basado en particiones "hard"
y ''soft". En primer lugar, un amplio elltudio de simulación examina el comportamiento
de la métrica propuesta en clúster "hard" utilizando los procedimientos
jerárquico y PAM. A continuación, se propone un nuevo algoritmo "fuzzy" Cmedoides
basado en la disimilaridad QAF. También se presentan tres versiones
robustas de este algoritmo "fuzzy" para tratar con datos que contengan atípicos.
Finalmente, se exploran otras vías de análisis clus ter "soft", concretamente,
D-clustering probabilístico y clustering basado en modelos mixtos
Hierarchical clustering for smart meter electricity loads based on quantile autocovariances
In order to improve the efficiency and sustainability of electricity systems, most countries worldwide are
deploying advanced metering infrastructures, and in particular household smart meters, in the residential sector.
This technology is able to record electricity load time series at a very high frequency rates, information that can
be exploited to develop new clustering models to group individual households by similar consumptions patterns.
To this end, in this work we propose three hierarchical clustering methodologies that allow capturing different
characteristics of the time series. These are based on a set of “dissimilarity” measures computed over different
features: quantile auto-covariances, and simple and partial autocorrelations. The main advantage is that they allow
summarizing each time series in a few representative features so that they are computationally efficient, robust
against outliers, easy to automatize, and scalable to hundreds of thousands of smart meters series. We evaluate the
performance of each clustering model in a real-world smart meter dataset with thousands of half-hourly time series.
The results show how the obtained clusters identify relevant consumption behaviors of households and capture part
of their geo-demographic segmentation. Moreover, we apply a supervised classification procedure to explore which
features are more relevant to define each cluster.This work was supported in part by the Spanish
Government through Project under Grant MTM2017-88979-P, and in part by
the Fundación Iberdrola through “Ayudas a la Investigación en Energía y
Medio Ambiente 2018.” The work of Andrés M. Alonso was supported in part
by the Spanish Government through Project under Grant ECO2015-66593-P.
Paper no. TSG-01702-2019
Quantile Cross-Spectral Density: A Novel and Effective Tool for Clustering Multivariate Time Series
Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] Clustering of multivariate time series is a central problem in data mining with applications in many fields. Frequently, the clustering target is to identify groups of series generated by the same multivariate stochastic process. Most of the approaches to address this problem include a prior step of dimensionality reduction which may result in a loss of information or consider dissimilarity measures based on correlations and cross-correlations but ignoring the serial dependence structure. We propose a novel approach to measure dissimilarity between multivariate time series aimed at jointly capturing both cross dependence and serial dependence. Specifically, each series is characterized by a set of matrices of estimated quantile cross-spectral densities, where each matrix corresponds to a pair of quantile levels. Then the dissimilarity between every couple of series is evaluated by comparing their estimated quantile cross-spectral densities, and the pairwise dissimilarity matrix is taken as starting point to develop a partitioning around medoids algorithm. Since the quantile-based cross-spectra capture dependence in quantiles of the joint distribution, the proposed metric has a high capability to discriminate between high-level dependence structures. An extensive simulation study shows that our clustering procedure outperforms a wide range of alternative methods and exhibits robustness to noise distribution besides being computationally efficient. A real data application involving bivariate financial time series illustrates the usefulness of the proposed approach. The procedure is also applied to cluster nonstationary series from the UEA multivariate time series classification archive.This research has been supported by the Ministerio de Economía y Competitividad (MINECO) grants MTM2017-82724-R and PID2020-113578RB-100, the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14), and the Centro de Investigación del Sistema Universitario de Galicia “CITIC” grant ED431G 2019/01; all of them through the European Regional Development Fund (ERDF). This work has received funding for open access charge by Universidade da Coruña/CISUGXunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0
Fuzzy clustering of ordinal time series based on two novel distances with economic applications
Time series clustering is a central machine learning task with applications
in many fields. While the majority of the methods focus on real-valued time
series, very few works consider series with discrete response. In this paper,
the problem of clustering ordinal time series is addressed. To this aim, two
novel distances between ordinal time series are introduced and used to
construct fuzzy clustering procedures. Both metrics are functions of the
estimated cumulative probabilities, thus automatically taking advantage of the
ordering inherent to the series' range. The resulting clustering algorithms are
computationally efficient and able to group series generated from similar
stochastic processes, reaching accurate results even though the series come
from a wide variety of models. Since the dynamic of the series may vary over
the time, we adopt a fuzzy approach, thus enabling the procedures to locate
each series into several clusters with different membership degrees. An
extensive simulation study shows that the proposed methods outperform several
alternative procedures. Weighted versions of the clustering algorithms are also
presented and their advantages with respect to the original methods are
discussed. Two specific applications involving economic time series illustrate
the usefulness of the proposed approaches
Copula-based fuzzy clustering of spatial time series
This paper contributes to the existing literature on the analysis of spatial time series presenting a new clustering algorithm called COFUST, i.e. COpula-based FUzzy clustering algorithm for Spatial Time series. The underlying idea of this algorithm is to perform a fuzzy Partitioning Around Medoids (PAM) clustering using copula-based approach to interpret comovements of time series. This generalisation allows both to extend usual clustering methods for time series based on Pearson’s correlation and to capture the uncertainty that arises assigning units to clusters. Furthermore, its flexibility permits to include directly in the algorithm the spatial information. Our approach is presented and discussed using both simulated and real data, highlighting its main advantages
The Bootstrap for Testing the Equality of Two Multivariate Stochastic Processes with an Application to Financial Markets
[Abstract] The problem of testing the equality of generating processes of two multivariate time series is addressed in this work. To this end, we construct two tests based on a distance measure between stochastic processes. The metric is defined in terms of the quantile cross-spectral densities of both processes. A proper estimate of this dissimilarity is the cornerstone of the proposed tests. Both techniques are based on the bootstrap. Specifically, extensions of the moving block bootstrap and the stationary bootstrap are used for their construction. The approaches are assessed in a broad range of scenarios under the null and the alternative hypotheses. The results from the analyses show that the procedure based on the stationary bootstrap exhibits the best overall performance in terms of both size and power. The proposed techniques are used to answer the question regarding whether or not the dotcom bubble crash of the 2000s permanently impacted global market behavior.This research has been supported by MINECO (MTM2017-82724-R and PID2020-113578RB-100), the Xunta de Galicia (ED431C-2020-14), and “CITIC” (ED431G 2019/01)Xunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0
Spatio-temporal clustering: Neighbourhoods based on median seasonal entropy
In this research, a new uncertainty clustering method has been developed and applied to the spatial time series with seasonality. The new unsupervised grouping method is based on Neighbourhoods and Median Seasonal Entropy. This classification method aims to discover similar behaviours for a time series group and find a dissimilarity measure concerning a reference series r. The Neighbourhood’s Internal Verification Coefficient criterion makes it possible to measure intra-group similarity. This clustering criterion is flexible for spatial information. Our empirical approach allows us to measure accommodation decisions for tourists who visit Spain and decide to stay either in hotels or in tourist apartments. The results show the existence of dynamic seasonal patterns of behaviour. These insights support the decisions of economic agents.This research is associated with the group of Faculty of Economic and Business Sciences at the University of Malaga: “Social Indicators-SEJ157”. The research group has funded the professional editing service in English. Research Funders: “Funding for open access charge: Universidad de Málaga/CBUA”
Hard and soft clustering of categorical time series based on two novel distances with an application to biological sequences
Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG.[Abstract]: Two novel distances between categorical time series are introduced. Both of them measure discrepancies between extracted features describing the underlying serial dependence patterns. One distance is based on well-known association measures, namely Cramer's v and Cohen's κ. The other one relies on the so-called binarization of a categorical process, which indicates the presence of each category by means of a canonical vector. Binarization is used to construct a set of innovative association measures which allow to identify different types of serial dependence. The metrics are used to perform crisp and fuzzy clustering of nominal series. The proposed approaches are able to group together series generated from similar stochastic processes, achieve accurate results with series coming from a broad range of models and are computationally efficient. Extensive simulation studies show that both hard and soft clustering algorithms outperform several alternative procedures proposed in the literature. Two applications involving biological sequences from different species highlight the usefulness of the introduced techniques.Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C-2020-14The research of Ángel López-Oriona and José A. Vilar has been supported by the Ministerio de Economía y Competitividad (MINECO) grants MTM2017-82724-R and PID2020-113578RB-100, the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14), and the Centro de Investigación del Sistema Universitario de Galicia “CITIC” grant ED431G 2019/01; all of them through the European Regional Development Fund (ERDF). This work has received funding for open access charge by Universidade da Coruña/CISUG. The author Ángel López-Oriona is very grateful to researcher Maite Freire for her lessons about DNA theory