502 research outputs found
Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering
We propose two probability-like measures of individual cluster-membership
certainty which can be applied to a hard partition of the sample such as that
obtained from the Partitioning Around Medoids (PAM) algorithm, hierarchical
clustering or k-means clustering. One measure extends the individual silhouette
widths and the other is obtained directly from the pairwise dissimilarities in
the sample. Unlike the classic silhouette, however, the measures behave like
probabilities and can be used to investigate an individual's tendency to belong
to a cluster. We also suggest two possible ways to evaluate the hard partition.
We evaluate the performance of both measures in individuals with ambiguous
cluster membership, using simulated binary datasets that have been partitioned
by the PAM algorithm or continuous datasets that have been partitioned by
hierarchical clustering and k-means clustering. For comparison, we also present
results from soft clustering algorithms such as soft analysis clustering
(FANNY) and two model-based clustering methods. Our proposed measures perform
comparably to the posterior-probability estimators from either FANNY or the
model-based clustering methods. We also illustrate the proposed measures by
applying them to Fisher's classic iris data set
Fuzzy clustering with spatial-temporal information
Clustering geographical units based on a set of quantitative features observed at several time occasions requires to deal with the complexity of both space and time information. In particular, one should consider (1) the spatial nature of the units to be clustered, (2) the characteristics of the space of multivariate time trajectories, and (3) the uncertainty related to the assignment of a geographical unit to a given cluster on the basis of the above com- plex features. This paper discusses a novel spatially constrained multivariate time series clustering for units characterised by different levels of spatial proximity. In particular, the Fuzzy Partitioning Around Medoids algorithm with Dynamic Time Warping dissimilarity measure and spatial penalization terms is applied to classify multivariate Spatial-Temporal series. The clustering method has been theoretically presented and discussed using both simulated and real data, highlighting its main features. In particular, the capability of embedding different levels of proximity among units, and the ability of considering time series with different length
Hard and soft clustering of categorical time series based on two novel distances with an application to biological sequences
Financiado para publicaciĂłn en acceso aberto: Universidade da Coruña/CISUG.[Abstract]: Two novel distances between categorical time series are introduced. Both of them measure discrepancies between extracted features describing the underlying serial dependence patterns. One distance is based on well-known association measures, namely Cramer's v and Cohen's Îș. The other one relies on the so-called binarization of a categorical process, which indicates the presence of each category by means of a canonical vector. Binarization is used to construct a set of innovative association measures which allow to identify different types of serial dependence. The metrics are used to perform crisp and fuzzy clustering of nominal series. The proposed approaches are able to group together series generated from similar stochastic processes, achieve accurate results with series coming from a broad range of models and are computationally efficient. Extensive simulation studies show that both hard and soft clustering algorithms outperform several alternative procedures proposed in the literature. Two applications involving biological sequences from different species highlight the usefulness of the introduced techniques.Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C-2020-14The research of Ăngel LĂłpez-Oriona and JosĂ© A. Vilar has been supported by the Ministerio de EconomĂa y Competitividad (MINECO) grants MTM2017-82724-R and PID2020-113578RB-100, the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14), and the Centro de InvestigaciĂłn del Sistema Universitario de Galicia âCITICâ grant ED431G 2019/01; all of them through the European Regional Development Fund (ERDF). This work has received funding for open access charge by Universidade da Coruña/CISUG. The author Ăngel LĂłpez-Oriona is very grateful to researcher Maite Freire for her lessons about DNA theory
A Machine Learning-Based Framework for Clustering Residential Electricity Load Profiles to Enhance Demand Response Programs
Load shapes derived from smart meter data are frequently employed to analyze
daily energy consumption patterns, particularly in the context of applications
like Demand Response (DR). Nevertheless, one of the most important challenges
to this endeavor lies in identifying the most suitable consumer clusters with
similar consumption behaviors. In this paper, we present a novel machine
learning based framework in order to achieve optimal load profiling through a
real case study, utilizing data from almost 5000 households in London. Four
widely used clustering algorithms are applied specifically K-means, K-medoids,
Hierarchical Agglomerative Clustering and Density-based Spatial Clustering. An
empirical analysis as well as multiple evaluation metrics are leveraged to
assess those algorithms. Following that, we redefine the problem as a
probabilistic classification one, with the classifier emulating the behavior of
a clustering algorithm,leveraging Explainable AI (xAI) to enhance the
interpretability of our solution. According to the clustering algorithm
analysis the optimal number of clusters for this case is seven. Despite that,
our methodology shows that two of the clusters, almost 10\% of the dataset,
exhibit significant internal dissimilarity and thus it splits them even further
to create nine clusters in total. The scalability and versatility of our
solution makes it an ideal choice for power utility companies aiming to segment
their users for creating more targeted Demand Response programs.Comment: 29 pages, 19 figure
Quantile-Based Fuzzy Clustering of Multivariate Time Series in the Frequency Domain
Financiado para publicaciĂłn en acceso aberto: Universidade da Coruña/CISUG[Abstract] A novel procedure to perform fuzzy clustering of multivariate time series generated from different dependence models is proposed. Different amounts of dissimilarity between the generating models or changes on the dynamic behaviours over time are some arguments justifying a fuzzy approach, where each series is associated to all the clusters with specific membership levels. Our procedure considers quantile-based cross-spectral features and consists of three stages: (i) each element is characterized by a vector of proper estimates of the quantile cross-spectral densities, (ii) principal component analysis is carried out to capture the main differences reducing the effects of the noise, and (iii) the squared Euclidean distance between the first retained principal components is used to perform clustering through the standard fuzzy C-means and fuzzy C-medoids algorithms. The performance of the proposed approach is evaluated in a broad simulation study where several types of generating processes are considered, including linear, nonlinear and dynamic conditional correlation models. Assessment is done in two different ways: by directly measuring the quality of the resulting fuzzy partition and by taking into account the ability of the technique to determine the overlapping nature of series located equidistant from well-defined clusters. The procedure is compared with the few alternatives suggested in the literature, substantially outperforming all of them whatever the underlying process and the evaluation scheme. Two specific applications involving air quality and financial databases illustrate the usefulness of our approach.The authors are grateful to the anonymous referees for their comments and suggestions. The research of Ăngel LĂłpez-Oriona and JosĂ© A. Vilar has been supported by the Ministerio de EconomĂa y Competitividad (MINECO) grants MTM2017-82724-R and PID2020-113578RB-100, the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14), and the Centro de InvestigaciĂłn del Sistema Universitario de Galicia âCITICâ grant ED431G 2019/01; all of them through the European Regional Development Fund (ERDF). This work has received funding for open access charge by Universidade da Coruña/CISUGXunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0
Augmented Session Similarity Based Framework for Measuring Web User Concern from Web Server Logs
In this paper, an augmented sessions similarity based framework is proposed to measure web user concern from web server logs. This proposed framework will consider the best usage similarity between two web sessions based on accessed page relevance and URL based syntactic structure of website within the session. The proposed framework is implemented using K-medoids clustering algorithms with independent and combined similarity measures. The clusters qualities are evaluated by measuring average intra-cluster and inter-cluster distances. The experimental results show that combined augmented session dissimilarity metric outperformed the independent augmented session dissimilarity measures in terms of cluster validity measures
Review of Clustering Methods for Slow Coherency-Based Generator Grouping
Slow coherency is one of the most relevant concepts used in power systems dynamics to group generators that exhibit similar response to disturbances. Among the approaches developed for generator grouping based on slow coherency, clustering algorithms play a significant role. This paper reviews the clustering algorithms applied in model-based and data-driven approaches, highlighting the metrics used, the feature selection, the types of algorithms and the comparison among the results obtained considering simulated or measured data
Fuzzy clustering of spatial interval-valued data
In this paper, two fuzzy clustering methods for spatial intervalvalued
data are proposed, i.e. the fuzzy C-Medoids clustering
of spatial interval-valued data with and without entropy regularization.
Both methods are based on the Partitioning Around
Medoids (PAM) algorithm, inheriting the great advantage of
obtaining non-fictitious representative units for each cluster.
In both methods, the units are endowed with a relation
of contiguity, represented by a symmetric binary matrix. This
can be intended both as contiguity in a physical space and as
a more abstract notion of contiguity. The performances of the
methods are proved by simulation, testing the methods with
different contiguity matrices associated to natural clusters of
units. In order to show the effectiveness of the methods in
empirical studies, three applications are presented: the clustering
of municipalities based on interval-valued pollutants levels, the
clustering of European fact-checkers based on interval-valued
data on the average number of impressions received by their
tweets and the clustering of the residential zones of the city of
Rome based on the interval of price values
Fuzzy clustering of spatial interval-valued data
In this paper, two fuzzy clustering methods for spatial interval-valued data are proposed, i.e. the fuzzy
C-Medoids clustering of spatial interval-valued data with and without entropy regularization. Both methods are based on the Partitioning Around Medoids (PAM) algorithm, inheriting the great advantage of obtaining non-fictitious representative units for each cluster.
In both methods, the units are endowed with a relation of contiguity, represented by a symmetric binary matrix. This can be intended both as contiguity in a physical space and as a more abstract notion of contiguity. The performances of the methods are proved by simulation, testing the methods with different contiguity matrices associated to natural clusters of units. In order to show the effectiveness of the methods in empirical studies, three applications are presented: the clustering of municipalities based on interval-valued pollutants levels, the clustering of European fact-checkers based on interval-valued data on the average number of impressions received by their tweets and the clustering of the residential zones of the city of Rome based on the interval of price values
Fuzzy clustering of ordinal time series based on two novel distances with economic applications
Time series clustering is a central machine learning task with applications
in many fields. While the majority of the methods focus on real-valued time
series, very few works consider series with discrete response. In this paper,
the problem of clustering ordinal time series is addressed. To this aim, two
novel distances between ordinal time series are introduced and used to
construct fuzzy clustering procedures. Both metrics are functions of the
estimated cumulative probabilities, thus automatically taking advantage of the
ordering inherent to the series' range. The resulting clustering algorithms are
computationally efficient and able to group series generated from similar
stochastic processes, reaching accurate results even though the series come
from a wide variety of models. Since the dynamic of the series may vary over
the time, we adopt a fuzzy approach, thus enabling the procedures to locate
each series into several clusters with different membership degrees. An
extensive simulation study shows that the proposed methods outperform several
alternative procedures. Weighted versions of the clustering algorithms are also
presented and their advantages with respect to the original methods are
discussed. Two specific applications involving economic time series illustrate
the usefulness of the proposed approaches
- âŠ