6,247 research outputs found
Kernel Spectral Clustering and applications
In this chapter we review the main literature related to kernel spectral
clustering (KSC), an approach to clustering cast within a kernel-based
optimization setting. KSC represents a least-squares support vector machine
based formulation of spectral clustering described by a weighted kernel PCA
objective. Just as in the classifier case, the binary clustering model is
expressed by a hyperplane in a high dimensional space induced by a kernel. In
addition, the multi-way clustering can be obtained by combining a set of binary
decision functions via an Error Correcting Output Codes (ECOC) encoding scheme.
Because of its model-based nature, the KSC method encompasses three main steps:
training, validation, testing. In the validation stage model selection is
performed to obtain tuning parameters, like the number of clusters present in
the data. This is a major advantage compared to classical spectral clustering
where the determination of the clustering parameters is unclear and relies on
heuristics. Once a KSC model is trained on a small subset of the entire data,
it is able to generalize well to unseen test points. Beyond the basic
formulation, sparse KSC algorithms based on the Incomplete Cholesky
Decomposition (ICD) and , , Group Lasso regularization are
reviewed. In that respect, we show how it is possible to handle large scale
data. Also, two possible ways to perform hierarchical clustering and a soft
clustering method are presented. Finally, real-world applications such as image
segmentation, power load time-series clustering, document clustering and big
data learning are considered.Comment: chapter contribution to the book "Unsupervised Learning Algorithms
Flow time series clustering for demand pattern recognition in drinking water distribution systems: New insights about the most adequate methods
This study presents a proposal of clustering methodologies for demand pattern recognition
using network flow data collected from a large set of drinking water distribution networks in
Portugal. Most of the existing studies about clustering in flow time series rely on hierarchical
or k-Means clustering algorithms with inelastic measures distances. This study explores
alternative clustering algorithms, distance measures, comparison time windows, internal
index metrics and clustering prototypes. The performance of the alternative clustering
methodology was assessed in terms of multiple internal index metrics and the characterization
of the cluster centroids.
The methods with the best performance were Partition Algorithm with DTW distance, PAM
prototype with 15 minutes time window and the Partition Algorithm with GAK distance,
PAM prototype and 15 minutes time window because they allow a clear partition of flow
time series in three clusters. The first method identifies a night consumption pattern, a
typical weekend pattern and a typical working day pattern, whereas the second one identifies
a pattern with small variability between night and daily consumption.
To improve knowledge extraction, in terms of typical and anomalous existing patterns,
additional clustering operations were performed with the flow data set that belongs to
the cluster with small variability between night and daily consumption. New clusters were
identified and characterized regarding weekday, geographical location, and dry months and
wet months, showing that patterns associated with garden irrigation are independent of the
period of the day and season of the year, which indicates an inefficient water use.Este estudo apresenta uma proposta de metodologias de clustering para reconhecimento
de padrões de consumo usando um conjunto de dados de caudal coletados em redes de
distribuição de água em Portugal. A maioria dos estudos existentes sobre clustering em
séries temporais de caudal baseia-se em algoritmos de clustering hierárquicos ou de k-Means
com medidas de distâncias inelásticas. Este estudo explora alternativas de algoritmos de
clustering, medidas de distância, janelas temporais de comparação, medidas de Ãndice interno
e protótipos de clustering.
O desempenho das metodologias de clustering foi avaliado em termos de medidas de Ãndice
interno e também através da caracterização dos centroides dos clusters. As metodologias
com melhor desempenho foram o Algoritmo de Partição com distância DTW, protótipo
PAM e janela de temporal de 15 minutos e o Algoritmo de Partição com distância GAK,
protótipo PAM e janela de temporal de 15 minutos, pois permitiram a formação três
clusters. O primeiro método identifica um padrão de consumo noturno, um padrão tÃpico de
fim-de-semana e um padrão tÃpico de dia útil, enquanto o segundo método destaca-se por
apresentar um padrão com pequena variabilidade entre o consumo noturno e diurno.
Para melhorar a extração de conhecimento, operações adicionais de clustering foram
realizadas ao conjunto de dados que pertence ao cluster com pequena variabilidade entre
consumo noturno e diurno. Novos clusters foram identificados e caracterizados, mostrando
que os padrões associados à irrigação são independentes do perÃodo do dia e da época do
ano, o que indica um uso ineficiente da água
Mining Extremes through Fuzzy Clustering
Archetypes are extreme points that synthesize data representing "pure" individual types.
Archetypes are assigned by the most discriminating features of data points, and are almost
always useful in applications when one is interested in extremes and not on commonalities.
Recent applications include talent analysis in sports and science, fraud detection,
profiling of users and products in recommendation systems, climate extremes, as well as
other machine learning applications.
The furthest-sum Archetypal Analysis (FS-AA) (Mørup and Hansen, 2012) and the
Fuzzy Clustering with Proportional Membership (FCPM) (Nascimento, 2005) propose
distinct models to find clusters with extreme prototypes. Even though the FCPM model
does not impose its prototypes to lie in the convex hull of data, it belongs to the framework
of data recovery from clustering (Mirkin, 2005), a powerful property for unsupervised
cluster analysis. The baseline version of FCPM, FCPM-0, provides central prototypes
whereas its smooth version, FCPM-2 provides extreme prototypes as AA archetypes.
The comparative study between FS-AA and FCPM algorithms conducted in this dissertation
covers the following aspects. First, the analysis of FS-AA on data recovery from
clustering using a collection of 100 data sets of diverse dimensionalities, generated with
a proper data generator (FCPM-DG) as well as 14 real world data. Second, testing the
robustness of the clustering algorithms in the presence of outliers, with the peculiar behaviour
of FCPM-0 on removing the proper number of prototypes from data. Third, a
collection of five popular fuzzy validation indices are explored on accessing the quality
of clustering results. Forth, the algorithms undergo a study to evaluate how different
initializations affect their convergence as well as the quality of the clustering partitions.
The Iterative Anomalous Pattern (IAP) algorithm allows to improve the convergence of
FCPM algorithm as well as to fine-tune the level of resolution to look at clustering results,
which is an advantage from FS-AA. Proper visualization functionalities for FS-AA and
FCPM support the easy interpretation of the clustering results
- …