161 research outputs found
Automatic identification of the number of clusters in hierarchical clustering
Hierarchical clustering is one of the most suitable tools to discover the underlying true structure of a dataset in the case of unsupervised learning where the ground truth is unknown and classical machine learning classifiers are not suitable. In many real applications, it provides a perspective on inner data structure and is preferred to partitional methods. However, determining the resulting number of clusters in hierarchical clustering requires human expertise to deduce this from the dendrogram and this represents a major challenge in making a fully automatic system such as the ones required for decision support in Industry 4.0. This research proposes a general criterion to perform the cut of a dendrogram automatically, by comparing six original criteria based on the Calinski-Harabasz index. The performance of each criterion on 95 real-life dendrograms of different topologies is evaluated against the number of classes proposed by the experts and a winner criterion is determined. This research is framed in a bigger project to build an Intelligent Decision Support system to assess the performance of 3D printers based on sensor data in real-time, although the proposed criteria can be used in other real applications of hierarchical clustering.The methodology is applied to a real-life dataset from the 3D printers and the huge reduction in CPU time is also shown by comparing the CPU time before and after this modification of the entire clustering method. It also reduces the dependability on human-expert to provide the number of clusters by inspecting the dendrogram. Further, such a process allows applying hierarchical clustering in an automatic mode in real-life industrial applications and allows the continuous monitoring of real 3D printers in production, and helps in building an Intelligent Decision Support System to detect operational modes, anomalies, and other behavioral patterns.Peer ReviewedPostprint (author's final draft
An approach to validity indices for clustering techniques in Big Data
Clustering analysis is one of the most used
Machine Learning techniques to discover groups among data
objects. Some clustering methods require the number of clus ters into which the data is going to be partitioned. There exist
several cluster validity indices that help us to approximate
the optimal number of clusters of the dataset. However, such
indices are not suitable to deal with Big Data due to its size
limitation and runtime costs. This paper presents two cluster ing validity indices that handle large amount of data in low
computational time. Our indices are based on redefinitions
of traditional indices by simplifying the intra-cluster distance
calculation. Two types of tests have been carried out over 28
synthetic datasets to analyze the performance of the proposed
indices. First, we test the indices with small and medium size
datasets to verify that our indices have a similar effectiveness
to the traditional ones. Subsequently, tests on datasets of up
to 11 million records and 20 features have been executed to
check their efficiency. The results show that both indices can
handle Big Data in a very low computational time with an
effectiveness similar to the traditional indices using Apache
Spark framework.Ministerio de EconomÃa y Competitividad TIN2014-55894-C2-1-
Bootstrap–CURE: A novel clustering approach for sensor data: an application to 3D printing industry
The agenda of Industry 4.0 highlights smart manufacturing by making machines smart enough to make data-driven decisions. Large-scale 3D printers, being one of the important pillars in Industry 4.0, are equipped with smart sensors to continuously monitor print processes and make automated decisions. One of the biggest challenges in decision autonomy is to consume data quickly along the process and extract knowledge from the printer, suitable for improving the printing process. This paper presents the innovative unsupervised learning approach, bootstrap–CURE, to decode the sensor patterns and operation modes of 3D printers by analyzing multivariate sensor data. An automatic technique to detect the suitable number of clusters using the dendrogram is developed. The proposed methodology is scalable and significantly reduces computational cost as compared to classical CURE. A distinct combination of the 3D printer’s sensors is found, and its impact on the printing process is also discussed. A real application is presented to illustrate the performance and usefulness of the proposal. In addition, a new state of the art for sensor data analysis is presented.This work was supported in part by KEMLG-at-IDEAI (UPC) under Grant SGR-2017-574 from the Catalan government.Peer ReviewedPostprint (published version
Investigation Of Multi-Criteria Clustering Techniques For Smart Grid Datasets
The processing of data arising from connected smart grid technology is an important area of research for the next generation power system. The volume of data allows for increased awareness and efficiency of operation but poses challenges for analyzing the data and turning it into meaningful information. This thesis showcases the utility of clustering algorithms applied to three separate smart-grid data sets and analyzes their ability to improve awareness and operational efficiency.
Hierarchical clustering for anomaly detection in phasor measurement unit (PMU) datasets is identified as an appropriate method for fault and anomaly detection. It showed an increase in anomaly detection efficiency according to Dunn Index (DI) and improved computational considerations compared to currently employed techniques such as Density Based Spatial Clustering of Applications with Noise (DBSCAN).
The efficacy of betweenness-centrality (BC) based clustering in a novel clustering scheme for the determination of microgrids from large scale bus systems is demonstrated and compared against a multitude of other graph clustering algorithms. The BC based clustering showed an overall decrease in economic dispatch cost when compared to other methods of graph clustering. Additionally, the utility of BC for identification of critical buses was showcased.
Finally, this work demonstrates the utility of partitional dynamic time warping (DTW) and k-shape clustering methods for classifying power demand profiles of households with and without electric vehicles (EVs). The utility of DTW time-series clustering was compared against other methods of time-series clustering and tested based upon demand forecasting using traditional and deep-learning techniques. Additionally, a novel process for selecting an optimal time-series clustering scheme based upon a scaled sum of cluster validity indices (CVIs) was developed. Forecasting schemes based on DTW and k-shape demand profiles showed an overall increase in forecast accuracy.
In summary, the use of clustering methods for three distinct types of smart grid datasets is demonstrated. The use of clustering algorithms as a means of processing data can lead to overall methods that improve forecasting, economic dispatch, event detection, and overall system operation. Ultimately, the techniques demonstrated in this thesis give analytical insights and foster data-driven management and automation for smart grid power systems of the future
Neuroengineering of Clustering Algorithms
Cluster analysis can be broadly divided into multivariate data visualization, clustering algorithms, and cluster validation. This dissertation contributes neural network-based techniques to perform all three unsupervised learning tasks. Particularly, the first paper provides a comprehensive review on adaptive resonance theory (ART) models for engineering applications and provides context for the four subsequent papers. These papers are devoted to enhancements of ART-based clustering algorithms from (a) a practical perspective by exploiting the visual assessment of cluster tendency (VAT) sorting algorithm as a preprocessor for ART offline training, thus mitigating ordering effects; and (b) an engineering perspective by designing a family of multi-criteria ART models: dual vigilance fuzzy ART and distributed dual vigilance fuzzy ART (both of which are capable of detecting complex cluster structures), merge ART (aggregates partitions and lessens ordering effects in online learning), and cluster validity index vigilance in fuzzy ART (features a robust vigilance parameter selection and alleviates ordering effects in offline learning). The sixth paper consists of enhancements to data visualization using self-organizing maps (SOMs) by depicting in the reduced dimension and topology-preserving SOM grid information-theoretic similarity measures between neighboring neurons. This visualization\u27s parameters are estimated using samples selected via a single-linkage procedure, thereby generating heatmaps that portray more homogeneous within-cluster similarities and crisper between-cluster boundaries. The seventh paper presents incremental cluster validity indices (iCVIs) realized by (a) incorporating existing formulations of online computations for clusters\u27 descriptors, or (b) modifying an existing ART-based model and incrementally updating local density counts between prototypes. Moreover, this last paper provides the first comprehensive comparison of iCVIs in the computational intelligence literature --Abstract, page iv
New internal and external validation indices for clustering in Big Data
Esta tesis, presentada como un compendio de artÃculos de investigación,
analiza el concepto de Ãndices de validación de clustering y aporta nuevas
medidas de bondad para conjuntos de datos que podrÃan considerarse Big
Data debido a su volumen. Además, estas medidas han sido aplicadas en
proyectos reales y se propone su aplicación futura para mejorar algoritmos
de clustering.
El clustering es una de las técnicas de aprendizaje automático no supervisado
más usada. Esta técnica nos permite agrupar datos en clusters de
manera que, aquellos datos que pertenezcan al mismo cluster tienen caracterÃsticas
o atributos con valores similares, y a su vez esos datos son disimilares
respecto a aquellos que pertenecen a los otros clusters. La similitud de los
datos viene dada normalmente por la cercanÃa en el espacio, teniendo en
cuenta una función de distancia. En la literatura existen los llamados Ãndices
de validación de clustering, los cuales podrÃamos definir como medidas para
cuantificar la calidad de un resultado de clustering. Estos Ãndices se dividen
en dos tipos: Ãndices de validación internos, que miden la calidad del clustering
en base a los atributos con los que se han construido los clusters; e
Ãndices de validación externos, que son aquellos que cuantifican la calidad del
clustering a partir de atributos que no han intervenido en la construcción de
los clusters, y que normalmente son de tipo nominal o etiquetas.
En esta memoria se proponen dos Ãndices de validación internos para clustering
basados en otros Ãndices existentes en la literatura, que nos permiten
trabajar con grandes cantidades de datos, ofreciéndonos los resultados en un
tiempo razonable. Los Ãndices propuestos han sido testeados en datasets sintéticos
y comparados con otros Ãndices de la literatura. Las conclusiones de
este trabajo indican que estos Ãndices ofrecen resultados muy prometedores
frente a sus competidores.
Por otro lado, se ha diseñado un nuevo Ãndice de validación externo de
clustering basado en el test estadÃstico chi cuadrado. Este Ãndice permite
medir la calidad del clustering basando el resultado en cómo han quedado
distribuidos los clusters respecto a una etiqueta dada en la distribución. Los
resultados de este Ãndice muestran una mejora significativa frente a otros
Ãndices externos de la literatura y en datasets de diferentes dimensiones y caracterÃsticas.
Además, estos Ãndices propuestos han sido aplicados en tres proyectos
con datos reales cuyas publicaciones están incluidas en esta tesis doctoral.
Para el primer proyecto se ha desarrollado una metodologÃa para analizar el
consumo eléctrico de los edificios de una smart city. Para ello, se ha realizado
un análisis de clustering óptimo aplicando los Ãndices internos mencionados
anteriormente. En el segundo proyecto se ha trabajado tanto los Ãndices internos
como con los externos para realizar un análisis comparativo del mercado
laboral español en dos periodos económicos distintos. Este análisis se realizó
usando datos del Ministerio de Trabajo, Migraciones y Seguridad Social, y
los resultados podrÃan tenerse en cuenta para ayudar a la toma de decisión
en mejoras de polÃticas de empleo. En el tercer proyecto se ha trabajado con
datos de los clientes de una compañÃa eléctrica para caracterizar los tipos
de consumidores que existen. En este estudio se han analizado los patrones
de consumo para que las compañÃas eléctricas puedan ofertar nuevas tarifas
a los consumidores, y éstos puedan adaptarse a estas tarifas con el objetivo
de optimizar la generación de energÃa eliminando los picos de consumo que
existen la actualidad.This thesis, presented as a compendium of research articles, analyses
the concept of clustering validation indices and provides new measures of
goodness for datasets that could be considered Big Data. In addition, these
measures have been applied in real projects and their future application is
proposed for the improvement of clustering algorithms.
Clustering is one of the most popular unsupervised machine learning
techniques. This technique allows us to group data into clusters so that the
instances that belong to the same cluster have characteristics or attributes
with similar values, and are dissimilar to those that belong to the other
clusters. The similarity of the data is normally given by the proximity in
space, which is measured using a distance function. In the literature, there
are so-called clustering validation indices, which can be defined as measures
for the quantification of the quality of a clustering result. These indices are
divided into two types: internal validation indices, which measure the quality
of clustering based on the attributes with which the clusters have been built;
and external validation indices, which are those that quantify the quality of
clustering from attributes that have not intervened in the construction of
the clusters, and that are normally of nominal type or labels.
In this doctoral thesis, two internal validation indices are proposed for
clustering based on other indices existing in the literature, which enable
large amounts of data to be handled, and provide the results in a reasonable
time. The proposed indices have been tested with synthetic datasets and
compared with other indices in the literature. The conclusions of this work
indicate that these indices offer very promising results in comparison with
their competitors.
On the other hand, a new external clustering validation index based on
the chi-squared statistical test has been designed. This index enables the
quality of the clustering to be measured by basing the result on how the
clusters have been distributed with respect to a given label in the distribution.
The results of this index show a significant improvement compared to
other external indices in the literature when used with datasets of different
dimensions and characteristics.
In addition, these proposed indices have been applied in three projects with real data whose corresponding publications are included in this doctoral
thesis. For the first project, a methodology has been developed to analyse
the electrical consumption of buildings in a smart city. For this study, an
optimal clustering analysis has been carried out by applying the aforementioned
internal indices. In the second project, both internal and external
indices have been applied in order to perform a comparative analysis of the
Spanish labour market in two different economic periods. This analysis was
carried out using data from the Ministry of Labour, Migration, and Social
Security, and the results could be taken into account to help decision-making
for the improvement of employment policies. In the third project, data from
the customers of an electric company has been employed to characterise the
different types of existing consumers. In this study, consumption patterns
have been analysed so that electricity companies can offer new rates to consumers.
Conclusions show that consumers could adapt their usage to these
rates and hence the generation of energy could be optimised by eliminating
the consumption peaks that currently exist
A holistic approach for measuring the survivability of SCADA systems
Supervisory Control and Data Acquisition (SCADA) systems are responsible for controlling and monitoring Industrial Control Systems (ICS) and Critical Infrastructure Systems (CIS) among others. Such systems are responsible to provide services our society relies on such as gas, electricity, and water distribution. They process our waste; manage our railways and our traffic. Nevertheless to say, they are vital for our society and any disruptions on such systems may produce from financial disasters to ultimately loss of lives. SCADA systems have evolved over the years, from standalone, proprietary solutions and closed networks into large-scale, highly distributed software systems operating over open networks such as the internet. In addition, the hardware and software utilised by SCADA systems is now, in most cases, based on COTS (Commercial Off-The-Shelf) solutions. As they evolved they became vulnerable to malicious attacks. Over the last few years there is a push from the computer security industry on adapting their security tools and techniques to address the security issues of SCADA systems. Such move is welcome however is not sufficient, otherwise successful malicious attacks on computer systems would be non-existent. We strongly believe that rather than trying to stop and detect every attack on SCADA systems it is imperative to focus on providing critical services in the presence of malicious attacks. Such motivation is similar with the concepts of survivability, a discipline integrates areas of computer science such as performance, security, fault-tolerance and reliability. In this thesis we present a new concept of survivability; Holistic survivability is an analysis framework suitable for a new era of data-driven networked systems. It extends the current view of survivability by incorporating service interdependencies as a key property and aspects of machine learning. The framework uses the formalism of probabilistic graphical models to quantify survivability and introduces new metrics and heuristics to learn and identify essential services automatically. Current definitions of survivability are often limited since they either apply performance as measurement metric or use security metrics without any survivability context. Holistic survivability addresses such issues by providing a flexible framework where performance and security metrics can be tailored to the context of survivability. In other words, by applying performance and security our work aims to support key survivability properties such as recognition and resistance. The models and metrics here introduced are applied to SCADA systems as such systems insecurity is one of the motivations of this work. We believe that the proposed work goes beyond the current status of survivability models. Holistic survivability is flexible enough to support the addition of other metrics and can be easily used with different models. Because it is based on a well-known formalism its definition and implementation are easy to grasp and to apply. Perhaps more importantly, this proposed work is aimed to a new era where data is being produced and consumed on a large-scale. Holistic survivability aims to be the catalyst to new models based on data that will provide better and more accurate insights on the survivability of systems
Environmental data stream mining through a case-based stochastic learning approach
© . This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/Environmental data stream mining is an open challenge for Data Science. Common methods used are static because they analyze a static set of data, and provide static data-driven models. Environmental systems are dynamic and generate a continuous data stream. Dynamic methods coping with the temporal nature of data must be provided in Data Science. Our proposal is to model each environmental information unit, timely generated, as a new case/experience in a Case-Based Reasoning (CBR) system. This contribution aims to incrementally build and manage a Dynamic Adaptive Case Library (DACL). In this paper, a stochastic method for the learning of new cases and management of prototypes to create and manage the DACL in an incremental way is introduced. This stochastic method works with two main moments. An evaluation of the method has been carried using a data stream of air quality of the city of Obregon, Sonora. México, with good results. In addition, other datasets have been mined to ensure the generality of the approach.Peer ReviewedPostprint (author's final draft
Annotator: A Generic Active Learning Baseline for LiDAR Semantic Segmentation
Active learning, a label-efficient paradigm, empowers models to interactively
query an oracle for labeling new data. In the realm of LiDAR semantic
segmentation, the challenges stem from the sheer volume of point clouds,
rendering annotation labor-intensive and cost-prohibitive. This paper presents
Annotator, a general and efficient active learning baseline, in which a
voxel-centric online selection strategy is tailored to efficiently probe and
annotate the salient and exemplar voxel girds within each LiDAR scan, even
under distribution shift. Concretely, we first execute an in-depth analysis of
several common selection strategies such as Random, Entropy, Margin, and then
develop voxel confusion degree (VCD) to exploit the local topology relations
and structures of point clouds. Annotator excels in diverse settings, with a
particular focus on active learning (AL), active source-free domain adaptation
(ASFDA), and active domain adaptation (ADA). It consistently delivers
exceptional performance across LiDAR semantic segmentation benchmarks, spanning
both simulation-to-real and real-to-real scenarios. Surprisingly, Annotator
exhibits remarkable efficiency, requiring significantly fewer annotations,
e.g., just labeling five voxels per scan in the SynLiDAR-to-SemanticKITTI task.
This results in impressive performance, achieving 87.8% fully-supervised
performance under AL, 88.5% under ASFDA, and 94.4% under ADA. We envision that
Annotator will offer a simple, general, and efficient solution for
label-efficient 3D applications. Project page:
https://binhuixie.github.io/annotator-webComment: NeurIPS 2023. Project page at
https://binhuixie.github.io/annotator-web
- …