1,335 research outputs found
Performance Evaluation of Cluster Validity Indices (CVIs) on Multi/Hyperspectral Remote Sensing Datasets
The number of clusters (i.e., the number of classes) for unsupervised classification has been recognized as an important part of remote sensing image clustering analysis. The number of classes is usually determined by cluster validity indices (CVIs). Although many CVIs have been proposed, few studies have compared and evaluated their effectiveness on remote sensing datasets. In this paper, the performance of 16 representative and commonly-used CVIs was comprehensively tested by applying the fuzzy c-means (FCM) algorithm to cluster nine types of remote sensing datasets, including multispectral (QuickBird, Landsat TM, Landsat ETM+, FLC1, and GaoFen-1) and hyperspectral datasets (Hyperion, HYDICE, ROSIS, and AVIRIS). The preliminary experimental results showed that most CVIs, including the commonly used DBI (Davies-Bouldin index) and XBI (Xie-Beni index), were not suitable for remote sensing images (especially for hyperspectral images) due to significant between-cluster overlaps; the only effective index for both multispectral and hyperspectral data sets was the WSJ index (WSJI). Such important conclusions can serve as a guideline for future remote sensing image clustering applications
Investigation Of Multi-Criteria Clustering Techniques For Smart Grid Datasets
The processing of data arising from connected smart grid technology is an important area of research for the next generation power system. The volume of data allows for increased awareness and efficiency of operation but poses challenges for analyzing the data and turning it into meaningful information. This thesis showcases the utility of clustering algorithms applied to three separate smart-grid data sets and analyzes their ability to improve awareness and operational efficiency.
Hierarchical clustering for anomaly detection in phasor measurement unit (PMU) datasets is identified as an appropriate method for fault and anomaly detection. It showed an increase in anomaly detection efficiency according to Dunn Index (DI) and improved computational considerations compared to currently employed techniques such as Density Based Spatial Clustering of Applications with Noise (DBSCAN).
The efficacy of betweenness-centrality (BC) based clustering in a novel clustering scheme for the determination of microgrids from large scale bus systems is demonstrated and compared against a multitude of other graph clustering algorithms. The BC based clustering showed an overall decrease in economic dispatch cost when compared to other methods of graph clustering. Additionally, the utility of BC for identification of critical buses was showcased.
Finally, this work demonstrates the utility of partitional dynamic time warping (DTW) and k-shape clustering methods for classifying power demand profiles of households with and without electric vehicles (EVs). The utility of DTW time-series clustering was compared against other methods of time-series clustering and tested based upon demand forecasting using traditional and deep-learning techniques. Additionally, a novel process for selecting an optimal time-series clustering scheme based upon a scaled sum of cluster validity indices (CVIs) was developed. Forecasting schemes based on DTW and k-shape demand profiles showed an overall increase in forecast accuracy.
In summary, the use of clustering methods for three distinct types of smart grid datasets is demonstrated. The use of clustering algorithms as a means of processing data can lead to overall methods that improve forecasting, economic dispatch, event detection, and overall system operation. Ultimately, the techniques demonstrated in this thesis give analytical insights and foster data-driven management and automation for smart grid power systems of the future
Behaviour modelling with data obtained from the Internet and contributions to cluster validation
[EN]This PhD thesis makes contributions in modelling behaviours found in different types of data acquired from the Internet and in the field of clustering evaluation. Two different types of Internet data were processed, on the one hand, internet traffic with the objective of attack detection and on the other hand, web surfing activity with the objective of web personalization, both data being of sequential nature. To this aim, machine learning techniques were applied, mostly unsupervised techniques. Moreover, contributions were made in cluster evaluation, in order to make easier the selection of the best partition in clustering problems.
With regard to network attack detection, first, gureKDDCup database was generated which adds payload data to KDDCup99 connection attributes because it is essential to detect non-flood attacks. Then, by modelling this data a network Intrusion Detection System (nIDS) was proposed where context-independent payload processing was done obtaining satisfying detection rates.
In the web mining context web surfing activity was modelled for web personalization. In this context, generic and non-invasive systems to extract knowledge were proposed just using the information stored in webserver log files. Contributions were done in two senses: in problem detection and in link suggestion. In the first application a meaningful list of navigation attributes was proposed for each user session to group and detect different navigation profiles. In the latter, a general and non-invasive link suggestion system was proposed which was evaluated with satisfactory results in a link prediction context.
With regard to the analysis of Cluster Validity Indices (CVI), the most extensive CVI comparison found up to a moment was carried out using a partition similarity measure based evaluation methodology. Moreover, we analysed the behaviour of CVIs in a real web mining application with elevated number of clusters in which they tend to be unstable. We proposed a procedure which automatically selects the best partition analysing the slope of different CVI values.[EU]Doktorego-tesi honek internetetik eskuratutako datu mota ezberdinetan aurkitutako portaeren modelugintzan eta multzokatzeen ebaluazioan egiten ditu bere ekarpenak. Zehazki, bi mota ezberdinetako interneteko datuak prozesatu dira: batetik, interneteko trafikoa, erasoak hautemateko helburuarekin; eta bestetik, web nabigazioen jarduera, weba pertsonalizatzeko helburuarekin; bi datu motak izaera sekuentzialekoak direlarik. Helburu hauek lortzeko, ikasketa automatikoko teknikak aplikatu dira, nagusiki gainbegiratu-gabeko teknikak. Testuinguru honetan, multzokatzeen partizio onenaren aukeraketak dakartzan arazoak gutxitzeko multzokatzeen ebaluazioan ere ekarpenak egin dira.
Sareko erasoen hautemateari dagokionez, lehenik gureKDDCup datubasea eratu da KDDCup99-ko konexio atributuei payload-ak (sareko paketeen datu eremuak) gehituz, izan ere, ez-flood erasoak (pakete gutxi erabiltzen dituzten erasoak) hautemateko ezinbestekoak baitira. Ondoren, datu hauek modelatuz testuinguruarekiko independenteak diren payload prozesaketak oinarri dituen sareko erasoak hautemateko sistema (network Intrusion Detection System (nIDS)) bat proposatu da maila oneko eraso hautemate-tasak lortuz.
Web meatzaritzaren testuinguruan, weba pertsonalizatzeko helburuarekin web nabigazioen jarduera modelatu da. Honetarako, web zerbizarietako lorratz fitxategietan metatutako informazioa soilik erabiliz ezagutza erabilgarria erauziko duen sistema orokor eta ez-inbasiboak proposatu dira. Ekarpenak bi zentzutan eginaz: arazoen hautematean eta esteken iradokitzean. Lehen aplikazioan sesioen nabigazioa adierazteko atributu esanguratsuen zerrenda bat proposatu da, gero nabigazioak multzokatu eta nabigazio profil ezberdinak hautemateko. Bigarren aplikazioan, estekak iradokitzeko sistema orokor eta ez-inbasibo bat proposatu da, eta berau, estekak aurresateko testuinguruan ebaluatu da emaitza onak lortuz.
Multzokatzeak balioztatzeko indizeen (Cluster Validity Indices (CVI)) azterketari dagokionez, gaurdaino aurkitu den CVI-en konparaketa zabalena burutu da partizioen antzekotasun neurrian oinarritutako ebaluazio metodologia erabiliz. Gainera, CVI-en portaera aztertu da egiazko web meatzaritza aplikazio batean normalean baino multzo kopuru handiagoak dituena, non CVI-ek ezegonkorrak izateko joera baitute. Arazo honi aurre eginaz, CVI ezberdinek partizio ezberdinetarako lortzen dituzten balioen maldak aztertuz automatikoki partiziorik onena hautatzen duen prozedura proposatu da.[ES]Esta tesis doctoral hace contribuciones en el modelado de comportamientos encontrados en diferentes tipos de datos adquiridos desde internet y en el campo de la evaluación del clustering. Dos tipos de datos de internet han sido procesados: en primer lugar el tráfico de internet con el objetivo de detectar ataques; y en segundo lugar la actividad generada por los usuarios web con el objetivo de personalizar la web; siendo los dos tipos de datos de naturaleza secuencial. Para este fin, se han aplicado técnicas de aprendizaje automático, principalmente técnicas no-supervisadas. Además, se han hecho aportaciones en la evaluación de particiones de clusters para facilitar la selección de la mejor partición de clusters.
Respecto a la detección de ataques en la red, primero, se generó la base de datos gureKDDCup que añade el payload (la parte de contenido de los paquetes de la red) a los atributos de la conexión de KDDCup99 porque el payload es esencial para la detección de ataques no-flood (ataques que utilizan pocos paquetes). Después, se propuso un sistema de detección de intrusos (network Intrusion Detection System (IDS)) modelando los datos de gureKDDCup donde se propusieron varios preprocesos del payload independientes del contexto obteniendo resultados satisfactorios.
En el contexto de la minerı́a web, se ha modelado la actividad de la navegación web para la personalización web. En este contexto se propondrán sistemas genéricos y no-invasivos para la extracción del conocimiento, utilizando únicamente la información almacenada en los ficheros log de los servidores web. Se han hecho aportaciones en dos sentidos: en la detección de problemas y en la sugerencia de links. En la primera aplicación, se propuso una lista de atributos significativos para representar las sesiones de navegación web para después agruparlos y detectar diferentes perfiles de navegación. En la segunda aplicación, se propuso un sistema general y no-invasivo para sugerir links y se evaluó en el contexto de predicción de links con resultados satisfactorios.
Respecto al análisis de ı́ndices de validación de clusters (Cluster Validity Indices (CVI)), se ha realizado la más amplia comparación encontrada hasta el momento que utiliza la metodologı́a de evaluación basada en medidas de similitud de particiones. Además, se ha analizado el comportamiento de los CVIs en una aplicación real de minerı́a web con un número elevado de clusters, contexto en el que los CVIs tienden a ser inestables, ası́ que se propuso un procedimiento para la selección automática de la mejor partición en base a la pendiente de los valores de diferentes CVIs.Grant of the Basque Government (ref.: BFI08.226); Grant of Ministry of Economy and Competitiveness of the Spanish Government (ref.: BES-2011-045989); Research stay grant of Spanish Ministry of Economy and Competitiveness (ref.: EEBB-I-14-08862); University of the Basque Country UPV/EHU (BAILab, grant UFI11/45); Department of Education, Universities and Research of the Basque Government (grant IT-395-10); Ministry of Economy and Competitiveness of the Spanish Government and by the European Regional Development Fund - ERDF (eGovernAbility, grant TIN2014-52665-C2-1-R)
DBSCAN algoritmin hyperparametri optimisointi käyttäen uudenlaista geneettiseen algoritmiin perustuvaa menetelmää
Ship traffic is a major source of global greenhouse gas emissions, and the pressure on the maritime industry to lower its carbon footprint is constantly growing. One easy way for ships to lower their emissions would be to lower their sailing speed. The global ship traffic has for ages followed a practice called "sail fast, then wait", which means that ships try to reach their destination in the fastest possible time regardless and then wait at an anchorage near the harbor for a mooring place to become available. This method is easy to execute logistically, but it does not optimize the sailing speeds to take into account the emissions. An alternative tactic would be to calculate traffic patterns at the destination and use this information to plan the voyage so that the time at anchorage is minimized. This would allow ships to sail at lower speeds without compromising the total length of the journey.
To create a model to schedule arrivals at ports, traffic patterns need to be formed on how ships interact with port infrastructure. However, port infrastructure is not widely available in an easy-to-use form. This makes it difficult to develop models that are capable of predicting traffic patterns. However, ship voyage information is readily available from commercial Automatic Information System (AIS) data. In this thesis, I present a novel implementation, which extracts information on the port infrastructure from AIS data using the DBSCAN clustering algorithm.
In addition to clustering the AIS data, the implementation presented in this thesis uses a novel optimization method to search for optimal hyperparameters for the DBSCAN algorithm. The optimization process evaluates possible solutions using cluster validity indices (CVI), which are metrics that represent the goodness of clustering. A comparison with different CVIs is done to narrow down the most effective way to cluster AIS data to find information on port infrastructure
Cluster-Based Control of Transition-Independent MDPs
This work studies the ability of a third-party influencer to control the
behavior of a multi-agent system. The controller exerts actions with the goal
of guiding agents to attain target joint strategies. Under mild assumptions,
this can be modeled as a Markov decision problem and solved to find a control
policy. This setup is refined by introducing more degrees of freedom to the
control; the agents are partitioned into disjoint clusters such that each
cluster can receive a unique control. Solving for a cluster-based policy
through standard techniques like value iteration or policy iteration, however,
takes exponentially more computation time due to the expanded action space. A
solution is presented in the Clustered Value Iteration algorithm, which
iteratively solves for an optimal control via a round robin approach across the
clusters. CVI converges exponentially faster than standard value iteration, and
can find policies that closely approximate the MDP's true optimal value. For
MDPs with separable reward functions, CVI will converge to the true optimum.
While an optimal clustering assignment is difficult to compute, a good
clustering assignment for the agents may be found with a greedy splitting
algorithm, whose associated values form a monotonic, submodular lower bound to
the values of optimal clusters. Finally, these control ideas are demonstrated
on simulated examples.Comment: 22 pages, 3 figure
A correlation-based fuzzy cluster validity index with secondary options detector
The optimal number of clusters is one of the main concerns when applying
cluster analysis. Several cluster validity indexes have been introduced to
address this problem. However, in some situations, there is more than one
option that can be chosen as the final number of clusters. This aspect has been
overlooked by most of the existing works in this area. In this study, we
introduce a correlation-based fuzzy cluster validity index known as the
Wiroonsri-Preedasawakul (WP) index. This index is defined based on the
correlation between the actual distance between a pair of data points and the
distance between adjusted centroids with respect to that pair. We evaluate and
compare the performance of our index with several existing indexes, including
Xie-Beni, Pakhira-Bandyopadhyay-Maulik, Tang, Wu-Li, generalized C, and Kwon2.
We conduct this evaluation on four types of datasets: artificial datasets,
real-world datasets, simulated datasets with ranks, and image datasets, using
the fuzzy c-means algorithm. Overall, the WP index outperforms most, if not
all, of these indexes in terms of accurately detecting the optimal number of
clusters and providing accurate secondary options. Moreover, our index remains
effective even when the fuzziness parameter is set to a large value. Our R
package called WPfuzzyCVIs used in this work is also available in
https://github.com/nwiroonsri/WPfuzzyCVIs.Comment: 19 page
How do they pay as they go?: Learning payment patterns from solar home system users data in Rwanda and Kenya
Pay-as-you-go (PAYGo) financing models play a vital role in boosting the distribution of solar-home-systems (SHSs) to electrify rural Sub-Saharan Africa. This financing model improves the affordability of SHSs by supporting the payment flexibility required in these contexts. Such flexibility comes at a cost, and yet the assumptions that guide the PAYGo model design remain largely untested. To close the gap, this paper proposes a methodology based on unsupervised machine learning algorithms to analyse the payment records of over 32,000 Rwandan and 25,000 Kenyan SHS users from Bboxx Ltd., and in so doing gain detailed insights into users' payment behavioural patterns. More precisely, the method first applies three clustering algorithms to automatically learn the main payment behavioural groups in each country separately; it then determines the preferred customer segmentation through a validation procedure which combines quantitative and qualitative insights. The results highlight six behavioural groups in Rwanda and four in Kenya; however, several parallels can be made between the two country profiles. These groups highlight the diversity of payment patterns found in the PAYGo model. Further analysis of their payment performance suggests that a one-size-fits-all approach leads to inefficiencies and that tailored plans should be considered to effectively cater to all SHS users
Modelling Coastal Vulnerability: An integrated approach to coastal management using Earth Observation techniques in Belize
This thesis presents an adapted method to derive coastal vulnerability through the application of Earth Observation (EO) data in the quantification of forcing variables. A modelled assessment for vulnerability has been produced using the Coastal Vulnerability Index (CVI) approach developed by Gornitz (1991) and enhanced using Machine learning (ML) clustering. ML has been employed to divide the coastline based on the geotechnical conditions observed to establish relative vulnerability. This has been demonstrated to alleviate bias and enhanced the scalability of the approach – especially in areas with poor data coverage – a known hinderance to the CVI approach (Koroglu et al., 2019).Belize provides a demonstrator for this novel methodology due to limited existing data coverage and the recent removal of the Mesoamerican Reef from the International Union for Conservation of Nature (IUCN) List of World Heritage In Danger. A strong characterization of the coastal zone and associated pressures is paramount to support effective management and enhance resilience to ensure this status is retained.Areas of consistent vulnerability have been identified using the KMeans classifier; predominantly Caye Caulker and San Pedro. The ability to automatically scale to conditions in Belize has demonstrated disparities to vulnerability along the coastline and has provided more realistic estimates than the traditional CVI groups. Resulting vulnerability assessments have indicated that 19% of the coastline at the highest risk with a seaward distribution to high risk observed. Using data derived using Sentinel-2, this study has also increased the accuracy of existing habitat maps and enhanced survey coverage of uncharted areas.Results from this investigation have been situated within the ability to enhance community resilience through supporting regional policies. Further research should be completed to test the robust nature of this model through an application in regions with different geographic conditions and with higher resolution input datasets
New internal and external validation indices for clustering in Big Data
Esta tesis, presentada como un compendio de artículos de investigación,
analiza el concepto de índices de validación de clustering y aporta nuevas
medidas de bondad para conjuntos de datos que podrían considerarse Big
Data debido a su volumen. Además, estas medidas han sido aplicadas en
proyectos reales y se propone su aplicación futura para mejorar algoritmos
de clustering.
El clustering es una de las técnicas de aprendizaje automático no supervisado
más usada. Esta técnica nos permite agrupar datos en clusters de
manera que, aquellos datos que pertenezcan al mismo cluster tienen características
o atributos con valores similares, y a su vez esos datos son disimilares
respecto a aquellos que pertenecen a los otros clusters. La similitud de los
datos viene dada normalmente por la cercanía en el espacio, teniendo en
cuenta una función de distancia. En la literatura existen los llamados índices
de validación de clustering, los cuales podríamos definir como medidas para
cuantificar la calidad de un resultado de clustering. Estos índices se dividen
en dos tipos: índices de validación internos, que miden la calidad del clustering
en base a los atributos con los que se han construido los clusters; e
índices de validación externos, que son aquellos que cuantifican la calidad del
clustering a partir de atributos que no han intervenido en la construcción de
los clusters, y que normalmente son de tipo nominal o etiquetas.
En esta memoria se proponen dos índices de validación internos para clustering
basados en otros índices existentes en la literatura, que nos permiten
trabajar con grandes cantidades de datos, ofreciéndonos los resultados en un
tiempo razonable. Los índices propuestos han sido testeados en datasets sintéticos
y comparados con otros índices de la literatura. Las conclusiones de
este trabajo indican que estos índices ofrecen resultados muy prometedores
frente a sus competidores.
Por otro lado, se ha diseñado un nuevo índice de validación externo de
clustering basado en el test estadístico chi cuadrado. Este índice permite
medir la calidad del clustering basando el resultado en cómo han quedado
distribuidos los clusters respecto a una etiqueta dada en la distribución. Los
resultados de este índice muestran una mejora significativa frente a otros
índices externos de la literatura y en datasets de diferentes dimensiones y características.
Además, estos índices propuestos han sido aplicados en tres proyectos
con datos reales cuyas publicaciones están incluidas en esta tesis doctoral.
Para el primer proyecto se ha desarrollado una metodología para analizar el
consumo eléctrico de los edificios de una smart city. Para ello, se ha realizado
un análisis de clustering óptimo aplicando los índices internos mencionados
anteriormente. En el segundo proyecto se ha trabajado tanto los índices internos
como con los externos para realizar un análisis comparativo del mercado
laboral español en dos periodos económicos distintos. Este análisis se realizó
usando datos del Ministerio de Trabajo, Migraciones y Seguridad Social, y
los resultados podrían tenerse en cuenta para ayudar a la toma de decisión
en mejoras de políticas de empleo. En el tercer proyecto se ha trabajado con
datos de los clientes de una compañía eléctrica para caracterizar los tipos
de consumidores que existen. En este estudio se han analizado los patrones
de consumo para que las compañías eléctricas puedan ofertar nuevas tarifas
a los consumidores, y éstos puedan adaptarse a estas tarifas con el objetivo
de optimizar la generación de energía eliminando los picos de consumo que
existen la actualidad.This thesis, presented as a compendium of research articles, analyses
the concept of clustering validation indices and provides new measures of
goodness for datasets that could be considered Big Data. In addition, these
measures have been applied in real projects and their future application is
proposed for the improvement of clustering algorithms.
Clustering is one of the most popular unsupervised machine learning
techniques. This technique allows us to group data into clusters so that the
instances that belong to the same cluster have characteristics or attributes
with similar values, and are dissimilar to those that belong to the other
clusters. The similarity of the data is normally given by the proximity in
space, which is measured using a distance function. In the literature, there
are so-called clustering validation indices, which can be defined as measures
for the quantification of the quality of a clustering result. These indices are
divided into two types: internal validation indices, which measure the quality
of clustering based on the attributes with which the clusters have been built;
and external validation indices, which are those that quantify the quality of
clustering from attributes that have not intervened in the construction of
the clusters, and that are normally of nominal type or labels.
In this doctoral thesis, two internal validation indices are proposed for
clustering based on other indices existing in the literature, which enable
large amounts of data to be handled, and provide the results in a reasonable
time. The proposed indices have been tested with synthetic datasets and
compared with other indices in the literature. The conclusions of this work
indicate that these indices offer very promising results in comparison with
their competitors.
On the other hand, a new external clustering validation index based on
the chi-squared statistical test has been designed. This index enables the
quality of the clustering to be measured by basing the result on how the
clusters have been distributed with respect to a given label in the distribution.
The results of this index show a significant improvement compared to
other external indices in the literature when used with datasets of different
dimensions and characteristics.
In addition, these proposed indices have been applied in three projects with real data whose corresponding publications are included in this doctoral
thesis. For the first project, a methodology has been developed to analyse
the electrical consumption of buildings in a smart city. For this study, an
optimal clustering analysis has been carried out by applying the aforementioned
internal indices. In the second project, both internal and external
indices have been applied in order to perform a comparative analysis of the
Spanish labour market in two different economic periods. This analysis was
carried out using data from the Ministry of Labour, Migration, and Social
Security, and the results could be taken into account to help decision-making
for the improvement of employment policies. In the third project, data from
the customers of an electric company has been employed to characterise the
different types of existing consumers. In this study, consumption patterns
have been analysed so that electricity companies can offer new rates to consumers.
Conclusions show that consumers could adapt their usage to these
rates and hence the generation of energy could be optimised by eliminating
the consumption peaks that currently exist
- …