10 research outputs found

    Normality-based validation for crisp clustering

    Full text link
    This is the author’s version of a work that was accepted for publication in Pattern Recognition. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Pattern Recognition, 43, 36, (2010) DOI 10.1016/j.patcog.2009.09.018We introduce a new validity index for crisp clustering that is based on the average normality of the clusters. Unlike methods based on inter-cluster and intra-cluster distances, this index emphasizes the cluster shape by using a high order characterization of its probability distribution. The normality of a cluster is characterized by its negentropy, a standard measure of the distance to normality which evaluates the difference between the cluster's entropy and the entropy of a normal distribution with the same covariance matrix. The definition of the negentropy involves the distribution's differential entropy. However, we show that it is possible to avoid its explicit computation by considering only negentropy increments with respect to the initial data distribution, where all the points are assumed to belong to the same cluster. The resulting negentropy increment validity index only requires the computation of covariance matrices. We have applied the new index to an extensive set of artificial and real problems where it provides, in general, better results than other indices, both with respect to the prediction of the correct number of clusters and to the similarity among the real clusters and those inferred.This work has been partially supported with funds from MEC BFU2006-07902/BFI, CAM S-SEM-0255-2006 and CAM/UAM CCG08-UAM/TIC-442

    Cluster validation by measurement of clustering characteristics relevant to the user

    Full text link
    There are many cluster analysis methods that can produce quite different clusterings on the same dataset. Cluster validation is about the evaluation of the quality of a clustering; "relative cluster validation" is about using such criteria to compare clusterings. This can be used to select one of a set of clusterings from different methods, or from the same method ran with different parameters such as different numbers of clusters. There are many cluster validation indexes in the literature. Most of them attempt to measure the overall quality of a clustering by a single number, but this can be inappropriate. There are various different characteristics of a clustering that can be relevant in practice, depending on the aim of clustering, such as low within-cluster distances and high between-cluster separation. In this paper, a number of validation criteria will be introduced that refer to different desirable characteristics of a clustering, and that characterise a clustering in a multidimensional way. In specific applications the user may be interested in some of these criteria rather than others. A focus of the paper is on methodology to standardise the different characteristics so that users can aggregate them in a suitable way specifying weights for the various criteria that are relevant in the clustering application at hand.Comment: 20 pages 2 figure

    A New Approach to Cohesion Measurement: Region-Based Clustering Validation

    Get PDF
    Clustering assigns objects to clusters based on similarity, aiming to ensure that objects within the same cluster are similar and those in different clusters are dissimilar. Evaluating clustering quality is crucial and challenging. Thus, researchers have proposed clustering validation indices namely internal and external validation indices. Internal indices assess clustering quality using intrinsic information within a dataset. We focus on internal validation indices for their real-world applicability. In this paper, we have proposed a novel region-based internal validation (RCV) index. Our index incorporates the division of each cluster into three distinct regions which are the inner, middle, and outer regions. according to the clusters' center and their corresponding radius, we split each cluster into the aforementioned regions. The average distance is then computed for each region, and a penalty factor is applied to these average distances. By summing up the three penalized average distances, a Region Cluster Validation (RCV) score is obtained for each cluster. The RCV scores for all clusters are then summed together to yield an overall measure of cluster validity. A lower index value indicates better clustering quality. Experiment results on the synthetic and real-world datasets exhibit the usability and effectiveness RCV index

    Incremental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative Study

    Get PDF
    Validation is one of the most important aspects of clustering, particularly when the user is designing a trustworthy or explainable system. However, most clustering validation approaches require batch calculation. This is an important gap because of the value of clustering in real-time data streaming and other online learning applications. Therefore, interest has grown in providing online alternatives for validation. This paper extends the incremental cluster validity index (iCVI) family by presenting incremental versions of Calinski-Harabasz (iCH), Pakhira-Bandyopadhyay-Maulik (iPBM), WB index (iWB), Silhouette (iSIL), Negentropy Increment (iNI), Representative Cross Information Potential (irCIP), Representative Cross Entropy (irH), and Conn_Index (iConn_Index). This paper also provides a thorough comparative study of correct, under- and over-partitioning on the behavior of these iCVIs, the Partition Separation (PS) index as well as four recently introduced iCVIs: incremental Xie-Beni (iXB), incremental Davies-Bouldin (iDB), and incremental generalized Dunn\u27s indices 43 and 53 (iGD43 and iGD53). Experiments were carried out using a framework that was designed to be as agnostic as possible to the clustering algorithms. The results on synthetic benchmark data sets showed that while evidence of most under-partitioning cases could be inferred from the behaviors of the majority of these iCVIs, over-partitioning was found to be a more challenging problem, detected by fewer of them. Interestingly, over-partitioning, rather then under-partitioning, was more prominently detected on the real-world data experiments within this study. The expansion of iCVIs provides significant novel opportunities for assessing and interpreting the results of unsupervised lifelong learning in real-time, wherein samples cannot be reprocessed due to memory and/or application constraints

    Validación de clusters basada en la negentropía de las particiones

    Full text link
    Las técnicas de clustering se basan en la agrupación de una serie de puntos de acuerdo a un criterio de similitud, buscando que los puntos pertenecientes a un mismo cluster sean más similares entre si de lo que lo son con el resto de puntos. El principal objetivo de este proyecto de fin de carrera es el estudio y evaluación de métodos de validación de clusters basados en la negentropía, así como su comparación con otros métodos más tradicionales. Para ello se ha realizado un estudio del estado del arte, en el que se han evaluado diferentes métodos de clustering así como diferentes métodos de validación. La técnica de clustering que hemos utilizado en este proyecto se basa en ajustar a los datos una mezcla de gaussianas utilizando el algoritmo EM. Cada una de las gaussianas que contiene el modelo devuelto por éste se corresponde con un cluster. A cada conjunto de datos se le realizan ajustes con diferente número de gaussianas, con lo que conseguimos tener modelos con diferente número de clusters. Los modelos devueltos por el algoritmo EM son evaluados mediante diferentes métodos de validación de clustering, los cuales nos dan una medida de la calidad de los diferentes modelos basándose en el criterio utilizado por cada método de validación. Entre estos métodos se encuentra el método objeto de análisis de este proyecto, Negentropy-based Validation ( ), y dos ya establecidos en el contexto de las mezclas de distribuciones, AIC y BIC, con los que se realizarán las comparaciones. Para la evaluación del método se ha generado una batería de problemas sintéticos, escogiendo las variables que intervienen en cada problema de tal forma que al finalizar el análisis se han obtenido unos resultados que nos han permitido comparar el desempeño de los tres métodos en un rango muy amplio de situaciones. Gracias al análisis realizado se ha llegado a las siguientes conclusiones: AIC tiene un funcionamiento muy negativo y es un método que mejora el desempeño de BIC en la mayoría de los casos, planteándose como un fuerte candidato para su uso en aplicaciones con datos reales. Parte de los resultados obtenidos en este estudio han sido publicados en una revista internacional (1).The clustering techniques are based on the grouping of a number of points according to a similarity criterion, looking forward to find in a cluster points more similar to each other than to the rest of the points. The main objective of this final project at university is the study and evaluation of the clustering validation methods based on the negentrophy, and its comparison with other more traditional methods. To that end, a study of “the state of the art” has been carried out, in which different clustering and validation methods have been evaluated. The clustering technique which has been used in this project is based on adjusting a mixture of Gaussians to a dataset using the EM algorithm. Each of the Gaussians contained on the model returned by the algorithm corresponds to a cluster. Every dataset is been adjust with different number of Gaussians, in order to obtain models with different number of clusters. The models that have been returned by the EM algorithm are evaluated with different clustering validation methods, which give us an approach to the quality of the different methods based on the criterion used by each validation method. Among these methods, we can find the one under study on this project, the Negentrophy-based Validation method ( ), and two other methods already settled on the context of the distribution mixtures, the AIC and BIC methods, with which the comparisons will be make. For the evaluation of the method, a set of synthetic problems have been developed, choosing the variables involve in each problem so that, to the end of the analysis, the results obtained allow us to compare the performance of the three methods at a wide range of situations. As the result of this analysis, the main conclusions obtained are: AIC has a very negative behavior and is a method that improves the performance of BIC on most of the cases, emerging as a strong candidate for its use on real data applications. Part of the results obtained on this study has been published on an international magazine (1)

    Intelligent Analysis of Cerebral Magnetic Resonance Images: Extracting Relevant Information from Small Datasets

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de lectura : 21-09-2017Los metodos de machine learning aplicados a imagenes medicas se estan convirtiendo en potentes herramientas para el analisis y diagnostico de pacientes. La alta disponibilidad de repositorios de im agenes de diferentes modalidades ha favorecido el desarrollo de sistemas que aprenden a extraer caracteristicas relevantes y construyen modelos predictivos a partir de grandes cantidades de informacion, por ejemplo, los metodos de deep learning. Sin embargo, el analisis de conjuntos de imagenes provenientes de un menor numero de sujetos, como es el caso de las imagenes adquiridas en entornos de investigacion cl nica y pre-cl nica, ha recibido considerablemente menos atencion. El objetivo de esta tesis es implementar un conjunto de herramientas avanzadas para resolver este problema, permitiendo el analisis robusto de Im agenes de Resonancia Magn etica (MRI por sus siglas en ingl es) cuando se dispone de pocos sujetos de estudio. En este contexto, las herramientas propuestas se emplean para analizar de manera autom atica conjuntos de datos obtenidos de imagenes funcionales de MR del cerebro en estudios de regulacion del apetito en roedores y humanos, y de im agenes funcionales y estructurales de MR de desarrollos tumorales en modelos animales y humanos. Los metodos propuestos se derivan de la idea de considerar cada voxel del conjunto de im agenes como un patron, en lugar de la nocion convencional de considerar cada imagen como un patr on. El Cap tulo 1 describe la motivaci on de esta tesis, incluyendo los objetivos propuestos, la estructura general del documento y las contribuciones de esta investigaci on. El Capitulo 2 contiene una introduccion actualizada del estado del arte en MRI, los procedimientos mas usados en el pre-procesamiento de imagenes, y los algoritmos de machine learning m as utiles y sus aplicaciones en MRI. El Cap tulo 3 presenta el dise~no experimental y los pasos de pre-procesamiento aplicados a los conjuntos de datos de regulaci on de apetito y desarrollo tumoral. El Capitulo 4 implementa nuevos metodos de aprendizaje supervisados para el analisis de conjuntos de datos de MRI obtenidos de un conjunto peque~no de sujetos. Se ilustra este enfoque presentando primero la metodolog a Fisher Maps, que permite la visualizaci on cuantitativa y no invasiva de la circuiter a cerebral del apetito, mediante el an alisis autom atico de Im agenes Ponderadas en Difusi on (DWI por sus siglas en ingl es). Esta metodolog a se extiende a la clasi caci on de im agenes completas combinando las predicciones obtenidas de cada p xel. El Cap tulo 5 propone un nuevo algoritmo de aprendizaje no supervisado, ilustrando su desempe~no sobre datos sint eticos y datos provenientes de estudios de tumores cerebrales y crecimiento tumoral. Por ultimo, en el Cap tulo 6 se resumen las principales conclusiones de este trabajo y se plantean amplias v as para su desarrollo futuro. En resumen, esta tesis presenta un nuevo enfoque capaz de trabajar en contextos con baja disponibilidad de sujetos de estudio, proponiendo algoritmos de aprendizaje supervisado y no supervisado. Estos metodos pueden ser facilmente generalizados a otros paradigmas o patologias, e incluso, a distintas modalidades de imagenes

    Contributions to improve human-computer interaction using machine learning

    Get PDF
    181 p. (eng.) 189 p. (eus.)This PhD thesis contributes on designing and applying data mining techniques targeting the improvement of Human Computer Interaction (HCI) in different contexts. The main objectives of the thesis are to design systems based on data mining methods for modelling behaviour on interaction and use data. Moreover, having to work often in unsupervised learning contexts has lead to contribute methodologically to clustering validation regardless of the context; an unsolved problem in machine learning. Cluster Validity Indexes (CVIs) partially solve this problem by providing a quality score of the partitions, but none of them has proven to robustly face the broad range of conditions. In this regard, in the first contribution several CVI decision fusion (voting) approaches are proposed, showing that they are promising strategies for clustering validation.In the Human-Computer Interaction context, the contributions are structured in three different areas. The accessibility area is analysed in the first one where an efficient system to automatically detect navigation problems of users, with and without disabilities, is presented. The next contribution is focused on the medical informatics and it analyses the interaction in a medical dashboard used to support the decision-making of clinicians (SMASH). On the one hand, connections between visual and interaction behaviours on SMASH are studied. On the other hand, based on the interaction behaviours observed in SMASH, two main cohorts of users are automatically detected and characterised: primary (pharmacists) vs secondary (non-pharmacists).Finally, two contributions on the e-Services area are made, focusing on their interaction and use respectively. In the first one, potential students aiming to enrol the University of the Basque Country (UPV/EHU) are satisfactorily modelled based on the interactive behaviours they showed in the web of this university. The second one, empirically analyses and characterises the use of e-Government services in different European countries based on survey data provided by EurostatDoktorego-tesi honek, hainbat testuingurutan, Pertsona-Konputagailu Elkarrekintzaren (PKE) hobekuntzarako datuen meatzaritzako teknikak diseinatzen eta aplikatzen laguntzen du. Tesiaren helburu nagusiak datu-meatzaritzako metodoetan oinarritutako sistemak diseinatzea da, elkarrekintza- eta erabilera-datuen portaera modelatzeko. Gainera, gainbegiratu gabeko ikasketa-testuinguruekin sarritan lan egin behar izanak, datuen testuinguru guztiei zuzendutako clusteringa baliozkotzeari buruzko ekarpen metodologikoa egitera bultzatu gaitu. Kluster baliozkotze indizeek (CVI) partizioen kalitate-neurri bat ematen duten heinean, arazo hau partzialki ebazten dute, baina horietako batek ere ez du erakutsi egoeren espektro handiari aurre egiteko gaitasuna. Ildo honetatik, lehen kontribuzioan CVIen arteko erabaki-fusioen (bozketa) hainbat sistema proposatzen ditugu, eta klusteringa baliozkotzeko estrategia eraginkorrak direla erakusten dugu.Pertsona-Konputagailu Elkarrekintzaren testuinguruan, ekarpenak hiru arlotan egituratuta daude. Irisgarritasun arloa lehenengo kontribuzioan aztertzen da, sistema eraginkor bat aurkeztuz, desgaitasuna duten eta desgaitasuna ez duten erabiltzaileen nabigazio-arazoak automatikoki detektatzen dituena.Hurrengo ekarpena informatika-medikoan zentratzen da eta medikuei erabakiak hartzeko jardueretan laguntzeko erabiltzen den osasun-arbela mediko baten (SMASH) elkarrekintza aztertzen du. Batetik, SMASH arbelean portaera bisualen eta interaktiboen arteko loturak aztertzen dira. Bestalde, SMASH arbelean antzemandako portaera interaktiboen arabera, bi erabiltzaile talde nagusi detektatu eta ezaugarritu dira: lehen mailakoak (farmazialariak) eta bigarren mailakoak (ez farmazialariak).Azkenik, bi kontribuzio egiten dira zerbitzu elektronikoen (e-Zerbitzuen) arloan, elkarrekintza eta erabileran oinarrituz, hurrenez hurren. Lehenengoan, Euskal Herriko Unibertsitatean (UPV/EHU) izena eman nahi duten ikasle potentzialak modu eraginkorrean modelatu dira unibertsitate honen webgunean erakutsitako jokabide interaktiboen arabera. Bigarrenean, gobernuko e-Zerbitzuen erabilera aztertu da Europako hainbat herrialdetan, Eurostatek emandako inkesta-datuetan oinarrituzEsta tesis doctoral contribuye al diseño y la aplicación de técnicas de minería de datos dirigidas a la mejora de la Interacción Persona-Computadora (IPC) en diferentes contextos. Los objetivos principales de la tesis son diseñar sistemas basados en métodos de minería de datos para modelar el comportamiento en datos de interacción y uso. Además, como los contextos de aprendizaje no supervisado han sido una constante en nuestro trabajo, hemos contribuido metodológicamente a la validación de clustering independientemente del contexto de los datos; problema no resuelto en el aprendizaje automático. Los índices de validación de cluster (CVI) resuelven parcialmente este problema al proporcionar un valor cuantitativo de calidad de las particiones, pero ninguno de ellos ha demostrado poder enfrentarse de manera robusta en una amplia gama de condiciones. En este sentido, en la primera contribución se proponen varios sistemas de fusión de decisiones (votaciones) entre CVIs, demostrando que son estrategias prometedoras para la validación de cluster.En el contexto de Interacción-Persona Computador, las contribuciones están estructuradas en tres áreas diferentes. En la primera de ellas se analiza el área de accesibilidad, presentando un sistema eficiente para detectar automáticamente los problemas de navegación de los usuarios, con y sin discapacidad.La siguiente contribución se centra en la informática médica y analiza la interacción en una pizarra médica web (SMASH) utilizada para asistir en la toma de decisiones de los médicos. Por un lado, se estudian las conexiones entre los comportamientos visuales y de interacción en SMASH. Por otro lado, en base a los comportamientos de interacción observados en SMASH, se detectan y caracterizan automáticamente dos grupos principales de usuarios: primario (farmacéuticos) y secundario (no farmacéuticos).Finalmente, se realizan dos contribuciones en el área de servicios electrónicos, centrándose en su interacción y uso, respectivamente. En la primera, se modelan satisfactoriamente los estudiantes que potencialmente desean matricularse en la Universidad del País Vasco (UPV / EHU), en función de los comportamientos interactivos que muestran en la web de esta universidad. La segunda contribución, analiza empíricamente y caracteriza el uso de los servicios de gobierno electrónico en diferentes países europeos en base a datos de encuestas proporcionados por Eurostat

    Distance construction and clustering of football player performance data

    Get PDF
    I present a new idea to map football players information by using multidimensional scaling and to cluster football players. The actual goal is to define a proper distance measure between players. The data was assembled from whoscored.com. Variables are of the mixed type, containing nominal, ordinal, count and continuous information. In the data pre-processing stage, four different steps are followed through for continuous and count variables: 1) representation (i.e., considerations regarding how the relevant information is most appropriately represented, e.g., relative to minutes played), 2) transformation (football knowledge as well as the skewness of the distribution of some count variables indicates that transformation should be used to decrease the effective distance between higher values compared to the distances between lower values), 3) standardisation (in order to make within-variable variations comparable), and 4) variable weighting including variable selection. In a final phase, all the different types of distance measures are combined by using the principle of the Gower dissimilarity (Gower, 1971). As the second part of this thesis, the aim was to choose a suitable clustering technique and to estimate the best number of clusters for the dissimilarity measurement obtained from football players data set. For this aim, different clustering quality indexes have been introduced, and as first proposed by Hennig (2017), a new concept to calibrate the clustering quality indexes has been presented. In this respect, Hennig (2017) proposed two random clustering algorithms, which generates random clustering points from which standardised clustering quality index values can be calculated and aggregated in an appropriate way. In this thesis, two new additional random clustering algorithms have been proposed and the aggregation of clustering quality indexes has been examined with different types of simulated and real data sets. In the end, this new concept has been applied to the dissimilarity measurement of football players
    corecore