69 research outputs found

    Agglomerative Nesting (AGNES) Method and Divisive Analysis (DIANA) Method For Hierarchical Clustering On Some Distance Measurement Concepts

    Get PDF
    Clustering data through hierarchical approach could be performed by Agglomerative Nesting (AGNES) Method and Divisive Analysis (DIANA) Method. The objective of this research is to compare both the methods based on Euclid and Manhattan distance measurements. Of this research the clustering procedures of agglomerative method are conducted by exploring all techniques including single linkage, complete linkage, average linkage, and Ward. The data used are the National Socio-Economic Survey (SUSENAS) data which are selected specifically for the percentage of over 5 year old residents in each province, for both living in urban or rural, who access the internet in the last 3 months in 2017 but classified according purpose of accessing. By applying Mean Square Error (MSE) for 2 and 3 clusters, it can be concluded that the single linkage technique is the best performance of clustering procedure for both Euclidean and Manhattan distances

    Probabilistic clustering of interval data

    Get PDF
    In this paper we address the problem of clustering interval data, adopting a model-based approach. To this purpose, parametric models for interval-valued variables are used which consider configurations for the variance-covariance matrix that take the nature of the interval data directly into account. Results, both on synthetic and empirical data, clearly show the well-founding of the proposed approach. The method succeeds in finding parsimonious heterocedastic models which is a critical feature in many applications. Furthermore, the analysis of the different data sets made clear the need to explicitly consider the intrinsic variability present in interval data.info:eu-repo/semantics/publishedVersio

    3rd Workshop in Symbolic Data Analysis: book of abstracts

    Get PDF
    This workshop is the third regular meeting of researchers interested in Symbolic Data Analysis. The main aim of the event is to favor the meeting of people and the exchange of ideas from different fields - Mathematics, Statistics, Computer Science, Engineering, Economics, among others - that contribute to Symbolic Data Analysis

    Development of a R package to facilitate the learning of clustering techniques

    Get PDF
    This project explores the development of a tool, in the form of a R package, to ease the process of learning clustering techniques, how they work and what their pros and cons are. This tool should provide implementations for several different clustering techniques with explanations in order to allow the student to get familiar with the characteristics of each algorithm by testing them against several different datasets while deepening their understanding of them through the explanations. Additionally, these explanations should adapt to the input data, making the tool not only adept for self-regulated learning but for teaching too.Grado en Ingeniería Informátic

    Some problems in the theory and application of the methods of numerical taxonomy

    Get PDF
    Several of the methods of numerical taxonomy are compared and shown to be variants of a tripartite grouping procedure associated with a generalised intercluster similarity function involving ten computational parameters. Clustering by the techniques of hierarchic fusion, monothetic division and iterative relocation is obtained using different arithmetic combinations of the function parameters to both compute similarities and effect changes in cluster membership. The combinatorial solution for Ward's method is found, and the centroid sorting combinatorial solution is extended for size difference, shape difference, dispersion and dot product coefficients. It is suggested that clusters are characterised more by the choice of similarity criterion than by the choice of method, and it is demonstrated that some common criteria such as distance and the error sum of squares are inclined to force spherical 'minimum-variance' classes. These are contrasted by 'natural' classes, which correspond to closed density surfaces defined for a multi-variate sample space by the underlying probability density function. A method for mode-seeking is developed from this probabilistic model through various theoretical and experimental phases, and it is shown to perform slightly better than iterative relocation with the minimum-variance criteria using several Gaussian test populations. A fast algorithm is proposed for the solution of the Jardine-Sibson method for generating overlapping classes, and it is observed that this technique finds natural classes and is closely related to the probabilistic model. Some aspects of computational procedures are discussed, and in particular, it is proposed that a generalised system involving a statistical language, conversational mode package and program suite could be developed from a basic subroutine system. Paging and simulation techniques for the organisation of direct-access data files are suggested, and a comprehensive package of computer programs for cluster analysis is described

    Performance Assessment of The Extended Gower Coefficient on Mixed Data with Varying Types of Functional Data.

    Get PDF
    Clustering is a widely used technique in data mining applications to source, manage, analyze and extract vital information from large amounts of data. Most clustering procedures are limited in their performance when it comes to data with mixed attributes. In recent times, mixed data have evolved to include directional and functional data. In this study, we will give an introduction to clustering with an eye towards the application of the extended Gower coefficient by Hendrickson (2014). We will conduct a simulation study to assess the performance of this coefficient on mixed data whose functional component has strictly-decreasing signal curves and also those whose functional component has a mixture of strictly-decreasing signal curves and periodic tendencies. We will assess how four different hierarchical clustering algorithms perform on mixed data simulated under varying conditions with and without weights. The comparison of the various clustering solutions will be done using the Rand Index

    Archetypes for histogram-valued data

    Get PDF
    Il principale sviluppo innovativo del lavoro è quello di propone una estensione dell'analisi archetipale per dati ad istogramma. Per quanto concerne l'impianto metodologico nell'approccio all'analisi di dati ad istogramma, che sono di natura complessa, il presente lavora utilizza le intuizioni della "Symbolic Data Analysis" (SDA) e le relazioni intrinseche tra dati valutati ad intervallo e dati valutati ad istogramma. Dopo aver discusso la tecnica sviluppata in ambiente Matlab, il suo funzionamento e le sue proprietà su di un esempio di comodo, tale tecnica viene proposta, nella sezione applicativa, come strumento per effettuare una analisi di tipo "benchmarking" quantitativo. Nello specifico, si propongono i principali risultati ottenuti da una applicazione degli archetipi per dati ad istogramma ad un caso di benchmarking interno del sistema scolastico, utilizzando dati provenienti dal test INVALSI relativi all'anno scolastico 2015/2016. In questo contesto l'unità di analisi è considerata essere la singola scuola, definita operativamente attraverso le distribuzioni dei punteggi dei propri alunni valutate, congiuntamente, sotto forma di oggetti simbolici ad istogramma

    Studies in numerical taxonomy of soils

    Get PDF
    A series of established numerical taxonomic strategies was applied to soil data from three sources: USDA (1975), De Alwis (1971) and the Soil Survey of England and Wales. The first two sources provided data for 41 soil profiles, which were classified without reference to their geographical location. The data obtained from the Soil Survey of England and Wales related to a particular geographical area (West Sussex Coastal Plain) and the geographical relationship between soil individuals was also examined. Two methods of soil characterization (soil profile models) were compared with respect to their effect on the results produced by two hierarchical agglomerative strategies based on two measures of inter-individual similarity. Comparison of results, obtained from the agglomerative strategies for the two soil profile models, was made. The nature of inter-attribute correlation for depth levels modelled as arrays of independent attributes was examined, and all attributes were classified on the basis of inter-attribute correlation. Seven hierarchical agglomerative strategies were examined with respect to their goodness-of-fit in the original space and also the relationship between goodness-of-fit and clarity of clusters was examined. From these comparisons, two agglomerative strategies were chosen to represent two classes of strategy: (a) strategies with minimum of distortion, (b) strategies with a greater distortion but clear clusters. The average linkage method from the first category and the Ward's error sum of squares (ESS) method from the second category were selected. These two strategies were applied to the data sets described above using two measures of similarity namely (a) squared Euclidean distance and (b) Mahalanobis D2, and a divisive strategy, REMUL, was also applied to classify the soil populations. The classifications obtained from these strategies were compared by Wilk's Criterion A and the classification which had the lowest A was treated as the best initial partition. The best two partitions of the two populations obtained from the agglomerative strategy, Ward's ESS method, were further analysed. The optimum number of groups (G) in each population was decided by the relationship between LambdaG2 and G. The soil profile groups produced by these methods were further examined and improved by a reallocation strategy based on the Mahalanobis distance between individuals and the group centroids. Reallocation was done using 30 attributes from the uppermost soil horizons. Canonical analysis was performed on the populations both before and after the classification. Canonical plots were produced and a comparison was made with the dendrograms obtained for the best partitions. The classifications obtained were examined in relation to parent material classes. The spatial relationship of the soil groups of the West Sussex Coastal Plain was also investigated. As shown by this study, it is possible to produce a better classification of soils by numerical taxonomic methods compared with traditional methods. For this end, it is not necessary to use all attributes of soils, but a sufficiently large number of properties, which can be empirically determined, is adequate for the purpose of producing a natural classification. The soil groups produced by numerical methods showed a closer association with parent materials.<p
    corecore