1,695 research outputs found

    Nonparametric Feature Extraction from Dendrograms

    Full text link
    We propose feature extraction from dendrograms in a nonparametric way. The Minimax distance measures correspond to building a dendrogram with single linkage criterion, with defining specific forms of a level function and a distance function over that. Therefore, we extend this method to arbitrary dendrograms. We develop a generalized framework wherein different distance measures can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, to address the model selection problem, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep representations. In the first approach, for example for the clustering problem, we build a graph with positive and negative edge weights according to the consistency of the clustering labels of different objects among different solutions, in the context of ensemble methods. Then, we use an efficient variant of correlation clustering to produce the final clusters. In the second approach, we investigate the sequential combination of different distances and features sequentially in the spirit of multi-layered architectures to obtain the final features. Finally, we demonstrate the effectiveness of our approach via several numerical studies

    Methods for fast and reliable clustering

    Get PDF

    HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis

    Get PDF
    Comprehensive benchmarking of clustering algorithms is rendered difficult by two key factors: (i) the elusiveness of a unique mathematical definition of this unsupervised learning approach and (ii) dependencies between the generating models or clustering criteria adopted by some clustering algorithms and indices for internal cluster validation. Consequently, there is no consensus regarding the best practice for rigorous benchmarking, and whether this is possible at all outside the context of a given application. Here, we argue that synthetic datasets must continue to play an important role in the evaluation of clustering algorithms, but that this necessitates constructing benchmarks that appropriately cover the diverse set of properties that impact clustering algorithm performance. Through our framework, HAWKS, we demonstrate the important role evolutionary algorithms play to support flexible generation of such benchmarks, allowing simple modification and extension. We illustrate two possible uses of our framework: (i) the evolution of benchmark data consistent with a set of hand-derived properties and (ii) the generation of datasets that tease out performance differences between a given pair of algorithms. Our work has implications for the design of clustering benchmarks that sufficiently challenge a broad range of algorithms, and for furthering insight into the strengths and weaknesses of specific approaches

    Clustering in multivariate data: visualization, case and variable reduction

    Get PDF
    Cluster analysis is a very common problem for multivariate data. It is receiving intense attention due to the current boom in data warehousing and mining driven by the growth in information technology today. Technology is allowing us to collect massive data sets, both in cases and variables, and develop sophisticated interactive and dynamic graphics. There are three current issues for cluster analysis: visualizing cluster structure, reducing the number of cases, and reducing the number of variables in very large data sets. This thesis addresses each of these issues;The lower-dimensional projection of data found by projection pursuit which preserves the cluster structure helps clustering by eliminating the influence of nuisance variables. Initially partitioning data into a set of small classifications improves the efficiency of hierarchical agglomerative clustering by saving the time and memory for the beginning stage of clustering. Minimal spanning tree is used for this partitioning method
    • …
    corecore