418 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationWith the tremendous growth of data produced in the recent years, it is impossible to identify patterns or test hypotheses without reducing data size. Data mining is an area of science that extracts useful information from the data by discovering patterns and structures present in the data. In this dissertation, we will largely focus on clustering which is often the first step in any exploratory data mining task, where items that are similar to each other are grouped together, making downstream data analysis robust. Different clustering techniques have different strengths, and the resulting groupings provide different perspectives on the data. Due to the unsupervised nature i.e., the lack of domain experts who can label the data, validation of results is very difficult. While there are measures that compute "goodness" scores for clustering solutions as a whole, there are few methods that validate the assignment of individual data items to their clusters. To address these challenges we focus on developing a framework that can generate, compare, combine, and evaluate different solutions to make more robust and significant statements about the data. In the first part of this dissertation, we present fast and efficient techniques to generate and combine different clustering solutions. We build on some recent ideas on efficient representations of clusters of partitions to develop a well founded metric that is spatially aware to compare clusterings. With the ability to compare clusterings, we describe a heuristic to combine different solutions to produce a single high quality clustering. We also introduce a Markov chain Monte Carlo approach to sample different clusterings from the entire landscape to provide the users with a variety of choices. In the second part of this dissertation, we build certificates for individual data items and study their influence on effective data reduction. We present a geometric approach by defining regions of influence for data items and clusters and use this to develop adaptive sampling techniques to speedup machine learning algorithms. This dissertation is therefore a systematic approach to study the landscape of clusterings in an attempt to provide a better understanding of the data

    Coping with new Challenges in Clustering and Biomedical Imaging

    Get PDF
    The last years have seen a tremendous increase of data acquisition in different scientific fields such as molecular biology, bioinformatics or biomedicine. Therefore, novel methods are needed for automatic data processing and analysis of this large amount of data. Data mining is the process of applying methods like clustering or classification to large databases in order to uncover hidden patterns. Clustering is the task of partitioning points of a data set into distinct groups in order to minimize the intra cluster similarity and to maximize the inter cluster similarity. In contrast to unsupervised learning like clustering, the classification problem is known as supervised learning that aims at the prediction of group membership of data objects on the basis of rules learned from a training set where the group membership is known. Specialized methods have been proposed for hierarchical and partitioning clustering. However, these methods suffer from several drawbacks. In the first part of this work, new clustering methods are proposed that cope with problems from conventional clustering algorithms. ITCH (Information-Theoretic Cluster Hierarchies) is a hierarchical clustering method that is based on a hierarchical variant of the Minimum Description Length (MDL) principle which finds hierarchies of clusters without requiring input parameters. As ITCH may converge only to a local optimum we propose GACH (Genetic Algorithm for Finding Cluster Hierarchies) that combines the benefits from genetic algorithms with information-theory. In this way the search space is explored more effectively. Furthermore, we propose INTEGRATE a novel clustering method for data with mixed numerical and categorical attributes. Supported by the MDL principle our method integrates the information provided by heterogeneous numerical and categorical attributes and thus naturally balances the influence of both sources of information. A competitive evaluation illustrates that INTEGRATE is more effective than existing clustering methods for mixed type data. Besides clustering methods for single data objects we provide a solution for clustering different data sets that are represented by their skylines. The skyline operator is a well-established database primitive for finding database objects which minimize two or more attributes with an unknown weighting between these attributes. In this thesis, we define a similarity measure, called SkyDist, for comparing skylines of different data sets that can directly be integrated into different data mining tasks such as clustering or classification. The experiments show that SkyDist in combination with different clustering algorithms can give useful insights into many applications. In the second part, we focus on the analysis of high resolution magnetic resonance images (MRI) that are clinically relevant and may allow for an early detection and diagnosis of several diseases. In particular, we propose a framework for the classification of Alzheimer's disease in MR images combining the data mining steps of feature selection, clustering and classification. As a result, a set of highly selective features discriminating patients with Alzheimer and healthy people has been identified. However, the analysis of the high dimensional MR images is extremely time-consuming. Therefore we developed JGrid, a scalable distributed computing solution designed to allow for a large scale analysis of MRI and thus an optimized prediction of diagnosis. In another study we apply efficient algorithms for motif discovery to task-fMRI scans in order to identify patterns in the brain that are characteristic for patients with somatoform pain disorder. We find groups of brain compartments that occur frequently within the brain networks and discriminate well among healthy and diseased people

    Deep Divergence-Based Approach to Clustering

    Get PDF
    A promising direction in deep learning research consists in learning representations and simultaneously discovering cluster structure in unlabeled data by optimizing a discriminative loss function. As opposed to supervised deep learning, this line of research is in its infancy, and how to design and optimize suitable loss functions to train deep neural networks for clustering is still an open question. Our contribution to this emerging field is a new deep clustering network that leverages the discriminative power of information-theoretic divergence measures, which have been shown to be effective in traditional clustering. We propose a novel loss function that incorporates geometric regularization constraints, thus avoiding degenerate structures of the resulting clustering partition. Experiments on synthetic benchmarks and real datasets show that the proposed network achieves competitive performance with respect to other state-of-the-art methods, scales well to large datasets, and does not require pre-training steps

    Cluster validity in clustering methods

    Get PDF
    • …
    corecore