75 research outputs found

    Weighting Policies for Robust Unsupervised Ensemble Learning

    Get PDF
    The unsupervised ensemble learning, or consensus clustering, consists of finding the optimal com- bination strategy of individual partitions that is robust in comparison to the selection of an algorithmic clustering pool. Despite its strong properties, this approach assigns the same weight to the contribution of each clustering to the final solution. We propose a weighting policy for this problem that is based on internal clustering quality measures and compare against other modern approaches. Results on publicly available datasets show that weights can significantly improve the accuracy performance while retaining the robust properties. Since the issue of determining an appropriate number of clusters, which is a primary input for many clustering methods is one of the significant challenges, we have used the same methodology to predict correct or the most suitable number of clusters as well. Among various methods, using internal validity indexes in conjunction with a suitable algorithm is one of the most popular way to determine the appropriate number of cluster. Thus, we use weighted consensus clustering along with four different indexes which are Silhouette (SH), Calinski-Harabasz (CH), Davies-Bouldin (DB), and Consensus (CI) indexes. Our experiment indicates that weighted consensus clustering together with chosen indexes is a useful method to determine right or the most appropriate number of clusters in comparison to individual clustering methods (e.g., k-means) and consensus clustering. Lastly, to decrease the variance of proposed weighted consensus clustering, we borrow the idea of Markowitz portfolio theory and implement its core idea to clustering domain. We aim to optimize the combination of individual clustering methods to minimize the variance of clustering accuracy. This is a new weighting policy to produce partition with a lower variance which might be crucial for a decision maker. Our study shows that using the idea of Markowitz portfolio theory will create a partition with a less variation in comparison to traditional consensus clustering and proposed weighted consensus clustering

    Ensemble clustering via heuristic optimisation

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityTraditional clustering algorithms have different criteria and biases, and there is no single algorithm that can be the best solution for a wide range of data sets. This problem often presents a significant obstacle to analysts in revealing meaningful information buried among the huge amount of data. Ensemble Clustering has been proposed as a way to avoid the biases and improve the accuracy of clustering. The difficulty in developing Ensemble Clustering methods is to combine external information (provided by input clusterings) with internal information (i.e. characteristics of given data) effectively to improve the accuracy of clustering. The work presented in this thesis focuses on enhancing the clustering accuracy of Ensemble Clustering by employing heuristic optimisation techniques to achieve a robust combination of relevant information during the consensus clustering stage. Two novel heuristic optimisation-based Ensemble Clustering methods, Multi-Optimisation Consensus Clustering (MOCC) and K-Ants Consensus Clustering (KACC), are developed and introduced in this thesis. These methods utilise two heuristic optimisation algorithms (Simulated Annealing and Ant Colony Optimisation) for their Ensemble Clustering frameworks, and have been proved to outperform other methods in the area. The extensive experimental results, together with a detailed analysis, will be presented in this thesis

    Neuroengineering of Clustering Algorithms

    Get PDF
    Cluster analysis can be broadly divided into multivariate data visualization, clustering algorithms, and cluster validation. This dissertation contributes neural network-based techniques to perform all three unsupervised learning tasks. Particularly, the first paper provides a comprehensive review on adaptive resonance theory (ART) models for engineering applications and provides context for the four subsequent papers. These papers are devoted to enhancements of ART-based clustering algorithms from (a) a practical perspective by exploiting the visual assessment of cluster tendency (VAT) sorting algorithm as a preprocessor for ART offline training, thus mitigating ordering effects; and (b) an engineering perspective by designing a family of multi-criteria ART models: dual vigilance fuzzy ART and distributed dual vigilance fuzzy ART (both of which are capable of detecting complex cluster structures), merge ART (aggregates partitions and lessens ordering effects in online learning), and cluster validity index vigilance in fuzzy ART (features a robust vigilance parameter selection and alleviates ordering effects in offline learning). The sixth paper consists of enhancements to data visualization using self-organizing maps (SOMs) by depicting in the reduced dimension and topology-preserving SOM grid information-theoretic similarity measures between neighboring neurons. This visualization\u27s parameters are estimated using samples selected via a single-linkage procedure, thereby generating heatmaps that portray more homogeneous within-cluster similarities and crisper between-cluster boundaries. The seventh paper presents incremental cluster validity indices (iCVIs) realized by (a) incorporating existing formulations of online computations for clusters\u27 descriptors, or (b) modifying an existing ART-based model and incrementally updating local density counts between prototypes. Moreover, this last paper provides the first comprehensive comparison of iCVIs in the computational intelligence literature --Abstract, page iv

    Unsupervised learning on social data

    Get PDF

    Image Based Biomarkers from Magnetic Resonance Modalities: Blending Multiple Modalities, Dimensions and Scales.

    Get PDF
    The successful analysis and processing of medical imaging data is a multidisciplinary work that requires the application and combination of knowledge from diverse fields, such as medical engineering, medicine, computer science and pattern classification. Imaging biomarkers are biologic features detectable by imaging modalities and their use offer the prospect of more efficient clinical studies and improvement in both diagnosis and therapy assessment. The use of Dynamic Contrast Enhanced Magnetic Resonance Imaging (DCE-MRI) and its application to the diagnosis and therapy has been extensively validated, nevertheless the issue of an appropriate or optimal processing of data that helps to extract relevant biomarkers to highlight the difference between heterogeneous tissue still remains. Together with DCE-MRI, the data extracted from Diffusion MRI (DWI-MR and DTI-MR) represents a promising and complementary tool. This project initially proposes the exploration of diverse techniques and methodologies for the characterization of tissue, following an analysis and classification of voxel-level time-intensity curves from DCE-MRI data mainly through the exploration of dissimilarity based representations and models. We will explore metrics and representations to correlate the multidimensional data acquired through diverse imaging modalities, a work which starts with the appropriate elastic registration methodology between DCE-MRI and DWI- MR on the breast and its corresponding validation. It has been shown that the combination of multi-modal MRI images improve the discrimination of diseased tissue. However the fusion of dissimilar imaging data for classification and segmentation purposes is not a trivial task, there is an inherent difference in information domains, dimensionality and scales. This work also proposes a multi-view consensus clustering methodology for the integration of multi-modal MR images into a unified segmentation of tumoral lesions for heterogeneity assessment. Using a variety of metrics and distance functions this multi-view imaging approach calculates multiple vectorial dissimilarity-spaces for each one of the MRI modalities and makes use of the concepts behind cluster ensembles to combine a set of base unsupervised segmentations into an unified partition of the voxel-based data. The methodology is specially designed for combining DCE-MRI and DTI-MR, for which a manifold learning step is implemented in order to account for the geometric constrains of the high dimensional diffusion information.The successful analysis and processing of medical imaging data is a multidisciplinary work that requires the application and combination of knowledge from diverse fields, such as medical engineering, medicine, computer science and pattern classification. Imaging biomarkers are biologic features detectable by imaging modalities and their use offer the prospect of more efficient clinical studies and improvement in both diagnosis and therapy assessment. The use of Dynamic Contrast Enhanced Magnetic Resonance Imaging (DCE-MRI) and its application to the diagnosis and therapy has been extensively validated, nevertheless the issue of an appropriate or optimal processing of data that helps to extract relevant biomarkers to highlight the difference between heterogeneous tissue still remains. Together with DCE-MRI, the data extracted from Diffusion MRI (DWI-MR and DTI-MR) represents a promising and complementary tool. This project initially proposes the exploration of diverse techniques and methodologies for the characterization of tissue, following an analysis and classification of voxel-level time-intensity curves from DCE-MRI data mainly through the exploration of dissimilarity based representations and models. We will explore metrics and representations to correlate the multidimensional data acquired through diverse imaging modalities, a work which starts with the appropriate elastic registration methodology between DCE-MRI and DWI- MR on the breast and its corresponding validation. It has been shown that the combination of multi-modal MRI images improve the discrimination of diseased tissue. However the fusion of dissimilar imaging data for classification and segmentation purposes is not a trivial task, there is an inherent difference in information domains, dimensionality and scales. This work also proposes a multi-view consensus clustering methodology for the integration of multi-modal MR images into a unified segmentation of tumoral lesions for heterogeneity assessment. Using a variety of metrics and distance functions this multi-view imaging approach calculates multiple vectorial dissimilarity-spaces for each one of the MRI modalities and makes use of the concepts behind cluster ensembles to combine a set of base unsupervised segmentations into an unified partition of the voxel-based data. The methodology is specially designed for combining DCE-MRI and DTI-MR, for which a manifold learning step is implemented in order to account for the geometric constrains of the high dimensional diffusion information

    Clustering ensemble method

    Get PDF
    Clustering is an unsupervised learning paradigm that partitions a given dataset into clusters so that objects in the same cluster are more similar to each other than to the objects in the other clusters. However, when clustering algorithms are used individually, their results are often inconsistent and unreliable. This research applies the philosophy of Ensemble learning that combines multiple partitions using a consensus function in order to address these issues to improve a clustering performance. A clustering ensemble framework is presented consisting of three phases: Ensemble Member Generation, Consensus and Evaluation. This research focuses on two points: the consensus function and ensemble diversity. For the first, we proposed three new consensus functions: the Object-Neighbourhood Clustering Ensemble (ONCE), the Dual-Similarity Clustering Ensemble (DSCE), and the Adaptive Clustering Ensemble (ACE). ONCE takes into account the neighbourhood relationship between object pairs in the similarity matrix, while DSCE and ACE are based on two similarity measures: cluster similarity and membership similarity. The proposed ensemble methods were tested on benchmark real-world and artificial datasets. The results demonstrated that ONCE outperforms the other similar methods, and is more consistent and reliable than k-means. Furthermore, DSCE and ACE were compared to the ONCE, CO, MCLA and DICLENS clustering ensemble methods. The results demonstrated that on average ACE outperforms the state-of-the-art clustering ensemble methods, which are CO, MCLA and DICLENS. On diversity, we experimentally investigated all the existing measures for determining their relationship with the ensemble quality. The results indicate that none of them are capable of discovering a clear relationship and the reasons for this are: (1) they all are inappropriately defined to measure the useful difference between the members, and (2) none of them have been used directly by any consensus function. Therefore, we point out that these two issues need to be addressed in future research

    An integrated clustering analysis framework for heterogeneous data

    Get PDF
    Big data is a growing area of research with some important research challenges that motivate our work. We focus on one such challenge, the variety aspect. First, we introduce our problem by defining heterogeneous data as data about objects that are described by different data types, e.g., structured data, text, time-series, images, etc. Through our work we use five datasets for experimentation: a real dataset of prostate cancer data and four synthetic dataset that we have created and made them publicly available. Each dataset covers different combinations of data types that are used to describe objects. Our strategy for clustering is based on fusion approaches. We compare intermediate and late fusion schemes. We propose an intermediary fusion approach, Similarity Matrix Fusion (SMF), where the integration process takes place at the level of calculating similarities. SMF produces a single distance fusion matrix and two uncertainty expression matrices. We then propose a clustering algorithm, Hk-medoids, a modified version of the standard k-medoids algorithm that utilises uncertainty calculations to improve on the clustering performance. We evaluate our results by comparing them to clustering produced using individual elements and show that the fusion approach produces equal or significantly better results. Also, we show that there are advantages in utilising the uncertainty information as Hkmedoids does. In addition, from a theoretical point of view, our proposed Hk-medoids algorithm has less computation complexity than the popular PAM implementation of the k-medoids algorithm. Then, we employed late fusion that aggregates the results of clustering by individual elements by combining cluster labels using an object co-occurrence matrix technique. The final cluster is then derived by a hierarchical clustering algorithm. We show that intermediate fusion for clustering of heterogeneous data is a feasible and efficient approach using our proposed Hk-medoids algorithm

    Unsupervised learning on social data

    Get PDF

    Machine Learning

    Get PDF
    Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behavior. Machine learning addresses more specifically the ability to improve automatically through experience

    Efficient data mining algorithms for time series and complex medical data

    Get PDF
    • …
    corecore